I calculated a confusion matrix for my classifier using the method confusion_matrix() from the sklearn package. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.
I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the matrix.
I found several methods how to normalize a matrix (row and column normalization) but I don't know much about maths and am not sure if this is the correct approach. Can someone help please?
Suppose that
>>> y_true = [0, 0, 1, 1, 2, 0, 1]
>>> y_pred = [0, 1, 0, 1, 2, 2, 1]
>>> C = confusion_matrix(y_true, y_pred)
>>> C
array([[1, 1, 1],
[1, 2, 0],
[0, 0, 1]])
Then, to find out how many samples per class have received their correct label, you need
>>> C / C.astype(np.float).sum(axis=1)
array([[ 0.33333333, 0.33333333, 1. ],
[ 0.33333333, 0.66666667, 0. ],
[ 0. , 0. , 1. ]])
The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:
>>> from sklearn.metrics import precision_recall_fscore_support
>>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> recall
array([ 0.33333333, 0.66666667, 1. ])
Similarly, if you divide by the sum over axis=0
, you get the precision (fraction of class-k
predictions that have ground truth label k
):
>>> C / C.astype(np.float).sum(axis=0)
array([[ 0.5 , 0.33333333, 0.5 ],
[ 0.5 , 0.66666667, 0. ],
[ 0. , 0. , 0.5 ]])
>>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> prec
array([ 0.5 , 0.66666667, 0.5 ])