Python: how to normalize a confusion matrix?

Kaly picture Kaly · Jan 4, 2014 · Viewed 39.6k times · Source

I calculated a confusion matrix for my classifier using the method confusion_matrix() from the sklearn package. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier.

I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the matrix.

I found several methods how to normalize a matrix (row and column normalization) but I don't know much about maths and am not sure if this is the correct approach. Can someone help please?

Answer

Fred Foo picture Fred Foo · Jan 5, 2014

Suppose that

>>> y_true = [0, 0, 1, 1, 2, 0, 1]
>>> y_pred = [0, 1, 0, 1, 2, 2, 1]
>>> C = confusion_matrix(y_true, y_pred)
>>> C
array([[1, 1, 1],
       [1, 2, 0],
       [0, 0, 1]])

Then, to find out how many samples per class have received their correct label, you need

>>> C / C.astype(np.float).sum(axis=1)
array([[ 0.33333333,  0.33333333,  1.        ],
       [ 0.33333333,  0.66666667,  0.        ],
       [ 0.        ,  0.        ,  1.        ]])

The diagonal contains the required values. Another way to compute these is to realize that what you're computing is the recall per class:

>>> from sklearn.metrics import precision_recall_fscore_support
>>> _, recall, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> recall
array([ 0.33333333,  0.66666667,  1.        ])

Similarly, if you divide by the sum over axis=0, you get the precision (fraction of class-k predictions that have ground truth label k):

>>> C / C.astype(np.float).sum(axis=0)
array([[ 0.5       ,  0.33333333,  0.5       ],
       [ 0.5       ,  0.66666667,  0.        ],
       [ 0.        ,  0.        ,  0.5       ]])
>>> prec, _, _, _ = precision_recall_fscore_support(y_true, y_pred)
>>> prec
array([ 0.5       ,  0.66666667,  0.5       ])