Scikit: calculate precision and recall using cross_val_score function

Anil Narassiguin picture Anil Narassiguin · Dec 8, 2014 · Viewed 18.3k times · Source

I'm using scikit to perform a logistic regression on spam/ham data. X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

If I want to get the accuracies for a 10 fold cross validation, I just write:

 accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:

precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')

But it results in a ValueError:

ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4') 

Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score function ?

Thank you in advance !

Answer

Anil Narassiguin picture Anil Narassiguin · Dec 9, 2014

To compute the recall and precision, the data has to be indeed binarized, this way:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)

To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:

accuracy = cross_val_score(classifier, X_train, y_train, cv=10)

It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.