I'm using scikit to perform a logistic regression on spam/ham data. X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
If I want to get the accuracies for a 10 fold cross validation, I just write:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
But it results in a ValueError
:
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')
Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score
function ?
Thank you in advance !
To compute the recall and precision, the data has to be indeed binarized, this way:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.