sklearn.metrics.precision_recall_curve: Why are the precision and recall returned arrays instead of single values

Question 1

sklearn.metrics.precision_recall_curve: Why are the precision and recall returned arrays instead of single values

python machine-learning scikit-learn precision-recall

Sreejith Menon · Jul 3, 2016 · Viewed 7.5k times · Source

Answer

Answer

From the sklearn documentation for precision_recall_curve:

Compute precision-recall pairs for different probability thresholds.

Classifier models like logistic regression do not actually output class labels (like "0" or "1"), they output probabilities (like 0.67). These probabilities tell you the likelihood that the input sample is of a particular class, like the positive ("1") class. But you still need to choose a probability threshold so that the algorithm can convert the probability (0.67) into a class ("1").

If you choose a threshold of 0.5, then all input samples with calculated probabilities greater than 0.5 will be assigned to the positive class. If you choose a different threshold and you get a different number of samples assigned to the positive and negative class, and therefore different precision and recall scores.

Question 2

I am calculating the precisions and recall for off-the-shelf algorithms on a dataset that I recently prepared.

It is a binary classification problem and I am looking to calculate the precision, recall and the f-scores for each of the classifier I built.

test_x, test_y, predics, pred_prob,score = CH.buildBinClassifier(data,allAttribs,0.3,50,'logistic')

The build classifier method basically builds a classifier, fits a training data and returns test_x(the features of the test data), test_y(the ground truth labels), predict(predictions made by the classifier), red_prob(prediction probabilities from the LogisiticRegression.predict_proba method).

Below is the code for calculating precision-recall:

from sklearn.metrics import precision_recall_curve

pr, re, _ = precision_recall_curve(test_y,pred_prob,pos_label=1)
pr
(array([ 0.49852507,  0.49704142,  0.49554896,  0.49702381,  0.49850746,
         0.5       ,  0.5015015 ,  0.50301205,  0.50453172,  0.50606061,
         . . . . . . . 
         0.875     ,  1.        ,  1.        ,  1.        ,  1.        ,
         1.        ,  1.        ,  1.        ,  1.        ])
re
array([ 1.        ,  0.99408284,  0.98816568,  0.98816568,  0.98816568,
         0.98816568,  0.98816568,  0.98816568,  0.98816568,  0.98816568,
         . . . . . . . 
         0.04142012,  0.04142012,  0.03550296,  0.0295858 ,  0.02366864,
         0.01775148,  0.01183432,  0.00591716,  0.        ]))

I do not understand why are precision and recall arrays? Shouldn't they be just single numbers?

Since precision is calculated as tpf/(tpf+fpf) and similarly recall as definition?

I am aware about calculating the average precision-recall by the following piece of code, but somehow seeing arrays instead of tpf, fpf, precision and recall is making me wonder what is going on.

from sklearn.metrics import precision_recall_fscore_support as prf

precision,recall,fscore,_ = prf(test_y,predics,pos_label=1,average='binary')

Edit: But without the average and pos_label parameter it reports the precisions for each of the class. Could someone explain the difference between the outputs of these two methods?

sklearn.metrics.precision_recall_curve: Why are the precision and recall returned arrays instead of single values

Answer

Related questions