scikit weighted f1 score calculation and usage

com picture com · Oct 25, 2015 · Viewed 7.3k times · Source

I have a question regarding weighted average in sklearn.metrics.f1_score

sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted', sample_weight=None)

Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

First, if there is any reference that justifies the usage of weighted-F1, I am just curios in which cases I should use weighted-F1.

Second, I heard that weighted-F1 is deprecated, is it true?

Third, how actually weighted-F1 is being calculated, for example

{
    "0": {
        "TP": 2,
        "FP": 1,
        "FN": 0,
        "F1": 0.8
    },
    "1": {
        "TP": 0,
        "FP": 2,
        "FN": 2,
        "F1": -1
    },
    "2": {
        "TP": 1,
        "FP": 1,
        "FN": 2,
        "F1": 0.4
    }
}

How to calculate weighted-F1 of the above example. I though it should be something like (0.8*2/3 + 0.4*1/3)/3, however I was wrong.

Answer

jakevdp picture jakevdp · Oct 25, 2015

First, if there is any reference that justifies the usage of weighted-F1, I am just curios in which cases I should use weighted-F1.

I don't have any references, but if you're interested in multi-label classification where you care about precision/recall of all classes, then the weighted f1-score is appropriate. If you have binary classification where you just care about the positive samples, then it is probably not appropriate.

Second, I heard that weighted-F1 is deprecated, is it true?

No, weighted-F1 itself is not being deprecated. Only some aspects of the function interface were deprecated, back in v0.16, and then only to make it more explicit in previously ambiguous situations. (Historical discussion on github or check out the source code and search the page for "deprecated" to find details.)

Third, how actually weighted-F1 is being calculated?

From the documentation of f1_score:

``'weighted'``:
  Calculate metrics for each label, and find their average, weighted
  by support (the number of true instances for each label). This
  alters 'macro' to account for label imbalance; it can result in an
  F-score that is not between precision and recall.

So the average is weighted by the support, which is the number of samples with a given label. Because your example data above does not include the support, it is impossible to compute the weighted f1 score from the information you listed.