I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform')
which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0
and 3099 samples labelled as 1
, f1-score is 0.5
for both classes, and weighted average is 0.5
:
Second example, using DummyClassifier(strategy='constant')
, i.e. guessing the same label every time, guessing label 1
every time in this case, average of f1-scores is 0.33
, while f1 for label 0
is 0.00
:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report