How to calculate classification error rate

MonsterMMORPG picture MonsterMMORPG · Apr 9, 2012 · Viewed 41.4k times · Source

Alright. Now this question is pretty hard. I am going to give you an example.

Now the left numbers are my algorithm classification and the right numbers are the original class numbers

177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89

So here my algorithm merged 2 different classes into 1. As you can see it merged class 86 and 89 into one class. So what would be the error at the above example ?

Or here another example

203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7

At the above example left numbers are my algorithm classification and the right numbers are original class ids. As can be seen above it miss classified 3 products (i am classifying same commercial products). So at this example what would be the error rate? How would you calculate.

This question is pretty hard and complex. We have finished the classification but we are not able to find correct algorithm for calculating success rate :D

Answer

denis picture denis · Apr 11, 2012

Here's a longish example, a real confuson matrix with 10 input classes "0" - "9" (handwritten digits), and 10 output clusters labelled A - J.

Confusion matrix for 5620 optdigits:

True 0 - 9 down, clusters A - J across
-----------------------------------------------------
      A    B    C    D    E    F    G    H    I    J
-----------------------------------------------------
0:    2         4         1       546    1
1:   71  249        11    1    6            228    5
2:   13    5        64    1   13    1       460
3:   29    2       507        20         5    9
4:        33  483         4   38         5    3    2
5:    1    1    2   58    3            480   13
6:    2    1    2       294         1         1  257
7:    1    5    1            546         6    7
8:  415   15    2    5    3   12        13   87    2
9:   46   72    2  357        35    1   47    2
----------------------------------------------------
    580  383  496 1002  307  670  549  557  810  266  estimates in each cluster

y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580  383  496 1002  307  670  549  557  810  266]

For example, cluster A has 580 data points, 415 of which are "8"s; cluster B has 383 data points, 249 of which are "1"s; and so on.

The problem is that the output classes are scrambled, permuted; they correspond in this order, with counts:

      A    B    C    D    E    F    G    H    I    J
      8    1    4    3    6    7    0    5    2    6
    415  249  483  507  294  546  546  480  460  257

One could say that the "success rate" is 75 % = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this throws away useful information — here, that E and J both say "6", and no cluster says "9".

So, add up the biggest numbers in each column of the confusion matrix and divide by the total.
But, how to count overlapping / missing clusters, like the 2 "6"s, no "9"s here ?
I don't know of a commonly agreed-upon way (doubt that the Hungarian algorithm is used in practice).

Bottom line: don't throw away information; look at the whole confusion matrix.

NB such a "success rate" will be optimistic for new data !
It's customary to split the data into say 2/3 "training set" and 1/3 "test set", train e.g. k-means on the 2/3 alone,
then measure confusion / success rate on the test set — generally worse than on the training set alone.
Much more can be said; see e.g. Cross-validation.