Can k-means clustering do classification?

algorithm cluster-analysis data-mining k-means

Sirius Wang · Mar 10, 2014 · Viewed 22k times · Source

I want to know whether the k-means clustering algorithm can do classification?

If I have done a simple k-means clustering .

Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.

Cluster A at left side.

Cluster B at right side.

So, if I have one new data. What should I do?

Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?

Answer

The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).

Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)

However, the following method may be even more reasonable:

3. Train an actual classifier.

Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.

I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.

k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.

Can k-means clustering do classification?

Answer

Related questions