Kmeans without knowing the number of clusters?

Legend picture Legend · Jul 7, 2011 · Viewed 22.7k times · Source

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters.

I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don't remember where I saw that. It would be great if someone can point me to any resources that discuss this. I am using SciPy for k-means currently but any related library would be fine as well.

If there are alternate ways of achieving the same or a better algorithm, please let me know.

Answer

Alex Reynolds picture Alex Reynolds · Jul 7, 2011

One approach is cross-validation.

In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters?

If the memberships are roughly the same, the data fit well into k clusters. Otherwise, you try a different k.

Also, you could do PCA (principal component analysis) to reduce your 50 dimensions to some more tractable number. If a PCA run suggests that most of your variance is coming from, say, 4 out of the 50 dimensions, then you can pick k on that basis, to explore how the four cluster memberships are assigned.