Improving k-means clustering

Dhruv Gairola picture Dhruv Gairola · Jan 10, 2011 · Viewed 7k times · Source

My lecture notes on computer vision mention that the performance of the k-means clustering algorithm can be improved if we know the standard deviation of the clusters. How so?

My thinking is that we can use the standard deviations to come up with a better initial estimate through histogram based segmentation first. What do you think? Thanks for any help!

Answer

ang mo picture ang mo · Jan 10, 2011

Your lecturer might have the 2002 paper by Veenman et al in mind. The basic idea is that you set the maximum variance you allow in each cluster. You start with as many clusters as data points and then you "evolve" clusters by

  • merging neighboring clusters if the resulting cluster's variance is below the threshold
  • isolating elements that are "far" if a cluster's variance is above the threshold
  • or moving some elements between neighboring clusters if it decreases the sum of squared errors

(this evolution acts as a global optimization procedure, and prevents the bad consequences of initial assignment of cluster means you have in k-means)

To sum up, if you know the variance, you know how varied the clusters should be, so it's easier to e.g. detect outliers (which usually should be put into separate clusters).