how to choose initial centroids for k-means clustering

Clint Whaley picture Clint Whaley · Mar 12, 2016 · Viewed 10.1k times · Source

I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:

A,1,1
B,2,1
C,4,4
D,4,5

I need to create two different clusters. How do i start with the centroids?

Answer

Tony Babarino picture Tony Babarino · Mar 12, 2016

You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:

  • Choose one center uniformly at random from among the data points.
  • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2 (You can use scipy.stats.rv_discrete for that).
  • Repeat Steps 2 and 3 until k centers have been chosen.
  • Now that the initial centers have been chosen, proceed using standard k-means clustering.