I am working on implementing k-means clustering in Python. What is the good way to choose initial centroids for a data set? For instance: I have following data set:
A,1,1
B,2,1
C,4,4
D,4,5
I need to create two different clusters. How do i start with the centroids?
You might want to learn about K-means++ method, because it's one of the most popular, easy and giving consistent results way of choosing initial centroids. Here you have paper on it. It works as follows:
x
, compute D(x)
, the distance between x
and the nearest center that has already been chosen.x
is chosen with probability proportional to D(x)^2
(You can use scipy.stats.rv_discrete for that).k
centers have been chosen.