if I already have a numpy array that can serve as the initial centroids, how can I properly initialize the kmeans algorithm? I am using the scikit-learn Kmeans class
this post (k-means with selected initial centers) indicates that I only need to set n_init=1 if I am using a numpy array as the initial centroids but I am not sure if my initialization is working properly
Naftali Harris' excellent visualization page shows what I am trying to do http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
"I'll choose" --> "Packed Circles" --> run kmeans
#numpy array of initial centroids
startpts=np.array([[-0.12, 0.939, 0.321, 0.011], [0.0, 0.874, -0.486, 0.862], [0.0, 1.0, 0.0, 0.033], [0.12, 0.939, 0.321, -0.7], [0.0, 1.0, 0.0, -0.203], [0.12, 0.939, -0.321, 0.25], [0.0, 0.874, 0.486, -0.575], [-0.12, 0.939, -0.321, 0.961]], np.float64)
centroids= sk.KMeans(n_clusters=8, init=startpts, n_init=1)
centroids.fit(actual_data_points)
#get the array
centroids_array=centroids.cluster_centers_
Yes, setting initial centroids via init
should work. Here's a quote from scikit-learn documentation:
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
What is the shape
(n_clusters, n_features)
referring to?
The shape requirement means that init
must have exactly n_clusters
rows, and the number of elements in each row should match the dimensionality of actual_data_points
:
>>> init = np.array([[-0.12, 0.939, 0.321, 0.011],
[0.0, 0.874, -0.486, 0.862],
[0.0, 1.0, 0.0, 0.033],
[0.12, 0.939, 0.321, -0.7],
[0.0, 1.0, 0.0, -0.203],
[0.12, 0.939, -0.321, 0.25],
[0.0, 0.874, 0.486, -0.575],
[-0.12, 0.939, -0.321, 0.961]],
np.float64)
>>> init.shape[0] == 8
True # n_clusters
>>> init.shape[1] == actual_data_points.shape[1]
True # n_features
What is n_features?
n_features
is the dimensionality of your sample. For instance, if you were to cluster points on a 2D plane, n_features
would be 2.