I am trying to implement Kmeans
algorithm in python which will use cosine distance
instead of euclidean distance as distance metric.
I understand that using different distance function can be fatal and should done carefully. Using cosine distance as metric forces me to change the average function (the average in accordance to cosine distance must be an element by element average of the normalized vectors).
I have seen this elegant solution of manually overriding the distance function of sklearn, and I want to use the same technique to override the averaging section of the code but I couldn't find it.
Does anyone knows How can it be done ?
How critical is it that the distance metric doesn't satisfy the triangular inequality?
If anyone knows a different efficient implementation of kmeans where I use cosine metric or satisfy an distance and averaging functions it would also be realy helpful.
Thank you very much!
Edit:
After using the angular distance instead of cosine distance, The code looks as something like that:
def KMeans_cosine_fit(sparse_data, nclust = 10, njobs=-1, randomstate=None):
# Manually override euclidean
def euc_dist(X, Y = None, Y_norm_squared = None, squared = False):
#return pairwise_distances(X, Y, metric = 'cosine', n_jobs = 10)
return np.arccos(cosine_similarity(X, Y))/np.pi
k_means_.euclidean_distances = euc_dist
kmeans = k_means_.KMeans(n_clusters = nclust, n_jobs = njobs, random_state = randomstate)
_ = kmeans.fit(sparse_data)
return kmeans
I noticed (with mathematics calculations) that if the vectors are normalized the standard average works well for the angular metric. As far as I understand, I have to change _mini_batch_step()
in k_means_.py. But the function is pretty complicated and I couldn't understand how to do it.
Does anyone knows about alternative solution?
Or maybe, Does anyone knows how can I edit this function with a one that always forces the centroids to be normalized?
So it turns out you can just normalise X to be of unit length and use K-means as normal. The reason being if X1 and X2 are unit vectors, looking at the following equation, the term inside the brackets in the last line is cosine distance.
So in terms of using k-means, simply do:
length = np.sqrt((X**2).sum(axis=1))[:,None]
X = X / length
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
And if you need the centroids and distance matrix do:
len_ = np.sqrt(np.square(kmeans.cluster_centers_).sum(axis=1)[:,None])
centers = kmeans.cluster_centers_ / len_
dist = 1 - np.dot(centers, X.T) # K x N matrix of cosine distances