clustering with cosine similarity

user1473883 picture user1473883 · Jun 22, 2012 · Viewed 11.5k times · Source

I have a large data set that I would like to cluster. My trial run set size is 2,500 objects; when I run it on the 'real deal' I will need to handle at least 20k objects.

These objects have a cosine similarity between them. This cosine similarity does not satisfy the requirements of being a mathematical distance metric; it doesn't satisfy the triangle inequality.

I would like to cluster them in some "natural" way that puts similar objects together without needing to specify beforehand the number of clusters I expect.

Does anyone know of an algorithm that would do that? Really, I'm just looking for any algorithm that doesn't require a) a distance metric and b) a pre-specified number of clusters.

Many thanks!

This question has been asked before here: Clustering from the cosine similarity values (but this solution only offers K-means clustering), and here: Effective clustering of a similarity matrix (but this solution was rather vague)

Answer

Alex Wilson picture Alex Wilson · Jun 22, 2012

Apache mahout has a number of clustering algorithms, including some which don't require you to specify N and which allow you to specify the distance metric.

Mean shift clustering is similar to k-means but without a pre specified number of clusters https://cwiki.apache.org/confluence/display/MAHOUT/Mean+Shift+Clustering.

Then more generally, if you would like to try a variety of algorithms, there is an absolute wealth of sophisticated packages available for R (including a few variational Bayesian implementations of EM which will select the best number of clusters) which have proved very useful for some of my research in the past: http://cran.r-project.org/web/views/Cluster.html.