Clustering words based on Distance Matrix

user2115183 picture user2115183 · Apr 27, 2013 · Viewed 25.8k times · Source

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.

Answer

Andreas Mueller picture Andreas Mueller · Apr 27, 2013

You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm. DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices. You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.

There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.

disclosure: I'm a scikit-learn core dev.