Clustering cosine similarity matrix

Stefan D picture Stefan D · May 7, 2015 · Viewed 17k times · Source

A few questions on stackoverflow mention this problem, but I haven't found a concrete solution.

I have a square matrix which consists of cosine similarities (values between 0 and 1), for example:

  |  A  |  B  |  C  |  D
A | 1.0 | 0.1 | 0.6 |  0.4
B | 0.1 | 1.0 | 0.1 |  0.2
C | 0.6 | 0.1 | 1.0 |  0.7
D | 0.4 | 0.2 | 0.7 |  1.0

The square matrix can be of any size. I want to get clusters (I don't know how many) which maximize the values between the elements in the cluster. I.e. for the above example I should get two clusters:

  1. B
  2. A, C, D

The reason being because C & D have the highest value between them, and A & C also have the highest value between them.

An item can be in only one cluster.

Recall is not that important for this problem, but precision is very important. It is acceptable to output three clusters: 1) B, 2) A, 3) C, D . But it is not acceptable to output any solution where B is in a cluster with another element.

I think the diagonal (1.0) is confusing me. My data is guaranteed to have at least one cluster of 2+ elements, and I want to find as many clusters as possible without sacrificing precision.

I will have to implement this in Python.

Answer

Ashkan picture Ashkan · May 7, 2015

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather an easy algorithm.

Here is a piece of code doing it in python using sklearn:

import numpy as np
from sklearn.cluster import SpectralClustering
mat = np.matrix([[1.,.1,.6,.4],[.1,1.,.1,.2],[.6,.1,1.,.7],[.4,.2,.7,1.]])
SpectralClustering(2).fit_predict(mat)
>>> array([0, 1, 0, 0], dtype=int32)

As you can see it returns the clustering you have mentioned.

The algorithm takes the top k eigenvectors of the input matrix corresponding to the largest eigenvalues, then runs the k-mean algorithm on the new matrix. Here is a simple code that does this for your matrix:

from sklearn.cluster import KMeans
eigen_values, eigen_vectors = np.linalg.eigh(mat)
KMeans(n_clusters=2, init='k-means++').fit_predict(eigen_vectors[:, 2:4])
>>> array([0, 1, 0, 0], dtype=int32)

Note that the implementation of the algorithm in the sklearn library may differ from mine. The example I gave is the simplest way of doing it. There are some good tutorial available online describing the spectral clustering algorithm in depth.

For the cases you want the algorithm to figure out the number of clusters by itself, you can use Density Based Clustering Algorithms like DBSCAN:

from sklearn.cluster import DBSCAN
DBSCAN(min_samples=1).fit_predict(mat)
array([0, 1, 2, 2])