What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Question 1

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

python numpy pandas similarity cosine-similarity

zbinsd · Jul 13, 2013 · Viewed 117.9k times · Source

Answer

Answer

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

Results:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()

Question 2

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

The sparse representation is:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3

In Python, it's straightforward to work with the matrix-input format:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

Gives:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Answer

Related questions