python: How to calculate the cosine similarity of two word lists?

gladys0313 picture gladys0313 · Mar 2, 2015 · Viewed 7.3k times · Source

I want to calculate the cosine similarity of two lists like following:

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']

B = [u'home (private)', u'school', u'bank', u'shopping mall']

I know the cosine similarity of A and B should be

3/(sqrt(7)*sqrt(4)).

I try to reform the lists into forms like 'home bank bank building factory', which looks like a sentence, however, some elements (e.g. home (private)) have blank space in itself and some elements have brackets so I find it difficult to calculate the word occurrence.

Do you know how to calculate the word occurrence in this complicated list, so that for list B, word occurrence can be represented as

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}? 

Or do you know how to calculate the cosine similarity of these two lists?

Thank you very much

Answer

Hugh Bothwell picture Hugh Bothwell · Mar 2, 2015
from collections import Counter

# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']

# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             # sqrt(7)
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))    # 3
cosine = dot / (len_a * len_b)                          # 0.5669467