Why does word2Vec use cosine similarity?

opus111 picture opus111 · Jul 17, 2016 · Viewed 7.4k times · Source

I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.

However, I do not understand why cosine is the correct measure of word similarity. Cosine similarity says that two vectors point in the same direction, but they could have different magnitudes.

For example, cosine similarity makes sense comparing bag-of-words for documents. Two documents might be of different length, but have similar distributions of words.

Why not, say, Euclidean distance?

Can anyone one explain why cosine similarity works for word2Vec?

Answer

Aaron picture Aaron · Jul 17, 2016

Those two distance metrics are probably strongly correlated so it might not matter all that much which one you use. As you point out, cosine distance means we don't have to worry about the length of the vectors at all.

This paper indicates that there is a relationship between the frequency of the word and the length of the word2vec vector. http://arxiv.org/pdf/1508.02297v1.pdf