Interpreting negative Word2Vec similarity from gensim

alvas picture alvas · Feb 22, 2017 · Viewed 7k times · Source

E.g. we train a word2vec model using gensim:

from gensim import corpora, models, similarities
from gensim.models.word2vec import Word2Vec

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

texts = [[word for word in document.lower().split()] for document in documents]
w2v_model = Word2Vec(texts, size=500, window=5, min_count=1)

And when we query the similarity between words, we find negative similarity scores:

>>> w2v_model.similarity('graph', 'computer')
0.046929569156789336
>>> w2v_model.similarity('graph', 'system')
0.063683518562347399
>>> w2v_model.similarity('survey', 'generation')
-0.040026775040430063
>>> w2v_model.similarity('graph', 'trees')
-0.0072684112978664561

How do we interpret the negative scores?

If it's a cosine similarity shouldn't the range be [0,1]?

What is the upper bound and lower bound of the Word2Vec.similarity(x,y) function? There isn't much written in the docs: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similarity =(

Looking at the Python wrapper code, there isn't much too: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1165

(If possible, please do point me to the .pyx code of where the similarity function is implemented.)

Answer

Eugene K picture Eugene K · Mar 1, 2017

Cosine similarity ranges from -1 to 1, same as a regular cosine wave.

Cosine Wave

As for the source:

https://github.com/RaRe-Technologies/gensim/blob/ba1ce894a5192fc493a865c535202695bb3c0424/gensim/models/word2vec.py#L1511

def similarity(self, w1, w2):
    """
    Compute cosine similarity between two words.
    Example::
      >>> trained_model.similarity('woman', 'man')
      0.73723527
      >>> trained_model.similarity('woman', 'woman')
      1.0
    """
    return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2])