Matching words and vectors in gensim Word2Vec model

patrick picture patrick · Jul 29, 2016 · Viewed 15.2k times · Source

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.

As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.

This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!

Problem:

Match words to embedding vectors created by Word2Vec () -- how do I do it?

My approach:

After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0. Thus

for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
    print i
    word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
    wordvector= newmod.syn0[i] #get the vector with the corresponding index
    print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
  • Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?

  • Does this even get me correct results?

*My code to create the word vectors:

model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
        
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings

model.save("newmodel")

I found this question which is similar but has not really been answered.

Answer

Arcyno picture Arcyno · Dec 15, 2016

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !

This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441