I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z]
, I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab
and the word vectors through model.syn0
. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Match words to embedding vectors created by Word2Vec ()
-- how do I do it?
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab()
phase) to the vector matrix outputted as model.syn0
.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.
I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word
which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441