I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:
b = ['let',
'know',
'buy',
'someth',
'featur',
'mashabl',
'might',
'earn',
'affili',
'commiss',
'fifti',
'year',
'ago',
'graduat',
'21yearold',
'dustin',
'hoffman',
'pull',
'asid',
'given',
'one',
'piec',
'unsolicit',
'advic',
'percent',
'buy']
Model
model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model)
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####
if I try to get the similarity score by doing model['buy']
of one the words in the list, I get the
KeyError: "word 'buy' not in vocabulary"
Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.
The first parameter passed to gensim.models.Word2Vec
is an iterable of sentences. Sentences themselves are a list of words. From the docs:
Initialize the model from an iterable of
sentences
. Each sentence is a list of words (unicode strings) that will be used for training.
Right now, it thinks that each word in your list b
is a sentence and so it is doing Word2Vec
for each character in each word, as opposed to each word in your b
. Right now you can do:
model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model['a'])
array([ 7.42487283e-03, -5.65282721e-03, 1.28707094e-02, ... ]
To get it to work for words, simply wrap b
in another list so that it is interpreted correctly:
model = gensim.models.Word2Vec([b],min_count=1,size=32)
print(model['buy'])
array([-0.01331611, 0.00496594, -0.00165093, -0.01444992, 0.01393849, ... ]