I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.
I am using the following test to evaluate the accuracy of model: https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)
[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
{'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
{'correct': 0, 'incorrect': 86, 'section': 'currency'},
{'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
{'correct': 123, 'incorrect': 183, 'section': 'family'},
{'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
{'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
{'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
{'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
{'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
{'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
{'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
{'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
{'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
{'correct': 835, 'incorrect': 9973, 'section': 'total'}]
I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:
Sample output:
capital-common-countries:
ACCURACY TOP1: 19.37 % (98 / 506)
Total accuracy: 19.37 % Semantic accuracy: 19.37 % Syntactic accuracy: -nan %
capital-world:
ACCURACY TOP1: 10.26 % (149 / 1452)
Total accuracy: 12.61 % Semantic accuracy: 12.61 % Syntactic accuracy: -nan %
currency:
ACCURACY TOP1: 6.34 % (17 / 268)
Total accuracy: 11.86 % Semantic accuracy: 11.86 % Syntactic accuracy: -nan %
city-in-state:
ACCURACY TOP1: 11.78 % (185 / 1571)
Total accuracy: 11.83 % Semantic accuracy: 11.83 % Syntactic accuracy: -nan %
family:
ACCURACY TOP1: 57.19 % (175 / 306)
Total accuracy: 15.21 % Semantic accuracy: 15.21 % Syntactic accuracy: -nan %
gram1-adjective-to-adverb:
ACCURACY TOP1: 6.48 % (49 / 756)
Total accuracy: 13.85 % Semantic accuracy: 15.21 % Syntactic accuracy: 6.48 %
gram2-opposite:
ACCURACY TOP1: 17.97 % (55 / 306)
Total accuracy: 14.09 % Semantic accuracy: 15.21 % Syntactic accuracy: 9.79 %
gram3-comparative:
ACCURACY TOP1: 34.68 % (437 / 1260)
Total accuracy: 18.13 % Semantic accuracy: 15.21 % Syntactic accuracy: 23.30 %
gram4-superlative:
ACCURACY TOP1: 14.82 % (75 / 506)
Total accuracy: 17.89 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.78 %
gram5-present-participle:
ACCURACY TOP1: 19.96 % (198 / 992)
Total accuracy: 18.15 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.31 %
gram6-nationality-adjective:
ACCURACY TOP1: 35.81 % (491 / 1371)
Total accuracy: 20.76 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.14 %
gram7-past-tense:
ACCURACY TOP1: 19.67 % (262 / 1332)
Total accuracy: 20.62 % Semantic accuracy: 15.21 % Syntactic accuracy: 24.02 %
gram8-plural:
ACCURACY TOP1: 35.38 % (351 / 992)
Total accuracy: 21.88 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.52 %
gram9-plural-verbs:
ACCURACY TOP1: 20.00 % (130 / 650)
Total accuracy: 21.78 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.08 %
Questions seen / total: 12268 19544 62.77 %
However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks. Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.
Very low scores on the analogy-questions are more likely due to limitations in the amount or quality of your training data, rather than mistuned parameters. (If your training phrases are really only 5 words each, they may not capture the same rich relations as can be discovered from datasets with full sentences.)
You could use a window of 5 on your phrases – the training code trims the window to what's available on either side – but then every word of each phrase affects all of the other words. That might be OK: one of the Google word2vec papers ("Distributed Representations of Words and Phrases and their Compositionality") mentions that to get the best accuracy on one of their phrase tasks, they used "the entire sentence for the context". (On the other hand, on one English corpus of short messages, I found a window size of just 2 created the vectors that scored best on the analogies-evaluation, so larger isn't necessarily better.)
A paper by Levy & Goldberg, "Dependency-Based Word Embeddings", speaks a bit about the qualitative effect of window-size:
https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf
They find:
Larger windows tend to capture more topic/domain information: what other words (of any type) are used in related discussions? Smaller windows tend to capture more about word itself: what other words are functionally similar? (Their own extension, the dependency-based embeddings, seems best at finding most-similar words, synonyms or obvious-alternatives that could drop-in as replacements of the origin word.)