In this page, it is said that:
[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]
However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y):
(quick, brown), (brown, quick)
So, why distinguish that much between context and targets if it is the same thing in the end?
Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they seem to do the difference between those two approaches that much in this problem:
An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.
Would not this yields the same results?
Here is my oversimplified and rather naive understanding of the difference:
As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day
CBOW model will tell you that most probably the word is beautiful
or nice
. Words like delightful
will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.
On the other hand, the skip-gram model is designed to predict the context. Given the word delightful
it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day
, or some other relevant context. With skip-gram the word delightful
will not try to compete with the word beautiful
but instead, delightful+context
pairs will be treated as new observations.
UPDATE
Thanks to @0xF for sharing this article
According to Mikolov
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
One more addition to the subject is found here:
In the "skip-gram" mode alternative to "CBOW", rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict 'ate' from average('The', 'cat', 'the', 'mouse')], the network is presented with four skip-gram examples [predict 'ate' from 'The'], [predict 'ate' from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse']. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)