How to use Gensim doc2vec with pre-trained word vectors?

Stergios picture Stergios · Dec 14, 2014 · Viewed 37k times · Source

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec?

Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training?

Thanks.

Answer

gojomo picture gojomo · May 20, 2015

Note that the "DBOW" (dm=0) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode).

(Before gensim 0.12.0, there was the parameter train_words mentioned in another comment, which some documentation suggested will co-train words. However, I don't believe this ever actually worked. Starting in gensim 0.12.0, there is the parameter dbow_words, which works to skip-gram train words simultaneous with DBOW doc-vectors. Note that this makes training take longer – by a factor related to window. So if you don't need word-vectors, you may still leave this off.)

In the "DM" training method (dm=1), word-vectors are inherently trained during the process along with doc-vectors, and are likely to also affect the quality of the doc-vectors. It's theoretically possible to pre-initialize the word-vectors from prior data. But I don't know any strong theoretical or experimental reason to be confident this would improve the doc-vectors.

One fragmentary experiment I ran along these lines suggested the doc-vector training got off to a faster start – better predictive qualities after the first few passes – but this advantage faded with more passes. Whether you hold the word vectors constant or let them continue to adjust with the new training is also likely an important consideration... but which choice is better may depend on your goals, data set, and the quality/relevance of the pre-existing word-vectors.

(You could repeat my experiment with the intersect_word2vec_format() method available in gensim 0.12.0, and try different levels of making pre-loaded vectors resistant-to-new-training via the syn0_lockf values. But remember this is experimental territory: the basic doc2vec results don't rely on, or even necessarily improve with, reused word vectors.)