word2vec lemmatization of corpus before training

Luca Fiaschi picture Luca Fiaschi · May 26, 2014 · Viewed 12k times · Source

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.

Answer

Daniel picture Daniel · May 27, 2014

I think it really matters about what you want to solve with this. It depends on the task.

Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.

But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.