FastText using pre-trained word vector for text classification

Question 1

FastText using pre-trained word vector for text classification

nlp word2vec text-classification fasttext

JarvisIA · Dec 7, 2017 · Viewed 8.9k times · Source

Answer

Answer

FastText supervised training has -pretrainedVectors argument which can be used like this:

$ ./fasttext supervised -input train.txt -output model -epoch 25 \
       -wordNgrams 2 -dim 300 -loss hs -thread 7 -minCount 1 \
       -lr 1.0 -verbose 2 -pretrainedVectors wiki.ru.vec

Few things to consider:

Chosen dimension of embeddings must fit the one used in pretrained vectors. E.g. for Wiki word vectors is must be 300. It is set by -dim 300 argument.
As of mid-February 2018, Python API (v0.8.22) doesn't support training using pretrained vectors (the corresponding parameter is ignored). So you must use CLI (command line interface) version for training. However, a model trained by CLI with pretrained vectors can be loaded by Python API and used for predictions.
For large number of classes (in my case there were 340 of them) even CLI may break with an exception so you will need to use hierarchical softmax loss function (-loss hs)
Hierarchical softmax is worse in performance than normal softmax so it can give up all the gain you've got from pretrained embeddings.
The model trained with pretrained vectors can be several times larger than one trained without.
In my observation, the model trained with pretrained vectors gets overfitted faster than one trained without

Question 2

I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels.

I have tried using fast-text library by Facebook, which has two utilities of interest to me:

A) Word Vectors with pre-trained models

B) Text Classification utilities

However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities.

What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this?

FastText using pre-trained word vector for text classification

Answer

Related questions