Computing TF-IDF on the whole dataset or only on training data?

python machine-learning scikit-learn nlp tf-idf

keramat · Dec 12, 2017 · Viewed 8.9k times · Source

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

Answer

I have not read the book and I am not sure whether this is actually a mistake in the book however I will give my 2 cents.

According to the documentation of scikit-learn, fit() is used in order to

Learn vocabulary and idf from training set.

On the other hand, fit_transform() is used in order to

Learn vocabulary and idf, return term-document matrix.

while transform()

Transforms documents to document-term matrix.

On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).

Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.

Computing TF-IDF on the whole dataset or only on training data?

Answer

Related questions