What is the difference between fit_transform and transform in sklearn countvectorizer?

Anurag Pandey picture Anurag Pandey · Aug 1, 2016 · Viewed 7.3k times · Source

I was recently practicing bag of words introduction : kaggle , I want to clear few things :

using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )

Now when we were preparing the bag of words array on train reviews we used fit_predict on the list of train reviews , now I know that fit_predict does two things , first it fits on the data and knows the vocabulary and then it makes vectors on each review .

thus when we used vectorizer.transform( "*list of cleaned train reviews* " ) this just transformed the list of test reviews into the vector for each review.

my question is, why not use fit_transform on the test list too? I mean in the documents it says it leads to overfitting, but it does make sense to me to use it anyways; let me give you my prospective:

when we don't use fit_transform we are essentially saying to make feature vector of test reviews using the most frequent words of train reviews. Why not make test features array using the most frequent words in the test itself?

I mean does random forest care? if we give random forest the train feature array and train feature sentiment to work and train itself with and then give it the test feature array won't it just give its prediction on sentiment?

Answer

Abhinav Arora picture Abhinav Arora · Aug 2, 2016

You do not do a fit_transform on the test data because, when you fit a Random Forest, the Random Forest learns the classification rules based on the values of the features that you provide it. If these rules are to be applied to classify the test set then you need to make sure that the test features are calculated in the same way using the same vocabulary. If the vocabulary of the training and the test features is different, then features will not really make sense as they will reflect a vocabulary that is separate from the one the document was trained on.

Now if we specifically talk about CountVectorizer, then consider the following example, let your training data have the following 3 sentences:

  1. Dog is black.
  2. Sky is blue.
  3. Dog is dancing.

Now the vocabulary set for this will be {Dog, is, black, sky, blue, dancing}. Now the Random Forest that you will train will try to learn rules based on the count of these 6 vocabulary terms. So your features will be vector of length 6. Now if the test set is as follows:

  1. Dog is white.
  2. Sky is black.

Now if you use the test data for fit_transform your vocabulary will look like {Dog, white, is, Sky, black}. So here your each document will be represented by a vector of length 5 denoting the counts of each of these terms. Now, this will be like comparing apples with oranges. You learn rules for counts of the previous vocabulary and those rules can not be applied to this vocabulary. This is the reason why you only fit on the training data.