How can I pass a preprocessor to TfidfVectorizer? - sklearn - python

eman picture eman · May 25, 2014 · Viewed 9.5k times · Source

How can I pass a preprocessor to TfidfVectorizer? I made a function that takes a string and returns a preprocessed string then I set processor parameter to that function "preprocessor=preprocess", but it doesn't work. I've searched so many times, but I didn't found any example as if no one use it.

I have another question. Does it (preprocessor parameter) override removing stopwords and lowereing case that could be done using stop_words and lowercase parameters?

Answer

David picture David · May 25, 2014

You simply define a function that takes a string as input and retuns what is to be preprocessed. So for example a trivial function to uppercase strings would look like this:

def preProcess(s):
    return s.upper()

Once you have your function made then you just pass it into your TfidfVectorizer object. For example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?'
     ]

X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()

Results in:

[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']

This indirectly answers your follow-up question since despite lowercase being set to true, the preprocess function to uppercase overrides it. This is also mentioned in the documentation:

preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.