How can I pass a preprocessor to TfidfVectorizer? I made a function that takes a string and returns a preprocessed string then I set processor parameter to that function "preprocessor=preprocess", but it doesn't work. I've searched so many times, but I didn't found any example as if no one use it.
I have another question. Does it (preprocessor parameter) override removing stopwords and lowereing case that could be done using stop_words and lowercase parameters?
You simply define a function that takes a string as input and retuns what is to be preprocessed. So for example a trivial function to uppercase strings would look like this:
def preProcess(s):
return s.upper()
Once you have your function made then you just pass it into your TfidfVectorizer
object. For example:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?'
]
X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()
Results in:
[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']
This indirectly answers your follow-up question since despite lowercase being set to true, the preprocess function to uppercase overrides it. This is also mentioned in the documentation:
preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.