adding words to stop_words list in TfidfVectorizer in sklearn

ac11 picture ac11 · Nov 9, 2014 · Viewed 30.4k times · Source

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)

I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

Answer

Pedram picture Pedram · Jul 14, 2017

This is how you can do it:

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])

vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)

X = vectorizer.fit_transform(["this is an apple.","this is a book."])

idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# printing the tfidf vectors
print(X)

# printing the vocabulary
print(vectorizer.vocabulary_)

In this example, I created the tfidf vectors for two sample documents:

"This is a green apple."
"This is a machine learning book."

By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:

(0, 1)  0.707106781187
(0, 0)  0.707106781187
(1, 3)  0.707106781187
(1, 2)  0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.