ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df

Jeet Dadhich picture Jeet Dadhich · Jun 14, 2016 · Viewed 7.7k times · Source
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=200000,
                             min_df=.5, stop_words='english',
                             use_idf=True,sublinear_tf=True,tokenizer = tokenize_and_stem_body,ngram_range=(1,3))
tfidf_matrix_body = tfidf_vectorizer.fit_transform(totalvocab_stemmed_body)

The above code gives me the error

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

Can anyone help me out on the same and I have change all value 80 to 100 but issue remain same?

Answer

pnv picture pnv · Jan 25, 2017

From the documentation, scikit-learn, TF-IDF vectorizer,

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Please check the data type of the variable, totalvocab_stemmed_body . If it is a list, each element of the list is considered as a document.

Case 1: No of documents=20,00,000, min_df=0.5.

If you have a large number of files (say 2 Million), and each has a few words only, and are from very different domains, there's very less chance that there are terms which are present in minimum, 10,00,000 (20,00,000 * 0.5 ) documents.

Case 2: No of documents=200, max_df=0.95

If you have a set of repeated files (say 200), you will see that the terms are present in most of the documents. With max_df=0.95, you are telling that those terms which are present in more than 190 files, do not consider them. In this case, all terms are more or less repeated, and your vectorizer won't be able to find out any terms for the matrix.

This is my thought on this topic.