Attribute error while using scikit-learn

Animesh Pandey picture Animesh Pandey · Mar 5, 2013 · Viewed 8.7k times · Source

I am trying to find similar questions using scikit using cosine similarity. I was trying this sample code available on the internet. Link1 and Link2

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright."]
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
trainVectorizerArray = vectorizer.
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

transformer.fit(trainVectorizerArray)
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

I always get this error

Traceback (most recent call last):
File "C:\Users\Animesh\Desktop\NLP\ngrams2.py", line 14, in <module>
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
File "C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn  \feature_extraction\text.py", line 740, in fit_transform
raise ValueError("empty vocabulary; training set may have"
ValueError: empty vocabulary; training set may have contained only stop words or min_df  (resp. max_df) may be too high (resp. too low).

I even checked the code available on this link. There I got error AttributeError: 'CountVectorizer' object has no attribute 'vocabulary'.

How to solve this issue ?

I am using Python 2.7.3 on Windows 7 32 Bit and scikit_learn 0.13.1.

Answer

Fred Foo picture Fred Foo · Mar 5, 2013

Since I'm running the development (pre-0.14) version, where the feature_extraction.text module got overhauled, I don't get the same error message. But I suspect you can solve this issue with:

vectorizer = CountVectorizer(stop_words=stopWords, min_df=1)

The min_df parameter causes CountVectorizer to throw away any term that occurs in too few documents (because it won't have any predictive value). By default, it's set to 2, which means all your terms get thrown away, so you get an empty vocabulary.