raise ValueError("np.nan is an invalid document, expected byte or "

Sadhana Singh picture Sadhana Singh · Mar 13, 2018 · Viewed 9k times · Source

i am using CountVectorizer in scikit-learn for Vectorizing the feature sequence. i got stuck when it is giving an error as below: ValueError: np.nan is an invalid document, expected byte or unicode string.

i am taking an example csv dataset with two columns CONTENT and sentiment.my code is as below:

df = pd.read_csv("train.csv",encoding = "ISO-8859-1")
X, y = df.CONTENT, df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print X_train, y_train

vect = CountVectorizer(ngram_range=(1,3), analyzer='word', encoding = "ISO-8859-1")
print vect
X=vect.fit_transform(X_train, y_train)
y=vect.fit(X_test) 
print vect.get_feature_names()

the error i got is:

File "C:/Users/HP/cntVect.py", line 28, in <module>
    X=vect.fit_transform(X_train, y_train)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
    for feature in analyze(doc):

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)

  File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 121, in decode
    raise ValueError("np.nan is an invalid document, expected byte or "

ValueError: np.nan is an invalid document, expected byte or unicode string.

Answer

MaxU picture MaxU · Mar 13, 2018

replace NaN's with spaces - this should make CountVectorizer happy:

X, y = df.CONTENT.fillna(' '), df.sentiment