UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Vishal Kharde picture Vishal Kharde · Jan 14, 2016 · Viewed 8.5k times · Source

I'm writing a code for stemming a tweet, but I'm having issues with encoding. When I tried to apply porter stemmer it shows error.Maybe i m not able to tokenize it properly.

My code is as follows...

import sys
import pandas as pd
import nltk
import scipy as sp
from nltk.classify import NaiveBayesClassifier
from nltk.stem import PorterStemmer
reload(sys)  
sys.setdefaultencoding('utf8')


stemmer=nltk.stem.PorterStemmer()

p_test = pd.read_csv('TestSA.csv')
train = pd.read_csv('TrainSA.csv')

def word_feats(words):
    return dict([(word, True) for word in words])

for i in range(len(train)-1):
    t = []
    #train.SentimentText[i] = " ".join(t)
    for word in nltk.word_tokenize(train.SentimentText[i]):
        t.append(stemmer.stem(word))
    train.SentimentText[i] = ' '.join(t)

When I try to execute it returns the error:


UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-5aa856d0307f> in <module>()
     23     #train.SentimentText[i] = " ".join(t)
     24     for word in nltk.word_tokenize(train.SentimentText[i]):
---> 25         t.append(stemmer.stem(word))
     26     train.SentimentText[i] = ' '.join(t)
     27 

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in stem(self, word)
    631     def stem(self, word):
    632         stem = self.stem_word(word.lower(), 0, len(word) - 1)
--> 633         return self._adjust_case(word, stem)
    634 
    635     ## --NLTK--

/usr/lib/python2.7/site-packages/nltk/stem/porter.pyc in _adjust_case(self, word, stem)
    602         for x in range(len(stem)):
    603             if lower[x] == stem[x]:
--> 604                 ret += word[x]
    605             else:
    606                 ret += stem[x]

/usr/lib64/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

anybody has any clue, wat is wrong with my code.I m stuck with this error.Any suggestions..?

Answer

roeland picture roeland · Jan 14, 2016

I think the key line is 604, one frame above the place which raises the error:

--> 604                 ret += word[x]

Probably ret is an Unicode string and word is a byte string. And you cannot decode UTF-8 byte by byte, as that loop is trying to do.

The problem is that read_csv is returning bytes, and you are trying to do text processing on those bytes. That simply doesn't work, those bytes have to be decoded to Unicode first. I think you can use:

pandas.read_csv(filename, encoding='utf-8')

If possible, use Python 3. Then trying to concatenate bytes and unicode will always raise an error, making it much easier to spot these problems.