UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

Josephine M. Ho picture Josephine M. Ho · Aug 3, 2017 · Viewed 20.7k times · Source

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

Answer

bobince picture bobince · Aug 3, 2017

Unfortunately, CSV files have no built-in method of signalling character encoding.

read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)

It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:

pd.read_csv('../filename.csv', encoding='cp1252')