Python 3 chokes on CP-1252/ANSI reading

Aaron Altman picture Aaron Altman · Jul 19, 2010 · Viewed 17.4k times · Source

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:

  File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>

The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?

This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.

UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

Answer

Marius Gedminas picture Marius Gedminas · Jul 20, 2010

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

or with an actual file:

>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte

Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

>>> open('test.txt', encoding='latin-1').read()
'\x81\n'

Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.