UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Serhii Matrunchyk picture Serhii Matrunchyk · Feb 16, 2015 · Viewed 53.6k times · Source

I'm simply trying to decode \uXXXX\uXXXX\uXXXX-like string. But I get an error:

$ python
Python 2.7.6 (default, Sep  9 2014, 15:04:36) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'\u041e\u043b\u044c\u0433\u0430'.decode('utf-8')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)

    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

I'm Python newbie. What's a problem? Thanks!

Answer

Martijn Pieters picture Martijn Pieters · Feb 16, 2015

Python is trying to be helpful. You cannot decode Unicode data, it is already decoded. So Python first will encode the data (using the ASCII codec) to get bytes to decode. It is this implicit encoding that fails.

If you have Unicode data, it only makes sense to encode to UTF-8, not decode:

>>> print u'\u041e\u043b\u044c\u0433\u0430'
Ольга
>>> u'\u041e\u043b\u044c\u0433\u0430'.encode('utf8')
'\xd0\x9e\xd0\xbb\xd1\x8c\xd0\xb3\xd0\xb0'

If you wanted a Unicode value, then using a Unicode literal (u'...') is all you needed to do. No further decoding is necessary.

The same implicit conversion takes place in the other direction; if you tried to encode a bytestring you'd trigger an implicit decoding:

>>> u'\u041e\u043b\u044c\u0433\u0430'.encode('utf8').encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)