How do I convert unicode string with cp1252 characters into UTF-8 with Python?

ninapavlich picture ninapavlich · Jul 25, 2017 · Viewed 8.8k times · Source

I am getting text through an API that returns characters with a windows encoded apostrophe (\x92):

> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>

I'm trying to convert this string to UTF-8 so that it instead returns: "There’s thirty days in June"

When I try to decode or encode this unicode string, it throws an error:

>>> title.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)

>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>

If I were to initialize the string as plain-text and then decode it, it works:

>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June

My question is how do I convert the unicode string that I'm getting into a plain-text string so that I can decode it?

Answer

Anthony Sottile picture Anthony Sottile · Jul 25, 2017

It seems your string was decoded with latin1 (as it is of type unicode)

  1. To convert it back to the bytes it originally was, you need to encode using that encoding (latin1)
  2. Then to get text back (unicode) you must decode using the proper codec (cp1252)
  3. finally, if you want to get to utf-8 bytes you must encode using the UTF-8 codec.

In code:

>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June

Depending on whether the API takes text (unicode) or bytes, 3. may not be necessary.