I am getting text through an API that returns characters with a windows encoded apostrophe (\x92):
> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>
I'm trying to convert this string to UTF-8 so that it instead returns: "There’s thirty days in June"
When I try to decode or encode this unicode string, it throws an error:
>>> title.decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)
>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>
If I were to initialize the string as plain-text and then decode it, it works:
>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June
My question is how do I convert the unicode string that I'm getting into a plain-text string so that I can decode it?
It seems your string was decoded with latin1
(as it is of type unicode
)
latin1
)unicode
) you must decode using the proper codec (cp1252
)utf-8
bytes you must encode using the UTF-8
codec.In code:
>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June
Depending on whether the API takes text (unicode
) or bytes
, 3. may not be necessary.