Decoding double encoded utf8 in Python

Chris Ciesielski picture Chris Ciesielski · Jul 24, 2009 · Viewed 14.9k times · Source

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

Raw string from tcp dump:

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into:

u'Rafa\xc5\x82'

The best we get is:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8") 

This results in correct string which is:

u'Rafa\u0142' 

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

Answer

Ivan Baldin picture Ivan Baldin · Jul 24, 2009
>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>