Python's urllib.quote
and urllib.unquote
do not handle Unicode correctly in Python 2.6.5. This is what happens:
In [5]: print urllib.unquote(urllib.quote(u'Cataño'))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/kkinder/<ipython console> in <module>()
/usr/lib/python2.6/urllib.pyc in quote(s, safe)
1222 safe_map[c] = (c in safe) and c or ('%%%02X' % i)
1223 _safemaps[cachekey] = safe_map
-> 1224 res = map(safe_map.__getitem__, s)
1225 return ''.join(res)
1226
KeyError: u'\xc3'
Encoding the value to UTF8 also does not work:
In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño
It's recognized as a bug and there is a fix, but not for my version of Python.
What I'd like is something similar to urllib.quote/urllib.unquote, but handles unicode variables correctly, such that this code would work:
decode_url(encode_url(u'Cataño')) == u'Cataño'
Any recommendations?
Python's urllib.quote and urllib.unquote do not handle Unicode correctly
urllib
does not handle Unicode at all. URLs don't contain non-ASCII characters, by definition. When you're dealing with urllib
you should use only byte strings. If you want those to represent Unicode characters you will have to encode and decode them manually.
IRIs can contain non-ASCII characters, encoding them as UTF-8 sequences, but Python doesn't, at this point, have an irilib
.
Encoding the value to UTF8 also does not work:
In [6]: print urllib.unquote(urllib.quote(u'Cataño'.encode('utf8')))
Cataño
Ah, well now you're typing Unicode into a console, and doing print
-Unicode to the console. This is generally unreliable, especially in Windows and in your case with the IPython console.
Type it out the long way with backslash sequences and you can more easily see that the urllib
bit does actually work:
>>> u'Cata\u00F1o'.encode('utf-8')
'Cata\xC3\xB1o'
>>> urllib.quote(_)
'Cata%C3%B1o'
>>> urllib.unquote(_)
'Cata\xC3\xB1o'
>>> _.decode('utf-8')
u'Cata\xF1o'