URL encoding/decoding with Python

Joey picture Joey · Aug 25, 2010 · Viewed 81.8k times · Source

I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:

1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.

2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)

3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.

The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?

Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.

The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.

That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.

Answer

pycruft picture pycruft · Aug 25, 2010

url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.

The output isn't very pretty but it should be a correct uri encoding.

>>> s = u'1234567890-/:;()$&@".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'

Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.

>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&@".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'

This is, in fact, what the django functions mentioned in another answer do.

The functions django.utils.http.urlquote() and django.utils.http.urlquote_plus() are versions of Python’s standard urllib.quote() and urllib.quote_plus() that work with non-ASCII characters. (The data is converted to UTF-8 prior to encoding.)

Be careful if you are applying any further quotes or encodings not to mangle things.