base64 encoding unicode strings in python 2.7

Marcin picture Marcin · Mar 5, 2012 · Viewed 22.1k times · Source

I have a unicode string retrieved from a webservice using the requests module, which contains the bytes of a binary document (PCL, as it happens). One of these bytes has the value 248, and attempting to base64 encode it leads to the following error:

In [68]: base64.b64encode(response_dict['content']+'\n')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-68-8c1f1913eb52> in <module>()
----> 1 base64.b64encode(response_dict['content']+'\n')

C:\Python27\Lib\base64.pyc in b64encode(s, altchars)
     51     """
     52     # Strip off the trailing newline
---> 53     encoded = binascii.b2a_base64(s)[:-1]
     54     if altchars is not None:
     55         return _translate(encoded, {'+': altchars[0], '/': altchars[1]})

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 272: ordinal not in range(128)

In [69]: response_dict['content'].encode('base64')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
C:\...\<ipython-input-69-7fd349f35f04> in <module>()
----> 1 response_dict['content'].encode('base64')

C:\...\base64_codec.pyc in base64_encode(input, errors)
     22     """
     23     assert errors == 'strict'
---> 24     output = base64.encodestring(input)
     25     return (output, len(input))
     26

C:\Python27\Lib\base64.pyc in encodestring(s)
    313     for i in range(0, len(s), MAXBINSIZE):
    314         chunk = s[i : i + MAXBINSIZE]
--> 315         pieces.append(binascii.b2a_base64(chunk))
    316     return "".join(pieces)
    317

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 44: ordinal not in range(128)

I find this slightly surprising, because 248 is within the range of an unsigned byte (and can be held in a byte string), but my real question is: what is the best or right way to encode this string?

My current work-around is this:

In [74]: byte_string = ''.join(map(compose(chr, ord), response_dict['content']))

In [75]: byte_string[272]
Out[75]: '\xf8'

This appears to work correctly, and the resulting byte_string is capable of being base64 encoded, but it seems like there should be a better way. Is there?

Answer

Cameron picture Cameron · Mar 5, 2012

You have a unicode string which you want to base64 encode. The problem is that b64encode() only works on bytes, not characters. So, you need to transform your unicode string (which is a sequence of abstract Unicode codepoints) into a byte string.

The mapping of abstract Unicode strings into a concrete series of bytes is called encoding. Python supports several encodings; I suggest the widely-used UTF-8 encoding:

byte_string = response_dict['content'].encode('utf-8')

Note that whoever is decoding the bytes will also need to know which encoding was used to get back a unicode string via the complementary decode() function:

# Decode
decoded = byte_string.decode('utf-8')

A good starting point for learning more about Unicode and encodings is the Python docs, and this article by Joel Spolsky.