python byte string encode and decode

Question 1

python byte string encode and decode

python json unicode utf-8 python-unicode

kung-foo · Mar 7, 2012 · Viewed 30.7k times · Source

Answer

Answer

You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.

If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.

If it is in fact text, you need to know what encoding to use to decode it to Unicode.

Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:

>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>

UPDATE about latin1:

u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".

Question 2

I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.

b = '\x80'
u8 = b.encode('utf-8')
j = json.dumps(u8)

I expected j to be '\xc2\x80' but instead I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.

Any ideas?

EDIT: I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)

python byte string encode and decode

Answer

Related questions