How can I compare a unicode type to a string in python?

rGil picture rGil · May 9, 2013 · Viewed 72.5k times · Source

I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:

us = u'MyString' # is the utf-8 string

Part one of my question, is why does this return False? :

us.encode('utf-8') == "MyString" ## False

Part two - how can I compare within a list comprehension?

myComp = [utfString for utfString in jsonLoadsObj
           if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.

EDIT: I'm using Google App Engine, which uses Python 2.7

Here's a more complete example of the problem:

#json coming from remote server:
#response object looks like:  {"number1":"first", "number2":"second"}

data = json.loads(response)
k = data.keys()

I need something like:
myList = [item for item in k if item=="number1"]  

#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]

Answer

Martijn Pieters picture Martijn Pieters · May 9, 2013

You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys() first:

data = json.loads(response)
myList = [item for item in data if item == "number1"]  

You may want to use u"number1" to avoid implicit conversions between Unicode and byte strings:

data = json.loads(response)
myList = [item for item in data if item == u"number1"]  

Both versions work fine:

>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']

Note that in your first example, us is not a UTF-8 string; it is unicode data, the json library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:

On Python 2, your expectation that your test returns True would be correct, you are doing something else wrong:

>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>

There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:

myComp = [elem for elem in json_data if elem == u"MyString"]