I am trying to use a list comprehension that compares string objects, but one of the strings is utf-8, the byproduct of json.loads. Scenario:
us = u'MyString' # is the utf-8 string
Part one of my question, is why does this return False? :
us.encode('utf-8') == "MyString" ## False
Part two - how can I compare within a list comprehension?
myComp = [utfString for utfString in jsonLoadsObj
if utfString.encode('utf-8') == "MyString"] #wrapped to read on S.O.
EDIT: I'm using Google App Engine, which uses Python 2.7
Here's a more complete example of the problem:
#json coming from remote server:
#response object looks like: {"number1":"first", "number2":"second"}
data = json.loads(response)
k = data.keys()
I need something like:
myList = [item for item in k if item=="number1"]
#### I thought this would work:
myList = [item for item in k if item.encode('utf-8')=="number1"]
You must be looping over the wrong data set; just loop directly over the JSON-loaded dictionary, there is no need to call .keys()
first:
data = json.loads(response)
myList = [item for item in data if item == "number1"]
You may want to use u"number1"
to avoid implicit conversions between Unicode and byte strings:
data = json.loads(response)
myList = [item for item in data if item == u"number1"]
Both versions work fine:
>>> import json
>>> data = json.loads('{"number1":"first", "number2":"second"}')
>>> [item for item in data if item == "number1"]
[u'number1']
>>> [item for item in data if item == u"number1"]
[u'number1']
Note that in your first example, us
is not a UTF-8 string; it is unicode data, the json
library has already decoded it for you. A UTF-8 string on the other hand, is a sequence encoded bytes. You may want to read up on Unicode and Python to understand the difference:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
On Python 2, your expectation that your test returns True
would be correct, you are doing something else wrong:
>>> us = u'MyString'
>>> us
u'MyString'
>>> type(us)
<type 'unicode'>
>>> us.encode('utf8') == 'MyString'
True
>>> type(us.encode('utf8'))
<type 'str'>
There is no need to encode the strings to UTF-8 to make comparisons; use unicode literals instead:
myComp = [elem for elem in json_data if elem == u"MyString"]