I am reading data from a file which contains words with french and english letters. I am attempting to construct a list of all of the possible english and french letters (stored as strings). I do this with the code below:
# encoding: utf-8
def trackLetter(letters, line):
for a in line:
found = False;
for b in letters:
if b==a:
found = True
if not found:
letters += a
cur_letters = []; # for storing possible letters
data = urllib2.urlopen('https://duolinguist.wordpress.com/2015/01/06/top-5000-words-in-french-wordlist/', 'utf-8')
for line in data:
trackLetter(cur_letters, line)
# works if I print here
print cur_letters
This code prints the following:
['t', 'h', 'e', 'o', 'f', 'a', 'n', 'd', 'i', 'r', 's', 'b', 'y', 'w', 'u', 'm', 'l', 'v', 'c', 'p', 'g', 'k', 'x', 'j', 'z', 'q', '\xc3', '\xa0', '\xaa', '\xb9', '\xa9', '\xa8', '\xb4', '\xae', '-', '\xe2', '\x80', '\x99', '\xa2', '\xa7', '\xbb', '\xaf']
Obviously the French letters have been lost in some sort of conversion to ASCII, despite me specifying the UTF encoding! The strange thing is when I print out the line directly (shown as a comment), the french characters appear perfectly!
What should I do to preserve these characters (é, è, ê, etc.
), or convert them back to their original version?
They aren't lost, they're just escaped when you print the list.
When you print a list in Python 2, it calls the __str__
method of the list itself, not on each individual item, and the list's __str__
method escapes your non-ascii characters. See this excellent answer for more explanation:
The following snippet demonstrates the issue succintly:
char_list = ['é', 'è', 'ê']
print(char_list)
# ['\xc3\xa9', '\xc3\xa8', '\xc3\xaa']
print(', '.join(char_list))
# é, è, ê