Handling french letters in Python

David Ferris picture David Ferris · Nov 24, 2016 · Viewed 7.1k times · Source

I am reading data from a file which contains words with french and english letters. I am attempting to construct a list of all of the possible english and french letters (stored as strings). I do this with the code below:

# encoding: utf-8
def trackLetter(letters, line):
    for a in line:
        found = False;
        for b in letters:
            if b==a:
                found = True
        if not found:
            letters += a

cur_letters = []; # for storing possible letters

data = urllib2.urlopen('https://duolinguist.wordpress.com/2015/01/06/top-5000-words-in-french-wordlist/', 'utf-8')
for line in data:
    trackLetter(cur_letters, line)
    # works if I print here

print cur_letters

This code prints the following:

['t', 'h', 'e', 'o', 'f', 'a', 'n', 'd', 'i', 'r', 's', 'b', 'y', 'w', 'u', 'm', 'l', 'v', 'c', 'p', 'g', 'k', 'x', 'j', 'z', 'q', '\xc3', '\xa0', '\xaa', '\xb9', '\xa9', '\xa8', '\xb4', '\xae', '-', '\xe2', '\x80', '\x99', '\xa2', '\xa7', '\xbb', '\xaf']

Obviously the French letters have been lost in some sort of conversion to ASCII, despite me specifying the UTF encoding! The strange thing is when I print out the line directly (shown as a comment), the french characters appear perfectly!

What should I do to preserve these characters (é, è, ê, etc.), or convert them back to their original version?

Answer

Greg picture Greg · Nov 24, 2016

They aren't lost, they're just escaped when you print the list.

When you print a list in Python 2, it calls the __str__ method of the list itself, not on each individual item, and the list's __str__ method escapes your non-ascii characters. See this excellent answer for more explanation:

How does str(list) work?

The following snippet demonstrates the issue succintly:

char_list = ['é', 'è', 'ê']
print(char_list)
# ['\xc3\xa9', '\xc3\xa8', '\xc3\xaa']

print(', '.join(char_list))
# é, è, ê