Setting the encoding for sax parser in Python

Dan Weaver picture Dan Weaver · May 13, 2009 · Viewed 7.3k times · Source

When I feed a utf-8 encoded xml to an ExpatParser instance:

def test(filename):
    parser = xml.sax.make_parser()
    with codecs.open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parser.feed(line)

...I get the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "test.py", line 72, in search_test
    parser.feed(line)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128)

I'm probably missing something obvious here. How do I change the parser's encoding from 'ascii' to 'utf-8'?

Answer

Stephan202 picture Stephan202 · May 13, 2009

Your code fails in Python 2.6, but works in 3.0.

This does work in 2.6, presumably because it allows the parser itself to figure out the encoding (perhaps by reading the encoding optionally specified on the first line of the XML file, and otherwise defaulting to utf-8):

def test(filename):
    parser = xml.sax.make_parser()
    parser.parse(open(filename))