I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat.
Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?
This is the error I'm getting with the default parser:
ExpatError: not well-formed (invalid token): line 311, column 190
The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í
EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74
I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"
All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.
An XML document may start with a declaration like this:
`<?xml version="1.0" encoding="UTF-8"?>`
or like this:
<?xml version="1.0"?>
or not have a declaration at all ... in each case the parser will decode the document using UTF-8.
However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.
If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:
>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio
>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration
>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8
>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again
>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works
>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception
>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8
>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed