ParseError: not well-formed (invalid token) using cElementTree

BioGeek picture BioGeek · Oct 24, 2012 · Viewed 111.5k times · Source

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTree:

>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    ET.XML(s)
  File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17

Is there a way to make cElementTree not complain?

Answer

iabdalkader picture iabdalkader · Oct 24, 2012

It seems to complain about \x08 you will need to escape that.

Edit:

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)