Trying to parse XML, with ElementTree, that contains undefined entity (i.e.
) raises:
ParseError: undefined entity
In Python 2.x XML entity dict can be updated by creating parser (documentation):
parser = ET.XMLParser()
parser.entity["nbsp"] = unichr(160)
but how to do the same with Python 3.x?
Update: There was misunderstanding from my side, because I overlooked that I was calling parser.parser.UseForeignDTD(1)
before trying to update XML entity dict, which was causing error with the parser. Luckily, @m.brindley was patient and pointed that XML entity dict still exists in Python 3.x and can be updated the same way as in Python 2.x
The issue here is that the only valid mnemonic entities in XML are quot
, amp
, apos
, lt
and gt
. This means that almost all (X)HTML named entities must be defined in the DTD using the entity declaration markup defined in the XML 1.1 spec. If the document is to be standalone, this should be done with an inline DTD like so:
<?xml version="1.1" ?>
<!DOCTYPE naughtyxml [
<!ENTITY nbsp " ">
<!ENTITY copy "©">
]>
<data>
<country name="Liechtenstein">
<rank>1 ></rank>
<year>2008©</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
</data>
The XMLParser
in xml.etree.ElementTree
uses an xml.parsers.expat
to do the actual parsing. In the init arguments for XMLParser
, there is a space for 'predefined HTML entities' but that argument is not implemented yet. An empty dict named entity
is created in the init method and this is what is used to look up undefined entities.
I don't think expat (by extension, the ET XMLParser) is able to handle switching namespaces to something like XHMTL to get around this. Possibly because it will not fetch external namespace definitions (I tried making xmlns="http://www.w3.org/1999/xhtml"
the default namespace for the data element but it did not play nicely) but I can't confirm that. By default, expat will raise an error against non XML entities but you can get around that by defining an external DOCTYPE - this causes the expat parser to pass undefined entity entries back to the ET.XMLParser
's _default()
method.
The _default()
method does a look up of the entity
dict in the XMLParser
instance and if it finds a matching key, it will replace the entity with the associated value. This maintains the Python-2.x syntax mentioned in the question.
Solutions:
chr()
in py3k - unichr()
is not a valid name anymore
XMLParser.entity
with html.entities.html5
to map all valid HTML5 mnemonic entities to their characters.HTMLParser
to handle mnemonic entities but this won't return an ElementTree
as desired.Here is the snippet I used - it parses XML with an external DOCTYPE through HTMLParser
(to demonstrate how to add entity handling by subclassing), ET.XMLParser
with entity mappings and expat
(which will just silently ignore undefined entities due to the external DOCTYPE). There is a valid XML entity (>
) and an undefined entity (©
) which I map to chr(0x24B4)
with the ET.XMLParser
.
from html.parser import HTMLParser
from html.entities import name2codepoint
import xml.etree.ElementTree as ET
import xml.parsers.expat as expat
xml = '''<?xml version="1.0"?>
<!DOCTYPE data PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<data>
<country name="Liechtenstein">
<rank>1></rank>
<year>2008©</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
</data>'''
# HTMLParser subclass which handles entities
print('=== HTMLParser')
class MyHTMLParser(HTMLParser):
def handle_starttag(self, name, attrs):
print('Start element:', name, attrs)
def handle_endtag(self, name):
print('End element:', name)
def handle_data(self, data):
print('Character data:', repr(data))
def handle_entityref(self, name):
self.handle_data(chr(name2codepoint[name]))
htmlparser = MyHTMLParser()
htmlparser.feed(xml)
# ET.XMLParser parse
print('=== XMLParser')
parser = ET.XMLParser()
parser.entity['copy'] = chr(0x24B8)
root = ET.fromstring(xml, parser)
print(ET.tostring(root))
for elem in root:
print(elem.tag, ' - ', elem.attrib)
for subelem in elem:
print(subelem.tag, ' - ', subelem.attrib, ' - ', subelem.text)
# Expat parse
def start_element(name, attrs):
print('Start element:', name, attrs)
def end_element(name):
print('End element:', name)
def char_data(data):
print('Character data:', repr(data))
print('=== Expat')
expatparser = expat.ParserCreate()
expatparser.StartElementHandler = start_element
expatparser.EndElementHandler = end_element
expatparser.CharacterDataHandler = char_data
expatparser.Parse(xml)