Python ElementTree support for parsing unknown XML entities?

Bryce picture Bryce · Aug 30, 2011 · Viewed 21.2k times · Source

I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:

<Style name="admin-5678">
    <Rule>
      <Filter>[admin_level]='5'</Filter>
      &maxscale_zoom11;
    </Rule>
</Style>

There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:

    #!/usr/bin/python
    ##
    ## Where's the entity support as documented at:
    ## http://effbot.org/elementtree/elementtree-xmlparser.htm
    ## In Python 2.7.1+ ?
    ##
    from pprint     import pprint
    from xml.etree  import ElementTree
    from cStringIO  import StringIO

    parser = ElementTree.ElementTree()
   #parser.entity["maxscale_zoom11"] = unichr(160)
    testf = StringIO('<foo>&maxscale_zoom11;</foo>')
    tree = parser.parse(testf)
   #tree = parser.parse(testf,"XMLParser")
    for node in tree.iter('foo'):
        print node.text

Which depending on how you adjust the comments gives:

xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5

or

AttributeError: 'ElementTree' object has no attribute 'entity'

or

AttributeError: 'str' object has no attribute 'feed'           

For those curious the XML is from the OpenStreetMap's mapnik project.

Answer

RayLuo picture RayLuo · Feb 24, 2016

As @cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.

I finally got it working. Quoted from this Q&A.

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)