I am trying to write a parsing algorithm to efficiently pull data from an xml document. I am currently rolling through the document based on elements and children, but would like to use iterparse instead. One issue is that I have a list of elements that when found, I want to pull the child data from them, but it seems like using iterparse my options are to filter based on either one element name, or get every single element.
Example xml:
<?xml version="1.0" encoding="UTF-8"?>
<data_object xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<source id="0">
<name>Office Issues</name>
<datetime>2012-01-13T16:09:15</datetime>
<data_id>7</data_id>
</source>
<event id="125">
<date>2012-11-06</date>
<state_id>7</state_id>
</event>
<state id="7">
<name>Washington</name>
</state>
<locality id="2">
<name>Olympia</name>
<state_id>7</state_id>
<type>City</type>
</locality>
<locality id="3">
<name>Town</name>
<state_id>7</state_id>
<type>Town</type>
</locality>
</data_object>
Code example:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
base = False
b_name = ""
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
base = True
bname = elem.tag
children = elem.getchildren()
child_list = []
for child in children:
child_list.append(child.tag)
print bname + ":" + str(child_list)
elif event == "end" and elem.tag in ELEMENT_LIST:
base = False
root.clear()
With iterparse
you cannot limit parsing to some types of tags, you may do this only with one tag (by passing argument tag
). However it is easy to do manually what you would like to achieve. In the following snippet:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
print "this elem is interesting, do some processing: %s: [%s]" % (elem.tag, ", ".join(child.tag for child in elem))
elem.clear()
you limit your search to interesting tags only. Important part of iterparse
is the elem.clear()
which clears memory when item is obsolete. That is why it is memory efficient, see http://lxml.de/parsing.html#modifying-the-tree