A simplified version of my XML parsing function is here:
import xml.etree.cElementTree as ET
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
for (ev, el) in it:
count += 1
print('count: {0}'.format(count))
This causes Python to run out of memory, which doesn't make a whole lot of sense. The only thing I am actually storing is the count, an integer. Why is it doing this:
See that sudden drop in memory and CPU usage at the end? That's Python crashing spectacularly. At least it gives me a MemoryError
(depending on what else I am doing in the loop, it gives me more random errors, like an IndexError
) and a stack trace instead of a segfault. But why is it crashing?
The documentation does tell you "Parses an XML section into an element tree [my emphasis] incrementally" but doesn't cover how to avoid retaining uninteresting elements (which may be all of them). That is covered by this article by the effbot.
I strongly recommend that anybody using .iterparse()
should read this article by Liza Daly. It covers both lxml
and [c]ElementTree.
Previous coverage on SO:
Using Python Iterparse For Large XML Files
Can Python xml ElementTree parse a very large xml file?
What is the fastest way to parse large XML docs in Python?