I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.
I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml.
Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse
. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using lxml.etree.parse
this can be fixed by using the recover=True
option of a custom parser, but iterparse
does not accept a custom parser. (see also: this question)
My current code looks like this:
from lxml import etree
events = ("start", "end")
context = etree.iterparse(xmlfile, events=events)
event, root_element = context.next() # <items>
for action, element in context:
if action == 'end' and element.tag == 'item':
# <parse>
root_element.clear()
Error when iterparse
encounters a bad character (in this case, it's a ^Y
):
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25
I don't even wish to decode this data, I can just drop it. However I don't know any way to skip the element - I tried context.next
and continue
in try/except statements.
Any help would be appreciated!
Update
Some additional info: This is the line where iterparse fails:
<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>
According to etree, the error occurs at bytes 0x19 0x73 0x20 0x65
.
According to hexedit, 19 73 20 65
translates to ASCII .s e
The .
in this place should be an apostrophe (foto's).
I also found this question, which does not provide a solution.
If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:
import codecs
from lxml import etree
events = ("start", "end")
reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
context = etree.iterparse(reader, events=events)
This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.