Parsing an xml file with an ordered dictionary

Omid picture Omid · Nov 14, 2014 · Viewed 9.6k times · Source

I have an xml file of the form:

<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>

I need to process it so that, for instance, when the user inputs nd, the program matches it with the <Phonetic> tag and returns and from the <Phonemic> part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.

I searched and found xmltodict which is used for the same purpose:

import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
    obj = xmltodict.parse(fd.read())

Running this gives me an ordered dict:

>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])

Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd I'd have to write:

obj['NewDataSet']['Root'][0]['Phonetic']

which is ridiculously complicated. I tried to make it into a regular dictionary by dict() but as it is nested, the inner layers remain ordered and my data is so big.

Answer

Anshul Goyal picture Anshul Goyal · Nov 14, 2014

If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic'], IMO, you are not doing it right.

Instead, you can do the following

obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]

Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.

PS: I had the same issues with xmltodict. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.

EDIT

Following code works for me

import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]