I have an xml
file of the form:
<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>
I need to process it so that, for instance, when the user inputs nd
, the program matches it with the <Phonetic>
tag and returns and
from the <Phonemic>
part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.
I searched and found xmltodict which is used for the same purpose:
import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
obj = xmltodict.parse(fd.read())
Running this gives me an ordered dict
:
>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])
Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd
I'd have to write:
obj['NewDataSet']['Root'][0]['Phonetic']
which is ridiculously complicated. I tried to make it into a regular dictionary by dict()
but as it is nested, the inner layers remain ordered and my data is so big.
If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic']
, IMO, you are not doing it right.
Instead, you can do the following
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.
PS: I had the same issues with xmltodict
. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.
EDIT
Following code works for me
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]