Let's assume that I have an XML like this:
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
<documents count="N">
<document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
]]>
</document>
</documents>
</author>
I would like to read this XML file and convert it to a pandas DataFrame:
key type language feature web data
e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
19e71144c50a8b9160b3cvdf2324f0955e906fce XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
21d4af9021a174f61b8erf284606c74d9e42 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
28a45eb2460823499763d70vdf9ca00ddbb665 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
This is what I already tried, but I am getting some errors and probably there is a more efficient way of doing this task:
from lxml import objectify
import pandas as pd
path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))
for i in range(0,len(xml)):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
Could anybody provide me a better aproach for this problem?
You can easily use xml
(from the Python standard library) to convert to a pandas.DataFrame
. Here's what I would do (when reading from a file replace xml_data
with the name of your file or file object):
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''\
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
<documents count="N">
<document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
]]>
</document>
</documents>
</author>
''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author
, then I would add the following generator:
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree
tutorial provided in the xml
library documentation.