I want to import OData XML datafeeds from the Dutch Bureau of Statistics (CBS) into our database. Using lxml and pandas I thought this should be straigtforward. By using OrderDict I want to preserve the order of the columns for readability, but somehow I can't get it right.
from collections import OrderedDict
from lxml import etree
import requests
import pandas as pd
# CBS URLs
base_url = 'http://opendata.cbs.nl/ODataFeed/odata'
datasets = ['/37296ned', '/82245NED']
feed = requests.get(base_url + datasets[1] + '/TypedDataSet')
root = etree.fromstring(feed.content)
# all record entries start at tag m:properties, parse into data dict
data = []
for record in root.iter('{{{}}}properties'.format(root.nsmap['m'])):
row = OrderedDict()
for element in record:
row[element.tag.split('}')[1]] = element.text
data.append(row)
df = pd.DataFrame.from_dict(data)
df.columns
Inspecting data
, the OrderDict is in the right order. But looking at df.head()
the columns have been sorted alphabetically with CAPS first?
Help, anyone?
Something in your example seems to be inconsistent, as data
is a list
and no dict
, but assuming you really have an OrderedDict
:
Try to explicitly specify your column order when you create your DataFrame:
# ... all your data collection
df = pd.DataFrame(data, columns=data.keys())
This should give you your DataFrame with the columns ordered just in exact the way they are in the OrderedDict (via the data.keys()
generated list)