I know this question has been asked many times. I tried several solutions but I couldn't solve my problem.
I have a large nested JSON file (1.4GB) and I would like to make it flat and then convert it to a CSV file.
The JSON structure is like this:
{
"company_number": "12345678",
"data": {
"address": {
"address_line_1": "Address 1",
"locality": "Henley-On-Thames",
"postal_code": "RG9 1DP",
"premises": "161",
"region": "Oxfordshire"
},
"country_of_residence": "England",
"date_of_birth": {
"month": 2,
"year": 1977
},
"etag": "26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00",
"kind": "individual-person-with-significant-control",
"links": {
"self": "/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl"
},
"name": "John M Smith",
"name_elements": {
"forename": "John",
"middle_name": "M",
"surname": "Smith",
"title": "Mrs"
},
"nationality": "Vietnamese",
"natures_of_control": [
"ownership-of-shares-50-to-75-percent"
],
"notified_on": "2016-04-06"
}
}
I know that this is easy to accomplish with pandas
module but I am not familiar with it.
EDITED
The desired output should be something like this:
company_number, address_line_1, locality, country_of_residence, kind,
12345678, Address 1, Henley-On-Thamed, England, individual-person-with-significant-control
Note that this is just the short version. The output should have all the fields.
For the JSON data you have given, you could do this by parsing the JSON structure to just return a list of all the leaf nodes.
This assumes that your structure is consistent throughout, if each entry can have different fields, see the second approach.
For example:
import json
import csv
def get_leaves(item, key=None):
if isinstance(item, dict):
leaves = []
for i in item.keys():
leaves.extend(get_leaves(item[i], i))
return leaves
elif isinstance(item, list):
leaves = []
for i in item:
leaves.extend(get_leaves(i, key))
return leaves
else:
return [(key, item)]
with open('json.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
write_header = True
for entry in json.load(f_input):
leaf_entries = sorted(get_leaves(entry))
if write_header:
csv_output.writerow([k for k, v in leaf_entries])
write_header = False
csv_output.writerow([v for k, v in leaf_entries])
If your JSON data is a list of entries in the format you have given, then you should get output as follows:
address_line_1,company_number,country_of_residence,etag,forename,kind,locality,middle_name,month,name,nationality,natures_of_control,notified_on,postal_code,premises,region,self,surname,title,year
Address 1,12345678,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
Address 1,12345679,England,26281dhge33b22df2359sd6afsff2cb8cf62bb4a7f00,John,individual-person-with-significant-control,Henley-On-Thames,M,2,John M Smith,Vietnamese,ownership-of-shares-50-to-75-percent,2016-04-06,RG9 1DP,161,Oxfordshire,/company/12345678/persons-with-significant-control/individual/bIhuKnFctSnjrDjUG8n3NgOrl,Smith,Mrs,1977
If each entry can contain different (or possibly missing) fields, then a better approach would be to use a DictWriter
. In this case, all of the entries would need to be processed to determine the complete list of possible fieldnames
so that the correct header can be written.
import json
import csv
def get_leaves(item, key=None):
if isinstance(item, dict):
leaves = {}
for i in item.keys():
leaves.update(get_leaves(item[i], i))
return leaves
elif isinstance(item, list):
leaves = {}
for i in item:
leaves.update(get_leaves(i, key))
return leaves
else:
return {key : item}
with open('json.txt') as f_input:
json_data = json.load(f_input)
# First parse all entries to get the complete fieldname list
fieldnames = set()
for entry in json_data:
fieldnames.update(get_leaves(entry).keys())
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
csv_output.writeheader()
csv_output.writerows(get_leaves(entry) for entry in json_data)