Split huge (95Mb) JSON array into smaller chunks?

mpen picture mpen · Aug 13, 2011 · Viewed 8.4k times · Source

I exported some data from my database in the form of JSON, which is essentially just one [list] with a bunch (900K) of {objects} inside it.

Trying to import it on my production server now, but I've got some cheap web server. They don't like it when I eat all their resources for 10 minutes.

How can I split this file into smaller chunks so that I can import it piece by piece?


Edit: Actually, it's a PostgreSQL database. I'm open to other suggestions on how I can export all the data in chunks. I've got phpPgAdmin installed on my server, which supposedly can accept CSV, Tabbed and XML formats.


I had to fix phihag's script:

import json
with open('fixtures/PostalCodes.json','r') as infile:
  o = json.load(infile)
  chunkSize = 50000
  for i in xrange(0, len(o), chunkSize):
    with open('fixtures/postalcodes_' + ('%02d' % (i//chunkSize)) + '.json','w') as outfile:
      json.dump(o[i:i+chunkSize], outfile)

dump:

pg_dump -U username -t table database > filename

restore:

psql -U username < filename

(I don't know what the heck pg_restore does, but it gives me errors)

The tutorials on this conveniently leave this information out, esp. the -U option which is probably necessary in most circumstances. Yes, the man pages explain this, but it's always a pain to sift through 50 options you don't care about.


I ended up going with Kenny's suggestion... although it was still a major pain. I had to dump the table to a file, compress it, upload it, extract it, then I tried to import it, but the data was slightly different on production and there were some missing foreign keys (postalcodes are attached to cities). Of course, I couldn't just import the new cities, because then it throws a duplicate key error instead of silently ignoring it, which would have been nice. So I had to empty that table, repeat the process for cities, only to realize something else was tied to cities, so I had to empty that table too. Got the cities back in, then finally I could import my postal codes. By now I've obliterated half my database because everything is tied to everything and I've had to recreate all the entries. Lovely. Good thing I haven't launched the site yet. Also "emptying" or truncating a table doesn't seem to reset the sequences/autoincrements, which I'd like, because there are a couple magic entries I want to have ID 1. So..I'd have to delete or reset those too (I don't know how), so I manually edited the PKs for those back to 1.

I would have ran into similar problems with phihag's solution, plus I would have had to import 17 files one at a time, unless I wrote another import script to match the export script. Although he did answer my question literally, so thanks.

Answer

phihag picture phihag · Aug 13, 2011

In Python:

import json
with open('file.json') as infile:
  o = json.load(infile)
  chunkSize = 1000
  for i in xrange(0, len(o), chunkSize):
    with open('file_' + str(i//chunkSize) + '.json', 'w') as outfile:
      json.dump(o[i:i+chunkSize], outfile)