Efficient ways to write a large NumPy array to a file

Fomite picture Fomite · Jan 8, 2012 · Viewed 7.8k times · Source

I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.

Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.

It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.

Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?

Answer

Andrew Jaffe picture Andrew Jaffe · Jan 8, 2012

Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.

The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.

For platform-independent, but numpy-specific, saves, you can use save (doc).

Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.