Handling very large netCDF files in python

tiago picture tiago · Aug 22, 2012 · Viewed 9.1k times · Source

I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.

For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:

import netCDF4 as nc

f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]

But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.

Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.

Answer

Russ Rew picture Russ Rew · Aug 23, 2012

You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:

http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html

The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles) you want for the variables. You can specify how much memory to use as a buffer and how much to use for chunk caches, but it's not clear how to use memory optimally between these uses, so you may have to just try some examples and time them. Rather than completely transpose a variable, you probably want to "partially transpose" it, by specifying chunks that have a lot of data along the 2 big dimensions of your slice and have only a few values along the other dimensions.