Combine multiple NetCDF files into timeseries multidimensional array python

LCook picture LCook · Jun 19, 2015 · Viewed 17.3k times · Source

I am using data from multiple netcdf files (in a folder on my computer). Each file holds data for the entire USA, for a time period of 5 years. Locations are referenced based on the index of an x and y coordinate. I am trying to create a time series for multiple locations(grid cells), compiling the 5 year periods into a 20 year period (this would be combining 4 files). Right now I am able to extract the data from all files for one location and compile this into an array using numpy append. However, I would like to extract the data for multiple locations, placing this into a matrix where the rows are the locations and the columns contain the time series precipitation data. I think I have to create a list or dictionary, but I am not really sure how to allocate the data to the list/dictionary within a loop.

I am new to python and netCDF, so forgive me if this is an easy solution. I have been using this code as a guide, but haven't figured out how to format it for what I'd like to do: Python Reading Multiple NetCDF Rainfall files of variable size

Here is my code:

import glob
from netCDF4 import Dataset
import numpy as np

# Define x & y index for grid cell of interest 
    # Pittsburgh is 37,89
yindex = 37  #first number
xindex = 89  #second number

# Path
path = '/Users/LMC/Research Data/NARCCAP/'  
folder = 'MM5I_ccsm/'

## load data file names    
all_files = glob.glob(path + folder+'*.nc')
all_files.sort()

## initialize np arrays of timeperiods and locations
yindexlist = [yindex,'38','39'] # y indices for all grid cells of interest
xindexlist = [xindex,xindex,xindex] # x indices for all grid cells of interest
ngridcell = len(yindexlist)
ntimestep = 58400  # This is for 4 files of 14600 timesteps

## Initialize np array
timeseries_per_gridcell = np.empty(0)

## START LOOP FOR FILE IMPORT
for timestep, datafile in enumerate(all_files):    
    fh = Dataset(datafile,mode='r')  
    days = fh.variables['time'][:]
    lons = fh.variables['lon'][:]
    lats = fh.variables['lat'][:]
    precip = fh.variables['pr'][:]

    for i in range(1):
        timeseries_per_gridcell = np.append(timeseries_per_gridcell,precip[:,yindexlist[i],xindexlist[i]]*10800)

    fh.close()

print timeseries_per_gridcell     

I put 3 files on dropbox so you could access them, but I am only allowed to post 2 links. Here are they are:

https://www.dropbox.com/s/rso0hce8bq7yi2h/pr_MM5I_ccsm_2041010103.nc?dl=0 https://www.dropbox.com/s/j56undjvv7iph0f/pr_MM5I_ccsm_2046010103.nc?dl=0

Answer

N1B4 picture N1B4 · Jun 20, 2015

Nice start, I would recommend the following to help solve your issues.

First, check out ncrcat to quickly concatenate your individual netCDF files into a single file. I highly recommend downloading NCO for netCDF manipulations, especially in this instance where it will ease your Python coding later on.

Let's say the files are named precip_1.nc, precip_2.nc, precip_3.nc, and precip_4.nc. You could concatenate them along the record dimension to form a new precip_all.nc with a record dimension of length 58400 with

ncrcat precip_1.nc precip_2.nc precip_3.nc precip_4.nc -O precip_all.nc

In Python we now just need to read in that new single file and then extract and store the time series for the desired grid cells. Something like this:

import netCDF4
import numpy as np

yindexlist = [1,2,3]
xindexlist = [4,5,6]
ngridcell = len(xidx)
ntimestep = 58400

# Define an empty 2D array to store time series of precip for a set of grid cells
timeseries_per_grid_cell = np.zeros([ngridcell, ntimestep])

ncfile = netCDF4.Dataset('path/to/file/precip_all.nc', 'r')

# Note that precip is 3D, so need to read in all dimensions
precip = ncfile.variables['precip'][:,:,:]

for i in range(ngridcell):
     timeseries_per_grid_cell[i,:] = precip[:, yindexlist[i], xindexlist[i]]

ncfile.close()

If you have to use Python only, you'll need to keep track of the chunks of time indices that the individual files form to make the full time series. 58400/4 = 14600 time steps per file. So you'll have another loop to read in each individual file and store the corresponding slice of times, i.e. the first file will populate 0-14599, the second 14600-29199, etc.