Is it possible to save a numpy array appending it to an already existing npy-file --- something like np.save(filename,arr,mode='a')
?
I have several functions that have to iterate over the rows of a large array. I cannot create the array at once because of memory constrains. To avoid to create the rows over and over again, I wanted to create each row once and save it to file appending it to the previous row in the file. Later I could load the npy-file in mmap_mode, accessing the slices when needed.
The build-in .npy
file format is perfectly fine for working with small datasets, without relying on external modules other then numpy
.
However, when you start having large amounts of data, the use of a file format, such as HDF5, designed to handle such datasets, is to be preferred [1].
For instance, below is a solution to save numpy
arrays in HDF5 with PyTables,
Step 1: Create an extendable EArray
storage
import tables
import numpy as np
filename = 'outarray.h5'
ROW_SIZE = 100
NUM_COLUMNS = 200
f = tables.open_file(filename, mode='w')
atom = tables.Float64Atom()
array_c = f.create_earray(f.root, 'data', atom, (0, ROW_SIZE))
for idx in range(NUM_COLUMNS):
x = np.random.rand(1, ROW_SIZE)
array_c.append(x)
f.close()
Step 2: Append rows to an existing dataset (if needed)
f = tables.open_file(filename, mode='a')
f.root.data.append(x)
Step 3: Read back a subset of the data
f = tables.open_file(filename, mode='r')
print(f.root.data[1:10,2:20]) # e.g. read from disk only this part of the dataset