I have got a question about how best to write to hdf5 files with python / h5py.
I have data like:
-----------------------------------------
| timepoint | voltage1 | voltage2 | ...
-----------------------------------------
| 178 | 10 | 12 | ...
-----------------------------------------
| 179 | 12 | 11 | ...
-----------------------------------------
| 185 | 9 | 12 | ...
-----------------------------------------
| 187 | 15 | 12 | ...
...
with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).
With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.
I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.
Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?
I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.
dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,))
and then when another block of 10^4 rows arrives, replace the dataset?
Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).
I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.
Per the FAQ, you can expand the dataset using dset.resize
. For example,
import os
import h5py
import numpy as np
path = '/tmp/out.h5'
os.remove(path)
with h5py.File(path, "a") as f:
dset = f.create_dataset('voltage284', (10**5,), maxshape=(None,),
dtype='i8', chunks=(10**4,))
dset[:] = np.random.random(dset.shape)
print(dset.shape)
# (100000,)
for i in range(3):
dset.resize(dset.shape[0]+10**4, axis=0)
dset[-10**4:] = np.random.random(10**4)
print(dset.shape)
# (110000,)
# (120000,)
# (130000,)