I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).
One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.
Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.
I am using python/h5py.
This is actually one of the use-cases of HDF5. If you just want to be able to access all the datasets from a single file, and don't care how they're actually stored on disk, you can use external links. From the HDF5 website:
External links allow a group to include objects in another HDF5 file and enable the library to access those objects as if they are in the current file. In this manner, a group may appear to directly contain datasets, named datatypes, and even groups that are actually in a different file. This feature is implemented via a suite of functions that create and manage the links, define and retrieve paths to external objects, and interpret link names:
myfile = h5py.File('foo.hdf5','a')
myfile['ext link'] = h5py.ExternalLink("otherfile.hdf5", "/path/to/resource")
Be careful: when opening myfile
, you should open it with 'a'
if it is an existing file. If you open it with 'w'
, it will erase its contents.
This would be very much faster than copying all the datasets into a new file. I don't know how fast access to otherfile.hdf5
would be, but operating on all the datasets would be transparent - that is, h5py would see all the datasets as residing in foo.hdf5
.