I am processing number of files, each processing of the file will output several thousand of arrays of float and I will store the data of all files in one huge dataset in a single hdf5 for further processing.
The thing is currently I am confused about how to append my data into the hdf5 file. (comment in the code above) In 2 for loops above, as you can see, I want to append 1 dimensional array of float into hdf5 at a time, and not as the whole thing. My data is in terabytes, and we can only append the data into the file.
There are several questions:
Or is this not possible?
EDIT:
I've been following Simon's suggestion, and currently here is my updated code:
hid_t desFi5;
hid_t fid1;
hid_t propList;
hsize_t fdim[2];
desFi5 = H5Fcreate(saveFilePath, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
fdim[0] = 3;
fdim[1] = 1;//H5S_UNLIMITED;
fid1 = H5Screate_simple(2, fdim, NULL);
cout << "----------------------------------Space done\n";
propList = H5Pcreate( H5P_DATASET_CREATE);
H5Pset_layout( propList, H5D_CHUNKED );
int ndims = 2;
hsize_t chunk_dims[2];
chunk_dims[0] = 3;
chunk_dims[1] = 1;
H5Pset_chunk( propList, ndims, chunk_dims );
cout << "----------------------------------Property done\n";
hid_t dataset1 = H5Dcreate( desFi5, "des", H5T_NATIVE_FLOAT, fid1, H5P_DEFAULT, propList, H5P_DEFAULT);
cout << "----------------------------------Dataset done\n";
bufi = new float*[1];
bufi[0] = new float[3];
bufi[0][0] = 0;
bufi[0][1] = 1;
bufi[0][2] = 2;
//hyperslab
hsize_t start[2] = {0,0};
hsize_t stride[2] = {1,1};
hsize_t count[2] = {1,1};
hsize_t block[2] = {1,3};
H5Sselect_hyperslab( fid1, H5S_SELECT_OR, start, stride, count, block);
cout << "----------------------------------hyperslab done\n";
H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, *bufi);
fdim[0] = 3;
fdim[1] = H5S_UNLIMITED; // COMPLAINS HERE
H5Dset_extent( dataset1, fdim );
cout << "----------------------------------extent done\n";
//hyperslab2
hsize_t start2[2] = {1,0};
hsize_t stride2[2] = {1,1};
hsize_t count2[2] = {1,1};
hsize_t block2[2] = {1,3};
H5Sselect_hyperslab( fid1, H5S_SELECT_OR, start2, stride2, count2, block2);
cout << "----------------------------------hyperslab2 done\n";
H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, *bufi);
cout << "----------------------------------H5Dwrite done\n";
H5Dclose(dataset1);
cout << "----------------------------------dataset closed\n";
H5Pclose( propList );
cout << "----------------------------------property list closed\n";
H5Sclose(fid1);
cout << "----------------------------------dataspace fid1 closed\n";
H5Fclose(desFi5);
cout << "----------------------------------desFi5 closed\n";
My current output is:
bash-3.2$ ./hdf5AppendTest.out
----------------------------------Space done
----------------------------------Property done
----------------------------------Dataset done
----------------------------------hyperslab done
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5D.c line 1103 in H5Dset_extent(): unable to set extend dataset
major: Dataset
minor: Unable to initialize object
#001: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dint.c line 2179 in H5D__set_extent(): unable to modify size of data space
major: Dataset
minor: Unable to initialize object
#002: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5S.c line 1874 in H5S_set_extent(): dimension cannot exceed the existing maximal size (new: 18446744073709551615 max: 1)
major: Dataspace
minor: Bad value
----------------------------------extent done
----------------------------------hyperslab2 done
----------------------------------H5Dwrite done
----------------------------------dataset closed
----------------------------------property list closed
----------------------------------dataspace fid1 closed
----------------------------------desFi5 closed
Currently, I see that setting things in unlimited with H5Dset_extent still causes a problem during runtime. (problematic function is marked with //COMPLAINS HERE
in the code above.) I already got a chunk data as specified by Simon, so what's the problem here?
On the other hand, without H5Dset_extent, I can write a test array of [0, 1, 2] just fine, but how can we make the code above the output the test array to the file like this:
[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
[0, 1, 2]
...
...
Recall: this is just a test array, the real data is bigger, and I cannot hold the whole thing in the RAM, so I must put data in part by part one at a time.
EDIT 2:
I've followed more of Simon's suggestion. Here is the critical part:
hsize_t n = 3, p = 1;
float *bufi_data = new float[n * p];
float ** bufi = new float*[n];
for (hsize_t i = 0; i < n; ++i){
bufi[i] = &bufi_data[i * n];
}
bufi[0][0] = 0.1;
bufi[0][1] = 0.2;
bufi[0][2] = 0.3;
//hyperslab
hsize_t start[2] = {0,0};
hsize_t count[2] = {3,1};
H5Sselect_hyperslab( fid1, H5S_SELECT_SET, start, NULL, count, NULL);
cout << "----------------------------------hyperslab done\n";
H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, fid1, H5P_DEFAULT, *bufi);
bufi[0][0] = 0.4;
bufi[0][1] = 0.5;
bufi[0][2] = 0.6;
hsize_t fdimNew[2];
fdimNew[0] = 3;
fdimNew[1] = 2;
H5Dset_extent( dataset1, fdimNew );
cout << "----------------------------------extent done\n";
//hyperslab2
hsize_t start2[2] = {0,0}; //PROBLEM
hsize_t count2[2] = {3,1};
H5Sselect_hyperslab( fid1, H5S_SELECT_SET, start2, NULL, count2, NULL);
cout << "----------------------------------hyperslab2 done\n";
H5Dwrite(dataset1, H5T_NATIVE_FLOAT, H5S_ALL, fid1, H5P_DEFAULT, *bufi);
From the above, I got the following output for hdf5:
0.4 0.5 0.6
0 0 0
After further experiment with start2
and count2
, I see these variables only affect starting index and incrementing index for bufi
. It does not move the position of the writing index of my dataset at all.
Recall: the final result must be:
0.1 0.2 0.3
0.4 0.5 0.6
Also, it must be bufi
instead of *bufi
for H5Dwrite
, Simon, because bufi
gives me completely random numbers.
UPDATE 3:
For the selection part suggested by Simon:
hsize_t start[2] = {0, 0};
hsize_t count[2] = {1, 3};
hsize_t start[2] = {1, 0};
hsize_t count[2] = {1, 3};
These will give out the following error:
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
#000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c line 245 in H5Dwrite(): file selection+offset not within extent
major: Dataspace
minor: Out of range
count[2]
should be {3,1}
, rather than {1,3}
, I suppose? And for start[2]
, if I don't set it as {0,0}
, it will always yell out the error above.
Are you sure this is correct?
How to append the data in this case? What kind of function must I use?
You must use hyperslabs. That's what you need to write only part of a dataset.
The function to do that is H5Sselect_hyperslab
. Use it on fd1
and use fd1
as your file dataspace in your H5Dwrite
call.
I have tried put infinity flag of HDF5 in, but the runtime execution complains.
You need to create a chunked dataset in order to be able to set its maximum size to infinity. Create a dataset creation property list and use H5Pset_layout
to make it chunked. Use H5Pset_chunk
to set the chunk size. Then create your dataset using this property list.
I don't want to calculate the data that I have each time; is there a way to just simply keep on adding data in, without caring the value of
fdim
?
You can do two things:
Precompute the final size so you can create a dataset big enough. It looks like that's what you are doing.
Extend your dataset as you go using H5Dset_extent
. For this you need to set the maximum dimensions to infinity so you need a chunked dataset (see above).
In both case, you need to select an hyperslab on the file dataspace in your H5Dwrite
call (see above).
#include <iostream>
#include <hdf5.h>
// Constants
const char saveFilePath[] = "test.h5";
const hsize_t ndims = 2;
const hsize_t ncols = 3;
int main()
{
First, create a hdf5 file.
hid_t file = H5Fcreate(saveFilePath, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
std::cout << "- File created" << std::endl;
Then create a 2D dataspace. The size of the first dimension is unlimited. We set it initially to 0 to show how you can extend the dataset at each step. You could also set it to the size of the first buffer you are going to write for instance. The size of the second dimension is fixed.
hsize_t dims[ndims] = {0, ncols};
hsize_t max_dims[ndims] = {H5S_UNLIMITED, ncols};
hid_t file_space = H5Screate_simple(ndims, dims, max_dims);
std::cout << "- Dataspace created" << std::endl;
Then create a dataset creation property list. The layout of the dataset have to be chunked when using unlimited dimensions. The choice of the chunk size affects performances, both in time and disk space. If the chunks are very small, you will have a lot of overhead. If they are too large, you might allocate space that you don't need and your files might end up being too large. This is a toy example so we will choose chunks of one line.
hid_t plist = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_layout(plist, H5D_CHUNKED);
hsize_t chunk_dims[ndims] = {2, ncols};
H5Pset_chunk(plist, ndims, chunk_dims);
std::cout << "- Property list created" << std::endl;
Create the dataset.
hid_t dset = H5Dcreate(file, "dset1", H5T_NATIVE_FLOAT, file_space, H5P_DEFAULT, plist, H5P_DEFAULT);
std::cout << "- Dataset 'dset1' created" << std::endl;
Close resources. The dataset is now created so we don't need the property list anymore. We don't need the file dataspace anymore because when the dataset will be extended, it will become invalid as it will still hold the previous extent. So we will have to grab the updated file dataspace anyway.
H5Pclose(plist);
H5Sclose(file_space);
We will now append two buffers to the end of the dataset. The first one will be two lines long. The second one will be three lines long.
We create a 2D buffer (contigous in memory, row major order).
We will allocate enough memory to store 3 lines, so we can reuse the buffer.
Let us create an array of pointers so we can use the b[i][j]
notation
instead of buffer[i * ncols + j]
. This is purely esthetic.
hsize_t nlines = 3;
float *buffer = new float[nlines * ncols];
float **b = new float*[nlines];
for (hsize_t i = 0; i < nlines; ++i){
b[i] = &buffer[i * ncols];
}
Initial values in buffer to be written in the dataset:
b[0][0] = 0.1;
b[0][1] = 0.2;
b[0][2] = 0.3;
b[1][0] = 0.4;
b[1][1] = 0.5;
b[1][2] = 0.6;
We create a memory dataspace to indicate the size of our buffer in memory. Remember the first buffer is only two lines long.
dims[0] = 2;
dims[1] = ncols;
hid_t mem_space = H5Screate_simple(ndims, dims, NULL);
std::cout << "- Memory dataspace created" << std::endl;
We now need to extend the dataset. We set the initial size of the dataset to 0x3, we thus need to extend it first. Note that we extend the dataset itself, not its dataspace. Remember the first buffer is only two lines long.
dims[0] = 2;
dims[1] = ncols;
H5Dset_extent(dset, dims);
std::cout << "- Dataset extended" << std::endl;
Select hyperslab on file dataset.
file_space = H5Dget_space(dset);
hsize_t start[2] = {0, 0};
hsize_t count[2] = {2, ncols};
H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);
std::cout << "- First hyperslab selected" << std::endl;
Write buffer to dataset.
mem_space
and file_space
should now have the same number of elements selected.
Note that buffer
and &b[0][0]
are equivalent.
H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT, buffer);
std::cout << "- First buffer written" << std::endl;
We can now close the file dataspace. We could close the memory dataspace now and create a new one for the second buffer, but we will simply update its size.
H5Sclose(file_space);
New values in buffer to be appended to the dataset:
b[0][0] = 1.1;
b[0][1] = 1.2;
b[0][2] = 1.3;
b[1][0] = 1.4;
b[1][1] = 1.5;
b[1][2] = 1.6;
b[2][0] = 1.7;
b[2][1] = 1.8;
b[2][2] = 1.9;
Resize the memory dataspace to indicate the new size of our buffer. The second buffer is three lines long.
dims[0] = 3;
dims[1] = ncols;
H5Sset_extent_simple(mem_space, ndims, dims, NULL);
std::cout << "- Memory dataspace resized" << std::endl;
Extend dataset. Note that in this simple example, we know that 2 + 3 = 5. In general, you could read the current extent from the file dataspace and add the desired number of lines to it.
dims[0] = 5;
dims[1] = ncols;
H5Dset_extent(dset, dims);
std::cout << "- Dataset extended" << std::endl;
Select hyperslab on file dataset. Again in this simple example, we know that 0 + 2 = 2. In general, you could read the current extent from the file dataspace. The second buffer is three lines long.
file_space = H5Dget_space(dset);
start[0] = 2;
start[1] = 0;
count[0] = 3;
count[1] = ncols;
H5Sselect_hyperslab(file_space, H5S_SELECT_SET, start, NULL, count, NULL);
std::cout << "- Second hyperslab selected" << std::endl;
Append buffer to dataset
H5Dwrite(dset, H5T_NATIVE_FLOAT, mem_space, file_space, H5P_DEFAULT, buffer);
std::cout << "- Second buffer written" << std::endl;
The end: let's close all the resources:
delete[] b;
delete[] buffer;
H5Sclose(file_space);
H5Sclose(mem_space);
H5Dclose(dset);
H5Fclose(file);
std::cout << "- Resources released" << std::endl;
}
NB: I removed the previous updates because the answer was too long. If you are interested, browse the history.