I'm using h5py to save numpy arrays in HDF5 format from python. Recently, I tried to apply compression and the size of the files I get is bigger...
I went from things (every file has several datasets) like this
self._h5_current_frame.create_dataset(
'estimated position', shape=estimated_pos.shape,
dtype=float, data=estimated_pos)
to things like this
self._h5_current_frame.create_dataset(
'estimated position', shape=estimated_pos.shape, dtype=float,
data=estimated_pos, compression="gzip", compression_opts=9)
In particular examples, the size of the compressed file is 172K and that of the uncompressed file is 72K (and h5diff reports both files are equal). I tried a more basic example and it works as expected...but not in my program.
How is that possible? I don't think gzip algorithm ever gives a bigger compressed file, so it's probably related with h5py and use thereof :-/ Any ideas?
Cheers!!
EDIT:
At the sight of the output from h5stat
, it seems the compressed version saves a lot of metadata (in the last few lines of the output)
Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
# of unique groups: 21
# of unique datasets: 56
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 5
File space information for file metadata (in bytes):
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 3798/503
Datasets(exclude compact data): 15904/9254
Datatypes: 0/0
Groups:
B-tree/List: 0
Heap: 0
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 116824
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Small groups (with 0 to 9 links):
# of groups with 1 link(s): 1
# of groups with 2 link(s): 5
# of groups with 3 link(s): 5
# of groups with 5 link(s): 10
Total # of small groups: 21
Group bins:
# of groups with 1 - 9 links: 21
Total # of groups: 21
Dataset dimension information:
Max. rank of datasets: 3
Dataset ranks:
# of dataset with rank 1: 51
# of dataset with rank 2: 3
# of dataset with rank 3: 2
1-D Dataset information:
Max. dimension size of 1-D datasets: 624
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 36
# of datasets with dimension sizes 2: 2
# of datasets with dimension sizes 3: 2
Total # of small datasets: 40
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 40
# of datasets with dimension size 10 - 99: 2
# of datasets with dimension size 100 - 999: 9
Total # of datasets: 51
Dataset storage information:
Total raw data size: 33602
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 2
Dataset layout counts[CHUNKED]: 54
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 2
GZIP filter: 54
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 4
Dataset datatype #0:
Count (total/named) = (20/0)
Size (desc./elmt) = (14/8)
Dataset datatype #1:
Count (total/named) = (17/0)
Size (desc./elmt) = (22/8)
Dataset datatype #2:
Count (total/named) = (10/0)
Size (desc./elmt) = (22/8)
Dataset datatype #3:
Count (total/named) = (9/0)
Size (desc./elmt) = (14/8)
Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small # of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Summary of file space information:
File metadata: 136526 bytes
Raw data: 33602 bytes
Unaccounted space: 5111 bytes
Total space: 175239 bytes
Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
# of unique groups: 21
# of unique datasets: 56
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 5
File space information for file metadata (in bytes):
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 3663/452
Datasets(exclude compact data): 15904/10200
Datatypes: 0/0
Groups:
B-tree/List: 0
Heap: 0
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Small groups (with 0 to 9 links):
# of groups with 1 link(s): 1
# of groups with 2 link(s): 5
# of groups with 3 link(s): 5
# of groups with 5 link(s): 10
Total # of small groups: 21
Group bins:
# of groups with 1 - 9 links: 21
Total # of groups: 21
Dataset dimension information:
Max. rank of datasets: 3
Dataset ranks:
# of dataset with rank 1: 51
# of dataset with rank 2: 3
# of dataset with rank 3: 2
1-D Dataset information:
Max. dimension size of 1-D datasets: 624
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 1: 36
# of datasets with dimension sizes 2: 2
# of datasets with dimension sizes 3: 2
Total # of small datasets: 40
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 40
# of datasets with dimension size 10 - 99: 2
# of datasets with dimension size 100 - 999: 9
Total # of datasets: 51
Dataset storage information:
Total raw data size: 50600
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 56
Dataset layout counts[CHUNKED]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 56
GZIP filter: 0
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 4
Dataset datatype #0:
Count (total/named) = (20/0)
Size (desc./elmt) = (14/8)
Dataset datatype #1:
Count (total/named) = (17/0)
Size (desc./elmt) = (22/8)
Dataset datatype #2:
Count (total/named) = (10/0)
Size (desc./elmt) = (22/8)
Dataset datatype #3:
Count (total/named) = (9/0)
Size (desc./elmt) = (14/8)
Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
Total # of objects with small # of attributes: 0
Attribute bins:
Total # of objects with attributes: 0
Max. # of attributes to objects: 0
Summary of file space information:
File metadata: 19567 bytes
Raw data: 50600 bytes
Unaccounted space: 5057 bytes
Total space: 75224 bytes
First, here's a reproducible example:
import h5py
from scipy.misc import lena
img = lena() # some compressible image data
f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()
f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()
f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()
Now let's look at the file sizes:
~$ h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
File metadata: 1304 bytes
Raw data: 2097152 bytes
Unaccounted space: 840 bytes
Total space: 2099296 bytes
~$ h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
File metadata: 11768 bytes
Raw data: 302850 bytes
Unaccounted space: 1816 bytes
Total space: 316434 bytes
~$ h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
File metadata: 11768 bytes
Raw data: 2098560 bytes
Unaccounted space: 1816 bytes
Total space: 2112144 bytes
In my example, compression with gzip -9
makes sense - although it requires an extra ~10kB of metadata, this is more than outweighed by a ~1794kB decrease in the size of the image data (about a 7:1 compression ratio). The net result is a ~6.6 fold reduction in total file size.
However, in your example the compression only reduces the size of your raw data by ~16kB (a compression ratio of about 1.5:1), which is massively outweighed by a 116kB increase in the size of the metadata. The reason why the increase in metadata size is so much larger than for my example is probably because your file contains 56 datasets rather than just one.
Even if gzip magically reduced the size of your raw data to zero you would still end up with a file that was ~1.8 times larger than the uncompressed version. The size of the metadata is more or less guaranteed to scale sublinearly with the size of your arrays, so if your datasets were much larger then you would start to see some benefit from compressing them. As it stands, your array is so small that it's unlikely that you'll gain anything from compression.
The reason why the compressed version needs so much more metadata is not really to do with the compression per se, but rather to do with the fact that in order to use compression filters the dataset needs to be split into fixed-size chunks. Presumably a lot of the extra metadata is being used to store the B-tree that is needed to index the chunks.
f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()
f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()
f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
compression_opts=9)
f6.close()
And the resulting file sizes:
~$ h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
File metadata: 11768 bytes
Raw data: 2097152 bytes
Unaccounted space: 1816 bytes
Total space: 2110736 bytes
~$ h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
File metadata: 3920 bytes
Raw data: 2097152 bytes
Unaccounted space: 96 bytes
Total space: 2101168 bytes
~$ h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
File metadata: 3920 bytes
Raw data: 305051 bytes
Unaccounted space: 96 bytes
Total space: 309067 bytes
It's obvious that chunking is what incurs the extra metadata rather than compression, since nocomp_autochunked.h5
contains exactly the same amount of metadata as complevel_0.h5
above, and introducing compression to the chunked version in complevel_9_onechunk.h5
made no difference to the total amount of metadata.
Increasing the chunk size such that the array is stored as a single chunk reduced the amount of metadata by a factor of about 3 in this example. How much difference this would make in your case will probably depend on how h5py automatically selects a chunk size for your input dataset. Interestingly this also resulted in a very slight reduction in the compression ratio, which is not what I would have predicted.
Bear in mind that there are also disadvantages to having larger chunks. Whenever you want to access a single element within a chunk, the whole chunk needs to be decompressed and read into memory. For a large dataset this can be disastrous for performance, but in your case the arrays are so small that it's probably not worth worrying about.
Another thing you should consider is whether you can store your datasets within a single array rather than lots of small arrays. For example, if you have K 2D arrays of the same dtype that each have dimensions MxN then you could store them more efficiently in a KxMxN 3D array rather than lots of small datasets. I don't know enough about your data to know whether this is feasible.