Saving in a file an array or DataFrame together with other information

user8682794 picture user8682794 · Apr 9, 2018 · Viewed 11k times · Source

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.

I am now trying to come up with a similar functionality in Python 3.6. So far, I have looked online and consulted a number of posts, which however do not exactly address what I want to do.

A few reference posts include:

For a small NumPy array, I have concluded that a combination of the function numpy.savez() and a dictionary can store adequately all relevant information in a single file.

For example:

a = np.array([[2,4],[6,8],[10,12]])
d = {"first": 1, "second": "two", "third": 3}

np.savez(whatever_name.npz, a=a, d=d)
data = np.load(whatever_name.npz)

arr = data['a']
dic = data['d'].tolist()

However, the question remains:

Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy array or a (large) Pandas DataFrame?

I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.

Answer

jpp picture jpp · Apr 24, 2018

There are many options. I will discuss only HDF5, because I have experience using this format.

Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.

Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.

In my experience, for performance and portability, avoid pyTables / HDFStore to store numeric data. You can instead use the intuitive interface provided by h5py.

Store an array

import h5py, numpy as np

arr = np.random.randint(0, 10, (1000, 1000))

f = h5py.File('file.h5', 'w', libver='latest')  # use 'latest' for performance

dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
                        compression='gzip', compression_opts=9)

Compression & chunking

There are many compression choices, e.g. blosc and lzf are good choices for compression and decompression performance respectively. Note gzip is native; other compression filters may not ship by default with your HDF5 installation.

Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.

Add some attributes

dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)

Store a dictionary

for k, v in d.items():
    f.create_dataset('dictgroup/'+str(k), data=v)

Out-of-memory access

dictionary = f['dictgroup']
res = dictionary['my_key']

There is no substitute for reading the h5py documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.