The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes
and/or characteristics
.
This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.
I am now trying to come up with a similar functionality in Python 3.6. So far, I have looked online and consulted a number of posts, which however do not exactly address what I want to do.
A few reference posts include:
What is the difference between save a pandas dataframe to pickle and to csv?
What is the fastest way to upload a big csv file in notebook to work with python pandas?
For a small NumPy
array, I have concluded that a combination of the function numpy.savez()
and a dictionary
can store adequately all relevant information in a single file.
For example:
a = np.array([[2,4],[6,8],[10,12]])
d = {"first": 1, "second": "two", "third": 3}
np.savez(whatever_name.npz, a=a, d=d)
data = np.load(whatever_name.npz)
arr = data['a']
dic = data['d'].tolist()
However, the question remains:
Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy
array or a (large) Pandas
DataFrame
?
I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.
There are many options. I will discuss only HDF5, because I have experience using this format.
Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.
Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.
In my experience, for performance and portability, avoid pyTables
/ HDFStore
to store numeric data. You can instead use the intuitive interface provided by h5py
.
Store an array
import h5py, numpy as np
arr = np.random.randint(0, 10, (1000, 1000))
f = h5py.File('file.h5', 'w', libver='latest') # use 'latest' for performance
dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
compression='gzip', compression_opts=9)
Compression & chunking
There are many compression choices, e.g. blosc
and lzf
are good choices for compression and decompression performance respectively. Note gzip
is native; other compression filters may not ship by default with your HDF5 installation.
Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.
Add some attributes
dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)
Store a dictionary
for k, v in d.items():
f.create_dataset('dictgroup/'+str(k), data=v)
Out-of-memory access
dictionary = f['dictgroup']
res = dictionary['my_key']
There is no substitute for reading the h5py
documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.