Here is a sample data frame:
import pandas as pd
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
I know that pandas has the pytables based HDFStore, which is an easy way to efficiently serialize/deserialize a data frame. But those datasets are not very easy to load directly using a reader h5py or matlab. How can I save a data frame using h5py, so that I can easily load it back using another hdf5 reader?
Here is my approach to solving this problem. I am hoping either someone else has a better solution or my approach is helpful to others.
First, define function to make a numpy structure array (not a record array) from a pandas DataFrame.
import numpy as np
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
Use reset_index
to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.
sa = df_to_sarray(df.reset_index())
sa
array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
(4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
(7L, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Save that structured array to an hdf5 file.
import h5py
with h5py.File('mydata.h5', 'w') as hf:
hf['df'] = sa
Load the h5 dataset
with h5py.File('mydata.h5') as hf:
sa2 = hf['df'][:]
Extract the ID column and delete it from sa2
ID = sa2['ID']
sa2 = nprec.drop_fields(sa2, 'ID')
Make data frame with index ID using sa2
df2 = pd.DataFrame(sa2, index=ID)
df2.index.name = 'ID'
print(df2)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN