Convert large csv to hdf5

jmilloy picture jmilloy · Nov 29, 2014 · Viewed 26.4k times · Source

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

Answer

unutbu picture unutbu · Nov 29, 2014

Use append=True in the call to to_hdf:

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
#    A  B
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df    # allow df to be garbage collected

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

yields

    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90

Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.


Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'
store = pd.HDFStore(filename)

for i in range(2):
    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
    store.append('data', df)

store.close()

store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()

yields

    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90