I'm reading a csv sample file and store it on .h5 database. The .csv is structured as follows:
User_ID;Longitude;Latitude;Year;Month;String
267261661;-3.86580025;40.32170825;2013;12;hello world
171255468;-3.83879575;40.05035005;2013;12;hello world
343588169;-3.70759531;40.4055946;2014;2;hello world
908779052;-3.8356385;40.1249459;2013;8;hello world
289540518;-3.6723114;40.3801642;2013;11;hello world
635876313;-3.8323166;40.3379393;2012;10;hello world
175160914;-3.53687933;40.35101274;2013;12;hello world
155029860;-3.68555076;40.47688417;2013;11;hello world
I've putting it on a .h5 store with the pandas to_hdf, selecting to pass to the .h5 only a couple of columns:
import pandas as pd
df = pd.read_csv(filename + '.csv', sep=';')
df.to_hdf('test.h5','key1',format='table',data_columns=['User_ID','Year'])
I've obtained different results in the columns stored in the .h5 file using HDFStore and read_hdf, in particular:
store = pd.HDFStore('test.h5')
>>> store
>>> <class 'pandas.io.pytables.HDFStore'>
File path: /test.h5
/key1 frame_table (typ->appendable,nrows->8,ncols->6,indexers->[index],dc->[User_ID,Year])
which is what I expect (only the 'User_ID' and 'Year' columns stored in the database), althought the ncols->6 means that actually all the columns have been stored in the .h5 file.
If I try reading the file with pd.read_hdf:
hdf = pd.read_hdf('test.h5','key1')
and asking for the keys:
hdf.keys()
>>> Index([u'User_ID', u'Longitude', u'Latitude', u'Year', u'Month', u'String'], dtype='object')
which is not what I'm expected since all columns of the original .csv file are still in the .h5 database. How can I store only a selection of columns in the .h5 in order to reduce the size of the database?
Thanks for your help.
just select out the columns as you write to the file.
cols_to_keep = ['User_ID', 'Year']
df.loc[:, cols_to_keep].to_hdf(...)