Get inferred dataframe types iteratively using chunksize

Zelazny7 picture Zelazny7 · Mar 21, 2013 · Viewed 7k times · Source

How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once?

I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks.

If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and retain the highest inferred type. In addition, for object types, I need to retain the maximum length as these will be stored as strings in the table.

Is there a pandonic way of retaining only this information without reading in the entire dataset?

Answer

Zelazny7 picture Zelazny7 · Mar 21, 2013

I didn't think it would be this intuitive, otherwise I wouldn't have posted the question. But once again, pandas makes things a breeze. However, keeping the question as this information might be useful to others working with large data:

In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)

# Store the dtypes of each chunk into a list and convert it to a dataframe:

In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])

In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64],
       [int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)

# Very cool that I can take the max of these data types and it will preserve the hierarchy:

In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)

# I can now store the above into a dictionary:

types = dtypes.max().to_dict()

# And pass it into pd.read_csv fo the second run:

chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)