I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv()
with a chunksize=10,000 parameter.
However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.
Any ideas?
There is no "optimal chunksize" [*]. Because chunksize
only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)
To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...
by looking at your number of columns, their dtypes, and the size of each; use either df.describe()
, or else for more in-depth memory usage, by column:
print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix top
/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.
One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_values
arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.
And use all the standard pandas read_csv
tricks, like:
dtypes
for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usageusecols
if you only want to keep a subset of columns