How to profile large datasets with Pandas profiling?

Viv picture Viv · May 8, 2019 · Viewed 11.2k times · Source

Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.

But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.

df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")

I have also tried with check_recoded = False option as well. But it does not help in profiling entirely. Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.

Answer

Giorgos Myrianthous picture Giorgos Myrianthous · Mar 18, 2020

v2.4 introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):

from pandas_profiling import ProfileReport


profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")