Data is not perfectly clean, but is used without issue with pandas. The pandas library provides many extremely useful functions for EDA.
But when I use profiling for large data i.e 100 million records with 10 columns, reading it from a database table, it does not complete and my laptop runs out of memory, the size of data in csv is around 6 gb and my RAM is 14 GB my idle usage is around 3 - 4 GB approximately.
df = pd.read_sql_query("select * from table", conn_params)
profile = pandas.profiling.ProfileReport(df)
profile.to_file(outputfile="myoutput.html")
I have also tried with check_recoded = False
option as well.
But it does not help in profiling entirely.
Is there any way to chunk and read the data and finally generate the summary report as a whole? OR any other method to use this function with large dataset.
v2.4
introduced the minimal mode that disables expensive computations (such as correlations and dynamic binning):
from pandas_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_file(output_file="output.html")