How to estimate dataframe real size in pyspark?

TheSilence picture TheSilence · May 6, 2016 · Viewed 31.7k times · Source

How to determine a dataframe size?

Right now I estimate the real size of a dataframe as follows:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.

Answer

Kiran Thati picture Kiran Thati · Aug 12, 2016

Currently I am using the below approach, but not sure if this is the best way:

df.persist(StorageLevel.Memory)
df.count()

On the spark-web UI under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory:

df.unpersist()