Drop spark dataframe from cache

ankit patel picture ankit patel · Aug 26, 2015 · Viewed 59k times · Source

I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution;

df1.cache()
df2.cache()

Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)?

For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space.

Answer

Alexander picture Alexander · Aug 26, 2015

just do the following:

df1.unpersist()
df2.unpersist()

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.