We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance.
Yes, Apache Spark will unpersist the RDD when it's garbage collected.
In RDD.persist
you can see:
sc.cleaner.foreach(_.registerRDDForCleanup(this))
This puts a WeakReference to the RDD in a ReferenceQueue leading to ContextCleaner.doCleanupRDD
when the RDD is garbage collected. And there:
sc.unpersistRDD(rddId, blocking)
For more context see ContextCleaner in general and the commit that added it.
A few things to be aware of when relying on garbage collection for unperisting RDDs: