What is the difference between cache and persist?

Ramana picture Ramana · Nov 11, 2014 · Viewed 105.8k times · Source

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

Answer

ahars picture ahars · Nov 12, 2014

With cache(), you use only the default storage level :

  • MEMORY_ONLY for RDD
  • MEMORY_AND_DISK for Dataset

With persist(), you can specify which storage level you want for both RDD and Dataset.

From the official docs:

  • You can mark an RDD to be persisted using the persist() or cache() methods on it.
  • each persisted RDD can be stored using a different storage level
  • The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

Use persist() if you want to assign a storage level other than :

  • MEMORY_ONLY to the RDD
  • or MEMORY_AND_DISK for Dataset

Interesting link for the official documentation : which storage level to choose