In terms of RDD
persistence, what are the differences between cache()
and persist()
in spark ?
With cache()
, you use only the default storage level :
MEMORY_ONLY
for RDDMEMORY_AND_DISK
for DatasetWith persist()
, you can specify which storage level you want for both RDD and Dataset.
From the official docs:
- You can mark an
RDD
to be persisted using thepersist
() orcache
() methods on it.- each persisted
RDD
can be stored using a differentstorage level
- The
cache
() method is a shorthand for using the default storage level, which isStorageLevel.MEMORY_ONLY
(store deserialized objects in memory).
Use persist()
if you want to assign a storage level other than :
MEMORY_ONLY
to the RDDMEMORY_AND_DISK
for DatasetInteresting link for the official documentation : which storage level to choose