We all know Spark does the computation in memory. I am just curious on followings.
If I create 10 RDD
in my pySpark shell from HDFS, does it mean all these 10 RDD
s data will reside on Spark Workers Memory?
If I do not delete RDD
, will it be in memory forever?
If my dataset(file) size exceeds available RAM size, where will data to stored?
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory?
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
If I do not delete RDD, will it be in memory forever?
Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist()
or sqlContext.uncacheTable("sparktable")
to remove the df
or tables from memory.
link to read more
If my dataset size exceeds available RAM size, where will data to stored?
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed. link to read more
If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment
To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.
In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.