Spark 2.0 memory fraction

syl picture syl · Sep 23, 2016 · Viewed 10.5k times · Source

I am working with Spark 2.0, the job starts by sorting the input data and storing its output on HDFS.

I was getting out of memory errors, the solution was to increase the value of "spark.shuffle.memoryFraction" from 0.2 to 0.8 and this solved the problem. But in the documentation I have found that this is a deprecated parameter.

As I understand, it was replaced by "spark.memory.fraction". How to modify this parameter while taking into account the sort and storage on HDFS?

Answer

gsamaras picture gsamaras · Sep 23, 2016

From the documentation:

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

  • spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
    is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
    large records.
  • spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. Otherwise, when much of this space is used for caching and execution, the tenured generation will be full, which causes the JVM to significantly increase time spent in garbage collection.

In I would modify the spark.storage.memoryFraction.


As a side note, are you sure that you understand how your job behaves?

It's typical to fine tune your job starting from the memoryOverhead, #cores , etc. firstly and then move on to the attribute you modified.