Which setting to use in Spark to specify compression of `Output`?

nikk picture nikk · Aug 14, 2016 · Viewed 11.9k times · Source

So, Spark has the file spark-defaults.xml for specifying what settings, including which compression codec is to used and at what stage (RDD, Shuffle). Most of the settings can be set at the application level.

EDITED:

conf = SparkConf() conf.set("spark.hadoop.mapred.output.compress", "true") conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.snappy")

How can I use spark-defaults.xml to tell Spark to use a particular codec to compress Spark outputs only?

Option 1 spark.hadoop.mapred.output.compress true spark.hadoop.mapred.output.compression.codec snappy

Option 2: spark.mapreduce.output.fileoutputformat.compress true spark.mapreduce.output.fileoutputformat.compress.codec snappy

Option 3: mapreduce.output.fileoutputformat.compress true mapreduce.output.fileoutputformat.compress.codec snappy

Anyone has the proper way to sethe this (from any of these options or something similar)? I am running Spark 1.6.1.

Answer

ronhash picture ronhash · May 24, 2017

You should add this to your spark-defaults.xml:

<property>
    <name>spark.hadoop.mapred.output.compress</name>
    <value>true</value>
</property>
<property>
    <name>spark.hadoop.mapred.output.compression.codec</name>
    <value>snappy</value>
</property>

This is the same as adding these in the spark-submit command:

--conf spark.hadoop.mapred.output.compress=true
--conf spark.hadoop.mapred.output.compression.codec=snappy