So, Spark has the file spark-defaults.xml
for specifying what settings, including which compression codec is to used and at what stage (RDD, Shuffle). Most of the settings can be set at the application level.
conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.snappy")
How can I use spark-defaults.xml
to tell Spark to use a particular codec to compress Spark outputs only?
Option 1
spark.hadoop.mapred.output.compress true
spark.hadoop.mapred.output.compression.codec snappy
Option 2:
spark.mapreduce.output.fileoutputformat.compress true
spark.mapreduce.output.fileoutputformat.compress.codec snappy
Option 3:
mapreduce.output.fileoutputformat.compress true
mapreduce.output.fileoutputformat.compress.codec snappy
Anyone has the proper way to sethe this (from any of these options or something similar)? I am running Spark 1.6.1.
You should add this to your spark-defaults.xml
:
<property>
<name>spark.hadoop.mapred.output.compress</name>
<value>true</value>
</property>
<property>
<name>spark.hadoop.mapred.output.compression.codec</name>
<value>snappy</value>
</property>
This is the same as adding these in the spark-submit
command:
--conf spark.hadoop.mapred.output.compress=true
--conf spark.hadoop.mapred.output.compression.codec=snappy