I am trying to use Spark SQL to write parquet
file.
By default Spark SQL supports gzip
, but it also supports other compression formats like snappy
and lzo
.
What is the difference between these compression formats?
Use Snappy if you can handle higher disk usage for the performance benefits (lower CPU + Splittable).
When Spark switched from GZIP to Snappy by default, this was the reasoning:
Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.
Snappy:
GZIP:
1) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/