Spark SQL - difference between gzip vs snappy vs lzo compression formats

Shankar picture Shankar · Mar 4, 2016 · Viewed 26.9k times · Source

I am trying to use Spark SQL to write parquet file.

By default Spark SQL supports gzip, but it also supports other compression formats like snappy and lzo.

What is the difference between these compression formats?

Answer

Garren S picture Garren S · May 30, 2017

Use Snappy if you can handle higher disk usage for the performance benefits (lower CPU + Splittable).

When Spark switched from GZIP to Snappy by default, this was the reasoning:

Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.

Snappy:

  • Storage Space: High
  • CPU Usage: Low
  • Splittable: Yes (1)

GZIP:

  • Storage Space: Medium
  • CPU Usage: Medium
  • Splittable: No

1) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/