Amazon Athena and compressed S3 files

MattY picture MattY · Dec 19, 2016 · Viewed 8k times · Source

I have an S3 bucket with several zipped CSV files (utilization logs.) I'd like to query this data with Athena, but the output is completely garbled.

It appears Athena is trying to parse the zip files without decompressing them first. Is it possible to force Hive to recognize my files as compressed data?

Answer

jens walter picture jens walter · Dec 19, 2016

For Athena compression is supported, but the supported formats are

  • Snappy (.snappy)
  • Zlib (.bz2)
  • GZIP (.gz)

Those formats are detected by their filename suffix. If the suffix doesn't match, the reader does not decode the content. I tested it with a test.csv.gz file and it worked right away. So try changing the compression from zip to gzip and it should work.