Unzip a large ZIP file on Amazon S3

Alex Hall picture Alex Hall · Sep 21, 2015 · Viewed 21.3k times · Source

I'm working at a company that processes very large CSV files. Clients upload the file to Amazon S3 via filepicker. Then multiple server processes can read the file in parallel (i.e. starting from different points) to process it and store it in a database. Optionally the clients may zip the file before uploading.

  1. Am I correct that the ZIP format does not allow decompression of a single file in parallel? That is, there is no way to have multiple processes read the ZIP file from different offsets (maybe with some overlap between blocks) and stream uncompressed data from there?

If I am correct, then I want a way to take the ZIP file on S3 and produce an unzipped CSV, also on S3.

  1. Does Amazon provide any services that can perform this task simply? I was hoping that Data Pipeline could do the job, but it seems to have limitations. For example "CopyActivity does not support copying multipart Amazon S3 files" (source) seems to suggest that I can't unzip anything larger than 5GB using that. My understanding of Data Pipeline is very limited so I don't know how suitable it is for this task or where I would look.
  2. Is there any SaaS that does the job?

I can write code to download, unzip, and multipart upload the file back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would have been ideal for running the code (to avoid provisioning unneeded resources) but execution time is limited to 60 seconds. Plus the use case seems so simple and generic I expect to find an existing solution.

Answer

E.J. Brennan picture E.J. Brennan · Sep 21, 2015

Your best bet is probably to have an S3 event notification sent to an SQS queue every time a zip file is uploaded to S3, and have on or more EC2 instances polling the queue waiting for files to unzip.

You may only need on running instance to do this, but you could also have a autoscale policy that spins up more instance if the size of the SQS queue grows too big for a single instance to do the de-zipping fast enough (as defined by you).