How to extract files from a zip archive in S3

Rpj picture Rpj · Feb 3, 2015 · Viewed 19.8k times · Source

I have a zip archive uploaded in S3 in a certain location (say /foo/bar.zip) I would like to extract the values within bar.zip and place it under /foo without downloading or re-uploading the extracted files. How can I do this, so that S3 is treated pretty much like a file system

Answer

DNA picture DNA · Feb 18, 2015

S3 isn't really designed to allow this; normally you would have to download the file, process it and upload the extracted files.

However, there may be a few options:

  1. You could mount the S3 bucket as a local filesystem using s3fs and FUSE (see article and github site). This still requires the files to be downloaded and uploaded, but it hides these operations away behind a filesystem interface.

  2. If your main concern is to avoid downloading data out of AWS to your local machine, then of course you could download the data onto a remote EC2 instance and do the work there, with or without s3fs. This keeps the data within Amazon data centers.

  3. You may be able to perform remote operations on the files, without downloading them onto your local machine, using AWS Lambda.

You would need to create, package and upload a small program written in node.js to access, decompress and upload the files. This processing will take place on AWS infrastructure behind the scenes, so you won't need to download any files to your own machine. See the FAQs.

Finally, you need to find a way to trigger this code - typically, in Lambda, this would be triggered automatically by upload of the zip file to S3. If the file is already there, you may need to trigger it manually, via the invoke-async command provided by the AWS API. See the AWS Lambda walkthroughs and API docs.

However, this is quite an elaborate way of avoiding downloads, and probably only worth it if you need to process large numbers of zip files! Note also that (as of Oct 2018) Lambda functions are limited to 15 minutes maximum duration (default timeout is 3 seconds), so may run out of time if your files are extremely large - but since scratch space in /tmp is limited to 500MB, your filesize is also limited.