I'm working at a company that processes very large CSV files. Clients upload the file to Amazon S3 via filepicker. Then multiple server processes can read the file in parallel (i.e. starting from different points) to process it and store it in a database. Optionally the clients may zip the file before uploading.
If I am correct, then I want a way to take the ZIP file on S3 and produce an unzipped CSV, also on S3.
I can write code to download, unzip, and multipart upload the file back to S3, but I was hoping for an efficient, easily scalable solution. AWS Lambda would have been ideal for running the code (to avoid provisioning unneeded resources) but execution time is limited to 60 seconds. Plus the use case seems so simple and generic I expect to find an existing solution.
Your best bet is probably to have an S3 event notification sent to an SQS queue every time a zip file is uploaded to S3, and have on or more EC2 instances polling the queue waiting for files to unzip.
You may only need on running instance to do this, but you could also have a autoscale policy that spins up more instance if the size of the SQS queue grows too big for a single instance to do the de-zipping fast enough (as defined by you).