Concatenate files in S3 using AWS Lambda

V. Samma picture V. Samma · Sep 21, 2016 · Viewed 9.4k times · Source

Is there a way to use Lambda for S3 file concatenation?

I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).

I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).

The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.

Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.

Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?

Answer

l0b0 picture l0b0 · Sep 21, 2016

I would reconsider whether you actually want to do this:

  • The S3 costs will go up.
  • The pipeline complexity will go up.
  • The latency from Firehose input to Spark input will go up.
  • If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…

Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:

  • You can archive S3 objects almost immediately, lowering costs.
  • Data is available in Spark as soon as possible.
  • If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
  • There's a tiny amount of latency increase from establishing TCP connections and authentication.

I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:

  • A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
  • An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
  • A live Spark/EMR/underlying data service instance ready to receive the data.
  • In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.

Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.


If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.