I have written a lambda that is triggered off s3 bucket to unzip a zip file and process a text document inside. Due to the limitation of memory of lambda i need to move my process over to something like AWS batch. Correct me if I am wrong but my work flow should look something like this.
I beleive I need to write a lambda to put the location of the s3 bucket on amazons SQS were a AWS batch can read the location and do all the unzipping/data processing their were their is more memory.
Here is my current lambda, it takes in the event triggered by the s3 bucket, checks to see if it is a zip file then pushes the name of that s3 Key to SQS. Should I tell AWS batch to start reading the queue here in my lambda? I am totally new to AWS in general and not sure were to go from here.
public class dockerEventHandler implements RequestHandler<S3Event, String> {
private static BigData app = new BigData();
private static DomainOfConstants CONST = new DomainOfConstants();
private static Logger log = Logger.getLogger(S3EventProcessorUnzip.class);
private static AmazonSQS SQS;
private static CreateQueueRequest createQueueRequest;
private static Matcher matcher;
private static String srcBucket, srcKey, extension, myQueueUrl;
@Override
public String handleRequest(S3Event s3Event, Context context)
{
try {
for (S3EventNotificationRecord record : s3Event.getRecords())
{
srcBucket = record.getS3().getBucket().getName();
srcKey = record.getS3().getObject().getKey().replace('+', ' ');
srcKey = URLDecoder.decode(srcKey, "UTF-8");
matcher = Pattern.compile(".*\\.([^\\.]*)").matcher(srcKey);
if (!matcher.matches())
{
log.info(CONST.getNoConnectionMessage() + srcKey);
return "";
}
extension = matcher.group(1).toLowerCase();
if (!"zip".equals(extension))
{
log.info("Skipping non-zip file " + srcKey + " with extension " + extension);
return "";
}
log.info("Sending object location to key" + srcBucket + "//" + srcKey);
//pass in only the reference of where the object is located
createQue(CONST.getQueueName(), srcKey);
}
}
catch (IOException e)
{
log.error(e);
}
return "Ok";
}
/*
*
* Setup connection to amazon SQS
* TODO - Find updated api for sqs connection to eliminate depreciation
*
* */
@SuppressWarnings("deprecation")
public static void sQSConnection() {
app.setAwsCredentials(CONST.getAccessKey(), CONST.getSecretKey());
try{
SQS = new AmazonSQSClient(app.getAwsCredentials());
Region usEast1 = Region.getRegion(Regions.US_EAST_1);
SQS.setRegion(usEast1);
}
catch(Exception e){
log.error(e);
}
}
//Create new Queue
public static void createQue(String queName, String message){
createQueueRequest = new CreateQueueRequest(queName);
myQueueUrl = SQS.createQueue(createQueueRequest).getQueueUrl();
sendMessage(myQueueUrl,message);
}
//Send reference to the s3 objects location to the queue
public static void sendMessage(String SIMPLE_QUE_URL, String S3KeyName){
SQS.sendMessage(new SendMessageRequest(SIMPLE_QUE_URL, S3KeyName));
}
//Fire AWS batch to pull from que
private static void initializeBatch(){
//TODO
}
I have setup docker and understand docker images. I believe my docker image should contain all the code to read the queue, unzip, process and kit the file to RDS all in one docker image/container.
I am looking for someone who has something similar done they could share to help. Something along the lines of :
Mr. S3: Hey lambda I have a file
Mr. Lambda :Okay S3 I see you, hey aws batch could you unzip and do stuff to this
Mr. Batch: Gotchya mr lambda, ill take care of that and put it in RDS or some data base after.
I have not written the class/docker image yet but i have all the code done to process/unzip and kick off to rds done. Lambda just is limited to memory due to some of the files being 1gb or bigger.
Okay so after looking through the AWS docs on Batch, you don't need an SQS queue. Batch has a concept called Job Queue which is similar to an SQS FIFO queue, but different in that these job queues have priorities, and jobs within them can have dependencies on other jobs. The basic process is: