I would like to ingest data into s3 from kinesis firehose formatted as parquet. So far I have just find a solution that implies creating an EMR, but I am looking for something cheaper and faster like store the received json as parquet directly from firehose or use a Lambda function.
Thank you very much, Javi.
After dealing with the AWS support service and a hundred of different implementations, I would like to explain what I have achieved.
Finally I have created a Lambda function that process every file generated by Kinesis Firehose, classifies my events according to the payload and stores the result in Parquet files in S3.
Doing that is not very easy:
First of all you should create a Python virtual env, including all the required libraries (in my case Pandas, NumPy, Fastparquet, etc). As the resulted file (that includes all the libraries and my Lambda function is heavy, it is necessary to launch an EC2 instance, I have used the one included in the free tier). To create the virtual env follow these steps:
Create the lambda_function propertly:
import json
import boto3
import datetime as dt
import urllib
import zlib
import s3fs
from fastparquet import write
import pandas as pd
import numpy as np
import time
def _send_to_s3_parquet(df):
s3_fs = s3fs.S3FileSystem()
s3_fs_open = s3_fs.open
# FIXME add something else to the key or it will overwrite the file
key = 'mybeautifullfile.parquet.gzip'
# Include partitions! key1 and key2
write( 'ExampleS3Bucket'+ '/key1=value/key2=othervalue/' + key, df,
compression='GZIP',open_with=s3_fs_open)
def lambda_handler(event, context):
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'])
try:
s3 = boto3.client('s3')
response = s3.get_object(Bucket=bucket, Key=key)
data = response['Body'].read()
decoded = data.decode('utf-8')
lines = decoded.split('\n')
# Do anything you like with the dataframe (Here what I do is to classify them
# and write to different folders in S3 according to the values of
# the columns that I want
df = pd.DataFrame(lines)
_send_to_s3_parquet(df)
except Exception as e:
print('Error getting object {} from bucket {}.'.format(key, bucket))
raise e
Copy the lambda function to the lambda.zip and deploy the lambda_function:
Trigger the to be executed when you like, e.g, each time a new file is created in S3, or even you could associate the lambda function to Firehose. (I did not choose this option because the 'lambda' limits are lower than the Firehose limits, you can configure Firehose to write a file each 128Mb or 15 minutes, but if you associate this lambda function to Firehose, the lambda function will be executed every 3 mins or 5MB, in my case I had the problem of generate a lot of little parquet files, as for each time that the lambda function is launched I generate at least 10 files).