I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3
with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile
.
After uploading the data to S3
, I want to investigate it using Athena
. Also, I would like to visualize them in QuickSight
by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3
will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE
.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table
in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/
prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.