Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, that causes AWS Glue crawler to analyse it. I know that there is schedule based crawling, but never found an event- based one.
As a quick start, here is a blow-by-blow account of how to create a Lambda in Python to do this. This is the first time I've created a Lambda so YMMV.
"Action": "glue:StartCrawler",
"Resource": "*"
from future import print_function
import json
import boto3
print('Loading function')
glue = boto3.client(service_name='glue', region_name='ap-southeast-2',
endpoint_url='https://glue.ap-southeast-2.amazonaws.com')
def lambda_handler(event, context):
#print("Received event: " + json.dumps(event, indent=2))
try:
glue.start_crawler(Name='my-glue-crawler')
except Exception as e:
print(e)
print('Error starting crawler')
raise e
Finally, assuming you selected that the trigger should be disabled while developing, click the S3 trigger from the designer panel and ensure it is enabled (you may need to save the lambda after making this change)
That's it, but note that an exception will be thrown if the crawler is already running so you will want to handle that if you have frequent uploads or long crawls. See: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-crawling.html#aws-glue-api-crawler-crawling-StartCrawler
(apologies if the code isn't formatted correctly - SO preview is all over the place)