How to Convert Many CSV files to Parquet using AWS Glue

mark s. picture mark s. · Apr 23, 2018 · Viewed 14.3k times · Source

I'm using AWS S3, Glue, and Athena with the following setup:

S3 --> Glue --> Athena

My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.

Since I'm using Athena, I'd like to convert the CSV files to Parquet. I'm using AWS Glue to do this right now. This is the current process I'm using:

  1. Run Crawler to read CSV files and populate Data Catalog.
  2. Run ETL job to create Parquet file from Data Catalog.
  3. Run a Crawler to populate Data Catalog using Parquet file.

The Glue job only allows me to convert one table at a time. If I have many CSV files, this process quickly becomes unmanageable. Is there a better way, perhaps a "correct" way, of converting many CSV files to Parquet using AWS Glue or some other AWS service?

Answer

Tanveer Uddin picture Tanveer Uddin · Sep 19, 2018

I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.

Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:

  1. Iterating through catalog/database/tables

The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.

After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


client = boto3.client('glue', region_name='ap-southeast-2')

databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName

Tables = client.get_tables(DatabaseName=databaseName)

tableList = Tables['TableList']

for table in tableList:
    tableName = table['Name']
    print '\n-- tableName: ' + tableName

    datasource0 = glueContext.create_dynamic_frame.from_catalog(
        database="tpc-ds-csv", 
        table_name=tableName, 
        transformation_ctx="datasource0"
    )

    datasink4 = glueContext.write_dynamic_frame.from_options(
        frame=datasource0,
        connection_type="s3", 
        connection_options={
            "path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
            },
        format="parquet",
        transformation_ctx="datasink4"
    )
job.commit()