Load Parquet files into Redshift

Richard picture Richard · Sep 6, 2018 · Viewed 10.8k times · Source

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.

Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?

Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.

Appreciate your help!

I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).

Answer

John Rotenstein picture John Rotenstein · Sep 6, 2018

The Amazon Redshift COPY command can natively load Parquet files by using the parameter:

FORMAT AS PARQUET

See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats

The table must be pre-created; it cannot be created automatically.

Also note from COPY from Columnar Data Formats - Amazon Redshift:

COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.