Load Parquet files into Redshift

Question 1

Load Parquet files into Redshift

amazon-web-services amazon-ec2 amazon-redshift parquet amazon-redshift-spectrum

Richard · Sep 6, 2018 · Viewed 10.8k times · Source

Answer

Answer

The Amazon Redshift COPY command can natively load Parquet files by using the parameter:

FORMAT AS PARQUET

See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats

The table must be pre-created; it cannot be created automatically.

Also note from COPY from Columnar Data Formats - Amazon Redshift:

COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.

Question 2

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.

Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?

Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.

Appreciate your help!

I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).

Load Parquet files into Redshift

Answer

Related questions