Read Parquet file stored in S3 with AWS Lambda (Python 3)

Ptah picture Ptah · Dec 26, 2017 · Viewed 8.2k times · Source

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

It seems that there are two possible approaches, which both work locally to the docker container:

  1. fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
  2. pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:

    • If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
    • If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.

My questions are :

  • why do I get a different result in my docker container than I do in the Lambda environment?
  • what is the proper way to give the URI?
  • is there an accepted way to read Parquet files in S3 through AWS Lambda?

Thanks!

Answer

Igor Tavares picture Igor Tavares · Jan 10, 2020

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.

In the Docs there is a step-by-step to do it.

Code example:

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Reference