Read Parquet file stored in S3 with AWS Lambda (Python 3)

Question 1

Read Parquet file stored in S3 with AWS Lambda (Python 3)

python amazon-s3 aws-lambda parquet pyarrow

Ptah · Dec 26, 2017 · Viewed 8.2k times · Source

Answer

Answer

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.

In the Docs there is a step-by-step to do it.

Code example:

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Reference

Question 2

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
Add a test python function to the zip, send it to S3, update the lambda and test it

It seems that there are two possible approaches, which both work locally to the docker container:

fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
- If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
- If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.

My questions are :

why do I get a different result in my docker container than I do in the Lambda environment?
what is the proper way to give the URI?
is there an accepted way to read Parquet files in S3 through AWS Lambda?

Thanks!

Read Parquet file stored in S3 with AWS Lambda (Python 3)

Answer

Related questions