I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
It seems that there are two possible approaches, which both work locally to the docker container:
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range
in pyarrow/parquet.py, line 714OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848. My questions are :
Thanks!
AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.
In the Docs there is a step-by-step to do it.
Code example:
import awswrangler as wr
# Write
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...")