Use AWS Glue Python with NumPy and Pandas Python Packages

jumpman23 picture jumpman23 · Sep 20, 2017 · Viewed 17.2k times · Source

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.

Answer

Jasper_Li picture Jasper_Li · Oct 12, 2017

I think the current answer is you cannot. According to AWS Glue Documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

But even when I try to include a normal python written library in S3, the Glue job failed because of some HDFS permission problem. If you find a way to solve this, please let me know as well.