How do I get Python libraries in pyspark?

thenakulchawla picture thenakulchawla · Mar 25, 2016 · Viewed 44.8k times · Source

I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.

When I try to import any of them I get the below error:

>>> from shapely.geometry import polygon
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: No module named shapely.geometry

I know the module isn't present, but how can these packages be brought to my pyspark libraries?

Answer

armatita picture armatita · Mar 25, 2016

In the Spark context try using:

SparkContext.addPyFile("module.py")  # also .zip

, quoting from the docs:

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.