I have written a class implementing a classifier in python. I would like to use Apache Spark to parallelize classification of a huge number of datapoints using this classifier.
However, when I do the following:
import BoTree
bo_tree = BoTree.train(data)
rdd = sc.parallelize(keyed_training_points) #create rdd of 10 (integer, (float, float) tuples
rdd = rdd.mapValues(lambda point, bt = bo_tree: bt.classify(point[0], point[1]))
out = rdd.collect()
Spark fails with the error (just the relevant bit I think):
File "/root/spark/python/pyspark/worker.py", line 90, in main
command = pickleSer.loads(command.value)
File "/root/spark/python/pyspark/serializers.py", line 405, in loads
return cPickle.loads(obj)
ImportError: No module named BoroughTree
Can anyone help me? Somewhat desperate...
Thanks
Probably the simplest solution is to use pyFiles
argument when you create SparkContext
from pyspark import SparkContext
sc = SparkContext(master, app_name, pyFiles=['/path/to/BoTree.py'])
Every file placed there will be shipped to workers and added to PYTHONPATH
.
If you're working in an interactive mode you have to stop an existing context using sc.stop()
before you create a new one.
Also make sure that Spark worker is actually using Anaconda distribution and not a default Python interpreter. Based on your description it is most likely the problem. To set PYSPARK_PYTHON
you can use conf/spark-env.sh
files.
On a side note copying file to lib
is a rather messy solution. If you want to avoid pushing files using pyFiles
I would recommend creating either plain Python package or Conda package and a proper installation. This way you can easily keep track of what is installed, remove unnecessary packages and avoid some hard to debug problems.