Google is literally littered with solutions to this problem, but unfortunately even after trying out all the possibilities, am unable to get it working, so please bear with me and see if something strikes you.
OS: MAC
Spark : 1.6.3 (2.10)
Jupyter Notebook : 4.4.0
Python : 2.7
Scala : 2.12.1
I was able to successfully install and run Jupyter notebook. Next, i tried configuring it to work with Spark, for which i installed spark interpreter using Apache Toree. Now when i try running any RDD operation in notebook, following error is thrown
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar
Things already tried: 1. Set PYTHONPATH in .bash_profile 2. Am able to import 'pyspark' in python-cli on local 3. Have tried updating interpreter kernel.json to following
{
"language": "python",
"display_name": "Apache Toree - PySpark",
"env": {
"__TOREE_SPARK_OPTS__": "",
"SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
"__TOREE_OPTS__": "",
"DEFAULT_INTERPRETER": "PySpark",
"PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
"PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
"PYTHON_EXEC": "python"
},
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
"--profile",
"{connection_file}"
]
}
Use findspark lib to bypass all environment setting up process. Here is the link for more information. https://github.com/minrk/findspark
Use it as below.
import findspark
findspark.init('/path_to_spark/spark-x.x.x-bin-hadoopx.x')
from pyspark.sql import SparkSession