Jupyter pyspark : no module named pyspark

Saurabh Mishra picture Saurabh Mishra · Feb 3, 2017 · Viewed 31.7k times · Source

Google is literally littered with solutions to this problem, but unfortunately even after trying out all the possibilities, am unable to get it working, so please bear with me and see if something strikes you.

OS: MAC

Spark : 1.6.3 (2.10)

Jupyter Notebook : 4.4.0

Python : 2.7

Scala : 2.12.1

I was able to successfully install and run Jupyter notebook. Next, i tried configuring it to work with Spark, for which i installed spark interpreter using Apache Toree. Now when i try running any RDD operation in notebook, following error is thrown

Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar

Things already tried: 1. Set PYTHONPATH in .bash_profile 2. Am able to import 'pyspark' in python-cli on local 3. Have tried updating interpreter kernel.json to following

{
  "language": "python",
  "display_name": "Apache Toree - PySpark",
  "env": {
    "__TOREE_SPARK_OPTS__": "",
    "SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
    "__TOREE_OPTS__": "",
    "DEFAULT_INTERPRETER": "PySpark",
    "PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
  "PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
    "PYTHON_EXEC": "python"
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
    "--profile",
    "{connection_file}"
  ]
}
  1. Have even updated interpreter run.sh to explicitly load py4j-0.9-src.zip and pyspark.zip files. When the opening the PySpark notebook, and creating of SparkContext, I can see the spark-assembly, py4j and pyspark packages being uploaded from local, but still when an action is invoked, somehow pyspark is not found.

Answer

kay picture kay · Nov 25, 2017

Use findspark lib to bypass all environment setting up process. Here is the link for more information. https://github.com/minrk/findspark

Use it as below.

import findspark
findspark.init('/path_to_spark/spark-x.x.x-bin-hadoopx.x')
from pyspark.sql import SparkSession