Error while running PySpark DataProc Job due to python version

Kassem Shehady picture Kassem Shehady · Jul 19, 2018 · Viewed 7.7k times · Source

I create a dataproc cluster using the following command

gcloud dataproc clusters create datascience \
--initialization-actions \
    gs://dataproc-initialization-actions/jupyter/jupyter.sh \

However when I submit my PySpark Job I got the following error

Exception: Python in worker has different version 3.4 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

Any Thoughts?

Answer

brotich picture brotich · Jul 22, 2018

This is due to a difference in the python versions between the master and the worker. By default, the jupyter image installs the latest version of miniconda, which uses the python3.7. However, the worker is still using the default python3.6.

Solution: - specify the miniconda version when creating the master node i.e to install python3.6 in the master node

gcloud dataproc clusters create example-cluster --metadata=MINICONDA_VERSION=4.3.30

Note:

  • may need updating to have a more sustainable solution to managing the environment