I create a dataproc cluster using the following command
gcloud dataproc clusters create datascience \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
However when I submit my PySpark Job I got the following error
Exception: Python in worker has different version 3.4 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Any Thoughts?
This is due to a difference in the python versions between the master and the worker. By default, the jupyter image
installs the latest version of miniconda, which uses the python3.7. However, the worker is still using the default python3.6.
Solution: - specify the miniconda version when creating the master node i.e to install python3.6 in the master node
gcloud dataproc clusters create example-cluster --metadata=MINICONDA_VERSION=4.3.30
Note: