Connecting to remote Spark Cluster

apache-spark pyspark cluster-computing apache-spark-standalone

beginner_ · Apr 17, 2018 · Viewed 7.6k times · Source

I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration

master on machine 1 (port 7077 exposed)
worker on machine 1
driver on machine 2

I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText()

But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :

INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```

that goes on and on again without actually computing the app.

This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)

Connecting to remote Spark Cluster

Answer

Related questions