I have been experimenting and googling for many hours, with no luck.
I have a spark streaming app that runs fine in a local spark cluster. Now I need to deploy it on cloudera 5.4.4. I need to be able to start it, have it run in the background continually, and be able to stop it.
I tried this:
$ spark-submit --master yarn-cluster --class MyMain my.jar myArgs
But it just prints these lines endlessly.
15/07/28 17:58:18 INFO Client: Application report for application_1438092860895_0012 (state: RUNNING)
15/07/28 17:58:19 INFO Client: Application report for application_1438092860895_0012 (state: RUNNING)
Question number 1: since it is a streaming app, it needs to run continuously. So how do I run it in a "background" mode? All the examples I can find of submitting spark jobs on yarn seem to assume that the application will do some work and terminate, and therefore that you would want to run it in the foreground. But that is not the case for streaming.
Next up... at this point the app does not seem to be functioning. I figure it could be a bug or misconfiguration on my part, so I tried to look in the logs to see what's happening:
$ yarn logs -applicationId application_1438092860895_012
But it tells me :
/tmp/logs/hdfs/logs/application_1438092860895_0012does not have any log files.
So question number 2: If the application is RUNNING, why does it have no log files?
So eventually I just had to kill it:
$ yarn application -kill application_1438092860895_012
That brings up question number 3: assuming I can eventually get the app launched and running in the background, is "yarn application -kill" the preferred way of stopping it?
spark-submit
console. The job is running in background already when writes out RUNNING state.yarn application -kill
is probably the best way how to stop Spark streaming application, but it's not perfect. It would be better to do some graceful shutdown to stop all stream receivers and stop streaming context, but I personally don't know how to do it.