Error ExecutorLostFailure when running a task in Spark

User17 picture User17 · Jul 21, 2015 · Viewed 20.2k times · Source

when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime

Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory each 3.2 GB Disk . I am running the Spark task from one of the slave node (from 8 nodes) (so with 0.7 storage fraction approx 4.8 gb only is available on each node )and using Mesos as the Cluster Manager. I am using this configuration :

spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true

I am trying to Run Spark MLlib Naive Bayes Algorithm on a Folder around 14 GB of data. (There is no issue when I am running the task on a 6 GB folder) I am reading this folder from google storage as RDD and giving 32 as partition parameter.(I have tried increasing the partition as well). Then using TF to create feature vector and predict on basis of that. But when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime. I tried different configurations but nothing is helping. May be I am missing something very basic but not able to figure out. Any help or suggestion will be highly valuable.

Log is:

   15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job    
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2    
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled    
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s    
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: 20150526-135628-3255597322-5050-1304-S8 (epoch 3)    
15/07/21 01:18:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 20150526-135628-3255597322-5050-1304-S8 from BlockManagerMaster.    
15/07/21 01:18:20 INFO DAGScheduler: Job 2 failed: collect at /opt/work/V2ProcessRecords.py:213, took 29.013646 s    
Traceback (most recent call last):    
  File "/opt/work/V2ProcessRecords.py", line 213, in <module>
    secondPassRDD = firstPassRDD.map(lambda ( name, title,  idval, pmcId, pubDate, article, tags , author, ifSigmaCust, wclass): ( str(name), title,  idval, pmcId, pubDate, article, tags , author, ifSigmaCust , "Yes" if ("PMC" + pmcId) in rddNIHGrant else ("No") , wclass)).collect()    
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect    
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__    
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 12, vamp-m-2.c.quantum-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)    
Driver stacktrace:    
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
        at       org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
        at    org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
        at     scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at     org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
        at    org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/07/21 01:18:20 INFO BlockManagerMaster: Removed 20150526-135628-3255597322-5050-1304-S8 successfully in removeExecutor
15/07/21 01:18:20 INFO DAGScheduler: Host added was in lost list earlier:vamp-m-2.c.quantum-854.internal
Jul 21, 2015 1:01:15 AM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
15/07/21 01:18:20 INFO SparkContext: Invoking stop() from shutdown hook



{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}

{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616389696,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"} {"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616389697,"Failed":true,"Accumulables":[]}} {"Event":"SparkListenerExecutorAdded","Timestamp":1437616389707,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Executor Info":{"Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Total Cores":1,"Log Urls":{}}} {"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}} {"Event":"SparkListenerExecutorRemoved","Timestamp":1437616397743,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"} {"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616397743,"Failed":true,"Accumulables":[]}} {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":2,"Stage Attempt ID":0,"Stage Name":"collect at /opt/work/V2ProcessRecords.py:215","Number of Tasks":72,"RDD Info":[{"RDD ID":6,"Name":"PythonRDD","Parent IDs":[0],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/P*/*.nxml","Scope":"{\"id\":\"0\",\"name\":\"wholeTextFiles\"}","Parent IDs":[],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"","Submission Time":1437616365566,"Completion Time":1437616397753,"Failure Reason":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Accumulables":[]}} {"Event":"SparkListenerJobEnd","Job ID":2,"Completion Time":1437616397755,"Job Result":{"Result":"JobFailed","Exception":{"Message":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Stack Trace":[{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages","File Name":"DAGScheduler.scala","Line Number":1266},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1257},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"scala.collection.mutable.ResizableArray$class","Method Name":"foreach","File Name":"ResizableArray.scala","Line Number":59},{"Declaring Class":"scala.collection.mutable.ArrayBuffer","Method Name":"foreach","File Name":"ArrayBuffer.scala","Line Number":47},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"abortStage","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"scala.Option","Method Name":"foreach","File Name":"Option.scala","Line Number":236},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"handleTaskSetFailed","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1450},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1411},{"Declaring Class":"org.apache.spark.util.EventLoop$$anon$1","Method Name":"run","File Name":"EventLoop.scala","Line Number":48}]}}}

Answer

joshi.n picture joshi.n · Oct 3, 2017

The Most common cause of ExecutorLostFailure as per my understanding is OOM in executor.

In order to resolve the OOM issue, one needs to figure out what exactly is causing it. Simply increasing the default parallelism or increasing the executor memory is not a strategic solution.

If you look at what increasing parallelism do is it tries to create more executors so that each executor can work on less and less data. But if your data is skewed such that the key on which data partitioning happens (for parallelism) has more data, simply increasing parallelism will be of no effect.

Similarly just by increasing Executor memory will be a very inefficient way of handing such a scenario as if only one executor is failing with ExecutorLostFailure , requesting increased memory for all the executors will make your application require much more memory then actually expected.