I am using HDP 2.5, running spark-submit as yarn cluster mode.
I have tried to generate data using dataframe cross join. i.e
val generatedData = df1.join(df2).join(df3).join(df4)
generatedData.saveAsTable(...)....
df1 storage level is MEMORY_AND_DISK
df2,df3,df4 storage level is MEMORY_ONLY
df1 has much more records i.e 5 million while df2 to df4 has at most 100 records. doing so my explain plain would result with better performance using BroadcastNestedLoopJoin explain plan.
for some reason it always fail. I don't know how can I debug it and where the memory explode.
Error log output:
16/12/06 19:44:08 WARN YarnAllocator: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on hdp4: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 WARN TaskSetManager: Lost task 1.0 in stage 12.0 (TID 19, hdp4): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
I didn't see any WARN or ERROR logs before this error. What is the problem? where should I look for the memory consumption? I cannot see anything on the Storage tab of SparkUI. the log was taken from yarn resource manager UI on HDP 2.5
EDIT
looking at the container log, it seems like it's a java.lang.OutOfMemoryError: GC overhead limit exceeded
I know how to increase the memory, but I don't have any memory anymore. How can I do a cartesian / product join with 4 Dataframes without getting this error.
I also meet this problem and try to solve it by refering some blog. 1. Run spark add conf bellow:
--conf 'spark.driver.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' \ --conf 'spark.executor.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC ' \
Heap after GC invocations=157 (full 98): PSYoungGen total 940544K, used 853456K [0x0000000781800000, 0x00000007c0000000, 0x00000007c0000000) eden space 860160K, 99% used [0x0000000781800000,0x00000007b5974118,0x00000007b6000000) from space 80384K, 0% used [0x00000007b6000000,0x00000007b6000000,0x00000007bae80000) to space 77824K, 0% used [0x00000007bb400000,0x00000007bb400000,0x00000007c0000000) ParOldGen total 2048000K, used 2047964K [0x0000000704800000, 0x0000000781800000, 0x0000000781800000) object space 2048000K, 99% used [0x0000000704800000,0x00000007817f7148,0x0000000781800000) Metaspace used 43044K, capacity 43310K, committed 44288K, reserved 1087488K class space used 6618K, capacity 6701K, committed 6912K, reserved 1048576K }
Both PSYoungGen and ParOldGen are 99% ,then you will get java.lang.OutOfMemoryError: GC overhead limit exceeded if more object was created .
Try to add more memory for your executor or your driver when more memory resources are avaliable:
--executor-memory 10000m \
--driver-memory 10000m \
For my case : memory for PSYoungGen are smaller then ParOldGen which causes many young object enter into ParOldGen memory area and finaly ParOldGen are not avaliable.So java.lang.OutOfMemoryError: Java heap space error appear.
Adding conf for executor:
'spark.executor.extraJavaOptions=-XX:NewRatio=1 -XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '
-XX:NewRatio=rate rate = ParOldGen/PSYoungGen
It dependends.You can try GC strategy like
-XX:+UseSerialGC :Serial Collector
-XX:+UseParallelGC :Parallel Collector
-XX:+UseParallelOldGC :Parallel Old collector
-XX:+UseConcMarkSweepGC :Concurrent Mark Sweep
Java Concurrent and Parallel GC