Optimizing GC on EMR cluster

Question 1

Optimizing GC on EMR cluster

apache-spark garbage-collection jvm emr amazon-emr

Stormbringer · Dec 8, 2016 · Viewed 7.3k times · Source

Answer

Answer

Allocation Failure is the normal and the most common reason for initiating GC cycle.

Logs tell that GC happens once a second and takes about 10ms, that is, 1% time. IMO, there is nothing to optimize here.

Question 2

I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures.

2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 
2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K->1370376K(3294336K), 0.0091147 secs] [Times: user=0.11 sys=0.01, real=0.00 secs] 
2016-12-07T23:42:22.525+0000: [GC (Allocation Failure) 2016-12-07T23:42:22.525+0000: [ParNew: 909299K->485K(1022400K), 0.0080858 secs] 2279240K->1370427K(3294336K), 0.0082357 secs] [Times: user=0.12 sys=0.00, real=0.01 secs] 
2016-12-07T23:42:23.474+0000: [GC (Allocation Failure) 2016-12-07T23:42:23.474+0000: [ParNew: 909349K->547K(1022400K), 0.0090641 secs] 2279291K->1370489K(3294336K), 0.0091965 secs] [Times: user=0.12 sys=0.00, real=0.00 secs]

I am reading few TB's of data, (mostly string) so I am worried that the constant GC will slow down processing time.
I would appreciate any pointers on how to understand this message and how to optimize GC so that it consumes minimum CPU time.

Optimizing GC on EMR cluster

Answer

Related questions