I have a Java application using CMS garbage collection that suffers from a "ParNew (promotion failed)" full GC a few times every day (see below for an example). I understand that a promotion failure occurs when garbage collection cannot find enough (contiguous) space in the old generation into which to promote an object from the new generation. At this point it is forced to do an expensive stop-the-world full GC. I want to avoid such events.
I have read several articles that suggest possible solutions but I wanted to clarify/consolidate them here:
In case it is relevant, here are my current GC options and a sample of logs preceding a promotion failed event.
-Xmx4g -XX:+UseConcMarkSweepGC -XX:NewRatio=1
2014-12-19T09:38:34.304+0100: [GC (Allocation Failure) [ParNew: 1887488K->209664K(1887488K), 0.0685828 secs] 3115998K->1551788K(3984640K), 0.0690028 secs] [Times: user=0.50 sys=0.02, real=0.07 secs]
2014-12-19T09:38:35.962+0100: [GC (Allocation Failure) [ParNew: 1887488K->208840K(1887488K), 0.0827565 secs] 3229612K->1687030K(3984640K), 0.0831611 secs] [Times: user=0.39 sys=0.03, real=0.08 secs]
2014-12-19T09:38:39.975+0100: [GC (Allocation Failure) [ParNew: 1886664K->114108K(1887488K), 0.0442130 secs] 3364854K->1592298K(3984640K), 0.0446680 secs] [Times: user=0.31 sys=0.00, real=0.05 secs]
2014-12-19T09:38:44.818+0100: [GC (Allocation Failure) [ParNew: 1791932K->167245K(1887488K), 0.0588917 secs] 3270122K->1645435K(3984640K), 0.0593308 secs] [Times: user=0.57 sys=0.00, real=0.06 secs]
2014-12-19T09:38:49.239+0100: [GC (Allocation Failure) [ParNew (promotion failed): 1845069K->1819715K(1887488K), 0.4417916 secs][CMS: 1499941K->647982K(2097152K), 2.4203021 secs] 3323259K->647982K(3984640K), [Metaspace: 137778K->137778K(1177600K)], 2.8626552 secs] [Times: user=3.46 sys=0.01, real=2.86 secs]
Although increasing the memory is indeed the simplest and most general solution, in this case it seems we had a particular issue that required a particular solution. Looking at the GC logs in my case I would see logs like this:
GC (CMS Initial Mark) [1 CMS-initial-mark: 2905552K(3145728K)]
which shows that the old gen was ~92% full at the start of the CMS (2.9Gb out of 3.1Gb was used). So the JVM had decided that the "occupancy fraction" should be around 90%. This is a change from the default it starts with that I think is around 68%.
Apparently my application behaves in a way that makes the JVM think this is a good thing. But then the application seems to surprise the JVM by suddenly needing more space in old gen to promote objects from new gen.
On adding the GC flags
-XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly
we no longer saw any "promotion failed" events. These flags, respectively, set the initial occupancy fraction to 50% and tell the JVM not to change this fraction. Therefore, as soon as old gen gets above 50%, it will start a CMS. This avoids it waiting till occupancy gets up to 90% or so, where the chance of a "promotion failed" is much higher.