views:

560

answers:

2

In Java, the concurrent mode failure means that the concurrent collector failed to free up enough memory space form tenured and permanent gen and has to give up and let the full stop-the-world gc kicks in. The end result could be very expensive. I understand this concept but never had a good comprehensive understanding of A) what could cause a concurrent mode failure and B) what's the solution?. This sort of unclearness leads me to write/debug code without much of hints in mind and often has to shop around those performance flags from Foo to Bar without particular reasons, just have to try. I'd like to learn from developers here how your experience is. If you had previous encountered such performance issue, what was the cause and how you addressed it? If you have coding recommendations, please don't be too general. Thanks!

+1  A: 

Sometimes OOM pretty quick and got killed, sometime suffers long gc period (last time was over 10 hours).

It sounds to me like a memory leak is at the root of your problems.

A CMS failure won't (as I understand it) cause an OOM. Rather a CMS failure happens because the JVM needs to do too many collections too quickly, and CMS could not keep up. One situation where lots of collection cycles happen in a short period is when your heap is nearly full.

The really long GC time sounds weird ... but is theoretically possible if your machine was thrashing horribly. However, a long period of repeated GCs is quite plausible if your heap is very nearly full.

You can configure the GC to give up when the heap is 1) at max size and 2) still close to full after a full GC has completed. Try doing this if you haven't done so already. It won't cure your problems, but at least your JVM will get the OOM quickly, allowing a faster service restart and recovery.

EDIT - the option to do this is -XX:GCHeapFreeLimit=nnn where nnn is a number between 0 and 100 giving the minimum percentage of the heap that must be free after the GC. The default is 2. The option is listed in the aptly titled "The most complete list of -XX options for Java 6 JVM" page. (There are lots of -XX options listed there that don't appear in the Sun documentation. Unfortunately the page provides few details on what the options actually do.)

You should probably start looking to see if your application / webapp has memory leaks. If it has, your problems won't go away unless those leaks are found and fixed. In the long term, fiddling with the Hotspot GC options won't fix memory leaks.

Stephen C
Understood. I knew that there is slow leak in our program, we just yet be able to find it. :( At the meantime, we were trying to squeeze as much as we can to see if different gc policy can help to mitigate the issue. CMS wouldn't directly cause the OOM but its often with the full gc kicks in and that can cause major problems. We often see performance start degrading when CMS mode failure appears in our gc log. But maybe lack of experiences, we haven't yet been able to find the leak or find the suitable gc policy that can fit well.
jimx
I feel that heap full does not necessarily mean there will be a severe gc issue but not being able to collect much of garbage after a full gc is a really bad sign. If I'd like to ask jvm kill the app early and quickly, what flags should I use? faster service restart and recovery does sound promising. At least we don't have to suffer long unresponsiveness. I'd rather choose die fast. Thanks.
jimx
+3  A: 

Quoted from "Understanding Concurrent Mark Sweep Garbage Collector Logs"

The concurrent mode failure can either be avoided by increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value

However, if there is really a memory leak in your application, you're just buying time.

If you need fast restart and recovery and prefer a 'die fast' approach I would suggest not using CMS at all. I would stick with '-XX:+UseParallelGC'.

From "Garbage Collector Ergonomics"

The parallel garbage collector (UseParallelGC) throws an out-of-memory exception if an excessive amount of time is being spent collecting a small amount of the heap. To avoid this exception, you can increase the size of the heap. You can also set the parameters -XX:GCTimeLimit=time-limit and -XX:GCHeapFreeLimit=space-limit

antispam
CMSInitiatingOccupancyFraction was what I had tried. Sounds like it might not be a good idea to our situation. I'd rather buy fast die.
jimx
Though I don't know if I want to make the switch to ParallelGC just yet. Our previous experience seems to indicate that the long pause was mainly caused by CMS failure and full gc kicks in. Those back-to-back full gcs are really the culprit. I was trying to find out if there is a way to remain in CMS but when excessive full gc takes too long, kill it. Will GCTimeLimit and GCHeapFreeLimit still work under CMS?
jimx
I guess GCTimeLimit and GCHeapFreeLimit are just for parallel algorithm. Until the memory leak is discovered, we usually practice a periodic 'sanity reset' of the JVM in a time window agreed with the user.
antispam