In Java, the concurrent mode failure means that the concurrent collector failed to free up enough memory space form tenured and permanent gen and has to give up and let the full stop-the-world gc kicks in. The end result could be very expensive. I understand this concept but never had a good comprehensive understanding of A) what could cause a concurrent mode failure and B) what's the solution?. This sort of unclearness leads me to write/debug code without much of hints in mind and often has to shop around those performance flags from Foo to Bar without particular reasons, just have to try. I'd like to learn from developers here how your experience is. If you had previous encountered such performance issue, what was the cause and how you addressed it? If you have coding recommendations, please don't be too general. Thanks!
Sometimes OOM pretty quick and got killed, sometime suffers long gc period (last time was over 10 hours).
It sounds to me like a memory leak is at the root of your problems.
A CMS failure won't (as I understand it) cause an OOM. Rather a CMS failure happens because the JVM needs to do too many collections too quickly, and CMS could not keep up. One situation where lots of collection cycles happen in a short period is when your heap is nearly full.
The really long GC time sounds weird ... but is theoretically possible if your machine was thrashing horribly. However, a long period of repeated GCs is quite plausible if your heap is very nearly full.
You can configure the GC to give up when the heap is 1) at max size and 2) still close to full after a full GC has completed. Try doing this if you haven't done so already. It won't cure your problems, but at least your JVM will get the OOM quickly, allowing a faster service restart and recovery.
EDIT - the option to do this is -XX:GCHeapFreeLimit=nnn
where nnn
is a number between 0 and 100 giving the minimum percentage of the heap that must be free after the GC. The default is 2. The option is listed in the aptly titled "The most complete list of -XX options for Java 6 JVM" page. (There are lots of -XX options listed there that don't appear in the Sun documentation. Unfortunately the page provides few details on what the options actually do.)
You should probably start looking to see if your application / webapp has memory leaks. If it has, your problems won't go away unless those leaks are found and fixed. In the long term, fiddling with the Hotspot GC options won't fix memory leaks.
Quoted from "Understanding Concurrent Mark Sweep Garbage Collector Logs"
The concurrent mode failure can either be avoided by increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting
CMSInitiatingOccupancyFraction
to a lower value
However, if there is really a memory leak in your application, you're just buying time.
If you need fast restart and recovery and prefer a 'die fast' approach I would suggest not using CMS at all. I would stick with '-XX:+UseParallelGC'.
From "Garbage Collector Ergonomics"
The parallel garbage collector (UseParallelGC) throws an out-of-memory exception if an excessive amount of time is being spent collecting a small amount of the heap. To avoid this exception, you can increase the size of the heap. You can also set the parameters
-XX:GCTimeLimit=time-limit
and-XX:GCHeapFreeLimit=space-limit