views:

139

answers:

2

I have a J2EE project running on JBoss, with a maximum heap size of 2048m, which is giving strange results under load testing. I've benchmarked the heap and cpu usage and received the following results (series 1 is heap usage, series 2 is cpu usage):

It seems as if the heap is being used properly and getting garbage collected properly around A. When it gets to B however, there appears to be some kind of a bottleneck as there is heap space available, but it never breaks that imaginary line. At the same time, at C, the cpu usage drops dramatically. During this period we also receive an "OutOfMemoryError (GC overhead limit exceeded)," which does not make much sense to me as there is heap space available.

My guess is that there is some kind of bottleneck, but what exactly I can't even imagine. How would you suggest going about finding the cause of the issue? I've profiled the memory usage and noticed that there are quite a few instances of the one class (around a million), but the total size of these instances is fairly small (around 50MB if I remember correctly).

Edit: The server is dedicated to to this application and the CPU usage given is only for the JVM (there should not be any significant CPU usage outside of the JVM). The memory usage is only for the heap, it does not include the permgen space. This problem is reproducible. My main concern is surrounding the limit encountered around B, for which I have not found a plausible explanation yet.

Conclusion: Turns out this was caused by a bunch of long running SQL queries being called concurrently. The returned ResultSets were also very large, possibly explaining the OOME. I still have no reasonable explanation for why there appears to be some limit at B.

+2  A: 

From the error message it appears that the JVM is using the parallel scavenger algorithm for garbage collection. The message is dumped along with an OOME error when a lot of time is spent on GC, but not a lot of the heap is recovered.

The document from Sun does not specify if the 98% of the total time consumed is to be read as 98% of the CPU utilization of the process or that of the CPU itself. In either case, I have to draw the following inferences (with limited information):

  • The garbage collector or the JVM process does not have enough CPU utilization, most likely due to other processes consuming CPU at the same time.
  • The garbage collector does not have enough CPU utilization since it is a low priority thread, and another memory intensive (but not CPU intensive) thread in the JVM is doing work at the same time, which results in the failure to de-allocate memory.

Based on the above inferences (all, one or none of them could be true), it would be worthwhile to correlate the graph that you're obtained with the runtime behavior of the application as far as users are concerned. In other words, you might find it useful to determine if other processes are kicked off (when your problem occurs), or the part of the application that is in operation (again, when the problem occurs).

In any case, the page referenced above, does give an option to disable the GC overhead limit used by the GC algorithm.

EDIT: If the problem occurs periodically, and can be reproduced, it might turn out to be a memory leak, otherwise (i.e. it occurs sporadically), you are better off tuning the GC algorithm or even changing it.

Vineet Reynolds
The first bullet point doesn't make sense to me, as the cpu usage drops. The second one seems plausible, but I still don't understand why the memory usage seems to reach some kind of maximum which it can't pass? I'd rather not change the GC overhead limit as I have a feeling that it will only address the symptom and not the cause.
Zecrates
The first point makes sense if your graph is reporting only the JVM's CPU usage, and not other processes. As for the flag on GC overhead limit, well that just disables the overhead; you cannot change it. Coming to the second point, it might help if you can determine whether you've hit a high water mark. What I meant to say is that it is quite possible that the graph is not depicting the entire heap, and possibly only a portion of it (does it include permgen as well?).
Vineet Reynolds
Okay, I understand what you mean with point one. This graph is only for the JVM's CPU usage and the system is dedicated to this application. The graph does not include permgen, so are you suggesting that it might be the permgen space which causes the OOME? What could affect permgen space over time, in the past the only problems I've had with permgen was with loading and unloading of classes?
Zecrates
As an additional note: I can understand lack of permgen space causing the OOME even when there is memory available on the heap, but it still doesn't explain the "high water mark" on the heap.
Zecrates
Well, the 2G allocated to the heap, includes the permgen as well. The permgen space can be specified separately, but that affects the amount of memory available for other objects. As more and more classes get loaded, the amount of permgen space occupied can increased (until the maximum specified, beyond which you get a permgen space error). I think, with the information that you've provided, you should now look at how much memory is consumed by the various generations of the JVM heap, and also look at the CPU usages of other processes as well, with emphasis on the former.
Vineet Reynolds
The permgen space is specified separately, as 256m. I've updated the question with my conclusion, thanks for your help. The only thing still bothering me is that limit at B, which I just can't explain :(
Zecrates
A: 

If I want to know where the "bottlenecks" are, I just get a few stackshots. There's no need to wonder and guess and play detective. They will just tell you.

Usually memory problems and performance problems go hand in hand, so if you fix the performance problems, you will also fix the memory problems (not for certain, though).

Mike Dunlavey