views:

453

answers:

5

We experience several minutes lags in our server. Probably they are triggered by "stop the world" garbage collections. But we use concurrent mark and sweep GC (-XX:+UseConcMarkSweepG) so, I think, these pauses are triggered by memory fragmentation of old generation.

How can memory fragmentation of old generation be analyzed? Are there any tools for it?

Lags happen every hour. Most time they are about 20 sec, but sometimes - several minutes.

A: 

I have used YourKit to good effect for this type of problem.

Eric J.
Yes, a great tool. But it doesn't show memory fragmentation - only memory consumption, which it doesn't help with lags. Or I don't know some cool options :)?
Vitaly
YourKit and other memory profilers will show you when / how frequently GC is happening to attempt to reorganize memory and reduce fragmentation. It won't show you your fragmentation directly (unless I don't know about a cool option too)
Eric J.
I'm going to check behavior of GC through VisualVM. By the way, YourKit is dangerous to run at production servers. We had problems with performance even with an option "disable all". And thanks for advice about big objects.
Vitaly
What kind of performance problems did you see? So far we have only used it in our stress test environment but had considered running it on one application server for a while to gather real-life metrics.
Eric J.
Lags :) If byte code instrumentation is turned on (default settings of agent), the lags are huge. If off, just lags.
Vitaly
+2  A: 

Look at your Java documentation for the "java -X..." options for turning on GC logging. That will tell you whether you are collecting old or new generation, and how long the collections are taking.

A pause of "several minutes" sounds extraordinary. Are you sure that you aren't just running with a heap size that is too small, or on a machine with not enough physical memory?

  • If your heap too close to full, the GC will be triggered again and again, resulting in your server spending most of its CPU time in the GC. This will show up in the GC logs.

  • If you use a large heap on a machine with not enough physical memory, a full GC is liable to cause your machine to "thrash", spending most of its time madly moving virtual memory pages to and from disc. You can observe this using system monitoring tools; e.g. by watching the console output from "vmstat 5" on a typical UNIX/Linux system.

Stephen C
Several minutes lags happens to, once or two in a day. Edited a qustion.
Vitaly
I'll try vmstat 5, thanks
Vitaly
Yes, I'll try verbose GC output. It just prints too much info - can slow down servers, wouldn't want to do it :) Now we use GarbageCollectorMXBeans. The output looks like this: ConcurrentMarkSweep 27459. And lag almost perfectly matches with it (27 sec). It happens every hour or so, that's why I think about memory fragmentation, not a memory leak. – Vitaly 0 secs ago [delete this comment]
Vitaly
A: 

There is no memory fragmentation in Java; during the GC run, memory areas are compacted.

Since you don't see a high CPU utilization, there is no GC running, either. So something else must be the cause of your problems. Here are a few ideas:

  • If the database of your application is on a different server, there may be network problems

  • If you run Windows and you have mapped network drives, one of the drives may lock up your computer (again network problems). The same is true for NFS drives on Unix. Check the system log for network errors.

  • Is the computer swapping lots of data to disk? Since CPU util is low, the cause of the problem could be that the app was swapped to disk and the GC run forced it back into RAM. This will take a long time if your server hasn't enough real RAM to keep the whole Java app in RAM.

Also, other processes can force the app out of RAM. Check the real memory utilization and your swap space usage.

To understand the output of the GC log, this post might help.

[EDIT] I still can't get my head around "low CPU" and "GC stalls". Those two usually contradict each other. If the GC is stalling, you must see 100% CPU usage. If the CPU is idle, then something else is blocking the GC. Do you have objects which overload finalize()? If a finalize blocks, the GC can take forever.

Aaron Digulla
Well there IS fragmentation, but the GC will attempt to reduce it when it runs. Having too many large-ish (relative to your available heap) objects that are frequently allocated/deallocated will cause the app to spend a lot of time in GC and harm performance.
Eric J.
There is a memory fragmentation if ConcurrentMarkAndSweep used. For example, http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.base.doc/info/aes/ae/rprf_javamemory.html.
Vitaly
No database used.
Vitaly
Eric J., large objects sound as a great idea, thank you. I'll check memory dumps.
Vitaly
No, it is not network problems. Lags reported by GC matches lags in our internal log.
Vitaly
@Vitaly: Please give some more information how your app works. Is it creating huge HashMaps or something (i.e. large data graphs with lots of references to other objects) all the time?
Aaron Digulla
@Vitaly: See my edits.
Aaron Digulla
Aaron, about CPU. By "not a high CPU utilization" I meant that all the threads are blocked and, so, it isn't our code who causing lags. CPU usage is high during the lags :))
Vitaly
No, no finalize() methods in the code
Vitaly
@Aaron About hash maps. We have some HashMaps of size 20-30K elements (not sure if they are big). Also we create quite big byte[] quite often - they can be of size 40K-80K.
Vitaly
byte[] doesn't matter much for GC; for a GC, only object references matter. Do you add/remove elements from the HashMaps all the time?. As for "all threads are blocked", that should be "all threads but the GC thread". And the GC thread eats as much CPU as possible, so while it runs, you must have 100% CPU usage or something is odd.
Aaron Digulla
A: 

Vitaly, There is fragmentation problem. My observation: If there are small size of the objects which are getting updated frequently then in that case it generates lot of garbage. Though CMS collects the memory occupied by these objects, this memory is fragmented. Now Mark-Sweep-Compact thread comes into picture (stop the world)and try to compact this fragmented memory causing long pause.

Opposite to that if the objects size is bigger then it generates less fragmented memory and
Mark-Swap-Compact takes less time to compact this memory. This may cause less throughput but will help you to reduce the long pause caused by GC compaction.

Kishor
We've already handled the problem. Sometimes there wasn't enough memory in oldgen to copy objects survived from young gen. Starting CMS when fixed amount memory consumed fixed the problem.
Vitaly
Vitaly, Can you please briefly point out the way you solved fragmentation problem? How exactly you triggered CMS after consuming fixed amount of money?And how this solve the fragmentation problem?Kishor
Kishor
Please read money == memory
Kishor
A: 

To see how Vitaly probably handled this, see Understanding Concurrent Mark Sweep Garbage Collector Logs.