views:

896

answers:

6

We have recently been migrating a number of applications from running under RedHat linux JDK1.6.0_03 to Solaris 10u8 JDK1.6.0_16 (much higher spec machines) and we have noticed what seems to be a rather pressing problem: under certain loads our JVMs get themselves into a "Death Spiral" and eventually go out of memory. Things to note:

  • this is not a case of a memory leak. These are applications which have been running just fine (in one case for over 3 years) and the out-of-memory errors are not certain in any case. Sometimes the applications work, sometimes they don't
  • this is not us moving to a 64-bit VM - we are still running 32 bit
  • In one case, using the latest G1 garbage collector on 1.6.0_18 seems to have solved the problem. In another, moving back to 1.6.0_03 has worked
  • Sometimes our apps are falling over with HotSpot SIGSEGV errors
  • This is affecting applications written in Java as well as Scala

The most important point is this: the behaviour manifests itself in those applications which suddenly get a deluge of data (usually via TCP). It's as if the VM decides to keep adding more data (possibly progressing it to the TG) rather than running a GC on "newspace" until it realises that it has to do a full GC and then, despite practically everything in the VM being garbage, it somehow decides not to collect it!

It sounds crazy but I just don't see what else it is. How else can you explain an app which one minute falls over with a max heap of 1Gb and the next works just fine (never going about 256M when the app is doing exactly the same thing)

So my questions are:

  1. Has anyone else observed this kind of behaviour?
  2. has anyone any suggestions as to how I might debug the JVM itself (as opposed to my app)? How do I prove this is a VM issue?
  3. Are there any VM-specialist forums out there where I can ask the VM's authors (assuming they aren't on SO)? (We have no support contract)
  4. If this is a bug in the latest versions of the VM, how come no-one else has noticed it?
+2  A: 

I have had the same issue on Solaris machines, and I solved it by decreasing the maximum size of the JVM. The 32 bit Solaris implementation apparently needs some overhead room beyond what you allocate for the JVM when doing garbage collections. So, for example, with -Xmx3580M I'd get the errors you describe, but with -Xmx3072M it would be fine.

Rex Kerr
But these apps are really not very big - usually 256Mb and the machines they are on are beasts (24Gb of RAM) and currently under-utilized. I see no reason why solaris would be having problems finding any extra memory for housekeeping!
oxbow_lakes
Maybe it's proportional to data throughput and/or GC load, and yours is just that much higher than mine? What did you set the maximum heap size to?
Rex Kerr
I eventually ramped it up to 1Gb in desperation and watched as the app happily started, never going above 256Mb! But this was not deterministic (it didn't work first time) - it failed, it failed, it failed, it failed, it worked!
oxbow_lakes
+1  A: 

What kind of OutOfMemoryError are you getting? Is the heap space exhausted or is the problem related to any of the other memory pools (the Error usually have a message giving more details on its cause).

If the heap is exhausted and the problem can be reproduced (it sounds as if it can), I would first of all configure the VM to produce a heap dump on OutOfMemoryErrors. You can then analyze the heap and make sure that it's not filled with objects, which are still reachable through some unexpected references.

It's of course not impossible that you are running into a VM bug, but if your application is relying on implementation specific behaviour in 1.6.0_03, it may for some reason or another end up as a memory hog when running on 1.6.0_16. Such problems may also be found if you are using some kind of server container for your application. Some developers are obviously unable to read documentation, but tend to observe the API behaviour and make their own conclusions about how something is supposed to work. This is of course not always correct and I've ran into similar problems both with Tomcat and with JBoss (both products at least used to work only with specific VMs).

jarnbjo
`jhat` doesn't seem to be capable of analyzing any heaps >= 256Mb in size, unfortunately because *it* goes out of memory! It's a mixture of "heap exhausted" and "GC overhead limit exceeded" errors and it's not running in a container (other than Spring)
oxbow_lakes
I've been told to try YourKit but I'm reticent to spend time on this approach. After all, if the app runs on 1.6.0_03/linux but not on 1.6.0_18/solaris then the issue is surely with the VM - how will profiling my heap help?
oxbow_lakes
I use the Eclipse Memory Analyzer (http://www.eclipse.org/mat/), perhaps you want to take a look at it? I already explained in my answer why your problem is not necessarily caused by a VM bug and why I think you should take a closer look at the heap dump.
jarnbjo
@Jarnbjo - I'm not sure you explained anything of the sort: an application runs fine for 3 years and then starts falling over when migrated to a new VM and this is a memory leak?
oxbow_lakes
Short summary: If your application depends on VM implementation (instead of documented) behaviour, the problem you are seeing may not be a VM bug, but a bug in your application.
jarnbjo
What does "depends on JVM implementation" mean in this case? I expect that if I have garbage the VM will collect it (which is *documented* behaviour any VM should have). I suppose I do depend on the virtual machine not having a bug in it but that is hardly an unreasonable dependency
oxbow_lakes
+1  A: 
  1. Yes, I've observed this behavior before, and usually after countless hours of tweaking JVM parameters it starts working.
  2. Garbage Collection, especially in multithreaded situations is nondeterministic. Defining a bug in nondeterministic code can be a challenge. But you could try DTrace if you are using Solaris, and there are a lot of JVM options for peering into HotSpot.
  3. Go on Scala IRC and see if Ismael Juma is hanging around (ijuma). He's helped me before, but I think real in-depth help requires paying for it.
  4. I think most people doing this kind of stuff accept that they either need to be JVM tuning experts, have one on staff, or hire a consultant. There are people who specialize in JVM tuning.

In order to solve these problems I think you need to be able to replicate them in a controlled environment where you can precisely duplicate runs with different tuning parameters and/or code changes. If you can't do that hiring an expert probably isn't going to do you any good, and the cheapest way out of the problem is probably buying more RAM.

Erik Engbrecht
+2  A: 

Interesting problem. Sounds like one of the garbage collectors works poorly on your particular situation.

Have you tried changing the garbage collector being used? There are a LOT of GC options, and figuring out which ones are optimal seems to be a bit of a black art, but I wonder if a basic change would work for you.

I know there is a "Server" GC that tends to work a lot better than the default ones. Are you using that?

Threaded GC (which I believe is the default) is probably the worst for your particular situation, I've noticed that it tends to be much less aggressive when the machine is busy.

One thing I've noticed, it often takes two GCs to convince Java to actually take out the trash. I think the first one tends to unlink a bunch of objects and the second actually deletes them. What you might want to do is occasionally force two garbage collections. This WILL cause a significant GC pause, but I've never seen a case where it took more than two to clean out the entire heap.

Bill K
Well, server is the default in a machine like this. The problem is that this is only happening on our production servers, so it's not like I can just tinker around on my PC (where everything works as normal). Also, previous messing with GC options did not leave me thinking that anything was any better than the default. It would be a massive task to start blindly going through options but I tried what was recommended here for high throughput to no avail: http://java.sun.com/performance/reference/whitepapers/tuning.html.
oxbow_lakes
How about monitoring your memory usage and running System.gc() twice whenever it gets, say, 3/4 full. You'd have to also include a mechanism to ensure this doesn't happen too often, but if you only have a problem when your data is bursty, it may be a workable solution. You might also want to set your min memory to the same as your max so that it allocates it all at once instead of during the burst--that makes your 3/4 full measurement more reliable.
Bill K
I tried setting Xms to be the same as Xmx and it didn't make any difference. One of the apps (which is batch-based) I have added an explicit call to gc and I'll see how that pans out early next week
oxbow_lakes
+1  A: 

Also make sure it's not a hardware fault (try running MemTest86 or similar on the server.)

finnw
+1  A: 

Which kind of SIGSEV errors exactly do you encounter?

If you run a 32bit VM, it could be what I described here: http://janvanbesien.blogspot.com/2009/08/mysterious-jvm-crashes-explained.html

Jan Van Besien