We have recently been migrating a number of applications from running under RedHat linux JDK1.6.0_03 to Solaris 10u8 JDK1.6.0_16 (much higher spec machines) and we have noticed what seems to be a rather pressing problem: under certain loads our JVMs get themselves into a "Death Spiral" and eventually go out of memory. Things to note:
- this is not a case of a memory leak. These are applications which have been running just fine (in one case for over 3 years) and the out-of-memory errors are not certain in any case. Sometimes the applications work, sometimes they don't
- this is not us moving to a 64-bit VM - we are still running 32 bit
- In one case, using the latest G1 garbage collector on 1.6.0_18 seems to have solved the problem. In another, moving back to 1.6.0_03 has worked
- Sometimes our apps are falling over with HotSpot
SIGSEGV
errors - This is affecting applications written in Java as well as Scala
The most important point is this: the behaviour manifests itself in those applications which suddenly get a deluge of data (usually via TCP). It's as if the VM decides to keep adding more data (possibly progressing it to the TG) rather than running a GC on "newspace" until it realises that it has to do a full GC and then, despite practically everything in the VM being garbage, it somehow decides not to collect it!
It sounds crazy but I just don't see what else it is. How else can you explain an app which one minute falls over with a max heap of 1Gb and the next works just fine (never going about 256M when the app is doing exactly the same thing)
So my questions are:
- Has anyone else observed this kind of behaviour?
- has anyone any suggestions as to how I might debug the JVM itself (as opposed to my app)? How do I prove this is a VM issue?
- Are there any VM-specialist forums out there where I can ask the VM's authors (assuming they aren't on SO)? (We have no support contract)
- If this is a bug in the latest versions of the VM, how come no-one else has noticed it?