views:

1931

answers:

5

We've been debugging this JBoss server problem for quite a while. After about 10 hours of work, the server goes into 100% CPU panic attacks and just stalls. During this time you cannot run any new programs, so you can't even kill -quit to get a stack trace. These high 100% SYS CPU loads last 10-20 seconds and repeat every few minutes.

We have been working on for quite a while. We suspect it has something to do with the GC, but cannot confirm it with a smaller program. We are running on i386 32bit, RHEL5 and Java 1.5.0_10 using -client and ParNew GC.

Here's what we have tried so far:

  1. We limited the CPU affinity so we can actually use the server when the high load hits. With strace we see an endless loop of SIGSEGV and then the sig return.

  2. We tried to reproduce this with a Java program. It's true that SYS CPU% climbs high with WeakHashMap or when accessing null pointers. Problem was that fillStackTrace took a lot of user CPU% and that's why we never reached 100% SYS CPU.

  3. We know that after 10 hours of stress, GC goes crazy and full GC sometimes takes 5 seconds. So we assume it has something to do with memory.

  4. jstack during that period showed all threads as blocked. pstack during that time, showed MarkSweep stack trace occasionally, so we can't be sure about this as well. Sending SIGQUIT yielded nothing: Java dumped the stack trace AFTER the SYS% load period was over.

We're now trying to reproduce this problem with a small fragment of code so we can ask Sun.

If you know what's causing it, please let us know. We're open to ideas and we are clueless, any idea is welcome :)

Thanks for your time.

A: 

Have you tried profiling applications. There are some good profiling applications that can run on production servers. Those should give you if GC is running into trouble and with which objects

Nuno Furtado
A: 

I had a similar issue with JBoss (JBoss 4, Linux 2.6) last year. I think in the end it did turn out to be related to an application bug, but it was definitely very hard to figure out. I would keep trying to send a 'kill -3' to the process, to get some kind of stack trace and figure out what is blocking. Maybe add logging statements to see if you can figure out what is setting it off. You can use 'lsof' to figure out what files it has open; this will tell you if there is a leak of some resource other than memory.

Also, why are you running JBoss with -client instead of -server? (Not that I think it will help in this case, just a general question).

Avi
We're running with -client instead of -server out of legacy. I'm trying to change that to -server, though it happens with -server as well. I have checked and saw that -server does not send SIGSEGV when accessing a null pointer. We tried kill -quit at 40% SYS, couldn't find anything nonblocking :(
gilm
Are the threads blocked on locks (waiting for other threads to release a lock), or blocked on IO?
Avi
+1  A: 

Hi gilm. If you're certain that GC is the problem (and it does sound like it based on your description), then adding the -XX:+HeapDumpOnOutOfMemoryError flag to your JBoss settings might help (in JBOSS_HOME/bin/run.conf).

You can read more about this flag here. It was originally added in Java 6, but was later back-ported to Java 1.5.0_07.

Basically, you will get a "dump file" if an OutOfMemoryError occurs, which you can then open in various profiling tools. We've had good luck with the Eclipse Memory Analyzer.

This won't give you any "free" answers, but if you truly have a memory leak, then this will help you find it.

Matt Solnit
A: 

You could try adding the command line option -verbose:gc which should print GC and heap sizes out to stdout. pipe stdout to a file and see if the high cpu times line up with a major gc.

I remember having similar issues with JBoss on Windows. Periodically the cpu would go 100%, and the Windows reported mem usage would suddenly drop down to like 2.5 MB, much smaller than possible to run JBoss, and after a few seconds build itself back up. As if the entire server came down and restarted itself. I eventually tracked my issue down to a prepared statement cache never expiring in Apache Commons.

If it does seem to be a memory issue, then you can start taking periodic heap dumps and comparing the two, or use something like JProbe Memory profiler to track everything.

rally25rs
+1  A: 

Thanks to everybody for helping out.

Eventually we upgraded (only half of the java servers,) to JDK 1.6 and the problem disappeared. Just don't use 1.5.0.10 :)

We managed to reproduce these problems by just accessing null pointers (boosts SYS instead of US, and kills the entire linux.)

Again, thanks to everyone.

gilm