We've been debugging this JBoss server problem for quite a while. After about 10 hours of work, the server goes into 100% CPU panic attacks and just stalls. During this time you cannot run any new programs, so you can't even kill -quit
to get a stack trace. These high 100% SYS CPU loads last 10-20 seconds and repeat every few minutes.
We have been working on for quite a while. We suspect it has something to do with the GC, but cannot confirm it with a smaller program. We are running on i386 32bit, RHEL5 and Java 1.5.0_10 using -client
and ParNew GC
.
Here's what we have tried so far:
We limited the CPU affinity so we can actually use the server when the high load hits. With
strace
we see an endless loop ofSIGSEGV
and then the sig return.We tried to reproduce this with a Java program. It's true that SYS CPU% climbs high with
WeakHashMap
or when accessing null pointers. Problem was thatfillStackTrace
took a lot of user CPU% and that's why we never reached 100% SYS CPU.We know that after 10 hours of stress, GC goes crazy and full GC sometimes takes 5 seconds. So we assume it has something to do with memory.
jstack
during that period showed all threads as blocked.pstack
during that time, showed MarkSweep stack trace occasionally, so we can't be sure about this as well. SendingSIGQUIT
yielded nothing: Java dumped the stack trace AFTER the SYS% load period was over.
We're now trying to reproduce this problem with a small fragment of code so we can ask Sun.
If you know what's causing it, please let us know. We're open to ideas and we are clueless, any idea is welcome :)
Thanks for your time.