We've been debugging this JBoss server problem for quite a while. After about 10 hours of work, the server goes into 100% CPU panic attacks and just stalls. During this time you cannot run any new programs, so you can't even kill -quit to get a stack trace. These high 100% SYS CPU loads last 10-20 seconds and repeat every few minutes.
We have been working on for quite a while. We suspect it has something to do with the GC, but cannot confirm it with a smaller program. We are running on i386 32bit, RHEL5 and Java 1.5.0_10 using -client and ParNew GC.
Here's what we have tried so far:
We limited the CPU affinity so we can actually use the server when the high load hits. With
stracewe see an endless loop ofSIGSEGVand then the sig return.We tried to reproduce this with a Java program. It's true that SYS CPU% climbs high with
WeakHashMapor when accessing null pointers. Problem was thatfillStackTracetook a lot of user CPU% and that's why we never reached 100% SYS CPU.We know that after 10 hours of stress, GC goes crazy and full GC sometimes takes 5 seconds. So we assume it has something to do with memory.
jstackduring that period showed all threads as blocked.pstackduring that time, showed MarkSweep stack trace occasionally, so we can't be sure about this as well. SendingSIGQUITyielded nothing: Java dumped the stack trace AFTER the SYS% load period was over.
We're now trying to reproduce this problem with a small fragment of code so we can ask Sun.
If you know what's causing it, please let us know. We're open to ideas and we are clueless, any idea is welcome :)
Thanks for your time.