views:

760

answers:

7

Hello

I've got (the currently latest) jdk 1.6.0.18 crashing while running a web application on (the currently latest) tomcat 6.0.24 unexpectedly after 4 to 24 hours 4 hours to 8 days of stress testing (30 threads hitting the app at 6 mil. pageviews/day). This is on RHEL 5.2 (Tikanga).

The crash report is at http://pastebin.com/f639a6cf1 and the consistent parts of the crash are:

  • a SIGSEGV is being thrown
  • on libjvm.so
  • eden space is always full (100%)

JVM runs with the following options:

CATALINA_OPTS="-server -Xms512m -Xmx1024m -Djava.awt.headless=true"

I've also tested the memory for hardware problems using http://memtest.org/ for 48 hours (14 passes of the whole memory) without any error.

I've enabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to inspect for any GC trends or space exhaustion but there is nothing suspicious there. GC and full GC happens at predicable intervals, almost always freeing the same amount of memory capacities.

My application does not, directly, use any native code.

Any ideas of where I should look next?

Edit - more info:

1) There is no client vm in this JDK:

[foo@localhost ~]$ java -version -server
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

[foo@localhost ~]$ java -version -client
java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

2) Changing the O/S is not possible.

3) I don't want to change the JMeter stress test variables since this could hide the problem. Since I've got a use case (the current stress test scenario) which crashes the JVM I'd like to fix the crash and not change the test.

4) I've done static analysis on my application but nothing serious came up.

5) The memory does not grow over time. The memory usage equilibrates very quickly (after startup) at a very steady trend which does not seem suspicious.

6) /var/log/messages does not contain any useful information before or during the time of the crash

More info: Forgot to mention that there was an apache (2.2.14) fronting tomcat using mod_jk 1.2.28. Right now I'm running the test without apache just in case the JVM crash relates to the mod_jk native code which connects to JVM (tomcat connector).

After that (if JVM crashes again) I'll try removing some components from my application (caching, lucene, quartz) and later on will try using jetty. Since the crash is currently happening anytime between 4 hours to 8 days, it may take a lot of time to find out what's going on.

+3  A: 

A few ideas:

  • Use a different JDK, Tomcat and/or OS version
  • Slightly modify test parameters, e.g. 25 threads at 7.2 M pageviews/day
  • Monitor or profile memory usage
  • Debug or tune the Garbage Collector
  • Run static and dynamic analysis
kiwicptn
+1  A: 

Does your memory grow over time? If so, I suggest changing the memory limits lower to see if the system is failing more frequently when the memory is exhausted.

Can you reproduce the problem faster if:

  • You decrease the memory availble to the JVM?
  • You decrease the available system resources (i.e. drain system memory so JVM does not have enough)
  • You change your use cases to a simpler model?

One of the main strategies that I have used is to determine which use case is causing the problem. It might be a generic issue, or it might be use case specific. Try logging the start and stopping of use cases to see if you can determine which use cases are more likely to cause the problem. If you partition your use cases in half, see which half fails the fastest. That is likely to be a more frequent cause of the failure. Naturally, running a few trials of each configuration will increase the accuracy of your measurements.

I have also been known to either change the server to do little work or loop on the work that the server is doing. One makes your application code work a lot harder, the other makes the web server and application server work a lot harder.

Good luck, Jacob

TheJacobTaylor
Looking at your trace, system memory should not be the issue in this case. Are there any messages in the system log? Also, if I am reading it right, it looks like you might have a rather high number of threads running. There are a ton of threads waiting for available CPU at any given time. I would expect faster average response times with a smaller number of threads.
TheJacobTaylor
+1  A: 

Try switching your servlet container from Tomcat to Jetty http://jetty.codehaus.org/jetty/.

crowne
To see whether the JVM will still crash? Or for completely migrating to jetty?
cherouvim
I would go for completely migrating to Jetty, just because I like what I've seen from Jetty in the past.However the latest comparisons that I've just googled, seem to show that performance wise Jetty-6 vs Tomcat-6 are fairly equal, although Jetty does come across as having a lighter memory footprint.From a more methodical approach, long as your application is standards compliant the migration shouldn't be too tough, and then you may be able to eliminate the container as the root cause or verify your application as the root cause. Good Luck.
crowne
@crowne: thanks for the comment. My application is compliant with all major servers (tomcat, jboss, resin, jetty, glashfish) so migration is no problem. I'll definitely try out the stress test on jetty.
cherouvim
+1  A: 

If I was you, I'd do the following:

  • try slightly older Tomcat/JVM versions. You seem to be running the newest and greatest. I'd go down two versions or so, possibly try JRockit JVM.
  • do a thread dump (kill -3 java_pid) while the app is running to see the full stacks. Your current dump shows lots of threads being blocked - but it is not clear where do they block (I/O? some internal lock starvation? anything else?). I'd even maybe schedule kill -3 to be run every minute to compare any random thread dump with the one just before the crash.
  • I have seen cases where Linux JDK just dies whereas Windows JDK is able to gracefully catch an exception (was StackOverflowException then), so if you can modify the code, add "catch Throwable" somewhere in the top class. Just in case.
  • Play with GC tuning options. Turn concurrent GC on/off, adjust NewSize/MaxNewSize. And yes, this is not scientific - rather desperate need for working solution. More details here: http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html

Let us know how this was sorted out!

mindas
+2  A: 

Have you tried different hardware? It looks like you're using a 64-bit architecture. In my own experience 32-bit is faster and more stable. Perhaps there's a hardware issue somewhere too. Timing of "between 4-24 hours" is quite spread out to be just a software issue. Although you do say system log has no errors, so I could be way off. Still think its worth a try.

Daniil
Trying out different hardware is not an option, but I'll try the 32bit jvm. Thanks
cherouvim
+1  A: 

Is it an option to go to the 32-bit JVM instead? I believe it is the most mature offering from Sun.

Thorbjørn Ravn Andersen
Will try that out. Thanks.
cherouvim
+2  A: 

do you have compiler output? i.e. PrintCompilation (and if you're feeling particularly brave, LogCompilation).

I have debugged a case like this in the part by watching what the compiler is doing and, eventually (this took a long time until the light bulb moment), realising that my crash was caused by compilation of a particular method in the oracle jdbc driver.

so basically what I'd do is;

  • switch on PrintCompilation
  • since that doesn't give timestamps, write a script that watches that logfile (like a sleep every second and print new rows) and reports when methods were compiled (or not)
  • repeat the test
  • check the compiler output to see if the crash corresponds with compilation of some method
  • repeat a few more times to see if there is a pattern

If there is a discernable pattern then use .hotspot_compiler (or .hotspotrc) to make it stop compiling the offending method(s), repeat the test and see if it doesn't blow up. Obviously in your case this process could theoretically take months I'm afraid.

some references

The other thing I'd do is systematically change the gc algorithm you're using and check the crash times against gc activity (e.g. does it correlate with a young or old gc, what about TLABs?). Your dump indicates you're using parallel scavenge so try

  • the serial (young) collector (IIRC it can be combined with a parallel old)
  • ParNew + CMS
  • G1

if it doesn't recur with the different GC algos then you know it's down to that (and you have no fix but to change GC algo and/or walk back through older JVMs until you find a version of that algo that doesn't blow).

Cheers Matt

Matt
Thanks for bringing PrintCompilation to my attention. Will definitely try this out.
cherouvim