views:

835

answers:

3

Originally posted on Server Fault, where it was suggested this question might better asked here.

We are using JBoss to run two of our WARs. One is our web app, the other is our web service. The web app accesses a database on another machine and makes requests to the web service. The web service makes JMS requests to other machines, aggregates the data, and returns it.

At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. The machine running JBoss has 8 CPUs. Our web app is still accessible during this time, however pages take about 3 minutes to load. Restarting JBoss restores everything to normal.

The database machine and all the other machines are fine, only the machine running JBoss is affected. Memory usage is normal. Network utilization is normal. There are no suspect error messages in the JBoss logs.

I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. I have not gotten my test environment to replicate the problem.

Where do we go from here? How can we narrow down the problem?

Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. So far people have just restarted JBoss when the problem occurred to minimize down time. Next time it happens they will get a developer to take a look. The question is, next time it happens, what can be done to determine the cause?

We could setup a separate JBoss instance on the same box and install the web app separately from the web service. This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). This doesn't narrow it down much though.

Should I enable JMX remote? This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. However, is there a significant down side to enabling JMX remote in a production environment?

Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing?

Any other ideas?

Thanks!

+2  A: 

I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. Profiling would definitely help in order to pinpoint the problem.

A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. Second thing I would check is that you are running with -server flags and that your gc settings are sane. You could also just run some dstat to see what the process is doing during the lockup. But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this.

disown
I have a test environment setup and I've been using The Grinder to hammer it. I'm unable to reproduce the problem there. Not sure why. Maybe my tests don't exercise the same or as wide a variety of data. I've profiled my tests to be sure there is normally no thread contention. I did find production wasn't using -server, and I yelled at someone for it. :) GC settings are the default. Is this so bad? I will definitely check out the commands you listed.
NateS
+1 for thread dump
matt b
Sorry Nate, missed the load testing section in your post. I really need to start reading posts before answering them :)
disown
I've got the tools ready to debug the problem next time it happens in production! This client uses Windows (don't ask). I ended up using CDB, a Windows tool, for getting all time thread CPU usage and native IDs. I have a script to run this twice with 10 seconds between runs, the threads that change the most are the culprits. Then I run jstack from the JDK to get the thread stacktraces, including native IDs. Now we just need production to chowder again! :)
NateS
+2  A: 

This typically happens with runaway code or unsafe thread access to hashmaps. A simple thread dump (kill -3, as @disown says, or ctrl-break in a windows console) will reveal this problem.

Since you're unable to reproduce it using tests I think it smells like a concurrency issue; it's usually hard to make test scripts behave sufficiently random to catch issues of this type.

I normally try to make it standard operating procedure to do thread-dumps of any JVM that is restarted due to operational anomalies, and it's really a requirement to catch those once-a-month things.

krosenvold
+1  A: 

There's a quick and dirty way of identifying which threads are using up the CPU time on JBoss. Go the the JMX Console with a browser (usually on http://localhost:8080/jmx-console, but may be different for you), look for a bean called ServerInfo, it has an operation called listThreadCpuUtilization which dumps the actual CPU time used by each active thread, in a nice tabular format. If there's one misbehaving, it usually stands out like a sore thumb.

There's also the listThreadDump operation which dumps the stack for every thread to the browser.

Not as good as a profiler, but a much easier way to get the basic information. For production servers, where it's often bad news to connect a profiler, it's very handy.

skaffman
I checked this out. It is very useful! Though you have to use thread names rather than IDs to correlate between the thread CPU utilization list and the thread stacktraces.
NateS