views:

390

answers:

4

Hello. I have a nasty issue with load-balanced Tomcat servers that are hanging up. Any help would be greatly appreciated.

The system

I'm running Tomcat 6.0.26 on HotSpot Server 14.3-b01 (Java 1.6.0_17-b04) on three servers sitting behind another server that acts as load balancer. The load balancer runs Apache (2.2.8-1) + MOD_JK (1.2.25). All of the servers are running Ubuntu 8.04.

The Tomcat's have 2 connectors configured: an AJP one, and a HTTP one. The AJP is to be used with the load balancer, while the HTTP is used by the dev team to directly connect to a chosen server (if we have a reason to do so).

I have Lambda Probe 1.7b installed on the Tomcat servers to help me diagnose and fix the problem soon to be described.

The problem

Here's the problem: after about 1 day the application servers are up, JK Status Manager starts reporting status ERR for, say, Tomcat2. It will simply get stuck on this state, and the only fix I've found so far is to ssh the box and restart Tomcat.

I must also mention that JK Status Manager takes a lot longer to refresh when there's a Tomcat server in this state.

Finally, the "Busy" count of the stuck Tomcat on JK Status Manager is always high, and won't go down per se -- I must restart the Tomcat server, wait, then reset the worker on JK.

Analysis

Since I have 2 connectors on each Tomcat (AJP and HTTP), I still can connect to the application through the HTTP one. The application works just fine like this, very, very fast. That is perfectly normal, since I'm the only one using this server (as JK stopped delegating requests to this Tomcat).

To try to better understand the problem, I've taken a thread dump from a Tomcat which is not responding anymore, and from another one that has been restarted recently (say, 1 hour before).

The instance that is responding normally to JK shows most of the TP-ProcessorXXX threads in "Runnable" state, with the following stack trace:

java.net.SocketInputStream.socketRead0 ( native code )
java.net.SocketInputStream.read ( SocketInputStream.java:129 )
java.io.BufferedInputStream.fill ( BufferedInputStream.java:218 )
java.io.BufferedInputStream.read1 ( BufferedInputStream.java:258 )
java.io.BufferedInputStream.read ( BufferedInputStream.java:317 )
org.apache.jk.common.ChannelSocket.read ( ChannelSocket.java:621 )
org.apache.jk.common.ChannelSocket.receive ( ChannelSocket.java:559 )
org.apache.jk.common.ChannelSocket.processConnection ( ChannelSocket.java:686 )
org.apache.jk.common.ChannelSocket$SocketConnection.runIt ( ChannelSocket.java:891 )
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run ( ThreadPool.java:690 )
java.lang.Thread.run ( Thread.java:619 )

The instance that is stuck shows most (all?) of the TP-ProcessorXXX threads in "Waiting" state. These have the following stack trace:

java.lang.Object.wait ( native code )
java.lang.Object.wait ( Object.java:485 )
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run ( ThreadPool.java:662 )
java.lang.Thread.run ( Thread.java:619 ) 

I don't know of the internals of Tomcat, but I would infer that the "Waiting" threads are simply threads sitting on a thread pool. So, if they are threads waiting inside of a thread pool, why wouldn't Tomcat put them to work on processing requests from JK?

EDIT: I don't know if this is normal, but Lambda Probe shows me, in the Status section, that there are lots of threads in KeepAlive state. Is this somehow related to the problem I'm experiencing?

Solution?

So, as I've stated before, the only fix I've found is to stop the Tomcat instance, stop the JK worker, wait the latter's busy count slowly go down, start Tomcat again, and enable the JK worker once again.

What is causing this problem? How should I further investigate it? What can I do to solve it?

Thanks in advance.

+1  A: 

Check your log file first.

I think the default log file is located in /var/log/daemon.log. (this file does not contains only the logs from tomcat)

telebog
+1  A: 

Check your keepalive time setting. It seems you are getting threads into keepalive state, and they don't time out. It appears your server is not detecting client disconnects within a reasonable time. There are several timeout and count variables involved.

BillThor
+1  A: 

Do you have JVM memory settings and garbage collection configured? You would do this where you set your CATALINA_OPTS

examples:

CATALINA_OPTS="$CATALINA_OPTS -server -Xnoclassgc -Djava.awt.headless=true"
CATALINA_OPTS="$CATALINA_OPTS -Xms1024M -Xmx5120M -XX:MaxPermSize=256m"
CATALINA_OPTS="$CATALINA_OPTS -XX:-UseParallelGC"
CATALINA_OPTS="$CATALINA_OPTS -Xnoclassgc"

There are multiple philosophies on which GC setting is best. It depends on the kind of code that you are executing. The config above worked best for a JSP-intensive environment (taglibs instead of MVC framework).

Hugh Lang
there's a missing line break in my config sample. it somehow got removed by the form submit. (right after UseParallelGC" )
Hugh Lang
@Hugh Lang: fixed the line break for you by making the whole block 'code' - you can do that by selecting the text and click the '1010' button or by indenting 4 spaces.
Simon Groenewolt
A: 

I've had a similar problem with Weblogic. The cause was that too many threads were waiting for network responses and Weblogic was running out of memory. Tomcat probably behaves the same way. Things you can try are:

  • Decrease the timeout value of your connections.
  • Decrease the total amount of simultaneous connections, so that tomcat doesn't start new threads when that amount is reached.
  • Easy fix, but doesn't correct the root cause: It might be that tomcat is in out of memory state, even though it's not showing up in the logs yet. Increase tomcat's memory like previously described.
Lauri Larjo