tags:

views:

362

answers:

2

We have two Tomcat 6.0.20 servers fronted by Apache, with communication between the two using AJP. Tomcat in turn consumes web services on a JBoss cluster.

This morning, one of the Tomcat machines was using 100% of CPU on 6 of the 8 cores on our machine. We took a heap dump using JConsole, and then tried to connect JVisualVM to get a profile to see what was taking all the CPU, but this caused Tomcat to crash. At least we had the heap dump!

I have loaded the heap dump into Eclipse MAT, where I have found that we have 565 instances of java.lang.Thread. Some of these, obviously, are entirely legitimate, but the vast majority are named "ajp-6009-XXX" where XXX is a number.

I know my way around Eclipse MAT pretty well, but haven't been able to find an explanation for it. If anyone has some pointers as to why Tomcat may be doing this, or some hints on finding out why using Eclipse MAT, that'd be appreciated!

+1  A: 

This isn't a direct answer I guess, but perhaps as a mitigating approach in production, you could limit the damage by restricting the maxThreads for AJP in your configuration, per http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html ?

The default is 200, which sure is a lot of threads - but that possibly doesn't explain the 565 above. Obviously that has the potential to push the problem elsewhere, but perhaps you'll better be able to debug the problem there, or it will manifest itself in a different way. Is it possible that you're just under a high amount of load? Is there anything notable in the behaviour of Apache in the periods leading up to the problems you're experiencing?

Chad
A: 

Impossible to know for sure unless you managed to get a thread dump, but once I experienced a similar problem where all 8 cores were busy at 100% with thousands of threads (it wasn't on Tomcat however).

In our case, each thread was stuck inside java.util.HashMap in the get() method, spinning tightly in the for loop:

   public V get(Object key) {
       if (key == null)
           return getForNullKey();
       int hash = hash(key.hashCode());
       for (Entry<K,V> e = table[indexFor(hash, table.length)];
            e != null;
            e = e.next) {
           Object k;
           if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
               return e.value;
       }
       return null;
   }

Our theory was that somehow the linked list of entries at the specific bucket had got corrupted and was pointing back to itself, so was never able to exit the loop. Since no job ever finished, more and more threads got consumed from the pool as more requests were made.

This can occur if the table has to be resized whilst putting new entries, but there is unguarded read/write access by several threads; one thread may be extending the linked list at a specific bucket location whilst another is busy trying to moving it. If access to the hash map is not synchronized then it is quite likely to get corrupted (although generally not reproducible).

Check to see if there is a shared HashMap (or HashSet) which several threads can simultaneously access. If so, and it is easy to do so, either replace with a ConcurrentHashMap, or use a ReentrantReadWriteLock to guard read/write access to the map. You could of course try Collections.synchronizedMap() too, but this would not be so scaleable.

Any of these proposed fixes should prevent the issue, if that turns out to be the root cause of your problem.

See also:

http://lightbody.net/blog/2005/07/hashmapget_can_cause_an_infini.html http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html

rhu