views:

145

answers:

2

The application, when subjected to load, sometimes, utilises 100%.

doing a kill -quit <pid> showed 1100+ threads in waiting state as:

Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode):

"http-8080-1198" daemon prio=10 tid=0x00007f17b465c800 nid=0x2061 in Object.wait() [0x00007f1762b6e000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00007f17cb087890> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
        - locked <0x00007f17cb087890> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:484)
        at java.lang.Thread.run(Thread.java:619)

"http-8080-1197" daemon prio=10 tid=0x00007f17b465a800 nid=0x2060 in Object.wait() [0x00007f1762c6f000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00007f17cb14f460> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
        - locked <0x00007f17cb14f460> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:484)
        at java.lang.Thread.run(Thread.java:619)
............

The state does not change even when the application-context is undeployed OR the DB is restarted.

Please suggest a probable cause.

App Server: Apache Tomcat 6.0.26

Max Threads: 1500

Threads in WAITING state : 1138

+1  A: 

"waiting on" is not a problem. The thread is waiting to be notified - and in this case it is locked on the JIoEndpoint.Worker

The background thread that listens for incoming TCP/IP connections and hands them off to an appropriate processor.

So I think this is waiting for actual requests to come in.

Firstly, CPU utilization actually increases when you have many threads due to high amount of context switching. Do you actually need 1500? Can you try by reducing?

Secondly, Is it memory hogging or GC-ing too often?

"waiting for" would be a problem if you see those. Do you have any BLOCKED(on object monitor) or waiting to lock () in the stack trace?

JoseK
We are load-testing it with 7500 concurrent users. Is there a ballpark for no of threads to concurrency ratio?
Mohit Nanda
@Mohit: Load testing is the right way to go. It depends on how long each request takes per user and what processing they typically would do.http://people.apache.org/~mturk/docs/article/ftwai.html says *To get most out of Tomcat you should limit the number of concurrent requests to 200 per CPU. *
JoseK
7500 concurrent **users or requests** - that's quite a lot - so are you clustered?
JoseK
A: 

On a Solaris system you can use the command

prstat -L -p <pid> 0 1 > filename.txt

This will give you a break down of each process doing work on the CPU and will be based on the Light weight processor ID, instead of the PID. When you look at your thread dump you can match the light weight process up to your NID (or TID depending on the implementations) which are shown on the top line of your thread dump. By matching these two things up you will be able to tell which of your threads are the CPU hog.

Here is an example of the output.

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU   PROCESS/LWPID
   687 user      1024M  891M sleep   59    0   0:40:07 12.0% java/5
   687 user      1024M  891M sleep   59    0   0:34:43 15.3% java/4
   687 user      1024M  891M sleep   59    0   0:17:00 7.6%  java/3
   687 user      1024M  891M sleep   59    0   1:00:07 31.4% java/2

Then with a corresponding thread dump, you can find these threads

"GC task thread#0 (ParallelGC)" prio=3 tid=0x00065295 nid=0x2 runnable
"GC task thread#1 (ParallelGC)" prio=3 tid=0x00012345 nid=0x3 runnable
"GC task thread#2 (ParallelGC)" prio=3 tid=0x0009a765 nid=0x4 runnable
"GC task thread#3 (ParallelGC)" prio=3 tid=0x0003456b nid=0x5 runnable

So in the case of this High CPU case, the problem was in the Garbage collection. This is seen by matching the nid with the LWPID field

If this will help you out I would suggest making a script that will take the output your prstat and the CPU usage all at once. This will provide you wil the most accurate representation of your application.

As per your original two threads, @joseK was correct. Those threads are sitting and waiting to take a request from a user. There is no problem there.

Sean
Thx, will try it out to diagnose it further.
Mohit Nanda