views:

5483

answers:

4

On my busiest production installation, on occasion I get a single thread that seems to get stuck in an infinite loop. I've not managed to figure out who is the culprit, after much research and debugging, but it seems like it should be possible. Here are the gory details:

Current debugging notes:

1) ps -eL 18975 shows me the the Linux pid the problem child thread, 19269

$ps -eL | grep 18975
...
PID   LWP   TTY          TIME CMD
18975 18994 ?        00:00:05 java
18975 19268 ?        00:00:00 java
18975 19269 ?        05:16:49 java
18975 19271 ?        00:01:22 java
18975 19273 ?        00:00:00 java
...

2) jstack -l 18975 says there are no deadlocks, jstack -m 18975 does not work

3) jstack -l 18975 does give me the stack trace for all my threads (~400). Example thread stack (and not the problem):

"http-342.877.573.944-8080-360" daemon prio=10 tid=0x0000002adaba9c00 nid=0x754c in Object.wait() [0x00000000595bc000..0x00000000595bccb0]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on  (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:416)
        - locked  (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:442)
        at java.lang.Thread.run(Thread.java:619)

4) The ps -eL output's thread ID does not match the output from jstack, or at least I cannot see it. (jstack documentation is a bit sparse.)

5) There are no heavy IO, memory usage or other corresponding activity clues to work with.

Platform:

  • Java 6
  • Tomcat 6
  • RHEL 4 (64-bit)

Does anybody know how I can make that connection from the linux ps output to my problem child java thread? So close, yet so far...

+5  A: 

You can use JConsole to view the thread's stack trace.

If your using JDK 1.6.0_07 or above, you can also use visualvm.

Both tools provide a nice view of all the running threads in an application. The visualvm is quite a bit nicer, but hopefully seeing all the threads can help you track down the run-away thread.

Check for threads that are always in a state of RUNNING. When we had a run-away thread, the stack trace would constantly change. So we were able to tell which methods the loop was calling, and track down the loop.

Steve K
But I can't tell which thread is the problem from just jconsole. :( With over 400 threads, and the ps output being helpful, I should be able to make an authoritative analysis.
Stu Thompson
+1 for jconsole. Jstack sounds like it is unsupported: "NOTE - This utility is unsupported and may or may not be available in future versions of the J2SE SDK. jstack is not currently available on Windows platforms or on the Linux Itanium platform."
matt b
(Sorry about the delete and repost.) Hmmmm...OK, a bit kludgy and best done at 2am, but I see your logic...that could work...
Stu Thompson
just because jstack is unsupported does not mean it is not a valid debugging tool.
Stu Thompson
visualvm has a filter for "Show Live threads only" that might be helpful for you.
matt b
A: 

From memory if you CTRL-BREAK on the console you will get a dump of the current threads and a few of their stack trace frames.

From memory (I'm not sure if this is an IntelliJ IDEa feature, or it is default in java) but it will tell you which thread is deadlocked, and which object they are waiting on. You should be able to redirect the output to a file, and just grep for the DEADLOCKED text.

JConsole, VisualVM or other profilers such as JProfiler will also show you the threads and their stacks, however if you don't want to use any external tool I think CTRL-BREAK will give you what you're looking for.

Aidos
+5  A: 

It looks like the nid in the jstack output is the Linux LWP id.

"http-342.877.573.944-8080-360" daemon prio=10 tid=0x0000002adaba9c00 nid=0x754c in Object.wait() [0x00000000595bc000..0x00000000595bccb0]

Convert the nid to decimal and you have the LWP id. In your case 0x754c is 30028. This process is not shown in our ps output, but it was probably one of the LWPs you have omitted to save space.

Here's a little a Perl snippet you can use to pipe the output of jstack to:

#!/usr/bin/perl -w
while (<>) {
    if (/nid=(0x[[:xdigit:]]+)/) {
        $lwp = hex($1);
        s/nid=/lwp=$lwp nid=/;
    }
    print;
}
bromfiets
While I had tried this a year ago with no success, it is working for me now. I'm on a new version of Java, so maybe that is it. Or I just made a `hexNid2dec(pid)` math error. Or maybe was doing something else incredible stupid.
Stu Thompson
Just for the record, 'nid' is 'native id' -- the underlying system's native identifier for the Java thread.
Cowan
A: 

On SUN

Note that prstat by default shows the no of light weight processes , not the LWPID.

To see information for all the lightweight processes for a particular user use the -L option.

prstat -L -v -u weblogic

now use the LWPID and convert it into hex and match it with the nid from the thread dump

yes