tags:

views:

429

answers:

5

I am currently in the process of investigating a very peculiar problem on our lab servers. Whenever we run a java program on a machine with a 64-bit SUSE SLES11 installation that has been accessed with Citrix, it just hangs. I have the latest updates on the machine but it doesn't help. If any of these circumstances change, it works: 32-bit OS, SLES10.2, access via Cygwin/Exceed and other X applications such as xclock work fine.

This might look like a ServerFault question so far, but what I'm actually looking for is suggestions on software I can use to trace what this software is actually doing. Where it hangs is on a "FUTEX_WAIT" (found by using strace):

futex(0x7f4e3eaab9e0, FUTEX_WAIT, 19686, NULL

The cursor just stops in the trace just after the NULL and just stays there indefinitely. I have found a previous bug report that looks a little similar to this problem, but the circumstances are very different.

UPDATE: Apparently, futex_wait problems are a sign of strange race conditions in the kernel/libc locking up processes. I will have to try with a newer kernel/libc and see if either of that makes any difference.

UPDATE2: kernel/libc changes made no difference. Did manage to start up jvisualvm and hang it with a predictable external JMX port and connected to that from another machine at which point I found this in the thread trace for main:

Name: main
State: RUNNABLE
Total blocked: 0  Total waited: 0

Stack trace: 
sun.awt.X11GraphicsDevice.getDoubleBufferVisuals(Native Method)
sun.awt.X11GraphicsDevice.makeDefaultConfiguration(X11GraphicsDevice.java:208)
sun.awt.X11GraphicsDevice.getDefaultConfiguration(X11GraphicsDevice.java:182)
   - locked java.lang.Object@1c190c99
sun.awt.X11.XToolkit.<clinit>(XToolkit.java:92)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:169)
java.awt.Toolkit$2.run(Toolkit.java:834)
java.security.AccessController.doPrivileged(Native Method)
java.awt.Toolkit.getDefaultToolkit(Toolkit.java:826)
   - locked java.lang.Class@308a1f38
org.openide.util.ImageUtilities.ensureLoaded(ImageUtilities.java:519)
org.openide.util.ImageUtilities.access$200(ImageUtilities.java:80)
org.openide.util.ImageUtilities$ToolTipImage.createNew(ImageUtilities.java:699)
org.openide.util.ImageUtilities.getIcon(ImageUtilities.java:487)
   - locked java.util.HashMap@3c07ae6d
org.openide.util.ImageUtilities.getIcon(ImageUtilities.java:361)
   - locked java.util.HashMap@1c4c94e5
org.openide.util.ImageUtilities.loadImage(ImageUtilities.java:139)
org.netbeans.core.startup.Splash.loadContent(Splash.java:262)
org.netbeans.core.startup.Splash$SplashComponent.<init>(Splash.java:344)
org.netbeans.core.startup.Splash.<init>(Splash.java:170)
org.netbeans.core.startup.Splash.getInstance(Splash.java:102)
org.netbeans.core.startup.Main.start(Main.java:301)
org.netbeans.core.startup.TopThreadGroup.run(TopThreadGroup.java:110)
java.lang.Thread.run(Thread.java:619)

Tried the deadlock detection button in jvisualvm but it discovered no deadlocks.

Currently talking to Citrix Europe about this problem and delivering traces to them. Will update this question if it gets solved.

UPDATE 3: This problem has been traced to Citrix and has been submitted with service request number 60235154. Seems like the problem is either somewhere in Java or in the Citrix implementation of X11 at the moment.

A: 

Use gdb to attach to the process. gdb isn't exactly intuitive, but there are a lot of howtos and similar on the net.

http://dirac.org/linux/gdb/06-Debugging%5FA%5FRunning%5FProcess.php

drhirsch
+1  A: 

Do you have source code for the Java program? If so, you can remotely debug it using Eclipse or another IDE. If you don't have source code, your options are more limited, but you can try connecting to the process via JConsole to gain some insight into what's happening. Java profiling tools are another option, but harder to set up.

Rob H
One of the problems here is that jConsole is one of the programs that are failing, otherwise I'd do just that. I'll try starting eclipse, but I suspect that's going to hang as well.
Stefan Thyberg
You can run JConsole remotely. Make sure it's the Java 6 version.
Rob H
We use jconsole and jvisualvm to manifest this problem. It doesn't seem possible to pass VM arguments to jconsole but it's possible with jvisualvm so I did that today and connected to it remotely after it hung with another instance of jvisualvm on another machine, which gave some more clues about what might be wrong. Updated my question with the result.
Stefan Thyberg
+1  A: 

ltrace traces shared-library function calls. That can give you a higher-level view of things. But it can also spew tons more output than strace, since many library functions (e.g. strcmp) don't result in system calls.

But futex is used for locking, so if you get stuck at futex, you probably deadlocked. Or you're just looking at one thread which is waiting for other threads. ltrace/strace -f follows clone/fork to trace all threads/all child processes.

In gdb, sometimes thread apply all <command> is useful for multithreaded processes. e.g. thread apply all bt

Peter Cordes
A: 

Maybe jvisualvm, which comes with the java from Sun, has what you need. You can record the state of the virtual machine as your program is running and also tell it to save any stack dumps to a file you can later open and look at. Look for jvisualvm in the bin directory of your jdk. Here's where you can see more documentation: http://java.sun.com/javase/6/docs/technotes/tools/share/jvisualvm.html

Good luck!

Pete
A: 

Hello,

Did you find the root casue?

I faced with the same problem but do not know how to fix it. I tried jvisualvm, jconsole and other tools but the problem is in native method which source code located under built java 1.5 source.

Thanks, Vasily.

Vasily
You should not answer this way on Stack Overflow, this space is for answers, not comments. Please submit a comment with the same text, this answer should be removed.
Stefan Thyberg