ansaurus

Question

Determining the reason for a stalled process on Linux

Answer 1

+2 A:

You can strace the program in question and see what system calls it's making.

Paul Tomblin 2008-10-21 20:25:37

Answer 2

+6 A:

If you are able to spot this "moment of unresponsiveness", then you might use strace to attach to the process in question during that time and try to figure out where it "sleeps":

strace -f -o LOG -p <pid>

More lightweight, but less reliable method:

When process hangs, use top/ps/gdp/strace/ltrace to find out the state of the process (e.g. whether it waits in "select" or consumes 100% cpu in some library call)
Knowing the general nature of the call in question, tailor the invocation of strace to log specific syscalls or groups of syscall. For example, to log only file access-related syscalls, use:
```
strace -e file -f -o LOG ....
```

If the strace is too heavy a tool for you, try monitoring:

Memory usage with "vmstat 1 > /some/log" - maybe process is being swapped in (or out) during that time
IO usage with vmstat/iotop - maybe some other process is thrashing the disks
/proc/interrupts - maybe driver for your T1 card is experiencing problems?

ADEpt 2008-10-21 21:34:53

Answer 3

A:

Thanks - strace sounds useful. Catching the process at the right time will be part of the fun. I came up with a scheme to periodically write a time stamp into shared memory, then monitor with another process. Sending a SIGSTOP would then let me at least examine the application stack with gdb. I don't know if strace on a paused process will tell me much, but I could maybe then turn on strace and see what it will say. Or turn on strace and hit the process with a SIGCONT.

2008-10-21 22:04:50

forgot to add - there is also a companion tool "ltrace", for tracing library calls (strace traces syscalls only)

ADEpt 2008-10-22 06:22:18

ansaurus

tags:

views:

answers:

Determining the reason for a stalled process on Linux

related questions