views:

419

answers:

3

I'm trying to determine the reason for a stalled process on Linux. It's a telecom application, running under fairly heavy load. There is a separate process for each of 8 T1 spans. Every so often, one of the processes will get very unresponsive - up to maybe 50 seconds before an event is noted in the normally very busy process's log.

It is likely some system resource that runs short. The obvious thing - CPU usage - looks to be OK.

Which linux utilities might be best for catching and analyzing this sort of thing, and be as unobtrusive about it as possible, as this is a highly loaded system? It would need to be processes rather than system oriented, it would seem. Maybe ongoing monitoring of /proc/pid/XX? Top wouldn't seem to be too useful here.

+2  A: 

You can strace the program in question and see what system calls it's making.

Paul Tomblin
+6  A: 

If you are able to spot this "moment of unresponsiveness", then you might use strace to attach to the process in question during that time and try to figure out where it "sleeps":

strace -f -o LOG -p <pid>

More lightweight, but less reliable method:

  1. When process hangs, use top/ps/gdp/strace/ltrace to find out the state of the process (e.g. whether it waits in "select" or consumes 100% cpu in some library call)

  2. Knowing the general nature of the call in question, tailor the invocation of strace to log specific syscalls or groups of syscall. For example, to log only file access-related syscalls, use:

    strace -e file -f -o LOG ....
    

If the strace is too heavy a tool for you, try monitoring:

  1. Memory usage with "vmstat 1 > /some/log" - maybe process is being swapped in (or out) during that time

  2. IO usage with vmstat/iotop - maybe some other process is thrashing the disks

  3. /proc/interrupts - maybe driver for your T1 card is experiencing problems?

ADEpt
A: 

Thanks - strace sounds useful. Catching the process at the right time will be part of the fun. I came up with a scheme to periodically write a time stamp into shared memory, then monitor with another process. Sending a SIGSTOP would then let me at least examine the application stack with gdb. I don't know if strace on a paused process will tell me much, but I could maybe then turn on strace and see what it will say. Or turn on strace and hit the process with a SIGCONT.

forgot to add - there is also a companion tool "ltrace", for tracing library calls (strace traces syscalls only)
ADEpt