views:

520

answers:

5

I am writing a daemon in c on linux. It traps signals SIGHUP, SIGTERM, SIGINT, and SIGQUIT, logs them using syslog and quits. If it receives SIGSEGV it core dumps. When these occur everything happens as expected but once in a while it quits...does not exit cleanly, does not log the signal, and does not leave a core dump. I am stumped and not sure how to debug the problem. What ways can it quit other than these signals? Is there an obvious answer, something that I am missing? What other debugging practices do you recommend to debug such a seemingly sporadic problem in a daemon process?

+2  A: 

You can have the parent of the daemon stay around and wait for it, and then have the parent log the reason for the daemon quitting (ie, whether it was signalled or it exited).

William Pursell
This is an excellent solution, how can you get this information from the parent once the child died?
In the parent, call wait, use WIFEXITED/WIFSIGNALED to determine what happened, then call syslog with the appropriate log message. Check the man page for wait.
William Pursell
+1  A: 

Well, there are lots of other signals that will cause it to quit, including of course SIGKILL, which you aren't going to be able to do anything about. Basically anything in the following content from man 7 signal where Action is Term or Core (though the latter would at least leave a core dump):

   First the signals described in the original POSIX.1-1990 standard.

   Signal     Value     Action   Comment
   -------------------------------------------------------------------------
   SIGHUP        1       Term    Hangup detected on controlling terminal
                                 or death of controlling process
   SIGINT        2       Term    Interrupt from keyboard
   SIGQUIT       3       Core    Quit from keyboard
   SIGILL        4       Core    Illegal Instruction

   SIGABRT       6       Core    Abort signal from abort(3)
   SIGFPE        8       Core    Floating point exception
   SIGKILL       9       Term    Kill signal
   SIGSEGV      11       Core    Invalid memory reference
   SIGPIPE      13       Term    Broken pipe: write to pipe with no readers
   SIGALRM      14       Term    Timer signal from alarm(2)
   SIGTERM      15       Term    Termination signal
   SIGUSR1   30,10,16    Term    User-defined signal 1
   SIGUSR2   31,12,17    Term    User-defined signal 2
   SIGCHLD   20,17,18    Ign     Child stopped or terminated
   SIGCONT   19,18,25    Cont    Continue if stopped
   SIGSTOP   17,19,23    Stop    Stop process
   SIGTSTP   18,20,24    Stop    Stop typed at tty
   SIGTTIN   21,21,26    Stop    tty input for background process
   SIGTTOU   22,22,27    Stop    tty output for background process

   The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.

   Next the signals not in the POSIX.1-1990 standard but described in SUSv2 and POSIX.1-2001.

   Signal       Value     Action   Comment
   -------------------------------------------------------------------------
   SIGBUS      10,7,10     Core    Bus error (bad memory access)
   SIGPOLL                 Term    Pollable event (Sys V). Synonym of SIGIO
   SIGPROF     27,27,29    Term    Profiling timer expired
   SIGSYS      12,-,12     Core    Bad argument to routine (SVr4)
   SIGTRAP        5        Core    Trace/breakpoint trap
   SIGURG      16,23,21    Ign     Urgent condition on socket (4.2BSD)
   SIGVTALRM   26,26,28    Term    Virtual alarm clock (4.2BSD)
   SIGXCPU     24,24,30    Core    CPU time limit exceeded (4.2BSD)
   SIGXFSZ     25,25,31    Core    File size limit exceeded (4.2BSD)

   Up to and including Linux 2.2, the default behaviour for SIGSYS, SIGXCPU, SIGXFSZ, and (on architectures other than SPARC
   and  MIPS) SIGBUS was to terminate the process (without a core dump).  (On some other Unices the default action for SIGX-
   CPU and SIGXFSZ is to terminate the process without a core dump.)  Linux 2.4 conforms to  the  POSIX.1-2001  requirements
   for these signals, terminating the process with a core dump.

   Next various other signals.

   Signal       Value     Action   Comment
   --------------------------------------------------------------------
   SIGIOT         6        Core    IOT trap. A synonym for SIGABRT
   SIGEMT       7,-,7      Term
   SIGSTKFLT    -,16,-     Term    Stack fault on coprocessor (unused)
   SIGIO       23,29,22    Term    I/O now possible (4.2BSD)
   SIGCLD       -,-,18     Ign     A synonym for SIGCHLD
   SIGPWR      29,30,19    Term    Power failure (System V)
   SIGINFO      29,-,-             A synonym for SIGPWR
   SIGLOST      -,-,-      Term    File lock lost
   SIGWINCH    28,28,20    Ign     Window resize signal (4.3BSD, Sun)
   SIGUNUSED    -,31,-     Term    Unused signal (will be SIGSYS)
chaos
+2  A: 

Attach gdb to it with

gdb -p <pid>
Make sure you compiled with the -g flag and take a backtrace as soon as it exits. Good luck!

kmm
I didn't know you could do that! This is great because the daemon is running on a server I don't have physical access to. I am periodically on the move with my laptop and can't keep an open terminal monitoring it when im on the move. This way i can attach/detach gdb when needed without shutting down the daemon; excellent!
+3  A: 

If your daemon is working with network sockets, it's quite likely to be SIGPIPE - you get this if you try to write to a socket (or pipe) that's been closed by the other side. Note that even if you're checking whether the socket is writeable before writing to it (eg. with select()), it can always be closed between that check and the write itself.

caf
Ah! I am using sockets and do not trap SIGPIPE, didn't think of that, I bet that is it. Currently my select() call is in a loop that breaks if it is interrupted but i want to stay in the loop if it is a SIGPIPE. From your comment I gather that a select() call won't ever be interrupted by a SIGPIPE, only read()/write() calls, is that true?
Your process won't be signalled by `SIGPIPE` from a `select()`, but it will return with the file descriptor marked as readable (so that you can find out that it's been closed). `SIGPIPE` is only raised by `write()`s. If you ignore or handle `SIGPIPE`, the `write()` will return `EPIPE`.
caf
A: 

A shell wrapper can catch your daemon's exit status. Here's how it works:

$ ./waitstatus true
pid 1512: exit status 0 (success)

$ ./waitstatus false
pid 1514: exit status 1 (abnormal)

$ ./waitstatus perl -e 'exit 21'
pid 1518: exit status 21 (abnormal)

$ ./waitstatus perl -e 'kill TERM => $$'
pid 1520: terminated on signal 15

$ ./waitstatus no-such-command
pid 1522: command not found: no-such-command

$ ./waitstatus /sbin/EACCES.contrived
pid 1524: command not executable: /sbin/EACCES.contrived

... and here's how it's implemented:

$ cat ./waitstatus
#! /bin/bash

"$@" &
PID=$!

wait $PID
STATUS=$?

if   [ $STATUS -gt 128 ]; then
  MSG="terminated on signal $(( $STATUS - 128 ))";
else
  case $STATUS in
    0)
      MSG="exit status 0 (success)"
      ;;
    127)
      MSG="command not found: $1"
      ;;
    126)
      MSG="command not executable: $1"
      ;;
    *)
      MSG="exit status $STATUS (abnormal)"
      ;;
  esac
fi

echo "pid $PID: $MSG"
exit $STATUS

You might want to change that last echo line to an invocation of your system's logger command to, for example, direct the status message to syslog.

pilcrow