tags:

views:

3336

answers:

6

I am writing a system monitor for Linux and want to include some watchdog functionality. In the kernel, you can configure the watchdog to keep going even if /dev/watchdog is closed. In other words, if my daemon exits normally and closes /dev/watchdog, the system would still re-boot 59 seconds later. That may or may not be desirable behavior for the user.

I need to make my daemon aware of this setting because it will influence how I handle SIGINT. If the setting is on, my daemon would need to (preferably) start an orderly shutdown on exit or (at least) warn the user that the system is going to reboot shortly.

Does anyone know of a method to obtain this setting from user space? I don't see anything in sysconf() to get the value. Likewise, I need to be able to tell if the software watchdog is enabled to begin with.

Edit:

Linux provides a very simple watchdog interface. A process can open /dev/watchdog , once the device is opened, the kernel will begin a 60 second count down to reboot unless some data is written to that file, in which case the clock re-sets.

Depending on how the kernel is configured, closing that file may or may not stop the countdown. From the documentation:

The watchdog can be stopped without causing a reboot if the device /dev/watchdog is closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.

I need to be able to tell if CONFIG_WATCHDOG_NOWAYOUT was set from within a user space daemon, so that I can handle the shutdown of said daemon differently. In other words, if that setting is high, a simple:

# /etc/init.d/mydaemon stop

... would reboot the system in 59 seconds, because nothing is writing to /dev/watchdog any longer. So, if its set high, my handler for SIGINT needs to do additional things (i.e. warn the user at the least).

I can not find a way of obtaining this setting from user space :( Any help is appreciated.

A: 

You would need to catch the SIGINT in your application with signal handling. Link below shows some more information on signal handlers.

http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_21.html

Suroot
Yes, I know how signal handlers work. The handler for SIGINT needs to behave differently if the kernel was configured with CONFIG_WATCHDOG_NOWAYOUT, so I'm looking for some means of obtaining that configuration value from user space.
Tim Post
+1  A: 

a watchdog guards against hard-locking the system, either because of a software crash, or hardware failure.

what you need is a daemon monitoring daemon (dmd). check 'monit'

Javier
I am writing a DMD, specifically for use on paravirtualized xen guests that also has some nagios style functionality (sysinfo data is written to xenbus where it can be watched by the privileged domain). Please see additional edits to the question, I wasn't clear enough.
Tim Post
A: 

Surely CONFIG_WATCHDOG_NOWAYOUT influences a preprocessor command?

Therefore, why do you need to check or set that from userspace

Arafangion
CONFIG_WATCHDOG_NOWAYOUT is a kernel symbol, the program I am writing resides in userspace. Building the program against the kernel tree is very bothersome when packaging it. On some systems it is set, on others it is not. I need to be able to get that value from userspace.
Tim Post
Does your program live _entirely_ in kernel space?Worst case, if you truly do not compile against a specific kernel, you could make an assumption, and refine that assumption by the use of either the gzip'ed kernel configuration that can live in /proc, or in /boot?
Arafangion
A: 

I hate having to suggest this, but being at a loss for a better idea: have you considered grepping for CONFIG_WATCHDOG_NOWAYOUT across your kernel source to see if it sets a global or gets returned by some function somewhere? You could look for some secondary change it induces in the kernel's exposed interface for something from which to infer its setting.

I haven't got the kernel source to hand or I'd've done this myself before replying, sorry.

Crashworks
Its not my kernel I'm worried about, its the users of my program who (likely) don't know the particulars of kernel innards. There appears to be no sysconf() or ioctl() that gets this particular setting, nor is it exported in procfs. However, if /proc/config.gz is supported I can get the value there
Tim Post
+5  A: 

AHA! After digging through linux/watchdog.h and drivers/watchdog/softdog.c, I was able to determine the capabilities of the softdog ioctl() interface. Looking at the capabilities that it announces in struct watchdog_info:

static struct watchdog_info ident = {
                .options =              WDIOF_SETTIMEOUT |
                                        WDIOF_KEEPALIVEPING |
                                        WDIOF_MAGICCLOSE,
                .firmware_version =     0,
                .identity =             "Software Watchdog",
        };

It does support a magic close that (seems to) over-ride CONFIG_WATCHDOG_NOWAYOUT. So, when terminating normally, I have to write a single char 'V' to /dev/watchdog then close it, and the timer will stop counting.

So, a simple ioctl() on a file descriptor to /dev/watchdog asking WDIOC_GETSUPPORT allows one to determine if this flag is set, i.e. :

int fd;
struct watchdog_info info;

fd = open("/dev/watchdog", O_WRONLY);
if (fd == -1) {
   perror("open");
   goto abort;
}

if (ioctl(fd, WDIOC_GETSUPPORT, &info)) {
    perror("ioctl");
    goto abort;
}

if (WDIOF_MAGICCLOSE & info.options)
   printf("Watchdog supports magic close char\n");
...

When working with hardware watchdogs, you might want to open with O_NONBLOCK so ioctl() not open() blocks (hence detecting a busy card).

If WDIOF_MAGICCLOSE is not supported, one should just assume that the soft watchdog is configured with NOWAYOUT.

I love Linux, I really do :) Sometimes it just takes a bit of hunting to work with. Thanks for all of the replies.

Edit: Further info

What stinks is, in order to determine the caps, you have to open the device .. once you open it the counter starts. So, if you have opened it and it does NOT support the magic close character, you had better ensure that you effect a normal shutdown / sync upon receiving SIGINT.

Tim Post
A: 

I think the watchdog device drivers are really intended for use on embedded platforms (or at least well controlled ones) where the developers will have control of which kernel is in use.

This could be considered to be an oversight, but I think it is not.

One other thing you could try, if the watchdog was built as a loadable module, unloading it will presumably abort the shutdown?

MarkR
If the softdog is a loadable module, life gets very easy as it also accepts arguments when loading (and yes, unloading stops it). The problem is, on embedded systems (which I'm working on, as you said) you frequently see monolithic kernels with everything as static objects.
Tim Post
I think it was, actually a bit of an oversight, which is why they later changed the soft watchdog to obey the magic close character.. i.e. writing a single 'V' to the device, then closing it, always stops the countdown.
Tim Post