ansaurus

Question

Answer 1

+5 A:

In embedded systems, what is often done is a watchdog module.

A watchdog checks some location (could be a file, could be a memory location, whatever), and restarts the system under examination if the location does not meet criteria.

So you might have your program under probe do is to write some programname_watchdog file with an epoch stamp periodically. This would be part of the regular loop.

Then your watchdog (in a totally different process) would check the file. If the date listed was sufficiently outdated, the other program would be killed and restarted, since it would be deemed to have critically malfunctioned(either hung or crashed). Note that your watchdog will have some simple logic, so its chances of failing are much lower.

I'm positive there are other ways to accomplish this as well. This is just one way.

edit: You have to consider the stack your system is built on. The more external dependencies you have, the more risk of failure. You also have to consider a formal proof of program correctness if you are looking for perfect operation.

The question really becomes what you are expecting from your system; what sort of failures are unacceptable and what sort of failures are expected so you can compensate for them.

This question becomes a proof-hardware-software co-design issue very fast (and expensive, too). I'm curious to see what you are doing and what your solution is.

Paul Nathan 2010-09-03 17:15:27

Of course watchdogs are one of the strongest and widest-used solution. I'll keep a couple of days to think hard on it and I'll let you know.

Enrico Carlesso 2010-09-03 23:35:18

Answer 2

A:

Like Paul Nathan said, use a watchdog.

There are a few things you can do to make things more robust though, for example:

int lastTick;

int RemoteProcessState()
{
    int tick = GetRemoteTick();

    if (tick == -1)
    {
        // Process recoverable error state.
        return -1;
    }

    if (tick == -2)
    {
        // Process unrecoverable error state.
        return -1;
    }

    if (tick < 0)
    {
        // Detect if the watchdog is overflowed.
                    return -1;
    }

    if (abs(abs(tick) - abs(lastTick)) > ALLOWED_PROCESS_LAG)
    {
        // Resynchronize process
    }
    else
    {
        // Process running normally.
    }

    return 0;
}

That is a pseudeocode sample from real code used in a embedded RTU for process control.

Its primitive, but it works. Not only does this ensure that the remote process is alive, but if the remote process has drifted in calculation speed (scan rates are affected by program size and complexity) it will make sure that the two processes are still synchronized.

If you want more data, start investigating the return codes used by Modbus, or how the OPC protocol handles managing its Quality byte.

entens 2010-09-03 21:17:59

Answer 3

A:

Well. I've thought long over this problem, and 2 things have come up.

A Software Watchdog should be so simple that crashing should be nearby impossible. For maniac people, an interesting programming challenge can be write a net of watchdogs, written in different languages, which have to keep alive one with other and all together should monitor the main process.

Even if challenging and interesting, it seems a big waste of time, and the scenario look like soldiers in war.

Secondly, in the application I'm developing I've a Hardware watchdog, which should be always present in critical operation.

So now my application has a software watchdog which refresh the hardware one, and monitor the program life.

In the end, Paul, I completely agree with you.

Enrico Carlesso 2010-09-07 16:11:15

ansaurus

tags:

views:

answers:

Best practice to monitor program life.

related questions