views:

64

answers:

3

Hi guys!

I want to hear your opinion about program life monitoring.

This is the scenario. You have a simple program which normally works, that means that it's well written, exception are handled and so on.

How will you operate if you want to ensure that this program works FOREVER?

No external tools like crontab are available, but any overhead can be added.

Using another program that continuously "pings" the main program? Touching a file and check with another program for the file modification?

And how do you assure that this second program always works?

So, come on, tell me which are your opinion or best practice in this context!

As footnote, I've to write this program in Python, but it's a general purpose question!

+5  A: 

In embedded systems, what is often done is a watchdog module.

A watchdog checks some location (could be a file, could be a memory location, whatever), and restarts the system under examination if the location does not meet criteria.

So you might have your program under probe do is to write some programname_watchdog file with an epoch stamp periodically. This would be part of the regular loop.

Then your watchdog (in a totally different process) would check the file. If the date listed was sufficiently outdated, the other program would be killed and restarted, since it would be deemed to have critically malfunctioned(either hung or crashed). Note that your watchdog will have some simple logic, so its chances of failing are much lower.

I'm positive there are other ways to accomplish this as well. This is just one way.

edit: You have to consider the stack your system is built on. The more external dependencies you have, the more risk of failure. You also have to consider a formal proof of program correctness if you are looking for perfect operation.

The question really becomes what you are expecting from your system; what sort of failures are unacceptable and what sort of failures are expected so you can compensate for them.

This question becomes a proof-hardware-software co-design issue very fast (and expensive, too). I'm curious to see what you are doing and what your solution is.

Paul Nathan
Of course watchdogs are one of the strongest and widest-used solution. I'll keep a couple of days to think hard on it and I'll let you know.
Enrico Carlesso
A: 

Like Paul Nathan said, use a watchdog.

There are a few things you can do to make things more robust though, for example:

int lastTick;

int RemoteProcessState()
{
    int tick = GetRemoteTick();

    if (tick == -1)
    {
        // Process recoverable error state.
        return -1;
    }

    if (tick == -2)
    {
        // Process unrecoverable error state.
        return -1;
    }

    if (tick < 0)
    {
        // Detect if the watchdog is overflowed.
                    return -1;
    }

    if (abs(abs(tick) - abs(lastTick)) > ALLOWED_PROCESS_LAG)
    {
        // Resynchronize process
    }
    else
    {
        // Process running normally.
    }

    return 0;
}

That is a pseudeocode sample from real code used in a embedded RTU for process control.

Its primitive, but it works. Not only does this ensure that the remote process is alive, but if the remote process has drifted in calculation speed (scan rates are affected by program size and complexity) it will make sure that the two processes are still synchronized.

If you want more data, start investigating the return codes used by Modbus, or how the OPC protocol handles managing its Quality byte.

entens
A: 

Well. I've thought long over this problem, and 2 things have come up.

A Software Watchdog should be so simple that crashing should be nearby impossible. For maniac people, an interesting programming challenge can be write a net of watchdogs, written in different languages, which have to keep alive one with other and all together should monitor the main process.

Even if challenging and interesting, it seems a big waste of time, and the scenario look like soldiers in war.

Secondly, in the application I'm developing I've a Hardware watchdog, which should be always present in critical operation.

So now my application has a software watchdog which refresh the hardware one, and monitor the program life.

In the end, Paul, I completely agree with you.

Enrico Carlesso