tags:

views:

150

answers:

6

When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).

I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).

But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.

On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...

What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?

+1  A: 

I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.

So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.

RocketSurgeon
+4  A: 

There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.

Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.

Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!

Art
Three very good points. Thanks
Mawg
+1  A: 

Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.

When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.

A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.

Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?

janm
+5  A: 

One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.

How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.

If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.

Craig McQueen
Very Amiga Guru Meditation and Kickstart like. Which means I like it. :-)
Amigable Clark Kant
+1 for guru meditation ;-)
Mawg
+2  A: 

The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.

Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.

I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).

Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.

Michael Burr
A: 

Have you ever considered using a garbage collector ?

And I'm not joking.

If you do dynamic allocation at runtime in embedded systems, why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.

You've probably got the malloc (or whatever) implementation's source, right ?

If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).

If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant; if it was you couldn't risk "dieing" like this anyway ?

Tim Williscroft
That is a possibility, but I prefer to make it even simpler. A statically allocated chunk of memory, for use as a Buffer Pool. IN embedded telecomms it is unusual to allocate a message buffer, etc, for more than a a minute (ymmv), so during testing I sweep the pool looking for buffers allocated "for too long" and tarcing that.But while a garbage collector might keep the system running a little longer, it only postpones the inevitable. If garbage collection can regain resources, then I have sloppy coding somewhere.
Mawg
What you describe is a kind of GC anyway. If your system crashes then either you _do_ have sloppy coding somewhere OR keeping running 100% of the time isn't a requirement. In which case "good enough" will do. If your embedded system is overwhelmed by external events, again, if it's within spec to fail, go ahead and fail. But if you do, increment your "overwhelmed error" counter. Maybe keep a queue of "overwhelmed at this timestamp" in that spare flash sector. If you connect to a real network, send an SNMP trap.
Tim Williscroft