views:

321

answers:

7

I have an embedded system that has multiple (>20) tasks running at different priorities. I also have watchdog task that runs to check that all the other tasks are not stuck. My watchdog is working because every once in a blue moon, it will reboot the system because a task did not check in.

How do I determine which task died?

I can't just blame the oldest task to kick the watchdog because it might have been held off by a higher priority task that is not yielding.

Any suggestions?

+1  A: 

Is this pre-emptive? I gather so since otherwise a watchdog task would not run if one of the others had gotten stuck.

You make no mention of the OS but, if a watchdog task can check if a single task has not checked in, there must be separate channels of communication between each task and the watchdog.

You'll probably have to modify the watchdog to somehow dump the task number of the one that hasn't checked in and dump the task control blocks and memory so you can do a post-mortem.

Depending on the OS, this could be easy or hard.

paxdiablo
A: 

Depending on your system and OS, there may be different approaches. One very low level approach I have used is to blink an LED on when each of the tasks is running. You may need to put a scope on the LEDs to see very fast task switching.

simon
A: 

For an interrupt-driven watchdog, you'd just make the task switcher update the currently running task number each time it is changed, allowing you to identify which one didn't yield.

However, you suggest you wrote the watchdog as a task yourself, so before rebooting, surely the watchdog can identify the starved task? You can store this in memory that persists beyond a warm reboot, or send it over a debug interface. The problem with this is that the starved task is probably not the problematic one: you'll probably want to know the last few task switches (and times) in order to identify the cause.

Mark
A: 

A simplistic, back of the napkin approach would be something like this:

int8_t wd_tickle[NUM_TASKS]

void taskA_main()
{
   ...
   // main loop
   while(1) {
     ...
     wd_tickle[TASKA_NUM]++;
   }
}

... tasks B, C, D... follow similar pattern

void watchdog_task()
{
   for(int i= 0; i < NUM_TASKS; i++) {
     if(0 == wd_tickle[i]) {
       // Egads! The task didn't kick us! Reset and record the task number
     }
    }
}
Benoit
The problem is that B is a higher priority than A. B is locked up, but A doesn't kick the watchdog. A gets blamed for B's lockup.
Robert
+2  A: 

A per-task watchdog requires that the higher priority tasks yield for an adequate time so that all may kick the watchdog. To determine which task is at fault, you'll have to find the one that's starving the others. You'll need to measure task execution times between watchdog checks to locate the actual culprit.

Dingo
+2  A: 

Even I was working last few weeks on Watchdog reset problem. But fortunately for me in the ramdump files (in ARM development environment), which has one Interrupt handler trace buffer, containing PC and SLR at each of the interrupts. Thus from the trace buffer I could exactly find out which part of code was running before WD reset.

I think if you have same kind of mechanism of storing PC, SLR at each interrupt then you can precisely find out culprit task.

Chintan
A: 

How is your system working exactly? I always use a combination of software and hardware watchdogs. Let me explain...

My example assumes you're working with a preemptive real time kernel and you have watchdog support in your cpu/microcontroller. This watchdog will perform a reset if it was not kicked withing a certain period of time. You want to check two things:

1) The periodic system timer ("RTOS clock") is running (if not, functions like "sleep" would no longer work and your system is unusable).

2) All threads can run withing a reasonable period of time.

My RTOS (www.lieron.be/micror2k) provides the possibility to run code in the RTOS clock interrupt handler. This is the only place where you refresh the hardware watchdog, so you're sure the clock is running all the time (if not the watchdog will reset your system).

In the idle thread (always running at lowest priority), a "software watchdog" is refreshed. This is simply setting a variable to a certain value (e.g. 1000). In the RTOS clock interrupt (where you kick the hardware watchdog), you decrement and check this value. If it reaches 0, it means that the idle thread has not run for 1000 clock ticks and you reboot the system (can be done by looping indefinitely inside the interrupt handler to let the hardware watchdog reboot).

Now for your original question. I assume the system clock keeps running, so it's the software watchdog that resets the system. In the RTOS clock interrupt handler, you can do some "statistics gathering" in case the software watchdog situation occurs. Instead of resetting the system, you can see what thread is running at each clock tick (after the problem occurs) and try to find out what's going on. It's not ideal, but it will help.

Another option is to add several software watchdogs at different priorities. Have the idle thread set VariableA to 1000 and have a (dedicated) medium priority thread set Variable B. In the RTOS clock interrupt handler, you check both variables. With this information you know if the looping thread has a priority higher then "medium" or lower then "medium". If you wish you can add a 3rd or 4th or how many software watchdogs you like. Worst case, add a software watchdog for each priority that's used (will cost you as many extra threads though).

Ron