views:

263

answers:

6

I think I might be having a stack overflow problem or something similar in my embedded firmware code. I am a new programmer and have never dealt with a SO so I'm not sure if that is what's happening or not.

The firmware controls a device with a wheel that has magnets evenly spaced around it and the board has a hall effect sensor that senses when magnet is over it. My firmware operates the stepper and also count steps while monitoring the magnet sensor in order to detect if the wheel has stalled.

I am using a timer interrupt on my chip (8 bit, 8057 acrh.) to set output ports to control the motor and for the stall detection. The stall detection code looks like this...

    //   Enter ISR
    //   Change the ports to the appropriate value for the next step
    //    ...

    StallDetector++;      // Increment the stall detector

    if(PosSensor != LastPosMagState)
    {
        StallDetector = 0;

        LastPosMagState = PosSensor;
    }
    else
    {
        if (PosSensor == ON) 
        {
            if (StallDetector > (MagnetSize + 10))
            {
                HandleStallEvent();
            }
        }
        else if (PosSensor == OFF) 
        {
            if (StallDetector > (GapSize + 10))
            {
                HandleStallEvent();
            }
        }
    }

this code is called every time the ISR is triggered. PosSensor is the magnet sensor. MagnetSize is the number of stepper steps that it takes to get through the magnet field. GapSize is the number of steps between two magnets. So I want to detect if the wheel gets stuck either with the sensor over a magnet or not over a magnet.

This works great for a long time but then after a while the first stall event will occur because 'StallDetector > (MagnetSize + 10)' but when I look at the value of StallDetector it is always around 220! This doesn't make sense because MagnetSize is always around 35. So the stall event should have been triggered at like 46 but somehow it got all the way up to 220? And I don't set the value of stall detector anywhere else in my code.

Do you have any advice on how I can track down the root of this problem?

The ISR looks like this

void Timer3_ISR(void) interrupt 14
{
    OperateStepper();  // This is the function shown above
    TMR3CN &= ~0x80;   // Clear Timer3 interrupt flag        
}

HandleStallEvent just sets a few variable back to their default values so that it can attempt another move...

#pragma save
#pragma nooverlay
void HandleStallEvent()
{
///*
    PulseMotor = 0;                 //Stop the wheel from moving
    SetMotorPower(0);               //Set motor power low
    MotorSpeed = LOW_SPEED;
    SetSpeedHz();
    ERROR_STATE = 2;
    DEVICE_IS_HOMED = FALSE;
    DEVICE_IS_HOMING = FALSE;
    DEVICE_IS_MOVING = FALSE;
    HOMING_STATE = 0;
    MOVING_STATE = 0;
    CURRENT_POSITION = 0;
    StallDetector = 0;
    return;
//*/
}
#pragma restore
+1  A: 

Does HandleStallEvent() "look at" StallDetector within the ISR or does it trigger something on the main loop? If it's on the main loop, are you clearing the interrupt bit?

Or are you looking at StallDetector from a debugger outside the ISR? Then a retriggered interrupt would use the correct value each time, but execute too many times, and you would only see the final, inflated value.

On second thought, more likely you don't have to clear an interrupt-generating register, but rather the interrupt pin is remaining asserted by the sensor. You need to ignore the interrupt after it's first handled until the line deasserts, such as by having the original ISR disable itself and and reinstall it in a second ISR which handles the 1->0 transition.

You might then also need to add debouncing hardware or adjust it if you have it.

Potatoswatter
I am looking at StallDetector using my IDE debugger. I set a breakpoint on HandleStallEvent and when it hits it I look at it
Jordan S
@Jordan: Stack overflow or no, if `StallDetector++` is the first line and you break on that, it should count up by one each time you drop into the debugger. If it doesn't, the debugger isn't showing you the whole story. Try debugging outside the ISR, such as by writing the value of `StallDetector` into a circular buffer and/or setting a flag which triggers a mainloop breakpoint.
Potatoswatter
It does count up by one each time. I can't seem to catch what it's doing right before the error occurs because it generally takes about 5 minutes of running at normal speed before the "stall" is triggered.
Jordan S
+1  A: 

This is definitely not stack overflow. If you blew the stack (overflowed it) your application would simply crash. This sounds more like something we used to call memory stomping in my C++ days. You may not be accessing the memory location that the StallDetector value occupies via StallDetector variable alone. There may be another part of your code "stomping" this particular memory location erroneously.

Unfortunately, this kind of issue is very hard to track down. About the only thing you could do is systematically isolate (remove from execution) chunks of your code until you narrow down and find the bug.

Paul Sasik
yeah that's what I'm afraid of.
Jordan S
+1  A: 

Do you have nest ISRs on your system? Could be something along the lines of start your ISR and increment your count, then interrupt it and do it again. Do this enough times and your interrupt stack can overflow. It could also explain such a high counter variable as well.

Michael Dorgan
+1  A: 

Check your parameter types. If you defined the parameters in a way different than the caller expects then calling your method could overwrite the space that variable is stored in. (For instance if you wrote the function expecting an int but it is pushing a long onto the stack.)

Bill K
+1  A: 

You could see what additional options your debugger supports. In Visual Studio, for example, it is possible to set a "data breakpoint", where you break when a memory location changes (or is set to a certain value, or above a threshold, ...).

If something like this is possible in your case, you could see where the data is changed and if there is someone else writing to the memory erroneously.

Daniel Rose
+2  A: 

Is PosSensor volatile? That is, do you update PosSensor somewhere, or is it directly reading a GPIO?

I assume GapSize is rather large (> 220?) It sounds to me like you might have a race condition.

// PosSensor == OFF, LastPosMagState == OFF
    if(PosSensor != LastPosMagState)
    {
        StallDetector = 0;

        LastPosMagState = PosSensor;
    }
    else
    {
// Race Condition: PosSensor turns ON here
// while LastPosMagState still == OFF
        if (PosSensor == ON) 
        {
            if (StallDetector > (MagnetSize + 10))
            {
                HandleStallEvent();
            }
        }
        else if (PosSensor == OFF) 
        {
            if (StallDetector > (GapSize + 10))
            {
                HandleStallEvent();
            }
        }
    }

You should cache the value of PosSensor once, right after doing StallDetector++, so that in the event PosSensor changes during your code, you don't start testing the new value.

ajs410
It looks like this solved the problem. I am still not sure how StallDetector was able to get up to 200+ but I added one line of code that caches the PosSensor value at the beginning of this function, changed a couple variable names and now it has been running for over 1500 sequential moves with no problems!
Jordan S
PosSensor and LastPosMagState were both OFF. This allowed StallDetector to approach the value (GapSize + 10). Then the race condition hits; PosSensor turns on after if(PosSensor != LastPosMagState), but before if (PosSensor == ON), so StallDetector doesn't get set to 0 when it should have been, allowing the GapSize'd StallDetector value to be compared against your MagnetSize
ajs410
Ohhhhh that makes sense! Thanks for the exp.
Jordan S