views:

184

answers:

5

I've got a buffer overrun I absolutely can't see to figure out (in C). First of all, it only happens maybe 10% of the time or so. The data that it is pulling from the DB each time doesn't seem to be all that much different between executions... at least not different enough for me to find any discernible pattern as to when it happens. The exact message from Visual Studio is this:

A buffer overrun has occurred in hub.exe which has corrupted the program's internal state. Press Break to debug the program or Continue to terminate the program.

For more details please see Help topic 'How to debug Buffer Overrun Issues'.

If I debug, I find that it is broken in __report_gsfailure() which I'm pretty sure is from the /GS flag on the compiler and also signifies that this is an overrun on the stack rather than the heap. I can also see the function it threw this on as it was leaving, but I can't see anything in there that would cause this behavior, the function has also existed for a long time (10+ years, albeit with some minor modifications) and as far as I know, this has never happened.

I'd post the code of the function, but it's decently long and references a lot of proprietary functions/variables/etc.

I'm basically just looking for either some idea of what I should be looking for that I haven't or perhaps some tools that may help. Unfortunately, nearly every tool I've found only helps with debugging overruns on the heap, and unless I'm mistaken, this is on the stack. Thanks in advance.

+2  A: 

While it won't help you in Windows, Valgrind is by far the best tool for detecting bad memory behavior.

If you are debugging the stack, your need to get to low level tools - place a canary in the stack frame (perhaps a buffer filled with something like 0xA5) around any potential suspects. Run the program in a debugger and see which canaries are no longer the right size and contain the right contents. You will gobble up a large chunk of stack doing this, but it may help you spot exactly what is occurring.

Yann Ramin
Yeah, I've used it in the past. While our server code does run on various flavors of Unix (Solaris/HP/AIX), it doesn't look like Valgrind is supported there, so unfortunately, it doesn't quite help me here.
Morinar
A: 

Wrap it in an exception handler and dump out useful information when it occurs.

Peter
+1  A: 

You could try putting some local variables on either end of the buffer, or even sentinels into the (slightly expanded) buffer itself, and trigger a breakpoint if those values aren't what you think they should be. Obviously, using a pattern that is not likely in the data would be a good idea.

dash-tom-bang
I put in a handful of local variable buffers hoping to catch some value and the issue hasn't reproduced in 25 or so tries (which is 2-3x more than I've ever gone before). It's like the buffers I added padded everything just enough so that nothing ever crashed. Even when I was debugging into them, the buffers held the exact values I would expect them to right before returning from the function every time.
Morinar
What if you just expand your buffer, and write some known values to the end of it?
dash-tom-bang
You say that as if I know which buffer I'm overwriting. If that wasn't clear, I have absolutely no idea. If I knew which buffer I was overrunning I'd merely set a hardware breakpoint and win.
Morinar
ha hah oh I see. I assumed it was one in particular, sorry. Usually in the case of stack stomping your overwrite will only destroy local variables, so if you've got a function with a number of local buffers, sentinels between them may help identify which one is overrunning.
dash-tom-bang
You could set watchpoints on all your sentinel values too, so you break into the debugger as soon as one of them changes.
caf
Accepting this as it helped me get to the solution: I ended up tracking this down by putting some local variables around various buffers and figuring out which buffer was overrunning. I then put hardware breakpoints on either side of the buffer and wait for it to reproduce. It was an insidious little bug where we were telling a function an 8 byte buffer was 10 bytes, and it was uppercasing characters (among other things) so only reproduced if those extra two bytes happened to contain lowercase characters. Thanks all for the help!
Morinar
A: 

Does this program recurse at all? If so, I check there to ensure you don't have an infinite recursion bug. If you can't see it manually, sometimes you can catch it in the debugger by pausing frequently and observing the stack.

RickNotFred
Nope. No recursion.
Morinar
+1  A: 

One thing I have done in the past to help narrow down a mystery bug like this was to create a variable with global visibility named checkpoint. Inside the culprit function, I set checkpoint = 0; as the very first line. Then, I added ++checkpoint; statements before and after function calls or memory operations that I even remotely suspected might be able to cause an out-of-bounds memory reference (plus peppering the rest of the code so that I had a checkpoint at least every 10 lines or so). When your program crashes, the value of checkpoint will narrow down the range you need to focus on to a handful of lines of code. This may be a bit overkill, I do this sort of thing on embedded systems (where tools like valgrind can't be used) but it should still be useful.

bta
Great idea! Will try that next.
Morinar
I had one practically every other line... the value of it when it crashes was EXACTLY what it should have been. :-p
Morinar
I don't understand what you mean. If `checkpoint` was 6 (for instance) when the program crashed, then your problem happened between the sixth and seventh `++checkpoint` statement. If you are able to read this value after the crash, it should pinpoint the source of your problem.
bta
The crash happens upon exiting from a function. With the function itself full of increments, it crashed as one would expect right after hitting all of them.
Morinar
If the error message is complaining about a buffer overrun and the crash happens upon returning from a function, then it sounds like you most likely have some code that is corrupting the function call stack. Either the stored value of the address to return to or the previous stack frame's cached register values have been corrupted. This might not be easy to track down. Try taking your crashing function and breaking it up into smaller sub-functions. If one of them crashes when it returns, it might give us some hints.
bta
Also, you may want to look at the raw call stack in a debugger and see if the data for the previous stack frame looks familiar (like data that may have been read out of the database, for instance).
bta