tags:

views:

531

answers:

9

I was writing a function to figure out if a given system of linear inequalities has a solution, when all of a sudden it started giving the wrong answers after a seemingly innocuous change.

I undid some changes, re-did them, and then proceeded to fiddle for the next two hours, until I had reduced it to absurdity.

The following, inserted anywhere into the function body, but nowhere else in the program, fixes it:

if(0) {
    __asm__("nop\n");
    __asm__("nop\n");
    __asm__("nop\n");
    __asm__("nop\n");
}

It's for a school assignment, so I probably shouldn't post the function on the web, but this is so ridiculous that I don't think any context is going to help you. And all the function does is a bunch of math and looping. It doesn't even touch memory that isn't allocated on the stack.

Please help me make sense of the world! I'm loathe to chalk it up to the GCC, since the first rule of debugging is not to blame the compiler. But heck, I'm about to. I'm running Mac OS 10.5 on a G5 tower, and the compiler in question identifies itself as 'powerpc-apple-darwin9-gcc-4.0.1' but I'm thinking it could be an impostor...

UPDATE: Curiouser and curiouser... I diffed the .s files with nops and without. Not only are there too many differences to check, but with no nops the .s file is 196,620 bytes, and with it's 156,719 bytes. (!)

UPDATE 2: Wow, should have posted the code! I came back to the code today, with fresh eyes, and immediately saw the error. See my sheepish self-answer below.

+15  A: 

Most times when you modify the code inconsequentially and it fixes your problem, it's a memory corruption problem of some sort. We may need to see the actual code to do proper analysis, but that would be my first guess, based on the available information.

paxdiablo
+3  A: 

Does it happen in debug and release mode build (with symbols and without)? Does it behave the same way using a debugger? Is the code moultithreaded? Are you compiling with optimizations? Can you try another machine?

ojblass
None of those were an answer but thanks!
ojblass
+3  A: 

Can you confirm that you are indeed getting different executables when you add the if(0) {nops}? I don't see nops on my system.

$ gcc --version
powerpc-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

$ cat nop.c
void foo()
{
    if (0) {
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
    }
}

$ gcc nop.c -S -O0 -o -
    .
    .
_foo:
    stmw r30,-8(r1)
    stwu r1,-48(r1)
    mr r30,r1
    lwz r1,0(r1)
    lmw r30,-8(r1)
    blr

$ gcc nop.c -S -O3 -o -
    .
    .
_foo:
    blr
sigjuice
I just learned so much in three seconds... thank you.
ojblass
What happens if you make it "if(1)"? I always thought -O0 did absolutely no optimization.
paxdiablo
Does the output of -O0 change when using __asm__ volatile("nop"); ? (Curious.)
strager
That's what I though about if(0). if (1) does emit the four nops, which doesn't surprise me.
sigjuice
@strager if (0) {__asm__ volatile("nop") ... } makes no difference to -O0 or -O3.
sigjuice
@sigjuice does your dungeon master still talk to you? No honestly you have a mastery and understanding I respect.
ojblass
He's complimenting you (in a particularly geeky way, but a compliment nonetheless :-).
paxdiablo
Yeah, I tried it both ways, and the .s files differ wildly. (see above)
rodarmor
+12  A: 

It's faulty pointer arithmetic, either directly (through a pointer) or indirectly (by going past the end of an array). Check all your arrays. Don't forget that if your array is

 int a[4];

then a[4] doesn't exist.

What you're doing is overwriting something on the stack accidentally. The stack contains both locals, parameters, and the return address from your function. You might be damaging the return address in a way that the extra noops cures.

For example, if you have some code that is adding something to the return address, inserting those extra 16 bytes of noops would cure the problem, because instead of returning past the next line of code, you return into the middle of some noops.

One way you might be adding something to the return address is by going past the end of a local array or a parameter, for example

  int a[4];
  a[4]++;
Joel Spolsky
All PowerPC instructions are 32 bits. Four nops would mean an extra 16 bytes.
sigjuice
Yeah, they're nice instructions. NOP is actually ORI R0,R0,0 (OR register 0 with itself and constant 0).
paxdiablo
Given the crazy differences in the two .s files (see update above), I'm thinking that everyday memory corruption may not be it this time.
rodarmor
+2  A: 

My guess is stack corruption -- though gcc should optimize anything inside an if(0) out, I would have thought.

You could try sticking a big array on the stack in your function and see if that also fixes it -- that would also implicate stack corruption.

Are you sure you're running what you think you're running? (dumb question, but it happens.)

smcameron
No way a dumb question! One time I was editing on one machine, while compiline/running on another. I was pretty surprised when my changes weren't showing up ;-)This time though, no such luck!
rodarmor
I tried stack allocating a big array, but no dice.
rodarmor
+2  A: 
Trevor Boyd Smith
A: 

Break out that one function into a separate .c file (or .cpp or whatever). Compile just that one file with the nops and without them, to .s files and compare them.

Try an old version of gcc. Go back 5 or 10 years and see if things get stranger.

Windows programmer
+1  A: 

I am the author of "Debugging" so kindly referenced above by Trevor Boyd Smith. He has it right -- the key rules here are #2 Make It Fail (which you seem to be doing okay), and #3 Quit Thinking and Look. The conjectures above are very good (demonstrating mastery of rule #1 -- Understand the System -- in this case the way code size can change a bug). But actually watching it fail with a debugger will show you what's actually happening without guesswork.

+5  A: 

I came back to this after a few days busy with other things, and figured it out right away. Sorry I didn't post the code sooner, but it was hard coming up with minimal example that displayed the problem.

The root problem was that I left out the return statements in the recursive function. I had:

bool function() {
    /* lots of code */
    function()
}

When it should have been:

bool function() {
    /* lots of code */
    return function()
}

This worked because, through the magic of optimization, the right value happened to be in the right register at the right time, and made it to the right place.

The bug was originally introduced when I broke the first call into its own special-cased function. And, at that point, the extra nops were the difference between this first case being inlined directly into the general recursive funtion.

Then, for reasons that I don't fully understand, inlining this first case led to the right value not being in the right place at the right time, and the function returning junk.

rodarmor
You could have caught that by turning on compiler warnings, viz. "test.c:6: warning: control reaches end of non-void function". Always a good idea to compile with "gcc -Wall"
Hugh Allen
@Hugh, something strange is going on with -Wall. It only warns me about falling out of the function when I've got optimization turned off. -Wall -O0 issues the warning, whereas -Wall -O{1,2,3,4} is quiet. Could it be something about deciding to inline the function?
rodarmor
Sounds like it might be a bug. Try with the latest gcc version, then go here: http://gcc.gnu.org/bugs.html
Hugh Allen
Interesting bug :)
Liran Orevi