ansaurus

Question

Bug fixed with four nops in an if(0), world no longer makes sense.

Answer 1

+15 A:

Most times when you modify the code inconsequentially and it fixes your problem, it's a memory corruption problem of some sort. We may need to see the actual code to do proper analysis, but that would be my first guess, based on the available information.

paxdiablo 2009-04-02 04:58:17

Answer 2

+3 A:

Does it happen in debug and release mode build (with symbols and without)? Does it behave the same way using a debugger? Is the code moultithreaded? Are you compiling with optimizations? Can you try another machine?

ojblass 2009-04-02 05:07:59

None of those were an answer but thanks!

ojblass 2009-04-02 05:13:05

Answer 3

+3 A:

Can you confirm that you are indeed getting different executables when you add the if(0) {nops}? I don't see nops on my system.

$ gcc --version
powerpc-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

$ cat nop.c
void foo()
{
    if (0) {
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
    }
}

$ gcc nop.c -S -O0 -o -
    .
    .
_foo:
    stmw r30,-8(r1)
    stwu r1,-48(r1)
    mr r30,r1
    lwz r1,0(r1)
    lmw r30,-8(r1)
    blr

$ gcc nop.c -S -O3 -o -
    .
    .
_foo:
    blr

sigjuice 2009-04-02 05:19:26

I just learned so much in three seconds... thank you.

ojblass 2009-04-02 05:22:51

What happens if you make it "if(1)"? I always thought -O0 did absolutely no optimization.

paxdiablo 2009-04-02 05:39:53

Does the output of -O0 change when using __asm__ volatile("nop"); ? (Curious.)

strager 2009-04-02 05:42:41

That's what I though about if(0). if (1) does emit the four nops, which doesn't surprise me.

sigjuice 2009-04-02 05:44:35

@strager if (0) {__asm__ volatile("nop") ... } makes no difference to -O0 or -O3.

sigjuice 2009-04-02 05:46:55

@sigjuice does your dungeon master still talk to you? No honestly you have a mastery and understanding I respect.

ojblass 2009-04-02 05:47:12

He's complimenting you (in a particularly geeky way, but a compliment nonetheless :-).

paxdiablo 2009-04-02 05:52:40

Yeah, I tried it both ways, and the .s files differ wildly. (see above)

rodarmor 2009-04-02 14:19:02

Answer 4

+12 A:

It's faulty pointer arithmetic, either directly (through a pointer) or indirectly (by going past the end of an array). Check all your arrays. Don't forget that if your array is

 int a[4];

then a[4] doesn't exist.

What you're doing is overwriting something on the stack accidentally. The stack contains both locals, parameters, and the return address from your function. You might be damaging the return address in a way that the extra noops cures.

For example, if you have some code that is adding something to the return address, inserting those extra 16 bytes of noops would cure the problem, because instead of returning past the next line of code, you return into the middle of some noops.

One way you might be adding something to the return address is by going past the end of a local array or a parameter, for example

  int a[4];
  a[4]++;

Joel Spolsky 2009-04-02 05:25:40

All PowerPC instructions are 32 bits. Four nops would mean an extra 16 bytes.

sigjuice 2009-04-02 05:37:38

Yeah, they're nice instructions. NOP is actually ORI R0,R0,0 (OR register 0 with itself and constant 0).

paxdiablo 2009-04-02 05:43:33

Given the crazy differences in the two .s files (see update above), I'm thinking that everyday memory corruption may not be it this time.

rodarmor 2009-04-02 14:08:21

Answer 5

+2 A:

My guess is stack corruption -- though gcc should optimize anything inside an if(0) out, I would have thought.

You could try sticking a big array on the stack in your function and see if that also fixes it -- that would also implicate stack corruption.

Are you sure you're running what you think you're running? (dumb question, but it happens.)

smcameron 2009-04-02 05:49:39

No way a dumb question! One time I was editing on one machine, while compiline/running on another. I was pretty surprised when my changes weren't showing up ;-)This time though, no such luck!

rodarmor 2009-04-02 14:10:11

I tried stack allocating a big array, but no dice.

rodarmor 2009-04-02 14:10:41

Answer 6

+2 A:

Trevor Boyd Smith 2009-04-02 16:00:35

Answer 7

A:

Break out that one function into a separate .c file (or .cpp or whatever). Compile just that one file with the nops and without them, to .s files and compare them.

Try an old version of gcc. Go back 5 or 10 years and see if things get stranger.

Windows programmer 2009-04-03 04:53:07

Answer 8

+1 A:

I am the author of "Debugging" so kindly referenced above by Trevor Boyd Smith. He has it right -- the key rules here are #2 Make It Fail (which you seem to be doing okay), and #3 Quit Thinking and Look. The conjectures above are very good (demonstrating mastery of rule #1 -- Understand the System -- in this case the way code size can change a bug). But actually watching it fail with a debugger will show you what's actually happening without guesswork.

2009-04-07 00:16:51

Answer 9

+5 A:

I came back to this after a few days busy with other things, and figured it out right away. Sorry I didn't post the code sooner, but it was hard coming up with minimal example that displayed the problem.

The root problem was that I left out the return statements in the recursive function. I had:

bool function() {
    /* lots of code */
    function()
}

When it should have been:

bool function() {
    /* lots of code */
    return function()
}

This worked because, through the magic of optimization, the right value happened to be in the right register at the right time, and made it to the right place.

The bug was originally introduced when I broke the first call into its own special-cased function. And, at that point, the extra nops were the difference between this first case being inlined directly into the general recursive funtion.

Then, for reasons that I don't fully understand, inlining this first case led to the right value not being in the right place at the right time, and the function returning junk.

rodarmor 2009-04-07 15:35:46

You could have caught that by turning on compiler warnings, viz. "test.c:6: warning: control reaches end of non-void function". Always a good idea to compile with "gcc -Wall"

Hugh Allen 2009-04-08 11:54:31

@Hugh, something strange is going on with -Wall. It only warns me about falling out of the function when I've got optimization turned off. -Wall -O0 issues the warning, whereas -Wall -O{1,2,3,4} is quiet. Could it be something about deciding to inline the function?

rodarmor 2009-04-08 13:26:08

Sounds like it might be a bug. Try with the latest gcc version, then go here: http://gcc.gnu.org/bugs.html

Hugh Allen 2009-04-09 00:56:50

Interesting bug :)

Liran Orevi 2009-04-16 23:25:13

ansaurus

tags:

views:

answers:

Bug fixed with four nops in an if(0), world no longer makes sense.

related questions