tags:

views:

154

answers:

6

[Edit: It seems this is an issue on gcc versions prior to 4.4, I got confused because of a gcc bugzilla entry reporting it for 4.5 (latest). Sorry, I should've tested with more recent versions. Still, the problem is somewhat valid as most people don't run gcc 4.4+.]

Is it possible to tell the compiler the variable used in a switch fits within the provided case statements? In particular if it's a small range and there's a jump table generated.

extern int a;
main()
{
        switch (a & 0x7) {   // 0x7  == 111  values are 0-7
        case 0: f0(); break;
        case 1: f1(); break;
        case 2: f2(); break;
        case 3: f3(); break;
        case 4: f4(); break;
        case 5: f5(); break;
        case 6: f6(); break;
        case 7: f7(); break;
        }
}

I tried xor'ing to low bits (as the example), using enums, using gcc_unreachable() to no avail. The generated code always checks if the variable is inside the range, adding a pointless branch conditional and moving away the jump table calculation code.

Note: this is in the innermost loop of a decoder, performance matters significantly.

It seems I'm not the only one.

There is no way to tell gcc that the default branch is never taken, although it will omit the default branch if it can prove that the value is never out of range based on earlier conditional checks.

So, how do you help gcc prove the variable fits and there's no default branch in the example above? (Without adding a conditional branch, of course.)

EDIT1: This was on OS X 10.6 Snow Leopard with GCC 4.2 (default from Xcode.) It didn't happen with GCC 4.4/4.3 in linux (reported by Nathon and Jens Gustedt.)

EDIT2: The functions in the example are there for readability, think those are inlined or just statements. Making a function call on x86 is expensive.

Also the example, as mentioned in the note, belongs inside a loop on data (big data.)

The generated code with gcc 4.2/OS X is:

[...]
andl    $7, %eax
cmpl    $7, %eax
ja  L11
mov %eax, %eax
leaq    L20(%rip), %rdx
movslq  (%rdx,%rax,4),%rax
addq    %rdx, %rax
jmp *%rax
.align 2,0x90
L20:
.long   L12-L20
.long   L13-L20
.long   L14-L20
.long   L15-L20
.long   L16-L20
.long   L17-L20
.long   L18-L20
.long   L19-L20
L19:
[...]

The problem lies on cmp $7, %eax / ja L11.

EDIT3:

OK, I'm going with the ugly solution and adding a special case for gcc versions below 4.4 using a different version without a switch and using goto and gcc's &&label extensions.

static void *jtb[] = { &&c_1, &&c_2, &&c_3, &&c_4, &&c_5, &&c_6, &&c_7, &&c_8 };
[...]
goto *jtb[a & 0x7];
[...]
while(0) {
c_1:
// something
break;
c_2:
// something
break;
[...]
}

Note the array of labels is static so it's not computed every call.

Thanks everyone for the great help! And sorry to those with valid answers who didn't get the mark :(

+2  A: 

Have you tried declaring the switch variable as a bitfield?

struct Container {
  uint16_t a:3;
  uint16_t unused:13;
};

struct Container cont;

cont.a = 5;  /* assign some value */
switch( cont.a ) {
...
}

Hope this works!

Praetorian
Interesting. But no luck, it still compares to 7 and "ja L5". I'm using gcc 4.2 BTW (Forgot to mention!) But the guy on the gcc bugzilla reported it happens in gcc 4.5 for him (though it might be non-x86.)
alecco
+1  A: 

Perhaps just use a default label for the fist or last case?

Jens Gustedt
I forgot to mention I tried this, too. This just makes the conditional jump include the first or last cases, still branching. It seems like gcc can't tell the value is in the range 0-7.
alecco
@aleccolocco: too bad. You didn't tell what version of gcc you have, maybe you just have a bad one? I suppose you also tried all optimizing flags?
Jens Gustedt
@aleccolocco: sorry only read the description of your compiler after my last comment. I experimented on my machine (64 bit, gcc 4.4.3) I get a perfect jump table, and the only warning is the missing `return` for main. I'll also try on my phone an ARM with gcc 4.2
Jens Gustedt
On ARM with gcc 4.2 it produces a jump table too, but isn't able to optimize out the bound check, which 4.4.3 does. The two instructions, `and` and `cmp` are really succeeding each other, so this is not very much optimized.BTW, I observed that the generated code is quite different when I add a `return` statement after the `switch`.
Jens Gustedt
@Jens Gustedt: wow, you got busy on this one! My original code does have return and code after the switch.
alecco
gcc 4.4.3 on i686: jump table, no bound check.gcc 4.0.1 on i686: jump table, bound check.So my guess is that the version makes the difference, but unfortunately I don't have my hands on a machine with 4.5 to do a test.
Jens Gustedt
Yes, it's a GCC <4.4 problem. Thank you! About the original on adding default label to the first or last cases, it could be counter-productive because the branch would be taken sometimes instead of never. (Not tested.)
alecco
+3  A: 
Nathon
Yeah, exactly the same code for -O3 and -O5, as stated on another comment I'm using gcc 4.2 but the bugzilla entry says it happens with gcc 4.5, too.
alecco
Hmm, mine is gcc (Debian 4.4.4-6) 4.4.4.
Nathon
There is no -O5 (http://gcc.gnu.org/onlinedocs/gcc-4.4.4/gcc/Optimize-Options.html), so not really surprising that it does the same as -O3 ;-)
Peter
@Nathon: Just tried it on Gentoo with 4.5 and had the same result as you! Argh. Still, this is for an open source project so can't demand users have the latest gcc version. Now I'm torn on who to assign the answer as correct :-/
alecco
+4  A: 

Perhaps you could use an array of function pointers instead of a switch ?

#include <stdio.h>

typedef void (*func)(void);

static void f0(void) { printf("%s\n", __FUNCTION__); }
static void f1(void) { printf("%s\n", __FUNCTION__); }
static void f2(void) { printf("%s\n", __FUNCTION__); }
static void f3(void) { printf("%s\n", __FUNCTION__); }
static void f4(void) { printf("%s\n", __FUNCTION__); }
static void f5(void) { printf("%s\n", __FUNCTION__); }
static void f6(void) { printf("%s\n", __FUNCTION__); }
static void f7(void) { printf("%s\n", __FUNCTION__); }

int main(void)
{
    const func f[8] = { f0, f1, f2, f3, f4, f5, f6, f7 };
    int i;

    for (i = 0; i < 8; ++i)
    {
        f[i]();
    }
    return 0;
}
Paul R
alecco
(Sorry to dead post…) Function call overhead is a given as long as you're doing a true function call. That should be eliminated if you turn it into a tail call by having fN return int.
Potatoswatter
+1  A: 

This question is certainly interesting from the standpoint of a missed compiler optimization that is seemingly obvious to us, and I did spend considerable time trying to come up with a straightforward solution, largely out of personal curiousity.

That said, I have to admit I am highly skeptical that this additional instruction will ever result in a measurable performance difference in practice, especially on a new mac. If you have any significant amount of data, you'll be I/O bound, and a single instruction will never be your bottleneck. If you have a tiny amount of data, then you'll need to perform a lot lot lot of calculations repeatedly before a single instruction will become a bottleneck.

Would you post some code to show that there really is a performance difference? Or describe the code and data your working with?

John
Yes. If the branch is unlikely the predictor will mark it. But it will consume 2 cycles. I'm working on a high performance compression algorithm and implementation. One of the problems with compression is the unpredictability due to the nature of compression. Is the next block on decompression a literal or a match? (picking source) How long is it? Does it cross a cache-line or page boundary? (serious performance penalty.) This loop will be processed for every "atom" on decompression. Performance matters: Google has their super-secret compression algorithm, Zippy. (dbs, comms, everything)
alecco
When originally posted I thought the bug/issue was broken in recent versions of gcc (because of the gcc bugzilla entry for 4.5 linked.) My bad. @Jens Gustedt above confirmed this. I tested then 4.4/4.5 on a different OS. I'll edit the question to reflect this on top. Thanks! (and have an upvote for valid critical thinking!)
alecco
I agree that performance matters. I'm asking whether that cmpl instruction has a measurable performance effect, and if so, under what real circumstances (because I can't think of any.) It's the difference between this being a question of "how do I hack a specific version of gcc just for the sake of curiousity" and a question about a legitimate code optimization.
John
Yeah. At the moment the test data is random chunks of Wikipedia text and still need to finish other parts of the code. But this is the innermost most important decoding loop, for just 1MB of typical plaintext it will be run ~180k times (from tests.)
alecco
+1  A: 

I didn't try, but I'm not sure that gcc_unreachable does the same thing as __builtin_unreachable. Googling the two, gcc_unreachable appears to be designed as a as an assertion tool for development of GCC itself, perhaps with a branch prediction hint included, whereas __builtin_unreachable makes the program instantly undefined — which sounds like deleting the basic block, which is what you want.

http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005funreachable-3075

Potatoswatter
Yeah, here's proof that `gcc_unreachable` is a function and not the abstract representation of void code: http://old.nabble.com/optimization-of-switch-statements-on-i386-td15366926.html
Potatoswatter