ansaurus

Question

Answer 1

+4 A:

If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.

ndim 2010-04-01 20:27:13

Interesting! I'll try that, although the number of permutation is going to grow badly... We'll have more than a dozen flags and the code itself is over 1000 lines.

Gabriel 2010-04-01 20:40:44

You have a 1000-line long loop body? Ouch.

Steve Jessop 2010-04-01 20:48:15

Answer 2

A:

Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.

danben 2010-04-01 20:28:11

I'd think that if you're trying to avoid the cost of a branch, then introducing an out-of-line function call isn't acceptable. But I may be missing something.

Steve Jessop 2010-04-01 20:33:38

@Steve Jessop - if I'm understanding the OP correctly, he wants to avoid premature execution of the body of his expensive function. Also, it is not the branching itself but the branch penalty that is expensive. (Keep in mind I'm just playing devil's advocate - I have no idea if the branch penalty is making an actual difference in his application, but I see no reason not to give the benefit of the doubt.)

danben 2010-04-01 20:44:57

I'm not sure a CALL would be faster than an IF, even an IF that breaks the branch predictor, which would not be the case anyway.

Gabriel 2010-04-01 20:56:25

If you aren't worried about the branch penalty then what is the point of the question?

danben 2010-04-01 21:04:01

@danben: sure, I'm waving my hands a lot by saying that a call to a variable address is even more expensive than a mis-predicted branch. It might not be, I don't actually know. But I suspect they both play merry havoc with the instruction pipeline.

Steve Jessop 2010-04-01 21:08:39

@Steve Jessop - I believe that is not the case; the conditional move removes any possibility of a branch penalty (as there is no branch). The worst thing it does is delay the call until the address is known. I would have to imagine this would be less expensive than flushing and reloading the pipeline.

danben 2010-04-02 00:16:44

OK, I don't know how x64 (or rather, any specific x64 CPU) does a computed call. You're saying it does it without interrupting the pipeline, so the pipeline will simultaneously contain instructions from here, and instructions from way-megabytes-over-there-determined-at-runtime, just as if it were sequential instructions or a correctly-predicted branch? In that case, I see why you're saying a call through a function pointer is cheaper than a possibly-mispredicted branch, at least ignoring possible overhead of the calling convention. Thanks.

Steve Jessop 2010-04-02 01:01:06

No, that's not what I'm saying at all - I'm just saying that with an indirect function call, the pipeline is only stalled, whereas with a mispredicted branch it needs to be flushed, which I believe (but am not entirely certain) is more expensive.

danben 2010-04-02 02:53:55

Answer 3

+4 A:

You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.

You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.

I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.

RarrRarrRarr 2010-04-01 20:29:17

You're right, it's fixed. It's VStudio 2008.

Gabriel 2010-04-01 20:36:40

Answer 4

+5 A:

You can cut the cost of the branch by up to 10, just by moving it out of the loop:

int Test()
{
    int x = 2;
    if (should_skip) {
        for (int i=0 ; i<10 ; i++)
        {
            x *= 2;
            x *= 2;
        }
    } else {
        for (int i=0 ; i<10 ; i++)
        {
            x *= 2;
            x *= 2;
            x *= 2;
        }
    }

    return x;
}

In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.

If this results in too much duplicated code to be maintainable, use a template that takes x by reference:

    int x = 2;
    if (should_skip) {
        doLoop<true>(x);
    } else {
        doLoop<false>(x);
    }

And check that the compiler inlines it.

Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.

Steve Jessop 2010-04-01 20:32:04

Would have to use templates, no choice. 10 was an arbitrary test value. I'll try that, thx!

Gabriel 2010-04-01 20:42:42

Answer 5

+1 A:

Here's an actual answer to the actual question!

volatile int y = 0;

int Test() 
{
    int x = 2; 
    for (int i=0 ; i<10 ; i++) 
    { 
        x *= 2; 

        __asm {NOP}; // to skip it replace this 
        __asm {NOP}; // by JMP 2 (after the goto) 
        goto dont_do_it;
    keep_my_code:
        x *= 2; // Op to skip or not 
    dont_do_it: 
        x *= 2; 
    }
    if (y) goto keep_my_code;
    return x; 
}

Jeffrey L Whitledge 2010-04-01 20:42:40

Thanks for the suggestion. I actually tried that. You wouldn't believe what the compiler generates... The result is too messy to be changed:1. The code inside the loop, after goto dont_do_it is still dead, so it's removed from the loop2. The end of the loop including the code removed is replicated in place of an actual jump, so keep_my_code is executed only after if(y)You're still the first one to try to actually answer the actual question (:

Gabriel 2010-04-01 20:53:03

@Gabriel - Oh wow. It sounds like the shape of the highly optimized code is just too slippery to ever allow for run-time code manipulation. It sounds like you're going to have to write the whole thing yourself in assembler, if you still want to go down that road.

Jeffrey L Whitledge 2010-04-01 20:59:21

I've though of that, but I'm afraid the compiler might throw away unreachable asm.

Gabriel 2010-04-02 18:50:39

Answer 6

A:

This may give insight:

#pragma optimize for Visual Studio.

That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.

At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.

Paul Nathan 2010-04-01 21:15:42

Using the vs asm output would be quite painful coz I would have to rework a bunch of asm every time I change the base C++. Remember I'm trying to have a solution that does not require much work whenever I change the C++ code.

Gabriel 2010-04-02 18:46:24

@Gabriel: No, what I am suggesting is writing the whole routine in asm. It would be modified separately, like any other function.

Paul Nathan 2010-04-02 20:06:50

Answer 7

A:

If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.

John Burton 2010-04-01 22:48:34

ansaurus

tags:

views:

answers:

How to keep unreachable code?

related questions