views:

410

answers:

7

I'd like to write a function that would have some optional code to be executed or not depending on user settings. The function is cpu-intensive and having ifs in it would be slow since the branch predictor is not that good.

My idea is making a copy in memory of the function and replace NOPs with a jump when I don't want to execute some code. My working example goes like this:

int Test()
{
    int x = 2;
    for (int i=0 ; i<10 ; i++)
    {
        x *= 2;

        __asm {NOP}; // to skip it replace this
        __asm {NOP}; // by JMP 2 (after the goto)
            x *= 2; // Op to skip or not

        x *= 2;
    }
    return x;
}

In my test's main, I copy this function into a newly allocated executable memory and replace the NOPs by a JMP 2 so that the following x *= 2 is not executed. JMP 2 is really "skip the next 2 bytes".

The problem is that I would have to change the JMP operand every time I edit the code to be skipped and change its size.

An alternative that would fix this problem would be:

__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
    x *= 2; // Op to skip or not
dont_do_it:
x *= 2;

I would then want to skip or not the goto, which has a fixed size. Unfortunately, in full optimization mode, the goto and the x*=2 are removed because they are unreachable at compilation time.

Hence the need to keep that dead code.

I'm using VStudio 2008.

+4  A: 

If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.

ndim
Interesting! I'll try that, although the number of permutation is going to grow badly... We'll have more than a dozen flags and the code itself is over 1000 lines.
Gabriel
You have a 1000-line long loop body? Ouch.
Steve Jessop
A: 

Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.

danben
I'd think that if you're trying to avoid the cost of a branch, then introducing an out-of-line function call isn't acceptable. But I may be missing something.
Steve Jessop
@Steve Jessop - if I'm understanding the OP correctly, he wants to avoid premature execution of the body of his expensive function. Also, it is not the branching itself but the branch penalty that is expensive. (Keep in mind I'm just playing devil's advocate - I have no idea if the branch penalty is making an actual difference in his application, but I see no reason not to give the benefit of the doubt.)
danben
I'm not sure a CALL would be faster than an IF, even an IF that breaks the branch predictor, which would not be the case anyway.
Gabriel
If you aren't worried about the branch penalty then what is the point of the question?
danben
@danben: sure, I'm waving my hands a lot by saying that a call to a variable address is even more expensive than a mis-predicted branch. It might not be, I don't actually know. But I suspect they both play merry havoc with the instruction pipeline.
Steve Jessop
@Steve Jessop - I believe that is not the case; the conditional move removes any possibility of a branch penalty (as there is no branch). The worst thing it does is delay the call until the address is known. I would have to imagine this would be less expensive than flushing and reloading the pipeline.
danben
OK, I don't know how x64 (or rather, any specific x64 CPU) does a computed call. You're saying it does it without interrupting the pipeline, so the pipeline will simultaneously contain instructions from here, and instructions from way-megabytes-over-there-determined-at-runtime, just as if it were sequential instructions or a correctly-predicted branch? In that case, I see why you're saying a call through a function pointer is cheaper than a possibly-mispredicted branch, at least ignoring possible overhead of the calling convention. Thanks.
Steve Jessop
No, that's not what I'm saying at all - I'm just saying that with an indirect function call, the pipeline is only stalled, whereas with a mispredicted branch it needs to be flushed, which I believe (but am not entirely certain) is more expensive.
danben
+4  A: 

You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.

You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.

I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.

RarrRarrRarr
You're right, it's fixed. It's VStudio 2008.
Gabriel
+5  A: 

You can cut the cost of the branch by up to 10, just by moving it out of the loop:

int Test()
{
    int x = 2;
    if (should_skip) {
        for (int i=0 ; i<10 ; i++)
        {
            x *= 2;
            x *= 2;
        }
    } else {
        for (int i=0 ; i<10 ; i++)
        {
            x *= 2;
            x *= 2;
            x *= 2;
        }
    }

    return x;
}

In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.

If this results in too much duplicated code to be maintainable, use a template that takes x by reference:

    int x = 2;
    if (should_skip) {
        doLoop<true>(x);
    } else {
        doLoop<false>(x);
    }

And check that the compiler inlines it.

Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.

Steve Jessop
Would have to use templates, no choice. 10 was an arbitrary test value. I'll try that, thx!
Gabriel
+1  A: 

Here's an actual answer to the actual question!

volatile int y = 0;

int Test() 
{
    int x = 2; 
    for (int i=0 ; i<10 ; i++) 
    { 
        x *= 2; 

        __asm {NOP}; // to skip it replace this 
        __asm {NOP}; // by JMP 2 (after the goto) 
        goto dont_do_it;
    keep_my_code:
        x *= 2; // Op to skip or not 
    dont_do_it: 
        x *= 2; 
    }
    if (y) goto keep_my_code;
    return x; 
} 
Jeffrey L Whitledge
Thanks for the suggestion. I actually tried that. You wouldn't believe what the compiler generates... The result is too messy to be changed:1. The code inside the loop, after goto dont_do_it is still dead, so it's removed from the loop2. The end of the loop including the code removed is replicated in place of an actual jump, so keep_my_code is executed only after if(y)You're still the first one to try to actually answer the actual question (:
Gabriel
@Gabriel - Oh wow. It sounds like the shape of the highly optimized code is just too slippery to ever allow for run-time code manipulation. It sounds like you're going to have to write the whole thing yourself in assembler, if you still want to go down that road.
Jeffrey L Whitledge
I've though of that, but I'm afraid the compiler might throw away unreachable asm.
Gabriel
A: 

This may give insight:

#pragma optimize for Visual Studio.

That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.

At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.

Paul Nathan
Using the vs asm output would be quite painful coz I would have to rework a bunch of asm every time I change the base C++. Remember I'm trying to have a solution that does not require much work whenever I change the C++ code.
Gabriel
@Gabriel: No, what I am suggesting is writing the whole routine in asm. It would be modified separately, like any other function.
Paul Nathan
A: 

If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.

John Burton