views:

616

answers:

11

I am trying to optimize a small, highly used function which uses the high bits in an unsigned short int to indicate the values of an array to sum together. At first I was using the obvious approach shown below. Please note that loop unrolling is not explicitly shown as it should be done by the compiler.

int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
    if (i & mask){
        total += value[j];
    }
}

However, later I thought it might be better to remove the branching to help CPU pipelining and came up with the following.

int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
    total += ((i & mask) != 0) * value[j];
}

Note that since (i & mask) does not result in a boolean answer, the comparison with 0 forces the result to be either 1 or 0. Although this second approach eliminates the if-statement from this section of the code, the second solution needs to run a multiplication of 0 or 1 on every iteration in addition to the rest of the equation.

Which code will run faster?

+1  A: 

This depends totally on the compiler, machine instruction set and, probably, phase of the moon.

There is no specific correct answer because of this. If you really want to know, check the assembly output from the compiler.

From a simplistic point of view, I would say that the second is slower since it involves all the calculations of the first plus a multiply. But the compiler may well be smart enough to optimize that away.

So the correct answer is: it depends.

paxdiablo
+1. Also, unrolling the loop will almost certainly improve performance more than messing around with branch vs. multiply.
Zooba
Aside from the time I improved performance by rolling up a loop (that function took 80% of the run time, so I was desperate for optimization). The old conventional optimization wisdom is overdue for an overhaul.
David Thornley
+1  A: 

Although the second example doesn't have an explicit branch, there might be an implicit one to convert the result of the comparison to a bool. You might gain a little insight by turning on the assembly listing output for your compiler and looking at that.

Of course the only way to know for sure is to take some timings both ways.

Mark Ransom
Yes, I think you are right, there is an implicit branch. Thank you for pointing that out.
Nixuz
It depends on the architecture - on x86, the int-to-bool can be done branch-free with the two instructions 'cmp' and 'setne'.
Adam Rosenfield
+1  A: 

The answer must surely be: try it on the target hardware and see. And be sure to follow the advice of the multitude of micro-benchmark/stopwatch-benchmark questions posted here on SO over the last few weeks.

Link to one benchmarking question: http://stackoverflow.com/questions/410437/is-stopwatch-benchmarking-acceptable

Personally, I'd go with the if, unless there was a really compelling reason to use the "obfuscated" alternative.

Software Monkey
+12  A: 

Which code will run faster?

Test it to find out.

Also, look at the assembly-language version of the code which the compiler emits, because there you might see things in it that surprise you, and which hint at further optimizations (for example, using short as you are using can need more instructions that using the machine's natural integer size).

ChrisW
+9  A: 

Either could be faster. For some processors, the actual input data may change the answer. You will need to profile both approaches with real data. Here are some things that can affect actual performance on x86 hardware.

Let's assume for the moment that you're using a late-model Pentium 4. That processor has two levels of branch predictors baked into the CPU. If the branch predictors can guess correctly the branch direction, I suspect that the first will be fastest. This is most likely to occur if the flags are nearly all the same value or if they alternate in a very simple pattern most of the time. If the flags are truly random, then the branch predictor will be wrong half the time. For our hypothetical 32-stage Pentium 4, this will kill performance. For Pentium 3 chips, Core 2 chips, Core i7, and most AMD chips, the pipelines are shorter, so the cost of bad branch prediction is much lower.

If your value vector is noticeably larger than the processor's cache, then either approach will be limited by memory bandwidth. They'll both have essentially identical performance characteristics. If the value vector fits comfortably in cache, be careful how you do any profiling so that one of the test loops isn't getting penalized for filling the cache and the other one benefits from it.

Mr Fooz
+4  A: 

You could make it branchless without a multiply. It looks like for each bit set you are using that bit position as an index into an array.

First, you can easily extract bits set with:

unsigned short set_mask= i & -i;
i&= i - 1;

Then, you can get the bit index by counting the bits set in (set_mask - 1). There's a constant time formula for this.

Some platforms also have an intrinsic to get the bit index of a bit set which is probably faster. x86 has bsr, PPC has cntlz.

So the answer is the branchless multiplyless version is probably fastest :)

MSN
Very interesting, but I am wonder if the "constant time formula" might not be worth it, can you provide some more details about this formula?
Nixuz
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
MSN
Thank you, this is a very elegant solution.
Nixuz
This solution implemented: unsigned int total = 0 while (i){ total += value[countBits((i i }
Nixuz
GCC has __builtin_popcount(x) which returns the number of set bits in x.
ephemient
A: 

The only real way to determine the truth of a statement is to test. With that in mind i would concur with previous posts that say try it out!

On most modern processors branching is a costly process especially branches infrequently taken. This is because the pipeline must be flushed resulting in the CPU in effect not being able to attempt to execute one or more instructions concurrently - simply because it doesnt know where the next instruction will come from. With a few branches the possible control flows become to complex for the CPU to try all possibilities simultaneously so it must therefore do the branch and then start doing many instructions at once after that.

mP
+4  A: 

What about this revision?

int total = 0;
for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++){
    total += (mask & 0x0001) * value[j];
}

I've made mask into a copy of i limited to 16-bit unsigned range, but the code checks whether the last bit of mask is set, multiplying the array value by that bit. This should be faster simply because there are fewer operations per iteration, and only the main loop branches and conditions are needed. Also, the loop can exit early if i is small to start with.


This demonstrates why measurement is important. I'm using an antiquated Sun SPARC. I wrote a test program as shown, with the two contenders from the question as test 0 and test 1, and my own answer as test 2. And then ran timing tests. The 'sum' is printed as a sanity check - to ensure that the algorithms all give the same answer.

64-bit unoptimized:

gcc -m64 -std=c99 -I$HOME/inc -o x x.c -L$HOME/lib/sparcv9 -ljl -lposix4

Test 0: (sum = 1744366)  7.973411 us
Test 1: (sum = 1744366) 10.269095 us
Test 2: (sum = 1744366)  7.475852 us

Nice: mine is slightly faster than the original, and the souped up version is slower.

64-bit optimized:

gcc -O4 -m64 -std=c99 -I$HOME/inc -o x x.c -L$HOME/lib/sparcv9 -ljl -lposix4

Test 0: (sum = 1744366)  1.101703 us
Test 1: (sum = 1744366)  1.915972 us
Test 2: (sum = 1744366)  2.575318 us

Darn - my version is now dramatically the slowest. The optimizer is good!

32-bit optimized:

gcc -O4 -std=c99 -I$HOME/inc -o x x.c -L$HOME/lib -ljl -lposix4

Test 0: (sum = 1744366)  0.839278 us
Test 1: (sum = 1744366)  1.905009 us
Test 2: (sum = 1744366)  2.448998 us

32-bit unoptimized:

gcc -std=c99 -I$HOME/inc -o x x.c -L$HOME/lib -ljl -lposix4

Test 0: (sum = 1744366)  7.493672 us
Test 1: (sum = 1744366)  9.610240 us
Test 2: (sum = 1744366)  6.838929 us

Same code on (32-bit) Cygwin and a not so geriatric laptop (32-bit, optimized)

Test 0: (sum = 1744366)  0.557000 us
Test 1: (sum = 1744366)  0.553000 us
Test 2: (sum = 1744366)  0.403000 us

Now my code is the fastest. That's why you measure! It also shows why people who run benchmarks for a living get distraught.

Test harness (shout if you want the timer.h and timer.c code):

#include <stdio.h>
#include "timer.h"

static volatile int value[] =
{
    12, 36, 79, 21, 31, 93, 24, 15,
    56, 63, 20, 47, 62, 88,  9, 36,
};

static int test_1(int i)
{
    int total = 0;
    for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++)
    {
        if (i & mask)
            total += value[j];
    }
    return(total);
}

static int test_2(int i)
{
    int total = 0;
    for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++)
    {
        total += ((i & mask) != 0) * value[j];
    }
    return(total);
}

static int test_3(int i)
{
    int total = 0;
    for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++)
    {
        total += (mask & 0x0001) * value[j];
    }
    return(total);
}

typedef int(*func_pointer)(int);

static func_pointer test[] = { test_1, test_2, test_3 };

#define DIM(x)(sizeof(x)/sizeof(*(x)))

int main()
{
    int i, j, k;
    char buffer[32];
    for (i = 0; i < DIM(test); i++)
    {
        Clock t;
        long sum = 0;
        clk_init(&t);
        clk_start(&t);
        for (j = 0; j < 0xFFFF; j += 13)
        {
            int rv;

            for (k = 0; k < 1000; k++)
                rv = (*test[i])(j);
            sum += rv;
        }
        clk_stop(&t);
        printf("Test %d: (sum = %ld) %9s us\n", i, sum,
               clk_elapsed_us(&t, buffer, sizeof(buffer)));
    }
}

I haven't spent time working out why my code is slower when optimized.

Jonathan Leffler
Darius Bacon
BTW, I'd use j <= 0xFFFF in the inner loop, instead of < (not that it matters). Also I had to change it to use clock.h. Thanks for hacking this up -- I was too lazy.
Darius Bacon
Er, clock() from time.h, that is.
Darius Bacon
Since I can't leave well enough alone, I tried out my 4-bits-at-a-time suggestion too, and that ran in 53% of the time of test_3().
Darius Bacon
+1  A: 

why not do this (assuming i is 32 bits)

  for (i2 = i; i2; i2 = i3) {
    i3 = i2 & (i2-1);
    last_bit = i2-i3;
    a = last_bit & 0xffff;
    b = (last_bit << 16);
    j = place[a] + big_place[b];
    total += value[j];
  }

Where place is a table of size 2^15+1 such that place[0] = 0, place[1] = 1, place[2] = 2, place[4] = 3, place[8] = 4...place[15] = 16 (rest of the values don't matter). and big_place is almost identical: big_place[0] = 0,big_place[1] = 17.... big_place[15] = 32.

David Lehavi
+1  A: 

Try

total += (-((i & mask) != 0)) & value[j];

in place of

total += ((i & mask) != 0) * value[j];

This avoids the multiply. Whether there will be a branch or not is up to whether the compiler is clever enough to find branch-free code for -(foo != 0). (Which is possible, but I'd be a bit surprised.)

(Of course, this depends on two's-complement representation; the C standard is agnostic on that.)

You might help out the compiler like so, assuming 32-bit ints and that signed >> propagates the sign bit:

total += (((int)((i & mask) << (31 - j))) >> 31) & value[j];

That is, shift the possibly-set bit left to the most-significant position, cast as signed int, then right all the way back to the least-significant position, yielding either all 0's or all 1's, under the above implementation-defined assumptions. (I haven't tested this.)

Another possibility: consider blocks of (say) 4 bits at a time. There are 16 different addition sequences; you can dispatch to unrolled code for each of them, with no tests at all within each code block. The hope here is for one indirect jump to cost less than 4 test-and-branches.

Update: Using Jonathan Leffler's scaffolding, the 4-bits-at-a-time method is fastest by a wide margin on my MacBook. Negate-and turns out to be about the same as multiply. I wonder if the processor multiplies special cases like 0 and 1 faster (or not such a special case if it's faster for most-bits-clear or most-bits-set multiplicands in general).

I didn't code up the accepted answer since it's unlikely to be fastest on this particular benchmark (it should get most of its benefit from enumerating only the set bits, doing best on sparse sets, but fully half of the bits are set in this benchmark). Here are my changes to Leffler's code, in case anyone else is strangely motivated to spend time on this:

#include <stdio.h>
#include <time.h>

static int value[] =
{
    12, 36, 79, 21, 31, 93, 24, 15,
    56, 63, 20, 47, 62, 88,  9, 36,
};

static int test_1(int i)
{
    int total = 0;
    for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++)
    {
        if (i & mask)
            total += value[j];
    }
    return(total);
}

static int test_2(int i)
{
    int total = 0;
    for (unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++)
    {
        total += ((i & mask) != 0) * value[j];
    }
    return(total);
}

static int test_3(int i)
{
    int total = 0;
    for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++)
    {
        total += (mask & 0x0001) * value[j];
    }
    return(total);
}

static int test_4(int i)
{
    int total = 0;
    for (unsigned mask = i & 0xFFFF, j = 0; mask != 0; mask >>= 1, j++)
    {
        total += -(mask & 0x0001) & value[j];
    }
    return(total);
}

static int test_5(int i)
{
    int total = 0;
    const int *p = value;
    for (unsigned mask = i & 0xFFFF; mask != 0; mask >>= 4, p += 4)
    {
        switch (mask & 0xF)
        {
        case 0x0: break;
        case 0x1: total += p[0]; break;
        case 0x2: total += p[1]; break;
        case 0x3: total += p[1] + p[0]; break;
        case 0x4: total += p[2]; break;
        case 0x5: total += p[2] + p[0]; break;
        case 0x6: total += p[2] + p[1]; break;
        case 0x7: total += p[2] + p[1] + p[0]; break;
        case 0x8: total += p[3]; break;
        case 0x9: total += p[3] + p[0]; break;
        case 0xA: total += p[3] + p[1]; break;
        case 0xB: total += p[3] + p[1] + p[0]; break;
        case 0xC: total += p[3] + p[2]; break;
        case 0xD: total += p[3] + p[2] + p[0]; break;
        case 0xE: total += p[3] + p[2] + p[1]; break;
        case 0xF: total += p[3] + p[2] + p[1] + p[0]; break;
        }
    }
    return(total);
}

typedef int(*func_pointer)(int);

static func_pointer test[] = { test_1, test_2, test_3, test_4, test_5 };

#define DIM(x)(sizeof(x)/sizeof(*(x)))

int main()
{
    int i, j, k;
    for (i = 0; i < DIM(test); i++)
    {
        long sum = 0;
        clock_t start = clock();
        for (j = 0; j <= 0xFFFF; j += 13)
        {
            int rv;

            for (k = 0; k < 1000; k++)
                rv = (*test[i])(j);
            sum += rv;
        }
        clock_t stop = clock();
        printf("(sum = %ld) Test %d: %8.6f s\n", sum, i + 1, 
               (stop - start) / (1.0 * CLOCKS_PER_SEC));
    }
}

Results (gcc -O4 -std=c99 branchmult2.c):

(sum = 1744366) Test 1: 0.225497 s
(sum = 1744366) Test 2: 0.221127 s
(sum = 1744366) Test 3: 0.126301 s
(sum = 1744366) Test 4: 0.124750 s
(sum = 1744366) Test 5: 0.064877 s

Edit 2: I decided the test would be more realistic without the volatile qualifier.

Darius Bacon
+1  A: 

To be uberfast you can avoid the loop, shifts and multiplications - use switch.

switch (i) {
    case 0: break;
    case 1: total = value[0]; break;
    case 2: total = value[1]; break;
    case 3: total = value[1] + value[0]; break;
    case 4: total = value[2]; break;
    case 5: total = value[2] + value[0]; break;
    ...
}

It is a lot to type but I guess it will be much faster in run-time. You cannot beat the performance of lookup table!

I'd rather write a small Perl script that will generate this code for me - just to avoid typing errors.

If you think it is a bit extreme you can use smaller table - for 4 bits, and do a lookup several times, shifting the mask each time. Performance will suffer a bit, but code will be much smaller.

qrdl
Until the switch statement gets too large for a code cache line, and performance suffers.
David Thornley
In this case you can use smaller lookup table (as I mentioned) and lookup several times.
qrdl
And the code may be faster, but nearby code slower because this version takes up more cache. :-)
Darron