views:

377

answers:

7

Hi! I have some code that is going to be run thousands of times, and was wondering what was faster. array is a 30 value short array which always holds 0, 1 or 2.

result = (array[29] * 68630377364883.0)
       + (array[28] * 22876792454961.0)
       + (array[27] * 7625597484987.0)
       + (array[26] * 2541865828329.0)
       + (array[25] * 847288609443.0)
       + (array[24] * 282429536481.0)
       + (array[23] * 94143178827.0)
       + (array[22] * 31381059609.0)
       + (array[21] * 10460353203.0)
       + (array[20] * 3486784401.0)
       + (array[19] * 1162261467)
       + (array[18] * 387420489)
       + (array[17] * 129140163)
       + (array[16] * 43046721)
       + (array[15] * 14348907)
       + (array[14] * 4782969)
       + (array[13] * 1594323)
       + (array[12] * 531441)
       + (array[11] * 177147)
       + (array[10] * 59049)
       + (array[9]  * 19683)
       + (array[8]  * 6561)
       + (array[7]  * 2187)
       + (array[6]  * 729)
       + (array[5]  * 243)
       + (array[4]  * 81)
       + (array[3]  * 27)
       + (array[2]  * 9)
       + (array[1]  * 3)
       + (b[0]);

Would it be faster if I use something like:

if(array[29] != 0)
{
    if(array[29] == 1)
    {
        result += 68630377364883.0;
    }
    else
    {
        result += (whatever 68630377364883.0 * 2 is);
    }
}

for each of them. Would this be faster/slower? If so, by how much?

+5  A: 

Even if addition is faster than multiplication, I think that you will lose more because of the branching. In any case, if addition is faster than multiplication, a better solution might be to use a table and index by it.

const double table[3] = {0.0, 68630377364883.0, 68630377364883.0 * 2.0};
result += table[array[29]];
jbernadas
not very good idea in my opinion, table lookup prevents SSE vectorization
aaa
+12  A: 

That is a ridiculously premature "optimization". Chances are you'll be hurting performance because you are adding branches to the code. Mispredicted branches are very costly. And it also renders the code harder to read.

Multiplication in modern processors is a lot faster than it used to be, it can be done a few clock cycles now.

Here's a suggestion to improve readability:

for (i=1; i<30; i++) {
    result += array[i] * pow(3, i);
}
result += b[0];

You can pre-compute an array with the values of pow(3, i) if you are really that worried about performance.

NullUserException
On the readability, I'd rather replace all multiples by `pow(3, N)` in the original code, and let the compiler optimise it (-O1 or higher, assuming GCC).
jweyrich
You misspelled *"pessimization"* ;)
peterchen
Doesn't relying on particular optimizations reduce portability? I'd go with the horner schema, as presented by ysap.
peterchen
@peterchen Huh?
NullUserException
@NullUserEx - Horner method is an efficient way for polynomial evaluation, where: ` a*x^3 + b*x^2 + c*x + d = (((a*x)+b)*x+c)*x+d`
ysap
@ysap I meant to ask what optimizations that reduce portability I included in my code.
NullUserException
@NullUserException: It was for @jweyrich's suggestion, relying on pow(3,i) being folded to a constant by the optimizer. I've worked with many compilers that wouldn't, and that would make a big hit on this function.
peterchen
Explain downvote.
NullUserException
A: 
  1. If you're not sure - why don't you just measure it yourself?
  2. Second example will be most likely much slower, but not because of the addition - mispredicted conditional jumps cost a lot of time.
  3. If you have only 3 values, the cheapest way might be to have a static 2D array of values int **vals = {{0, 1*3, 2*3}, {0, 1*9, 2*9}, ...} and just sum vals[0][array[1]] + vals[1][array[2]] + ...
  4. Some SIMD instructions might be faster than anything you can write on your own - look at those. Then again - if you're doing this a lot, handing it off to GPU might be even faster - depending on your other calculations.
viraptor
+3  A: 

My first attempt at optimisation would be to remove the floating-point ops in favour of integer arithmetic:

uint64_t total = b[0];
uint64_t x = 3;
for (int i = 1; i < 30; ++i, x *= 3) {
    total += array[i] * x;
}

uint64_t is not standard C++, but is very widely available. You just need a version of C99's stdint for your platform.

There's also optimising for comprehensibility and maintainability - was this code a loop at one point, and did you measure the performance difference when you replaced the loop? Fully unrolling like this might even make the program slower (as well as less readable), since the code is larger and hence occupies more of the instruction cache, and hence results in cache misses elsewhere. You just don't know.

This assuming of course that your constants actually are the powers of 3 - I haven't bothered checking, which is precisely what I consider to be the readability issue with your code...

Steve Jessop
I tend to disagree, I have been in situations where using 64-bit integers is slower than using doubles (even in a 64-bit OS with a modern 64-bit processor).
jbernadas
That's why I say "attempt" - I'd measure and compare speeds. Surely to disagree with me, you'd have to think it's not even worth trying integer arithmetic ;-)
Steve Jessop
+1 For suggesting not using floats and for the accumulator in `x`
NullUserException
Thanks, although ysap's single accumulator is more cunning.
Steve Jessop
if this is a 32 bit processor the conversion from fixed to float, doing the math in float then converting from float back to fixed (which is what the compiler is going to do if you dont typecast) may very well be faster assuming you have an fpu, which most processors dont (sure most desktop ones do but of all the isas out there most dont).
dwelch
For a processor with 64 bit integers or at least 64 bit math, fixed should be noticeably faster, you save the two conversions int to float and float to int on top of the multiply which may be the same speed.
dwelch
If your cpu is 32 bit and its multiply is 32bit*32bit=32bit you have that much more of a problem as the C library has to add code to protect from overflow and synthesize even 32 bit multiplies as well as 64 bit multiplies. On average though across the possible 32 bit or 64 processors out there the integer version is going to be faster. If you are limited to a specific platform, just time it.
dwelch
+6  A: 

First, on most architectures, mis-branching is very costly (depending on the execution pipeline depth), so I bet the non-branching version is better.

A variation on the code may be:

result = array[29];
for (i=28; i>=0; i--)
    result = result * 3 + array[i];

Just make sure there are no overflows, so result must be in a type larger than 32-bit integer.

ysap
And result * 3 can be turned into result << 1 + result, which is probably faster. The compiler will likely optimize it to something like that anyway.
siride
@siride - this is really architecture dependent. I am not sure about x86, but many DSP's have native MULT and/or MAC (Multiply Accumulate) instructions that perform in 1 cycle (assuming no register dependency).
ysap
x86 is probably fast enough now. But there used to be a trick where you could use the lea instruction to do math of the form x + A*y + B, where A can be 1, 2 or 4 and B can be any 32-bit offset. The key is that at least older versions of x86 had special hardware to implement effective address calculations, so this would skip the normal ALU business. Not sure how that works now, though.
siride
A: 

Multiply, because branching is awefully slow

Dani
+3  A: 

This is basically doing what strtoull does. If you don't have the digits handy as an ASCII string to feed to strtoull then I guess you have to write your own implementation. As people point out, branching is what causes a performance hit, so your function is probably best written this way:

#include <tr1/cstdint>
uint64_t base3_digits_to_num(uint8_t digits[30])
{
    uint64_t running_sum = 0;
    uint64_t pow3 = 1;
    for (int i = 0; i < 30; ++i) {
        running_sum += digits[i] * pow3;
        pow3 *= 3;
    }
    return running_sum;
}

It's not clear to me that precomputing your powers of 3 is going to result in a significant speed advantage. You might try it and test yourself. The one advantage a lookup table might give you is that a smart compiler could possibly unroll the loop into a SIMD instruction. But a really smart compiler should then be able to do that anyway and generate the lookup table for you.

Avoiding floating point is also not necessarily a speed win. Floating point and integer operations are about the same on most processors produced in the last 5 years.

Checking to see if digits[i] is 0, 1 or 2 and executing different code for each of these cases is definitely a speed lose on any processor produced in the last 10 years. The Pentium3/Pentium4/Athlon Thunderbird days are when branches started to really become a huge hit, and the Pentium3 is at least 10 years old now.

Lastly, you might think this will be the bottleneck in your code. You're probably wrong. The right implementation is the one that is the simplest and most clear to anybody coming along reading your code. Then, if you want the best performance, run your code through a profiler and find out where to concentrate your optimization efforts. Agonizing this much over a little function when you don't even know that it's a bottleneck is silly.

And almost nobody here recognized that you were basically doing a base 3 conversion. So even your current primitive hand loop unrolling obscured your code enough that most people didn't understand it.

Edit: In fact, I looked at the assembly output. On an x86_64 platform the lookup table buys you nothing and may in fact be counter-productive because of its affect on the cache. The compiler generates leaq (%rdx,%rdx,2), %rdx in order to multiply by 3. Fetching from a table would be something like moveq (%rdx,%rcx,8), %eax, which is basically the same speed aside from requiring a fetch from memory (which might be very expensive). So it's almost certain that my code with the gcc option -funroll-loops is significantly faster than your attempt to optimize by hand.

The lesson here is that the compiler does a much, much better job of optimization than you can. Just make your code as clear and readable to others as possible and let the compiler do the work. And making it clear to others has the additional advantage of making it easier for the compiler to do its job.

Omnifarious
mixed precision though, having to convert from fixed point to float to do the float multiply then convert back to store the result turns one operation into 3, even if the fixed and float both take a single instruction period (same amount of time). avoiding float is a good thing here, within the last 5 years too.
dwelch