This is basically doing what strtoull
does. If you don't have the digits handy as an ASCII string to feed to strtoull
then I guess you have to write your own implementation. As people point out, branching is what causes a performance hit, so your function is probably best written this way:
#include <tr1/cstdint>
uint64_t base3_digits_to_num(uint8_t digits[30])
{
uint64_t running_sum = 0;
uint64_t pow3 = 1;
for (int i = 0; i < 30; ++i) {
running_sum += digits[i] * pow3;
pow3 *= 3;
}
return running_sum;
}
It's not clear to me that precomputing your powers of 3 is going to result in a significant speed advantage. You might try it and test yourself. The one advantage a lookup table might give you is that a smart compiler could possibly unroll the loop into a SIMD instruction. But a really smart compiler should then be able to do that anyway and generate the lookup table for you.
Avoiding floating point is also not necessarily a speed win. Floating point and integer operations are about the same on most processors produced in the last 5 years.
Checking to see if digits[i]
is 0, 1 or 2 and executing different code for each of these cases is definitely a speed lose on any processor produced in the last 10 years. The Pentium3/Pentium4/Athlon Thunderbird days are when branches started to really become a huge hit, and the Pentium3 is at least 10 years old now.
Lastly, you might think this will be the bottleneck in your code. You're probably wrong. The right implementation is the one that is the simplest and most clear to anybody coming along reading your code. Then, if you want the best performance, run your code through a profiler and find out where to concentrate your optimization efforts. Agonizing this much over a little function when you don't even know that it's a bottleneck is silly.
And almost nobody here recognized that you were basically doing a base 3 conversion. So even your current primitive hand loop unrolling obscured your code enough that most people didn't understand it.
Edit: In fact, I looked at the assembly output. On an x86_64 platform the lookup table buys you nothing and may in fact be counter-productive because of its affect on the cache. The compiler generates leaq (%rdx,%rdx,2), %rdx
in order to multiply by 3. Fetching from a table would be something like moveq (%rdx,%rcx,8), %eax
, which is basically the same speed aside from requiring a fetch from memory (which might be very expensive). So it's almost certain that my code with the gcc option -funroll-loops
is significantly faster than your attempt to optimize by hand.
The lesson here is that the compiler does a much, much better job of optimization than you can. Just make your code as clear and readable to others as possible and let the compiler do the work. And making it clear to others has the additional advantage of making it easier for the compiler to do its job.