views:

1114

answers:

14

I'd like a shortcut for the following little function, where performance is very important (the function is called more than 10.000.000 times):

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

Does anyone have any idea... a cool bitoperation trick? Thanks for your help in advance!

+5  A: 

If bit ops are faster than comparison on your target machine you can do this:

inline int len(uint32 val)
{
    if(val & 0xff000000) return 4;
    if(val & 0x00ff0000) return 3;
    if(val & 0x0000ff00) return 2;
    return 1;
} 
Steve Townsend
Bitops or compares are the same on most CPU (compare is nothing more than a subtraction) and the number of branches is the same. This said, sorting the tests by likelyhood is a good approach.
tristopia
They probably come out the same, but addition and subtraction do involve carry/borrow propagation in the simple digital logic implementation, while each bit's result for bitwise operations is totally independent, so it very well could be faster to use bitwise and.
nategoose
+35  A: 

How about this one?

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

Removing the inline keyword, g++ -O2 compiles this to the following branchless code:

movl    8(%ebp), %edx
movl    %edx, %eax
andl    $-16777216, %eax
cmpl    $1, %eax
sbbl    %eax, %eax
addl    $4, %eax
xorl    %ecx, %ecx
testl   $-65536, %edx
sete    %cl
subl    %ecx, %eax
andl    $-256, %edx
sete    %dl
movzbl  %dl, %edx
subl    %edx, %eax

If you don't mind machine-specific solutions, you can use the bsr instruction which searches for the first 1 bit. Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4:

int len(uint32 val)
{
    asm("mov 8(%ebp), %eax");
    asm("or  $255, %eax");
    asm("bsr %eax, %eax");
    asm("shr $3, %eax");
    asm("inc %eax");
    asm("mov %eax, 8(%ebp)");
    return val;
}

Note that I am not an inline assembly god, so maybe there's a better to solution to access val instead of addressing the stack explicitly. But you should get the basic idea.

The GNU compiler also has an interesting built-in function called __builtin_clz:

inline int len(uint32 val)
{
    return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;
}

This looks much better than the inline assembly version to me :)

FredOverflow
+1: I was writing a version with / and +, but the idea is the same
stefaanv
I'm sorry, I'm not a pro in C/C++. Why should this be faster?
youllknow
@stefaanv At 40 cycles the integer division, a version with `/` is definitely not the same.
Pascal Cuoq
@youllknow: It removes all comparisons, linearizing the code. This helps out the instruction pipeline by removing any possible branch mispredictions.
Clark Gaebel
While all the branches may be removed by this code (may, as it's far from sure) you introduce dependies between the tests in that case you will have always the 3 ands.
tristopia
This will always perform the same number of ops irrespective of the target. Cascading comparison and return of result should result in half as many aggregate ops, assuming even dist of targets for the function.
Steve Townsend
`BSR` is fast only on Core2 (2cycle latency 1c throughput) on i7 it's 3c/1c, on K7/K8 it's 9c latency 1per/c throughput. On P4 it's 16c latency, throughput ?.
tristopia
See also sylvanaar's answer for discussion and benchmarks.
Brian
@Pascal: most of the compilers I know, optimize divisions by powers of 2 to shifts, so I wouldn't worry about that.
stefaanv
@stef: Note that `i/8` and `i>>3` yield different results given negative inputs, so the compiler *cannot* optimize the former to the latter if `i` is a signed type! For example, `(-1)/8 = 0` but `(-1)>>3 = -1`. That's why I explicitly wrote `>>3` instead of `/8`. I know the result of `__builtin_clz` is always positive, but the compiler does not know that.
FredOverflow
+3  A: 

You can avoid the conditional branches that can be costly if the distribution of your numbers does not make prediction easy:

return 4 - (val <= 0x000000ff) - (val <= 0x0000ffff) - (val <= 0x00ffffff);

Changing the <= to a & will not change anything much on a modern processor. What is your target platform?

Here is the generated code for x86-64 with gcc -O:

    cmpl    $255, %edi
    setg    %al
    movzbl  %al, %eax
    addl    $3, %eax
    cmpl    $65535, %edi
    setle   %dl
    movzbl  %dl, %edx
    subl    %edx, %eax
    cmpl    $16777215, %edi
    setle   %dl
    movzbl  %dl, %edx
    subl    %edx, %eax

There are comparison instructions cmpl of course, but these are followed by setg or setle instead of conditional branches (as would be usual). It's the conditional branch that is expensive on a modern pipelined processor, not the comparison. So this version saves the expensive conditional branches.

My attempt at hand-optimizing gcc's assembly:

    cmpl    $255, %edi
    setg    %al
    addb    $3, %al
    cmpl    $65535, %edi
    setle   %dl
    subb    %dl, %al
    cmpl    $16777215, %edi
    setle   %dl
    subb    %dl, %al
    movzbl  %al, %eax
Pascal Cuoq
Won't the `<=` will still cause branching instruction to be needed?
Mark B
any modern intel processor (core2duo, i7...)
youllknow
@Mark B Not for most architectures; for instance my compiler uses the `setg` and `setle` instructions to transfer the comparison flag to a register on x86-64 without branching.
Pascal Cuoq
A branch is expensive if it is mispredicted, else it's really cheap. As for the set instruction, I just looked up its latency/throughput, it's 1 cycle latency on all modern chips (P4 stinks with 5 cycles) and a throughput of 1 per cycle (except K7/K8 who can do 2 or 3 per cycle)
tristopia
You should be careful with partial register writes, they can be quite costly on intel processors, if I remember correctly.http://www.intel.com/software/products/documentation/vlin/mergedprojects/analyzer_ec/mergedProjects/reference_olh/pentium4_hh/lips/lipspro_partial_stall.htm
tristopia
@tristopia What is costly is to attempt to read a full register less than a couple dozen cycles after writing a partial register. That still leaves some creative uses, but this is not one of them: I am never reading a full register made of partial writes. I kept the `movzbl` at the end expressly to avoid this.
Pascal Cuoq
+9  A: 

Binary search MIGHT save a few cycles, depending on the processor architecture.

inline int len(uint32 val)
{
    if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
    return (val & 0x0000ff00)? 2: 1;
}

Or, finding out which is the most common case might bring down the average number of cycles, if most inputs are one byte (eg when building UTF-8 encodings, but then your break points wouldn't be 32/24/16/8):

inline int len(uint32 val)
{
    if (val & 0xffffff00) {
       if (val & 0xffff0000) {
           if (val & 0xff000000) return 4;
           return 3;
       }
       return 2;
    }
    return 1;
}

Now the short case does the fewest conditional tests.

Ben Voigt
+1 -- I was just starting to write about using a binary search.
Jerry Coffin
+1 -- because the benchmark showed a factor 2 improvement :)
Matthieu M.
+1  A: 

Ok one more version. Similar to Fred's one, but with less operations.

inline int len(uint32 val)
{
    return 1
        + (val > 0x000000ff)
        + (val > 0x0000ffff)
        + (val > 0x00ffffff)
    ;
}
Maciej Hehl
+12  A: 

Do you really have profile evidence that this is a significant bottleneck in your application? Just do it the most obvious way and only if profiling shows it to be a problem (which I doubt), then try to improve things. Most likely you'll get the best improvement by reducing the number of calls to this function than by changing something within it.

Mark B
+1  A: 

This gives you less comparisons. But may be less efficient if memory access operation costs more than a couple of comparisons.

int precalc[1<<16];
int precalchigh[1<<16];
void doprecalc()
{
    for(int i = 0; i < 1<<16; i++) {
        precalc[i] = (i < (1<<8) ? 1 : 2);
        precalchigh[i] = precalc[i] + 2;
    }
}
inline int len(uint32 val)
{
    return (val & 0xffff0000 ? precalchigh[val >> 16] : precalc[val]);
}
Leonid
+23  A: 

I did a mini unscientific benchmark just measuring the difference in GetTickCount() calls when calling the function in a loop from 0 to MAX_LONG times under the VS 2010 compiler.

Here's what I saw:

This took 11497 ticks

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

While this took 14399 ticks

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

edit: my idea about why one was faster is wrong because:

inline int len(uint32 val)
{
    return 1
        + (val > 0x000000ff)
        + (val > 0x0000ffff)
        + (val > 0x00ffffff)
        ;
}

This version used only 11107 ticks. Since + is faster than - perhaps? I'm not sure.

Even faster though was the binary search at 7161 ticks

inline int len(uint32 val)
{
    if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
    return (val & 0x0000ff00)? 2: 1;
}

And fastest so far is using the MS intrinsic function, at 4399 ticks

#pragma intrinsic(_BitScanReverse)

inline int len2(uint32 val)
{
    DWORD index;
    _BitScanReverse(&index, val);

    return (index>>3)+1;

}

For reference - here's the code i used to profile:

int _tmain(int argc, _TCHAR* argv[])
{
    int j = 0;
    DWORD t1,t2;

    t1 = GetTickCount();

    for(ULONG i=0; i<-1; i++)
        j=len(i);

    t2 = GetTickCount();

    _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);


    t1 = GetTickCount();

    for(ULONG i=0; i<-1; i++)
        j=len2(i);

    t2 = GetTickCount();

    _tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
}

Had to print j to prevent the loops from being optimized out.

sylvanaar
What optimization settings did you use? The compiler should still have eliminated the loops. Use `j += len(i);` instead of `j = len(i);` to prevent the compiler from replacing it with `j = len(~0UL);`
Ben Voigt
+1 for measuring. Why didnt anybody other do this? I dont care for any assembly tricks, if one didnt proof these as usefull.
Markus Kull
@ben changing to j+= made all the implementations execute FASTER - i don't understand why fully
sylvanaar
If you made 'j' volatile, the loops wouldn't be optimized out.
Robert
Depending on the "shape" of the actual data, your test will likely be lopsided in favor of versions that have branches. Because your test yields results in "chunks" (all length=1 first, then length=2 etc) the branch predictor in the CPU will be correct most of the time. This is fair if the data to be measured is largely predictable, but if the input data is randomized, the branch-based solutions will degrade whereas the non-branching solutions will have consistent performance.
star
Even with `volatile` though, the compiler could do `ULONG i=0; while(i++ <= 0xff) j = 1; while(i++ <= 0xff00) j = 2; while(i++ <= 0xff0000) j = 3; while(i++ < 0xffffffff) j = 4;` Random data is the only way to prevent optimization, and then time spent in `rand()` will dominate runtime.
Ben Voigt
I suppose you could pregenerate a few thousand random numbers and run them through multiple times.
sylvanaar
Yes, as long as some unknown (to the compiler) data is used it will prevent the compiler from precomputing `len(...)`. For example it could be calling `rand()` in a loop, or reading a disk file, but it should be done before starting the timer or it may obscure the execution time differences in `len()` itself.
Ben Voigt
+1  A: 

The minimum number of bits required to store an integer is:

int minbits = (int)ceil( log10(n) / log10(2) ) ;

The number of bytes is:

int minbytes = (int)ceil( log10(n) / log10(2) / 8 ) ;

This is an entirely FPU bound solution, performance may or may not be better than a conditional test, but worth investigation perhaps.

[EDIT] I did the investigation; a simple loop of ten million iterations of the above took 918ms whereas FredOverflow's accepted solution took just 49ms (VC++ 2010). So this is not an improvement in terms of performance, though may remain useful if it were the number of bits that were required, and further optimisations are possible.

Clifford
+2  A: 

Just to illustrate, based on FredOverflow's answer (which is nice work, kudos and +1), a common pitfall regarding branches on x86. Here's FredOverflow's assembly as output by gcc:

movl    8(%ebp), %edx   #1/.5
movl    %edx, %eax      #1/.5
andl    $-16777216, %eax#1/.5
cmpl    $1, %eax        #1/.5
sbbl    %eax, %eax      #8/6
addl    $4, %eax        #1/.5
xorl    %ecx, %ecx      #1/.5
testl   $-65536, %edx   #1/.5
sete    %cl             #5
subl    %ecx, %eax      #1/.5
andl    $-256, %edx     #1/.5
sete    %dl             #5
movzbl  %dl, %edx       #1/.5
subl    %edx, %eax      #1/.5
# sum total: 29/21.5 cycles

(the latency, in cycles, is to be read as Prescott/Northwood)

Pascal Cuoq's hand-optimized assembly (also kudos):

cmpl    $255, %edi      #1/.5
setg    %al             #5
addb    $3, %al         #1/.5
cmpl    $65535, %edi    #1/.5
setle   %dl             #5
subb    %dl, %al        #1/.5
cmpl    $16777215, %edi #1/.5
setle   %dl             #5
subb    %dl, %al        #1/.5
movzbl  %al, %eax       #1/.5
# sum total: 22/18.5 cycles

Edit: FredOverflow's solution using __builtin_clz():

movl 8(%ebp), %eax   #1/.5
popl %ebp            #1.5
orb  $-1, %al        #1/.5
bsrl %eax, %eax      #16/8
sarl $3, %eax        #1/4
addl $1, %eax        #1/.5
ret
# sum total: 20/13.5 cycles

and the gcc assembly for your code:

movl $1, %eax        #1/.5
movl %esp, %ebp      #1/.5
movl 8(%ebp), %edx   #1/.5
cmpl $255, %edx      #1/.5
jbe  .L3             #up to 9 cycles
cmpl $65535, %edx    #1/.5
movb $2, %al         #1/.5
jbe  .L3             #up to 9 cycles
cmpl $16777216, %edx #1/.5
sbbl %eax, %eax      #8/6
addl $4, %eax        #1/.5
.L3:
ret
# sum total: 16/10 cycles - 34/28 cycles

in which the instruction cache line fetches which come as the side-effect of the jcc instructions probably cost nothing for such a short function.

Branches can be a reasonable choice, depending on the input distribution.

Edit: added FredOverflow's solution which is using __builtin_clz().

Michael Foukarakis
Interesting, could you also measure my `return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;` solution?
FredOverflow
What's your source for cycles?
ssg
The Intel architecture manuals, http://siyobik.info/index.php?module=x86, and measurements on the cores mentioned above.
Michael Foukarakis
@mfukar I am a little confused by your reasoning. Are you showing cost estimates in which conditional jumps are counted as zero, and then concluding that conditional jumps aren't very expensive? Isn't that circular reasoning? Conditional branches may be so well predicted as to be free in regular algorithms (say, matrix multiplication) but we haven't even begun talking about input distribution here, and they are not going to be free for all distributions.
Pascal Cuoq
@mfukar Regarding the confirmation that "branches aren't that bad" in Sylvanaar's tests, there is no indication that the compiler did not use a conditional move instruction for compiling the `if`s. I wrote a comment to the effect that any attempt to optimize this function could be only for fun, not for measurable differences. Did you see it? Apparently it was quite controversial.
Pascal Cuoq
I suppose that's a possibility (compiling to CMOV), although I'm not too sure - latency/throughput data on Intel's manuals on CMOV is erroneous in non-recent cores). One would have to verify the assembly executed against those appearing here. Also, regarding your previous comment, I've actually measured conditional jumps taking as long as 9 cycles to complete on some input distributions, so my point is essentially moot.
Michael Foukarakis
+3  A: 

On some systems this could be quicker on some architectures:

inline int len(uint32_t val) {
   return (int)( log(val) / log(256) );  // this is the log base 256 of val
}

This may also be slightly faster (if comparison takes longer than bitwise and):

inline int len(uint32_t val) {
    if (val & ~0x00FFffFF) {
        return 4;
    if (val & ~0x0000ffFF) {
        return 3;
    }
    if (val & ~0x000000FF) {
        return 2;
    }
    return 1;

}

If you are on an 8 bit microcontroller (like an 8051 or AVR) then this will work best:

inline int len(uint32_t val) {
    union int_char { 
          uint32_t u;
          uint8_t a[4];
    } x;
    x.u = val; // doing it this way rather than taking the address of val often prevents
               // the compiler from doing dumb things.
    if (x.a[0]) {
        return 4;
    } else if (x.a[1]) {
       return 3;
    ...
nategoose
+2  A: 

You may have a more efficient solution depending on your architecture.

MIPS has a "CLZ" instruction that counts the number of leading zero-bits of a number. What you are looking for here is essentially 4 - (CLZ(x) / 8) (where / is integer division). PowerPC has the equivalent instruction cntlz, and x86 has BSR. This solution should simplify down to 3-4 instructions (not counting function call overhead) and zero branches.

bta
I just noticed that `BSR` operates slightly differently than `CLZ`. `BSR` returns a bit index, so for a 32-bit input you would have to do `31 - BSR(x)` to yield the equivalent of the MIPS `CLZ` instruction.
bta
+2  A: 

to Pascal Cuoq and the 35 other people who up-voted his comment:

"Wow! More than 10 million times... You mean that if you squeeze three cycles out of this function, you will save as much as 0.03s? "

Such a sarcastic comment is at best rude and offensive.

Optimization is frequently the cumulative result of 3% here, 2% there. 3% in overall capacity is nothing to be sneezed at. Suppose this was an almost saturated and unparallelizable stage in a pipe. Suppose CPU utilization went from 99% to 96%. Simple queuing theory tells one that such a reduction in CPU utilization would reduce the average queue length by over 75%. [the qualitative (load divided by 1-load)]

Such a reduction can frequently make or break a particular hardware configuration as this has feed back effects on memory requirements, caching the queued items, lock convoying, and (horror of horrors should it be a paged system) even paging. It is precisely these sorts of effects that cause bifurcated hysteresis loop type system behavior.

Arrival rates of anything seem to tend to go up and field replacement of a particular CPU or buying a faster box is frequently just not an option.

Optimization is not just about wall clock time on a desktop. Anyone who thinks that it is has much reading to do about the measurement and modelling of computer program behavior.

Pascal Cuoq owes the original poster an apology.

nbourbaki
-1: This should be a comment.
Cedric H.
@cedric H - maybe it should be but it seems that the site software won't let a new member comment on any but his own posts.
nbourbaki
I don't think there is any minimum reputation for using the "flag" button, which is what I would do if I was as offended as you seem to be.
Pascal Cuoq
Actually, I thought it important to point out that getting a 3% improvement in capacity was somewhat different than a dismissed absolute improvement of .03 seconds, especially since some comments seemed not to appreciate that not-so-subtle difference. Perhaps if you had ever had to support 100000 boards out in the field, with arrival rates to the application running on the board doubling every 2 years, you might actually appreciate the difference.
nbourbaki
A: 

If I remember 80x86 asm right, I'd do something like:

  ; Assume value in EAX; count goes into ECX
  cmp eax,16777215 ; Carry set if less
  sbb ecx,ecx      ; Load -1 if less, 0 if greater
  cmp eax,65535
  sbb ecx,0        ; Subtract 1 if less; 0 if greater
  cmp eax,255
  sbb ecx,-4       ; Add 3 if less, 4 if greater

Six instructions. I think the same approach would also work for six instructions on the ARM I use.

supercat