views:

3180

answers:

13

I am looking for an efficient way to determine the position of the least significant bit that is set in an integer, e.g. for 0x0FF0 it would be 4.

A trivial implementation is this:

unsigned GetLowestBitPos(unsigned value)
{
   assert(value != 0); // handled separately

   unsigned pos = 0;
   while (!(value & 1))
   {
      value >>= 1;
      ++pos;
   }
   return pos;
}

Any ideas how to squeeze some cycles out of it?

(Note: this question is for people that enjoy such things, not for people to tell me xyzoptimization is evil.)

[edit] Thanks everyone for the ideas! I've learnt a few other things, too. Cool!

+11  A: 

There is an x86 assembly instruction (bsf) that will do it. :)

More optimized?!

Side Note:

Optimization at this level is inherently architecture dependent. Today's processors are too complex (in terms of branch prediction, cache misses, pipelining) that it's so hard to predict which code is executed faster on which architecture. Decreasing operations from 32 to 9 or things like that might even decrease the performance on some architectures. Optimized code on a single architecture might result in worse code in the other. I think you'd either optimize this for a specific CPU or leave it as it is and let the compiler to choose what it thinks it's better.

Mehrdad Afshari
-1: That's neither C nor C++
dwc
@dwc: I understand, but I think this clause: "Any ideas how to squeeze some cycles out of it?" makes such an answer perfectly acceptable!
Mehrdad Afshari
+1 His answer is necessarily dependent on his architecture because of endianness, so dropping down to assembly instructions is a perfectly valid answer.
Chris Lutz
Chris: how is the code endian dependant? Maybe I'm missing something, but I don't see it.
dwc
+1 Clever answer, yes it isn't C or C++ but it is the right tool for the job.
Andrew Hare
Wait, nevermind. The actual value of the integer doesn't matter here. Sorry.
Chris Lutz
Unless I am also missing something, there's nothing with endianness going on here. A right shift is a right shift, regardless of endianness. Likewise ANDing with 1.
rmeador
It's important to know that the result of those instructions is undefined when the value you're scanning is zero.
Bastien Léonard
@Bastian: They set ZF=1 if the operand is zero.
Mehrdad Afshari
Oh, you're right.
Bastien Léonard
A: 

You could check if any of the lower order bits are set. If so then look at the lower order of the remaining bits. e.g.,:

32bit int - check if any of the first 16 are set. If so, check if any of the first 8 are set. if so, ....

if not, check if any of the upper 16 are set..

Essentially it's binary search.

Arnshea
I doubt it'll actually increase speed, since the code complexity might reduce the pipeline efficiency.
Mehrdad Afshari
+2  A: 

It can be done with a worst case of less than 32 operations:

Principle: Checking for 2 or more bits is just as efficient as checking for 1 bit.

So for example there's nothing stopping you from checking for which grouping its in first, then checking each bit from smallest to biggest in that group.

So...
if you check 2 bits at a time you have in the worst case (Nbits/2) + 1 checks total.
if you check 3 bits at a time you have in the worst case (Nbits/3) + 2 checks total.
...

Optimal would be to check in groups of 4. Which would require in the worst case 11 operations instead of your 32.

The best case goes from your algorithms's 1 check though to 2 checks if you use this grouping idea. But that extra 1 check in best case is worth it for the worst case savings.

Note: I write it out in full instead of using a loop because it's more efficient that way.

int getLowestBitPos(unsigned int value)
{
    //Group 1: Bits 0-3
    if(value&0xf)
    {
     if(value&0x1)
      return 0;
     else if(value&0x2)
      return 1;
     else if(value&0x4)
      return 2;
     else
      return 3;
    }

    //Group 2: Bits 4-7
    if(value&0xf0)
    {
     if(value&0x10)
      return 4;
     else if(value&0x20)
      return 5;
     else if(value&0x40)
      return 6;
     else
      return 7;
    }

    //Group 3: Bits 8-11
    if(value&0xf00)
    {
     if(value&0x100)
      return 8;
     else if(value&0x200)
      return 9;
     else if(value&0x400)
      return 10;
     else
      return 11;
    }

    //Group 4: Bits 12-15
    if(value&0xf000)
    {
     if(value&0x1000)
      return 12;
     else if(value&0x2000)
      return 13;
     else if(value&0x4000)
      return 14;
     else
      return 15;
    }

    //Group 5: Bits 16-19
    if(value&0xf0000)
    {
     if(value&0x10000)
      return 16;
     else if(value&0x20000)
      return 17;
     else if(value&0x40000)
      return 18;
     else
      return 19;
    }

    //Group 6: Bits 20-23
    if(value&0xf00000)
    {
     if(value&0x100000)
      return 20;
     else if(value&0x200000)
      return 21;
     else if(value&0x400000)
      return 22;
     else
      return 23;
    }

    //Group 7: Bits 24-27
    if(value&0xf000000)
    {
     if(value&0x1000000)
      return 24;
     else if(value&0x2000000)
      return 25;
     else if(value&0x4000000)
      return 26;
     else
      return 27;
    }

    //Group 8: Bits 28-31
    if(value&0xf0000000)
    {
     if(value&0x10000000)
      return 28;
     else if(value&0x20000000)
      return 29;
     else if(value&0x40000000)
      return 30;
     else
      return 31;
    }

    return -1;
}
Brian R. Bondy
I will happily delete this, but just so I understand... why the 3+ down votes?
Brian R. Bondy
+1 from me. It's not the fastest but it's faster than the original, which was the point...
Andrew Grant
I'd distrust this because (1) you haven't tested it (or haven't shared the results), and (2) it's a big wadge of code, so outside of a benchmark scenario it might slow other things down by occupying excess icache. But that applies to most answers here, so no downvote from me.
Steve Jessop
@onebyone.livejournal.com: Even if there was a bug in the code, the concept of grouping is the point I was trying to get across. The actual code sample doesn't matter much, and it could be made more compact but less efficient.
Brian R. Bondy
I'm just wondering if there is a real bad part of my answer, or if people didn't just like that I wrote it out in full?
Brian R. Bondy
I don't mean a bug, I mean that you've said "it's faster" without evidence. So when I say "test" I really meant "profile", i.e. "test the performance". This probably is faster, but over the years I've grown suspicious of blind optimisation based on "it's obvious, innit" arguments.
Steve Jessop
For instance, it's at least plausible that some compilers would recognise the questioner's original loop and replace it with the appropriate count-zero opcode, and do it in one instruction, but fail to do the same replacement for this code. You'd then be *slower* than the loop. So I'd seek proof...
Steve Jessop
@onebyone.livejournal.com: When you compare 2 algorithms, you should compare them as they are, not assuming that one will be magically transformed by an optimization phase. I never claimed my algorithm was "faster" either. Only that it is less operations.
Brian R. Bondy
@onebyone.livejournal.com: ... I don't need to profile the above code to know it is less operations. I can see that clearly. I never made any claims that require profiling.
Brian R. Bondy
Steve Jessop
... and you did say 'more efficient'. So I wrongly assumed that by 'more efficient' you meant more performant. Perhaps the people who downvoted you assumed the same.
Steve Jessop
A: 
unsigned GetLowestBitPos(unsigned value)
{
    if (value & 1) return 1;
    if (value & 2) return 2;
    if (value & 4) return 3;
...
}
BoltBait
Excellent! That would certainly be an improvement on the original.
e.James
+1 for -funroll-loops-1 for doing it by hand
Mikeage
It will only be 32 lines of code...
BoltBait
A: 

If you have the resources, you can sacrifice memory in order to improve the speed:

static const unsigned bitPositions[MAX_INT] = { 0, 0, 1, 0, 2, /* ... */ };

unsigned GetLowestBitPos(unsigned value)
{
    assert(value != 0); // handled separately
    return bitPositions[value];
}

Note: This table would consume at least 4 GB (16 GB if we leave the return type as unsigned). This is an example of trading one limited resource (RAM) for another (execution speed).

If your function needs to remain portable and run as fast as possible at any cost, this would be the way to go. In most real-world applications, a 4GB table is unrealistic.

e.James
So by resources you mean 4GB of memory....
Andrew Grant
The bitpositions array need only be the same size as the max possible value. So as long as you restrict the input, you're fine :)
Stefan Kendall
The range of the input is already specifed by the paramater type - 'unsigned' is a 32-bit value so no, you are not fine.
Brian
@Andrew Grant: Yes, or even more, if the size of unsigned uses more bits on a system. The OP asked how to reduce cycles, and this will do it. Like any design decision, it involves some tradeoffs. In some systems, a few cycles saved is worth 4GB of RAM.
e.James
umm... does your mythical system and OS have a concept of paged memory? How much time is that going to cost?
Mikeage
@Mikeage: Good point. Memory paging is definitely a valid concern.
e.James
This is a non-answer. Your solution is completely unrealistic in ALL real-world applications and calling it a "tradeoff" is disingenuous. Your mythical system that has 16GB of ram to devote to a single function just does not exist. You'd have been as well answering "use a quantum computer".
Brian
+4  A: 

Why not use binary search? This will always complete after 5 operations (assuming int size of 4 bytes):

if (0x0000FFFF & value) {
    if (0x000000FF & value) {
        if (0x0000000F & value) {
            if (0x00000003 & value) {
                if (0x00000001 & value) {
                    return 1;
                } else {
                    return 2;
                }
            } else {
                if (0x0000004 & value) {
                    return 3;
                } else {
                    return 4;
                }
            }
        } else { ...
    } else { ...
} else { ...
soulmerge
thx for the corrections, Adam.
soulmerge
+1 This is very similar to my answer. The best case run time is worse than my suggestion, but the worst case run time is better.
Brian R. Bondy
+7  A: 

The fastest (non-intrinsic/non-assembler) solution to this is to find the lowest-byte and then use that byte in a 256-entry lookup table. This gives you a worst-case performance of four conditional instructions and a best-case of 1. Not only is this the least amount of instructions, but the least amount of branches which is super-important on modern hardware.

Your table (256 8-bit entries) should contain the index of the LSB for each number in the range 0-255. You check each byte of your value and find the lowest non-zero byte, then use this value to lookup the real index.

This does require 256-bytes of memory, but if the speed of this function is so important then that 256-bytes is well worth it,

E.g.

byte lowestBitTable[256] = {
.... // left as an exercise for the reader to generate
};

unsigned GetLowestBitPos(unsigned value)
{
  // note that order to check indices will depend whether you are on a big 
  // or little endian machine. This is for little-endian
  byte* bytes = (byte*)value;
  if (bytes[0])
    return lowestBitTable[bytes[0]];
  else if (bytes[1])
      return lowestBitTable[bytes[1]] + 8;
  else if (bytes[2])
      return lowestBitTable[bytes[2]] + 16;
  else
      return lowestBitTable[bytes[3]] + 24;  
}
Andrew Grant
It's actually a worst-case of three conditionals :) But yes, this is the fastest approach (and usually what people are looking for in interview questions like this).
Brian
Don't you want a +8, +16, +24 in there somewhere?
Mark Ransom
@Brian/@Mark - good points, thanks! :)
Andrew Grant
Any lookup table increases the chance of cache miss and might incur the cost of memory access which can be several orders of magnitude higher than executing instructions.
Mehrdad Afshari
True, in theory, but then you could say the same thing about I-Cache with all of the unrolled loops in other answers. If the speed of this function is so important one can assume it's called frequently so small amounts of either data are likely to be in the cache. Either way profile and see.
Andrew Grant
@Mehrdad If this function is that important there are easy ways to insure the table either stays in cache or is prefetched prior to calling the function
Brian
i would even use bit-shifts (shifting it by 8 each time). could be done entirely using registers then. using pointers, you will have to access memory.
Johannes Schaub - litb
+6  A: 

Most modern architectures will have some instruction for finding the position of the lowest set bit, or the highest set bit, or counting the number of leading zeroes etc.

If you have any one instruction of this class you can cheaply emulate the others.

Take a moment to work through it on paper and realise that x & (x-1) will clear the lowest set bit in x, and ( x & ~(x-1) ) will return just the lowest set bit, irrespective of achitecture, word length etc. Knowing this, it is trivial to use hardware count-leading-zeroes / highest-set-bit to find the lowest set bit if there is no explicit instruction to do so.

If there is no relevant hardware support at all, the multiply-and-lookup implementation of count-leading-zeroes given here or one of the ones on the Bit Twiddling Hacks page can trivially be converted to give lowest set bit using the above identities and has the advantage of being branchless.

moonshadow
+21  A: 

Bit Twiddling Hacks offers an excellent collection of, er, bit twiddling hacks, with performance/optimisation discussion attached. My favourite solution for your problem (from that site) is «multiply and lookup»:

unsigned int v;  // find the number of trailing zeros in 32-bit v 
int r;           // result goes here
static const int MultiplyDeBruijnBitPosition[32] = 
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];

Helpful references:

Anton Tykhyy
Adam Davis
Brian R. Bondy
Depends on the math processor. A single 32x32 integer multiply with no overflow, though, is exceptionally fast - this is what video games are built on. I'd be interested in seeing benchmarks though...
Adam Davis
That multiply is fast nowadays :) Thanks Anton, I had browse Bit Twiddling Hacks casually, but completely missed this item. d'oh!
peterchen
This code is fast. I'm getting 497 million ops/sec in .Net on a 2GHz CPU. Nothing wrong with that!
IanC
+6  A: 

As always, this problem has had some attention on it in the past :-)

  • If your compiler is GCC, use __builtin_ctz. It does exactly what you want. Remember to check for 0 first.
  • If your compiler is not GCC, then look at the implementation of __builtin_ctz from GCC for your architecture.

As others have said, there is no way to produce a definitive platform-portable optimal implementation for this kind of thing in C. I assume without checking that GCC has a generic implementation, to assist writers of back-ends, which might be a reasonable compromise, or might just be the simplest possible implementation.

Steve Jessop
Under visual c++, there is the _bitscanforward intrinsic. http://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx
Eclipse
+8  A: 

Why not use the built-in ffs? (I grabbed a man page from Linux, but it's more widely available than that.)

ffs(3) - Linux man page

Name

ffs - find first bit set in a word

Synopsis

#include <strings.h>
int ffs(int i);
#define _GNU_SOURCE
#include <string.h>
int ffsl(long int i);
int ffsll(long long int i);

Description

The ffs() function returns the position of the first (least significant) bit set in the word i. The least significant bit is position 1 and the most significant position e.g. 32 or 64. The functions ffsll() and ffsl() do the same but take arguments of possibly different size.

Return Value

These functions return the position of the first bit set, or 0 if no bits are set in i.

Conforming to

4.3BSD, POSIX.1-2001.

Notes

BSD systems have a prototype in <string.h>.

ephemient
"Why not use ffs?" Only because strings.h is obscure. Well spotted :-)
Steve Jessop
+1  A: 

See my answer here for how to do it with a single x86 instruction, except that to find the least significant set bit you'll want the BSF ("bit scan forward") instruction instead of BSR described there.

timday
+2  A: 

Inspired by this similar post that involves searching for a set bit, I offer the following:

unsigned GetLowestBitPos(unsigned value)
{
   double d = value ^ (value - !!value); 
   return (((int*)&d)[1]>>20)-1023; 
}

Pros:

  • no loops
  • no branching
  • runs in constant time
  • handles value=0 by returning an otherwise-out-of-bounds result
  • only two lines of code

Cons:

  • assumes little endianness as coded (can be fixed by changing the constants)
  • assumes that double is a real*8 IEEE float
DocMax
Interesting - I am still scared to use doubles for bit arithmetics, but I'll keep it in mind
peterchen