views:

374

answers:

12

I have an array of uint64 and for all unset bits (0s), I do some evaluations.

The evaluations are not terribly expensive, but very few bits are unset. Profiling says that I spend a lot of time in the finding-the-next-unset-bit logic.

Is there a faster way (on a Core2duo)?

My current code can skip lots of high 1s:

for(int y=0; y<height; y++) {
  uint64_t xbits = ~board[y];
  int x = 0;
  while(xbits) {
    if(xbits & 1) {
      ... with x and y
    }
    x++;
    xbits >>= 1;
  }
}

(And any discussion about how/if to SIMD/CUDA-ise this would be an intriguing tangent!)

+1  A: 

I would suggest using some sort of lookup table (per byte, or short, depending on the available resources) that would tell you which bits are clear in a certain value.

Tal Pressman
+2  A: 

I can think of a few optimization points like Loop unwinding, in which you can try something like

for(int y=0; y < height; y++) {

    uint64_t xbits = ~board[y];
    int x = 0;

    while(xbits) {
     if(xbits & (1 << 0)) {
       ... with x and y
     }
     if(xbits & (1 << 1)) {
       ... with x and y
     }
     if(xbits & (1 << 2)) {
       ... with x and y
     }
     if(xbits & (1 << 3)) {
       ... with x and y
     }
     if(xbits & (1 << 4)) {
       ... with x and y
     }
     if(xbits & (1 << 5)) {
       ... with x and y
     }
     if(xbits & (1 << 6)) {
       ... with x and y
     }
     if(xbits & (1 << 7)) {
       ... with x and y
     }
     x+=8;
     xbits >>= 8;
    }
}

This will remove 7 loop check, 7 additions, 7 shifting for 8 calculations ...

Another way I can think of, is just ignoring consecutive 1's if they are set for example

while (xbits) {

    if (xbits & 0xF) {

          // Process for the four bits !!!
    }

    xbits >>= 4;
}

Warning: If the bits are too much scattered then the above method might make things slow :-(

Alphaneo
I think you should increase x after every `if` statement.
Nick D
Thanks for pointing Nick, but if we can use x+1, x+2, etc, (if x is not used in many places) it should be fine, isn't it?
Alphaneo
+3  A: 

Have you considered a table, which would allow you to handle every byte at once. Essentially by a s single subscript operation you'd retrieve a list of the "x" values that are not set in the byte (to which you'd add 8 * byte-within-uint64 to get your true "x".

By using one byte to store one 1-to-8 bit number value (we could pack this a bit but then the benefit of having a good to go value would be somewhat defeated), and by assuming that we'd have a maximum of 4 0-valued bits (byte values with more 0 bits can be coded with an escape code, which would trigger some conventional bit logic, which would be acceptable then, given the low probability of such events), we need a table of 256 * 4 bytes = 1k.

mjv
+6  A: 

Hacker's Delight suggests a loop-unrolled binary search. Not pretty, but fast for sparse unset bits because it skips dwords/bytes/nibbles/etc. with every bit set.

If you can get a Phenom with SSE4a (not Core2 Duo, unfortunately) you can use POPCNT to write a fast number-of-set-bits function. Then you can get the index of the next unset bit with:

pop(x & (~x-1))

x & (~x-1) clears the set bits above the next zero bit; then you just have to count the remaining bits with POPCNT.

Here's a worked example with a byte:

    01101111 x
    10010000 ~x
    10001111 ~x-1
    00001111 x & ~x-1
pop(00001111) => 4
Dominic Cooney
+3  A: 

If you're willing to use assemply, BSF (Bit Scan Forward) would be the operation to use. It finds 1 bits though, so you'll have to invert your bitmask. IIRC, the XOR will set the zero flag if the result is 0, so you can test that flag before trying BSF. On x86, BSF works on 32 bits registers, so you'll have to split your value. (But then you should be using 32 bits integers in the first place, I'd say).

MSalters
+2  A: 

One approach - partition into nibbles then use switch to select the bits from the nibble. Use templates so the bit selected is known at compile time, and to help unwind the code.

template < int i, int x >
struct process_bit {
    inline static void apply ( int y ) { };
};

template < int x >
struct process_bit < 1, x > {
    inline static void apply ( int y ) {
        evaluate ( x, y );
    }
};

template < int x, int n >
inline void process_nibble_bits ( int y ) {
    process_bit < x & 1, n >::apply( y );
    process_bit < ( x >> 1 ) & 1, n + 1 > ::apply( y );
    process_bit < ( x >> 2 ) & 1, n + 2 > ::apply( y );
    process_bit < ( x >> 3 ) & 1, n + 3 > ::apply( y );
}


template < int n >
inline void process_nibble ( uint64_t xbits, int y ) {
    uint64_t nibble = ( xbits >> n ) & 0xf;
    if ( nibble ) {
        switch ( nibble ) {
            case 0:
            process_nibble_bits < 0, n > ( y );
            break;
            case 1:
            process_nibble_bits < 1, n > ( y );
            break;
            case 2:
            process_nibble_bits < 2, n > ( y );
            break;
            case 3:
            process_nibble_bits < 3, n > ( y );
            break;
            case 4:
            process_nibble_bits < 4, n > ( y );
            break;
            case 5:
            process_nibble_bits < 5, n > ( y );
            break;
            case 6:
            process_nibble_bits < 6, n > ( y );
            break;
            case 7:
            process_nibble_bits < 7, n > ( y );
            break;
            case 8:
            process_nibble_bits < 8, n > ( y );
            break;
            case 9:
            process_nibble_bits < 9, n > ( y );
            break;
            case 10:
            process_nibble_bits < 10, n > ( y );
            break;
            case 11:
            process_nibble_bits < 11, n > ( y );
            break;
            case 12:
            process_nibble_bits < 12, n > ( y );
            break;
            case 13:
            process_nibble_bits < 13, n > ( y );
            break;
            case 14:
            process_nibble_bits < 14, n > ( y );
            break;
            case 15:
            process_nibble_bits < 15, n > ( y );
            break;
        }
    }
}

template < int i, int n >
struct bit_tree {
    inline static void apply ( uint64_t xbits, int y ) {
        // each call to here represents scan of bits in [ n, n + 2i )
        bit_tree < i >> 1, n > ::apply(xbits, y);
        bit_tree < i >> 1, n + i > ::apply(xbits, y);
    };
};


template < int i, int n >
struct bit_tree_with_guard {
    inline static void apply ( uint64_t xbits, int y ) {
        // each call to here represents scan of bits in [ n, n + 2i )
        // so this branch to execute if any in [ n, n + i ) are set

        if ( xbits & ( ( ( ( ( uint64_t ) 1LL ) << i ) - 1 ) << n ) )
            bit_tree < i >> 1, n > ::apply(xbits, y);

        if ( xbits & ( ( ( ( ( uint64_t ) 1LL ) << i ) - 1 ) << ( n + i) ) )
            bit_tree < i >> 1, n + i > ::apply(xbits, y);
    };
};

// put guards on 8 and 16 bit blocks ( for some reason using inheritance is slower ) 
template < int n >
struct bit_tree < 8, n > {
    inline static void apply ( uint64_t xbits, int y ) {
        bit_tree_with_guard < 8, n > ::apply ( xbits, y );
    }
};
template < int n >
struct bit_tree < 16, n > {
    inline static void apply ( uint64_t xbits, int y ) {
        bit_tree_with_guard < 16, n > ::apply ( xbits, y );
    }
};


template < int n >
struct bit_tree < 2, n > {
    inline static void apply ( uint64_t xbits, int y ) {
        process_nibble < n > ( xbits, y );
    }
};


void template_nibbles(int height) {
    for (int y = 0; y < height; y++) {
        uint64_t xbits = ~board[y];
        bit_tree< 32, 0>::apply ( xbits, y );
    }
}

Running it's not as fast as the ffs version, but it is a touch better than the other portable ones, and appears to be consistent with them in results:

$ bin\bit_twiddle_micro_opt.exe                                               
testing will_while()... 3375000 usecs (check 1539404233,1539597930)           
testing will_ffs()... 2890625 usecs (check 675191567,1001386403)              
testing alphaneo_unrolled_8()... 3296875 usecs (check 1539404233,1539597930)  
testing template_nibbles()... 3046875 usecs (check 1539404233,1539597930)

Using a tree all the way doesn't seem to give any gain; not using the switch for the nibble is slower. Anyone know a way of not having to write the 16 cases by hand using only C++?

Pete Kirkham
+1  A: 

Here's a quick micro-benchmark; please run it if you can to get stats for your system, and please add your own algorithms!

The commandline:

g++ -o bit_twiddle_mirco_opt bit_twiddle_mirco_opt.cpp -O9 -fomit-frame-pointer -DNDEBUG -march=native

And the code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <stdint.h>

static unsigned long get_usecs() {
    struct timeval tv;
    gettimeofday(&tv,NULL);
    return tv.tv_sec*1000000+tv.tv_usec;
}

enum { MAX_HEIGHT = 64 };
uint64_t board[MAX_HEIGHT];
int xsum, ysum;

void evaluate(int x,int y) {
    xsum += x;
    ysum += y;
}

void alphaneo_unrolled_8(int height) {
    for(int y=0; y < height; y++) {
     uint64_t xbits = ~board[y];
     int x = 0;  
     while(xbits) {
      if(xbits & (1 << 0))
       evaluate(x,y);
      if(xbits & (1 << 1))
       evaluate(x+1,y);
      if(xbits & (1 << 2))
       evaluate(x+2,y);
      if(xbits & (1 << 3))
       evaluate(x+3,y);
      if(xbits & (1 << 4))
       evaluate(x+4,y);
      if(xbits & (1 << 5))
       evaluate(x+5,y);
      if(xbits & (1 << 6))
       evaluate(x+6,y);
      if(xbits & (1 << 7))
       evaluate(x+7,y);
      x+=8;
      xbits >>= 8;
     }
    }
}

void will_while(int height) {
    for(int y=0; y<height; y++) {
     uint64_t xbits = ~board[y];
     int x = 0;
     while(xbits) {
      if(xbits & 1)
       evaluate(x,y);
      xbits >>= 1;
      x++;
     }
    }
}

void will_ffs(int height) {
    for(int y=0; y<height; y++) {
     uint64_t xbits = ~board[y];
     int x = __builtin_ffsl(xbits);
     while(x) {
      evaluate(x-1,y);
      xbits >>= x;
      xbits <<= x;
      x = __builtin_ffsl(xbits);
     }
    }
}

void rnd_board(int dim) {
    for(int y=0; y<dim; y++) {
     board[y] = ~(((uint64_t)1 << dim)-1);
     for(int x=0; x<dim; x++)
      if(random() & 1)
       board[y] |= (uint64_t)1 << x;
    }
}

void test(const char* name,void(*func)(int)) {
    srandom(0);
    printf("testing %s... ",name);
    xsum = ysum = 0;
    const unsigned long start = get_usecs();
    for(int i=0; i<100000; i++) {
     const int dim = (random() % MAX_HEIGHT) + 1;
     rnd_board(dim);
     func(dim);
    }
    const unsigned long stop = get_usecs();
    printf("%lu usecs (check %d,%d)\n",stop-start,xsum,ysum);
}

int main() {
    test("will_while()",will_while);
    test("will_ffs()",will_ffs);
    test("alphaneo_unrolled_8()",alphaneo_unrolled_8);
    return 0;
}
Will
Core2duo:testing will_while()... 3354148 usecs (check 1556785771,1556733683)testing will_ffs()... 3017699 usecs (check 1556785771,1556733683)testing alphaneo_unrolled_8()... 3269703 usecs (check 1556785771,1556733683)
Will
will_ffs appears to be wrong: loses top 32 bits (which your testcase ignores, 1.5E9 is < 2^31) and you zero out all bits lower than `x` by the shift. You probably want `__builtin_ffsll`, and you can zero the 'x' bit using `xbits ^= (1<<(x-1));`. The loop also becomes simpler if you just write `while(int x = __builtin_ffsll(xbits)) {`
MSalters
Man! Am impressed with the level of analysis you guy did!!!
Alphaneo
@MSalters depends on whether my target is 32 or 64bit? Its a shame that builtins aren't named after their actual type rather than relative type, e.g. __builtin_ffs_64(...)
Will
+2  A: 

Other answers are good. Here's my contribution:

You could invert the word, and then have a loop finding the least-significant 1-bit:

int x = something;

int lsb = x ^ ((x-1) & x);

i.e. if   x = 100100
a = (x - 1) = 100011 // these two steps turn off the lsb
b = (a & x) = 100000
c = (x ^ b) = 000100 // this step detects the lsb
lsb = c

Then to tell if you are done, do x ^= lsb and test for zero.

If you wanted to turn that lsb (which is an actual bit) into a bit-number, that's where a lookup table or an unrolled binary search could be just what you need.

Is that what you wanted?

Mike Dunlavey
A: 

If you think that unset bits will be uncommon, then perhaps a simple

if (xbits != ((uint64_t)-1))
{
   // regular code goes here
}

would be a win. That way, in the common case (all bits in the word are set) you would skip past 64 set bits in one go.

Jeremy Friesner
+1  A: 

Does your profiling indicate you're mostly spending time in the inner while loop, or are you spending most of the time doing the ~board[y] calculation and then incrementing y straight away?

If it's the latter, you might be better off having a second level bitmap, with each bit in that map eliminating an entire 64b word in your board bitmap -- that way you can skip a fair bit further ahead, and if you're lucky, avoid loading entire cache lines of your bitmap.

What's the distribution of the number of bits set in your bitmap?

Anthony Towns
A: 

A variation of the look up table version: Have a lookup table for the next unset bit for 8-bits. Check 8-bit blocks and AND to 0xFF, compare to see if the result's still 0xFF. If it is, skip, else lookup at the table?

Calyth
+1  A: 

If you have very few unset bits, then don't use a bitfield at all, use a sparse representation. By that, I mean keep an array of integers containing the index of each unset bit. Iterating through the unset bits is just iterating through the array. Setting and clearing bits becomes more complicated, but if finding an unset bit is your most expensive operation, using a sparse representation will likely be a win.

Keith Randall
Indeed I do this when the number falls below a certain amount
Will