8 bits representing the number 7 look like this:
00000111
Three bits are set. What is the best algorithm to determine the number of set bits in a 32-bit integer?
8 bits representing the number 7 look like this:
00000111
Three bits are set. What is the best algorithm to determine the number of set bits in a 32-bit integer?
This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.
The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.
Some CPUs have a single built-in instruction to do it, and others have parallel instructions which act on bit vectors. The parallel instructions will almost certainly be fastest, however, the single-instruction algorithms are 'are usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster'.
A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory.
If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.
I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm':
int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
This is because it has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.
References:
http://graphics.stanford.edu/~seander/bithacks.html
http://en.wikipedia.org/wiki/Hamming_weight
http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/
http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)
Brian Kernighan's method goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.
Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to him that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"
long count_bits(long n) {
unsigned int c; // c accumulates the total bits set in v
for (c = 0; n; c++)
n &= n - 1; // clear the least significant bit set
return c;
}
Note that this is an question used during interviews. The interviewer will add the caveat that you have "infinite memory". In that case, you basically create an array of size 232 and fill in the bit counts for the numbers at each location. Then, this function becomes O(1).
Why not iteratively divide by 2?
count = 0 while n > 0 if (n % 2) == 1 count += 1 n /= 2
EDIT: Matt, I don't currently have enough reputation to comment... :P. I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity
If you happen to be using Java, the built-in method Integer.bitCount will do that.
Also consider the build-in functions of your compilers.
On the GNU compiler for example you can just use:
int __builtin_popcount (unsigned int x);
In the worst case the compiler will generate a call to a function. In the best case the compiler will emit a cpu instruction to do the same job faster.
The GCC intrinsics even work across multiple platforms. Popcount will become mainstream in the x86 architecture, so it makes sense to start using the intrinsic now. Other architectures have the popcount for years.
What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.
But if the speed is the major factor and not the code size then I think the follow can be faster:
static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
static int bitCountOfByte( int value ){
return BIT_COUNT[ value & 0xFF ];
}
static int bitCountOfInt( int value ){
return bitCountOfByte( value )
+ bitCountOfByte( value >> 8 )
+ bitCountOfByte( value >> 16 )
+ bitCountOfByte( value >> 24 );
}
I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.
From Hacker's Delight, p. 66, Figure 5-2
int pop(unsigned x)
{
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
return x & 0x0000003F;
}
Executes in ~20-ish instructions (arch dependent), no branching.
Hacker's Delight is delightful! Highly recommended.
In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness anytime.
unsigned int bitCount (unsigned int value) {
unsigned int count = 0;
while (value > 0) { // until all bits are zero
if ((value & 1) == 1) { // check lower bit
count++;
}
value /= 2; // shift bits, removing lower bit
}
return count;
}
For a happy medium between a 232 lookup table and iterating through each bit individually:
int bitcount(unsigned int num){
int count = 0;
static int nibblebits[] =
{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
for(; num != 0; num >>= 4)
count += nibblebits[num & 0x0f];
return count;
}
The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)
The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.
I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.
Fastest is the following, at 3.7 seconds:
static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}
Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.
Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.
Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.
The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.
So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.
It's hard to think of a situation where you'd want to use the bit-twiddling approach.
Edit: Similar results here.
Few open questions:-
1) If the number is negative then?
2) If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.
we can modify the algo to support the negative number as follows:-
count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
count += 1
n /= 2
return count
now to overcome the second problem we can write the algo like:-
int bit_count(int num)
{
int count=0;
while(num)
{
num=(num)&(num-1);
count++;
}
return count;
}
for complete reference see :
http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html
A simple way which should work nicely for a small amount of bits it something like this (For 4 bits in this example):
(i & 1) + (i & 2)/2 + (i & 4)/4 + (i & 8)/8
Would others recommend this for a small number of bits as a simple solution?
I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.
With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.
#define BitCount(X,Y) \
Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
Y = ((Y + (Y >> 3)) & 030707070707); \
Y = (Y + (Y >> 6)); \
Y = (Y + (Y >> 12) + (Y >> 24)) & 077;
Here is a secret to the first and most complex step:
input output
AB CD Note
00 00 = AB
01 01 = AB
10 01 = AB - (A >> 1) & 0x1
11 10 = AB - (A >> 1) & 0x1
so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.