ansaurus

Question

Best algorithm to count the number of set bits in a 32-bit integer?

Answer 1

+69 A:

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Some CPUs have a single built-in instruction to do it, and others have parallel instructions which act on bit vectors. The parallel instructions will almost certainly be fastest, however, the single-instruction algorithms are 'are usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster'.

A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory.

If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm':

int NumberOfSetBits(int i)
{
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

This is because it has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.

References:

http://graphics.stanford.edu/~seander/bithacks.html

http://en.wikipedia.org/wiki/Hamming_weight

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)

Matt Howells 2008-09-20 19:05:22

+1. The first line in your NumberOfSetBits() is very cool -- just 3 instructions, instead of the 4 you would need if you separately masked out the even- and odd-numbered bits and added them (appropriately shifted) together.

j_random_hacker 2009-05-25 12:23:35

ha! love the NumberOfSetBits() function, but good luck getting that through a code review. :-)

Jason S 2009-11-22 06:51:19

It's write-only code. Just put a comment that you are not meant to understand or maintain this code, just worship the gods that revealed it to mankind. I am not one of them, just a prophet. :)

Matt Howells 2009-11-23 09:29:14

+1 for your witty and convincing defense.

Guge 2009-11-27 00:06:58

Maybe it should use `unsigned int`, to easily show that it is free of any sign bit complications. Also would `uint32_t` be safer, as in, you get what you expect on all platforms?

Craig McQueen 2009-12-15 02:18:41

I checked a few things while answering to this question http://stackoverflow.com/questions/2709430/count-number-of-bits-in-a-64-bit-long-big-integer/2709523#2709523 and I think there is a bug. last line: return ((i + (i >> 4) should be changed to: return (((i + (i >> 4))

Maciej Hehl 2010-04-25 19:56:02

@Maciej: Can you please provide a case where it does not return the expected result?

Matt Howells 2010-04-28 10:04:26

@Matt Howells I'm sorry. It looks like I got lost among those parenthesis. I had a bug in my implementation. it didn't work for numbers > 15. I checked the wikipedia article and there were those parenthesis. I was sure that was my problem. I fixed the parenthesis and it started to work. Looks like I fixed something else. But I think I learned something. Inability to reproduce the "bug" made me check the precedence of the operators :). Thank You and I apologize for the mess

Maciej Hehl 2010-04-29 14:19:51

Answer 2

+26 A:

Brian Kernighan's method goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to him that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

long count_bits(long n) {     
  unsigned int c; // c accumulates the total bits set in v
  for (c = 0; n; c++) 
    n &= n - 1; // clear the least significant bit set
  return c;
}

Note that this is an question used during interviews. The interviewer will add the caveat that you have "infinite memory". In that case, you basically create an array of size 2³² and fill in the bit counts for the numbers at each location. Then, this function becomes O(1).

hoyhoy 2008-09-20 19:08:27

Answer 3

+4 A:

Why not iteratively divide by 2?

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2

EDIT: Matt, I don't currently have enough reputation to comment... :P. I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity

daniel 2008-09-20 19:10:30

That'll work and is easy to understand, but there are faster methods.

Matt Howells 2008-09-20 19:40:13

Unless you do this a *LOT*, the performance impact would be negligible. So all things being equal, I agree with daniel that 'best' implies "doesn't read like gibberish".

2008-09-20 21:50:32

I deliberately didn't define 'best', to get a variety of methods. Lets face it if we have got down to the level of this sort of bit-twiddling we are probably looking for something uber-fast that looks like a chimp has typed it.

Matt Howells 2008-09-21 17:47:38

Bad code. A compiler might make good one out of it, but in my tests GCC did not. Replace (n%2) with (n AND being much faster than MODULO. Replace (n/=2) with (n>>=1); bitshifting much faster than division.

Mecki 2008-09-25 10:35:31

@Mecki: In my tests, gcc (4.0, -O3) *did* do the obvious optimisations.

2008-09-25 13:32:50

Answer 4

+3 A:

If you happen to be using Java, the built-in method Integer.bitCount will do that.

Noether 2008-09-20 19:14:39

Answer 5

+30 A:

Also consider the build-in functions of your compilers.

On the GNU compiler for example you can just use:

  int __builtin_popcount (unsigned int x);

In the worst case the compiler will generate a call to a function. In the best case the compiler will emit a cpu instruction to do the same job faster.

The GCC intrinsics even work across multiple platforms. Popcount will become mainstream in the x86 architecture, so it makes sense to start using the intrinsic now. Other architectures have the popcount for years.

Nils Pipenbrinck 2008-09-20 19:23:05

I agree that this is good practice in general, but on XCode/OSX/Intel I found it to generate slower code than most of the suggestions posted here. See my answer for details.

2008-09-25 03:29:58

AFAIK the only x86 CPU that could do a pop-count in a single instruction would be the AMD Phenom/Barcelona (Family 10h). has about a latency of 4 cycles or so?

Calyth 2009-01-09 23:38:30

The Intel i5/i7 has the SSE4 instruction POPCNT which does it, using general purpose registers. GCC on my system does not emit that instruction using this intrinsic, i guess because of no -march=nehalem option yet.

matja 2009-11-24 10:31:53

@matja, my GCC 4.4.1 emits the popcnt instruction if I compile with -msse4.2

Nils Pipenbrinck 2009-11-24 13:29:35

@Nils Pipenbrinck, good point, that works.

matja 2009-11-24 14:29:33

use c++'s `std::bitset::count`. after inlining this compiles to a single `__builtin_popcount` call.

caspin 2010-09-04 18:18:38

cool. Didn't knew that. What compiler are you using?

Nils Pipenbrinck 2010-09-05 00:54:55

Answer 6

+3 A:

What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.

But if the speed is the major factor and not the code size then I think the follow can be faster:

       static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
        static int bitCountOfByte( int value ){
            return BIT_COUNT[ value & 0xFF ];
        }

        static int bitCountOfInt( int value ){
            return bitCountOfByte( value ) 
                 + bitCountOfByte( value >> 8 ) 
                 + bitCountOfByte( value >> 16 ) 
                 + bitCountOfByte( value >> 24 );
        }

I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.

Horcrux7 2008-09-20 19:31:08

My code has 10 operation. Your code has 12 operation. Your link work with smaller arrays (5). I use 256 elements. With the caching can be a problem. But if you use it very frequently then this is not a problem.

Horcrux7 2008-09-20 21:12:35

This approach is measurably quite a bit faster than the bit-twiddling approach, as it turns out. As for using more memory, it compiles to less code and that gain is repeated every time you inline the function. So it could easily turn out to be a net win.

2008-09-25 03:43:04

Answer 7

+14 A:

From Hacker's Delight, p. 66, Figure 5-2

int pop(unsigned x)
{
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

Executes in ~20-ish instructions (arch dependent), no branching.

Hacker's Delight is delightful! Highly recommended.

Kevin Little 2008-09-20 19:38:00

Ah, another reader of that delightful book. I keep it close and read bits of it almost every day :D Definitely a good book!

freespace 2008-09-21 04:30:05

+1 for the reference.

J.F. Sebastian 2009-01-12 12:48:25

I previewed Hacker's Delight on Amazon and now it's on its way here. Thanks for the recommendation! :D

Carson Myers 2009-11-24 18:25:09

Answer 8

+8 A:

In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness anytime.

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1) {   // check lower bit
            count++;
        }
        value /= 2;               // shift bits, removing lower bit
    }
    return count;
}

paxdiablo 2008-09-21 01:14:03

Instead of dividing by 2 and commenting it as "shift bits...", you should just use the shift operator (>>) and leave out the comment.

indiv 2008-09-25 03:42:31

Nah, then he has to comment "divide by 2"...

Johannes Schaub - litb 2009-07-13 19:04:40

wouldn't it make more sense to replace `if ((value

Wallacoloo 2010-04-25 19:04:49

No, the best solution isn't the one most readable in this case. Here the best algorithm is the fastest one.

nikic 2010-09-23 07:55:16

That's entirely your opinion, @nikic, although you're free to downvote me, obviously. There was no mention in the question as to how to quantify "best", the words "performance" or "fast" can be seen nowhere. That's why I opted for readable.

paxdiablo 2010-09-23 08:57:06

@paxdiablo: Right as you are I changed my vote. I haven't read the question carefully enough. (I had to make a ghost-edit so I can change the vote. Hopefully you're okay with that.)

nikic 2010-09-23 10:34:47

Answer 9

+7 A:

For a happy medium between a 2³² lookup table and iterating through each bit individually:

int bitcount(unsigned int num){
    int count = 0;
    static int nibblebits[] =
        {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
    for(; num != 0; num >>= 4)
        count += nibblebits[num & 0x0f];
    return count;
}

From http://ctips.pbwiki.com/CountBits

PhirePhly 2008-09-22 03:55:59

Answer 10

+3 A:

2008-09-23 01:29:53

Answer 11

+4 A:

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.

Michael Dorfman 2008-09-23 09:20:15

Answer 12

+10 A:

I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.

Fastest is the following, at 3.7 seconds:

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.

Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.

Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.

The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.

So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.

It's hard to think of a situation where you'd want to use the bit-twiddling approach.

Edit: Similar results here.

2008-09-25 02:46:32

@Mike, The table based approach is unbeatable if the table is in the cache. This happens in micro-benchmarks (e.g. do millions of tests in a tight loop). However, a cache miss takes around 200 cycles, and even the most naive popcount will be faster here.It always depends on the application.

Nils Pipenbrinck 2008-09-25 04:42:23

If you're not calling this routine a few million times in a tight loop then you have no reason to care about it's performance at all, and might as well use the naive-but-readable approach since the performance loss will be negligible. And FWIW, the 8bit LUT gets cache-hot within 10-20 calls.

2008-09-25 11:02:03

I don't think it's all that hard to imagine a situation where this is a leaf call made from the method -actually doing the heavy lifting- in your app. Depending on what else is going on (and threading) the smaller version could win. Lots of algorithms have been written that beat their peers due to better locality of reference. Why not this too?

Jason 2010-05-06 08:50:25

Try this with clang, it's _significantly_ smarter at implementing builtins.

Matt Joiner 2010-10-05 05:23:06

Answer 13

+7 A:

2009-10-02 21:34:54

The best non-assembler form like this I've seen unrolled 24 32bit words at a time. http://dalkescientific.com/writings/diary/popcnt.c, http://stackoverflow.com/questions/3693981/bit-popcount-for-large-buffer-assembly-preferred, http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html

Matt Joiner 2010-10-05 05:25:53

Answer 14

A:

Few open questions:-

1) If the number is negative then?

2) If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.

we can modify the algo to support the negative number as follows:-

count = 0

while n != 0

if ((n % 2) == 1 || (n % 2) == -1

count += 1

n /= 2

return count

now to overcome the second problem we can write the algo like:-

int bit_count(int num)

{

int count=0;

while(num)

{

    num=(num)&(num-1);

    count++;
}

return count;

}

for complete reference see :

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

Baban 2010-05-07 04:51:57

Answer 15

A:

A simple way which should work nicely for a small amount of bits it something like this (For 4 bits in this example):

(i & 1) + (i & 2)/2 + (i & 4)/4 + (i & 8)/8

Would others recommend this for a small number of bits as a simple solution?

Matthew Mitchell 2010-06-01 21:54:32

Answer 16

+1 A:

I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.

With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

Here is a secret to the first and most complex step:

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.

Don Gillies

systemBuilder 2010-06-11 21:40:40

ansaurus

tags:

views:

answers:

Best algorithm to count the number of set bits in a 32-bit integer?

related questions