+2  A: 
Jaap Weidemann
Works like a charm. Thanks!
BitTwiddler1011
+1  A: 

If you're on a 32-bit machine, split w into two 32-bit words, calculate the popcount separately for each half, then add up. This will get rid of some unneeded operations that are required to synthesize 64-bit operations from 32-bit ones (shifts, mults...). This also allows for increased parallelism if you interleave the calculations.

If you're compiling 64-bit code, you may try this:

int popcnt64(uint64_t w)
{
   uint64_t w1 = (w & 0x2222222222222222) + ((w+w) & 0x2222222222222222);
   uint64_t w2 = (w >> 1 & 0x2222222222222222) + (w >> 2 & 0x2222222222222222);
   w1 = w1 + (w1 >> 4) & 0x0f0f0f0f0f0f0f0f;
   w2 = w2 + (w2 >> 4) & 0x0f0f0f0f0f0f0f0f;
   return (w1 + w2) * 0x0101010101010101 >> 57;
}

This contains more operations, but gives more opportunities of parallel execution to the CPU. On newer CPUs it should be slightly faster, on others it will be slightly slower.

slacker