Given a file (binary or textual), what is the fastest or most elegant way in C++ to count the ones and zeros in the binary representation of that file?
I would read in the file one unsigned int
at a time and count the number of bits turned on in each until EOF. You can use the classic sparse count algorithm to count the number of 1
s in an unsigned int
value.
int bitcount (unsigned int n)
{
int count = 0 ;
while (n) {
count++ ;
n &= (n - 1) ;
}
return count ;
}
Just do the above for all read in unsigned int
values and keep a running total.
On most any system the main execution time will be i/o-bound. And how to achieve fastest i/o depends very much on the system. So there's no single answer to "fastest".
Most "elegant" depends on the eyes that see.
So in summary, neither question has a definitive answer; it sounds like fishing for solutions to a homework assignment. Is it homework?
Create a 256 length char array. Stream through the file a byte at a time incrementing the array position of each byte. Hard code the number of 1s in each of the 256 bytes in another array. Multiply the 2 arrays and sum.
Not sure about elegance and definitely requires more memory, but might be faster than linuxuser27's algorithm.
Whenever I want to know the best bit manipulation trick for a particular task, I start here: http://graphics.stanford.edu/~seander/bithacks.html
Under "Counting bits set, in parallel" they list a 12 operation method for counting the bits set in a 32 bit integer. They also show methods for integers larger than 32 bits.
I would recommend you use results array:
unsigned char cheating[] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3,
4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2,
3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4,
5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2,
3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4,
5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5,
6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3,
4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3,
4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6,
7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4,
5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5,
6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 };
After you read the file in as unsigned char array you can just sum:
for (int i = 0; i < arraysize; i++) {
sum0+=8-cheating[unsignedbytearray[i]];
sum1+=cheating[unsignedbytearray[i]];
}
It is very hard to write code, that would be faster or more elegant :P
I would try to use std::bitset
, it has a good implementation of the count()
method (count interface) by keeping a pre-computed array of length 256 for count of all possible
bytes. For counting zeros,
std::bitset<N> bs;
size_t zero_count = bs.size() - bs.count();
I would initialize file_zero_count = 0
and open the file. Then in every iteration of a loop, I would read N
bits into a bitset, and add the zero_count
of those N
bits to the file_zero_count
. Then read the next N
bits and so on ...
The crucial choice is the value of N
. My first choice would be such that N
bits fit into one memory page, for e.g. N = 4096
.
One easy and quick way is to build a look-up table which tells you how many 1's any of the chars has, e.g. 'a' with ASCII 97 has 3. You can store such a look-up table in an fixed length array for constant-time access. Load the file page-by-page into the memory to minimize the number of I/O's and count for every char sequentially.
If you more information about the type of data you are processing, more interesting solutions can be created. For example, if you know that the file contains natural language textual data, you can build look-up tables for frequently occurring terms such as 'the', 'of' and 'and' to speed-up counting computations. Such a look-up table can efficiently be stored in hash table.
I would read in the file byte by byte and count the number of 1/0 in each byte:
The reason I would choose byte is that it is easy to put together a lookuptable for the number of 1 in a byte by hand. Note: I was counting the number of ones in a byte. But I built the table backwards (so it is actually a count of the number of 0's (as it is the inverse of the 1's)).
int countOne[] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // Top Line + 1 2^5 (16 set)
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // Top Line + 1 2^6 (32 set)
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, // Top Line + 2 2^5 + 2^6 (16/32 set)
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // Top Line + 1 2^7 (64 set)
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, // Top Line + 2 2^5 + 2^7 (16/64 set)
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, // Top Line + 2 2^6 + 2^7 (32/64 set)
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, // Top Line + 3 2^5 + ^6 + 2^7 (16/32/64 set)
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // Top Line + 1 2^8 (128 set)
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, // Top Line + 2 2^5 + 2^8 (16/128 set)
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, // Top Line + 2 2^6 + 2^8 (32/128 set)
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, // Top Line + 3 2^5 + 2^6 + 2^8 (16/32/128 set)
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, // Top Line + 2 2^7 + 2^8 (64/128 set)
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, // Top Line + 3 2^5 + 2^7 + 2^8 (16/64/128 set)
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, // Top Line + 3 2^6 + 2^7 + 2^8 (32/64/128 set)
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
Now all you need to do is use a std::for_each that iterater across a stream (without dropping spaces.
counter = std::for_each(std::istreambuf_iterator<unsigned char>(file),
std::istreambuf_iterator<unsigned char>(),
couter);
Now you just need to define an appropriate type for counter and the rest is history.
Here's a complete example (well almost there's an exercise for the implementor at the end...) It uses the 12 instruction count for 32 byte values from http://graphics.stanford.edu/~seander/bithacks.html . It also loads the file using mmap as that is (often) faster than other read methods. I think the only thing to do to make it faster would be to split the count across multiple threads. But on my system it already processes 10MB files in under 0.03s, which seems OK to me.
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <iostream>
#include <unistd.h>
int main()
{
int fd = open("junk.txt",O_RDWR);
struct stat info;
fstat(fd, &info);
void * page = mmap(0, info.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
uint64_t bitcount = 0;
//Lets ignore alignment issues for now - I think mmap guarantees that they're OK.
uint32_t * v_ptr = static_cast<uint32_t*>(page);
for(unsigned int i=0; i<info.st_size/4; ++i)
{
uint32_t v = *v_ptr;
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
bitcount += ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
++v_ptr;
}
//Need to adjust for the last 0-3 bytes... Exercise for the reader
munmap( page, info.st_size );
std::cout<<"Total of "<<bitcount<<" set bits"<<std::endl;
}