+15  A: 

Unless the data is truely random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (ie: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achived. As long as you only need to access the bits sequencially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.

Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.

Finally, if the bits are truely random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.

Tall Jeff
Beautiful solution. It might even be fast as well since memory loads are so costly today.
jlouis
A: 

Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.

Tim Ring
A: 

Quick combinatoric proof that you can't really save much space:

Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.

That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.

Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.

Tyler
+1  A: 

Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:

I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.

  • If the sample is highly asymmetric, I'll simply store the indexes to the set bits (or maybe the distance to the next bit) in a list.
  • If the sample is highly symmetric, I'll keep using a conventional bit array.
  • If the sample is moderately symmetric, I'll use a lossless compression method like Huffman coding suggested by InSciTekJeff.

The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.

It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.

erickson
+1  A: 

One more compression thought:

If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.

You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.

Tyler
+4  A: 

I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.

Thanks Antaeus, I had actually looked into range coding already, as the accepted answer cited Huffman coding as just one example of lossless compression. However, Huffman is easy to implement and works well on moderately asymmetric input. For highly asymmetric input, run-length methods are good.
erickson
+2  A: 

Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays

bill
Thanks Bill. I remember hearing about Judy arrays before but I had completely forgotten about them. I will take another look at them.
erickson