views:

120

answers:

2

I need a minimal diff for similar 1000 byte blocks. These blocks will have at most 20% of the bits different. The flipped bits will be like radio static -- randomly flipped bits with a uniform distribution over the whole block. Here's my pseudo code using XOR and lzo compression:

minimal_diff=lzo(XOR(block1,block2))

Since the blocks are small, I'm using lzo's compression with the hope that this compression format has minimal boilerplate.

I have reviewed algorithms such as xdelta and bsdiff, but these will not work for random static noise like this. These are more oriented around finding shifted sequences of bytes.

Can error correcting codes work here for creating a minimal diff? How exactly?

Exact algorithms would be nice. If it's just a research paper theory and not implemented then I'm not interested.

NOTE: The similar bits in each block line up. There is no shifting. There is just some random noise bit flips that differentiate the blocks.

A: 

Have you tried standard compression algorithms already? What performance do you see? You should get fairly good compression ratios on the xor of the old and new blocks, due to the high bias towards 0s.

Other than the standard options, one alternative that springs to mind is encoding each diff as a list of variable-length integers specifying the distance between flipped bits. For example, using 5-bit variable length integers, you could describe gaps of up to 16 bits in 5 bits, gaps of 17 to 1024 bits in 10 bits, and so forth. If there's any regularity to the intervals between flipped bits, you can use a regular compressor on this encoding for further savings.

Nick Johnson
+3  A: 

Hi,

if its truly random noise then it does not really compress. This means that if you have 8,000 bits (1,000 bytes x 8 bits / byte) and every individual bit has 1/5 (20%) probability of flipping, then you can't encode the changed bits in less than 8,000 x (-4/5 x ln2 4/5 + -1/5 x ln2 1/5) = 8,000 x (-4/5 x -0.322 + -1/5 x -2.322) = 8,000 x (0.2576 + 0.4644) = 5,776 bits i.e. 722 bytes. This is based on Shannon's information theory.

Because the trivial way to represent the changed bits takes 1000 bytes (just encode the XOR of two blocks), you can save at most 30% of the space by compression. If you achieve consistently more then the bits are not randomly distributed or the bit flip probability is less than 20%.

Standard algorithms like Lempel-Ziv are designed for structured data (i.e. data that is not random noise). Random noise like this is best encoded by simple Huffman-coding and that kind of stuff. But you can save at most 30%, so it's a question whether it's actually worth the effort.

antti.huima
In your message you said at most 20% of the bits would differ, not 20% of the bytes.
Jason Orendorff
With 20% of the *bits* differing I get an average of 821 bytes with zlib. 996 with bz2, which must be byte-oriented.
Jason Orendorff
Yeah, having 20% of the BYTES randomly CHANGED is very different from FLIPPING 20% of bits.
antti.huima
Continuing that, note that if you change 20% of the bytes to random bytes you actually FLIP only 10% of the bits (because changing a bit to a random bit flips it with only 50% probability). Additionally, the bit flips are correlated. That reduces the amount of entropy significantly.
antti.huima