ansaurus

Question

Minimal binary diff for similar 1000 byte blocks with static noise?

Answer 1

A:

Have you tried standard compression algorithms already? What performance do you see? You should get fairly good compression ratios on the xor of the old and new blocks, due to the high bias towards 0s.

Other than the standard options, one alternative that springs to mind is encoding each diff as a list of variable-length integers specifying the distance between flipped bits. For example, using 5-bit variable length integers, you could describe gaps of up to 16 bits in 5 bits, gaps of 17 to 1024 bits in 10 bits, and so forth. If there's any regularity to the intervals between flipped bits, you can use a regular compressor on this encoding for further savings.

Nick Johnson 2009-11-25 00:19:15

Answer 2

+3 A:

Hi,

if its truly random noise then it does not really compress. This means that if you have 8,000 bits (1,000 bytes x 8 bits / byte) and every individual bit has 1/5 (20%) probability of flipping, then you can't encode the changed bits in less than 8,000 x (-4/5 x ln2 4/5 + -1/5 x ln2 1/5) = 8,000 x (-4/5 x -0.322 + -1/5 x -2.322) = 8,000 x (0.2576 + 0.4644) = 5,776 bits i.e. 722 bytes. This is based on Shannon's information theory.

Because the trivial way to represent the changed bits takes 1000 bytes (just encode the XOR of two blocks), you can save at most 30% of the space by compression. If you achieve consistently more then the bits are not randomly distributed or the bit flip probability is less than 20%.

Standard algorithms like Lempel-Ziv are designed for structured data (i.e. data that is not random noise). Random noise like this is best encoded by simple Huffman-coding and that kind of stuff. But you can save at most 30%, so it's a question whether it's actually worth the effort.

antti.huima 2009-11-25 00:36:21

In your message you said at most 20% of the bits would differ, not 20% of the bytes.

Jason Orendorff 2009-11-26 00:59:31

With 20% of the *bits* differing I get an average of 821 bytes with zlib. 996 with bz2, which must be byte-oriented.

Jason Orendorff 2009-11-26 01:03:44

Yeah, having 20% of the BYTES randomly CHANGED is very different from FLIPPING 20% of bits.

antti.huima 2009-11-26 07:03:45

Continuing that, note that if you change 20% of the bytes to random bytes you actually FLIP only 10% of the bits (because changing a bit to a random bit flips it with only 50% probability). Additionally, the bit flips are correlated. That reduces the amount of entropy significantly.

antti.huima 2009-11-26 07:06:34

ansaurus

tags:

views:

answers:

Minimal binary diff for similar 1000 byte blocks with static noise?

related questions