views:

63

answers:

3

Say there's an array of 1024 bits that are all zeros:

example: [0,0,0,0,0,0,0,...]

Then I overwrite 20 zeros with ones at completely random positions:

example: [0,1,0,0,0,0,0,...]

What is the theoretical minimum number of bits needed to encode the location of these 20 randomly placed bits, assuming that I had a perfect encoder?

I know there are communication theory equations that will tell me this, but I want to double check my calculations.

Harder bonus question: Show me the code for an algorithm that implements an encoding that approaches this minimum limit.

Bonus bonus: What if the bit flips where byte level instead of bit level? e.g. entire bytes flipped. Same result?

+1  A: 

If you treated a string of 200 bits as an array of twenty 10 bit numbers, each listing the position of one of the one-bits, you'd be saving 824 bits.

But I don't think this is the minimum. For example, if you treat each of the numbers as relative to the previous item instead of an absolute position, some analysis may show that on average you'd only need, say, 8 bits to encode the distance to the next one-bit. So add a bit to the front: when 0, then 200 bits follow with absolute positions. When 1, then 160 bits follow with relative positions. This should yield a lower average number of bits to encode the full value.

Generalizing, this is simply data compression. There are probably many compression algorithms that could reduce the average number of bits required to encode your "twenty one-bits in 1024" to a very small number. Computing a suitable binary tree, storing a representation of it, then storing the bits required to traverse the tree would likely yield a very efficient algorithm (this is in fact the basis of modern data compression).

Emtucifor
+1  A: 

If you are going with a dictionary-based encoding where the decoder has the dictionary as well, there's no absolute minimum. For a frequency-based encoding, however, What you need is to calculate the entropy:

E = -(P(0) * log_2(P(0)) + P(1) * log_2(P(1)))
E = -(1004/1024 * log_2(1004/1024) + 20/1024 * log_2(20/1024))
E = 0.1388005

So each bit of the input should require 0.1388005 bits of the output on average. In total:

0.1388005 * 1024 = 142.1317 bits.

This means that in theory, using an optimal algorithm you can encode any string with 1004 zeros and 20 ones (or the other way around) using 143 bits.

Max Shawabkeh
+3  A: 

ceiling(log2(1024 choose 20)) = 139 bits

(calculation on Wolfram Alpha)

The other answers saying 143 bits leave out that we know there are exactly 20 ones. Here's a concrete encoding to show one way to use that knowledge: using arithmetic coding, send each of the 1024 '0' or '1' symbols in succession. The first symbol gets weighted at 20/1024 probability of being '1'; but each later symbol gets weighted differently. If the first symbol was '0', use 20/1023 on the next; but if it was '1', use 19/1023. Continue in the same way to the end. Arithmetic coding does all the hard work to fit in about 139 bits, as long as we tell it the right probabilities.

On the "bonus bonus": error correction wasn't in the original question. You can layer an error-correcting code on top of first finding an optimal encoding assuming no error, like above (and that's normally a good way to break down the problem). You don't lose any coding efficiency that way, though I think you can lose robustness -- as in, if you get more errors than your ECC can correct, will the message come out as total garbage or will it degrade more gracefully?

Darius Bacon
I'm enjoying just figuring out why this is so. Great.
So any idea of an encoding algorithm for this?
The location of the errors is known ahead of time, as I have shown in this question. I just want to encode the location of those errors in the absolute minimum number of bytes that I can.
Ah, it wasn't clear to me that these 1 bits represented errors. Then that does change the answer, but you can approach it the same way; the answer will probably look more complicated.
Darius Bacon
Also this exactly-20-errors thing is awfully academic -- I've never heard of any error source helpful enough to inject always 20 errors, never 19 or 21. :-) I hope you enjoyed it anyway.
Darius Bacon
The idea is that Alice has a message with errors, and Bob has a copy of the correct message AND Alice's messages with the errors. So Bob can count the exact number errors and their location and just send Alice the differences while using minimal bytes to encode them.
Sure, but to use my answer, Bob would have to send the count of errors as well. The bits for the count exactly outweigh the bits saved by *knowing* the count.
Darius Bacon
@Darius Bacon: What are you talking about? The number of errors can be given in 10 bits, then the encoding for those errors can vary depending on the quantity. If the 10 bits do equal 20, then 143 + 10 = 153 bits is a wild improvement over 1024.
Emtucifor
If the encoding of the number of bits in error takes 10 bits, then you can send the message in this case in 10 + 139 bits, better than 10 + 143. But it's not any better than other coding schemes that might model the message probabilities more directly; to choose a scheme, we should ask about those probabilities. E.g. if the errors take the form of 1024 independent chances, each of probability 20/1024, then just feed that directly to the arithmetic coder, instead of first counting the one bits and sending *that*.
Darius Bacon