Shannon's Theorem is defined in terms of random data and probabilities. Similarly, the entropy of a string is only defined for random strings -- the entropy is a property of the distribution, not of the strings themselves. So, we can restate Shannon's Theorem informally as:
If you randomly select a string from a given probability distribution, then the best average compression ratio we can get for the string is given by the entropy rate of the probability distribution.
Given any random string, I can easily write a compression algorithm which will compress that string down into 1 bit, but my algorithm will necessarily increase the length of some other strings. My compression algorithm works as follows:
- If the input string equals some pre-chosen random string, the output is the 1-bit string "0"
- Otherwise, the output is the N+1-bit string of "1" followed by the input string
The corresponding decompression algorithm is:
- If the input is "0", the output is our previous pre-chosen random string
- Otherwise, the output is everything except for the first input bit
The key here is that we can't write down one algorithm which, for all strings from a given distribution, compresses them all at a high rate on average. There's just too many strings.
If we have a given probability distribution of strings, we can calculate the entropy rate of the distribution, and then if randomly pick a string according to the distribution and attempt to compress it using any algorithm, the relative size of the compressed string will, on average, never be less than the entropy rate. This is what Shannon's Theorem says.