views:

502

answers:

4

From Shannon's Source Coding Theorem we know that the entropy of a compressed string is bounded by the entropy of the original string like so:

H(X) <= L < H(X) + 1/N

where H(X) is entropy of the source string, N is the length of the source string, and L is the expected length of the compressed string.

This necessarily means that there is a limit to lossless compression.

What I'd like to know is:

  • Can we directly relate entropy to some expected compression ratio?

  • Can we use the entropy to find some upper bound for the compression ratio?

+2  A: 

You can't directly relate entropy to compression ratio without knowing the length of the source string, but you can see the theoretical limit to the maximum compression ratio by solving for the smallest possible value of L. You can use this limit as a metric for efficiency of your compression algorithms, although a bad metric doesn't mean that a better algorithm has been discovered or even exists.

So, yes. You can use entropy to find the theoretical maximum lossless compression ratio, but no, you can't use it to determine your expected compression ratio for any given compression algorithm.

Jekke
A: 

Yes! I think this paper would point you in the right direction.

ETA Looks like you need to be an IEEE member to read the actual paper. If someone could find a publicly available resource (or explain the math here), that would be much better of course!

Eric Petroelje
A: 

Yes. The entropy rate of the English language is often cited as 1.5 bits per character (give or take). Typical encodings use 8 bits per character. So a maximally compressed text should be 1.5/8 (~19%) the size of the original. Actual results for a plain text version of Jane Austin's Pride and Prejudice: orig = 701K, bzip2 = 178K, for ~25%.

dwc
For up-to-date english text results, Matt Mahoney's Large Text Compression Benchmark at http://www.cs.fit.edu/~mmahoney/compression/text.html is very interesting.
schnaader
+4  A: 

Shannon's Theorem is defined in terms of random data and probabilities. Similarly, the entropy of a string is only defined for random strings -- the entropy is a property of the distribution, not of the strings themselves. So, we can restate Shannon's Theorem informally as:

If you randomly select a string from a given probability distribution, then the best average compression ratio we can get for the string is given by the entropy rate of the probability distribution.

Given any random string, I can easily write a compression algorithm which will compress that string down into 1 bit, but my algorithm will necessarily increase the length of some other strings. My compression algorithm works as follows:

  1. If the input string equals some pre-chosen random string, the output is the 1-bit string "0"
  2. Otherwise, the output is the N+1-bit string of "1" followed by the input string

The corresponding decompression algorithm is:

  1. If the input is "0", the output is our previous pre-chosen random string
  2. Otherwise, the output is everything except for the first input bit

The key here is that we can't write down one algorithm which, for all strings from a given distribution, compresses them all at a high rate on average. There's just too many strings.

If we have a given probability distribution of strings, we can calculate the entropy rate of the distribution, and then if randomly pick a string according to the distribution and attempt to compress it using any algorithm, the relative size of the compressed string will, on average, never be less than the entropy rate. This is what Shannon's Theorem says.

Adam Rosenfield
very well said but I didn't quite follow the argument "there are just too many strings"
Tony Lee