tags:

views:

104

answers:

4

This is a theoretical question, so expect that many details here are not computable in practice or even in theory.

Let's say I have a string s that I want to compress. The result should be a self-extracting binary (can be x86 asm but can also be some other hypothetical Turing-complete low level language) which outputs s.

Now, we can easily iterate through all possible such binaries/programs, ordered by size. Let B_s be the sub-list of these binaries who output s (of course B_s is uncomputable).

As every set of positive integers must have a minimum, there must be a smallest program b_min_s in B_s.

For what languages (i.e. set of strings) do we know something about the size of b_min_s? Maybe only an estimation. (I can construct some trivial examples where I can always even calculate B_s and also b_min_s but I am interested in more interesting languages.)

+4  A: 

This is Kolmogorov complexity, and you are correct that it's not computable. If it were, you could create a paradoxical program of length n that printed a string with Kolmogorov complexity m > n.

Clearly, you can bound b_min_s for given inputs. However, as far as I know most of the efforts to do so have been existence proofs. For instance, there is an ongoing competition to compress English Wikipedia.

Matthew Flaschen
Yea, exactly that prize drove me to this question. :) However, such competitions/tries only give indications because they show lower limits for a particular example string. They don't give any answer about an average / real hard limit of some given language (e.g. XML with grammatical correct English as content).
Albert
Here is some good compression explanation I'd recommend for further reading: http://www.mattmahoney.net/dc/dce.html - and on the Hutter page, there's a link to http://cs.fit.edu/~mmahoney/compression/textdata.html which is nice to read, too.
schnaader
A: 

The maximal (avarage) compression rate possible is 1:1.
The number of possible inputs is equal to the number of outputs.
It has to be to be able to map the output back to the input.
To be able to store the output you need container at the same size as the minimal container for the input - giving 1:1 compression rate.

Dani
"The maximal (avarage) compression rate possible is 1:1." What does that actually mean?
Matthew Flaschen
A: 

Basically, you need enough information to rebuild your original information. I guess the other answers are more helpful for your theoretical discussion, but just keep this in mind.

Chad
+2  A: 

Claude Shannon estimated the information density of the english language to be somewhere between 0.6 and 1.3 bits per character in this 1950 paper.

phreeza
Hm, I wonder if Kolmogorov complexity is compatible with Shannons information density. From my intuition, Shannon information is just a stream of bits. E.g. the pixel-stream of a fractal image has still some high information density by definition of Shannon. So under this considerations, I wonder if 0.6 is really a good estimation. Maybe for English text which doesn't contain any redundant information.
Albert
Shannon Information makes a statement about the general statistical case, while Kolmogorov complexity is the information content of a single object.So in this example, Shannon information says something about the average character in an English text, while the Kolmogorov complexity is the information content of a specific body of text, eg your string s.
phreeza