views:

164

answers:

3

I came across this question;

"A lossless compression algorithm claims to guarantee to make some files smaller and no files larger.
Is this;

a) Impossible

b) Possible but may run for an indeterminate amount of time,

c) Possible for compression factor 2 or less,

d) Possible for any compression factor?"

I'm leaning towards (a), but couldn't give a solid explanation as to why. (I'll list the thoughts a friend and I came up with as a possible answer)

+10  A: 

By the pigeon-hole principle, given a string of 10 bits you have 1024 possible inputs, and need to map to 9 bits or fewer, so there are < 1024 outputs.

This guarantees either the algorithm has collisions (lossy compression) or at some point choses to return the unmodified input as output.

In the latter case, you cannot determine how to decompress an arbitrary string of bits. (It could be an unmodified input, or a compressed output from a larger bit string).

-> Impossible.

RJFalconer
This is just speculation. I'm hardly an expert on this, just wanted to see what others thought of my answer. Thanks!
RJFalconer
@BlueNovember: don't worry: *every* user that comes across an archiver eventually asks himself whether it's possible to make such a compression :-)
Pavel Shved
Hmm. I didn't, Pavel.
spender
+1 Pigeon-hole principle. Consider that you either compress, or keep the original, and that you need at least one marker bit appended (or prefixed) to the original, to mark it as such.
kaizer.se
More formally: There does not exist an injective function on finite sets whose codomain is smaller than its domain.
LBushkin
+5  A: 

Just a slight clarification of BlueNovember's post...

You only have to have some files becoming smaller, so the claim that a string of 10 bits has to map to 9 bits or fewer isn't quite right. In particular, if someone proposed such a compression mechanism it could map all strings of 10 bits or fewer to exactly the same output (i.e. an identity transformation).

However, we are told that there is at least one file which does become smaller. Without loss of generality, consider that to start with x bits and end up as y bits, where y is strictly less than x.

Now consider the domain of "files with y bits or fewer", which has 2y+1-1 bit strings (including the empty one). In order for none of those to result in a bigger file, each has to map to a bit string in the same domain, i.e. 2y+1-1 compressed files. However, we already know that the initial string of length x bits compresses to one of those values - leaving only 2y+1-2 possible values.

At this point the pigeon hole principle comes in - you clearly can't map 2y+1-1 inputs to 2y+1-2 outputs without repeating an output, which violates the reversibility of compression.

Jon Skeet
When talking about strings, I totally agree. But since we're talking about files isn't there one more variable to consider: the filename? Deciding which file to decompress and which to leave could be based on the file extension. Or am I missing something?
Yannick M.
@Yannick: If you're allowed to *change* the filename, then you can trivially output an empty file with a filename which contains all the data. If you can't change the filename, then it depends on whether there's a correlation between filename and data. If you know that any file with an extension of "000" is all zeroes, then you could indeed compress the data... but I'd suggest that's cheating, and that you should be able to store arbitrary data with arbitrary filenames. At that point, it becomes irrelevant.
Jon Skeet
A: 

a) impossible

If you have a file that can not be compressed further, you still have to add the information whether it has been compressed or not, so in that case the file would have to grow.

Guffa
Why the downvote? If you don't say anything at all about what you don't like, it's rather pointless.
Guffa