tags:

views:

392

answers:

9

So the compression process takes a chunk of binary data A and outputs a smaller chunk of binary data B. What characteristics of B make it unable to go through this process again?

+11  A: 

It is not true that data that is already compressed cannot be compressed again. If you take a file consisting of 1 million zeros and compress it using gzip, the resulting compressed file is 1010 bytes. If you compress the compressed file again it is further reduced to just 75 bytes.

$ python
>>> f = open('0.txt', 'w')
>>> f.write('0'*1000000)
>>> f.close()
>>>
$ wc -c 0.txt
1000000 0.txt

$ gzip 0.txt
$ wc -c 0.txt.gz
1010 0.txt.gz

$ mv 0.txt.gz 0.txt
$ gzip 0.txt
$ wc -c 0.txt.gz
75 0.txt.gz

The reason why it is unlikely that compression works twice is because the compression process removes redundancy. When you have less redundancy it is harder to compress the file further.

Mark Byers
If you can compress it again it means that your original algorithm was not optimal.
Bill K
@Bill K, I doubt that there is *any* compression algorithm that is optimal for every input. The brilliance of this answer is that it points out the exception while explaining the more general case.
Mark Ransom
@Mark Byers, your 3 file example muddies the water a bit. You can indeed get a smaller file by compressing twice, but now you're relying on outside knowledge to know that you need to decompress twice. That outside knowledge requires at least one bit.
Mark Ransom
Good answer. As a side note: this is why the SETI people say they a haven't found any ET signals yet. Advaced aliens would move towards increasingly efficient communications, and an optimally efficient communication would have no redundancy, and thus would be indistinquishable from noise.
DanO
@Bill K, @Mark Byers - Not optimal for that PARTICULAR INPUT. Compression algorithms have to optimise for general input. If it were completely optimal for that input it would be useless for general input. For example: A random number generator with a particular seed is a program can "decompress" 10's of bytes into 10's of Terabytes. The inverse of that generator is a very optimal compression algorithm but only for that particular input
DanO
@Mark Byers: Your are right. Optimizing for *Typical* input is the best they can do. Optimizing for *general* input would preculde optimal efficency on *typical* input.
DanO
Yes, I should have said for that input--I thought that was implied--that the program you choose to use had not chosen the best algorithm, OR that that it choose the best algorithm, but it was still not optimal for that set of data. @Mark Byers--if the algorithm had been optimal, it would have included that second layer--hence the definition of Optimal (This implies a level rarely, if ever, achieved by the way)
Bill K
This is a good explanation, although the example is probably a bit obscure to those unfamiliar with linux in general or those commands in particular.
-1 In your example, 00 cannot be compressed to just 0, since that leaves no room to compress the other two possibilities, 0 and 1. You might as well say any compression algorithm is sub-optimal since it's possible to encode any particular string of bits as `0` and encode any other string of bits with more than one bit... The concept you are clearly missing is that **you cannot compress random data**, and compressing it makes it more random
BlueRaja - Danny Pflughoeft
+2  A: 

Compression works by recognizing patterns and saying "this pattern is here, here and here, so I'll store it once and remember to put it there there and there when I decompress".

Most patterns would get caught in the first compression. You can achieve further compression after it's compress, but... there aren't many patterns left.

glowcoder
+13  A: 

Data has something called entropy: the amount of new information each new bit gives. For example, 10101010101010101010 has low entropy because you don't need the next bit to know what comes next. A perfect compression algorithm would compress to maximum entropy, so every bit gives information and so cannot be removed, making the size a minimum.

murgatroid99
+1 perfectly stated. It's worth also mentioning that there are 2^n possible strings of n-bits but only 2^(n-1) strings of (n-1) bits, so files with maximum entropy (completely random data) are impossible to reliably compress by even 1 bit!
BlueRaja - Danny Pflughoeft
+3  A: 

First, this only applies to lossless compression. Lossy compression (like jpg), can theoretically be applied over and over. Of course the quality of the compressed material drops each time.

For lossless compression, we can think of compression as a routine that takes some data and transforms it to another form (A->B). Since it is lossless, we must be able to then take B and go A<-B. If we follow this through, then it means that if we take every sequence of 4 bits (16 patterns) and compress them, we must get 16 different results. That means that on average, no compression was done!

Compression takes advantage of the fact that for certain kinds of data some sequences of data are less common. These less common forms will become larger when compressed. The more common forms that we have chosen our scheme for will then become smaller. On average, the messages are either the same size or larger.

Taking it one step further, if we repeatedly recompress the same message it will on average not change size (again, this is the best case).

Chris
I think it would be slightly more useful to enumerate all sequences of N bits or fewer. If N=4, there are 31 such sequences. See my answer below (which is perhaps overly terse).
supercat
+5  A: 

For a very academic answer to this question, have a look at Information Etropy! If you're like me, though, the article will make your head hurt.

A simpler answer: Assume you could compress again and again, say by a factor of 10 each time. You could compress Wikipedia down to a gigabyte, then 100M, then 10M... do this 9 times and you'd be down to one byte. If all the information in Wikipedia could be compressed down to one byte, people would not have needed to write it, they could just have expanded one of the 256 possible bytes, one of them would have been the contents of Wikipedia :)

A slightly more sensible answer: Text is redundant: There is information in those bytes that could be expressed a little more tightly. The Wikipedia article mentions the fact that 'q' is almost always followed by 'u', for example. 'E' occurs more often than 'T'. And so forth. Similarly, in a program, often 0 is found more often than any other number. This consistency can be exploited and 'squeezed out.' But once you've done that once, the original redundance is mostly gone. The compressed file has hardly any more 'wasted bits.'

Carl Smotricz
A: 

You can compress data as much as you like, but the effect might not be what you want. After the first level of compression, if you run the same algorithm on it, it probably won't compress enough to make it worthwhile.

Think about this, here is your data:

1001 0011 1110 0100 0011 1001

I'll use a made-up compressor to tokenize by nybble (4 bits) the data as such:

if 1001, compress as 101 since no nybble begins with 101 and 1001 occurs twice if 0011, compress as 110 since no nybble begins with 110 and 0011 occurs twice

After compression:

101 110 1110 0100 110 101 or 1011 1011 1001 0011 0101

This wouldn't actually work in the real world, but as you can imagine, you could compress this again as it still is binary data.

The next compression does this:

if 1011, compress as 111

After compression: 111 111 1001 0011 0101 or 1111 1110 0100 1101 01

But as you can see, there aren't anymore duplicate nybbles, so the compressor I used would have nothing left to compress.

Again, this isn't a real compressor, just an easy way to understand the concept.

Exit
+1  A: 

It's not that it can be compressed only once, it's that there is a minimum size you can compress any data before you start losing bits of it (the way you do with a low quality jpg or MP3 file). Most compression algorithms these days are good enough that one pass gets you within a couple of % of that so a second time isn't really worthwhile rather that in not being possible.

To understand the minimum size without reading too much theory, think of a question with two possible answers Yes and No. The smallest you can make this result is a single bit where 0 = No and 1 = Yes (or vice versa). Even that has made a bunch of assumptions (that the person receiving the data understands this encoding already for instance).

On a more complex level the same is true for all other data. In a situation where you have eight possible answers, all equally probable (this is important), the minimum size is three bits - the smallest number of bits to allow you eight options (000, 001, 010, 011, 100, 101, 110, 111).

There are some clever things you can do to reduce it a little under certain circumstances (for instance you use a smaller number of bits for very common answers at the expense of needing more than might be needed for less common ones but at a lower overall average) but ultimately there is a minimum amount of storage needed to contain the information.

Jon Hopkins
+2  A: 

Take a piece of paper and fold it - you've compressed it by 50%. Now do it again - and keep trying. Notice how it becomes harder and harder, and at some point you have to stop?

Data compression suffers from the same limits. Sure, you can compress it again, and you might save a little more space, but it's a clear example of diminishing returns - every further compression attempt requires more effort for marginal improvements.

Damien Dennehy
+1  A: 

For any number N, there are 2^(N+1)-1 different possible input files of length N bits or shorter. If every different input file will yield a different output file, then for every possible input file of length k which can be reduced to some shorter length, there must be at least one shorter file that gets longer.

supercat