views:

917

answers:

12

I was thinking about compression, and it seems like there would have to be some sort of limit to the compression that could be applied to it, otherwise it'd be a single byte.

So my question is, how many times can I compress a file before:

  • It does not get any smaller?
  • The file becomes corrupt?

Are these two points the same or different?

Where does the point of diminishing returns appear?

How can these points be found?

I'm not talking about any specific algorithm or particular file, just in general.

+3  A: 

Usually compressing once is good enough if the algorithm is good.
In fact, compressing multiple times could lead to an increase in the size

Your two points are different.

  • Compression done repeatedly and achieving no improvement in size reduction
    is an expected theoretical condition
  • Repeated compression causing corruption
    is likely to be an error in the implementation (or maybe the algorithm itself)

Now lets look at some exceptions or variations,

  • Encryption may be applied repeatedly without reduction in size
    (in fact at times increase in size) for the purpose of increased security
  • Image, video or audio files increasingly compressed
    will loose data (effectively be 'corrupted' in a sense)
nik
I think it should be noted that image, video, and audio files would only be 'corrupted' and lose date if a lossy compression (such as mp3, divx, etc.) is used. If the compression is lossless, then the output of the compression is effectively the same data, only recorded in a different number of bytes.
Totty
@Totty, your point is well taken. A good example for audio is FLAC against MP3.
nik
It's a good idea to apply compression before encryption, because encryption usually disrupts the patterns that (most) compression algorithms use to do their magic.
dar7yl
@dar7yl, you are right. It is also a good idea to TAR first and then compress to get better patterns across the complete data (rather than individual file compresses).
nik
+13  A: 

For lossless compression, the only way you can know how many times you can gain by recompressing a file is by trying. It's going to depend on the compression algorithm and the file you're compressing.

Two files can never compress to the same output, so you can't go down to one byte. How could one byte represent all the files you could decompress to?

The reason that the second compression sometimes works is that a compression algorithm can't do omniscient perfect compression. There's a trade-off between the work it has to do and the time it takes to do it. Your file is being changed from all data to a combination of data about your data and the data itself.

Example

Take run-length encoding (probably the simplest useful compression) as an example.

04 04 04 04 43 43 43 43 51 52 11 bytes

That series of bytes could be compressed as:

[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)

Where the positive number in brackets is a repeat count and the negative number in brackets is a command to emit the next -n characters as they are found.

In this case we could try one more compression:

[3] 04 [-4] 43 fe 51 52 7 bytes (fe is your -2 seen as two's complement data)

We gained nothing, and we'll start growing on the next iteration:

[-7] 03 04 fc 43 fe 51 52 8 bytes

We'll grow by one byte per iteration for a while, but it will actually get worse. One byte can only hold negative numbers to -128. We'll start growing by two bytes when the file surpasses 128 bytes in length. The growth will get still worse as the file gets bigger.

There's a headwind blowing against the compression program--the meta data. And also, for real compressors, the header tacked on to the beginning of the file. That means that eventually the file will start growing with each additional compression.


RLE is a starting point. If you want to learn more, look at LZ77 (which looks back into the file to find patterns) and LZ78 (which builds a dictionary). Compressors like zip often try multiple algorithms and use the best one.

Here are some cases I can think of where multiple compression has worked.

  1. I worked at an Amiga magazine that shipped with a disk. Naturally, we packed the disk to the gills. One of the tools we used let you pack an executable so that when it was run, it decompressed and ran itself. Because the decompression algorithm had to be in every executable, it had to be small and simple. We often got extra gains by compressing twice. The decompression was done in RAM. Since reading a floppy was slow, we often got a speed increase as well!
  2. Microsoft supported RLE compression on bmp files. Also, many word processors did RLE encoding. RLE files are almost always significantly compressible by a better compressor.
  3. A lot of the games I worked on used a small, fast LZ77 decompressor. If you compress a large rectangle of pixels (especially if it has a lot of background color, or if it's an animation), you can very often compress twice with good results. (The reason? You only have so many bits to specify the lookback distance and the length, So a single large repeated pattern is encoded in several pieces, and those pieces are highly compressible.)
Nosredna
+1. Great RLE example.
Grant Wagner
+1  A: 

You can compress infinite times, however second and other compressions usually will produce only lager file, then previous one. So there is no point to compress more then once.

You would need infinite storage, though. You'd use up the universe.
Nosredna
+1  A: 

You can compress a file as many times as you like. But for most compression algorithms the resulting compression from the second time on will be negligible.

Matthew Vines
That reads as if the file would get smaller all the time ...
tanascius
+9  A: 

Generally the limit is one compression. Some algorithms results in a higher compression ratio, and using a poor algorithm followed by a good algorithm will often result in improvements. But using the good algorithm in the first place is the proper thing to do.

There is a theoretical limit to how much a given set of data can be compressed. To learn more about this you will have to study information theory.

Martin Liversage
Regarding theoretical limit: yes, a good place to start is with the work of Claude Shannon. However, it doesn't say how a given compression algorithm will compress the data, and predicting the _number_ of compression steps with profit is pretty hopeless.
Nosredna
+1  A: 

How many times can I compress a file before it does not get any smaller?

In general, not even one. Whatever compression algorithm you use, there must always exists a file that does not get compressed at all, otherwise you could always compress repeatedly until you reach 1 byte, by your same argument.

How many times can I compress a file before it becomes corrupt?

If the program you use to compress the file does its job, the file will never corrupt (of course I am thinking to lossless compression).

Federico Ramponi
in fact AT LEAST HALF of all files will become larger, or remain the same size with any compression algorithm.However, this says nothing about USEFUL files, which usually contain non-random data, and thus is usually compressible
Brian Postow
He, don't stop at 1 byte, continue until you have 1 bit!
tuinstoel
+1  A: 

Compression (I'm thinking lossless) basically means expressing something more concisely. For example

111111111111111

could be more consisely expressed as

15 X '1'

This is called run-length encoding. Another method that a computer can use is to find a pattern that is regularly repeated in a file.

There is clearly a limit to how much these techniques can be used, for example run-length encoding is not going to be effect on

15 X '1'

since there are no repeating patterns. Similarly if the pattern replacement methods converts long patterns to 3 char ones, reapplying it will have little effect, because the only remaining repeating patterns will be 3-length or shorter. Generally applying compression to a already compressed file makes it slightly bigger, because of various overheads. Applying good compression to a poorly compressed file is usually less effective than applying just the good compression.

Peter
+1  A: 

Once I made a perfect compression algorithm, it reduced the size of any file to 0, the problem was when trying to get the original back...

fortran
+4  A: 

In general for most algorithms, compressing more then once isn't useful. There's a special case though.

If you have a large number of duplicate files, the zip format will zip each independently, and you can then zip the first zip file to remove duplicate zip information. Specifically, for 7 identical excel files sized at 108kb, zipping them with 7zip results in a 120kb archive. Zipping again results in an 18kb archive. Going past that you get diminishing returns.

CoderTao
Good example. I worked on a few videogames where double-compression was used. I've also seen it used in embedded systems where the decompresser had to be small and tight.
Nosredna
+2  A: 

Suppose we have a file N bits long, and we want to compress it losslessly, so that we can recover the original file. There are 2^N possible files N bits long, and so our compression algorithm has to change one of these files to one of 2^N possible others. However, we can't express 2^N different files in less than N bits.

Therefore, if we can take some files and compress them, we have to have some files that length under compression, to balance out the ones that shorten.

This means that a compression algorithm can only compress certain files, and it actually has to lengthen some. This means that, on the average, compressing a random file can't shorten it, but might lengthen it.

Practical compression algorithms work because we don't usually use random files. Most of the files we use have some sort of structure or other properties, whether they're text or program executables or meaningful images. By using a good compression algorithm, we can dramatically shorten files of the types we normally use.

However, the compressed file is not one of those types. If the compression algorithm is good, most of the structure and redundancy have been squeezed out, and what's left looks pretty much like randomness.

No compression algorithm, as we've seen, can effectively compress a random file, and that applies to a random-looking file also. Therefore, trying to re-compress a compressed file won't shorten it significantly, and might well lengthen it some.

So, the normal number of times a compression algorithm can be profitably run is one.

Corruption only happens when we're talking about lossy compression. For example, you can't necessarily recover an image precisely from a JPEG file. This means that a JPEG compressor can reliably shorten an image file, but only at the cost of not being able to recover it exactly. We're often willing to do this for images, but not for text, and particularly not executable files.

In this case, there is no stage at which corruption begins. It starts when you begin to compress it, and gets worse as you compress it more. That's why good image-processing programs let you specify how much compression you want when you make a JPEG: so you can balance quality of image against file size. You find the stopping point by considering the cost of file size (which is more important for net connections than storage, in general) versus the cost of reduced quality. There's no obvious right answer.

David Thornley
A: 
paperhorse
I'm pretty sure he's joking
Jonah
A: 

It all depends on the algorithm. In other words the question can be how many times a file can be compressed using this algorithm first, then this one next...

Cristian