views:

328

answers:

3

What's the concept behind zip compression? I can understand the concept of removing empty space etc, but presumably something has to be added to say how much/where that free space needs to be added back in during decompression?

What's the basic process for compressing a stream of bytes?

+25  A: 

A good place to start would be to lookup the Huffman compression scheme. The basic idea behind huffman is that in a given file some bytes appear more frequently then others (in a plaintext file many bytes won't appear at all). Rather then spend 8 bits to encode every byte, why not use a shorter bit sequence to encode the most common characters, and a longer sequences to encode the less common characters (these sequences are determined by creating a huffman tree).

Once you get a handle on using these trees to encode/decode files based on character frequency, imagine that you then start working on word frequency - instead of encoding "they" as a sequence of 4 characters, why not consider it to be a single character due to its frequency, allowing it to be assigned its own leaf in the huffman tree. This is more or less the basis of ZIP and other lossless type compression - they look for common "words" (sequences of bytes) in a file (including sequences of just 1 byte if common enough) and use a tree to encode them. The zip file then only needs to include the tree info (a copy of each sequence and the number of times it appears) to allow the tree to be reconstructed and the rest of the file to be decoded.

Follow up:

To better answer the original question, the idea behind lossless compression is not so much to remove empty space, but to remove redundent information.

If you created a database to store music lyrics, you'd find a lot of space was being used to store the chorus which repeats several times. Instead of using all that space, you could simply place the word CHORUS before the first instance of the chorus lines, and then every time the chorus is to be repeated, just use CHORUS as a place holder (in fact this is pretty much the idea behind LZW compression - in LZW each line of the song would have a number shown before it. If a line repeats later in the song, rather then write out the whole line only the number is shown)

David
Excellent way to provide a summary of the answer with links to further reading rather than simply sending the OP to wiki/google.
EBGreen
More basic compression is probably RLE compression, but it does not explain much about the more advanced kinds.
Paul de Vrieze
As an additional resource, you could add a link or mention the Security Now! podcast. In episode #205, Steve Gibson discusses comperssion theory and how it has evolved over time. Here is a link to the transcript: http://www.grc.com/sn/sn-205.txt
EBGreen
Actually you don't even need to store the tree, if you use dynamic Huffman compression. When encoding you start with some default tree. Then you encode single character/word using it, update the tree based on the frequency of characters/words you have already read and process next character/word until the end of input is reached. When decoding you start with the same initial tree and update it based on the input.
Filip Navara
For accuracy sake - zip is (usually) not LZW based, but rather LZ77+(non dynamic) huffman based. LZ77 is the compression algorithm, that eliminates repeating sequences of letters ("oh baby baby" as the one redundant " baby" in it), while Huffman is an efficient encoding on the output of LZ77.
r0u1i
A: 

The concept between compression is basically statististical. If you've got a series of bytes, the chance of byte N being X in practice depends on the value distribution of the previous bytes 0..N-1. Without compression, you allocate 8 bits for each possible value X. With compression, the amounts of bytes allocated for each value X depends on the estimated chance p(N,X).

For instance, given a sequence "aaaa", a compression algorithm can assign a high value to p(5,a) and lower values to p(5,b). When p(X) is high, the bitstring assigned to X will be short, when p(X) is low a long bitstring is used. In this way, if p(N,X) is a good estimate then the average bitstring will be shorter than 8 bits.

MSalters
+5  A: 

The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.

For example, if your file consists solely of the byte 0x41 (A) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001 shorten it to the 1-bit sequence 0. Then the file can be represented by 0000000000000000 (sixteen 0s). So then the file of the byte 0x41 repeated sixteen times can be represented by the file consisting of the byte 0x00 repeated twice.

So what we have here is that for this file (0x41 repeated sixteen times) the bits 01000001 don't convey any additional information over the bit 0. So, in this case, we throw away the extraneous bits to obtain a shorter representation.

That is the core idea behind compression.

As another example, consider the eight byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.

000 0x41
001 0x42
010 0x43
011 0x44
100 0x45
101 0x46
110 0x47
111 0x48

Now we can represent the above byte pattern by 00000101 00111001 01110111 (this is the three-byte pattern 0x05 0x39 0x77) repeated 2048 times.

But an even better approach is to represent the byte pattern

0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

by the single bit 0. Then we can represent the above byte pattern by 0 repeated 2048 times which becomes the byte 0x00 repeated 256 times. Now we only need to store the dictionary

0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48

and the byte pattern 0x00 repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.

That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.

See for example:

  1. Data compression
  2. Run length encoding
  3. Huffman compression
  4. Shannon-Fano coding
  5. LZW
Jason