views:

603

answers:

5

So I've been thinking lately about how compression might be implemented, and what I've postulated so far is that it might be using a sort of HashTable of 'byte signature' keys with memory location values where that 'byte signature' should be replaced upon expansion of the compressed item in question.

Is this far from the truth?

How is compression typically implemented? No need for a page worth of answer, just in simple terms is fine.

+8  A: 

Check this wiki page...

Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.

Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.

Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.

However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually (text files usually can be compressed more after being compressed, due to fewer symbols), result in an expansion, as will attempts to compress all but the most trivially encrypted data.

In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, like for example always removing the last byte of a file, will always compress a file up to the point where it is empty.

An example of lossless vs. lossy compression is the following string:

25.888888888

This string can be compressed as:

25.[9]8

Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using

26

instead, the original data is lost, at the benefit of a smaller file size.

Chalkey
I'd love to vote you up. Perhaps you could quote some from your reference so that your answer doesn't be come completely useless if the link breaks.
tvanfosson
I posted a link because the compression topic is so vast and theres so many different algorithms I didn't want to spam the page with a massive essay.
Chalkey
+1 for compressing your answer!
Adam Liss
Haha Adam, like it!
Chalkey
+54  A: 

Compressing algorithms try to find repeated subsequences to replace them with a shorter representation.

Let's take the 25 byte long string Blah blah blah blah blah! (200 bit) from An Explanation of the Deflate Algorithm for example.

Naive approach

A naive approach would be to encode every character with a code word of the same length. We have 7 different characters and thus need codes with the length of ceil(ld(7)) = 3. Our code words can than look like these:

000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used

Now we can encode our string as follows:

000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B   l   a   h   _   b   l   a   h   _   b   l   a   h   _   b   l   a   !

That would just need 25·3 bit = 75 bit for the encoded word plus 7·8 bit = 56 bit for the dictionary, thus 131 bit (65.5%)

Or for sequences:

00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used

The encoded word:

01 00    00    00    00    10
B  lah b lah b lah b lah b lah!

Now we just need 6·2 bit = 12 bit for the encoded word and 10·8 bit = 80 bit plus 3·8 bit = 24 bit for the length of each word, thus 116 bit (58.0%).

Huffman code approach

The Huffman code is used to encode more frequent characters/substrings with shorter code than less frequent ones:

5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"

// or for sequences

4 × "lah b"
1 × "B", "lah!"

A possible Huffman code for that is:

0      → "l"
10     → "a"
110    → "h"
1110   → " "
11110  → "b"
111110 → "B"
111111 → "!"

Or for sequences:

0  → "lah b"
10 → "B"
11 → "lah!"

Now our Blah blah blah blah blah! can be encoded to:

111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B      l a  h   _    b     l a  h   _    b     l a  h   _    b     l a  h   _    b     l a  h   !

Or for sequences:

10 0     0     0     0     11
B  lah b lah b lah b lah b lah!

Now out first code just needs 78 bit or 8 bit instead of 25·8 = 200 bit like our initial string has. But we still need to add the dictionary where our characters/sequences are stored. For our per-character example we would need 7 additional bytes (7·8 bit = 56 bit) and our per-sequence example would need again 7 bytes plus 3 bytes for the length of each sequence (thus 59 bit). That would result in:

56 + 78 = 134 bit (67.0%)
59 +  8 =  67 bit (33.5%)

The actual numbers may not be correct. Please feel free to edit/correct it.

Gumbo
what a great reply! kudos for the effort
Eli Bendersky
Is the ld() function log-to-base-2 ?
John Fouhy
+1 Fantastic example of rudimentary compression. Just for reference Huffman Coding is an example of "Entropy Coding", which is just one technique in lossless compression. Using dictionaries is another popular approach. http://en.wikipedia.org/wiki/Lossless_data_compression contains lots of information on the different tpyes.
Falaina
+5  A: 

Lossless compression algorithms translate each possible input into distinct outputs, in such a way that more common inputs translate to shorter outputs. It's mathematically impossible for all possible inputs to be compressed -- otherwise, you'd have multiple inputs A and B compressing to the same form, so when you decompress it, do you get back to A or back to B? In practice, most useful information has some redundancy and this redundancy fits certain patterns; hence the data can usefully be compressed because the cases that expand when you compress them don't naturally arise.

Lossy compression, for example, that used in JPEG or MP3 compression, works by approximating the input data by some signal that can be expressed in fewer bits than the original. When you decompress it, you don't get the original, but you usually get something close enough.

Edmund
"It's mathematically impossible for all possible inputs to be compressed" ... When I found out that integers and floats used the same amount of memory, I felt so smart that I had realized I could use this to build a super-duper compression scheme in QBasic using the fact that ints can only have 64k different values, but floats can be, like, huge! Well, for some reason my data was corrupted after uncompressing. Boy, was I young...
balpha
If only you could make meaningful data out of nothing... We'd all be out of a job, and much happier for it, I'm sure.
Matthew Scharley
timday
@timday: Ah, a demon of the second kind! http://everything2.com/title/Demon%2520of%2520the%2520Second%2520Kind
John Fouhy
+2  A: 

In VERY simple terms, a common form of compression is a http://en.wikipedia.org/wiki/Dictionary_coder. This involves replacing longer repeated strings with shorter ones.

For example if you have a file that looks like this:

"Monday Night","Baseball","7:00pm"
"Tuesday Night","Baseball","7:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Baseball","5:00pm"

It would be roughly 150 characters, but if you where to do a simple substitution as follows: A="Monday Night",B="Tuesday Night",C="Baseball",D="Softball",E="7:00pm",F="8:00pm",G=5:00pm"

Then the same content could be encoded as:

A,C,E
B,C,E
A,D,F
A,D,F
A,C,G

Using on 25 characters! A clever observer could also see how to easily reduce this further to 15 characters if we assumed some more things about the format of the file. Obviously there is the overhead of the substitution key, but often very large files have a lot of these substitutions. This can be a very efficient way to compress large files or data structures and still allow them to be "somewhat" human readable.

Mainguy
A: 

Rosetta Code has an entry on Huffman Coding, as does an earlier blog entry of mine.

Paddy3118