Compressing algorithms try to find repeated subsequences to replace them with a shorter representation.
Let's take the 25 byte long string Blah blah blah blah blah!
(200 bit) from An Explanation of the Deflate Algorithm for example.
Naive approach
A naive approach would be to encode every character with a code word of the same length. We have 7 different characters and thus need codes with the length of ceil(ld(7)) = 3
. Our code words can than look like these:
000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used
Now we can encode our string as follows:
000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B l a h _ b l a h _ b l a h _ b l a !
That would just need 25·3 bit = 75 bit for the encoded word plus 7·8 bit = 56 bit for the dictionary, thus 131 bit (65.5%)
Or for sequences:
00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used
The encoded word:
01 00 00 00 00 10
B lah b lah b lah b lah b lah!
Now we just need 6·2 bit = 12 bit for the encoded word and 10·8 bit = 80 bit plus 3·8 bit = 24 bit for the length of each word, thus 116 bit (58.0%).
Huffman code approach
The Huffman code is used to encode more frequent characters/substrings with shorter code than less frequent ones:
5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"
// or for sequences
4 × "lah b"
1 × "B", "lah!"
A possible Huffman code for that is:
0 → "l"
10 → "a"
110 → "h"
1110 → " "
11110 → "b"
111110 → "B"
111111 → "!"
Or for sequences:
0 → "lah b"
10 → "B"
11 → "lah!"
Now our Blah blah blah blah blah!
can be encoded to:
111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B l a h _ b l a h _ b l a h _ b l a h _ b l a h !
Or for sequences:
10 0 0 0 0 11
B lah b lah b lah b lah b lah!
Now out first code just needs 78 bit or 8 bit instead of 25·8 = 200 bit like our initial string has. But we still need to add the dictionary where our characters/sequences are stored. For our per-character example we would need 7 additional bytes (7·8 bit = 56 bit) and our per-sequence example would need again 7 bytes plus 3 bytes for the length of each sequence (thus 59 bit). That would result in:
56 + 78 = 134 bit (67.0%)
59 + 8 = 67 bit (33.5%)
The actual numbers may not be correct. Please feel free to edit/correct it.