The basic concept is that instead of using eight bits to represent each byte, you use shorter representations for more frequently occuring bytes or sequences of bytes.
For example, if your file consists solely of the byte 0x41 (A
) repeated sixteen times, then instead of representing it as the 8-bit sequence 01000001
shorten it to the 1-bit sequence 0
. Then the file can be represented by 0000000000000000
(sixteen 0
s). So then the file of the byte 0x41
repeated sixteen times can be represented by the file consisting of the byte 0x00
repeated twice.
So what we have here is that for this file (0x41
repeated sixteen times) the bits 01000001
don't convey any additional information over the bit 0
. So, in this case, we throw away the extraneous bits to obtain a shorter representation.
That is the core idea behind compression.
As another example, consider the eight byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and now repeat it 2048 times. One way to follow the approach above is to represent bytes using three bits.
000 0x41
001 0x42
010 0x43
011 0x44
100 0x45
101 0x46
110 0x47
111 0x48
Now we can represent the above byte pattern by 00000101 00111001 01110111
(this is the three-byte pattern 0x05 0x39 0x77
) repeated 2048 times.
But an even better approach is to represent the byte pattern
0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
by the single bit 0
. Then we can represent the above byte pattern by 0
repeated 2048 times which becomes the byte 0x00
repeated 256 times. Now we only need to store the dictionary
0 -> 0x41 0x42 0x43 0x44 0x45 0x46 0x47 0x48
and the byte pattern 0x00
repeated 256 times and we compressed the file from 16,384 bytes to (modulo the dictionary) 256 bytes.
That, in a nutshell is how compression works. The whole business comes down to finding short, efficient representations of the bytes and byte sequences in a given file. That's the simple idea, but the details (finding the representation) can be quite challenging.
See for example:
- Data compression
- Run length encoding
- Huffman compression
- Shannon-Fano coding
- LZW