ansaurus

Question

Answer 1

A:

gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.

By means of demonstration:

[chris@polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris@polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris@polaris ~]$ gzip random
[chris@polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz

[chris@polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris@polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris@polaris ~]$ gzip ordered
[chris@polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz

My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.

Chris Heald 2010-08-30 23:08:44

Well, there is more to it, namely the huffman encoding.

ypnos 2010-08-30 23:10:16

There is more to it, hence the "crudely described". :)

Chris Heald 2010-08-30 23:14:13

Answer 2

A:

The algorithm used by gzip is called DEFLATE.

It combines two popular compression techniques: Duplicate string elimination and bit reduction. Both are explained in the article.

Basically as a rule of thumb you could say that compression gets best when some characters find much more often use than most others and/or when characters are often repeated consecutively. Compression gets worst when characters are uniformely distributed in the input and change every time.

There are also measurements for this, like the entropy of the data.

ypnos 2010-08-30 23:09:49

ansaurus

tags:

views:

answers:

gzip compression excepections?

related questions