views:

30

answers:

2

Hi,

Is there any way to project what kind of compression result you'd get using gzip on an arbitrary string? What factors contribute to the worst and best cases? I'm not sure how gzip works, but for example a string like:

"fffffff"

might compress well compared to something like:

"abcdefg"

where do I start?

Thanks

A: 

gzip uses the deflate algorithm, which, crudely described, compresses files by replacing repeated strings with pointers to the first instance of the string. Thus, highly repetitive data compresses exceptionally well, while purely random data will compress very little, if at all.

By means of demonstration:

[chris@polaris ~]$ dd if=/dev/urandom of=random bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296325 s, 3.5 MB/s
[chris@polaris ~]$ ll random
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 random
[chris@polaris ~]$ gzip random
[chris@polaris ~]$ ll random.gz
-rw-rw-r-- 1 chris chris 1048761 2010-08-30 16:12 random.gz

[chris@polaris ~]$ dd if=/dev/zero of=ordered bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00476905 s, 220 MB/s
[chris@polaris ~]$ ll ordered
-rw-rw-r-- 1 chris chris 1048576 2010-08-30 16:12 ordered
[chris@polaris ~]$ gzip ordered
[chris@polaris ~]$ ll ordered.gz
-rw-rw-r-- 1 chris chris 1059 2010-08-30 16:12 ordered.gz

My purely random data sample actually got larger due to overhead, while my file full of zeroes compressed to 0.1% of its previous size.

Chris Heald
Well, there is more to it, namely the huffman encoding.
ypnos
There is more to it, hence the "crudely described". :)
Chris Heald
A: 

The algorithm used by gzip is called DEFLATE.

It combines two popular compression techniques: Duplicate string elimination and bit reduction. Both are explained in the article.

Basically as a rule of thumb you could say that compression gets best when some characters find much more often use than most others and/or when characters are often repeated consecutively. Compression gets worst when characters are uniformely distributed in the input and change every time.

There are also measurements for this, like the entropy of the data.

ypnos