ansaurus

Question

How to defeat gzip (or other lossless compression)

Answer 1

A:

Try to gzip the file that results from the following command:

echo a > file.txt

The compression of a 2 bytes file resulted of a 31 bytes gzipped file!

banx 2010-08-06 16:23:01

Answer 2

A:

A text file with 1 byte in it (for example one character like 'A') is stored in 1 byte on the disk but winrar rars it to 94 bytes and zips to 141 bytes.

I know it's a sort of cheat answer but it works. I think it's going to be the biggest % difference between original size and 'compressed' size you are going to see.

Take a look at the formula for zipping, they are reasonably simple, and to make 'compressed' file larger than the original, the most basic way is to avoid any repeating data.

Tom Gullen 2010-08-06 16:24:41

Answer 3

+3 A:

Random data, or data encrypted with a good cypher would probably be best.

But any good packer should only add constant overhead, once it decides that it can't compress the data. (@Frank). For a fixed overhead, an empty file, or a single character will give the greatest percentage overhead.

For packers that include the filename (e.g. rar, zip, tar), you could of course just make the filename really long :-)

Douglas Leeder 2010-08-06 16:26:02

Even if compressing adds only constant overhead, can a file grow unboundedly this way if, at every level, it does not compress? (I know this is purely theoretical :))

neworder 2010-08-06 16:30:25

No. Random data, because it's random, is going to include some sequences that compress really, really well.

DJClayworth 2010-08-06 16:31:20

@DJClayworth but random data has none of the structure that compression requires, so the compressor will lose encoding the bits that aren't nice sequences.

Douglas Leeder 2010-08-06 16:49:52

@neworder yes each level of compression will almost certainly add a header, and compressed data can't be compressed any more.

Douglas Leeder 2010-08-06 16:51:30

The only way of avoiding the recursive overhead would be to declare that the 'foo' compressor detects it's trying to compress a 'foo' file, and just returns the original. The decompressor would, similarly, have to pass though non-foo files without modification. And there would be lots of problems with false-positives.

Douglas Leeder 2010-08-06 16:55:57

Answer 4

+6 A:

Well, I'd assume eventually it'll max out since the bit patterns will repeat, but I just did:

touch file
gzip file -c > file.1
...
gzip file.9 -c > file.10

And got:

  0 bytes: file
 25 bytes: file.1
 45 bytes: file.2
 73 bytes: file.3
103 bytes: file.4
122 bytes: file.5
152 bytes: file.6
175 bytes: file.7
205 bytes: file.8
232 bytes: file.9
262 bytes: file.10

Here are 24,380 files graphically (this is really surprising to me, actually):

alt text

I was not expecting that kind of growth, I would just expect linear growth since it should just be encapsulating the existing data in a header with a dictionary of patterns. I intended to run through 1,000,000 files, but my system ran out of disk space way before that.

If you want to reproduce, here is the bash script to generate the files:

#!/bin/bash

touch file.0

for ((i=0; i < 20000; i++)); do
    gzip file.$i -c > file.$(($i+1))
done

wc -c file.* | awk '{print $2 "\t" $1}' | sed 's/file.//' | sort -n > filesizes.txt

The resulting filesizes.txt is a tab-delimited, sorted file for your favorite graphing utility. (You'll have to manually remove the "total" field, or script it away.)

mjschultz 2010-08-06 16:28:36

Interesting that the file size seems to increase in no particular order or with no particular relationship

Tom Gullen 2010-08-06 16:32:05

It looks like a pure linear increase from headers/dictionaries etc.

Douglas Leeder 2010-08-06 16:52:15

@Douglas: That was my expectation as well, but I've updated with many more files. Apparently looks can be deceiving.

mjschultz 2010-08-06 17:31:25

Note that gzip is a file format and not just a compressed data format like deflate; it’s just using deflate.

Gumbo 2010-08-06 17:56:37

Answer 5

A:

An empty file would lead to an increase in size in gzipped version.

Not sure about gzip, but we can come up with 'lossless compression' schemes which do not lead to an ad-infinitum increase in size.

We can view a file as a natural number (interpret is as a bitstring). The compression algorithm basically is a function f which maps a number to another number in a 1-1 fashion.

We can say it is a 'decent' compression function if f(n) is smaller than n for many n.

Consider the function

f(4x) = 4x-1
f(4x-1) = 4x-2
f(4x-2) = 4x-3
f(4x-3) = 4x.

This is a lossless 'compression' scheme which does not have the ad-infinitum increase in size problem.

Moron 2010-08-06 16:32:03

:-) - no dictionaries or headers to add overhead each iteration. You'd have to record elsewhere how any iterations you put the data through.

Douglas Leeder 2010-08-06 16:53:47

@Doug: Don't you have record elsewhere the number of iterations even if you did have header info? For eg: what if my original file was an already compressed file? I believe the OP was looking for some theoretical argument like pigeonhole principle, and I suppose this answer is in that vein.

Moron 2010-08-06 16:57:53

Define f(y)(x) as applying f(x) y times, and similarly zip(y)(x) as applying zip to x, y times: given some compressed data z, I can bound y in zip(y)(x) == z, by how many headers I can get by uncompressing, whereas I can't bound f(y)(x).

Douglas Leeder 2010-08-06 17:22:43

@Doug: You might be able bound it, but can never pinpoint y exactly, either. What is your point?

Moron 2010-08-06 17:35:36

Answer 6

A:

All these compression algorithms are looking for redundant data. If you file has no or very less redundancy in it (like a sequence of abac…az, bcbd…bz, cdce…cz, etc.) it is very likely that the “deflated” output is rather an inflation.

Gumbo 2010-08-06 16:36:06

ansaurus

tags:

views:

answers:

How to defeat gzip (or other lossless compression)

related questions