views:

300

answers:

2

According to the specifiction of gz the filesize is saved in the last 4bytes of a .gz file.

I have created 2 files with

dd if=/dev/urandom of=500M bs=1024 count=500000
dd if=/dev/urandom of=5G bs=1024 count=5000000

I gziped them

gzip 500M 5G

I checked the last 4 bytes doing

tail -c4 500M|od -I      (returns 512000000 as expected)
tail -c4 5G|od -I        (returns 825032704 as not expected)

It seems that hitting the invisible 32bit barrier, makes the value written into the ISIZE completely nonsense. Which is more annoying, than if they had used some error bit instead.

Does anyone know of a way to get the uncompressed .gz filesize from the .gz without extracting it?

thanks

specification: http://www.gzip.org/zlib/rfc-gzip.html

edit: if anyone to try it out, you could use /dev/zero instead of /dev/urandom

+1  A: 

I haven't tried this with a file of the size you mentioned, but I often find the uncompressed size of a .gz file with

zcat file.gz | wc -c

when I don't want to leave the uncompressed file lying around, or bother to compress it again.

Obviously, the data is uncompressed, but is then piped to wc.

It's worth a try, anyway.

EDIT: When I tried creating a 5G file with data from /dev/random it produced a file 5G of size 5120000000, although my file manager reported this as 4.8G

Then I compressed it with gzip 5G, the results 5G.gz was the same size (not much compression of random data).

Then zcat 5G.gz | wc -c reported the same size as the original file: 5120000000 bytes. So my suggestion seemed to have worked for this trial, anyway.

Thanks for waiting

pavium
Yes thanks,but my question was more in the sense of.How do I get the uncompressed filesize without actually doing a decompression.For files smaller than 32bit files. You can just extract the last 4 bytes. This is not possible for larger files, and as you have done , the only way is to do a decompression.
monkeyking
But my method performed a decompression which didn't affect the original compressed file, and didn't create an extra uncompressed file. There would be no cleaning up afterward. And I think it's worth noting that the answer you accepted said that decompression was the *only* way to get the exact size. It makes sense that *the only way to find out what's in the box, is to open it*.
pavium
Yes, it didn't affect the original file, but my concern was not "not touching" the file, but merely a speed issue. If I want to allocate an array for the entire data, then I should know the size. That requires doing a decompression, followed by another decompression for the actual datacopy. This is not necessary if the file is smaller than 2.1 gig.std gunzip can also decompress to stdout, doing gunzip -c file |wc -cBut thanks for your input :)
monkeyking
+6  A: 

There isn't one.

The only way to get the exact size of a compressed stream is to actually go and decompress it (even if you write everything to /dev/null and just count the bytes).

Its worth noting that ISIZE is defined as

ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.

in the gzip RFC so it isn't actually breaking at the 32-bit barrier, what you're seeing is expected behavior.

Kevin Montrose