tags:

views:

452

answers:

5

Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?

+1  A: 

Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.

yk4ever
Never touch operating systems that are older than me... Seriously speaking: I'm looking for a python solution, as the code is for all platforms.
Paul Oyster
Windows is at least 24 or 25 years old. Version 1 came out around 1985 or so. How old are you?
jmucchiello
44.5 (and last used Unix at 18)
Paul Oyster
A: 

Looking at the source for the gzip module, I see that the underlying file object for GzipFile seems to be fileobj. So:

mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()

?

Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr.

Not exactly a public API, but...

Matt Anderson
.tell() works great. What I'm looking for is the original file size.
Paul Oyster
not mygzipfile.tell(), instead mygzipfile.fileobj.tell().
Matt Anderson
+3  A: 

The last 4 bytes of the .gz hold the original size of the file

gnibbler
The last 4 bytes is the “size of the original (uncompressed) input data modulo 2^32.” (http://www.gzip.org/zlib/rfc-gzip.html)
Gumbo
+4  A: 

The gzip format specifies a field called ISIZE that:

This contains the size of the original (uncompressed) input data modulo 2^32.

In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eof defined as such:

def _read_eof(self):
    # We've read to the end of the file, so we have to rewind in order
    # to reread the 8 bytes containing the CRC and the file size.
    # We check the that the computed CRC and size of the
    # uncompressed data matches the stored values.  Note that the size
    # stored is the true file size mod 2**32.
    self.fileobj.seek(-8, 1)
    crc32 = read32(self.fileobj)
    isize = U32(read32(self.fileobj))   # may exceed 2GB
    if U32(crc32) != U32(self.crc):
        raise IOError, "CRC check failed"
    elif isize != LOWU32(self.size):
        raise IOError, "Incorrect length of data produced"

There you can see that the ISIZE field is being read, but only to to compare it to self.size for error detection. This then should mean that GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.

I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.

Jorge Israel Peña
I guess this is good enough. In case of a file larger than 4G, it is easy to add some heuristics to the progress bar to set the file-size to 4G + ISIZE, if tell() indicates we we are too close to ISIZE.
Paul Oyster
A: 

GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.

Guilherme Salgado