views:

134

answers:

1

I have a Python program which is going to take text files as input. However, some of these files may be gzip compressed. Is there a cross-platform, usable from Python way to determine if a file is gzip compressed or not? Is the following reliable or could an ordinary text file 'accidentally' look gzip-like enough for me to get false positives?

try:
    gzip.GzipFile(filename, 'r')
    # compressed
    # ...
except:
    # not compressed
    # ...

Thanks, Ryan

+10  A: 

The magic number for gzip compressed files is 1f 8b. Although testing for this is not 100% reliable, it is highly unlikely that "ordinary text files" start with those two bytes—in UTF-8 it's not even legal.

Usually gzip compressed files sport the suffix .gz though. Even gzip(1) itself won't unpack files without it unless you --force it to. You could conceivably use that, but you'd still have to deal with a possible IOError (which you have to in any case).

One problem with your approach is, that gzip.GzipFile() will not throw an exception if you feed it an uncompressed file. Only a later read() will. This means, that you would probably have to implement some of your program logic twice. Ugly.

hop
gzip compressed files often have the .gz file extension (in fact, I don't think I've ever seen a .gzip extension), but it's generally unsafe to rely on file extension to test for the type of file anyhow.
CanSpice
@CanSpice: of course, typo
hop
Does it? - The gzip C library will transparently read uncompressed files. Although it will write files uncompressed it puts CRC codes through them to allow "gzip -t" (caught me out once)
Martin Beckett
@Martin: it does: $ gunzip foogzip: foo: unknown suffix -- ignored
hop
The c 'library' gzip, ie gzopen/gzread/etc will transparently read uncompressed files. They have an open compression=none mode which does NOT write unchanged flat files.
Martin Beckett