ansaurus

Question

Unzipping part of a .gz file using python

Answer 1

+1 A:

I seems that you need to look into Python zlib library instead

The GZIP format relies on zlib, but introduces a a file-level compression concept along with CRC checking, and this appears to be what you do not want/need at the moment.

See for example these code snippets from Dough Hellman

Edit: the code on Doubh Hellman's site only show how to compress or decompress with zlib. As indicated above, GZIP is "zlib with an envelope", and you'll need to decode the envellope before getting to the zlib-compressed data per se. Here's more info to go about it, it's really not that complicated:

see RFC 1952 for details about the GZIP format
This format starts with a 10 bytes header, followed by optional, non compressed elements such as the file name or a comment, followed by the zlib-compressed data, itself followed by a CRC-32 (precisely an "Adler32" CRC).
By using Python's struct module, parsing the header should be relatively simple
The zlib sequence (or its first few thousand bytes, since that is what you want to do) can then be decompressed with python's zlib module, as shown in the examples above
Possible problems to handle: if there are more than one file in the GZip archive, and if the second file starts within the block of a few thousand bytes we wish to decompress.

Sorry to provide neither an simple procedure nor a ready-to-go snippet, however decoding the file with the indication above should be relatively quick and simple.

mjv 2009-11-14 00:19:33

@mjv...Which particular code snippet applies to the example above. I went through the link and read Working with Streams. Nowhere does it state that its working with gzip streams. I assume this works with zlib streams (have tested with zlib streams)

2009-11-14 00:35:30

@unknown: Check my edit; the code snippets pertain to the compressing/decompressing to/from pure zlib. The GZip format implies fist parsing a small, uncompressed header, before finding its zlip "payload" which can be decompressed as shown.

mjv 2009-11-14 05:35:39

Answer 2

A:

I can't see any possible reason why you would want to decompress the first 2000 compressed bytes. Depending on the data, this may uncompress to any number of output bytes.

Surely you want to uncompress the file, and stop when you have uncompressed as much of the file as you need, something like:

f = gzip.GzipFile(fileobj=open('postcode-code.tar.gz', 'rb'))
data = f.read(4000)
print data

AFAIK, this won't cause the whole file to be read. It will only read as much as is necessary to get the first 4000 bytes.

rjmunro 2009-11-14 00:22:20

f.read(2000) here will read the first 2000 bytes of decompressed data. I am interested in the first 2000 bytes of compressed data.

2009-11-14 00:25:00

Why? What on earth is your application?

rjmunro 2009-11-14 00:27:22

:-)I am trying to find string "xyz" in the first 4k of data. Assuming I decompress 2K of gzipped data and land with 4K of decompressed data, I can search/grep in this 4k for the string. All the searching code is already in place..

2009-11-14 00:31:41

Assume that all I am going to get it is first 2k of compressed data from a 60K .gz file. After that nothing. Nada. I need to *find* my string in the decompressed part of this 2k

2009-11-14 00:37:20

ansaurus

tags:

views:

answers:

Unzipping part of a .gz file using python

related questions