tags:

views:

559

answers:

3

I'm trying to figure out the best way to compress a stream with Python's zlib.

I've got a file-like input stream (input, below) and an output function which accepts a file-like (output_function, below):

with open("file") as input:
    output_function(input)

And I'd like to gzip-compress input chunks before sending them to output_function:

with open("file") as input:
    output_function(gzip_stream(input))

It looks like the gzip module assumes that either the input or the output will be a gzip'd file-on-disk… So I assume that the zlib module is what I want.

However, it doesn't natively offer a simple way to create a stream file-like… And the stream-compression it does support comes by way of manually adding data to a compression buffer, then flushing that buffer.

Of course, I could write a wrapper around zlib.Compress.compress and zlib.Compress.flush (Compress is returned by zlib.compressobj()), but I'd be worried about getting buffer sizes wrong, or something similar.

So, what's the simplest way to create a streaming, gzip-compressing file-like with Python?

Edit: To clarify, the input stream and the compressed output stream are both too large to fit in memory, so something like output_function(StringIO(zlib.compress(input.read()))) doesn't really solve the problem.

+1  A: 

The gzip module supports compressing to a file-like object, pass a fileobj parameter to GzipFile, as well as a filename. The filename you pass in doesn't need to exist, but the gzip header has a filename field which needs to be filled out.

Mmmm… I hadn't noticed that… But I'm not sure it will work: either the `fileobj` must be a gzip'd input stream, or an output stream which the gzip'd data will be written to. So, better than nothing, but still not quite what I'd like.
David Wolever
+1  A: 

Use the cStringIO (or StringIO) module in conjunction with zlib:

>>> import zlib
>>> from cStringIO import StringIO
>>> s.write(zlib.compress("I'm a lumberjack"))
>>> s.seek(0)
>>> zlib.decompress(s.read())
"I'm a lumberjack"
jcdyer
The problem with this, though, is that the entire input stream must be loaded into memory (when it's passed to `zlib.compress`) and then must be loaded into memory *again* when it is returned from `zlib.decompress`.
David Wolever
It never leaves memory, if you use StringIO. You said in your question that you wanted a "file-like object", which is common python terminology for an object that has similar methods to a file object. It doesn't say anything about whether it lives on disk or not. But then you also suggested that you didn't want a gz file. Can you please be more clear about what you are really looking for?
jcdyer
Err, sorry - yes, that is my fault. In my mind "file-like object" implies "something intended to be processed in chunks", but I guess that's a faulty assumption. I have updated the question.
David Wolever
have you looked at `zlib.compressobj()` and `zlib.decompressobj()`? Perfect for chunking.
jcdyer
Yup, I have. As I mentioned (albeit not very clearly), they work, but their interface isn't very standard, and it could depend on my getting things like buffer sizes correct.
David Wolever
+3  A: 

It's quite kludgy (self referencing, etc; just put a few minutes writing it, nothing really elegant), but it does what you want if you're still interested in using gzip instead of zlib directly.

Basically, GzipWrap is a (very limited) file-like object that produces a gzipped file out of a given iterable (e.g., a file-like object, a list of strings, any generator...)

Of course, it produces binary so there was no sense in implementing "readline".

You should be able to expand it to cover other cases or to be used as an iterable object itself.

from gzip import GzipFile

class GzipWrap(object):
    # input is a filelike object that feeds the input
    def __init__(self, input, filename = None):
        self.input = input
        self.buffer = ''
        self.zipper = GzipFile(filename, mode = 'wb', fileobj = self)

    def read(self, size=-1):
        if (size < 0) or len(self.buffer) < size:
            for s in self.input:
                self.zipper.write(s)
                if size > 0 and len(self.buffer) >= size:
                    self.zipper.flush()
                    break
            else:
                self.zipper.close()
            if size < 0:
                ret = self.buffer
                self.buffer = ''
        else:
            ret, self.buffer = self.buffer[:size], self.buffer[size:]
        return ret

    def flush(self):
        pass

    def write(self, data):
        self.buffer += data

    def close(self):
        self.input.close()
Heim
haha very smart - passing `self` to the GzipFile. I like it!
David Wolever
(ok, so I see your point that it's not particularly elegant to pass 'self' to the GzipFile… But I still think it's a neat hack).
David Wolever
I've corrected a little bug in the code. When reading with size < 0, it didn't clear the buffer. I don't think you'll be using it like that, but a bug is a bug... O:)
Heim