ansaurus

Question

How do the compression codecs work in Python?

Answer 1

A:

The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:

>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>>     f.write("ABCD")

On my system, this produces a file 12 bytes in size. Let's see what it contains:

>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>>     f.read()
'ABCD'

Okay, now let's do another write in append mode:

>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>>     f.write("EFGH")

The file is now 24 bytes in size, and its contents are:

>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>>     f.read()
'ABCD'

What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.

I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.

DNS 2010-09-29 19:17:52

First, the size differentials are the results of running the program once. Running it with 'w' produces the exact same file that 'a+' does, which is about 30% larger than the uncompressed version. Second, even though Python doesn't read past the first compressed block of data, "bunzip2" does.

Chris B. 2010-09-29 19:46:33

Answer 2

A:

I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.

guidoism 2010-09-29 21:18:20

Answer 3

+1 A:

The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.

I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses

cobbal 2010-09-29 21:28:11

Answer 4

A:

As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.

The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.

import bz2

class BZ2StreamEncoder(object):
    def __init__(self, filename, mode):
        self.log_file = open(filename, mode)
        self.encoder = bz2.BZ2Compressor()

    def write(self, data):
        self.log_file.write(self.encoder.compress(data))

    def flush(self):
        self.log_file.write(self.encoder.flush())
        self.log_file.flush()

    def close(self):
        self.flush()
        self.log_file.close()

log_file = BZ2StreamEncoder(archive_file, 'ab')

A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.

Chris B. 2010-09-30 14:46:07

ansaurus

tags:

views:

answers:

How do the compression codecs work in Python?

related questions