views:

102

answers:

4

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.

My code looks like this:

log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
    log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))

However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?

I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?


EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.

A: 

The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:

>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>>     f.write("ABCD")

On my system, this produces a file 12 bytes in size. Let's see what it contains:

>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>>     f.read()
'ABCD'

Okay, now let's do another write in append mode:

>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>>     f.write("EFGH")

The file is now 24 bytes in size, and its contents are:

>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>>     f.read()
'ABCD'

What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.

I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.

DNS
First, the size differentials are the results of running the program once. Running it with 'w' produces the exact same file that 'a+' does, which is about 30% larger than the uncompressed version. Second, even though Python doesn't read past the first compressed block of data, "bunzip2" does.
Chris B.
A: 

I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.

guidoism
+1  A: 

The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.

I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses

cobbal
A: 

As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.

The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.

import bz2

class BZ2StreamEncoder(object):
    def __init__(self, filename, mode):
        self.log_file = open(filename, mode)
        self.encoder = bz2.BZ2Compressor()

    def write(self, data):
        self.log_file.write(self.encoder.compress(data))

    def flush(self):
        self.log_file.write(self.encoder.flush())
        self.log_file.flush()

    def close(self):
        self.flush()
        self.log_file.close()

log_file = BZ2StreamEncoder(archive_file, 'ab')

A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.

Chris B.