tags:

views:

517

answers:

5

I'm interested in compressing data using Python's gzip module. It happens that I want the compressed output to be deterministic, because that's often a really convenient property for things to have in general -- if some non-gzip-aware process is going to be looking for changes in the output, say, or if the output is going to be cryptographically signed.

Unfortunately, the output is different every time. As far as I can tell, the only reason for this is the timestamp field in the gzip header, which the Python module always populates with the current time. I don't think you're actually allowed to have a gzip stream without a timestamp in it, which is too bad.

In any case, there doesn't seem to be a way for the caller of Python's gzip module to supply the correct modification time of the underlying data. (The actual gzip program seems to use the timestamp of the input file when possible.) I imagine this is because basically the only thing that ever cares about the timestamp is the gunzip command when writing to a file -- and, now, me, because I want deterministic output. Is that so much to ask?

Has anyone else encountered this problem?

What's the least terrible way to gzip some data with an arbitrary timestamp from Python?

A: 

In lib/gzip.py, we find the method that builds the header, including the part that does indeed contain a timestamp. In Python 2.5, this begins on line 143:

def _write_gzip_header(self):
    self.fileobj.write('\037\213')             # magic header
    self.fileobj.write('\010')                 # compression method
    fname = self.filename[:-3]
    flags = 0
    if fname:
        flags = FNAME
    self.fileobj.write(chr(flags))
    write32u(self.fileobj, long(time.time())) # The current time!
    self.fileobj.write('\002')
    self.fileobj.write('\377')
    if fname:
        self.fileobj.write(fname + '\000')

As you can see, it uses time.time() to fetch the current time. According to the online module docs, time.time will "return the time as a floating point number expressed in seconds since the epoch, in UTC." So, if you change this to a floating-point constant of your choosing, you can always have the same headers written out. I can't see a better way to do this unless you want to hack the library some more to accept an optional time param that you use while defaulting to time.time() when it's not specified, in which case, I'm sure they'd love it if you submitted a patch!

Sean
+6  A: 

Yeah, you don't have any pretty options. The time is written with this line in _write_gzip_header:

write32u(self.fileobj, long(time.time()))

Since they don't give you a way to override the time, you can do one of these things:

  1. Derive a class from GzipFile, and copy the _write_gzip_header function into your derived class, but with a different value in this one line.
  2. After importing the gzip module, assign new code to its time member. You will essentially be providing a new definition of the name time in the gzip code, so you can change what time.time() means.
  3. Copy the entire gzip module, and name it my_stable_gzip, and change the line you need to.
  4. Pass a CStringIO object in as fileobj, and modify the bytestream after gzip is done.
  5. Write a fake file object that keeps track of the bytes written, and passes everything through to a real file, except for the bytes for the timestamp, which you write yourself.

Here's an example of option #2 (untested):

class FakeTime:
    def time(self):
        return 1225856967.109

import gzip
gzip.time = FakeTime()

# Now call gzip, it will think time doesn't change!

Option #5 may be the cleanest in terms of not depending on the internals of the gzip module (untested):

class GzipTimeFixingFile:
    def __init__(self, realfile):
        self.realfile = realfile
        self.pos = 0

    def write(self, bytes):
        if self.pos == 4 and len(bytes) == 4:
            self.realfile.write("XYZY")  # Fake time goes here.
        else:
            self.realfile.write(bytes)
        self.pos += len(bytes)
Ned Batchelder
A: 

It's not pretty, but you could monkeypatch time.time temporarily with something like this:

import time

def fake_time():
  return 100000000.0

def do_gzip(content):
    orig_time = time.time
    time.time = fake_time
    # result = do gzip stuff here
    time.time = orig_time
    return result

It's not pretty, but it would probably work.

Tony Arkles
My main objection to this approach is that I'm writing a library, and that my library's caller might be trying to use gzip in another thread, in which case the changes I'd be making would potentially affect the other threads. This is especially dire if other threads try to use the same trick!
zaphod
+1  A: 

Submit a patch in which the computation of the time stamp is factored out. It would almost certainly be accepted.

fivebells
I can't imagine that the patch will show up in Ubuntu (which I happen to be using) for quite some time, which means I still need a workaround. Still, I think this is an excellent answer!
zaphod
A: 

I've taken Mr. Coventry's advice and submitted a patch. However, given the current state of the Python release schedule, with 3.0 just around the corner, I don't expect it to show up in a release anytime soon. Still, we'll see what happens!

In the meantime, I like Mr. Batchelder's option 5 of piping the gzip stream through a small custom filter that sets the timestamp field correctly. It sounds like the cleanest approach. As he demonstrates, the code required is actually quite small, though his example does depend for some of its simplicity on the (currently valid) assumption that the gzip module implementation will choose to write the timestamp using exactly one four-byte call to write(). Still, I don't think it would be very difficult to come up with a fully general version if needed.

The monkey-patching approach (a.k.a. option 2) is quite tempting for its simplicity but gives me pause because I'm writing a library that calls gzip, not just a standalone program, and it seems to me that somebody might try to call gzip from another thread before my module is ready to reverse its change to the gzip module's global state. This would be especially unfortunate if the other thread were trying to pull a similar monkey-patching stunt! I admit this potential problem doesn't sound very likely to come up in practice, but imagine how painful it would be to diagnose such a mess!

I can vaguely imagine trying to do something tricky and complicated and perhaps not so future-proof to somehow import a private copy of the gzip module and monkey-patch that, but by that point a filter seems simpler and more direct.

zaphod