ansaurus

Question

Answer 1

A:

In lib/gzip.py, we find the method that builds the header, including the part that does indeed contain a timestamp. In Python 2.5, this begins on line 143:

def _write_gzip_header(self):
    self.fileobj.write('\037\213')             # magic header
    self.fileobj.write('\010')                 # compression method
    fname = self.filename[:-3]
    flags = 0
    if fname:
        flags = FNAME
    self.fileobj.write(chr(flags))
    write32u(self.fileobj, long(time.time())) # The current time!
    self.fileobj.write('\002')
    self.fileobj.write('\377')
    if fname:
        self.fileobj.write(fname + '\000')

As you can see, it uses time.time() to fetch the current time. According to the online module docs, time.time will "return the time as a floating point number expressed in seconds since the epoch, in UTC." So, if you change this to a floating-point constant of your choosing, you can always have the same headers written out. I can't see a better way to do this unless you want to hack the library some more to accept an optional time param that you use while defaulting to time.time() when it's not specified, in which case, I'm sure they'd love it if you submitted a patch!

Sean 2008-11-05 03:44:19

Answer 2

+6 A:

Yeah, you don't have any pretty options. The time is written with this line in _write_gzip_header:

write32u(self.fileobj, long(time.time()))

Since they don't give you a way to override the time, you can do one of these things:

Derive a class from GzipFile, and copy the _write_gzip_header function into your derived class, but with a different value in this one line.
After importing the gzip module, assign new code to its time member. You will essentially be providing a new definition of the name time in the gzip code, so you can change what time.time() means.
Copy the entire gzip module, and name it my_stable_gzip, and change the line you need to.
Pass a CStringIO object in as fileobj, and modify the bytestream after gzip is done.
Write a fake file object that keeps track of the bytes written, and passes everything through to a real file, except for the bytes for the timestamp, which you write yourself.

Here's an example of option #2 (untested):

class FakeTime:
    def time(self):
        return 1225856967.109

import gzip
gzip.time = FakeTime()

# Now call gzip, it will think time doesn't change!

Option #5 may be the cleanest in terms of not depending on the internals of the gzip module (untested):

class GzipTimeFixingFile:
    def __init__(self, realfile):
        self.realfile = realfile
        self.pos = 0

    def write(self, bytes):
        if self.pos == 4 and len(bytes) == 4:
            self.realfile.write("XYZY")  # Fake time goes here.
        else:
            self.realfile.write(bytes)
        self.pos += len(bytes)

Ned Batchelder 2008-11-05 03:49:51

Answer 3

A:

It's not pretty, but you could monkeypatch time.time temporarily with something like this:

import time

def fake_time():
  return 100000000.0

def do_gzip(content):
    orig_time = time.time
    time.time = fake_time
    # result = do gzip stuff here
    time.time = orig_time
    return result

It's not pretty, but it would probably work.

Tony Arkles 2008-11-05 04:32:23

My main objection to this approach is that I'm writing a library, and that my library's caller might be trying to use gzip in another thread, in which case the changes I'd be making would potentially affect the other threads. This is especially dire if other threads try to use the same trick!

zaphod 2008-11-06 00:35:05

Answer 4

+1 A:

Submit a patch in which the computation of the time stamp is factored out. It would almost certainly be accepted.

fivebells 2008-11-05 15:17:54

I can't imagine that the patch will show up in Ubuntu (which I happen to be using) for quite some time, which means I still need a workaround. Still, I think this is an excellent answer!

zaphod 2008-11-05 21:09:56

Answer 5

A:

I've taken Mr. Coventry's advice and submitted a patch. However, given the current state of the Python release schedule, with 3.0 just around the corner, I don't expect it to show up in a release anytime soon. Still, we'll see what happens!

In the meantime, I like Mr. Batchelder's option 5 of piping the gzip stream through a small custom filter that sets the timestamp field correctly. It sounds like the cleanest approach. As he demonstrates, the code required is actually quite small, though his example does depend for some of its simplicity on the (currently valid) assumption that the gzip module implementation will choose to write the timestamp using exactly one four-byte call to write(). Still, I don't think it would be very difficult to come up with a fully general version if needed.

The monkey-patching approach (a.k.a. option 2) is quite tempting for its simplicity but gives me pause because I'm writing a library that calls gzip, not just a standalone program, and it seems to me that somebody might try to call gzip from another thread before my module is ready to reverse its change to the gzip module's global state. This would be especially unfortunate if the other thread were trying to pull a similar monkey-patching stunt! I admit this potential problem doesn't sound very likely to come up in practice, but imagine how painful it would be to diagnose such a mess!

I can vaguely imagine trying to do something tricky and complicated and perhaps not so future-proof to somehow import a private copy of the gzip module and monkey-patch that, but by that point a filter seems simpler and more direct.

zaphod 2008-11-06 21:17:09

ansaurus

tags:

views:

answers:

setting the gzip timestamp from Python

related questions