I'm interested in compressing data using Python's gzip
module. It happens that I want the compressed output to be deterministic, because that's often a really convenient property for things to have in general -- if some non-gzip-aware process is going to be looking for changes in the output, say, or if the output is going to be cryptographically signed.
Unfortunately, the output is different every time. As far as I can tell, the only reason for this is the timestamp field in the gzip header, which the Python module always populates with the current time. I don't think you're actually allowed to have a gzip stream without a timestamp in it, which is too bad.
In any case, there doesn't seem to be a way for the caller of Python's gzip
module to supply the correct modification time of the underlying data. (The actual gzip
program seems to use the timestamp of the input file when possible.) I imagine this is because basically the only thing that ever cares about the timestamp is the gunzip
command when writing to a file -- and, now, me, because I want deterministic output. Is that so much to ask?
Has anyone else encountered this problem?
What's the least terrible way to gzip
some data with an arbitrary timestamp from Python?