views:

498

answers:

4

I've heard discussion about how OpenOffice (ODF) files are compressed zip files of XML and other data. So making a tiny change to the file can potentially totally change the data, so delta compression doesn't work well in version control systems.

I've done basic testing on an OpenOffice file, unzipping it and then rezipping it with zero compression. I used the Linux zip utility for my testing. OpenOffice will still happily open it.

So I'm wondering if it's worth developing a small utility to run on ODF files each time just before I commit to version control. Any thoughts on this idea? Possible better alternatives?

Secondly, what would be a good and robust way to implement this little utility? Bash shell that calls zip (probably Linux only)? Python? Any gotchas you can think of? Obviously I don't want to accidentally mangle a file, and there are several ways that could happen.

Possible gotchas I can think of:

  • Insufficient disk space
  • Some other permissions issue that prevents writing the file or temporary files
  • ODF document is encrypted (probably should just leave these alone; the encryption probably also causes large file changes and thus prevents efficient delta compression)
+7  A: 

First, version control system you want to use should support hooks which are invoked to transform file from version in repository to the one in working area, like for example clean / smudge filters in Git from gitattributes.

Second, you can find such filter, instead of writing one yourself, for example rezip from "Management of opendocument (openoffice.org) files in git" thread on git mailing list (but see warning in "Followup: management of OO files - warning about "rezip" approach"),

You can also browse answers in "Tracking OpenOffice files/other compressed files with Git" thread, or try to find the answer inside "[PATCH 2/2] Add keyword unexpansion support to convert.c" thread.

Hope That Helps

Jakub Narębski
Terrific information. I'm most interested in Subversion and Mercurial at the moment. I don't think Subversion has clean/smudge type feature. No idea for Mercurial—I'm relatively new to that.
Craig McQueen
@Craig: Mercurial has hooks.
Borealid
A: 

Here is a Python script that I've put together. It's had minimal testing so far. I've done basic testing in Python 2.6. But I prefer the idea of Python in general because it should abort with an exception if any error occurs, whereas a bash script may not.

This first checks that the input file is valid and not already uncompressed. Then it copies the input file to a "backup" file with ".bak" extension. Then it uncompresses the original file, overwriting it.

I'm sure there are things I've overlooked. Please feel free to give feedback.


#!/usr/bin/python
# Note, written for Python 2.6

import sys
import shutil
import zipfile

# Get a single command-line argument containing filename
commandlineFileName = sys.argv[1]

backupFileName = commandlineFileName + ".bak"
inFileName = backupFileName
outFileName = commandlineFileName
checkFilename = commandlineFileName

# Check input file
# First, check it is valid (not corrupted)
checkZipFile = zipfile.ZipFile(checkFilename)
checkZipFile.testzip()

# Second, check that it's not already uncompressed
isCompressed = False
for fileObject in checkZipFile.infolist():
    if fileObject.compress_type != zipfile.ZIP_STORED:
        isCompressed = True
if isCompressed == False:
    raise Exception("File is already uncompressed")

checkZipFile.close()

# Copy to "backup" file and use that as the input
shutil.copy(commandlineFileName, backupFileName)
inputZipFile = zipfile.ZipFile(inFileName)

outputZipFile = zipfile.ZipFile(outFileName, "w", zipfile.ZIP_STORED)

# Copy each input file's data to output, making sure it's uncompressed
for fileObject in inputZipFile.infolist():
    fileData = inputZipFile.read(fileObject)
    outFileObject = fileObject
    outFileObject.compress_type = zipfile.ZIP_STORED
    outputZipFile.writestr(outFileObject, fileData)

outputZipFile.close()

This is in a Mercurial repository in BitBucket.

Craig McQueen
+1  A: 

I've modified the python program in Craig McQueen's answer just a bit. Changes include:

  • Actually checking the return of testZip (according to the docs, it appears that the original program will happily proceed with a corrupt zip file past the checkzip step).

  • Rewrite the for-loop to check for already-uncompressed files to be a single if-statement.

Here is the new program:

#!/usr/bin/python
# Note, written for Python 2.6

import sys
import shutil
import zipfile

# Get a single command-line argument containing filename
commandlineFileName = sys.argv[1]

backupFileName = commandlineFileName + ".bak"
inFileName = backupFileName
outFileName = commandlineFileName
checkFilename = commandlineFileName

# Check input file
# First, check it is valid (not corrupted)
checkZipFile = zipfile.ZipFile(checkFilename)

if checkZipFile.testzip() is not None:
    raise Exception("Zip file is corrupted")

# Second, check that it's not already uncompressed
if all(f.compress_type==zipfile.ZIP_STORED for f in checkZipFile.infolist()):
    raise Exception("File is already uncompressed")

checkZipFile.close()

# Copy to "backup" file and use that as the input
shutil.copy(commandlineFileName, backupFileName)
inputZipFile = zipfile.ZipFile(inFileName)

outputZipFile = zipfile.ZipFile(outFileName, "w", zipfile.ZIP_STORED)

# Copy each input file's data to output, making sure it's uncompressed
for fileObject in inputZipFile.infolist():
    fileData = inputZipFile.read(fileObject)
    outFileObject = fileObject
    outFileObject.compress_type = zipfile.ZIP_STORED
    outputZipFile.writestr(outFileObject, fileData)

outputZipFile.close()
Jason Grout
A: 

Here's another program I stumbled across: store_zippies_uncompressed by Mirko Friedenhagen.

The wiki also shows how to integrate it with Mercurial.

Craig McQueen