views:

273

answers:

3

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.

Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.

+2  A: 

Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.

If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.

James Curran
So, basically, "double-zipping" (zip a zip file)?
Toybuilder
No, the original zip file being made of separate compressed "blobs", the second pass won't find good repetitions.
Martin Plante
Toybuilder: actually, 'tar' is the usual answer, since it just generates a big archive out of your files by concatenating them with a minimal index. Thats why .tar.gz files are so popular in the unix world.
Edward Kmett
+2  A: 

Take a look at google's open-vcdiff.

http://code.google.com/p/open-vcdiff/

It is designed for calculating small compressed deltas and implements RFC 3284.

http://www.ietf.org/rfc/rfc3284.txt

Microsoft has an API for doing something similar, sans any semblance of a standard.

In general the algorithms you are looking for are ones based on Bentley/McIlroy:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470

In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.

They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.

Edward Kmett
+5  A: 

Several formats do inter-file compression.

The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.

More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.

CesarB
Thanks - I didn't know about the term "solid compression". This helped!
Toybuilder
Another interesting one, but not popular so a bit outside the original question: rzip.
CesarB
And before .tar.gz there was .tar.Z, but that is not in use anymore (it was completely replaced by .tar.gz).
CesarB