views:

976

answers:

4

This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:

I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.

But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH option, which can be used during compression to insert "sync points" in the compressed output (inflateSync() can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip doesn't have an option for this, but I'm willing to write my own compression program if I must).

It seems from one source that even Z_FULL_FLUSH is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH also produces sync points but they are not usable for random access).

Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?

Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."

+4  A: 

I'm not sure if this would be practical in your exact situation, but couldn't you just gzip each large file into smaller files, say 10 MB each? You would end up with a bunch of files: file0.gz, file1.gz, file2.gz, etc. Based on a given offset within the original large, you could search in the file named "file" + (offset / 10485760) + ".gz". The offset within the uncompressed archive would be offset % 10485760.

William Brendel
Or you could TAR them all and end up with a .GZ.TAR. :)
Vilx-
That would definitely make things cleaner. I was just trying to go for simplicity here, but your suggestion is well taken :-)
William Brendel
.gz.tar is not really random access, since you must jump through all the headers to get to one file
jpalecek
Well, yes and no. With fixed size chunks (10 MB in this case), you would not have to walk through a list of headers. This relies on the assumption that the tar will order the files alphabetically (which happens to be the case in GNU-land).
William Brendel
Yes, but the files would not be compressed then (10 MB uncompressed for your indexing expression to work, 10 MB compressed for direct access in tar to work). It's hard to compress anything to a fixed size, although you could make that size sufficiently large and handle excess space with sparse files
jpalecek
If you want the various chunk files to be packaged in a single archive, you could just use an archive format like .zip that supports a directory allowing random access to individual files. William Brendel's key idea is to chunk the original file and compress the chunks independently.
Michael Burr
This is a similar concept to what Windows NTFS compression does to allow random access to data - each cluster (or 2K chunk - or some unit) is independently compressed.
Michael Burr
John Zwinck
All those who pointed out that tarring the files would break random access: you are correct. I wasn't thinking :-) I'll remove that point from my answer. Thanks for the observations.
William Brendel
+2  A: 

I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.

For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.

Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.

jpalecek
I don't need to seek to a specific uncompressed position--only to seek somewhat randomly with some coarse granularity within the compressed file. I don't mind at all if all I can do is say "uncompess the data starting here, about 700MB into this file."
John Zwinck
@John Zwinck: Add your comment to your question as an update. Note that given the variable compression of data (some stuff I compress shrinks by 94% or so - usually, except when it only shrinks by about 50% or so), your estimate of where to start decompressing might be very hit and miss.
Jonathan Leffler
A: 

Because lossless compression works better on some areas than others, if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.

You might look at "Compression: A Key for Next-Generation Text Retrieval Systems" by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates in Computer magazine November 2000 http://doi.ieeecomputersociety.org/10.1109/2.881693

Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word. One can directly search the compressed text for words or phrases, which turns out to be even faster than searching uncompressed text.

Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.

You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text. (There are almost 13,000 unique words in the KJV Bible). Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases". (The compression gained by packing frequent words and phrases into two bytes is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter). There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression. For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data. Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase. Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file. Then use "sort --unique" to eliminate duplicate words in that lexicon file. (Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)

Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file. Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase". My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.

You can pick any random even byte offset into the compressed text, and start decompressing from there. I don't think it's possible to make a finer-grained random access compressed file format.

David Cary
+1  A: 

Take a look at dictzip. It is compatible with gzip and allows coarse random access.

An excerpt from its man page:

dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat --start] will make use of this data to perform pseudo-random access on the file.

I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.

Ivo Danihelka