views:

221

answers:

3

One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)

My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.

The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.

I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.

+1  A: 

I think you're describing a modified version of the Longest Common Substring Problem.

Jason Punyon
Yes, it's very related. The big (and crucial) difference is that the match is repeated literally millions of times, each time with a new, unique source string (from sliding the previous source into a new window.)
SPWorley
+1  A: 

You could look at the details of the LZMA Algorithm used by 7-zip. The 7-zip author claims to have improved on the algorithm used by zlib et al.

Kristo
A: 

Well, I notice that you go into some detail about the problem but don't mention the information provided in section 4 of RFC 1951 (the specification for the DEFLATE Compressed Data Format, i.e. the format used in ZIP) which leads me to believe you might have missed this resource.

Their basic approach is a chained hash table using three-byte sequences as keys. As long as the chain is not empty, all the entries along it are scanned to a) eliminate false collisions, b) eliminate matches that are too old, and c) pick the longest match out of those remaining.

(Note that their recommendation is shaped by the factor of patents; it may be that they knew of a more effective technique but could not be sure that it was not covered by someone's patent. Personally, I've always wondered why one couldn't find the longest matches by examining the matches for the three-byte sequences that start at the second byte of the incoming data, the third byte, etc. and weeding out matches that don't match up. i.e., if your incoming data is "ABCDEFG..." and you've got hash matches for "ABC" at offsets 100, 302 and 416 but your only hash match for "BCD" is at offset 301, you know that unless you have two entirely coincidental overlapping hash matches -- unlikely -- then 302 is your longest match.)

Also note their recommendation of optional "lazy matching" (which ironically does more work): instead of automatically taking the longest match that starts at the first byte of the incoming data, the compressor checks for an even longer match starting at the next byte. If your incoming data is "ABCDE..." and your only matches in the sliding window are for "ABC" and for "BCDE", you're better off encoding the "A" as a literal byte and the "BCDE" as a match.