One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)
My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.
The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.
I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.