Java Deflater strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY

The strategy options mentioned in the Java Deflater originated in the zlib (C) implementation of ZLIB and (RFC 1950) and DEFLATE (1951), I believe. They are present in virtually all compression libraries that implement DEFLATE.

To understand what they mean, you need to know a little about DEFLATE. The compression algorithm combines LZ77 and Huffman coding. The basics are:

LZ77 compression works by finding sequences of data that are repeated. Implementations typically use a "sliding window" of between 1k and 32k, to keep track of data that went before. For any repeats in the original data, instead of inserting the repeated data in the output, the LZ77 compression inserts a "back-reference". Imagine the back reference saying "here, insert the same data you saw 8293 bytes ago, for 17 bytes".
Huffman coding substitutes codes for the actual data. When the data says X, the Huffman code says Y. This obviously helps compression only if the substitute is shorter than the original. (the counter-example is in Yes Man, when Norm uses "Car" for a shortname for Carl. Carl points out that Carl is already pretty short.) The Huffman encoding algorithm does a frequency analysis, and uses the shortest substitutes for the data sequences that occur most often.

Deflate combines these, so that one can use Huffman codes on LZ77 back-refs. The strategy options on various DEFLATE/ZLIB compressors just tells the library how much to weight Huffman versus LZ77.

FILTERED usually means the LZ77 matches are stopped at a length of 5. So when the documentation says

Use (Filtered) for data produced by a filter (or predictor), ... Filtered data consists mostly of small values with a somewhat random distribution.

(from the zlib man page)
...my reading of the code says that it does LZ77 matching, but only up to sequences of 5 or fewer bytes. That's what the doc means by "small values" I guess. But the number 5 isn't mentioned in the doc, so there's no guarantee that number won't be changed from rev to rev, or from one implementation of ZLIB/DEFLATE to another (like the C version and the Java version).
HUFFMAN_ONLY says, only do the substitution codes based on frequency analysis. HUFFMAN_ONLY is very very fast, but not very effective in compression for most data. Unless you have a evry small range of byte values (for example, if bytes in your actual datastream take one of only 20 of the possible 255 values), or have extreme requirements for speed in compression at the expense of size, HUFFMAN_ONLY will not be what you want.
DEFAULT combines the two in the way the authors expected it would be most effective for most applications.

As far as I know there is no way to do only LZ77. There is no LZ77_ONLY strategy.

Keep in mind that the strategy never affects correctness of compression; it affects only the performance of it, either in speed or size.

There are other ways to tweak the compressor. One is to set the size of the LZ77 sliding window. In the C library, this is specified with a "Window bits" option. If you understand LZ77, then you know that a smaller window means less searching back, which means faster compression, at the expense of missing some matches. This is often the more effective knob to turn when compressing.

The bottom line is that, for the 80% case, you don't care to tweak strategy. You might be interested in fiddling with window bits, just to see what happens. But only do that when you've done everything else you need to do in your app.

reference:
How DEFLATE works, by Antaeus Feldspar

ansaurus

tags:

views:

answers:

Java Deflater strategies - DEFAULT_STRATEGY, FILTERED and HUFFMAN_ONLY

related questions