views:

423

answers:

4

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:

  1. Am I on the right track or is there a better way?
  2. Why does the LZW algorightm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?

Thanks.

Edit: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?

Edit: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.

A: 
  1. The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.

  2. LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)

You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.

J-16 SDiZ
I'm fine with using a compression library. I didn't think I'd find one that would give me such low-level access to the dictionary. Do you have an example?
Fantius
+1  A: 

LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.

However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.

Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.

LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.

There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.

George Phillips
Ahh, your answer to #2 makes perfect sense and seems obvious in hindsight. Thanks.
Fantius
A: 

I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).

Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.

Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.

Nosredna
I have tried deflate. With data as small as only a few 100 bytes, it is not so good.
Fantius
According to Wikipedia, "LZ78 decompression allows random access to the input as long as the entire dictionary is available." That sound perfect for my application, but I don't know of a library that allow such operations.
Fantius
Yeah. You need to freeze a dictionary. The equivalent thing to do in LZ77 would be to have fake data lying around the precedes the real data, allowing lookback into the fake data. The fake data could just be a couple representative files.
Nosredna
+1  A: 

The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)

However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.

Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).

At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.

Tony Lee