views:

447

answers:

4

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.

The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.

The runtime of my trivial implementation was less than stellar, however.

How can this be optimized? Is it possible to do it in a single pass?

+1  A: 

I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.

Here are my thoughts at first sight. Maybe you already do or have already thought all this.

I would try the following.

Two adjustable parameters:

  1. Number of byte-pair occurrences in chunk if data before to consider replacing it. (So that the dictionary doesn't grow faster than the chunk shrinks.)
  2. Number of replacements by pass before it's probably not worth replacing anymore. (So that the algorithm stops wasting time when there's maybe only 1 or 2 % left to gain.)

I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.

I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).

Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.

Keep us posted if you try something and get interesting results!

M. Joanis
+3  A: 

This is a summary of my progress so far:

Googling found this little report that links to the original code and cites the source:

Philip Gage, titled 'A New Algorithm for Data Compression', that appeared in 'The C Users Journal' - February 1994 edition.

The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.

That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.

My test data is enwik8 from the Hutter Prize.

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.

As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.

The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.

Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.

Its sill multi-pass though. And so its not a generally competitive algorithm.

If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.

There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

Will
+1  A: 

I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.

To speed up performance on byte-pair encoding:

  1. Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
  2. Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
  3. Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
  4. Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
  5. Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).
BobMcGee
very good advice, and exactly how the code works.If you have anything online about your LZF, I'd very like to look at it, as I am very interested in the subject generally.
Will
Much of what I did got folded into the CompressLZF code for the H2 Database. Don't let the choice of Java fool you -- the code is very tightly optimized in the same fashion C might be, and can compress at over 100 MB/s and decompress at nearly at 1/2 to 1/4 your memory bandwidth.
BobMcGee
A: 

Yes, keep us posted.

guarantee?

BobMcGee gives good advice. However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true. I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.

If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE. However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.

single pass

It seems like this algorithm can be slightly tweaked in order to do it in one pass. Assuming 7 bit ASCII plaintext: Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero). Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs). Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet. As phkahler pointed out, that sounds practically the same as LZW.

David Cary