views:

508

answers:

1

For a personal project of mine, I'm writing up a small class to compress to and decompress from a rather obscure format. I've got the full spec, but that's not where the problem is.

First, this 'format' uses a set of 6 different compression types as well as uncompressed blocks of byte data. The formats are RLE, an offshoot of RLE where the number increments each byte (e.g. 3, 4, 5, ...), a 16-bit RLE, LZ-Copy, a reverse LZ-copy, and LZ-Copy Xor'd with 255. It's not the cleanest of specs, but I didn't design it either.

My compression routine is supposed to take in an array of anywhere from 1 to 65535 bytes, and (hopefully) compress it as much as possible. My previous attempt at this simply calculated out, starting from any index in the uncompressed stream, which of the compression techniques above will provide the best compression, and then compresses however many bytes that method will compress to the array of compressed bytes before repeating from the new 'uncompressed' index, e.g.:

{0,0,0,1,2,3,4}

The algorithm would first read that there were three zeroes at the start, and then output the RLE encoding for them that the spec used, and then starting from the fourth element, would read that incrementing RLE would cover the '1,2,3,4' well enough and compress that before returning.

The problem summarized is that while trying to find out the best spec to use, the routine is very slow even on small (20-30) byte arrays. Can anyone help with tips on how I might look at optimizing this, or if there's any more information I could provide to help?

A: 

It sounds like what you're trying to do is work out a large number of compression possibilities for every possible segment (let's call your variable length 1-64K blocks segments) of the file. Correct me if I'm wrong, but are you working out the best compression for the first segment from the following choices (method 0 is uncompressed):

  • compression method 0, length 1 byte.
  • compression method 1, length 1 byte.
  • : : : : :
  • compression method 6, length 1 byte.
  • compression method 0, length 2 bytes.
  • compression method 1, length 2 bytes.
  • : : : : :
  • compression method 6, length 65534 bytes.
  • compression method 0, length 65535 bytes.
  • compression method 1, length 65535 bytes.
  • compression method 2, length 65535 bytes.
  • compression method 3, length 65535 bytes.
  • compression method 4, length 65535 bytes.
  • compression method 5, length 65535 bytes.
  • compression method 6, length 65535 bytes.

That's going to take a huge amount of time (roughly 420,000 compression attempts per segment). If that is what you're doing, you'll be better off choosing a single segment size (e.g., 64K) and applying each of the seven compression methods to it to choose the best. Then, for each segment, output the "method" byte followed by the compressed data.

paxdiablo
No, not quite like that. I look up basically to see how many bytes each compression type can compress starting from a given index in the uncompressed bytes, and then pick the one which gives the most compressed bytes, before starting again from the new source index.
Sukasa
compression method 0, length 1 byte, index 0....compression method 7, length 1 byte, index 0.compression method 0, length 1 byte, index 8....compression method 7, length 1 byte, index 8.compression method 0, length 1 byte, index 36....compression method 7, length 1 byte, index 36.
Sukasa
Note that this is probably an NP-complete problem, since changing the source index of attempt nr. 2 might produce different results. What I mean is, that even if you pick a sub-optimal first method, the rest of the compression might give a better result overall.
Lasse V. Karlsen