views:

356

answers:

8

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.

Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?

Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.

A: 

Did you try bzip2 for this? http://bzip.org/

It's always worked better than zlib for me.

Jason Coco
+4  A: 

Have you considered Run-length encoding?

Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".

You can combine that with run-length encoding for even better compression ratios, depending on your data.

Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).

Marvin
The process of delta-encoding followed by run-length encoding has worked really well for me in the past. I used it to compress word location data in a full-text indexing system.
Ferruccio
A: 

http://en.wikipedia.org/wiki/LZMA

HTH

plan9assembler
+2  A: 

Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.

  1. Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.

  2. If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.

  3. If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.

Blair Conrad
+2  A: 

You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.

In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.

Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...

comingstorm
A: 

Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.

A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:

  • 1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
  • 1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
  • 6 bits to record the bit number to switch from your previous integer.

If there are more than 4 bits different, then store the integer.

This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.

Stephen Denne
+5  A: 

If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.

Take an input stream like:

1101
1101
1110
1110
0110

and output:

1101
0000
0010
0000
1000

a bit of pseudo code

compressed[0] = uncompressed[0]
loop
  compressed[i] = uncompressed[i-1] ^ uncompressed[i]

We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.

When you want to decompress:

uncompressed[0] = compressed[0]
loop
  uncompressed[i] = uncompressed[i-1] ^ compressed[i]

This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.

Jay Kominek
This is a really good idea. It compresses down to 20% of the original size, which is better than what I had.
twk
A: 

"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up -300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.

One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).

  • Output the first integer (32 bits).
  • Output the number of bit changes (n=0-3, 2 bits).
  • Output n bit specifiers (0-31, 5 bits each).

Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).

I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.

Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).

Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.

Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be

  • single-bit changes (2+5 = 7 bits) : 80% of the transitions.
  • double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.

This means your average would have to be 1.2 bit changes per integer to make this worthwhile.

One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).

I notice it performs MUCH better than WinZip on a Windows platform so it may also outperform zlib.

paxdiablo