views:

242

answers:

5

Anyone have a recommendation on a good compression algorithm that works well with double precision floating point values? We have found that the binary representation of floating point values results in very poor compression rates with common compression programs (e.g. Zip, RAR, 7-Zip etc).

The data we need to compress is a one dimensional array of 8-byte values sorted in monotonically increasing order. The values represent temperatures in Kelvin with a span typically under of 100 degrees. The number of values ranges from a few hundred to at most 64K.

Clarifications

  • All values in the array are distinct, though repetition does exist at the byte level due to the way floating point values are represented.

  • A lossless algorithm is desired since this is scientific data. Conversion to a fixed point representation with sufficient precision (~5 decimals) might be acceptable provided there is a significant improvement in storage efficiency.

Update

Found an interesting article on this subject. Not sure how applicable the approach is to my requirements.

http://users.ices.utexas.edu/~burtscher/papers/dcc06.pdf

+1  A: 

Can you do a delta between adjacent values?
Is there a limit to how much a value can change between measurements? Is it acceptable to limit this change to some maximum rate value (at the cost of introducing some smoothing?)

There obviously a limit to the real accuracy of the values from the thermal sensor, do you need to store 64bits of precision or are you better storing an integer number of say 0.01-Kelvin units?

If you can live with some more error and the increase is relatively smooth you might be better just fitting a function to the data and storing only a few terms of the function.

EDIT:
Take a look at a typical data set and look at the range of differences between adjacent values. Then look at what accuracy you need to represent this.

eg. If you have a max 1deg difference between readings you could store changes of 1/256 of this in a byte. If you need to store a bigger range or more accuracy use a short divided by some factor.
So next reading would be = last_reading + (float)increment/256.0

Martin Beckett
We have considered using Delta encoding of some sort, but have not yet devised an algorithm. The values relate to an infrared image and thus do not have any significant discontinuities.
David Taylor
A: 

You could think about re-coding your data with an entropy coder (Huffman, Shannon-Fano, Arithmetic Coding). But this will only provide good results if you have many repetitions of the datapoints and if you know which symbols will appear with which probability.

macs
+1  A: 

Compression algorithms live on repetitions and regularities, and floating-point numbers don't do well at that.

The first question is whether you can use single-precision floating-point values, which will give you 50% compression immediately. Few thermometers are accurate to seven digits, and the exponent will represent temperatures considerably below anything I've been told you can actually get.

If not, can you filter your temperatures, rounding them off to the equivalent of N digits (more likely N/.301 bits)? That may introduce enough regularities to be useful.

If you really have to store 64 bits of information for each temperature reading, and all the bits are significant and not predictable from the other bits, then you effectively can't compress it.

David Thornley
Single precision floats due not have sufficient precision. The data is related to infrared imaging with with device sensitivity on the order of 0.08 Celsius or better. While the floating point values are non-repetitive, there is significant repetition at the byte level which the compression algorithms we have tried are unable to take advantage of due to their design. Based on some Google searches, this is a known problem with compression of scientific data.
David Taylor
@David -- Can you explain why you think single precision floats are not adequate ? With a span of 100 degrees and a resolution of even 0.01 degrees a single precision float should have more than enough precision/resolution.
Paul R
Um, yeah. A single precision float will get you six significant digits easy. For a range of, say, 0-1000K, that will get you resolution better than 0.001 kelvin, which is a whole lot better than your imager sensitivity. Are you trying to preserve the random noise or something?
David Thornley
Technically yes, but the end customer expects the "same" value as their original image. Data + Noise in -> gospel out ;) The loss in precision is readily apparent to the end user, though the meaningfulness of the "lost data" is insignificant. The perception of data loss out weights the rational analysis. Such is life.
David Taylor
+3  A: 

First thing to consider: try compressing the data before you convert it to double precision. Re your comment to David Thornley, unless your IR imaging ADC's have 24 significant bits, 32-bit floats should have more than enough precision; it is only your requirement to exactly preserve the noise generated by subsequent processing that is a problem. Failing that, it might conceivably be practical to reverse-engineer your processing, by determining a table of values it generates, and storing an index to this table instead.

Second: if your compression algorithm knows that your data is in 8-byte chunks, it will be much more effective; this is because it will not throw your most significant bytes in with the least significant bytes. As a crude preprocessing method, you could try prefixing each double with a distinctive byte (ASCII comma, perhaps?) before piping it through a byte-based compressor like gzip; this should result in better total compression even though the intermediate stream is 12% larger. Less crude but more effort would be to write your own compression adapted to this task -- perhaps using an 8-level tree to represent the expected values of each byte in your double.

Third: as image data is highly redundant, some form of delta coding or other image-related compression should save some space. However, it will not gain you a terribly large amount if you demand lossless compression, as the image noise is inherently incompressible. Also, it will not help you deal with the pseudo-random hash in the less-significant bits of your doubles, as explained above.

comingstorm
Very perceptive. I am already "palettizing" the data and compressing the resulting array using techniques optimized for 16-bit grayscale (actually multi-planar for A/D > 16-bit). The resulting temperature lookup is the part I need to compress more efficiently. I have also tested a variety approaches to address the "chunking" issue you note with some success (e.g. compressing the nth byte of each chunk together). The 8-level tree idea sounds interesting, but will take some work. Delta coding with a good predictor function might work well.
David Taylor
There is a more efficient generalization of palletization called Vector Quantization. See here for example: http://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.phpThis is very heavy on processing time for compression, but super lightweight on decompression. Perhaps this is though what you are doing already.
3yE
Thanks for the pointer on Vector Quantization, a very interesting approach. Learn something new every day!
David Taylor
+1  A: 

All the encoders you list are byte oriented, and are thrown off by a few properties of doubles. For one there is the layout where the 12-bit exponent/sign doesn't really play well with byte boundaries, for other there's the noisiness of your input. The first part is easy to deal with in multitude of ways, the second will limit the effectiveness of any lossless compression that you throw at it. I think that even the best result will be less than amazing, i don't know your data but i'd suspect you can count on mere 25% save, more or less.

From the top of my head, and perhaps useless because you have thought of everything on this list...

  1. Treat the stream as 64-bit integers and delta-encode adjacent values. If you have runs of values with the same exponent, it will effectively zero it out, as well as possibly some high mantissa bits. There will be overflows, but the data still needs only 64 bits and the operation can be reveresed.

  2. At this stage you can optionally try some crude integer prediction, and save differences.

  3. If you have followed the suggestion before, you will have almost half values starting with 000... and almost half with FFF... To eliminate that, rotate the value to the left (ROL) by 1 bit and XOR it with all Fs if the current LSB is 1. Reverse is XOR with Fs if LSB is 0 then ROR.

On the second thought simply XORing predictions to true values can be better than difference, because you don't have to do step 3 then.

  1. You can try reordering bytes to group bytes with same significance together. Like, first all most significant bytes, and so on. At the very least you should get something like a massive run of zeroes with at most few bits of noise first.

  2. Run through a generic compressor or even first RLE on the run of zeroes, then an entropy encoder, like huffman, or better, range encoder from 7zip/LZMA.

There is one good thing about your data, it is monotonous. There is a bad thing about your data: it's simply too small a set. How much do you want to save, mere kilobyes? what for? The compression effectiveness will suffer a lot if there is often exponent difference between adjacent values.

If you are processing large number of those data sets, you should consider using their similarity to compress them together better - perhaps interleave them at some stage. If you can live with some loss, zeroing out some least significant bytes might be a good idea - perhaps both on source data and on prediction so that you don't reintroduce noise there.

3yE
Your observations are on the mark. I have experimented with all of the methods you have suggested. The approach that seems to work the best is a custom predictor algorithm and encoding model that uses XOR as per your suggestion. The reason compression is important is because many such images are being stored. The floating point compression is actually but a small part of a larger strategy which uses palletization and JPEG-LS 16-bit grayscale compression. Currently we are able to achieve around 65% compression which is quite good IMO. Thx
David Taylor