ansaurus

Question

How do I compress a large number of similar doubles?

Answer 1

+2 A:

If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

recursive 2009-09-09 15:49:00

Answer 2

+4 A:

Rarely are all the bits of a double-precision number meaningful.

If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.

Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.

Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.

So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.

for u in samples:
    z = int( 16384*(u-m)/s )

for z in scaled_samples:
    u = s*(z/16384.0)+m

Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.

Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.

If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.

I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.

S.Lott 2009-09-09 15:52:23

Thanks for pointing me to Z-scores as a generalization of what I was trying to do with my first approach. Unfortunately, I can not make any assumptions on what is noise in the incoming data, because the actual, physical, sources vary widely in precision and also the demands on the data's precision depend on the application that's using my my software.

Hanno Fietz 2009-09-09 16:11:50

Answer 3

+3 A:

Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.

If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.

moonshadow 2009-09-09 15:58:43

That's a good point, thanks!

Hanno Fietz 2009-09-09 16:13:44

ansaurus

tags:

views:

answers:

How do I compress a large number of similar doubles?

related questions