tags:

views:

100

answers:

3

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.

A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).

I have considered the following approaches:

  • divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
  • XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.

Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?

+2  A: 

If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

recursive
+4  A: 

Rarely are all the bits of a double-precision number meaningful.

If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.

Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.

Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.

So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.

for u in samples:
    z = int( 16384*(u-m)/s )

for z in scaled_samples:
    u = s*(z/16384.0)+m

Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.


Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.

If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.

I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.

S.Lott
Thanks for pointing me to Z-scores as a generalization of what I was trying to do with my first approach. Unfortunately, I can not make any assumptions on what is noise in the incoming data, because the actual, physical, sources vary widely in precision and also the demands on the data's precision depend on the application that's using my my software.
Hanno Fietz
+3  A: 

Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.

If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.

moonshadow
That's a good point, thanks!
Hanno Fietz