I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
- divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
- XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?