views:

766

answers:

6

I want to store a list of the following tuples in a compressed format and I was wondering which algorithm gives me

  • smallest compressed size
  • fastest de/compression
  • tradeoff optimum ("knee" of the tradeoff curve)

My data looks like this:

(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)

One of the two ints refers to a point in time and it's very likely that the numbers ending up in one list are close to each other. The other int represents an abstract id and the values are less likely to be close, although they aren't going to be completely random, either. The double is representing a sensor reading and while there is some correlation between the values, it's probably not of much use.

+2  A: 

Since the "time" ints can be close to each other, try to only store the first and after that save the difference to the int before (delta-coding). You can try the same for the second int, too.

Another thing you can try is to reorganize the data from [int1, int2, double], [int1, int2, double]... to [int1, int1...], [int2, int2...], [double, double...].

To find out the compression range your result will be in, you can write your data into a file and download the compressor CCM from Christian Martelock here. I found out that it performs very well for such data collections. It uses a quite fast context mixing algorithm. You can also compare it to other compressors like WinZIP or use a compression library like zLib to see if it is worth the effort.

schnaader
+2  A: 

If I'm reading the question correctly, you simply want to store the data efficiently. Obviously simple options like compressed xml are simple, but there are more direct binary serialization methods. One the leaps to mind is Google's protocol buffers.

For example, in C# with protobuf-net, you can simply create a class to hold the data:

[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}

Then just [de]serialize a List or Foo[] etc, via the ProtoBuf.Serializer class.

I'm not claiming it will be quite as space-efficient as rolling your own, but it'll be pretty darned close. The protocol buffer spec makes fairly good use of space (for example, using base-128 for integers, such that small numbers take less space). But it would be simple to try it out, without having to write all the serialization code yourself.

This approach, as well as being simple to implement, also has the advantage of being simple to use from other architectures, since there are protocol buffers implementations for various languages. It also uses much less CPU than regular [de]compression (GZip/DEFLATE/etc), and/or xml-based serialization.

Marc Gravell
Thanks for pointing that out, I'm serializing stuff with pb anyway, so it's a natural choice in my context. Would you know if they compress repeated patterns with shorter sequences? I can RTF spec too, if not. ;-)
Hanno Fietz
No, it doesn't do that. However, if you had a specific need, a `bytes` member could be made to hold data compressed with GZip or something. This is outside the spec, so the client/server would need to agree this purely as a detail.
Marc Gravell
OK, so that means rearranging the data to get three sorted lists for each tuple member instead of one list of 3-tuples won't buy me anything?
Hanno Fietz
+1  A: 

Most compression algorithms will work equally bad on such data. However, there are a few things ("preprocessing") that you can do to increase the compressibility of the data before feeding it to a gzip or deflate like algorithm. Try the following:

First, if possible, sort the tuples in ascending order. Use the abstract ID first, then the timestamp. Assuming you have many readings from the same sensor, similar ids will be placed close together.

Next, if the measures are taken at regular intervals, replace the timestamp with the difference from the previous timestamp (except for the very first tuple for a sensor, of course.) For example, if all measures are taken at 5 minutes intervals, the delta between two timestamps will usually be close to 300 seconds. The timestamp field will therefore be much more compressible, as most values are equal.

Then, assuming that the measured values are stable in time, replace all readings with a delta from the previous reading for the same sensor. Again, most values will be close to zero, and thus more compressible.

Also, floating point values are very bad candidates for compression, due to their internal representation. Try to convert them to an integer. For example, temperature readings most likely do not require more than two decimal digits. Multiply values by 100 and round to the nearest integer.

Stephan Leclercq
+1  A: 

Here is a common scheme used in most search engines: store deltas of values and encode the delta using a variable byte encoding scheme, i.e. if the delta is less than 128, it can be encoded with only 1 byte. See vint in Lucene and Protocol buffers for details.

This will not give you the best compression ratio but usually the fastest for encoding/decoding throughput.

ididak
+2  A: 

Sort as already proposed, then store

(first ints) (second ints) (doubles)

transposed matrix. Then compressed

Gilles
A: 

Great answers, for the record, I'm going to merge those I upvoted into the approach I'm finally using:

  1. Sort and reorganize the data so that similar numbers are next to each other, i. e. sort by id first, then by timestamp and rearrange from (<int1>, <int2>, <double>), ... to ([<int1>, <int1> ...], [<int2>, <int2> ... ], [<double>, <double> ...]) (as suggested by schnaader and Stephan Leclercq

  2. Use delta-encoding on the timestamps (and maybe on the other values) as suggested by schnaader and ididak

  3. Use protocol buffers to serialize (I'm going to use them anyway in the application, so that's not going to add dependencies or anything). Thanks to Marc Gravell for pointing me to it.

Hanno Fietz