views:

797

answers:

4

Hi, I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.

Now these numbers are in fixed format string.

What are the algorithms for compressing numbers in fixed format? Is number only compressions (if there any) better than string compression? Should I directly compress string without converting it to numbers and then compress?

Thanks in advance.

+5  A: 

Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).

Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.

I recommend you have a look at The Data Compression Book

Cade Roux
+2  A: 

What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.

Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.

In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.

An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.

That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.

Will Hartung
+4  A: 
Charlie Martin
Thanks for your excellent suggestion. I was expecting solutions like that. Here, the data format is fixed and there are thousands of sequential data. So, I guess the delta solution is more efficient here. Problem is, it has no indexing. So, data always have to be decompressed before it is read. Can you suggest a better solution with indexing?Thanks a lot.
fireball003
We start getting down to what the data is really like, and if losses are acceptable. A couple thoughts, though: store a full value every k steps, so you never have to run the sum more that k/2 steps; store the 1st derivative and a varying step width (see "adaptive differential encoding".) The second choice loses some information, although you can set a lower bound on how much.
Charlie Martin
+1  A: 

It depends on what you are going to do with the data, and how much precision you need.

Lat/long is traditionally given in degrees, minutes, and seconds, with 60 seconds to the minute, 60 minutes to the degree,and 1 degree of latitude nominally equal to 60 nautical miles (nmi). 1 minute is then 1 nmi, and 1 second is just over 100 ft.

Latitude goes from -90 to +90 degrees. Representing latitude as integer seconds gives you a range of -324000..+324000, or about 20 bits. Longitude goes -180 to +180, so representing longitude the same way requires 1 more bit.

So you can represent a complete lat/long position, to +/- 50 ft, in 41 bits.

Obviously, if you don't need that much precision, you can back down your bit count.

Observe that a traditional single-precision 32-bit float uses about 24 bits of mantissa, so you are down to about +/- 6 feet if you just convert your lat/long in seconds to float. It is kind of hard to beat two single-precision floats for this kind of thing.