views:

177

answers:

6

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.

The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.

Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.

Is there another algorithm that would be able to compress this kind of data better? I'm using C++.

Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.

A: 

Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.

You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.

As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).

Marcelo Cantos
The points are ordered. If I changed their order then I would have to add an index to reconstruct the original order, probably increasing the size overall. The records (representing roads) that contain the points are unordered, but rearranging them is a PITA. I suspect I could get better compression by separating the high and low bytes, but as my data storage system is highly generic (used for all records in the system regardless of their nature) I thought it would be best to avoid polluting it with special-purpose code.
Qwertie
Plus, I don't think arranging them spatially would help zlib that much. Just from what I know about the deflate algorithm, I don't think it can recognize numeric correlation - if you give it a byte sequence 1, 2, 3, ... 100, it compresses as badly as random bytes.
Qwertie
@Qwertie: Consider two sequences with absolute values: 100,101,102 and 200,201,202. Now look at them as relative values: 0,1,2 and 0,1,2. Going relative allows any dictionary-based compressor to combine these two.
Steven Sudit
Good point, though in the case of geographic points, the differences between pairs/series of points are almost never exactly the same for different pairs/series - but true, they would be more similar than the original points.
Qwertie
They wouldn't have to be exactly the same. Because the probability distribution of deltas is skewed heavily towards small numbers, the huffman tree would favour these by requiring fewer bits to encode them. See my amended answer for an alternative, though.
Marcelo Cantos
+4  A: 

You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.

General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.

But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.

As I understand it, you have a large number of data points of the form

XXxxYYyy, where the upper-case letters are very uniform. So factor them out.

Rewrite the list as something similar to this:

XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...

Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.

Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.

Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.

jalf
It's a good idea, I was just hoping an algorithm existed to handle this kind of data automatically. My system design isn't well suited to this kind of rearrangement.
Qwertie
If your system design doesn't support that kind of rearrangement, but does support running these things through zlib, then just write your own filter to work in place of zlib, that is specific to your data format.
Novelocrat
+1 for interesting thoughts.
Qwertie
A: 

Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:

0-1 Number of bytes of ID that should be borrowed from previous record 2-4 Format of position record 6-7 Number of succeeding records that use the same 'mode' byte

Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.

000 - 8 - Two full four-byte positions 001 - 3 - Twelve bits for X and Y 010 - 2 - Ten-bit X and six-bit Y 011 - 2 - Six-bit X and ten-bit Y 100 - 4 - Two sixteen-bit signed displacements 101 - 3 - Sixteen-bit X and 8-bit Y signed displacement 110 - 3 - Eight-bit signed displacement for X; 16-bit for Y 111 - 2 - Two eight-bit signed displacements

A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).

A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).

Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.

supercat
+2  A: 

Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.

http://www.7-zip.org/

bshields
+1 for the idea (assuming 7-zip can be used in source form), though I neglected to mention it's running on WinCE and must be fast; 7-zip is probably too big and slow.
Qwertie
A: 

It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.

Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.

Qwertie
That would be my suggestion. Specifically, I'd suggest you look at BZip2, which is the only popular implementation of BWT that I'm aware of. It has a set of compressors that might not be perfect for your dataset, but has the advantage of being pretty well debugged, and specified.
Jason
A: 

You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.

This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.

If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).

Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

Kirk Kuykendall
I just notice this "and other information such as the address range, ID, speed limit, and a pointer to a street name record stored elsewhere" in comment above. If you have a lot of attributes, you might see if there's a linear referencing system LRS available (e.g. roadID, milepost). This would allow you to have one attribute record span multiple road segments instead of repeating for each road segment. http://en.wikipedia.org/wiki/Linear_Reference_System
Kirk Kuykendall
Thanks but I'm just wondering about compression algorithms. What you're suggesting is a complete reorganization of the data which I may consider eventually (but I already have my own ideas about how to do that). Btw, come to think of it, the duplicate points may not be a big problem provided that the duplicates are part of the same data "chunk", as zlib should notice the duplication and compress it out.
Qwertie
@Kirk, come to think of it, a LRS might well come in handy... eventually. Thanks for the tip! I'd need to completely rewrite a lot of stuff but we might just have the budget for it ;)
Qwertie