views:

188

answers:

2

This came out being incomprehensible. I will rephrase

Is there an algorithm or approach that will allow sorting an array in such a way that it minimizes the differences between successive elements?

struct element
{
uint32 positions[8];
}

These records are order-insensitive.
The output file format is defined to be:

byte  present;  // each bit indicating whether position[i] is present
uint32 position0;
-- (only bits set in Present are actually written in the file).  
uint32 positionN;  // N is the bitcount of "present"
byte  nextpresent;

All records are guaranteed to be unique, so a 'present' byte of 0 represents EOF. The file is parsed by updating a "current" structure with the present fields, and the result is added to the list.

Eg: { 1, 2, 3}, { 2, 3, 2}, { 4, 2, 3}
Would be: 111b 1 2 3 001b 4 111b 2 3 2
Saving 2 numbers off the unsorted approach.

My goal is to to minimize the output file size.

A: 

You're looking at a pair of subproblems, defining the difference between structures, then the sort.

I'm not terribly clear on your description of the structure, nor on the precedence of differences, but I'll assume you can work that out and compute a difference score between two instances. For files, there are known algorithms for discussing these things, like the one used in diff.

For your ordering, you're looking at a classic travelling salesman problem. If you're sorting a few of these things, its easy. If you are sorting a lot of them, you'll have to settle for a 'good enough' sort, unless you're ready to apply domain knowledge and many little tricks from TSP to the effort.

This should be simpler than TSP - the distances between states are bounded, and a -> b is the same as b -> a
+3  A: 

Your problem

I think this question should really be tagged with 'compression'.

As I understand it, you have unordered records which consist of eight 4-byte integers: 32 bytes in total. You want to store these records with a minimum file size, and have decided to use some form of delta encoding based on a Hamming distance. You're asking how to best sort your data for the compression scheme you've constructed.

Your assumptions

From what you've told us, I don't see any real reason for you to split up your 32 bytes in the way you've described (apart from the fact that word boundaries are convenient)! If you get the same data back, do you really care if it's encoded as eight lots of 4 bytes, or sixteen lots of 2 bytes, or as one huge 32-byte integer?

Furthermore, unless there's something about the problem domain which makes your method the favourite, your best bet is probably to use a tried-and-tested compression scheme. You should be able to find code that's already written, and you'll get good performance on typical data.

Your question

Back to your original question, if you really do want to take this route. It's easy to imagine picking a starting record (I don't think it will make much difference which, but it probably makes sense to pick the 'smallest' or 'largest'), and computing the Hamming distance to all other records. You could then pick the one with the minimum distance to store next, and repeat. Obviously this is O(n^2) in the number of records. Unfortunately, this paper (which I haven't read or understood in detail) makes it look like computing the minimum Hamming distance from one string to a set of others is intrinsically hard, and doesn't have very good approximations.

You could obviously get better complexity by sorting your records based on Hamming weight (which comes down to the population count of that 32-byte integer), which is O(n log(n)) in the number of records. Then use some difference coding on the result. But I don't think this will make a terribly good compression scheme: the integers from 0 to 7 might end up as something like:

000, 100, 010, 001, 101, 011, 110, 111

0, 4, 2, 1, 5, 3, 6, 7

Which brings us back to the question I asked before: are you sure your compression scheme is better than something more standard for your particular data?

Tom
Nice answer Tom (I found your profile!!)
Matt Warren