ansaurus

Question

Sorting structures in order of least change

Answer 1

A:

You're looking at a pair of subproblems, defining the difference between structures, then the sort.

I'm not terribly clear on your description of the structure, nor on the precedence of differences, but I'll assume you can work that out and compute a difference score between two instances. For files, there are known algorithms for discussing these things, like the one used in diff.

For your ordering, you're looking at a classic travelling salesman problem. If you're sorting a few of these things, its easy. If you are sorting a lot of them, you'll have to settle for a 'good enough' sort, unless you're ready to apply domain knowledge and many little tricks from TSP to the effort.

2008-11-26 04:04:38

This should be simpler than TSP - the distances between states are bounded, and a -> b is the same as b -> a

2008-11-26 07:49:02

Answer 2

+3 A:

Your problem

I think this question should really be tagged with 'compression'.

As I understand it, you have unordered records which consist of eight 4-byte integers: 32 bytes in total. You want to store these records with a minimum file size, and have decided to use some form of delta encoding based on a Hamming distance. You're asking how to best sort your data for the compression scheme you've constructed.

Your assumptions

From what you've told us, I don't see any real reason for you to split up your 32 bytes in the way you've described (apart from the fact that word boundaries are convenient)! If you get the same data back, do you really care if it's encoded as eight lots of 4 bytes, or sixteen lots of 2 bytes, or as one huge 32-byte integer?

Furthermore, unless there's something about the problem domain which makes your method the favourite, your best bet is probably to use a tried-and-tested compression scheme. You should be able to find code that's already written, and you'll get good performance on typical data.

Your question

Back to your original question, if you really do want to take this route. It's easy to imagine picking a starting record (I don't think it will make much difference which, but it probably makes sense to pick the 'smallest' or 'largest'), and computing the Hamming distance to all other records. You could then pick the one with the minimum distance to store next, and repeat. Obviously this is O(n^2) in the number of records. Unfortunately, this paper (which I haven't read or understood in detail) makes it look like computing the minimum Hamming distance from one string to a set of others is intrinsically hard, and doesn't have very good approximations.

You could obviously get better complexity by sorting your records based on Hamming weight (which comes down to the population count of that 32-byte integer), which is O(n log(n)) in the number of records. Then use some difference coding on the result. But I don't think this will make a terribly good compression scheme: the integers from 0 to 7 might end up as something like:

000, 100, 010, 001, 101, 011, 110, 111

0, 4, 2, 1, 5, 3, 6, 7

Which brings us back to the question I asked before: are you sure your compression scheme is better than something more standard for your particular data?

Tom 2008-12-01 16:37:42

Nice answer Tom (I found your profile!!)

Matt Warren 2009-07-16 19:39:54

ansaurus

tags:

views:

answers:

Sorting structures in order of least change

Your problem

Your assumptions

Your question

related questions