ansaurus

Question

Most efficient sorting algorithm for many identical keys?

Answer 1

+8 A:

I think you could just hash the objects, since real order doesn't matter, only grouping. Identical objects will end up grouped in the same bucket. This is assuming that every type you're interested in has its own hash function, or you can define your own and overload it (taking each type as a parameter to a different hashCode function definition).

To avoid collisions across data types (so strings don't end up in the same bucket as doubles, for one example), you'd need to encode the data type into the hash. So, for example, if you have a 32-bit hash, maybe the first 5 bits could encode the data type, so you can have 32 different types in the same hash map.

EDIT: Let me just add that the reason that I'm suggesting a custom hash map is because I don't know of one that exposes enough of its internal implementation for you to get the values out of each bucket. There might be such an implementation that I don't know of. There are a lot of things I don't know. :)

Bill the Lizard 2008-12-09 21:04:19

That hash would have to be very small and therefore you'd have to compare lots of datasets. How is it possible to determine the size of the hash in advance so that the amount of work can be minimized?

Georg 2008-12-09 21:14:56

I'm not exactly sure what you mean. Are you talking about the speed of the hash function or the actual size of the return type?

Bill the Lizard 2008-12-09 21:20:25

You wouldn't do what you described in your second paragraph. Hash tables don't just put things with identical keys in the same bucket. They fall back on a comparison function when the keys are the same.

Jules 2008-12-09 22:59:13

Some implementations of hash tables put things with the same hashes (not keys) in the same bucket. Besides that, I'm suggesting a custom hash table.

Bill the Lizard 2008-12-10 01:13:25

Answer 2

+2 A:

A galloping mergesort, such as python's built-in sort (c.f. timsort), has good expected performance when there are large runs of already-sorted data (like, in your example, identical objects) -- you'll skip O(log(N)) work per merge. You can also distribute a mergesort across multiple CPU's and disks, if your dataset is extremely large (this is called an "external" sort). However, it will be worst case O(Nlog(N)).

The only sorts that are faster than Nlog(N) are counting sorts, that exploit some common property of the keys. To use a linear time sort (hash table or radix/bucket sort), you'll have to hash the struct's to generate some kind of numerical key.

Radix sort will make multiple passes through the keys, so its expected time will be longer than a hashtable approach; and, since you don't care about lexicographic order, the hash table solution sounds better for you, if you can afford to hash the keys.

2008-12-09 21:10:52

Answer 3

+1 A:

3-way QuickSort performs very well when there are large number of duplicates.

CMS 2008-12-09 21:14:11

Answer 4

A:

If you know the range of the possible values, and it's small, you could do: (pseudo-ish code)

uint[] bucket = new int[10];
foreach(uint val in foo) {
    ++bucket[val];
}

uint bar_i = 0;
uint[] bar = new int[foo.length];
foreach(int val = 0; val < 10; val++) {
    uint occurrences = bucket[val];
    for(int i=0; i < occurrences; i++) {
        bar[bar_i++] = val;
    }
}

recursive 2008-12-09 21:16:04

Answer 5

A:

I think that hashing into buckets would be the best solution, assuming that there is a hash that preserves operator= mapping (0.0 might not hash to the same thing -0.0, but they could be "equal"). Assuming you only have an equal, and less-than operator, you could implement a rudimentary quick-sort algorithm of picking the first element as the pivot, and putting the less than in one group, and greater than in another group, and then repeating the process on each group.

FryGuy 2008-12-09 21:16:47

Answer 6

A:

I think that since you have arbitrary objects that you do not want to copy around too much, you could just use references or pointers for the sort, and, if needed, copy the objects in order afterwards.

Svante 2008-12-09 21:19:37

Answer 7

A:

Maybe an R+B or AVL tree? Then again - it would still be ultimately O(NlogN). Might as well use heapsort - won't be any worse and no extra memory usage...

Vilx- 2008-12-09 21:36:24

Answer 8

+4 A:

The magic word you're looking for here is multiset (or bag). It's not really a sort at all, since you don't care about the order as long as you have all the elements with equal keys grouped together. There are several canned implementations available, depending on the language you're using, but in general the hashed version above is asymptotically optimal, I believe: insert() is constant time, since you can compute a hash in O(1) and append colliding inserts to a list in O(1) time; you can retrieve one element from the bins in O(1) time, you just grab the first one in the bin; and you can therefore collect all of them in O(n) time, since you retrieve n elements with O(1) for each element.

Charlie Martin 2008-12-09 22:17:41

Answer 9

A:

lakshmanaraj 2008-12-10 07:16:03

ansaurus

tags:

views:

answers:

Most efficient sorting algorithm for many identical keys?

related questions