views:

973

answers:

9

What is the most efficient algorithm for grouping identical items together in an array, given the following:

  1. Almost all items are duplicated several times.
  2. The items are not necessarily integers or anything else that's similarly simple. The range of the keys is not even well-defined, let alone small. In fact, the keys can be arbitrary structs. This rules out the most simple forms of counting sort.
  3. We care about both asymptotic and non-asymptotic properties, and n may be small sometimes. However, when n is small, performance is still important because this function may be called several million times in a loop on millions of small datasets. This rules out any expensive hash function or using a complex data structure that needs to perform lots of memory allocations.
  4. The data may be sorted in arbitrary order as long as all identical items are grouped together.

If this is confusing, here's an example, assuming such a function is named groupIdentical:

uint[] foo = [1,2,3,2,1,5,4,5];
uint[] bar = groupIdentical(foo);
// One possibile correct value for bar:
// bar == [2,2,1,1,3,4,5,5].
// Another possible correct answer:
// bar == [1,1,2,2,5,5,4,3].

However, as a reminder, we cannot assume that the data is composed as integers.

Edit: Thank you for the answers. My main problem with hashing was that hash tables perform memory allocations to frequently. What I ended up doing was writing my own hash table that uses a region allocator that I had around to get around this problem. Works well.

+8  A: 

I think you could just hash the objects, since real order doesn't matter, only grouping. Identical objects will end up grouped in the same bucket. This is assuming that every type you're interested in has its own hash function, or you can define your own and overload it (taking each type as a parameter to a different hashCode function definition).

To avoid collisions across data types (so strings don't end up in the same bucket as doubles, for one example), you'd need to encode the data type into the hash. So, for example, if you have a 32-bit hash, maybe the first 5 bits could encode the data type, so you can have 32 different types in the same hash map.

EDIT: Let me just add that the reason that I'm suggesting a custom hash map is because I don't know of one that exposes enough of its internal implementation for you to get the values out of each bucket. There might be such an implementation that I don't know of. There are a lot of things I don't know. :)

Bill the Lizard
That hash would have to be very small and therefore you'd have to compare lots of datasets. How is it possible to determine the size of the hash in advance so that the amount of work can be minimized?
Georg
I'm not exactly sure what you mean. Are you talking about the speed of the hash function or the actual size of the return type?
Bill the Lizard
You wouldn't do what you described in your second paragraph. Hash tables don't just put things with identical keys in the same bucket. They fall back on a comparison function when the keys are the same.
Jules
Some implementations of hash tables put things with the same hashes (not keys) in the same bucket. Besides that, I'm suggesting a custom hash table.
Bill the Lizard
+2  A: 

A galloping mergesort, such as python's built-in sort (c.f. timsort), has good expected performance when there are large runs of already-sorted data (like, in your example, identical objects) -- you'll skip O(log(N)) work per merge. You can also distribute a mergesort across multiple CPU's and disks, if your dataset is extremely large (this is called an "external" sort). However, it will be worst case O(Nlog(N)).

The only sorts that are faster than Nlog(N) are counting sorts, that exploit some common property of the keys. To use a linear time sort (hash table or radix/bucket sort), you'll have to hash the struct's to generate some kind of numerical key.

Radix sort will make multiple passes through the keys, so its expected time will be longer than a hashtable approach; and, since you don't care about lexicographic order, the hash table solution sounds better for you, if you can afford to hash the keys.

+1  A: 

3-way QuickSort performs very well when there are large number of duplicates.

CMS
A: 

If you know the range of the possible values, and it's small, you could do: (pseudo-ish code)

uint[] bucket = new int[10];
foreach(uint val in foo) {
    ++bucket[val];
}

uint bar_i = 0;
uint[] bar = new int[foo.length];
foreach(int val = 0; val < 10; val++) {
    uint occurrences = bucket[val];
    for(int i=0; i < occurrences; i++) {
        bar[bar_i++] = val;
    }
}
recursive
A: 

I think that hashing into buckets would be the best solution, assuming that there is a hash that preserves operator= mapping (0.0 might not hash to the same thing -0.0, but they could be "equal"). Assuming you only have an equal, and less-than operator, you could implement a rudimentary quick-sort algorithm of picking the first element as the pivot, and putting the less than in one group, and greater than in another group, and then repeating the process on each group.

FryGuy
A: 

I think that since you have arbitrary objects that you do not want to copy around too much, you could just use references or pointers for the sort, and, if needed, copy the objects in order afterwards.

Svante
A: 

Maybe an R+B or AVL tree? Then again - it would still be ultimately O(NlogN). Might as well use heapsort - won't be any worse and no extra memory usage...

Vilx-
+4  A: 

The magic word you're looking for here is multiset (or bag). It's not really a sort at all, since you don't care about the order as long as you have all the elements with equal keys grouped together. There are several canned implementations available, depending on the language you're using, but in general the hashed version above is asymptotically optimal, I believe: insert() is constant time, since you can compute a hash in O(1) and append colliding inserts to a list in O(1) time; you can retrieve one element from the bins in O(1) time, you just grab the first one in the bin; and you can therefore collect all of them in O(n) time, since you retrieve n elements with O(1) for each element.

Charlie Martin
A: 
lakshmanaraj