What is the most efficient algorithm for grouping identical items together in an array, given the following:
- Almost all items are duplicated several times.
- The items are not necessarily integers or anything else that's similarly simple. The range of the keys is not even well-defined, let alone small. In fact, the keys can be arbitrary structs. This rules out the most simple forms of counting sort.
- We care about both asymptotic and non-asymptotic properties, and n may be small sometimes. However, when n is small, performance is still important because this function may be called several million times in a loop on millions of small datasets. This rules out any expensive hash function or using a complex data structure that needs to perform lots of memory allocations.
- The data may be sorted in arbitrary order as long as all identical items are grouped together.
If this is confusing, here's an example, assuming such a function is named groupIdentical:
uint[] foo = [1,2,3,2,1,5,4,5];
uint[] bar = groupIdentical(foo);
// One possibile correct value for bar:
// bar == [2,2,1,1,3,4,5,5].
// Another possible correct answer:
// bar == [1,1,2,2,5,5,4,3].
However, as a reminder, we cannot assume that the data is composed as integers.
Edit: Thank you for the answers. My main problem with hashing was that hash tables perform memory allocations to frequently. What I ended up doing was writing my own hash table that uses a region allocator that I had around to get around this problem. Works well.