I have a large list of integers (thousands), and I want to extract the first N (in the order of 10-20) unique elements from it. Each integer in the list occurs roughly three times.
Writing an algorithm to do this is trivial, but I wonder what's the most speed and memory efficient way to do it.
There are some additional constraints and informations in my case:
In my use-case I extract my uniques multiple times on the array, each time skipping some elements from the beginning. The amount of elements that I skip is not known during unique-extraction. I don't even have a upper bound. Therefore sorting is not speed efficient (I have to preserve the order of the array).
The integers are all over the place, so a bit-array as a lookup solution is not feasible.
I want to avoid temporary allocations during the search at all costs.
My current solution looks roughly like this:
int num_uniques = 0;
int uniques[16];
int startpos = 0;
while ((num_uniques != N) && (start_pos < array_length))
{
// a temporary used later.
int insert_position;
// Get next element.
int element = array[startpos++];
// check if the element exist. If the element is not found
// return the position where it could be inserted while keeping
// the array sorted.
if (!binary_search (uniques, element, num_uniques, &insert_position))
{
// insert the new unique element while preserving
// the order of the array.
insert_into_array (uniques, element, insert_position);
uniques++;
}
}
The binary_search / insert into array algorithm gets the job done, but the performance is not great. The insert_into_array call moves elements around a lot, and this slows everythign down.
Any ideas?
EDIT
Great answers, guys! Everyone deserves an accepted answer, but I can give only one. I'll implement a bunch of your ideas and do a performance-shootout with some typical data. The one with the idea that lead to the quickest implementation get's the accepted answer.
I'll run the code on a modern PC and a embedded CortexA8-CPU and I'll weight the results somehow. Will post the results as well.
EDIT: Results of the shoot-out
Timings on a Core-Duo, 100 iterations over a 160kb test-dataset.
Bruteforce (Pete): 203 ticks
Hash and Bruteforce (Antti): 219 ticks
Inplace Binary Tree (Steven): 390 ticks
Binary-Search (Nils): 438 ticks
http://torus.untergrund.net/code/unique_search_shootout.zip (C-source and testdata)
Additional remarks:
The Inplace Binary Tree absolutely rocks for true random distributions (my test-data has a tendency to be ascending).
The Binary-Search works very well on my testdata for more than 32 uniques. It performs almost linear.