views:

862

answers:

8

I have a large list of integers (thousands), and I want to extract the first N (in the order of 10-20) unique elements from it. Each integer in the list occurs roughly three times.

Writing an algorithm to do this is trivial, but I wonder what's the most speed and memory efficient way to do it.

There are some additional constraints and informations in my case:

  • In my use-case I extract my uniques multiple times on the array, each time skipping some elements from the beginning. The amount of elements that I skip is not known during unique-extraction. I don't even have a upper bound. Therefore sorting is not speed efficient (I have to preserve the order of the array).

  • The integers are all over the place, so a bit-array as a lookup solution is not feasible.

  • I want to avoid temporary allocations during the search at all costs.

My current solution looks roughly like this:

  int num_uniques = 0;
  int uniques[16];
  int startpos = 0;

  while ((num_uniques != N) && (start_pos < array_length))
  {
    // a temporary used later.
    int insert_position;

    // Get next element.
    int element = array[startpos++];

    // check if the element exist. If the element is not found
    // return the position where it could be inserted while keeping
    // the array sorted.

    if (!binary_search (uniques, element, num_uniques, &insert_position))
    {

      // insert the new unique element while preserving 
      // the order of the array.

      insert_into_array (uniques, element, insert_position);

      uniques++;
    }
  }

The binary_search / insert into array algorithm gets the job done, but the performance is not great. The insert_into_array call moves elements around a lot, and this slows everythign down.

Any ideas?


EDIT

Great answers, guys! Everyone deserves an accepted answer, but I can give only one. I'll implement a bunch of your ideas and do a performance-shootout with some typical data. The one with the idea that lead to the quickest implementation get's the accepted answer.

I'll run the code on a modern PC and a embedded CortexA8-CPU and I'll weight the results somehow. Will post the results as well.


EDIT: Results of the shoot-out

Timings on a Core-Duo, 100 iterations over a 160kb test-dataset.

Bruteforce (Pete):            203 ticks
Hash and Bruteforce (Antti):  219 ticks
Inplace Binary Tree (Steven): 390 ticks
Binary-Search (Nils):         438 ticks

http://torus.untergrund.net/code/unique_search_shootout.zip (C-source and testdata)

Additional remarks:

  • The Inplace Binary Tree absolutely rocks for true random distributions (my test-data has a tendency to be ascending).

  • The Binary-Search works very well on my testdata for more than 32 uniques. It performs almost linear.

+11  A: 

Why not just start inserting your array elements into a std::set and stop when the set has N elements? Sets are guaranteed not to have duplicates. They're also guaranteed to be sorted, so if you traverse a set from begin() to end(), you'll do so in sorted order according to operator<.

Tyler McHenry
The set does allocations as I insert elements and has a to high runtime overhead.
Nils Pipenbrinck
@Nils: The runtime overhead is all down to the allocation of nodes -- otherwise, set<T> should be faster as it uses O(log n) lookup followed by O(1) insertion (your insertion is O(n)). You could write your own allocator to avoid the dynamic memory allocation, though this is a bit of work.
j_random_hacker
I'm not sure what you mean "does allocations as you insert"... you're only extracting and duplicating allocation for 10-20 elements. And I'm not sure how much faster you expect to compared to a red-black tree.
Tyler McHenry
@Nils "to high runtime overhead" : How slow is it? How fast does it need to be?
Adam Jaskiewicz
Doesn't the question need the elements sorted in their original order, not ascending order?
Steven Huwig
I got the impression that he was talking about sorting the original array to more easily skip dupes, and that he wanted to avoid doing so for performance reasons (completely understandable; that's certainly the wrong way to do this).
Adam Jaskiewicz
Use a hash-table set?
Jason S
+3  A: 

Instead of storing the unique integers into an array, use an actual binary tree. It'll save you from shifting the array elements repeatedly.

Pesto
+4  A: 

I would try soring the uniques in an unbalanced binary tree. That'll save you the cost of rearranging the uniques list, and if the source list is random enough, the insertions into the tree won't unbalance it drastically. (And you can do a search-and-insert-if-not-present all in one go with a binary tree.) If it does become unbalanced, then, worst case would be the same as iterating over a 16 element list instead of doing the binary search.

You know the max size of the binary tree, so you can preallocate all the necessary memory ahead of time, so that shouldn't be an issue. You could even use the "I've run out of memory for nodes" condition to let you know when you're done.

(EDIT: Apparently folks think I'm advocating using exceptions here. I'm not. I might be advocating actual common lisp-style conditions, but not the escape-continuation style exceptions found in most languages. Besides, it looks like he wants to do C for this.)

Jay Kominek
Sorry, but using exceptions for flow control is not a good idea. Otherwise, good answer.
Svante
if I understand Jay correctly, it wouldn't be an exception, since the memory is allocated up front. it would just the tree detecting that it has no more empty slots. you could certainly implement that as throwing an exception, but for this purpose it would be ill-advised, as you said.
rmeador
I figured he'd do this in C, which doesn't have exceptions. if(nextindex>memoryslots) { return(ALL_SIXTEEN_FOUND); } ...or whatever.
Jay Kominek
Yeah, C or C++, that's how it should be done. Do the comparison yourself; don't use an expection. Your answer sounds like that is what you were suggesting; I'll upvote you if you make it clear that you aren't suggesting using an exception for flow control.
Adam Jaskiewicz
+4  A: 

For an array that small ( if you want the first 20 elements, on average you have 10 to check equality with), a linear scan often out performs a binary search, even if you aren't having to insert elements.

Pete Kirkham
Yea - I thought about that as well, but I shy away from adding a O(n*n) algorithm. The requirements may change one day. Good idea though. keep it simple.
Nils Pipenbrinck
Inserting N elements into an array is already O(N*N); you're eliminating the binary search with many branches.
Pete Kirkham
@Nils Since your array of uniques is small it is an O(n) algorithm where n is the size of the large list.
starblue
+3  A: 

Use an array representation of a binary tree. The array can be of size 3N. Basically

arr[i] = value

arr[i+1] = left child array index

arr[i+2] = right child array index

Walk the "tree" each insertion of k, and if k is not found update its parent's [i+1] or [i+2] and add it to the next empty index. When you run out of space in the array, you've got your answer.

e.g.

find first 3 unique of 42243123: array size=3 * 3 = 9.

In the table below, "v" are values, "l" is left child index, "r" is right child index.

 v  l  r  v  l  r  v  l  r
 _________________________
-1 -1 -1 -1 -1 -1 -1 -1 -1
 4 -1 -1 -1 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1  6  3 -1 -1

and you're out of space.

Array indices 0 mod 3 are your answer.

You can preserve order by using groups of 4:

array[i] = value

array[i+1] = position in original array

array[i+2] = left child index

array[i+3] = right child index

Steven Huwig
the implicit binary tree is a nice idea!
Nils Pipenbrinck
+4  A: 

The fastest time complexity you'll achieve with the restrictions you have imposed is O(n) using a dictionary with O(1) lookup instead of your binary tree for the unique integers. Why bother searching for them when you can look them up in constant time?

Since you're only dealing with "thousands of records", anything else is a trivial addition.

John Rasch
This might be the best idea yet. Make an array of length p (prime) up front, each time you see a new int, stick it in array[int%p] if that spot isn't already occupied. When you've successfully added 16, you're done. O(1) for each int you process. You just waste a bit of time if there are collisions.
Jay Kominek
@John Rasch: This is a good approach, but please note that O(1) time lookup for dictionaries/hashtables is a half-truth; it depends on your hashtable being "sufficiently large" and on the absence of patterns in the input data that would lead to pathological O(n) slowness.
j_random_hacker
@j_random_hacker - good point, especially considering the small amount of data being looked up. Unless something like Jay's suggestion is used, the hashing function alone could waste enough time making a search more efficient
John Rasch
+2  A: 

If you have thousands of integers and every one occurs roughly three times, your algorithm should find the set of N unique integers pretty quickly, roughly in N(1+e) steps for small e (assuming the integers are ordered relatively randomly).

This means that your algorithm would insert N times a random integer into the uniques array. Insert number K would on the average shift K/2 elements in the array, yielding (N^2)/4 move operations. Your binary search would take roughly N * (log(N)-1) steps. This yields total complexity of (N^2)/4 + N(log(N)-1) + N(1+e) for your algorithm.

I think you could better e.g. by the following:

int num_uniques = 0, startpos = 0, k, element;
int uniques[16];

/* Allocate and clear a bit table of 32 * 32 = 1024 bits. */
uint32 bit_table[32], hash;
memzero((void *)(&bit_table), sizeof(bit_table));

while (num_uniques < N && startpos < array_length) {
  element = array[startpos++];

  /* Hash the element quickly to a number from 0..1023 */
  hash = element ^ (element >> 16);
  hash *= 0x19191919;
  hash >>= 22;
  hash &= 1023;

  /* Map the hash value to a bit in the bit table.
     Use the low 5 bits of 'hash' to index bit_table
     and the other 5 bits to get the actual bit. */
  uint32 slot=hash & 31;
  uint32 bit=(1u << (hash >> 5));

  /* If the bit is NOT set, this is element is guaranteed unique. */
  if (!(bit_table[slot] & bit)) {
    bit_table[slot] |= bit;
    uniques[num_uniques++] = element;
  } else { /* Otherwise it can be still unique with probability
              num_uniques / 1024. */
    for (k=0; k<num_uniques; k++) { if (uniques[k] == element) break }
    if (k==num_uniques) uniques[num_uniques++] = element;
  }
}

This algorithm would run in expected time of N + N^2 / 128 because the probability of running the inner loop (index variable k) is low.

antti.huima
Very nice. Thank you..
Nils Pipenbrinck
What quantity does e represent?
j_random_hacker
j_random: It's just a small number corresponding to the probability of having duplicate numbers on the list before N uniques have been found.
antti.huima
@antti: Ah thanks.
j_random_hacker
A: 

Given a list of integers of size N named L

Iterate L once to find the largest value and smallest value in the array.

Allocate (1 allocation) an integer array of size (small .. large) named A. Init this array to zeros

Iterate L, use L(i) to subscript into A, Increment the integer found there.

Then do your processing. Pick your starting point in L, and walk forward through the list, looking at A(i). Pick what ever set of A(i) > 2 that you want.

When you are done, dispose of A.

If you are really short on space, use 2 bits instead of an integer, with the following interpretations

00 count = 0
01 count = 1
10 count = 2
11 count > 2
EvilTeach
+0. A good idea if the space of integers being scanned is not much larger than N, but the asker specifically says that the integers are "all over the place, so a bit-array as a lookup solution is not feasible."
j_random_hacker
Ok. Why is it not feasible?Assuming N is 2^32, that number of bits easily fits into virtual memory.
EvilTeach