ansaurus

Question

Extract first N unique integers from an Array

Answer 1

+11 A:

Why not just start inserting your array elements into a std::set and stop when the set has N elements? Sets are guaranteed not to have duplicates. They're also guaranteed to be sorted, so if you traverse a set from begin() to end(), you'll do so in sorted order according to operator<.

Tyler McHenry 2009-03-20 15:04:28

The set does allocations as I insert elements and has a to high runtime overhead.

Nils Pipenbrinck 2009-03-20 15:05:49

@Nils: The runtime overhead is all down to the allocation of nodes -- otherwise, set<T> should be faster as it uses O(log n) lookup followed by O(1) insertion (your insertion is O(n)). You could write your own allocator to avoid the dynamic memory allocation, though this is a bit of work.

j_random_hacker 2009-03-20 15:09:41

I'm not sure what you mean "does allocations as you insert"... you're only extracting and duplicating allocation for 10-20 elements. And I'm not sure how much faster you expect to compared to a red-black tree.

Tyler McHenry 2009-03-20 15:10:03

@Nils "to high runtime overhead" : How slow is it? How fast does it need to be?

Adam Jaskiewicz 2009-03-20 15:20:06

Doesn't the question need the elements sorted in their original order, not ascending order?

Steven Huwig 2009-03-20 15:28:12

I got the impression that he was talking about sorting the original array to more easily skip dupes, and that he wanted to avoid doing so for performance reasons (completely understandable; that's certainly the wrong way to do this).

Adam Jaskiewicz 2009-03-20 15:31:44

Use a hash-table set?

Jason S 2009-03-20 15:43:18

Answer 2

+3 A:

Instead of storing the unique integers into an array, use an actual binary tree. It'll save you from shifting the array elements repeatedly.

Pesto 2009-03-20 15:07:35

Answer 3

+4 A:

I would try soring the uniques in an unbalanced binary tree. That'll save you the cost of rearranging the uniques list, and if the source list is random enough, the insertions into the tree won't unbalance it drastically. (And you can do a search-and-insert-if-not-present all in one go with a binary tree.) If it does become unbalanced, then, worst case would be the same as iterating over a 16 element list instead of doing the binary search.

You know the max size of the binary tree, so you can preallocate all the necessary memory ahead of time, so that shouldn't be an issue. You could even use the "I've run out of memory for nodes" condition to let you know when you're done.

(EDIT: Apparently folks think I'm advocating using exceptions here. I'm not. I might be advocating actual common lisp-style conditions, but not the escape-continuation style exceptions found in most languages. Besides, it looks like he wants to do C for this.)

Jay Kominek 2009-03-20 15:10:40

Sorry, but using exceptions for flow control is not a good idea. Otherwise, good answer.

Svante 2009-03-20 15:18:51

if I understand Jay correctly, it wouldn't be an exception, since the memory is allocated up front. it would just the tree detecting that it has no more empty slots. you could certainly implement that as throwing an exception, but for this purpose it would be ill-advised, as you said.

rmeador 2009-03-20 15:57:01

I figured he'd do this in C, which doesn't have exceptions. if(nextindex>memoryslots) { return(ALL_SIXTEEN_FOUND); } ...or whatever.

Jay Kominek 2009-03-20 16:19:07

Yeah, C or C++, that's how it should be done. Do the comparison yourself; don't use an expection. Your answer sounds like that is what you were suggesting; I'll upvote you if you make it clear that you aren't suggesting using an exception for flow control.

Adam Jaskiewicz 2009-03-20 18:01:27

Answer 4

+4 A:

For an array that small ( if you want the first 20 elements, on average you have 10 to check equality with), a linear scan often out performs a binary search, even if you aren't having to insert elements.

Pete Kirkham 2009-03-20 15:12:06

Yea - I thought about that as well, but I shy away from adding a O(n*n) algorithm. The requirements may change one day. Good idea though. keep it simple.

Nils Pipenbrinck 2009-03-20 17:06:38

Inserting N elements into an array is already O(N*N); you're eliminating the binary search with many branches.

Pete Kirkham 2009-03-20 17:30:30

@Nils Since your array of uniques is small it is an O(n) algorithm where n is the size of the large list.

starblue 2009-03-21 09:18:09

Answer 5

+3 A:

Use an array representation of a binary tree. The array can be of size 3N. Basically

arr[i] = value

arr[i+1] = left child array index

arr[i+2] = right child array index

Walk the "tree" each insertion of k, and if k is not found update its parent's [i+1] or [i+2] and add it to the next empty index. When you run out of space in the array, you've got your answer.

e.g.

find first 3 unique of 42243123: array size=3 * 3 = 9.

In the table below, "v" are values, "l" is left child index, "r" is right child index.

 v  l  r  v  l  r  v  l  r
 _________________________
-1 -1 -1 -1 -1 -1 -1 -1 -1
 4 -1 -1 -1 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1 -1 -1 -1 -1
 4  3 -1  2 -1  6  3 -1 -1

and you're out of space.

Array indices 0 mod 3 are your answer.

You can preserve order by using groups of 4:

array[i] = value

array[i+1] = position in original array

array[i+2] = left child index

array[i+3] = right child index

Steven Huwig 2009-03-20 15:18:25

the implicit binary tree is a nice idea!

Nils Pipenbrinck 2009-03-20 17:04:13

Answer 6

+4 A:

The fastest time complexity you'll achieve with the restrictions you have imposed is O(n) using a dictionary with O(1) lookup instead of your binary tree for the unique integers. Why bother searching for them when you can look them up in constant time?

Since you're only dealing with "thousands of records", anything else is a trivial addition.

John Rasch 2009-03-20 15:18:35

This might be the best idea yet. Make an array of length p (prime) up front, each time you see a new int, stick it in array[int%p] if that spot isn't already occupied. When you've successfully added 16, you're done. O(1) for each int you process. You just waste a bit of time if there are collisions.

Jay Kominek 2009-03-20 16:21:28

@John Rasch: This is a good approach, but please note that O(1) time lookup for dictionaries/hashtables is a half-truth; it depends on your hashtable being "sufficiently large" and on the absence of patterns in the input data that would lead to pathological O(n) slowness.

j_random_hacker 2009-03-23 05:00:39

@j_random_hacker - good point, especially considering the small amount of data being looked up. Unless something like Jay's suggestion is used, the hashing function alone could waste enough time making a search more efficient

John Rasch 2009-03-23 14:22:42

Answer 7

+2 A:

If you have thousands of integers and every one occurs roughly three times, your algorithm should find the set of N unique integers pretty quickly, roughly in N(1+e) steps for small e (assuming the integers are ordered relatively randomly).

This means that your algorithm would insert N times a random integer into the uniques array. Insert number K would on the average shift K/2 elements in the array, yielding (N^2)/4 move operations. Your binary search would take roughly N * (log(N)-1) steps. This yields total complexity of (N^2)/4 + N(log(N)-1) + N(1+e) for your algorithm.

I think you could better e.g. by the following:

int num_uniques = 0, startpos = 0, k, element;
int uniques[16];

/* Allocate and clear a bit table of 32 * 32 = 1024 bits. */
uint32 bit_table[32], hash;
memzero((void *)(&bit_table), sizeof(bit_table));

while (num_uniques < N && startpos < array_length) {
  element = array[startpos++];

  /* Hash the element quickly to a number from 0..1023 */
  hash = element ^ (element >> 16);
  hash *= 0x19191919;
  hash >>= 22;
  hash &= 1023;

  /* Map the hash value to a bit in the bit table.
     Use the low 5 bits of 'hash' to index bit_table
     and the other 5 bits to get the actual bit. */
  uint32 slot=hash & 31;
  uint32 bit=(1u << (hash >> 5));

  /* If the bit is NOT set, this is element is guaranteed unique. */
  if (!(bit_table[slot] & bit)) {
    bit_table[slot] |= bit;
    uniques[num_uniques++] = element;
  } else { /* Otherwise it can be still unique with probability
              num_uniques / 1024. */
    for (k=0; k<num_uniques; k++) { if (uniques[k] == element) break }
    if (k==num_uniques) uniques[num_uniques++] = element;
  }
}

This algorithm would run in expected time of N + N^2 / 128 because the probability of running the inner loop (index variable k) is low.

antti.huima 2009-03-20 16:24:12

Very nice. Thank you..

Nils Pipenbrinck 2009-03-20 17:03:00

What quantity does e represent?

j_random_hacker 2009-03-23 05:09:21

j_random: It's just a small number corresponding to the probability of having duplicate numbers on the list before N uniques have been found.

antti.huima 2009-03-23 08:15:06

@antti: Ah thanks.

j_random_hacker 2009-03-24 13:38:55

Answer 8

A:

Given a list of integers of size N named L

Iterate L once to find the largest value and smallest value in the array.

Allocate (1 allocation) an integer array of size (small .. large) named A. Init this array to zeros

Iterate L, use L(i) to subscript into A, Increment the integer found there.

Then do your processing. Pick your starting point in L, and walk forward through the list, looking at A(i). Pick what ever set of A(i) > 2 that you want.

When you are done, dispose of A.

If you are really short on space, use 2 bits instead of an integer, with the following interpretations

00 count = 0
01 count = 1
10 count = 2
11 count > 2

EvilTeach 2009-03-20 18:36:18

+0. A good idea if the space of integers being scanned is not much larger than N, but the asker specifically says that the integers are "all over the place, so a bit-array as a lookup solution is not feasible."

j_random_hacker 2009-03-23 05:11:57

Ok. Why is it not feasible?Assuming N is 2^32, that number of bits easily fits into virtual memory.

EvilTeach 2009-04-06 12:55:21

ansaurus

tags:

views:

answers:

Extract first N unique integers from an Array

related questions