views:

1317

answers:

9
+20  A: 

The question is a little ambiguous; when the request is "which one," does it mean return the value that is duplicated, or the position in the sequence of the duplicated one? If the former, any of the following three solutions will work; if it is the latter, the first is the only that will help.

Solution #1: assumes array is immutable

Build a bitmap; set the nth bit as you iterate through the array. If the bit is already set, you've found a duplicate. It runs on linear time, and will work for any size array.

The bitmap would be created with as many bits as there are possible values in the array. As you iterate through the array, you check the nth bit in the array. If it is set, you've found your duplicate. If it isn't, then set it. (Logic for doing this can be seen in the pseudo-code in this Wikipedia entry on Bit arrays or use the System.Collections.BitArray class.)

Solution #2: assumes array is mutable

Sort the array, and then do a linear search until the current value equals the previous value. Uses the least memory of all. Bonus points for altering the sort algorithm to detect the duplicate during a comparison operation and terminating early.

Solution #3: (assumes array length = 1,000,001)

  1. Sum all of the integers in the array.
  2. From that, subtract the sum of the integers 1 through 1,000,000 inclusive.
  3. What's left will be your duplicated value.

This take almost no extra memory, can be done in one pass if you calculate the sums at the same time.

The disadvantage is that you need to do the entire loop to find the answer.

The advantages are simplicity, and a high probability it will in fact run faster than the other solutions.

lavinio
That'll only work if the array contains ALL integers between 1 and 1,000,000 plus one double (so 1,000,001 elements), won't it?
Toon Van Acker
Good point; I had assumed the array contained all the integers. Too bad; it was a nice answer :)
lavinio
Okay, how about the bitmap solution?
lavinio
@lavino: Can you throw some light on bitmap solution?
Learner
this is same as hash table, you have used bitmap instead of hash table, it reduces size a little but still uses O(n) extra memory
Learner
Well, except there is no hash. It's direct addressing, and there are no buckets and no hash collisions.
lavinio
That bitmap would require 1,000,000 bits or 122.07 kB, regardless of array size (which would be at most 1,000,001 elements).
Toon Van Acker
hey any idea on how can I implement bitmap in c#?
Learner
@Learner: This does NOT use O(n) extra memory. It uses one million bits independent of the length of the array.
Jason
I'm surprised at the number of votes my response got. It's simple to implement, but somewhat clumsy. Personally I like the last Python example that just sorts the array and loops until it finds a duplicate.
lavinio
@lavino: Yours is O(n) (best possible) and uses O(1) extra space. What more is there to want?
Jason
The bitmap solution could be made to work multi-threaded. Each thread has it's own bitmap. If at the end of the run for all of the threads, there is still no collision, you start using xor/and/or to figure out the collision.
Brad Gilbert
hey any idea on how can I implement bitmap in c#?
Learner
bool[] bitmap = new bool[1000000]; Access via bitmap[i], where i >= 0 and < 1000000.
Greg
A `bool` array would use a byte (8 bits) for each element, so it would use nearly 1MB rather than the 122kB quoted above (although it would be likely be faster). If you really want to use a bitmap then C# has a BitArray class, see: http://msdn.microsoft.com/en-us/library/system.collections.bitarray.aspx
jtb
@jtb: I didn't know that, thanks for correcting my mistake.
Greg
@jtb why would bool array be faster than bit array?
Sonic Soul
+7  A: 

Assuming all the numbers from 1 to 1,000,000 are in the array, the sum of all numbers from 1 to 1,000,000 is (1,000,000)*(1,000,000 + 1)/2 = 500,000 * 1,000,001 = 500,000,500,000.

So just add up all the numbers in the array, subtract 500,000,500,000, and you'll be left with the number that occured twice.

O(n) time, and O(1) memory.

If the assumption isn't true, you could try using a Bloom Filter - they can be stored much more compactly than a hash table (since they only store fact of presence), but they do run the risk of false positives. This risk can be bounded though, by our choice of how much memory to spend on the bloom filter.

We can then use the bloom filter to detect potential duplicates in O(n) time and check each candidate in O(n) time.

rampion
This only applies when all the integers in the range are present. At least from what I remember from the story about Gauss back in school...
raoulsson
yep, that's true. I assumed as much was true from my reading of the question. But it looks like my reading was wrong.
rampion
What are the advantages of doing it probabilistically?
Jason
The memory is still O(n), but with a much smaller coefficient (depending on what you choose your false positive rate to be).
rampion
@rampion: But sometimes you get the answer wrong.
Jason
@Jason - a Bloom Filter only has false positives, never false negatives. So you won't miss the correct answer. You do run the risk of false positives, but that can be controlled by the size of the table, and then you can do an O(n) check on the few that remain.
rampion
+2  A: 

Rather than sorting the array and then checking, I would suggest writing an implementation of a comparison sort function that exits as soon as the dup is found, leading to no extra memory requirement (depending on the algorithm you choose, obviously) and a worst case O(nlogn) time (again, depending on the algorithm), rather than a best (and average, depending...) case O(nlogn) time.

E.g. An implementation of in-place merge sort.

http://en.wikipedia.org/wiki/Merge_sort

Toon Van Acker
+6  A: 

This python code is a modification of QuickSort:

def findDuplicate(arr):
    orig_len = len(arr)
    if orig_len <= 1:
     return None
    pivot = arr.pop(0)
    greater = [i for i in arr if i > pivot]
    lesser = [i for i in arr if i < pivot]
    if len(greater) + len(lesser) != orig_len - 1:
     return pivot
    else:
     return findDuplicate(lesser) or findDuplicate(greater)

It finds a duplicate in O(n logn)), I think. It uses extra memory on the stack, but it can be rewritten to use only one copy of the original data, I believe:

def findDuplicate(arr):
    orig_len = len(arr)
    if orig_len <= 1:
     return None
    pivot = arr.pop(0)
    greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
    lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
    if len(arr):
     return pivot
    else:
     return findDuplicate(lesser) or findDuplicate(greater)

The list comprehensions that produce greater and lesser destroy the original with calls to pop(). If arr is not empty after removing greater and lesser from it, then there must be a duplicate and it must be pivot.

The code suffers from the usual stack overflow problems on sorted data, so either a random pivot or an iterative solution which queues the data is necessary:

def findDuplicate(full):
    import copy
    q = [full]
    while len(q):
     arr = copy.copy(q.pop(0))
     orig_len = len(arr)
     if orig_len > 1:
      pivot = arr.pop(0)
      greater = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] > pivot]
      lesser = [arr.pop(i) for i in reversed(range(len(arr))) if arr[i] < pivot]
      if len(arr):
       return pivot
      else:
       q.append(greater)
       q.append(lesser)
    return None

However, now the code needs to take a deep copy of the data at the top of the loop, changing the memory requirements.

So much for computer science. The naive algorithm clobbers my code in python, probably because of python's sorting algorithm:

def findDuplicate(arr):
    arr = sorted(arr)
    prev = arr.pop(0)
    for element in arr:
     if element == prev:
      return prev
     else:
      prev = element
    return None
hughdbrown
A: 

As a variant of your solution (2), you can use radix sort. No extra memory, and will run in linear time. You can argue that time is also affected by the size of numbers representation, but you have already given bounds for that: radix sort runs in time O(k n), where k is the number of digits you can sort ar each pass. That makes the whole algorithm O(7n)for sorting plus O(n) for checking the duplicated number -- which is O(8n)=O(n).

Pros:

  • No extra memory
  • O(n)

Cons:

  • Need eight O(n) passes.
Jay
I'm not sure I agree that radix sort is O(n). In this particular case, the distribution of keys is such that no two are equal, except that there is one duplicate pair. For any arbitrary maximum in the list to be checked for duplicates, radix sort grows exactly the same as other efficient sorts, specifically it must sort at most n elements in up to log(n) passes, so radix is O(n log(n)). If the key distribution were otherwise, such as a fixed size hash value, and the number of keys significantly larger than the domain of the key, then a radix sort has the given O(n) complexity.
TokenMacGuy
+2  A: 

Hint: Use the property that A XOR A == 0, and 0 XOR A == A.

Mike Mu
How does this help? If you XOR all the numbers together, say, without any idea of what the result ought to be, there is no way this can be diagnostic.
hughdbrown
If the array contains 1,000,001 elements then this solution is better than computing sum (needs less memory). Compute a[1] xor 1 xor a[2] xor 2...
sdcvvc
A: 

And how about the problem of finding ALL duplicates? Can this be done in less than O(n ln n) time? (Sort & scan) (If you want to restore the original array, carry along the original index and reorder after the end, which can be done in O(n) time)

With extra memory using counting sort, yes. Or, with no extra memory using radix sort (but then, you could argue that using radix sort os "sort of" cheating because it's actually O(kn) where k is the maximum number of digits, and k is proportional to log n -- however, the bounds were already given for this problem so k is fixed.
Jay
A: 
def singleton(array):
  return reduce(lambda x,y:x^y, array)
Andrew Johnson
Works for some inputs (for example, range(1000) + [101]), but not for others (range(1001) + [101]).
ojrac
A: 

Sort integer by sorting them on place they should be. If you get "collision" than you found the correct number.

space complexity O(1) (just same space that can be overwriten) time complexity less than O(n) becuse you will statisticaly found collison before getting on the end.

ralu