views:

1302

answers:

18

Hi All,

Here is one of my interview question. Given an array of N elements and where an element appears exactly N/2 times and the rest N/2 elements are unique. How would you find the element with a better run time?

Remember the elements are not sorted and you can assume N is even. For example,

input array [] = { 10, 2, 3, 10, 1, 4, 10, 5, 10, 10 }

So here 10 appears extactly 5 times which is N/2.

I know a solution with O(n) run time. But still looking forward to know a better solution with O(log n).

+2  A: 

To do it less than O(n) you would have to not read all the numbers.
If you know there is a value that satisifies the relationship then you could just sample a small subset an show that only one number appears enough times to meet the relationship. You would have to assume the values are reasonably uniformly distributed

Edit. you would have to read n/2 to prove that such a number existed, but if you knew a number existed and only wanted to find it - you could read sqrt(n) samples

Martin Beckett
You **don't** have to read all the numbers. Ever. Just look at a sample of N/2 + 1 elements and the repeated one is the one you need. That sample will always have the repeated element because of pigeon-hole goodness.
Jeremy Powell
arent you guaranteed not to read all the numbers except in the degenerate case of n = 2?
ennuikiller
N/2 + 1 is still O(n)
Doug Currie
@Doug Hence the reason this is a comment and not an answer :P
Jeremy Powell
You would have to read a small fraction of the numbers eg sqrt(n) to get log n
Martin Beckett
How can you find it with a sample size of sqrt(n). You have to read numbers until you find a repeated value - best case is 2 elements, worst case is n/2 + 2.
lc
@mgb Could you elaborate on the sqrt answer? What's special about sqrt?
Jeremy Powell
That last comment was meant to be a question - "How can you find it with a sample size of sqrt(n)?"
lc
Nothing special about sqrt - just a suggestion that you could check a subset based on a power, rather than a fraction.
Martin Beckett
A: 

If you are told that the element you are looking for is the non-unique one surely the quickest way to do it is to iterate along the array until you find two the same and then return that element and stop looking. At most you have to search half the array.

I think this is O(n) so I guess it doesn't really help.

It seems too simple so I think I don't understand the problem correctly.

iamdudley
This solution is O(n^2) because as you read each number you have to compare it to all the previous numbers.
Tom Leys
@Tom Leys - Not if you store the numbers in a hash table as you go.
lc
@Tom good point. But maybe its just O(n log n) since you can create a sorted tree to find duplicates.
Jeremy Powell
@lc ah. hash would definitely have to have no collisions. otherwise it degenerates
Jeremy Powell
Oh yeah, thanks.
iamdudley
@Tom: you can use a bitset to do this in O(n).
cletus
@cletus oooo but if the size of the elements are arbitrary... that's a big bitset. :)
Jeremy Powell
A bitset with 2^31 elements can be stored in ~67 million 32 bit ints or ~270M of storage. That's entirely doable.
cletus
@cletus Doable? Yes.
Jeremy Powell
+6  A: 

You can't do this in sublinear time because you need to read the array. To process an array of a million records in logarithmic time would require only reading ~20 (log2) elements--clearly impossible. After all if you assume the first duplicate found is repeated N/2 times it's still O(n) because you may need to look at 500,001 elements to find a duplicate.

You can do this in O(n) if you assume the integers are nonnegative. It goes like this (pseudo-Java):

int repeatedNumber = -1; // sentinel value
int count = 0;
BitSet bits = new BigSet(); // this bitset needs to have 2^31 bits, roughly 2.1 billion
boolean duplicate = false;
for (int i : elements) {
  if (bits[i].isSet()) {
    if (repeatedNumber == -1) {
      repeatedNumber = i;
      count = 1;
    } else if (i == repeatedNumber) {
      count++;
    } else {
      System.out.println("Array has more than one repeated element");
      duplicate = true;
      break;
    }
  } else {
    bits[i].set();
  }
}
if (!duplicate && repeatedNumber != -1 && count == elements.length/2) {
  System.out.println(repeatedNumber + " occurred " + count + " times. The rest of the elements are unique");
} else {
  System.out.println("Not true");
}

A similar method is used to sort an array of unique integers in O(n) (radix sort).

cletus
+1 for good explanation of why this is a bogus question. Perhaps the correct answer is that "You CANNOT have such a solution, here's why"
Tom Leys
If we assume the problem is as described, then we can do without the bitset. 1. Examine the first three elements. If two are the same, you are done. Otherwise, iterate over the remaining list comparing each with the previous. When you find two the same, output and exit.
John Fouhy
+1  A: 

If I'm understanding the problem correctly: all we know about the array is it's length and it has (N/2)+1 unique elements, where 1 element is repeated N/2 times(in no specific order).

I think this suffers a hard limit of O(N) for the solution as you can't really assert (for a generic array) that you've found the number without finding at least 2 of the same number. I dont think there exists a search for an unordered array that can detect a duplicate in O(logN) (please correct me if i'm wrong). You will always need to read at least N/2 +1 elements in the worst case.

Falaina
You won't always need to read at least N/2 + 1 elements. The best case is the first two elements are duplicate, which means you read 2 elements. The worst case is the first N/2 + 1 elements are all unique, meaning the (N/2 + 2)th element is the answer, so you'll have to read N/2 + 2 at worst.
lc
Oops, you're right, I meant in the worst case. YOu're right if the first 2 elements were duplicates we'd be done.
Falaina
+14  A: 

There is a constant time solution if you are ready to accept a small probability of error. Randomly samples two values from the array, if they are the same, you found the value you were looking for. At each step, you have a 0.75 probability of not finishing. And because for every epsilon, there exists one n such that (3/4)^n < eps, we can sample at most n time and return an error if we did not found a matching pair.

Also remark that, if we keep sampling until we found a pair, the expected running time is constant, but the worst case running time is not bounded.

Eolmar
Usually doesn't big-O imply worst-case? Otherwise this wouldn't be so hard.
Jeremy Powell
I like this solution a lot. Even if you go through the list sequentially picking pairs of 2 numbers, the probability of you finding the pair would rise pretty rapidly(as per the Birthday Problem). It might even reach O(logN) in the average case. Though it still would have a worst case of O(N)
Falaina
You can avoid the unbounded running time by sampling the sets in groups of 3 in array order. See Ganesh M's solution and my refinement.
Doug Currie
if we are ignoring worst case time, then my algorithm, which assumes the first element is the duplicate, finishes in constant time!
Peter Recore
@Peter will it is 50% likely to be right. :)
cletus
isn't there a saying that you can't get more than 2 out of the 3 variables - fast, cheap, correct? I'm not charging for this, and constant time is pretty fast. so i guess correctness is going to have to suffer :)
Peter Recore
@Jeremy: no big-O does not imply worst case. "big-O worst case" implies worst case. The fact that big-O provides an upper bound sometimes makes people think it must be describing the worst case, but this is not so. You can have an asymptotic upper bound on the average case, without that also being an asymptotic upper bound on the worst case. That would then be "big-O average case".
Steve Jessop
I think Jeremy's point is that worst case is the only case for this problem that actually makes it an interesting question. It is trivial to get a best case or average case algorithm that works in log(n) time.
Peter Recore
+4  A: 

My Answer was,

  1. Divide N elements into [N/3] parts (i.e) each part will have 3 elements.
  2. Now compare these 3 elements among each other. - 3 comparisions
  3. Atleast one of the part will have two copies of the same element. Hence the number.

Runtime - O(N)

Ganesh M
Now extrapolate that to a million records and you should see your logic is flawed. The problem with dealing with small numbers (like 10) is that you can make mistakes based on anomalies due to size. Things are often a lot clearer with larger sets.
cletus
Starting from one end of the array, the probability of finding 2 elements in common in each set of 3 is 50%. So, the asymptotic complexity is smaller than O(n).
Doug Currie
It depends on whether we're looking at expected case or worst case. A robust solution can be done in O(n). If you assume the array conforms and any duplicate is the N/2 element then you're basically making assumptions and restating the problem as well as looking at probablistic expected case.
cletus
@cletus - what flaw? Some set of three must have a duplicate, otherwise there aren't N/2 of the same value since there are only N/3 sets.
Doug Currie
Now restate the solution with an array of a million elements.
cletus
it is the same: for (i=0; i<N; i+=3) {if a[i] == a[1+1] || a[i] == a[i+2] return a[i];if a[i+1] == a[i+2] return a[i+1];}return -1; // impossible per problem spec
Doug Currie
Also you're answer is wrong for 10 elements. Divide it in 3 and you have sets of 3, 3 and 4. You have to test the set of 4 and one of the sets of 3 because the repeated number could be distributed 1,1,3 or 2,2,1. So you're checking a set of 4 and a set of 3 at a minimum. Point is, it's still O(n).
cletus
No, it's 1/3,1/3,2/3,1/1 or 1/3,1/3,3/3,0/1 not 1/3,1/3,3/4; i.e., you don't have to look at the last element.
Doug Currie
thats correct. you don't have to look at the last.
Ganesh M
Ok, so you don't have to look at the last. I'll give you that one. But the solution is still O(n).
cletus
Comparing 3 elements among each other might require 3 comparisons. Suppose you've done two comparisons and without loss of generality the item in common between them is a. Then if a != b and a != c, you still need to check whether b == c.
Steve Jessop
@onebyone, yes you are right. My mistake.
Ganesh M
@cletus, yes my solution is O(N).
Ganesh M
+11  A: 

Here is my attempt at a proof of why this cannot be done in less than O(n) array accesses (for worst case, which surely is the only interesting case in this example):

Assume a worst case log(n) algorithm exists. This algorithm accesses the array at most log(n) times. Since it can make no assumptions about which elements are where, let me choose which log(n) elements it sees. I will choose to give it the first log(n) unique elements. It has not found the duplicate yet, and there still exist n/2 - log(n) unique elements for me to feed it if need be. In fact, I cannot be forced to feed it a duplicated number until it has read n/2 elements. Therefore such an algorithm cannot exist.

From a purely intuitive standpoint, this just seems impossible. Log(4 billion) is 32. So with an array of 4 billion numbers, 2 billion of which are unique, in no particular order, there is a way to find the duplicated element by only checking 32 elements?

Peter Recore
Remember that O(log.n) doesn't automatically imply that exactly log_2(n) array accesses are required. You can have a*log(n)+b accesses, for constants a and b.[not that I think it's possible either]
John Fouhy
true, my formal proof skills aren't up to the task of proving that if a and b are big enough to actually let you find the duplicated numbers, then they are necessarily proportional to n.
Peter Recore
+1 This is classic proof by feeding worst case numbers.
Kirill V. Lyadvinsky
The unspecified problem need not be O(log(n)). If the claim is that it is better than O(n), and you have proven that, by the arrangement of supplied list, that it must perform at least n/2 lookups, then you have claimed that the algorithm is exactly O(n) or worse.
TokenMacGuy
I think I was using log(n) because that's what the OP specified as a desired target. You are correct that I could have made a stronger statement. In the end it doesn't matter, because Ganesh was apparently (based on his accepted answer) looking for average case performance, not worst case like most of us thought.
Peter Recore
A: 

Restating my solution from a comment to Ganesh's version so I can format it:

for (i=0; i<N-2; i+=3) { 
   if a[i] == a[1+1] || a[i] == a[i+2] return a[i];
   if a[i+1] == a[i+2] return a[i+1]; 
} 
return a[N-1]; // for very small N

Probability of winning after 1 iteration: 50%

Probability of winning after 2 iterations: 75%

Etc.

Worst case, O(n) time O(1) space.

Note that after N/4 iterations you've used up all the N/2 unique numbers, so this loop will never iterate through more than 3/4 of the array if it is as specified.

Doug Currie
Doesn't work on an array of 4 elements if the second match appears in the last position. Why look at three at once -- why not two?
hughdbrown
Thanks, changed last line to return a[N-1] -- that fixes the N=4 case and a few others for N<12. I think N=5 is the only special case now.Re 3 vs 2: Looking at pairs rather than triples doesn't work if the values are perfectly distributed (even values random, odd values X, or visa versa).
Doug Currie
+6  A: 

For worst-case deterministic behavior, O(N) is correct (I've already seen more than one proof in the previous answers).

However, modern algorithmic theory isn't concerned just with worst-case behavior (that's why there are so many other big-somethings besides big-O, even though lazy programmers in a hurry often use big-O even when what they have in mind is closer to big-theta OR big-omega;-), nor just with determinism (withness the Miller-Rabin primality test...;).

Any random sample of K < N items will show no duplicates with a probabllity that < 2**K -- easily and rapidly reduced to essentially as low as you wish no matter what N is (e.g. you could reduce it to less than the probability that a random cosmic ray will accidentally and undetectably flip a bit in your memory;-) -- this observation hardly requires the creativity Rabin and Miller needed to find their probabilistic prime testing approach;-).

This would make a pretty lousy interview question. Similar less-lousy questions are often posed, often mis-answered, and often mis-remembered by unsuccessful candidates. For example, a typical question might be, given an array of N items, not knowing whether there is a majority item, to determine whether there is one, and which one it is, in O(N) time and O(1) auxiliary space (so you can't just set up a hash table or something to count occurrences of different values). "Moore's Voting Approach" is a good solution (probably the best one) to that worthy interviewing question.

Another interesting variation: what if you have 10**18 64-bit numbers (8 Terabytes' worth of data overall, say on a bigtable or clone thereof), and as many machines as you want, each with about 4GB of RAM on a pretty fast LAN, say one that's substantially better than GB ethernet -- how do you shard the problem under those conditions? What if you have to use mapreduce / hadoop? What if you're free to design your own dedicated framework just for this one problem -- could you get better performance than with mapreduce? How much better, at the granularity of back-of-envelope estimation? I know of no published algorithm for THIS variant, so it may be a great test if you want to check general facility of a candidate with highly-distributed approaches to tera-scale computation...

Alex Martelli
A: 

First off, it's past my bed time and I should know better than to post code in public without trying it first, yada, yada. I hope the criticism I'll get will at least be educational. :-)

I believe the problem can be restated as: "Find the number that occurs more than once."

In the absolute worst case, we would need to iterate through a little more than half the list (1 + N/2) before we found the 2nd instance of a non-unique number.

Worst case example: array [] = { 1, 2, 3, 4, 5, 10, 10, 10, 10, 10 }

On average though, we'd only need to iterate though 3 or 4 elements since half of the elements will contain the non-unique number i.e roughly every other number.

Perfectly even distribution examples:

  • array [] = { 1, 10, 2, 10, 3, 10, 4, 10, 5, 10 }
  • array [] = { 10, 1, 10, 2, 10, 3, 10, 4, 10, 5 }

In other words even if N = 1 million you would still only need to search; on average, the first 3 or 4 elements before you discovered a duplicate.

What's the big O notation for a fixed/constant runtime that doesn't increase with N?

Code:

int foundAt = -1;

for (int i=0; (i<N) && (foundAt==-1); i++)
{
    for (int j=i+1; j<N; j++)
    {
        if (array[i] == array[j])
        {
             foundAt = i;
             break;
        }
     }
}

int uniqueNumber = array[foundAt];
Chris Bennet
This is an O(n*n) algorithm, right? Change your inner loop to an exchange and you have an insertion sort, a known O(n*n) sorting algorithm.
hughdbrown
Constant runtime is O(1). But big-O is about _worst-case_. That's why everyone's saying this is O(n). You can't make statements like "on average, the first 3 or 4 elements" without knowing something about the distribution of numbers. [Incidentally, your solution is O(n^2).]
John Fouhy
+2  A: 
Crashworks
why downvote this but not explain why?
Peter Recore
I like the approach. In the "Assume there is..." paragraph, I'm not sure your last sentence is 100% sound. In the least, the "a is an element of r" should be more like "a occurs more than once in r".
Jeremy Powell
Also, to nitpick, you should use multisets, as 'sets' connote unique elements.
Jeremy Powell
Make q not contain a.
Jeremy Powell
Ok, yeah. Your 'Assume" paragraph makes a leap in logic from the statement of the algorithm being O(log n) and there being no subset blah blah blah. If this were sound, this would also work if the the problem assumed the elements were sorted (which then DOES make it sublinear). Justify that step and I think you've got a great proof.
Jeremy Powell
+1 by the way. :)
Jeremy Powell
A: 

This is a poor interview question.

  1. You don't know the answer yourself.
  2. It doesn't have any business case behind it, so you will have difficulty explaining it to the candidate.

Mostly because of the first one. What are you looking for? That the candidate should come up with this O(log n) solution you don't know exists? If you have to ask StackOverflow, is this something you can reasonably expect a candidate to come up with in an interview?

Kevin Peterson
This is an interview question he was asked, not one he gave.
Jeremy Stein
Sorry Kevin, I didn't ask this question, I was asked by the interviewers ..
Ganesh M
This is a poor SO answer. I think the purpose of SO is to ask questions one **doesn't** know the answer to. While it's fun to ask everyone a neat question and then post the right answer, that's definitely not required. I imagine most of the SO community would have asked this question if they were in Ganesh's shoes.
Jeremy Powell
Jeremy, it's a perfectly fine answer to the question I thought was being asked. Sorry for providing an answer before the comments clarified what question it was. In an interview, it's important to know what you are looking for in an answer.
Kevin Peterson
+6  A: 

I think you simply need to parse through the array keeping a backlog of two elements. As N/2 are equal and the rest is guaranteed to be distinct there must be one place i in your array where

a[i] == a[i-1] OR a[i] == a[i-2]

iterate once through your array and you have complexity of roughly 2*N which should be well inside O(N).

This answer is somewhat similar to the answer by Ganesh M and Dougie, but I think a little simpler.

Don Johe
Just make sure to initialize it correctly, check for the case of fewer than 4 elements, and start the loop at index 2. :-)
Nosredna
Well, indeed, but we are not here because we are surprised by the existence of border cases and loop initialization, aren't we?
Don Johe
+1  A: 

Suppose you have a python algorithm like this:

import math
import random

def find_duplicate(arr, gap):
    cost, reps = 0, 0
    while True:
        indexes = sorted((random.randint(0,len(arr)-i-1) for i in xrange(gap)), reverse=True)
        selection = [arr.pop(i) for i in indexes]
        selection_set = set(selection)
        cost += len(selection)
        reps += 1
        if len(selection) > len(selection_set):
            return cost, reps

The idea is that arr is your set of values and gap is the log base-2 of the size. Each time you select gap elements and see if there are duplicated values. If so, return your cost (in count of elements examined) and the number of iterations (where you examine log2(size) elements per iteration). Otherwise, look at another gap-sized set.

The problem with benchmarking this algorithm is that the creation of the data each time through the loop and alteration of the data is expensive, assuming a large amount of data. (Initially, I was doing 1 000 000 elements with 10 000 000 iterations.)

So let's reduce to an equivalent problem. The data is passed in as n/2 unique elements and n/2 repeated elements. The algorithm picks the random indexes of log2(n) elements and checks for duplicates. Now we don't even have to create the data and to remove elements examined: we can just check if we have two or more indexes over the halfway point. Select gap indexes, check for 2 or more over the halfway point: return if found, otherwise repeat.

import math
import random

def find_duplicate(total, half, gap):
    cost, reps = 0, 0
    while True:
        indexes = [random.randint(0,total-i-1) for i in range(gap)]
        cost += gap
        reps += 1
        above_half = [i for i in indexes if i >= half]
        if len(above_half) >= 2:
            return cost, reps
        else:
            total -= len(indexes)
            half -= (len(indexes) - len(above_half))

Now drive the code like this:

if __name__ == '__main__':
    import sys
    import collections
    import datetime
    for total in [2**i for i in range(5, 21)]:
        half = total // 2
        gap = int(math.ceil(math.log10(total) / math.log10(2)))
        d = collections.defaultdict(int)
        total_cost, total_reps = 0, 1000*1000*10
        s = datetime.datetime.now()
        for _ in xrange(total_reps):
            cost, reps = find_duplicate(total, half, gap)
            d[reps] += 1
            total_cost += cost
        e = datetime.datetime.now()
        print "Elapsed: ", (e - s)
        print "%d elements" % total
        print "block size %d (log of # elements)" % gap
        for k in sorted(d.keys()):
            print k, d[k]
        average_cost = float(total_cost) / float(total_reps)
        average_logs = average_cost / gap
        print "Total cost: ", total_cost
        print "Average cost in accesses: %f" % average_cost
        print "Average cost in logs: %f" % average_logs
        print

If you try this test, you'll find that the number of times the algorithm has to do multiple selections declines with the number of elements in the data. That is, your average cost in logs asymptotically approaches 1.

elements    accesses    log-accesses
32          6.362279    1.272456
64          6.858437    1.143073
128         7.524225    1.074889
256         8.317139    1.039642
512         9.189112    1.021012
1024        10.112867   1.011287
2048        11.066819   1.006075
4096        12.038827   1.003236
8192        13.022343   1.001719
16384       14.013163   1.000940
32768       15.007320   1.000488
65536       16.004213   1.000263
131072      17.002441   1.000144
262144      18.001348   1.000075
524288      19.000775   1.000041
1048576     20.000428   1.000021

Now is this an argument for the ideal algorithm being log2(n) in the average case? Perhaps. It certainly is not so in the worst case.

Also, you don't have to pick log2(n) elements at once. You can pick 2 and check for equality (but in the degenerate case, you will not find the duplication at all), or check any other number greater for duplication. At this point, all the algorithms that select elements and check for duplication are identical, varying only in how many they pick and how they identify duplication.

hughdbrown
A: 

By the way, it looks a very similar question was asked today:

To find value of n repeated element in array of 2n element

Perhaps someone will find ideas posted there useful.

mloskot
A: 

Here is Don Johe's answer in Ruby:

#!/usr/bin/ruby1.8

def find_repeated_number(a)
  return nil unless a.size >= 3
  (0..a.size - 3).each do |i|
    [
      [0, 1],
      [0, 2],
      [1, 2],
    ].each do |j1, j2|
      return a[i + j1] if a[i + j1] == a[i + j2]
    end
  end
end

p find_repeated_number([1, 1, 2])   # => 1
p find_repeated_number([2, 3, 2])   # => 1
p find_repeated_number([4, 3, 3])   # => 1

O(n)

Wayne Conrad
A: 

Looks like a O(logn) solution to this problem is not possible. Please provide your thoughts.

Ravi
A: 

Perhaps the point of asking for an O(log(n)) solution at an interview is to see whether the candidate can provide the proof that no such solution exists.

How does the candidate cope with requirements that are clearly impossible?

Christian