views:

10654

answers:

17

I had an interesting job interview experience a while back. The question started really easy:

Q1: We have a bag containing numbers 1, 2, 3, …, 100. Each number appears exactly once, so there are 100 numbers. Now one number is randomly picked out of the bag. Find the missing number.

I've heard this interview question before, of course, so I very quickly answered along the lines of:

A1: Well, the sum of the numbers 1 + 2 + 3 + … + N is (N+1)(N/2) (see Wikipedia: sum of arithmetic series). For N = 100, the sum is 5050.

Thus, if all numbers are present in the bag, the sum will be exactly 5050. Since one number is missing, the sum will be less than this, and the difference is that number. So we can find that missing number in O(N) time and O(1) space.

At this point I thought I had done well, but all of a sudden the question took an unexpected turn:

Q2: That is correct, but now how would you do this if TWO numbers are missing?

I had never seen/heard/considered this variation before, so I panicked and couldn't answer the question. The interviewer insisted on knowing my thought process, so I mentioned that perhaps we can get more information by comparing against the expected product, or perhaps doing a second pass after having gathered some information from the first pass, etc, but I really was just shooting in the dark rather than actually having a clear path to the solution.

The interviewer did try to encourage me by saying that having a second equation is indeed one way to solve the problem. At this point I was kind of upset (for not knowing the answer before hand), and asked if this is a general (read: "useful") programming technique, or if it's just a trick/gotcha answer.

The interviewer's answer surprised me: you can generalize the technique to find 3 missing numbers. In fact, you can generalize it to find k missing numbers.

Qk: If exactly k numbers are missing from the bag, how would you find it efficiently?

This was a few months ago, and I still couldn't figure out what this technique is. Obviously there's a Ω(N) time lower bound since we must scan all the numbers at least once, but the interviewer insisted that the TIME and SPACE complexity of the solving technique (minus the O(N) time input scan) is defined in k not N.

So the question here is simple:

  • How would you solve Q2?
  • How would you solve Q3?
  • How would you solve Qk?

Clarifications

  • Generally there are N numbers from 1..N, not just 1..100.
  • I'm not looking for the obvious set-based solution, e.g. using a bit set, encoding the presence/absence each number by the value of a designated bit, therefore using O(N) bits in additional space. We can't afford any additional space proportional to N.
  • I'm also not looking for the obvious sort-first approach. This and the set-based approach are worth mentioning in an interview (they are easy to implement, and depending on N, can be very practical). I'm looking for the Holy Grail solution (which may or may not be practical to implement, but has the desired asymptotic characteristics nevertheless).

So again, of course you must scan the input in O(N), but you can only capture small amount of information (defined in terms of k not N), and must then find the k missing numbers somehow.

+32  A: 

We can solve Q2 by summing both the numbers themselves, and the squares of the numbers.

We can then reduce the problem to

k1 + k2 = x
k1^2 + k2^2 = y

Where x and y are how far the sums are below the expected values.

Substituting gives us:

(x-k2)^2 + k2^2 = y

Which we can then solve to determine our missing numbers.

Anon.
+1; I've tried the formula in Maple for select numbers and it works. I still couldn't convince myself WHY it works, though.
polygenelubricants
@polygenelubricants: Rearranging the first equation gives `k1 = x - k2`, and we can substitute that into the second equation. You can also generalize this to higher k - though it becomes computationally expensive quite rapidly.
Anon.
@Anon: let's restrict discussion to Q2 for now. Can you show that this will always result in a solution? (i.e. no ambiguity, no unsolvable, always unique and correct solution).
polygenelubricants
@polygenelubricants: If you wanted to prove correctness, you would first show that it always provides *a* correct solution (that is, it always produces a pair of numbers which, when removing them from the set, would result in the remainder of the set having the observed sum and sum-of-squares). From there, proving uniqueness is as simple as showing that it only produces one such pair of numbers.
Anon.
The nature of the equations means that you will get two values of k2 from that equation. However, from teh first equation that you use to generate k1 you can see that these two values of k2 will mean that k1 is the other value so you have two solutions that are the same numbers the opposite way around. If you abitrarily declared that k1>k2 then you'd only have one solution to the quadratic equation and thus one solution overall. And clearly by the nature of the question an answer always exists so it always works.
Chris
For a given sum k1+k2, there are many pairs. We can write these pairs as K1=a+b and K2 = a-b where a = (K1+k2/2). a is unique for a given sum. The sum of the squares (a+b)**2 + (a-b)**2 = 2*(a**2 + b**2). For a given sum K1+K2, the a**2 term is fixed and we see that the sum of the squares will be unique due to the b**2 term. Therefore, the values x and y are unique for a pair of integers.
phkahler
+1  A: 

Can you check if every number exists? If yes you may try this:

S = sum of all numbers in the bag (S < 5050) Z = sum of the missing numbers 5050 - S

if the missing numbers are x and y then: x = Z - y and max(x) = Z - 1 So you check the range from 1 to max(x) and find the number

Ilian Iliev
+5  A: 

Not sure, if it's the most efficient solution, but I would loop over all entries, and use a bitset to remember, which numbers are set, and then test for 0 bits.

I like simple solutions - and I even believe, that it might be faster than calculating the sum, or the sum of squares etc.

Chris Lercher
I did propose this obvious answer, but this is not what the interviewer wanted. I explicitly said in the question that this is not the answer I'm looking for. Another obvious answer: sort first. Neither the `O(N)` counting sort nor `O(N log N)` comparison sort is what I'm looking for, although they are both very simple solutions.
polygenelubricants
@polygenelubricants: I can't find where you said that in your question. If you consider the bitset to be the result, then there is no second pass. The complexity is (if we consider N to be constant, as the interviewer suggests by saying, that the complexity is "defined in *k* not N") O(1), and if you need to construct a more "clean" result, you get O(k), which is the best you can get, because you always need O(k) to create the clean result.
Chris Lercher
@chris_l: "Note that I'm not looking for the obvious set-based solution (e.g. using a bit set,". The second last paragraph from the original question.
hrnt
@hmt: Yes, the question was edited a few minutes ago. I'm just giving the answer, that I would expect from an interviewee... Artificially constructing a sub-optimal solution (you can't beat O(n) + O(k) time, no matter what you do) doesn't make sense to me - except if you can't afford O(n) additional space, but the question isn't explicit on that.
Chris Lercher
@chris: I've edited the question again to further clarify. I do appreciate the feedback/answer.
polygenelubricants
+6  A: 

I haven't checked the maths, but I suspect that computing Σ(n^2) in the same pass as we compute Σ(n) would provide enough info to get two missing numbers, Do Σ(n^3) as well if there are three, and so on.

AakashM
A: 

for different values of k, the approach will be different so there won't be a generic answer in terms of k. e.g. for k=1 one can take advantage of Sum of natural numbers but for k = n/2, one has to use some kind of bitset. the same way for k=n-1, one can simply compare the only number in the bag with rest.

bhups
A: 

Very nice problem. I'd go for using a set difference for Qk. A lot of programming languages even have support for it, like in Ruby:

missing = (1..100).to_a - bag

It's probably not the most efficient solution but it's one I would use in real life if I was faced with such a task in this case (known boundaries, low boundaries). If the set of number would be very large then I would consider a more efficient algorithm, of course, but until then the simple solution would be enough for me.

DarkDust
A: 

Have I missed something? Can't you just put the missing numbers in a list?

List missingList = {};
For i=1 to 100 do
    if(i is missing) missing.add(i);
EndFor

missingList will have space complexity O(k), and if List is a Vector, then adding a number is O(1) in time.

Paxinum
What's the implementation of your "is missing" function?
JeremyP
your algorithm complexity is O(n^2), l is missing is O(N) and for loop is O(N)
ArsenMkrt
@ArsenMkrt: as many proposed, you could use a set. Transforming whatever-a-bag-is to a set is O(N), and the second pass (the loop) is O(N), yielding O(N).
back2dos
+3  A: 

Wait a minute. As the question is stated, there are 100 numbers in the bag. No matter how big k is, the problem can be solved in constant time because you can use a set and remove numbers from the set in at most 100 - k iterations of a loop. 100 is constant. The set of remaining numbers is your answer.

If we generalise the solution to the numbers from 1 to N, nothing changes except N is not a constant, so we are in O(N - k) = O(N) time. For instance, if we use a bit set, we set the bits to 1 in O(N) time, iterate through the numbers, setting the bits to 0 as we go (O(N-k) = O(N)) and then we have the answer.

It seems to me that the interviewer was asking you how to print out the contents of the final set in O(k) time rather than O(N) time. Clearly, with a bit set, you have to iterate through all N bits to determine whether you should print the number or not. However, if you change the way the set is implemented you can print out the numbers in k iterations. This is done by putting the numbers into an object to be stored in both a hash set and a doubly linked list. When you remove an object from the hash set, you also remove it from the list. The answers will be left in the list which is now of length k.

JeremyP
This answer is too simple, and we all know that simple answers don't work! ;) Seriously though, original question should probably emphasize O(k) space requirement.
DK
+49  A: 

You will find it by reading the couple pages of this: Muthukrishnan - Data Stream Algorithms: Puzzle 1: Finding Missing Numbers. It shows exactly the generalization you are looking for. Probably this is what your interviewer read and posed these questions.

Now, if only people would start deleting the answers that are subsumed or superseded by Muthukrishnan's treatment, and make this text easier to find. :)

Also see sdcvvc's directly related answer, which also includes pseudocode (hurray! no need to read those tricky math formulations :)) (thanks, great work!).

Dimitris Andreou
How do you translate *that* into code?!?
Ubersoldat
Oooh... That's interesting. I have to admit I got a bit confused by the maths but I was jsut skimming it. Might leave it open to look at more later. :) And +1 to get this link more findable. ;-)
Chris
The google books link doesn't work for me. Here a [better version](http://www.cs.rutgers.edu/~muthu/stream-1-1.ps) [PostScript File].
Heinrich Apfelmus
Wow. I didn't expect this to get upvoted! Last time I posted a reference to the solution (Knuth's, in that case) instead of trying to solve it myself, it was actually downvoted: http://stackoverflow.com/questions/3060104/how-to-implement-three-stacks-using-a-single-array/3077753#3077753The librarian inside me rejoices, thanks :)
Dimitris Andreou
@Apfelmus, note that this is a draft. (I don't blame you of course, I confused the draft for the real things for almost a year before finding the book). Btw if the link didn't work, you can go to http://books.google.com/ and search for "Muthukrishnan data stream algorithms" (without quotes), it's the first to pop up.
Dimitris Andreou
Yeah, but Google only shows me the TOC, not the full text. :-/
Heinrich Apfelmus
+69  A: 

Here's a summary of Dimitris Andreou's link.

Remember sum of i-th powers, where i=1,2,..,k. This reduces the problem to solving the system of equations

a1 + a2 + ... + ak = b1

a12 + a22 + ... + ak2 = b2

...

a1k + a2k + ... + akk = bk

Using Newton's identities, knowing bi allows to compute

c1 = a1 + a2 + ... ak

c2 = a1a2 + a1a3 + ... + ak-1ak

...

ck = a1a2 ... ak

If you expand the polynomial (x-a1)...(x-ak) the coefficients will be exactly c1, ..., ck - see Viète's formulas. Since every polynomial factors uniquely (ring of polynomials is an Euclidean domain), this means ai are uniquely determined, up to permutation.

This ends a proof that remembering powers is enough to recover the numbers. For constant k, this is a good approach.

However, when k is varying, the direct approach of computing c1,...,ck is prohibitely expensive, since e.g. ck is the product of all missing numbers, magnitude n!/(n-k)!. To overcome this, perform computations in Zq field, where q is a prime such that n <= q < 2n - it exists by Bertrand's postulate. The proof doesn't need to be changed, since the formulas still hold, and factorization of polynomials is still unique. You also need an algorithm for factorization over finite fields, for example the one by Berlekamp or Cantor-Zassenhaus.

High level pseudocode for constant k:

  • Compute i-th powers of given numbers
  • Subtract to get sums of i-th powers of unknown numbers. Call the sums bi.
  • Use Newton's identities to compute coefficients from bi; call them ci. Basically, c1 = b1; c2 = (c1b1 - b2)/2; see Wikipedia for exact formulas
  • Factor the polynomial xk-c1xk-1 + ... + ck.
  • The roots of the polynomial are the needed numbers a1, ..., ak.

For varying k, find a prime n <= q < 2n using e.g. Miller-Rabin, and perform the steps with all numbers reduced modulo q.

As Heinrich Apfelmus commented, instead of a prime q you can use q=2⌈log n⌉ and perform arithmetic in finite field.

sdcvvc
Thanks for doing this!
Dimitris Andreou
You don't have to use a prime field, you can also use `q = 2^(log n)`. (How did you make the super- and subscripts?!)
Heinrich Apfelmus
Also, you can calculate the c_k on the fly, without using the power sums, thanks to the formula $c^{k+1}_m = c^k_{m+1} + c^k_m x_{k+1}$ where the superscript $k$ denotes the number of variables and $m$ the degree of the symmetric polynomial.
Heinrich Apfelmus
@Heinrich Apfelmus Thanks! I'm not sure about the power of two approach - in general factorization would be much tricker, since there are divisors of zero and no uniqueness. To make sub/superscripts in answers, HTML tags can be used, in comments I think it is not available.
sdcvvc
+1 This is really, really clever. At the same time, it's questionable, whether it's really worth the effort, or whether (parts of) this solution to a quite artificial problem can be reused in another way. And even if this were a real world problem, on many platforms the most trivial `O(N^2)` solution will probably possibly outperform this beauty for even reasonably high `N`. Makes me think of this: http://tinyurl.com/c8fwgw Nonetheless, great work! I wouldn't have had the patience to crawl through all the math :)
back2dos
@sdcvvc Ah, I mean the [finite field](http://en.wikipedia.org/wiki/Finite_field) with `q=2^(log n)` elements, not the ring of bit vectors with `log n` bits. Being a field, the former doesn't have any zero divisors; of course, multiplication is a bit more complicated than a simple bit-wise AND.
Heinrich Apfelmus
@Heinrich Apfelmus: Thanks, added. @back2dos: The book in Dimitris's answer mentions applications of these techniques in communication complexity (set reconcilation).
sdcvvc
A: 

Try to find the product of numbers from 1 to 50 Let Product P1 = 1 x 2 x 3 x ............. 50 When u take out numbers one by one , multiply them So that u get the product P2. But two numbers are missing here. Hence P2 < P1 . The product of two mising terms a x b = P1 - P2. U already know the sum a + b = S1.

From the above 2 equations solve for a and b through quadratic equation. a nad b are ur missing numbers :-)

Please

Manjunath
+3  A: 

This might sound stupid, but, in the first problem presentedto you, you would have to see all the remaining numbers in the bag to actually add them up to find the missing number using that equation. So, since you get to see all the numbers why don't you just look for the number thats missing. Same goes for when two numbers are missing. Pretty simple I think. No point in using an equation when you get to see the numbers remaining in the bag.

Stephan M
I think the benefit of summing them up is that you don't have to remember which numbers you've already seen (e.g., there's no extra memory requirement). Otherwise the only option is to retain a set of all the values seen and then iterate over that set again to find the one that's missing.
Dan Tao
This question is usually asked with the stipulation of O(1) space complexity.
Beh Tou Cheh
+3  A: 

The problem with solutions based on sums of numbers is they don't take into account the cost of storing and working with numbers with large exponents... in practice, for it to work for very large n, a big numbers library would be used. We can analyse the space utilisation for these algorithms.

We can analyse the time and space complexity of sdcvvc and Dimitris Andreou's algorithms.

Storage:

l_j = ceil (log_2 (sum_{i=1}^n i^j))
l_j > log_2 n^j  (assuming n >= 0, k >= 0)
l_j > j log_2 n \in \Omega(j log n)

l_j < log_2 ((sum_{i=1}^n i)^j) + 1
l_j < j log_2 (n) + j log_2 (n + 1) - j log_2 (2) + 1
l_j < j log_2 n + j + c \in O(j log n)`

So l_j \in \Theta(j log n)

Total storage used: \sum_{j=1}^k l_j \in \Theta(k^2 log n)

Space used: assuming that computing a^j takes ceil(log_2 j) time, total time:

t = k ceil(\sum_i=1^n log_2 (i)) = k ceil(log_2 (\prod_i=1^n (i)))
t > k log_2 (n^n + O(n^(n-1)))
t > k log_2 (n^n) = kn log_2 (n)  \in \Omega(kn log n)
t < k log_2 (\prod_i=1^n i^i) + 1
t < kn log_2 (n) + 1 \in O(kn log n)

Total time used: \Theta(kn log n)

If this time and space is satisfactory, you can use a simple recursive algorithm. Let b!i be the ith entry in the bag, n the number of numbers before removals, and k the number of removals. In Haskell syntax...

let
  -- O(1)
  isInRange low high v = (v >= low) && (v <= high)
  -- O(n - k)
  countInRange low high = sum $ map (fromEnum . isInRange low high . (!)b) [1..(n-k)]
  findMissing l low high krange
    -- O(1) if there is nothing to find.
    | krange=0 = l
    -- O(1) if there is only one possibility.
    | low=high = low:l
    -- Otherwise total of O(knlog(n)) time
    | otherwise =
       let
         mid = (low + high) `div` 2
         klow = countInRange low mid
         khigh = krange - klow
       in
         findMissing (findMissing low mid klow) (mid + 1) high khigh
in
  findMising 1 (n - k) k

Storage used: O(k) for list, O(log(n)) for stack: O(k + log(n)) This algorithm is more intuitive, has the same time complexity, and uses less space.

a1kmm
+1, looks nice but you lost me going from line 4 to line 5 in snippet #1 -- could you explain that further? Thanks!
j_random_hacker
A: 

but you said: "We have a bag containing numbers 1, 2, 3, …, 100. Each number appears exactly once". if you know the sum of the two missing numbers like you did in the first case then you can find the two numbers by looking them up in a pre-prepared table. the key point is that the sum uniquely identifies the two numbers. if the sum is 3 the two numbers has to be 1 and 2, if it is 17 the two numbers can only be 10 and 7, etc...

victor
If the sum of the two missing numbers is 17, can't the two numbers also be 11 and 6 ? or 9 and 8 ?
Fanfan
Wow, what a bold statement.
Dan Tao
A: 

You could try using a Bloom Filter. Insert each number in the bag into the bloom, then iterate over the complete 1-k set until reporting each one not found. This may not find the answer in all scenarios, but might be a good enough solution.

jdizzle
A: 

I'd take a different approach to that question and probe the interviewer for more details about the larger problem he's trying to solve. Depending on the problem and the requirements surrounding it, the obvious set-based solution might be the right thing and the generate-a-list-and-pick-through-it-afterward approach might not.

For example, it might be that the interviewer is going to dispatch n messages and needs to know the k that didn't result in a reply and needs to know it in as little wall clock time as possible after the n-kth reply arrives. Let's also say that the message channel's nature is such that even running at full bore, there's enough time to do some processing between messages without having any impact on how long it takes to produce the end result after the last reply arrives. That time can be put to use inserting some identifying facet of each sent message into a set and deleting it as each corresponding reply arrives. Once the last reply has arrived, the only thing to be done is to remove its identifier from the set, which in typical implementations takes O(log k+1). After that, the set contains the list of k missing elements and there's no additional processing to be done.

This certainly isn't the fastest approach for batch processing pre-generated bags of numbers because the whole thing runs O((log 1 + log 2 + ... + log n) + (log n + log n-1 + ... + log k)). But it does work for any value of k (even if it's not known ahead of time) and in the example above it was applied in a way that minimizes the most critical interval.

Blrfl
A: 

I believe I have a O(k) time and O(log(k)) space algorithm, given that you have the floor(x) and log2(x) functions for arbitrarily big integers available:

You have an k-bit long integer (hence the log8(k) space) where you add the x^2, where x is the next number you find in the bag: s=1^2+2^2+... This takes O(N) time (which is not a problem for the interviewer). At the end you get j=floor(log2(s)) which is the biggest number you're looking for. Then s=s-j and you do again the above:

for (i = 0 ; i < k ; i++) { j = floor(log2(s)); missing[i] = j; s -= j; }

Now, you usually don't have floor and log2 functions for 2756-bit integers but instead for doubles. So? Simply, for each 2 bytes (or 1, or 3, or 4) you can use these functions to get the desired numbers, but this adds an O(N) factor to time complexity

CostasGR43