ansaurus

Question

Easy interview question got harder: given numbers 1..100, find the missing number(s)

Answer 1

+32 A:

We can solve Q2 by summing both the numbers themselves, and the squares of the numbers.

We can then reduce the problem to

k1 + k2 = x
k1^2 + k2^2 = y

Where x and y are how far the sums are below the expected values.

Substituting gives us:

(x-k2)^2 + k2^2 = y

Which we can then solve to determine our missing numbers.

Anon. 2010-08-16 10:37:21

+1; I've tried the formula in Maple for select numbers and it works. I still couldn't convince myself WHY it works, though.

polygenelubricants 2010-08-16 11:12:20

@polygenelubricants: Rearranging the first equation gives `k1 = x - k2`, and we can substitute that into the second equation. You can also generalize this to higher k - though it becomes computationally expensive quite rapidly.

Anon. 2010-08-16 11:24:54

@Anon: let's restrict discussion to Q2 for now. Can you show that this will always result in a solution? (i.e. no ambiguity, no unsolvable, always unique and correct solution).

polygenelubricants 2010-08-16 11:28:18

@polygenelubricants: If you wanted to prove correctness, you would first show that it always provides *a* correct solution (that is, it always produces a pair of numbers which, when removing them from the set, would result in the remainder of the set having the observed sum and sum-of-squares). From there, proving uniqueness is as simple as showing that it only produces one such pair of numbers.

Anon. 2010-08-16 11:50:25

The nature of the equations means that you will get two values of k2 from that equation. However, from teh first equation that you use to generate k1 you can see that these two values of k2 will mean that k1 is the other value so you have two solutions that are the same numbers the opposite way around. If you abitrarily declared that k1>k2 then you'd only have one solution to the quadratic equation and thus one solution overall. And clearly by the nature of the question an answer always exists so it always works.

Chris 2010-08-16 12:06:16

For a given sum k1+k2, there are many pairs. We can write these pairs as K1=a+b and K2 = a-b where a = (K1+k2/2). a is unique for a given sum. The sum of the squares (a+b)**2 + (a-b)**2 = 2*(a**2 + b**2). For a given sum K1+K2, the a**2 term is fixed and we see that the sum of the squares will be unique due to the b**2 term. Therefore, the values x and y are unique for a pair of integers.

phkahler 2010-08-16 14:31:09

Answer 2

+1 A:

Can you check if every number exists? If yes you may try this:

S = sum of all numbers in the bag (S < 5050) Z = sum of the missing numbers 5050 - S

if the missing numbers are x and y then: x = Z - y and max(x) = Z - 1 So you check the range from 1 to max(x) and find the number

Ilian Iliev 2010-08-16 10:37:28

Answer 3

+5 A:

Not sure, if it's the most efficient solution, but I would loop over all entries, and use a bitset to remember, which numbers are set, and then test for 0 bits.

I like simple solutions - and I even believe, that it might be faster than calculating the sum, or the sum of squares etc.

Chris Lercher 2010-08-16 10:38:16

I did propose this obvious answer, but this is not what the interviewer wanted. I explicitly said in the question that this is not the answer I'm looking for. Another obvious answer: sort first. Neither the `O(N)` counting sort nor `O(N log N)` comparison sort is what I'm looking for, although they are both very simple solutions.

polygenelubricants 2010-08-16 11:14:58

@polygenelubricants: I can't find where you said that in your question. If you consider the bitset to be the result, then there is no second pass. The complexity is (if we consider N to be constant, as the interviewer suggests by saying, that the complexity is "defined in *k* not N") O(1), and if you need to construct a more "clean" result, you get O(k), which is the best you can get, because you always need O(k) to create the clean result.

Chris Lercher 2010-08-16 11:20:32

@chris_l: "Note that I'm not looking for the obvious set-based solution (e.g. using a bit set,". The second last paragraph from the original question.

hrnt 2010-08-16 11:24:14

@hmt: Yes, the question was edited a few minutes ago. I'm just giving the answer, that I would expect from an interviewee... Artificially constructing a sub-optimal solution (you can't beat O(n) + O(k) time, no matter what you do) doesn't make sense to me - except if you can't afford O(n) additional space, but the question isn't explicit on that.

Chris Lercher 2010-08-16 11:30:36

@chris: I've edited the question again to further clarify. I do appreciate the feedback/answer.

polygenelubricants 2010-08-16 11:42:36

Answer 4

+6 A:

I haven't checked the maths, but I suspect that computing Σ(n^2) in the same pass as we compute Σ(n) would provide enough info to get two missing numbers, Do Σ(n^3) as well if there are three, and so on.

AakashM 2010-08-16 10:38:41

Answer 5

A:

for different values of k, the approach will be different so there won't be a generic answer in terms of k. e.g. for k=1 one can take advantage of Sum of natural numbers but for k = n/2, one has to use some kind of bitset. the same way for k=n-1, one can simply compare the only number in the bag with rest.

bhups 2010-08-16 10:46:50

Answer 6

A:

Very nice problem. I'd go for using a set difference for Qk. A lot of programming languages even have support for it, like in Ruby:

missing = (1..100).to_a - bag

It's probably not the most efficient solution but it's one I would use in real life if I was faced with such a task in this case (known boundaries, low boundaries). If the set of number would be very large then I would consider a more efficient algorithm, of course, but until then the simple solution would be enough for me.

DarkDust 2010-08-16 11:18:39

Answer 7

A:

Have I missed something? Can't you just put the missing numbers in a list?

List missingList = {};
For i=1 to 100 do
    if(i is missing) missing.add(i);
EndFor

missingList will have space complexity O(k), and if List is a Vector, then adding a number is O(1) in time.

Paxinum 2010-08-16 11:23:43

What's the implementation of your "is missing" function?

JeremyP 2010-08-16 11:27:42

your algorithm complexity is O(n^2), l is missing is O(N) and for loop is O(N)

ArsenMkrt 2010-08-16 11:36:30

@ArsenMkrt: as many proposed, you could use a set. Transforming whatever-a-bag-is to a set is O(N), and the second pass (the loop) is O(N), yielding O(N).

back2dos 2010-08-16 12:10:09

Answer 8

+3 A:

Wait a minute. As the question is stated, there are 100 numbers in the bag. No matter how big k is, the problem can be solved in constant time because you can use a set and remove numbers from the set in at most 100 - k iterations of a loop. 100 is constant. The set of remaining numbers is your answer.

If we generalise the solution to the numbers from 1 to N, nothing changes except N is not a constant, so we are in O(N - k) = O(N) time. For instance, if we use a bit set, we set the bits to 1 in O(N) time, iterate through the numbers, setting the bits to 0 as we go (O(N-k) = O(N)) and then we have the answer.

It seems to me that the interviewer was asking you how to print out the contents of the final set in O(k) time rather than O(N) time. Clearly, with a bit set, you have to iterate through all N bits to determine whether you should print the number or not. However, if you change the way the set is implemented you can print out the numbers in k iterations. This is done by putting the numbers into an object to be stored in both a hash set and a doubly linked list. When you remove an object from the hash set, you also remove it from the list. The answers will be left in the list which is now of length k.

JeremyP 2010-08-16 11:25:59

This answer is too simple, and we all know that simple answers don't work! ;) Seriously though, original question should probably emphasize O(k) space requirement.

DK 2010-09-02 20:48:04

Answer 9

+49 A:

You will find it by reading the couple pages of this: Muthukrishnan - Data Stream Algorithms: Puzzle 1: Finding Missing Numbers. It shows exactly the generalization you are looking for. Probably this is what your interviewer read and posed these questions.

Now, if only people would start deleting the answers that are subsumed or superseded by Muthukrishnan's treatment, and make this text easier to find. :)

Also see sdcvvc's directly related answer, which also includes pseudocode (hurray! no need to read those tricky math formulations :)) (thanks, great work!).

Dimitris Andreou 2010-08-16 11:26:56

How do you translate *that* into code?!?

Ubersoldat 2010-08-16 12:05:58

Oooh... That's interesting. I have to admit I got a bit confused by the maths but I was jsut skimming it. Might leave it open to look at more later. :) And +1 to get this link more findable. ;-)

Chris 2010-08-16 12:21:54

The google books link doesn't work for me. Here a [better version](http://www.cs.rutgers.edu/~muthu/stream-1-1.ps) [PostScript File].

Heinrich Apfelmus 2010-08-16 12:31:36

Wow. I didn't expect this to get upvoted! Last time I posted a reference to the solution (Knuth's, in that case) instead of trying to solve it myself, it was actually downvoted: http://stackoverflow.com/questions/3060104/how-to-implement-three-stacks-using-a-single-array/3077753#3077753The librarian inside me rejoices, thanks :)

Dimitris Andreou 2010-08-16 12:33:29

@Apfelmus, note that this is a draft. (I don't blame you of course, I confused the draft for the real things for almost a year before finding the book). Btw if the link didn't work, you can go to http://books.google.com/ and search for "Muthukrishnan data stream algorithms" (without quotes), it's the first to pop up.

Dimitris Andreou 2010-08-16 12:36:32

Yeah, but Google only shows me the TOC, not the full text. :-/

Heinrich Apfelmus 2010-08-16 12:55:23

Answer 10

+69 A:

Here's a summary of Dimitris Andreou's link.

Remember sum of i-th powers, where i=1,2,..,k. This reduces the problem to solving the system of equations

a₁ + a₂ + ... + a_k = b₁

a₁² + a₂² + ... + a_k² = b₂

...

a₁^k + a₂^k + ... + a_k^k = b_k

Using Newton's identities, knowing b_i allows to compute

c₁ = a₁ + a₂ + ... a_k

c₂ = a₁a₂ + a₁a₃ + ... + a_k-1a_k

...

c_k = a₁a₂ ... a_k

If you expand the polynomial (x-a₁)...(x-a_k) the coefficients will be exactly c₁, ..., c_k - see Viète's formulas. Since every polynomial factors uniquely (ring of polynomials is an Euclidean domain), this means a_i are uniquely determined, up to permutation.

This ends a proof that remembering powers is enough to recover the numbers. For constant k, this is a good approach.

However, when k is varying, the direct approach of computing c₁,...,c_k is prohibitely expensive, since e.g. c_k is the product of all missing numbers, magnitude n!/(n-k)!. To overcome this, perform computations in Z_q field, where q is a prime such that n <= q < 2n - it exists by Bertrand's postulate. The proof doesn't need to be changed, since the formulas still hold, and factorization of polynomials is still unique. You also need an algorithm for factorization over finite fields, for example the one by Berlekamp or Cantor-Zassenhaus.

High level pseudocode for constant k:

Compute i-th powers of given numbers
Subtract to get sums of i-th powers of unknown numbers. Call the sums b_i.
Use Newton's identities to compute coefficients from b_i; call them c_i. Basically, c₁ = b₁; c₂ = (c₁b₁ - b₂)/2; see Wikipedia for exact formulas
Factor the polynomial x^k-c₁x^k-1 + ... + c_k.
The roots of the polynomial are the needed numbers a₁, ..., a_k.

For varying k, find a prime n <= q < 2n using e.g. Miller-Rabin, and perform the steps with all numbers reduced modulo q.

As Heinrich Apfelmus commented, instead of a prime q you can use q=2^{⌈log n⌉} and perform arithmetic in finite field.

sdcvvc 2010-08-16 12:13:14

Thanks for doing this!

Dimitris Andreou 2010-08-16 12:38:45

You don't have to use a prime field, you can also use `q = 2^(log n)`. (How did you make the super- and subscripts?!)

Heinrich Apfelmus 2010-08-16 12:45:30

Also, you can calculate the c_k on the fly, without using the power sums, thanks to the formula $c^{k+1}_m = c^k_{m+1} + c^k_m x_{k+1}$ where the superscript $k$ denotes the number of variables and $m$ the degree of the symmetric polynomial.

Heinrich Apfelmus 2010-08-16 12:50:20

@Heinrich Apfelmus Thanks! I'm not sure about the power of two approach - in general factorization would be much tricker, since there are divisors of zero and no uniqueness. To make sub/superscripts in answers, HTML tags can be used, in comments I think it is not available.

sdcvvc 2010-08-16 12:59:31

+1 This is really, really clever. At the same time, it's questionable, whether it's really worth the effort, or whether (parts of) this solution to a quite artificial problem can be reused in another way. And even if this were a real world problem, on many platforms the most trivial `O(N^2)` solution will probably possibly outperform this beauty for even reasonably high `N`. Makes me think of this: http://tinyurl.com/c8fwgw Nonetheless, great work! I wouldn't have had the patience to crawl through all the math :)

back2dos 2010-08-16 13:52:33

@sdcvvc Ah, I mean the [finite field](http://en.wikipedia.org/wiki/Finite_field) with `q=2^(log n)` elements, not the ring of bit vectors with `log n` bits. Being a field, the former doesn't have any zero divisors; of course, multiplication is a bit more complicated than a simple bit-wise AND.

Heinrich Apfelmus 2010-08-16 13:57:17

@Heinrich Apfelmus: Thanks, added. @back2dos: The book in Dimitris's answer mentions applications of these techniques in communication complexity (set reconcilation).

sdcvvc 2010-08-16 14:12:40

Answer 11

A:

Try to find the product of numbers from 1 to 50 Let Product P1 = 1 x 2 x 3 x ............. 50 When u take out numbers one by one , multiply them So that u get the product P2. But two numbers are missing here. Hence P2 < P1 . The product of two mising terms a x b = P1 - P2. U already know the sum a + b = S1.

From the above 2 equations solve for a and b through quadratic equation. a nad b are ur missing numbers :-)

Please

Manjunath 2010-08-25 13:56:45

Answer 12

+3 A:

This might sound stupid, but, in the first problem presentedto you, you would have to see all the remaining numbers in the bag to actually add them up to find the missing number using that equation. So, since you get to see all the numbers why don't you just look for the number thats missing. Same goes for when two numbers are missing. Pretty simple I think. No point in using an equation when you get to see the numbers remaining in the bag.

Stephan M 2010-09-02 03:27:19

I think the benefit of summing them up is that you don't have to remember which numbers you've already seen (e.g., there's no extra memory requirement). Otherwise the only option is to retain a set of all the values seen and then iterate over that set again to find the one that's missing.

Dan Tao 2010-09-02 23:00:48

This question is usually asked with the stipulation of O(1) space complexity.

Beh Tou Cheh 2010-09-14 21:38:25

Answer 13

+3 A:

The problem with solutions based on sums of numbers is they don't take into account the cost of storing and working with numbers with large exponents... in practice, for it to work for very large n, a big numbers library would be used. We can analyse the space utilisation for these algorithms.

We can analyse the time and space complexity of sdcvvc and Dimitris Andreou's algorithms.

Storage:

l_j = ceil (log_2 (sum_{i=1}^n i^j))
l_j > log_2 n^j  (assuming n >= 0, k >= 0)
l_j > j log_2 n \in \Omega(j log n)

l_j < log_2 ((sum_{i=1}^n i)^j) + 1
l_j < j log_2 (n) + j log_2 (n + 1) - j log_2 (2) + 1
l_j < j log_2 n + j + c \in O(j log n)`

So l_j \in \Theta(j log n)

Total storage used: \sum_{j=1}^k l_j \in \Theta(k^2 log n)

Space used: assuming that computing a^j takes ceil(log_2 j) time, total time:

t = k ceil(\sum_i=1^n log_2 (i)) = k ceil(log_2 (\prod_i=1^n (i)))
t > k log_2 (n^n + O(n^(n-1)))
t > k log_2 (n^n) = kn log_2 (n)  \in \Omega(kn log n)
t < k log_2 (\prod_i=1^n i^i) + 1
t < kn log_2 (n) + 1 \in O(kn log n)

Total time used: \Theta(kn log n)

If this time and space is satisfactory, you can use a simple recursive algorithm. Let b!i be the ith entry in the bag, n the number of numbers before removals, and k the number of removals. In Haskell syntax...

let
  -- O(1)
  isInRange low high v = (v >= low) && (v <= high)
  -- O(n - k)
  countInRange low high = sum $ map (fromEnum . isInRange low high . (!)b) [1..(n-k)]
  findMissing l low high krange
    -- O(1) if there is nothing to find.
    | krange=0 = l
    -- O(1) if there is only one possibility.
    | low=high = low:l
    -- Otherwise total of O(knlog(n)) time
    | otherwise =
       let
         mid = (low + high) `div` 2
         klow = countInRange low mid
         khigh = krange - klow
       in
         findMissing (findMissing low mid klow) (mid + 1) high khigh
in
  findMising 1 (n - k) k

Storage used: O(k) for list, O(log(n)) for stack: O(k + log(n)) This algorithm is more intuitive, has the same time complexity, and uses less space.

a1kmm 2010-09-02 11:41:42

+1, looks nice but you lost me going from line 4 to line 5 in snippet #1 -- could you explain that further? Thanks!

j_random_hacker 2010-10-28 08:16:38

Answer 14

A:

but you said: "We have a bag containing numbers 1, 2, 3, …, 100. Each number appears exactly once". if you know the sum of the two missing numbers like you did in the first case then you can find the two numbers by looking them up in a pre-prepared table. the key point is that the sum uniquely identifies the two numbers. if the sum is 3 the two numbers has to be 1 and 2, if it is 17 the two numbers can only be 10 and 7, etc...

victor 2010-09-02 12:57:31

If the sum of the two missing numbers is 17, can't the two numbers also be 11 and 6 ? or 9 and 8 ?

Fanfan 2010-09-02 13:47:31

Wow, what a bold statement.

Dan Tao 2010-09-02 23:00:10

Answer 15

A:

You could try using a Bloom Filter. Insert each number in the bag into the bloom, then iterate over the complete 1-k set until reporting each one not found. This may not find the answer in all scenarios, but might be a good enough solution.

jdizzle 2010-09-02 16:29:41

Answer 16

A:

I'd take a different approach to that question and probe the interviewer for more details about the larger problem he's trying to solve. Depending on the problem and the requirements surrounding it, the obvious set-based solution might be the right thing and the generate-a-list-and-pick-through-it-afterward approach might not.

For example, it might be that the interviewer is going to dispatch n messages and needs to know the k that didn't result in a reply and needs to know it in as little wall clock time as possible after the n-kth reply arrives. Let's also say that the message channel's nature is such that even running at full bore, there's enough time to do some processing between messages without having any impact on how long it takes to produce the end result after the last reply arrives. That time can be put to use inserting some identifying facet of each sent message into a set and deleting it as each corresponding reply arrives. Once the last reply has arrived, the only thing to be done is to remove its identifier from the set, which in typical implementations takes O(log k+1). After that, the set contains the list of k missing elements and there's no additional processing to be done.

This certainly isn't the fastest approach for batch processing pre-generated bags of numbers because the whole thing runs O((log 1 + log 2 + ... + log n) + (log n + log n-1 + ... + log k)). But it does work for any value of k (even if it's not known ahead of time) and in the example above it was applied in a way that minimizes the most critical interval.

Blrfl 2010-09-03 02:57:57

Answer 17

A:

I believe I have a O(k) time and O(log(k)) space algorithm, given that you have the floor(x) and log2(x) functions for arbitrarily big integers available:

You have an k-bit long integer (hence the log8(k) space) where you add the x^2, where x is the next number you find in the bag: s=1^2+2^2+... This takes O(N) time (which is not a problem for the interviewer). At the end you get j=floor(log2(s)) which is the biggest number you're looking for. Then s=s-j and you do again the above:

for (i = 0 ; i < k ; i++) { j = floor(log2(s)); missing[i] = j; s -= j; }

Now, you usually don't have floor and log2 functions for 2756-bit integers but instead for doubles. So? Simply, for each 2 bytes (or 1, or 3, or 4) you can use these functions to get the desired numbers, but this adds an O(N) factor to time complexity

CostasGR43 2010-09-07 14:43:12

ansaurus

tags:

views:

answers:

Easy interview question got harder: given numbers 1..100, find the missing number(s)

Clarifications

related questions