views:

764

answers:

8

Question I've been given at the job interview. I was close to the solution but did not solve it unfortunately.

Assume we have a sequence that contains N numbers of type long. And we know for sure that among this sequence each number does occur exactly n times except for the one number that occurs exactly m times (0 < m < n). How do we find that number with O(N) operations and O(1) additional memory?

For the simplest case (when n = 2 and m = 1) all that we should do is just to perform accumulative xor on every number in sequence. The result will be equal to the desired number. But i'm stuck while trying to deal with arbitrary m and n.

I would appreciate an actual C++ solution.


EDIT: We know actual values of m and n a priori.

Example. We know that n = 3 and m = 2. The sequence (N = 8) is

5 11 5 2 11 5 2 11

And the right answer is 2 in this particular case because it occurs only twice.

+2  A: 

You simplest case can be more general, you can use the same technique you described for an odd number m and even number n.

grokus
Yes, you're right. And i had discovered this too.
Keynslug
But the problem as stated can have m and n both even or both odd, or m even and n odd. You've only covered 1 of 4 possibilities.
Mark Ransom
A: 

If the list is sorted then this becomes quite simple in that you just need to examine each batch in turn to see if it is length m.

If the list is not sorted then I don't think it is possible with O(1) additional memory.

Patrick
Interviewer swore it is possible. :)
Keynslug
+9  A: 

edit) Okay, this method isn't as sound as I initially thought. eBusiness's solution is much simpler and works correctly for cases such as n=4, m=2.

We can generalise the xor method to work with arbitrary m and n. We first need to pick a base b such that gcd(n, b) = b, and gcd(m, b) < b. As odd n/even m pairs satisfy this for base 2, the standard binary xor works for these pairs.

Firstly we define (a^^n) to mean (a^a^...^a) for n a's, with generalised xor of base b. For example, with standard binary xor, a^^2 = 0.

We need to define our generalised xor. The properties we desire are basically the same as of addition (commutativity, associativity), and we need a^^b = 0. The obvious solution is (x^y) = (x+y)%b for each digit in the base b representation (convince yourself this works, and is the same as binary xor for base 2). Then, we simply apply this to all numbers in the sequence, and end up with result = s^^(m%b), where s is the special number.
Lastly, we need to revert our 'xor'ed base b number to the expected number. We can simply compute i^^(m%b) for i=0..b-1, and then look up which value we have in s for each digit in result.

This algorithm is be O(N). For each number in the list, we have a constant number of operations to do, because we have at most 64 digits. Reverting at the end is at worst O(N) for large b. We can do this last step in constant space by computing i^^(m%b) for all i for each digit (again, we have a constant number of digits).


Example:

n = 3, m = 2. list = [5 11 5 2 11 5 2 11]

First we choose a base b. Obviously we have to choose base 3.

The xor table for reference:

  0|1|2
0|0|1|2
1|1|2|0
2|2|0|1

The computation:

  5     11      5      2     11      5      2     11
0^0=0. 0^1=1. 1^0=1. 1^0=1. 1^1=2. 2^0=2. 2^0=2. 2^1=0.
0^1=1. 1^0=1. 1^1=2. 2^0=2. 2^0=2. 2^1=0. 0^0=0. 0^0=0.
0^2=2. 2^2=1. 1^2=0. 0^2=2. 2^2=1. 1^2=0. 0^2=2. 2^2=1.

m % b = 2.

Thus we have s^^2 = [001]. We generate a table of i^^2 for each digit i, and then do a reverse lookup.

   i | 0 | 1 | 2 |
i^^2 | 0 | 2 | 1 |

0 -> 0
0 -> 0
1 -> 2

We lastly convert our result back into binary/decimal. [002] = 2.

Nabb
It is the correct solution, but I don't think I would have been able to understand it had I not figured it myself before reading your reply.
eBusiness
That is really **great!** Finally at the interview i has discovered a thought that non-binary base representation might be very helpful. But i was totally confused after all.
Keynslug
However when tortures had came to the end interviewer told me that there exists a solution a lot easier as it misses non-binary-base-representation magic. This fact completely blows my mind.
Keynslug
This may have been what the interviewer was looking for, but this isn't strictly O(N) time and O(1) memory. The word size is logarithmic in N. Even though the standard transdichotomous model of computation gives us O(1) time operations on words of size O(lg N), it doesn't follow that we can keep one word for each bit and still consider this to be O(1) memory. Ditto on running time.
jonderry
Sorry, it is not the same as my solution, looks a lot like it though, but this solution fails when n and m are not internally primary, try it with n=4 and m=2 and you will get an all-zeros result.
eBusiness
@jonderry: Since the values in the sequence are longs, we have at most `64 log(2) / log(b)` digits, so we have O(1) digits. Operations on each digit are O(1), so we have O(1) total.
Nabb
OK, I don't understand the solution then. I thought you were essentially saying to keep a count of the number of occurrences of each bit, modulo some (possibly large) number. Doesn't this require roughly 64^2 memory, where 64 ~ log N?
jonderry
BTW, I posted code for another solution below that uses only a constant number of additional words.
jonderry
+2  A: 
  • If you have a one to one hash on the set of integers from 0 to (N/n) + 1 then you can solve it by N iterations + N/n iterations with N memory useage. However there isn't a one to one mapping

  • If there is no constraint on memory it just must be constant you can define an array the size of the domain of longs that then you can solve the problem in 2N with constant gargantuan memory usage. For every x in N you simply Add to BIGARRY[x] then loop though BIGARRY looking for m. Its terrible and non implementable but meets the requirements and most interview questions are thought experiments any way.

rerun
Sure it does! Great consideration and vote up from me.
Keynslug
A: 

I believe that you can't do it using only O(1) additional space.

Here is my justification: you are given:

  • m
  • n
  • x_1 x_2 .. x_N

Since there are duplicates among the x_i values, let's define U as the set of unique values. All the elements in U appear n times, and one of them appears m times in the x_i series. Let us label the less often appearing element as u_0, and U_1 the set U - { u_0 }.

Let S be the the sum of all x_i. S can be written as:

sum(x_i) = n * sum(U_1) + m * u_0 = n * sum(U) + (m - n) * u_0

Solving this problem is equivalent to finding the sum of the unique elements in a series, and you can't do that in O(1) additional space, because you need an array or a hash table with linked elements - the space is actually O(N)

florin
+23  A: 

You do 64 sums, one for each bit, for each of the sums you calculate sum mod n, this calculation return m for each bit that should to be set in the result, and 0 for each bit that should not be set.

Example:
n = 3, m = 2. list = [5 11 5 2 11 5 2 11]

              5  11   5   2  11   5   2  11
sum of bit 0: 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 = 6   6 % 3 = 0
sum of bit 1: 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 = 5   5 % 3 = 2
sum of bit 2: 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 = 3   3 % 3 = 0
sum of bit 3: 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1 = 3   3 % 3 = 0

So only bit 1 is set, which means the result is 2.

Optimizing implementation:
(Tricks and considerations that are also useful for real problems)
It is worth noting that when iterating over an array, execution speed will to some extend be limited by memory access, if one need to perform multiple operations with each element it is usually fastest to perform them all on one element at a time, thus the processor only need to load each element from memory once. Interesting blog post on memory and cache.

It is possible to sum multiple bits in a single integer, rather than applying 64 different bitmasks to get each bit on it's own one could for instance use only 4 bitmasks that each extract 16 bits with 3 bits of space between each, as long as no overflow occur a normal addition operation will work exactly as if dealing with 16 4-bit integers, thus this method will work for 15 numbers. After 15 numbers have been processed this way the results must be added to a storage capable of holding bigger integers (could be 8 64-bit integers each holding 8 8-bit integers, they must of course in turn be emptied into bigger integers etc.).
The result is that rather than for each value doing 64 bitmasks, 63 bitshifts and 64 additions one need only do 4 bitmasks, 3 bitshifts and 4 additions, plus for every 15 values 8 bitmasks, 4 bitshifts and 8 additions, plus for every 255 values 16 bitmasks, 8 bitshifts and 16 additions etc.

Visualization:
(Summing 4x4-bit integers using 16-bit integers)

1000 1000 1000 1000 +
1000 0000 0000 1000 +
0000 0000 0000 1000 +
1000 1000 0000 0000 +
1000 0000 1000 0000 +
0000 0000 1000 1000 =
0010 0100 1100 0010

The result is the same whether you consider this to be 4 columns of 4-bit integers or 1 column of 16-bit integers, this is only true as long as long the 4-bit integers do not overflow.

eBusiness
Wow, it is the **most easy, clear** and **elegant** solution! And it [works](http://ideone.com/9H2q2) perfectly.
Keynslug
Elegant. This can also be thought of as generalising xor, just with modulo m instead of modulo 2.
Nabb
And _that's_ why they told him the exact size of integers. Great solution.
atomizer
Good solution. Strictly speaking, this uses O(N log N) time and O(log N) words of memory (as with Nabb's). The standard assumption is that O(log n) bits of memory is O(1) words, and operations on these words each take O(1) time. I posted a solution below that uses O(1) words of memory, and has the same running time as this one.
jonderry
@jonderry Strictly speaking it is O([int size] * N) time and O([int size]) memory. Of course one can cut a few bytes from memory use and get it down to O(1) by calculating the bit values one at a time, though that means iterating through the list once for each bit.
eBusiness
Yes, that's what my solution does. What I'm curious about is if you can get a true O(N) time solution that doesn't use much memory. Hashing gets you O(N) time, but requires O(N) memory.
jonderry
+2  A: 

Here's a solution that has the same running time as eBusiness's (which I consider to actually be O(N log N)), but truly uses O(1) words of memory. It assumes m is not a multiple of n. It also assumes two helper functions that count the number of elements strictly above and below their arguments.

int divider = 0;

for (int i = 63; i >= 0; i--) {
  divider |= 1 << i;
  int a = countAbove(divider);
  int b = countBelow(divider);
  if (a % n == 0 && b % n == 0) return divider;
  else if (a % n == 0) divider ^= 1 << i;
}
jonderry
It seems like your helper functions will make this O(N Log N) (off the top of my not-so-sharp head).
Lee-Man
Exactly. The other solutions, including eBusiness', are O(N log N) as well. Unlike the other solutions, however, the memory used by this solution is only a constant number of words.
jonderry
A: 

A soluation is some like the process to find the k-th order statistic

by dividing the sequence into 2 sub-seqs
(calculate the size of sub-seqs during the procedure)
while (sizeof(sub-seq) mod n != 0)
  do the same porcess on this sub-seq(dividing)

O(N) operations like finding the k-th order statistic.

喂过猪