views:

506

answers:

6

In a sequence of length n, where n=2k+3, that is there are k unique numbers appeared twice and three numbers appeared only once.

The question is: how to find the three unique numbers that appeared only once?

for example, in sequence 1 1 2 6 3 6 5 7 7 the three unique numbers are 2 3 5.

Note: 3<=n<1e6 and the number will range from 1 to 2e9

Memory limits: 1000KB , this implies that we can't store the whole sequence.

Method I have tried(Memory limit exceed):

I initialize a tree, and when read in one number I try to remove it from the tree, if the remove returns false(not found), I add it to the tree. Finally, the tree has the three numbers. It works, but is Memory limit exceed.

I know how to find one or two such number(s) using bit manipulation. So I wonder if

we can find three using the same method(or some method similar)?

Method to find one/two number(s) appeared only once:

If there is one number appeared only once, we can apply XOR to the sequence to find it.

If there are two, we can first apply XOR to the sequence, then separate the sequence into 2 parts by one bit of the result that is 1, and again apply XOR to the 2 parts, and we will find the answer.

+5  A: 

This is a classic question - I was actually just asked it a few weeks ago. To solve it, you take the number of possible distinct numbers that could appear, and allocate that many bits.

For example, if numbers in the list must be between 1-20, you allocate 20 bits - one for each number, and you initialize each bit as 0.

Then you traverse the list. Every time you see a number, flip the corresponding bit.

For example: With your example list of 2 6 3 6 5 7 7, we could allocate 7 bits (for 1 2 3 4 5 6 7). Then as we traverse the list, we would do the following:

  • flip 2nd bit
  • flip 6th bit
  • flip 3rd bit
  • flip 6th bit
  • etc

Then once you're done traversing the list, you can read through the bits to find the three unique numbers. They will all be represented by '1' bits, and the other numbers will be represented by 0s.

You read through the list twice, which takes 2n time, which is O(n).


Edit: It is possible that the bounds will not be given. One solution, then, is to simply read through the list first to determine the bounds yourself - then it's still O(n).

However one problem which could occur is that the list could be very small, but some very large numbers - effectively making the range too big. For example:

1, 99999999999999999, 1, 99999999999999999, 2, 3, 4

Solving that problem would require a lot of memory due to the large number present in the list, because even though there are very few numbers, the range is very large and we are allocating bits according to the range.

The solution could then be adjusted to give a new solution as follows using a hashtable (although I'm not sure if this is permitted given the problem's "bit manipulation only" stipulation):

  1. Let L denote the original list, and C denote a copy of it.
  2. Remove all duplicates from C (there are numerous ways of doing this efficiently).
  3. Create a hashtable H, and for each element in C, insert a key/value pair <number,pos> into H where number is the current element in C, and pos is its position in C. So, given a number that appears in L, we can now use H to find that number's position in C.
  4. Allocate a number of bits equal to the size of C, and initialize those bits to 0.
  5. Traverse L. Each time we run accross a number, get its value from H, and flip that bit in our bit list.
  6. Traverse the bit list - for each '1' bit, get the number from C which is at that position - that is one of the unique numbers.
Cam
Did the original question mention bounded numbers?
Akusete
No, but you can determine the bounds for a given list by iterating through the list once and finding the lowest/highest values in it. I will update my answer accordingly.
Cam
Thank you for your reply. I have thought of this method earlier. But the range of numbers is 1-2e9, and n is 3-1e6. So this methodwill not work.
shilk
@shilk: I've edited my answer accordingly. You can get around that using a hashtable.
Cam
Jonathan Sternberg
@Jonathan Sternberg: With memory limit 1000kb.
shilk
@shilk: 1mb? You can't even store the list in that. And above you mentioned you can only traverse the list once... are you sure you have the bounds/rules right? This is getting pretty ridiculous.
Cam
@Cam: Yeah, I'm fairly sure about the bounds. As we can not store the whole list, we can only read one and manipulate one.
shilk
If you can fit all the values in a hashtable, you can use it directly to solve the problem, no need to add the bit set.
Dave L.
+4  A: 

If a probabilistic solution will suffice then you could use a Bloom Filter.

Create two Bloom filters. The first (A) contains numbers that have been found at least one, and the second (B) contains numbers that have been found twice.

Pseudocode:

A = empty
B = empty

foreach x in the list
  if x in A
    add x to B
  else
    add x to A

foreach x in the list
  if x in A
    if !(x in B)
      print x

If you use the full 1000KB then the probability of error would be ridiculously low.

Peter Alexander
@Peter Alexander:How can you traverse the list twice as we don't have enough memory to store the whole list? I don't think Bloom Filter will work in this situation.
shilk
@shilk: a bloom filter is a glorified bit array, so its extremely compact. You "add" items to bloom filter by setting bits at index `hashcode % array.length` to 1 for several different hash functions, and you test for set membership in a similar way. This is a perfectly adequate, probabilistic solution to your question.
Juliet
@Juliet, he's right about the second traversal though. You can't use the Bloom filter to re-traverse the elements, and at the same time we can't store the elements :-/ -- I missed that bit.
Peter Alexander
+1  A: 

The problem gets harder and harder as you add more unique values, mainly because you can choose A,B,C such that A xor B xor C = 0. It gets harder and harder to detect if a subset of the values has the same checksum because it contains all the unique values, or because it omitted values which happened to xor to 0.

You can do 3 values in constant space and O(n*k) time, where k is the number of bits in the largest integer. (So O(n) time for your typical case: 32-bit integers.)

It would be interesting to find out if the time bound becomes non-linear in N as the number of unique values increases and you continue to require constant space.

//Special check for 0, because otherwise we don't know A xor B xor C != A xor B
if items unique-contains 0 then
    return 0 ++ SubProblem2Unique(items - 0)
//Compute A xor B xor C
val x = fold xor items
//Try to find a split which separates A and B from C.
for i in 0..WORD_SIZE
    //see if the checksum splits
    val x1 = fold xor [e in items where e & (1<<i) == 0]
    val x2 = x xor x1
    if x1 == x or x2 == x then continue //ith bit was the same for A and B and C
    //C is either x1 or x2
    val C = if items unique-contains x1 then x1 else x2
    return C ++ SubProblem2Unique(items - C)

throw InvalidInput
Strilanc
+4  A: 

You can do this in a similar way to the simpler cases of one and two different values.

We need two integers for each bit of the numbers (e.g. 32 bits). For each number, if that bit is zero, XOR the first integer with it. If it isn't, XOR the second integer with it.

Also, keep count of how many times you find a 1 or 0 in each position (we only need to check if this is even or odd, so keep a boolean).

After iterating through, our pairs of integers will be one of the following. The first number here represents an even count, the second an odd.

0, a^b^c
a^b, c
a^c, b
b^c, a

For each pair, check the even count integer. If it is zero, then we know the other integer is a^b^c, since no two of our results will be equal. Otherwise, we've found a value at the odd count integer.

public static int[] find3(int[] list) {
    int[][] xors = new int[32][2];
    boolean[] counts = new boolean[32];
    for (int curr : list) {
        for (int i = 0; i < 32; i++) {
            xors[i][(curr & (1 << i)) >> i] ^= curr;
            counts[i] ^= ((curr & (1 << i)) == (1 << i));
        }
    }

    // this really shouldn't take so many lines
    int[] ret = new int[3];
    int found = 0;
    for (int i = 0; i < 32; i++) {
        int oddCount = xors[i][counts[i] ? 1 : 0];
        int evenCount = xors[i][counts[i] ? 0 : 1];
        if (evenCount != 0) { // avoid the 0, a^b^c case.
            if (found == 0) {
                ret[0] = oddCount;// a
                ret[2] = evenCount;// b^c for now
                found++;
            } else if (found == 1 && ret[0] != oddCount) {
                ret[1] = oddCount;// b
                ret[2] ^= oddCount;// (b^c)^b == c
                break;
            }
        }
    }
    return ret;
}
Nabb
Awesome! You solve it perfectly. Thank you for your anwser.
shilk
@shilk. I agree. Good solution for the stated bounds.
Moron
+2  A: 

For a more general version of this problem (without those silly limits):

You can do this in O(n) time and O(1) space without assuming any bounds, or iterating over all the bits, and using only O(1) time bit manipulation tricks like the XOR trick which worked for 2 missing numbers.

Here is (pseudo)code to find just one of the numbers:

// Given an array arr with 2k+3 numbers, k of which are repeated twice
// and the remaining three are distinct: a,b,c.
// returns one of a,b,c.
int FindUnique(int []arr) {

    int s = 0; // This will ultimately hold a ^ b ^ c (bitwise XOR)

    for (int i = 0; i < arr.Length; i++) {
        s ^= arr[i];
    }

    int d = 0; // this holds diff(a,s) ^ diff(b,s) ^ diff(c,s)

    for (int i = 0; i < arr.Length; i++) {
        d ^= diff(arr[i],s);
    }

    int e = lowestBit(d); // This gives the position where one of a,b,c differs 
                          // from the others.

    int bucket1 = 0;
    int bucket2 = 0;

    for (int i = 0; i < arr.Length; i++) {
        if (arr[i] & e) {
            bucket1 ^= arr[i];
        } else {
            bucket2 ^= arr[i];
        }
    }

    int count1 = 0;
    int count2 = 0;

    for (int i = 0; i < arr.Length; i++) {
        if (arr[i] == bucket1) {
            count1++;
        }

        if (arr[i] == bucket2) {
            count2++;
        }
    }

    if (count1 == 1) return bucket1;

    return bucket2;
}

// return a number with the lowest bit of x ^ s set to 1 and rest 0.
// i.e. the lowest bit position where x and s differ.
int diff(int x, int s) {
    return lowestBit(x ^ s);
}

// Returns a number with only the lowest bit of y set.
int lowestBit(int y) {
    return y & ~(y-1);
}

The idea is as follows:

Say the numbers which appear once are a,b,c.

Now run the XOR through the array to get s = a XOR b XOR c.

Since the numbers are distinct, notice that s cannot be either a or b or c (as the other two will be equal then), thus there is at least one bit (not necessarily at the same position), where each of a,b,c differs from s.

In the two number case, we could see that s is non-zero and pick a bit which differentiated a & b and work with that.

We run into difficulties with that when we have three numbers, but we can still find a bit to differentiate one of the numbers out.

For each number x, find the lowest bit which differs from s. Consider the binary number in which only that bit is set to one and the rest are zero. Call this number diff(x).

Now if we compute diff(x) for each number and XOR them together, we get d = diff(a) XOR diff(b) XOR diff(c).

Notice that d cannot be zero.

Now find the lowest set bit of d. This bit position can be used to bucket out one of a,b,c, as not all of a,b,c can have the same bit at that position: if they did, then that bit of s which is the XOR of those three must be the same, but we ensured that we picked that bit of s to differ from at least one of the corresponding bits in a,b,c.

So we XOR again, differentiating on this bit, and check which of the two resulting numbers appears exactly once in the array. Once we find one number, we know how to deal with two numbers.

To find the diff just use the bithack: x & ~(x-1), which is a standard bit-hack and can be considered O(1) (instead of O(number of bits)).

Moron
A: 

Why not use a a hashset? - If a number already exists, delete from hashset - if a number does not exist, put into hashset The final hashset contains only unique numbers. Time: O(n) Memory:o(k) where k is number of distinct elements.

With hashset approach, the solution is scalable and can be used to determine any number of unique elements in any given sequence.

Amit
Because you can't fit half a million 32-bit values into a hashset in 1000 KB.
Dave L.