tags:

views:

186

answers:

8

My requirement is to find a duplicate number in an array of integers of length 10 ^ 15. I need to find a duplicate in one pass. I know the method (logic) to find a duplicate number from an array, but how can I handle such a large size.

A: 

You can declare it in the normal way: new int[1000000000000000]. However this will only work on a 64-bit machine; the largest you can expect to store on a 32-bit machine is a little over 2GB.

Realistically you won't be able to store the whole array in memory. You'll need to come up with a way of generating it in smaller chunks, and checking those chunks individually.

What's the data in the array? Maybe you don't need to generate it all in one go. Or maybe you can store the data in a file.

Tim Robinson
+1  A: 

You'll need a different data structure. I suspect the requirement isn't really to use an array - I'd hope not, as arrays can only hold up to Int32.MaxValue elements, i.e. 2,147,483,647... much less than 10^15. Even on a 64-bit machine, I believe the CLR requires that arrays have at most that many elements. (See the docs for Array.CreateInstance for example - even though you can specify the bounds as 64-bit integers, it will throw an exception if they're actually that big.)

Now, if you can explain what the real requirement is, we may well be able to suggest alternative data structures.

If this is a theoretical problem rather than a practical one, it would be helpful if you could tell us those constraints, too.

For example, if you've got enough memory for the array itself, then asking for 2^24 bytes to store which numbers you've seen already (one bit per value) isn't asking for much. This is assuming the values themselves are 32-bit integers, of course. Start with an empty array, and set the relevant bit for each number you find. If you find you're about to set one that's already set, then you've found the first duplicate.

Jon Skeet
A: 

You cannot declare an array of size greater than Int32.MaxValue (2^31, or approx. 2*10^9), so you will have to either chain arrays together or use a List<int> to hold all of the values.

murgatroid99
I don't believe list will work either. `List<>` Uses `System.Int32` to store it's indexes / counts / capacity / etc...
Aren
I thought you only could declare 2 GB of data in .Net (including 64 bit)? Can't try it on my machine
lasseespeholt
A: 

Your algorithm should really be the same regardless of the array size. The best time complexity you'll get has got to be (ideally) O(n) of course.

Consider the following pseudo-code for the algorithm:

  1. Create a HashSet<int> of capacity equal to the range of numbers in your array.
  2. Loop over each number in the array and check if it already exists in the hashset.
    • if no, add it to the hashset now.
    • if yes, you've found a duplicate.

Memory usage here is far from trivial, but if you want speed, it will do the job.

Noldorin
A better trade off between time and memory usage perhaps (giving you `O(n log n)` for time) would be to do sort the array (quicksort) and look for two consecutive numbers that are identical. May be prohibitive in terms of time though, I suspect...
Noldorin
@Nolorin: Yes, that's probably the best speed/memory tradeoff. By the way, `HashSet` cannot exceed a 32-bit integer's size in capacity. It's typed to `Int32`.
Aren
@Aren: We don't know the range of numbers though... The `HashSet` only gets as big as the range of numbers (possibly the entire set of 32-bit integers). Sounds like the OP is taking the wrong approach from the start though, hmm.
Noldorin
@Noldorin: Worst case scenario is that each number is different, which means there's `n` unique numbers. But you're right, something tells me the general approach is wrong here (from the OP).
Aren
+7  A: 

An array of 10^15 of integers would require more than a petabyte to store. You said it can be done in a single pass, so there's no need to store all the data. But even reading this amount of data would take a lot of time.

But wait, if the numbers are integers, they fall into a certain range, let's say N = 2^32. So you only need to search at most N+1 numbers to find a duplicate. Now that's feasible.

Asker
+4  A: 

You can use a BitVector array with length = 2^(32-5) = 0x0800000

This has a bit for each posible int32 number.

Note: easy solution (BitArray) do´nt support adecuate constructor.

BitVector32[] bv = new BitVector32[0x8000000];
int[] ARR = ....;   // Your array
foreach (int I in ARR)
{
    int Element = I >> 5;
    int Bit = I & 0x1f;
    if (bv[Element ][Bit])
    {
        // Option for First Duplicate Found
    }
    else
    {
        bv[I][Bit] = true;
    }
}
x77
Could you put some code snippets here
Lalchand
You'll get much more out of the learning experience, Lalchand, if you try writing it yourself without looking too carefully at the code snippet.
Owen S.
Much better answer than mine was. I don't know how I got stuck on the idea of hashtables. I was on crack.
Owen S.
Tanks. A little bug: I use Signed shift. Must be Unsigned shift: (uint)I >> 5;
x77
A: 

question,

1) is the number of items in the array 10^15

2) or can the value of the items be 10^15?

if it is #1:

where are you pulling the nummbers from? if its a file you can step through it.

are there more than 2,147,483,647 unique numbers?

if its #2:

a int64 can handle the number

if its #1 and #2:

are there more than 2,147,483,647 unique numbers?

if there are less then 2,147,483,647 unique numbers you can use a List<bigint>

Aaron
A: 

You don't need to do anything. By definition, there will be a duplicate because 2^32 < 10^15 - there aren't enough numbers to fill a 10^15 array uniquely. :)

Now if there is an additional requirement that you know where the duplicates are... thats another story, but it wasn't in the original problem.

cyberconte