tags:

views:

197

answers:

15

Hello, Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?

Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.

How would you approach this problem?

Thanks in advance :)

A: 

You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.

How? Oh, what language are you using?

Rice Flour Cookies
@Rice Flour Cookies :- Yeah that's what i mentioned in my question that i can't just populate array will those 10 billions integers. I am using C language
Ankit Rathod
+7  A: 

I had this as an interview question once.

Here is an algorithm that is O(N)

Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.

Author Edit:

Below, @Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.

Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.

Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.

Phimuemue Edit:

Of course one needs to take a bit "special" hash table:

Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.

So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.

kbrimington
Have you ever tried putting 10 billion numbers in a hash table?
fearofawhackplanet
@kbrimington :- Your answer is correct and simplest. But we can't possible keep a hastable of 10 billions integers. Can we?
Ankit Rathod
@fearofawhackplanet: Use a persistent hash. It is the most efficient data structure for the scenario.
kbrimington
@Nitesh-Panchal: If you don't want to use the storage, you can go N log N, and sort the collection in-place, then iterate across looking for the duplicate.
kbrimington
@kbrimington :- I don't mind using the storage but i mean i have say 512 mb of ram. Will 10 billion integers fill in hashtable in my memory?
Ankit Rathod
What planet do you people live on?! 10 billion numbers @ 4 bytes each is 40GB of data! And you are seriously trying to argue that you can just stick it in a hash table?
fearofawhackplanet
@fearofawhackplanet : :d. Thanks for joining in the discussion. Voted!
Ankit Rathod
@fearofawhackplanet: It's like I said earlier, a hash table is the only O(N) solution to the problem; however, if storage is an issue, go O(N log N) and sort it in place. The tradeoff is storage vs. performance.
kbrimington
@fearofawhackplanet - we're looking for 10 billion records, containing 1 32-bit integer each. Since we're only interested in "dupe, not dupe", we can get away with 2^29 bytes (2^32 bits) of storage, using the integer value to index on the bit we're interested in.
Vatine
Ok fair enough, I misunderstood the original post. The edit has clarified.
fearofawhackplanet
What you describe in the edit isn't a hashtable, since no hashing is involved; it's a bit array.
Nick Johnson
@Nick-Johnson: Thanks; the edit wasn't my doing, but as the point is well intended, I'll incorporate it.
kbrimington
A: 

You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.

Anindya Chatterjee
A: 

If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.

DixonD
A: 

If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.

With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).

Eiko
A: 

If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.

murgatroid99
+2  A: 

Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).

After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.

After the second pass, items with count > 1 in the second table are your duplicates.

This is still O(n), just twice as slow as a single pass.

Mau
That all looks great but its success depends critically on the distribution of the 10 billion numbers. There are actually only just over 4 billion four byte integers so potentially every single one of them might be duplicated. Certainly, if you are hashing based on the bottom 20 bits, you might find every item has a counter > 1 after the first pass.
JeremyP
A: 

I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?
Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.

Workshop Alex
Crafty programming skills? This is literally a 5-line `main()`...
R..
+4  A: 

The important question is whether you want to solve this problem efficiently, or whether you want accurately.

If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.

Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).

A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.

And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).

Jérémie
+1 for Bloom filters
Gabe Moothart
A: 

Had to do this a long time ago. What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:

1 to 10, 12, 16, 20 to 50, 52 would become..

[1,10], 12, 16, [20,50], 52, ...

Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage

p.s. another way to store them

1, -9, 12, 16, 20, -30, 52,

when i had no numbers lower than zero

After that i applied various algorithms (described by other posters) here on the reduced data set

vlad b.
+1  A: 

How about:

  1. Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
  2. Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.
Victor Sorokin
+1  A: 

Finding duplicates

Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".

If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.

Counting Duplicates

If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.

Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.

Its going to be a hack, but its doable.

Fish
A: 
#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
 ((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
    unsigned x=0;
    size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
    while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
    if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
    else printf("no duplicate\n");
    return 0;
}
R..
A: 

This is a simple problem that can be solved very easily (several lines of code)
and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters

i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions

  1. Map(key, value)
  2. Reduce(key, values[])

so all in all:

  • open file and iterate through the data
  • for each number -> Map(number, line_index)
  • in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
  • so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
  • print the duplicates : number, line_index1, line_index2,...

    again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages

    there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
    or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour

    good luck :)

    • Another more simple approach could be in using bloom filters

      AdamT
Adam
A: 

Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.

Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

ankushb