views:

191

answers:

7

You are given two sets of integers, sizes M and N with M < N. Perform inner equal join on these two sets (i.e., find intersection of the two lists). How to perform it if both the lists are in files and the available memory is of size K < M < N

A: 

A nested loop join will take minimal memory. For each row in file 1 you load each row in file 2 and compare it. Then I guess you mark the hits in file 1.

Depending on the sizes of the files it might be more efficient to sort one of the files first (which can be done with minimal memory). A lot depends on the amount of memory you have to play with.

Yellowfog
It's terribly inefficient.If it's for a job-interview, you have to at least try to be clever
Scharron
Ah, I figure that with your comment that you're trying to be clever. Not entirely successfully, though. A nested loop is the simplest answer to the problem, and will be in certain circumstance the most efficient. That'd be the one that I'd start with in the interview. More complicated solutions will depend upon how much memory you have to play with. I'm sure that I mentioned that.
Yellowfog
Sorry I was so rude; I made supposition based on past experience. From my point of view, your answer is simply not enough. But you're right, as a first answer it's ok.
Scharron
A: 

What do you mean by "the available memory is of size K < M < N"? Do by that you mean it's impossible to read a whole file into the memory? As Steve pointed out we need to know more about how much memory you actually have and what language you want to use on what system. The easiest answer would be to read only one value from file nr1 and then compare it to all the numbers from file nr2 one by one. Then take the next value from file nr1 and so on. Not the most elegant or fast appoach but sure it would get the job done.

Or just say that you're a Java programmer.

Zenzen
it's an algorithmic question, not about an actual implementation
Scharron
A: 
  1. Split N into n parts L (L < K)
  2. Read M number at a time, compare that number with all the numbers in current part L
  3. If you find the match write it to file
arjan
I'm not really sure how would splitting the larger file into chunks help or be more efficient than a simple nested loop? You still need to read all the numbers and compare them with the other file - then you need some memory for the numbers from M, so you have to divide N into chunks smaller than N/n < K more like N/n < (K+the_number_of_numbers_from_M_to_be_compared).
Zenzen
for each part you have to load numbers from M just once, you read M number at a time and compare it with all the numbers in the current part
arjan
changed my answer to make it clearer
arjan
+1  A: 

take K/2 elements of M and K/2 elements of N. Sort M-subset and N-subset. Now the intersection is trivial. Write the intersection, drop elements of the intersection, write back the left elements. Continue with all other K/2 subparts of M and N. You have now some common elements in a third file, and some partially sorted lists. Now for each K / 2 (minus removed elements) of M and N lists, compare them to find intersection efficiently.

(You also can add extra rounds of merging of 2 M-subsets and N-subsets to speed up intersection).

Hurrah !


Example of execution:

Ok let's take 2 lists of 9 integers with values between 0 and 9.
These lists will be
4 2 5 1 9 2 9 0 2
and
4 7 4 0 8 6 2 6 5
(generated by rand())

So we take two sublists : 
4 2 5 and 4 7 4

Let's sort them using quicksort
4 2 5
i   j <- two pointers
pivot = [0] = 4

increase i until it's greater than pivot = 4
i and j now points to 5
decrease j until it's smaller than pivot = 4
j points to 2
i is greater than j, thus except for pivot, the list is sorted.
Swap [0] and [j] to finish sorting
2 4 5.

Except for pivot + i + j, no extra allocation was needed (For 3 elements it's hard to believe, see any quicksort applet to have a better understanding).
Thus, from an algorithmic point of view, pivot + i + j are constant and can be neglected (in fact you would do memory - algorithm size to have the real K).

Do the same for the second set
(4 4 7)

Now with two pointers to the lists initialized to the start of the sublists, compare the pointed values.
2 - 4
Increase first sublist's pointers
4 - 4
Equal -> a first common element, remove them from list [2 5] and [4 7]
5 - 4
Increase second pointer
5 - 7
Increase first pointer
End of this sublists.

Dump this to original lists.
Do this for any other sublists
[1 9 2] and [0 8 6] -> [1 2 9] [0 6 8] -> no intersection
[9 0 2] and [2 6 5] -> [0 9 E] [5 6 E] -> [2]

Now we have : 
[2 5 E 1 2 9 0 9 E] and [4 7 E 0 6 8 5 6 E] with E for empty elements.
Now compare subsets with other subsets (sorted) (excepted for already processed sets (same index))
[2 5 E] with [0 6 8] and [6 5 E] -> becomes [2 E E] and [0 6 8] [6 E E]  (intersection [5])
Then
[1 2 9] with [4 7 E] and [6 E E] -> [1 2 9] and [4 7 E] [6 E E] (no intersection)
Then
[0 9 E] with [4 7 E] and [0 6 8] -> [9 E E] and [4 7 E] [6 8 E] (intersection [0])
End : 
[2 E E 1 2 9 9 E E] [4 7 E 6 8 E 6 E E] with common elements [4 2 5 0]
Scharron
why not using a Hash in order to check intersection? isn't it better than sorting? (complexity wise)
@javasuperuser Hash will be optimal if space wasn't a constraint, but it is.
Anurag
correct, thanks
How do you perform the sorting if you have no memory left? There's K memory, and you've taken K/2 elements from M and K/2 elements from N.
Yellowfog
Doing in-place sorting, you don't need more memory.
Scharron
I'm really trying to understand this but I can't. And I'd like to. Is there any chance that you could give a worked example, say with 9 integers each in M and N and memory limited to 6 integers?
Yellowfog
I edited to add an example of execution. Hope that helps. Since I was not sure if it was about in-place sorting or the whole algorithm, the example is about everyting :-)
Scharron
+3  A: 

Using an in-place sorting algorithm to sort both the lists first.

Once both M an N are sorted, then calculating the intersection is like a race. Place two markers a and b at the beginning of each list. Do these steps until any marker reaches the end of the list. In pseudocode:

until a > M.size or b > N.size
    if M[a] < N[b]
        a++
        continue
    if N[b] < M[a]
        b++
        continue
    print M[a] // common element

Given a decent in-place sorting algorithm, the complexity of this will be O(nlogn).

Anurag
how would you sort in-place while the memory can't fit the arrays?
Heapsort is a O(nlogn) in-place sorting algorithm for example.
Scharron
Also see http://en.wikipedia.org/wiki/External_sorting
sdcvvc
quicksort is also an in-place sort , yet you didn't describe how you deal with the fact that you can't load the entire array, I guess external sorting is the answer
@sdcvvc - the elements have to be split into halves of K/2 as @Scharron suggested until we run out of elements from the smaller list - M.
Anurag
A: 

for each k/2 subpart of M

 for each k/2 subpart of N 
        sort the M-subpart & N-subpart
        find intersection of both the subparts 
        search each element of the result and store it if it already not

present

vijayk
A: 

I think you can use bit set for this purpose.BitSet only consumes one bit per integer. Hope this helps.


BitSet b=new BitSet();//set 1 with {10,20,30}
BitSet c=new BitSet();//set 2 with {20,30,1,2}
b.set(10);
b.set(20);
b.set(30);
c.set(20);
c.set(30);
c.set(1);
c.set(2);
b.and(c);
System.out.println(b);//{20, 30}
Emil