You are given two sets of integers, sizes M and N with M < N. Perform inner equal join on these two sets (i.e., find intersection of the two lists). How to perform it if both the lists are in files and the available memory is of size K < M < N
A nested loop join will take minimal memory. For each row in file 1 you load each row in file 2 and compare it. Then I guess you mark the hits in file 1.
Depending on the sizes of the files it might be more efficient to sort one of the files first (which can be done with minimal memory). A lot depends on the amount of memory you have to play with.
What do you mean by "the available memory is of size K < M < N"? Do by that you mean it's impossible to read a whole file into the memory? As Steve pointed out we need to know more about how much memory you actually have and what language you want to use on what system. The easiest answer would be to read only one value from file nr1 and then compare it to all the numbers from file nr2 one by one. Then take the next value from file nr1 and so on. Not the most elegant or fast appoach but sure it would get the job done.
Or just say that you're a Java programmer.
- Split N into n parts L (L < K)
- Read M number at a time, compare that number with all the numbers in current part L
- If you find the match write it to file
take K/2 elements of M and K/2 elements of N. Sort M-subset and N-subset. Now the intersection is trivial. Write the intersection, drop elements of the intersection, write back the left elements. Continue with all other K/2 subparts of M and N. You have now some common elements in a third file, and some partially sorted lists. Now for each K / 2 (minus removed elements) of M and N lists, compare them to find intersection efficiently.
(You also can add extra rounds of merging of 2 M-subsets and N-subsets to speed up intersection).
Hurrah !
Example of execution:
Ok let's take 2 lists of 9 integers with values between 0 and 9.
These lists will be
4 2 5 1 9 2 9 0 2
and
4 7 4 0 8 6 2 6 5
(generated by rand())
So we take two sublists :
4 2 5 and 4 7 4
Let's sort them using quicksort
4 2 5
i j <- two pointers
pivot = [0] = 4
increase i until it's greater than pivot = 4
i and j now points to 5
decrease j until it's smaller than pivot = 4
j points to 2
i is greater than j, thus except for pivot, the list is sorted.
Swap [0] and [j] to finish sorting
2 4 5.
Except for pivot + i + j, no extra allocation was needed (For 3 elements it's hard to believe, see any quicksort applet to have a better understanding).
Thus, from an algorithmic point of view, pivot + i + j are constant and can be neglected (in fact you would do memory - algorithm size to have the real K).
Do the same for the second set
(4 4 7)
Now with two pointers to the lists initialized to the start of the sublists, compare the pointed values.
2 - 4
Increase first sublist's pointers
4 - 4
Equal -> a first common element, remove them from list [2 5] and [4 7]
5 - 4
Increase second pointer
5 - 7
Increase first pointer
End of this sublists.
Dump this to original lists.
Do this for any other sublists
[1 9 2] and [0 8 6] -> [1 2 9] [0 6 8] -> no intersection
[9 0 2] and [2 6 5] -> [0 9 E] [5 6 E] -> [2]
Now we have :
[2 5 E 1 2 9 0 9 E] and [4 7 E 0 6 8 5 6 E] with E for empty elements.
Now compare subsets with other subsets (sorted) (excepted for already processed sets (same index))
[2 5 E] with [0 6 8] and [6 5 E] -> becomes [2 E E] and [0 6 8] [6 E E] (intersection [5])
Then
[1 2 9] with [4 7 E] and [6 E E] -> [1 2 9] and [4 7 E] [6 E E] (no intersection)
Then
[0 9 E] with [4 7 E] and [0 6 8] -> [9 E E] and [4 7 E] [6 8 E] (intersection [0])
End :
[2 E E 1 2 9 9 E E] [4 7 E 6 8 E 6 E E] with common elements [4 2 5 0]
Using an in-place sorting algorithm to sort both the lists first.
Once both M an N are sorted, then calculating the intersection is like a race. Place two markers a
and b
at the beginning of each list. Do these steps until any marker reaches the end of the list. In pseudocode:
until a > M.size or b > N.size
if M[a] < N[b]
a++
continue
if N[b] < M[a]
b++
continue
print M[a] // common element
Given a decent in-place sorting algorithm, the complexity of this will be O(nlogn)
.
for each k/2 subpart of M
for each k/2 subpart of N sort the M-subpart & N-subpart find intersection of both the subparts search each element of the result and store it if it already not
present
I think you can use bit set for this purpose.BitSet only consumes one bit per integer. Hope this helps.
BitSet b=new BitSet();//set 1 with {10,20,30}
BitSet c=new BitSet();//set 2 with {20,30,1,2}
b.set(10);
b.set(20);
b.set(30);
c.set(20);
c.set(30);
c.set(1);
c.set(2);
b.and(c);
System.out.println(b);//{20, 30}