ansaurus

Question

Interview: lists intersection with limited memory

Answer 1

A:

A nested loop join will take minimal memory. For each row in file 1 you load each row in file 2 and compare it. Then I guess you mark the hits in file 1.

Depending on the sizes of the files it might be more efficient to sort one of the files first (which can be done with minimal memory). A lot depends on the amount of memory you have to play with.

Yellowfog 2010-07-24 21:46:23

It's terribly inefficient.If it's for a job-interview, you have to at least try to be clever

Scharron 2010-07-24 22:05:55

Ah, I figure that with your comment that you're trying to be clever. Not entirely successfully, though. A nested loop is the simplest answer to the problem, and will be in certain circumstance the most efficient. That'd be the one that I'd start with in the interview. More complicated solutions will depend upon how much memory you have to play with. I'm sure that I mentioned that.

Yellowfog 2010-07-25 08:37:34

Sorry I was so rude; I made supposition based on past experience. From my point of view, your answer is simply not enough. But you're right, as a first answer it's ok.

Scharron 2010-07-25 21:20:09

Answer 2

A:

What do you mean by "the available memory is of size K < M < N"? Do by that you mean it's impossible to read a whole file into the memory? As Steve pointed out we need to know more about how much memory you actually have and what language you want to use on what system. The easiest answer would be to read only one value from file nr1 and then compare it to all the numbers from file nr2 one by one. Then take the next value from file nr1 and so on. Not the most elegant or fast appoach but sure it would get the job done.

Or just say that you're a Java programmer.

Zenzen 2010-07-24 21:49:45

it's an algorithmic question, not about an actual implementation

Scharron 2010-07-24 22:05:31

Answer 3

A:

Split N into n parts L (L < K)
Read M number at a time, compare that number with all the numbers in current part L
If you find the match write it to file

arjan 2010-07-24 21:52:57

I'm not really sure how would splitting the larger file into chunks help or be more efficient than a simple nested loop? You still need to read all the numbers and compare them with the other file - then you need some memory for the numbers from M, so you have to divide N into chunks smaller than N/n < K more like N/n < (K+the_number_of_numbers_from_M_to_be_compared).

Zenzen 2010-07-24 22:01:48

for each part you have to load numbers from M just once, you read M number at a time and compare it with all the numbers in the current part

arjan 2010-07-25 09:00:56

changed my answer to make it clearer

arjan 2010-07-25 09:06:39

Answer 4

+1 A:

take K/2 elements of M and K/2 elements of N. Sort M-subset and N-subset. Now the intersection is trivial. Write the intersection, drop elements of the intersection, write back the left elements. Continue with all other K/2 subparts of M and N. You have now some common elements in a third file, and some partially sorted lists. Now for each K / 2 (minus removed elements) of M and N lists, compare them to find intersection efficiently.

(You also can add extra rounds of merging of 2 M-subsets and N-subsets to speed up intersection).

Hurrah !

Example of execution:

Ok let's take 2 lists of 9 integers with values between 0 and 9.
These lists will be
4 2 5 1 9 2 9 0 2
and
4 7 4 0 8 6 2 6 5
(generated by rand())

So we take two sublists : 
4 2 5 and 4 7 4

Let's sort them using quicksort
4 2 5
i   j <- two pointers
pivot = [0] = 4

increase i until it's greater than pivot = 4
i and j now points to 5
decrease j until it's smaller than pivot = 4
j points to 2
i is greater than j, thus except for pivot, the list is sorted.
Swap [0] and [j] to finish sorting
2 4 5.

Except for pivot + i + j, no extra allocation was needed (For 3 elements it's hard to believe, see any quicksort applet to have a better understanding).
Thus, from an algorithmic point of view, pivot + i + j are constant and can be neglected (in fact you would do memory - algorithm size to have the real K).

Do the same for the second set
(4 4 7)

Now with two pointers to the lists initialized to the start of the sublists, compare the pointed values.
2 - 4
Increase first sublist's pointers
4 - 4
Equal -> a first common element, remove them from list [2 5] and [4 7]
5 - 4
Increase second pointer
5 - 7
Increase first pointer
End of this sublists.

Dump this to original lists.
Do this for any other sublists
[1 9 2] and [0 8 6] -> [1 2 9] [0 6 8] -> no intersection
[9 0 2] and [2 6 5] -> [0 9 E] [5 6 E] -> [2]

Now we have : 
[2 5 E 1 2 9 0 9 E] and [4 7 E 0 6 8 5 6 E] with E for empty elements.
Now compare subsets with other subsets (sorted) (excepted for already processed sets (same index))
[2 5 E] with [0 6 8] and [6 5 E] -> becomes [2 E E] and [0 6 8] [6 E E]  (intersection [5])
Then
[1 2 9] with [4 7 E] and [6 E E] -> [1 2 9] and [4 7 E] [6 E E] (no intersection)
Then
[0 9 E] with [4 7 E] and [0 6 8] -> [9 E E] and [4 7 E] [6 8 E] (intersection [0])
End : 
[2 E E 1 2 9 9 E E] [4 7 E 6 8 E 6 E E] with common elements [4 2 5 0]

Scharron 2010-07-24 21:58:51

why not using a Hash in order to check intersection? isn't it better than sorting? (complexity wise)

2010-07-24 22:13:51

@javasuperuser Hash will be optimal if space wasn't a constraint, but it is.

Anurag 2010-07-24 22:17:11

correct, thanks

2010-07-24 22:18:11

How do you perform the sorting if you have no memory left? There's K memory, and you've taken K/2 elements from M and K/2 elements from N.

Yellowfog 2010-07-25 11:47:27

Doing in-place sorting, you don't need more memory.

Scharron 2010-07-25 11:51:52

I'm really trying to understand this but I can't. And I'd like to. Is there any chance that you could give a worked example, say with 9 integers each in M and N and memory limited to 6 integers?

Yellowfog 2010-07-25 22:03:17

I edited to add an example of execution. Hope that helps. Since I was not sure if it was about in-place sorting or the whole algorithm, the example is about everyting :-)

Scharron 2010-07-25 22:48:10

Answer 5

+3 A:

Using an in-place sorting algorithm to sort both the lists first.

Once both M an N are sorted, then calculating the intersection is like a race. Place two markers a and b at the beginning of each list. Do these steps until any marker reaches the end of the list. In pseudocode:

until a > M.size or b > N.size
    if M[a] < N[b]
        a++
        continue
    if N[b] < M[a]
        b++
        continue
    print M[a] // common element

Given a decent in-place sorting algorithm, the complexity of this will be O(nlogn).

Anurag 2010-07-24 22:13:34

how would you sort in-place while the memory can't fit the arrays?

2010-07-24 22:19:12

Heapsort is a O(nlogn) in-place sorting algorithm for example.

Scharron 2010-07-24 22:23:07

Also see http://en.wikipedia.org/wiki/External_sorting

sdcvvc 2010-07-24 22:23:09

quicksort is also an in-place sort , yet you didn't describe how you deal with the fact that you can't load the entire array, I guess external sorting is the answer

2010-07-24 22:25:13

@sdcvvc - the elements have to be split into halves of K/2 as @Scharron suggested until we run out of elements from the smaller list - M.

Anurag 2010-07-24 22:25:45

Answer 6

A:

for each k/2 subpart of M

 for each k/2 subpart of N 
        sort the M-subpart & N-subpart
        find intersection of both the subparts 
        search each element of the result and store it if it already not

present

vijayk 2010-07-25 08:54:50

Answer 7

A:

I think you can use bit set for this purpose.BitSet only consumes one bit per integer. Hope this helps.


BitSet b=new BitSet();//set 1 with {10,20,30}
BitSet c=new BitSet();//set 2 with {20,30,1,2}
b.set(10);
b.set(20);
b.set(30);
c.set(20);
c.set(30);
c.set(1);
c.set(2);
b.and(c);
System.out.println(b);//{20, 30}

Emil 2010-07-26 11:20:45

ansaurus

tags:

views:

answers:

Interview: lists intersection with limited memory

related questions