views:

119

answers:

6

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.

Example for k = 2:

top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]

Where it gets more interesting is when I want the combined top k across all the sorted lists.

top2(L1+L2) = ['a':10, 'c':8]

Just combining of the top k of the individual list would not necessarily gives the correct results:

top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]

The goal is to reduce the required space and keep the sorted lists small.

top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]

The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?

Note: Correct counts are not important. Only the order is.

top2(magic([L1,L2])) = ['a', 'c']
+1  A: 

If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).

In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.

Jerry Coffin
In the extreme case, you are looking at a lot of elements that do not need to be inspected!
Leftium
A: 
  1. Associate an index with each of your n lists. Set it to point to the first element in each case.
  2. Create a list-of-lists, and sort it by the indexed elements.
  3. The indexed item on the top list in your list-of-lists is your first element.
  4. Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
  5. The indexed item on the top list in your list-of-lists is your next element
  6. Goto 4 and repeat until done.

You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.

swestrup
A: 

In general, I think you are in trouble. Imagine the following lists:

['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]

and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.

Edit:

If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.

The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.

The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)

When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Keith Randall
Actually, 'a' beats 'b' because tcurdt is looking for the highest value. Also, only the first element of each list needs inspection because they are already sorted.
Leftium
@wonsungi: The a beats b part is wrong. The list is sorted by the number. Think of it a frequency of certain queries (say from different files). Now the OP wants to merge the files together to find the most frequent queries. I saw your answer and you are solving the wrong problem.
Moron
That's not how I see the OP's question. Look at the first example where he gets 'c':8. That comes from 'c':3 in the first list and 'c':5 in the second list.
Keith Randall
The two lists you gave as an example are too simplified because they don't exhibit keys across both lists. Nevertheless, the naive approach is to sum up the lists and then sort the result by the summed values. So we'd get ['a':100, 'b':99, 'c':90, 'd':89, 'b':2] for your two lists. Since you picked k=1 the result should be [ 'a': 100 ]
tcurdt
Ummm... 'b' is the same in both lists.
Keith Randall
Ups... shouldn't post that late at night :)['b':101, 'a':100, , 'c':90, 'd':89] --k=1--> ['b':101]
tcurdt
A: 

I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:

(New) Algorithm:

  1. (Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
  2. Get the next lowest unprocessed ID and find the total count across all lists.
  3. Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
  4. Go to step 2 until all ID's have been exhausted.

Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:

  1. The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
  2. The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).

Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.

Leftium
Indeed the values need to be summed up.Your algorithm sounds good but the actual problem is not just finding the top-k. For your algorithm I need to keep the data of all lists stored IIUC. That's what I would like to avoid."The goal is to reduce the required space".
tcurdt
@tcurdt: I incorrectly assumed all the lists were already in memory. You can write each list back to disk after it has been sorted by key. Then the most memory that is required is enough to hold the longest list. See my new answer for a simpler algorithm that has about the same memory efficiency.
Leftium
I am even concerned about disk space.
tcurdt
+1  A: 

This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.

  1. Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
  2. Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
Leftium
"it is impossible to tell which keys can be discarded until all the keys have been summed" that is also what I fear. I've already played with the top-k selection algorithm to see how deep into the lists it steps into. But this seems to heavily depend on the distribution of the data.
tcurdt
Given that I don't have the need for full accuracy of the counts I was hoping to be able to reduce the list's long tail data into some kind of information so some of original keys can be discarded. But it looks like this is not an easy CS problem to solve.
tcurdt
I am not sure this really is the final answer, but given that no one came up with a better answer and it matches what I thought I'll accept it as is.
tcurdt
A: 

The perfect solution requires all tuples to be inspected at least once.

However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.

For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.

I suggest taking an iterative approach:

  1. Tally each list until a certain lower count threshold L is reached.
  2. Lower L to include more tuples.
  3. Add the new tuples to the counts tallied so far.
  4. Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.

This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...

Leftium