Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)

views:

317

answers:

+4 Q:

Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)

Here's a description:

It operates like a regular map with get, put, and remove methods, but has a getTopKEntries(int k) method to get the top-K elements, sorted by the key:

For my specific use case, I'm adding, removing, and adjusting a lot of values in the structure, but at any one time there's approximately 500-1000 elements; I want to return the entries for the top 10 keys efficiently.

I call the put and remove methods many times.
I call the getTopKEntries method.
I call the put and remove methods some more times.
I call the getTopKEntries method.
...

I'm hoping for O(1) get, put, and remove operations, and for getTopKEntries to be dependent only on K, not on the size of the map.

So what's a data structure for efficiently returning the top-K elements of a map?

My other question is similar, but is for the case of returning all elements of a map, sorted by the key.

If it helps, both the keys and values are 4-byte integers.

+1 A:

A binary search tree (i.e. std::map in C++) sounds like the perfect structure: it’s already lexicographically ordered, i.e. a simple in-order traversal will yield the elements in ascending order. Hence, iterating over the first k elements will yield the top k elements directly.

Additionally, since you foresee a lot of “remove” operations, a hash table won’t be well-suited anyway: remove operations destroy the load factor characteristics of hash tables which leads to a rapid deterioration of the runtime.

Konrad Rudolph 2010-01-20 21:37:06

A binary tree has O(log n) time in the average case, but needs O(n) time in the worst-case. Not really O(1).. I think in this case O(1) is impossible without using a TreeMap (binary tree + hash map).

Chris Kannon 2010-01-20 21:40:57

However, the problem with maintaining a binary search tree is that the operations become `O(log n)`... and I'm glad you picked up on the `remove` operations; I have a specialized hash map implementation with good real-time behavior for smooth capacity increase and compaction.

Rudiger 2010-01-20 21:42:39

`@Chris Kannon:` I certainly hope it's possible to get the top-K entries without sorting the whole thing. Looking at the answers to my other question, I could maintain the indexes and then only get the top-K elements instead of sorting the whole thing. But I feel there's a better way...

Rudiger 2010-01-20 21:47:41

Although I agree with Chris in the choice of TreeMap, isn't adding to a binary tree still O(log n)? I'm not sure you can achieve o(1) for adding to or removing from anything sorted.

Bill K 2010-01-20 21:47:41

It's not going to be possible to get the top-K entries without sorting the whole thing. Why not, you ask? Because if you happen to remove one of the top-K entries, you need to already know which entry is the K+1th, and so on all the way to the end. The alternative is something like a selection-sort on the remaining entries to determine which is now in the top-K, which absolutely tubes your remove performance.

Anon. 2010-01-20 21:50:25

`@Bill K:` Correct; any comparison-based structure that maintains the _whole_ thing in sorted order is gonna have to have some lower-bound `Big Oh`. `@Anon.:` But I think this can be solved for `O(1)` adding and removing, and a fast top-K retrieval, or at least faster than sorting the whole thing...

Rudiger 2010-01-20 21:54:27

So what happens if you remove an object in the top K, and then try to get the top K entries?

Anon. 2010-01-20 21:59:46

@Chris: well, O(1) is just too much to hope for. And O(log n) is actually very, *very* good – much better than most people intuitively assume. And with just 500 to 1000 elements, that’s not even worth talking about.

Konrad Rudolph 2010-01-20 22:04:37

`@Konrad Rudolph:` I don't think `O(1)` is too much to hope for (a man can dream, damn it!)... But you're right; I probably need to profile the simpler solutions before I whip out the direct-address-tables...

Rudiger 2010-01-20 22:13:06

@Rudiger, pre-optimizing can sometimes be the worst thing to do. Make it work, THAN make it fast :)

Chris Kannon 2010-01-20 22:22:39

Won't you get better real-life performance with a quad-tree?

Stephan Eggermont 2010-01-20 23:06:12

`@Stephan Eggermont:` What's a quad-tree? Perhaps this should go in a new answer?

Rudiger 2010-01-21 00:12:51

I would recommend a fibonacci heap.

BlueRaja - Danny Pflughoeft 2010-01-20 21:45:33

Interesting, but I have just as many `remove` operations as `put` operations (asymptotically); a Fibonacci heap probably isn't so great in this case... And I might need to reevaluate the use of amortized data structures, since a few operations that take a long time to complete might kill the real-time performance...

Rudiger 2010-01-20 22:03:34

You might want a heap (although deletion may be a problem).

oefe 2010-01-20 21:46:07

+1 A:

I'm not sure I accept fully Konrad's view that lots of remove operations would destroy the structure of a hash table.

Without remove operations, you could keep all the objects in a hash table, and keep the top K in a priority heap that'd be incrementally updated. This would make insert O(1 + log K), i.e. constant-time in N assuming K is constant and not dependent on N (N = number of objects in the table). However this doesn't work when you have the remove operation available. The proposed Fibonacci heap has O(log N) amortized delete operation so it doesn't give a good solution either, as all the objects would be need to kept in the heap, and if you eventually remove every object that you insert, you get O(log N) behavior in general per a pair of insert+delete.

I would maybe try the following approach:

Store the objects in a hash table, assuming you need the whole table for some other purposes than returning the top objects. Maintain a priority heap (standard heap goes) that contains K * C objects for C whose value you need to search for experimentally. Whenever you add a new object, try to insert it in the heap; if it fits in the K*C space (heap is not at capacity yet or it pushes another object away), insert it and set a bit in the hash table to denote that the object is in the heap; when you push an object out of the heap, clear the bit. When you remove an object, check the bit; if the bit=1 i.e. the object was in the heap remove it from there (you need to search for it unless you have a pointer to it from the hash table; it's best to maintain the pointer). What happens now is that the heap shrinks. They key thing is that as long as the heap has still at least K objects it is guaranteed to contain all the top K objects. This is where the factor C comes in as it provides the "leeway" for the heap. When the heap size drops BELOW K, you run a linear scan over the whole hash table and fill the heap back to K*C capacity.

Setting C is empirical because it depends on how your objects come and go; but tuning it should be easy as you can tune it just based on runtime profiling.

Complexity: Insert is O(1 + log (KC)). Remove is O(1 + p log (KC) + q N) where p is the probability that a removed object was in the heap, and q is the probability that the heap needs to be rebuilt. p is dependent on the characteristics of how objects come and go. For a simple analysis we can set p=(KC/N), i.e. assume uniform probability. q is even more sensitive to the "flow" of the objects. For example, if new objects in general increase in their value over time and you always delete older objects, q tends towards zero.

Note that funnily enough p is inversely proportional to N so actually this part speeds up when N grows :)

antti.huima 2010-01-20 21:54:43

That remove operations destroy hash table performance characteristics is not my “view” – it’s an inherent fact of open addressing for collision resolution, which is what many advanced general-purpose hash table implementations use. Even sophisticated open addressing schemes suffer from this limitation. (Granted, separate chaining does *not* have this problem.)

Konrad Rudolph 2010-01-20 22:12:59

@Konrad Yes, with open addressing you have this problem. But it's not a general property of the hash table construction, rather a byproduct of the open addressing scheme. The "view" aspect was that your claim was thus generalized... But hey, thanks for providing valuable comments!!

antti.huima 2010-01-21 16:34:11

Unless I'm being severely un-creative today, you just can't do it all at O(1).

If you are maintaining a sort order, then adds and deletes will probably be at O(log n). If you are not, then your search will have to be O(n).

Hash tables just don't do sorting. I suggest you live with the O(log n) for inserts and deletes and use one of the suggested data structures (Heap is probably the best). If you need O(1) lookups you could combine a hash, but then you are maintaining two data structures in parallel and might as well use a TreeMap.

Bill K 2010-01-20 21:54:57

+1 A:

An alternative would be just to sort the items.

In your usage scenario there are only 1000 items – sorting them is just incredibly fast (keep in mind that log₂ 1000 ≈ 10 = almost 1), and it seems not to occur too often.

You can even adapt the selection algorithm to return the K smallest items. Unfortunately, this will still depend on n, not only on k as you’d hoped for: O(*n* + k log k).

(I’ve added this as a new answer because it’s actually completely unrelated to my first entry.)

Konrad Rudolph 2010-01-20 22:21:59

But that moves a lot of memory with insert and delete. A small BTree perhaps?

Stephan Eggermont 2010-01-20 23:09:47

@Stephan: I should have been clearer, I guess. I was referring to a hash table solution (with a large load factor so that traversal is efficient), and subsequently copying the elements into an array – or even not copying them and performing the selection in-place. This is just a wild guess but I would definitely be interested in a performance comparison of some of the proposed methods.

Konrad Rudolph 2010-01-21 08:47:11

If the sort key is a simple integer or decimal number, a trie will be quite fast. It will use up memory, and technically finding an element in a trie is O(log n). But in practice it'll be something like log₂₅₆ n, so the constant factor is very small (log₂₅₆ of 2 billion = 4).

Jason Orendorff 2010-01-21 14:09:15

I feel, heap is the best data structure for this problem. Because, put,remove and return K top elements can be returned in O(klog(N)) time. Use a max-heap if you want max elements.

Here, I am assuming that k top elements means, you need the k elements having max value.

Algorist 2010-01-26 05:41:50

ansaurus

tags:

views:

answers:

Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)

related questions