views:

191

answers:

7

I need a collection data-structure that can do the following:

  • Be sorted
  • Allow me to quickly pop values off the front and back of the list O(log n)
  • Remain sorted after I insert a new value
  • Allow a user-specified comparison function, as I will be storing tuples and want to sort on a particular value
  • Thread-safety is not required
  • Optionally allow efficient haskey() lookups (I'm happy to maintain a separate hash-table for this though)

My thoughts at this stage are that I need a priority queue and a hash table, although I don't know if I can quickly pop values off both ends of a priority queue. Another possibility is simply maintaining an OrderedDictionary and doing an insertion sort it every-time I add more data to it.

Because I'm interested in performance for a moderate number of items (I would estimate less than 200,000), I am unsure about what asymptotic performance I require for these operations. n will not grow infinitely, so a low constant performance k in k * O(n) may be as important O(n). That said, I would prefer that both the insert and pop operations take O(log n) time.

Furthermore, are there any particular implementations in Python? I would really like to avoid writing this code myself.

A: 

If this were Java I'd use a TreeSet with the NavigableSet interface.

This is implemented as a Red-Black-Tree.

starblue
+1  A: 

You might get good performance for these kinds of operations using blist or a database (such as the sqlite which is in the stdlib).

Mike Graham
The blist package looks very promising, with O(log n) pop() operations for `blist.blist()`. What is the asymptotic performance of insertions?
fmark
It seems blist.sortedlist() is not a possibility, as it doesn't allow popping from the end of the list, only the beginning.
fmark
@fmark, using `bisect.insort`, insertion into a sorted blist should be O(log**2 n).
Mike Graham
+1  A: 

I suggest some sort of balanced binary tree such as a red-black tree.

A search on PyPi throws up a couple of implementations. Searching on google will give you more.

bintrees on PyPi looks very complete and has both Python and C/Cython implementations. I have not used it though, so caveat emptor.

A red-black tree is kept sorted and most operations (insert, delete, find) are O(log2(N)), so finding an element in a tree of 200,000 entries will take on average 17-18 comparisons.

Dave Kirby
+1  A: 

Sounds like a skip list will fulfill all your requirements. It's basically a dynamically-sized sorted linked list, with O(log n) insertions and removals.

I don't really know Python, but this link seems to be relevant:

http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/

Oak
+1  A: 

I presume you need it sorted because you access element by rank in the sorted order?

You can use any implementation of any balanced binary tree, with the additional information at each node which tells you the numbers of descendants of that node (usually called the Order Statistic Binary Tree).

With this structure, given the rank of an element (even min/max), you can access/delete it in O(log n) time.

This makes all operations (access/insert/delete by rank, pop front/back, insert/delete/search by value) O(logn) time, while allowing custom sort methods.

Also, apparently python has an AVL tree (one of the first balanced tree structures) implementation which supports order statistics: http://www.python.org/ftp/python/contrib-09-Dec-1999/DataStructures/avl.README

So you won't need a custom implementation.

Moron
+1  A: 

Except for the hashing, what you're looking for is a double-ended priority queue, aka a priority deque.

If your need for sorting doesn't extend beyond managing the min and max of your data, another structure for you to look at might be an interval heap, which has the advantage of O(1) lookup of both min and max if you need to peek at values (though deleteMin and deleteMax are still O(log(N)) ). Unfortunately, I'm not aware of any implementations in Python, so I think you'd have to roll your own.

Here's an addendum to an algorithms textbook that describes interval heaps if you're interested:

http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/interval.pdf

Owen S.
+1  A: 

If you can really allow O(log n) for pop, dequeue, and insert, then a simple balanced search tree like red-black tree is definitely sufficient.

You can optimize this of course by maintaining a direct pointer to the smallest and largest element in the tree, and then updating it when you (1) insert elements into the tree or (2) pop or dequeue, which of course invalidate the resp. pointer. But because the tree is balanced, there's some shuffling going out anyway, and you can update the corr. pointer at the same time.

There is also something called min-max heap (see the Wikipedia entry for Binary Heap), which implements exactly a "double-ended priority queue", i.e. a queue where you can pop both from front end and the rear end. However there you can't access the whole list of objects in order, whereas a search tree can be iterated efficiently through in O(n) time.

The benefit of a min-max heap however is that the current min and max objects can be read in O(1) time, a search tree requires O(log(n)) just to read the min or max object unless you have the cached pointers as I mentioned above.

antti.huima
It sounds like a min-max heap, coupled with a hash table, is what I want.
fmark
min-max heaps are cool! I actually didn't know they exist before I found them on stackoverflow earlier :)
antti.huima