views:

229

answers:

4

I'm porting a C++ program to Python. There are some places where it uses std::set to store objects that define their own comparison operators. Since the Python standard library has no equivalent of std::set (a sorted key-value mapping data structure) I tried using a normal dictionary and then sorting it when iterating, like this:

def __iter__(self):
    items = self._data.items()
    items.sort()
    return iter(items)

However, profiling has shown that all the calls from .sort() to __cmp__ are a serious bottleneck. I need a better data structure - essentially a sorted dictionary. Does anyone know of an existing implementation? Failing that, any recommendations on how I should implement this? Read performance is more important than write performance and time is more important than memory.

Bonus points if it supports multiple values per key, like the C++ std::multimap.

Note that the OrderedDict class doesn't fit my needs, because it returns items in the order of insertion, whereas I need them sorted using their __cmp__ methods.

+1  A: 

Python does not have built-in data-structures for this, though the bisect module provides functionality for keeping a sorted list with appropriately efficient algorithms.

If you have a list of sorted keys, you can couple it with a collections.defaultdict(list) to provide multimap-like functionality.

Mike Graham
A: 

In his book "Programming in Python 3", Mark Summerfield introduces a sorted dictionary class. The source code is available in this zip archive - look for SortedDict.py. The SortedDict class is described in detail in the book (which I recommend very much). It supports arbitrary keys for comparison and multiple values per key (which any dictionary in Python does, so that's not that big a deal, I think).

Tim Pietzcker
+3  A: 

You should use sort(key=...).
The key function you use will be related to the cmp you are already using. The advantage is that the key function is called n times whereas the cmp is called nlog n times, and typically key does half the work that cmp does

If you can include your __cmp__() we can probably show you how to convert it to a key function

If you are doing lots of iterations between modifications, you should cache the value of the sorted items.

gnibbler
While not directly answering the question about data structures this has certainly helped to improve performance. +1
Evgeny
+1  A: 

For the sorted dictionary, you can (ab)use the stable nature of python's timsort: basically, keep the items partially sorted, append items at the end when needed, switching a "dirty" flag, and sort the remaining before iterating. See this entry for details and implementation (A Martelli's answer): http://stackoverflow.com/questions/1319763/key-ordered-dict-in-python

LeMiz