views:

429

answers:

5

Sometimes it makes sense to have a key-ordered dictionary. In C++, this is often implemented with a Red-black tree. But any self-balancing binary search tree will do (fwiw, Knuth is particularly clear on this subject). The best solution I've been able to come up with so far is to take R. McGraw's AVL-tree type and create a wrapper class that basically implements the STL map interface (also counting on the handy ordering of pairs (two element tuples) in Python). Such a tuple basically corresponds to std::map::value_type.

Yes, there's Python's bisect module, and while that is logarithmic at insertion time in the same way that self-balancing binary trees are logarithmic at insertion time (right?), frankly I just want an object. Called OrderedDict or something (and no, the Python 3.1 OrderedDict doesn't qualify -- that's for 'insertion-time' ordering -- and frankly what insertion-time ordering has to do with ordering is not quite obvious).

Note, a key-ordered dictionary is highly useful in many industries (in finance for instance, it's ordinary to keep track of price books of data, which are basically ordered dictionaries of price -> quantity, aggregated order information, etc.).

If anyone has any other ideas, that's great. All I know is I just got five million times smarter by Alex Martelli's 'answers' here. So I thought I'd ask.

+1  A: 

I got exactly the same need, and Alex Martelli's answer has totally convinced me: best is keeping a dictionary and a list of partially sorted keys, then sort when needed. This is efficient because of the very particular behaviour of python's sort algorithm (AKA Timsort). http://stackoverflow.com/questions/1319763/key-ordered-dict-in-python

I tested his implementation and mine, and his was best (because he does not insert in the middle of the list)

(I strongly advise you to read the paper linked in AM's comment about the timsort, which is a pearl).

LeMiz
Thanks. I finally read the full post. Very helpful. With respect to Alex Martelli and everyone's comments, I do think I might try to py-avl wrapper approach (as mentioned first in your post). It's a somewhat unusual case: but I basically need to find a sub-sequence (something like lower_bound and upper_bound in std::map). Even though timsort seems great (and particularly good for avoiding comparisons, while being more aggressive with swapping), something like this is I think more naturally resolved with a tree. O(log n) insert/delete is also nice..
Final comment, and then I'll let it rest for a while. But so as to clarify what I was saying in the above. It's not so much that I want to find subsequences with lower_bound and upper_bound in log(n) time (bisect can do that too). It's that I may subsequently want to insert an element somewhere in the subsequence (e.g., a sorted a collection of widgets in a layout, sorted by distance away from a reference widget). If you have an iterator to a tree (or a linked list for that matter -- but lower_bound/upper_bound in a linked list is of course O(n)), you're done.
fwiw, here's one solution using that py-avl (with a little patch: http://pastebin.com/d80a3296): http://pastebin.com/f66ca441c (note, this solution is probably incorrect in all ways that are important. the iter_range method is an attempt to emulate lower_bound / upper_bound, though this is the first time i've explored custom iterators. so like i say, it's probably wrong.)
(in my defense though, assuming this isn't an entirely wrong-headed idea, a more correct emulation of lower_bound / upper_bound would probably involve some serious changes to py-avl. like comparison of iterators, etc....)
A: 

Maybe this will help you.

http://pypi.python.org/pypi/rbtree/0.8.0

Unknown
+1  A: 

As you said, you can roll your own implementation with bisect:

class OrderedDict:
  def __init__(self, keyvalues_iter):
    self.__srtlst__ = sorted(keyvalues_iter)
  def __len__(self):
    return len(self.__srtlst__)
  def __contains__(self, key):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return True
    else:
      return False    
  def __getitem__(self, key):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return self.__srtlst__[index][1]
    else:
      raise KeyError(key)
  def __setitem__(sekf, key, value):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      self.__srtlst__[index][1] = value
    else:
      self.__srtlst__[index]=(key, value)
  def __delitem__(sekf, key, value):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      del __srtlst__[index]
    else:
      raise KeyError(key)
   def __iter__(self):
     return (v for k,v in self.__srtlst__)
   def clear(self):
     self.__srtlst__ = []
   def get(self, key, default=None):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return self.__srtlst__[index][1]
    else:
      return default
   def items(self):
     return self.__srtlst__[:]
  def iteritems(self):
    return iter(self.__srtlst__)
  def iterkeys(self):
    return (k for k,v in self.__srtlst__)
  def itervalues(self):
    return (v for k,v in self.__srtlst__)
  def keys(self):
    return [k for k,v in self.__srtlst__]
  def values(self):
    return [v for k,v in self.__srtlst__]
  def setdefault(self, key, default):
    index = bisect(self.__srtlst__, key)
    if index<len(self.__srtlst__) and self.__srtlst__[index][0] == key:
      return self.__srtlst__[index][1]
    else:
      self.__srtlst__[index]=(key, default)
      return default
  def update(self, other):
    #a more efficient implementation could be done merging the sorted lists
    for k, v in other.iteritems():
      self[k] = v

hmmm... it seems that I already did that for you, d'oh!

fortran
The only disadvantage of this implementation, afaik, is that insertion and deletion is potentially O(n) (even though you can find the index in O(log n), if you have to move elements to make way for a new element or remove an existing element, you may need to shift the underlying sorted list in O(n) time). slightly modified version available here: http://pastebin.com/f1e721bdb (but still no insertion method)
yes, of course growing the list can take time, but it should be asymptotically amortized... about the shifting, some tricks could be done to alleviate it (like for example marking empty positions and having an smaller auxiliary list for the costly insertions at the beginning and merge and compact from time to time), but in the end you should optimize for your needs: iterating (list wins), mutating (tree wins) or direct access by key (hashmap wins).
fortran
The "only disadvantage" is O(n) modifications instead of O(log n) or O(1)? That's not a minor disadvantage, that's losing a fundamental benefit of a tree. If all you happen to need at the moment is what you can get from this, great, but this is no std::map.
Glenn Maynard
As I said, everything is about trade-offs, it's the user who has to measure the impact of the different factors and choose the most appropriate structure. If the op is going to iterate sequentially through the map much more than mutating it or accessing by key, then this is a good option.
fortran
A: 

Lists are a miserable substitute for a tree.

Insertions need to move the whole list around to make space; deletions need to move the list back down. Adding or deleting stuff in batch is fine when it's possible, but it's very often not, or takes unnatural contortions to arrange it. A fundamental attribute of a tree is that insertions and deletions are O(log n); no amount of handwaving will turn O(n) into O(log n).

Inserting an item into a tree when you already know where it's going to go is O(1). Equivalently, deleting an item from a tree based on its node is also O(1). std::map supports both of these. These are both O(n) with a list.

Another fundamental property of a tree is that iterating over a range of values is O(1) per iteration. Combining list and dict loses this, because each iteration needs to do a dict lookup. (The list-of-tuples approach doesn't have this problem.)

Trees are among the most basic of data types. Python's lack of a tree container type is a wart. Maybe there's a third-party library implementing one (eg. the one linked by Mr. "Unknown", which I havn't tried so I can't vouch for), but there's no standard Python type for it.

Glenn Maynard
I must admit I find the lack of a (standardised) hash table in C++ more troublesome in my day to day work than the lack of a tree-based container in Python. But then I've never come across a use case like this where the structure wasn't small enough to just sort it.
Kylotan
O(1) insertion of item x: `thelist.append(x)`; O(1) deletion from index i: `thelist[i] = thelist[-1]; del thelist[-1]`. These perturb ordering, but for slight perturbations thelist.sort() is O(N) (thanks, timsort!-) and that only needs to be performed when actually needed (so for most access patterns you get good amortized performance). So, like e.g. Python's lack of tail recursion optimization, the lack of tree containers is very rarely a problem in practice (informal scan of Google's C++ sources: std::hashmap _many_ orders of magnitude more frequent than std::map!-).
Alex Martelli
They are for solving different kind of problems. In fact, you mean the vector/growable array implementation of a list, not the linked list. Also, a tree is just a list of lists, and you already have those in Python.
fortran
Which means insertion and deletion is O(n), exactly as I said, because resorting the list is intrinsic to the modification. Yes, fortran, of course you can implement a tree yourself; that's completely irrelevant.
Glenn Maynard
A: 

For a list that stays sorted, you can try module heapq.

peufeu
not fully sorted, a heap just imposes ordering between levels of a tree (father with children) but not in the same level (siblings), so for example in a valid heap from lower to higher, h[0]<h[1] and h[0]<[2] but h[1] could be greater than h[2] (or not)
fortran