views:

1083

answers:

6

Hello,

I am looking for a solid implementation of an ordered associative array (in terms of keys, not of insertion order).

More precisely, I am looking for a space-efficent implementation of a int-to-float (or string-to-float for another use case) mapping structure for which:

  • Ordered iteration is O(n)
  • Random access is O(1)

The best I came up with was gluing a dict and a list of keys, keeping the last one ordered with bisect and insert.

Any better idea ?

+3  A: 

An ordered tree is usually better for this cases, but random access is going to be log(n). You should keep into account also insertion and removal costs...

fortran
Do you have a proposed implementation (preferably with good documentation, especially for corner cases) ?
LeMiz
this seems an interesting implementation of an AVL sorted tree: http://pyavl.sourceforge.net/
fortran
Thanks for the link. My first thought is that it would be lot of modifications to use it as an associative array (if I gave up the requirement of having O(1) random access time).
LeMiz
+1  A: 

You could build a dict that allows traversal by storing a pair (value, next_key) in each position.

Random access:

my_dict[k][0]   # for a key k

Traversal:

k = start_key   # stored somewhere
while k is not None:     # next_key is None at the end of the list
    v, k = my_dict[k]
    yield v

Keep a pointer to start and end and you'll have efficient update for those cases where you just need to add onto the end of the list.

Inserting in the middle is obviously O(n). Possibly you could build a skip list on top of it if you need more speed.

John Fouhy
Obviously, you can store a pointer to the previous element also, which would make insertion/deletion from the middle easier, especially if you do layer a skip list on top.
John Fouhy
+1  A: 

I'm not sure which python version are you working in, but in case you like to experiment, Python 3.1 includes and official implementation of Ordered dictionaries: http://www.python.org/dev/peps/pep-0372/ http://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries

Santi
Ordered dictionaries are ordered with respect to key insertion, not key sort order, so I don't think the LeMiz will be able to use this solution.
TokenMacGuy
Ups, sorry. I got lost in the question...
Santi
A: 

here's a pastie: I Had a need for something similar. Note however that this specific implementation is immutable, there are no inserts, once the instance is created: The exact performance doesn't quite match what you're asking for, however. Lookup is O(log n) and full scan is O(n). This works using the bisect module upon a tuple of key/value (tuple) pairs. Even if you can't use this precisely, you might have some success adapting it to your needs.

import bisect

class dictuple(object):
    """
        >>> h0 = dictuple()
        >>> h1 = dictuple({"apples": 1, "bananas":2})
        >>> h2 = dictuple({"bananas": 3, "mangoes": 5})
        >>> h1+h2
        ('apples':1, 'bananas':3, 'mangoes':5)
        >>> h1 > h2
        False
        >>> h1 > 6
        False
        >>> 'apples' in h1
        True
        >>> 'apples' in h2
        False
        >>> d1 = {}
        >>> d1[h1] = "salad"
        >>> d1[h1]
        'salad'
        >>> d1[h2]
        Traceback (most recent call last):
        ...
        KeyError: ('bananas':3, 'mangoes':5)
   """


    def __new__(cls, *args, **kwargs):
        initial = {}
        args = [] if args is None else args
        for arg in args:
            initial.update(arg)
        initial.update(kwargs)

        instance = object.__new__(cls)
        instance.__items = tuple(sorted(initial.items(),key=lambda i:i[0]))
        return instance

    def __init__(self,*args, **kwargs):
        pass

    def __find(self,key):
        return bisect.bisect(self.__items, (key,))


    def __getitem__(self, key):
        ind = self.__find(key)
        if self.__items[ind][0] == key:
            return self.__items[ind][1]
        raise KeyError(key)
    def __repr__(self):
        return "({0})".format(", ".join(
                        "{0}:{1}".format(repr(item[0]),repr(item[1]))
                          for item in self.__items))
    def __contains__(self,key):
        ind = self.__find(key)
        return self.__items[ind][0] == key
    def __cmp__(self,other):

        return cmp(self.__class__.__name__, other.__class__.__name__
                  ) or cmp(self.__items, other.__items)
    def __eq__(self,other):
        return self.__items == other.__items
    def __format__(self,key):
        pass
    #def __ge__(self,key):
    #    pass
    #def __getattribute__(self,key):
    #    pass
    #def __gt__(self,key):
    #    pass
    __seed = hash("dictuple")
    def __hash__(self):
        return dictuple.__seed^hash(self.__items)
    def __iter__(self):
        return self.iterkeys()
    def __len__(self):
        return len(self.__items)
    #def __reduce__(self,key):
    #    pass
    #def __reduce_ex__(self,key):
    #    pass
    #def __sizeof__(self,key):
    #    pass

    @classmethod
    def fromkeys(cls,key,v=None):
        cls(dict.fromkeys(key,v))

    def get(self,key, default):
        ind = self.__find(key)
        return self.__items[ind][1] if self.__items[ind][0] == key else default

    def has_key(self,key):
        ind = self.__find(key)
        return self.__items[ind][0] == key

    def items(self):
        return list(self.iteritems())

    def iteritems(self):
        return iter(self.__items)

    def iterkeys(self):
        return (i[0] for i in self.__items)

    def itervalues(self):
        return (i[1] for i in self.__items)

    def keys(self):
        return list(self.iterkeys())

    def values(self):
        return list(self.itervalues())
    def __add__(self, other):
        _sum = dict(self.__items)
        _sum.update(other.__items)
        return self.__class__(_sum)

if __name__ == "__main__":
    import doctest
    doctest.testmod()
TokenMacGuy
+13  A: 

"Random access O(1)" is an extremely exacting requirement which basically imposes an underlying hash table -- and I hope you do mean random READS only, because I think it can be mathematically proven than it's impossible in the general case to have O(1) writes as well as O(N) ordered iteration.

I don't think you will find a pre-packaged container suited to your needs because they are so extreme -- O(log N) access would of course make all the difference in the world. To get the big-O behavior you want for reads and iterations you'll need to glue two data structures, essentially a dict and a heap (or sorted list or tree), and keep them in sync. Although you don't specify, I think you'll only get amortized behavior of the kind you want - unless you're truly willing to pay any performance hits for inserts and deletes, which is the literal implication of the specs you express but does seem a pretty unlikely real-life requirement.

For O(1) read and amortized O(N) ordered iteration, just keep a list of all keys on the side of a dict. E.g.:

class Crazy(object):
  def __init__(self):
    self.d = {}
    self.L = []
    self.sorted = True
  def __getitem__(self, k):
    return self.d[k]
  def __setitem__(self, k, v):
    if k not in self.d:
      self.L.append(k)
      self.sorted = False
    self.d[k] = v
  def __delitem__(self, k):
    del self.d[k]
    self.L.remove(k)
  def __iter__(self):
    if not self.sorted:
      self.L.sort()
      self.sorted = True
    return iter(self.L)

If you don't like the "amortized O(N) order" you can remove self.sorted and just repeat self.L.sort() in __setitem__ itself. That makes writes O(N log N), of course (while I still had writes at O(1)). Either approach is viable and it's hard to think of one as intrinsically superior to the other. If you tend to do a bunch of writes then a bunch of iterations then the approach in the code above is best; if it's typically one write, one iteration, another write, another iteration, then it's just about a wash.

BTW, this takes shameless advantage of the unusual (and wonderful;-) performance characteristics of Python's sort (aka "timsort"): among them, sorting a list that's mostly sorted but with a few extra items tacked on at the end is basically O(N) (if the tacked on items are few enough compared to the sorted prefix part). I hear Java's gaining this sort soon, as Josh Block was so impressed by a tech talk on Python's sort that he started coding it for the JVM on his laptop then and there. Most sytems (including I believe Jython as of today and IronPython too) basically have sorting as an O(N log N) operation, not taking advantage of "mostly ordered" inputs; "natural mergesort", which Tim Peters fashioned into Python's timsort of today, is a wonder in this respect.

Alex Martelli
I have almost exactly the same thing, but with__setitem__(self, k, v): if not k in self.d: idx = bisect.bisect(self.l, k) # log(n) self.l.insert(idx, k) # keeps the list sorted self.d[k] = vThat prevents the "dirty" flag thing. Is there a paper describing in detail python sort precisely (if not, I will read source code).But my question was really about a trick to prevent the duplication of the key list (for example, if hash are short, it may be preferable to duplicate only the hashs instead of longs). Thanks for the insight on python sort, though.
LeMiz
How about if (1) eliminate self.sorted (2) self.L == None if not sorted (3) self.L is generated from keys in __iter__() (4) __setitem__ invalidates self.L, setting it to None?
hughdbrown
Nice idea! If I have a probability p of doing an insertion between two iterations, that would reduce the constant before the iteration (which will still be O (n log n)) by the same factor. As an added benefit, if I can prove that my updates are sufficently uniformly distributed amongst my series (and given the garbage collector collects often enough, the average additional memeroy load will only be (1-p) * load of one serie (but I fear a bit the garbage collection cost). Need to study it the patterns more closely. Thanks !
LeMiz
There's a .txt file in the python sources that's a very nice essay by Tim Peters about his sorting algorithm; sorry, can't find the URL right now as I'm on my phone, no wifi. Key point here is that appending one item to a sorted list and sorting again is o(n), like bisect then insert, and asymptotically better if m much less than n items are appended between sorts, which I hope explains why the code I propose shd be better than the alternatives mentioned in the comments.
Alex Martelli
Found a wifi spot -- python.org is not responding right now, but there's a copy of the essay in question at http://webpages.cs.luc.edu/~anh/363/handouts/listsort.txt .
Alex Martelli
@hughdbrown, problem w/your approach is that it ensures O(N log N) cost for any sort -- while with mine one can expect a few O(N) sorts if the insertions are few between "ordered iterations".
Alex Martelli
@LeMiz, don't worry about duplication of large keys: self.L and self.d are just bunches of references to the same key objects, there is no copy of those objects.
Alex Martelli
@LeMiz, bisect+insert is O(N) and likely a lower multiplier than append+sort; OTOH you're paying that at every insertion, while my approach pays the O(N) price once per _transition_ between insertions (of m << N items) and ordered walks. So depending on patterns of sequences of insertions and sequences of ordered walks either might win, benchmarking needed. As I mentioned in previous comment, worries about "duplicating keys" are unneeded in either case.
Alex Martelli
I am convinced :). Reading the paper about timsort was of great value, and I would have been happy to have asked the question if only for it.BTW, if I can abuse of your time, how much space costs a ref to an object in a list? 16 bytes ? (Maybe I should ask this as another question).Thanks for your very precise answers !
LeMiz
@LeMiz, you're welcome! A reference, per se, takes up 4 bytes on a 32-bit build of Python, 8 on a 64-bit build; beyond this I do believe a new question would be advisable (e.g. on how you measure things, how you squeeze them when VERY tight, and so on;-).
Alex Martelli
+2  A: 

Here is my own implementation:

import bisect
class KeyOrderedDict(object):
   __slots__ = ['d', 'l']
   def __init__(self, *args, **kwargs):
      self.l = sorted(kwargs)
      self.d = kwargs

   def __setitem__(self, k, v):
      if not k in self.d:
         idx = bisect.bisect(self.l, k)
         self.l.insert(idx, k)
       self.d[k] = v

   def __getitem__(self, k):
      return self.d[k]

   def __delitem__(self, k):
      idx = bisect.bisect_left(self.l, k)
      del self.l[idx]
      del self.d[k]

   def __iter__(self):
      return iter(self.l)

   def __contains__(self, k):
      return k in self.d

The use of bisect keeps self.l ordered, and insertion is O(n) (because of the insert, but not a killer in my case, because I append far more often than truly insert, so the usual case is amortized O(1)). Access is O(1), and iteration O(n). But maybe someone had invented (in C) something with a more clever structure ?

LeMiz