ansaurus

Question

Python: Zope's BTree OOSet, IISet, etc... Effective for this requirement?

Answer 1

+1 A:

I don't think BTrees or other traditional sorted data structures (red-black trees, etc) will help you, because they keep order by key, not by corresponding value -- in other words, the field they guarantee as unique is the same one they order by. Your requirements are different, because you want uniqueness along one field, but sortedness by the other.

What are your performance requirements? With a rather simple pure Python implementation based on Python dicts for uniqueness and Python sorts, on a not-blazingly-fast laptop, I get 5 seconds for the original construction (essentially a sort over the million elements, starting with them as a dict), and about 9 seconds for the "update" with 20,000 new id/value pairs of which half "overlap" (thus overwrite) an existing id and half are new (I can implement the update in a faster way, about 6.5 seconds, but that implementation has an anomaly: if one of the "new" pairs is exactly identical to one of the "old" ones, both id and value, it's duplicated -- warding against such "duplication of identicals" is what pushes me from 6.5 seconds to 9, and I imagine you would need the same kind of precaution).

How far are these 5-and-9 seconds times from your requirements (taking into account the actual speed of the machine you'll be running on vs the 2.4 GHz Core Duo, 2GB of RAM, and typical laptop performance issues of this laptop I'm using)? IOW, is it close enough to "striking distance" to be worth tinkering and trying to squeeze a last few cycles out of, or do you need orders of magnitude faster performance?

I've tried several other approaches (with a SQL DB, with C++ and its std::sort &c, ...) but they're all slower, so if you need much higher performance I'm not sure what you could do.

Edit: since the OP says this performance would be fine but he can't achieve anywhere near it, I guess I'd best show the script I used to measure these times...:

import gc
import operator
import random
import time


nk = 1000

def popcon(d):
  for x in xrange(nk*1000):
    d['id%s' % x] = random.randrange(100*1000)

def sorted_container():
  ctr = dict()
  popcon(ctr)
  start = time.time()
  ctr_sorted = ctr.items()
  ctr_sorted.sort(key=operator.itemgetter(1))
  stend = time.time()
  return stend-start, ctr_sorted

def do_update(ctr, newones):
  start = time.time()
  dicol = dict(ctr)
  ctr.extend((k,v) for (k,v) in newones if v!=dicol.get(k,None))
  dicnu = dict(newones)
  ctr.sort(key=operator.itemgetter(1))
  newctr = [(k,v) for (k,v) in ctr if v==dicnu.get(k,v)]
  stend = time.time()
  return stend-start, newctr

def main():
  random.seed(12345)
  for x in range(3):
    duration, ctr = sorted_container()
    print 'dict-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
    newones = [('id%s' % y, random.randrange(nk*100))
                for y in xrange(nk*990,nk*1010)]
    duration, ctr = do_update(ctr, newones)
    print 'updt-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
    del ctr
    gc.collect()

main()

and this is a typical run:

$ time python som.py
dict-to-sorted, 0: 5.01 sec, len=1000000
updt-to-sorted, 0: 9.78 sec, len=1010000
dict-to-sorted, 1: 5.02 sec, len=1000000
updt-to-sorted, 1: 9.12 sec, len=1010000
dict-to-sorted, 2: 5.03 sec, len=1000000
updt-to-sorted, 2: 9.12 sec, len=1010000

real    0m54.073s
user    0m52.464s
sys 0m1.258s

the overall elapsed time being a few seconds more than the totals I'm measuring, obviously, because it includes the time needed to populate the container with random numbers, generate the "new data" also randomly, destroy and garbage-collect things at the end of each run, and so forth.

This is with the system-supplied Python 2.5.2 on a Macbook with Mac OS X 10.5.7, 2.4 GHz Intel Core Duo, and 2GB of RAM (times don't change much when I use different versions of Python).

Alex Martelli 2009-07-26 02:26:44

For a set of new data, resorting can take up to a couple minutes, so 9 seconds is more than ok. I feel like there must be some underlying problem that I am not addressing because my sort times are far, far longer.This is currently being tested/developed on an IBM xServe Dual 2.2 Ghz Core II Duo, 6 gb RAM. I am keeping all of this data in memory. I am running a ThreadingTCPServer that creates the original data and sorts it. Then there is a method by which I can add additional data. I have some benchmarking output outputing times to do the sorts and they take upwards of 5 minutes!

sberry2A 2009-07-26 02:36:17

Let me edit my answer to show what I used to measure these times!

Alex Martelli 2009-07-26 04:43:46

Answer 2

A:

It is perfectly possible to solve your problem. For this you should just note that the container types in Python always compare objects by calling their methods. Therefore you should do something like:

class Record:
    'Combination of unique part and sort part.'
    def __init__(self, unique, sort):
        self.unique = unique
        self.sort = sort

    def __hash__(self):
        # Hash should be implemented if __eq__ is implemented.
        return hash(self.unique)

    def __eq__(self, other):
        return self.unique == other.unique

    def __lt__(self, other):
        return self.sort < other.sort

 records = btree((Record(u, s) for u, s in zip(unique_data, sort_data)))

 print(records.pop())

Notes:

depending on how your favorite container type is implemented, you might need to add methods for !=, <=, >, >= as well
this will not break the relationship between == and <= as long as x.unique == y.unique ==> x.sort == y.sort

ilya n. 2009-08-09 22:40:32

ansaurus

tags:

views:

answers:

Python: Zope's BTree OOSet, IISet, etc... Effective for this requirement?

related questions