ansaurus

Question

Is there code out there to subclass set in Python for big xranges?

Answer 1

A:

If you are using python 3.0, you can subclass collections.Set

AJ 2009-11-10 14:34:29

I'm using 2.6, where collections.Set exists. (I'm trying to be able to interface with some other Py2 code, so I'd rather stick with 2.6 if I can.) As I said, if I have to go get my hands dirty subclassing it, I will, but I want to be sure nobody's done it before first.

Alistair Bell 2009-11-10 14:44:51

Answer 2

A:

This sounds like it might overlap with linear programming. In linear programming you are trying to find some optimal case where you add constraints to a set of values (typically integers) which initially van be very large. There are various libraries listed at http://wiki.python.org/moin/NumericAndScientific/Libraries that mention integer and linear programming, but nothing jumps out as being obviously what you want.

andrew cooke 2009-11-10 14:54:07

Answer 3

+1 A:

You are trying to make a set containing all the integer values in from 0 to 4,294,967,295. A byte is 8 bits, which gets you to 255. 99.9999940628% of your values are over one byte in size. A crude minimum size for your set, even if you are able to overcome the syntactic issues, is 4 billion bytes, or 4 GB.

You are never going to be able to hold an instance of that set in less than a GB of memory. Even with compression, it's likely to be a tough squeeze. You are going to have to get much more clever with your math. You may be able to take advantage of some properties of the set. After all, it's a very special set. What you are trying to do?

Ewan Todd 2009-11-10 14:57:34

Well yes, the point is to see if there's code out there that knows about properties of sets and still manages to do efficient union/intersection. Clearly the worst-case billion-entry set is going to be gigabytes in size, but I'm trying to get something better for 99% of cases.(What I'm trying to implement is essentially a constraint solver, btw, but one that is expected to have so many solutions that they can't reasonably be enumerated -- rather I want to be able to pick a few solutions and know that they _are_ solutions.)

Alistair Bell 2009-11-10 15:13:18

OK, so why not use a constraint solver (as I suggested...)?

andrew cooke 2009-11-10 16:06:56

If you know of any constraint solvers in Python that can handle unbounded numbers of solutions and picking random solutions from among them, I'm all ears. I looked quite hard and couldn't find one -- which is why I'm trying to write one.

Alistair Bell 2009-11-10 16:55:30

Answer 4

+2 A:

You say:

For the most part, the set contents can be expressed algorithmically.

How about writing a class which presents the entire set API, but determines set inclusion algorithmically. Then with a number of classes which wrap around other sets to perform the union and intersection algorithmically.

For example, if you had a set a and set b which are instances of these pseudo sets:

>>> u = Union(a, b)

And then you use u with the full set API, which will turn around and query a and b using the correct logic. All the set methods could be designed to return these pseudo unions/intersections automatically so the whole process is transparent.

Edit: Quick example with a very limited API:

class Base(object):

    def union(self, other):
        return Union(self, other)

    def intersection(self, other):
        return Intersection(self, other)

class RangeSet(Base):

    def __init__(self, low, high):
        self.low = low
        self.high = high

    def __contains__(self, value):
        return value >= self.low and value < self.high

class Union(Base):
    def __init__(self, *sets):
        self.sets = sets

    def __contains__(self, value):
        return any(value in x for x in self.sets)

class Intersection(Base):

    def __init__(self, *sets):
        self.sets = sets

    def __contains__(self, value):
        return all(value in x for x in self.sets)


a = RangeSet(0, 10)
b = RangeSet(5, 15)

u = a.union(b)
i = a.intersection(b)

print 3 in u
print 7 in u
print 12 in u

print 3 in i
print 7 in i
print 12 in i

Running gives you:

True
True
True
False
True
False

Mike Boers 2009-11-10 15:19:41

If you can express your trial sets as unions of a modest number of ranges, then this approach looks promising.

Ewan Todd 2009-11-10 17:00:49

Answer 5

A:

I would avoid subclassing set, since clearly you can usefully reuse no part of set's implementation. I would even avoid subclassing collections.Set, since the latter requires you to supply a __len__ -- a functionality which you appear not to need otherwise, and just can't be done effectively in the general case (it's going to be O(N), with, which the kind of size you're talking about, is far too slow). You're unlikely to find some existing implementation that matches your use case well enough to be worth reusing, because your requirements are very specific and even peculiar -- the concept of "random iterating and an occasional duplicate is OK", for example, is a really unusual one.

If your specs are complete (you only need union, intersection, and random iteration, plus occasional additions and removals of single items), implementing a special purpose class that fills those specs is not a crazy undertaking. If you have more specs that you have not explicitly mentioned, it will be trickier, but it's hard to guess without hearing all the specs. So for example, something like:

import random

class AbSet(object):
  def __init__(self, predicate, maxitem=1<<32):
    # set of all ints, >=0 and <maxitem, satisfying the predicate
    self.maxitem = maxitem
    self.predicate = predicate
    self.added = set()
    self.removed = set()

  def copy(self):
    x = type(self)(self.predicate, self.maxitem)
    x.added = set(self.added)
    x.removed = set(self.removed)
    return x

  def __contains__(self, item):
    if item in self.removed: return False
    if item in self.added: return True
    return (0 <= item < self.maxitem) and self.predicate(item)

  def __iter__(self):
    # random endless iteration
    while True:
      x = random.randrange(self.maxitem)
      if x in self: yield x

  def add(self, item):
    if item<0 or item>=self.maxitem: raise ValueError
    if item not in self:
      self.removed.discard(item)
      self.added.add(item)

  def discard(self, item):
    if item<0 or item>=self.maxitem: raise ValueError
    if item in self:
      self.removed.add(item)
      self.added.discard(item)

  def union(self, o):
    pred = lambda v: self.predicate(v) or o.predicate(v),
    x = type(self)(pred, max(self.maxitem, o.maxitem))
    toadd = [v for v in (self.added|o.added) if not pred(v)]
    torem = [v for v in (self.removed|o.removed) if pred(v)]
    x.added = set(toadd)
    x.removed = set(torem)

  def intersection(self, o):
    pred = lambda v: self.predicate(v) and o.predicate(v),
    x = type(self)(pred, min(self.maxitem, o.maxitem))
    toadd = [v for v in (self.added&o.added) if not pred(v)]
    torem = [v for v in (self.removed&o.removed) if pred(v)]
    x.added = set(toadd)
    x.removed = set(torem)

I'm not entirely certain about the logic determining added and removed upon union and intersection, but I hope this is a good base for you to work from.

Alex Martelli 2009-11-10 17:52:09

ansaurus

tags:

views:

answers:

Is there code out there to subclass set in Python for big xranges?

related questions