tags:

views:

91

answers:

5

I'm trying to write some Python code that includes union/intersection of sets that potentially can be very large. Much of the time, these sets will be essentially set(xrange(1<<32)) or something of the kind, but often there will be ranges of values that do not belong in the set (say, 'bit 5 cannot be clear'), or extra values thrown in. For the most part, the set contents can be expressed algorithmically.

I can go in and do the dirty work to subclass set and create something, but I feel like this must be something that's been done before, and I don't want to spend days on wheel reinvention.

Oh, and just to make it harder, once I've created the set, I need to be able to iterate over it in random order. Quickly. Even if the set has a billion entries. (And that billion-entry set had better not actually take up gigabytes, because I'm going to have a lot of them.)

Is there code out there? Anyone have neat tricks? Am I asking for the moon?

A: 

If you are using python 3.0, you can subclass collections.Set

AJ
I'm using 2.6, where collections.Set exists. (I'm trying to be able to interface with some other Py2 code, so I'd rather stick with 2.6 if I can.) As I said, if I have to go get my hands dirty subclassing it, I will, but I want to be sure nobody's done it before first.
Alistair Bell
A: 

This sounds like it might overlap with linear programming. In linear programming you are trying to find some optimal case where you add constraints to a set of values (typically integers) which initially van be very large. There are various libraries listed at http://wiki.python.org/moin/NumericAndScientific/Libraries that mention integer and linear programming, but nothing jumps out as being obviously what you want.

andrew cooke
+1  A: 

You are trying to make a set containing all the integer values in from 0 to 4,294,967,295. A byte is 8 bits, which gets you to 255. 99.9999940628% of your values are over one byte in size. A crude minimum size for your set, even if you are able to overcome the syntactic issues, is 4 billion bytes, or 4 GB.

You are never going to be able to hold an instance of that set in less than a GB of memory. Even with compression, it's likely to be a tough squeeze. You are going to have to get much more clever with your math. You may be able to take advantage of some properties of the set. After all, it's a very special set. What you are trying to do?

Ewan Todd
Well yes, the point is to see if there's code out there that knows about properties of sets and still manages to do efficient union/intersection. Clearly the worst-case billion-entry set is going to be gigabytes in size, but I'm trying to get something better for 99% of cases.(What I'm trying to implement is essentially a constraint solver, btw, but one that is expected to have so many solutions that they can't reasonably be enumerated -- rather I want to be able to pick a few solutions and know that they _are_ solutions.)
Alistair Bell
OK, so why not use a constraint solver (as I suggested...)?
andrew cooke
If you know of any constraint solvers in Python that can handle unbounded numbers of solutions and picking random solutions from among them, I'm all ears. I looked quite hard and couldn't find one -- which is why I'm trying to write one.
Alistair Bell
+2  A: 

You say:

For the most part, the set contents can be expressed algorithmically.

How about writing a class which presents the entire set API, but determines set inclusion algorithmically. Then with a number of classes which wrap around other sets to perform the union and intersection algorithmically.

For example, if you had a set a and set b which are instances of these pseudo sets:

>>> u = Union(a, b)

And then you use u with the full set API, which will turn around and query a and b using the correct logic. All the set methods could be designed to return these pseudo unions/intersections automatically so the whole process is transparent.

Edit: Quick example with a very limited API:

class Base(object):

    def union(self, other):
        return Union(self, other)

    def intersection(self, other):
        return Intersection(self, other)

class RangeSet(Base):

    def __init__(self, low, high):
        self.low = low
        self.high = high

    def __contains__(self, value):
        return value >= self.low and value < self.high

class Union(Base):
    def __init__(self, *sets):
        self.sets = sets

    def __contains__(self, value):
        return any(value in x for x in self.sets)

class Intersection(Base):

    def __init__(self, *sets):
        self.sets = sets

    def __contains__(self, value):
        return all(value in x for x in self.sets)


a = RangeSet(0, 10)
b = RangeSet(5, 15)

u = a.union(b)
i = a.intersection(b)

print 3 in u
print 7 in u
print 12 in u

print 3 in i
print 7 in i
print 12 in i

Running gives you:

True
True
True
False
True
False
Mike Boers
If you can express your trial sets as unions of a modest number of ranges, then this approach looks promising.
Ewan Todd
A: 

I would avoid subclassing set, since clearly you can usefully reuse no part of set's implementation. I would even avoid subclassing collections.Set, since the latter requires you to supply a __len__ -- a functionality which you appear not to need otherwise, and just can't be done effectively in the general case (it's going to be O(N), with, which the kind of size you're talking about, is far too slow). You're unlikely to find some existing implementation that matches your use case well enough to be worth reusing, because your requirements are very specific and even peculiar -- the concept of "random iterating and an occasional duplicate is OK", for example, is a really unusual one.

If your specs are complete (you only need union, intersection, and random iteration, plus occasional additions and removals of single items), implementing a special purpose class that fills those specs is not a crazy undertaking. If you have more specs that you have not explicitly mentioned, it will be trickier, but it's hard to guess without hearing all the specs. So for example, something like:

import random

class AbSet(object):
  def __init__(self, predicate, maxitem=1<<32):
    # set of all ints, >=0 and <maxitem, satisfying the predicate
    self.maxitem = maxitem
    self.predicate = predicate
    self.added = set()
    self.removed = set()

  def copy(self):
    x = type(self)(self.predicate, self.maxitem)
    x.added = set(self.added)
    x.removed = set(self.removed)
    return x

  def __contains__(self, item):
    if item in self.removed: return False
    if item in self.added: return True
    return (0 <= item < self.maxitem) and self.predicate(item)

  def __iter__(self):
    # random endless iteration
    while True:
      x = random.randrange(self.maxitem)
      if x in self: yield x

  def add(self, item):
    if item<0 or item>=self.maxitem: raise ValueError
    if item not in self:
      self.removed.discard(item)
      self.added.add(item)

  def discard(self, item):
    if item<0 or item>=self.maxitem: raise ValueError
    if item in self:
      self.removed.add(item)
      self.added.discard(item)

  def union(self, o):
    pred = lambda v: self.predicate(v) or o.predicate(v),
    x = type(self)(pred, max(self.maxitem, o.maxitem))
    toadd = [v for v in (self.added|o.added) if not pred(v)]
    torem = [v for v in (self.removed|o.removed) if pred(v)]
    x.added = set(toadd)
    x.removed = set(torem)

  def intersection(self, o):
    pred = lambda v: self.predicate(v) and o.predicate(v),
    x = type(self)(pred, min(self.maxitem, o.maxitem))
    toadd = [v for v in (self.added&o.added) if not pred(v)]
    torem = [v for v in (self.removed&o.removed) if pred(v)]
    x.added = set(toadd)
    x.removed = set(torem)

I'm not entirely certain about the logic determining added and removed upon union and intersection, but I hope this is a good base for you to work from.

Alex Martelli