ansaurus

Question

How to optimize operations on large (75,000 items) sets of booleans in Python?

Answer 1

+6 A:

You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.

For example:

# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)

# Get all items that are larger than 500, takes 2.58ms
y = x > 500

# Add 10 to those items, takes 26.1ms
x[y] += 10

Since that's with a lot more rows, I think that 75000 should not be a problem either :)

WoLpH 2010-10-19 10:59:26

OK, I'll check it out. I'll accept your answer if I end up using it.

Vilx- 2010-10-19 11:22:54

Personally, I don't think that numpy is really called for here. Python's built-in sets are quite sufficient for this I think.

Justin Peel 2010-10-19 19:14:06

You could probably tell numpy to use 8 bit integers as well if you want to reduce your memory footprint. However, I'm not sure if you can do this with the randint function. http://docs.scipy.org/doc/numpy/user/basics.types.html

GWW 2010-10-19 19:16:12

@GWW: not with randint directly, but you can even tell numpy to convert it to a `bool` like this: `numpy.array(numpy.random.randint(0, 2, 1000000), dtype='bool')`

WoLpH 2010-10-19 22:19:00

@Justin Peel: sufficient yes. but there I'm guessing that numpy will be faster for most operations. It's specifically designed for mass operations like these.

WoLpH 2010-10-19 22:20:40

@WoLpH oh cool thanks for the tip!

GWW 2010-10-19 23:27:35

@WoLpH I love numpy too, but I guess what I'm saying is that while numpy is very fast, using numpy here would add another dependency. Also, it will take a lot more rewriting to use it as opposed to using sets. I would try the sets method first and see if it is fast enough before adding numpy to the mix, but that's just my opinion. Using numpy for this means either using numpy arrays as sets, which I've tested isn't necessarily faster than using Python's sets, or it means using binary arrays along with np.where. Having to use np.where could slow things down a bit. Just some things to consider.

Justin Peel 2010-10-20 04:36:05

@Justin Peel: Indeed, having another dependency might not be the best choice. But if that is not a problem in this case than it will provide both a valueable learning experience and a clean alternative. Also, the python sets are not always that fast from my experience. If you're using them to get a unique list of items than using a dictionary seems to be faster.

WoLpH 2010-10-20 09:46:44

Answer 2

A:

For example, all keys from 1 to 74,000 contain true

Why not work on a subset? Just 74001 to the end.

Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).

S.Lott 2010-10-19 11:35:38

Of course, but then I'd have to rewrite the whole script.

Vilx- 2010-10-19 12:20:58

@Vilx: How so? You only need to subset things.

S.Lott 2010-10-19 12:58:42

I think you might have misunderstood me. These are not real numbers, it's just something I made up on the spot. All I mean to say is that there are large intervals of the same boolean value.

Vilx- 2010-10-19 13:01:17

@Vilx: I think you may have misunderstood me. If you have ranges that don't matter, find a way to filter them out. Also, try to stick to facts in your question -- misleading numeric values give you misleading answers.

S.Lott 2010-10-19 13:07:32

OK, updated the question with the word "might". Anyway, that's just what I'm trying to do - change this class (and only this class, I don't want to change how it's being used in code) so that the ranges would be optimized.

Vilx- 2010-10-19 14:04:38

@Vlix: isn't removing unused values an "optimization"? What's wrong with adding a filter and nothing more?

S.Lott 2010-10-19 17:36:12

Answer 3

A:

You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.

There is no compelling reason to use code that supports python 2.3 and earlier.

hughdbrown 2010-10-19 15:07:36

Answer 4

A:

Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.

Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.

Mike Dunlavey 2010-10-19 16:00:28

Answer 5

+1 A:

Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.

class RevisionSet(set):
    """
    A set of revisions, held in dictionary form for easy manipulation. If we
    were to rewrite this script for Python 2.3+, we would subclass this from
    set (or UserSet).  As this class does not include branch
    information, it's assumed that one instance will be used per
    branch.
    """
    def __init__(self, parm):
        """Constructs a RevisionSet from a string in property form, or from
        a dictionary whose keys are the revisions. Raises ValueError if the
        input string is invalid."""


        revision_range_split_re = re.compile('[-:]')

        if isinstance(parm, set):
            print "1"
            self.update(parm.copy())
        elif isinstance(parm, list):
            self.update(set(parm))
        else:
            parm = parm.strip()
            if parm:
                for R in parm.split(","):
                    rev_or_revs = re.split(revision_range_split_re, R)
                    if len(rev_or_revs) == 1:
                        self.add(int(rev_or_revs[0]))
                    elif len(rev_or_revs) == 2:
                        self.update(set(range(int(rev_or_revs[0]),
                                         int(rev_or_revs[1])+1)))
                    else:
                        raise ValueError, 'Ill formatted revision range: ' + R

    def sorted(self):
        return sorted(self)

    def normalized(self):
        """Returns a normalized version of the revision set, which is an
        ordered list of couples (start,end), with the minimum number of
        intervals."""
        revnums = sorted(self)
        revnums.reverse()
        ret = []
        while revnums:
            s = e = revnums.pop()
            while revnums and revnums[-1] in (e, e+1):
                e = revnums.pop()
            ret.append((s, e))
        return ret

    def __str__(self):
        """Convert the revision set to a string, using its normalized form."""
        L = []
        for s,e in self.normalized():
            if s == e:
                L.append(str(s))
            else:
                L.append(str(s) + "-" + str(e))
        return ",".join(L)

Addition: By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).

Justin Peel 2010-10-19 19:12:40

`def sorted(self): return sorted(self)` - this seems ominous to me...

Vilx- 2010-10-19 19:16:18

@Vilx, you can remove that if you just replace the like 3 spots in the file where the sorted method is called with just sorted(theRevSet)

Justin Peel 2010-10-19 19:20:03

It's not recursive stack overflow then? I started reading python tutorial today, but didn't get to classes yet. :P

Vilx- 2010-10-19 19:25:03

No, because the method sorted of RevisionSet class is different from the function. In other words, the sorted function doesn't call the sorted method. If you look at a set, it doesn't have any method called sorted. You can keep the old sorted method if you'd like, but I wouldn't.

Justin Peel 2010-10-19 20:35:29

OK, thank you! :)

Vilx- 2010-10-20 07:34:12

ansaurus

tags:

views:

answers:

How to optimize operations on large (75,000 items) sets of booleans in Python?

related questions