views:

153

answers:

5

There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.

The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.

After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.

I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?

+6  A: 

You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.

For example:

# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)

# Get all items that are larger than 500, takes 2.58ms
y = x > 500

# Add 10 to those items, takes 26.1ms
x[y] += 10

Since that's with a lot more rows, I think that 75000 should not be a problem either :)

WoLpH
OK, I'll check it out. I'll accept your answer if I end up using it.
Vilx-
Personally, I don't think that numpy is really called for here. Python's built-in sets are quite sufficient for this I think.
Justin Peel
You could probably tell numpy to use 8 bit integers as well if you want to reduce your memory footprint. However, I'm not sure if you can do this with the randint function. http://docs.scipy.org/doc/numpy/user/basics.types.html
GWW
@GWW: not with randint directly, but you can even tell numpy to convert it to a `bool` like this: `numpy.array(numpy.random.randint(0, 2, 1000000), dtype='bool')`
WoLpH
@Justin Peel: sufficient yes. but there I'm guessing that numpy will be faster for most operations. It's specifically designed for mass operations like these.
WoLpH
@WoLpH oh cool thanks for the tip!
GWW
@WoLpH I love numpy too, but I guess what I'm saying is that while numpy is very fast, using numpy here would add another dependency. Also, it will take a lot more rewriting to use it as opposed to using sets. I would try the sets method first and see if it is fast enough before adding numpy to the mix, but that's just my opinion. Using numpy for this means either using numpy arrays as sets, which I've tested isn't necessarily faster than using Python's sets, or it means using binary arrays along with np.where. Having to use np.where could slow things down a bit. Just some things to consider.
Justin Peel
@Justin Peel: Indeed, having another dependency might not be the best choice. But if that is not a problem in this case than it will provide both a valueable learning experience and a clean alternative. Also, the python sets are not always that fast from my experience. If you're using them to get a unique list of items than using a dictionary seems to be faster.
WoLpH
A: 

For example, all keys from 1 to 74,000 contain true

Why not work on a subset? Just 74001 to the end.

Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).

S.Lott
Of course, but then I'd have to rewrite the whole script.
Vilx-
@Vilx: How so? You only need to subset things.
S.Lott
I think you might have misunderstood me. These are not real numbers, it's just something I made up on the spot. All I mean to say is that there are large intervals of the same boolean value.
Vilx-
@Vilx: I think you may have misunderstood me. If you have ranges that don't matter, find a way to filter them out. Also, try to stick to facts in your question -- misleading numeric values give you misleading answers.
S.Lott
OK, updated the question with the word "might". Anyway, that's just what I'm trying to do - change this class (and only this class, I don't want to change how it's being used in code) so that the ranges would be optimized.
Vilx-
@Vlix: isn't removing unused values an "optimization"? What's wrong with adding a filter and nothing more?
S.Lott
A: 

You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.

There is no compelling reason to use code that supports python 2.3 and earlier.

hughdbrown
A: 

Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.

Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.

Mike Dunlavey
+1  A: 

Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.

class RevisionSet(set):
    """
    A set of revisions, held in dictionary form for easy manipulation. If we
    were to rewrite this script for Python 2.3+, we would subclass this from
    set (or UserSet).  As this class does not include branch
    information, it's assumed that one instance will be used per
    branch.
    """
    def __init__(self, parm):
        """Constructs a RevisionSet from a string in property form, or from
        a dictionary whose keys are the revisions. Raises ValueError if the
        input string is invalid."""


        revision_range_split_re = re.compile('[-:]')

        if isinstance(parm, set):
            print "1"
            self.update(parm.copy())
        elif isinstance(parm, list):
            self.update(set(parm))
        else:
            parm = parm.strip()
            if parm:
                for R in parm.split(","):
                    rev_or_revs = re.split(revision_range_split_re, R)
                    if len(rev_or_revs) == 1:
                        self.add(int(rev_or_revs[0]))
                    elif len(rev_or_revs) == 2:
                        self.update(set(range(int(rev_or_revs[0]),
                                         int(rev_or_revs[1])+1)))
                    else:
                        raise ValueError, 'Ill formatted revision range: ' + R

    def sorted(self):
        return sorted(self)

    def normalized(self):
        """Returns a normalized version of the revision set, which is an
        ordered list of couples (start,end), with the minimum number of
        intervals."""
        revnums = sorted(self)
        revnums.reverse()
        ret = []
        while revnums:
            s = e = revnums.pop()
            while revnums and revnums[-1] in (e, e+1):
                e = revnums.pop()
            ret.append((s, e))
        return ret

    def __str__(self):
        """Convert the revision set to a string, using its normalized form."""
        L = []
        for s,e in self.normalized():
            if s == e:
                L.append(str(s))
            else:
                L.append(str(s) + "-" + str(e))
        return ",".join(L)

Addition: By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).

Justin Peel
`def sorted(self): return sorted(self)` - this seems ominous to me...
Vilx-
@Vilx, you can remove that if you just replace the like 3 spots in the file where the sorted method is called with just sorted(theRevSet)
Justin Peel
It's not recursive stack overflow then? I started reading python tutorial today, but didn't get to classes yet. :P
Vilx-
No, because the method sorted of RevisionSet class is different from the function. In other words, the sorted function doesn't call the sorted method. If you look at a set, it doesn't have any method called sorted. You can keep the old sorted method if you'd like, but I wouldn't.
Justin Peel
OK, thank you! :)
Vilx-