views:

102

answers:

4

Let's say I have a list of objects. (All together now: "I have a list of objects.") In the web application I'm writing, each time a request comes in, I pick out up to one of these objects according to unspecified criteria and use it to handle the request. Basically like this:

def handle_request(req):
    for h in handlers:
        if h.handles(req):
            return h
    return None

Assuming the order of the objects in the list is unimportant, I can cut down on unnecessary iterations by keeping the list sorted such that the most frequently used (or perhaps most recently used) objects are at the front. I know this isn't something to be concerned about - it'll make only a miniscule, undetectable difference in the app's execution time - but debugging the rest of the code is driving me crazy and I need a distraction :) so I'm asking out of curiosity: what is the most efficient way to maintain the list in sorted order, descending, by the number of times each handler is chosen?

The obvious solution is to make handlers a list of (count, handler) pairs, and each time a handler is chosen, increment the count and resort the list.

    def handle_request(req):
        for h in handlers[:]:
            if h[1].handles(req):
                h[0] += 1
                handlers.sort(reverse=True)
                return h[1]
        return None

But since there's only ever going to be at most one element out of order, and I know which one it is, it seems like some sort of optimization should be possible. Is there something in the standard library, perhaps, that is especially well-suited to this task? Or some other data structure? (Even if it's not implemented in Python) Or should/could I be doing something completely different?

+3  A: 

Python's sort algorithm, timsort, is pretty magical: if your listed is sorted except for one element, it will intrinsically (discover and) use that fact, sorting in O(N) time. (Josh Bloch, the Java guru, was so impressed by a presentation about timsort's performance characteristics that he started coding it for Java on his laptop -- it's supposed to become Java's standard sort pretty soon). I'd just do a sort after each locate-and-increment-count, and very much doubt that other approaches can beat timsort.

Edit: the first alternative that comes to mind, of course, is to possibly "shift up" just the item whose count you've just incremented. But first, a little optimization to avoid copying handlers...):

def handle_request(req):
    for h in handlers:
        if h[1].handles(req):
            h[0] += 1
            handlers.sort(reverse=True)
            break
    else:
        return None
    return h[1]

now, the "shift up" variant

def handle_request(req):
    for i, h in enumerate(handlers):
        if h[1].handles(req):
            h[0] += 1
            for j in reversed(range(i+1)):
                if handlers[j][0] <= h[0]:
                    break
            if j < i:
                handlers[j+1:i+1] = handlers[j:i]
                handlers[j] = h
            break
    else:
        return None
    return h[1]

I can imagine patterns of access where this approach might save a little time -- for example, if the distribution was so skewed that most hits were in handlers[0], this would do little work beyond one comparison (while sort needs about N of them even in the best case). Without representative samples of your access patterns, I can't confirm or disprove this!-)

Alex Martelli
Cool, I didn't know that! Someday I'll have to look up the implementation (thus diverting more time from my real work).
David Zaslavsky
Alex Martelli
Regarding your edit: I think that would be effective since in my tests, different handlers correspond to different URLs, and most of the hits do tend to be for just a few distinct URLs (like images and stylesheets). But I'm not sure if that pattern holds generally.
David Zaslavsky
@David, nope, it doesn't hold generally -- access patterns for different applications and use cases are all over the place (and sometimes shift with time, so that counting long-ago hits as much as fresh ones can damage performance in such cases -- hit counts may then need to be gradually decreased as time passes, with _recent_ hits dominating the scene). Tuning such a structure for optimal performance requires access to a specific application's traces/logs that are deemed to be well representative of real-life traffic.
Alex Martelli
@Alex: I know, I meant "generally" in the context of my website, which is the first place I'll be deploying this. Unfortunately something's wrong with my log analyzer so I can't immediately access the statistics on the relative frequency of access of various URLs.
David Zaslavsky
A: 

I'm guessing that all those extra calls to sort() will slow you down more than it will speed you up. My suggestion would be to memoize handle_request() using a wrapper such as this (taken from here)

class Memoize:
    """Memoize(fn) - an instance which acts like fn but memoizes its arguments
    Will only work on functions with non-mutable arguments
    """
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args):
        if not self.memo.has_key(args):
            self.memo[args] = self.fn(*args)
        return self.memo[args]

You can use it like this:

handle_request = Memoize(handle_request)

That will cause the various return values of handle_request to be cached and could actually provide a noticeable speedup. I would suggest experimenting with when and were you wrap various functions with Memoize() in your app to see just how much memory it takes up and how much it speeds up (or doesn't) various functions. You could also memoize your .handles() method using a similar approach (for example, there's a memoizing decorator here).

Dan McDougall
I actually already have several functions memoized in this app, but not `handle_request`, because it never gets called with the same argument twice. And `handles` isn't a good candidate for memoization either because those functions are very simple (and again, never the same argument twice).
David Zaslavsky
+1  A: 

Sounds like a job for priority queue (a.k.a. heapq). Python has an implementation of priority queue as heapq in the standard library. Basically, you keep a tree/heap with the most-frequently-used-item or most-recently-used-item on the top.

Lie Ryan
+1  A: 

Even though timsort is magical, using list.sort() is not a good idea because (at a minimum) it requires each adjacent pair of entries to be compared each time to ensure that the list is in sorted order.

Using a priority queue (aka Python's heapq module) is a good solution for many problems like this, but is not ideal for your application because it is expensive to traverse through the heapq in order.

Surprisingly, the best approach for your situation is to use something like the much-aligned bubble sort. Since all entries are in order except for the one whose counter you just adjusted, all that can happen is that the one entry moves up a bit in the list. And since you are only incrementing by one, it shouldn't move far. So just compare it to the previous entry and if they are out of order swap them. Something like:

def handle_request(req):
    for (i, h) in enumerate(handlers):
        if h[1].handles(req):
            h[0] += 1
            while i > 0 and handlers[i][0] > handlers[i-1][0]:
                handlers[i-1], handlers[i] = handlers[i], handlers[i-1]
                i -= 1
            return h[1]
    return None

(Of course if multiple threads are accessing the handlers array, you have to do some kind of synchronization.)

mhagger
I was thinking along the same lines when I wrote the question... +1 for the implementation.
David Zaslavsky