ansaurus

Question

Efficient way to maintain a sorted list of access counts in Python

Answer 1

+3 A:

Python's sort algorithm, timsort, is pretty magical: if your listed is sorted except for one element, it will intrinsically (discover and) use that fact, sorting in O(N) time. (Josh Bloch, the Java guru, was so impressed by a presentation about timsort's performance characteristics that he started coding it for Java on his laptop -- it's supposed to become Java's standard sort pretty soon). I'd just do a sort after each locate-and-increment-count, and very much doubt that other approaches can beat timsort.

Edit: the first alternative that comes to mind, of course, is to possibly "shift up" just the item whose count you've just incremented. But first, a little optimization to avoid copying handlers...):

def handle_request(req):
    for h in handlers:
        if h[1].handles(req):
            h[0] += 1
            handlers.sort(reverse=True)
            break
    else:
        return None
    return h[1]

now, the "shift up" variant

def handle_request(req):
    for i, h in enumerate(handlers):
        if h[1].handles(req):
            h[0] += 1
            for j in reversed(range(i+1)):
                if handlers[j][0] <= h[0]:
                    break
            if j < i:
                handlers[j+1:i+1] = handlers[j:i]
                handlers[j] = h
            break
    else:
        return None
    return h[1]

I can imagine patterns of access where this approach might save a little time -- for example, if the distribution was so skewed that most hits were in handlers[0], this would do little work beyond one comparison (while sort needs about N of them even in the best case). Without representative samples of your access patterns, I can't confirm or disprove this!-)

Alex Martelli 2010-06-08 01:52:28

Cool, I didn't know that! Someday I'll have to look up the implementation (thus diverting more time from my real work).

David Zaslavsky 2010-06-08 01:59:12

Alex Martelli 2010-06-08 03:08:13

Regarding your edit: I think that would be effective since in my tests, different handlers correspond to different URLs, and most of the hits do tend to be for just a few distinct URLs (like images and stylesheets). But I'm not sure if that pattern holds generally.

David Zaslavsky 2010-06-08 19:40:11

@David, nope, it doesn't hold generally -- access patterns for different applications and use cases are all over the place (and sometimes shift with time, so that counting long-ago hits as much as fresh ones can damage performance in such cases -- hit counts may then need to be gradually decreased as time passes, with _recent_ hits dominating the scene). Tuning such a structure for optimal performance requires access to a specific application's traces/logs that are deemed to be well representative of real-life traffic.

Alex Martelli 2010-06-08 21:16:30

@Alex: I know, I meant "generally" in the context of my website, which is the first place I'll be deploying this. Unfortunately something's wrong with my log analyzer so I can't immediately access the statistics on the relative frequency of access of various URLs.

David Zaslavsky 2010-06-09 01:22:32

Answer 2

A:

I'm guessing that all those extra calls to sort() will slow you down more than it will speed you up. My suggestion would be to memoize handle_request() using a wrapper such as this (taken from here)

class Memoize:
    """Memoize(fn) - an instance which acts like fn but memoizes its arguments
    Will only work on functions with non-mutable arguments
    """
    def __init__(self, fn):
        self.fn = fn
        self.memo = {}
    def __call__(self, *args):
        if not self.memo.has_key(args):
            self.memo[args] = self.fn(*args)
        return self.memo[args]

You can use it like this:

handle_request = Memoize(handle_request)

That will cause the various return values of handle_request to be cached and could actually provide a noticeable speedup. I would suggest experimenting with when and were you wrap various functions with Memoize() in your app to see just how much memory it takes up and how much it speeds up (or doesn't) various functions. You could also memoize your .handles() method using a similar approach (for example, there's a memoizing decorator here).

Dan McDougall 2010-06-08 01:59:27

I actually already have several functions memoized in this app, but not `handle_request`, because it never gets called with the same argument twice. And `handles` isn't a good candidate for memoization either because those functions are very simple (and again, never the same argument twice).

David Zaslavsky 2010-06-08 02:06:28

Answer 3

+1 A:

Sounds like a job for priority queue (a.k.a. heapq). Python has an implementation of priority queue as heapq in the standard library. Basically, you keep a tree/heap with the most-frequently-used-item or most-recently-used-item on the top.

Lie Ryan 2010-06-08 08:17:32

Answer 4

+1 A:

Even though timsort is magical, using list.sort() is not a good idea because (at a minimum) it requires each adjacent pair of entries to be compared each time to ensure that the list is in sorted order.

Using a priority queue (aka Python's heapq module) is a good solution for many problems like this, but is not ideal for your application because it is expensive to traverse through the heapq in order.

Surprisingly, the best approach for your situation is to use something like the much-aligned bubble sort. Since all entries are in order except for the one whose counter you just adjusted, all that can happen is that the one entry moves up a bit in the list. And since you are only incrementing by one, it shouldn't move far. So just compare it to the previous entry and if they are out of order swap them. Something like:

def handle_request(req):
    for (i, h) in enumerate(handlers):
        if h[1].handles(req):
            h[0] += 1
            while i > 0 and handlers[i][0] > handlers[i-1][0]:
                handlers[i-1], handlers[i] = handlers[i], handlers[i-1]
                i -= 1
            return h[1]
    return None

(Of course if multiple threads are accessing the handlers array, you have to do some kind of synchronization.)

mhagger 2010-06-08 09:12:46

I was thinking along the same lines when I wrote the question... +1 for the implementation.

David Zaslavsky 2010-06-08 19:27:05

ansaurus

tags:

views:

answers:

Efficient way to maintain a sorted list of access counts in Python

related questions