views:

710

answers:

5

I need to get the lesser n numbers of a list in Python. I need this to be really fast because it's in a critical part for performance and it needs to be repeated a lot of times.

n is usually no greater than 10 and the list usually has around 20000 elements. The list is always different each time I call the function. Sorting can't be made in place.

Initially, I have written this function:

def mins(items, n):
    mins = [float('inf')]*n
    for item in items:
        for i, min in enumerate(mins):
         if item < min:
          mins.insert(i, item)
          mins.pop()
          break
    return mins

But this function can't beat a simple sorted(items)[:n] which sort the entire list. Here is my test:

from random import randint, random
import time

test_data = [randint(10, 50) + random() for i in range(20000)]

init = time.time()
mins = mins(test_data, 8)
print 'mins(items, n):', time.time() - init

init = time.time()
mins = sorted(test_data)[:8]
print 'sorted(items)[:n]:', time.time() - init

Results:

mins(items, n): 0.0632939338684
sorted(items)[:n]: 0.0231449604034

sorted()[:n] is three times faster. I believe this is because:

  1. insert() operation is costly because Python lists are not linked lists.
  2. sorted() is an optimized c function and mine is pure python.

Is there any way to beat sorted()[:n] ? Should I use a C extension, or Pyrex or Psyco or something like that?

Thanks in advance for your answers.

+10  A: 

You actually want a sorted sequence of mins.

mins = items[:n]
mins.sort()
for i in items[n:]:
    if i < mins[-1]: 
        mins.append(i)
        mins.sort()
        mins= mins[:n]

This runs much faster because you aren't even looking at mins unless it's provably got a value larger than the given item. About 1/10th the time of the original algorithm.

This ran in zero time on my Dell. I had to run it 10 times to get a measurable run time.

mins(items, n): 0.297000169754
sorted(items)[:n]: 0.109999895096
mins2(items)[:n]: 0.0309998989105

Using bisect.insort instead of append and sort may speed this up a hair further.

S.Lott
This is super fast!
Manuel Ceron
A heap would be better; no need to fully sort the whole list for each insert, just a cheaper reheap.
erickson
@erickson: Just edited to add that bisect.insort may have the same effect.
S.Lott
You're right. using bisect speeds up the algorithm. It's the fastest at the moment. And it's pure python.
Manuel Ceron
Cool; sorry I don't know python but I figured it would have something along these lines.
erickson
Bisect still isn't optimal. Its worst case is proportional to n * len(items). Try to time it on a strictly decreasing sequence of items. Heap can get that down to O(log(n) * len(items)). Not that there is much difference with such a small n.
Rafał Dowgird
No, bisect is O(log(len(items))). With an strictly decreasing sequence, this algorithm is still faster than heapq.nsmallest(). I just tested
Manuel Ceron
mins([1, 1, 0, 2], 3) returns [0, 1]. Is it intentional? For example, heapq.nsmallest(3, [1,1,0,2]) returns [0,1,1] i.e., it preserves duplicates and returns exactly n items.
J.F. Sebastian
Duplicate preservation is an easy change.
S.Lott
@S.Lott: nsmallest_slott_list(3, [0,1,2]) returns [0]. It is wrong. I've posted corrected version of you algorithm.
J.F. Sebastian
@J.F. Sebastian: nsmallest_slott_list? Not sure what this code is. But it looks like you've reversed items and n. If so, all bets are off.
S.Lott
s/you/your/ in my previous comment
J.F. Sebastian
@S.Lott: Compare your algorithm and my implementation.
J.F. Sebastian
If the first item is the smallest and n>1 than your algorithm doesn't work.
J.F. Sebastian
@J.F. Sebastian: Thanks for the correction. Also, you've reversed the arguments from the original question.
S.Lott
I've used the arguments order from stdlib's heapq.nsmallest(n, items).
J.F. Sebastian
@S.Lott: please update your answer with the corrected version to mark your answer as selected
Manuel Ceron
+2  A: 

A possibility is to use the bisect module:

import bisect

def mins(items, n):
    mins = [float('inf')]*n
    for item in items:
        bisect.insort(mins, item)
        mins.pop()
    return mins

However, it's just a bit faster for me:

mins(items, n): 0.0892250537872
sorted(items)[:n]: 0.0990262031555

Using psyco does speed it up a bit more:

import bisect
import psyco
psyco.full()

def mins(items, n):
    mins = [float('inf')]*n
    for item in items:
        bisect.insort(mins, item)
        mins.pop()
    return mins

Result:

mins(items, n): 0.0431621074677
sorted(items)[:n]: 0.0859830379486
fredreichbier
+2  A: 

If speed is of utmost concern, the fastest method is going to be with c. Psyco has an upfront cost, but may prove to be pretty fast. I would recommend Cython for python -> c compilation (a more up to date for pf Pyrex).

Hand coding it in c would be the best, and allow you to use data structures specific to your problem domain.

But note:

"Compiling the wrong algorithm in C may not be any faster than the right algorithm in Python" @S.Lott

I wanted to add S.Lott's comment so it gets noticed. Python make an excellent prototype language, where you can iron out an algorithm that you intend to later translate to a lower level language.

JimB
Compiling the wrong algorithm in C may not be any faster than the right algorithm in Python.
S.Lott
@S.Lott, I absolutely agree :) - Since you had a better algorithm, all I could do was to offer up a language alternative, (plus I wanted to mention Cython, as opposed to Pyrex)
JimB
+3  A: 

I like erickson's heap idea. I don't know Python either, but there appears to be a canned solution here: heapq — Heap queue algorithm

have tried heapq.nsmallest, but even when is a bit faster that sorted(items)[:n] is not faster than S.Lott's algorithm
Manuel Ceron
+5  A: 
import heapq

nlesser_items = heapq.nsmallest(n, items)

Here's a correct version of S.Lott's algorithm:

from bisect    import insort
from itertools import islice

def nsmallest_slott_bisect(n, iterable, insort=insort):
    it   = iter(iterable)
    mins = sorted(islice(it, n))
    for el in it:
        if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
            insort(mins, el)
            mins.pop()

    return mins

Performance:

$ python -mtimeit -s "import marshal; from nsmallest import nsmallest$label as nsmallest; items = marshal.load(open('items.marshal','rb')); n = 10"\
 "nsmallest(n, items)"
nsmallest_heapq
100 loops, best of 3: 12.9 msec per loop
nsmallest_slott_list
100 loops, best of 3: 4.37 msec per loop
nsmallest_slott_bisect
100 loops, best of 3: 3.95 msec per loop

nsmallest_slott_bisect is 3 times faster than heapq's nsmallest (for n=10, len(items)=20000). nsmallest_slott_list is only marginally slower. It is unclear why heapq's nsmallest is so slow; its algorithm is almost identical to the presented above (for small n).

J.F. Sebastian
Yes, this is the faster one. Thanks for the corrections. And thanks S.Lott too. This answer is the new chosen one :)
Manuel Ceron
@Manuel: I think the main credit should go to S.Lott and his answer should be accepted when he corrects his version (it is still incorrect at the time of this comment).
J.F. Sebastian
I agree. I'm going to give him back the selection when he updates the algorithm
Manuel Ceron