ansaurus

Question

Getting the lesser n elements of a list in Python

Answer 1

+10 A:

You actually want a sorted sequence of mins.

mins = items[:n]
mins.sort()
for i in items[n:]:
    if i < mins[-1]: 
        mins.append(i)
        mins.sort()
        mins= mins[:n]

This runs much faster because you aren't even looking at mins unless it's provably got a value larger than the given item. About 1/10th the time of the original algorithm.

This ran in zero time on my Dell. I had to run it 10 times to get a measurable run time.

mins(items, n): 0.297000169754
sorted(items)[:n]: 0.109999895096
mins2(items)[:n]: 0.0309998989105

Using bisect.insort instead of append and sort may speed this up a hair further.

S.Lott 2008-12-08 19:27:39

This is super fast!

Manuel Ceron 2008-12-08 19:53:16

A heap would be better; no need to fully sort the whole list for each insert, just a cheaper reheap.

erickson 2008-12-08 19:54:27

@erickson: Just edited to add that bisect.insort may have the same effect.

S.Lott 2008-12-08 19:55:23

You're right. using bisect speeds up the algorithm. It's the fastest at the moment. And it's pure python.

Manuel Ceron 2008-12-08 20:03:08

Cool; sorry I don't know python but I figured it would have something along these lines.

erickson 2008-12-08 20:04:04

Bisect still isn't optimal. Its worst case is proportional to n * len(items). Try to time it on a strictly decreasing sequence of items. Heap can get that down to O(log(n) * len(items)). Not that there is much difference with such a small n.

Rafał Dowgird 2008-12-08 20:44:12

No, bisect is O(log(len(items))). With an strictly decreasing sequence, this algorithm is still faster than heapq.nsmallest(). I just tested

Manuel Ceron 2008-12-08 20:49:12

mins([1, 1, 0, 2], 3) returns [0, 1]. Is it intentional? For example, heapq.nsmallest(3, [1,1,0,2]) returns [0,1,1] i.e., it preserves duplicates and returns exactly n items.

J.F. Sebastian 2008-12-08 20:58:07

Duplicate preservation is an easy change.

S.Lott 2008-12-08 21:01:16

@S.Lott: nsmallest_slott_list(3, [0,1,2]) returns [0]. It is wrong. I've posted corrected version of you algorithm.

J.F. Sebastian 2008-12-08 21:22:02

@J.F. Sebastian: nsmallest_slott_list? Not sure what this code is. But it looks like you've reversed items and n. If so, all bets are off.

S.Lott 2008-12-08 21:23:14

s/you/your/ in my previous comment

J.F. Sebastian 2008-12-08 21:25:17

@S.Lott: Compare your algorithm and my implementation.

J.F. Sebastian 2008-12-08 21:27:13

If the first item is the smallest and n>1 than your algorithm doesn't work.

J.F. Sebastian 2008-12-08 21:36:12

@J.F. Sebastian: Thanks for the correction. Also, you've reversed the arguments from the original question.

S.Lott 2008-12-08 21:48:10

I've used the arguments order from stdlib's heapq.nsmallest(n, items).

J.F. Sebastian 2008-12-08 22:14:46

@S.Lott: please update your answer with the corrected version to mark your answer as selected

Manuel Ceron 2008-12-08 22:46:38

Answer 2

+2 A:

A possibility is to use the bisect module:

import bisect

def mins(items, n):
    mins = [float('inf')]*n
    for item in items:
        bisect.insort(mins, item)
        mins.pop()
    return mins

However, it's just a bit faster for me:

mins(items, n): 0.0892250537872
sorted(items)[:n]: 0.0990262031555

Using psyco does speed it up a bit more:

import bisect
import psyco
psyco.full()

def mins(items, n):
    mins = [float('inf')]*n
    for item in items:
        bisect.insort(mins, item)
        mins.pop()
    return mins

Result:

mins(items, n): 0.0431621074677
sorted(items)[:n]: 0.0859830379486

fredreichbier 2008-12-08 19:32:09

Answer 3

+2 A:

If speed is of utmost concern, the fastest method is going to be with c. Psyco has an upfront cost, but may prove to be pretty fast. I would recommend Cython for python -> c compilation (a more up to date for pf Pyrex).

Hand coding it in c would be the best, and allow you to use data structures specific to your problem domain.

But note:

"Compiling the wrong algorithm in C may not be any faster than the right algorithm in Python" @S.Lott

I wanted to add S.Lott's comment so it gets noticed. Python make an excellent prototype language, where you can iron out an algorithm that you intend to later translate to a lower level language.

JimB 2008-12-08 19:37:25

Compiling the wrong algorithm in C may not be any faster than the right algorithm in Python.

S.Lott 2008-12-08 19:47:52

@S.Lott, I absolutely agree :) - Since you had a better algorithm, all I could do was to offer up a language alternative, (plus I wanted to mention Cython, as opposed to Pyrex)

JimB 2008-12-08 20:02:13

Answer 4

+3 A:

I like erickson's heap idea. I don't know Python either, but there appears to be a canned solution here: heapq — Heap queue algorithm

2008-12-08 20:00:39

have tried heapq.nsmallest, but even when is a bit faster that sorted(items)[:n] is not faster than S.Lott's algorithm

Manuel Ceron 2008-12-08 20:08:36

Answer 5

+5 A:

import heapq

nlesser_items = heapq.nsmallest(n, items)

Here's a correct version of S.Lott's algorithm:

from bisect    import insort
from itertools import islice

def nsmallest_slott_bisect(n, iterable, insort=insort):
    it   = iter(iterable)
    mins = sorted(islice(it, n))
    for el in it:
        if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
            insort(mins, el)
            mins.pop()

    return mins

Performance:

$ python -mtimeit -s "import marshal; from nsmallest import nsmallest$label as nsmallest; items = marshal.load(open('items.marshal','rb')); n = 10"\
 "nsmallest(n, items)"

nsmallest_heapq
100 loops, best of 3: 12.9 msec per loop
nsmallest_slott_list
100 loops, best of 3: 4.37 msec per loop
nsmallest_slott_bisect
100 loops, best of 3: 3.95 msec per loop

nsmallest_slott_bisect is 3 times faster than heapq's nsmallest (for n=10, len(items)=20000). nsmallest_slott_list is only marginally slower. It is unclear why heapq's nsmallest is so slow; its algorithm is almost identical to the presented above (for small n).

J.F. Sebastian 2008-12-08 20:01:12

Yes, this is the faster one. Thanks for the corrections. And thanks S.Lott too. This answer is the new chosen one :)

Manuel Ceron 2008-12-08 21:48:21

@Manuel: I think the main credit should go to S.Lott and his answer should be accepted when he corrects his version (it is still incorrect at the time of this comment).

J.F. Sebastian 2008-12-08 22:19:11

I agree. I'm going to give him back the selection when he updates the algorithm

Manuel Ceron 2008-12-08 22:45:34

ansaurus

tags:

views:

answers:

Getting the lesser n elements of a list in Python

related questions