ansaurus

Question

Python performance: iteration and operations on nested lists

Answer 1

A:

You can create your own Python module in C, and control the performance as you want: http://docs.python.org/extending/

mkotechno 2010-03-21 22:07:19

Yep, but this has the same distribution challenges as third-party modules. I'd much prefer to stay all python.

J.J. 2010-03-21 22:28:02

Answer 2

+2 A:

1. A (smaller) speedup could definitely be the initialization of your rows...

Replace

rows = []
for i in range(x):
    rows.append([0 for i in xrange(y)])

with

rows = [[0] * y for i in xrange(x)]

2. You can also avoid some lookups by moving random.random out of the loops (saves a little).

3. EDIT: after corrections -- you could arrive at something like this:

def f(x,y,n,z):
    rows = [[0] * y for i in xrange(x)]
    rn = random.random
    for i in xrange(n):
        topleft = (int(x*rn()) - z, int(y*rn()) - z)
        l = max(0, topleft[1])
        r = min(topleft[1]+(z*2), y)
        for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
            rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]

EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:

import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)

f1 21.1669280529
f2 12.9376120567
f  11.1249599457

ChristopheD 2010-03-21 22:33:20

Chris, absolutely. it was sloppy of me to ever do anything else. updated, saved about 0.1s across the board. post updated.

J.J. 2010-03-21 22:40:05

Ah, yessir. Unnecessary slicing. I'm a punk. 0.89s on my box with the test case parameters.

J.J. 2010-03-21 22:49:53

This `for item in rows[i][max(0, topleft[1]):min(topleft[1]+(z*2), y)]:` does not do what you want. It copies that area of the list, modifies some elements, but does nothing to the original `rows` list.

Wallacoloo 2010-03-21 23:08:40

@wallacoloo: very true (obviously), I'll update my post in a minute.

ChristopheD 2010-03-21 23:15:56

`j+(0,1)[j<255]` can be simplified to `j+(j<255)`

Wallacoloo 2010-03-22 00:15:25

@wallacoloo: good point thanks (it's a a tiny bit faster also)

ChristopheD 2010-03-22 00:19:55

They're both clever. I didn't know about the `j+(0,1)[j<255]` syntax -- and your simplification is almost *evil*.

J.J. 2010-03-22 00:23:55

It's code you won't often see in production code (you'll see it in code golf ;-) It's useful in this case though (to keep every value from the slice within the list comprehension (with the alternative if clause at the end you'll silently drop the values > 255 and end up with a shorter slice) and to increment the values at the same time.

ChristopheD 2010-03-22 00:30:32

As for point #1, this code is faster by 0.2 *milliseconds* on my computer: `tc = [0]*y; rows = [tc[:] for i in xrange(x-1)]; rows.append(tc)`. Yay micro-optimizations!

Wallacoloo 2010-03-22 03:17:37

Answer 3

+1 A:

in your f3 rewrite, g can be simplified. (Can also be applied to f4)

You have the following code inside a for loop.

l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)

However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.

Wallacoloo 2010-03-21 22:54:18

@walla, of course. thanks. shaved off 0.1s.

J.J. 2010-03-21 23:08:15

Answer 4

+1 A:

Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.

def f3b(x,y,n,z):
    rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]

def g1(x, y, z):
    l = y - z if y - z > 0 else 0
    r = y + z if y + z < 1024 else 1024
    for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
        rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

kriss 2010-03-21 23:24:53

kriss, well done! my head is still stuck in python from 2002; I can't manage to think in terms of the new constructs available. 1.7s on my box with the test case parameters.

J.J. 2010-03-21 23:30:49

What if for example half of the values in `rows[i][l:r]` are > 255: you'll be assigning the wrong slice...

ChristopheD 2010-03-22 00:08:10

@ChristopheD: You are right. I just started working on f3 without checking the initial version. (O, my god, again a reminder of why I so love unit tests).

kriss 2010-03-22 01:07:12

Answer 5

+2 A:

On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:

$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop

So, my machine is slower than yours by a factor of a bit more than 1.9.

The fastest code I have for this task is:

def f3(x=x,y=y,n=n,z=z):
    rows = [[0]*y for i in range(x)]
    rr = random.randrange
    inc = (1).__add__
    sat = (0xff).__and__

    for i in range(n):
        inputX, inputY = rr(x), rr(y)
        b = max(0, inputX - z)
        t = min(inputX + z, x)
        l = max(0, inputY - z)
        r = min(inputY + z, y)
        for i in range(b, t):
            rows[i][l:r] = map(inc, rows[i][l:r])
    for i in range(x):
      rows[i] = map(sat, rows[i])

which times as:

$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop

so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.

With a simple C-coded extensions, exte.c...:

#include "Python.h"

static PyObject*
dopoint(PyObject* self, PyObject* args)
{
    int x, y, z, px, py;
    int b, t, l, r;
    int i, j;
    PyObject* rows;

    if(!PyArg_ParseTuple(args, "iiiiiO",
                         &x, &y, &z, &px, &py, &rows
        ))
        return 0;

    b = px - z;
    if (b < 0) b = 0;
    t = px + z;
    if (t > x) t = x;
    l = py - z;
    if (l < 0) l = 0;
    r = py + z;
    if (r > y) r = y;

    for(i = b; i < t; ++i) {
        PyObject* row = PyList_GetItem(rows, i);
        for(j = l; j < r; ++j) {
            PyObject* pyitem = PyList_GetItem(row, j);
            long item = PyInt_AsLong(pyitem);
            if (item < 255) {
                PyObject* newitem = PyInt_FromLong(item + 1);
                PyList_SetItem(row, j, newitem);
            }
        }
    }

    Py_RETURN_NONE;
}

static PyMethodDef exteMethods[] = {
    {"dopoint", dopoint, METH_VARARGS, "process a point"},
    {0}
};

void
initexte()
{
    Py_InitModule("exte", exteMethods);
}

(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do

import exte
def f4(x=x,y=y,n=n,z=z):
    rows = [[0]*y for i in range(x)]
    rr = random.randrange

    for i in range(n):
        inputX, inputY = rr(x), rr(y)
        exte.dopoint(x, y, z, inputX, inputY, rows)

and the timing

$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop

shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).

Alex Martelli 2010-03-22 00:32:58

Alex, your python-only implementation clocks in at 0.979s on my machine. Nicely done, sir. The two `map()` operations, combined with moving *everything* to locals were the clever insights I couldn't see. // My concern with the c extension (and the various third-party libraries that will speed up the calculations) is the complications with the distribution. I'm not certain the cross-platform/compilation concerns are worth the speedup in the long run.

J.J. 2010-03-22 00:46:26

@J.J, wow, 3+ times faster than my Macbook Air, you sure do have a neat machine. Yes, I split the two `map` calls so I wouldn't have to worry about "saturating to 255" multiple times, just once and for all in the second `map` (but do note it's _not_ perfect: it "wraps around" to the low 8 bits instead -- for your specific purpose you need a different `sat` which I suspect may not be quite as fast). I do understand the deployment problems with having to compile stuff...

Alex Martelli 2010-03-22 01:36:26

Nice trick with inc and sat!

Wallacoloo 2010-03-22 03:19:08

ansaurus

tags:

views:

answers:

Python performance: iteration and operations on nested lists

related questions