ansaurus

Question

Python performance characteristics

Answer 1

+4 A:

Here's what's interesting.

Data Structure
Algorithm

Those will yield dramatic improvements.

Your list is good for -- at best -- a few single-digit performance improvements.

You need to fundamentally rethink your data structures if you want to see real speed improvements.

S.Lott 2009-12-16 11:20:20

I'm aware that algorithmic improvements are often a greater contributor, but in my current case there isn't much scope for that. Given that the existing algorithm in C would be several orders of magnitude faster, I'm inclined to believe there is considerable scope for performance from tuning.My case, of course, is probably not typical. The 'slow bits' are concentrated in two routines and there's not much that can be done algorithmically to reduce the use of either of them. I need to make them faster.

Adam Luchjenbroers 2009-12-16 11:34:56

@Adam Luchjenbroers: "there's not much that can be done algorithmically" This is often false. Indeed, I've never seen the case yet where this was true, but I've only been doing this kind of thing for 30 years. Yours may be the first. If there actually is nothing to be done algorithmically, you will spend a lot of your time for few real benefits. All the things you've listed are minor tweaks. 2% kind of things. And they don't add up to much even when you address every single one of them.

S.Lott 2009-12-16 14:04:03

The pet project in this case is a ray-tracer (or, at the time of writing, the very humble beginnings of a ray-tracer). It's a bit of a pathological case. What its doing is very very simple, and yet its taking alot longer than I believe it should.

Adam Luchjenbroers 2009-12-17 11:36:54

@Adam Luchjenbroers: Step 1 is to prove that your algorithm does what you think it should do and is actually optimal. In all cases, folks misuse `reduce` or do seomthing like write some function that they forget is O( n**2 ) and use it in an O( n**2 ) loop. *Combinatoric explosion* is the most common problem. And optimization doesn't reduce the number of combinations.

S.Lott 2009-12-17 12:47:30

Answer 2

+5 A:

S.Lott is right: the big effects are data structures and algorithms. Also, if you are doing a lot of I/O, how you manage it will make a big difference.

But if you are curious about the compiler internals: it will fold constants, but it will not inline functions or unroll loops. Inlining functions is a hard problem in a dynamic language.

You can see what the compiler does by disassembling some compiled code. Put some sample code in my_file.py, then use:

python -m dis my_file.py

This source:

def foo():
    return "BAR!"

for i in [1,2,3]:
    print i, foo()

produces:

  1           0 LOAD_CONST               0 (<code object foo at 01A0B380, file "\foo\bar.py", line 1>)
              3 MAKE_FUNCTION            0
              6 STORE_NAME               0 (foo)

  4           9 SETUP_LOOP              35 (to 47)
             12 LOAD_CONST               1 (1)
             15 LOAD_CONST               2 (2)
             18 LOAD_CONST               3 (3)
             21 BUILD_LIST               3
             24 GET_ITER
        >>   25 FOR_ITER                18 (to 46)
             28 STORE_NAME               1 (i)

  5          31 LOAD_NAME                1 (i)
             34 PRINT_ITEM
             35 LOAD_NAME                0 (foo)
             38 CALL_FUNCTION            0
             41 PRINT_ITEM
             42 PRINT_NEWLINE
             43 JUMP_ABSOLUTE           25
        >>   46 POP_BLOCK
        >>   47 LOAD_CONST               4 (None)
             50 RETURN_VALUE

Notice that only the top-level code in the module is disassembled, you need to write a little more code yourself to recurse through the nested code objects if you want to see the function definitions disassembled as well.

Ned Batchelder 2009-12-16 11:40:45

Answer 3

+3 A:

If you already know that your algorithm is as fast as possible, and you know that C would be much faster, then you may want to implement the core of your code in C as a C extension to Python. You can pragmatically decide which part of the code goes in C and which is in Python, using each language to its full potential.

Unlike some other languages, calling between C and Python is very fast, so there's no penalty for crossing the border often.

Ned Batchelder 2009-12-16 11:44:51

It's an option I'm considering. I'm also considering NumPy, which covers alot of the mathematical functionality I'm looking for.I'm mainly looking for knowledge that will enable me to reason better about performance in my code.

Adam Luchjenbroers 2009-12-16 11:48:53

Answer 4

+5 A:

The speed of your code might be automatically improved by using the Psyco module.

As for Numpy, it generally speeds things up by a significant factor. I consider it a must when manipulating numerical arrays.

You might also want to speed up the critical parts of your code with Cython or Pyrex, which allow you to create faster extension modules without having to write a full-fledged extension module in C (which would be more cumbersome).

EOL 2009-12-16 11:50:37

Answer 5

+11 A:

Python's compiler is deliberately dirt-simple -- this makes it fast and highly predictable. Apart from some constant folding, it basically generates bytecode that faithfully mimics your sources. Somebody else already suggested dis, and it's indeed a good way to look at the bytecode you're getting -- for example, how for i in [1, 2, 3]: isn't actually doing constant folding but generating the literal list on the fly, while for i in (1, 2, 3): (looping on a literal tuple instead of a literal list) is able to constant-fold (reason: a list is a mutable object, and to keep to the "dirt-simple" mission statement the compiler doesn't bother to check that this specific list is never modified so it could be optimized into a tuple).

So there's space for ample manual micro-optimization -- hoisting, in particular. I.e., rewrite

for x in whatever():
    anobj.amethod(x)

as

f = anobj.amethod
for x in whatever():
    f(x)

to save the repeated lookups (the compiler doesn't check whether a run of anobj.amethod can actually change anobj's bindings &c so that a fresh lookup is needed next time -- it just does the dirt-simple thing, i.e., no hoisting, which guarantees correctness but definitely doesn't guarantee blazing speed;-).

The timeit module (best used at a shell prompt IMHO) makes it very simple to measure the overall effects of compilation + bytecode interpretation (just ensure the snippet you're measuring has no side effects that would affect the timing, since timeit does run it over and over in a loop;-). For example:

$ python -mtimeit 'for x in (1, 2, 3): pass'
1000000 loops, best of 3: 0.219 usec per loop
$ python -mtimeit 'for x in [1, 2, 3]: pass'
1000000 loops, best of 3: 0.512 usec per loop

you can see the costs of the repeated list construction -- and confirm that is indeed what we're observing by trying a minor tweak:

$ python -mtimeit -s'Xs=[1,2,3]' 'for x in Xs: pass'
1000000 loops, best of 3: 0.236 usec per loop
$ python -mtimeit -s'Xs=(1,2,3)' 'for x in Xs: pass'
1000000 loops, best of 3: 0.213 usec per loop

moving the iterable's construction to the -s setup (which is run only once and not timed) shows that the looping proper is slightly faster on tuples (maybe 10%), but the big issue with the first pair (list slower than tuple by over 100%) is mostly with the construction.

Armed with timeit and the knowledge that the compiler's deliberately very simple minded in its optimizations, we can easily answer other questions of yours:

How fast are the following operations (comparatively)
* Function calls
* Class instantiation
* Arithmetic
* 'Heavier' math operations such as sqrt()

$ python -mtimeit -s'def f(): pass' 'f()'
10000000 loops, best of 3: 0.192 usec per loop
$ python -mtimeit -s'class o: pass' 'o()'
1000000 loops, best of 3: 0.315 usec per loop
$ python -mtimeit -s'class n(object): pass' 'n()'
10000000 loops, best of 3: 0.18 usec per loop

so we see: instantiating a new-style class and calling a function (both empty) are about the same speed, with instantiations possibly having a tiny speed margin, maybe 5%; instantiating an old-style class is slowest (by about 50%). Tiny differences such as 5% or less of course could be noise, so repeating each try a few times is advisable; but differences like 50% are definitely well beyond noise.

$ python -mtimeit -s'from math import sqrt' 'sqrt(1.2)'
1000000 loops, best of 3: 0.22 usec per loop
$ python -mtimeit '1.2**0.5'
10000000 loops, best of 3: 0.0363 usec per loop
$ python -mtimeit '1.2*0.5'
10000000 loops, best of 3: 0.0407 usec per loop

and here we see: calling sqrt is slower than doing the same computation by operator (using the ** raise-to-power operator) by roughly the cost of calling an empty function; all arithmetic operators are roughly the same speed to within noise (the tiny difference of 3 or 4 nanoseconds is definitely noise;-). Checking whether constant folding might interfere:

$ python -mtimeit '1.2*0.5'
10000000 loops, best of 3: 0.0407 usec per loop
$ python -mtimeit -s'a=1.2; b=0.5' 'a*b'
10000000 loops, best of 3: 0.0965 usec per loop
$ python -mtimeit -s'a=1.2; b=0.5' 'a*0.5'
10000000 loops, best of 3: 0.0957 usec per loop
$ python -mtimeit -s'a=1.2; b=0.5' '1.2*b'
10000000 loops, best of 3: 0.0932 usec per loop

...we see that this is indeed the case: if either or both numbers are being looked up as variables (which blocks constant folding), we're paying the "realistic" cost. Variable lookup has its own cost:

$ python -mtimeit -s'a=1.2; b=0.5' 'a'
10000000 loops, best of 3: 0.039 usec per loop

and that's far from negligible when we're trying to measure such tiny times anyway. Indeed constant lookup isn't free either:

$ python -mtimeit -s'a=1.2; b=0.5' '1.2'
10000000 loops, best of 3: 0.0225 usec per loop

as you see, while smaller than variable lookup it's quite comparable -- about half.

If and when (armed with careful profiling and measurement) you decide some nucleus of your computations desperately need optimization, I recommend trying cython -- it's a C / Python merge which tries to be as neat as Python and as fast as C, and while it can't get there 100% it surely makes a good fist of it (in particular, it makes binary code that's quite a bit faster than you can get with its predecessor language, pyrex, as well as being a bit richer than it). For the last few %'s of performance you probably still want to go down to C (or assembly / machine code in some exceptional cases), but that would be really, really rare.

Alex Martelli 2009-12-16 16:40:18

Answer 6

+4 A:

Hi, I'm the author of Arauna. I know nothing about Python, but I do know that Arauna is extremely optimized, both high level (data structures & algorithms) and low-level (cache friendly code, SIMD, multithreading). It's a hard target to go for...

Jacco Bikker 2009-12-29 23:30:39

I'm not trying to reach it as a target, I'm just using it to explain why I think there's alot of room for improvement in my code. 25% of its performance would be more than acceptable to me.

Adam Luchjenbroers 2009-12-30 00:43:03

ansaurus

tags:

views:

answers:

Python performance characteristics

How smart is its optimizer?

How fast are the following operations (comparatively)

How are numbers handled internally?

NumPy

Anything else interesting

Some background...

related questions