views:

1116

answers:

11

I was optimising some Python code, and tried the following experiment:

import time

start = time.clock()
x = 0
for i in range(10000000):
    x += 1
end = time.clock()

print '+=',end-start

start = time.clock()
x = 0
for i in range(10000000):
    x -= -1
end = time.clock()

print '-=',end-start

The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.

Stranger,

for i in range(10000000, 0, -1):

(ie running the loop backwards) is faster than

for i in range(10000000):

even when loop contents are identical.

What gives, and is there a more general programming lesson here?

A: 

Have you tried reversing the order of the loops?

dtmunir
'I've tried varying the order of the loops'
Statto
As in running the subtraction test first?
too much php
+11  A: 
$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop

Looks like you've some measurement bias

pixelbeat
Very interesting article.
drhirsch
Surely the point of measurement bias is that your one counter-example doesn't prove it one way or the other. So I _may_ have some measurement bias!
Statto
Yours came out slightly slower though.. ;)
Pod
Testing, like pixelbeat, using python -mtimeit from the command line, and using Python 2.5.2 on Windows Vista, I found no significant difference in either OP comaprison. (The miniscule differences I got had addition faster, but I am sure I would have seen the reverse too with multiple runs.)
Anon
I think in this case, addition is faster since it *has one less character to interpret*. I know it may seem useless, but that's my take on it.
LiraNuna
If the difference in his timings is related to processor cache size, using a more efficient timing system--one which intrinsically hits less native code--is going to eliminate the difference. (I don't have a decent instruction/cache profiler handy to compare, but that's what the OP really needs to compare.) Of course, you could also have a system with more L1 cache, a compiler that's making better guesses, etc.
Glenn Maynard
A: 

The running loop backwards is faster because the computer has an easier time comparing if a number is equal to 0.

Nican
While this may be true in optimised assembly language, it is generally *not* true for high level languages such as Python, and in any case is unlikely to make as significant a difference as 10%.
Greg Hewgill
It's not doing any comparison on the x value here?
Pod
Neither loop is running backwards. They're both going upwards, one adding 1 and one subtracting -1.
Glenn Maynard
@Glenn Maynard: I think this answer is referring to the `range(10000000, 0, -1)` part.
Greg Hewgill
+5  A: 

I think the "general programming lesson" is that it is really hard to predict, solely by looking at the source code, which sequence of statements will be the fastest. Programmers at all levels frequently get caught up by this sort of "intuitive" optimisation. What you think you know may not necessarily be true.

There is simply no substitute for actually measuring your program performance. Kudos for doing so; answering why undoubtedly requires delving deep into the implementation of Python, in this case.

With byte-compiled languages such as Java, Python, and .NET, it is not even sufficient to measure performance on just one machine. Differences between VM versions, native code translation implementations, CPU-specific optimisations, and so on will make this sort of question ever more tricky to answer.

Greg Hewgill
A: 

With Python 2.5 the biggest problem here is using range, which will allocate a list that big to iterate over it. When using xrange, whichever is done second is a tiny bit faster for me. (Not sure if range has become a generator in Python 3.)

UncleBens
I tested with xrange (in 2.6.2), and -= 1 is still consistently faster for me than += 1, regardless of which one is done first.
Glenn Maynard
+3  A: 

It's always a good idea when asking a question to say what platform and what version of Python you are using. Sometimes it does't matter. This is NOT one of those times:

  1. time.clock() is appropriate only on Windows. Throw away your own measuring code and use -m timeit as demonstrated in pixelbeat's answer.

  2. Python 2.X's range() builds a list. If you are using Python 2.x, replace range with xrange and see what happens.

  3. Python 3.X's int is Python2.X's long.

John Machin
+26  A: 

I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:

('+=', 11.370000000000001)
('-=', 10.769999999999998)

First, some observations:

  • This is 5% for a trivial operation. That's significant.
  • The speed of the native addition and subtraction opcodes is irrelevant. It's in the noise floor, completely dwarfed by the bytecode evaluation. That's talking about one or two native instructions around thousands.
  • The bytecode generates exactly the same number of instructions; the only difference is INPLACE_ADD vs. INPLACE_SUBTRACT and +1 vs -1.

Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx. INPLACE_ADD has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT, since you can't subtract strings. That means INPLACE_ADD contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.

This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.

Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.

Glenn Maynard
Quite enlightening. +1
Triptych
A: 

Your experiment is faulty. The way this experiment should be designed is to write 2 different programs - 1 for addition, 1 for subtraction. They should be exactly the same and run under the same conditions with the data being put to file. Then you need to average the runs (at least several thousand), but you'd need a statistician to tell you an appropriate number.

If you wanted to analyze different methods of addition, subtraction, and looping, again each of those should be a separate program.

Experimental error might arise from heat of processor and other activity going on the cpu, so i'd execute the runs in a variety of patterns...

mson
He should, but it's not the problem. I've tested it both ways, and I get the same results either way; the subtraction case is consistently faster for me in 2.6.
Glenn Maynard
+5  A: 

"The second loop is reliably faster ..."

That's your explanation right there. Re-order your script so the subtraction test is timed first, then the addition, and suddenly addition becomes the faster operation again:

-= 3.05
+= 2.84

Obviously something happens to the second half of the script that makes it faster. My guess is that the first call to range() is slower because python needs to allocate enough memory for such a long list, but it is able to re-use that memory for the second call to range():

import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start

A few runs of this script show that the extra time needed for the first range() accounts for nearly all of the time difference between '+=' and '-=' seen above:

first range() 0.4
second range() 0.23
too much php
Dupe of dtmunir's answer; OP already said that he gets the same results regardless of the order. I've reproduced this, as well as when the two tests are run independently, and when xrange is used instead of range.
Glenn Maynard
@Glenn: I interpreted 'tried varying the order of the loops' as 'used range(10000000, 0, -1) instead of range(10000000)' because he immediately goes into more detail about using reversed range(). He doesn't specifically say that he tried to put the subtraction test first. dtmunir's answer carries the same ambiguity which is probably why it was downvoted.
too much php
I don't 'immediately' go into detail about using reversed range! I did try putting the subtraction loop first, and also running both tests several times in a variety of orders.
Statto
A: 

That would be remarkable, so I have thoroughly evaluated your code and also setup the expiriment as I would find it more correct (all declarations and function calls outside the loop). Both versions I have run five times.

  • Running your code validated your claims: -= takes constantly less time; 3.6% on average
  • Running my code, though, contradicts the outcome of your experiment: += takes on average (not always) 0.5% less time.

To show all results I have put plots online:

So, I conclude that your experiment has a bias, and it is significant.

Finally here is my code:

import time

addtimes = [0.] * 100
subtracttimes = [0.] * 100

range100 = range(100)
range10000000 = range(10000000)

j = 0
i = 0
x = 0
start = 0.


for j in range100:
 start = time.clock()
 x = 0
 for i in range10000000:
  x += 1
 addtimes[j] = time.clock() - start

for j in range100:
 start = time.clock()
 x = 0
 for i in range10000000:
  x -= -1
 subtracttimes[j] = time.clock() - start

print '+=', sum(addtimes)
print '-=', sum(subtracttimes)
Paul