views:

456

answers:

5

I was curious about the performance of insert-sort using C and python but the results I've got just make me think if I've done something wrong. I suspected that C would be faster, but not that much.

I've profiled both codes and the insert-sort function is the place where the time is most spent.

Here is the C function:

void
insert_sort (vec_t * vec)
{
    int j;
    for (j = 1 ; j < vec->n ; j++){
        int key = vec->v[j];
        int i = j - 1;
        while (i >= 0 && vec->v[i] > key){
            vec->v[i+1] = vec->v[i];
            i--;
        }
        vec->v[i+1] = key;
    }
}

Here is the python function:

def insert_sort (ln):
    for j in range(1, len(ln)):
        key = ln[j]
        i = j-1
        while i >= 0 and ln[i] > key:
            ln[i+1] = ln[i]
            i-=1
        ln[i+1] = key

The test was made with 10000 integers, each one randomly generated between 0 and 10000.

The results for the time spent in each function was:

  • C time: 0.13 seconds
  • python time: 8.104 seconds

Am I doing something wrong here? Like I said, I expected to see the C code being faster, but not that faster.

I don't want to use built-in functions or whatsoever. I would like to implement the algorithm. Is there a pythonic way to doing things that I could use in the insert-sort?

A: 

What's wrong with:

ln.sort()
alex tingle
That uses a tim-sort. The goal is obviously to implement/learn about the algorithm, not to use the result.
Jeff Ober
The goal is obviously to implement/learn about the algorithm, not to use the result. – Jeff OberExactly.
didi
It's not obvious at all. Python is a framework build around lots of native-coded functions. Trying to re-implement basic stuff like like in Python, and then comparing it with the C is ridiculous.
alex tingle
He specifically stated he didn't want to use built-in functions. I wasn't going to vote you down, but -1 for calling his experiment ridiculous.
rlbond
@rlbond - I disagree. Trying to compare inefficient algorithms in C to inefficient algorithms in Python (or any other dynamic language) is going to yield results like this. That's the nature of dynamic languages.
Chris Lutz
"I don't want to use built-in functions or whatsoever. I would like to implement the algorithm."
Peter Recore
@Peter - It's commendable that he wants to implement (and has implemented) the algorithm, but he's still comparing apples to oranges and saying "Why are they different?"
Chris Lutz
He's comparing lines of c code with lines of essentially equivalent python code and finding the c code ~62x faster. Is that apples to oranges? Or is the better comparison: I want to do X. How do the best implementations of X in python and c compare?
foosion
@foosion - C = apples, Python = oranges. You can compare performance in the two languages based on "functionally identical" code, but the code isn't really identical, because the C variables are statically typed and stack allocated while the Python variables are dynamically typed and carry a lot of excess baggage and have to be allocated on the heap and have to get garbage collected and... In the end, the Python code is very different from the C code, even though the _source_ code doesn't look that way. And that's the _entire point_ of Python. The OP is basically saying "Why is this code...
Chris Lutz
...written in Python, a language that does a lot of under-the-hood work to make my life easier, slower than this code written in C, a language that does almost no work for you and makes you spell everything out quite explicitly?" And the answer should be quite obvious.
Chris Lutz
+12  A: 

Python is a dynamic language and the standard implementation uses an interpreter to evaluate code. This means that where the compiled C code can escape with a single machine instruction, for instance assigning to vec->v[i+1], Python's interpreter has to look up the sequence variable from the local scope, look up its class, find the item setting method on the class, call that method. Similarly for the compare, addition. Not to mention that executing almost every bytecode results in an indirect branch mispredict in the CPU that causes a pipeline bubble.

This is the sort of code that would benefit a lot from JIT compilation to native code and runtime type specialization, like unladen-swallow and PyPy are starting to do.

Otherwise the code is pretty much pythonic in the sense that if one needs to implement insertion sort, this is how one would do it in Python. It's also very unpythonic because you should use the very efficient built-in sort.

Ants Aasma
+1  A: 

What method did you use to measure the time?
Doing this sort of thing, I find python is at least 30 times slower than C
The C compiler may be able to use some optimisations that Python doesn't even attempt

If might be interesting to try psyco, that type of code is well suited to it.

building on Alex's answer, I tried cython. In his case cython turns the for loop and everything into pure C, so now I can compare C, python and psyco

now i have this insort.py


import psyco
import random
li = [random.randrange(10000) for _ in xrange(10000)]

def insort (ln):
    for j in range(1, len(ln)):
        key = ln[j]
        i = j-1
        while i >= 0 and ln[i] > key:
            ln[i+1] = ln[i]
            i-=1
        ln[i+1] = key

#psyco.bind(insort)

import pyximport; pyximport.install()
import pyxinsort

def pyx_setup():
    pyxinsort.setup(li)

def pyx_insort():
    pyxinsort.insort(li)

and this pyxinsort.pyx

cdef int ln[10000]

def insort(li):
    cdef int i,j,key
    for j in range(1, len(li)):
        key = ln[j]
        i = j-1
        while i >= 0 and ln[i] > key:
            ln[i+1] = ln[i]
            i-=1
        ln[i+1] = key

def setup(li):
    cdef int i
    for i in range(1, len(li)):
        ln[i]=li[i]

The code for insort is virtually identical. li is passed in for its length. ln is the array that is sorted and is prepopulated by setup, so I can isolate building the list from the sort

python

$ python2.5 -mtimeit -s'import inso' 'list(inso.li)'
10000 loops, best of 3: 84.5 usec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.insort(list(inso.li))'
10 loops, best of 3: 21.9 sec per loop

psyco

$ python2.5 -mtimeit -s'import inso' 'list(inso.li)'
10000 loops, best of 3: 85.6 usec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.insort(list(inso.li))'
10 loops, best of 3: 578 msec per loop

cython ( this is running exactly the same algorithm converted to C and compiled )

$ python2.5 -mtimeit -s'import inso' 'inso.pyx_setup()'
10000 loops, best of 3: 141 usec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.pyx_setup();inso.pyx_insort()'
10 loops, best of 3: 46.6 msec per loop

cython beats psyco by a factor of 16 and Python by a factor of 470!

For completeness, i've included the corresponding piece of C code generated by cython


  for (__pyx_v_j = 1; __pyx_v_j < __pyx_1; __pyx_v_j+=1) {
    __pyx_v_key = (__pyx_v_9pyxinsort_ln[__pyx_v_j]);
    __pyx_v_i = (__pyx_v_j - 1);
    while (1) {
      __pyx_2 = (__pyx_v_i >= 0);
      if (__pyx_2) {
        __pyx_2 = ((__pyx_v_9pyxinsort_ln[__pyx_v_i]) > __pyx_v_key);
      }
      if (!__pyx_2) break;
      (__pyx_v_9pyxinsort_ln[(__pyx_v_i + 1)]) = (__pyx_v_9pyxinsort_ln[__pyx_v_i]);
      __pyx_v_i -= 1;
    }
    (__pyx_v_9pyxinsort_ln[(__pyx_v_i + 1)]) = __pyx_v_key;
  }
gnibbler
+4  A: 

So, here are some lessons you should take away from this:

  • Interpreted Python is on the slow side. Don't try to write your own FFT, MPEG encoder, etc. in Python.

  • Even slow interpreted Python is probably fast enough for small problems. An 8 second run time is not horrible, and it would take you much longer to write and debug the C than the Python, so if you are writing something to run once, Python wins.

  • For speed in Python, try to rely on built-in features and C modules. Let someone else's C code do the heavy lifting. I worked on an embedded device where we did our work in Python; despite the slow embedded processor, the performance was decent, because C library modules were doing most of the work.

For fun and education, please repeat your Python test, this time using the built-in .sort() method on a list; it probably won't be quite as fast as the C, but it will be close. (Although for really large data sets, it will beat the C, because insertion sort sucks. If you rewrote the C to use the C library qsort() function, that would be the speed champ.)

A common Python design "pattern" is: first, write your app in Python. If it is fast enough, stop; you are done. Second, try to rewrite to improve speed; see if there is a C module you can use, for example. If it is still not fast enough, consider writing your own C module; or, write a C program, using the Python prototype code as the basis for your design.

steveha
Insertion sort can be done O(nlogn) but this one is O(n^2) so at 10000 i expect timsort to beat it
gnibbler
I said for "really large data sets" that Python would win, because Timsort is awesome; but for only 10,000 items, I expected the startup time to dominate. However, I just tossed off a quick test on my computer, and found that my trivial Python program to allocate a list of 10,000 random integers and sort it with `.sort()` is in fact faster than my trivial C program to initialize an array of 10,000 integers and sort it with the above insertion sort. (I had to modify the insertion sort to work in bare C... I'm not sure what that "vec" type is. It doesn't look like std::vector<int>...)
steveha
It looks like vec is just the array `v` bundled with it's length `n`
gnibbler
Sure, but I just wanted to toss off a quick test, so I just converted it to boring C. I didn't want to try to write a vec-compatible class to make it work, or convert it to std::vector<int>, just for a quick test.
steveha
+5  A: 

My first thought was that the laptop I have at hand right now, a Macbook Pro, must be comparable to but slightly better than your machine -- I don't have enough of your surrounding code to try your C example (what's a vec_t, etc, etc), but running the Python you coded gives me:

$ python -mtimeit -s'import inso' 'inso.insort(inso.li)'
10 loops, best of 3: 7.21 msec per loop

vs your 8.1 seconds. That's with you code put in insort.py, preceded by:

import random
li = [random.randrange(10000) for _ in xrange(10000)]

array doesn't help -- actually slows things down a bit. Then I installed psyco, the Python JIT helper (x86-only, 32-bit only), further added:

import psyco
psyco.full()

and got:

$ python -mtimeit -s'import inso' 'inso.insort(inso.li)'
10 loops, best of 3: 207 usec per loop

so a speedup of about 7.21 / 0.000207 = 34830 times -- vs the 8.04 / 0.13 = 62 times that surprised you so much;-).

Of course, the problem is that after the first time, the list is already sorted, so insort becomes must faster. You didn't give us enough of the surrounding test harness to know exactly what you measured. A more realisting example (where the actual list isn't touched so it stays disordered, only a copy is sorted...), without psyco:

$ python -mtimeit -s'import inso' 'inso.insort(list(inso.li))'
10 loops, best of 3: 13.8 sec per loop

Oops -- so your machine's WAY faster than a Macbook Pro (remembers, core don't count: we're using only one here;-) -- wow... or else, you're mismeasuring. Anyway, WITH psyco:

$ python -mtimeit -s'import inso' 'inso.insort(list(inso.li))'
10 loops, best of 3: 456 msec per loop

So psyco's speedup is only 13.8 / 0.456, 30 times -- about half as much as the 60+ times you get with pure-C coding. IOW, you'd expect python + psyco to be twice as slow as pure C. That's a more realistic and typical assessment.

If you we writing reasonably high-level code, psyco's speedup of it would degrade from (say) 30 times down to much less -- but so would C's advantage over Python. For example,

$ python -mtimeit -s'import inso' 'sorted(inso.li)'
100 loops, best of 3: 8.72 msec per loop

without psyco (in this case, psyco actually -- marginally -- slows down the execution;-), so that's another factor of 52 over psyco, 1582 overall over non-psyco insort.

But, when for some reason or other you have to write extremely low-level algorithms in python, rather than using the wealth of support from the builtins and stdlib, psyco can help reduce the pain.

Another point is, when you benchmark, please post ALL code so others can see exactly what you're doing (and possibly spot gotchas) -- your "scaffolding" is as tricky and likely to hide traps, as the code you think you're measuring!-)

Alex Martelli
Can you run timeit with just the list copy and subtract the result per loop from the list copy+insort? I realise the copy is quick compared to the sort of course.
gnibbler
@gnibbler, 179 usec -- in the noise compared to most of the other measurements, not just "quick";-)
Alex Martelli
Alex I have run with cython and psyco and saw a factor of 16 difference. Does cython run on your mac? http://stackoverflow.com/questions/1561596/performance-difference-on-insert-sort-in-c-and-python/1562714#1562714
gnibbler
@gnibbler, sure, Cython runs great on the mac, and it's a very useful language, but it's NOT Python -- it's an "extended subset" (isn't everything an extended subset of everything else?-), which is what the OP was asking about!-)
Alex Martelli
Alex, sorry I wasn't clearer. My intent was to make sure the inside of the function in cython is compiled to pure C, so the cython gives a way to benchmark the C in the same framework you used for the python. Python is taking ~19sec on my notebook, and C is over 400 times faster according to this experiment. I was just wondering if you could confirm the result for cython. 400 times seems a lot
gnibbler
No, with the right "scaffolding" (using identical source but as cinso.pyx), `python -mtimeit -s'import pyximport; pyximport.install(); import cinso' 'cinso.insort(list(cinso.li))'` is 8.8 seconds, vs 13.8 for psycoless Python and 0.46 for psyco. Maybe you were sorting the same list over and over (so it's nearly always already sorted, distorting the timing as I showed at the start of this answer)?
Alex Martelli
I've just tried putting a counter in the inner while loop and printing at the end of each sort, each pass prints a number around 25million where for a presorted list it would print 0. 25e6/0.047 is about 500 million loops per second which doesn't seem unreasonable as it is just a few integer ops.
gnibbler