views:

497

answers:

7

I wrote a method to calculate the cosine distance between two arrays:

def cosine_distance(a, b):
    if len(a) != len(b):
     return False
    numerator = 0
    denoma = 0
    denomb = 0
    for i in range(len(a)):
     numerator += a[i]*b[i]
     denoma += abs(a[i])**2
     denomb += abs(b[i])**2
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result

Running it can be very slow on a large array. Is there an optimized version of this method that would run faster?

Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve:

def cosine_distance(a, b):
    if len(a) != len(b):
     raise ValueError, "a and b must be same length" #Steve
    numerator = 0
    denoma = 0
    denomb = 0
    for i in range(len(a)):       #Mike's optimizations:
     ai = a[i]             #only calculate once
     bi = b[i]
     numerator += ai*bi    #faster than exponent (barely)
     denoma += ai*ai       #strip abs() since it's squaring
     denomb += bi*bi
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result
+4  A: 

If you can use SciPy, you can use cosine from spatial.distance:

http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

If you can't use SciPy, you could try to obtain a small speedup by rewriting your Python (EDIT: but it didn't work out like I thought it would, see below).

from itertools import izip
from math import sqrt

def cosine_distance(a, b):
    if len(a) != len(b):
        raise ValueError, "a and b must be same length"
    numerator = sum(tup[0] * tup[1] for tup in izip(a,b))
    denoma = sum(avalue ** 2 for avalue in a)
    denomb = sum(bvalue ** 2 for bvalue in b)
    result = 1 - numerator / (sqrt(denoma)*sqrt(denomb))
    return result

It is better to raise an exception when the lengths of a and b are mismatched.

By using generator expressions inside of calls to sum() you can calculate your values with most of the work being done by the C code inside of Python. This should be faster than using a for loop.

I haven't timed this so I can't guess how much faster it might be. But the SciPy code is almost certainly written in C or C++ and it should be about as fast as you can get.

If you are doing bioinformatics in Python, you really should be using SciPy anyway.

EDIT: Darius Bacon timed my code and found it slower. So I timed my code and... yes, it is slower. The lesson for all: when you are trying to speed things up, don't guess, measure.

I am baffled as to why my attempt to put more work on the C internals of Python is slower. I tried it for lists of length 1000 and it was still slower.

I can't spend any more time on trying to hack the Python cleverly. If you need more speed, I suggest you try SciPy.

EDIT: I just tested by hand, without timeit. I find that for short a and b, the old code is faster; for long a and b, the new code is faster; in both cases the difference is not large. (I'm now wondering if I can trust timeit on my Windows computer; I want to try this test again on Linux.) I wouldn't change working code to try to get it faster. And one more time I urge you to try SciPy. :-)

steveha
The numerator line is incorrect: it does a nested loop instead of a parallel one.
Darius Bacon
@Darius Bacon, you are correct and I need to fix that. Ugh.
steveha
Also, when I fixed that line to get the right answer, it's still slower than the original code. Agreed about SciPy anyway!(numerator = sum(avalue * bvalue for avalue, bvalue in zip(a, b)))
Darius Bacon
Good call with SciPy. Unfortunately your non-SciPy rewrite returns the wrong values. Replacing the numerator line with gnibbler's results in the right answer, but it's actually much slower than my original code.
Dan
Guess I should hit refresh before posting my comment...
Dan
Interestingly, scipy was actually much slower. For testing I ran a couple small arrays through 100K iterations. Original code ran ~1.3 seconds and scipy ran at ~7.5 seconds. I wonder if the tables would turn on larger arrays?
Dan
Out of curiosity, since I don't have SciPy installed but am perpetually intrigued by the project, have you got any timing for this particular case using cosine from spatial.distance?
Jarret Hardie
Thanks Steve—I'll give it a shot on the production code, which will have much larger arrays.
Dan
+1  A: 

This is faster for arrays of around 1000+ elements.

from numpy import array
def cosine_distance(a, b):
    a=array(a)
    b=array(b)
    numerator=(a*b).sum()
    denoma=(a*a).sum()
    denomb=(b*b).sum()
    result = 1 - numerator / sqrt(denoma*denomb)
    return result
gnibbler
+4  A: 

(I originally thought) you're not going to speed it up a lot without breaking out to C (like numpy or scipy) or changing what you compute. But here's how I'd try that, anyway:

from itertools import imap
from math import sqrt
from operator import mul

def cosine_distance(a, b):
    assert len(a) == len(b)
    return 1 - (sum(imap(mul, a, b))
                / sqrt(sum(imap(mul, a, a))
                       * sum(imap(mul, b, b))))

It's roughly twice as fast in Python 2.6 with 500k-element arrays. (After changing map to imap, following Jarret Hardie.)

Here's a tweaked version of the original poster's revised code:

from itertools import izip

def cosine_distance(a, b):
    assert len(a) == len(b)
    ab_sum, a_sum, b_sum = 0, 0, 0
    for ai, bi in izip(a, b):
        ab_sum += ai * bi
        a_sum += ai * ai
        b_sum += bi * bi
    return 1 - ab_sum / sqrt(a_sum * b_sum)

It's ugly, but it does come out faster. . .

Edit: And try Psyco! It speeds up the final version by another factor of 4. How could I forget?

Darius Bacon
nice addition - glad to hear that the use of imap offers an advantage to mul over ** 2
Jarret Hardie
I don't think it's that ugly :p
gnibbler
I was just a bit chagrined to see the imperative code beating out the purely functional code that more directly expresses the problem.
Darius Bacon
+1  A: 

Similar to Darius Bacon's answer, I've been toying with operator and itertools to produce a faster answer. The following seems to be 1/3 faster on a 500-item array according to timeit:

from math import sqrt
from itertools import imap
from operator import mul

def op_cosine(a, b):
    dot_prod = sum(imap(mul, a, b))
    a_veclen = sqrt(sum(i ** 2 for i in a))
    b_veclen = sqrt(sum(i ** 2 for i in b))

    return 1 - dot_prod / (a_veclen * b_veclen)
Jarret Hardie
+1  A: 

No need to take abs() of a[i] and b[i] if you're squaring it.

Store a[i] and b[i] in temporary variables, to avoid doing the indexing more than once. Maybe the compiler can optimize this, but maybe not.

Check into the **2 operator. Is it simplifying it into a multiply, or is it using a general power function (log - multiply by 2 - antilog).

Don't do sqrt twice (though the cost of that is small). Do sqrt(denoma * denomb).

Mike Dunlavey
Good call... each of these shaved off a bit of time.
Dan
@Dan: Welcome. Next I'd see if some unrolling would help, in case the iterator is costing you (they tend to do that). Next I would do some stack sampling to see if the function is being called any more than necessary (or if there is any other unnoticed time-tumor).
Mike Dunlavey
A: 
def cd(a,b):
    if(len(a)!=len(b)):
        raise ValueError, "a and b must be the same length"
    rn = range(len(a))
    adb = sum([a[k]*b[k] for k in rn])
    nma = sqrt(sum([a[k]*a[k] for k in rn]))
    nmb = sqrt(sum([b[k]*b[k] for k in rn]))

    result = 1 - adb / (nma*nmb)
    return result
Arrieta
You are using list comprehensions inside of calls to `sum()`. This will create a list, then `sum()` will use the list once, and then the list will be garbage-collected. Python has a nifty feature called "generator expressions" where you can use the same syntax as a list comprehension, but it will create an iterator. If you simply delete the `[` and `]` from inside your calls to `sum()`, you will now be using generator expressions. Read more about this here: http://docs.python.org/howto/functional.html#generator-expressions-and-list-comprehensions
steveha
@steveha: depends on input lenght and the function. I don't know here, but str.join(..) is faster with list comprehensions than genexps for short input (len ~100).
kaizer.se
@kaizer.se: `str.join` is a special case, since when it has a list, it first sums the lens, then it creates a string of total size and fills it with the parts; otherwise, it builds the string the obvious way (for part in iterable: result+= part)
ΤΖΩΤΖΙΟΥ
A: 

Using the C code inside of SciPy wins big for long input arrays. Using simple and direct Python wins for short input arrays; Darius Bacon's izip()-based code benchmarked out best. Thus, the ultimate solution is to decide which one to use at runtime, based on the length of the input arrays:

from scipy.spatial.distance import cosine as scipy_cos_dist

from itertools import izip
from math import sqrt

def cosine_distance(a, b):
    len_a = len(a)
    assert len_a == len(b)
    if len_a > 200:  # 200 is a magic value found by benchmark
        return scipy_cos_dist(a, b)
    # function below is basically just Darius Bacon's code
    ab_sum = a_sum = b_sum = 0
    for ai, bi in izip(a, b):
        ab_sum += ai * bi
        a_sum += ai * ai
        b_sum += bi * bi
    return 1 - ab_sum / sqrt(a_sum * b_sum)

I made a test harness that tested the functions with different length inputs, and found that around length 200 the SciPy function started to win. The bigger the input arrays, the bigger it wins. For very short length arrays, say length 3, the simpler code wins. This function adds a tiny amount of overhead to decide which way to do it, then does it the best way.

In case you are interested, here is the test harness:

from darius2 import cosine_distance as fn_darius2
fn_darius2.__name__ = "fn_darius2"

from ult import cosine_distance as fn_ult
fn_ult.__name__ = "fn_ult"

from scipy.spatial.distance import cosine as fn_scipy
fn_scipy.__name__ = "fn_scipy"

import random
import time

lst_fn = [fn_darius2, fn_scipy, fn_ult]

def run_test(fn, lst0, lst1, test_len):
    start = time.time()
    for _ in xrange(test_len):
        fn(lst0, lst1)
    end = time.time()
    return end - start

for data_len in range(50, 500, 10):
    a = [random.random() for _ in xrange(data_len)]
    b = [random.random() for _ in xrange(data_len)]
    print "len(a) ==", len(a)
    test_len = 10**3
    for fn in lst_fn:
        n = fn.__name__
        r = fn(a, b)
        t = run_test(fn, a, b, test_len)
        print "%s:\t%f seconds, result %f" % (n, t, r)
steveha