ansaurus

Question

Pearson Similarity Score, how can I optimise this further?

Answer 1

A:

If the inputs to any of your math functions are fairly constrained, you can use a lookup table instead of the math function. This can earn you some performance (speed) at the cost of extra memory to store the table.

Chris Judge 2009-08-20 15:47:28

Answer 2

A:

I'm not sure if this holds in Python. But calculating the sqrt is a processor intensive calculation.

You might go for a fast approximation newton

Toad 2009-08-20 15:49:11

Answer 3

+3 A:

The real speed increase would be gained by moving to numpy or scipy. Short of that, there are microoptimizations: e.g. x*x is faster than pow(x,2); you could extract the values at the same time as the keys by doing, instead of:

si = [val for val in v1 if val in v2]

something like

vs = [ (v1[val],v2[val]) for val in v1 if val in v2]

and then

sum1 = sum(x for x, y in vs)

and so on; whether each of these brings time advantage needs microbenchmarking. Depending on how you're using these coefficients returning the square would save you a sqrt (that's a similar idea to using squares of distances between points, in geometry, rather than the distances themselves, and for the same reason -- saves you a sqrt; which makes sense because the coefficient IS a distance, kinda...;-).

Alex Martelli 2009-08-20 15:57:29

Answer 4

+1 A:

I'd suggest changing:

[val for val in v1 if val in v2]

to

set(v1) & set(v2)

do

if not n: return 0.0    # and similar for den

instead of

if n == 0: return 0.0

and it's worth replacing last 6 lines with:

try:
    return num / sqrt(abs(temp))
except ZeroDivisionError:
    return 1.0

SilentGhost 2009-08-20 16:08:11

Answer 5

+1 A:

Since it looks like you're doing quite a bit of numeric computation you should give Psyco a shot. It's a JIT compiler that analyzes running code and optimizes certain operations. Install it, then at the top of your file put:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

This will enable Psyco's JIT and should speed up your code somewhat, for free :) (actually not, it takes up more memory)

lost-theory 2009-08-21 03:24:43

Answer 6

A:

I'll post what I've got so far as an answer to differentiate it from the question. This is a combination of some techniques described above that seem to have given the best improvement s far.

def pearson(v1,v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0

Edit: It looks like psyco gives a 15% improvment for this version which isn't massive but is enough to justify its use.

Andrew Ingram 2009-08-21 14:23:36

last line is not needed

SilentGhost 2009-08-21 14:59:15

thanks, I overlooked that, *edits*

Andrew Ingram 2009-08-21 16:15:41

did you look at the scipy implementation of pearson, if this is more efficient or not. I am always looking for algorithm efficiencies in libraries, it might be useful to submit this to scipy lists.

whatnick 2009-09-23 01:24:03

Answer 7

+2 A:

If you can use scipy, you could use the pearson function: http://www.scipy.org/doc/api%5Fdocs/SciPy.stats.stats.html#pearsonr

Or you could copy/paste the code (it has a liberal license) from http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py (search for def pearson()). In the code np is just numpy (the code does import numpy as np).

tonfa 2009-08-21 19:19:26

Answer 8

+1 A:

Scipy is the fastest!

I have don some tests with the code above and also with a version I found on my comp, see below for results and the code:

pearson 14.7597990757
sim_pearson 15.6806837987
scipy:pearsonr 0.451986019188

try:
    import psyco
    psyco.full()
except ImportError:
    pass

from math import sqrt

def sim_pearson(set1, set2):
    si={}
    for item in set1:
        if item in set2:
            si[item] = 1

    #number of elements
    n = len(si)

    #if none common, return 0 similarity
    if n == 0: return 0

    #add up all the preferences
    sum1 = sum([set1[item] for item in si])
    sum2 = sum([set2[item] for item in si])

    #sum up the squares
    sum_sq1 = sum([pow(set1[item], 2) for item in si])
    sum_sq2 = sum([pow(set2[item], 2) for item in si])

    #sum up the products
    sum_p = sum([set1[item] * set2[item] for item in si])

    nom = sum_p - ((sum1 * sum2) / n )
    den = sqrt( (sum_sq1 - (sum1)**2 / n) * (sum_sq2 - (sum2)**2 / n) )

    if den==0: return 0
    return nom/den



# from http://stackoverflow.com/questions/1307016/pearson-similarity-score-how-can-i-optimise-this-further
def pearson(v1, v2):
    vs = [(v1[val],v2[val]) for val in v1 if val in v2]

    n = len(vs)

    if n==0: return 0.0

    sum1,sum2,sum1_sq,sum2_sq,p_sum = 0.0, 0.0, 0.0, 0.0, 0.0

    for v1,v2 in vs:
        sum1+=v1
        sum2+=v2
        sum1_sq+=v1*v1
        sum2_sq+=v2*v2
        p_sum+=v1*v2

    # Calculate Pearson score
    num = p_sum-(sum1*sum2/n)
    temp = max((sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n),0)
    if temp:
        return num / sqrt(temp)
    return 1.0






if __name__ == "__main__":
    import timeit

    tsetup = """
from random import randrange
from __main__ import pearson, sim_pearson
from scipy.stats import pearsonr
v1 = [randrange(0,1000) for x in range(1000)]
v2 = [randrange(0,1000) for x in range(1000)]
#gc.enable()
"""
    t1 = timeit.Timer(stmt="pearson(v1,v2)", setup=tsetup)
    t2 = timeit.Timer(stmt="sim_pearson(v1,v2)", setup=tsetup)
    t3 = timeit.Timer(stmt="pearsonr(v1,v2)", setup=tsetup)

    tt = 1000

    print 'pearson', t1.timeit(tt)
    print 'sim_pearson', t2.timeit(tt)
    print 'scipy:pearsonr', t3.timeit(tt)

roman 2010-02-17 12:22:13

ansaurus

tags:

views:

answers:

Pearson Similarity Score, how can I optimise this further?

related questions