views:

134

answers:

3

Hi,

I have a matrix which is fairly large (around 50K rows), and I want to print the correlation coefficient between each row in the matrix. I have written Python code like this:

for i in xrange(rows): # rows are the number of rows in the matrix. 
    for j in xrange(i, rows):
        r = scipy.stats.pearsonr(data[i,:], data[j,:])
        print r  

Please note that I am making use of the pearsonr function available from the scipy module (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).

My question is: Is there a quicker way of doing this? Is there some matrix partition technique that I can use?

Thanks!

A: 

you can use the python multiprocess module, chunk up your rows into 10 sets, buffer your results and then print the stuff out (this would only speed it up on a multicore machine though)

http://docs.python.org/library/multiprocessing.html

btw: you'd also have to turn your snippet into a function and also consider how to do the data reassembly. having each subprocess have a list like this ...[startcord,stopcord,buff] .. might work nicely

def myfunc(thelist):
    for i in xrange(thelist[0]:thelist[1]):
    ....
    thelist[2] = result
pyInTheSky
+4  A: 

New Solution

After looking at Joe Kington's answer, I decided to look into the corrcoef() code and was inspired by it to do the following implementation.

ms = data.mean(axis=1)[(slice(None,None,None),None)]
datam = data - ms
datass = np.sqrt(scipy.stats.ss(datam,axis=1))
for i in xrange(rows):
    temp = np.dot(datam[i:],datam[i].T)
    rs = temp / (datass[i:]*datass[i])

Each loop through generates the Pearson coefficients between row i and rows i through to the last row. It is very fast. It is at least 1.5x as fast as using corrcoef() alone because it doesn't redundantly calculate the coefficients and a few other things. It will also be faster and won't give you the memory problems with a 50,000 row matrix because then you can choose to either store each set of r's or process them before generating another set. Without storing any of the r's long term, I was able to get the above code to run on 50,000 x 10 set of randomly generated data in under a minute on my fairly new laptop.

Old Solution

First, I wouldn't recommend printing out the r's to the screen. For 100 rows (10 columns), this is a difference of 19.79 seconds with printing vs. 0.301 seconds without using your code. Just store the r's and use them later if you would like, or do some processing on them as you go along like looking for some of the largest r's.

Second, you can get some savings by not redundantly calculating some quantities. The Pearson coefficient is calculated in scipy using some quantities that you can precalculate rather than calculating every time that a row is used. Also, you aren't using the p-value (which is also returned by pearsonr() so let's scratch that too. Using the below code:

r = np.zeros((rows,rows))
ms = data.mean(axis=1)

datam = np.zeros_like(data)
for i in xrange(rows):
    datam[i] = data[i] - ms[i]
datass = scipy.stats.ss(datam,axis=1)
for i in xrange(rows):
    for j in xrange(i,rows):
        r_num = np.add.reduce(datam[i]*datam[j])
        r_den = np.sqrt(datass[i]*datass[j])
        r[i,j] = min((r_num / r_den), 1.0)

I get a speed-up of about 4.8x over the straight scipy code when I've removed the p-value stuff - 8.8x if I leave the p-value stuff in there (I used 10 columns with hundreds of rows). I also checked that it does give the same results. This isn't a really huge improvement, but it might help.

Ultimately, you are stuck with the problem that you are computing (50000)*(50001)/2 = 1,250,025,000 Pearson coefficients (if I'm counting correctly). That's a lot. By the way, there's really no need to compute each row's Pearson coefficient with itself (it will equal 1), but that only saves you from computing 50,000 Pearson coefficients. With the above code, I expect that it would take about 4 1/4 hours to do your computation if you have 10 columns to your data based on my results on smaller datasets.

You can get some improvement by taking the above code into Cython or something similar. I expect that you'll maybe get up to a 10x improvement over straight Scipy if you're lucky. Also, as suggested by pyInTheSky, you can do some multiprocessing.

Justin Peel
+2  A: 

Have you tried just using numpy.corrcoef? Seeing as how you're not using the p-values, it should do exactly what you want, with as little fuss as possible. (Unless I'm mis-remembering exactly what pearson's R is, which is quite possible.)

Just quickly checking the results on random data, it returns exactly the same thing as @Justin Peel's code above and runs ~100x faster.

For example, testing things with 1000 rows and 10 columns of random data...:

import numpy as np
import scipy as sp
import scipy.stats

def main():
    data = np.random.random((1000, 10))
    x = corrcoef_test(data)
    y = justin_peel_test(data)
    print 'Maximum difference between the two results:', np.abs((x-y)).max()
    return data

def corrcoef_test(data):
    """Just using numpy's built-in function"""
    return np.corrcoef(data)

def justin_peel_test(data):
    """Justin Peel's suggestion above"""
    rows = data.shape[0]

    r = np.zeros((rows,rows))
    ms = data.mean(axis=1)

    datam = np.zeros_like(data)
    for i in xrange(rows):
        datam[i] = data[i] - ms[i]
    datass = sp.stats.ss(datam,axis=1)
    for i in xrange(rows):
        for j in xrange(i,rows):
            r_num = np.add.reduce(datam[i]*datam[j])
            r_den = np.sqrt(datass[i]*datass[j])
            r[i,j] = min((r_num / r_den), 1.0)
            r[j,i] = r[i,j]
    return r

data = main()

Yields a maximum absolute difference of ~3.3e-16 between the two results

And timings:

In [44]: %timeit corrcoef_test(data)
10 loops, best of 3: 71.7 ms per loop

In [45]: %timeit justin_peel_test(data)
1 loops, best of 3: 6.5 s per loop

numpy.corrcoef should do just what you want, and it's a lot faster.

Joe Kington
You're quite right. I thought of `corrcoef` at first, but some reason I remembered it being slower. Feeling a bit sheepish that I trusted my bad memory rather than trying it out. It is faster because it uses matrix multiplications to eliminate python loops. +1 from me.
Justin Peel
The problem with corrcoef though is that it uses about twice as much memory as needed. It also is calculating almost all of the coefficients twice. However, the bigger issue is memory and the OP will have to break up the data up to avoid memory problems. It essentially will become a combinatorics mess.
Justin Peel
@Justin Peel - True, corrcoef is creating an extra temporary copy of the input array. It's a tradeoff between speed and amount of memory used. Your solution is much better if memory is the main constraint, and with 50,000 rows, it's likely to be.
Joe Kington
@Joe Kington Actually, I was thinking more of how it actually calculates each coefficient twice and stores them though you are right that it also makes an extra temporary copy of the input. I think that this (corrcoef) may be the best way to do it, but you'd have to split the data up cleverly and put it back together carefully to get all of the combinations.
Justin Peel