tags:

views:

237

answers:

3

Hi,

Let's say I have a set of vectors (readings from sensor 1, readings from sensor 2, readings from sensor 3 -- indexed first by timestamp and then by sensor id) that I'd like to correlate to a separate set of vectors (temperature, humidity, etc -- also all indexed first by timestamp and secondly by type).

What is the cleanest way in numpy to do this? It seems like it should be a rather simple function...

In other words, I'd like to see:

> a.shape 
(365,20)

> b.shape
(365, 5)

> correlations = magic_correlation_function(a,b)

> correlations.shape
(20, 5)

Cheers, /YGA

P.S. I've been asked to add an example.

Here's what I would like to see:

$ In [27]:  x
$ Out[27]: 
array([[ 0,  0,  0],
       [-1,  0, -1],
       [-2,  0, -2],
       [-3,  0, -3],
       [-4,  0.1, -4]])

$ In [28]: y
$ Out[28]: 
array([[0, 0],
       [1, 0],
       [2, 0],
       [3, 0],
       [4, 0.1]])

$ In [28]: magical_correlation_function(x, y)
$ Out[28]: 
array([[-1.        ,  0.70710678,  1.        ]
       [-0.70710678,  1.        ,  0.70710678]])

Ps2: whoops, mis-transcribed my example. Sorry all. Fixed now.

A: 

As David said, you should define the correlation you're using. I don't know of any definitions of correlation that gives sensible numbers when correlating empty and non-empty signals.

Tim Lin
+2  A: 

The simplest thing that I could find was using the scipy.stats package

In [8]: x
Out[8]: 
array([[ 0. ,  0. ,  0. ],
       [-1. ,  0. , -1. ],
       [-2. ,  0. , -2. ],
       [-3. ,  0. , -3. ],
       [-4. ,  0.1, -4. ]])
In [9]: y
Out[9]: 
array([[0. , 0. ],
       [1. , 0. ],
       [2. , 0. ],
       [3. , 0. ],
       [4. , 0.1]])

In [10]: import scipy.stats

In [27]: (scipy.stats.cov(y,x)
          /(numpy.sqrt(scipy.stats.var(y,axis=0)[:,numpy.newaxis]))
          /(numpy.sqrt(scipy.stats.var(x,axis=0))))
Out[27]: 
array([[-1.        ,  0.70710678, -1.        ],
       [-0.70710678,  1.        , -0.70710678]])

These aren't the numbers you got, but you've mixed up your rows. (Element [0,0] should be 1.)

A more complicated, but purely numpy solution is

In [40]: numpy.corrcoef(x.T,y.T)[numpy.arange(x.shape[1])[numpy.newaxis,:]
                                 ,numpy.arange(y.shape[1])[:,numpy.newaxis]]
Out[40]: 
array([[-1.        ,  0.70710678, -1.        ],
       [-0.70710678,  1.        , -0.70710678]])

This will be slower because it computes the correlation of each element in x with each other element in x, which you don't want. Also, the advanced indexing techniques used to get the subset of the array you desire can make your head hurt.

If you're going to use numpy intensely, get familiar with the rules on broadcasting and indexing. They will help you push as much down to the C-level as possible.

AFoglia
I've updated the question with the "right" inputs -- prob. makes sense to update the response just so as not to confuse people :-)
YGA
Done. I've also added links to some helpful documentation.
AFoglia
A: 

Will this do what you want?

correlations = dot(transpose(a), b)
Mr Fooz