views:

2207

answers:

2

Hi,

I need to do auto-correlation of a set of numbers, which as I understand it is just the correlation of the set with itself.

I've tried it using numpy's correlate function, but I don't believe the result, as it almost always gives a vector where the first number is not the largest, as it ought to be.

So, this question is really two questions:

1) What exactly is numpy.correlate doing?

2) How can I use it (or something else) to do auto-correlation?

Thanks,

Ben

A: 

1) Here is the documentation for numpy.correlate. The code inside the file looks like this:

mode = _mode_from_name(mode)
return multiarray.correlate(a,v,mode)

multiarray.correlate points to a .pyd file (i.e. a DLL file), so to get the inner workings you should probably ask the numpy developers.

2) If you don't believe the numpy results, you might try SciPy's correlate function.

tgray
The scipy and numpy correlate functions are both in C. numpy multiarray source code is somewhere in here: http://svn.scipy.org/svn/numpy/trunk/numpy/core/src/multiarray/ and scipy correlate source code is here: http://svn.scipy.org/svn/scipy/trunk/scipy/signal/correlate_nd.c.src
endolith
+6  A: 

To answer your first question, Numpy.correlate(a, v, mode) is performing the convolution of a with the reverse of v and giving the results clipped by the specified mode. Because of the definition of convolution, the correlation C(t) = Sum for -inf < i < inf of (a[i] * v[t + i]), where -inf < t < inf. Even though this definition of the correlation would allow for results from -infinity to infinity, you obviously can't store an infinitely long array. So it has to be clipped, and that is where the mode comes in. There are 3 different modes: full, same, & valid. 'full' mode returns results for every t where both a and v have some overlap. 'same' mode returns a result with the same length as the shortest vector (a or v). 'valid' mode returns results only when a and v completely overlap each other. The documentation for Numpy.convolve gives more detail on the modes.

For your second question, I think Numpy.correlate is giving you the autocorrelation, it is just giving you a little more as well. The autocorrelation is used to find how similar a signal, or function, is to itself at a certain time difference. At a time difference of 0, the auto-correlation should be the highest because the signal is identical to itself, so you expected that the first element in the auto-correlation result array would be the greatest. However, the correlation is not starting at a time difference of 0. It starts at a negative time difference, closes to 0, and then goes positive. That is, you were expecting:

Autocorrelation(a) = Sum for -inf < i < inf (a[i] * v[t + i]), where 0 <= t < inf

But what you got was:

Autocorrelation(a) = Sum for -inf < i < inf (a[i] * v[t + i]), where -inf < t < inf

What you need to do is take the last half of your correlation result, and that should be the auto-correlation you are looking for. A simple python function to do that would be:

def autocorr(x):
    result = numpy.correlate(x, x, mode='full')
    return result[result.size/2:]

You will, of course, need error checking to make sure that x is actually a 1-d array. Also, this explanation probably isn't the most mathematically rigorous. I've been throwing around infinities because the definition of convolution uses them, but that doesn't necessarily apply for auto-correlation. So, the theoretical portion of this explanation may be slightly wonky, but hopefully the practical results are helpful. These pages on auto-correlation are pretty helpful, and can give you a much better theoretical background if you don't mind wading through the notation and heavy concepts.

A. Levy