views:

961

answers:

6

I apologize for being a bit verbose in advance: if you want to skip all the background mumbo jumbo you can see my question down below.

This is pretty much a follow up to a question I previously posted on how to compare two 1D (time dependent) signals. One of the answers I got was to use the cross-correlation function (xcorr in MATLAB), which I did.

Background information

Perhaps a little background information will be useful: I'm trying to implement an Independent Component Analysis algorithm. One of my informal tests is to (1) create the test case by (a) generate 2 random vectors (1x1000), (b) combine the vectors into a 2x1000 matrix (called "S"), and multiply this by a 2x2 mixing matrix (called "A"), to give me a new matrix (let's call it "T").

In summary: T = A * S

(2) I then run the ICA algorithm to generate the inverse of the mixing matrix (called "W"), (3) multiply "T" by "W" to (hopefully) give me a reconstruction of the original signal matrix (called "X")

In summary: X = W * T

(4) I now want to compare "S" and "X". Although "S" and "X" are 2x1000, I simply compare S(1,:) to X(1,:) and S(2,:) to X(2,:), each which is 1x1000, making them 1D signals. (I have another step which makes sure that these vectors are the proper vectors to compare to each other and I also normalize the signals).

So my current quandary is how to 'grade' how close S(1,:) matches to X(1,:), and likewise with S(2,:) to X(2,:).

So far I have used something like: r1 = max(abs(xcorr(S(1,:), X(1,:)))

My question

Assuming that using the cross correlation function is a valid way to go about comparing the similarity of two signals, what would be considered a good R value to grade the similarity of the signals? Wikipedia states that this is a very subjective area, and so I defer to the better judgment of those who might have experience in this field.

As you might realize, I'm not coming from a EE/DSP/statistical background at all (I'm a medical student) so I'm going through a sort of "baptism through fire" right now, and I appreciate all the help I can get. Thanks!

A: 

Since they should be equal, the correlation coefficient should be high, between .99 and 1. I would take the max and abs functions out of your calculation, too.

EDIT: I spoke too soon. I confused cross-correlation with correlation coefficient, which is completely different. My answer might not be worth much.

Adam Crume
Does this mean that I then make the autocorrelation of the original signals the "gold standard"?Therefore given:`r1 = xcorr(S(1,:), X(1,:))``r2 = xcorr(S(2,:), X(2,:))``a1 = xcorr(S(1,:), S(1,:))``a2 = xcorr(S(2,:), S(2,:))`Then the 'score' would be something like r1/a1 and r2/a2? Is this what you mean?
oort
Did you mean to comment on tom10's answer?
Adam Crume
@oort: see my answer below
Jason S
+1  A: 

A good starting point is to get a sense of what a perfect match will look like by calculating the auto-correlations for each signal (i.e. do the "cross-correlation" of each signal with itself).

tom10
A: 

I would agree that the result would be subjective. Something that would involve the sum of the squares of the differences, element by element, would have some value. Two identical arrays would give a value of 0 in that form. You would have to decide what value then becomes "bad". Make up 2 different vectors that "aren't too bad" and find their cross-correlation coefficient to be used as a guide.

(parenthetically: if you were doing a correlation coefficient where 1 or -1 would be great and 0 would be awful, I've been told by bio-statisticians that a real-life value of 0.7 is extremely good. I understand that this is not exactly what you are doing but the comment on correlation coefficient came up earlier.)

+1  A: 

THIS IS A COMPLETE GUESS - but I'm guessing max(abs(xcorr(S(1,:),X(1,:)))) > 0.8 implies success. Just out of curiosity, what kind of values do you get for max(abs(xcorr(S(1,:),X(2,:))))?

Another approach to validate your algorithm might be to compare A and W. If W is calculated correctly, it should be A^-1, so can you calculate a measure like |A*W - I|? Maybe you have to normalize by the trace of A*W.

Getting back to your original question, I come from a DSP background, so I get to deal with fairly noise-free signals. I understand that's not a luxury you get in biology :) so my 0.8 guess might be very optimistic. Perhaps looking at some literature in your field, even if they aren't using cross-correlation exactly, might be useful.

mtrw
Actually, right now, since I'm only using random vectors of uniformly distributed random numbers as my test cases I get R values of greater than 0.9 (when the reconstructed data matches the original data fairly well by eye), and less than 0.7 when the data is clearly not properly aligned.My problem is trying to figure out if there's a more formal way to describe 'looks good' and 'looks bad' without saying "I eyeballed it when comparing the graphs" (because I'm convinced that I can't get away with that kind of argument on the project status paper I have to submit)
oort
I'm sorry I can't give you a straight numeric answer. This seems like one of those cases where there's no substitute for experience.
mtrw
+1  A: 

Usually in such cases people talk about "false acceptance rate" and "false rejection rate". The first one describes how many times algorithm says "similar" for non-similar signals, the second one is the opposite.

Selecting a threshold thus becomes a trade-off between these criteria. To make FAR=0, threshold should be 1, to make FRR=0 threshold should be -1.

So probably, you will need to decide which trade-off between FAR and FRR is acceptable in your situation and this will give the right value for threshold.

Mathematically this can be expressed in different ways. Just a couple of examples: 1. fix some of rates at acceptable value and minimize other one 2. minimize max(FRR,FAR) 3. minimize a*FRR+b*FAR

maxim1000
+6  A: 
Jason S
+1: Nicely done.
gnovice
Jason S, thanks so much, this definitely is about as thorough an answer as I could possibly hope to get!
oort
btw, Caveat on the timeshift: first, what I have works for negative timeshifts, not positive. second: I think xcorr() uses wraparound on the time index (e.g. array indices are taken modulo the array length, so that a vector of samples with a pulse in the middle correlates perfectly with another vector with a front half of a pulse at the end and a back half of a pulse at the beginning). This may not be what you want when you are comparing signals, so you may need to pad your vectors with zeros at the beginning/end to avoid wraparound.
Jason S