ansaurus

Question

Finding a reasonable (noise-free) maximum element in a vector

Answer 1

+1 A:

Do you have access to bounds of your noise-free elements. For example, do you know that your noise-free elements are between -10 and 10 ?

In that case, you could remove noise, and then find the max

max( v( find(v<=10 & v>=-10) ) )

ThibThib 2009-07-21 23:39:32

Actually, you don't even need the call to FIND... you can index `v` with a logical vector: `max(v(abs(v) <= 10))`

gnovice 2009-07-21 23:54:39

No, I do not know where the noise-free elements are.

Jacob 2009-07-22 16:51:06

Answer 2

+3 A:

I would not sort then difference. If you have some reason to expect continuity or bounded change (the vector is of consecutive sensor readings), then sorting will destroy the time information (or whatever the vector index represents). Filtering by detecting large spikes isn't a bad idea, but you would want to compare the spike to a larger neighborhood (2nd difference effectively has you looking within a window of +-2).

You need to describe formally the expected information in the vector, and the type of noise.

You need to know the frequency and distribution of errors and non-errors. In the simplest model, the elements in your vector are independent and identically distributed, and errors are all or none (you randomly choose to store the true value, or an error). You should be able to figure out for each element the chance that it's accurate, vs. the chance that it's noise. This could be very easy (error data values are always in a certain range which doesn't overlap with non-error values), or very hard.

To simplify: don't make any assumptions about what kind of data an error produces (the worst case is: you can't rule out any of the error data points as ridiculous, but they're all at or above the maximum among non-error measurements). Then, if the probability of error is p, and your vector has n elements, then the chance that the kth highest element in the vector is less or equal to the true maximum is given by the cumulative binomial distribution - http://en.wikipedia.org/wiki/Binomial_distribution

wrang-wrang 2009-07-21 23:46:19

I didn't intend on sorting the difference: I proposed finding the difference of the sorted values to find spikes.

Jacob 2009-07-22 16:50:21

I said sort *THEN* difference, not sort *THE* difference :) My reasons against doing that still stand. Taking adjacent differences between things that weren't originally adjacent is questionable.

wrang-wrang 2009-07-22 19:40:53

Your sample graph makes it very clear that you can just use rank statistics (which do require sorting or binning). Since your bad values happen much less than half the time, you should throw everything out which is beyond a certain difference from the median, i.e. drop the i for which abs(v[i]-v_median)>t.

wrang-wrang 2009-07-22 19:48:20

Answer 3

+3 A:

First, pick your favorite method for identifying outliers...

Ken 2009-07-21 23:49:06

Answer 4

+2 A:

If you expect the numbers to come from a normal distribution, you can use a say 2xsd (standard deviation) above the mean to determine your max.

Draemon 2009-07-22 00:09:47

That's interesting but still not quite correct. Some linear combination of the mean and sd works, but how do I determine it?

Jacob 2009-07-22 16:57:21

Answer 5

+4 A:

NEW ANSWER:

Based on your plot of the sorted amplitudes, your diff(sort(V)) algorithm would probably work well. You would simply have to pick a threshold for what constitutes "too large" a difference between the sorted values. The first point in your diff(sort(V)) vector that exceeds that threshold is then used to get the threshold to use for V. For example:

diffThreshold = 2e5;
sortedVector = sort(V);
index = find(diff(sortedVector) > diffThreshold,1,'first');
signalThreshold = sortedVector(index);

Another alternative, if you're interested in toying with it, is to bin your data using HISTC. You would end up with groups of highly-populated bins at both low and high amplitudes, with sparsely-populated bins in between. It would then be a matter of deciding which bins you count as part of the low-amplitude group (such as the first group of bins that contain at least X counts). For example:

binEdges = min(V):1e7:max(V);  % Create vector of bin edges
n = histc(V,binEdges);         % Bin amplitude data
binThreshold = 100;            % Pick threshold for number of elements in bin
index = find(n < binThreshold,1,'first');  % Find first bin whose count is low
signalThreshold = binEdges(index);

OLD ANSWER (for posterity):

Finding a "reasonable maximum element" is wholly dependent upon your definition of reasonable. There are many ways you could define a point as an outlier, such as simply picking a set of thresholds and ignoring everything outside of what you define as "reasonable". Assuming your data has a normal-ish distribution, you could probably use a simple data-driven thresholding approach for removing outliers from a vector V using the functions MEAN and STD:

nDevs = 2;    % The number of standard deviations to use as a threshold
index = abs(V-mean(V)) <= nDevs*std(V);  % Index of "reasonable" values
maxValue = max(V(index));              % Maximum of "reasonable" values

gnovice 2009-07-22 00:19:13

this is true if you don't have too much corrupted elements. Otherwise your mean will be corrupted, and even the noise-free elements will out of the bound `nDevs*std(V)`

ThibThib 2009-07-22 08:04:22

@ThibThib: You are correct. I was just giving an example for dealing with data that has a near-normal distribution. It's difficult to give anything other than general examples since the author doesn't give much detail about the types of signals he's working with.

gnovice 2009-07-22 13:56:31

My apologies for the vague OP. I hope the updated version is more informative.

Jacob 2009-07-22 16:58:10

ansaurus

tags:

views:

answers:

Finding a reasonable (noise-free) maximum element in a vector

related questions