views:

390

answers:

3

I have a MATLAB function that finds charateristic points in a sample. Unfortunatley it only works about 90% of the time. But when I know at which places in the sample I am supposed to look I can increase this to almost 100%. So I would like to know if there is a function in MATLAB that would allow me to find the range where most of my results are, so I can then recalculate my characteristic points. I have a vector which stores all the results and the right results should lie inside a range of 3% between -24.000 to 24.000. Wheras wrong results are always lower than the correct range. Unfortunatley my background in statistics is very rusty so I am not sure how this would be called. Can somebody give me a hint what I would be looking for? Is there a function build into MATLAB that would give me the smallest possible range where e.g. 90% of the results lie.

EDIT: I am sorry if I didn't make my question clear. Everything in my vector can only range between -24.000 and 24.000. About 90% of my results will be in a range which spans approximately 1.44 ([24-(-24)]*3% = 1.44). These are very likely to be the correct results. The remaining 10% are outside of that range and always lower (why I am not sure taking then mean value is a good idea). These 10% are false and result from blips in my input data. To find the remaining 10% I want to repeat my calculations, but now I only want to check the small range. So, my goal is to identify where my correct range lies. Delete the values I have found outside of that range. And then recalculate my values, not on a range between -24.000 and 24.000, but rather on a the small range where I already found 90% of my values.

+1  A: 

Maybe you should try mean value (in matlab: mean) and standard deviation (in matlab: std)?

What is the statistic distribution of your data?

See also this wiki page, section "Interpretation and application". In general for almost every distribution, very useful Chebyshev's inequalities take place.

In most of the cases this should work:

meanval = mean(data)
stDev = std(data)

and probably the most (75%) of your values will be placed in range:

<meanVal - 2*stDev, meanVal + 2*stDev>
Gacek
Typo, that should be 95% of the values should be in the + or - 2*stDev range, assuming it's normally distributed data...
Harlan
Yes, you are right. But I'm not assuming it is normally distributed. Chebyshev's equalities take place in general case, and there is 75% for 2*stDev
Gacek
Thanks for your answer. I added a little bit to my post to make it clearer. Maybe your approach is sufficient. Unfortunately I won't have time to work on this today anymore.
Lucas
And after the edit, what I wrote still takes place. Play with it, I'm sure it would work for you
Gacek
+2  A: 

The relevant points you're looking for are the percentiles:

% generate sample data
data = [randn(900,1) ; randn(50,1)*3 + 5; ; randn(50,1)*3 - 5];
subplot(121), hist(data)
subplot(122), boxplot(data)

% find 5th, 95th percentiles (range that contains 90% of the data)
limits = prctile(data, [5 95])

% find data in that range
reducedData = data(limits(1)<data & data < limits(2));

Other approachs exist to detect outliers, such as the IQR outlier test and the three standard deviation rule, among many others:

%% three standard deviation rule
z = 3;
bounds = z * std(data)
reducedData = data( abs(data-mean(data)) < bounds );

and

%% IQR outlier test
Q = prctile(data, [25 75]);
IQ = Q(2)-Q(1);
%a = 1.5;   % mild outlier
a = 3.0;    % extreme outlier
bounds = [Q(2)-a*IQ , Q(2)+a*IQ]
reducedData = data(bounds(1)<data & data<bounds(2));


BTW if you want to get the z value (|X|<z) that corresponds to 90% area under the curve, use:

area = 0.9;                 % two-tailed probability
z = norminv(1-(1-area)/2)
Amro
Thank you. That seems to do the trick!
Lucas
A: 

it seems like maybe you want to find the number x in [-24,24] that maximizes the number of sample points in [x,x+1.44]; probably the fastest way to do this involves a sort of the sample points, which is ultimately nlog(n) time; a cheesy approximation would be as follows:

brkpoints = linspace(-24,24-1.44,n_brkpoints); %choose n_brkpoints big, but < # of sample points?
n_count = histc(data,[brkpoints,inf]); %count # data points between breakpoints;
accbins = 1.44 / (brkpoints(2) - brkpoints(1); %# of bins to accumulate;
cscount = cumsum(n_count); %half of the boxcar sum computation;
boxsum  = cscount - [zeros(accbins,1);cscount(1:end-accbins)]; %2nd half;
[dum,maxi] = max(boxsum); %which interval has the maximal # counts?
lorange = brkpoints(maxi);   %the lower range;
hirange = lorange + 1.44

this solution does fudge some of the corner case stuff about the bottom and top bin, etc.

note that if you're going to go by the Chebyshev inequality route, Petunin's Inequality is probably applicable, and will give a slight boost.

shabbychef