ansaurus

Question

Answer 1

+1 A:

Maybe you should try mean value (in matlab: mean) and standard deviation (in matlab: std)?

What is the statistic distribution of your data?

See also this wiki page, section "Interpretation and application". In general for almost every distribution, very useful Chebyshev's inequalities take place.

In most of the cases this should work:

meanval = mean(data)
stDev = std(data)

and probably the most (75%) of your values will be placed in range:

<meanVal - 2*stDev, meanVal + 2*stDev>

Gacek 2009-10-28 11:43:58

Typo, that should be 95% of the values should be in the + or - 2*stDev range, assuming it's normally distributed data...

Harlan 2009-10-28 15:28:04

Yes, you are right. But I'm not assuming it is normally distributed. Chebyshev's equalities take place in general case, and there is 75% for 2*stDev

Gacek 2009-10-28 15:33:47

Thanks for your answer. I added a little bit to my post to make it clearer. Maybe your approach is sufficient. Unfortunately I won't have time to work on this today anymore.

Lucas 2009-10-28 15:49:28

And after the edit, what I wrote still takes place. Play with it, I'm sure it would work for you

Gacek 2009-10-29 09:22:19

Answer 2

+2 A:

The relevant points you're looking for are the percentiles:

% generate sample data
data = [randn(900,1) ; randn(50,1)*3 + 5; ; randn(50,1)*3 - 5];
subplot(121), hist(data)
subplot(122), boxplot(data)

% find 5th, 95th percentiles (range that contains 90% of the data)
limits = prctile(data, [5 95])

% find data in that range
reducedData = data(limits(1)<data & data < limits(2));

Other approachs exist to detect outliers, such as the IQR outlier test and the three standard deviation rule, among many others:

%% three standard deviation rule
z = 3;
bounds = z * std(data)
reducedData = data( abs(data-mean(data)) < bounds );

and

%% IQR outlier test
Q = prctile(data, [25 75]);
IQ = Q(2)-Q(1);
%a = 1.5;   % mild outlier
a = 3.0;    % extreme outlier
bounds = [Q(2)-a*IQ , Q(2)+a*IQ]
reducedData = data(bounds(1)<data & data<bounds(2));

BTW if you want to get the z value (|X|<z) that corresponds to 90% area under the curve, use:

area = 0.9;                 % two-tailed probability
z = norminv(1-(1-area)/2)

Amro 2009-10-28 21:41:18

Thank you. That seems to do the trick!

Lucas 2009-10-29 09:58:27

Answer 3

A:

it seems like maybe you want to find the number x in [-24,24] that maximizes the number of sample points in [x,x+1.44]; probably the fastest way to do this involves a sort of the sample points, which is ultimately nlog(n) time; a cheesy approximation would be as follows:

brkpoints = linspace(-24,24-1.44,n_brkpoints); %choose n_brkpoints big, but < # of sample points?
n_count = histc(data,[brkpoints,inf]); %count # data points between breakpoints;
accbins = 1.44 / (brkpoints(2) - brkpoints(1); %# of bins to accumulate;
cscount = cumsum(n_count); %half of the boxcar sum computation;
boxsum  = cscount - [zeros(accbins,1);cscount(1:end-accbins)]; %2nd half;
[dum,maxi] = max(boxsum); %which interval has the maximal # counts?
lorange = brkpoints(maxi);   %the lower range;
hirange = lorange + 1.44

this solution does fudge some of the corner case stuff about the bottom and top bin, etc.

note that if you're going to go by the Chebyshev inequality route, Petunin's Inequality is probably applicable, and will give a slight boost.

shabbychef 2009-11-03 02:00:58

ansaurus

tags:

views:

answers:

Find only relevant points in MATLAB

related questions