views:

89

answers:

2

I've got a large (4000 values) set of unsorted, normally distributed points. I'm taking each of these data points into bins whose limits are in the BinLimit array. Then I'm tabulating the number of values in each bin.

X is the array of raw data, and N is the number of data points. The number of bins desired (TotalBins) is specified by the user.

Method #1

for i=1:TotalBins
    Freq(i) = length(find(X >= BinLimit(j) & X <= BinLimit(j+1)));
    j = j + 1;
end

Method #2:

sort(X);

for i=1:N
    if (X(i) >= BinLimit(j) && X(i) <= BinLimit(j+1))
        Freq(j) = freq(j) + 1;
    elseif (j < TotalBins)
        Freq(j+1) = freq(j+1) + 1;
        j = j + 1;
    end
end

Now, I know that the first solution is slower - for normal values of Total Bins (25-50) it's about 8 times slower but I'm wondering if there's a faster, more efficient solution than what I'm doing in Method #2.

Also, this is homework, but this is really just out of curiosity as either of the above solutions do the job - it's a chem class.

+4  A: 

Use histc.

N = HISTC(X,EDGES), for vector X, counts the number of values in X that fall between the elements in the EDGES vector (which must contain monotonically non-decreasing values). N is a LENGTH(EDGES) vector containing these counts.

N(k) will count the value X(i) if EDGES(k) <= X(i) < EDGES(k+1). The last bin will count any values of X that match EDGES(end). Values outside the values in EDGES are not counted. Use -inf and inf in EDGES to include all non-NaN values.

E.g.

BinLimit = sort(rand(50,1));
X = rand(4000,1);
Freq = histc(X,BinLimit);

For fun:

%%# Generating data
X = rand(1000000,1);
BinLimit = sort(rand(50,1));

%%# OP's method
disp('Method #1');
disp('=========');
tic;
j =1;
TotalBins = numel(BinLimit)-1;
for i=1:TotalBins
    Freq(i) = length(find(X >= BinLimit(j) & X <= BinLimit(j+1)));
    j = j + 1;
end
toc

%%# histc
disp('histc');
disp('=====');
tic;
histc(X,BinLimit);
toc

%%# My method
disp('Alternative');
disp('===========');
tic;
TotalBins = numel(BinLimit)-1;
Freq = zeros(TotalBins,1);
for i = 1:TotalBins
    Freq(i) = sum(X >= BinLimit(i) & X <= BinLimit(i+1));
end
toc

After a few runs to warm it up:

Method #1
=========
Elapsed time is 0.803120 seconds.
histc
=====
Elapsed time is 0.030633 seconds.
Alternative
===========
Elapsed time is 0.704808 seconds.
Jacob
Well that's just no fun!
Radu
@Radu: Welcome to MATLAB :)
Jacob
:(Tested histc, and it's 1.4 times faster than what I did.
Radu
I'd still like to see a solution that bests mine in terms of code logic.
Radu
@Radu: Reinvent the wheel :) ? MATLAB's a lot about knowing the right routines and learning how to vectorize (and both are related), so ....
Jacob
@Jacob, that's awesome. Faster and the code is a lot simpler! Thanks.
Radu
Picked as answer because it beats the other options although I do like Amro's solution. Thanks!
Radu
+2  A: 

Apart from using HISTC, here's a vectorized one-line solution:

X = randn(4000,1);                %# column vector
BinLimits = linspace(-4,4,10);    %# row vector

Freq = sum( bsxfun(@ge, X, BinLimits(1:end-1)) & bsxfun(@le, X, BinLimits(2:end)) )

Note that its not space-efficient though, as it create a matrix of size: length(X) by length(BinLimits)-1

Amro
@Amro: Nice idea!
Jacob
@Jacob: Note that `BinLimits` is a row vector, while `X` is a column vector
Amro
@Amro: Fixed the real error!
Jacob
@Jacob: oh I just saw your edit, it was a typo after all :P
Amro
@Amro: Tested with my numbers and it seems like it's missing out the largest and smallest values. For one repetition it's about twice as slow as my solution but I can't argue with 1 liners.
Radu
@Radu: like I mentioned in the end, you need to test for the equality on one of the two sides, I'm just too lazy to figure out which ;)
Amro
@Amro: Ah, sorry, I didn't understand on first read.
Radu
@Radu: It turns out you need to do both sides. I edited the answer above
Amro