views:

749

answers:

5

I have a bunch of times-series each described by two components, a timestamp vector (in seconds), and a vector of values measured. The time vector is non-uniform (i.e. sampled at non-regular intervals)

I am trying to compute the mean/SD of each 1-minutes interval of values (take X minute interval, compute its mean, take the next interval, ...).

My current implementation uses loops. This is a sample of what I have so far:

t = (100:999)' + rand(900,1);       % non-uniform time
x = 5*rand(900,1) + 10;             % x(i) is the value at time t(i)

interval = 1;         % 1-min interval
tt = ( floor(t(1)):interval*60:ceil(t(end)) )';  % stopping points of each interval
N = length(tt)-1;

mu = zeros(N,1);
sd = zeros(N,1);

for i=1:N
    indices = ( tt(i) <= t & t < tt(i+1) ); % find t between tt(i) and tt(i+1)
    mu(i) = mean( x(indices) );
    sd(i) = std( x(indices) );
end

I am wondering if there a faster vectorized solution. This is important because I have a large number of time-series to process each much longer than the sample shown above..

Any help is welcome.


Thank you all for the feedback.

I corrected the way t is generated to be always monotonically increasing (sorted), this was not really an issue..

Also, I may not have stated this clearly but my intention was to have a solution for any interval length in minutes (1-min was just an example)

+4  A: 

You could try and create a cell array and apply mean and std via cellfun. It's ~10% slower than your solution for 900 entries, but ~10x faster for 90000 entries.

[t,sortIdx]=sort(t); %# we only need to sort in case t is not monotonously increasing
x = x(sortIdx);

tIdx = floor(t/60); %# convert seconds to minutes - can also convert to 5 mins by dividing by 300
tIdx = tIdx - min(tIdx) + 1; %# tIdx now is a vector of indices - i.e. it starts at 1, and should go like your iteration variable.

%# the next few commands are to count how many 1's 2's 3's etc are in tIdx
dt = [tIdx(2:end)-tIdx(1:end-1);1]; 
stepIdx = [0;find(dt>0)];
nIdx = stepIdx(2:end) - stepIdx(1:end-1); %# number of times each index appears

%# convert to cell array
xCell = mat2cell(x,nIdx,1);

%# use cellfun to calculate the mean and sd
mu(tIdx(stepIdx+1)) = cellfun(@mean,xCell); %# the indexing is like that since there may be missing steps
sd(tIdx(stepIdx+1)) = cellfun(@mean,xCell);

Note: my solution does not give the exact same results as yours, since you skip a few time values at the end (1:60:90 is [1,61]), and since the start of the interval is not exactly the same.

Jonas
Thanks! I have a couple of points: [1] you're right about the way I generated `t` it may not always be monotonically increasing, that was not intended! [2] Even though I'm still deciphering the code, I really need the interval length to be a parametrized (5-min is what Im working on now, but that should be easily changeable)...
merv
[3] the truth is after you computed `stepIdx` I got a little lost :) could explain what `nIdx` represents? I get the part where you compute the minute-part of each timestamp then take the differences to find where it changes indicating the next 1-min interval, but I couldnt follow after that..
merv
nIdx is the number of times each index appears. I need this to be able to use mat2cell, which distributes the first n values into the first cell, the second n values in the second cell etc, thus grouping the indices that belong to each time interval.I hope the additional comments help make it clearer. Sorry for writing hard-to-read code. I should (have been) working on something different, so I answered this in a hurry :)
Jonas
thank you I appreciate the help.. What if I wanted a different interval length (not just 1-minute)? the consecutive differences trick wont work, any idea how to change this for any X-minute?
merv
See the comments in the function. Simply create tIdx by dividing by s seconds, where s is the number of seconds in the interval.
Jonas
+2  A: 

You can compute indices all at once using bsxfun:

indices = ( bsxfun(@ge, t, tt(1:end-1)') & bsxfun(@lt, t, tt(2:end)') );

This is faster than looping but requires storing them all at once (time vs space tradeoff)..

Amro
I like this one. The only problem is that I can't use the indices directly without a for loop: doing `x(indices)` didnt work, instead I have to: `for i=1:N, x(indices(:,i)), end`
merv
+2  A: 

Here's a way that uses binary search. It is 6-10x faster for 9900 elements and about 64x times faster for 99900 elements. It was hard to get reliable times using only 900 elements so I'm not sure which is faster at that size. It uses almost no extra memory if you consider making tx directly from the generated data. Other than that it just has four extra float variables (prevind, first, mid, and last).

% Sort the data so that we can use binary search (takes O(N logN) time complexity).
tx = sortrows([t x]);

prevind = 1;

for i=1:N
    % First do a binary search to find the end of this section
    first = prevind;
    last = length(tx);
    while first ~= last
        mid = floor((first+last)/2);
        if tt(i+1) > tx(mid,1)
            first = mid+1;
        else
            last = mid;
        end;
    end;
    mu(i) = mean( tx(prevind:last-1,2) );
    sd(i) = std( tx(prevind:last-1,2) );
    prevind = last;
end;

It uses all of the variables that you had originally. I hope that it suits your needs. It is faster because it takes O(log N) to find the indices with binary search, but O(N) to find them the way you were doing it.

Justin Peel
This should get even faster if you preassign mu and sd first instead of growing them inside the loop.
Jonas
@Jonas I thought that would be implied since it was in the asker's code. This is just to replace the last 5 lines of the asker's code. I thought that the last 5 lines were the slow ones.
Justin Peel
Is a binary search (with loops) faster than the vectorized vector comparison I started with?
merv
@merv Yes, the comparisons of timings I did were compared to the version you posted in your question. Vectorizing only gets you so far; it makes it faster at doing the same operation on every element than looping, but it is still doing the operation on every element. That means that your method does 14*900 = 12600 comparisons (for 900 elements) to find the indices to take the mean or std of while this binary search method does less than 14*log_2(900) comparisons = 140 comparisons to find them. The disparity only gets worse with increasing array size and increasing number of intervals.
Justin Peel
+1  A: 

Disclaimer: I worked this out on paper, but haven't yet had the opportunity to check it "in silico"...

You may be able to avoid loops or using cell arrays by doing some tricky cumulative sums, indexing, and calculating the means and standard deviations yourself. Here's some code that I believe will work, although I am unsure how it stacks up speed-wise to the other solutions:

[t,sortIndex] = sort(t);  %# Sort the time points
x = x(sortIndex);         %# Sort the data values
interval = 60;            %# Interval size, in seconds

intervalIndex = floor((t-t(1))./interval)+1;  %# Collect t into intervals
nIntervals = max(intervalIndex);              %# The number of intervals
mu = zeros(nIntervals,1);                     %# Preallocate mu
sd = zeros(nIntervals,1);                     %# Preallocate sd

sumIndex = [find(diff(intervalIndex)) ...
            numel(intervalIndex)];  %# Find indices of the interval ends
n = diff([0 sumIndex]);             %# Number of samples per interval
xSum = cumsum(x);                   %# Cumulative sum of x
xSum = diff([0 xSum(sumIndex)]);    %# Sum per interval
xxSum = cumsum(x.^2);               %# Cumulative sum of x^2
xxSum = diff([0 xxSum(sumIndex)]);  %# Squared sum per interval

intervalIndex = intervalIndex(sumIndex);  %# Find index into mu and sd
mu(intervalIndex) = xSum./n;                             %# Compute mean
sd(intervalIndex) = sqrt((xxSum-xSum.*xSum./n)./(n-1));  %# Compute std dev

The above computes the standard deviation using the simplification of the formula found on this Wikipedia page.

gnovice
Thanks for the response, I guess it would be interesting to compare the timing against the other solutions.
merv
+7  A: 

The only logical solution seems to be...

Ok. I find it funny that to me there is only one logical solution, but many others find other solutions. Regardless, the solution does seem simple. Given the vectors x and t, and a set of equally spaced break points tt,

t = sort((100:999)' + 3*rand(900,1));     % non-uniform time
x = 5*rand(900,1) + 10;             % x(i) is the value at time t(i)

tt = ( floor(t(1)):1*60:ceil(t(end)) )';

(Note that I sorted t above.)

I would do this in three fully vectorized lines of code. First, if the breaks were arbitrary and potentially unequal in spacing, I would use histc to determine which intervals the data series falls in. Given they are uniform, just do this:

int = 1 + floor((t - t(1))/60);

Again, if the elements of t were not known to be sorted, I would have used min(t) instead of t(1). Having done that, use accumarray to reduce the results into a mean and standard deviation.

mu = accumarray(int,x,[],@mean);
sd = accumarray(int,x,[],@std);
woodchips
+1: For some reason, I completely overlooked ACCUMARRAY.
gnovice
thanks, this is both concise and easy to read
merv
I didn't even know about accumarray. Thanks for demonstrating how useful it can be!
Jonas

related questions