views:

271

answers:

2

I've got some code that works, but is a bit of a bottleneck, and I'm stuck trying to figure out how to speed it up. It's in a loop, and I can't figure how to vectorize it.

I've got a 2D array, vals, that represents timeseries data. Rows are dates, columns are different series. I'm trying to bucket the data by months to perform various operations on it (sum, mean, etc). Here is my current code:

allDts; %Dates/times for vals.  Size is [size(vals, 1), 1]
vals;
[Y M] = datevec(allDts);
fomDates = unique(datenum(Y, M, 1)); %first of the month dates

[Y M] = datevec(fomDates);
nextFomDates = datenum(Y, M, DateUtil.monthLength(Y, M)+1);

newVals = nan(length(fomDates), size(vals, 2)); %preallocate for speed

for k = 1:length(fomDates);

This next line is the bottleneck because I call it so many times.(looping)

    idx = (allDts >= fomDates(k)) & (allDts < nextFomDates(k));
    bucketed = vals(idx, :);
    newVals(k, :) = nansum(bucketed);
end %for

Any Ideas? Thanks in advance.

+2  A: 

That's a difficult problem to vectorize. I can suggest a way to do it using CELLFUN, but I can't guarantee that it will be faster for your problem (you would have to time it yourself on the specific data sets you are using). As discussed in this other SO question, vectorizing doesn't always work faster than for loops. It can be very problem-specific which is the best option. With that disclaimer, I'll suggest two solutions for you to try: a CELLFUN version and a modification of your for-loop version that may run faster.

CELLFUN SOLUTION:

[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1);  % Start date of each month
[monthStart,sortIndex] = sort(monthStart);  % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart);  % Get unique start dates

valCell = mat2cell(vals(sortIndex,:),diff([0 uniqueIndex]));
newVals = cellfun(@nansum,valCell,'UniformOutput',false);

The call to MAT2CELL groups the rows of vals that have the same start date together into cells of a cell array valCell. The variable newVals will be a cell array of length numel(uniqueStarts), where each cell will contain the result of performing nansum on the corresponding cell of valCell.

FOR-LOOP SOLUTION:

[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1);  % Start date of each month
[monthStart,sortIndex] = sort(monthStart);  % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart);  % Get unique start dates

vals = vals(sortIndex,:);  % Sort the values according to start date
nMonths = numel(uniqueStarts);
uniqueIndex = [0 uniqueIndex];
newVals = nan(nMonths,size(vals,2));  % Preallocate
for iMonth = 1:nMonths,
  index = (uniqueIndex(iMonth)+1):uniqueIndex(iMonth+1);
  newVals(iMonth,:) = nansum(vals(index,:));
end
gnovice
Thanks. This speeds it up by about 50%!! If I understand the code correctly, this line: valCell = mat2cell(vals,diff([0; uniqueIndex])); is the key - it breaks up the values into cells, that are the length of the second arg long. (Didn't need the sort - the dates and their associated values are guaranteed to be sort
Marc
Yup, it sounds like you've got it. The second argument to MAT2CELL is a vector of sizes that the rows of the first argument will be broken into. For example, if the first argument is a 6x3 matrix (called A), and the second argument is [1 2 3], then MAT2CELL will return a 3-element cell array (called B) equal to the following: B = {A(1,:); A(2:3,:); A(4:6,:)}
gnovice
A: 

If all you need to do is form the sum or mean on rows of a matrix, where the rows are summed depending upon another variable (date) then use my consolidator function. It is designed to do exactly this operation, reducing data based on the values of an indicator series. (Actually, consolidator can also work on n-d data, and with a tolerance, but all you need to do is pass it the month and year information.)

Find consolidator on the file exchange on Matlab Central

woodchips

related questions