ansaurus

Question

Answer 1

+2 A:

I have previously worked with MATLAB's dataset array for large data, unfortunately its true that they do suffer from performance issues. One thing I found which helps speed things up, is to clear the observation names (ObsNames) property

Try the following fix:

%# I assume you have a 'dataset' object
ds = dataset(...);

%# clear the observation names property (It simply a label for each record)
ds.Properties.ObsNames = [];

Amro 2010-09-29 02:16:44

Thanks Amro - will give that a try. More generally, any recommendations or advice on alternative structures for better performance?

Vahid 2010-09-29 02:27:29

in theory you should be able to do everything using matrices and cell arrays, just a bit more awkwardly..

Amro 2010-09-29 02:37:56

Answer 2

A:

Amro suggested clearing the observation names:

ds.Properties.ObsNames = [];

This results in a massive performance benefit as the subset assignment changes from quadratic (linear when plotted against rows^2) to linear (when plotted against rows) with rows at the minor cost of losing the ObsNames.

Copying a DataSet is near instantaneous, so when combined with clearing the unneeded rows also results in a massive performance improvement, though slightly a less optimal solution (but with no loss of ObsNames). Performance is about 2x slower compared to dropping ObsNames. This only improves by 2% when ObsNames are also dropped.

supporting data

I used a small script to assign a subset rows of a 150,000 x 25 mixed string/integer/boolean dataset generated the following time measurements (seconds).

The memory heap size made no significant difference in performance and was left at 128 MB.

Subref means standard function for subset assignment was used

ObsNames=[] means the ObsNames are dropped
Delete means dataset was copied and unneeded rows cleared.

Rows, subref, subref&ObsName=[], Delete, Delete&ObsName=[]

8000, 4.19, 2.06, 4.81, 4.72

32000, 57.61, 2.49, 5.26, 5.21

72000, 390.72, 3.21, 6.09, 6.03

128000, ?(*), 4.21, 7.25, 7.19

(*) I gave up on evaluating this value. Based on linear extrapolation against rows^2 I would guess 2000 sec, or half an hour.

Script

clear
load('data'); % load 'alldata' dataset
% alldata.Properties.ObsNames = []; % drop obsnames

tic;
x = ((1:4).^2.*8000);

for h = 1:length(x)
    start = toc;
    somedata = alldata(1:x(h),:);
%     somedata = alldata; 
%     somedata(x(h):end,:) = []; % drop unrequired obs
    t(h) = toc - start;
    clear somedata
    disp([x(h), t(h)]);


end

Vahid 2010-09-29 13:46:09

ansaurus

tags:

views:

answers:

Extract large Matlab dataset subsets

related questions