views:

131

answers:

2

I am trying to deal with a very large dataset. I have k = ~4200 matrices (varying sizes) which must be compared combinatorially, skipping non-unique and self comparisons. Each of k(k-1)/2 comparisons produces a matrix, which must be indexed against its parents (i.e. can find out where it came from). The convenient way to do this is to (triangularly) fill a k-by-k cell array with the result of each comparison. These are ~100 X ~100 matrices, on average. Using single precision floats, it works out to 400 GB overall.
I need to 1) generate the cell array or pieces of it without trying to place the whole thing in memory and 2) access its elements (and their elements) in like fashion. My attempts have been inefficient due to reliance on MATLAB's eval() as well as save and clear occurring in loops.

for i=1:k
    [~,m] = size(data{i});
    cur_var = ['H' int2str(i)];
    %# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it.
    eval([cur_var ' = cell(1,k-i);']);
    for j=i+1:k
        [~,n] = size(data{j});
        eval([cur_var '{i,j} = zeros(m,n,''single'');']);
        eval([cur_var '{i,j} = compare(data{i},data{j});']);
    end
    save(cur_var,cur_var); %# Add '-append' when using a single MAT file.
    clear(cur_var);
end

The other thing I have done is to perform the split when mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0. This divides the result into the largest number of same-size pieces, which seems logical. The indexing is a little more complicated, but not too bad because a linear index could be used.

Does anyone know/see a better way?

+1  A: 

You can get rid of the eval and clear calls by assigning the filename separately.

for i=1:k
    [~,m] = size(data{i});
    file_name = ['H' int2str(i)];    
    cur_var = cell(1, k-i);
    for j=i+1:k
        [~,n] = size(data{j});
        cur_var{i,j} = zeros(m, n, 'single');
        cur_var{i,j} = compare(data{i}, data{j});
    end
    save(file_name, cur_var); 
end

If you need the saved variables to take different names, use the -struct option to save.

str.(file_name);
save(file_name, '-struct', str); 
Richie Cotton
+1  A: 

Here's a version that combines going fast with using minimal memory.

I use fwrite/fread so that you still can use parfor (and this time, I made sure it works :) )

%# assume data is loaded an k is known

%# find the index pairs for comparisons. This could be done more elegantly, I guess.
%# I'm constructing a lower triangular array, i.e. an array that has ones wherever
%# we want to compare i (row) and j (col). Then I use find to get i and j
[iIdx,jIdx] = find(tril(ones(k,k),-1));

%# create a directory to store the comparisons
mkdir('H_matrix_elements')
savePath = fullfile(pwd,'H_matrix_elements');

%# loop through all comparisons in parallel. This way there may be a bit more overhead from
%# the individual function calls. However, parfor is most efficient if there are 
%# a lot of relatively similarly fast iterations.
parfor ct = 1:length(iIdx)

   %# make the comparison - do double b/c there shouldn't be a memory issue 
   currentComparison = compare(data{iIdx(ct)},data{jIdx{ct});

   %# create save-name as H_i_j, e.g. H_104_23
   saveName = fullfile(savePath,sprintf('H_%i_%i',iIdx(ct),jIdx(ct)));

   %# save. Since 'save' is not allowed, use fwrite to write the data to disk
   fid = fopen(saveName,'w');

   %# for simplicity: save data as vector, add two elements to the beginning
   %# to store the size of the array
   fwrite(fid,[size(currentComparison)';currentComparison(:)]);  % ' #SO formatting

   %# close file
   fclose(fid)
end



%# to read e.g. comparison H_104_23
fid = fopen(fullfile(savePath,'H_104_23'),'r');
tmp = fread(fid);
fclose(fid);

%# reshape into 2D array.
data = reshape(tmp(3:end),tmp(1),tmp(2));
Jonas
Unfortunately, `save()` can't be called in a `parfor`.
reve_etrange
@reve_etrange: Oops. Ah well, you have to use e.g. fwrite, then.
Jonas