views:

74

answers:

4

Possible Duplicate:
Growable data structure in MATLAB

So in my current MATLAB script, I have a very large indeterminate-sized growing array. There is currently nothing I can do about it, because if I actually preallocate, it would take many many many times more memory than it should need (maximum possible amount of values is 640 per pixel, but usually it's something along the lines of 2-5).

Usually in this case, I'd be using a vector in C++ or something, where it grows exponentially in relation to a given capacity. But I think the matrices in Matlab starts fragmenting much faster than the purpose driven C++ vectors.

What do you guys think is the best alternative to something like this? Or should I just stick with normal arrays and hope that adding around 100k elements sequentially will work?

Thanks in advance.

+3  A: 

You could try what std::vector does when reallocating elements --- double its capacity every time it fills up which has an amortized cost of O(1).

Test:

function test_MAT_vector

    LIM = 5e4;
    %%# Normal
    tic;
    A = zeros(1,1);
    for i = 1:LIM
        A(end+1) = i;
    end
    toc

    %%# std::vector clone
    tic;
    B = zeros(1,1);
    for i = 1:LIM
        if(i > numel(B))
            B = [B;zeros(numel(B),1)];
        end
        B(i) = i;
    end
    toc

end

Output:

Elapsed time is 3.489820 seconds.

Elapsed time is 0.006710 seconds.

Using cell

%%# cell
tic;
A = cell(1,1);
for i = 1:LIM
    A(end+1) = {i};
end
toc

Elapsed time is 2.740792 seconds.

Jacob
It kind of makes me sad that I'm going to do this, but that looks like a valid solution. Thanks
Xzhsh
Or maybe I should just wrap a std::vector with a mexfile... <_< But then I'd have to do some kind of persistence. Ahh forget it.
Xzhsh
A: 

You can grow a cell array which is much cheaper than an array then use cell2mat to convert to array (requires having twice the memory but if you have it can be the easiest way).

If you know the number of pixels you can preallocate a cell array of the right size, fill each cell with the array of actual values (up to 640 but usually 2-5) then use cell2mat to get convert it to a contigous array.

If you are worried about fragmentation you can do pack after loading the cell which should defragment your memory (rights everything to disk and loads it again contiguously) before you do the cell2mat conversion.

thrope
@thrope: True, but it's slow (I've profiled it in my answer).
Jacob
@Jacob: its slow for single elements like your example but I generally use that method when wanting to concatenate larger arrays where the array expansion really kills you... Ie in your test instead of adding single elements add a random array of length 100000 - 1000000 for example. Also you can preallocate the cell array. I think you'd find this significantly faster than the array expansion, but I'm sure still not nearly as quick as your std::vector clone (although thats a bit more complicated in this case)
thrope
+7  A: 

Growing an array is generally a bad thing to do, at least if you will do it many times. There are a few tricks one can use. (I've tried them all, and written tools that do it for you automatically if you want.)

The problem is that the time associated with the array growth is an O(N^2) operation. It also tends to fragment your memory, leaving you with potential problems if you need to later on create a large array.

One approach is to preallocate your array to some fixed size, then append large blocks of zeros whenever the array size would be exceeded. Now use matrix indexing to insert the new values into the existing array. The idea is to double the size of the array whenever it needs to be reallocated. This causes only a few reallocations, but it may end up appending a lot of zeros that you don't need to fill. At the end, you dump the extraneous and unused array elements.

The second trick is to use a cell array. Every time you want to append a new element or block thereof, you simply append a new cell with those elements. Then at the end, you do a cell2mat operation. A problem with the cell trick is it is still an O(N^2) operation, because the cell array itself is essentially an array of pointers, and that list of pointers must be reallocated when you append new elements.

There is a combination trick you can do that I like, but it needs a tool to implement the operation. The idea is to create cells that contain blocks of say 10000 elements. Fill the first block with zeros. Then use matrix indexing to insert new elements as they are generated. The matrix indexing is fast of course. When you run out of room in that cell, you append a new large cell. Then at the end, you catenate all the cells together, carefully dropping the elements that were unused.

This last scheme involves relatively few new cell elements to be appended, and so can be most efficient overall if there may be millions of append steps.

You can find each of these approaches embodied in several File Exchange submissions I've posted over the years. The first can be found as grow_array, which internally manages the append steps, worrying about the matrix indexing for you.

Later on, I built the last scheme using either function handles or persistent variables to maintain the stored information. The tools that contain those implementations are found as growdata and growdata2.

woodchips
+1 for a detailed answer
Amro
+1  A: 

In a previous question, I had posted a similar solution to that proposed by @Jacob.

I later compared the performance of most of the options available (which are very well summarized by @woodchips above). Here are the results I got on my machine:

NUM = 50000;

%%# ========== MINE: ~0.07sec ==========
tic
BLOCK_SIZE = 2000;                           %# initial & increment size
listSize = BLOCK_SIZE;                       %# current list capacity
list = zeros(listSize, 2);                   %# actual list
listPtr = 1;                                 %# pointer to last free position
for i=1:NUM
    %# push items on list
    list(listPtr,:) = [rand rand];           %# store new item
    listPtr = listPtr + 1;                   %# increment position pointer
    %# add new block of memory if needed
    if( listPtr+(BLOCK_SIZE/10) > listSize ) %# less than 10%*BLOCK_SIZE free
        listSize = listSize + BLOCK_SIZE;    %# add new BLOCK_SIZE slots
        list(listPtr+1:listSize,:) = 0;
    end
end
list(listPtr:end,:) = [];                    %# remove unused slots
toc

%%# ========== PREALLOCATION (matrix): ~0.05sec ==========
tic
list = zeros(NUM,2);
for i=1:NUM
    list(i,:) = [rand rand];
end
toc

%%# ========== PREALLOCATION (cell): ~1.1sec ==========
tic
list = cell(NUM,1);
for i=1:NUM
    list{i} = [rand rand];
end
list = vertcat( list{:} );
toc

%%# ============ NO-PREALLOCATION (grow cell): ~5sec ========
tic
list = {};
for i=1:NUM
    list{i} = [rand rand];
end
list = vertcat( list{:} );
toc

%%# ============ NO-PREALLOCATION (grow matrix): ~24sec ========
tic
list = [];
for i=1:NUM
    list(i,:) = [rand rand];
end
toc

%%# ========== GROWDATA (by John D'Errico): ~3.3sec =========
tic
growdata                             %# The initialization call
for i = 1:NUM
    growdata( [rand rand] )          %# push items
end
list = growdata;                     %# unpacking step
toc
Amro
I will add that the simple cell solution is efficient for relatively small numbers of cells. This IS a good way to go for a few thousand cells. Even for 50k cells this is still fast. It is when you start moving into millions of appends that cell arrays with millions of cells can start to get more expensive. You still need to grow that array of pointers. (That story was different in past versions of matlab as I recall, where I found the break even point was lower, so I imagine they have improved things over time.)
woodchips
@woodchips: it seems you're right, adding a couple more zeros to the number of iterations, and the growing cell solution will get much slower. By the way, I probably should have used `cell2mat(list)` instead of `vertcat(list{:})` as its an order of magnitude faster..
Amro