views:

102

answers:

2

Hi, the code I'm dealing with has loops like the following:

bistar = zeros(numdims,numcases); 
parfor hh=1:nt       
  bistar = bistar +  A(:,:,hh)*data(:,:,hh+1)' ;
end   

for small nt (10).

After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.

I run

matlabpool

with the out-of-the-box configurations before running my code.

I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.

Thanks!

PS: I'm running the code on a quad core so I would expect to see some improvements.

+5  A: 

Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.

Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.

The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.

The complete statement in the help (thanks for your link hereunder) is:

If the time to compute f, g, and h is large, parfor will be significantly faster than the corresponding for statement, even if n is relatively small.

So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.

jdehaan
Thanks for the response, however in http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/parfor.html it states "parfor will be significantly faster than the corresponding for statement, even if n is relatively small." (Ofcourse idk what relatively small means.)I'm confused though, what do you mean by overhead in dividing the work and gathering results from the several threads/cores? The vars A and data are global and should be shared across all threads already. All matlab has to do is becareful with adding to bistar.
Junier
I added a precision, thanks for the link I had a look at the help and it states the same as what I try to make clear. I am not so good at explaining thing :-) The "if" part of the sentence is quite important. Hope it helps! This is not only true for matlab but for all kinds of parallel computing. Partitioning the problem adequately is essential.
jdehaan
+1 for pointing out the necessity of reading all of the help instead of just the part that appears to say what you want.
Donnie
Thx again jdehaan :) It's funny I think that MATLAB actually creates 4 processes and I'm guessing it is outsourcing work to these processes so the overhead you talk about is felt (since there is no shared memory among processes). However, in a perfect world the work would be outsourced to threads that share the variables A, data, and bistar so partitioning would be a matter of passing indices (negligible overhead) and since bistar is shared recombining is just a matter of adding to the correct indices in parallel (negligible overhead). Would this not be faster? Am I missing something?
Junier
@Donnie I didn't overlook it on purpose, I was tired from working and was literally about to go to bed...
Junier
@Junier: Putting the workers on threads instead of processes would make memory access faster, but PCT is designed to scale to distribute across workers on multiple servers in a compute farm, which won't have shared memory, or even the same architecture. Also, it would be complicated: Matlab (the language, not the VM) is single-threaded, so some code you might call, including C-based MEX files or libraries, Java, or Matlab internals, is not re-entrant, and could fail when called concurrently.
Andrew Janke
That said, newer Matlab versions automatically use multithreading in some builtin functions, like sum() and * (matrix multiply). Your code may already be taking advantage of multiple cores, and the more it's vectorized, the more Matlab may be able to multithread it. This is how Matlab usually does multicore for simpler operations on data sets that can fit in one process. See http://www.mathworks.com/support/solutions/en/data/1-4PG4AN/?solution=1-4PG4AN
Andrew Janke
+4  A: 

Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.

For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).

You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.

Jonas