the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt
. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n
matrices from the bistar
size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is large, parfor will be significantly faster than the corresponding for statement, even if n is relatively small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.