This is my code:
double res1[NNN];
#pragma omp parallel for collapse(3) schedule(dynamic)
for (int i=0; i<NNN; i++)
{
for (int j=0;j<NNN;j++)
{
for (int k=0;k<NNN;k++)
{
res1[i] = log(fabs(i*j*k));
}
}
}
std::cout<< res1[10] << std::endl;
When I use collapse(3)
it takes ~50 sec; without collapse(3)
only ~6-7 seconds. I am very puzzled by this behavior, since I would have expected a better performance with "collapse" than without.
Am I missing something?
I did some experiments and played with different configs:
(NNN = 2500 and 24 cores)
schedule(STATIC)
&& collapse(3)
-> ~54 secschedule(STATIC)
&& collapse(2)
-> ~8 secschedule(STATIC)
&& collapse(1)
-> ~8 secI also tried with DYNAMIC
schedule, but it takes enormous time (several minutes).
In my original problem, I have 4 DIM "for-loops" (4D array): 51x23x51x23.
What is the best way to use OpenMP/MPI to minimize running time? I have in total ~300 CPU cores. What is the best way to spread my array over these cores? The length of array is flexible (I can fit it to the number of CPUs in some way).
Any suggestions?
You are missing the concept of what is the impact of using dynamic scheduling on the OpenMP overhead.
Dynamic scheduling should be used to help us with load balance problems, where each loop iteration can take a different amount of time, and static iteration distribution would most likely create work imbalance between the different threads. Work imbalance leads to wasted CPU time as the threads that finish earlier simply wait for the other threads to finish. Dynamic scheduling overcomes that by distributing loop chunks on a first come, first served basis. But this adds overhead, since the OpenMP runtime system has to implement bookkeeping on which iteration was given out and which not and has to implement some type of synchronisation. Also, each thread must make at least one call to the OpenMP runtime every time it finishes its iteration block and goes looking for another one. With static scheduling all iteration blocks are precomputed initially and then each thread runs over its part without any interaction with the OpenMP runtime environment.
The most crucial difference between static and dynamic scheduling is the iteration chunk size (i.e. the number of consecutive loop iterations that each thread does before seeking to do work in another part of the iteration space). If omitted, the chunk size with static scheduling defaults to #_of_iterations/#_of_threads
while the default chunk size for dynamic scheduling is 1
, i.e. each thread has to ask the OpenMP runtime for each iteration of the distributed loop.
What happens in your case is that without collapse(3)
you have NNN
iteration chunks of the outer loop and each thread runs NNN*NNN
iterations (of the inner loops) before asking the OpenMP runtime for another iteration. When you collapse the loops, the number of iteration chunks grows to NNN*NNN*NNN
, i.e. there are many more chunks and each thread is going to ask the OpenMP runtime for a chunk after each iteration.
This brings another problem when inner loops are collapsed with the outermost: it will happen that many threads would get iterations that have the same value of i
, which would break the computation as the order of execution is not guaranteed and it might happen that the last thread that writes to res1[i]
is not the one that executes the last iteration of both inner loops.