I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp_send_wait()
and another 9.5% in gomp_send_post
.
The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a a loop of chunk size 4. p
ranges from 17 to 1000, depending on the size of the molecule we're simulating.
My numbers are for the worst case, when p is 17 and the chunk size 4. The performance is the same whether I'm using static, dynamic, or guided scheduling. With p=150
and chunk size 75
, the program is still 75x-100x slower than serial.
...
double e_t_sum=0.0;
double e_in_sum=0.0;
int nthreads,tid;
#pragma omp parallel for schedule(static, 4) reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads)
for (i = 0; i < p; i++){
if (i != c){
nthreads = omp_get_num_threads();
tid = omp_get_thread_num();
d_x = V_in[i].x - t_x;
d_y = V_in[i].y - t_y;
d_z = V_in[i].z - t_z;
rr = d_x * d_x + d_y * d_y + d_z * d_z;
if (i < c){
ee_t[i][c] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
e_t_sum += ee_t[i][c];
e_in_sum += ee_in[i][c];
}
else{
ee_t[c][i] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
e_t_sum += ee_t[c][i];
e_in_sum += ee_in[c][i];
}
// if(pid==0){printf("e_t_sum[%d]: %f\n", tid, e_t_sum[tid]);}
}
}//end parallel for
e_t += e_t_sum;
e_t -= e_in_sum;
...
First, I don't think optimizing your serial code in this case will help answer your OpenMP dilemna. Don't worry about it.
IMO there are three possible explanations for the slowdown:
This one can explain a slowdown easily. The elements of the array ee_t are leading to false sharing within the cache line. False sharing is when cores end up writing to the same cache line not becuase they are actually sharing data but when what the cores are writing just happens to be in the same cache line (which is why its called false sharing). I can explain more if you dont find false sharing on google. Making ee_t elements cache line aligned may help a lot.
The overhead of spawning work is higher than the parallelism benefit. Have you tried fewer than 8 cores? How is performance at 2 cores?
The total number of iterations is small, say we take 17 as an example. If you split it across 8 cores, it will suffer load imbalance problems (specially since some of your iterations are practically not doing any work (when i == c). At least one core will have to do 3 iterations, while all others will do 2. This does not explain a slow down but surely one reason why speedup is not as high as you may expect. Since your iterations are of varying lengths, I would use a dynamic schedule with a chunk size of 1 or use openmp guided. Experiment with the chunk size, a chunk too small will also lead to slowdown.
Let me know how it goes.