Avoiding False Sharing in OpenMP with arrays

Michael Aquilina picture Michael Aquilina · Oct 9, 2013 · Viewed 13.4k times · Source

I have started learning how to use OpenMP as part of a University course. As a lab excercise, we have been given a serial program which we need to parallelize.

One of the first things we were made aware of the dangers of False Sharing, especially when it comes to updating arrays in parallel for loops.

However, I am finding it hard transforming the following snippet of code into a parallizable task without causing False Sharing:

int ii,kk;

double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;

double dx = 1.0/(double)NX;
double dt = 0.5*dx*dx;

// Initialise both arrays with values
init(uk, ukp1);

for(kk=0; kk<NSTEPS; kk++) {
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }

   temp = ukp1;
   ukp1 = uk;
   uk = temp;
   printValues(uk,kk);
}

My first reaction was to try out sharing ukp1:

for(kk=0; kk<NSTEPS; kk++) {
   #pragma omp parallel for shared(ukp1)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
    }

   temp = ukp1;
   ukp1 = uk;
   uk = temp;
   printValues(uk,kk);
}

But this clearly shows significant slow down compared to the serial version. The obvious reason is that False sharing is occuring during some write operations to ukp1.

I was under the impression that maybe I could use the reduction clause, however I soon found out that this cannot be used on arrays.

Is there anything I can use to parallelize this code to improve the runtime? Is there a clause I can use which I have not heard of? Or is this the kind of task where I need to restructure the code to allow proper parallization?

All forms of input would be greatly appreciated!

EDIT: It was pointed out to me there was a mistake in my code. The code I have locally is correct, I just edited it wrongly (which changed the structure of the code), sorry for the confusion!

EDIT2:

Some information pointed out to me by @Sergey which I feel is useful:

  • Setting uk or ukp1 to private will essentially have the same effect as setting them to shared due to the fact they are both pointers to the same memory location

  • Using static scheduling should help in theory, but I am am still experiencing the same slowdown. Also, I feel that static scheduling is not the most portable way of fixing this problem.

Answer

Sergey L. picture Sergey L. · Oct 9, 2013

Since we are talking about optimisations first things first:

Define constants as macros, allows for better optimisation by the compiler.

#define dx (1.0/(double)NX)
#define dt (0.5*dx*dx)

When workring with OpenMP the default sharing rule for variables is shared, though it is good practice to set it to none and enable every variable you need inside the parallel section manually. This way you can be certain that you avoid conflicts.

#pragma omp parallel for default(none) shared(ukp1, uk)

Also setting ukp1 or uk to any sharing status will only pass the pointer into your parallel section since you declared them as pointers. So the memory in them is still shared.

Lastly to avoid flase sharing you want to make sure that as few cache lines are shared among threads as possible. Read only cachelines (thus uk in your case) are irrelevant, they can exist in every thread, but write cachelines ukp1 should be per thread. Today cache lines are normally 64 bytes long - thus one cache line will fit 8 doubles. so you want to assign chunks of at least 8 iterations per thread:

#pragma omp parallel for default(none) shared(ukp1, uk) schedule(static,8)

Will deploy your code 8 iterations per chunk and should appear in the inner loop.

for(kk=0; kk<NSTEPS; kk++) {
   #pragma omp parallel for default(none) shared(ukp1, uk) schedule(static,8)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
   // Swap pointers for the next time step
   temp = ukp1;
   ukp1 = uk;
   uk   = temp;
}

In practice depending on the size of your data you may want to assign even larger chunk sizes. I tend to use 0x1000 - which on most systems will even fit a whole page (assuming you are page-aligned).

Edit:

In order for this to actually have an effect you need to have your memory correctly aligned. You are starting at index 1, so:

 double *uk = memalign(0x40 , sizeof(double) * (NX + 8));
 double *ukp1 = memalign(0x40 , sizeof(double) * (NX + 8));
 uk += 7;
 ukp1 += 7;

Now ukp1[1] is cache-line aligned. Increasing your chunk size may help, but unless you plan on having NX > 100000 there isn't much point in parallelising in the first place.

You need to keep in mind that you are getting quite a lot of overhead restarting the parallel section in each iteration. To get that under control you would need to rework your scheduling beyond simple OpenMP.