OpenMP - Nested for-loop becomes faster when having parallel before outer loop. Why?

phineliner picture phineliner · Jul 9, 2015 · Viewed 8.3k times · Source

I'm currently implementing an dynamic programming algorithm for solving knapsack problems. Therefore my code has two for-loops, an outer and an inner loop.

From the logical point of view I can parallelize the inner for loop since the calculations there are independant from each other. The outer for loop can not be parallelized because of dependancies.

So this was my first approach:

for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp parallel for if(weightColumns > THRESHOLD)
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
}

The code works well, the algorithm solves the problems correctly. Then I was thinking about optimizing this, since I was not sure how OpenMP's thread management works. I wanted to prevent unnecessary initialization of the threads during each iteration, thus I put an outer parallel block around the outer loop.

Second approach:

#pragma omp parallel if(weightColumns > THRESHOLD)
{
    for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp for
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
     }
}

This has an unwanted side effect: Everything inside the parallel block will now be executed n-times, where n is the number of available cores. I already tried to work with pragmas single and critical to force the outer for-loop to be executed in one thread, but then I can not calculate the inner loop by multiple threads unless I open a new parallel block (but then there would be no gain in speed). But nevermind, because the good thing is: this does not affect the result. The problems are still solved correctly.

NOW THE STRANGE THING: The second approach is FASTER than the first one!

How can this be? I mean, although the outer for-loop is calculated n times (in parallel) and the inner for-loop is distributed n times among n cores it is faster than the first approach which only calculates the outer loop one time and distributes the workload of the inner for-loop evenly.

At first I was thinking: "well yeah, it's probably because of thread management" but then I read that OpenMP pools the instantiated threads which would speak against my assumption. Then I disabled compiler optimization (compiler flag -O0) to check if it has something to do with. But this did not affect the measurement.

Can anybody of you shed more light into this please?

The measured times for solving the knapsack-problem containing 7500 items with a max capacity of 45000 (creating a matrix of 7500x45000, which is way over the used THRESHOLD variable in code):

  • Approach 1: ~0.88s
  • Approach 2: ~0.52s

Thanks in advance,

phineliner

EDIT:

Measurement of a more complex problem: Added 2500 items to the problem (from 7500 to 10000) (more complex problems can't currently be handled because of memory reasons).

  • Approach 1: ~1.19s
  • Approach 2: ~0.71s

EDIT2: I was mistaken about the compiler optimization. This does not affect measurement. At least I can not reproduce the difference I measured before. I edited the question text according to this.

Answer

Z boson picture Z boson · Jul 13, 2015

Let's first consider what your code is doing. Essentially your code is transforming a matrix (2D array) where the values of the rows depend on the previous row but the values of the columns are independent of other columns. Let me choose a simpler example of this

for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

One way to parallelize this is to swap the loops like this

Method 1:

#pragma omp parallel for
for(int j=0; j<n; j++) {
    for(int i=1; i<n; i++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

With this method each thread runs all n-1 iterations of i of the inner loop but only n/nthreads iterations of j. This effectively processes strips of columns in parallel. However, this method is highly cache unfriendly.

Another possibility is to only parallelize the inner loop.

Method 2:

for(int i=1; i<n; i++) {
    #pragma omp parallel for 
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

This essentially processes the columns in a single row in parallel but each row sequentially. The values of i are only run by the master thread.

Another way to process the columns in parallel but each row sequentially is:

Method 3:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In this method, like method 1, each thread runs over all n-1 iteration over i. However, this method has an implicit barrier after the inner loop which causes each thread to pause until all threads have finished a row making this method sequential for each row like method 2.

The best solution is one which processes strips of columns in parallel like method 1 but is still cache friendly. This can be achieved using the nowait clause.

Method 4:

#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

In my tests the nowait clause does not make much difference. This is probably because the load is even (which is why static scheduling is ideal in this case). If the load was less even nowait would probably make more of a difference.

Here are the times in seconds for n=3000 on my my four cores IVB system GCC 4.9.2:

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

This test is probably memory bandwidth bound so I could have chosen a better case using more computation but nevertheless the differences are significant enough. In order to remove a bias due to creating the thread pool I ran one of the methods without timing it first.

It's clear from the timing how un-cache friendly method 1 is. It's also clear method 3 is faster than method 2 and that nowait has little effect in this case.

Since method 2 and method 3 both processes columns in a row in parallel but rows sequentially one might expect their timing to be the same. So why do they differ? Let me make some observations:

  1. Due to a thread pool the threads are not created and destroyed for each iteration of the outer loop of method 2 so it's not clear to me what the extra overhead is. Note that OpenMP says nothing about a thread pool. This is something that each compiler implements.

  2. The only other difference between method 3 and method 2 is that in method 2 only the master thread processes i whereas in method 3 each thread processes a private i. But this seems too trivially to me to explain the significant difference between the methods because the implicit barrier in method 3 causes them to sync anyway and processing i is a matter of an increment and a conditional test.

  3. The fact that method 3 is no slower than method 4 which processes whole strips of columns in parallel says the extra overhead in method 2 is all in leaving and entering a parallel region for each iteration of i

So my conclusion is that to explain why method 2 is so much slower than method 3 requires looking into the implementation of the thread pool. For GCC which uses pthreads, this could probably be explained by creating a toy model of a thread pool but I don't have enough experience with that yet.