I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n; i+=4)
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these.