I am trying to learn the loop optimization. i found that loop tiling helps in making the array looping faster. i tried with two block of codes given below with and without loop blocking and measure the time taken for both. i did not find significant difference most of the time. i tested varying the block size but i am not sure how to choose the block size. please help me if my direction is wrong. in fact i found the loop without block works faster many times.
a. With blocking
int max = 1000000;
int B = 100;
for (i = 0; i < max; i += B)
{
for (j = i; j < (i + B); j++)
{
array[j] = 0;
}
}
b. Without blocking
for (i = 0; i < max; i++)
{
array[i] = 0;
}
time taken: with blocking: elapsed time - 6997000 Nano Secs
Without blocking elapsed time - 6097000 Nano Secs
As pointed out here, tiling is a technique meant to keep your working set inside the caches while you work on it, in order to enjoy the memory latency. If you stream over your data once you'll see no benefit since you never hit the cache, so tiling won't be useful.
Your example loops do exactly the same work in the same order, except for adding another branch and making the branch patterns slightly more complicated (most predictors would be able to cope with that, it's just not helpful in any way).
Consider the following example -
#include "stdlib.h"
#include "stdio.h"
#include <time.h>
#define MAX (1024*1024*32)
#define REP 100
#define B (16*1024)
int main() {
int i,j,r;
char array[MAX];
for (i = 0; i < MAX; i++) { // warmup to make things equal if array happens to fit in your L3
array[i] = 0;
}
clock_t t1 = clock();
// Tiled loop
for (i = 0; i < MAX; i += B) {
for (r = 0; r < REP; r++) {
for (j = i; j < (i + B); j+=64) {
array[j] = r;
}
}
}
clock_t t2 = clock();
// un-tiled loop
for (r = 0; r < REP; r++) {
for (i = 0; i < MAX; i+=64) {
array[i] = r;
}
}
clock_t t3 = clock();
printf ("Tiled: %f sec\n", (double)(t2 - t1) / CLOCKS_PER_SEC);
printf ("Untiled: %f sec\n", (double)(t3 - t2) / CLOCKS_PER_SEC);
printf ("array[0] = %d\n", array[0]); // to prevent optimizing out all the writes
}
Both loops are doing the same accesses (the 64byte jumps is to stress the caches by using each cache line once, and preventing the IPC and instruction dispatching from being the bottleneck).
The tiled version reorders these accesses in blocks so that repeating a single block can repeatedly hit the cache. Since block size is set to 16k it would probably fit in most L1 caches and get a really good latency. For each iteration of the outer loop, you'll have 1 iteration where you miss all caches and go to memory (if your L3 is bigger than 32M just dial MAX
even higher to make sure), and REP-1
iterations that fly from the L1.
The Untiled version would also repeat itself REP
times in total, but every repetition will thrash all the data from the caches making all accesses go to the memory, accumulating to a much higher overall latency.
Compiling with gcc 4.8.1 (-O3) gives me on a Xeon 5670 @ 2.9GHz -
Tiled: 0.110000 sec
Untiled: 0.450000 sec
array[0] = 99
over 4x :)
Note that the untiled version still has one benefit - there's a single ordered stream so a HW prefetcher can run ahead fetching data for you effectively, somewhat mitigating the memory latency effect. However, you can help the CPU do something similar in the banked version, if you add to following :
for (i = 0; i < MAX; i += B) {
for (r = 0; r < REP; r++) {
for (j = i; j < (i + B); j+=64) {
array[j] = r;
if (r == REP - 2) // SW prefetching
__builtin_prefetch(&array[j+B], 0);
}
}
}
Telling the CPU to bring in the next block slightly ahead of finishing the current one. For the price of a conditional branch (with a few mispredictions per block), you reduce the execution time of the first iteration on the next block - I get a further reduction from that to :
Tiled: 0.070000 sec