I have a matrix multiply code that looks like this:
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
Here, the size of the matrix is represented by dimension
.
Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.
Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0
Program compiled as gcc -O3 simple.c
I have run this on Intel's icc compiler as well, and seen similar results.
EDIT:
As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.
Heres the complete program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv;
double timestamp()
{
double t;
gettimeofday(&tv, NULL);
t = tv.tv_sec + (tv.tv_usec/1000000.0);
return t;
}
int main(int argc, char *argv[])
{
int i, j, k;
double *A, *B, *C, start, end;
A = (double*)malloc(dimension*dimension*sizeof(double));
B = (double*)malloc(dimension*dimension*sizeof(double));
C = (double*)malloc(dimension*dimension*sizeof(double));
srand(292);
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
{
A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
C[dimension*i+j] = 0.0;
}
start = timestamp();
for(i = 0; i < dimension; i++)
for(j = 0; j < dimension; j++)
for(k = 0; k < dimension; k++)
C[dimension*i+j] += A[dimension*i+k] *
B[dimension*k+j];
end = timestamp();
printf("\nsecs:%f\n", end-start);
free(A);
free(B);
free(C);
return 0;
}
Here's my wild guess: cache
It could be that you can fit 2 rows of 2000 double
s into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)
But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)
Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.
The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway)
Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.
Your B
array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k
which is about 32 MB per matrix. That's much larger than your L2 cache.
When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.
However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B
will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.