How to speed up matrix multiplication in C++?

multiholle picture multiholle · Nov 29, 2010 · Viewed 21.5k times · Source

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.

Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.

 matrix mult_std(matrix a, matrix b) {
 matrix c(a.dim(), false, false);
 for (int i = 0; i < a.dim(); i++)
  for (int j = 0; j < a.dim(); j++) {
   int sum = 0;
   for (int k = 0; k < a.dim(); k++)
    sum += a(i,k) * b(k,j);
   c(i,j) = sum;
  }

 return c;
}


EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:

  • swapped k and j loop iterations -> performance improvement
  • declared dim() and operator()() as inline -> performance improvement
  • passing arguments by const reference -> performance loss! why? so I don't use it.

The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.

But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc


OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW

c99 main.c -o matrix -O3


NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr

g++ main.cpp matrix.cpp -o matrix -O3.


EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block). alt text

Answer

chrisaycock picture chrisaycock · Nov 29, 2010

Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:

matrix mult_std(matrix a, matrix b) {
 matrix c(a.dim(), false, false);
 for (int i = 0; i < a.dim(); i++)
  for (int k = 0; k < a.dim(); k++)
   for (int j = 0; j < a.dim(); j++)  // swapped order
    c(i,j) += a(i,k) * b(k,j);

 return c;
}

That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.