I'm studying simple multiplication of two big matrices using the Eigen library. This multiplication appears to be noticeably slower than both Matlab and Python for the same size matrices.
Is there anything to be done to make the Eigen operation faster?
Problem Details
X : random 1000 x 50000 matrix
Y : random 50000 x 300 matrix
Timing experiments (on my late 2011 Macbook Pro)
Using Matlab: X*Y takes ~1.3 sec
Using Enthought Python: numpy.dot( X, Y) takes ~ 2.2 sec
Using Eigen: X*Y takes ~2.7 sec
Eigen Details
You can get my Eigen code (as a MEX function): https://gist.github.com/michaelchughes/4742878
This MEX function reads in two matrices from Matlab, and returns their product.
Running this MEX function without the matrix product operation (ie just doing the IO) produces negligible overhead, so the IO between the function and Matlab doesn't explain the big difference in performance. It's clearly the actual matrix product operation.
I'm compiling with g++, with these optimization flags: "-O3 -DNDEBUG"
I'm using the latest stable Eigen header files (3.1.2).
Any suggestions on how to improve Eigen's performance? Can anybody replicate the gap I'm seeing?
UPDATE The compiler really seems to matter. The original Eigen timing was done using Apple XCode's version of g++: llvm-g++-4.2.
When I use g++-4.7 downloaded via MacPorts (same CXXOPTIMFLAGS), I get 2.4 sec instead of 2.7.
Any other suggestions of how to compile better would be much appreciated.
You can also get raw C++ code for this experiment: https://gist.github.com/michaelchughes/4747789
./MatProdEigen 1000 50000 300
reports 2.4 seconds under g++-4.7
First of all, when doing performance comparison, makes sure you disabled turbo-boost (TB). On my system, using gcc 4.5 from macport and without turbo-boost, I get 3.5s, that corresponds to 8.4 GFLOPS while the theoretical peak of my 2.3 core i7 is 9.2GFLOPS, so not too bad.
MatLab is based on Intel MKL, and seeing the reported performance, it clearly uses a multithreaded version. It is unlikely that an small library as Eigen can beat Intel on its own CPU!
Numpy can uses any BLAS library, Atlas, MKL, OpenBLAS, eigen-blas, etc. I guess that in your case it was using Atlas which is fast too.
Finally, here is how you can get better performance: enable multi-threading in Eigen by compiling with -fopenmp. By default Eigen uses for the number of the thread the default number of thread defined by OpenMP. Unfortunately this number corresponds to the number of logic cores, and not physical cores, so make sure hyper-threading is disabled or define the OMP_NUM_THREADS environment variable to the physical number of cores. Here I get 1.25s (without TB), and 0.95s with TB.