Parallel linear algebra for multicore system

Patrik picture Patrik · Apr 5, 2012 · Viewed 9.3k times · Source

I'm developing a program that needs to do heavy linear algebra calculations.

Now I'm using LAPACK/BLAS routines, but I need to exploit my machine (24 core Xeon X5690).

I've found projects like pblas and scalapack, but they all seem to focus on distributed computing and on using MPI.

I have no cluster available, all computations will be done on a single server and using MPI looks like an overkill.

Does anyone have any suggestion on this?

Answer

Jonathan Dursi picture Jonathan Dursi · Apr 5, 2012

As mentioned by @larsmans (with, say, MKL), you still use LAPACK + BLAS interfaces, but you just find a tuned, multithreaded version for your platform. MKL is great, but expensive. Other, open-source, options include:

  • OpenBLAS / GotoBLAS, the Nehalem support should work ok but no tuned support yet for westmere. Does multithreading very well.
  • Atlas : automatically tunes to your architecture at installation time. probably slower for "typical" matricies (eg, square SGEMM) but can be faster for odd cases, and for westmere may even beat out OpenBLAS/GotoBLAS, haven't tested this myself. Mostly optimized for serial case, but does include parallel multithreading routines.
  • Plasma - LAPACK implementation designed specificially for multicore.

I'd also agree with Mark's comment; depending on what LAPACK routines you're using, the distributed memory stuff with MPI might actually be faster than the multithreaded. That's unlikely to be the case with BLAS routines, but for something more complicated (say the eigenvalue/vector routines in LAPACK) it's worth testing. While it's true that MPI function calls are an overhead, doing things in a distributed-memory mode means you don't have to worry so much about false sharing, synchronizing access to shared variables, etc.