The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library for use with CUDA capable GPUs.
I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I …