How to properly use anaconda accelerate for GPU

budmitr picture budmitr · Jun 14, 2015 · Viewed 12.1k times · Source

I am trying to get fast computations of matrices with anaconda accelerate. I started with very basic example: multiply 2 matrices.

My goal is to somehow get GPU-multiplication which is better than usual numpy.dot

Here is my basic example, based on this documentation.

from numbapro import guvectorize
from numpy import arange

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='gpu')
def matmul(A, B, C):
    m, n = A.shape
    n, p = B.shape
    for i in range(m):
        for j in range(p):
            C[i, j] = 0
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]

import numpy as np
import time

for dim in [50, 100, 200]:
    rnd = np.random.RandomState(0)
    a = rnd.rand(dim, dim).astype(np.float32)
    b = rnd.rand(dim, dim).astype(np.float32)
    resgpu = np.zeros_like(a)

    start = time.time()
    rescpu = np.dot(a, b)
    print('CPU:', time.time() - start)

    start = time.time()
    resgpu = matmul(a, b)
    print('GPU:', time.time() - start)

    print(np.allclose(rescpu, resgpu))
    print(np.allclose(resgpu, rescpu))

Results are too bad: GPU is incredibly slower than CPU

CPU: 0.00011801719665527344
GPU: 0.05677294731140137
True
True
CPU: 0.00011205673217773438
GPU: 0.3881375789642334
True
True
CPU: 0.00038933753967285156
GPU: 3.018171787261963
True
True

Of course I understand that internal numpy realization is well optimized, but I expected anaconda official example to be good. I am using python 3.4.3 and got errors with using these two helping libs: http://www.cs.toronto.edu/~tijmen/gnumpy.html and https://github.com/rctn/gpupy

I should say that with gpupy I had successful speedup on python 2.7.

So my question is: how can I get matrix multiplication better than numpy-CPU by using GPU? What is wrong with anaconda official example and if there a working library for python3 that allows to use GPU in numpy way?

===

RESULTS

Unfortunately, there is no simple and good way for python 3, use 2.7 instead

Thanks to @rth for recommendint awesome library scikits.cuda

Available functions

Some benchmark (tested with using anaconda mkl, so numpy is fast too)

dim = 10000
rnd = np.random.RandomState(0)
a = rnd.rand(dim, dim).astype(np.float32)
b = rnd.rand(dim, dim).astype(np.float32)
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)

start = time.time()
rescpu = np.dot(a, b)
print 'CPU:', time.time() - start

start = time.time()
resgpu = culinalg.dot(a_gpu, b_gpu)
print 'GPU:', time.time() - start

resgpu = resgpu.get()
print np.allclose(rescpu, resgpu)
print np.allclose(resgpu, rescpu)

And results

CPU: 16.4765479565
GPU: 0.000520944595337

Answer

rth picture rth · Jun 15, 2015

You should have a look at BLAS implementations that provide highly optimized routines for classical linear algebra operations. The multiplication of dense matrices is performed with the gemm function.

  • For instance, matrix multiplication in numpy is significantly improved if it is compiled against an optimized BLAS implementation (OpenBLAS, ATLAS, MKL, etc).
  • For GPU, NVIDIA provides the cuBLAS implementation. According to this answer, it can be called with numpy arrays using scikits.cuda module. Anaconda accelerate that you are using, also provides direct binding to cuBLAS.

BTW, if you want to benchmark CPU vs GPU performance for matrix multiplication, you should also specify the BLAS used by Numpy for the CPU calculations, since the results could differ by an order of magnitude (see this benchmark).