Why is matrix multiplication faster with numpy than with ctypes in Python?

Charles Menguy picture Charles Menguy · May 4, 2012 · Viewed 28.4k times · Source

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:

  • Pure python implementation: no surprises here.
  • Numpy implementation using numpy.dot(a, b)
  • Interfacing with C using ctypes module in Python.

This is the C code that is transformed into a shared library:

#include <stdio.h>
#include <stdlib.h>

void matmult(float* a, float* b, float* c, int n) {
    int i = 0;
    int j = 0;
    int k = 0;

    /*float* c = malloc(nay * sizeof(float));*/

    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            int sub = 0;
            for (k = 0; k < n; k++) {
                sub = sub + a[i * n + k] * b[k * n + j];
            }
            c[i * n + j] = sub;
        }
    }
    return ;
}

And the Python code that calls it:

def C_mat_mult(a, b):
    libmatmult = ctypes.CDLL("./matmult.so")

    dima = len(a) * len(a)
    dimb = len(b) * len(b)

    array_a = ctypes.c_float * dima
    array_b = ctypes.c_float * dimb
    array_c = ctypes.c_float * dima

    suma = array_a()
    sumb = array_b()
    sumc = array_c()

    inda = 0
    for i in range(0, len(a)):
        for j in range(0, len(a[i])):
            suma[inda] = a[i][j]
            inda = inda + 1
        indb = 0
    for i in range(0, len(b)):
        for j in range(0, len(b[i])):
            sumb[indb] = b[i][j]
            indb = indb + 1

    libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);

    res = numpy.zeros([len(a), len(a)])
    indc = 0
    for i in range(0, len(sumc)):
        res[indc][i % len(a)] = sumc[i]
        if i % len(a) == len(a) - 1:
            indc = indc + 1

    return res

I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that numpy is stupidly fast:

benchmark

I'd like to understand why the numpy version is faster than the ctypes version, I'm not even talking about the pure Python implementation since it is kind of obvious.

Answer

Juno Woods picture Juno Woods · Aug 20, 2012

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.