I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:
numpy.dot(a, b)
ctypes
module in Python.This is the C code that is transformed into a shared library:
#include <stdio.h>
#include <stdlib.h>
void matmult(float* a, float* b, float* c, int n) {
int i = 0;
int j = 0;
int k = 0;
/*float* c = malloc(nay * sizeof(float));*/
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
int sub = 0;
for (k = 0; k < n; k++) {
sub = sub + a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sub;
}
}
return ;
}
And the Python code that calls it:
def C_mat_mult(a, b):
libmatmult = ctypes.CDLL("./matmult.so")
dima = len(a) * len(a)
dimb = len(b) * len(b)
array_a = ctypes.c_float * dima
array_b = ctypes.c_float * dimb
array_c = ctypes.c_float * dima
suma = array_a()
sumb = array_b()
sumc = array_c()
inda = 0
for i in range(0, len(a)):
for j in range(0, len(a[i])):
suma[inda] = a[i][j]
inda = inda + 1
indb = 0
for i in range(0, len(b)):
for j in range(0, len(b[i])):
sumb[indb] = b[i][j]
indb = indb + 1
libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);
res = numpy.zeros([len(a), len(a)])
indc = 0
for i in range(0, len(sumc)):
res[indc][i % len(a)] = sumc[i]
if i % len(a) == len(a) - 1:
indc = indc + 1
return res
I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that numpy
is stupidly fast:
I'd like to understand why the numpy
version is faster than the ctypes
version, I'm not even talking about the pure Python implementation since it is kind of obvious.
NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f
(it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult
algorithm is the trivial implementation. There are faster ways to do the same thing.