Consider code like this:
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t] arr, int i):
arr[i]+= 1
def test1(np.ndarray[np.int32_t] arr):
cdef int i
for i in xrange(len(arr)):
inc(arr, i)
def test2(np.ndarray[np.int32_t] arr):
cdef int i
for i in xrange(len(arr)):
arr[i] += 1
I used ipython to measure speed of test1 and test2:
In [7]: timeit ttt.test1(arr)
100 loops, best of 3: 6.13 ms per loop
In [8]: timeit ttt.test2(arr)
100000 loops, best of 3: 9.79 us per loop
Is there a way to optimize test1? Why doesn't cython inline this function as told?
UPDATE: Actually what I need is multidimension code like this:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
arr[i, j] += 1
def test1(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
Timing for it:
In [7]: timeit ttt.test1(arr)
1 loops, best of 3: 647 ms per loop
In [8]: timeit ttt.test2(arr)
100 loops, best of 3: 2.07 ms per loop
Explicit inlining gives 300x speedup. And my real function is quite big so inlining it makes code maintainability much worse
UPDATE2:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j):
arr[i, j]+= 1
def test1(np.ndarray[np.float32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.float32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
cdef class FastPassingFloat2DArray(object):
cdef float* data
cdef int stride0, stride1
def __init__(self, np.ndarray[np.float32_t, ndim=2] arr):
self.data = <float*>arr.data
self.stride0 = arr.strides[0]/arr.dtype.itemsize
self.stride1 = arr.strides[1]/arr.dtype.itemsize
def __getitem__(self, tuple tp):
cdef int i, j
cdef float *pr, r
i, j = tp
pr = (self.data + self.stride0*i + self.stride1*j)
r = pr[0]
return r
def __setitem__(self, tuple tp, float value):
cdef int i, j
cdef float *pr, r
i, j = tp
pr = (self.data + self.stride0*i + self.stride1*j)
pr[0] = value
cdef inline inc2(FastPassingFloat2DArray arr, int i, int j):
arr[i, j]+= 1
def test3(np.ndarray[np.float32_t, ndim=2] arr):
cdef int i,j
cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr)
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc2(tmparr, i,j)
Timings:
In [4]: timeit ttt.test1(arr)
1 loops, best of 3: 623 ms per loop
In [5]: timeit ttt.test2(arr)
100 loops, best of 3: 2.29 ms per loop
In [6]: timeit ttt.test3(arr)
1 loops, best of 3: 201 ms per loop
More than 3 years have passed since the question was posted and there have been great progress in the meantime. On this code (Update 2 of the question):
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
arr[i, j]+= 1
def test1(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
I get the following timings:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test1(arr)
%timeit test2(arr)
1 loops, best of 3: 354 ms per loop
1000 loops, best of 3: 1.02 ms per loop
So the problem is reproducible even after more than 3 years. Cython now has typed memoryviews, AFAIK it was introduced in Cython 0.16, so not available at the time the question was posted. With this:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(int[:, ::1] tmv, int i, int j):
tmv[i, j]+= 1
def test3(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
cdef int[:, ::1] tmv = arr
for i in xrange(tmv.shape[0]):
for j in xrange(tmv.shape[1]):
inc(tmv, i, j)
def test4(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
cdef int[:, ::1] tmv = arr
for i in xrange(tmv.shape[0]):
for j in xrange(tmv.shape[1]):
tmv[i,j] += 1
With this I get:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 977 µs per loop
1000 loops, best of 3: 838 µs per loop
We are almost there and already faster than the old-fashioned way! Now, the inc()
function is eligible to be declared nogil
, so let's declare it so! But oops:
Error compiling Cython file:
[...]
cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:
^
[...]
Function with Python return type cannot be declared nogil
Aaah, I totally missed that the void
return type was missing! Once again but now with void
:
cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:
tmv[i, j]+= 1
And finally I get:
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 843 µs per loop
1000 loops, best of 3: 853 µs per loop
As fast as manual inlining!
Now, just for fun, I tried Numba on this code:
import numpy as np
from numba import autojit, jit
@autojit
def inc(arr, i, j):
arr[i, j] += 1
@autojit
def test5(arr):
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
I get:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test5(arr)
100 loops, best of 3: 4.03 ms per loop
Even though it's 4.7x slower than Cython, most likely because the JIT compiler failed to inline inc()
, I think it is AWESOME! All I needed to do is to add @autojit
and didn't have to mess up the code with clumsy type declarations; 88x speedup for next to nothing!
I have tried other things with Numba, such as
@jit('void(i4[:],i4,i4)')
def inc(arr, i, j):
arr[i, j] += 1
or nopython=True
but failed to improve it any further.
Improving inlining is on the Numba developers' list, we only need to file more requests to make it have higher priority. ;)