Each Answer to this Q is separated by one/two green lines.
I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.
To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):
Version 1 (without cython)
import numpy as np
def _process(array):
rows = array.shape[0]
cols = array.shape[1]
out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array  array[row, :], axis=0)
return out
def main():
data = np.load('data.npy')
out = _process(data)
np.save('vianumpy.npy', out)
Version 2 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array  array[row, :], axis=0)
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('viacynpy.npy', out)
Version 3 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
for col in range(0, cols):
for row2 in range(0, rows):
out[row, col] += array[row2, col]  array[row, col]
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('vialoop.npy', out)
With a 10000×10 matrix saved in data.npy, the times are:
$ python m timeit c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop
$ python m timeit c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop
$ python m timeit c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop
Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?
Ps. This is NOT the calculation that I need to make, just a simple example that shows the same thing.
With slight modification, version 3 becomes twice as fast:
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))
for row in range(rows):
for row2 in range(rows):
for col in range(cols):
out[row, col] += array[row2, col]  array[row, col]
return out
The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.
If you need to use this function on small input arrays then you can reduce the overhead by using np.empty
instead of np.ones
. To reduce the overhead further use PyArray_EMPTY
from the numpy C API.
If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range
function) will overflow. To be safe use:
cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2
instead of
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
Timing:
In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop
where process
is your version 3.
As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as
out[row, :] = A[row, :] + B[row, :]*C[row, :]
. In this case a whole arrayB*C
must be created in memory, then added toA
. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.
Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.
As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.
I would recommend using the a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:
http://docs.cython.org/src/quickstart/cythonize.html
Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn’t doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.
Cython shines when you are doing an array manipulation that numpy can’t do in a ‘vectorized’ way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I’ve gotten 115x speedups using cython vs numpy for some of my own code:
https://github.com/synapticarbors/pylangevinintegrator
Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random
, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.
The difference may be due to version 1 and 2 doing a Pythonlevel call to np.sum()
for each row, while version 3 likely compiles to a tight, pure C loop.
Studying the difference between version 2 and 3’s Cythongenerated C source should be enlightening.
I’d guess the main overhead you are saving is the temporary arrays created. You create a great big array array  array[row, :]
, then reduce it into a smaller array using sum
. But building that big temporary array won’t be free, especially if you need to allocate memory.