quickly populate sparse matrices element by el-
ement:
import numpy as np
cimport numpy as np
...
cdef np.ndarray[np.intc_t] rows, cols
cdef np.ndarray[double] values
rows = np.zeros(nnz, dtype=np.intc)
cols = np.zeros(nnz, dtype=np.intc)
values = np.zeros(nnz, dtype=np.double)
cdef int idx = 0
for idx in range(0, nnz):
# Compute next non-zero matrix element
...
rows[idx] = row; cols[idx] = col
values[idx] = value
# Finally, we construct a regular
# SciPy sparse matrix:
return scipy.sparse.coo_matrix(
(values, (rows, cols)), shape=(N,N))
Data transformation and reduction
Consider computing a simple expression for a
large number of different input values, e.g.:
v = np.sqrt(x**2 + y**2 + z**2)
where the variables are arrays for three vectors
x, y and z. This is a case where, in most cases,
one does not need to use Cython – it is easily
expressed by pure NumPy operations that are al-
ready optimized and usually fast enough.
The exceptions are for either extremely small or
large amounts of data. For small data sets that
are evaluated many, many times, the Python over-
head of the NumPy expression will dominate,
and making a loop in Cython removes this over-
head. For large amounts of data, NumPy has two
problems: it requires large amounts of temporary
memory, and it repeatedly moves temporary re-
sults over the memory bus. In most scientific set-
tings the memory bus can easily become the main
bottleneck, not the CPU (for a detailed explana-
tion see
[Alted]
). In the example above, NumPy
will first square x in a temporary buffer, then
square y in another temporary buffer, then add
them together using a third temporary buffer, and
so on.
In Cython, it is possible to manually write a loop
running at native speed:
cimport libc
...
cdef np.ndarray[double] x, y, z, v
x = ...; y = ...; z = ...
v = np.zeros_like(x)
...
for i in range(x.shape[0]):
v[i] = libc.sqrt(
x[i]**2 + y[i]**2 + z[i]**2)
which avoids these problems, as no temporary
buffers are required. The speedup is on the or-
der of a factor of ten for large arrays.
If one is doing a lot of such transformations, one
should also evaluate numexpr and Theano which
are dedicated to the task. Theano is able to refor-
mulate the expression for optimal numerical sta-
bility, and is able to compute the expression on a
highly parallel GPU.