14

I wish to implement the following expression in Python: $$ x_i = \sum_{j=1}^{i-1}k_{i-j,j}a_{i-j}a_j, $$ where $x$ and $y$ are numpy arrays of size $n$, and $k$ is a numpy array of size $n\times n$. The size $n$ might be up to about 10000, and the function is part of an inner loop that will be evaluated many times, so speed is important.

Ideally I'd like to avoid a for loop altogether, though I guess it's not the end of the world if there is one. The problem is that I'm having trouble seeing how to do it without having a couple of nested loops, and that's likely to make it rather slow.

Can anybody see how to express the above equation using numpy in a way that's efficient, and preferably also readable? More generally, what is the best way to approach this sort of thing?

N. Virgo
  • 1,223
  • 1
  • 10
  • 23
  • I had a similar question a couple of days ago. I asked it over at stackoverflow. Check out this post. I use scipy.weave instead of cython. Does anybody know if this makes any (considerable) performance difference? – seb Mar 09 '13 at 06:27

3 Answers3

17

Here is the Numba solution. On my machine the Numba version is >1000x faster than the python version without the decorator (for a 200x200 matrix, 'k' and 200-length vector 'a'). You can also use the @autojit decorator which adds about 10 microseconds per call so that the same code will work with multiple types.

from numba import jit, autojit

@jit('f8[:](f8[:,:],f8[:])')
#@autojit
def looped_ver(k, a):
    x = np.empty_like(a)
    for i in range(x.size):
        sm = 0.0
        for j in range(0, i+1):
            sm += k[i-j,j] * a[i-j] * a[j]
        x[i] = sm
    return x

Disclosure: I am one of the Numba developers.

Geoff Oxberry
  • 30,394
  • 9
  • 64
  • 127
Travis Oliphant
  • 271
  • 1
  • 2
  • Thanks, that looks pretty straightforward. I didn't even know about numba! Cython, PyPy, Numba ... it's a confusing world. – N. Virgo Mar 09 '13 at 09:21
  • 3
    Travis, very cool, do you mind adding a disclosure to the bottom of your answer that you are one of the numba developers? – Aron Ahmadia Mar 09 '13 at 14:59
  • 1
    With $n=200$, the Cython version is also much faster relative to the looped Python (~700x, for me). I would be curious about how this performance changes with larger matrices, and whether they experience the same (memory?) bottlenecks. – Nat Wilson Mar 11 '13 at 18:06
  • @NatWilson - if you ask this as a question on scicomp, I'd be happy to try and tackle it for you :) – Aron Ahmadia Mar 12 '13 at 13:22
4

Here's a start. First, my apologies for any mistakes.

I experimented with a couple of different approached. I was a bit confused by the limits on the summation - should the upper limit be $i$, rather than $i-1$?

Edit: No, the upper limit was correct as provided in the question. I have left it as is here because another answer now uses the same code, but the fix is simple.

First a looped version:

def looped_ver(k, a):
    x = np.empty_like(a)
    for i in range(x.size):
        sm = 0
        for j in range(0, i+1):
            sm += k[i-j,j] * a[i-j] * a[j]
        x[i] = sm
    return x

I made it a single loop with numpy slices:

def vectorized_ver(k, a):
    ktr = zeros_like(k)
    ar = zeros_like(k)
    sz = len(a)
    for i in range(sz):
        ktr[i,:i+1] = k[::-1].diagonal(-sz+i+1)
        a_ = a[:i+1]
        ar[i,:i+1] = a_[::-1] * a_
    return np.sum(ktr * ar, 1)

The numpy version with one explicit loop is about 25x faster on my computer when $n=5000$.

Then I wrote a Cython version of the (more-readable) looped code.

import numpy as np
import cython
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def cyth_ver(double [:, ::1] k not None,
              double [:] a not None):
    cdef double[:] x = np.empty_like(a)
    cdef double sm
    cdef int i, j

    for i in range(len(a)):
        sm = 0.0
        for j in range(i+1):
            sm = sm + k[i-j,j] * a[i-j] * a[j]
        x[i] = sm
    return x

On my laptop, this one is about 200x faster than the looped version (and 8x faster than the 1-loop vectorized version). I'm sure others can do better.

I played with a Julia version, and it seemed (if I timed it properly) comparable to the Cython code.

Nat Wilson
  • 438
  • 3
  • 8
  • Many thanks. This has convinced me that I need to make the effort to learn Cython :) I should have mentioned that $x_0$ should always be zero, because the index going to $i-1$ means there are no terms to sum over - but this is a minor detail. – N. Virgo Mar 09 '13 at 02:20
  • Ah, I see. I gathered that from the original summation, but wasn't sure that was the intent. – Nat Wilson Mar 09 '13 at 06:13
1

What you want seems to be a convolution; I think the quickest way for achieving it would be the numpy.convolve function.

Your may have to fix the indices according to your exact needs but I think you would like to try something like:

import numpy as np
a = [1, 2, 3, 4, 5]
k = [2, 4, 6, 8, 10]

result = np.convolve(a, k*a[::-1])