Efficient 2D convolution/cross-correlation along only one axis (1D output)

Question

I wish to convolve/cross-correlate two images

and

but, only horizontally, yielding 1D output.

So, first output point would be sum(x * h), second sum(x * h_shift1), where h_shift1 is h horizontally shifted by 1 sample, in Python np.roll(h, axis=1). Basically, pass images through each other horizontally.

I know that 2D cross-correlation is efficiently done as

$$ \texttt{iFFT}_{2d}\bigg( \texttt{FFT}_{2d}\big(x\big) \cdot \overline{\texttt{FFT}_{2d}\big(\texttt{iFFTSHIFT}_{2d}(h)\big)} \bigg) $$

but this yields a 2D output which I don't need. Can my case be handled efficiently?

Convention

For sake of this post, assume a "backward" (which I think should be standard) definition, where $\overline{f(\cdot)}$ is complex conjugation:

$$ (x \star h)(\tau) = \int_{-\infty}^{\infty} x(t) \overline{h(t - \tau)} dt $$

which just flips the order of arguments relative to standard. An answer with the standard definition is also acceptable.

Clarification

This is not "batched 1D" / "many 1D". The sum and product for every shift is 2D:

$$ (\mathbf{a} \star \mathbf{b})(\tau) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \mathbf{a}(y, x) \overline{\mathbf{b}(y, x - \tau)} dxdy $$

Again, out[i] = sum(a * shift(b, i)), where a and b are 2D and sum yields a scalar.

Brute force example

MATLAB -- Python -- same outputs

OverLordGoldDragon · Accepted Answer · 2023-04-17T12:17:01.003

We observe that "2D along 1D" is equivalently: first do 1D horizontally (each row independently), then sum vertically. The complete operation for an output point, except for a shift, is sum(a * b), which is a 2D product and 2D sum.

1D convolution for row m does sum(a[m, :] * b[m, :]) for all shifts of b by i, b_i.
Summing vertically for a given i is hence summing sum(a[m, :] * b_i[m, :]) for all m.
(2) is same as sum(a * b_i), i.e. sum(a[:, :] * b_i[:, :]).

So, if we let hf = ifftshift(conj(fft(h, axis=1)), axes=1), and prod = fft(x, axis=1) * hf, then it's just sum(ifft(prod, axis=1), axis=0). But we observe, by linearity, we can move sum inside ifft for a great speedup. All together,

$$ \texttt{CC}_{2d1d}(x, h) = \texttt{iFFT}_{1d} \left( \sum_{m=0}^{M - 1} \left( \texttt{FFT}_{1d}\big(x\big) \cdot \overline{\texttt{FFT}_{1d}\big(\texttt{iFFTSHIFT}_{1d}(h)\big)} \right)[m] \right) $$

where 2D indexing is $x[m, n]$, and $\texttt{op}_{1d}$ denotes 1D operation along $n$'s axis.

Thanks to @CrisLuengo and @Royi for pointers.

Example in question

Applying in code (extending the code at bottom)...

import matplotlib.pyplot as plt
from PIL import Image
# load image as greyscale
x = np.array(Image.open("cim0.png").convert("L")) / 255.
h = np.array(Image.open("cim1.png").convert("L")) / 255.
# blank regions default to `1`, undo that
x[x==1] = 0
h[h==1] = 0
# compute
out = cc2d1d(x, h)[0].real
# plot
plt.plot(out); plt.show()

the peak is near center, as expected:

Applications

I used it to identify abrupt changes in audio, by cross-correlating CWT's impulse response with non-linearly filtered version of SSQ_CWT. So one major use is 2D template matching upon underlying 1D structures. Surely there's plenty others.

(Note for those curious in the linked post) But! I by no means did this with images like in this post. An "image" involves up to three major modifications - compression, color-mapping, and clipping (vmin, vmax args in plt.imshow) - which change its numeric representation once loaded from image into array. Instead I operated on the original arrays, and it's clear from the worse results in this post.

Convolution? Vertical?

Convolution: remove np.conj
Vertical: ifftshift(, axes=1) -> ifftshift(, axes=0), and mean(axis=0) -> mean(axis=1)
Boundary effects / "time aliasing": pad and unpad exactly same as with 1D convolutions. But note, if $h$ isn't reusable, it's faster to adjust unpad indices instead of doing ifftshift, as shown in Royi's answer on conv2 (ignore vertical unpad).

Benchmarks (CPU)

For reusable $h$:

def cc2d1d_hf(x, hf):
    return ifft((fft(x) * hf).sum(axis=0))
shapes = [(8192, 8192), (256, 262144), (262144, 256)]
for shape in shapes:
    x = np.random.randn(shape)
    hf = np.conj(fft(ifftshift(np.random.randn(shape), axes=1)))
    %timeit cc2d1d_hf(x, hf)

3.01 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.21 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 78.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Code

import numpy as np
from numpy.fft import fft, ifft, ifftshift
def cc2d1d(x, h):
    prod = fft(x) * np.conj(fft(ifftshift(h, axes=1)))
    return ifft(prod.sum(axis=0))
def cc2d1d_brute(x, h):
    out = np.zeros(x.shape[-1], dtype=x.dtype)
    h = ifftshift(np.conj(h), axes=1)
    for i in range(len(out)):
        out[i] = np.sum(x * np.roll(h, i, axis=1))
    return out
for M in (128, 129):
    for N in (128, 129):
        x = np.random.randn(M, N) + 1jnp.random.randn(M, N)
        h = np.random.randn(M, N) + 1jnp.random.randn(M, N)
        out0 = cc2d1d(x, h)
        out1 = cc2d1d_brute(x, h)
        assert np.allclose(out0, out1)

You implemented what I said not in the way I meant. It is better to sum in the spatial domain in case your data is real. Summing real numbers is faster than complex numbers. — Royi, Apr 17 '23 at 08:37
@Royi Firstly, I implemented what I realized, all comments afterwards simply didn't bother properly reading what I wrote. That could change now. I don't see what you mean, however - summing either x or h before ffts won't work, and after ifft is slower. Where would you put the sum? — OverLordGoldDragon, Apr 17 '23 at 08:42
@Royi If it's not easy to explain, I can open a question. If it's wrong, it should be removed. — OverLordGoldDragon, Apr 22 '23 at 11:22
Let's say the signals are M x N1 and M x N2. What I'd do is M cross correlations between the M rows of the images. The decision whether or not do it in the frequency domain should be takes like any other 1D Convolution / Cross Correlation. Once you have the results, which is, let's say, an array of M x N3 (Depends on the boundary conditions of the cross correlation you chose) do the vertical sum (On real numbers usually). It will be faster than summing complex numbers in case you chose fft(). — Royi, Apr 23 '23 at 05:12
So edge case, or at best a separate use case rather than an improvement to the general case which my approach handles. Your approach is only better if we never fft. If we fft, then ifft cost will by far outweigh the difference between complex vs real addition. Note even if N1 or N2 is small, overlap-add FFT is still an option, and as far as I know, direct convolution only wins for like 32-sized kernel. That's all fine, but should be presented for what it is. @Royi — OverLordGoldDragon, Apr 24 '23 at 14:20
Did I say what to use? Each one with the calculations he wants. But all there is to do is 1D cross correlation and only then summing. You're summing on the complex domain for no reason. — Royi, Apr 25 '23 at 05:44
@Royi The way you entered the conversation only interprets as you suggesting improvements. It's also the case right now, where you claim frequency-domain summation is inferior. So, you believe M length N iffts plus real summation over (M, N) is faster than 1 length N ifft plus complex summation over (M, N)? If not, then I really don't understand what you're saying. If you'll bring in the non-FFT case, that's again handling a separate use case. — OverLordGoldDragon, Apr 25 '23 at 12:19

OverLordGoldDragon · Answer 2 · 2023-04-17T12:05:42.567

OLD (& slower) ANSWER - see other for better solution

We observe that this "2D along 1D" is simply a subset of the full 2D operation - namely, we obtain the 1D vector of interest by slicing the full 2D output along horizontal, i.e. indexing a row. Equivalently, it's maximum subsampling along vertical - so applying "Subsampling in time $\Leftrightarrow$ Folding in Fourier", we let

$$ \texttt{prod} = \texttt{FFT}_{2d}\big(x\big) \cdot \overline{\texttt{FFT}_{2d}\big(\texttt{iFFTSHIFT}_{2d}(h)\big)} $$

and then the desired output is out = ifft2(prod.reshape(sub, 1, N).mean(axis=0)), where M, N = x.shape and sub = M. Except, this is equivalent to out = out_full[0], which assumes that the 1D vector of interest is the very first row. That's in fact not the case with $\texttt{iFFTSHIFT}_\text{2d}$.

Following the linked post, we could use the offset relation to instead make it like out_full[offset:][0], with an appropriately chosen offset - but the more efficient option is just to shift h the right way. Turns out to do this, we simply don't shift it along vertical: in turn, the FFTs will vertical-center it during cross-correlation.

Lastly, we realize that fft(x) == x for len(x) == 1, so ifft2 along vertical & horizontal simplifies to ifft along horizontal. In full-ish math...

$$ \texttt{CC}_{2d1d}(x, h) = \texttt{iFFT}_{1d} \left( \frac{1}{M}\sum_{m=0}^{M - 1} \left( \texttt{FFT}_{2d}\big(x\big) \cdot \overline{\texttt{FFT}_{2d}\big(\texttt{iFFTSHIFT}_{1d;n}(h)\big)} \right)[m] \right) $$

where 2D indexing is $x[m, n]$, and $\texttt{iFFTSHIFT}_{1d;n}$ denotes 1D ifftshift along $n$'s axis. (anyone got better notation? and I don't mean a "strictly accurate" complete mess)

Benchmarks (CPU)

For reusable $h$:

def cc2d1d_hf(x, hf):
    return ifft((fft2(x) * hf).mean(axis=0))
shapes = [(8192, 8192), (256, 262144), (262144, 256)]
for shape in shapes:
    x = np.random.randn(shape)
    hf = np.conj(fft2(ifftshift(np.random.randn(shape), axes=1)))
    %timeit cc2d1d_hf(x, hf)

4.65 s ± 270 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.9 s ± 133 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.52 s ± 404 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Code

import numpy as np
from numpy.fft import fft2, ifft2, ifftshift
def cc2d1d(x, h):
    prod = fft2(x) * np.conj(fft2(ifftshift(h, axes=1)))
    return ifft(prod.mean(axis=0))
def cc2d1d_brute(x, h):
    out = np.zeros(x.shape[-1], dtype=x.dtype)
    h = ifftshift(np.conj(h), axes=1)
    for i in range(len(out)):
        out[i] = np.sum(x * np.roll(h, i, axis=1))
    return out
for M in (128, 129):
    for N in (128, 129):
        x = np.random.randn(M, N) + 1jnp.random.randn(M, N)
        h = np.random.randn(M, N) + 1jnp.random.randn(M, N)
        out0 = cc2d1d(x, h)
        out1 = cc2d1d_brute(x, h)
        assert np.allclose(out0, out1)

Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. — Peter K., Apr 16 '23 at 14:55
@PeterK., I think in this case the comments were actually shaping the solution :-). — Royi, Apr 17 '23 at 07:51
@Royi They actually stopped doing anything after the first comment. — OverLordGoldDragon, Apr 17 '23 at 08:31

Efficient 2D convolution/cross-correlation along only one axis (1D output)

Convention

Clarification

Brute force example

2 Answers2

Example in question

Applications

Convolution? Vertical?

Benchmarks (CPU)

Code

Benchmarks (CPU)

Code

Linked