Optimize window length (STFT) via gradient descent (in neural networks)

Question

The authors from this paper optimized a Gaussian window size via gradient descent (the σ parameter of the bell curve) together with the other parameters of neural networks.

I don't use Gaussian window but use Hann window instead. I would like to know how to optimize stft window size with Hann/Hamming window via gradient descent?

The problem is that unlike the Gaussian window, Hann window does not have continuous parameter σ as a proxy for gradient descent. Is there a way to rewrite Hann window or are there parameters that one could use to control the window size and that are differentiable? Currently the $n$ is positive integer and is not differentiable.

torch.hann_window uses:

$$w[n] = \begin{cases} \tfrac{1}{2} \left( 1 − \cos \left(\tfrac{2\pi}{N-1}n\right) \right) \qquad & 0 \le n \le N-1 \\ 0 & \text{otherwise} \\ \end{cases}$$

I scratched my head for quite some time but could not figure out how to differentiate it.

Any hints from you are highly appreciated.

In the optimization effort, what parameter are they trying to maximize or minimize by varying the window width? — robert bristow-johnson, Feb 19 '22 at 21:19
And with the Gaussian window, there are two width parameters to think about. 1. the $\sigma$ parameter of the bell curve. 2. the actually cutoff point where the window goes to zero. — robert bristow-johnson, Feb 19 '22 at 21:21
Well, if you have an optimum $\sigma$ for the optimum Gaussian, then you can fit a Hann or Hamming to it so that the second-derivative of the main lobe in the middle is the same for both the Hann and the optimal Gaussian. — robert bristow-johnson, Feb 21 '22 at 08:00
I am not planning to find the optimum σ for Gaussian (this is what was done in the paper) but would like to optimize Hann. Are you suggesting to use the optimum σ of Gaussian to optimize Hann indirectly? — JXuan, Feb 22 '22 at 09:27
Yes. Because I do not have the foggiest idea what specific parameter you're trying to minimize or maximize in choosing the optimal $\sigma$ for a Gaussian window. So, without that knowledge, if your problem is solved for Gaussian, I am suggesting that you could consider that whatever optimal Gaussian window you get, you choose the Hann (or Hamming) window that has the same height and same curvature for the main lobe. That means the values of the zeroth, first, and second derivatives are the same in the center of both windows. They might have approximately the same performance. — robert bristow-johnson, Feb 23 '22 at 05:36
@robertbristow-johnson many thanks for completing the formula and sharing the idea! I edited my question to be more precise. Sorry for the confusion. — JXuan, Feb 23 '22 at 08:26
@robertbristow-johnson the paper used Gaussian window as an example to show that the window size could be optimized via gradient descent but I don't use Gaussian window at all. Do you know if there are other parameters that controls the window width of Hann window and are differentiable? If directly optimizing Hann window is possible, it would be better than taking a detour via Gaussian window optimization. Many thanks in advance! — JXuan, Feb 23 '22 at 08:37
There is no other parameter to the Hann window other than its width. — robert bristow-johnson, Feb 24 '22 at 04:13

OverLordGoldDragon · Answer 1 · 2022-02-24T08:33:26.373

2

The easiest way is to take STFT using operators with autodiff support, e.g. via PyTorch. Then simply set the window as an updatable parameter, initializing as Hanning etc:

import torch.nn as nn
from scipy.signal import windows
win = torch.from_numpy(windows.hann(128))
win_t = nn.Parameter(win, requires_grad=True)

Then we'd do e.g.

def forward(self, x):
    x = STFT(x, win_t)
    x = self.conv(x)
    ...

It might help to apply a symmetry constraint, by e.g. defining weights of the window as only half of the initial window, but operating with the full window that uses the updated weights.

See this post for example with fully differentiable CWT applied on inversion. That network could be embedded in a conv net and its filter weights updated if we wrap them with nn.Parameter.

Optimizing for width

If the goal's to optimize a parameter of a predefined window function, then generate the window through the differentiable parameter:

def gauss(N, sigma):
    t = torch.linspace(-.5, .5, N)
    return torch.exp(-(t / sigma)**2)
N = x.shape[-1]
sigma = nn.Parameter(torch.tensor(0.1))
win_t = gauss(N, sigma)

Optimizing for length

To optimize for the number of samples of the window, we must ensure the sampling is differentiable.

Hann does arange(N) / N, but N must be continuous, which requires rounding. torch.round isn't differentiable, but torch.clamp is.
To ensure the process works, we define an optimization objective via normalized cross-correlation (NCC): the optimal window length will be that which matches a reference. We implement NCC via conv1d.
Optimize via plain gradient descent, ensure to zero gradients after each step to avoid gradding the grads

Suppose we start with N=129 and optimum is N_ref=160. Results:

Code at Github.

edited Feb 24 '22 at 08:33

answered Feb 19 '22 at 20:42

OverLordGoldDragon

8,912
5
23
74

Great!! I did not know it's already implemented in pytorch. Many many thanks! – JXuan Feb 21 '22 at 06:56
@JXuan Glad it helped, also added $\sigma$ case. If the problem's solved, consider voting & accepting. – OverLordGoldDragon Feb 21 '22 at 20:11
Thanks for adding the σ case. For STFT, the window size is supposed to be the power of 2, instead of any real numbers, isn't it? If so, one could not just autodiff the win of Hann/Hamming window directly (because it is not differentiable)? – JXuan Feb 22 '22 at 09:24
I got a reply from the contributors who implemented the torch stft and window functions in the Pytorch forum. They said 'The window size is not immediately trainable, although you could implement your own method of updating it.' – JXuan Feb 22 '22 at 10:49
@JXuan Window can be of any length, power of 2 is typically for computational reasons. Yes, tuning something like width of Hanning will be harder, but is certainly doable. ssqueezepy has a torch implementation, but a key step isn't differentiable (buffer via CuPy) so it must be implemented via PyTorch - alternatively, code here can be ported easily to torch (though needs optimizing) – OverLordGoldDragon Feb 23 '22 at 01:56
I just looked into buffer in your repo. I think I need to find a continuous proxy of window length there or some other parameters where backpropogation could take place. Thanks for sharing your implementations. I will wait a bit to see if others have any ideas to share. – JXuan Feb 23 '22 at 09:50
@JXuan Added example for $N$. – OverLordGoldDragon Feb 24 '22 at 08:33

Olli Niemitalo · Answer 2 · 2022-02-24T17:56:44.400

1

Consider $N$ as a positive real number and use your formula for the Hann window, $w[n]$. Then the partial derivative of $w[n]$ with respect to $N$ will be:

$$\frac{\partial\,w[n]}{\partial N} = \begin{cases} - \dfrac{\pi\,n\,\sin\left(\frac{2\pi}{N - 1}n\right)}{{\left(N - 1\right)}^2} \qquad & 0 \le n \le N-1 \\ 0 & \text{otherwise.} \\ \end{cases}$$

At integer $N$, $w[n]$ has a very simple, sparse length-$N$ discrete Fourier transform (DFT). With non-integer $N$ that quality is lost, but also I don't know that you would need that quality in our application.

edited Feb 24 '22 at 17:56

answered Feb 23 '22 at 14:09

Olli Niemitalo

13,491
1
33
61

Thanks for contributing the idea! What kind of quality loss do you mean specifically? – JXuan Feb 25 '22 at 08:36
By "quality" I mean that "distinguishing attribute" with the DFT. – Olli Niemitalo Feb 25 '22 at 15:56

Optimize window length (STFT) via gradient descent (in neural networks)

2 Answers2

Optimizing for width

Optimizing for length