Why not overlap save for inverse stft

Question

Since STFT uses overlapped sections of the input signals and compute DFT, it resembles overlap-save method used for block convolution, instead the overlap-add method for block convolution uses non-overlapped sections of the input signal.

However, the following reconstruction process confuses me. What is the reason for not using overlap-save method for reconstructing input signal from STFT signal?

Images courtesy: Introduction to Signal Processing, Sophocles J. Orfanidis

This is a good question and similar to one that I had pondered for 40 years. — robert bristow-johnson, Nov 28 '23 at 20:49

score 1 · Answer 1 · answered Nov 28 '23 at 13:55

What is the reason for not using overlap-save method for reconstructing input signal from STFT signal?

The main reason here is presumably the block called "Modify" which directly manipulates the signal in the frequency domain. This often done for time-variant processing where you derive the desired manipulation directly from the input spectrum. In this case there is no impulse response (or it would be cumbersome to calculate one) and you need to mitigate time domain aliasing. This can be dialed with a suitable choice of frame size, fft size, hop size, analysis window and reconstruction window.

For LTI filters you can show that the methods are equivalent: overlap add can be expressed as a STFT with rectangular windows and appropriate FFT, frame and hop size.

robert bristow-johnson · Answer 2 · 2023-11-28T21:43:03.963

When we're using STFT to do some kinda non-linear process (such as phase-vocoder applications), the consequences of the non-linear process are unpredictable enough that it's usually impossible to avoid clicks if your window is the rectangular window and, in the output, one segment of sound is butt-spliced to the following segment. There will be a discontinuity at that splice point.

So some windowing that has a gradual fade-in and fade-out is necessary to keep the process smooth. The windowing of the input data need not be the same as the windowing of the output. When I was doing this STFT, my preference for an input window was to use the Gaussian window because the Fourier Transform of a gaussian function in the time domain is a gaussian function in the frequency domain. A single sinusoid will slide this gaussian up in the frequency spectrum. Theoretically, there are no ripples or bumps in that result from a single sinusoid. That makes it easier to identify individual frequency components without confusing them with bumps from other frequency components. Also there is a lot of mathematical commonality between a gaussian function and a linear-swept chirp.

Normally, you want the effective window at the output to be, what we call, complementary. That is the falling slope of the previous frame adds to the rising slope of the current frame and adds to a constant. This all depends on the overlap, but there are also weird special cases.

But with fast-convolution, that is using a form of the STFT and the FFT to perform a long convolution (like $h[n]$ might be several hundred thousand samples long), then the process is linear and you can predict what the algorithm will do about the discontinuities at the edges of each segment. Overlap-save (I think a better name is "Overlap-scrap") recognizes that a certain number of contiguous output samples are crap (because the impulse response straddles the boundary between $x[N-1]$ and $x[n]$ in the convolution) and simply discards them and "saves" the good samples.

Overlap-add zero-pads the input and uses linearity to determine that the falling tail of the output of the previous frame can add to the rising tail of the current frame to create the correct output.

If you have MIPS to burn, you can do very high-quality Overlap-Add convolution with the STFT and a Hann window, but you still must zero-pad the input to the FFT and your frame hop length will be half what it would be with a rectangular window. But doing that and getting it right helps set you up to use STFT to do other, more sophisticated, operations such as the phase vocoder.

Thank you for the reply, I can understand the need of windowing, but still not understand the reconstruction process. Assume that no modification after STFT, the ISTFT recreate the input blocks x0, x1, x3.. (Fig. 18.5.1) with windows applied. When you add them as shown in the figure, I still think recreated signal gets distorted by these adds. — Ras, Nov 30 '23 at 10:57
The reconstruction process is exactly overlapping the segments and adding them together. It's mathematically described in an answer to another question. — robert bristow-johnson, Nov 30 '23 at 11:15

Why not overlap save for inverse stft

2 Answers2