Filterbank Interpretation of Inverse STFT With Hop

Question

The slide below, borrowed from CMU, shows the typical interpretation of the "filterbank interpretation" of signal reconstruction from a Short Time Fourier Transform. As far as I can tell, the hop-size here is 1, i.e., there is no frame decimation to represent a realistic frame hop.

In that context, it makes sense: your inputs for any given sample $n$ are the outputs of lowpass-and-downconvert operations, so to reconstruct, you just upconvert-and-add them together. Do this at each timestep.

But usually, we have a substantial hop, which means that the $X[n,k]$ elements only exist at hopped values of $n$. I.e., if our hop is 1, we have $X[0,k], X[16,k], \ldots$

Consequently, we are only calculating $y[0], y[16], \ldots$

How exactly do we get the missing values? Interpolate? If interpolation, where does that happen? In the branches after the upconversion? At the $y[n]$ output?

I have scoured multiple articles, slidesets, and books, and none of them seem to explain this.

Or am I mis-interpreting what's happening here? (If it matters, I'm interested in audio, where window sizes would be a few hundred samples, and the hops would be 25% of that.)

Jdip · Accepted Answer · 2022-08-11T06:32:32.037

1

Here is (hopefully) what you're looking for: https://ccrma.stanford.edu/~jos/sasp/Downsampled_STFT_Filter_Banks.html.

Specifically, this section: https://ccrma.stanford.edu/~jos/sasp/Filter_Bank_Reconstruction.html

edited Aug 11 '22 at 06:32

answered Aug 11 '22 at 06:28

Jdip

5,980
3
7
29

It definitely points me in the right direction and partly confirms my intuition. I need to ponder that. – Novak Aug 11 '22 at 06:48
Glad I could help. Let me know if anything is unclear ;) – Jdip Aug 11 '22 at 07:22

Filterbank Interpretation of Inverse STFT With Hop

1 Answers1