1

The slide below, borrowed from CMU, shows the typical interpretation of the "filterbank interpretation" of signal reconstruction from a Short Time Fourier Transform. As far as I can tell, the hop-size here is 1, i.e., there is no frame decimation to represent a realistic frame hop.

In that context, it makes sense: your inputs for any given sample $n$ are the outputs of lowpass-and-downconvert operations, so to reconstruct, you just upconvert-and-add them together. Do this at each timestep.

But usually, we have a substantial hop, which means that the $X[n,k]$ elements only exist at hopped values of $n$. I.e., if our hop is 1, we have $X[0,k], X[16,k], \ldots$

Consequently, we are only calculating $y[0], y[16], \ldots$

How exactly do we get the missing values? Interpolate? If interpolation, where does that happen? In the branches after the upconversion? At the $y[n]$ output?

I have scoured multiple articles, slidesets, and books, and none of them seem to explain this.

Or am I mis-interpreting what's happening here? (If it matters, I'm interested in audio, where window sizes would be a few hundred samples, and the hops would be 25% of that.)

enter image description here

Jdip
  • 5,980
  • 3
  • 7
  • 29
Novak
  • 175
  • 1
  • 9

1 Answers1

1

Here is (hopefully) what you're looking for: https://ccrma.stanford.edu/~jos/sasp/Downsampled_STFT_Filter_Banks.html.

Specifically, this section: https://ccrma.stanford.edu/~jos/sasp/Filter_Bank_Reconstruction.html

Jdip
  • 5,980
  • 3
  • 7
  • 29