The slide below, borrowed from CMU, shows the typical interpretation of the "filterbank interpretation" of signal reconstruction from a Short Time Fourier Transform. As far as I can tell, the hop-size here is 1, i.e., there is no frame decimation to represent a realistic frame hop.
In that context, it makes sense: your inputs for any given sample $n$ are the outputs of lowpass-and-downconvert operations, so to reconstruct, you just upconvert-and-add them together. Do this at each timestep.
But usually, we have a substantial hop, which means that the $X[n,k]$ elements only exist at hopped values of $n$. I.e., if our hop is 1, we have $X[0,k], X[16,k], \ldots$
Consequently, we are only calculating $y[0], y[16], \ldots$
How exactly do we get the missing values? Interpolate? If interpolation, where does that happen? In the branches after the upconversion? At the $y[n]$ output?
I have scoured multiple articles, slidesets, and books, and none of them seem to explain this.
Or am I mis-interpreting what's happening here? (If it matters, I'm interested in audio, where window sizes would be a few hundred samples, and the hops would be 25% of that.)
