Choice of relationship between n_fft and window_length in STFT

Question

Not a veteran in Signal Processing, it would be extremely appreciated to help me understand the idea/heuristic behinde the idea in STFT that

$$\text{nfft} \ge \text{window length}$$

At least from a statistical perspective, I believe it's an overfit to the estimate of each coefficient of each frame of STFT given the following reason:

For $X_1,X_2,X_n\sim D\in [-1, 1]$, we can apply an basis approach to estimate its probability density function: $$p(x)=\sum\limits_{i=0}^{ \infty}\theta_i \phi_i(x)$$ where $\phi_i(x)$ is an orthogonal function, which satisfies: \begin{cases} \int_{-1}^{1}\phi_i(x)\phi_j(x) dx=0 & i\neq j\\ \int_{-1}^{1}\phi_i(x)\phi_j(x)dx=1 & \text{otherwise} \end{cases}

At the essence of STFT (one frame), each coefficient represents an estimation of coefficient ($\hat{\theta}_i$) of the first $\text{nfft}$ basis functions, given the $\text{window length}$ number of audio sample.

Let $N = \text{window length}$ and $\hat{p}(x)=\sum\limits_{i=0}^{\text{nfft}}\hat{\theta}_i\phi_i(x)$ where each $\hat{\theta}_i$ is estimated by $$\hat{\theta}_j = \frac{1}{N}\sum\limits_{i=1}^{N}\phi_j(X_I)$$ It can be shown that this is an unbiased estimator of $\theta_j$ and has a nice property: if choosing $\text{nfft}=C_0(N)^{1/5}$ then the convergence of the mean square error between $p(x)$ and $\hat{p}(x)$ is optimal $O(N^{-4/5})$

And surely, when

$$\text{nfft} \ge \text{window length}$$

it will be either $C_0$ is quite large (so that $\text{nfft}=\text{window length}$ actually makes sense) or we may not follow the best relationship between $\text{nfft}$ and $\text{window length}$ ($\text{nfft}=C_0(\text{window length})^{1/5}$) for the optimal estimation of $p(x)$

Tip, $\text{nfft}$ , more here. Also revelant. Note nfft >= win_len isn't a preference but necessity. — OverLordGoldDragon, Nov 04 '22 at 04:59
@OverLordGoldDragon Nice catch for the formatting refinement, great thanks :) However, I am still having difficulties addressing the revelant post directly to this question: any further clarification would be appreciated. — LambdaDelta34, Nov 04 '22 at 07:27

score 1 · Answer 1 · answered Nov 05 '22 at 19:31

nfft >= win_length is a forced implementation detail of the columnwise implementation of STFT: each column of STFT is just x multiplied by a certain shifting of window, followed by fft, and of course we can't have len(fft(x * window)) < len(window). Columnwise however isn't the only implem.

I don't know where your equations come from, but they're wrong: there is no relationship between win_length and quality of estimation. There is a relationship with window, but that's only completely determined by win_length for a few common windows: what matters is the width and Heisenberg resolutions.

Lastly, note that STFT is not an orthogonal transform, and the "windowed Fourier transform" is but one of its interpretations. Truer to time-frequency analysis, STFT is convolution with windowed complex sinusoids, i.e. bandpass filters, where hop_length is stride. It's the top visual here, except widths are fixed in time and frequency, and frequencies are linearly spaced.

Choice of relationship between n_fft and window_length in STFT

1 Answers1