How do I time-stretch a signal from the frequency domain?

Question

I have $FFT(signal)$, and want to find $operation$ such that $$IFFT(operation(FFT(signal),2)) = timestretch(signal, 2)$$

Where $timestretch$ stretches the signal in the time domain - maintaining the frequencies, unlike basic subsampling stretching which leads pitch change.

One option I know I could do is convert to a 2D spectrogram and stretch this "image" and then reverse the spectrogram, but this seems both inaccurate and expensive.

@ederwander Is the Vocoder technique basically the 2d technique? From what I read it is (they even mention STFT and spectrograms). I don't understand how you do the actually stretching of the spectrogram, they use a pvresample function thats unexplained - and thats the whole meat of the time stretching algorithm (stretching the spectrogram). Is it fine to just do basic stretching of the spectrogram (as if it were an image)? — Tobi Akinyemi, Aug 02 '20 at 22:21
in my code there is no pvresample, you must be looking in the wrong place lol, I've never heard of this 2D technique, the phase vocoder is the basis for making time stretch in the frequency domain...all other techniques for doing this in the frequency domain are derived from phase vocoder — ederwander, Aug 02 '20 at 22:39
I think you need to define better what you mean by "timestretch". If you apply this operation to a voice recording: what would hear ? Same for someone playing a guitar? What do you want to preserve and what can change? A LOT of work has gone into time stretching algorithms in audio, but they are complicated since it's quite hard to make them sound good and "good" depends a lot of the specific requirements of your application. — Hilmar, Aug 03 '20 at 12:13
@Hilmar I don't care about how it sounds, I just want to stretch it. E.g. I don't care about preserving consonant sounds - which usually get smeared — Tobi Akinyemi, Aug 03 '20 at 18:35

score 4 · Answer 1 · answered Aug 04 '20 at 11:16

Are you talking about taking an audio recording where someone was speaking too fast, and you want to slow down the speech without changing the pitch? If so, then your original formulation of the problem is incorrect; you can’t do what you want by taking one long FFT of the entire recording and then processIng it. The usual way this is solved is by using the phase vocoder, where you break the signal into overlapping blocks and transform each block. This is a deep subject with lots of tricks to avoid artifacts.

Marcus Müller · Answer 2 · 2020-08-03T08:39:43.830

Your "$\text{timestretch}(\text{signal}, k)$" is what we call "interpolation by $k$", usually. (if you don't believe it: try for yourself!)

Let us adopt a sensible notation (and not call function English words, which typically leads to confusion).

$s\in \mathbb C^N$: discrete time input signal of length $N$
$m\in\mathbb N$: interpolation ratio
$r\in \mathbb C^{mN},\, m\in\mathbb N$: $m$ times as long output signal, subject to:
- $r[ln] = s[n],\,n\in\mathbb Z,\,l \in \mathbb N$ (i.e. the stretched signal still goes through the same points as the original: "interpolation criterion")
- $\text{DFT}\{r\}[f] <\epsilon \text{ for } |f| > N/2,\, \epsilon > 0$ (image suppression to at most a small $\epsilon$)

So, there's very many resampling / interpolation algorithms that you could employ.

The probably simplest is zero padding, i.e you take the $\text{DFT}\{r\}$, which is inherently $N$ long, extend the result by $(m-1)N$ zeros, making it $mN$ values long, and do the inverse transform.
That amounts to sinc interpolation due to the convolution theorem of the DFT and the fact that zero-padding mathematically "looks" like you've taken a $N$-periodic signal and multiplied it with an $N$-long rectangular window.

So, your $operation$ is just plain boring "appending zeros until you hit the target length". Often, DSP is that easy!

This isn't the time-stretching I was talking about. This is naive stretching using subsampling/interpolation - the pitch isn't preserved like I desired — Tobi Akinyemi, Aug 03 '20 at 18:31

score 1 · Answer 3 · answered Aug 04 '20 at 07:40

1

Say the FFT array is X[0:N-1]. What you want to do is something like this

for n in range(S * N):
    if mod(n,S) == 0:
        Y[n] = S*X[n/S]
    else:
        Y[n] = 0;
y = ifft(Y)

Effectively you are keeping the radial frequency information the same but scaling the number of samples used by S.

answered Aug 04 '20 at 07:40

Keegs

310
1
12

What @Bob says in his answer is correct, this solution would need to be performed on time blocks of a signal. There is probably a way of using some gaussian or other window and going along sample by sample of the original signal doing this, but it’s beyond my knowledge. – Keegs Aug 04 '20 at 20:42

How do I time-stretch a signal from the frequency domain?

3 Answers3