Using spectogram to speed up a signal - Time Scaling/Phase Vocoder

Question

Background

About half a year ago, while learning about spectograms as part of an Image Processing course I took, I was told you can speed up audio using spectograms as follows:

Calculate the spectogram of the signal (using short-time fourier transform).
get rid of every 2nd column in the spectogram (if you want to double the speed for example, otherwise use a different ratio).
calculate the inverse transformation to turn the spectogram back to a signal.

Since we deleting every 2nd column, we delete values only in the time domain, and not the frequency domain, and therefore the pitch won't change (this is in contrast to methods such as deleting every second sample from the original audio, which would make the pitch higher, since the frequency is increased).

Even with that, we were told this method is not complete since we also have to correct (fix) the phase.

The Qustion

Why do you need to correct the phase, and how do you do that (algorithm with mathematical explanation)? I tried looking online but couldn't find anything about this! Also, below is the code from my course, which may be helpful, and I'd like to know how it works

If I remember correctly, we were given this image saying that the following:

The top signal is the original signal
The middle signal is after deleting columns from the spectogram
The last signal is after fixing the phase. (I believe this is called phase vocoding or something).

Code Example

I found this piece of code we were given which performs the speeding down/up a signal.

spec is the spectogram of the signal.
ratio is the ratio by which we speed up/slowdown the sound.

def phase_vocoder(spec, ratio):
    num_timesteps = int(spec.shape[1] / ratio)
    time_steps = np.arange(num_timesteps) * ratio
# interpolate magnitude
yy = np.meshgrid(np.arange(time_steps.size), np.arange(spec.shape[0]))[1]
xx = np.zeros_like(yy)
coordiantes = [yy, time_steps + xx]
warped_spec = map_coordinates(np.abs(spec), coordiantes, mode='reflect', order=1).astype(np.complex)

# phase vocoder
# Phase accumulator; initialize to the first sample
spec_angle = np.pad(np.angle(spec), [(0, 0), (0, 1)], mode='constant')
phase_acc = spec_angle[:, 0]

for (t, step) in enumerate(np.floor(time_steps).astype(np.int)):
    # Store to output array
    warped_spec[:, t] *= np.exp(1j * phase_acc)

    # Compute phase advance
    dphase = (spec_angle[:, step + 1] - spec_angle[:, step])

    # Wrap to -pi:pi range
    dphase = np.mod(dphase - np.pi, 2 * np.pi) - np.pi

    # Accumulate phase
    phase_acc += dphase

return warped_spec

I gotta sorta answer here. And a paper here. Dunno if that helps, but I don't have the juice to write a specific answer for you. Even for bounty. — robert bristow-johnson, May 13 '22 at 06:33

score -1 · Answer 1 · answered May 05 '22 at 01:01

If you delete a segment of a non zero amplitude sine wave, and that deleted segment isn’t a multiple of a wavelength or period in length, then when you abutt the remaining portions of the sine wave together there will be a discontinuity, which usually sounds bad.

If you analyse or estimate the phase of each remaining sine wave segment at their new abutment, then you know how much you need to change the phase of one or more portions of sine wave to reduce or eliminate their abutment discontinuities, thus making the result sound better. That’s what a phase vocoder algorithm tries to do.

Yet another option is to only delete segments which are an exact integer multiple of the periodicity in length. But this will rarely be the same as one column of an STFT array, and the original signal may not be perfectly periodic.

Why is deleting parts of the histogram equivalent to deleting a segment of the sine wave? What is the formula for correcting the phase? (Even if you could reference me to the phase vocoder algorithm that would be great) — snatchysquid, May 05 '22 at 07:15

orchi_d · Answer 2 · 2022-05-09T20:40:38.590

-1

Speeding up or slowing down sounds is called time-scale-modification, and it is not as simple as throwing away segments. You start with the spectrogram but use different analysis and synthesis hopsizes depending on the desired speed up/slow down factor.

The weighted overlap add (WOLA) is one of the simpler algorithms to do this. In WOLA you do linear interpolation of the magnitude spectrum between frames. Here is a Matlab implementation.

edited May 09 '22 at 20:40

answered May 09 '22 at 20:27

orchi_d

577
2
7

I actually don't know Matlab, but I did look at the first link you sent. I did not understand a few things: What do you do if $\alpha < 1$? Why is the linear interpolation done this way, and why does it guarantee that the phase will be right? Also, I added the code from my course to the question, is it doing the same thing as said in your link (I am unable to understand)? – snatchysquid May 10 '22 at 06:55
Comparing the WOLA site and the code, I do think the two codes do the same thing, except a phase_acc (accumulator) is used instead of simply the angle. Why is that – snatchysquid May 10 '22 at 07:15

score -1 · Answer 3 · answered May 12 '22 at 19:52

The description seems to match that of a phase vocoder, which isn't a simple algorithm but there's a detailed explanation here. Some points:

"Toss every 2nd column then invert" isn't it. Tossing every other column is exactly the same as doubling the hop size. Standard inversion takes two parameters as input: the spectrogram, and its hop size. If we specify the original hop size with halved spectrogram, all that changes is a windowed rescaling (see unbuffer called by istft, also this article on inversion). Additional manipulations of complex STFT coefficients is required.
in contrast to methods such as deleting every second sample from the original audio, which would make the pitch higher, since the frequency is increased

"Doubled speed" is doubled frequency:

$$ \cos(\omega (2t)) = \cos ((2\omega) t) $$

For a general signal, the behavior of $x(t) \rightarrow x(2t)$ is more complicated, but always results in higher frequency (except if aliased).

Why is deleting parts of the STFT equivalent to deleting a segment of the sine wave?

It's not, see 1. I don't think it's what hotpaw meant. (However, it can be if we violate NOLA).

I found this piece of code

map_coordinates is undefined, we can't run it. It's ideal to show the most relevant parts of code in the question, and link to the full runnable code. Regardless StackExchange isn't for inspecting lengthy codes.

Finally, we must pay attention to aliasing: any algorithm that lowers the output size must have this step, explicitly or implicitly. It must also answer "what if the doubling itself aliases". Again, only throwing away columns addresses neither.

about 4: map_coordinates is the scipy function: scipy.ndimage.map_coordinates, sorry for the confusion. I'll read the article you mentioned and see if I understand, thank you for the answer. — snatchysquid, May 13 '22 at 12:06
@snatchysquid Consider voting on an answer you found most helpful so the bounty isn't voided. — OverLordGoldDragon, May 14 '22 at 00:23

Using spectogram to speed up a signal - Time Scaling/Phase Vocoder

Background

The Qustion

Code Example

3 Answers3