Phase vocoder, pitch shift in MATLAB and C

Question

I'm trying to make a Phase Vocoder in MATLAB. But I'm stuck with the calculations on changing the pitch. I browsed loads of website and I simply do not understand what to do.

Currently I have in my MATLAB code a working flow where input = output

$${\xrightarrow{\rm Input}}\boxed{\text{sampling}}{\rightarrow}\boxed{\text{Hanning window}}{\rightarrow}\boxed{\text{STFT}}{\rightarrow} \cdots {\rightarrow}\boxed{\text{ISTFT}}{\rightarrow}\boxed{\text{Add}}{\xrightarrow{\rm Output}}$$

What do I have to do in the $"\cdots"$ to get the pitch shift? With the STFT I get the FFT of 1024 sample windows with 512 overlap.

here my matlab code, this can be used to pitch shift, first do you need time stretch and then apply some kind of interpolation at the end — ederwander, Nov 22 '17 at 01:25
i was about to suggest to the OP that he/she use the phase vocoder first to time-scale (which changes tempo without changing pitch) and then resample (which changes pitch and tempo together). that way the Hann windows stay the same width in the reconstruction overlap-add of the phase vocoder. — robert bristow-johnson, Nov 22 '17 at 04:24
i have some MATLAB phase vocoder code, too. it was done back in 2001 when i did this paper on time-scaling each sinusoidal component inside each frame or block within the phase vocoder. — robert bristow-johnson, Nov 22 '17 at 04:25
@ederwander Problem is that i have to put the algorithm in a real time dsp afterwards. Using your algorithm it is not possible as i can see. The dsp reads 1024 bytes while playing 1024 bytes in an other buffer. — Kristof, Nov 22 '17 at 17:48
what's the target DSP or CPU? are you coding this in C? if the DSP is a SHArC, i recommend staying away from C and doing this in asm. still, in my opinion, you'll get better real-time results with a time-domain pitch shifter. — robert bristow-johnson, Nov 22 '17 at 20:14
@robertbristow-johnson i will be using the C5515 eZDSP USB Stick. This is a school project, we have to use an fft, and i'll be using the hardware fft inside the board. The code from ederwander time streches the complete signal. After that i can just resample to get the signal again. In real time this method is not possible, or is it?. — Kristof, Nov 22 '17 at 20:23
@Kristof keep a couple frames in memory and only process about half way through the data in memory during each frame. That way you can see data samples both before and after the samples that you are currently processing. This does add some additional delay in the output... — ederwander, Nov 22 '17 at 21:57
@ederwander yes i get that, but take your matlab code, i'll have to resample in the for loop i guess to directly get the pitched signal. Been trying to do it the whole evening. Had no success doing it :/. If any of u would have an idea? Guess i don't completely understand the matlab code. — Kristof, Nov 22 '17 at 22:29
@Kristof just apply resample in time stretched frame and stream it in the loop I do it in real-time all the time, phase vocoder is not the best way to use in real-time, but if you have processing and sufficient memory to storage some frames will work with some delay... do you understand how overlap-add works? — ederwander, Nov 22 '17 at 22:47
@Kristof here my demo code working in real-time while one song are playing, you can use one microphone too, this use the same principle that i said above. — ederwander, Nov 22 '17 at 22:58
@ederwander I got it working in matlab, but dont rly understand how the time streching works. I get all the steps of the process, fft, ifft, overlap-add, ... except the fact that the signal is time stretched. The amount of points you get back from the ifft is the same as the ones you put in. What does change in the frequency domain to get the time strech? — Kristof, Nov 22 '17 at 23:15
@ederwander could u give me your python code from the youtube vid — Kristof, Nov 23 '17 at 21:57

robert bristow-johnson · Answer 1 · 2017-11-24T23:44:56.797

alright, conceptually, for one frame (which is centered at $n=0$ and has non-zero width of $N-1$ samples):

$$\begin{align} X[k] &= \mathcal{DFT}\Big\{ x[n] w(n) \Big\} \\ \\ &= \sum\limits_{n=-\tfrac{N}{2}}^{\tfrac{N}{2}-1} \big( x[n] w(n) \big) \, e^{-j \frac{2 \pi}{N} nk} \qquad |k| \le \tfrac{N}2 \\ \end{align}$$

here, our frame hop is $\tfrac{N}2$ samples, which is half of the window width.

$w(t)$ is a continuous-time complementary window with non-zero width of $N$ samples and centered at $t=0$, like a Hann window:

$$ w(t) = \begin{cases} \tfrac12 + \tfrac12 \cos\left(\tfrac{2\pi}{N}t \right) \qquad & |t| < \tfrac{N}2 \\ 0 \qquad & |t| \ge \tfrac{N}2 \\ \end{cases} $$

the output spectrum is a scaled copy of the input spectrum. this is what shifts the pitch.

$$ Y[k] = \begin{cases} X(\alpha^{-1} k) \qquad & |k| \le \tfrac{N \alpha}{2} \\ 0 \qquad & |k| > \tfrac{N \alpha}{2} \\ \end{cases} $$

where $\alpha$ is the output-to-input frequency ratio. $X(f)$ is a continuous-frequency function that is interpolated from the discrete-frequency spectrum $X[k]$ in such a way that equality exists when the argument is an integer.

$$ X(f) \bigg|_{f=k} = X[k] \qquad k \in \mathbb{Z} $$

the discrete-time-domain output for this frame (before adjusting for stretching or scrunching the window) is:

$$\begin{align} y[n] &= \mathcal{iDFT}\Big\{ Y[k] e^{j \phi[k]} \Big\} \\ \\ &= \tfrac{1}{N} \sum\limits_{k=-\tfrac{N}{2}}^{\tfrac{N}{2}-1} \big( Y[k] e^{j \phi[k]} \big) \, e^{j \frac{2 \pi}{N} nk} \qquad |n| \le \tfrac{N}2 \\ \end{align}$$

and the frame of output (centered at time $n=0$) is

$$ y[n] \frac{w(n)}{w(\alpha \, n)} $$

the $\frac{w(n)}{w(\alpha \, n)}$ factor is undoing the stretched or scrunched input window and re-applying the original window at the same scale (since we're not time-scaling and the frame hop length remains unchanged from input to output).

$\phi[k]$ is a phase adjustment (with odd symmetry in $k$) that is constant for each sinusoidal component (a la Miller Puckette) rather than a changing phase adjustment for each FFT bin (a la Portnoff). that phase adjustment for each frequency component is what is required to make each sinusoid component continuous because of the frequency shift. that's how the phase vocoder does glitch-free pitch shifting, even if the input is not periodic or a single harmonic note.

this is why this phase adjustment is needed. consider a single sinusoid with normalized angular frequency $\omega_0$.

$$\begin{align} x[n] &= \cos(\omega_0 n + \theta_0) \\ &= \tfrac12\big(e^{j (\omega_0 n + \theta_0)} + e^{-j (\omega_0 n + \theta_0)} \big) \\ &= \tfrac12\big(e^{j\theta_0}e^{j \omega_0 n} + e^{-j\theta_0}e^{-j \omega_0 n} \big) \\ \end{align}$$

now let's consider only the positive frequency complex component.

$$ \hat{x}[n] = e^{j\theta_0}e^{j \omega_0 n} $$

this has an instantaneous angle of $\theta_0$ at time $n=0$. now the instantaneous value of this sinusoid in the center of the previous frame is

$$ \hat{x}[-\tfrac{N}{2}] = e^{j\theta_0}e^{-j \omega_0 \frac{N}{2}} $$

and the instantaneous angle is $\theta_0 - \tfrac{N}{2} \omega_0$. and it shouldn't surprise us that halfway between the two frame centers (which is exactly in the middle of the crossfade from the previous frame to the current frame) the instantaneous angle $\theta_0 - \tfrac{N}{4} \omega_0$

now if the output spectrum is mapped from the interpolated input spectrum as

$$ Y[k] = X(\alpha^{-1} k) $$

then this component frequency at the input, $\omega_0$ gets mapped to $\alpha \omega_0$. and the current frame output (before the phase adjustment) is

$$\begin{align} y[n] &= \cos(\alpha\omega_0 n + \theta_0) \\ &= \tfrac12\big(e^{j (\alpha\omega_0 n + \theta_0)} + e^{-j (\alpha\omega_0 n + \theta_0)} \big) \\ &= \tfrac12\big(e^{j\theta_0}e^{j \alpha\omega_0 n} + e^{-j\theta_0}e^{-j \alpha\omega_0 n} \big) \\ \end{align}$$

and the positive frequency component is

$$ \hat{y}[n] = e^{j\theta_0}e^{j \alpha \omega_0 n} $$

and the value at the left edge of the current frame where it is joined to the right edge of the previous frame

$$ \hat{y}[-\tfrac{N}{4}] = e^{j\theta_0}e^{-j \alpha \omega_0 \frac{N}{4}} $$

now the instantaneous angle of the output sinusoid at the center of the previous frame is the same instantaneous angle of the input sinuosoid at the center of the previous frame which is $\theta_0 - \tfrac{N}{2} \omega_0$. this makes the output sinuosoid (the positive frequency component) of the previous frame, when "justified" or expressed in terms of the time index of the current frame

$$ \hat{y}[n] = e^{j(\theta_0 - \frac{N}{2}\omega_0)}e^{j \alpha \omega_0 (n + \frac{N}{2})} $$

(note that when $n=-\tfrac{N}{2}$, which is the center of the previous frame, the instantaneous angle is $\theta_0 - \tfrac{N}{2} \omega_0$.) now at the output sample midway between the previous frame and the current frame (which is $n=-\tfrac{N}{4}$), the sinusoid of the previous frame is at angle

$$\theta_0 - \tfrac{N}{2}\omega_0 + \alpha \omega_0 (-\tfrac{N}{4} + \tfrac{N}{2}) = \theta_0 + (\alpha - 2)\omega_0\tfrac{N}{4}$$

but, at that same sample, the angle of the sinusoid of the current frame is

$$ \theta_0 - \alpha \omega_0 \tfrac{N}{4} $$

note that when there is no pitch shift and $\alpha = 1$, then the two angles are the same and the splice is seamless. but when $\alpha \ne 1$, the phase of one of the frames must be adjusted to make the splice seamless. the previous frame is already a "done deal" and will not be modified, so it's the phase of the sinusoid of the current from that must be adjusted to make the splice seamless. that phase is modified for all DFT bins associated with this particular sinusoid as

$$\begin{align} \theta_0 - \alpha \omega_0 \tfrac{N}{4} + \phi[k] &= \theta_0 + (\alpha - 2)\omega_0\tfrac{N}{4} \\ -\alpha \omega_0 \tfrac{N}{4} + \phi[k] &= (\alpha - 2)\omega_0\tfrac{N}{4} \\ \phi[k] &= (\alpha - 1)\omega_0\tfrac{N}{2} \\ \end{align}$$

so the phase adjustment for the current frame necessary to align it with the previous frame is the normalized angular frequency of the sinusoid, $\omega_0$, times the frame hop displacement $\tfrac{N}{2}$ times the factor $(\alpha - 1)$. you should apply same phase adjustment to all DFT bins of $Y[k]$ that are associated with this particular sinusoidal component at frequency $\omega_0$. the values of $k$ will be around $\tfrac{\omega_0}{2 \pi} N$.

so Miller Puckette says

$$\phi[k] = (\alpha - 1) \omega_0 \tfrac{N}{2} \qquad \text{for } k \approx \tfrac{\omega_0}{2 \pi} N$$

and Portnoff would say that the adjustment would be

$$\phi[k] = (\alpha - 1) \tfrac{2 \pi k}{N} \tfrac{N}{2} = (\alpha - 1) \pi k$$

but, in my opinion, Puckette is correct and Portnoff is wrong.

Thx :D How is the constant phase adjustment calculated? And u got any pseudo code? — Kristof, Nov 23 '17 at 01:03
Tried it in matlab, keeps echoing itself. https://pastebin.com/tjcZykCk not sure if it is bcuz of the phase, but i dont think it is. Seems like the sound is just time stretching — Kristof, Nov 23 '17 at 02:12
If i do it with windows of 1024 without overlap i get an amazingly high noise — Kristof, Nov 23 '17 at 03:05
i will try to spell out how that interframe phase adjustment is done. i've only done this for time-scaling. i did not do a phase vocoder to pitch shift directly, rather i time-scaled it and then did resampling on the result (or input) of the time-scaling phase vocoder. — robert bristow-johnson, Nov 23 '17 at 05:13

Phase vocoder, pitch shift in MATLAB and C

1 Answers1

Linked