What happens when I try to resample a speech recording from 8kHz to 16kHz?

Question

I have a question regarding resampling in the context of speech. Given a speech recording sampled at 16kHz, downsampling to 8kHz will basically remove half of the samples (each 16000 samples become 8000). Now, I'm wondering about the inverse senario, given a telephony quality speech recording (300Hz -> 3400Hz) sampled at 8kHz, what happens if I try to resample the signal to 16kHz ? How will the 8000 additional samples for each second be computed ?

I tried this using sox and a new recording has been generated with no complaints or error messages, So my question is : how is this done ? Is there some kind of standard procedure, like an interpolation used to build the missing samples ?

People usually use low pass filter. You can find more info here https://dsp.stackexchange.com/questions/38901/is-interpolation-of-an-audio-signal-to-increase-frequency-resolution-possible/38902#38902, and code here https://dsp.stackexchange.com/questions/28202/resampling-a-digital-sound-signal?rq=1 Just curious, there are many resampling questions today. — AlexTP, May 04 '17 at 13:33

Jason R · Accepted Answer · 2017-05-04T13:58:53.390

The new samples are generated by interpolating between the original ones. Exactly how this is done will vary by implementation, but the most typical way would be to use a linear interpolating filter. With this technique, you would interpolate by a factor of 2 by inserting zeros between each of your input samples. Assuming your input signal is $x[n]$, your expanded signal would look like:

$$ x_e[n] = [ x[0], 0, x[1], 0, x[2], 0, \ldots ] $$

Due to the properties of the discrete time Fourier transform, this has the effect of compressing and duplicating the spectrum of the input signal (so all of the spectral content that were present between DC and the Nyquist rate of the original signal are compressed into a region half as wide, then duplicated twice).

Finally, you then apply a lowpass filter to remove the extra copy of the spectrum above $f_s/4$ in the expanded signal. The result is an interpolated-by-2 version of your input signal that is bandlimited to the same region of frequencies as the original.

Doing this for an audio signal isn't likely to do much for you, unless you have some equipment or algorithm that requires a particular input sample rate. The interpolated audio, when played back at the new rate, should sound the same as the original signal; the interpolation process doesn't create any new information.

Edit: I should also note that the above is just a conceptual description of the process. In practice, you wouldn't use the explicit step of zero-stuffing followed by linear filtering. Instead, you can use multirate techniques like a polyphase realization of the interpolation filter that allow you to achieve the same effect with fewer computations.

What happens when I try to resample a speech recording from 8kHz to 16kHz?

1 Answers1