Reconstructing Audio From Spectrogram (Using it as a filter)

Question

Given a spectrogram calculated using the following code:

import matplotlib.pyplot as plt
import numpy as np
import scipy
from scipy import signal, fft
from scipy.io import wavfile
from skimage import util
import librosa
import pylab
# from scipy.fftpack import fft
def stft(x, fs, framesz, hop):
    framesamp = int(frameszfs)
    hopsamp = int(hopfs)
    w = scipy.hanning(framesamp)
    X = scipy.array([scipy.fft(w*x[i:i+framesamp]) for i in range(0, len(x)-framesamp, hopsamp)])
    return X
def istft(X, fs, T, hop):
    x = scipy.zeros(Tfs)
    framesamp = X.shape[1]
    hopsamp = int(hopfs)
    for n,i in enumerate(range(0, len(x)-framesamp, hopsamp)): x[i:i+framesamp] += scipy.real(scipy.ifft(X[n]))
    return x
audio_data = f'../recordings/SDRuno_20220309_140138Z_1kHz.wav' # y movement as test...
fs, data = wavfile.read(audio_data)
data = np.mean(data, axis=1) # convert to mono by avg l+r channels
N = data.shape[0]
L = N / fs
amp = 2 / N * np.abs(data)
print(f'Audio length: {L:.2f} seconds')
f, ax = plt.subplots()
ax.plot(np.arange(N) / fs, data)
ax.set_xlabel('Time [s]')
ax.set_ylabel('Amplitude [unknown]');
plt.show()
M = int(fs * 0.001 * 20) # originally we had 1024, however now we use this for 20ms window resolution
amp = 2 * np.sqrt(2)
stft_sig_f, stft_sig_t, stft_sig_Zxx = signal.stft(data, fs=fs, window='hann', nperseg=M, detrend=False)
plt.pcolormesh(stft_sig_t, stft_sig_f, np.abs(stft_sig_Zxx), vmin=0, vmax=amp, shading='gouraud')
plt.title('STFT Magnitude')
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.show()
freqs, times, Sx = signal.spectrogram(data, fs=fs, window='hanning', nperseg=M, detrend=False, scaling='spectrum')
f, ax = plt.subplots(figsize=(10,5))
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx), cmap='inferno')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
plt.show()

How can one use the spectrogram's "cleaner signal" to filter the audio. I did read somewhere that the Sx is effectively the Zxx**2 for which the Zxx can be used for an ISTFT to reconstruct audio, however my knowledge is pretty lacking. Any help would be appreciated!

EDIT: The current code for plotting the STFT just shows a black window.

The IQ WAV file used in this example can be found here: https://www.dropbox.com/s/at31myl5oufvfa3/SDRuno_20220309_140138Z_1kHz.wav?dl=0

OverLordGoldDragon · Accepted Answer · 2022-03-24T21:44:50.570

2

"Spectrogram isn't invertible" isn't entirely true. Simply Google "invert spectrogram python" - among the results should be the most well-known algorithm, Griffin-Lim. One can hence tweak the spectrogram and then invert it, as a form of non-linear filtering. However, the best we can do is recover the signal within a global phase shift, $e^{j\phi_0}$ (which still preserves instantaneous frequency and amplitude perfectly).

Better results can be obtained by implementing STFT via differentiable operators and minimizing reconstruction error via gradient descent (scalogram example); torch.stft might help here.

edited Mar 24 '22 at 21:44

answered Mar 11 '22 at 13:51

OverLordGoldDragon

8,912
5
23
74

1

Ah right! It's not a power spectral density at all, simply blocks of FFT's. My mistake. – Dan Boschen Mar 11 '22 at 13:56
So the suggestion, if I'm correct, is to instead of doing the spectogram for that particular visualisation, simply instead perform the STFT and then perhaps perform ISTFT and then analyse this new signal? – rshah Mar 11 '22 at 13:58
1

To bring the comment forward from an earlier answer the idea is as follows. I am recording the RF signal for a device which is in operation - i.e. it moves every 2s interval and RF is emitted. The goal is to distinguish the operation it is doing by cleaning the signal and build a dataset of signal features (i.e. spectral/mfcc coefficients etc.) for classification. The spectrogram I have shows these individual operations as blocky however I am not sure if it is possible to get classificaiton features from the plot, but instead filter the wav in this way to then get the features from the "audio" – rshah Mar 11 '22 at 13:59
1

@rshah I don't really follow. STFT by itself doesn't "clean" or modify the signal in any way, but one can modify it to achieve cleaning. It's also possible to extract "cleaning indices" from spectrogram but operate directly on STFT and invert the original STFT, which is perfect inversion. And yes, spectrogram can be used for direct feature extraction. I'd also recommend scattering as alternative to MFCC. – OverLordGoldDragon Mar 11 '22 at 14:03
@OverLordGoldDragon Okay I'll look into scattering and looks like the best approach then is to do the STFT and then invert the original STFT and extract the signal features from there. I have updated the OP to show the current approach I am taking and also providing the IQ WAV file I am operating on as an example, in case you wanted to see for yourself what is going on. The sampling rate is 2MHz in this case. – rshah Mar 11 '22 at 14:07
@rshah Your inversion code is missing un-windowing; why not just use scipy.istft? Also 2MHz for stuff like STFT will require splitting the input into chunks then joining together to be computationally tractable. – OverLordGoldDragon Mar 11 '22 at 14:18
I will stick to scipy.stft/istft to keep it simple - no point re-inventing the wheel heh. So correct me if I am wrong, but for say a file of length n seconds (i.e. 120s), I should split this up into say 5s chunks (or even my operation intervals of 2s) and then do the STFT + ISTFT on each of these "audio"/signal chunks (i.e. via PyDub) and then extract the features from the result of the ISTFT? – rshah Mar 11 '22 at 14:21
@rshah There's no point if you're only doing "STFT + ISTFT"; something must come in-between, else it's just a glorified x + 1 - 1. Otherwise yes. But it depends what "features" mean; if you have a large enough dataset, then a raw waveform (ISTFT) may work; abs(STFT) (spectrogram) is a "feature", and a better one with limited data. – OverLordGoldDragon Mar 11 '22 at 14:53
@OverLordGoldDragon by features I essentially mean things that can be used for classification to distinguish between operations I.e. for a deep neural network – rshah Mar 11 '22 at 15:00
@rshah Unsure what "operations" means in context but yes, my previous comment likely applies: abs(STFT) is a more robust and discriminative alternative to a raw waveform if data's limited. – OverLordGoldDragon Mar 11 '22 at 16:07

Knut Inge · Answer 2 · 2022-03-11T18:05:09.067

The magnitude only contains half the required information, as @dan mentions.

You may try to set the complex phase angle to random values [0…2*pi]. That gives a «vocoder-ish» sound. For speech signals it sounds like «noise speaking».

In some cases phase is assumed to be minimum phase but I dont know if that makes sense in a spectrogram?

edit:

Perhaps I do not use the proper terminology. The below piece of MATLAB code performs an fft-based filterbank with 50% overlap, power-complimentary (dual) windowing and complex representation of fft coefficients. It is invertible (within numerical precision)

%% generate window
win_pow = 2;%power complementary
Nh = 256;
N = 2*Nh;
win = hann(N+1);
win = win(1:N);
win = win / sum(win([1 end/2+1]));
win = win.^(1/win_pow);
assert(max(abs(win.^win_pow + circshift(win, Nh).^win_pow - 1)) < 1e-14);
%% test 50% overlapping STFT
x = 2rand(3N,1)-1;
X = buffer(x, N, Nh, "nodelay");
X = X .* win;
X = fft(X);
% <- this is where you would do something useful
Y = ifft(X);
Y = Y .* win;
x_hat = debuffer(Y, Nh);
err = x(1:length(x_hat)) - x_hat;
assert(max(abs(err((Nh+1):2*N))) < 1e-15);

This filterbank is non-overlapping, rectangular windowed and also invertible. I thought that this, too was a "STFT" if somewhat poor due to being "critically sampled":

%% test critically sampled STFT
x = 2*rand(3*N,1)-1;
X = buffer(x, N, 0, "nodelay");
X = X;
X = fft(X);
Y = ifft(X);
Y = Y;
x_hat = debuffer(Y, 0);
err = x(1:length(x_hat)) - x_hat;
assert(max(abs(err((Nh+1):2*N))) < 1e-15);

Whenever I see abs(X) (and the phase is thrown away), I think of that as a spectrogram, and using the critically sampled filterbank from the previous snippet, I cannot see how it would be invertible without some significant assumptions about signal statistics? Is this what you guys call a STFT? A Spectrogram? I am confused by the responses.

%% test critically sampled spectrogram
x = 2*rand(3*N,1)-1;
X = buffer(x, N, 0, "nodelay");
X = X;
X = fft(X);
X = abs(X);
Y = ifft(X);
Y = Y;
x_hat = debuffer(Y, 0);
err = x(1:length(x_hat)) - x_hat;

x is represented as a 1536x1 vector, 12288 bytes on my machine.

X is represented as a 512x3 array, 12288 bytes on my machine. However, as it is mirrored, it only contains 257x3x8 = 6168 bytes of unique information. Thus it cannot possibly convey all possibilities of the x vector?

I see how an oversampled filterbank would contain "additional information" that (even if you throw away phase information) might be used to aid reconstruction. It sounds like an exercise I would rather not venture into.

"Half" is true for DFT but not STFT; majority of phase information is encoded in the redundancy of the transform. — OverLordGoldDragon, Mar 11 '22 at 13:53
@Knut I was wrong-- I was incorrectly thinking "power spectral density" but all phase and amplitude information is preserved. — Dan Boschen, Mar 11 '22 at 14:50

Reconstructing Audio From Spectrogram (Using it as a filter)

f, ax = plt.subplots()

ax.plot(np.arange(N) / fs, data)

ax.set_xlabel('Time [s]')

ax.set_ylabel('Amplitude [unknown]');

plt.show()

amp = 2 * np.sqrt(2)

plt.pcolormesh(stft_sig_t, stft_sig_f, np.abs(stft_sig_Zxx), vmin=0, vmax=amp, shading='gouraud')

plt.title('STFT Magnitude')

plt.ylabel('Frequency [Hz]')

plt.xlabel('Time [sec]')

plt.show()

2 Answers2