How to train a FCNN with Spectrogram images?

Question

I'm working on a audio dereverberation deep learning model, based on the U-net architecture. The idea of my project came from image denoising with auto-encoders. I feed the reverberated spectrogram to the network, and the network should give me in output the cleaned version. I train the network with pairs of spectrograms , the clean version and the reverberated version.

This is the link to one of the peers I'm following for this project: Speech Dereverberation Using Fully Convolutional Networks.

My problem is, how to save spectrograms of audio data for the training. I have done two tests:

I have saved spectrograms as RGB images, so they are 3D tensor, so exctly what a convolutional neworks wants in input fro training. The trained model is then able to output a reconstructed version of the input spectrogram with less reverb. The problem of this solution is than , then I can't recover the audio from the cleaned spectrogram which is an RGB image.
I have saved directly the spectrogram matrix with numpy.save(), and then reload with numpy.load(). With this solution i can obtain in output, directrly the dereverbereted spectrogram matrix, which can be fed to the Griffin-lim algoritm to recover the audio (this because I consider just the magnitude of the spectrogram). The problem of this solution is that, I don't know if I can feed this 2D numpy array (the stft matrix) directly to the convolutional network, or I need to do some king of preprocessing.

Why would you save as RGB? It triples data size without adding any information. CNN's aren't restricted to any input channel size. — OverLordGoldDragon, Jan 01 '22 at 23:40
thank you for the comment. I completly agree with you. But when train the U-net, results are better with the spectrograms saved as an image rather and I don't know why. — Lorenzoncina, Jan 02 '22 at 14:11
Strange. Not the first time I've seen this, and it makes no sense to me - I've asked about it. — OverLordGoldDragon, Jan 02 '22 at 21:00

OverLordGoldDragon · Answer 1 · 2023-04-21T15:51:11.990

Spectrograms will work with any network that can operate on images. A spectrogram, however, is not an image, and many image techniques will be inapplicable:

Data augmentation via rotation: a rotated spectrogram doesn't represent the same process at all, or even any process (there may not be a signal that maps to a given 2D array).
Some networks are tailored specifically to exploit image-specific priors (such as rotation-invariance per point 1) which are useless or detrimental to time-frequency representations.

This is more of an ML question suited for another SE site where one could comment specifically on U-net. As for RGB, it triples data size without adding any information, and should degrade performance as it breaks spatial dependencies - regardless I've seen it used before, and opened a question on it - and answered it.

what other SE site do you recommend? – Filipe Pinto Feb 02 '22 at 15:32 — Filipe Pinto, Feb 02 '22 at 15:32
@FilipePinto DS.SE, CV.SE, AI.SE – OverLordGoldDragon Feb 04 '22 at 19:27 — OverLordGoldDragon, Feb 04 '22 at 19:27

score 0 · Answer 2 · answered Apr 06 '23 at 14:20

Actually using analysis of STFT like images using deep learning methods optimized for image processing is highly common.

Not all image augmentation methods are applicable, but many do. For instance using some scaling of values, adding noise, etc. Yet usually only methods which are applicable to grayscale images.

Another issue is handling the complex values.
One method might be using the magnitude of the data and the other might be concatenating the real and imaginary images as channels.

U-Net are very capable to do inference on the pixel level. Namely inferring something based on frequency and time in STFT.

You may even think of models to process a longer time window by concatenating more STFT frames into the image just like "video".

There is no limitation working directly on the numpy array. Converting into images means you work only on the magnitude.

How to train a FCNN with Spectrogram images?

2 Answers2

Linked