I'm working on a audio dereverberation deep learning model, based on the U-net architecture. The idea of my project came from image denoising with auto-encoders. I feed the reverberated spectrogram to the network, and the network should give me in output the cleaned version. I train the network with pairs of spectrograms , the clean version and the reverberated version.
This is the link to one of the peers I'm following for this project: Speech Dereverberation Using Fully Convolutional Networks.
My problem is, how to save spectrograms of audio data for the training. I have done two tests:
- I have saved spectrograms as RGB images, so they are 3D tensor, so exctly what a convolutional neworks wants in input fro training. The trained model is then able to output a reconstructed version of the input spectrogram with less reverb. The problem of this solution is than , then I can't recover the audio from the cleaned spectrogram which is an RGB image.
- I have saved directly the spectrogram matrix with numpy.save(), and then reload with numpy.load(). With this solution i can obtain in output, directrly the dereverbereted spectrogram matrix, which can be fed to the Griffin-lim algoritm to recover the audio (this because I consider just the magnitude of the spectrogram). The problem of this solution is that, I don't know if I can feed this 2D numpy array (the stft matrix) directly to the convolutional network, or I need to do some king of preprocessing.