Python audio analysis: which spectrogram should I use and why?

Question

I am doing my final project at university: pitch estimation from song recording using convolutional neural network (CNN). I want to retrieve pitches existed in a song recording. For CNN input, I am using a spectrogram.

I am using MIR-QBSH dataset with pitch vectors as data label. Before processing the audio to CNN (each audio has 8 sec duration in .wav files of 8 KHz, 8 bit, mono), I need to pre-process the audio into a spectrogram representation.

I have found 3 ways to generate a spectrogram, the code are listed below. Audio example I am using in this code is available here.

Imports:

import librosa
import numpy as np
import matplotlib.pyplot as plt
import librosa.display
from numpy.fft import *
import math
import wave
import struct
from scipy.io import wavfile

Spectrogram A

x, sr = librosa.load('audio/00020_2003_person1.wav', sr=None)
window_size = 1024
hop_length = 512 
n_mels = 128
time_steps = 384
window = np.hanning(window_size)
stft= librosa.core.spectrum.stft(x, n_fft = window_size, hop_length = hop_length, window=window)
out = 2 * np.abs(stft) / np.sum(window)
plt.figure(figsize=(12, 4))
ax = plt.axes()
plt.set_cmap('hot')
librosa.display.specshow(librosa.amplitude_to_db(out, ref=np.max), y_axis='log', x_axis='time',sr=sr)
plt.savefig('spectrogramA.png', bbox_inches='tight', transparent=True, pad_inches=0.0 )

Spectrogram B

x, sr = librosa.load('audio/00020_2003_person1.wav', sr=None)
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
# plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')

Spectrogram C

# Read the wav file (mono)
samplingFrequency, signalData = wavfile.read('audio/00020_2003_person1.wav')
print(samplingFrequency)
print(signalData)
Plot the signal read from wav file
plt.subplot(111)
plt.specgram(signalData,Fs=samplingFrequency)
plt.xlabel('Time')
plt.ylabel('Frequency')

Spectrogram results are displayed below:

My question is, from the 3 spectrograms I have listed above, which spectrogram is best to use for input to CNN and why should I use that spectrogram type? I am currently having difficulty to find their differences, as well as their pros and cons.

They're the same, just with different colors (and in one case, a different axis scale). CNN doesn't care what color your graphs are. — hobbs, Dec 16 '20 at 19:25
Your method may work but a CNN + spectrograms may not be the best approach for audio processing. https://towardsdatascience.com/whats-wrong-with-spectrograms-and-cnns-for-audio-processing-311377d7ccd — alessandro, Dec 16 '20 at 19:56
Ah, I see. I thought that it's gonna bring a big difference for CNN if I use type A or type B or type C. Now I think as long as I am consistent with one type, the processing in CNN should be working well. Thank you, @hobbs — Dionisius Pratama, Dec 21 '20 at 15:47
I am an undergraduate computer science student, and at my uni, this is the one of a few research focusing in audio processing, so my lecturer does not really take 'is this the best method' into a big consideration. But, I want to say thank you for the information you have given. I can bring this matter to be discussed with my research supervisor. @alessandro — Dionisius Pratama, Dec 21 '20 at 15:51
Late comment but I'd like to hear your conclusion after trying all three. I think replacing features for different trials (training CNN) is not that time consuming once you built your pipeline. — Long, Jul 02 '23 at 06:04
@Long I finally decided to go with spectrogram A. Training CNN was not as long as I had thought, but I needed so many filters at each convolution (I used 4 layers of 1D CNN convolution) to reach a good performance. — Dionisius Pratama, Jul 10 '23 at 14:56

score 6 · Answer 1 · answered Dec 16 '20 at 15:36

I recommend a Synchrosqueezed Continuous Wavelet Transform representation, available in ssqueezepy. Synchrosqueezing arose in context of audio processing (namely speaker identification), and there's much literature on applying CWT for audio tasks.

Advantage over STFT is the inherently logarithmic nature of the feature extractor, matching audio structuring. That is, we perceive sound differences in relative terms, so 6kHz is to 3kHz what 24kHz is to 12kHz. STFT is unable to map such features without being very wasteful, as it lacks an adaptive window (your Spectrogram A is not logarithmic; it's a logarithmic display of linear data). Analytic CWT (CWT with analytic wavelet) further enjoys superior instantaneous frequency and amplitude representations (see below references).

Synchrosqueezing further enhances these representations, and can be thought of as an attention mechanism. Experiments have shown it to enhance EEG classifier performance, for example. Currently only CWT is supported, but I'll be adding STFT within a week or so. Also see:

Excellent CWT tutorial (also comparing with STFT)
Synchrosqueezing friendly overview
CNN + CWT vs STFT comparison
Synchrosqueezing paper

Examples on EEG data:

Thank you for your recommendation. I will consider to discuss this matter with my research supervisor. — Dionisius Pratama, Dec 21 '20 at 15:40

score 3 · Answer 2 · answered Dec 16 '20 at 09:54

3

From the information you've given, I'd use spectrogram A as the frequency data is spaced logarithmically. This is advantageous for pitch detection applications as it gives you a greater amount of resolution around the fundamental frequencies.

answered Dec 16 '20 at 09:54

tobassist

816
5
18

Spectrogram A does indeed use a log axis when plotted, but it appears that the actual underlying data is still denominated in a linear frequency scale, meaning that there's not actually any more resolution around the fundamental frequencies (except perhaps it's more visible to the human eye in that spectrogram) – nanofarad Dec 16 '20 at 17:26
Thank you for the insights! – Dionisius Pratama Dec 21 '20 at 15:44

score 1 · Answer 3 · edited Dec 16 '20 at 20:36

1

It is preferable to plot Spectrogram using Praat. It is a very trusted historical source for calculating important features, also it is can be used with python. Check out this link : Parselmouth. Hand coding Spectrogram can be erroneous, So what I suggest is use the Praat Script to cross verify which one is correct.
Or use the Praat Scripting. Hope it helps

edited Dec 16 '20 at 20:36

lennon310

3,590
19
24
27

answered Dec 16 '20 at 07:41

Hans Schliderin

11
2

1

This doesn't appear to answer the question. The question is not asking how to plot spectograms; it is asking which to use (i.e., how to generate them) with a neural network. – D.W. Dec 16 '20 at 19:55
Thank you for your answer, Abhishek. Although I am still not sure which spectrogram I am gonna use by reading your answer, I will take your answer as a suggestion. – Dionisius Pratama Dec 21 '20 at 15:44

Python audio analysis: which spectrogram should I use and why?

Plot the signal read from wav file

3 Answers3