1

I'm relatively new to DSP so excuse my ignorance but I was hoping to have an audio related question I had answered. If we're able to decompose audio into frequencies (e.g. mel spectogram) why is there a whole machine learning field dedicated to 'pitch detection' - wouldn't the pitch just be the mean/mode frequency in a melspec?

What is the difference between the true pitch of say a piece of music over time vs the average of the frequencies you can get from a melspec?

Mellow
  • 111
  • 1

2 Answers2

1

Mostly because of frequency resolution and latency. The Mel Spectrogram is just a Short Term Fourier Transform with some remapping of the frequency axis.

The resolution the Discrete Fourier Transform (DFT) is a function of the transform length. The more resolution you need, the longer the time domain window needs to be.

If you want to build a guitar tuner with accuracy of 1 cent for the low E string, you would need a resolution of around 0.05Hz. This would require a DFT that's 20 seconds long. That's simply not practical. There are ways of interpolating with shorter time domain windows, but it's fairly expensive in terms of memory and CPU cycles.

For high quality pitch detections, I prefer phase locked loops. With some intelligent time constant management you can find a very good trade off between response time and accuracy.

Of course, it all depends what your specific application requirements are. For some applications a DFT based algorithm is "good enough" especially if you have to do a DFT anyway.

Hilmar
  • 44,604
  • 1
  • 32
  • 63
  • I would add that in general pitch detection is associated with real time processing, but some robust offline algorithms exist as well that don’t have to deal with real-time’s limited abilities) processing delays for example) – Jdip Nov 12 '22 at 14:02
  • 1
    As a side note, @Hilmar I’ve seen you repeatedly mention PLLs on this forum, any chance you’d have time (and will!) to write a small tutorial on the subject in the future? – Jdip Nov 12 '22 at 14:03
  • By "PLL" do you mean "pitch tracking"? Because I cannot imagine a PLL designed the way they are for communications systems working well for a pitch detector in cases of a missing first harmonic. – robert bristow-johnson Nov 12 '22 at 16:43
  • Pitch tracking is a good term. It's basically state machine: once you have a rough idea what the pitch is (there are various ways of doing that), you switch to tracking mode with a local oscillator a control loop that minimizes the phase error – Hilmar Nov 13 '22 at 14:03
1

wouldn't the pitch just be the mean/mode frequency in a melspec?

Uhm, no?

Depending on the units used for pitch and where you define zero to be, pitch, as perceived as a musical parameter, is the logarithm of the fundamental frequency. The fundamental frequency is the reciprocal of the shortest period of the quasiperiodic signal that is the musical note.

For pitch measured in octaves, it's the base-2 logarithm.

For MIDI they define the unit of pitch to be the semitone, $\frac{1}{12}$ octave. And they define middle C as note #60. That make A-440 note #69.

Then, for MIDI, pitch is

$$ 12 \log_2 \left( \frac{f_0}{440 \text{Hz}} \right) + 69 $$

$$ f_0 = \frac{1}{P} $$

And $P$ is the smallest period of $x(t)$ such that

$$ x(t+P) \approx x(t) $$

robert bristow-johnson
  • 20,661
  • 4
  • 38
  • 76