1

I've been looking for info on this topic for a while and I came across several algorithms that may be suitable for this purpose.

Specifically I'm interested in getting a frequency representation like this in real time

frequency spectrum from MELODIA (I think), from www.justinsalamon.com

from where I can extract multiple pitches (chords), if any. But the frequency must have an exponential resolution, since that's how notes are perceived.

I've read about wavelets, tried out Morlet but didn't get good results (poor accuracy). I also read about Constant Q Transform, and recently came across YIN, pYIN and MELODIA. Currently struggling with technical issues to make MELODIA work.

It takes a while to test each one of this algorithms, so I'm asking: Any of the algorithms that I mentioned are a no-go for this (maybe too slow, outdated or better options or poor performance)? Any other algorithms that I may be missing and that are relevant for this topic?

Thanks!

Isaac
  • 113
  • 4
  • Every major CWT implementation is flawed in some way, sometimes tremendously. Worth giving another shot with ssqueezepy, which also has flaws in very low frequencies, that will be addressed eventually and be made faster. – OverLordGoldDragon Mar 24 '23 at 15:10
  • Though if it's to be paired with ML, that's a separate issue that's solvable, and a standard CWT is likely to perform poorly per excess temporal redundancy (hop_size=1). – OverLordGoldDragon Mar 24 '23 at 15:14
  • "But the frequency must have an exponential resolution, since that's how notes are perceived." This is rather misguided. Perceived pitch distance is (approximately) logarithmic, but that doesn't mean that a pitch detection algorithm must work in a transform domain with translation invariance in logarithmic frequency. In fact, you almost certainly don't want that. – Jazzmaniac Mar 24 '23 at 17:39
  • I'll take a look at your post @OverLordGoldDragon, looks promising, thanks! – Isaac Mar 25 '23 at 09:43
  • 1
    Why do you said that @Jazzmaniac? If I have the same bins width like in the case of the FFT, higher notes would have a bigger error than lower notes. What am I missing here? – Isaac Mar 25 '23 at 09:46
  • @Isaac, the human auditory system is optimised for this task, and it doesn't have a just noticeable difference of pitch sensation that is proportional to the frequency. We perceive pitch relationships as logarithmic, because harmonic relationships induce such a scale. But the pitch resolution is independent. If you tried to have a logarithmic resolution, you would run into all sorts of problems. Low frequencies would allocate too little bandwidth and synchronisation would force the latency through the roof. Be smart, learn from your brain. – Jazzmaniac Mar 25 '23 at 16:09

2 Answers2

1

ColorChord implements a realtime DFT where bins are spaced logarithmically as quartertones of the chromatic scale, $\sqrt[\leftroot{-2}\uproot{2}24]{2}$. It's super useful for music applications and O(n) fast. The theory behind it is also accessible [1] [2]. Shouts to its creators Will Murnane and Charles Lohr!

How it works essentially is each quartertone step multiplier, $\underset{n = 0\ldots23}{2^{n/24}}$ is combined with a configurable base frequency, which is used to guide the frequency of each bin.

References

ColorChord DFT Theory

[1] https://github.com/cnlohr/colorchord/blob/master/docs/TheoryOfCCDFT.md

[2] https://www.colorchord.net/blog/colorchords-music-optimized-dft-algorithm/

bazz
  • 154
  • 6
  • I read the second reference you gave. There are some serious problems, including neighboring bin leakage. – robert bristow-johnson Mar 25 '23 at 20:06
  • @robertbristow-johnson To address bin leakage a simple proportion-based peak bin interpolation is included. I have resolved this dft's spectral leakage down to the ballpark of half a cent using certain peak bin interpolation techniques. Whether or not that precision is satisfactory is project dependent. I invite you to elaborate on your other observations in the DSP chat room. https://chat.stackexchange.com/rooms/1090/post-processing – bazz Mar 26 '23 at 05:35
  • But those summations in the mathematical description are a finite sum, which is essentially applying the rectangular window on your data. The worst kind of window. Just the use of the rectangular window on the time-domain data will cause spectral leakage. – robert bristow-johnson Mar 26 '23 at 06:14
  • Do you suppose this is an opportunity to improve the algorithm without sacrfice on performance? The CC DFT has an embedded implementation so the runtime speed is important. – bazz Mar 26 '23 at 15:04
  • Perhaps. We could get rid of the DFT and have 24 staggered band-pass filters per octave, equally-spaced in log frequency. The output of each band-pass filter would be squared and low-pass filtered (to make the envelope smoother). Then maybe passed through a $\log(\cdot)$ function and you would have dB envelopes on each quarter-tone bin. It might be kindof expensive. Perhaps a sliding Goertzel algorithm with the input a sliding Hann window. – robert bristow-johnson Mar 27 '23 at 02:29
  • it's funny you mention Goertzel, I saw a recent tweet between Charles and LixieLabs who seems to have implemented Goertzel with inspiration from the CC dft. https://twitter.com/cnlohr/status/1614826371808722944?s=20 – bazz Mar 27 '23 at 05:21
  • I understand that this spectral leakage that you are mentioning comes from the multiplications of pure sine and cosine functions, right? Would it make sense to apply a Hann (or any other type of window) to these sine and cosine functions and then, instead of chopping them on each lower scale, just downscale them? – Isaac Apr 02 '23 at 18:35
1

Most instruments have tone/timbre unlike that of a sine. Typically some complex waveform of more or less harmonic relationship. This seems to suggest that something like cepstrum analysis would be better suited for the task.

But when doing polyphonic single-instrument analysis, some sine-like tone might be one harmonic of one pitch or another harmonic of another pitch. Careful analysis of onset times might reduce the ambiguity but probably not 100%.

An interesting case in point: The Hammond organ contains 96 sine-like mechanical sound generators (additive synthesis). Each key press will mix in up to 9 of those sines. Interestingly, this means that whenever one or more keys are pressed that include e.g. a 440Hz harmonic, that will be from a single source (and at the same phase). Ie the phase relationship of partials in a single tone is somewhat arbitrary (or at least not locked like they would typically be in a sampled waveform or wavetable generator)

Knut Inge
  • 3,384
  • 1
  • 8
  • 13