7

When I compute a spectrogram of (say) a piece of music, there is a lot of frequency "smearing." Often we can reasonably expect that the "true" generating process is much sparser in frequency (i.e. maybe 10-20 active frequencies instead of hundreds).

We also know that plain old sine waves become smeared in spectrograms.

So, can't we model the corrupting "frequency leakage" at a specific timestep as a linear equation $$ Ax = b $$ Where

  • $b$ is the known vector representing the STFFT-computed spectral densities at the various frequencies
  • $x$ is the unknown, sparse vector of generating frequencies
  • $A$ is the (computable) "smearing matrix" which maps power from sparse generating frequencies to the "smeared" set of frequencies which STTFT produces (We can just compute this by taking the STFFT of various sine wave snippets)

And naturally one could extend this to further encourage sparsity, e.g. $$ \min_{x} \|Ax - b\| + λ \|x\|_1 $$

I have not seen this approach discussed, but I am very new to DSP. Is this a known technique, or is there some reason why it would not work in "sharpening" a spectrogram?

  • 1
    FFT spectral leakage shouldn't be conceptually confused with STFT boundary effects, though they're sometimes related. Synchrosqueezing may be of interest. – OverLordGoldDragon Jun 28 '22 at 17:47
  • 1
    @OverLordGoldDragon thank you for the pointer, this is a beautiful exposition (have only read a bit of it thus far). You are right that I am conflating fft leakage and boundary effects, will have to look into this more. I had come across reassignment methods but this seems a clearer intro than others I've seen. – Michael Kayser Jun 28 '22 at 19:34
  • 1
    @MichaelKayser The assumption that music is spectrally "sparse" is highly questionable. Can you clarify whether you want to work with actual music or that you know for sure that just have a sum of sine waves? – Hilmar Jun 28 '22 at 21:03
  • @Hilmar fair point. I've been looking at simple stuff like solo instruments (e.g. a short piano piece or cello suite). Even this may have weird parts e.g. when notes are being struck, but for now I'm seeing how much mileage I can get with a sparsity assumption which I expect does explain a fair percentage of such recordings. Very open to feedback based on your experience though. – Michael Kayser Jun 28 '22 at 23:54
  • I'm doubtful. IMO the sparsity assumption will discard a lot of important details. This may be ok if you are ONLY interested in pitch detection but if want to maintain the difference between, say, a clarinet and a violin this is unlikely to work. – Hilmar Jun 30 '22 at 17:15
  • @Hilmar thanks for your thoughts -- just out of curiosity, what aspects of these sounds (for example, these two instruments) are you thinking would get lost? For example, I was under the impression that macro-characteristics like "clarinet versus violin" would be primarily represented in the relative presence and decay of all the different harmonics of a particular note. On the other hand, there are sounds that I could imagine getting lost in a sinusoid-based analysis, e.g. clicks, fingers on strings, the "breathy" sound of some instruments. Is the latter what you mean? – Michael Kayser Jun 30 '22 at 19:18
  • Most instruments have a combination of tonal and atonal (noise-like or impulsive) components. A-tonal can be breathing, bowing or fretting noise, anything percussive, attacks, etc . The same bass guitar sounds very different depending on whether you play with a pick, fingers or slap it. Some of the things like decay rates, vibrato, tremolo, glides, bends, etc will result in a broadening of the spectral lines. These broadenings are based on the actual signal, not the analysis method, so they are actually meaningful. It's not always easy to distinguish the two. – Hilmar Jul 01 '22 at 18:17
  • @Hilmar Thanks for your feedback, this makes a lot of sense. I can definitely see how something like a glissando or tremolo, coupled with a bit of ambient reverb, could lead to an intrinsically spread out spectrum. And the "noisy" or impulsive sounds are definitely complex. At the end of the day I think the reassignment methods are more principled for distinguishing these sorts of signal variations. Do you have any suggested pointers (books or papers) which go in more detail on instrument sound characteristics? Thanks again. – Michael Kayser Jul 02 '22 at 04:07

1 Answers1

2

This looks sort of like the standard approach to the maximum likelihood estimator for tones.

Kay (1) starts with the signal $$ x[t] = A \cos(\omega_0 t + \phi) + \epsilon[t] $$ and assumes $\epsilon[t]$ is an independent, identically distributed Gaussian noise source with variance $\sigma_\epsilon^2$.

The likelihood function for this is: \begin{align} \newcommand{\bx}{\mathbf{x}} p(\bx ; \mathbf{\theta}) = \frac{1}{(2\pi \sigma_\epsilon^2)^{T/2} } \exp\Big( -\frac{1}{2\sigma_\epsilon^2} \sum_{t=0}^{T-1} \big(x[t] - \tilde{A} \cos(\tilde{\omega}_0(t-\nu) + \tilde{\phi}) \big)^2 \Big) \end{align} where our parameters of interest are $\mathbf{\theta} = \Big [ \tilde{A}, \tilde{\omega}_0, \tilde{\phi} \Big ]^T$.

Using the log-likelihood, and forming the vector version gives: $$\newcommand{\bc}{\mathbf{c}} \newcommand{\bs}{\mathbf{s}} \newcommand{\bH}{\mathbf{H}} \begin{align} L(\theta') &= \frac{T}{2}\log_e(2\pi \sigma_\epsilon^2) + (\bx -\tilde{\alpha}_c \bc - \tilde{\alpha}_s \bs )^T (\bx -\tilde{\alpha}_c \bc - \tilde{\alpha}_s \bs )\\ &= \frac{T}{2}\log_e(2\pi \sigma_\epsilon^2) + (\bx - \bH\underline{\alpha} )^T (\bx - \bH \underline{\alpha} ) \end{align}$$ where \begin{align} \bx &= \left [ x[0], x[1], \ldots, x[T-1] \right]^T\\ \bc &= \left [ \cos\left(-\tilde{\omega}_0 \frac{T-1}{2}\right), \ldots , \cos(-\frac{\tilde{\omega}_0}{2}), \cos(\frac{\tilde{\omega}_0}{2}), \ldots \cos\left(\tilde{\omega}_0 \frac{T-1}{2}\right) \right]^T\\ \bs &= \left [ \sin\left(-\tilde{\omega}_0 \frac{T-1}{2}\right), \ldots , \sin(-\frac{\tilde{\omega}_0}{2}), \sin(\frac{\tilde{\omega}_0}{2}), \ldots \sin\left(\tilde{\omega}_0 \frac{T-1}{2}\right) \right]^T\\ \bH &= \left [ \bc \ \bs \right ]\\ \underline{\alpha} &= \left[ \tilde{\alpha}_c\ \tilde{\alpha}_s\right ]^T\\ \tilde{\alpha}_c &= \tilde{A} \cos \phi\\ \tilde{\alpha}_s &= -\tilde{A} \sin \phi \end{align} assuming $T$ is even.

Like I said, sort of. The $\underline{\alpha}$ here is analogous to your $x$, but $\underline{\alpha}$ is in no way sparse.

Interesting question. I'll think about it some more.

(1) S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, 1997.

Peter K.
  • 25,714
  • 9
  • 46
  • 91