0

I am analyzing a short audio .wav file of a drum and piano for source separation. (not using interpretable languages like Librosa, Scikit, Matlab, or R, but rather a compilable language in .NET).

When considering the hierarchical clustering approach taken in this report, see Fig. 4, I am wondering whether (within a frequency row of the mixed STFT) whether it would be better to cluster the phase-adjusted reals and imaginaries or the unadjusted values?

By phase-adjusted I mean (matlab):

phi = angle(STFT)
xhat = STFT.*exp(1i*phi);

Also, once the coefficients are agglomerated within rows, would it be advantageous to use a Weiner or binary mask? I am currently experimenting, but wanted to ask the community in case there are any caveats.

UPDATE

The goal is sound source separation, i.e., separating percussion from piano.

The type of clustering is hierarchical.

What is being clustered: real and imaginary values in the same frequency row (bin) of a mixed STFT -- obtained from numerous DFTs of contiguous windowed (Hann) segments of audio signal.

An example of a Weiner mask involves representing an $F \times T$ STFT matrix, $\bf{V}$, with a basis and gain matrix, i.e. $\underset{F \times T}{\bf{V}}=\underset{F \times K}{\bf{W}} \underset{K \times T}{\bf{H}} $ derived from using NMF (non-negative matrix factorization) - an unsupervised clustering method for which the assumed number of clusters is $K$. The $K$ columns of $\bf{W}$ represent $K$ clusters of frequencies and $K$ rows of $\bf{H}$ represent the respective $K$ clusters of activations (over time). Assuming $K$ total clusters, the modified STFT (MSTFT) for the $k$th cluster, $\mathbf{M}_k$, based on a Weiner mask has elements

$m_{ftk} = \frac{\hat{v}_{ftk}}{\sum_l \hat{v}_{ftl}}$,

where $\hat{v}_{ftk}$ is an element of the estimated STFT for cluster $k$ based on the outer product matrix $\mathbf{\hat{V}}_k = \mathbf{w}_k \mathbf{h}_k$

  • 1
    i don't know what a Weiner or binary mask is. what do you mean by "clustering"? do you mean grouping together adjacent DFT bins to be associated with a single sinusoidal component? – robert bristow-johnson Dec 01 '19 at 20:00
  • No, not collapsing anything (...'together"). An STFT matrix consists of columns representing time blocks, and rows representing bin-specific DFT frequencies for a short signal span in a given time block. (i.e., a DFT is run for each signal in each time block). Within, say, the 1000-1040 Hz frequency bin (row), I would be identifying similar vectors (real, imaginary) from many time blocks. By clustering I meant, clusters whose time blocks have the most similar real/imaginary values. –  Dec 01 '19 at 20:25
  • 2
    i understand what the STFT is. but, even after your clarification, i don't know what you mean by "clustering". nor what a "Weiner or binary mask" is, particularly regarding the STFT. – robert bristow-johnson Dec 02 '19 at 02:48
  • 1
    same here! JoleT, I think you're assuming that we know exactly what you're planning to do, but honestly, I've no clue (and rbj probably only slightly more). If you could edit your question explain what you mean by clustering (what exactly do you want to cluster? Into which clusters? For what purpose? Using what?), and also define (or at least link to a definition) for Weiner mask and binary mask. – Marcus Müller Dec 02 '19 at 07:00
  • See the updated OP –  Dec 03 '19 at 18:53

0 Answers0