I am analyzing a short audio .wav file of a drum and piano for source separation. (not using interpretable languages like Librosa, Scikit, Matlab, or R, but rather a compilable language in .NET).
When considering the hierarchical clustering approach taken in this report, see Fig. 4, I am wondering whether (within a frequency row of the mixed STFT) whether it would be better to cluster the phase-adjusted reals and imaginaries or the unadjusted values?
By phase-adjusted I mean (matlab):
phi = angle(STFT)
xhat = STFT.*exp(1i*phi);
Also, once the coefficients are agglomerated within rows, would it be advantageous to use a Weiner or binary mask? I am currently experimenting, but wanted to ask the community in case there are any caveats.
UPDATE
The goal is sound source separation, i.e., separating percussion from piano.
The type of clustering is hierarchical.
What is being clustered: real and imaginary values in the same frequency row (bin) of a mixed STFT -- obtained from numerous DFTs of contiguous windowed (Hann) segments of audio signal.
An example of a Weiner mask involves representing an $F \times T$ STFT matrix, $\bf{V}$, with a basis and gain matrix, i.e. $\underset{F \times T}{\bf{V}}=\underset{F \times K}{\bf{W}} \underset{K \times T}{\bf{H}} $ derived from using NMF (non-negative matrix factorization) - an unsupervised clustering method for which the assumed number of clusters is $K$. The $K$ columns of $\bf{W}$ represent $K$ clusters of frequencies and $K$ rows of $\bf{H}$ represent the respective $K$ clusters of activations (over time). Assuming $K$ total clusters, the modified STFT (MSTFT) for the $k$th cluster, $\mathbf{M}_k$, based on a Weiner mask has elements
$m_{ftk} = \frac{\hat{v}_{ftk}}{\sum_l \hat{v}_{ftl}}$,
where $\hat{v}_{ftk}$ is an element of the estimated STFT for cluster $k$ based on the outer product matrix $\mathbf{\hat{V}}_k = \mathbf{w}_k \mathbf{h}_k$