Classifying STFT from multiple signal samples

Question

I have a collection of signals (IQ wav) split up into ~2s samples of sampling rate 2MHz, and can collect the STFT information from these samples through the following code:

    # The following is in a for loop over the directories which hold the samples
    ####
    #
    fs, x = scipy.io.wavfile.read(f'../category/signal_sample_{i}.wav')
    # Once the recording is in memory, we normalise it to +1/-1
    x = x / np.max(np.abs(x))
    # We convert to mono by averaging the left and right channels.
    x = np.mean(x, axis=1)
    x = np.asarray(x, dtype=np.float32) # np.float32)
    # Create a 20ms [hanning] hop window
    M = int(fs * 0.001 * 10)
    # Number of samples
    N = x.shape[0]
    L = N / fs # audio length
f, t, Zxx = signal.stft(x, fs=fs, window='hann', nperseg=M)

From what I understand the STFT info is found in Zxx which for my case, typically takes the shape of (10001, 401). Unfortunately, while a subset of my entire sample set for each category can be stored in memory, the collection as a whole is too big to do this!

I've looked into using CVNNs etc. for classifying the complex ndarrays (Zxx), which is fine, however I am still struggling to figure out the approach to take for training (and ultimately using some for testing/validation).

OverLordGoldDragon · Accepted Answer · 2022-03-25T01:44:37.107

Feature extraction under compute constraints is trading feature quality for transform speed/size. An advantage of STFT is, such trading is easy.

Three crucial parameters: hop_size (stride), n_fft, and choice of window. When optimizing for compute, window is closely tied to the first two. Some guidelines:

out.size < in.size breaks invertibility. However, our goal is to extract analysis info. Hence, out.size == in.size is a "reference" to keep in mind; it's fine to go below, but the more we do so, the more we lose. < is hop_size > n_fft.
Large hop_size will lose analysis info. The loss is greater with a narrower window. If the window isn't wide enough, then we also lose synthesis info (invertibility) - that is, we skip some parts of input entirely, which is never desired. Also, small hop_size isn't necessarily greater time resolution; the window must be small enough.
Large n_fft isn't necessarily greater frequency resolution; the window must be wide enough. Like hop_size, too little n_fft with a wide window (narrow in frequency) will break invertibility, as we're not tiling some of the frequency axis at all.

After STFT,

I don't necessarily recommend complex-valued. It's a higher-entropy, less stable counterpart of the modulus, and will take more training instances than the modulus to train on - though, the peak performance with enough data might be better since more info is preserved. Further,
- Strided complex STFT is heavily aliased, which compromises spatial operators relying on ordinality and uniformity of data (e.g. convolutions). Modulus smoothens the representation and enables a much greater safe stride. Lowering stride to accomodate increases learning space dimensionality and model variance, so a sort of lose-lose. Might not bother non-spatials though (e.g. Dense).
- All else kept same, CV doubles data size relative to modulus.
If doing complex-valued, or any post-processing upon STFT directly (except modulus), I recommend against standard implementations: they're higher-entropy and more numerically error prone. ssqueezepy solves this.

Ultimately, it helps to look at it as "what do I gain/lose by shrinking/expanding my input size"? If something gets you 100% accuracy but you need real-time and it takes 1 hour per minute, that won't do. The more analysis info your transform can squeeze out per unit of compression, the better.

Preprocessing comments

I strongly discourage x /= np.max(np.abs(x)) for signals; it's sensitive to outliers and the rescaling isn't meaningfully consistent for different x. Prefer x /= x.std(). I also don't know if this is meaningful

# We convert to mono by averaging the left and right channels.

If the channels function as independent sources of information, then a mean crudely discards information; it's better to unroll these along the batch dimension for STFT and along channels dimension for the neural net.

Classifying STFT from multiple signal samples

1 Answers1

Preprocessing comments

Linked