2

I want to create a feature space that includes MFCCs, MFCC deltas, and MFCC delta-deltas concatenated along the time axis which I will then feed into a CNN for speech emotion recognition.

After extracting the MFCCs from the mel-spectrogram, I scaled the coefficients using standard scaling. Then, I computed the deltas and delta-deltas and concatenated all the features along the time-frame axis, essentially stacking the deltas on top of the MFCC-gram. But, of course, the deltas are going to generally have smaller values than the MFCCs, because it calculates differences between frames; thus, the magnitudes of the deltas are less prominent in the final feature space. Should I scale the features again, or should I instead hold back from scaling the MFCCs and then only scale once the final feature space is extracted? Does it even matter?

Thanks!

1 Answers1

1

If you actually want the delta and delta-delta features to be standardized, that is, have unit variance and zero mean, then you have to do that operation after computing the features.

Whether it matters in practice - you'll just have to test. Deep learning especially tends to be quite robust for different feature scaling (as long as it is consistent).

Jon Nordby
  • 267
  • 1
  • 9