1

I've noticed in some literature that the authors chose to use the mean and standard deviation of the extracted MFCC features.

"ANALYSIS AND VOICE RECOGNITION IN INDONESIAN LANGUAGE USING MFCC AND SVM METHOD", Harvianto

Can you offer some insights as to why this approach is considered after the MFCC extraction?

I have seen the mean and standard deviation stacked horizontally like this:

mfccs = (np.hstack((np.mean( mfccs, axis=1), np.std( mfccs, axis=1)))
Joe
  • 113
  • 3
  • 2
    https://dsp.stackexchange.com/q/19564/8202 This link might be relevant for you. You can use the mean and standard deviation of MFCC's to remove channel effects that change over time, i.e. convolutional effects such as response of the vocal tract or a recording device (for channel robustness). – jojeck Sep 01 '22 at 07:43
  • 1
    Are you asking why the mean and std are taken? That's because it adds additional features. You can also take the mean and std of the diff for example.

    Or are you asking the REASONING behind taking the mean and std? As in what's the physical meaning of taking the mean and std?

    – Jdip Sep 02 '22 at 17:14
  • @Jdip, your former answer , answers my question. I just wanted to know why the mean and std are taken. Since you brought up the second statement, is there a physical meaning to the mean and std? I did not think of that statement. You have me thinking of the perspective as taking the differential of the MFCC means how fast the audio signal moves (velocity) for the delta function? – Joe Sep 02 '22 at 18:07
  • 1
    Mean and std are invariants. Related. – OverLordGoldDragon Sep 03 '22 at 10:00

0 Answers0