Network design for extracting MFCC features from audio samples

Question

I currently trying to train a neural network capable of mapping a input (400 input features) to a output vector of 13 elements.

The input features is an fixed length audio sample, and the output is a feature vector extracted from it.

I made a simple network consisting of one layer:

model.add(Dense(output_dim=13, input_dim=400, init="normal",activation="relu"))

Trained it for 10 epochs gives me these results:

The training shows that the it does become better, but how do i improve it ?

Updated model:

Training and testing:

Plotting (predicted_output - actual_output)[0] Plot showing the different between the predicted_output and the actual_output for only the first feature.

overlap plot:

Related: http://datascience.stackexchange.com/questions/14664/neural-network-with-flexible-number-of-inputs If your goal is to calculate MFCC from audio input, then this is a well-defined transformation which is going to be much easier and faster to do directly without using any deep learning. It's a bit like creating a deep NN to calculate Sine or Cosine (but way harder) . . . So it may help to give some background on why you specifically want to train a NN for this task. — Neil Slater, Nov 19 '16 at 20:39
So.. yes doing it using the math would be easier. The reason why i want to do it the other way, is to get a better understanding with neural networks. How different network models can be used. This post does differ a bit from the related post, as the input length is fixed - (400) so the reason behind using rnn+lstm isn't valid here. — Carlton Banks, Nov 20 '16 at 12:38
I guess something is wrong with the accuracy value... But i am not sure how I should interpret the loss.. Is it a lot or just some minimum? — Carlton Banks, Nov 20 '16 at 13:46
Are your input variables just short audio samples? What sample rate, and how are you getting the true MFCC for training? I think the problem is tractable, but your current loss values very high and will need a much deeper network, maybe a convolutional network, if you want good accuracy. You could probably choose a simpler problem for learning NNs, but I suppose the good thing is you can generate a lot of training data if you want. — Neil Slater, Nov 20 '16 at 18:16
The sample rate is 16000 hz. And the true MFCC has been precalculated with the same setting. My current model looks like this is a 4 layer prelu (first 3 layers) and Elu at the last layer. The loss is around 80 when testing and 60 when training. Why are you suggesting a cnn for a regression problem? — Carlton Banks, Nov 20 '16 at 19:26
CNN is nothing to do with regression/classification split (you can do either). It is good for signal processing, especially where you have translation invariance, which you very much do for MFCC. — Neil Slater, Nov 20 '16 at 19:58
I am not sure on how i should interpret the performance.. ? I can't say whether the result i am getting is good or bad.. I decreased the loss but the accuracy?... — Carlton Banks, Nov 20 '16 at 20:00
Accuracy is not relevant for regression tasks. Ignore it. The loss is your mean squared error - I would interpret it w.r.t. typical values seen in your data. Take a look at a few of the predictions compared to actual values, to get a sense of what that level of loss means. Ideally the loss is a lot smaller in absolute terms than square of typical desired output value. — Neil Slater, Nov 20 '16 at 20:05
Makes sense one would look at it with respect to the values.
I made a plot. and shows that the data deviates usually -20 to 60... — Carlton Banks, Nov 20 '16 at 20:10
You want a comparison with actual data, not the difference, in order to understand how good or bad the predictions are. If the true values of the MFCC coefficients in your examples are between 0 and 10, then your results are terrible. If they are between 1000 and 2000, then they are pretty good. — Neil Slater, Nov 20 '16 at 20:12
The plot with one color shows the error. And the other plot shows the overlap. The green one being the one predicted. — Carlton Banks, Nov 20 '16 at 20:42
My gut feeling is that you could get much better results (maybe loss < 10 or even < 1) if you looked at CNNs, perhaps add some dropout, variations in hyper-params etc. However, I could easily be wrong. You are probably the first person to test deep learning against calculating MFCC, so may have set the performance benchmark. In that case, you can look at the learning challenge as successively beating your own results. — Neil Slater, Nov 20 '16 at 22:38

Network design for extracting MFCC features from audio samples

0 Answers0