How do I construct input to neural network from audio signals?

Question

Input: Microphone recordings of digits from 0 to 9 from different speakers.

Output: The digit from 0 to 9.

I am doing this for fun. So first I will train my neural network using some samples and then use it to classify digits.

Problem is every person takes different amount of time to say the digit also every person takes different amount of time to say different digits. This gives me sound signal of different time duration. But input to the neural network must be of some fixed size. How can I make all my input signals of same length so that I can give input it to neural network? Is there any standard technique? Fourier transform?

Well what's the longest sound sample, just use that as the input length — , Aug 29 '15 at 10:36

gsmafra · Answer 1 · 2015-09-05T20:25:46.907

So the problem is that you want to classify some audio samples but each sample has different sizes. AFAIK there is no special way for doing this with (simple) neural networks.

For general classifiers I am sure there are lots of ways but these are some very simple ones so you can begin playing with the problem:

Divide your signal by frames/segments of equal size and use each frame as if it were a training example. At the end you will use a voting strategy. Since you are using a neural network, you can use the probabilistic outputs of the last layer instead of the hard classes to weight this voting.

Using the average across all frames/segments in your signal. This is a very light-weight approach but can have good results sometimes.

Either way in the "feature extraction" part you will want to use some spectral-based transformation. I recommend either the logarithm of the spectrum or computing MFCC vectors. You can also train your network from raw audio but this is a bit unusual.

You should note that these two methods completely spoil the dynamics of your system because of the granularity. Try playing with the frame sizes or aggregating neighbor frames to increase or decrease the granularity.

How do I construct input to neural network from audio signals?

1 Answers1