Input: Microphone recordings of digits from 0 to 9 from different speakers.
Output: The digit from 0 to 9.
I am doing this for fun. So first I will train my neural network using some samples and then use it to classify digits.
Problem is every person takes different amount of time to say the digit also every person takes different amount of time to say different digits. This gives me sound signal of different time duration. But input to the neural network must be of some fixed size. How can I make all my input signals of same length so that I can give input it to neural network? Is there any standard technique? Fourier transform?

