I have a collection of wave files that has a few well separated words spoken. My goal is to split them into individual files. I wanted the program to work on all files, not just one of them or require tuning on each one.
So I get started to look at the wave forms. They are well separated. The recording environment can be noisy or quiet.
To suppress the noise, I made a simple non-linear transformation to suppress number with small amplitude and expand the other as follow
$ y[t] = e^{x[t] * x[t]/M} $ where M is the maximum of the squares, that is, the exponent is at most than 1. That leads me to this:
With that, it is fairly easy to segment the audio. I used a Hidden Markov Model for that purpose, but we have a problem here:
In this case, the audio has an interesting peak at the second word, all the rest are normalized according to the max. Note that it is still pretty easy to eyeball the separation, but it would be hard to find a single threshold (or in my case, a single HMM) that fits all the files.
I think I will need some adaptive techniques. I was thinking something like adaptive thresholding in image processing, but I have no control on how long the silence would be so there is no good window to use to find the threshold.
Any idea how to deal with the variations? I could have train a different HMM for each file on the fly, but that is really slow.

