How to deal with variations in amplitudes when doing voiced/silence detection?

Question

I have a collection of wave files that has a few well separated words spoken. My goal is to split them into individual files. I wanted the program to work on all files, not just one of them or require tuning on each one.

So I get started to look at the wave forms. They are well separated. The recording environment can be noisy or quiet.

To suppress the noise, I made a simple non-linear transformation to suppress number with small amplitude and expand the other as follow

$ y[t] = e^{x[t] * x[t]/M} $ where M is the maximum of the squares, that is, the exponent is at most than 1. That leads me to this:

With that, it is fairly easy to segment the audio. I used a Hidden Markov Model for that purpose, but we have a problem here:

In this case, the audio has an interesting peak at the second word, all the rest are normalized according to the max. Note that it is still pretty easy to eyeball the separation, but it would be hard to find a single threshold (or in my case, a single HMM) that fits all the files.

I think I will need some adaptive techniques. I was thinking something like adaptive thresholding in image processing, but I have no control on how long the silence would be so there is no good window to use to find the threshold.

Any idea how to deal with the variations? I could have train a different HMM for each file on the fly, but that is really slow.

instead of using fixed threshold, you can use two moving averages. one average could be updated at faster rate and other at slower rate and then whenever faster average is more than slower those frames could be declared as speech and rest as silence. this is base for some VAD(Voice Activity Detector) algorithms — Arpit Jain, Apr 16 '17 at 15:16
I am a little confused with your faster rate and slower rate ... are you implying I am using a moving average filter with a different sampling rate, or a different window size? — Andrew Au, Apr 16 '17 at 18:18
you can have one moving average(of power/energy) with low/slow update coefficient and other other moving average with high/fast update coefficient. and decide speech absence or presence based on logic mentioned in previous comment. you need to tweak both the update coefficients suiting to your set of data. hope this helps . — Arpit Jain, Apr 17 '17 at 04:30

How to deal with variations in amplitudes when doing voiced/silence detection?

0 Answers0