How to determine whether a speech segment is voiced/unvoiced?

Question

I want to determine whether a speech frame is voiced/unvoiced. Out of many methods found while searching, one method said find energy of the frame and if it is above a certain threshold, mark it as voiced. Now, my question is how should I determine this 'threshold value'? Is it by trial and error or are there any set of rules?

In my attempts, I resorted to a simple idea of looking at the energy plots and setting a threshold value accordingly. It served me well, but I want to know whether it was just beginner's luck.

You might benefit from these responses as well. Can I please ask what sort of setting you are dealing with? (i.e. Is this a talk-show kind of environment of open air recordings?) — A_A, Dec 06 '18 at 13:04
We are recording in a normal room, I mean not sound-proofed or anything. Just like a daily life situation. — Anand Mohan, Dec 06 '18 at 16:08

score 2 · Answer 1 · answered Dec 06 '18 at 17:43

2

The reason I asked about the environment of the recording is because if you have a well controlled situation (e.g. a talk-show, an interview that takes place in a studio, a discussion in a relatively quiet room and other similar situations), you can take your power estimates from every frame, create a histogram of those values and use that to find the threshold value that would then discriminate your frames between voiced and un-voiced.

This is similar to Otsu's method and it is important to have a well controlled situation for the assumption of "...a bimodal histogram..." to be valid.

Hope this helps.

answered Dec 06 '18 at 17:43

A_A

10,650
3
27
35

But isn't it that unvoiced frames might have more power than voiced frames? I'm not sure energy-based methods are optimum for this task. However, it's always a good idea to try it... – applesoup Dec 06 '18 at 19:53
@applesoup The original question was stated as: "...how should I determine this 'threshold value'? Is it by trial and error or are there any set of rules?". This answer addresses that part. What you are asking in this comment requires a classifier. It would be useful to amend your question to include all relevant information for more accurate answers. – A_A Dec 07 '18 at 09:35
@a-a I see - I missed the OP's original question, which is perfectly addressed by your answer. However, I believe the process of thresholding the frame energies to discriminate between voiced and unvoiced frames can also be seen as a classifier, isn't it? – applesoup Dec 07 '18 at 11:49
@applesoup I am sorry, I seem to have taken your comment as coming from the OP. Yes, that would be a primitive VAD under the assumption that the voiced part is the dialogue going on. What I think you may be suggesting is a way of understanding if a more coherent signal is being voiced or it could be any random sound. I think that what you are suggesting would still need a bit of "classification" because Autocorrelation would tend to latch on the fundamental which is different between genders. – A_A Dec 07 '18 at 12:08
I'm beginning to have a slight feeling there might be some confusion regarding the term "voiced". My interpretation is related to phonetic properties of a speech sound and maybe in the context here it is interpreted as the presence of speech in general (containing voiced and unvoiced sounds)... Is my feeling right? – applesoup Dec 07 '18 at 12:19
@applesoup As long as we agree on what a VAD does, we are on the same page. There are various degrees of performance you can achieve with various degrees of solution complexity. It might be helpful to have a look at these as well (?). – A_A Dec 07 '18 at 12:24

score 1 · Accepted Answer · answered Dec 06 '18 at 12:45

Well, that is energy detection for you; setting the threshold is a long-discussed (and not too abstract) problem when receiving OOK (on-off-keying).

The threshold you choose will be a tradeoff between missed detections and false alarms.

You'll hence will need to have a probability density function of your signal of interest (voice energy per frame) and of your noise (noise energy per frame).

You could then either set a constant false alarm rate and set your threshold as low as possible to attain that, or a constant detection probability and live with the number of false alarms that you get.

applesoup · Answer 3 · 2018-12-06T19:54:38.280

0

An alternative to a pure energy-based detector is to analyze the height of the peak of the normalized autocorrelation function closest to zero.

This answer explains a way to compute a simple periodicity coefficient which should indicate whether a segment is voiced or unvoiced.

edited Dec 06 '18 at 19:54

answered Dec 06 '18 at 16:30

applesoup

647
4
14

How to determine whether a speech segment is voiced/unvoiced?

3 Answers3