Rough estimate of how close a human speaker is to a microphone?

Question

Is there a known technique for estimating how close a human speaker is to a single microphone? I'm hoping that there is a known technique that can analyze an incoming audio signal that contains a human voice with typical ambient room noise, and give a rough estimate of how close the speaker is to the microphone. Is there an algorithm/formula that can analyze such a signal and look for early reflections/reverb and then give a quality measurement back? Note, I am not looking for an estimate in units of distance (3 feet, etc.), just something that would tell me if the speaker is right on top of the mic, near the mic, far from the mic, or really far from the mic. Also, if it simplifies things, I can safely make the assumption that the speaker's voice will be significantly louder than any other human voice in the room.

Any papers or especially open source libraries would be fantastic.

PREREQUISITE: I would assume that an algorithm/formula that can detect if there is a human voice of significant volume present in an audio source would be a prerequisite for making the above quality measurement. Any information or resources along those lines would also be greatly appreciated.

Supervised or unsupervised? How do you want to delineate near from far? — Emre, Apr 28 '12 at 18:27
@Emre - I know what supervised/unsupervised means from a machine learning context, but I'm not sure how you are asking it here. Can you elaborate? As far as delineating near from far I want to categorize the distance using the 4 category labels given above: right on top of the mic, near the mic, far from the mic, or really far from the mic. — Robert Oschler, Apr 28 '12 at 19:58
Are you going to provide any labeled examples to demonstrate the various classes (near/far/above etc.)? — Emre, Apr 28 '12 at 20:02
@Emre - I didn't think they would help because any classification technique that would take advantage of the acoustic characteristics peculiar to the room for a particular recording would not work for me. I need a generalized technique that will work across most "common" home/office environments. — Robert Oschler, Apr 28 '12 at 20:05

score 5 · Answer 1 · answered Apr 28 '12 at 23:41

5

Here is one possible suggestion: use a second order microphone (e.g. cardioid microphone). The human voice is a point source and has strong reactive velocity near-field component which velocity mics are sensitive too. In plain English this means you get substantially more relative bass when the speaker is close by. You could leverage to compare the energy in the 500-1kHz Band to the energy in the 100-200Hz band. With some reasonable calibration you ought be to able to tell whether the speaker is close or far.

answered Apr 28 '12 at 23:41

Hilmar

9,472
28
35

That's a good idea but the microphone in this case is a fixed design robot that only has a single microphone. I can't rely on the PC microphone to act as the 2nd because they might be in another room of the house where the PC isn't. – Robert Oschler Apr 29 '12 at 03:08
You don't a second mic. Use a cardioid capsule, not an omni capsule. Most mics come in both versions – Hilmar Apr 29 '12 at 04:37
2

Ok, thanks. I understand now. Still can't change the mic in the robot though. – Robert Oschler Apr 29 '12 at 12:01
1

You can take a look at the data sheet of the microphone. Chances are there is already one in there. – Hilmar Apr 29 '12 at 21:56
Or give the microphone a pinna? :) – endolith Apr 30 '12 at 16:31

score 5 · Answer 2 · answered Apr 30 '12 at 10:02

In general, this would be an Automatic Gain Control problem. For example, an AGC stage is found in many RF receivers to try and maintain a constant level of reception.

The AGC is a simple feedback system that is monitoring the amplitude (or power) of a signal at its input, compares it with a reference level and uses this "error" signal to drive an amplifier that compensates for variations in the input signal.

Therefore, in very simple terms and as far as this question is concerned, the "error" signal contains this information of how far away is a source from the microphone that records it. Actually, in the audio world, there is already a device that deals with this problem.

A variation of the AGC is the "compressor". The compressor, monitors a signal at its input and passes it, unchanged, in the output as long as its level is below a certain threshold. Once the signal goes above a certain threshold, the compressor reduces the amplification by a set amount. The key difference of the compressor is that this reduction in amplification is not kicking in and out instantly. Rather, there is the so called "Attack" and "Decay" times that determine how "softly" the compressor limits the signal or returns back to the original amplification over time.

A properly tuned compressor can be used to compensate for various performance styles or speakers (for example in the studio room of a radio station). People speaking passionately might come close to the microphone and also raise their voice. That's when the compressor politely kicks in and limits the signal from that microphone to avoid distortion and allow other speakers to be heard as well (otherwise you would have to lower the gain of that microphone manually and that's not practical for many microphones)

Therefore, a compressor with just one threshold would answer the simple question "Speaker is close or away" from the microphone. More thresholds would give more fine grained control.

In terms of libraries or toolboxes to achieve compression you could perhaps use CSound's relevant module.

However, for this application you might want to focus on the bandwidth of the fundamental frequency of voice (as a better indicator of what the speaker is doing, how many speakers are there, etc).

Therefore, you can take your input signal, filter it with a simple band-pass filter (with a 85Hz-255Hz bandwidth), rectify the filtered signal (to get a measure of its level) and then pass it through an integrator with a high enough time constant (something like 100ms-300ms for example). This will return a slow varying smooth signal that you can use as an indicator of how far a "human voice" is from a microphone based on the amplitude of the signal. The integrator will smooth your signal enough to avoid sudden "jumps" in amplitude (in any case the speaker is not going to go from "far" to "close" in 2ms).

There is a lot of relevant research, elements of which you can use in your application if you search for "Gender Identification" (from audio recordings) or "Audio scene analysis". (These are mostly falling in the field of classification of course)

This wouldn't help if the goal is to measure how loud someone is speaking, like distinguishing between a loud person far from the mic and a quiet person close to it. — endolith, Apr 30 '12 at 16:34
It also doesn't help if the speaker is outside the critical distance and the mic in the diffuse sound field. Overall energy is dominated by reflected energy which is not a function of distance — Hilmar, May 01 '12 at 09:32
@endolith Yes, this is true but there is "so much" you can do with 1 (possibly calibrated) microphone. "Hilmar": I suppose that what you mean is a way of discriminating if the signal spectrum has some "comb" structure (http://en.wikipedia.org/wiki/Comb_filter) which would imply reflections. However, perhaps this characteristic becomes intense enough to be detected when the speaker is very far from the mic and at specific positions inside a room. Otherwise, the direct "ray" will hit the mic first. In any case, directional effects are diminished when focusing at lower frequencies. — A_A, May 01 '12 at 12:41

score 1 · Answer 3 · answered Sep 21 '12 at 19:57

One possible approach would be to use an echo cancellation algorithm to do the heavy lifting of telling you where the echos were coming in. Just make sure the code you use is able to give you that information, since it's intermediate. This assumes some reflective surfaces and so on, but since you just looking at a rough estimate, it may work.

Off the top of my head, I don't know how these algorithms find the delay-time. If you want to do it yourself, you could try autocorrelation, but given the time delays you are likely to be dealing with, I suspect this won't be as efficient as whatever the echo cancellation experts have figured out.

Rough estimate of how close a human speaker is to a microphone?

3 Answers3