In general, this would be an Automatic Gain Control problem. For example, an AGC stage is found in many RF receivers to try and maintain a constant level of reception.
The AGC is a simple feedback system that is monitoring the amplitude (or power) of a signal at its input, compares it with a reference level and uses this "error" signal to drive an amplifier that compensates for variations in the input signal.
Therefore, in very simple terms and as far as this question is concerned, the "error" signal contains this information of how far away is a source from the microphone that records it. Actually, in the audio world, there is already a device that deals with this problem.
A variation of the AGC is the "compressor". The compressor, monitors a signal at its input and passes it, unchanged, in the output as long as its level is below a certain threshold. Once the signal goes above a certain threshold, the compressor reduces the amplification by a set amount. The key difference of the compressor is that this reduction in amplification is not kicking in and out instantly. Rather, there is the so called "Attack" and "Decay" times that determine how "softly" the compressor limits the signal or returns back to the original amplification over time.
A properly tuned compressor can be used to compensate for various performance styles or speakers (for example in the studio room of a radio station). People speaking passionately might come close to the microphone and also raise their voice. That's when the compressor politely kicks in and limits the signal from that microphone to avoid distortion and allow other speakers to be heard as well (otherwise you would have to lower the gain of that microphone manually and that's not practical for many microphones)
Therefore, a compressor with just one threshold would answer the simple question "Speaker is close or away" from the microphone. More thresholds would give more fine grained control.
In terms of libraries or toolboxes to achieve compression you could perhaps use CSound's relevant module.
However, for this application you might want to focus on the bandwidth of the fundamental frequency of voice (as a better indicator of what the speaker is doing, how many speakers are there, etc).
Therefore, you can take your input signal, filter it with a simple band-pass filter (with a 85Hz-255Hz bandwidth), rectify the filtered signal (to get a measure of its level) and then pass it through an integrator with a high enough time constant (something like 100ms-300ms for example). This will return a slow varying smooth signal that you can use as an indicator of how far a "human voice" is from a microphone based on the amplitude of the signal. The integrator will smooth your signal enough to avoid sudden "jumps" in amplitude (in any case the speaker is not going to go from "far" to "close" in 2ms).
There is a lot of relevant research, elements of which you can use in your application if you search for "Gender Identification" (from audio recordings) or "Audio scene analysis". (These are mostly falling in the field of classification of course)