65

I read in some places that music is mostly sampled at 44.1 kHz whereas we can only hear up to 20 kHz. Why is it?

Peter Mortensen
  • 219
  • 1
  • 8
Soham De
  • 752
  • 1
  • 5
  • 7
  • 1
    Younger people can hear higher frequencies. Other recording techniques use up to 48 kHz. – Thorbjørn Ravn Andersen Mar 06 '17 at 13:42
  • 20
    Nyquist theorem : you need two samples every swing to tell the frequency of a wave. – mathreadler Mar 06 '17 at 20:43
  • 1
    Because processors are faster, memory is cheap, but good analog filters are still tricky, even higher sample rates can make sense as well (96 or 192 kHz) – Nick T Mar 06 '17 at 22:25
  • 3
    @ThorbjørnRavnAndersen I think 48 kHz is common because it's divisible into 24, 25, and 30 fps used in video production. 24 doesn't evenly go into 44100. That's what Wikipedia mentions. – Nick T Mar 06 '17 at 22:28
  • @NickT at the time CD came out, none of that was true. In fact 44.1 was reduced from “normal” 48 to save bits. – JDługosz Mar 07 '17 at 00:14
  • 7
    @SohamDe This is because if you sample a 20 kHz audio signal at exactly 20 kHz, you would hear nothing at all. Picture it, a sine wave that peaks every 1 / 20,000 second. Well if you sample that at the exact same rate, then you would only sample the peaks (or nodes, or whatever level you happen to sample it at). So when you recreate the signal from digital, all you get is a flat line. This concept is called aliasing and it makes it so that you must sample at at least twice the maximum frequency you want to be able to hear. 44 100 Hz is convenient because is divisible by a power of 2. – MichaelK Mar 08 '17 at 10:44
  • @Michael if you make that into an answer, you have my +1 guaranteed. That's a lot more clear for someone like me with not a ton of experience in this. Add some pictures, and you've got a top tier answer in the works. – Cullub Mar 08 '17 at 20:38

6 Answers6

96
  1. The sampling rate of a real signal needs to be greater than twice the signal bandwidth. Audio practically starts at 0 Hz, so the highest frequency present in audio recorded at 44.1 kHz is 22.05 kHz (22.05 kHz bandwidth).
  2. Perfect brickwall filters are mathematically impossible, so we can't just perfectly cut off frequencies above 20 kHz. The extra 2 kHz is for the roll-off of the filters; it's "wiggle room" in which the audio can alias due to imperfect filters, but we can't hear it.
  3. The specific value of 44.1 kHz was compatible with both PAL and NTSC video frame rates used at the time.

Note that the rationale is published in many places: Wikipedia: Why 44.1 kHz?

endolith
  • 15,759
  • 8
  • 67
  • 118
  • 11
    Hi, I really agree with your answer, but the "..twice the highest frequency" thing bites beginners very soon, because Nyquist is about bandwidth, not highest frequency; I went ahead and slightly modified your answer. Please check if it's OK with you. – Marcus Müller Mar 05 '17 at 06:28
  • But what is the extra 50 Hz for? – Ruslan Mar 05 '17 at 09:19
  • 2
    @Ruslan: Wikipedia is quite good about it. – jojeck Mar 05 '17 at 11:58
  • From the Wiki article, "In the early 1980s, a 32 kHz sampling rate was used in broadcast" ... should read "from the late 1960s" ... – user_1818839 Mar 05 '17 at 13:27
  • 2
    @BrianDrummond So edit it? – endolith Mar 05 '17 at 17:20
  • 4
    @MarcusMüller the beginner who's bitten by “Nyqvist is highest allowed frequency” will get bitten anyway, by aliasing artifacts... After that, they'll also understand how any range of frequencies of bandwidth $Δf$ is demodulated to one between $0$ and $Δf = f_s/2$. – leftaroundabout Mar 05 '17 at 22:12
  • Not only are brick wall filters impossible, but they wouldn't always be desirable even if they were possible. If one had a brick wall filter at precisely 20,000Hz and one were to pass a saw tooth through it that were frequency modulated with a center frequency of 2,000Hz, the signal would repeatedly switch between having nine harmonics and containing ten, with the tenth harmonic--when present--being about 5dB down from the fundamental. Even if one couldn't hear the 20,000Hz harmonic directly, any harmonic distortion present in the output could create some aliasing at lower frequencies. – supercat Mar 05 '17 at 22:23
  • 1
    One might not be able to tell the difference between a 19,999.9Hz sound with ten harmonics or a 20,000.1Hz sound with nine if one heard them separately, but the doesn't mean a transition between the two would not be audible. Having a filter with a more gradual cut-off would avoid such issues. – supercat Mar 05 '17 at 22:26
  • 2
    @supercat even in those cases, if it were possible the brick wall filter would still be desirable as the sampling filter. it's job is just to avoid aliasing. what you're talking about is the job of an equalization filter – Steve Cox Mar 06 '17 at 00:57
  • Yes Steve is right, aliasing artifacts depend on the sampling. If we have a low pass filter built into it then maybe we can get away without aliasing artifacts. – mathreadler Mar 06 '17 at 20:44
  • @Ruslan, Re, "But what is the extra 50 Hz for?" Find the prime factors of 44100, and you will get a fun surprise. – Solomon Slow Mar 08 '17 at 20:35
  • nothing from above made sense to me. so, does this mean, in July 2021, I won't be able to tell apart apple music's existing quality(AAC at 256kbps) and apple music lossless(ALAC up to 24bit-44.1kHz). – Naveen Reddy Marthala Jul 28 '21 at 15:42
  • @NaveenReddyMarthala That's not related to sampling rate. ALAC is lossless compression, AAC is lossy compression. – endolith Jul 28 '21 at 16:14
  • ok @endolith leave that part. I have been trying to understand if it would make sense to invest in a better audio equipment and stumbled upon this thread. so, as the question states, would I(a 24 year old) be able to hear a difference? – Naveen Reddy Marthala Jul 28 '21 at 17:18
  • @NaveenReddyMarthala Probably not, but you can do tests like http://abx.digitalfeed.net/ – endolith Jul 28 '21 at 19:52
  • 1
    @NaveenReddyMarthala This is all off-topic, but the difference between lossless and lossy compression isn't going to be heard in the frequency response; it's going to be heard in wideband noise like cymbals and distorted guitar not having all of their frequency content and sounding warbly. – endolith Jul 29 '21 at 15:34
73

44,100 was chosen by Sony because it is the product of the squares of the first four prime numbers. This makes it divisible by many other whole numbers, which is a useful property in digital sampling.

44100 = 2^2 * 3^2 * 5^2 * 7^2

As you've noticed, 44100 is also just above the limit of human hearing doubled. The just above part gives the filters some leeway, therefore making them less expensive (less chips rejected).

As Russell points out in the comments, the divisible by many other whole numbers aspect had an immediate benefit at the time the sample rate was chosen. Early digital audio was recorded on existing analog video recording media which supported, depending on region, either the NTSC or PAL video spec. NTSC and PAL had different Lines per Field and Fields per Second rates, the LCM of which (together with the Samples per Line) is 44100.

dotancohen
  • 753
  • 4
  • 7
  • 12
    The choice wasn't simply about getting many prime factors, but specifically to make good use of NTSC and PAL video recording equipment to store digital masters. https://en.wikipedia.org/wiki/44,100_Hz#Recording_on_video_equipment – Russell Borogove Mar 06 '17 at 00:59
  • 3
    @RussellBorogove: Thank you. As per the Wiki link, 44100 is the LCM of the sample rates of the NTSC and PAL video features' rates . That is quite a direct consequence of being a number with so many factors, and I do believe that you are right that the horse led the cart on this spec. – dotancohen Mar 06 '17 at 06:32
  • 2
    Divisible by many numbers, but not by 8 :) – Bogdan Alexandru Mar 06 '17 at 16:02
  • (Wikipedia says a variety of rates from 40.5 to 46.8 kHz would have met these criteria, and 44.1 kHz was chosen to provide a transition band for antiliasing filter) – endolith Mar 07 '17 at 19:12
  • 3
    @BogdanAlexandru Also not divisible by 1 ms USB frames :D – endolith Mar 08 '17 at 20:15
  • @endolith Good point – Bogdan Alexandru Mar 08 '17 at 20:53
13

The Nyquist rate is above twice the bandlimit of a baseband signal that you want to capture without ambiguity (e.g. aliasing).

Sample at a lower rate than twice 20kHz, and you won't be able to tell the difference between very high and very low frequencies just from looking at the samples, due to aliasing.

Added: Note that any finite length signal has infinite support in the frequency domain, thus is not strictly bandlimited. This is yet another reason why sampling any non-infinite audio source a bit above twice the highest frequency spectra (in a baseband signal) is required to avoid significant aliasing (beyond just reasons of finite filter transition roll-off).

hotpaw2
  • 35,346
  • 9
  • 47
  • 90
  • Hi, I really agree with your answer, but the "..twice the highest frequency" thing bites beginners very soon, because Nyquist is about bandwidth, not highest frequency; I went ahead and slightly modified your answer. Please check if it's OK with you. – Marcus Müller Mar 05 '17 at 06:29
  • 6
    @MarcusMüller, because "beginners" to sampling start with sampling baseband signals and not passband signals, it really is about the highest frequency (sometimes called "bandlimit") and not bandwidth (which has an additional ambiguity regarding one-sided or two-sided bandwidth). – robert bristow-johnson Mar 05 '17 at 06:43
  • @robertbristow-johnson haven't looked at that ambiguity. Hm; I like the bandlimit approach! – Marcus Müller Mar 05 '17 at 06:46
  • 3
    in the Wikipedia article we call it "$B$" and, although Shannon said $f_\text{s} \ge 2B$ is sufficient, he was assuming finite energy, so no sinusoids (which have infinite energy and can also put dirac deltas at $\pm B$). if you allow for a sinusoid right at frequency $B$, then it's the more oft-stated $f_\text{s}>2B$ . – robert bristow-johnson Mar 05 '17 at 06:54
10

Basically, twice the bandwidth is a common requirement for signal sampling, thus $2\times 20 = 40$ kHz is a minimum. Then, a little more is useful to cope with imperfect filtering and quantization. Details follow.

What you need in theory is not what is required in practice. This goes along the quote (attributed to many):

In theory there is no difference between theory and practice. In practice there is.

I am not an expert in audio, but I have been trained by high-quality audio sampling/compression people. My knowledge might be rusty, take it with caution.

First, standard sampling theory works under some assumptions: linear systems, and time invariance. Then, a continuous bandlimited phenomenon is known, in theory, to be possibly sampled at about twice the bandwidth (or twice the maximum frequency for baseband signals) without loss. The "Nyquist rate" is often defined as:

the minimum rate at which a signal can be sampled without introducing errors

This is the analysis part of the "sampling theorem". The "can be" is important. There is a synthesis part: the continuous signal "can be reconstructed" analogously using cardinal sines. This is not the only technique, and it does not take into account low-pass prefiltering, non-linear (such as quantization, saturation) and other time-variant factors.

Human hearing is not a simple topic. It is accepted that humans hear frequencies from 20 Hz up to 20,000 Hz. But such precise bounds in Hertz are not a trait of nature for all humans. A gradual loss of sensitivity to higher frequencies is frequent with age. On the other side:

Under ideal laboratory conditions, humans can hear sound as low as 12 Hz and as high as 28 kHz, though the threshold increases sharply at 15 kHz in adults

Hearing is not linear: there are audition and suffering thresholds. It is not time-invariant. There are masking effects in both time and frequency.

If the 20 Hz up to 20,000 Hz band is a common range, and a 40,000 Hz should theoretically suffice, a little extra is needed to cope with extra distortion. A rule of thumb says that 10% more is ok ($2.2\times$ signal bandwidth) and 44,100 Hz just does it. It goes back to the late 1970s. Why is not 44,000 Hz used? Mainly because of standards, set by the popularity of CDs, whose technology is as always based on a trade-off. In addition, 44,100 is the product of squares of first four prime numbers ($2^2 \times 3^2 \times 5^2 \times 7^2$), hence has small factors, beneficial for computations (like FFT).

So from $2\times 20 $ to $44.1$ (and multiples), we have a balance in safety, quantization, usability, computations and standards.

Other options exist: the DAT format for instance was released with 48 kHz sampling, with initially difficult conversion. 96 kHz is discussed with respect to quantization (or bit depth) in What sample rate and bit depth should I use? This is a controversial subject, see 24 bit 48kHz verses 24 bit 96kHz. You can check Audacity sample rates for instance.

Laurent Duval
  • 31,850
  • 3
  • 33
  • 101
  • 2
  • The answer to the question is that the Nyquist theorem dictates > 40kHz, not > 20kHz. 2. Neither human hearing nor the CD format is limited to 20Hz at the low end. Any large enough pipe organ can produce a 16Hz tone, and CD can reproduce it easily. Some organs go down to 8Hz, which starts to be perceived as individual vibrations, but which again CD can reproduce.
  • – user207421 Mar 05 '17 at 22:20
  • I do agree with your comment, except for "dictates" (this is an "if" condition). Could you point out where I have deviated from it? – Laurent Duval Mar 06 '17 at 07:40
  • 1
    I have just one supplement to @LaurentDuval 's answer. Speech, music, and sound in general are non-stationary signals. Although these are effectively bandlimited but we yet do not know how the human ear is transducing the continuous time signal to nerve firings which facilitate our perception of sound. It is often argued that some people have "golden ears" and can make out difference between 44.1 kHz versus 96 kHz recordings. Also, I am yet to confirm on the following, it seems higher sampling rates benefits perception of additional cues, such as localization in binaural recordings. – Neeks Mar 07 '17 at 11:49