Bandwidth of an entire song

Question

My question has to do with the difference between the frequencies of a single note, and the frequencies of an entire song.

If I have a 5 second signal of the form: $x(t)=\sin(8\pi t)$, here is the frequency response with zero-padding:

For a signal of the form $x(t)=\sin(8\pi t)\sin^2(\pi t/5)$, here is how this looks:

Here is the frequency response with zero-padding:

My intuition about the second signal is that it is a 4Hz tone that gets louder and then quieter again. In both cases, the highest frequency contained in the signal is higher than 4Hz, even though it was a 4Hz tone. This indicates to me that the frequencies that we hear are not the same as the frequencies in the Fourier transform. This further indicates that just because the highest note of a song may be at 16kHz does not mean that the bandwidth of the song is at -16kHz to +16kHz and that 32kHz sampling rate is sufficient. Music is typically recorded at 44.1kHz, but is the bandwidth of a song really -22.05kHz to +22.05kHz, even if every individual note is? I took the FFT of Bad Habits by Ed Sheeran out of curiosity. The highest frequency component actually appears to be at 22.05 kHz. Doesn't this indicate that if we had a higher sampling rate than 44.1kHz, we would have seen higher frequency components of the music? In other words, the FFT looks like higher frequency components got "cut off" by a low sampling rate.

My second question is about understanding how a note played during a song will affect the FT of the song. Without zero-padding, the first signal is purely 4Hz and the second signal has nonzero components at 4Hz and the two adjacent bins. With zero-padding, it appears there are a large number of nonzero bins in each, and the second signal actually appears as the first with sidelobe suppression. This seems significant to me because if an 8kHz tone, played for one second, appears in a 3 minute song, it would not affect the FT of the song by adding a pure 8kHz tone in. I think it would appear as an 8kHz tone of one second duration, zero-padded to a 3 minute duration (and therefore including the sidelobes), since the sidelobes are important to destructively interfere with the note outside of the timeframe it is supposed to be played at. Is this correct?

Edit: I just remembered something probably critical. Any signal of finite time necessarily has infinite bandwidth. If the highest tone in a song is 16kHz, then the highest frequency component of the whole song would be a "smeared" 16kHz, and some sidelobes will be cut off when sampling at 44.1kHz. Therefore the DFT is lossy. Part of my confusion is probably because I read elsewhere on the internet that the DFT is lossless, but I am thinking now that must be wrong since all real signals are of finite time/infinite bandwidth, therefore all real signals must have an infinite sampling rate to be truly lossless. Is this correct?

Edit #2: Envidia pointed out that I had forgotten to fftshift Bad Habits. It definitely looks better now.

Did you make sure to fftshift() your signal so that the DC component is at the center? — Envidia, Jan 08 '24 at 21:19
Rather than apply to unfiltered audio, the sampling is often applied after a filter. It would be common to first apply a low-pass filter near 20kHz, and only then sample at 44kHz. In that case, there are no higher frequencies to lose. — BowlOfRed, Jan 09 '24 at 16:18
"In other words, the FFT looks like higher frequency components got "cut off" by a low sampling rate." Technically, the higher frequencies get aliased by the low sampling rate. They are cut off by the FT. If you had a signal that consisted solely of frequencies between 50 kHz and 60kHz, then you'd be able to recover the signal from a properly configured FT of a 44.1kHz sample. — Acccumulation, Jan 10 '24 at 00:39

Jdip · Accepted Answer · 2024-01-22T18:39:48.037

First of all, kudos to you: I appreciate the effort and thinking you've managed to articulate in your question.

The DFT is a mathematical tool. As such, the parameters used to compute it can hide or reveal information that is there or not there. For example, zero-padding will reveal side-lobes for a single tone but that is just an artifact of the DFT of a finite-length sequence. You don't "hear" these side lobes when listening to a single tone. You only hear that single tone (a perfect frequency peak if the parameters of the DFT are chosen appropriately). As another example, if the parameters aren't set correctly, giving poor frequency resolution, the DFT will hide information from you.

That being said, case-in-point:

This indicates to me that the frequencies that we hear are not the same as the frequencies in the Fourier transform.

$5$ seconds is too short to see the expected peaks in the frequency domain your signal has: recall that $$\sin(A)\sin(B) = \frac{1}{2}\big(\cos(A-B)-\cos(A+B)\big)$$ In your case, the FFT should therefore show peaks at $3.8\,\tt{Hz}$, $4\,\tt{Hz}$ and $4.2\,\tt{Hz}$.
Set your parameters appropriately: increase the duration to $10$ seconds and you should see:

Music is typically recorded at 44.1kHz, but is the bandwidth of a song really -22.05kHz to +22.05kHz?

I think you already know the answer: in the analog world, no. Some details:
Harmonics are integer multiples of a fundamental frequency. For example, if a pianist bangs on a middle $A$, which has fundamental frequency of $440 \,\tt{Hz}$, its harmonics would be at $880\,\tt{Hz}$ (2nd harmonic), $1320\,\tt{Hz}$ (3rd harmonic), $1760\,\tt{Hz}$ (4th harmonic), and so on. There's technically no upper limit to how high harmonics can go in terms of frequency. However, the amplitude (~loudness) of these harmonics generally decreases as the frequency increases, making them less significant in the overall sound, especially as they move out of the range of human hearing.

Which brings us to:

The highest frequency component actually appears to be at 22.05 kHz. Doesn't this indicate that if we had a higher sampling rate than 44.1kHz, we would have seen higher frequency components of the music?

Correct! And some audio systems claim to use $96\,\tt{kHz}$ for that exact purpose (although I have a hard time believing anyone that claims they can hear the difference with standard $44.1\,\tt{kHz}$ or $48\,\tt{kHz}$, unless they'd want their dogs to be able to?) - edit: there are other purposes to higher sampling rates such as post-processing considerations and less stringent anti-aliasing/reconstruction filter requirements, but that's outside the scope of your question, see comments -. As far as quality of playback is concerned, there's no point in going higher because: 1. Humans can not hear past $\approx 20\,\tt{kHz}$, and 2. as previously mentioned, the harmonics will generally be well below our hearing threshold anyways.

if an 8kHz tone, played for one second, appears in a 3 minute song, [...] it would appear as an 8kHz tone of one second duration, zero-padded to a 3 minute duration (and therefore including the sidelobes), since the sidelobes are important to destructively interfere with the note outside of the timeframe it is supposed to be played at. Is this correct?

You are correct that it would be interpolated because of the length of your DFT. But again, this is only the result of a mathematical operation. The side-lobes would interfere with other frequencies side-lobes, but this is not typically referred to as "destructive interference" in the context of Fourier analysis. Instead, it's an aspect of the spectral leakage caused by the finite duration of the signal.

Finally:

Part of my confusion is probably because I read elsewhere on the internet that the DFT is lossless, but I am thinking now that must be wrong since all real signals are of finite time/infinite bandwidth, therefore all real signals must have an infinite sampling rate to be truly lossless. Is this correct?

Yes and no. The term "lossless" in the context of DFT refers to the fact that, theoretically, the DFT of a signal and its inverse can reconstruct the original signal exactly, without any information loss. But like I said earlier, in real-world scenarios, signals are band-limited, and practical sampling rates are chosen to capture the essential frequency content of these signals. So while it's theoretically true that an infinite sampling rate is required to capture all frequency components of a finite-duration signal, in practice, a sufficiently high sampling rate (like $44.1 \, \tt{kHz}$ for audio) is adequate to capture all the significant components within the human hearing range.

Thanks Jdip, this helps a lot. I am still confused about one part though. My question will not fit in a comment so I will edit my original question to include it. — Levi, Jan 09 '24 at 20:06
Hi @Levi, I hope you don’t mind, but I think your edit should indeed be asked instead in a comment, or rather a chat room, because it’s not really adding anything more to your original question. I’ve edited your question and opened a chat room here: https://chat.stackexchange.com/rooms/150771/bandwidth-of-a-song — Jdip, Jan 09 '24 at 22:35
(The reason that professional production often uses 96–192kHz and 24–32 bits is not so you can hear any improvement over 48/16 in the final result, but because of all the processing along the way — amplification, EQ, mixing, and the battery of other sound-processing modules/plugins, all of which can greatly magnify the limitations of 48/16. Maintaining much higher quality right up to the end avoids that, giving the best end result, even if it's then downsampled. It also lets you control how that downsampling happens, e.g. using dithering to make the most of those bits.) — gidds, Jan 09 '24 at 23:01
(Feel free to add my comment, or something along those lines, into the answer itself. Comments are ephemeral — answers are forever!) — gidds, Jan 09 '24 at 23:22
@gidds Just extending what is said, it makes sense to record and play the audio at maybe 24-bit 96 or 192 kHz even if your analog bandwidth is about 20 kHz, because it makes cheaper analog electronics to move the most difficult parts of anti-aliasing and reconstruction filters into digital domain. It also may make sense to store these intermediate results in a multi-track audio editor when making a song. For storing and distributing the final song, anything beyond human hearing is useless, so 44.1 or 48 kHz at 16 bits should be enough. Just like modern videos don't contain infrared or UV. — Justme, Jan 09 '24 at 23:35
fair point as well @Justme, I've edited my answer accordingly, although I think both your comments are leading us a little away from the OP's question's scope which is only concerned with the effect of cutting higher frequencies from the playback bandwidth. Does cutting frequencies above 20Khz affect the listening experience? I would argue it does not. — Jdip, Jan 09 '24 at 23:50
@Jdip If the sampling rate is 44.1 kHz, as every sample can differ from previous, it will have energy at 22.05 kHz. If we changed to any sampling rate like 192 kHz, each sample still differs a bit from previous so has energy at 96 kHz. So capable of storing info humans are not generally able to hear. Should not affect the listening experience in general, typically meaning, not able to hear sounds above 20 kHz. Which is a good assumption for an average person. — Justme, Jan 10 '24 at 00:16
Right, that’s just my point. As far as the listening experience is concerned, higher sampling rates are unnecessary. — Jdip, Jan 10 '24 at 00:20
@Justme A/D and D/A converters can use digital processing to simplify the analog filtering stages without needing to store or transmit a redundantly high sampling rate signal. I.e. the converters can be "black boxes" that accept or produce 44.1kHz digital audio on one end, and analog audio on the other end, and do whatever resampling, delta-sigma, digital filtering is required internally, in addition to a simple analog filtering stage. — Knut Inge, Jan 10 '24 at 05:05

Bandwidth of an entire song

1 Answers1