New to audio processing, why is my data so lopsided when I try to break into frequency ranges?

Question

I have a wav file audio data, I broke it up into 1024-length windows (no overlap), and performed fft on each one.

If I visualize this data it actually looks pretty good, but the problem is that the data is really lopsided. It seems like buckets 1-3 get good activity, but they're generally much larger than buckets 4-8, so when I visualize the data I have to have a weird conditional multiplier on the higher frequency buckets so I see some activity.

So then, what is the proper way to break my fft into frequency buckets? A simple explanation would be best. Thank you!

i think you would be well-advised to window and overlap your data. if you want your frame hop to remain 1024 samples, i would suggest each frame to be 2048 samples and to window it, perhaps, with a Hann window (if you want your overlapping frames to add to the original). just a friendly suggestion. — robert bristow-johnson, Mar 16 '18 at 02:08
@robertbristow-johnson how do I apply a windowing function to my data? Is it the fft*window_function, or something else? I've not been able to find good information about what general equation to use. — Morgan Usley, Mar 17 '18 at 05:23
i'm poking around. take a look at this. (but it's about non-overlapping rectangular windowing, which appears what you're presently doing) here's something else. here is something incomplete. — robert bristow-johnson, Mar 17 '18 at 08:11
perhaps it exists, but we need a good comprehensive resource here on windowing. — robert bristow-johnson, Mar 17 '18 at 08:13

dsp_user · Answer 1 · 2018-03-15T14:36:23.490

1

Most of the energy in an audio/speech signal is almost always found in the lower bands (roughly below 1kHz) so the lopsided shape that you're observing is not surprising.

Let me just add some more information regarding fft frequency bins. These bins can be organized into larger buckets if you need this kind of representation.

For an FFT length of 1024, you should end up with 513 frequency bins (N/2+1). The first bin corresponds to DC (0 Hz) and is usually ignored and N/2+1 is the theoretical Nyquist frequency (also ignored).

The bandwidth of a frequency bin is defined as

BW_bin =  Sampling_rate/FFT length;

Note that while the sampling rate isn't necessary to compute the FFT, it is needed to calculate the bandwidth (frequency resolution).

To get the activity (magnitudes in dB), you can use the following equation (one way to compute the magnitudes)

Mag[i] = 10*log*(sqrt(2*(Real[i]^2+Img[i]^2/fftNorm));//iterate through bins [1-511]

fftNorm depends on the kind of the window function used (https://en.wikipedia.org/wiki/Window_function) and is simply N (fft length) in the case of a rectangular window (no window). The factor 2 in the equation accounts for the upper (discarded) half of the FFT.

For visualization purposes, you can now easily combine several bins into a group of bins coressponding to a particular frequncy range.

edited Mar 15 '18 at 14:36

answered Mar 15 '18 at 07:32

dsp_user

921
7
11

As asked by OP, does your answer clarify why the data is lopsided ? from answer one can understand the method of visualizing data, but reason for it being lopsided is still missing. – Arpit Jain Mar 15 '18 at 11:34
1

I said that most energy will still be in the lower bands but you're right, I'll edit my answer to emphasize that. – dsp_user Mar 15 '18 at 12:01
@Morgan Usley I thought you were already doing that. For instance, you can take magnitudes from 10 consecutive bins and put them in a separate bucket (sum all the mags and divide by 10). This will give you a coarser representation of your signal but admittedly may be misleading. If you want to retain most of the information of the original spectra, then creating a spectral envelope might help (the spectral envelope sort of normalizes the FFT). – dsp_user Mar 15 '18 at 14:28
You need to be careful when taking the log values of DFT bins as zero values are possible (not like with "real world" signals). Also, because the way logs work, you can get rid of the sqrt by multiplying by 1/2 on the outside. It is also common to use "power" instead of magnitude. Since power is magnitude squared, this means a multiplier of 2 on the outside. – Cedron Dawg Mar 15 '18 at 14:30
Yes, that's a good point. – dsp_user Mar 15 '18 at 14:32
A related question: What is the purpose of the following code? – Olli Niemitalo Mar 16 '18 at 15:27

Olli Niemitalo · Answer 2 · 2018-03-15T12:38:49.733

Sound frequency spectra are rarely flat. In my experience a 6 dB/octave (exactly 20 dB/decade) spectral downward slope is typical. For example a saw wave has that kind of a spectral slope. Saw wave can be composed from its harmonics by (adapted from Wikipedia's formula):

$$x_\mathrm{sawtooth}(t) = A\sum_{k=1}^{\infty} {(-1)}^{k} \frac {\sin (2\pi kft)}{k} $$

If the frequency bins $k \ge 1$ correspond exactly to the harmonics $k \ge 1,$ then, for a certain normalization of the saw wave amplitude $A$, the squared absolute value of frequency bin $k$ is $1/k^2$. If we collect the bins into larger buckets using your normalization scheme (second last column below) and a proposed scheme where the normalization takes place inside the sum using a factor $k$ (last column):

$$\begin{array}{l|l|l|l} n&k\text{ range}&\displaystyle\frac{\displaystyle{\sum_{k=2^n}^{2^{n+1}-1}\frac{1}{k^2}}}{2^{n+1}-2^n}&\displaystyle{\sum_{k=2^n}^{2^{n+1}-1}\frac{1}{k^2}k}\\ \hline 0&1\ldots1&1&1\\ 1&2\ldots3&0.1805555555&0.8333333333\\ 2&4\ldots7&0.03767148526&0.7595238095\\ 3&8\ldots15&0.008580403911&0.7253718503\\ 4&16\ldots31&0.002046901055&0.7090162022\\ 5&32\ldots63&0.0004998643892&0.7010207082\\ 6&64\ldots127&0.0001235095158&0.6970686888\\ \inf&&0&0.6931471805 = \ln(2)\\ \end{array}$$

$n$ is the bucket number. The proposed scheme gives quite a flat result that may be useful for visualization.

Another possibility is to use a logarithmic magnitude scale like dB, which shows values close to zero at greater resolution. That is less misleading than arbitrary frequency-dependent normalization schemes.

score 0 · Answer 3 · answered Mar 15 '18 at 12:38

0

Are you summing the complex FFT values or their magnitudes to combine your bins? The latter is better. You can also sum the squares divide by the count the take the square root. This is known as RMS (Root Mean Squares).

If you are summing the complex values, the more bins that are included the more likely different phases in the bins will cancel each other out.

Hope this helps,

Ced

answered Mar 15 '18 at 12:38

Cedron Dawg

7,560
2
9
24

Yes, your question is valid, but I guess OP is using magnitudes only. and for audio signals it is expected that most of the signal energy is present in initial few bins/buckets. – Arpit Jain Mar 15 '18 at 12:47
1

@arpit jain, Newbie+assumption=trouble. The OP may be summing first, then taking magnitude. BTW, I've written my own spectogram in my own audio recording program and I dispute your contention as overly broad. Sometimes baby ain't got no base. – Cedron Dawg Mar 15 '18 at 13:12
yup, agreed with you @Cedran Dawg . I like this " Newbie+assumption=trouble" :) :) – Arpit Jain Mar 15 '18 at 13:30
@Morgan Usley, Thanks for confirming arpit jain's assumption was correct. Olli's explanation is very good, I was just trying to rule out a possible "newbie mistake". Carry on. – Cedron Dawg Mar 15 '18 at 14:23

New to audio processing, why is my data so lopsided when I try to break into frequency ranges?

3 Answers3