1

I have a wav file audio data, I broke it up into 1024-length windows (no overlap), and performed fft on each one.

If I visualize this data it actually looks pretty good, but the problem is that the data is really lopsided. It seems like buckets 1-3 get good activity, but they're generally much larger than buckets 4-8, so when I visualize the data I have to have a weird conditional multiplier on the higher frequency buckets so I see some activity.

So then, what is the proper way to break my fft into frequency buckets? A simple explanation would be best. Thank you!

3 Answers3

1

Most of the energy in an audio/speech signal is almost always found in the lower bands (roughly below 1kHz) so the lopsided shape that you're observing is not surprising.

Let me just add some more information regarding fft frequency bins. These bins can be organized into larger buckets if you need this kind of representation.

For an FFT length of 1024, you should end up with 513 frequency bins (N/2+1). The first bin corresponds to DC (0 Hz) and is usually ignored and N/2+1 is the theoretical Nyquist frequency (also ignored).

The bandwidth of a frequency bin is defined as

BW_bin =  Sampling_rate/FFT length;

Note that while the sampling rate isn't necessary to compute the FFT, it is needed to calculate the bandwidth (frequency resolution).

To get the activity (magnitudes in dB), you can use the following equation (one way to compute the magnitudes)

Mag[i] = 10*log*(sqrt(2*(Real[i]^2+Img[i]^2/fftNorm));//iterate through bins [1-511]

fftNorm depends on the kind of the window function used (https://en.wikipedia.org/wiki/Window_function) and is simply N (fft length) in the case of a rectangular window (no window). The factor 2 in the equation accounts for the upper (discarded) half of the FFT.

For visualization purposes, you can now easily combine several bins into a group of bins coressponding to a particular frequncy range.

dsp_user
  • 921
  • 7
  • 11
  • As asked by OP, does your answer clarify why the data is lopsided ? from answer one can understand the method of visualizing data, but reason for it being lopsided is still missing. – Arpit Jain Mar 15 '18 at 11:34
  • 1
    I said that most energy will still be in the lower bands but you're right, I'll edit my answer to emphasize that. – dsp_user Mar 15 '18 at 12:01
  • @Morgan Usley I thought you were already doing that. For instance, you can take magnitudes from 10 consecutive bins and put them in a separate bucket (sum all the mags and divide by 10). This will give you a coarser representation of your signal but admittedly may be misleading. If you want to retain most of the information of the original spectra, then creating a spectral envelope might help (the spectral envelope sort of normalizes the FFT). – dsp_user Mar 15 '18 at 14:28
  • You need to be careful when taking the log values of DFT bins as zero values are possible (not like with "real world" signals). Also, because the way logs work, you can get rid of the sqrt by multiplying by 1/2 on the outside. It is also common to use "power" instead of magnitude. Since power is magnitude squared, this means a multiplier of 2 on the outside. – Cedron Dawg Mar 15 '18 at 14:30
  • Yes, that's a good point. – dsp_user Mar 15 '18 at 14:32
1

Sound frequency spectra are rarely flat. In my experience a 6 dB/octave (exactly 20 dB/decade) spectral downward slope is typical. For example a saw wave has that kind of a spectral slope. Saw wave can be composed from its harmonics by (adapted from Wikipedia's formula):

$$x_\mathrm{sawtooth}(t) = A\sum_{k=1}^{\infty} {(-1)}^{k} \frac {\sin (2\pi kft)}{k} $$

If the frequency bins $k \ge 1$ correspond exactly to the harmonics $k \ge 1,$ then, for a certain normalization of the saw wave amplitude $A$, the squared absolute value of frequency bin $k$ is $1/k^2$. If we collect the bins into larger buckets using your normalization scheme (second last column below) and a proposed scheme where the normalization takes place inside the sum using a factor $k$ (last column):

$$\begin{array}{l|l|l|l} n&k\text{ range}&\displaystyle\frac{\displaystyle{\sum_{k=2^n}^{2^{n+1}-1}\frac{1}{k^2}}}{2^{n+1}-2^n}&\displaystyle{\sum_{k=2^n}^{2^{n+1}-1}\frac{1}{k^2}k}\\ \hline 0&1\ldots1&1&1\\ 1&2\ldots3&0.1805555555&0.8333333333\\ 2&4\ldots7&0.03767148526&0.7595238095\\ 3&8\ldots15&0.008580403911&0.7253718503\\ 4&16\ldots31&0.002046901055&0.7090162022\\ 5&32\ldots63&0.0004998643892&0.7010207082\\ 6&64\ldots127&0.0001235095158&0.6970686888\\ \inf&&0&0.6931471805 = \ln(2)\\ \end{array}$$

$n$ is the bucket number. The proposed scheme gives quite a flat result that may be useful for visualization.

Another possibility is to use a logarithmic magnitude scale like dB, which shows values close to zero at greater resolution. That is less misleading than arbitrary frequency-dependent normalization schemes.

Olli Niemitalo
  • 13,491
  • 1
  • 33
  • 61
0

Are you summing the complex FFT values or their magnitudes to combine your bins? The latter is better. You can also sum the squares divide by the count the take the square root. This is known as RMS (Root Mean Squares).

If you are summing the complex values, the more bins that are included the more likely different phases in the bins will cancel each other out.

Hope this helps,

Ced

Cedron Dawg
  • 7,560
  • 2
  • 9
  • 24
  • Yes, your question is valid, but I guess OP is using magnitudes only. and for audio signals it is expected that most of the signal energy is present in initial few bins/buckets. – Arpit Jain Mar 15 '18 at 12:47
  • 1
    @arpit jain, Newbie+assumption=trouble. The OP may be summing first, then taking magnitude. BTW, I've written my own spectogram in my own audio recording program and I dispute your contention as overly broad. Sometimes baby ain't got no base. – Cedron Dawg Mar 15 '18 at 13:12
  • yup, agreed with you @Cedran Dawg . I like this " Newbie+assumption=trouble" :) :) – Arpit Jain Mar 15 '18 at 13:30
  • @Morgan Usley, Thanks for confirming arpit jain's assumption was correct. Olli's explanation is very good, I was just trying to rule out a possible "newbie mistake". Carry on. – Cedron Dawg Mar 15 '18 at 14:23