Most of the energy in an audio/speech signal is almost always found in the lower bands (roughly below 1kHz) so the lopsided shape that you're observing is not surprising.
Let me just add some more information regarding fft frequency bins. These bins can be organized into larger buckets if you need this kind of representation.
For an FFT length of 1024, you should end up with 513 frequency bins (N/2+1). The first bin corresponds to DC (0 Hz) and is usually ignored and N/2+1 is the theoretical Nyquist frequency (also ignored).
The bandwidth of a frequency bin is defined as
BW_bin = Sampling_rate/FFT length;
Note that while the sampling rate isn't necessary to compute the FFT, it is needed to calculate the bandwidth (frequency resolution).
To get the activity (magnitudes in dB), you can use the following equation (one way to compute the magnitudes)
Mag[i] = 10*log*(sqrt(2*(Real[i]^2+Img[i]^2/fftNorm));//iterate through bins [1-511]
fftNorm depends on the kind of the window function used (https://en.wikipedia.org/wiki/Window_function) and is simply N (fft length) in the case of a rectangular window (no window).
The factor 2 in the equation accounts for the upper (discarded) half of the FFT.
For visualization purposes, you can now easily combine several bins into a group of bins coressponding to a particular frequncy range.