Fixed point saturation and round

Question

Another question related to fixed point saturation and rounding. Say using notation S<WL, FW> where S means it is signed number, WL is the full bitwidth, and FW is the fractional bitwidth. So S<16, 15> means a 16-bit signed number that has 15-bit fractional bits.

Normally in a N-tap filter, denote input as S<x_WL,x_FW> and coefficients are S<c_WL,c_FW>. Then each multiplication would be S<x_WL+c_WL, x_FW+c_FW> and the N-tap output y would grow S<x_WL+c_WL, x_FW+c_FW> by log2(N) bits to S<x_WL+c_WL+log2(N), x_FW+c_FW>. Finally y's bitwidth S<x_WL+c_WL+log2(N), x_FW+c_FW> would be saturated and round off to S<y_WL, y_FW> where y_FW = x_FW+c_FW-round_off, and the part to the left of decimal point of y would be y_WL-y_FW.

Beside this routine approach, what other ways to limit internal bit growth, for example, in individual multiplication output S<x_WL+c_WL, x_FW+c_FW> and final output? There were some papers on designing the coefficients so that each multiplication can guarantee only grow 1-bit. But what else can be done?

Thanks

I have to admit that I didn't solve each of your equations given but I see your general question is concern about the bits and rounding in an FIR filter. It is very important for noise considerations to let the filter grow the signal and then truncate the signal after. i detail this further in these posts which may interest you: https://dsp.stackexchange.com/questions/31577/inter-filter-bit-width/31588#31588 https://dsp.stackexchange.com/questions/38620/what-is-the-suitable-design-method-to-the-filter/38621#38621 — Dan Boschen, May 12 '21 at 01:30
thanks for the links. I would add that besides noise considerations in the first link, more important to me is power consumption. Adding more bits is not desired in my case. So my design flow is somewhat different than the flow shown in the second link — hbf, May 12 '21 at 01:59
I would respectfully argue it is the same; as you want to minimize the total number of bits within some SNR or other requirements constraint; so this informs the least you can add within that constraint. Further you would be very interested then in multi-rate signal processing as power is a major motivator for that (minimizing the sampling rate wherever possible since dynamic power goes as $C V^2 f$ — Dan Boschen, May 12 '21 at 02:05
Most importantly is to not naively think you can just scale the coefficients to prevent overflow without considering the total quantization noise growth. That is my point. — Dan Boschen, May 12 '21 at 02:06
yeah, I understand your point of reaching optimal SNR when selecting coefficient bitwidth. That is a different starting point than my question though: I meant the coefficients are already given, can't change them. So what else can be done other than the simulation wave shown below to decide intermediate results' bitwidth? — hbf, May 12 '21 at 02:08
If the coefficients are given you can still scale them to reduce but width which comes down to how low can you go, and back to the quantization constraint ultimately — Dan Boschen, May 12 '21 at 02:10
I may have an additional answer in addition to Hilmar’s good answer below; specifically if your waveform is specific to any modern communications modulation? — Dan Boschen, May 12 '21 at 02:12
yeah, you can assume it is related to modern comm modulations like QAM and QPSK etc — hbf, May 12 '21 at 02:16
See this post and scroll down to the graphic "Maximum ADC Input Signal" which is applicable to precision in an FIR filter or any datapath. I worked out this chart analytically for a Gaussian distributed waveform (as most modern comms will be) showing the trade between clipping noise and quantization noise. https://dsp.stackexchange.com/questions/60035/how-to-adjust-receiver-gains-to-avoid-saturation-and-quantization-noise-to-optim/63086#63086 This is just the analytical approach to what Hilmar describes via simulation, but either way I think his answer is spot on to what you are asking. — Dan Boschen, May 12 '21 at 02:23
thanks. your approach is rx specific and there are normal backoff. but anyway that's a way to go — hbf, May 12 '21 at 17:19
Yes good point - Rx specific as it doesn’t take out of band emissions into consideration — Dan Boschen, May 12 '21 at 17:32

Hilmar · Answer 1 · 2021-05-12T17:33:36.177

2

I think this overly conservative. If you apply it to an IIR filter, you would conclude infinite bit growth which isn't the case.

For a filter with impulse response, $h[n]$, the "worst case" gain the filter can provide to the input is given as the absolute some the coefficients, i.e.

$$G_{max} = \sum |h[n]| $$

This means that if $x[n] < x_{max}$ then $y[n] < x_{max} \cdot G_{max} $ for all $n$.

You are guaranteed not to overflow as long as you accommodate this maximum gain.

Even that's too conservative for most applications. The input signal that will create this maximum gain is basically $sign(h[-n])$ which is extremely unlikely to occur. So in practice a statistical approach works better: Run a sizable number of representative input signals through the filter and look at the probability distribution function of the output.

At the end of the day this is a trade off between the noise from occasionally clipping an output sample vs the quantization/rounding noise you create on every output sample.

edited May 12 '21 at 17:33

answered May 12 '21 at 02:03

Hilmar

44,604
1
32
63

Correct. But what else can be done other than running these simulations? I mean analytical ways? – hbf May 12 '21 at 02:10
by the way, why you said "maximum gain is basically sign(h[-n])"? – hbf May 12 '21 at 02:15
I was going to add an answer but it is basically this what Hilmar describes but from an analytical approach; given the known distribution of the waveform (for example complex Gaussian) then the noise due to clipping can be quantified analytically, and that can be traded with the noise due to quantization such that for any number of bits we can optimize our setting of rms level relative to full scale and maximize our dynamic range within that number of bits. – Dan Boschen May 12 '21 at 02:18
@hbf. I added a clarification to the answer – Hilmar May 12 '21 at 17:33
@Hilmar ok, i see it. I think you meant x_max was referred to sign(h[-n]) in previous reply? But I wasn't sure I understood why h[-n]. In general, x_max has no relationship to h[n] or h[-n], or I missed some logic behind – hbf May 12 '21 at 18:28

Fixed point saturation and round

1 Answers1