4

I have implemented a moving average, similar to the Hogenauer Filter, with a reduced number of computation operations. I expect the expected error to behave as the random walk and its STD to be of order $\sqrt{n} *\varepsilon$, where $n$ is the filtered vector length. Somehow I get one order smaller. What I am missing and is there a way to reduce the error?

len = 1000;
windLen = 11;
normCoeff = 1 / windLen;
q = nan(len, 1);

for a = 1:len x = randn(10^5, 1);

xRef = movmean(x, windLen);
xRef(1:(windLen - 1) / 2 + 1) = [];

varState = 0;
xTest = nan(size(x));
x = [zeros(windLen, 1); x];
for ind=1:length(x) - windLen
    varState = varState + x(windLen + ind) - x(ind);
    xTest(ind) = varState * normCoeff;
end

xTest(1:windLen) = [];
xRef(length(xTest) + 1:end) = [];
q(a) = xTest(end) - xRef(end);

end disp(std(q))

edit

Following the suggestion by @Dan Boschen, I attached the comparison of this method vs Hogenauer Filter and for some reason, the attached method is faster (twice). comment: Please mind that the external loop is just for the improvement of the comparison and not actually required for the computation.

clc
clear
windLen = 11;
testLen = 10^4;
normCoeff = 1 / windLen;
xBuff = zeros(windLen, 1);
x = randn(10^4, 1);

tic for a = 1:testLen varState = 0; y = nan(size(x)); xBuff(windLen + 1:windLen + length(x)) = x; for ind=1:length(x) varState = varState + xBuff(windLen + ind) - xBuff(ind); y(ind) = varState * normCoeff; end end toc

tic for a = 1:testLen y2 = filter([1 0 0 0 0 0 0 0 0 0 0 -1], [11 -11], x); end toc plot(y - y2)

The error accumulation appears here too. enter image description here

Gideon Genadi Kogan
  • 1,156
  • 5
  • 17
  • You should not have the second code in the loop which is why it will in Matlab run significantly faster. It should simply be: tic; y2 = filter([1 0 0 0 0 0 0 0 0 0 0 -1], [11 -11], x); toc – Dan Boschen Oct 02 '20 at 03:29
  • 1
    may i ask why you wouldn't use fixed-point arithmetic with a moving sum filter? so you know what you subtract from the end of the delay line is exactly what you added earlier. – robert bristow-johnson Oct 03 '20 at 21:13
  • 1
    @robertbristow-johnson, the general implementation is in floating-point. Is it common to move from floating point to fixed point and back to floating point? It does seem like an interesting suggestion and I will consider it – Gideon Genadi Kogan Oct 04 '20 at 05:46
  • 3
    it's not common, but in my experience, when doing a sliding sum or sliding average, it is best to convert to fixed-point, add-and-subtract-the-same-quantity (resulting in *zero* accumulated error in the accumulator), and convert back to floating point. if your accumulator is double-precision, perhaps you don't need to. but i have implemented sliding mean algorithms and doing this in single-precision float is fraught with *accumulating* rounding error. and since the accumulator is an integrator having a pole on the unit circle (marginally stable) that accumulated error never goes away. – robert bristow-johnson Oct 04 '20 at 05:53
  • That would be seem to explain exactly what we see in the plot of the final results for 'base-CIC' below, where the resulting error is a random walk that I haven't fully explained yet --- but this error should still be bounded since the subsequent difference is subtracting the signal at the accumulator output with a delayed version of itself. Resulting in the 'base-OP' result where the error is larger but still bounded. This is important as it would justify why we could use floating or fixed point; as long as the accumulator is extended there will be no issue. – Dan Boschen Oct 04 '20 at 12:18
  • This means we don't really care about error that has accumulated at the accumulator output but only the difference in error over the N sample average. So we minimize that difference by going to extended precision. The "Adding and subtracting the same quantity" in fixed point doesn't really make sense: We are subtracting the signal from itself N samples later which will also have accumulated fixed point quantization error. – Dan Boschen Oct 04 '20 at 12:20
  • So I don't think this is a fixed vs floating point issue per se but specific to getting extended precision; this can be done by converting to fixed point and back, but be aware that it then limits the quantization noise floor to be that of that fixed point system with quantization noise introduced in that process, so if achieving the numerical precision of single (25 bit) or double precision (54 bit) floating point you would just need to keep the fixed point implementation above those respective precisions with the further extension at the accumulator output. – Dan Boschen Oct 04 '20 at 12:29
  • @robertbristow-johnson I think I am missing a critical architectural detail in your optional approach to get zero accumulated error in the accumulator; how do you add and subtract the same number when the accumulator is only adding? Is this a different moving average implementation than the Hogenauer? In the Hogenauer the subtraction is of the accumulator output where the error is already introduced, from a different output later of the accumulator where there is different error. What am I missing? – Dan Boschen Oct 04 '20 at 12:35
  • it's not just adding. – robert bristow-johnson Oct 04 '20 at 13:07
  • @Robert At the accumulator it is just adding, the subtraction is operating on the accumulator output (in the Hogenauer). But I do see what might be a fundamental reason why it should never be done in floating point: In fixed point we can roll-over/under on overflow such that the same result is obtained after the subtraction regardless of over/under flow. With floating we will get an over/underflow error and with the random walk error output of the accumulator this can eventually happen (although with variance growing at rate $N$ maybe even that isn't realistically feasible) – Dan Boschen Oct 04 '20 at 13:53
  • no @DanBoschen it's not. a moving sum filter, that is an FIR filter of length $L$ implemented as a truncated IIR with pole-zero cancellation:

    $$ y[n] = y[n-1] + x[n] - x[n-L] $$

    this is really a simple concept. the $x[n]$ that was added $L$ samples ago is now subtracted as the $x[n-L]$ term. In fixed-point, exactly what was added $L$ samples ago is being subtracted now. In floating-point what was added $L$ samples ago and what is subtracted now both depend on the exponent of the the floating-point value in the accumulator, which may be different, resulting in a rounding error.

    – robert bristow-johnson Oct 04 '20 at 19:05
  • @robertbristow-johnson Shouldn't it be $y[n] = x[n-1]+x[n]- x[n-L] - x[n-L-1]$? Review the block diagram in my post, this seems to match that but your equation doesn't....or can it be somehow rearranged to be equivalent (It doesn't appear so to me)? Oh I see, we can swap the order! Makes sense – Dan Boschen Oct 04 '20 at 19:11
  • no, not a simple moving sum. these are two different ways of implementing the same FIR. one has recursion: $$ y[n]=y[n−1]+x[n]−x[n−L] $$ and the other does not: $$ y[n] = \sum_{\ell=0}^{L-1} x[n-\ell] $$ One will accumulate round-off error when using floating-point. The latter never does. – robert bristow-johnson Oct 04 '20 at 19:15
  • @robertbristow-johnson Three ways since we can also swap the order of the integrator and comb (your formula has comb then integrator, while mine is integrator then comb). So does this mean there is a difference in the round-off error between those approaches or does the cancellation still occur either way? (The latter does accumulate round off error as I showed in my answer if the accumulator doesn't extend the precision, floating or fixed point: it is just a bound growth of L). – Dan Boschen Oct 04 '20 at 19:17
  • (But that is interesting about the order and if that changes anything, might give further insight into why my "base-CIC" case had a random walk--- when I have time again I will go back and try that as well unless the reason is obvious-- And understood about the significant difference when the exponent changes which you avoid in fixed point ) – Dan Boschen Oct 04 '20 at 19:20
  • mine is just a moving sum. your FIR filter is a differentiator followed by a non-recursive comb. i am discussing the accumulation of roundoff error in a recursive moving sum filter, implemented as a truncated IIR filter in which there is a pole-zero cancellation. but the cancelled pole sits right on $z=1$. so, if there is any roundoff error, it's not gonna decay. – robert bristow-johnson Oct 04 '20 at 19:21
  • I did the fixed sum (base-base-SP) and confirmed the predicted growth by not having extended precision accumulator with single precision output and single precision accumulation over the fixed number of samples. For floating point this is saying you will get N x the minimum possible error unless you do the sum in double precision and then truncate the output back to single precision (same thing with fixed point and not allowing the accumulator to grow) – Dan Boschen Oct 04 '20 at 19:24
  • But I don't follow your last comment-- my CIC filter is a recursive accumulator (not differentiator) followed by the non recursive comb. So this is the recursive moving sum filter you are talking about---- although yours is specifically comb first then recursive accumulator (mathematically identical but perhaps a difference in error results, I want to look at that closer). – Dan Boschen Oct 04 '20 at 20:04
  • i shouldn't have said "differentiator" for the difference done with the comb filter. it's this difference that makes the accumulator (as the second block) maintain a value of the length $L$ sum. otherwise a DC value in the input (which a moving average or moving sum should be able to do) will result in the accumulator growing without bound. so if you put accumulator (or the discrete-time "integrator") first, you can have only zero DC component in the input. at least if you expect this moving average filter to run indefinitely. you cannot always swap the order. – robert bristow-johnson Oct 06 '20 at 05:54

1 Answers1

4

The OP is implementing the Hogenauer Filter (thank you Eugene! http://read.pudn.com/downloads163/ebook/744947/123.pdf), also called a CIC Filter, as an efficient equivalent of the moving average filter, and is getting a noise error result that is 10x more than expected.

Short Answer

The reason for the additional error in the OP's case is due to not having an extended precision accumulator.

Long Answer

We will show what the predicted noise is, for both the properly designed Moving Average and CIC filters, and then simulation results of the various structures as confirmation.

Both structures are shown below with the optional scaling for normalization, properly located at the output. The upper drawing as the Moving Average Filter is a moving average over 11 samples, and the lower drawing is mathematically equivalent as the Hogenauer or Cascade-Integrator-Comb (CIC) Filter. (For details on why these are equivalent, see CIC Cascaded Integrator-Comb spectrum)

Moving Average and CIC Filter

What Is the Expected Noise?

We will first detail what noise due to numerical precision we should expect in a properly designed moving average filter. Fixed and floating point systems will be limited by the finite quantization levels given by the precision of the number. The difference between floating point and fixed point is that with fixed point the designer (or good compiler) needs to be extra careful of overflow and underflow conditions at every output (nodes) in the design, and scale the nodes accordingly such as with bit-shifting to prevent such things from happening. With floating point, this scaling happens for us automatically by the floating point processor, with overhead stored in each number. (If time to market is important, the floating point is the way to go- but if cost and power are the primary metrics, then fixed point should be strongly considered). The diagram below details the single precision floating point representation to illustrate this. The exponent of the number is equivalent to a left or right shift, scaling the number to the ranges as shown on the left side of the diagram. So even though floating point can handle an extremely large numerical range- for any given instance, the closest number we can get to that number will always be within the precision set by the mantissa. As the exponent increases, the range of the numbers available for that given exponent increases, but we will still only have the precision of the mantissa and sign bit for the quantity of numbers we can choose from.

Floating point precision

Single precision floating point has 25 bits of precision as given by the 23 bit mantissa, plus the sign bit, plus the Robert B-J "hidden-1" bit. Double precision floating point equivalently has 54 bits of precision.

For the implementation of CIC filters, a key item that fixed point brings is the modulo arithmetic during overflow such that the subtraction in the comb is still numerically accurate regardless of the overflow. This is not how floating point arithmetic occurs where instead an overflow (or underflow) error will ultimately occur in the expected random walk behavior of the accumulator output. (So further implementation details would be required to reset the accumulator as the accumulator output exceeds certain levels toward overflow and underflow conditions). Further the quantization error in a floating point CIC implementation will be proportional to the level at the accumulator output (which is visible in the more bursty patterns of noise in the plots below for the floating point implementations).

Related is this post on the dynamic range of floating point systems: More Simultaneous Dynamic Range with Fixed Point or Floating Point? and this excellent presentation @RBJ has made at the 2008 AES Conference https://www.aes.org/events/125/tutorials/session.cfm?code=T19 which I am not sure is available anywhere online (Robert can comment). At that other post RBJ educated me about the additional hidden bit in the dynamic range result that I had confirmed with the results in my answer there.

Quantization Noise in an Accumulator

Regardless of fixed or floating point, the noise due to the accumulation that is present in both structures (Moving Average Filter and CIC Filter) is specific to any accumulator so worth-while providing full details of that operation.

For the case of the Moving Average Filter where the accumulation is done over a fixed number of iterations, the resulting noise due to precision is stationary, ergodic, band-limited and will approach a Gaussian distribution.

In contrast, for the output of the accumulator in the CIC Filter (not the final output but the internal node) is a non-stationary non-ergodic random walk random process with otherwise similar qualities as what we will detail below for accumulator noise.

Noise due to quantization is reasonably approximated as a white noise process with a uniform distribution. The variance of a uniform distribution is $r^2/12$, where $r$ is the range; thus resulting in the $q^2/12$ variance for quantization noise with $q$ being a quantization level. What occurs as this noise is accumulated is demonstrated in the diagram below, where for any addition, the distribution at the output of the adder would be the convolution of the distributions for the noise samples being summed. For example, after one accumulation the uniform distribution at the input would convolve with the uniform distribution of the previous sample resulting in a triangular distribution also with a well known variance of $q^2/6$. We see through successive convolutions after each iteration of the accumulator that the variance grows according to:

$$\sigma_N^2 = \frac{Nq^2}{12}$$

Which is the predicted variance both at the output just prior to the scaling of the Moving Average Filter where $N$ is fixed (11 in the OP's example), and at the output of the accumulator ("Integrator") in the CIC filter, where N is a counter that increases with every sample of operation. Consistent with the Central Limit Theorem, the distribution after a fixed number of counts $N$ quickly approaches a Gaussian, and due to the obvious dependence between samples introduced in the operation will no longer be white (and given the structures themselves are low-pass filters). The scaling by dividing by $N$, appropriately placed at the Moving Average Filter output, returns the variance to be $\sigma = q^2/12$, thus having the same variance as the input but now with a band-limited nearly Gaussian distribution. Here we see the critically of allowing filters to grow the signal (extended precision accumulators), and if we must scale, reserve scaling for the output of the filter. Never scale by scaling the input, or scaling the coefficients! Scaling in these alternate approaches will result in an increased noise at the output.

Quantization Noise in Accumulator

Thus we see that the predicted noise variance due to precision at the output of the Moving Average Filter is $q^2/12$, and is a Gaussian, band-limited, ergodic and stationary noise process.

Noise at Output of CIC Filter

The noise at the output of the accumulator in the CIC implementation has a variance that increases with every sample, so is a non-stationary, non-ergodic random walk process. It is itself a low-pass filter structure, creating dependence between the samples so that they are no longer independent. We would almost at this point declare it to be unusable but then in the following differencing structure we see where the magic happens: similar to using the 2-sample Variance to measure random systems with divergent properties, the output of the delay and subtraction as done in the "Comb" is a stationary, ergodic, nearly Gaussian random process!

Specifically given the difference of the two random walk signals, namely the signal and the same random walk signal as it as $N$ samples prior, we see that the result of this difference would be the same as we achieved for the Moving Average Filter output; specifically, before scaling:

$$\sigma_N^2 = \frac{Nq^2}{12}$$

And with the final scaling operation results in the same $q^2/12$ result for the CIC Filter as was obtained for the Moving Average Filter, with all the same properties regarding stationarity, ergodicity and band-limiting.

Also to be noted here is that the accumulator output noise, as a random walk noise process, grows in variance without bound at rate $N$; this means that inevitably the accumulator output will over/under flow due to error alone. For a fixed point system this is of no consequence as long as the operation rolls over on such an overflow or underflow condition; the subsequent subtraction, as long as only one such over/under flow occurred between the signals being subtracted, would be the same result (modulo arithmetic). However in floating point, an over/under flow error will occur. We see that the very low likelihood for this to occur given the error growth rate of $N\sigma^2$ unless our signal itself is operating continuously with a minimum or maximum exponent scale. For example, with single-precision floating point, and considering a probability of occurrence bound by as large as $5\sigma$ to say "unlikely", it would take $12 \times 2^{25}/5$ which is approximately 80.5M samples for the error to traverse every exponent to then reach over/underflow. This would be a good justification to only do the CIC filter in fixed point implementations, unless it is known that both the signal magnitude and total processing duration would prohibit this condition from occurring.

Simulation Results

The first simulation is to confirm the noise characteristics and variance of the accumulator output. This was done with a uniform white noise with $q = 1$, accumulated and differenced over 11 samples following the CIC structure (no output scaling was done). The upper plot below shows the noise at the output of the accumulator as well as the delayed version of this same signal from within the comb structure prior to being differenced. We see the unbounded wandering result of this random walk signal, but we also see that due to the correlation / dependency introduced in the accumulator that the difference between these two signals is stationary and bounded as shown in the middle plot. The histogram over a longer sequences confirms the Gaussian shape, and the variance of this result, with $q=1$ in the simulation was measured to 0.907 as predicted by $Nq^2/12$ with $N = 11$. (Which is the predicted variance of the output of the CIC prior to the final divide by $11$ that is shown in the first diagram).

Accum noise

An FFT of the differenced signal that was in the histogram above confirms the expected band-limited result:

Spectrum of noise at accum output

Finally the various implementation were compared using single precision floating point so that we could use a double precision reference model as representative of "truth" for the desired moving average computation, and allow for the ability to extend the precision appropriately in the fixed point result to confirm best practice for implementation.

For this simulation, the following models were compared with names used and descriptions below:

base: Baseline double precision moving average filter used as a reference: I compared using filter and conv with identical results, and ultimately used:

base = filter(ones(11,1),11,x);

I also confirmed that the scaling of 11 shown is effectively done at the end per the diagram.

base SP: Moving average filter same as baseline with single precision floating point, which will confirm the noise growth by a factor of $N$ due to not having an extended precision accumulator:

base_SP = y_filt_sp = filter(ones(windLen,1, "single"),single(windLen),single(x));

OP: Single Precision implementation for Hogenauer done as a for loop like the OP had done, but is significantly faster than OP's actual approach. I confirmed that the result is cycle and bit accurate to his with using a double precision variant of this. I confirmed what is shown below is functionally identical to scaling after the loop. The issue is the accumulator is not extended precision.

y_mod_sp = nan(testLen,1);
xBuff = zeros(windLen+1, 1, "single");
accum = single(0); 
for a = 1:testLen
  # acccumulate
  accum += single(x(a));
  #shift into buffer
  xBuff = shift(xBuff,1);
  xBuff(1)= accum;

comb and scale (works same if scale moved to after loop)

y_mod_sp(a) = (xBuff(1) - xBuff(windLen + 1)) / single(windLen);
endfor

CIC: Single Precision Floating Point CIC Implementation without extended precision accumulator:

# hogenauer with filter command
y_hog_sp = filter(single([1 0 0 0 0 0 0 0 0 0 0 -1]), single([windLen -windLen]), single(x));

CIC_ext: Single Precision Floating Point CIC with extended precision Acccumulator:

# hogenauer with filter command extended precision (demonstrating 
# the benefit of scaling only at output
y_hog_sp2 = single(filter([1 0 0 0 0 0 0 0 0 0 0 -1], [windLen -windLen], x));

With the following results as presented in the plot below, showing the differences from baseline in each case (given as "base - ....").

In summary, we expect the error signal from baseline at the output of the single precision CIC filter to have a standard deviation of $\sigma = q/\sqrt{12}$ where $q = 1/2^{25}$, resulting in $\sigma = 8.6e-9$.

From the simulation, the actual results for standard deviations were (for the stationary cases):

base - OP: $\sigma = 2.1e-7$

base - CIC: (not stationary)

base - base SP: $\sigma = 2.5e-8$

base - CIC ext: $\sigma = 7.8e-9$

Comparative errors Single Precision

I do not yet understand why the precision limitation in the CIC approach using the filter command results in a random walk error and this requires further investigation. However we do see by using an extended precision accumulator as shown in the "base-CIC ext" case, the best possible performance is achieved for numerical error. Extending the precision in the OP's method would certainly result in similar performance (at a much larger run time in MATLAB but may illuminate approaches in other platforms which I suspect was the motivation for coding it in a loop).

The 'base-base SP' result demonstrates how the standard deviation will grow by $N$ if an extended precision accumulator is not used in the standard Moving Average Filter, where the result of $\sigma = 2.5e-8$ which is in close agreement to this prediction given by $\sigma = \sqrt{11/12}/2^{25} = 2.85e-8$.

The OP's result is an order of magnitude larger than expected and is quite bursty, although does appear to be stationary. The explanation for the "burstiness" of the errors for the OP model is clearer after observation of the plot of the actual signal (not the difference signal) at the accumulator output plotted below. The floating point error is proportional to this signal depending on which exponent we are in, and for each the associated error or minimum quantization level will be, for single precision floating point, $1/2^{25}$ smaller. We see from the plot of the simulation result above that the error magnitude in the output for the OP case is generally proportional to the absolute magnitude of the accumulator output, which is an unbounded random walk! It is for this reason imperative that the precision at the accumulator be extended such that the maximum deviation of the random walk result between the resulting signal and its delayed copy in the comb not exceed the final precision desired. This is the reason the OP is seeing 10x more noise in that implementation!

Signal at Accumulator output for OP


CODE COMPARISON IN OP'S QUESTION:

The OP's comparative code for the option using filter() should not be inside a loop! (Observe that the exact same y2 result that is itself $10^4$ samples long is simply being computed $10^4$ times.)

This would be the correct comparison below showing the Hogenauer filter (CIC) structure simulated with the filter command (y2) and compared to the OP's code for the same function (y). The filter line y2 executes the entire $10^4$ data set in 0.854 seconds on my machine, while the other code took as along as me to write this and is still crunching-- so I canceled that and reduced testLen to 3000 samples to get a quicker comparison (97.08 seconds vs 0.039 seconds):

clc
clear
windLen = 11;
testLen = 10^4;
normCoeff = 1 / windLen;
xBuff = zeros(windLen, 1); 
x = randn(testLen, 1);

tic for a = 1:testLen varState = 0; y = nan(size(x)); xBuff(windLen + 1:windLen + length(x)) = x; for ind=1:length(x) varState = varState + xBuff(windLen + ind) - xBuff(ind); y(ind) = varState * normCoeff; end end toc

tic y2 = filter([1 0 0 0 0 0 0 0 0 0 0 -1], [11 -11], x); toc

And the resulting error difference y-y2:

Error difference

A quicker implementation in MATLAB of the Hogenauer in a loop form (in case that was really needed to be consistent with a C implementation for example) but without yet addressing the "mysterious" error contribution, would be as follows:

tic
y = nan(testLen, 1);
xBuff = zeros(windLen+1, 1);
accum = 0; 
for a = 1:testLen
    # acccumulate
    accum += x(a);
#shift into buffer
xBuff = shift(xBuff,1);
xBuff(1)= accum;

# comb and scale
y(a) = (xBuff(1) - xBuff(windLen + 1)) / windLen;  

endfor toc

tic y2 = filter([1 0 0 0 0 0 0 0 0 0 0 -1], [11 -11], x); toc

For this case I was able to quickly process the full $10^4$ samples resulting in comparative elapsed time of 0.038 seconds for the filter() approach (y2) vs 2.385 seconds for the loop approach (y). The difference between the two results y-y2 is plotted below:

difference error

Dan Boschen
  • 50,942
  • 2
  • 57
  • 135
  • I probably tried to implement something similar to the Hogenauer Filter. It seems that the Hogenauer Filter, as you implemented it, requires 12 multiplication and 1 division (which is probably more expensive than multiplication, right?). Besides, there is similar number of additions. I have tried to implement it manually eliminating the multiplications with zeros and ones and the additions. Also, I am interested in is the std of the expected error. Both, my implementation and your Hogenauer Filter provided an std(E) smaller than sqrt(n)*eps, which is not clear to me. – Gideon Genadi Kogan Sep 30 '20 at 09:59
  • Also, I am trying to implement it to rolling RMS but if the error is too large, I can face complex output. Welford's method seems to be related to the objective but it handles moving variance rather than moving sum of squares. – Gideon Genadi Kogan Sep 30 '20 at 15:34
  • @GideonGenadiKogan The 0's don't require any multipliers and further it is "multiply" by 1 so there are no multipliers required. We typically won't divide and take the scaling for an efficient multiplication, resulting in a delay and subtraction at the output of an accumulator (simple!). (So only one addition in the accumulator, and one subtraction after. The hogenauer as I demonstrated has a std that is the expected $\sqrt{n}$ factor. If you don't delay enough you will get less so perhaps that is your issue? For a moving average of 11 the delay is $z^{-12}$. – Dan Boschen Sep 30 '20 at 21:13
  • But how is the implementation you give less operations than a Hogenauer (which is just two additions?) – Dan Boschen Sep 30 '20 at 21:18
  • Assuming that multiplication with 0 and 1 do not take time, the CIC implementation is better than mine. Do you have any reference that indicates that those multiplications are "free"? I prefer references for C/arm... – Gideon Genadi Kogan Oct 01 '20 at 05:50
  • I am not sure what you mean by usually we do not divide and than multiply by the normalization factor. I have done a minor edition of the code. Can you elaborate a little bit more why it is not a good practice? – Gideon Genadi Kogan Oct 01 '20 at 06:07
  • Please see the addition above – Gideon Genadi Kogan Oct 01 '20 at 06:52
  • @Gideon In the second code there is no need for a loop. It is simply y2 = filter([1 0 0 0 0 0 0 0 0 0 0 -1], [11 -11], x); – Dan Boschen Oct 02 '20 at 03:28
  • @Gideon I don't understand your question about multiplications, when you multiply by 1, there is no multiplier. And if you were going to multiply by zero into a subsequent summation, you ignore that path to the summation -- so in both cases there is no additional operation and there is no multiplier. Your code appears to be a Hogenauer filter to me unless I am missing a detail? – Dan Boschen Oct 02 '20 at 03:40
  • @GideonGenadiKogan I added additional cases following what I think you were trying to do and your concern about the error behavior. Read the additions as I think it answers your core question now. Basically don't scale to the end. – Dan Boschen Oct 02 '20 at 06:19
  • @robertbristow-johnson not quite as long as it appears, there is a whole earlier version I am deleting once I have it right--patience, don't read yet. I accidentally erased major portions because I hit esc and a key stroke so learned my lesson and saving as I go – Dan Boschen Oct 03 '20 at 21:14
  • @robertbristow-johnson ok I'm done. It's still a treatise. – Dan Boschen Oct 03 '20 at 21:47
  • @robertbristow-johnson (I mentioned one of your presentations deep in there... is it available online?) – Dan Boschen Oct 03 '20 at 21:55
  • it was never a paper, but was a powerpoint presentation. it was done on a Mac that i still have and can power up, but it's a very old version of power point. if i find the files, would you want to host the PPT somewhere? essentially i shown that if you don't need 40 dB of headroom, you're better off with 32-bit fixed than with 32-bit float. i also did some sound demonstrations down to 7-bit words to demonstrate the effect of a few different fixed and float formats with different dithers and different noise shaping. – robert bristow-johnson Oct 03 '20 at 23:22
  • were you at that AES workshop in 2008, Dan? – robert bristow-johnson Oct 03 '20 at 23:33
  • @robertbristow-johnson I have the presentation as you emailed it to me. Sure I would be happy to if I have your permission to post it. But yes that was my whole point of the linked "DSP-puzzle"; in most cases unless you are dealing with an ill-conditioned matrix fixed point is usually the better choice with regards to performance, cost, size.... you just have to be a lot more skilled to use it or be prepared to run into issues, and have a lot more time available to do it right. For NRE vs high volume production that is usually the right trade. – Dan Boschen Oct 03 '20 at 23:40
  • so i sent you the PPT? i didn't remember. you can post it, i don't care. but it's not a paper. it doesn't have the verbiage of a presentation. just the visuals. – robert bristow-johnson Oct 04 '20 at 01:26
  • @robertbristow-johnson Did I say paper? It says "presentation" where I referenced it. I found your email but looks like I have to unzip / organize it as the content is a companion to the ppt rather than embedded in it. On my to-do list, but wasted too much other time today on this answer ;) – Dan Boschen Oct 04 '20 at 01:50
  • i think the ppt is self-contained except for the audio files. but i think they were in a subdirectory or something. the old MATLAB code that generated the audio files is there also. i hope you can detangle it. – robert bristow-johnson Oct 04 '20 at 01:56
  • @robertbristow-johnson I should be able to; probably not until after this month but I’ll be setting up a new server where I can put it – Dan Boschen Oct 04 '20 at 02:00
  • whatever is fine with me. – robert bristow-johnson Oct 04 '20 at 02:01