Parallel decimating FIR

Question

I'm trying to implement a parallel decimating FIR targeting an FPGA. The data rate is 4x the FPGA clock speed so I am receiving 4 separate streams of the demultiplexed data stream.

I need to decimate x2 and filter this data such that I am left with 2 parallel data streams (data rate of 2x the FPGA clock speed).

I'm using a half-band FIR filter in a polyphase architecture and have split out the filter taps in to 4 phases. 1 phase of which is all zeros, and another is all zeros apart from the centre tap so there will be some efficiencies I can work in around that.

The text book I have hasn't helped me with this particular problem and I can't seem to find much on the web either. I guess if I understood the maths a bit better it may be obvious, but at the moment, it is not.

The following is a block diagram of what i'm trying to achieve. It shows the data in and out at the overall rate rather than what is physically implemented in the FPGA. Hopefully it makes sense. The part i'm unsure of is how the filter channels are supposed to be combined at the output (hence the big box with a '?' in it!). Any advice on this appreciated. Thanks in advance.

In this forum entry Dsprelated-forum you find also some useful links — Irreducible, Mar 28 '22 at 09:21
Note that I am teaching an online course starting at the end of this week that covers this and other cool DSP tricks in a LOT more detail. More information here: https://ieeeboston.org/2022-courses/ (It is the DSP for Wireless Communications Course). — Dan Boschen, Mar 28 '22 at 12:10
@DanBoschen do have plans to offer this course also at more convenient times for Europeans? 6PM PST is like midnight here. — Irreducible, Mar 29 '22 at 05:42
@Irreducible I open up 2 sessions when there are enough sign-ups but the workshops are more "office hours/ Q&A" sessions, and I'm available anytime via email regardless. Most of the important content is through pre-recorded videos and workshops are also recorded. Email me for me info boschen @ loglin dot com — Dan Boschen, Mar 29 '22 at 10:46

Dan Boschen · Accepted Answer · 2022-06-09T13:10:43.110

The OP wants to implement a decimate by two with input samples that have been parallelized into four streams: each sample in each stream is the higher rate waveform that is to be decimated by two. This necessitates parallell decomposition of a high rate filter so that the equivalent structure can run in parallel at a lower rate. The following will first show the resulting parallel form for a decimate by two, and then detail the process for creating such parallel structures.

Regardless of the creation of four streams, we would create a decimate by two by first filtering the input with a halfband filter to eliminate possible aliasing and then selecting every other sample at the output to provide the samples at half the input rate, as depicted in the diagram below. The use of a halfband specifically offers the convenience of nearly every other sample being zero, leading to further reductions in processing.

Direct Form Decimate by Two (Input at Rate R, Output at Rate R/2)

With the creation of the four input streams each running at one-fourth the input rate or $R/4$, which is done by commutating the input waveform, we can process with the same filter by decimating the filter coefficients across four filters as shown below with each filter also running at rate $R/4$. The coefficients from the original halfband filter are mapped in row to column form across the four parallel filters. Below shows the output combining for a generic decimate by 2 with any arbitrary FIR filter:

Parallel Form Decimate by 2: Four inputs at rate R/4, Two outputs at rate R/4

For the case of a halfband filter specifically, in the row to column mapping one of the filters will end up having all zero coefficients (and therefore eliminated), and another will be replaced by a delay. With a 40 tap FIR halfband, mapped to FIR0, FIR1, FIR2, and FIR3; FIR1 was zeroed out and FIR3 resulted in a fixed 4 sample delay ($z^{-4}$). In this case the block diagram above simplifies to the following:

Reduced Form When Using Halfband: Four inputs at rate R/4, Two outputs at rate R/4

The complete answer is above. Derivation on Parallel Decomposition below:

The parallel decomposition was determined by first considering the four $R/4$ input streams as they would be processed through the originating halfband filter at the full rate $R$. To do this, we would zero-fill each of the streams offset by one sample and sum the resulting output as each stream is convolved with the halfband filter coefficients as depicted in the diagram below:

Given the zero-fills, each of the four datapaths can be filtered with a parallel filter bank running at rate $R/4$. This is of similar form to a polyphase interpolator (but as noted at the end of this post, these filters are NOT polyphase due to aliasing). For example the parallel decomposition for the output of Datapath 0 is shown below, where the full rate output at rate $R$ is reconstructed by commutating the filter outputs. Here the Datapath 0 enters the filter at rate $R/4$ without zero-filling: on each $R/4$ cycle, the next input is shifted into all four filters in parallel, then the outputs of all four filters are computed. Before the next $R/4$ cycle, the filter outputs are read sequentially at rate $R$, and then the next input is shifted into all four filters in parallel and the process repeats. Thus the input and all the filters are running at rate $R/4$ yet producing an identical result to the form previously given above (for each of the Datapaths).

To develop the proper combining, we will first show how to combine the outputs to get the composite output at the full rate $R$, and then later we will show the reduction for this specific case of a decimate by two with two outputs each running at rate $R/4$. For this, we do the same processing for the other three datapaths, but before combining we must introduce the additional sample delay as visualized in the graphic "Parallel Datapath at Rate R":

We can eliminate the unit sample delays by modifying the start position of the commutators. Here in the drawing I denote that the next input sample is not loaded until the commutator reaches FIR0. So on the first cycle at rate $R$, the filterbank for Datapath 0 is loaded with the first sample from that Datapath, but the other three commutators simply advance one sample (with all zeros in the filters or whatever initial condition exists). The filterbank for Datapath 1 loads on sample 2, the filterbank for Datapath 2 loads on sample 3. and the filterbank for Datapath 3 loads on sample 4. The filter outputs are recomputed after each new load, which therefore occurs at rate $R/4$, yet the outputs as commutated are summed at rate $R$ here in this diagram.

From this form we can then see the Decimate by Two structure, as for the Decimate by Two, we only select every other output at rate $R/2$. Therefore we never use filter banks FIR1 and FIR3 for Datapath 0, and similarly two banks are similarly eliminated for each of the four datapaths as given in the simplified diagram below:

From this form we can convert it back to having the commutators all start at the top with an appropriate output delay to be equivalent, such that all filter banks are loaded in parallel from the four Datapaths at the same time:

This shows the implementation with four $R/4$ inputs and one $R/2$ output. Consider the structure of the decimated by two output for each of the four Datapaths as tabulated below for each $R/4$ output sample $n$:

This table is encoded as dk[m]_fp where $m$ refers to the $R/4$ input sample number, and $k$ refers to each datapath from 0 to 3, and $p$ refers to each filter bank from 0 to 3. For example, the entry d3[1]f3 represents the output of filter FIR3 with input from Datapath 3 after the second sample ($m=1$, the first sample is $m=0$) from Datapath 3. Reviewing the above table should confirm the implementation for parallel even/odd output of the decimated by two waveform as depicted below:

This is equivalent to the solution first introduced above by moving the three unit delays in the Even Sample output to after the summer (sum then delay rather than delay then sum, the two operations are equivalent).

The further reduction due to the actual halfband coefficients is detailed below. The first diagram shows the originating coefficients for the halfband filter and the mapping to the four parallel filters; in this case a 40 tap halfband filter parallelized to four 10 tap filters:

Each of the four filters is shown separately below where we see FIR1 eliminated as all zeros and FIR3 as a four sample delay:

Note that the creation of the four parallel filters is identical to polyphase decomposition explained at this post. To be clear the structure is identical but the resulting filters themselves in this case are NOT polyphase due to aliasing within the filters (they do not each have an all pass response with varying phases as given by a polyphase filter bank): here we have decimated halfband filters by 4, where with a polyphase implementation we would require a quarterband filter to avoid aliasing. Regardless of the aliasing that does indeed occur on each of the four datapaths, the proper combining cancels out the aliased components and reconstructs the expected decimate by two output with no further distortion. The form produces an identical result as the decimate by two structure first introduced.

Confirmation of Implementation

The following waveform was used as a test, occupying 80% of the available decimate by two output bandwidth with intentional "jammer" tones introduced in the aliasing locations.

The test waveform together with the halfband filter overlaid is shown below (Note that 30 dB was subtracted from the halfband frequency response to align visually with the spectrum).

The resulting output spectrum for a direct decimate by 2 using the combined input at rate R and output at rate R/2 is compared to the spectrum with four parallel inputs at rate R/4 and two parallel outputs at rate R/4 (the outputs were commutated to create the final waveform for this spectrum).

Thanks @dan-boschen, I have further processing to do at the R/2 rate hence needing to get to that rate. I had felt there may be a way of combing the 4 polyphase outputs to get to the rate r/2 without the interpolate but maybe not. I'll give your architecture a go and see how I get on. — Adi, Mar 29 '22 at 14:34
@Adi yes I see your thought process- since we don’t need the Dec/4 result there is an alternate approach that is the Dec/2 desired. Let me show how that would compare. — Dan Boschen, Mar 29 '22 at 14:42
@Adi do you still need to run the filters at the lowest R/4 rate? If so I don’t immediately see a simpler approach. — Dan Boschen, Mar 29 '22 at 14:58
I believe so unless there are tricks to get around that. A bit of background may help; the ADC producing the sample data is running at 1400MSPS (700MHz of BW). The FPGA core clock is 350MHz hence why I have 4 parallel data streams (350 * 4 = 1400). I have to decimate this data by 2 (1400MSPS -> 700MSPS, 4 parallel data streams to 2 parallel data streams), do some further processing at the lower 700MSPS rate and then decimate by 2 again (350MSPS, to give a single data stream in the FPGA) and further processing again. Hope that makes sense. — Adi, Mar 29 '22 at 15:16

Parallel decimating FIR

1 Answers1

Linked