Chapter 3 - Implementation
Chapter 3 - Implementation
1 Filter Structures
1.1 Canonical Form
In the previous chapter, we gave the following dierence equation as a general
form for LTI lters which can be implemented causally with a nite amount of
computation and memory:
M
[ N
[
y [n] = ak x [n k] + bk y [n k] (1)
k=0 k=1
a0 + a1 z 1 + · · · + aM z M
H (z) =
1 b1 z 1 · · · bN z N
aM (z z1 ) (z z2 ) · · · (z zM ) zN
= · ·
bN zM (z p1 ) (z p2 ) · · · (z pN )
H1 (z) = a0 + a1 z 1 + · · · + aM z M
1
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 2
H1 ( z ) H 2 ( z)
a0
x[n] y[n]
z-1 a1 b1
z-1
z-1 a2 b2
z-1
z-1 z-1
aM bN
Figure 1: Direct form implementation structure for a lter with M zeros and
N poles.
H 2 ( z) H1 ( z )
a0
x[n] y[n]
b1
z-1 z-1 a1
b2
z-1 z-1 a2
z-1 z-1
bN aM
1
x[n]
1 z-1 _ =<
15 2
4 52 y[n]
<1
z-1 ` =<
19
13
z-1
1 15
< a =<
2 26
(z 12 )z 2
Figure 3: Parallel lter structure, realizing H (z) = .
(z 14 )(z2 +z+ 12 )
Parallel structures are not so popular as cascade structures, in part due to the
added complexity of expanding the lter in this form and in part because they
often have higher implementation complexity. Pole positions have exactly the
same sensitivity to coe!cient quantization as they do in the cascade structure;
however, zero locations are generally more sensitive to coe!cient quantization
than they are in the cascade structure.
One advantage of parallel implementations over cascade implementations
is their sensitivity to numerical round-o errors. The round-o noise sources
associated with a parallel implementation are injected at the input and output
of each parallel lter section, while the round-o noise sources in a cascade
structure are injected at the input and output of each successive lter stage.
If a lter has multiple poles close together and near the unit circle, round-o
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 6
y0 [n ] 1 y1[n ] 1 y 2 [n ] 1 y3 [ n ]
x[n] y[n]
k1 k2 k3
k1 k2 k3
w0 [n ] w1[n ] w2 [n ] w3[n ]
z-1 1 z-1 1 z-1 1
Figure 4: Third order FIR lattice lter implementation. The shaded box identi-
es the regular structure which is repeated throughout the lattice.
matrix form:
Ym (z) 1 km Ym1 (z)
=
Wm (z) km 1 z 1 Wm1 (z)
1 km z 1 Ym1 (z)
=
km z 1 Wm1 (z)
1 km z 1 1 km1 z 1 Ym2 (z)
=
km z 1 km1 z 1 Wm2 (z)
..
.%
m
\ &
1 ki z 1 X (z)
=
ki z 1 X (z)
i=1
This transfer matrix expansion provides us with one method to determine the
lter transfer function, Ym (z) /X (z).
To understand the lattice much better, however, let Am (z) and Bm (z) de-
note the transfer functions,
Ym (z)
Am (z) =
X (z)
Wm (z)
Bm (z) =
X (z)
It turns out that these transfer functions are mirror images of each other for all
m. Specically, the corresponding impulse responses satisfy
bm [n] = am [m n]
We may use the mirror image relationship between the transfer functions of
the lower and upper lattice branches to work back from a desired all-zero transfer
function, H (z) = Y (z) /X (z), to nd the corresponding lattice coe!cients.
The key is to observe that the forward relationship
Am (z) 1 km z 1 Am1 (z)
=
Bm (z) km z 1 Bm1 (z)
may be inverted as
1
Am1 (z) 1 km z 1 Am (z)
=
Bm1 (z) km z 1 Bm (z)
1 z 1 km z 1 Am (z)
=
z 1 km
2 z 1 km 1 z m Am z 1
from which we get
Am (z) km z m Am z 1
Am1 (z) = 2
(2)
1 km
Working backward through the lattice, we may determine each coe!cient, km ,
in turn by recognizing that Am1 (z) is an order m 1 lter, so that am [m]
km am [0] must be 0. That is
am [m]
km = (3)
am [0]
The lattice coe!cients are generally known as “reection coe!cients.”
Example 2 Consider the third order FIR lter,
1 1 1
H (z) = 1 + z 1 z 2 + z 3
4 4 2
We know that A3 (z) = H (z), so equation (3) yields
a3 [3] h [3] 1
k3 = = =
a3 [0] h [0] 2
Next, we use equation (2) to nd
1 + 14 z 1 14 z 2 + 12 z 3 12 12 14 z 1 + 14 z 2 + z 3
A2 (z) =
1 14
4 3 3 1 3 2
= + z z
3 4 8 8
1 1
= 1 + z 1 z 2
2 2
We can now use equation (3) again to nd
a2 [2] 1
k2 = =
a2 [0] 2
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 9
Then
a1 [1]
k1 = =1
a1 [0]
and we have found all of the reection coe!cients.
An FIR lter has minimum phase (all zeros inside the unit circle)
if and only if all of the reection coe!cients are less than 1, i.e.,
|km | < 1.
Recall that minimum phase lters are the only lters which we can invert
causally. In fact, the inverse lter may be implemented using the all-pole lattice
structure described next, which is guaranteed to remain stable so long as the
reection coe!cients are all less than 1. The coe!cients for such lters may be
e!ciently represented as xed-point quantities of the form
3
km km · 2(W 1)
3
where km is a W -bit two’s complement integer.
y0 [ n ] 1 y1[n ] 1 y2 [n ] 1 y3 [ n ]
y[n] x[n]
k1 k2 k3
< k1 < k2 < k3
w0 [n ] w1[n ] w2 [n] w3[n ]
z-1 1 z-1 1 z-1 1
Figure 5: Third order all-pole lattice lter whose transfer function is the recip-
rocal of that of the FIR lattice lter in Figure 4. We have deliberately drawn
the lter backwards, with the input on the right and the output on the left to
emphasize the connection between the FIR and IIR lattice structures and their
internal state variables.
identied by shaded boxes in the gures. These structures clearly enforce exactly
the same relationship between the vectors,
ym1 [n] ym [n]
and
wm1 [n] wm [n]
b
x[n] z-1 z-1 y[n]
a a
<b
G (z) bG (z)
H (z) =
1 + bG (z) bG (z)
b
(za)2 b
= b2
= 2
1 + (za) 2 (z a) + b2
Example 3 Consider the canonical second order lter structure shown in Fig-
ure 7. The implementation involves two memories (or states), s1 and s2 , cor-
responding to the outputs of the rst and second delay element, respectively. As
each new input sample x [n] arrives, the following operations are performed to
derive a new output sample, y [n].
• s0 # s0 + b1 · s1
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 12
1
x[n] y[n]
b1
z-1
a1
z-1
b2 a2
• s0 # s0 + b2 · s2
• y # s0 use y to accumulate output of second adder
• y # y + a1 · s1
• y # y + a2 · s2
• y [n] # y
• s2 # s1 # s0 operate the shift register to prepare for next sample
address generation logic which is well adapted to certain typical DSP operations.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 13
For full custom hardware solutions, similar considerations apply and xed-point
implementations generally result in smaller, less expensive chips with higher
throughput. We briey consider oating point implementations in Section 2.7.
Fixed-point processors treat all quantities as integers. It is up to the im-
plementor to establish a relationship between these integers and the real-valued
quantities which would ideally be processed. For example, suppose that the
input sequence satises |x [n]| < A for some bound, A, and we wish to rep-
resent the samples as two’s complement integers, each having W bits. This
could be achieved by scaling the input samples by 2W 1 /A and rounding to
the nearest integer. The implementor must then bear in mind that the integer
values are scaled versions of the original input samples, scaling the output back
accordingly.
To minimize the impact of integer rounding eects, it is desirable to scale
the results produced by each processing step so that they utilize as many of
the available bits as possible. This can lead to a large number of dierent
scale factors, which may require additional multiplications2 . For this reason,
it is common to adopt a convention which restricts the scale factors to exact
powers of 2. This leads to a description of xed-point quantities in terms of two
quantities: the number, I, of integer bits; and the number, F , of fraction bits.
We say that a real-valued quantity x has an I.F xed point representation
as a W = I + F bit integer x3 , if
x = 2F x3 (4)
x = 11.001
The 3 fraction bits are separated from the 2 integer bits by the binary point. The
integer x3 , used to represent x, is given by
x3 = 11001
2I+F 1 2F x 2I+F 1 1
2 You should always implement scaling through multiplication, rather than division, since
multiplication is much faster, especially when working with high precision representations.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 14
or, equivalently,
2I1 x 2I1 2F
It is often convenient to approximate this range by
x3 = 24 x = 42
= 11010110
Multiplying two W -bit signed integers, x31 and x32 , yields a (2W 1)-bit
signed integer, y 3 , in the range 22W 2 < y 22W 2 . If the multiplicands
correspond to I1 .F1 and I2 .F2 xed-point quantities, respectively, y 3 is an
(I1 + I2 1).(F1 + F2 ) xed-point representation of y = x1 · x2 . Note that
multiplication itself does not introduce any round-o errors. Round-o errors
arise when we need to reduce the precision of the result back to a W -bit quantity
by discarding fraction bits.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 15
Input, a
a2 a1 a0 LSB
Input, b
b2 b1 b0
a a a
co co co
1-bit adder b 1-bit adder b 1-bit adder b
r r+2co r r+2co r r+2co
=a+b+ci ci =a+b+ci ci =a+b+ci ci
0
r2 r1 r0
Output, r
Figure 8: Typical ripple-adder circuit constructed from 1-bit full adder cells,
each of which sums three input bits, a, b, ci , and outputs the sum as a 2-bit
quantity, r, co .
a3 a2 a1 a0
b0
b1
0
1-bit 1-bit 1-bit 1-bit
adder adder adder adder
0
b2
0
1-bit 1-bit 1-bit 1-bit
adder adder adder adder
0
b3
0
1-bit 1-bit 1-bit 1-bit
adder adder adder adder
0
r6 r5 r4 r3 r2 r1 r0
3. Identify the transfer function, Hi (z), from the lter input to the output
of the ith accumulator block and nd the associated BIBO gain:
4
[
Gi = |hi [n]|
n=0
Iiout = I0 + glog2 Gi h
Fiout = F0 glog2 Gi h
H 3( z)
H 2 ( z)
H1 ( z )
x[n] y1[n ] 1 y2 [n ] 1 y3 [ n ]
(1.15) (2.30) (2.14 ) (5.27 ) (4.12 ) (5.27 ) (2.14)
(2.30) (5.27 ) (5.27) (5.27 )
0. 4
z-1 < 0. 5 < 0.8
z-1 0.7
(0.16) (3.13) (1.15) (1.15)
6. It can happen that the above steps lead to some coe!cients with insu!-
cient integer bits in their representation. For example, assuming that all
coe!cients must be represented with the same word size, W× , the coe!-
cient a has only Ia = W× Fa integer bits, requiring the coe!cient to lie
in the range
|a| < 2Ia 1
If this is too small, we must reduce the number of fraction bits, Fa . This,
in turn, means reducing the number of fraction bits (increasing the number
of integer bits) at the input to the relevant accumulator block, taking Iiin
above the minimum value given by equation (5). Reducing the number
of fraction bits at the input to an accumulator is undesirable in that it
reduces the accuracy of the representation. For this reason, this nal
adjustment step is best saved until last.
• Multiplication by 0.4 involves a 2.14 input and a 2.30 output. The coe!-
cient must have a 0.16 representation, which is ne.
• Multiplication by 0.5 involves a 2.14 input and a 4.28 output. The coef-
cient must therefore have a 2.14 representation, which is also ne.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 21
• Multiplication by 0.8 involves a 4.12 input and a 4.28 output. The coe!-
cient must have a 0.16 representation, which is not possible. We will need
to reduce the number of fraction bits at the input to the second accumulator
block by at least 1 in the next step.
• Multiplication by 0.7 involves a 4.12 input and a 2.30 output, requiring
a 2.18 representation for the coe!cient. This is also not possible. We
will need to reduce the number of fraction bits at the input to the third
accumulator block by at least 3 in the next step.
where is the set of frequencies which are of interest. In this case, Gi , is the
maximum gain in the amplitude of any pure sinusoid in the frequency range of
3 32 integer bits simply means that the rst two bit positions after the binary point are
never used.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 23
interest. It can be su!ciently smaller than the BIBO gain, leading to more frac-
tion bits in the xed-point representations and hence less round-o error. There
is, however, some risk that the representations may have insu!cient dynamic
range and hence overow.
In simple two’s complement arithmetic, overow is a serious concern. Con-
sider, for example, what happens if a 3.5 accumulator representation overows
with the value 4, representing it as 100.00000, which is equal to 4. Over-
ow in two’s complement arithmetic causes the numerical representation to
“wrap-around” leading to massive errors. For this reason, if the BIBO gains
are not used, the implementation should employ saturating arithmetic. Satu-
rating arithmetic checks for overow and selects the representable value which
is closest to the true out-of-range value.
Saturating arithmetic comes with some of its own drawbacks which are worth
noting. Firstly, it is somewhat more expensive to implement a saturating ac-
cumulator than a simple two’s complement accumulator. Secondly and more
signicantly, a saturating accumulator typically saturates the results of each
incremental addition. Consider, for example, the central accumulator in Fig-
ure 10, which has 3 inputs. Depending on the order in which the inputs are
added, the results produced by a saturating accumulator may be dierent.
Moreover, when designing the xed-point representations for a saturating
accumulator one must be careful to ensure that each incremental output from the
accumulator (after adding each new quantity into the total) can be represented
without overow, under the conditions for which the lter is being designed.
By contrast, with regular two’s complement addition, it is su!cient to ensure
that the accumulator has su!cient precision to accommodate the nal result,
in which case intermediate overow bits will be guaranteed to cancel. This can
detract from some of the savings achieved by selecting less gains, Gi , which are
less conservative than the BIBO gains.
x[n] y[n]
< b2 < b1
z-1 z-1
The two coe!cients in the implementation are thus directly related to the radius,
r, and the real-part, r cos , of the poles. Evidently, for stability we require
0 b2 < 1 and |b1 | < 2. Now suppose both quantities are to be represented
using signed 3-bit integers; b1 as a 2.1 quantity, and b2 as an unsigned 1.2
quantity. Since b2 must be positive to get complex conjugate poles, the available
pole positions, p, which can be described under these conditions have
+u u u ,
1 2 3
|p| 5 , ,
4 4 4
and
1 2 3
? (p) 5 0, ± , ± , ±
4 4 4
In fact, the only stable complex conjugate pole positions which can be achieved
are those indicated in Figure 12. Note that only 15 dierent conjugate pole pairs
can be realized and they are quite irregularly spaced.
Example 9 Consider the coupled implementation of a second order all-pole
lter, as shown in Figure 6. We showed previously that the pole positions asso-
ciated with this realization are given by
p = a ± jb
so that quantizing the coe!cients is exactly equivalent to quantizing the real and
imaginary parts of the pole positions. If we again assume a 3-bit signed xed-
point representation for the lter coe!cients, the complex conjugate pole pairs
which can be achieved lie on the regular rectangular grid shown in Figure 13. In
this case, there are 19 possible conjugate pole pairs.
Unfortunately, implementation structure aects many dierent properties
of the lter, whose dependencies can be di!cult to track down and optimize.
These properties include memory consumption, coe!cient quantization, round-
o errors and limit cycle behaviour, some aspects of which are discussed further
below. There is no general method to select the best structure for implementing
a particular lter. Instead, several dierent structures can be examined, sepa-
rately optimizing their parameters and representations, to determine the most
appropriate structure for a given application.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 25
¼( p )
unit circle
½( p )
Figure 12:
¼( p )
unit circle
½( p )
Figure 13:
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 26
where Pi is a power gain factor, computed from the transfer function Ti (z),
between the input to the ith accumulator block and the output of the entire
lter. The power gain may be expressed in any of the following ways:
4
[
Pi = (ti [n])2
n=0
]
1
= tˆi ($)2 d$
2
]
1
= Ti ej$ Ti ej$ d$
2
The power spectral density of the output noise process is given by
[ 2
S ($) = 2i Ti ej$
i
4 Actually, the round-o error process is very slightly symmetrical, so there is a very small
mean oset. If, for example, the normalizing downshift discards 2 fraction bits to form a 1.3
result, rounding to the nearest value and upward where there is an ambiguity, the possible
round-o errors are +2 × 235 , +1 × 235 , 0 and 31 × 235 . In most cases, a large number of
fraction bits are discarded and the round-o error is distributed almost symmetrically about
0.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 27
T1 ( z ) T2 ( z )
b 1[n ] b 2 [n ] b 3 [n ]
x[n] y1[n ] 1 y2 [n ] 1 y3 [ n ]
(1.15) (2.30) (2.14 ) (5.27 ) (4.12 ) (5.27 ) (2.14)
(2.30) (5.27 ) (5.27) (5.27 )
0. 4
z-1 < 0. 5 < 0.8
z-1 0.7
(0.16) (3.13) (1.15) (1.15)
Figure 14: Round-o noise model for the lter implementation in Figure 10.
Example 10 Consider the cascade system shown in Figure 10. There are three
accumulator blocks, each with its own source of round-o noise, as shown in
Figure 14. These noise processes have variances,
228 224
21 = 23 = = 95dB and 22 = = 83dB
12 12
The power gain factors and the noise power spectral density may be computed
from the three noise transfer functions,
z 0.5 z + 0.7
T1 (z) = ·
z 0.4 z + 0.8
z + 0.7
T2 (z) =
z + 0.8
T3 (z) = 1
n = 0 $ y3 = 7
3 7+2
n = 1 $ y =7 =5
4
5+2
n = 2 $ y3 =7 =6
4
6+2
n = 3 $ y3 =7 =5
4
5+2
n = 4 $ y3 =7 =6
4
Evidently, the lter output will continue to oscillate between 5 and 6. These
oscillations are dependent on the DC level of the supplied input and they rep-
resent tones in the lter output which are unrelated to the frequency content of
the input signal. If insu!cient bits are used in the numerical implementation,
these tones can become audible or otherwise perceptible in the lter output.
representation uses 32 bits in total: one sign bit; M = 23 mantissa bits; and 8
exponent bits. This allows exponents in the range 128 to +127 and a relative
accuracy of
x
2(M+1)
|x|
The double precision representation uses M = 53 mantissa bits and 10 exponent
bits, with a 64-bit word size.
When oating point representations are employed, round-o errors are often
much smaller than in the xed-point case, since the number of fraction bits
(determined by the exponent, e) is dynamically adjusted after each computa-
tion is performed, so as to maximize the accuracy of the representation. This
adjustment process can add signicant complexity, particularly to additions.
Consider, for example, the addition of two quantities, 3.708 and 3.707. Each
quantity is represented with an exponent of e = 1, but their sum requires an
exponent of e = 11. In general, additions require extensive shifting of the
mantissa bits both before and after the binary addition of their contents. This
shifting is either performed sequentially (can require many clock cycles) or in
parallel, using a “barrel shifting network” (can occupy a lot of silicon area on
the chip).
In either case, the lesson to be learned is that oating point additions are
much more complex than xed-point additions and they can actually take signi-
cantly longer to execute than oating point multiplications. Moreover, round-o
error may be introduced whenever two numbers are added, and whenever two
numbers are multiplied, making it much more di!cult to trace the eects of
round-o error through a system for modeling purposes.