0% found this document useful (0 votes)
9 views

Chapter 3 - Implementation

Uploaded by

lemoke0731
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Chapter 3 - Implementation

Uploaded by

lemoke0731
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Elec4621:

Advanced Digital Signal Processing


Chapter 3: Filter Implementation
Techniques
Dr. D. S. Taubman
March 11, 2011

1 Filter Structures
1.1 Canonical Form
In the previous chapter, we gave the following dierence equation as a general
form for LTI lters which can be implemented causally with a nite amount of
computation and memory:
M
[ N
[
y [n] = ak x [n  k] + bk y [n  k] (1)
k=0 k=1

Here, N is the number of non-trivial poles and M is the number of non-trivial


zeros in the lter’’s transfer function,

a0 + a1 z 1 + · · · + aM z M
H (z) =
1  b1 z 1  · · ·  bN z N
aM (z  z1 ) (z  z2 ) · · · (z  zM ) zN
= · ·
bN zM (z  p1 ) (z  p2 ) · · · (z  pN )

The most direct implementation of equation (1) is that shown in Figure 1.


This structure is a direct transcription of the dierence equations and is known
as the ““direct form.”” We have split the accumulation process into two adders
primarily to distinguish between contributions from the rst and second sum-
mations on the right hand side of equation (1).
An alternative implementation structure may be derived by noting that
H (z) may be regarded as the product of two transfer functions: an all-zero
transfer function,

H1 (z) = a0 + a1 z 1 + · · · + aM z M

1
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 2

H1 ( z ) H 2 ( z)
a0
x[n] y[n]

z-1 a1 b1
z-1

z-1 a2 b2
z-1

z-1 z-1
aM bN

Figure 1: Direct form implementation structure for a lter with M zeros and
N poles.

and an all-pole transfer function,


1
H2 (z) =
1  b1 z 1  · · ·  bN z N
The left hand section of Figure 1 implements H1 (z), feeding its result to the
right hand section which implements H2 (z). But H (z) = H1 (z) H2 (z) =
H2 (z) H1 (z) can be realized by implementing the all-zero and all-pole sections
in the opposite order, as illustrated in Figure 2.
The structure shown in Figure 2 is known as the ““canonical form.”” It has one
obvious advantage over the direct form structure: it requires less memory.
This is because the all-pole lter requires delayed copies of its output and the
all-zero lter requires delayed copies of its input. By cascading them in the order
of Figure 2, the delay operators may be shared. Each delay requires memory to
store a single sample.
You may think of each string of delay elements, identied by Z 1 in the
gures, as a shift register. The canonical form requires a single shift register,
with storage for max {M, N } samples, while the direct form structure requires
two shift registers, having combined storage for M +N samples. Filters are linear
state machines and the canonical form exposes its fundamental states. The
direct implementation involves additional redundant states. There are specic
types of lters which we shall encounter, whose state information can be reduced
further, without aecting the order of the lter, but these are not generic.
Apart from memory, the two implementation structures depicted in Figures 1
and 2 have very dierent implications for the propagation of numerical round-
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 3

H 2 ( z) H1 ( z )
a0
x[n] y[n]

b1
z-1 z-1 a1

b2
z-1 z-1 a2

z-1 z-1
bN aM

Figure 2: Canonical form implementation structure for a lter with M zeros


and N poles.

o errors to the lter output. As we shall see in Section 2, round-o errors


may be modeled as independent white noise sources injected into each of the
accumulator blocks in the lter structure. In the direct form structure of Fig-
ure 1, there is only one accumulator block and its round-o errors are processed
by the all-pole lter, H2 (z). This tends to amplify the round-o noise power
at frequencies (points on the unit circle) near the poles. The canonical form
shown in Figure 2 has noise sources injected at the lter input and at the lter
output. If the pole-zero structure is such that zeros nearly cancel the poles, the
direct form will result in substantially higher peaks in the round-o noise power
spectrum than the canonical structure.
The lesson to be learned from the above discussion is that a lter may be
implemented in various quite dierent ways, each of which can have very dif-
ferent implications for memory, round-o noise and other important attributes
which we shall examine here. Armed with an arsenal of dierent implementa-
tion structures, and tools for analyzing the implications of each structure, you
will be able to try a variety of strategies and select the one which is most suited
to your application. In the sections which follow, we investigate a number of
other useful implementation structures.

1.2 Cascade Structures


Since we are interested in lters with real-valued coe!cients, H (z) can always
be factored into a cascade of rst and second order sections,

H (z) = H1 (z) · H2 (z) · · · ·


c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 4

where rst order sections have the form


z  zi
Hi (z) =
z  pi
and second order sections have the form
z 2 + a1,i z + a2,i
Hi (z) =
z 2 + b1,i z + b2,i
If M = N , all zeros and poles in these segments are non-trivial. If M > N ,
there will be some trivial poles, while if M < N there will be some trivial zeros.
Each lter section may, itself be implemented in direct form, or in canonical
form.
One advantage of cascade structures over the direct or canonical forms shown
in Figures 1 and 2 is that the pole and zero locations are much less sensitive
to coe!cient quantization eects. Most lter designs yield coe!cients with ir-
rational values which cannot be represented exactly using nite precision arith-
metic. Since the coe!cients of the implemented lter are not identical to the
designed values, the poles and zeros of the implemented lter will also be dif-
ferent. If the numerator or denominator polynomials have high order, the pole
and zero locations can be very sensitive to small changes in the coe!cients, ai
and bi . If care is not taken, a lter with poles close to the unit circle can even
become unstable as a result of coe!cient quantization eects. The situation is
substantially better for cascade structures, since each section has its own poles
and zeros, whose locations are aected by only one or two coe!cients.

1.3 Parallel Structures


Rather than expanding H (z) as a product of rst and second order factors, a
partial fraction expansion may be used to express H (z) as a sum of rst and
second order systems. The following example illustrates the general principle.
Example 1 Consider a lter with the following transfer function
 
z  12 z 2
H (z) =    
z  14 z 2 + z + 12

The lter has a non-trivial zero at z = 12 , a simple pole at z = 1


4 and a pair
of complex conjugate poles at z = re±j where r = s12 and  = 
4. A partial
fraction expansion is obtained by expressing H (z) as
1 z +  1 z 2 + z + 
H (z) = +
2 z  14 2 z 2 + z + 12
and solving for the coe!cients, ,  and . The coe!cients are found by equat-
ing the numerator of H (z) with
   
1 2 1 1 2  1
(z + ) z + z + + z + z +  z 
2 2 2 4
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 5

1
x[n]
1 z-1 _ =<
15 2
4 52 y[n]

<1
z-1 ` =<
19
13

z-1
1 15
< a =<
2 26

(z 12 )z 2
Figure 3: Parallel lter structure, realizing H (z) = .
(z 14 )(z2 +z+ 12 )

Multiplying both polynomials by 2 yields


     
1 1 1 1 1
2z 3  z 2 = 2z 3 + z 2 1 +  +   +z + + +1  
4 2 4 2 4
So
7
 =  
4
1 1
 =  +  
4 2
 = 2

Solving these equations yields  =  19 15 15


13 ,  =  52 ,  =  26 . The correspond-
ing parallel lter structure is shown in Figure 3.

Parallel structures are not so popular as cascade structures, in part due to the
added complexity of expanding the lter in this form and in part because they
often have higher implementation complexity. Pole positions have exactly the
same sensitivity to coe!cient quantization as they do in the cascade structure;
however, zero locations are generally more sensitive to coe!cient quantization
than they are in the cascade structure.
One advantage of parallel implementations over cascade implementations
is their sensitivity to numerical round-o errors. The round-o noise sources
associated with a parallel implementation are injected at the input and output
of each parallel lter section, while the round-o noise sources in a cascade
structure are injected at the input and output of each successive lter stage.
If a lter has multiple poles close together and near the unit circle, round-o
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 6

y0 [n ] 1 y1[n ] 1 y 2 [n ] 1 y3 [ n ]
x[n] y[n]
k1 k2 k3

k1 k2 k3
w0 [n ] w1[n ] w2 [n ] w3[n ]
z-1 1 z-1 1 z-1 1

Figure 4: Third order FIR lattice lter implementation. The shaded box identi-
es the regular structure which is repeated throughout the lattice.

errors introduced at the beginning of a cascade will be shaped by the transfer


function associated with the remaining elements in the cascade, with high gain
at frequencies close to the poles. By contrast, round-o errors introduced into
a parallel structure are only shaped by the transfer function of a single section,
having at most two poles.

1.4 Lattice Structures


In this section, we consider an entirely dierent and much less obvious imple-
mentation structure to those described above. Figure 4 shows a lattice structure
which can be used for implementing FIR lters. We shall consider a comple-
mentary all-pole lattice structure shortly.
The characteristic feature of lattices is that they are composed of regular
repeating sections, each of which has two inputs and two outputs (unlike the
single input, single output sections we have considered previously). The nal
lter is obtained by joining the two front-end inputs and selecting one of the
two back-end outputs. For ease of discussion, we will use the symbols, ym [n]
and wm [n], to describe the outputs from the mth lattice section. The inputs
to that lattice section are then ym1 [n] and wm1 [n]. These are illustrated in
Figure 4.
The relationship between these various quantities may be described by the
following equations

ym [n] = ym1 [n] + km wm1 [n  1]


wm [n] = wm1 [n  1] + km ym1 [n]

In the Z-transform domain, we may write these in the following convenient


c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 7

matrix form:
    
Ym (z) 1 km Ym1 (z)
=
Wm (z) km 1 z 1 Wm1 (z)
  
1 km z 1 Ym1 (z)
=
km z 1 Wm1 (z)
   
1 km z 1 1 km1 z 1 Ym2 (z)
=
km z 1 km1 z 1 Wm2 (z)
..
.%
m 
\ &  
1 ki z 1 X (z)
=
ki z 1 X (z)
i=1

This transfer matrix expansion provides us with one method to determine the
lter transfer function, Ym (z) /X (z).
To understand the lattice much better, however, let Am (z) and Bm (z) de-
note the transfer functions,
Ym (z)
Am (z) =
X (z)
Wm (z)
Bm (z) =
X (z)
It turns out that these transfer functions are mirror images of each other for all
m. Specically, the corresponding impulse responses satisfy
bm [n] = am [m  n]

Note that both impulse responses have support 0  n  m. In the Z-transform,


time reversal is equivalent to replacing z with z 1 , so this relationship may be
written  
Bm (z) = z m Am z 1
and, of course, we also have
 
Am (z) = z m Bm z 1
The mirror image relationship between Am (z) and Bm (z) may be shown by
a simple recursion. When m = 0, we trivially have A0 (z) = B0 (z) = 1. Now
supposing the symmetry to hold for some m, we may recursively verify that it
holds at m + 1. We get
Bm+1 (z) = z 1 Bm (z) + km+1 Am (z)
   
= z (m+1) Am z 1 + km+1 z m Bm z 1
    
= z (m+1) Am z 1 + km+1 zBm z 1
 
= z (m+1) Am (u) + km+1 u1 Bm (u) u=z1
 
= z (m+1) Am+1 z 1
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 8

We may use the mirror image relationship between the transfer functions of
the lower and upper lattice branches to work back from a desired all-zero transfer
function, H (z) = Y (z) /X (z), to nd the corresponding lattice coe!cients.
The key is to observe that the forward relationship
    
Am (z) 1 km z 1 Am1 (z)
=
Bm (z) km z 1 Bm1 (z)
may be inverted as
   1  
Am1 (z) 1 km z 1 Am (z)
=
Bm1 (z) km z 1 Bm (z)
  
1 z 1 km z 1 Am (z)
 
=
z 1  km
2 z 1 km 1 z m Am z 1
from which we get
 
Am (z)  km z m Am z 1
Am1 (z) = 2
(2)
1  km
Working backward through the lattice, we may determine each coe!cient, km ,
in turn by recognizing that Am1 (z) is an order m  1 lter, so that am [m] 
km am [0] must be 0. That is
am [m]
km = (3)
am [0]
The lattice coe!cients are generally known as ““reection coe!cients.””
Example 2 Consider the third order FIR lter,
1 1 1
H (z) = 1 + z 1  z 2 + z 3
4 4 2
We know that A3 (z) = H (z), so equation (3) yields
a3 [3] h [3] 1
k3 = = =
a3 [0] h [0] 2
Next, we use equation (2) to nd
   
1 + 14 z 1  14 z 2 + 12 z 3  12 12  14 z 1 + 14 z 2 + z 3
A2 (z) =
1  14
 
4 3 3 1 3 2
= + z  z
3 4 8 8
1 1
= 1 + z 1  z 2
2 2
We can now use equation (3) again to nd
a2 [2] 1
k2 = =
a2 [0] 2
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 9

from which we get


   
1 + 12 z 1  12 z 2 + 12  12 + 12 z 1 + z 2
A1 (z) =
1  14
 
4 3 3 1
= + z
3 4 4
= 1 + z 1

Then
a1 [1]
k1 = =1
a1 [0]
and we have found all of the reection coe!cients.

1.4.1 Properties of Lattice Filters


Since the lattice structure involve more multiplications than the cascade, direct
or canonical lter structures, one might reasonably ask why we would want to go
to the trouble. Lattice lters actually arise naturally in the context of adaptive
ltering systems, where they have excellent properties which we shall consider
later in the subject. For now, however, we point out the following very useful
property of lattice lters:

An FIR lter has minimum phase (all zeros inside the unit circle)
if and only if all of the reection coe!cients are less than 1, i.e.,
|km | < 1.

Recall that minimum phase lters are the only lters which we can invert
causally. In fact, the inverse lter may be implemented using the all-pole lattice
structure described next, which is guaranteed to remain stable so long as the
reection coe!cients are all less than 1. The coe!cients for such lters may be
e!ciently represented as xed-point quantities of the form
3
km  km · 2(W 1)
3
where km is a W -bit two’’s complement integer.

1.4.2 All-Pole Lattice Structure


Figure 5 shows an all-pole lattice structure. Note the close similarity between
this structure and that in Figure 4. In fact, if k1 , k2 , , kM are the reection
coe!cients of an FIR lattice with lter transfer function H (z), then they are
also the reection coe!cients of the IIR lattice with lter transfer function,
1/H (z). Of course, we require H (z) to have minimum phase, which is simply
arranged by ensuring that |km | < 1 for all m.
The reciprocal relationship between the FIR and IIR lattice structures in
Figures 4 and 5 is easily explained by considering the repeating structures,
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 10

y0 [ n ] 1 y1[n ] 1 y2 [n ] 1 y3 [ n ]
y[n] x[n]
k1 k2 k3
< k1 < k2 < k3
w0 [n ] w1[n ] w2 [n] w3[n ]
z-1 1 z-1 1 z-1 1

Figure 5: Third order all-pole lattice lter whose transfer function is the recip-
rocal of that of the FIR lattice lter in Figure 4. We have deliberately drawn
the lter backwards, with the input on the right and the output on the left to
emphasize the connection between the FIR and IIR lattice structures and their
internal state variables.

identied by shaded boxes in the gures. These structures clearly enforce exactly
the same relationship between the vectors,
   
ym1 [n] ym [n]
and
wm1 [n] wm [n]

The signal ow direction is reversed in the upper branch, so addition


of km wm1 [n  1] to ym1 [n] to get ym [n] is replaced by subtraction of
km wm1 [n  1] from ym [n] to get ym1 [n]. There is no other dierence be-
tween the FIR and IIR lattices. It follows that the relationship between the
signals (y0 [n] , w0 [n]) and (yM [n] , wM [n]) must also be preserved. In the IIR
lattice, however, yM [n] is the input and y0 [n] is the output, so the IIR lattice
must invert the FIR lattice.
If we pass a signal, x [n], into the input of an FIR lattice and then pass the
output signal, y [n] into the input (at yM [n]) of the corresponding IIR lattice,
all intermediate quantities and all states in the IIR lattice will assume identical
values to their counterparts in the FIR lattice.

1.5 Other Structures


While the implementation structures described above are most widely used,
they are far from exhaustive. As an example, Figure 6 provides an alternate
implementation for a second order all-pole lter (or lter section). This exam-
ple is modeled on the structure of coupled oscillators found in nature and in
analog circuits. Two single pole sections have their inputs and outputs partially
coupled, resulting in an all-pole lter with complex conjugate poles.
The transfer function for this lter is easily found by recognizing that each
z 1 1
of the single-pole building blocks has transfer function G (z) = 1az 1 = za ,
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 11

b
x[n] z-1 z-1 y[n]

a a
<b

Figure 6: Second order all-pole lter based on coupled single-pole sections.

and hence the complete coupled structure has transfer function

G (z) bG (z)
H (z) =
1 + bG (z) bG (z)
b
(za)2 b
= b2
= 2
1 + (za) 2 (z  a) + b2

The poles are thus at locations z = a ± jb.


The coupled implementation requires twice as many multiplications as a di-
rect, canonical or cascade implementation. On the other hand, its pole positions
have low sensitivity to quantization. Regardless of how much error we make in
the approximation of a and b, the lter is guaranteed to have complex conjugate
pole pairs.

2 Finite Word Length Eects


2.1 Multiply-Accumulate Operations
Linear lters are invariably implemented through a sequence of multiplications
and additions. In each case, a signal-dependent quantity, x, is multiplied by
some coe!cient, a, and the result is added to a previously computed signal-
dependent quantity, s. We may summarize this ““multiply-accumulate”” opera-
tion as
s#s+a·x

Example 3 Consider the canonical second order lter structure shown in Fig-
ure 7. The implementation involves two memories (or states), s1 and s2 , cor-
responding to the outputs of the rst and second delay element, respectively. As
each new input sample x [n] arrives, the following operations are performed to
derive a new output sample, y [n].

•• s0 # x [n] use s0 to accumulate output of rst adder

•• s0 # s0 + b1 · s1
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 12

1
x[n] y[n]

b1
z-1
a1

z-1
b2 a2

Figure 7: Second order canonical lter section.

•• s0 # s0 + b2 · s2
•• y # s0 use y to accumulate output of second adder

•• y # y + a1 · s1
•• y # y + a2 · s2
•• y [n] # y
•• s2 # s1 # s0 operate the shift register to prepare for next sample

The complete implementation thus involves four multiply-accumulate oper-


ations per sample. Two temporary accumulator registers, y and s0 are also
used, together with two state variables, s1 and s2 . To see why state variables
are fundamentally dierent to the temporary accumulator variables, consider the
following.
If the complete lter involves multiple cascaded sections, the implementation
of each section may share the same temporary accumulator variables, but each
section must have its own independent set of state variables.

2.2 Fixed-Point Arithmetic


By and large, DSP chips are general purpose micro-processors which have
been specially designed to process multiply-accumulate operations with a high
throughput1 . DSP chips are designed either for xed-point or oating-point
processing, but in this section we focus on xed-point processing. Fixed-point
arithmetic can be performed more e!ciently, with lower latency and less cir-
cuitry, but it places a greater demand on the implementor to customize the dy-
namic range of the data which is being processed to the available word lengths.
1 Other DSP-specic features typically include multiple data/address buses and special

address generation logic which is well adapted to certain typical DSP operations.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 13

For full custom hardware solutions, similar considerations apply and xed-point
implementations generally result in smaller, less expensive chips with higher
throughput. We briey consider oating point implementations in Section 2.7.
Fixed-point processors treat all quantities as integers. It is up to the im-
plementor to establish a relationship between these integers and the real-valued
quantities which would ideally be processed. For example, suppose that the
input sequence satises |x [n]| < A for some bound, A, and we wish to rep-
resent the samples as two’’s complement integers, each having W bits. This
could be achieved by scaling the input samples by 2W 1 /A and rounding to
the nearest integer. The implementor must then bear in mind that the integer
values are scaled versions of the original input samples, scaling the output back
accordingly.
To minimize the impact of integer rounding eects, it is desirable to scale
the results produced by each processing step so that they utilize as many of
the available bits as possible. This can lead to a large number of dierent
scale factors, which may require additional multiplications2 . For this reason,
it is common to adopt a convention which restricts the scale factors to exact
powers of 2. This leads to a description of xed-point quantities in terms of two
quantities: the number, I, of integer bits; and the number, F , of fraction bits.
We say that a real-valued quantity x has an I.F xed point representation
as a W = I + F bit integer x3 , if

x = 2F x3 (4)

We can also express this xed point representation graphically by inserting a


““binary point”” after the F th most signicant bit.

Example 4 The real-valued quantity, x = 3.125, may be represented exactly as


an unsigned 2.3 xed point quantity,

x = 11.001

The 3 fraction bits are separated from the 2 integer bits by the binary point. The
integer x3 , used to represent x, is given by

x3 = 11001

which has a value of 25 = 23 x

We will usually be working with two’’s complement signed integer represen-


tations. This means that the maximum and minimum integers which can be
represented using W bits are 2W 1 and 2W 1  1, respectively. Consequently,
the range of real-valued quantities, x, which can be represented without overow
in an I.F xed-point representation is given by

2I+F 1  2F x  2I+F 1  1
2 You should always implement scaling through multiplication, rather than division, since

multiplication is much faster, especially when working with high precision representations.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 14

or, equivalently,
2I1  x  2I1  2F
It is often convenient to approximate this range by

|x| < 2I1

Example 5 The real-valued quantity, x = 2.625, may be expressed in a signed


4.4 representation by the 8-bit integer

x3 = 24 x = 42
= 11010110

The maximum value which can be represented as a 4.4 quantity is


127
0111.1111 = = 3.9375
16
The smallest quantity is
128
1000.0000 = = 4
16
If two xed-point quantities have the same number of fraction bits, they can
be added directly as integers, without any round-o errors. If they have dierent
numbers of fraction bits, one or both of the quantities must have its represen-
tation changed by multiplying or dividing the integer values by an appropriate
power of 2. We refer to this as a normalization shift, since multiplication or divi-
sion by a power of 2 amounts to shifting the two’’s complement integer quantities
to the left or right, respectively.

Example 6 Suppose we need to add a 4.4 quantity, x1 , to a 6.2 quantity, x2 .


Both quantities are represented as 8-bit integers, but with dierent numbers of
fraction bits. If x2 really requires 6 integer bits, we can expect the result to also
require at least 6 integer bits, meaning that we should drop 2 fraction bits from
x1 before adding the numbers. We can implement this as follows (C-language
syntax used for convenience):

•• x31 # (x31 + 2) >> 2 —— Obtains nearest (by rounding) 6.2 representation of


x1 .
3
•• y 3 # x1 + x32 —— y 3 has 2 fraction bits.

Multiplying two W -bit signed integers, x31 and x32 , yields a (2W  1)-bit
signed integer, y 3 , in the range 22W 2 < y  22W 2 . If the multiplicands
correspond to I1 .F1 and I2 .F2 xed-point quantities, respectively, y 3 is an
(I1 + I2  1).(F1 + F2 ) xed-point representation of y = x1 · x2 . Note that
multiplication itself does not introduce any round-o errors. Round-o errors
arise when we need to reduce the precision of the result back to a W -bit quantity
by discarding fraction bits.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 15

Input, a

a2 a1 a0 LSB
Input, b

b2 b1 b0
a a a
co co co
1-bit adder b 1-bit adder b 1-bit adder b
r r+2co r r+2co r r+2co
=a+b+ci ci =a+b+ci ci =a+b+ci ci
0

r2 r1 r0

Output, r

Figure 8: Typical ripple-adder circuit constructed from 1-bit full adder cells,
each of which sums three input bits, a, b, ci , and outputs the sum as a 2-bit
quantity, r, co .

2.2.1 Relative Complexity of Operations


Since xed-point arithmetic is actually integer arithmetic, with a superimposed
convention for the interpretation of the integers, the complexity of the rele-
vant operations depends only upon the word size, W . Addition of two W+ -bit
operands is almost invariably implemented as a logic circuit consisting of a cas-
cade of W+ full adder cells, as shown in Figure 8. Each one-bit full adder takes
three inputs: an input bit from each of the two words being added; and a carry
bit form the next least signicant bit position. The full adder cell generates
two outputs: a result bit; and a carry bit, to be propagated to the next more
signicant bit position.
Multiplication can be implemented in various ways, but the most common
architecture found in DSP applications (and also most modern CPU’’s) is a
““parallel”” multiplication circuit consisting of a W× × W× array of one-bit adder
cells, coupled by AND gates, as shown in Figure 9. The total number of gates
required to implement a multiplier is proportional to the product of the word
sizes of the two multiplicands. Thus, for example, a 16 × 16 multiplier unit
requires 256 one-bit adder cells. By contrast, a 16-bit adder requires only 16
one-bit adder cells.
Although the complexity of a parallel multiplier grows with the square of
input word size, its propagation delay does not. The critical signal propagation
path in Figure 9 skirts the upper and left hand boundaries of the array of
adder cells, involving 8 adder cells. In general, the propagation delay for a
parallel adder circuit is proportional to the sum of the word sizes of its inputs.
Thus, the delay associated with a 16 × 16 parallel multiplier circuit is 32 times
the propagation delay associated with a single adder cell, only twice the delay
associated with a 16-bit adder.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 16

a3 a2 a1 a0
b0

b1

0
1-bit 1-bit 1-bit 1-bit
adder adder adder adder
0

b2

0
1-bit 1-bit 1-bit 1-bit
adder adder adder adder
0

b3

0
1-bit 1-bit 1-bit 1-bit
adder adder adder adder
0

r6 r5 r4 r3 r2 r1 r0

Figure 9: 4-bit by 4-bit parallel multplication circuit involving 16 one-bit adder


cells and producing 7 output bits. For simplicity, the circuit shown here works
only with an unsigned multplicand, b. a can be signed or unsigned.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 17

If the parallel multiplier unit in Figure 9 is coupled directly to the adder


unit in Figure 8 to implement a multiply-accumulate unit, the total propagation
delay is identical to that of the multiplier by itself, plus a single adder cell delay.
In other words, both the complexity and the propagation delay of a multiply-
accumulate unit are almost identical to those of the multiplier by itself.
The lesson to be learned from the above is that additions are comparatively
cheap and multiplications dominate the cost of hardware DSP solutions. DSP
chips generally provide accumulators with signicantly larger bit-depths than
the multiplier, since the cost of implementing the accumulator is negligible in
comparison with that of the multiplier. The accumulator bit-depth, W+ is
typically more than twice as large as the multiplicand bit-depth, W× .
For those of you who are wondering about division, you can forget it imme-
diately. Division cannot be eectively implemented using a parallel logic unit,
so dividers are typically a great deal slower than multipliers. You should never
use division in any algorithm unless you absolutely cannot avoid it. Division by
a xed coe!cient, a, is much better implemented as a multiplication by a1 .

2.3 Dynamic Range Selection


In this section we consider the problem of determining appropriate xed-point
representations for each point in the implementation of a lter. To avoid fre-
quent renormalizations (adjustments in the number of fraction bits by bit shift-
ing, as in Example 6), it is convenient to select the same representation for all
quantities entering each accumulator block.
Since adders are much cheaper to implement than multipliers, accumulators
can normally work with higher bit-depth than the multiplicands entering mul-
tiplier units. For example, a typical multiplier unit might accept 16-bit inputs,
producing a 32-bit result, while all additions might be performed using 32-bit
arithmetic. Since the output of each accumulator is usually an input to a subse-
quent multiplier, the outputs must be renormalized by discarding extra fraction
bits.
To summarize the above statements, a good implementation should deter-
mine a suitable xed-point representation for the input to and the output from
each accumulator block. An appropriate design procedure is as follows.
1. Pick a xed-point representation for the input samples, x [n]. The input
to a digital lter generally arrives as integers already, with some implied
xed-point representation. For example, audio samples often arise from
sampling and quantizing an analog waveform in the range 1V to +1V ,
so it is natural to think of the W -bit integers as having a 1.(W  1) xed-
piont representation. Thus, 16-bit audio may be understood as having
a 1.15 representation and 24-bit audio might be understood as having a
1.23 representation, with 7 extra fraction bits. In any event, let I0 and F0
denote the number of integer and fraction bits associated with the input
representation.
2. Identify the accumulator blocks in the lter structure.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 18

3. Identify the transfer function, Hi (z), from the lter input to the output
of the ith accumulator block and nd the associated BIBO gain:
4
[
Gi = |hi [n]|
n=0

To avoid any possibility of overow, the output representation must be


able to hold samples which are Gi times larger than the lter input sam-
ples. Accordingly, the number of integer bits required for the accumulator
output representation is

Iiout = I0 + glog2 Gi h

where gah means ““round a up to the nearest integer.”” Assuming a consis-


tent word size, W× = I0 + F0 , for all multiplier inputs, we must have

Fiout = F0  glog2 Gi h

fraction bits at the output of the ith accumulator block.


4. Next, we need to select the representation at the input to each accumu-
lator block. Overow in the accumulator is of no concern so long as the
nal result ts into the selected representation. This is because overow
bits from the outputs of multipliers or at intermediate points in the ac-
cumulation of multiplier outputs are guaranteed to cancel each other out,
so long as we use a two’’s complement representation. It follows that the
number of integer bits used during accumulation need only satisfy

Iiin  Iiout (5)

The number of fraction bits used during accumulation will generally be


much larger than the number of fraction bits available for multiplier inputs,
since W+ is typically at least twice as large as W× .
At this step, we select the minimum value, Iiin = Iiout , which maximizes
the number of accumulator fraction bits and hence maximizes the accuracy
with which we can represent the lter coe!cients. We may nd later,
however, that Iiin needs to be increased to satisfy other constraints (see
Step 6).
5. Next, we assign multiplier coe!cients in such a way as to reconcile the
dierences between input and output representations. For example, an
input to the ith accumulator block may be formed by multiplying some
coe!cient, a, by the output from the (i  1)st accumulator block. It fol-
lows that the representation of a should have Fa fraction bits, where Fa
is given by
Fa = Fiin  Fi1
out
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 19

H 3( z)
H 2 ( z)
H1 ( z )
x[n] y1[n ] 1 y2 [n ] 1 y3 [ n ]
(1.15) (2.30) (2.14 ) (5.27 ) (4.12 ) (5.27 ) (2.14)
(2.30) (5.27 ) (5.27) (5.27 )

0. 4
z-1 < 0. 5 < 0.8
z-1 0.7
(0.16) (3.13) (1.15) (1.15)

Figure 10: Fixed-point representation selections for a cascade lter consisting of


two rst order sections. The lter has two real-valued zeros and two real-valued
poles.

6. It can happen that the above steps lead to some coe!cients with insu!-
cient integer bits in their representation. For example, assuming that all
coe!cients must be represented with the same word size, W× , the coe!-
cient a has only Ia = W×  Fa integer bits, requiring the coe!cient to lie
in the range
|a| < 2Ia 1
If this is too small, we must reduce the number of fraction bits, Fa . This,
in turn, means reducing the number of fraction bits (increasing the number
of integer bits) at the input to the relevant accumulator block, taking Iiin
above the minimum value given by equation (5). Reducing the number
of fraction bits at the input to an accumulator is undesirable in that it
reduces the accuracy of the representation. For this reason, this nal
adjustment step is best saved until last.

Example 7 Consider the following transfer function.


z  0.5 z + 0.7
H (z) = ·
z  0.4 z + 0.8
The lter may be implemented as a cascade of two rst order sections, as shown
in Figure 10.Evidently, there are three accumulator blocks, whose outputs we
label y1 [n], y2 [n] and y3 [n]. The output of the entire lter is y [n] = y3 [n]. The
transfer functions from x [n] to each of these accumulator outputs are labeled
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 20

H1 (z), H2 (z) and H3 (z). They are easily found to be


z
H1 (z) =
z  0.4
z  0.5 z
H2 (z) = ·
z  0.4 z + 0.8
1 z  14
15
7
1 z  15
= +
2 z  0.4 2 z + 0.8
z  0.5 z + 0.7
H3 (z) = H (z) = ·
z  0.4 z + 0.8
7 7
1 z  12 1 z + 12
= +
2 z  0.4 2 z + 0.8
The corresponding impulse responses are
n
h1 [n] = (0.4) u [n]
1 14
h2 [n] = (0.4)n u [n]  (0.4)n1 u [n  1]
2 30
1 n 7 n1
+ (0.8) u [n]  (0.8) u [n  1]
2 30
1 n 7 n1
h3 [n] = (0.4) u [n]  (0.4) u [n  1]
2 24
1 7
+ (0.8)n u [n] + (0.8)n1 u [n  1]
2 24
Expanding these impulse responses, we evaluate the BIBO gains numerically.
Note that we need only expand an initial prex, since all impulse responses have
exponential decay. We nd BIBO gains of
1 2
G1 = =1
1  0.4 3
G2 = 4.3571
G3 = 1.6071

Assuming that the input has a 1.15 xed-point representation of real-valued


quantities in the range 1 to 1, we nd that the outputs of the three accumulator
blocks should have 2.14, 4.12 and 2.14 xed-point representations, respectively.
If the accumulator has W+ = 32 bits of precision, Step 4 in the design
procedure selects optimal input representations of 2.30, 4.28 and 2.30 for the
three accumulator blocks, noting that these may need to be adjusted in Step 6.
Step 5 then yields the following lter coe!cient representations.

•• Multiplication by 0.4 involves a 2.14 input and a 2.30 output. The coe!-
cient must have a 0.16 representation, which is ne.
•• Multiplication by 0.5 involves a 2.14 input and a 4.28 output. The coef-
cient must therefore have a 2.14 representation, which is also ne.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 21

•• Multiplication by 0.8 involves a 4.12 input and a 4.28 output. The coe!-
cient must have a 0.16 representation, which is not possible. We will need
to reduce the number of fraction bits at the input to the second accumulator
block by at least 1 in the next step.
•• Multiplication by 0.7 involves a 4.12 input and a 2.30 output, requiring
a 2.18 representation for the coe!cient. This is also not possible. We
will need to reduce the number of fraction bits at the input to the third
accumulator block by at least 3 in the next step.

In step 6 we make the accumulator input representation adjustments men-


tioned above. The nal xed-point representations for the accumulator inputs
and outputs and for the coe!cients themselves are all shown in Figure 10.
Note that the two branches with coe!cients of 1 are implemented using sim-
ple upshifts to convert between their input and output representations. If we
had selected scale factors for our xed-point representation which were not pow-
ers of 2 we would have been able to extract a little bit more accuracy from the
representations, but these branches would have required full multipliers, which
is more costly than simply shifting bit positions. In hardware implementations,
converting between dierent xed-point representations in the power-of-2 frame-
work adopted here is essentially free.
We also note that the outputs from the three accumulators require normal-
ization shifts (right shifts, with rounding) by 16, 15 and 13, respectively. These
normalization shifts are the sole source of round-o error, which we shall inves-
tigate further in Section 2.5.

2.3.1 Alternate Implementation Environments


In the above, we have considered a conventional DSP implementation environ-
ment, in which the word sizes available for multiplication and accumulation
operators are xed and generally dierent. When working with general purpose
CPU’’s, there is very little dierence except that the accumulation bit-depth,
W+ , is usually identical to the multiplicand bit-depth, W× . This aects the
number of fraction bits available for accumulation, which requires the designer
to carefully balance the number of fraction bits dedicated to representing in-
termediate sample values, with the number of fraction bits used to represent
coe!cients.
In hardware implementations, normalizating downshifts are almost free and
up-shifts are completely free —— adding extra fraction bits to a word is simply
a matter of rewiring the bits. Consider Step 6 in the design process described
above, where we had to make adjustments to ensure that lter coe!cient rep-
resentations have su!cient integer bits. In Example 7, we found that the ideal
representation for the output of the multiplier with coe!cient 0.8 had 0 inte-
ger bits and 16 fraction bits. Rather than adjusting the number of fraction bits
used by the accumulator, and hence all multipliers feeding into that accumula-
tor, one could instead change the coe!cient to 0.4 and shift the output of the
multiplier one bit position to the left (multiplication by 2). The shift is free in
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 22

hardware, and the coe!cient of 0.4 can be represented as a 0.16 xed-point


number. In the same way, we could divide the coe!cient 0.7 by 8, allowing
it to be represented as a 2.18 xed-point number3 , shifting the result 3 bit
positions to the left. It follows that hardware implementations can and should
always have the same number of integer bits at the accumulator input and its
normalized (downshifted) output, i.e., Iiin = Iiout .
Another implementation environment which is becoming increasingly im-
portant is that created by the multi-word instruction sets supported by most
modern general purpose CPU’’s and by some DSP chips. The intel MMX in-
struction set is typical of these multi-word operations. As an example, MMX
instructions allow four 16-bit by 16-bit multiplications to be performed simul-
taneously, by leveraging the high performance parallel multiplier matrix which
is required for double-precision oating point multiplication. The snag is that
the output from each 16 × 16 multiplier has only 16 bits, rather than the 31
required to eect true multiplication. The most useful variant of the operation
eectively forms  
x×a
y=
216
That is, the two multiplicands, x and a, are eectively multiplied to form a
31-bit result, whose 16 least signicant bits are immediately discarded. In this
operation, there is no opportunity to implement accumulators with higher pre-
cision than the multiplicands, nor is the output y rounded to the integer nearest
to x×a
216 . As a result, y is both a scaled and a biased version of the true product.
All of these eects can be appropriately compensated by careful implementation,
but we do not have space here to discuss such matters further.

2.3.2 Rethinking the Gain Factors


In the above design procedure, we computed BIBO gains, Gi , between the lter
input and each of the accumulator outputs. These gains represent worst-case
expansions in the dynamic range of an arbitrary input seqence, x [n]. The worst
case expansions are invariably obtained when x [n] alternates between the two
extremes of its own dynamic range. Specically, x [n] = A sign (h [K  n]) where
A is the maximum possible value for |x [n]|, sign (h) is 1 if h is positive and 1
if h is negative, and K is an arbitrary integer.
These worst-case gains might not actually be observed in reality, especially
if we know that not all input signals, x [n], are actually possible. When imple-
menting lters for narrow-band signals, x [n], it is common to assume that the
worst-case expansion is given by
 
 
Gi = max ˆĥi ($) = max | Hi (z)|z=ej$ |
$5 $5

where is the set of frequencies which are of interest. In this case, Gi , is the
maximum gain in the amplitude of any pure sinusoid in the frequency range of
3 32 integer bits simply means that the rst two bit positions after the binary point are

never used.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 23

interest. It can be su!ciently smaller than the BIBO gain, leading to more frac-
tion bits in the xed-point representations and hence less round-o error. There
is, however, some risk that the representations may have insu!cient dynamic
range and hence overow.
In simple two’’s complement arithmetic, overow is a serious concern. Con-
sider, for example, what happens if a 3.5 accumulator representation overows
with the value 4, representing it as 100.00000, which is equal to 4. Over-
ow in two’’s complement arithmetic causes the numerical representation to
““wrap-around”” leading to massive errors. For this reason, if the BIBO gains
are not used, the implementation should employ saturating arithmetic. Satu-
rating arithmetic checks for overow and selects the representable value which
is closest to the true out-of-range value.
Saturating arithmetic comes with some of its own drawbacks which are worth
noting. Firstly, it is somewhat more expensive to implement a saturating ac-
cumulator than a simple two’’s complement accumulator. Secondly and more
signicantly, a saturating accumulator typically saturates the results of each
incremental addition. Consider, for example, the central accumulator in Fig-
ure 10, which has 3 inputs. Depending on the order in which the inputs are
added, the results produced by a saturating accumulator may be dierent.
Moreover, when designing the xed-point representations for a saturating
accumulator one must be careful to ensure that each incremental output from the
accumulator (after adding each new quantity into the total) can be represented
without overow, under the conditions for which the lter is being designed.
By contrast, with regular two’’s complement addition, it is su!cient to ensure
that the accumulator has su!cient precision to accommodate the nal result,
in which case intermediate overow bits will be guaranteed to cancel. This can
detract from some of the savings achieved by selecting less gains, Gi , which are
less conservative than the BIBO gains.

2.4 Coe!cient Quantization


The design procedures outlined above are intended to assign as many fraction
bits as possible to the representation of lter coe!cients. Nevertheless, most
lter designs yield irrational lter coe!cients which cannot be represented ex-
actly with any nite number of bits. We refer to the process of coercing lter
coe!cients to implementable values as ““coe!cient quantization.”” In general,
the lter which is implemented will have slightly dierent poles and zeros to
those which were originally intended. As we have already mentioned the im-
pact of coe!cient quantization can be very dependent on the implementation
structure. The following examples are intended to clarify this point.

Example 8 Consider a second order all-pole lter, implemented as in Fig-


2
ure 11.The transfer function is H (z) = z2 +bz1 z+b2 . The design calls for complex
conjugate poles at z = re±j where

b2 = r2 and b1 = 2r cos 


c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 24

x[n] y[n]

< b2 < b1

z-1 z-1

Figure 11: Direct implementation of a second order all-pole lter.

The two coe!cients in the implementation are thus directly related to the radius,
r, and the real-part, r cos , of the poles. Evidently, for stability we require
0  b2 < 1 and |b1 | < 2. Now suppose both quantities are to be represented
using signed 3-bit integers; b1 as a 2.1 quantity, and b2 as an unsigned 1.2
quantity. Since b2 must be positive to get complex conjugate poles, the available
pole positions, p, which can be described under these conditions have
+u u u ,
1 2 3
|p| 5 , ,
4 4 4

and  
1 2 3
? (p) 5 0, ± , ± , ±
4 4 4
In fact, the only stable complex conjugate pole positions which can be achieved
are those indicated in Figure 12. Note that only 15 dierent conjugate pole pairs
can be realized and they are quite irregularly spaced.
Example 9 Consider the coupled implementation of a second order all-pole
lter, as shown in Figure 6. We showed previously that the pole positions asso-
ciated with this realization are given by
p = a ± jb
so that quantizing the coe!cients is exactly equivalent to quantizing the real and
imaginary parts of the pole positions. If we again assume a 3-bit signed xed-
point representation for the lter coe!cients, the complex conjugate pole pairs
which can be achieved lie on the regular rectangular grid shown in Figure 13. In
this case, there are 19 possible conjugate pole pairs.
Unfortunately, implementation structure aects many dierent properties
of the lter, whose dependencies can be di!cult to track down and optimize.
These properties include memory consumption, coe!cient quantization, round-
o errors and limit cycle behaviour, some aspects of which are discussed further
below. There is no general method to select the best structure for implementing
a particular lter. Instead, several dierent structures can be examined, sepa-
rately optimizing their parameters and representations, to determine the most
appropriate structure for a given application.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 25

¼( p )
unit circle

½( p )

Figure 12:

¼( p )
unit circle

½( p )

Figure 13:
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 26

2.5 Round-O Eects


Following the implementation strategies outlined above, it should be clear that
the only sources of numerical error are the normalizing downshifts which are
applied at the output of each accumulator block to decrease the precision from
W+ (accumulator bit-depth) to W× (multiplicand bit-depth). If the normalized
representation at the output of the ith accumulator block has Fiout fraction bits,
the round-o error,  i , is in the range
out out
2(Fi +1)
  i  2(Fi +1)

Moreover, under reasonable conditions this error is generally uncorrelated with


the input, uncorrelated with other sources of round-o error and uncorrelated
from sample to sample.
In summary, each accumulator may be modeled by a true error-free accumu-
lator, plus an independent white noise process. Assuming that the number of
normalizing downshift discards a signicant number of fraction bits, the round-
o error process for accumulator block i may be modeled as having zero mean4
and a variance of
] 2(Fio u t 1) out
2 1 2 22Fi
 i = F o u t x dx =
2 i out
2(Fi 1) 12
The total output noise power (variance) is given by
[
2 =  2i Pi
i

where Pi is a power gain factor, computed from the transfer function Ti (z),
between the input to the ith accumulator block and the output of the entire
lter. The power gain may be expressed in any of the following ways:
4
[
Pi = (ti [n])2
n=0
] 
1  
= tˆˆi ($)2 d$
2 
] 
1    
= Ti ej$ Ti ej$ d$
2 
The power spectral density of the output noise process is given by
[   2
S ($) =  2i Ti ej$ 
i
4 Actually, the round-o error process is very slightly symmetrical, so there is a very small
mean oset. If, for example, the normalizing downshift discards 2 fraction bits to form a 1.3
result, rounding to the nearest value and upward where there is an ambiguity, the possible
round-o errors are +2 × 235 , +1 × 235 , 0 and 31 × 235 . In most cases, a large number of
fraction bits are discarded and the round-o error is distributed almost symmetrically about
0.
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 27

T1 ( z ) T2 ( z )
b 1[n ] b 2 [n ] b 3 [n ]

x[n] y1[n ] 1 y2 [n ] 1 y3 [ n ]
(1.15) (2.30) (2.14 ) (5.27 ) (4.12 ) (5.27 ) (2.14)
(2.30) (5.27 ) (5.27) (5.27 )

0. 4
z-1 < 0. 5 < 0.8
z-1 0.7
(0.16) (3.13) (1.15) (1.15)

Figure 14: Round-o noise model for the lter implementation in Figure 10.

Example 10 Consider the cascade system shown in Figure 10. There are three
accumulator blocks, each with its own source of round-o noise, as shown in
Figure 14. These noise processes have variances,
228 224
 21 = 23 = = 95dB and  22 = = 83dB
12 12
The power gain factors and the noise power spectral density may be computed
from the three noise transfer functions,
z  0.5 z + 0.7
T1 (z) = ·
z  0.4 z + 0.8
z + 0.7
T2 (z) =
z + 0.8
T3 (z) = 1

2.6 Limit Cycles


An interesting and potentially annoying consequence of round-o error is that
the output from a recursive lter can exhibit small harmonic uctuations even
when the input signal is constant. The phenomenon is illustrated by the follow-
ing simple example.
Example 11 Consider a simple single-pole lter, with dierence equation
y [n] = x [n] + ay [n  1]
The lter’’s transfer function is
z
H (z) =
za
1
and its DC gain is 1a . To make the example as simple as possible, suppose
1
that a =  4 , so that the lter coe!cient can be implemented exactly without any
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 28

quantization at all. Multiplication by a is equivalent to a right-shift operation.


Specically, each new output sample is computed using the following expression
 3 
3 3 y [n] 1
y [n] = x [n]  +
4 2
where the primed quantities are the integers used in the xed-point representation
of x and y and the operation enclosed by the oor delimiters, ef, simply rounds
y 3 [n] /4 to the nearest integer. This operation may be implemented by adding 2
to y 3 [n] and then shifting the result to the right by two bit positions.
Now suppose we supply a constant (DC) input of x3 [n] = 7. The DC gain
of the lter is 45 so that the output should ideally converge to a constant value
of 5 35 , which cannot be represented exactly. The actual behaviour of the lter
output is traced below:

n = 0 $ y3 = 7
 
3 7+2
n = 1 $ y =7 =5
4
 
5+2
n = 2 $ y3 =7 =6
4
 
6+2
n = 3 $ y3 =7 =5
4
 
5+2
n = 4 $ y3 =7 =6
4
Evidently, the lter output will continue to oscillate between 5 and 6. These
oscillations are dependent on the DC level of the supplied input and they rep-
resent tones in the lter output which are unrelated to the frequency content of
the input signal. If insu!cient bits are used in the numerical implementation,
these tones can become audible or otherwise perceptible in the lter output.

2.7 Floating-Point Arithmetic


Up until this point, we have considered only xed-point arithmetic. Floating
point implementations are conceptually much simpler, since the oating point
processor automatically optimizes the numerical representation of every arith-
metic result so as to minimize round-o error. Of course, the price paid for this
is that oating point arithmetic is signicantly more expensive to implement.
A oating point representation for the real-valued quantity, x, involves two
integer valued quantities, e (a signed exponent), and m (an unsigned M -bit
mantissa). The relationship is given by
 
|x| = 2e 1 + 2M m

Note that the sign of x is represented by an extra binary digit.


Most oating point processors employ either the IEEE single precision rep-
resentation or the IEEE double precision representation. The single precision
c
?Taubman, 2003 ELEC4621: Implementation Techniques Page 29

representation uses 32 bits in total: one sign bit; M = 23 mantissa bits; and 8
exponent bits. This allows exponents in the range 128 to +127 and a relative
accuracy of
x
 2(M+1)
|x|
The double precision representation uses M = 53 mantissa bits and 10 exponent
bits, with a 64-bit word size.
When oating point representations are employed, round-o errors are often
much smaller than in the xed-point case, since the number of fraction bits
(determined by the exponent, e) is dynamically adjusted after each computa-
tion is performed, so as to maximize the accuracy of the representation. This
adjustment process can add signicant complexity, particularly to additions.
Consider, for example, the addition of two quantities, 3.708 and 3.707. Each
quantity is represented with an exponent of e = 1, but their sum requires an
exponent of e = 11. In general, additions require extensive shifting of the
mantissa bits both before and after the binary addition of their contents. This
shifting is either performed sequentially (can require many clock cycles) or in
parallel, using a ““barrel shifting network”” (can occupy a lot of silicon area on
the chip).
In either case, the lesson to be learned is that oating point additions are
much more complex than xed-point additions and they can actually take signi-
cantly longer to execute than oating point multiplications. Moreover, round-o
error may be introduced whenever two numbers are added, and whenever two
numbers are multiplied, making it much more di!cult to trace the eects of
round-o error through a system for modeling purposes.

You might also like