Performance-oriented_technolog
Performance-oriented_technolog
This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may
be from any type of computer printer.
The quality of this reproduction is dep mdent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
O rder N um ber 9126812
UMI
300N. ZeebR i
Ann Arbor, MI 48106
Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
NOTE TO USERS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Performance-Oriented Technology Mapping
By
DISSERTATION
DOCTOR OF PHILOSOPHY
in
COMPUTER SCIENCE
in the
GRADUATE DIVISION
of the
Approved:
Chair:. . .
., Date
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Performance-Oriented Technology Mapping
Copyright © 1990
Herve J. Touati
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Performance-Oriented Technology Mapping
Herve J. Touati
University of California Departm ent of Electrical Engineering
Berkeley, California and Com puter Science
Ph.D . Thesis Computer Science Division
Abstract
This thesis presents a variety of techniques to minimize circuit delay during the translation
of a set of Boolean equations into a list of connected logic gates th at can be used for the
m anufacturing of combinational digital circuits. This translation process is called technol
ogy mapping. The first contribution of this work is to present an optim al algorithm to
implement a Boolean circuits th a t can be represented as trees using an extension of known
tree covering algorithms. The second and more im portant contribution of this work is an
in-depth analysis of fanout optimization. The fanout problem is the problem of distributing
a signal to several destinations, where the signal may be required at different times, in order
to minimize the overall delay. This work presents the most detailed theoretical study of the
complexity of fanout optim ization published so far, and a spectrum of heuristics to solve
the fanout problem under realistic delay models. This thesis also introduces a simple algo
rithm th a t can be used to apply fanout optim ization to an entire network. This algorithm
yields an optimal application of fanout optim ization in terms of delay, while keeping area
increase of the circuit to a low value. To study the integration of tree covering and fanout
optim ization, this work introduces a technology independent delay model th a t characterizes
precisely suboptimalities due to inbalances in a network. This is the first technology inde
pendent delay model th a t models the delay through a node as a function of the arrival time
distribution at a node. This delay model can be used to derive analytically optim al solutions
in simple cases, which can be used to measure the suboptim ality of heuristics. An extension
to tree covering is then suggested, and shown to provide significant delay reductions for a
relatively heavy cost in area. Finally this work investigates technology independent delay
optimization techniques based on partial or total collapsing of logic, and shows th a t further
delay reductions can be achieved with these techniques possibly at a higher cost in area.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgements
First I would like to thank my advisor, P r. Robert Brayton, for his undivided support and
encouragement. Much of this work would not have been possible without his help. I am
also very grateful to Pr. Sangiovanni-Vincentelli for accepting the burden of being a second
reader of this thesis and for m any exciting discussions. I am also indebted to Pr. Kobayashi,
of the M athem atics D epartm ent, who was kind enough to accept to be a member of my
thesis committee. I owe m uch to Pr. Alan Smith and P r. Alvin Despain for teaching me
m any of the fundam entals of computer science reseaxch. I also would like to thank Dr.
Kurshan of AT&T Bell Laboratories for m any interesting discussions, P r. Susan Graham
and Dr. Kurokawa of IBM Japan for their support and advice early in my graduate student
life. Finally I would like to acknowledge the French M inistry of Transportation, DARPA,
IBM and DEC who granted me the privilege of their support during my graduate studies.
Above all, I would like to thank my wife, Atsuko, and my daughter, Marianne
Ayaka, for their love and support, and for coping with the hardships of graduate student
life. I owe to M arianne the habit of waking up early every day and to Atsuko a strong desire
to graduate. These would be m ajor contributions in any field of research.
I would like to express my thanks to the CAD group in general, for providing such
a stim ulating and exciting research environment.
Of course, m any friends deserve special thanks. I was told th at it is more polite to
list one’s friends by alphabetic order, so th a t only those whose name start by ’A ’ do not get
offended. So m any thanks to Pranav Ashar; Mary and Wendell Baker, for tolerating me as
their office mate; Andrea Casotto; Fred Douglis; Paul Gutwin; Bruce Holmer; K urt Keutzer;
Luciano Lavagno; Bill Lin; Sharad Malik; Rick McGeer; Cho Moon; Rajeev Murgai; Antony
Ng; Terry Regier; Rick Rudell; Rafael Saavedra-B arrera; Alex Saldanha; Hamid Savoj;
Luigi Semenzato; Ellen Sentovitch; N arendra Shenoy; K. J. Singh; Ashok Singhal; Greg
Sorkin; Peter Van Roy, for sharing Belgian chocolates with me; Steve Viavant; Albert
Wang; Yosinori W atanabe; and Greg W hitcomb.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C on ten ts
Table o f C ontents ii
List o f Figures v
1 Introduction 1
1.1 O v e r v ie w .................................................................................................................. 1
1.2 Terminology and N o ta tio n ..................................................................................... 3
1.2.1 M athem atical N o t a t i o n ............................................................................ 3
1.2.2 Combinational L o g ic .................................................................................. 3
1.2.3 Physical M o d e lin g ..................................................................................... 5
ii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS iv
6 Conclusion 140
Bibliography 142
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List o f F igures
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L IS T OF FIGU RES vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List o f Tables
4.1 Effect of Taking Polarities Into Account in Fanout Delay H e u r is tic 102
4.2 Effect of Using a Logarithmic vs. Linear Dolay E stim a te .......................... 104
4.3 Effect of Iterative Im p rovem en t......................................................................... 105
4.4 Iterative Improvement vs. Optimal A ssig n m e n t.................................................. 110
4.5 Effect of Tree D u p lic a tio n ................................................................................... 114
4.6 Effect of Allowing Troe O v e r la p s ...................................................................... 116
4.7 Effect of Limiting Treo Overlaps ...................................................................... 118
vll
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 1
In trod u ction
1.1 Overview
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 1. IN TR O D U C TIO N 2
are only a topic of secondary priority in this research. Much work remains to be done in
this area.
The technology m apping algorithms presented in this work rely on two main tech
niques: tree covering and fanout optimization. Tree covering algorithms were first stud
ied Aho et al. [2, 3, 1] in the context of code generation for expressions, and were later
adapted to technology mapping by Keutzer [24, 27] and Rudell [14, 36]. If the objective
is to minimize circuit area, tree covering algorithms produce good quality results and are
straightforward to use. However if the objective is to minimize circuit delay, tree covering
needs to be extended even to generate optim al solution for trees. This extension is discussed
in chapter 2 .
Tree covering alone tends to generate poor quality implementations in terms of
delay, because most circuits are not trees but directed acyclic graphs. The signal available
at the output of a tree needs to be distributed to several destinations. Such a signal is
called a fanout signal. W ith tree covering alone, the circuitry used to distribute a fanout
signal to its destinations is implemented by default with a wire. In first approximation, if
n is the num ber of destinations of a fanout signal, the delay through this wire is of order
0 {n ). Using a simple buffer tree, this delay can be reduced to O (logn). It is thus very
im portant, to minimize delay, to be able to insert buffer trees to reduce the delay incurred
by the distribution of fanout signals. This optimization, called fanout optimization [5], is
the main focus of chapter 3.
Tree covering can be form ulated as the problem of minimizing delay through a
fanin node, i.e. a node with several inputs but only one output. In a very similar way,
fanout optimization can be form ulated as the problem of minimizing delay through a fanout
node, i.e. a node with one input and m any outputs. To minimize delay through an entire
circuit, we need to coordinate the use of tree covering and fanout optim ization on the fanin
and fanout nodes th at compose the circuit. This global optimization problem is the focus
of chapter 4.
These techniques provide a solid set of technology mapping algorithms on which we
can rely to measure the effect of technology independent optimizations. Chapter 5 contains
the results of some empirical studies of the effect of technology independent optimizations
on the quality of the final, technology m apped implementation of a circuit. Finally chapter 6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 1. IN TRO D U C TIO N 3
throughout this thesis, and a description of the abstraction we use to model the physical
behavior of circuit components.
By convention we use the letters n and m to denote integer valued constants, and
the letters i, j and k to denote integer valued variables, usually indices. We use the letters
a, b, c to designate real valued constants, and x , y , z , t , r and p to designate real valued
variables. Occasionally, we also used upper case letters.
We use other letters to designate certain quantities in specific contexts. In partic
ular b, g and s are used as indices in a library of gates; b is usually denotes a buffer and
s a source, i.e. a gate used at the root of a tree. We use a, f3 and 7 as constants that
characterize the delay through a gate of a gate library. We also commonly use d to denote
the num ber of buffers in a library, and n to denote the number of destinations of a fanout
signal.
We use En to denote the group of perm utations of a set of n elements. In that
context, <7 is used to denote an element of S n, i.e. a perm utation. We use tt to denote the
natural projection from a Cartesian to one of its components. For example, 1t a '• A x B -* A ,
where 7r^((a, b)) = a. TZ denotes the field of real numbers, and TZ+ the subset of 1Z formed
of the nonnegative real numbers.
We also use the letters x, y, z to denote Boolean variables. We denote the Boolean
and with a or with a space if the meaning is clear. We denote the Boolean or with a
“+ ” and the negation with a bar over the expression to be negated. For example we would
write x + y = x.y.
A combinational logic circuit can be specified as a set of Boolean equations, with
no cyclic dependencies, of the form yj = f j ( x i , . . . , x n), where x \ , . . . , x n and yj denote
>
Boolean variables and f j a Boolean function. There is a one-to-one mapping between a set
of Boolean equations with no cyclic dependencies and a Boolean network, i.e. a directed
acyclic graph graph G = (F, E ), in which a logic equation is associated to each node.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 1. IN TRO D U C TIO N 4
PRIMARY OUTPUTS
fanout node
fanin node
PRIMARY INPUTS
Some of the nodes of a Boolean network are distinguished as prim ary inputs and
prim ary outputs, and represent respectively the inputs and outputs of the corresponding
circuit. The fanin of a node v of a Boolean network is the set of nodes u whose output is
directly connected to an input of v. Similarly, the fanout of a node v is the set of nodes
u th at have an input directly connected to the output of v. A node th a t has a fanout
containing only one element is called a fanin node\ a node th a t has a fanin containing only
one element is called a fanout node. A Boolean network can always be decomposed into a
network of fanin and fanout nodes, in such a way th a t a fanin node is only connected to
fanout nodes, prim ary inputs or prim ary outputs, and a fanout node is only connected to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 1. IN TR O D U C TIO N 5
fanin nodes, prim ary inputs or prim ary outputs, as illustrated in Figure 1.1.
1 .2 .3 P h y sic a l M o d elin g
G a te A re a In standard cell technology, we use grid counts to measure the area of a gate
or cell. The grid count is the width of a cell relative to a standard design rule. Grid count is
directly proportional to standard cell area before routing [31]. If a well characterized router
is used for layout, grid count is a good estimate of final chip area.
G a te D e la y To model gate delay, we use a simple linear delay model. This model char
acterizes the delay from an input pin i to the output pin of a gate g using a linear equation
of the form cti,g + Pi,g 7 . The coefficient cti,g models the intrinsic delay through the gate; the
coefficient Pi,9 models the load dependent delay; 7 is the capacitive load at the output of
the gate, which is estim ated by looking at the output connections of the gate. The model
also distinguishes between rise and fall delays. If capacitive loads are restricted to be within
a small range of values, and if delay coefficients were determined for th a t range of values,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 1. IN TRO D U C TIO N 6
this model is a reasonable first approximation, within 10% of actual delays [29]. Gate level
delay models used in the industry are usually more complex th a t this linear, 0 -order model.
Industrial models usually rely on a simple first-order non-linear delay model, th a t computes
the delay and its slope, i.e. a discretized version of the first derivative of the delay. A
well-tuned version of this model can estimate physical delays within a few percent.
To estimate the capacitive load a t the output of a gate g, we add the capacitive
loads of the input pins of the gates driven by g to a. additional term representing the delay
through the wires.
A rriv a l T im e s , R e q u ire d T im e s a n d S lack Given arrival times at the prim ary inputs
of the network, the arrival tim e at the output of any gate in the network is defined as the
latest possible moment a transition may occur at the output of th at gate. We use static
timing analysis to compute arrival times, i.e. we assume th a t all paths through the logic
can be activated. Techniques to detect false paths exist [33] but are too computationally
expensive to be of practical use during technology mapping. In addition, one of the goals
of technology mapping for delay is to make a large num ber of paths critical, reducing the
need for more sophisticated delay estimators.
Given required times at the prim ary outputs of a network, we can also compute
the required time at the outputs of any gate of the network by propagating required times
throughout the network. The slack at the output of a gate is defined as the difference
between the required time and the arrival time at th a t output.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 1. IN TRO D U CTIO N
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 2
2.1 Introduction
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 9
rewrite
rewrite
minimum cost equivalent to a given Boolean network. The simplifying assumption we make
about the technology m apper is th a t the structure of the network of gates it computes is
derived from the structure of the Boolean network it takes as input. The technology mapper
is not expected to modify the structure of the network drastically: such transformations are
best left to the technology independent phase, not because doing so yields b etter results,
but because it corresponds to a natural separation of concerns, and a simpler overall design.
The first technology m appers were rule-based (LSS [13], SOCRATES [19]), i.e.
based on local transformations called rules. Examples of rules are given in Figure 2.1.
Rules can be used to implement unimplem ented logic, or simply to improve the quality of
an already implemented set of gates. Rule-based systems are quite flexible and can gen
erate circuits of good quality in term s of either area or delay. They can also be used in
a postprocessing phase to improve the quality of a circuit generated by other algorithms.
Unfortunately, rule-based systems suffer from two severe limitations: rules are library de
pendent and only local optimizations are possible in polynomial time.
An attractive alternative to rule-based technology mappers is the use of a divide-
and-conquer strategy, where the network is partitioned into the largest pieces th a t can be
handled with efficient, optimum, library independent algorithms. The most im portant of
these algorithms is tree covering. Given a tree, expressed with simple primitives (e.g. 2-
input NAND gates and inverters), tree covering can find a minimum cost covering of the
tree by gates of a library. This covering can then be extracted to form an implementation of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 10
the tree. This implementation is not in general a minimum cost implem entation of the tree
because it is a function of the decomposition of the tree into primitives, and also because
in some cases the minimum cost solution cannot be expressed as a tree cover (i.e. when a
minimum delay solution requires the introduction of buffers between gates, or when the use
of Boolean identities is required to identify a m atch). However, tree covering usually yields
good quality implementations, and has the merit of being very fast.
Tree covering was originally developed for code generation in retargetable compil
ers [1]. It was first introduced in the context of technology mapping by Keutzer [24] for
producing minimum area implementations. There has been some earlier attem pts to extend
tree covering to produce minimum delay implementations, but these techniques were either
ad hoc [26] or were not implemented [36]. In this chapter we introduce an elegant solution
to the problem of optim al tree covering for delay, and present and discuss experimental
results comparing this solution with simpler but suboptim al techniques.
O ptimal tree covering can be solved in tim e linear in the num ber of nodes in the
tree. The complexity of tree covering as a function of the num ber and size of the gates in
the library depends on which tree p attern matching algorithm is used. The fastest pattern
m atching algorithms build an autom aton th at summarizes all possible patterns m atched
by gates in the library. A t the cost of some preprocessing time, these algorithms avoid
having to traverse the same substructures several times. A good overview of tree pattern
m atching algorithms can be found in [20]. The fastest algorithms are based on a bottom
up traversal of the trees, but they have the largest space requirements. An interesting
way to reduce this space overhead can be found in [11]. No technology m apper has been
based on bottom -up tree covering. The second fastest algorithms are based on top down
string matching techniques [1]. These algorithms were used in Dagon [24], In contrast,
misll technology m apper [14] uses a more conventional but slower matching algorithm,
th a t simply enumerates all patterns and requires little preprocessing. For simplicity, we use
the slower m atching algorithm used in misll. The choice of the m atching algorithm has no
influence on the quality of the final result and thus has no influence on our experimental
results.
The rest of this chapter is organized as follows. In section 2.2 we review the basic
tree covering algorithm for minimum area and show how it can be directly extended to
minimize delay if load values are supposed to be constant. Then in section 2.3 we show
how to extend this algorithm to take exact load values into account. This extension was
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O PTIM IZATIO N W IT H TR E E COVERING 11
originally suggested by Rudell [36] and implemented by us [42] using load discretization.
We present a new m ethod th at relies on a functional representation of achievable arrival
times at a node v as a function of the load at the output of v. In section 2.4 we review
three main factors th at limits the optimality of tree covering for delay, and discuss how these
limitations could be overcome. In section 2.5 we present several techniques th at can be used
to reduce the area of tree cover under a delay constraint. One of these techniques consists
in extending tree covering further to directly minimize area under a delay constraint, at the
expense of more computation time. We present an adaptive time discretization algorithm
to perform this task. This algorithm was already described in [36, 42]. Finally we present
and discuss our experimental results in section 2 .6 and summarize the main results of this
chapter in section 2.7.
• G iv en a tree T and a set of tree patterns P , representing the gates of a library, and
a cost associated with each gate,
To allow the use of tree pattern matching algorithms, we need to represent both the tree
and the logic functions associated with the gates in a common set of primitives or base
functions. In m is ll and DAGON, this set of primitives is composed of only two types
of elements: 2-input NAND gates and inverters, m is ll also adds extraneous inverters to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 12
N 0R2 ¥ OAI21
“ A n
Figure 2.2: Example of Decomposition into Prim itive Gates
allow the pattern m atching algorithm to choose in which phase subfunctions should be
implemented. An example of such a decomposition for a NOR gate and a A 0I21 gate is
given in figure 2.2. The network is decomposed in a similar fashion into primitive gates.
We suppose th at we have at our disposal an algorithm to enumerate all matching
tree patterns m at a given node v of tree T. We call this set m atch(v, P ). Associated with a
given pattern m in m atch(v, P ), we have a gate g(m ) and a list of nodes vi 6 in (m ), which
correspond to the nodes of T matching the inputs of m . An example of such a matching is
given in figure 2.3.
In general, the cost of a m atch depends on some contextual information th a t is
function of both the children and the parent nodes of given node v of the tree. While
for area minimization the cost of a m atch only depends on the children nodes, for delay
minimization, or area minimization under a delay constraint, the cost function also depends
on information provided by the parent node (load values, required times). We present here
a general formulation of the cost function th at cover all three types of optimization.
To formulate this cost function, we use a generic variable I to denote the contextual
information derived from the parent nodes. The minimum cost tree cover at a node v,
cost(v, I ) , is the minimum cost of a cover of the subtree rooted at v for a given I.
To evaluate the cost of a pattern m in m a tc h (v ,P ) we need a cost function
cost(m , I ) th at depends on the cost of the gate g(m ) associated with the p attern and the
costs cost(vmii, I m,i)> where the nodes vm>i € inputs(m ) are the nodes in the tree matching
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 13
-O x J
match
s V3
the inputs of m , and Im,i is the contextual information derived from I th a t is propagated
through m to node Vi<m. For example, I m>i may be the load of gate g(m ) at input pin i, in
which case it is independent of I ; or it can be the required tim e at Vi<m derived from the
required tim e at v.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 14
p ro c ed u re minjcost(v, P)
fo r w in children(v) do min.cost(w, P) od;
fo r m in match(v, P) do
cost[m, I) = cost(g(m),[cost(vitm,Ii,m),vi,m € m jn r f s ( m ) ] ) ;
od;
cost(v,I) := m in rn em alch(Ulp ) C o s i ( m ,7 ) ;
In the case of delay minimization, if we ignore the differences between loads and use a
nom inal load value 70 at the output of each gate, there is no contextual information to
propagate either. In th a t case the cost function a rriva l(m ) = cost(m , I ) becomes:
However if we want to take actual load values into account, c o s t(m ,I) and c o st(v ,I) are
non constant functions of I . We will see next how they can be computed efficiently.
To compute an exact minimum delay cover we need to take load values into ac
count. Load information is propagated top down, from the root to the leaves of the tree.
Therefore load values have to be represented as contextual information. In th a t case the
cost function depends on the load 7 at the output of a node:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 15
Cost functions are now dependent on the load 7 . We need to find a representation
of these functions that allow us to compute their minimum as in Figure 2.4. We propose here
three different representations. The first two use staircase functions, the third piece-wise
linear functions.
The simplest non constant representation of the cost functions is to use staircase
functions, where the limits of the intervals are fixed and independent of the nodes. This is
simple to implement but relatively inaccurate unless the discretization intervals are made
fairly small, which is computationally expensive. This m ethod was originally suggested by
Rudell [36].
Adaptive discretization has two drawbacks: it requires the precom putation of all
possible load values at each node, which is roughly as tim e consuming as performing the
tree covering itself; it does not work at the root of the tree, where the num ber of all possible
load values grows exponentially and the range grows linearly with the number of fanouts of
the root node.
A more general and more elegant m ethod consists in going up one level of abstrac
tion, by storing at each node a function th at gives, for any given load value, the optimum
choice of a gate at this node. The main difficulty is to find a data structure adapted to the
representation of such a function. We first list which operations are needed on this data
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O PTIM IZATIO N W IT H T R E E CO VERING 16
structure, and then show these operations can be efficiently implemented using piece-wise
linear functions.
We have to be able to perform efficiently the following operations:
• representation o f gate delays', given a gate g, m atching at node v, and arrival times
ai at the inputs of the gate, we should be able to compute a function / ( 7 ), where 7 is
the load at the output of node v, th at gives the arrival time at the output of g when
g has to drive a load of 7 .
• computation o f the m inimum o f two functions: given two functions / and g of one
variable 7 representing the load at the output of a node v, we need to be able to
compute the function h( 7 ) = m in (/( 7 ), <7(7 )).
Given these three operations, we can obtain a minimum delay tree cover by applying directly
the algorithm of Figure 2.4. In the remaining of this section, we show how these three
operations can be implemented efficiently using piece-wise linear functions within our delay
model. For more complex delay models, piece-wise linear functions may not suffice. In th at
case, a more complex data structure would need to be used.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 17
for 1 < i < n — 1, yi+i = yi + ( ® i+ i - Xi)s{. All segments but the last one are finite: the
right bound of the i th segment for i < n is given by ®»+i.
Finding th e O ptim um Solution By computing the m inimum of all the functions rep
resenting the arrival times for all gates matching at a node v, we obtain a piece-wise linear
function / representing the best arrival times realizable at v for any load value. For a
given load value 7 , we simply perform a binary search on the array of tuples (X i,y i,S i,p i)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 18
representing / to identify the segment to which 7 belongs. Once this segment is identified,
we use the pointer pi to retrieve the gate th a t realizes the minimum.
The actual implem entation is slightly more complex than w hat we have just de
scribed, because a gate m ay m atch at a node in several different ways, and we need to keep
track w ith Pi not only of a choice of a gate but also of a choice of a matching.
The initial decomposition of a tree in simple primitives (i.e. 2-input NAND gates
and inverters) is required before tree covering can be used. The principle of tree covering
is to decompose the target tree and gates into the same set of primitives so th a t functional
m atching is replaced by the simpler problem of p attern matching. Since gates represent
small logic functions, it is feasible to enumerate all patterns th at represent a gate. This
num ber can be reduced further by exploiting the symmetry of some inputs.
Unfortunately, for the tree itself, it is not practical in general to enumerate all
possible decomposition in simple primitives. We choose one decomposition, and the result
depends in general on the quality of this decomposition. Ideally, an adequate choice should
be m ade as a function of the arrival times a t the leaves of the tree. In the absence of this
information, we simply generate a balanced decomposition of the nodes of the tree. This
is not in general the best choice. Before we can discuss better decompositions, we need to
cover the m aterial of chapter 4. We will come back to this problem in chapter 5.
During tree matching, the symmetry of gate inputs is exploited to reduce the com
putational overhead of the algorithm. All patterns th a t a gate can m atch are enumerated,
but all possible assignments of gate inputs to pattern inputs axe not. When the inputs
are equivalent, only one assignment is tried. For area minimization, logically equivalent pin
assignments have all the same cost, so there is no need to consider more than one. However,
for delay minimization, this is not the case, since the delay through a gate is pin dependent
in general.
For example, a 3-input NAND gate is represented by only one pattern, but has
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 2. D E L A Y O P TIM IZATIO N W ITH T R E E CO VERING 19
-£x>-i_
critical
b2 ’
decompose
match
gate delay (pin to output)
v1 - > f = 1.2 ns
pattern v 2 -> f = 1.1 ns
v3 - > f = 1.0 ns
3! = 6 possible pin assignments. Only one out of the six possible pin assignments is tried by
the tree pattern matching algorithm. This can lead to suboptimal choices. In the example
of Figure 2.5, the critical input is assigned the slowest pin v \. For any preassigned choice
of input ordering, there is always a simple circuit configuration for which this choice is
suboptimal. We propose in this section a simple algorithm to correct this problem.
To optimize pin assignment during tree covering, we can proceed as follows. First,
in a preprocessing step th at can be done once and for all on the library, we identify the
sets of pins th at are functionally equivalent and therefore interchangeable. Then, for each
gate during pattern matching, we consider each of the equivalent sets of pins separately and
reorder them in order to decrease delay.
If we use constant load values, the problem of optimal pin assignment can be solved
in tim e O (nlogn) where n is the num ber of pins in the set. In th at case the problem can
be reduced to the following discrete optimization problem: if a j is the arrival time at node
Vi and dj the delay through the gate from pin j , the problem is to find an assignment of
nodes to pins th at minimize the total delay. Let E n be the set of perm utations of a set of n
elements, and for a 6 En , let a{i) be the image by the perm utation a of element i. The pin
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H TR E E CO VERING 20
assignment problem can the be formulated as the following discrete optimization problem:
min m ax a; + d„a)
<r€2n l < i < n W
An optim al solution can be computed by ordering the Oj and the dj by increasing size and
using the perm utation a (i) = n —i + 1. However, if we take pin load values into account,
the problem has the following form, where aj denotes the arrival time at node Vi and a j,
and 7 i are delay coefficients derived from the delay model:
This problem can still be solved optimally by dynamic programming in time 0 ( 2 n). Though
exponential, this algorithm is still practical for most libraries since the num ber of equivalent
pins of any given gate usually does not exceed eight in CMOS technology.
So far we have ignored the fact that in our delay model we distinguish between rise
and fall delays. This means th at arrival times are characterized by a pair of real numbers:
(dr, a,f) instead of a single number. To decide which of two solutions is better, we need to
decide which of two pairs of arrival times is “faster” . We use the following criterion to make
this decision:
This selection is not guaranteed to be optimal in general but works well in practice. It
outperforms the other obvious choice:
We can also use the dynamic programming algorithm of Figure 2.4 to find a min
imum area cover under a delay constraint. The delay constraint is expressed as a required
tim e a t the root of the tree, and is propagated down the tree as a contextual value, to
gether with load values. In th at case the cost function is of the form c o s t( m ,i,r ) where 7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 21
represents the load and r the required tim e at the output of a node:
a r r iv a lim ^ r ) = m ax , ( « .,g ( m ) + f t ,g ( m ) 7 +
Vi,m einputs{m )
where rj,m is the required tim e a t input node Vi)Tn propagated from required time r at node
v through gate g(m ). To minimize area under a delay constraint, we take into account in
the cost function both the area and the slack (the slack is the difference between required
tim e and arrival tim e). At each interm ediate node the minimum area solution is selected if
it meets the delay requirement (that is, if its slack is non negative); otherwise, the minimum
delay solution is chosen.
To implement this algorithm, we need to discretize the required times. Enum erat
ing all possible required times at a node is not feasible in this case, because required times
not only depend on the m atch just above a node b u t on all the possible combinations of
m atches above the node up to the root of the tree.
To control the run tim e of the algorithm, we enforce a limit on the num ber of
discretization intervals at each node, which is an integer-valued param eter r specified by
the user. The discretization intervals are obtained by first computing a range of interesting
values for the required tim e at each node and then dividing this range into equal intervals.
The range of interesting values for the required time a t a node v is determined as
follows. Let 7 be a possible load value at node v. Let a^eiay(,y) be the minimum arrival
tim e achievable a t node v with a load of 7 at its output, and a 2 rea(v ) the arrival tim e at v
of a minimum area cover of the subtree rooted at v for th at same load value. Any required
tim e outside the interval [o*{otf(w), a2rea(v)] does not need to be considered. Indeed, if
the required tim e at node v is less than o ^ Joy(i>), no cover of the subtree rooted a t v can
meet the tim ing requirement; in th a t case, the minimum cost cover is the minimum delay
cover for this subtree. If the required tim e is greater than a2rea(v), we can just choose the
minimum area cover for this subtree.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZA TIO N W IT H TR E E CO VERING 22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 23
at th a t node, the smallest such inverter is used in place of the current one. This algorithm
is greedy in the sense th at it does not necessarily make the best use of the available slack
to recover area. However it is guaranteed to never worsen delay, and, as we will see in the
results section, is quite effective in practice.
There is no reason to perform optim al gate selection on inverters only. The inverter
selection algorithm can be extended to all gates th at come in different sizes and strengths
in the library. This provides a cheap and simple way to recover area after tree covering
without hampering the full power of tree covering for delay minimization. This optimization
is likely to be very effective for large libraries.
Circuits are usually not trees: they have several outputs, inputs are shared among
several functions, and there may be several paths from one node to another. To assess the
amount of improvement we can expect from the algorithms proposed in this chapter, we
need to measure their effect in isolation on trees. To th a t effect, we generated three Boolean
functions, nor32, b a la n c e d and unbalanced, th at can be represented as trees:
• b alan ced , a balanced binary tree with 64 inputs. Internal nodes are alternatively
computing a Boolean AND or a Boolean OR, with inverters inserted randomly.
• unbalanced, an unbalanced binary tree with 32 inputs, where every node has at most
one child th a t is not a leaf. Internal nodes are alternatively computing a Boolean
AND or a Boolean OR, with inverters inserted randomly.
The results of these experiments are also sensitive to the gate libraries being used. To take
this effect into account, we performed our experiments with four different CMOS standard
cell libraries of various origins:
• MCNC, a public domain library available from MCNC. It is distributed with the IWLS’89
be ".hmark suite [32] under the name l i b 2 . It is composed of 29 gates. Inverters ap
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H TR E E CO VERING 24
pear in 3 different strengths; all the other gates in one strength only. Gate delay
information is pin dependent.
• CMOS12 is a library from AT&T. It is composed of 189 gates, mostly AOI and OAI
gates. Only a few gates appear in different strengths. Gate delays are only available for
the slowest pins; all pins are assigned the worst case delay, which is too conservative.
In the following sections, we report and discuss results by library. We provide detailed
experimental results for the first two libraries, and only summary information for the re
maining two. For clarity, we use the following acronyms to refer to the various tree covering
algorithms studied in these experiments:
• MDCL refers to minimum delay tree covering using a constant load value.
• MDCLIS refers to minimum delay tree covering using a constant load value, followed
by optim um inverter selection. Inverter selection is done using the algorithm of sec
tion 2.5.
• MDEL refers to minimum delay tree covering using exact load values.
• MDELIS refers to minimum delay tree covering using exact load values, followed by
optim um inverter selection.
Since MDCL is not optim um for delay, MDCLIS can outperform MDCL both in terms of area
and delay. On the other hand, MDEL being optimum for delay, MDELIS can only improve
over MDEL in term s of area.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 25
2 .6 .1 R e s u l ts w i t h t h e MCNC L i b r a r y
In table 2.1 we show the results obtained by using minimum area tree covering,
minimum delay tree covering, minimum delay tree covering with constant loads and tree
covering for minimum area under a delay constraint, where the delay constraint is the
minimum delay achievable by tree covering.
The results indicate th a t MDEL tree covering does not lead to a significant delay
reduction over MDCL for this library: only 1%. However, the reduction in area is more
substantial: 15%. This can be explained by the fact th at using constant load values in MDCL
underestim ates the cost of stronger and larger gates, th at have higher input loads. The
penalty of using gates stronger than the optimum has a first order effect in terms of area.
In term s of delay, a higher input load is partially compensated by a stronger drive capability,
leading to a second order effect on delay. MADC tree covering does not lead to consistent
results. This is due to the fact th a t MADC relies on an older implementation than MDEL, th at
discretizes load values instead of using piece-wise linear functions. Discretization of arrival
times is another source of inaccuracy. Overall MADC reduces area by 6 % and increases delay
by 1% relative to MDEL.
Table 2.2 shows the.,effect of optim al inverter selection when used after tree cover
ing. Inverter selection has no effect on trees built w ith MDEL. However it improves noticeably
the quality of the covers obtained with MDCL. The benefit of MDELIS over MDCLIS is only of
7% in area for no gain in delay, down from 15% and 1% respectively for MDEL vs. MDCL.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 2. D E L A Y O PTIM IZATIO N W IT H T R E E COVERING 26
MDCLIS: Minimum Delay with Constant Loads, with optimum Inverter Selection
MDELIS: Minimum Delay with Exact Loads,with optimum Inverter Selection
area total cell area
delay measured in nanoseconds
For one example, MDCLIS actually outperforms MDELIS in terms of delay. This anomaly
can be explained by the difficulty on handling rise and fall delays optimally, as explained
in section 2.4.3. The anomaly disappears if we modify the library to make all rise and fall
delays equal, or if we modify the comparison function used to compare pairs of rise and fall
arrival times.
We repeated the previous experiments, using the CM0S12 library. The results are
reported in table 2.3 and table 2.4. The results are essentially similar to those reported in
the previous section. MDCL tree covering increased area by 355% for a decrease in delay of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O PTIM IZATIO N W IT H T R E E CO VERING 27
MDCLIS: Minimum Delay with Constant Loads, with optimum Inverter Selection
MDELIS: Minimum Delay with Exact Loads, with optim um Inverter Selection
area total cell area
delay measured in nanoseconds
30% over MA. On average, MDEL tree covering outperformed MDCL by 13% in area and 4% in
delay. Again, using exact load values during delay minimization had some effect on area but
only a second order effect on delay. MADC tree covering obtained more satisfactory results
with this library. Compared with MDEL tree covering, it achieved a reduction of area of 17%
for an increase in delay of 4%.
For this library, inverter selection also improves the quality of MDEL tree covers in
terms of area, by 8 %. The effect of inverter selection on MDCL tree covers is more significant,
to the point th at MDCLIS tree covering actually outperforms MDELIS in term s of area by 8 %
for a cost in delay of only 1 %.
We repeated the same experiments with library LIBRARY3. MDEL tree covering
outperformed MDCL by 10% in area and 1% in delay. After inverter selection, the advantage
was only of 4% in area and 1% in delay for MDELIS over MDCLIS. Inverter selection reduced
the area of MDEL covers by 7%.
We repeated the same experiments with library LIBRARY4. For th a t library, which
is much richer than LIBRARY3 in term s of num ber of inverters, MDEL tree covering outper
formed MDCL by 56% in area and 24% in delay. After inverter selection, the advantage was
reduced to 21% in area and 8 % in delay for MDELIS over MDCLIS. Inverter selection improved
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 28
2.7 Conclusion
The main conclusion from this experiments is somewhat disappointing: taking load
values into account during minimum delay tree covering (method MDEL) does not lead to a
very significant decrease in delay over the simpler tree covering m ethod th at uses constant
loads (method MDCL). The main advantage of MDEL over MDCL is actually more in terms of
area. By ignoring the effect of larger input loads, MDCL tend to favor gates th a t are larger
than necessary. Choosing larger gates th an necessary has a direct effect on area but only a
second order effect on delay, since the cost of higher input loads is partially compensated
by an increase in drive capability.
Another interesting experimental result is th a t most of the advantage of using
MDEL over MDCL can be obtained by the use of optim al inverter selection after tree covering.
Overall, MDEL remains the best tree covering algorithm for delay, b u t using MDCL followed
by optim al inverter selection is a very attractive choice if simplicity of implementation and
cpu tim e are an issue. It will be interesting to see the effect of extending optim al inverter
selection to optim al gate sizing as a postprocessing phase after tree covering.
Minimizing area under a delay constraint is not very effective and is com putation
ally expensive. We strongly recommend a divide and conquer approach to this problem:
first, use a fast tree covering algorithm th a t minimizes delay only, to obtain a minimum
delay solution. Then, use, in a postprocessing phase, either an inverter selection algorithm
or a gate sizing algorithm to recover area at no delay cost. This approach of area recovery
is suboptimal, b u t leads rapidly to good quality circuits. If area is more of a concern, it
is always possible to use minimum area tree covering, possibly as a postprocessing phase
after delay optimization, on parts of the tree th at are not tim e critical. Trying to find a
good cover under an area and a delay constraint is too complex and tim e consuming: it
is more efficient to start w ith a minimum delay cover, and modify it in a postprocessing
optimization phase to recover area in a greedy fashion whenever possible.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 3
3.1 Introduction
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 30
this level.
The basic optimization techniques on which fanout optimization relies, buffering,
gate resizing, critical signal isolation, are not new. There is a vast literature on timing
optimization techniques th at covers these optimizations [34, 17, 21, 4,13]. W hat is original
in the fanout optimization approach originally due to Berman, Carter and Day [5], is the
idea of combining these techniques into a single algorithm. The main limitation of Berman’s
work is th at it did not propose a very practical approach to apply a fanout algorithm to an
entire circuit. It turns out th at we can use a simple technique due to Hoover, Klawe and
Pippenger [22] to solve this problem, as suggested by Fishburn [15]. We have extended this
technique to recover area after delay minimization.
Fanout optimization can also be used to enforce fanout constraints imposed by
a technology. Though in this work we ignore fanout constraints or load limitations, they
can be handled by a simple modification of our algorithms, for example by modifying the
cost functions to make gates infinitely slow as soon as their load constraints are violated.
Another reason for using fanout optimization to enforce load limitations is to control the
accuracy of gate-level delay models. The main source of inaccuracy of these models is the
presence of large capacitive loads at gate outputs.
In some situations, fanout circuits with reconvergence can yield faster circuits than
fanout trees, e.g. when the loads at the sinks are very high, such as for the output pads
of a chip. These situations can be handled by tree-based fanout algorithms if we replace
sinks with large loads by a num ber of virtual sinks with smaller loads before applying
the algorithm. In this work we suppose th at large loads are split among several virtual
destinations or handled by special purpose circuitry, and we only consider fanout circuits
th a t have no reconvergence, i.e. th at are trees.
There is an interesting duality between tree covering and fanout optimization as
can be seen in figure 3.1. For delay minimization, tree covering aims at minimizing the
arrival time at the root of a tree given arrival times at its leaves, while fanout optimization
aims at maximizing the required tim e at the root of a fanout tree given required times at
its leaves. In term s of complexity, fanout optimization is, for all but the simplest delay
models, NP-complete. But we can make, from an optimum fanout algorithm, a one pass
algorithm th a t optimizes the fanout problems of a circuit and yields a minimum delay
implementation, in the sense that there is no way to modify one or more of the fanout trees
of this implementation in order to decrease delay. In contrast, tree covering itself is of linear
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 3. FAN O U T O PTIM IZATIO N 31
PRIMARY OUTPUTS
fanout optimization
tree covering
PRIMARY INPUTS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZA TIO N 32
area, and can be applied to any fanout tree, independent of its origin. In section 3.7 we
show how we can apply a fanout algorithm to an entire circuit and explain in what sense
the resulting circuit implem entation is optim al with respect to fanout optimization. This
technique can also be used in postprocessing phase to recover wasted area in parts of the
circuit th a t are not critical for delay. Finally in section 3.8 we present our experimental
results, and in section 3.9 we summarize th e m ain results of this chapter.
F a n o u t P r o b le m fo r M in im u m D e la y
• G iv e n a library C of buffers and inverters, and for each b € C its input load 76, its
load dependent delay fib and its intrinsic delay a
• F in d a tree of buffers and inverters th a t distributes the signal X to all the sinks and
maximizes the required tim e at the source.
The complexity of the fanout problem depends on the delay model. For a very
simple delay model, under which the delay through a buffer is constant equal to 1 and the
fanout is constrained to be less than some constant value k, the fanout problem for delay can
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 33
be solved in linear tim e using a technique called combinational merging [18]. Unfortunately,
as soon as the delay model takes load values into account, even if all required times are equal
and only one type of buffer is used, the fanout problem for delay becomes NP-complete.
There is thus little hope of finding a polynomial tim e algorithm to solve the fanout problem
optimally with delay models of the level of complexity of the ones commonly used for CMOS
standard cells.
Berman et al. proved th at the fanout problem for minimum area under a delay
constraint is NP-complete under a simple delay model where gates are represented by a
finite num ber of virtual gates with fixed fanout and constant delay. Unfortunately, the
proof of this result is not very satisfactory since it requires the existence of buffers in an
unrealistically wide range of sizes and delays.
In this section we present a few complexity results for the fanout problem under
various delay models. To keep things simple, we ignore the issue of phase assignment: we
suppose th at all sinks require the signal with the same polarity as produced by the source and
th a t only buffers are available in the library. For a simple delay model, the fanout problem
can be solved optimally in tim e O (nlo g n ) using combinational merging. We present this
algorithm in section 3.3.1. We use combinational merging as a heuristic in one of our fanout
optim ization algorithms, but it is not optimum in general for more complex delay models.
In section 3.3.2 we study the fanout problem for a slightly more complex delay model, for
which we only have partial results. In section 3.3.3, we show th at if we allow non constant
load values at the sinks, the fanout problem for delay becomes NP-complete. This is the
first complexity result for the fanout problem for minimum delay without the addition of
an area constraint. Finally in section 3.3.4 we review briefly Berm an’s complexity result.
This overview is not complete unfortunately. Many complexity issues are left un
resolved. However our main purpose is to provide solid evidence th a t the fanout problem is
difficult to solve exactly for complex delay models, in order to justify our use of approximate
algorithms. In th a t sense, our goal has been achieved.
In the constant delay model, the library contains only one buffer. The delay
through this buffer is constant equal to one delay unit when the fanout does not exceed
some threshold k, and is infinite otherwise. For th at delay model, the combinational merging
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. FANO U T O PTIM IZATIO N 34
Let S be a set of nodes with an individual required time associated with each node.
Initially S contains all the sinks with their required times.
Sort S by decreasing required times.
While \S\ > 1 {
At the first iteration, r = \S\m od(k — 1). If r < 2, r = r + (k - 1)
}
The only remaining node in S is the root of an optimal fanout tree.
algorithm, due to Golumbic [18], finds a minimum delay fanout tree in time 0 {n\ogn) where
n is the num ber of sinks. Moreover this tree is guaranteed to be of minimum area among
all trees of finite delay. The algorithm is outlined in figure 3.2.
It is possible to prove that if ( r i , .. . , r n) are the required times at the sinks, the
required time at the root of an optimal fanout tree is given by the following formula [2 2 ]:
In this section we introduce a slightly more realistic delay model. The library still
only contains one buffer, but this tim e the delay through the buffer is equal to the number
of its fanouts. This means that the load dependent delay coefficient of the buffer is 1, its
intrinsic delay is 0, and the loads of all the gates 1. This model is similar to but simpler
than the unit fanout delay model used in m is ll, which assumes a delay of 1 + 0 .2 * n where
n is the num ber of fanouts of the buffer, i.e. it assumes an intrinsic delay of 1 and a load
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 35
E q u a l R e q u ire d T im e s We can solve the fanout problem for minimum delay exactly
for this delay model if all the required tim es at the sinks are equal. As we will see shortly,
even this simple case is not completely straightforward. We will make use of the following
definitions:
D e fin itio n 1 A 2-3 tree is a tree T such that any intermediate node o f T has a fanout o f
2 or 3.
D e fin itio n 2 Let V ( T ) be the set of paths from the root to a leaf in a 2-3 tree T , and let
p be such a path. Let xp and yp be the number o f nodes o f fanout 2 and 3 respectively on
that path. The weight o f path p is defined as follows:
The weight o f a 2-3 tree is the maximum weight on any o f its paths:
w( T) = m ax w(p) (3.3)
v ' p €7>(T)
The delay o f a path and the delay through a tree are defined similarly:
d(p) = 2 xp + 3 yp (3.4)
Since we suppose th a t all required times are equal, maximizing the required tim e at the
source of the fanout tree is equivalent to minimizing the worst path delay through the tree,
i.e. the quantity d(T).
D e fin itio n 3 A 2-3 tree is a simple 2-3 tree i f all nodes at the same level have the same
fanout. In particular, in a simple 2-3 tree, all the leaves are at the same distance from the
root.
It is easy to see th a t all paths of a simple 2-3 tree have equal weight, and th at the number
of leaves of a simple 2-3 tree T is equal to w(T).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZA TIO N 36
r r r
w l w2 f-2
r-2 r-f+2
T h e o r e m 3.3.1 Let (I 2 , 13) be a pair o f integers that realizes the m inim um o f the quantity
2x + 3y subject to the constraint 2X 3V > n. Let T be a simple 2-3 tree with h levels o f nodes
o f fanout 2 and I3 levels o f nodes o f fanout 3. Then T has m inim um delay among all fanout
trees with n leaves or more. Its delay is given by the following expression:
P ro o f Let T be an optim al fanout tree. Its nodes with fanout equal to 1 can be eliminated
without modifying the fanout of other nodes and without increasing the delay through the
tree. We will show th at its nodes w ith fanout greater than 3 can be split into nodes with
strictly smaller fanouts w ithout increasing the delay through the tree, w ith the transfor
m ation shown in figure 3.3. A node u with a fanout / of 4 or more can be replaced by
three nodes v , W\ and W2 , such th a t v is directly connected to the parent of u, and has w \
and W2 as children. Two children of the children of u are m ade children of w \, while the
remaining ones are made children of 102 - If r is the earliest required time of any child of u,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O P TIM IZATIO N 37
the required time at u is r — f . In the worst case the earliest required time of a child of w i
is r and th e required time of w \ is r —2. Similarly, in the worst case the earliest required
time of a child of W2 is r and the required time of W2 is r — ( f — 2 ). Thus the required time
at v is no worse th an min( 7’ —2 ,r —/ + 2) + 2 = 7’ —/ , since / > 4. ■
Lemma 3.3.2 shows th at we can find a 2-3 tree th at is an optimal fanout tree. The following
lemma shows th a t we can restrict our attention further to simple 2-3 trees.
le m m a 3 .3 .3 Let T be a 2-3 tree. There is a simple 2-3 tree T ' that is at least as fast as
T and has at least as many leaves as T .
P ro o f Let T be a 2-3 tree and let l(T) be the num ber of leaves of T. The proof proceeds
in two steps. We first show th a t l(T) < w( T) for any 2-3 tree, and then we show th at for
any 2-3 tree T there is a simple 2-3 tree T ' such th a t d(T') < d(T) and w( T' ) = w(T). This
would prove the lemma, since for any simple 2-3 tree T' , 1{T') = w(T' ).
To prove th at l (T) < w(T), we proceed by induction on the height of T. The
result is obviously true for trees of height zero, since the num ber of leaves and the weight
are both equal to 1 in th a t case. Let us suppose th a t the result is true for all 2-3 trees of
height h - 1 > 0. Let ( T j , . . . ,Tfc), with k = 2 or k = 3, be the subtrees of T th at are the
fanouts of the root node of T. By induction hypothesis, l(Ti) < w(Ti). Moreover we have
l(T) = l(Ti) and w( T) = k x maxi<j<* w(T{). Thus l(T) < w(T).
We can build a simple 2-3 tree T ' such th a t d(T') < d(T) and w( T' ) = w( T) as
follows. Let p be a path of T such th a t w(p) = w( T) , and let I 1' be a simple 2-3 tree with x p
levels of nodes of fanout 2 and yp levels of nodes of fanout 3. We have w( T' ) = w(p) = w( T)
and d{T') = d(p) < d(T). m
P ro o f [of theorem 3.3.1] According to the previous two lemmas, we only need to consider
simple 2-3 trees. The optim al simple 2-3 tree is the simple 2-3 tree T th at minimizes d(T)
subject to the constraint th a t l(T) > n, where l (T) is the num ber of leaves of tree T. For
a given simple 2-3 tree T, with x levels of nodes with fanout 2 and y levels of nodes with
fanout 3, we have: d(T) = 2x + 3y and l(T) = w( T) = 2X x 3V. Thus the problem of finding
an optim al simple 2-3 tree is reduced to the problem of finding a pair of integers (x, y) th at
is solution of the following discrete optimization problem:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 38
We will first show th a t there is always a solution of 3.6 th at is such th at 0 < x < 2. For
any pair (a, y ) of integers, let d(x, y) = 2 x + 3y and l(x, y) = 2X x 3y. If x > 3, we have:
Arbitrary Required Tim es We are not aware of any polynomial tim e algorithm to solve
the minimum delay fanout problem for arbitrary required times under the unit fanout delay
model. We conjecture th at this problem is not NP-complete and th at such an algorithm
exists.
The difference between the unit fanout model of the previous section and the unit
fanout model with varying sink loads used in this section is th a t we now allow sink loads to
take any positive rational value. There is still only one buffer in the library, and its drive
capability is 1, its intrinsic delay 0 and its input load 1. Under this delay model, we will
prove th a t the fanout problem for minimum delay is NP-complete even if we restrict the
sink required times to be all equal. We will also prove th at the fanout problem for minimum
area under a delay constraint is NP-complete even if we restrict sink loads to be integers.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 3. FANO U T O PTIM IZATIO N 39
Theorem 3.S.4 Given a fanout problem fo r the unit fanout model with rational sink loads
and a constant D, the following decision problem is NP-complete: is there a fanout tree
such that the delay through the fanout tree is less than or equal to D. The problem remains
NP-complete even i f the required tim es at the sinks are all equal.
The decision problem is clearly in NP. To prove it is NP-complete, we will exhibit a polyno
mial time reduction of 3-partition to it. For clarity, we restate here the 3-partition problem
[16]:
The nature of the constraints is such th a t if a solution exists, the sets S i contain exactly 3
elements.
A decision problem is NP-complete in the strong sense if, unless P = N P , there
is no polynomial algorithm to solve the decision problem even if we restrict the problem
to instances where the numbers appearing in an instance are bounded by a polynomial
function of the size of the instance. Equivalently, a decision problem is NP-complete in the
strong sense if it cannot be solved by a pseudo-polynomial algorithm, i.e. an algorithm that
is polynomial in the size of the instance and the magnitude of the numbers appearing in
the instance. NP-complete problems th a t axe not strongly NP-complete derive their NP-
completeness from the presence of exponentially large numbers in the formulation of their
instances. PARTITION [16] is an example of an NP-complete problem th a t is not strongly
NP-complete: it can be solved in pseudo-polynomial tim e by dynamic programming. When
the numbers appearing in the instances of a problem are derived from finite precision phys
ical param eters, as is the case for fanout optimization, it is not realistic to suppose the
presence of exponentially large num bers in problem instances. In other words, to be rele
vant, proofs of NP-completeness of fanout problems need to show NP-completeness in the
strong sense, by derivation from a strongly NP-complete problem. Our NP-completeness
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 40
proofs are based on 3-partition, which is one of the simplest strongly NP-complete problems
[16].
The proof of theorem 3.3.4 relies on the following lemma:
P roof [of theorem 3.3.4] We only need to exhibit a polynomial time reduction of 3-
partition to the fanout problem for instances of 3-partitions such th at 3m = 3 h. Let A
be such an instance of 3-partition. We create 3h = 3m sinks, one per element of A , and
assign to the sink corresponding to element a of A the load 1 + s { a )/K where K is an
arbitrary integer such th at K > 3 /2 B . All required times are taken to be equal and D is
set to be equal to 3h + B / K . Clearly, this specifies a fanout problem for our delay model,
and the construction can be done in tim e polynomial in the size of the instance A . We need
now to prove th a t decision problem A is equivalent to this fanout problem.
Suppose th a t A has a solution. Then we can group the elements of A in triplets
S i , . . . , Sm , such th at £ aes,. s(a) = B for 1 < i < m . We can then build a 3-tree with h
levels and 3h leaves, such th a t the sinks corresponding to the elements of S , are siblings
of each other. Any node of the tree at level h — 1 has a fanout of 3, and the load it
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZA TIO N 41
drives is equal to 3 + B / K since ]£a€Si s(a) = B . Thus the to tal delay through the tree is
3(h — 1) + 3 + B / K = D , which proves th a t the fanout decision problem has a solution.
Conversely, suppose th at the fanout decision problem has a solution. There is then
a fanout tree whose delay is no greater than D. Prom lemma 3.3.2, which still holds if the
loads at the sinks are allowed to be larger than 1 , we can assume without loss of generality
th a t this fanout tree is a 2-3 tree. Let T be the 3-tree with 3h leaves and a depth of h, with
sinks allocated to its leaves in some arbitrary way. Since the load of any sink is less than
1 + 5 / 2 5 , the load of buffers at level h — 1 does not exceed 3(1 + B /2 K ) < 3 + 1 . Thus,
the delay through T is smaller than 3h + 1. Since no other 2-3 tree can drive 3h outputs of
load 1 or more in less than 3h + 1, T is the only possible 2-3 tree th a t realizes the minimum
delay fanout tree. Thus there is an assignment of sinks to leaves of T such th a t the delay
through T is no larger than D = 3h + B / K . This means th a t the sinks can be 3-partitioned
into triplets whose aggregate load is equal to 3 + B / K . This is equivalent to saying th at A
can be 3-partitioned. ■
Theorem 3.3.7 Given a fanout problem fo r the unit fanout model with integer sink loads,
a delay constraint D and a constant A , the following decision problem is NP-complete: is
there a fanout tree such that the delay through the tree is less than or equal to D and its
area is less than A ? The problem remains NP-complete even i f the required times at the
sinks are all equal.
P roof This decision problem is clearly in NP. To prove it is NP-complete we will again ex
hibit a polynomial time reduction of 3-partition to it. From a given instance A of 3-partition,
we build an instance of the fanout problem as follows: we create 3m sinks, one for each
element of A , having for loads the values (m + l)s (a ). The area constraint is A = m and
the delay constraint is D = m + (m + 1)B . All required times are taken to be the same.
To show th a t both problems are equivalent, we first note th a t if a fanout tree is such that
less than m gates are directly connected to the sinks, there m ust be at least one gate which
fanouts to 4 or more sinks. Given the constraint th a t s(a) > 5 / 4 , the load at this gate
m ust be greater than (m + 1 )(5 + 1) > m + (m + 1 )5 . Thus to be able to meet the delay
constraint, all m gates th at the fanout tree is allowed to contain under the area constraint
m ust be to used to drive sinks. Moreover, each of them has to drive exactly 3 sinks. In
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 42
th a t case, the delay constraint will be met if and only if the loads are equilibrated, th at is
if and only if A can be 3-partitioned. ■
3 .3 .4 B e rm a n ’s D e la y M o d el
Berman et al. used a different delay model in their work [5]. In their model, gates
have a fixed fanout and delay. This is equivalent to saying th at the load a t the output of a
gate is equal to the num ber of fanouts, and the delay through a gate is a piece-wise linear
function of the load, with a threshold value above which the delay through a gate becomes
infinite. Under this delay model they show th at the fanout problem for minimum area
under a delay constraint is NP-complete, but their proof relies on unrealistic assumptions
on gate size and gate delay param eters, which weakens their result. More specifically, they
suppose a library containing gates with the following area and delay characteristics, where
N and K are given integer-valued parameters: a gate w ith delay 1, fanout limit of 1 + y
and area (N K ) 3, and, for i = 1 , . . . , K , a gate with delay 2 N K + i, fanout limit 2 and area
( N K )2 - 3i.
3.4.1 N o ta tio n
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANOU T O PTIM IZATIO N 43
The buffer selection algorithm is a very rudim entary fanout optimization algo
rithm . It does not build any fanout tree; it simply sizes existing buffers optimally. If a
fanout tree is implemented as a wire, this algorithm has no effect. Buffer selection can also
be used inside trees as was done in chapter 2 .
For a given buffer selection b, the algorithm computes the required time at the
input of the buffer. If r 0 is the required time a t the output of the buffer, and 7 i in the load
at the output of the buffer, the delay through the buffer is ab + fib'll,n and the required time
at the input of the buffer is r0 - ab - fib'll,n- Unfortunately this quantity does not take into
account the delay due to the input load of the buffer, which should be taken into account
to make an optim al buffer selection. So we subtract from the required time a t the input of
the buffer the load dependent delay /347 b corresponding to the tim e it takes the source gate
to drive the buffer input. The algorithm selects a buffer th a t maximizes this quantity as
shown in Figure 3.4. The algorithm works in time 0 (d + n ) where d is the number of buffers
in the library and n the num ber of sinks. The computation of ro is not actually necessary
and is given in Figure 3.4 for clarity.
One of the main reasons why fanout optimization is necessary is to reduce large
loads created by large fanouts. The simplest way to do so is to insert a two-level tree of
buffers at multiple fanout points in the circuit, where a two-level tree is defined as follows:
D e fin itio n 4 A tree is a two-level tree i f any leaf o f the tree is separated from the root of
the tree by exactly one intermediate node.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 44
The two-level fanout tree algorithm which ignores required times is given in Figure 3.5 and
explained in detail in th e rest of this section.
B u ffe r S e le c tio n A two-level fanout tree is composed of one level of intermediate buffers.
For simplicity, we enforce the restriction th a t all interm ediate buffers are of the same
strength (or buffer type). This allows us to compute in constant tim e a good approxi
m ation of the number of interm ediate buffers required. This computation is done once for
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. FAN O U T O P TIM IZATIO N 45
f / &7l,n [ IP b ll ,n | 1
k opt
h (3.8)
U V &76 J ’ V &7b J
S in k A ssig n m e n t The num ber of interm ediate buffers to be used is computed by suppos
ing th at the loads are equally divisible among all interm ediate buffers. This is not the case in
general. Unfortunately, even if the num ber of interm ediate buffers is given, assigning sinks
of varying loads to these buffers is a difficult problem. It is equivalent to multiprocessor
scheduling, which is known to be NP-complete (see [16] page 238).
To perform sink assignment, we use a simple greedy algorithm th at allocates the
next sink to the interm ediate buffer th a t has been assigned the least amount of load so
far. The best results are obtained if the sinks are sorted in order of decreasing loads. This
assignment is made for each buffer type b, but only for k\pt interm ediate buffers. The best
solution is then retained.
C o m p le x ity The num ber of intermediate buffers is of the order of \fn . The best fit
algorithm spends \ f n tim e to determine where how to assign each sink. Moreover this
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 46
computation is done once for each buffer type. Thus, in total, the complexity of this
algorithm is 0 { d re1-5).
The previous algorithm ignores sink required times. Though it can handle non
constant load values, it does not perform very well in the presence of wide variations in
required times. It is possible to compensate for this deficiency by taking required times
into account during sink assignment. To do so we still use a best fit, greedy algorithm but
instead of assigning a sink to the interm ediate buffer th at has so far the least amount of
load to drive, we assign a sink to an intermediate buffer in such a way th at the required
time at the source of the fanout tree is decreased the least by this assignment. If all required
times are equal, these two greedy algorithms produce the same result. In the preprocessing
phase, we sort the sinks in order of increasing required times, and, in case of ties, in order
of decreasing loads. The complexity of this algorithm is also 0 ( d n 1-5). The algorithm is
sketched in Figure 3.6.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 47
algorithm twoJeveljmith.requiredJ.imes
sort the sinks by increasing required times ( r i , . . r n )
compute 7 i,„ = E i< i< n 7i
foreach b 6 C {
Lt — / Phll.n
Kopt — Y Pijb
foreach 1 < i < khopt {
required[i] = 0 .0 ;
load[i] = 0 .0 ;
}
foreach 1 < i<n{
foreach 1 < j < kbopt {
required[i, j] = m in(r, - Ptload\j], required[j}) —Pbli
}
ji = arg max1<; <itbfii required[i, j]
required\ji] = required[i, ji\
load\ji] = load\ji] + 7 i
assijn[i] = j i
}
}
end twoJeveljwithjrequired-times
Figure 3.6: Two-Level Fanout Tree Taking Required Times into Account
O p tim a lity
le m m a 3.4.1 The greedy sink assignment algorithm is not optimal. This is still true even
i f all sink loads are equal and all required times are integer valued.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 48
3 3 4 4 4 3 4 4 3 4
given required tim e at the source of the fanout tree. An O (nlogn) optim al algorithm can
be derived from this decision procedure by using binary search.
le m m a 3.4.2 Given k identical buffers, with drive capacity equal to 1, n sinks, with loads
equal to 1 and integer required times, and an integer constant D, the following decision
problem can be solved in linear time: is there a sink assignment such that the required time
at any o f the buffers is no less than D ?
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 49
If such a consecutive assignment exists, it assigns to the first buffer the sinks ( s i , . . . , —1)
where ix is the first index for which the required time at the input of the first buffer becomes
less than D. It assigns the remaining sinks to the remaining buffers recursively using the
same principle. More precisely, it assigns to the bth buffer the sinks where
the finite sequence (jb)o<b<k is determined by the following recurrence equations:
io = 1
The required tim e at the input of buffer b is given by the expression - (ib - ib-i) since
all sink loads are equal to 1 and the drives of the buffers are all equal to 1. The answer to
the decision problem is “yes” if and only if any % exceeds n. •
3 .4 .5 C o m b in a tio n a l M ergin g
The previous two algorithms are very limited in the kind of fanout trees they can
produce. These structures are sufficient for most practical fanout sizes only if the required
times at the sinks are close to each other. This is not often the case. To be able to obtain
faster fanout trees, it is desirable to explore a larger set of fanout structures than just
two-level trees.
Combinational merging is a simple, 0 (n \o g n ) algorithm which has the ability to
generate a rich set of fanout tree structures. Unfortunately, it relies on the characteristics of
a simple delay model, and needs to be adapted heuristically to a more complex delay model.
The basic step of this algorithm, illustrated in Figure 3.2, is simple. We first suppose th at
the sinks have been sorted in order in increasing required times ( r i , .. . , r n). We take a
group of sinks with the largest required tim es ( r * , . . . , r n), make them the children of a new
buffer node and remove them from the list of sinks. We then compute the required tim e
at the new buffer node, and merge it with the sorted list of sinks. This transform ation is
applied until k becomes equal to 1. In th a t case the source is used to drive the remaining
sinks directly, unless insertering a buffer between the source and the sinks yields a faster
circuit.
W ith our delay model, there are two questions to be answered before combinational
merging can be used: how k should be chosen, and which type b of buffer should be used.
We use a heuristic th at computes k = kb as a function of b, and we select the best b using
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 50
algorithm bottom.up-fanout.tree.construction
sort th e sinks s» by increasing required times ( r i , . . rn)
while n > 1 {
foreach 1 < i < n com pute 7 jlTl = Y^i<j<n Is
foreach b £ C {
l.b _ / PbKl.n
k°pt ~ V T r t
h = m a x j i , 1 < i < n, 7 f,n >
required[b] = r kb - (3blkh,n - a» - /3.7b
}
b = argmaxS6 £ required[g]
k=h
create a new buffer node v of type b
a tta c h th e sinks (sk, • • - 1 an ) to v
remove th e sinks ( sn)
some cost function. For each buffer 6 , we compute k\pt as in equation 3.7. We take kb such
th a t the sum of the loads of the sinks of index kb to n just exceed the quantity We
opt
then create a new buffer of type b, connect it directly to a new copy of the source, and
make it drive the kb sinks with largest required times. We compute the required time at
this new source, and select the buffer type b th a t maximizes this required time, and use kb
for k. This algorithm is given in Figure 3.8.
The choice of this heuristic can be motivated as follows. For a given 6 , we need
to determine an adequate value for k. A choice based on taking a -A— fraction of the total
opt
remaining sink load appears reasonable, since, in the case where all required times are equal,
it leads to a two-level tree th a t is close to the optim um fanout tree. To compute k \p t, we
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O P TIM IZATIO N 51
suppose th at the buffers are driven by the source gate; this is too restrictive in general and
is likely to lead to suboptim al results. Yoshikawa et al. [44] recently proposed an extension
of this algorithm, based on branch and bound techniques, th at does not suffer from this
limitation.
> 1 (3.9)
n —*+00 y/n a
(3.10)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 52
< n n ( ‘- ; )
l<i<k
,k
< n
(3' 12)
Using the inequality (1 — j) * < e valid for all x > 1, we can derive, for any given real
num ber e > 0 , the following inequality:
< n ( l - ^ (3.13)
(314)
<
- —
eca (3.15)
This inequality is also valid for 0 < n < a 2, since in th a t case > 1, and thus
g (n ) = 0 < P 3 -. By applying this inequality to ^ we obtain:
f^ fe ) s £ <*•«>
Since g is monotonic nondecreasing, we deduce from the previous two inequalities:
ffM n + r V = £ i ( n ) < ^ (3 .1 7 )
r^ rl" 1 r -fi-
/(») ^ E h/Wl
e *=o v
(3-19)
We can deduce from this inequality th at:
/w s E (i+VjE) (3.20)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 53
We obtain finally:
< - (3.25)
a
L T -T re e s L T trees are designed to be just complex enough to allow for critical signal
isolation and buffering of large capacitive loads. A recursive definition of the set of L T
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 54
□ LT-tree
trees is given below. Figure 3.9 illustrates this definition. When an L T tree is used as a
fanout circuit, its root corresponds to the source of the signal, and its leaves to the sinks.
If the tree is composed of a single leaf, the source and the sink are not distinguished here.
In the actual circuit, they would be distinguished, and connected by a wire.
(3) Let T be a tree rooted at r such that one child o f r is an L T tree and all the other
children o f r are leaves. Then T is an L T tree.
In a LT-tree, there is a t most one interm ediate node th at has more th an one
interm ediate node as a child. If there is no such node, the L T -tree is term inated according
to case ( 1 ) of the definition, and is called a L T -tree of type 1. If there is such a node, the
node is the root of a two-level tree which term inates the L T -tree. In th a t case, the LT-tree
is of type 2. Examples of type 1 and type 2 ZT-trees are given in Figure 3.10. In the
example of type 2 , the interm ediate node with more than one interm ediate node as a child
is highlighted.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 55
P ro o f A ZT-tree of type 1 is entirely determined by the num ber k of interm ediate nodes of
the tree, the num ber of leaves attached to each interm ediate node, and the buffer selected at
each interm ediate node. For a given k, we can represent the topological structure of a ZT-
tree by a unique k-tuple of integers (®i, < , Xk) satisfying: 1 < x i < X2 < . . . < Xk < n — 2,
where leaves Xj + 1 to X j+1 are the leaves connected to the j ih interm ediate node (by
convention we set Xk+\ to be equal to n). In particular, leaves 1 to ®i are connected to the
root node, and leaves Xk + 1 to n are connected to the last interm ediate node of the tree.
The inequality Xk < n —2 is there to guarantee th a t the last interm ediate node is connected
to a t least 2 sinks, as required by the definition of type 1 ZT-trees. The num ber of such
k-tuples of integers is equal to the num ber of distinct ways of choosing k elements among
/ n - 2 \
7i — 2, i.e. I j . In addition, for a given topological structure with k interm ediate
\ k J
nodes, there are exactly dk possible assignments of buffers to interm ediate nodes of the tree.
In total, the num ber of distinct ZT-trees of type 1 is given by the formula:
The ZT-tree based algorithm explores implicitly, using dynamic programming, all
ZT-trees of type 1. For ZT-trees of type 2, the ZT-tree based algorithm only considers those
trees whose two-level subtrees are derived from the two-level algorithm of section 3.4.4. In
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 56
delay: load * P
other words, for every L T -tree of type 1, the algorithm only considers one L T -tree of type 2,
for a to tal search space of size 2 (d + l ) " -2 . If we ignore tree patterns, then the size of
the search space is 2n_1. This is only a small fraction of the total num ber of rooted trees
(the C atalan numbers give the num ber of ways to fully parenthesize a string of n symbols).
Using Stirling’s formula, we can easily deduce th at T (n ) is asymptotically equivalent to
22 n
n 1-6 '
The LT-tree based algorithm only considers assignments of sinks to leaves of the
LT-trees th a t are such th a t sinks with larger required times are placed further from the
root of the tree. This is partially justified by the following fact:
le m m a 3.4.5 I f sink loads are all equal, there is an optimal LT-tree such that the sinks
with larger required times are placed further from the root.
P r o o f W hen loads are equal, exchanging two sinks th a t are out of order in a LT-tree can
only increase the required time at the source of the tree. ■
Unfortunately, for arbitrary load values, the optimal L T -tree may require th at sinks are
placed out of order, as can be checked by inspection in the example of Figure 3.11.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANOU T O PTIM IZATIO N 57
( 1 ) for some sink index I > k and buffer type b, the root of T^,g) is directly connected to
sinks (sj;,. . . , si_i) and to a buffer of type b. The subtree connected to the buffer b is
an optimal XT-tree T(ji6) for the subproblem (I, b). Since the algorithm proceeds by
induction from n to 1 , T^j,) has already been computed and is available.
The algorithm is detailed in Figure 3.12. The best required time achievable for a pair {k,g)
is stored in the table required[k, g]. To keep track of the optimal configuration selected for
(k ,g ), we simply need to store a flag, useJw oJevel[k,g], to decide whether a two level tree
is used or not; and, in the case a two level tree is not used, an index, next[k,g], which is
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 58
a lg o rith m ItJree.compuiation
sort the sinks si by increasing required times (77 , rn)
fo rea c h 1 < i < n compute 7 = J2i<j<n 7 j
required[n +1,0] = + 0 0 ;
fo r k = n to 1 {
fo re a c h g such that (k, g) £ { (1 , s)} U ( [ 2 , . . . , re] x £) {
required[k,g] = requirediwo- itvei[k, <7];
useJ,woJevel[k, 3 ] = true;
foreach (I,b) £ {(re + 1 , 0 )} U [2 ,. .. , n] x £ such that l> k {
required = min(rfc, required]}, 6] - a j) - /3s(7j + 7 k,n - 7i,n);
if (required > required[k, 3 ]) {
required[k, jf] = required;
useJbwoJevel\k, 3 ] = false;
next\k,g] = I;
gate[k, g] = 6;
}
}
}
}
end It-tree-computation
the index of the first sink not directly attached to the root of T(k,g)> and, if next[k,g] < n , a
buffer type, gate[k,g]., which is the type of the unique buffer attached to the root of T(fc)3).
An example of com putation of these entries is given in Figure 3.13. To compute the required
time of a configuration other than a precomputed two-level tree, we start with the required
time of the selected subproblem (1,6), required[l,b]. This required tim e is not exactly the
required time at the input of it does not take into account the intrinsic delay of buffer
b. To obtain the required time at the input of T(i,6), we need to subtract from required[l, b]
the intrinsic delay a& of buffer b. The required time at the output of the root of T^,g) is then
the minimum between the earliest required time of a sink connected to the root of T(fc,g))
which is rk, and the required time at the input of which is required[l,b] — ab- To
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 59
obtain the required time required[k,g], we need to subtract from m in(r k,required[l,b\ —ab)
the load dependent delay Pgijb+'Yk.n—'Yl.n) but we do not include the intrinsic delay of gate
g. 7b is the load of buffer b while 7 k,n ~ 7l,n is the sum of the loads of the sinks attached
to the root of T(k,g)- The required tim e required[k,g] is actually the best required tim e
achieved for any possible choice of (l,g ), with I > g.
A llow ing N o d e s w ith N o L eav es It can be helpful in practice to allow some interm edi
ate nodes in a LT-tree to bear no leaves. For example, this gives the algorithm the freedom
to generate a sequence of buffers of increasing sizes to drive large loads when needed, as for
example at the root of a two-level tree. This can be implemented as a direct extension of
the dynamic programming algorithm of Figure 3.12, by computing, for each triplet (k ,g , I),
with 0 < I < L the optim al LT-tree Tk,g,i for sinks k to n with source g and I or less
interm ediate nodes w ith no leaves connected to them . Tk,g,i can only be composed of a
buffer attached to the tree Tk,g,i-i or a buffer attached to the tree Tk',g,i for some k 1 > k. If
Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 60
we allow up to L —1 interm ediate nodes to bear no leaves, the complexity of the algorithm
becomes 0 ( L d2 n 2-5).
3 .4 .7 O th er F an ou t A lg o rith m s
Other fanout optimizations have been proposed, by Berman et al. in [5] and Singh
et al. in [40]. None of these algorithms are optimal, since they all rely, as the L T -tree
algorithm does for L T -trees of type 1, on ordering sinks by required times. All of these
algorithms produce trees th a t have the following property: there exists a depth-first search
order of the nodes of the tree th at visits the leaves in order of increasing required times.
We give in Figure 3.14 an example of a fanout problem th a t cannot be solved optimally
with such trees, even with the simple unit-fanout delay model of section 3.3.2. It is to be
noted th a t the suboptim ality of these algorithms has nothing to do with the fact th at the
fanout problem is NP-complete for some delay models. It is actually not known whether
the fanout problem is suboptimal for the unit-delay model.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R 3, FANO U T OPTIM IZATIO N 61
B e r m a n 's A lg o rith m s Berman et al. presented two algorithms for the fanout problem.
One complex algorithm, also based on dynamic programming, which can be seen as a
generalisation of our I T - tree based algorithm, and one much simpler algorithm, called the
tWQ=greup algorithm,
The L T ’tree algorithm only allows trees th at have at most one buffer in the fanout
©f any buffer, Since this condition is too restrictive to produce good quality results, we have
relaxed It somewhat by allowing fanout trees to contain one balanced fanout tree as a subtree
If It helps reducing delay, Berman’s algorithm relaxes the restriction of having at most one
buffer In the fanout of any buffer using a different technique. The algorithm allows any
number of buffers in the fanout of a buffer, from 1 to some limit k, and uses dynamic
programming to select this number optimally. By restricting the sinks to be ordered by
Increasing required times, the dynamic programming algorithm can be made polynomial,
though with a large exponent: 0(fcn3), if we ignore buffer selection. If we want to take
buffer selection into account with our delay model, a direct modification of this algorithm
yields a complexity 0 ( n 2 dk (n + d)). The dk term comes from the fact th at we have dk
possible ways of assigning buffers to k inputs. If k = 1, we obtain the bound 0 ( n 2 d2)
because the term n 2 dk n disappears in that case. This is the complexity of the L T -tree
based algorithm If we do not use two-level trees.
The two-group algorithm algorithm was introduced by Berman et al. as a more
practical alternative to fanout optimization. This very simple algorithm tries all possible
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 62
decomposition of the sinks into two groups of sinks with consecutive required times, and,
for each group, supposes th at all required times and all loads are equal to select a balanced
tree out of a set of precomputed trees. The precom putation can be done once and for all
for a given library, so the run tim e of the two-group algorithm is essentially 0 {n ). This
simple algorithm is fast but unlikely to generate results of the same quality as the algorithms
proposed earlier.
H a n d lin g In v e rtin g B u ffers In some technologies, like CMOS standard cells, nonin
verting buffers are made of a juxtaposition of two inverting buffers. For delay optimization,
it is thus always preferable to use inverting buffers. Inverting buffers offer more possibilities
for optimization, and in the worst case can always be combined to reproduce the delay
characteristics of noninverting buffers. There is no m ajor difficulty in handling inverting
buffers. We simply need to keep track of the polarity of the signal a t the output of a buffer
and accept a connection with a sink only if the polarities match.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZA TIO N 63
distribute the signal to the other group. W hen inverting buffers are used, we do not specify
which group of sinks is directly connected to the source, since connecting the source to the
sinks with inverted polarities m ay yield a faster solution. We actually try both assignments
and keep the best solution.
• the local optimizations only have to be implemented once, instead of once per fanout
algorithm.
• the fanout algorithms can be made simpler. This is particularly true for area min
imization under a delay constraint, which can be done quite effectively by local op
tim ization. T hat way, the fanout algorithms can simply be implemented for delay
minimization only.
• good quality results can be obtained efficiently by the combined effect of a simple and
fast fanout algorithm and a simple and fast local optimization algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 64
Fanout algorithms do not guarantee in general th a t the buffers in the fanout trees
they produce are selected optimally. For example, the two-level algorithm enforces all inter
mediate buffers to be of the same type which is not necessarily the best possible solution;
the combinational merging algorithm selects buffers heuristically, without complete knowl
edge of the local structure of the tree in which the buffer is inserted. By performing optimal
buffer selection on the fanout trees resulting from these algorithms, we can decrease delay at
times significantly. In addition, the computational cost of performing optimal gate selection
is very low: 0 ( d 2 m ), where m is the num ber of edges in the tree, and d the num ber of
buffers in the library. The cost of performing optim al gate selection is thus only a fraction
of the cost of building a fanout tree in the first place. Thus there is no reason not to perform
optim al gate selection on every fanout tree.
Optim al buffer selection can be implemented with a simple algorithm th a t proceeds
from the sinks to the root of a fanout tree, and selects, at each interm ediate node, a
buffer th a t maximizes the required tim e at the parent of th a t node. In contrast with tree
covering, it is not enough to simply select a buffer th at maximizes the required time at a
node, because a later required time for a subtree usually means a higher load to drive for
th a t subtree. Selecting the largest possible required time for a subtree can slow down the
signal going to the other subtrees sharing the same parent. It is possible, using dynamic
programming, to compute for each subtree the required tim e required[b] achievable by
using a buffer of type b a t the root of the subtree. To do so, we proceed as follows.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FANO U T O P TIM IZATIO N 65
Let v be an intermediate node of the tree, with nodes (■tfi,...,vn) 35 children. For each
buffer b G C, we have already computed by induction the best required times required[vi, b]
achievable for the subtrees rooted at u,-, 1 < i < n if b is selected at node Vi. To compute
required[v, 6], we simply need to find an assignment of buffers to nodes ( « i , . . . , vn), i.e. a
function / : { 1 , . . . , n} -» { 1 , . .. , d}, th a t maximizes the required time at v, given by the
following expression:
This assignment problem we are attem pting to solve can be reformulated as follows:
given two n X d m atrix of numbers and Utj such th at, for any given i, r i j and are
monotonically non decreasing in j , find an assignment / : { 1 ,. .. , 71} —> { l , . . . , d } th at
maximizes the quantity:
77 = (3-29)
h =£ km (3-30)
l< t< n
The Tij represent the required times required[vi,bj] and the litj the load values 7 ^ . The
load values are actually independent of i, and, without loss of generality, we can suppose
th at they are sorted in increasing order. Thus the l i j are monotonically non decreasing
in j . If the r^j are not monotonically non decreasing in j , there would be a pair (j’i , j 2 )
of indices such th a t, for a some i, and 7,»iJ-1 > r ij2. For i, j \ would always be
a superior choice than j'2 . Wecan therefore replace r ij 2 by and Utj2 by without
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 66
affecting at least one optim al assignment to the problem. We can iterate this replacement
until the monotonicity condition is satisfied by the Titj.
We define a distance between two assignments / and g as follows:
d(f>9) = £ \ f( i) ~ 9 { i) \ (3.31)
l< t< n
In particular, d ( f,g ) = 1 if and only if / and g differ on only one index, and the difference
of value on this index is only one.
The optim um assignment algorithm is outlined in Figure 3.15. The algorithm
computes a sequence of assignments (f k ) i < k < K , for some K < d n , such th a t d(fk+i, fk) = 1>
and such th a t there is a ko, 0< ko < K for which f^ is optimal, fo is initialized to be such
th at fo(i) = 1 for all 1 < i < n. Given f k , the computation of f k + i only takes constant time.
To compute fk+i from fk, the algorithm finds an index ik th a t is critical for the current
assignment, i.e. an index th a t minimizes the quantity r i j k^) for 1 < i < n. fk+i is then
defined as being equal to fk for all indices different from ik and equal to fk{ik) + 1 on ik- If
fk{ik) = d and thus cannot be incremented, the algorithm term inates the com putation of
the sequence (fk)- The total num ber of incrementing steps cannot exceed d n .
To find the optim al assignment , the algorithm works backwards, from f a to
fo, in order to exploit the fact th at the cost of assignment f k , g a in(fk), can be com
puted in constant time from the cost of fk+ i, but not conversely. To compute ga in (fk)
in constant time from gain(fk+ i), the algorithm keeps track of the interm ediate quantities
r fk = mini<;<n r i)/fc(i) and l f k = £i<»<n and use the fact th at r f k = r iktfk{ik) and
lfk = - rik,fk(ik)+T- rikJk(ik)■ This guarantees th at k 0 can be computed in 0 (n d ) time.
Finally constructing fk Q from fo given ko takes 0 (n d ) time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 67
algorithm optimaLbuffer-assignment
set / to be such that: f(i) = 1 for all 1 < i < n.
k = 1;
do {
io = argmini<j<„ rij(i)
if f{io) = d b re ak
P ro o f By the monotonicity hypothesis on the arrays r^j and lij, (3.32) implies th at
r f < rg and If < lg, and (3.33) implies th a t r f = = riotg^ 0) > rg. Thus rf = rg and
which implies th at g a in ( f) > gain(g). •
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 68
If the fanout problem is given with a delay constraint, our objective is to find a
fanout tree with minimum area th at meets the delay constraint. The fanout algorithms we
have presented so far can only minimize delay. We have suggested in section 3.4.6 a way
to extend the L T -tree based algorithm to minimize area under a delay constraint. Other
fanout algorithms could also be extended to support this optimization, but each algorithm
will have to be modified separately (e.g. the L T -tree algorithm is the only one to be based
on dynamic programming). In addition, it is likely to be difficult to extend the fanout
algorithms to minimize area under a delay constraint in an efficient and accurate way.
Fortunately, there are simpler ways to recover area. Though not optimal, the techniques
we present in the following paragraphs are very effective in practice and straightforward to
implement.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 69
P a r ti a l C o llap se The second heuristic we use, which is also very effective at recovering
unnecessary area, consists in partially collapsing minimum delay fanout trees. This tech
nique is particularly useful on ZT-trees or trees generated by combinational merging. The
algorithm performs partial collapses as follows. Given a fanout tree and a delay constraint,
i.e. an arrival tim e at the root of the fanout tree, the tree is visited from the root to the
sinks in order to compute the arrival times at every intermediate node. Then the tree is
visited in reverse order, from the sinks to the root. At each interm ediate node v, the algo
rithm computes the required time at v th a t would be obtained if v were connected directly
to all the sinks driven by the subtree Tv rooted at v. If this required tim e is larger than the
arrival time at v, Tv is collapsed into v: all buffers contained in T v are eliminated and the
sinks of Tv are directly connected to v.
B u ffer S e le c tio n We also use a modified version of the optimal buffer selection algorithm
of section 3.6.2. This algorithm is not optimal: it does not find the buffer assignment th at
would minimize area under a delay constraint, but is fast, simple and easy to implement.
The modification is done as follows: given an arrival time at the root of the tree, for each
interm ediate node v and for each buffer type 6 , we want to compute an achievable arrival
time at v if node v is assigned a buffer of type b. This arrival time is taken to be the
arrival time obtained on the fanout tree after the buffer at node v has been replaced by
b, without changing the rest of the tree; it is clearly an achievable, but not necessarily an
optimal arrival time at node v with buffer b. The suboptimality comes from the fact th at
we do not know what is the optimal buffer assignment for the siblings of v given th a t v is
assigned b, and it would be too time consuming to compute it for all values of 6 . Given
this achievable arrival time at v for a given choice of buffer b, to perform buffer selection
we use the algorithm of section 3.6.2, modified to select the minimum index k for which f k
guarantees a nonnegative slack at v. This computation is done for each value of b and the
result is stored at node v. The buffer selection for node v is done when the parent node of
v is visited.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 70
algorithm one.pass^lobaljanout.optimization
foreach node v visited in topological order from outputs to inputs {
if v is the root of a tree {
apply fanout optimization to the fanout problem rooted at v
} else {
propagate the required time at the output of v to the inputs of v
}
end onejpass-globalJanout.opUmizaiion
circuit. In section 3.7.1 we present such a technique and show th at it is optim al with respect
to delay minimization. In section 3.7.2 we extend this technique to recover area under a
delay constraint. For area minimization under a delay constraint, this technique is not
optim al but is very efficient in practice.
• otherwise, we simply propagate the required times of the outputs of v to the input
of v. More specifically, if (f>i,. . . ,v n) axe the fanouts of v, the required tim e r at the
output of v is the minimum of the required tim es of ( v i ,. . . ,v n)] the load 7 at the
output of v is the sum of the loads of ( i;i,. . . ,v n). The required time 7\ at input pin
i of node v is then given by the following equation: r , = r - a.i —$ 7 , where a{ is the
intrinsic delay and /3; the load dependent delay of the gate a t node v for input pin i.
The algorithm of Figure 3.16 is optimal in the following sense. Let N be a com
binational network, L = (v 1 , . . . ,v m) be an arbitrary list of possibly repeated tree roots of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R 3. FAN O U T O P TIM IZATIO N 71
PRIMARY OUTPUTS
\y fanout optimization
PRIMARY INPUTS
Figure 3.17: Applying Fanout Algorithms in One Pass From O utputs to Inputs
N and F a fanout algorithm. The pair (F, L) defines a global fanout algorithm as follows:
first, compute the required times at all nodes in N . Then, for k = 1 to k = m , apply the
algorithm F to the fanout problem rooted at Vk and recompute the required times a t all
nodes in N wherever necessary. We define the cost of (F, L) as a n-dimensional vector:
where the nodes pi, 1 < i < n, are the prim ary inputs of N , and r(v) designates the required
tim e a t node v. We say th a t the n-dimensional vectors x and y satisfy the inequality x < y
if and only if X{ < y; for all 1 < i < n.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O PTIM IZATIO N 72
2. the fanout trees it produces are such that the required times at the root o f the trees are
a non decreasing function o f the sink required times.
Let Lq = ( iq ,.. . , v m) be a topological order of the tree roots o f N starting from primary
outputs, and let L be an arbitrary list o f tree roots. Then cost(F, Lo) > cost(F, L). Moreover
i f F is an optimal fanout algorithm, the result o f (F , Lq) is optimal with respect to fanout
optimization in the sense that no modification of the individual fanout trees can reduce the
delay through the network.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 3. FANOU T O PTIM IZATIO N 73
hypothesis, the required times at the sinks of tree root ujt+i are no worse in T p than in Topt.
Since F is optim al, rTopt(vk+i) < rTp(vk+i)- ■
The importance of theorem 3.7.1 is to prove th at any lack of optim ality in fanout
optimization is due to the lack of optimality of the fanout optimization algorithms, not to
the procedure we use to apply the fanout algorithms to the entire circuit. In particular,
there is no need to develop fast, incremental techniques to extract critical path information
th a t would be used to guide the fanout optimization. This simple algorithm based on one
topological traversal does as well in terms of delay as any other more complicated technique.
The main weakness of the optimal procedure described in the previous section is
th at it can be too wasteful in area. It optimizes all tree roots for minimum delay, with no
consideration for the necessity of such an optimization. We now introduce a technique to
recover area at no cost in delay. This technique works with arbitrary required times a t the
primary outputs and arbitrary arrival times a t the primary inputs of a network.
To recover area at no delay cost after fanout optimization, we proceed as follows.
We first save a t each tree root the arrival tim e achievable after fanout optimization. We
then reapply the fanout optimizer to each fanout problem, visited in topological order. This
time we call the fanout optimizer to minimize fanout tree area under the constraint th at
the required time at the root of the tree is no less than the arrival tim e at the root of the
tree. To perform this optimization, we use the techniques described in section 3.6.3.
W ith this simple algorithm, we can recover, at no delay cost, most of the area
wasted by the first phase of fanout optimization. Unfortunately this algorithm is not opti
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 74
mal, for two reasons: first because it relies on a fanout algorithm th at itself is not optimal;
second because by visiting nodes in topological order, it uses the slack available on any given
path as early as possible. A more equilibrated use of the slack can lead, at least in some
cases, to a smaller circuit with the same delay. Despite these lim itations, this technique is
very effective at recovering area, as can be seen from the results of section 3.8.3.
Our set of benchmark circuits come from four sources: MCNC, ISCAS, Intel and
AT&T. The MCNC benchmarks were put together during the International Logic Synthesis
Workshop 1989 [32] and are publicly available from MCNC. They are themselves from
several origins, though complete information was not always available on each of them . The
ISCAS benchmarks were originally testing benchmarks. The Intel and AT&T circuits come
from these companies. No details concerning their functionality were provided. Table 3.2
contains some general information on the 25 benchmark circuits, including the number
of prim ary inputs and prim ary outputs, the number of literals in factored form, and the
num ber of gates needed to implement the circuits when technology m apped for minimum
area using the MCNC library l i b 2 . We also indicate briefly the function of the circuit if
known. If it is not known, we simply used the word logic to characterize the circuit.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 75
c irc u it f u n c tio n : simple description of the logic function of the circuit, if available
o rig in : origin of the circuit
# pis: num ber of prim ary inputs
# p o s: num ber of prim ary outputs
# lits : num ber of literals in factored form as computed by m is ll
# g a te s : num ber of gates to implement the circuit using m is l l technology
m apper in minimum area mode with the MCNC library l i b 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O PTIM IZATIO N 76
c irc u it m in a re a fa n o u t o p t g a in
a re a d e la y a re a d e la y a r e a d e la y
C1355 990 27.16 1119 24.25 1.13 0.89
C1908 1086 35.04 1236 29.55 1.14 0.84
C2670 1420 28.42 1478 22.14 1.04 0.78
C3540 2201 45.24 2421 34.27 1 .1 0 0.76
C5315 3171 39.94 3352 30.78 1.06 0.77
C6288 6777 120.37 7340 101.08 1.08 0.84
C7552 4660 69.47 5084 28.55 1.09 0.41
a lu 4 1486 47.17 1724 31.84 1.16 0 .6 8
ampbpreg 2741 59.85 2891 39.40 1.05 0 .6 6
ampbsm 1478 25.78 1615 18.05 1.09 0.70
am ppint 2 1021 22.45 1136 12.31 1 .1 1 0.55
ampxhdl 751 24.73 865 13.38 1.15 0.54
apex 6 1505 17.74 1565 13.40 1.04 0.76
des 6452 107.12 7358 17.84 1.14 0.17
d f lg r c b l 615 12.83 630 1 1 .0 1 1 .0 2 0 .8 6
fc o n rc b l 467 15.18 481 13.82 1.03 0.91
frg 2 1738 37.91 1893 14.95 1.09 0.39
k2 1972 32.80 2189 20.56 1 .1 1 0.63
k c c tlc b 3 449 11.50 469 1 0 .2 0 1.04 0.89
p a ir 3200 25.85 3482 17.80 1.09 0.69
ro t 1387 30.18 1444 21.49 1.04 0.71
s b iu c b l 471 22.18 514 18.89 1.09 0.85
tfa u ltc b l 375 9.91 400 8.41 1.07 0.85
vda 1118 23.76 1324 16.75 1.18 0.70
x3 1653 22.40 1754 11.45 1.06 0.51
aver - - - - 1.09 0 .6 6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 77
E ffect o n O p tim iz e d C irc u its Before technology mapping, we optimized each circuit
using m is ll technology independent simplification and factorization algorithms, as is nor
mally done in logic synthesis. We then compared the area and delay of the circuits after
technology mapping for minimum area, with and without fanout optimization. The results
are reported in Table 3.3. We can observe a wide variation in delay reductions, ranging
from 9% to a factor of 6 , while the area increases ranged from 3% to 18%. On average,
fanout optimization reduces delay by 34% for a area increase of 9%.
E ffect o n U n o p tim iz e d C irc u its Logic simplification is usually beneficial both in terms
of area and delay, while logic factorization can decrease circuit area by sharing common
subexpressions a t the cost of larger fanouts and extra levels of logic. Because we ran our
experiments on circuits th a t were optimized by m is l l, there is the possibility th a t fanout
optim ization was effective on these circuits simply because it corrects the fanouts introduced
by factorization. To check for this possibility, we ran the same experiments on the same
circuits, but this tim e before the circuits were optimized by m is l l. The results are reported
in Table 3.4. Though for some circuits, such as C1355 and C6288, fanout optimization was
more effective on optimized circuits, on average fanout optimization was more effective on
unoptimized circuits, reducing delay by 44% for an area increase of 10% instead of 34%
and 9% respectively. From this experiment we can conclude th at factorization is not the
main factor contributing to large fanouts. Actually, technology independent optimization
can have the overall effect of reducing the im pact of fanout optimization.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FANOU T O PTIM IZATIO N 80
optimally for delay. In the area recovery phase, if the inverter is not critical for delay,
it is replaced by a smaller inverter. The effect of this simple optimization is reported in
Table 3.5. Inverter optimization can achieve a 5% decrease in delay at negligible cost in
area. Inverter optimization can achieve 15% of the to tal delay decrease obtainable with
fanout optimization.
F a n o u t O p tim iz a tio n w ith N o C r itic a l S ig n al Iso la tio n The two main fanout opti
mizations combined in our algorithms are buffering and critical signal isolation. To deter
mine what is the relative importance of these two optimizations, we measured the effect of
fanout optimization limited to buffering. For the buffering algorithm, we use the algorithm
of section 3.4.3, which restricts itself to two-level fanout trees. This algorithm ignores re
quired times but takes load values into account. The results obtained with this algorithm
are reported in Table 3.6. Using this simple algorithm, we obtained a delay reduction of
23% for a cost in area of only 3% in average. Buffering alone accounts for 68% of the total
delay reduction we can obtain with fanout optimization, for only a third of the cost in area.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O P TIM IZATIO N 82
c irc u it fa n o u t o p t lo w er b o u n d r a tio
d e la y d e la y d e la y
C1355 24.25 20.53 0.85
C1908 29.55 24.56 0.83
C2670 22.14 19.74 0.89
C3540 34.27 28.86 0.84
C5315 30.78 27.01 0 .8 8
C6288 101.08 83.74 0.83
C7552 28.55 24.48 0 .8 6
a lu 4 31.84 27.08 0.85
ampbpreg 39.40 34.24 0.87
ampbsm 18.05 15.97 0 .8 8
am ppint 2 12.31 11.00 0.89
ampxhdl 13.38 11.43 0.85
apex 6 13.40 12.55 0.94
des 17.84 15.71 0 .8 8
d f lg r c b l 1 1 .0 1 9.71 0 .8 8
fc o n rc b l 13.82 13.54 0.98
frg 2 14.95 12.06 0.81
k2 20.56 18.46 0.90
k c c tlc b 3 1 0 .2 0 8.75 0 .8 6
p a ir 17.80 16.00 0.90
ro t 21.49 18.22 0.85
s b iu c b l 18.89 17.77 0.94
tfa u ltc b l 8.41 7.75 0.92
vda 16.75 15.45 0.92
x3 11.45 10.40 0.91
aver - - 0 .8 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANOU T O PTIM IZATIO N 84
optimizations, we ran the fanout optimizer after having deactivated them. The results of
this experiment are reported in Table 3.8. In terms of delay, the impact of peephole opti
mization is negligible. This was to be expected, since our most powerful fanout optimization
algorithm does buffer selection optimally. However, in terms of area, peephole optimiza
tion reduces the cost of fanout optimization from 15% down to 9%, which is a valuable
contribution to the overall performance of the fanout optimizer.
3.9 Conclusion
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 87
stead of trying to find one fanout algorithm th at would be applicable in all contexts, we
recommend the approach we have been following, of developing a spectrum of simple but
efficient fanout optimization algorithms based on different approaches: balanced trees, LT-
trees, combinational merging, top-down traversal. These algorithms are efficient enough to
make it practical to try them all on every fanout problem to retain the best solution in
all cases. In addition, some optimizations can be shared by all fanout algorithms if done
during a postprocessing optim ization phase on fanout trees. These optimizations do not
affect delay, but can contribute significantly to area reduction.
The most im portant contribution of this chapter is to have demonstrated th at,
at least in the context of fanout optimization, there is a simple way of applying a fanout
optimization algorithm to an entire network th at is optimum in term s of delay and very
efficient in terms of area. This is a significant improvement over past methods which rely on
the identification and incremental improvement of critical paths. Our m ethod is guaranteed
to produce the best delay achievable with a given fanout optimization algorithm and requires
only two passes on the network. In addition it achieves significant delay reduction for a very
moderate cost in area, observed in our experiments to be no more th an 10% on average.
To minimize area under a delay constraint, we did not modify any of the fanout
algorithms to do so. Rather, we simply apply every fanout algorithm to each fanout problem,
and selected among the fanout trees so obtained th at met the delay constraint, one with
minimum area. Area reduction is achieved simply by using a spectrum of fanout algorithms,
including some th at have can only produce fairly simple fanout trees.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 4
4.1 Introduction
In the previous two chapters, we covered separately two im portant delay opti
mization techniques: tree covering and fanout optimization. The purpose of this chapter
is to study the problem of integrating these two optimizations. Figure 4.1 illustrates how
we have applied so far these optimizations. In gray are fanin trees, the parts of a circuit
where we apply tree covering to perform gate selection. In white are fanout trees, where we
apply fanout optimization. We can always combine tree covering and fanout optimization
as follows: we first use tree covering to implement the fanin trees, in one pass from primary
inputs to primary outputs. We then apply fanout optimization, in one pass from primary
outputs to primary inputs. We can run an additional fanout optimization pass to recover
area, but we will ignore area recovery for the moment. The question we would like to answer
in this chapter is: are there better ways to apply tree covering and fanout optimization to
a network th at would lead to significant speed improvements?
We first introduce some definitions and terminology used in the rest of this chapter.
We partition a Boolean network into fanin trees and fanout trees. We group each tree into
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COMBINING T R E E CO VERING AN D FAN O U T O P TIM IZATIO N 89
P R IM A R Y O U T P U T S
fanout optimization
tree covering
P R IM A R Y IN P U T S
a node. To each fanin tree corresponds a fanin node, and to each fanout tree corresponds
a fanout node. Fanin nodes are specified by a Boolean function th a t can be implemented
as a tree. Fanout nodes are simply specified by a source and a set of sinks. The polarity of
these sinks is usually not known before the fanin trees are implemented. Fanin nodes can be
implemented by tree covering, or by some form of restructuring followed by tree covering.
In this work, we only use tree covering, but the theoretical p art of this chapter remains
valid if we use restructuring in addition to tree covering. Fanout nodes are implemented
using fanout optimization. In each case, we suppose th a t the implementation attem pts
to minimize delay for a given set of arrival times for fanin nodes or required times for
fanout nodes. An implementation th at is such that all delays from the leaves of the tree
to the root of the tree are equal is called a balanced implementation. We use unbalanced
implementations to compensate for the imbalance in arrival times or required times. The
problem we study in this chapter is the problem of finding a good order of application
of tree covering (possibly with restructuring) and fanout optimization to minimize delay.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING T R E E CO VERING A N D FAN O U T O P TIM IZA TIO N 90
We call this problem the tree-based delay optimization problem, since it is the problem of
minimizing delay while respecting the tree boundaries of a Boolean network.
In section 4.2 we formulate the tree-based delay optimization problem as a con
vex optim ization problem. This formulation is only valid for a continuous version of the
constant delay model of section 3.3.1. By abstracting away the discrete nature of delay
optim ization in our setting, this formulation allows us to compute analytically the mini
mum delay implementation of a few simple circuits. We can then use these examples to
detect potential sources of suboptim ality in tree-based delay optimization algorithms. In
this section we exhibit a circuit th a t has the following property: for any constant a > 0,
there is an implem entation of this circuit whose delay is a delay units worse than an optimal
implem entation and is such th a t all paths are critical and all fanin nodes and all fanout
nodes are implemented optimally. Such an implementation cannot be optimized by any
greedy application of tree covering or fanout optimization in any order. Due to physical
constraints, this example is only realistic for a limited range of values of a. Nevertheless it
indicates clearly the limitations of greedy delay improvement strategies.
This example is based on an initial implementation where arbitrarily unbalanced
implementations of fanin and fanout nodes compensate each other exactly. We can easily
avoid this problem by starting with a balanced initial implementation. To do so, we im
plement all fanout trees using a balanced configuration prior to the first application of tree
covering. This technique is described in section 4.3. The main difficulty in building these
balanced fanout trees is to handle sink polarities properly. Since the fanout trees are built
before an implem entation of fanin trees is available, sink polarities are not known. This
phase assignment problem has an im portant impact on the quality of the final implemen
tation.
Once we have built an initial implementation, we can iterate tree covering and
fanout optimization until we reach a local minimum. In section 4.4, we present a simple
iterative scheme th a t we can use to perform this iteration. This scheme consists in iterating
tree covering and fanout optimization passes on the network, using a t the i th iteration the
delay information computed at the ( i - l ) tfl iteration. Our experimental results indicate that
there is almost no advantage in performing more than one iteration with this method. To
estim ate the optim ality of the final result, we applied this optimization scheme to another
simple circuit for which we can compute the optim al solution for a simple delay model.
On this circuit also the iterative algorithm converges almost immediately, and the result
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING T R E E CO VERING A N D FAN O U T O P TIM IZATIO N 91
obtained after one iteration is within a few percent of the optim um solution. This example
also suggests an improvement of the iterative algorithm, which converges more slowly but
reaches a solution th a t is within a fraction of a percent of the optim um. We conclude
section 4.4 by a brief overview of a new approach to tree-based minimization proposed by
Yoshikawa et al. [44].
The results of the section 4.4 are a strong indication that we are reaching the limit
of what can be achieved with tree-based delay minimization. The purpose of section 4.5
is to propose two techniques to minimize delay th at do not preserve tree boundaries but
are nevertheless simple variations of the tree covering and fanout optimization algorithms
proposed so far. The first of these techniques, t r e e d u p lic a tio n , allows the duplication of
a fanin node to implement the node both in positive and negative phases. In tree covering,
one phase is selected, and the other phase is provided by an inverter. W ith tree duplication,
both phases are implemented as separate trees, possibly with partial overlap. The rationale
behind this heuristic is to make available to the fanout optimizer the earliest possible source
for each polarity, possibly reducing by one the num ber of buffers on a critical path. Another
factor makes this optimization attractive: the possibility of avoiding unnecessary duplication
during the area recovery phase of fanout optimization. The second of these techniques, tree
overlap, is more radical. It allows tree covering to ignore tree boundaries. The example of
Figure 4.2 illustrates why allowing overlaps between trees can help reducing circuit delay.
The circuit shown in this example can be implemented as shown in solid lines, with three
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COMBINING TR E E CO VERING AN D FANOUT O P TIM IZATIO N 92
2-input NAND gates and one inverter. Or it can be implemented as suggested by the
dotted lines, w ith two 3-input NAND gates. The second implementation is significantly
faster, and in th a t case happens to use less area. In general, however, allowing overlaps
tend to increase area. If the example of Figure 4.2 is modified to have a fanout of 10,
the implementation with ten 3-input NAND gates will still be faster, but would use more
area th an an implem entation with eleven 2-input NAND gates and one inverter. This
optimization has the effect of moving logic across fanout points in a m anner reminiscent of
retiming [30].
The results obtained in this chapter indicate th at we are reaching a limit to what
we can expect from tree-based technology mapping algorithms in terms of delay optimiza
tion. This opens the way to the next chapter, where we examine the effect of technology-
independent optimizations th at can modify the global structure of a network.
To model the effect of tree covering and restructuring on a fanin node, we use a
function /(ffli,. . . ,o„) th at represents the arrival time achieved by an optimal implem enta
tion of a fanin node v at the output of v given arrival times ( a i , . . . , a n) at the inputs of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 4. COMBINING TR E E CO VERING A N D FANO U T O P TIM IZATIO N 93
v. If the Boolean function computed by fanin node v is a Boolean and of n inputs, and
if the library is composed of and gates of constant delay equal to 1 and fanin k, where k
lies between 2 and some limit t, the optim al implementation of v can be obtained using
combinational merging. Moreover, the function / can be computed exactly in th at case,
and is given by the following formula [18, 22]:
This approximation ignores the discrete nature of tree covering and restructuring, and the
added irregularity of fanin nodes with asymmetric logic functions. However it captures the
essence of fanin tree balancing. It produces a delay of log n for balanced arrival times, and,
for unbalanced arrival times, allows late signals to traverse a fanin node for less delay at an
extra cost for the other signals. Moreover, the model has not been chosen arbitrarily, but
derived directly from the optim al solution of a fanin problem for a simple delay model. In
addition, this model has the following im portant property:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING T R E E CO VERIN G A N D FAN O U T O P TIM IZA TIO N 94
We deduce from these equations, and from the fact th at a tJ- = ay,-, the following inequality,
which proves our result:
d2/
= - E ^ ^ + E ^ i
i,j * 3 t,j «,i
= E a*i(^ + 3 / |- 2^ i )
*<i
= E®<j(»"W)2 - °
i<j
The continuous approxim ation of g is obtained in the same fashion as the continuous ap
proxim ation of / in the previous section:
Given a continuous model of the delay through fanin nodes and fanout nodes as the
result of tree covering and fanout optimization, we can proceed to formulate the tree-based
delay optim ization problem as a global optimization problem. We assume th at the Boolean
network to be optimized is decomposed into fanin nodes and fanout nodes. W ithout loss of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING TR E E CO VERING A N D FAN O U T O P TIM IZATIO N 95
generality, we can suppose th at the fanin nodes and fanout nodes are maximal, in the sense
th a t a fanin node is never connected to another fanin node, nor a fanout node to another
fanout node. If it were not the case, i.e. for example if two fanin nodes were connected, the
input node could always be collapsed into the output node.
We divide the edges or wires of the network into four sets: P I , PO, L E A F and
R O O T. Each wire only connects two nodes. Multiple fanouts are handled exclusively
through fanout nodes. The P I wires are wires directly connected to prim ary inputs. As
sociated with each of the P I wires is an arrival tim e. Arrival times are represented as a
vector a of arrival times, of dimension |P I |. Similarly, the PO wires are the wires directly
connected to prim ary outputs. Associated with each PO wire is a required times. Required
times are represented as a vector r of dimension \P 0\. The L E A F wires are the wires th at
connect an output of a fanout node to an input of a fanin node. Arrival times on these
wires are represented by a vector x of dimension \LE A F \. Finally the R O O T wires are
the wires connecting the output of a fanin node to the input of a fanout node. Figure 4.3
illustrates these definitions.
For each fanin node, we suppose th a t we have at our disposal a function th at
computes the best achievable arrival tim e at the output of the node, for any feasible imple
m entation of the node. This function only depends on x and a. To simplify the notation,
we designate by / the vector of such functions, and we keep implicit the dependency on the
vector of arrival times a, considered constant. Thus / : TZ\LEAF\ —> x 'R)r o o t \. All
the components of / are convex functions.
We note irpo the orthogonal projection of a vector of Tl\p o \ x R)r o o t \ onto 7?.^°I,
and similarly tvr o o t designates the projection onto N } r o o t \. For a given assignment of
arrival times x at the L E A F wires, ttpo(f(® )) designates the best achievable arrival times
at the prim ary outputs of the network. For 1 < p < +oo, |aj|p designates the quan
tity (£lb=i ®+ designates the vector of components (max(a:fc,0))i<jt<n. The
to tal amount by which an implementation fails to meet its tim ing requirements is given
by \{'Kpo(f(x)) — and the maximum am ount by which a requirement fails is given
by |(irpo(/(® )) ~ ?')+|oo- In both cases, the convexity of the components of / implies the
convexity of the cost function.
Similarly, for each fanout node, we suppose th a t we have at our disposal a function
th a t computes the best achievable required time at the input of the node, for a given set of
required times at the outputs. This function only depends on x and the constant vector r.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 4. COMBINING TR E E CO VERING AND FAN O U T O PTIM IZATIO N 96
PRIMARY OUTPUTS
V fanout node
A fanin node
------ PI o r PO wire
ROOT wire
PRIMARY INPUTS
As for prim ary input arrival times, to simplify notation, we keep implicit the dependency
on the prim ary output required times. We designate by g the vector of these functions:
g : TZ\l e a f \ —> %\p i \ x R}r o o t \. A L E A F wire assignment x is realizable if it satisfies the
following inequalities: 7 T |p /|( p ( a :) ) > a and n\r o o t \ { 9 { ^ ) ) > ’f |R O O T |( / ( ® ) ) -
m in |(7r|PO| ( / ( a : ) ) - r ) + |00
x £ 'R}l‘EAF\
7r|P/|(fl'(a:)) > a
’H i i o o r i t e C a : ) ) > ^ | r o ot |( / ( ® ) )
Each constraint in the problem if of the form g > f , where g is a concave function and
/ a convex function. The set of points satisfying the constraints is thus convex, and the
problem has been expressed as the problem if minimizing a convex function over a convex
set.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 97
rl tZ
x4
a1 a2
T\ = log(eXl + e12)
r2 = log(eX3 + ex*)
ai = - lo g ( e -Xl + e_X3)
a2 = - lo g (e ~ X2 + e~X4)
where (ai, <12) are given arrival times at the inputs of the network. To simplify the compu
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 98
Ri = eTi
Ai = eai
Xi = eXi
The problem is then to reduced to finding the minimal pairs ( jRi , 122) satisfying the following
two equations:
12i = Xx + X 2
A 1X 1 A 2X 2
2 X i - Ai X 2 - A2
The minimal points m ust be such th a t an<^ (§§f> 3X6 colinear, otherwise
it would be possible to decrease both R i and R 2 and still stay in the feasibility region.
The first vector is equal to (1,1) and the second vector to ( ( x 2- a 2)2 )• Thus the
minimal points are characterized by the equation:
Xi = X2
A\ A2
A param etric representation of the set of minimum points is then readily available, using
T = ^ as param eter. The range of T is 1 < T < oo.
R\ = T ( A \ + A 2)
R2 = T -' i ^ 1+
This closed form indicates th a t the behavior of this circuit is equivalent to th e behavior of
a single fanout node with an input arrival time of log(eai + e“2).
W ith the example of the previous section, we can exhibit a family of circuit imple
m entations for the same network th at are arbitrarily far from the optimum solution, but yet
have the property th a t every node, taken in isolation, is configured optimally with respect
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 99
to the rest of the network. In other words, any algorithm th a t works greedily by changing
the circuit implementation one node at a tim e or a group composed of a fanin node and its
fanout node at a tim e, can enter a global implementation where every single node or group
appears optim al with respect to the rest of the circuit, but the overall delay through the
circuit is arbitrarily far from the optim al solution.
This family of implementations is param etrized by one param eter, designated as
a. Given a, we consider the following wire assignment to the circuit of Figure 4.4, where
we now suppose th a t the arrival times and the required times are all equal to 0:
si = log(l + e“ ) (4.6)
s 2 = log(l + e - “ ) (4.7)
x 4 = log(l + e“ ) (4.9)
Since we have:
= - ^ ( r i h r + iT ? )
= 0
this leaf wire assignment is optim al with respect to the two fanout nodes. The best pri
mary output arrival times achievable with this wire assignment are equal, for both primary
outputs, to the value:
= log(l + e“ + 1 + e- “ )
= log 2 + log(cosh(a))
which can be made arbitrarily far from an optim al solution. Solutions arbitrarily far from
the optim um are not realized in practice due to physical limitations. This result simply
indicates th at greedy tree-based delay optimization algorithms are unable to recover from
certain initial implementations, no m atter how suboptimal these implementations are.
In addition, if we take as initial implementation an implementation where the
fanout nodes are implemented optimally relative to the wire values given by equations 4.6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 100
to 4.9, and the fanin trees are implemented arbitrarily, a one pass application of tree covering
to this implementation will produce a configuration th a t cannot be optimized any further
and whose distance to the optimal solution is given by equation 4.10. Only if the initial
implementation is balanced would the result be optimal.
These questions arise from the fact th at the heuristic is to be used during tree covering.
Since the implementation of fanin nodes is not known a t this point, neither the polarity
nor the load at the outputs of fanout nodes is known when the delay estimation heuristic
is called.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 101
inputs to primary outputs. Then, if n is the num ber of fanouts of v, we build a balanced
fanout tree with n sinks to implement v. The fanout tree is selected by the balanced tree
algorithm presented in section 3.4.3. The loads of the sinks are taken to be all equal to
some generic value (we use the input load of a minimum size 2-input NAND gate).
The computation of this fanout tree is done twice, once by supposing th a t all sinks
are of positive polarity, once by supposing th at all sinks are of negative polarity. Among
these two trees, the fanout tree w ith the earlier output arrival time is stored at node v , and
the other tree is discarded. The fanout tree stored at node v in this fashion is called T v .
We use T„ to compute the arrival tim e at any output of v.
W hen tree covering is applied to a fanin node in the fanout of v, we need to
compute the arrival tim e at gate inputs p connected to v. To do so, we compute the delay
through the fanout tree Tv stored at v, adding to the load driven by the buffer connected
to p the difference between the load at p and the generic sink load value we used to build
Tv . This computation is done irrespective of the polarity of p.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 102
Table 4.1: Effect of Taking Polarities Into Account in Fanout Delay Heuristic
with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 103
C o n c lu sio n In summary, we found th a t the best delay heuristic estimates the delay
through a fanout node using a balanced fanout tree and ignores sink polarities. This
heuristic also takes into account variations in sink loads, though we have not assessed
the effectiveness of this technique, on the ground th at it is straightforward to implement
and we expect it to have only a second order effect on delay. In the results presented in the
remainder of this thesis, we apply this heuristic during the initial tree covering phase.
In this section we present two iterative delay optimization algorithms. The first
algorithm, presented in section 4.4.1, simply iterates tree covering and fanout optimization.
We call it the simple iterative improvement algorithm. Experimentally this algorithm con
verges very rapidly, and the result obtained after one iteration is almost as good as the
final result. Unfortunately these experiments do not provide any information about the
optim ality of the final result. To gain some insight into possible sources of suboptimality,
we introduce in section 4.4.2 a simple network for which we can compute an optim al imple
m entation under the continuous delay model of section 4.2. We simulate the effect of the
simple iterative improvement algorithm on this network under the continuous delay model.
The solution obtained by this algorithm is suboptimal, but only within a few percent of the
optimum solution. Finally, in section 4.4.3 we discuss briefly a new technique proposed by
Yoshikawa et al. [44] to perform iterative improvements th a t uses an different approach.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 104
Table 4.2: Effect of Using a Logarithmic vs. Linear Delay Estim ate
lo g a rith m ic : use a logarithmic model to estim ate delay through fanout node
lin e a r: use a linear delay to estim ate delay through fanout node
g ain : increase in area or in delay obtained by using a linear delay estimate
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 106
algorithm iterativejmprovement
foreach node v visited in topological order from inputs to outputs do {
if if node is fanin node {
apply tree covering
} else {
apply fanout optimization, taking all required times to be equal
}
do {
foreach fanout node v visited in topological order from outputs to inputs do {
apply fanout optimization
}
foreach fanin node v visited in topological order from inputs to outputs do {
apply tree covering
}
} u n til network delay does not decrease
foreach fanout node v visited in topological order from outputs to inputs do {
apply fanout optimization in area recovery mode
}
end iterativejmprovement
The simple iterative improvement algorithm is sketched Figure 4.5. A fter an initial
implem entation has been built with tree covering, using the heuristic of section 4.3 to
estimate delay through fanout nodes, the algorithm iterates fanout optim ization and tree
covering. As long as fanout optim ization is done in topological order from outputs to inputs,
fanout problems do not interact w ith each other, and the optim al solution w ith respect to
fanout optim ization can be achieved. During tree covering however, we need to evaluate
the delay through fanout nodes. For th a t purpose, we use the fanout trees built at the
previous fanout optim ization pass. In this case, we need to take into account the polarity
at which a signal is needed at a gate input, otherwise we obtain worse results. The results
obtained by this algorithm after three iterations are reported in Table 4.3 and compared
with the results obtained with only one iteration. The advantage of iterating is negligible
on average, with a decrease in delay of 1% for no cost in area. In some examples, the delay
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 107
r2 r3
a1 a2
actually increases. This can be explained by the imprecision introduced by different rise
and fall delays. There is no observable decrease in delay after the third iteration.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 108
Ri = eTi
Ai = eai
Xi = eXi
JZi = Xr+X2
R2 =X 5+ X 6
R3 =X 7+ X 3
Ci =
°2 =x ^ + x l + x ^ ~ I ^ = 0
1 , 1 1
3 “ Xe + X 7 ~ X 3+ X 4 ~
Since the optimization problem we are trying to solve in convex, the Pareto optimal points
have a simple characterization. For each Pareto optim al point p, there is a hyperplane H
containingp such th at all points satisfying the constraints (C \ , C 2, C3) axe on the same side
of the hyperplane H. T hat is, for each Pareto optim al point p there is a triplet (a, 6, c) such
th a t p is a solution of the following convex optimization problem:
mm a R tW + bRaW + cR^X)
X
Ci{X ) = 0,1 < * < 3
X rfe X s = X 2X 5X 7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 109
X 3X 6 = X S( X 3 + X 4)
X 4X 7 = X S( X 3 + X 4)
X xX 4 = X 2X 3
X 3X 6 = X S( X 3 + X 4)
X 4X 7 = X 3( X 3 + X 4)
We can derive from these equations a param etric representation of the set of P areto points
for this example. We take as parameters the following quantities:
T = *i
*4
T +1
Ri =
U
3(2 T + 1)
R2 =
R3 = 3( r + 2)
3(T
( 4 - i - * )
The values of the interm ediate variables can be rederived from the following equations:
1
= u
X~2
1
=
X4
1
x5
1
X3 3 \A 2 A! u)
X! = t x2
*3 = tx 4
X6 = (1 + 7p)Xs
X7 = (1 + T ) X 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 110
CD
O
O
O
in it o p tim u m p = 0.5
II
It
ai 0 2 a rr iv a l a rr iv a l # ite r a rriv a l # i te r a rr iv a l # i te r
0 .0 0 .0 2.398 2.485 1 2.402 12 2.398 79
2 .0 0 .0 3.885 3.968 3 3.899 13 3.892 84
4.0 0 .0 5.805 5.885 2 5.821 17 5.813 79
_ ( r + i ) /2T n
(7T + 4 ) W i A 2J
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 111
by the algorithm. We modified the iterative algorithm to use a rate of r » = (1 - p1) at the
iih iteration, where 0 < p < 1. The original algorithm corresponds to the special case
of p = 0. By decreasing r towards 0, the iterative algorithm is given more opportunity
to distribute imbalance corrections properly throughout the network, at the cost of more
iterations. The results for p = 0.0, p = 0.5 and p = 0.9 are given in Table 4.4.
The results obtained w ith p = 0.0 confirm our earlier experimental results th a t the
simple iterative improvement algorithm converges very rapidly. In all cases, the results for
p = 0.0 after only one iteration (our default iterative improvement algorithm) are within
2.5% of the optim um. By increasing p, we were able to improve further the quality of the
final result at the cost of more iterations to reach a good quality result. For p = 0.9, the
results after one iteration are only within 27% of the optimum, while the results after 100
iterations are within 0.2% of the optimum in all three examples.
These results provide, in a limited way, some solid evidence of the effectiveness of
our simple iterative improvement algorithm, and confirmation th at one iteration is sufficient
to obtain good quality results.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 113
The second of these optimizations, presented in section 4.5.2 allows the overlap
of the implementation of fanin trees as shown in Figure 4.2 in order to reduce delay. This
optim ization can decrease delay significantly, but may lead to substantial area increases.
We showed in section 4.3 the role of phase assignment on the quality of tree-based
delay optimization. By biasing tree covering in favor of a given polarity, we could slow down
a circuit by an average 11%. By using a heuristic th a t attributes the same arrival times
to signals of different polarities, we essentially make sure th at the best phase is used for
any given tree in the absence of any accurate information concerning the arrival times of
signals. If all signals at the output of a fanout node are required with the same phase, there
is no problem. This is not the case in general. If a signal is required under both polarities,
at least one critical sink will receive the signal delayed by one inverter.
The problem occurs because we limit ourselves to one covering per fanin node. If
we allow tree duplication, we can cover each fanin node twice, each cover producing the
signal in a different polarity. These two trees may overlap and share some logic if there is
no advantage in duplicating them further. This optimization has two advantages:
• in the case of small fanouts with signals needed in different polarities, it can remove
one inverter delay if both trees can produce the signal with similar arrival times.
• for large fanouts, it decreases the need for deeper fanout trees by providing an addi
tional source th at can provide signals to one half of the sinks.
On the other hand, this optimization has two potential drawbacks: it may be wasteful in
area and it does not preserve testability [38]. Unnecessary logic duplication can be controlled
easily using the same technique th a t we use during fanout optimization. In the first pass
of fanout optimization, we select the best solution at each node, which may require the use
of tree duplication on the source node. In the second pass of fanout optimization, the area
recovery pass, we can eliminate one cover of the source node if this transformation does
not slow down the circuit. Removing redundancies is a more time consuming operation,
but can only decrease delay and area, provided th at there are no false paths in the circuit
[33, 25].
We have implemented a simple version of this optimization. Our implementation
has the following limitations: it does not take into account the cost of tree duplication in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 115
term s of extra fanout loads at the inputs of a fanin node; it uses a simple-minded allocation
of sinks to the two sources made available by tree duplication; and it does not perform
redundancy removal after tree duplication.
Given two sources, S and S, providing the same signal under differing polarities,
for a fanout node, we perform fanout optimization as follows. We partition the sinks into
two sets: the set P of sinks with positive polarity, and the set N of sinks with negative
polarity. We try all 4 possible assignments of P and N to S and S and perform fanout
optimization in all 4 cases; i.e. we consider implementing the problem with S alone, S alone,
or both S and ~S. The best solution with smallest delay is retained. In case of equality, the
solution which uses only one source is chosen and one tree is discarded.
We have implemented this optimization, and report the results of our experiments
in Table 4.5. We achieved an average delay reduction of 3% for a average area increase of
6%. Additional delay reductions should be achievable by using a better sink assignment
algorithm, and a more flexible tree duplication policy, allowing in particular the duplication
of sources of the same polarity.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 117
tree-based delay optimization, but more work is required to control the penalty in area.
Combining tree overlap w ith the tree duplication algorithm of section 4.5.1 has no effect,
since allowing overlaps allows in particular the duplication of trees in different phases.
L im itin g O v e rla p s It is possible to reduce the increase in area by limiting the overlap
between trees. A simple way to enforce this limit is to allow overlaps to take place only
over nodes th a t have a fanout of K or less, for some constant K . This simple technique
is an efficient way to reclaim area because most of the delay reduction can be obtained by
allowing overlaps over nodes with small fanouts, and a large fraction of the area increase is
due to overlaps allowed over nodes with large fanouts. The results of allowing tree overlaps
only on nodes with five or fewer fanouts is reported in Table 4.7. By limiting overlaps, we
achieved an average delay reduction of 8% for a cost in area of 28%.
4.6 Conclusion
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 118
c irc u it no o v e rla p o v e r ap -5 g a in
a re a d e la y a r e a d e la y a re a d e la y
C1355 1192 24.21 1967 22.20 1.65 0.92
C1908 1358 28.24 1890 26.59 1.39 0.94
C2670 1796 20.86 2400 20.53 1.34 0.98
C3540 2857 32.88 3799 31.51 1.33 0.96
C5315 3693 29.31 5494 26.56 1.49 0.91
C6288 8272 95.47 14534 87.28 1.76 0.91
C7552 5322 27.83 8254 26.98 1.55 0.97
a lu 4 1977 29.24 2542 28.24 1.29 0.97
ampbpreg 3289 35.56 4255 30.45 1.29 0.86
ampbsm 1875 16.55 2229 14.24 1.19 0.86
amppint2 1353 11.29 1473 9.92 1.09 0.88
ampxhdl 1059 12.90 1244 11.82 1.17 0.92
apex6 1912 11.29 2213 10.23 1.16 0.91
des 8632 16.08 9869 16.12 1.14 1.00
d f lg r c b i 730 10.39 879 9.41 1.20 0.91
fc o n rc b l 537 11.00 667 9.55 1.24 0.87
f rg 2 2367 13.17 2437 11.95 1.03 0.91
ls.2 2755 16.87 3891 15.10 1.41 0.90
k c c tlc b 3 557 9.26 684 8.09 1.23 0.87
p a ir 3956 16.40 4565 15.30 1.15 0.93
ro t 1651 19.17 1997 17.17 1.21 0.90
s b iu c b l 599 15.66 820 15.00 1.37 0.96
tfa u ltc b l 439 6.80 527 6.57 1.20 0.97
vda 1522 13.31 2057 12.20 1.35 0.92
x3 2033 10.97 2312 10.62 1.14 0.97
aver - - - - 1.28 0.92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 119
minimum area
delay
1.0
area
1.0 2.0
have also proposed to use logic duplication to provide a signal in both phases at a multiple
fanout point and shown th at this technique can lead to additional delay reductions at little
cost in area. More work needs to be done to preserve the testability of the circuit during
this transformation.
In total, we have provided several methods to perform technology mapping, which
provide a wide tradeoff between delay and area:
• minimum delay tree covering with limited overlaps and fanout optimization.
The average effect of these four methods in indicated in Figure 4.7. In area, all data are
relative to the minimum area mapping. In delay, the data relative the minimum delay tree
covering with overlaps and fanout optimization.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 5
5.1 Introduction
In the previous chapter, we presented several techniques for the efficient integration
of tree covering and fanout optimization and provided empirical evidence of the efficiency
of some of these methods. The purpose of this chapter is to examine the effect of technology
independent logic transformations on circuit delay. We do not introduce any new technology
independent algorithms to reduce delay. The originality and interest of this study comes
from the fact th at we now have at our disposal an efficient technology m apper on which
we can rely to estimate delay. Similar data previously reported in the literature are usually
of limited accuracy because they do not take into account the corrective effect of powerful
optimization techniques such as fanout optimization.
The first step of this empirical study is to measure the effect of literal count min
imization on circuit delay and area. Literal count minimization has been used as the main
objective of technology independent optimization in logic synthesis because it correlates
well with final circuit area [31]. The effect of literal count minimization is to simplify and
120
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 121
factorize the logic. By eliminating logic, simplification helps both in term s of delay and in
term s of area. However factorization usually trades off delay for better area. We present in
section 5.2 empirical d a ta th a t confirms th a t literal count minimization has unpredictable
effects on delay.
Since our objective is to minimize delay more than area, we should control the
use of factorization, and concentrate our efforts on logic simplification instead. We have
not explored fully the modification of the m any technology independent transformations
available in a logic synthesis tool such as m is ll, but we examine in section 5.3 the effect
of a controlled use of a few of these transform ations th a t could lead to a substantial area
reduction a t a much smaller cost in delay than uncontrolled literal count minimization.
However if delay is our principal objective, we should be able to produce fast
circuits simply by flattening the logic. To flatten the logic, we collapse the Boolean network
into a graph with only one level of nodes. Each of the nodes has a function associated with
it th at can be represented in sum-of-product form. In other words, collapsing to one level
of nodes can be seen as collapsing to two levels of logic if no fanin or fanout lim itation is
enforced. Collapsing only helps in reducing delay for a certain set of circuits. For many
circuits, collapsing introduces such a large am ount of logic duplication th a t even delay
increases. Nevertheless, when it applies, collapsing is a simple and very efficient technology
independent delay optim ization technique. We discuss the effect of collapsing in more detail
in section 5.4.
Since network collapsing is such an effective technique a t reducing circuit delay, it
is worth investigating whether partial collapsing can be used when to tal collapsing is not
practical. To decide which parts of a network should be collapsed, we use an algorithm
developed by Lawler et al. [28]. Lawler’s algorithm can be viewed as the technology
independent analog of the extension of tree covering allowing overlaps between trees that
we introduced in section 4.5.2. The main drawback of this algorithm is th at it tends to
increase area, but some of this area can be recovered by using the controlled literal count
reduction techniques presented in section 5.3.
There has been some previous work in the area of logic restructuring for delay. The
most notable effort in this direction was speedup, by Singh et al. [41], which performs local
collapsing and factorization in order to reduce the number of levels of logic a signal has to
traverse while controlling the increase in area incurred by collapsing. The work by Fishburn
[15] is based on a similar idea, though the restructuring is performed differently. Another
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 122
technique, th a t is more efficient in terms of area but more computationally intensive, was
proposed by Chen and M uroga [12]. This technique consists in exploiting the observability
don’t care set at a node to remove connections on the critical path. Berman et al. propose
similar methods [6]. We present our adaptation of Lawler’s algorithm in section 5.5 and
compare it to speedup.
In the rem ainder of this section, we use the technology m apper to measure the area
and delay of a circuit. For technology mapping, we use tree covering in delay minimization
mode, followed by fanout optimization. We use the heuristic of section 4.3 but we do not
allow overlaps between tree covers.
We measured the effect of literal count minimization on circuit area and circuit
delay after technology mapping. The results are reported in Table 5.1. All circuits were
optimized using m i s l l standard algebraic script [9] except C2670, which was optimized
manually. As is apparent in the table, the circuits obtained from Intel (df l g r c b l, f c o n rc b l,
k c c tlc b 3 , s b iu c b l, t f a u l t c b i ) were already optimized, and little gain was achieved for
this circuits. On average, minimizing literal count decreased area by 28% for no cost in
delay. However, a more careful inspection of the data indicates th a t the effect on delay of
literal count minimization is unpredictable, varying from a decrease of 22% to an increase
of 26%, though the larger increases in delay correspond to significant decreases in area.
Obviously the techniques used in m is l l to reduce literal count are quite powerful.
Unfortunately if they are used without discrimination, they may lead at times to substantial
delay increases. This unpredictable behavior is undesirable and more work needs to be done
to control the effect on delay of these optimizations.
A simple way to reduce circuit area without having to pay for an increase in delay
is to reduce the literal count by using simplification only. A better way would be to allow,
in addition to simplification, factorization along non critical paths. However, to perform
this optim ization reliably, we need a good technology independent delay estim ator, and
none is available at present. We have experimented with a simple m is l l script, shown in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 124
• sweep: this command eliminates nodes with no fanin or no fanout. It simply removes
from a network unnecessary nodes.
• decomp -q: this command factorizes nodes in a simple way with no concern for crit
icality. It is only used to break large nodes into smaller ones so th at the other com
mands can run in a reasonable amount of cpu time.
• e lim in a te -1 100 -1: this command collapses anode into its fanout. The collapsing
is only done if the node has a single fanout, and if the size of the resulting node does
not exceed 100 cubes. The role of this command is to ensure th at nodes are large
enough for simplification to have some effect, but not too large so that simplification
takes a reasonable am ount of cpu time.
• s im p lify -1: this command runs e s p re sso [7, 37], a two-level logic minimizer. The
minimizer is given some information about the structure of the network, th at allows it
to simplify the logic function at a node in the context of the other nodes in the network.
In particular, when simplifying a node v, the minimizer is allowed to change the inputs
of v if it simplifies the logic function at v. The -1 option limits the minimizer to using
as inputs of v only nodes th a t are closer to the prim ary inputs th an v. This restriction
guarantees th a t the num ber of levels of logic in the network is not increased by the
minimizer.
We measured the effect of the simple simplification script of Figure 5.1 on circuit area and
delay after technology mapping. The results are reported in Table 5.2. The average effect of
the simple simplification script is an area reduction of 9% and a delay reduction of 2%. The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 125
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 126
fluctuations in delay are less severe than with the standard script, varying from a reduction
of 22% to an increase of 14%. The average reduction in area obtained by simplification
alone is roughly a third of the reduction in area obtainable with the standard script. This
is a heavy price to pay for a more controlled effect on delay.
Some circuits cannot be collapsed into two levels of logic without a large area
penalty. However it is often possible to collapse these networks partially in order to reduce
delay at a more moderate cost in area. To perform partial collapsing, we need an algorithm
th a t determines which groups of nodes are to be collapsed into single nodes in order to
decrease delay the through the network.
Unfortunately we do not have at out disposal a reasonably accurate technology
independent delay model, such as the one proposed by Wallace et al. [43]. As a rough
measure of delay, we use the num ber of logic levels a signal has to cross. To lim it the
inaccuracy of this delay model, we only apply it after having decomposed a network into
simple gates. These simple gates are any of the four 2-input gates th at can be represented
as a 2-input NAND gate with possibly inverters at the inputs, representing one of the four
following Boolean functions: a + b, a + b, a + b or a + b.
To form the groups, we use a clustering algorithm due to Lawler th a t minimizes
the number of levels of logic in the network after collapsing of the groups subject to the
constraint th a t each group is formed of at most K nodes. This algorithm generates possibly
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 127
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS
algorithm Laviler.clustering.algorithm
{ / * labeling step */
foreach node v visited in topological order from inputs to outputs {
if fanin(v) = 0 th en L = 0 else L = maxuef anin(v) ^ eKu)
k = |{it, u 6 transitive Janin{v)y label(u) = £}|
if k > K label{y) = L + 1 else label(v) = L
}
}
{ / * clustering step */
foreach node v visited in topological order from outputs to inputs {
if fanout(v) = 0 th e n L = oo else L = minuef anounv) label(u)
if label(v) < L {
create a new cluster c
c = {u}U {ti€ transitive-fanin(v), label(u) = label(v)}
}
}
{ / * collapsing step */
foreach cluster c {
root(c) = { t € c , label(v) < max uef anoui(v) label(u)}
foreach v 6 root(c) {
collapse into v all nodes in c fl transitive Janin{y)
}
}
}
end Lawler.clustering.algorithm
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 129
overlapping clusters, and clusters th a t have more th an one output, requiring extra logic
duplication during collapsing. For these reasons, this clustering algorithm usually increases
circuit area. In the next subsection, we describe Lawler’s algorithm in more detail.
Lawler’s algorithm determines a minimum delay clustering of a network under the
constraint th a t each cluster does not exceed a global capacity constraint K . The delay
through a cluster is assumed to be the same for all clusters, and the size of a cluster is
the num ber of nodes it contains. Lawler’s algorithm can handle more general clustering
problems, but the present formulation is sufficient for our purpose. Lawler’s algorithm
proceeds in two steps: a labeling step and a clustering step. We have added a third step to
do the collapsing of the clusters. The algorithm is described in Figure 5.2.
The labeling step proceeds as follows. We visit the nodes in topological order, from
inputs to outputs. For each node v, we compute the largest label L of any of its fanins. If
v does not have any fanin, L is taken to be equal to 0. We then compute the num ber k of
nodes in the transitive fanin of v th a t axe of label L. If k exceeds K , the label of v is set
to be L + 1, otherwise it is set to be L. In the clustering step the nodes are visited in the
reverse order, from outputs to inputs. If the label of a node v is less than the labels of all the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 130
nodes in its fanout, a new cluster is created, containing v and all the nodes in the transitive
fanin of v with the same label as v. The collapsing step collapses the nodes of a cluster
together. Some duplication is introduced within a cluster if a cluster has several outputs.
One node is created per output, and every node of cluster contained in the transitive fanin
of two more output nodes of a cluster is duplicated. Duplicating is also introduced across
clusters, since a node may belong to several clusters as shown in Figure 5.3.
An example (from [28]) shows the application of the algorithm in Figure 5.3. In
th a t example K , the cluster size lim it, is set to 3. The labels are indicated inside the nodes
and the clustering is shown by encirclings. As can be seen in this example, the algorithm
may replicate some nodes. The labeling and clustering parts of the algorithm operates in
0 ( K N 2), where N is the to tal number of nodes in the network and K the maximum size
of a cluster. The tim e complexity of the collapsing p a rt is dependent on the logic function
obtained at each node.
As can be observed in Figure 5.2 and Figure 5.3, Lawler’s algorithm bears a strong
similarity to tree covering w ith overlaps. The labeling step is the analog of the forward
dynamic programming pass of tree covering. The clustering step is the analog of the gate
selection pass of tree covering. In both algorithms, the nodes of the network are visited
in the same order. In other words, the Lawler’s clustering algorithm can be thought as a
technology independent tree covering algorithm allowing overlaps, and in th a t sense is a
natural extension of the algorithms we presented in the previous chapter. It suffers from
the same problem as the tree covering algorithm allowing overlaps, as it often causes large
area increases. More work needs to be done in this area to determine whether these area
increases are necessary to reduce delay.
In this section we examine the effect of the clustering algorithm on our set of
benchmarks. We apply the clustering algorithm using the script of Figure 5.4. Most of the
commands in this script have been introduced earlier. The new commands are:
• tech-decomp -o 2: this command decomposes the network into 2-input NAND gates
possibly with inverters a t one or both of the inputs.
• resu b - a -d: applied to a network decomposed into 2-input NAND gates, this com
m and detects if two copies of the same node are present in the network. If it is the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 131
clustering script {
sweep
decomp -q
tech_decomp -o 2
resub -a -d
sweep
reduce.depth -S 8
eliminate -1
simplify -1
}
speed-up script {
sweep
decomp -q
speed-up -d 6 -m unit
}
case, one copy is removed and the fanout of the remaining node is increased by the
fanout of the removed node.
The results of clustering on area and delay of circuits optimized for minimum
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 133
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 134
literal count is given in Table 5.4. Clustering achieves an average delay reduction of 13%
for an average area increase of 39%. This technique performs better than tree covering
with overlaps in both area and delay. For comparison, we also give in Table 5.5 the results
obtained using m i s l l speed_up routine. The script we used to run the speed_up routine is
indicated in Figure 5.5. There is no need to use most of the commands in the previous script
because the speedjup command performs its own decomposition into simple gates and area
recovery. The speedjup command decreased delay by only 6% on average, for a m oderate
increase of 12% in area. In addition, speedjup does not perform very consistently: it
actually increases the delay of 7 out of our set of 25 benchmarks. In contrast, the clustering
algorithm increased delay in only one of the examples for a higher cost in area.
In this section we present two techniques to reduce the area increase due to clus
tering. The first technique is a modification of the labeling step of the clustering algorithm.
The second technique is based on redundancy removal.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 135
primary output, this value is simply the largest label value computed in the labeling step.
If v is not a primary output, this value is guaranteed to be available when v is visited by
the topological ordering of the nodes. The label of each node is given this maximum value.
Then the largest label value the inputs of v could have without forcing an increase of the
label of v is computed. This value is propagated to the inputs of v.
The effect of the relabeling heuristic is illustrated in Table 5.6. Relabeling reduces
the average increase in area caused by clustering from 39% to 31%, and actually reduces
delay by an additional percentage point, yielding an average delay reduction of 14%.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D E LA Y OPTIMIZATIONS 137
algorithm relabelingJieuristic
m axJabel = maxugpo label(y)
foreach node v {
maxJabel(v) — maxJabel
}
foreach node v visited in topological order from outputs to inputs {
foreach node u E F A N I N ( v ) {
if (label(v) < maxJabel{v) and label(u) = = label(v)) {
increm ent = maxJabel(v) — label(u )
} else {
increm ent = max(0, maxJabel(v) — label(u) — 1)
}
maxJabel(u) = Tnm(maxJabel(u), label(u) + increment)
}
label(v) = maxJabel(v)
}
end relabelingJieuristic
active critical path. This hypothesis is satisfied by m ost circuits. Moreover, all circuits can
be made to satisfy this hypothesis by using Keutzer’s algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 138
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 139
our results slightly in favor of area. Redundancy removal was effective a t reducing the area
penalty incurred by clustering. On average, clustering followed by redundancy removal
increased area by 23% for a reduction in delay of 17%.
5.6 Conclusion
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 6
C onclusion
The m ain results of this work are as follows. We provided an exact solution to the minimum
delay tree covering problem based on piece-wise linear functions. We performed an extensive
study of fanout optimization heuristics, presented new complexity results, and introduced
a spectrum of fanout optimization algorithms. We developed a simple algorithm to apply
fanout optimization throughout an entire network th at reduces delay a t a very moderate cost
in area. To study the integration of tree covering and fanout optimization, we introduced
a technology independent delay model th a t characterizes precisely suboptimalities due to
imbalances in a network. This is the first technology independent delay model th at models
the delay through a node as a function of the arrival tim e distribution at a node. In addition,
this delay model can be used to derive analytically optim al solutions in simple cases which
can be used to assess the optim ality of algorithms. We showed the importance of the
technique used to evaluate the arrival times at the input of trees before fanout optimization,
and presented an efficient heuristic to solve this problem. We also experimented with
allowing tree covers to overlap, and showed significant delay reductions with this technique.
Finally we investigated technology independent delay optim ization techniques based on
partial or total collapsing of logic, and showed th at further delay reductions can be achieved
with these techniques possibly at a higher cost in area.
A surprising conclusion of our work is th a t it is im portant to ignore critical paths
140
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6. CONCLUSION 141
when performing delay optim ization during logic synthesis. As confirmed by the experi
m ents of Yoshikawa et al. [44], delay reduction on non-critical paths can create additional
slacks on those paths th a t can be exploited to reduce delay through critical paths. By
concentrating on critical paths only, delay optimization algorithms condemn themselves to
suboptim al solutions.
We now have a t our disposal a spectrum of delay optimization techniques. Fanout
optim ization is the cheapest technique in terms of area consumption, and should be given
top priority. Tree covering for delay comes in second. The area-delay tradeoff potential
of tree covering depends more heavily on the quality of the library used by the technology
m apper. In some cases tree covering can outperform fanout optim ization, though we were
not able to dem onstrate this fact in this thesis due to the confidentiality of some of our
libraries. Allowing overlaps between tree covers as well as technology independent collaps
ing algorithms followed by redundancy removal add to the arsenal of delay minimization
techniques.
More work needs to be done in technology independent delay optimization tech
niques. There are three main avenues of research: the development of more accurate
technology independent delay models than the ones currently in use; the improvement
of collapsing algorithms in term s of area; the improvement of techniques based on kernel
extraction (speed_up) or observability don’t-care sets in term s of cpu speed. In particular,
it would be interesting to investigate the use of the technology independent delay model
introduced in chapter 4 or a model derived on similar ideas to drive technology independent
delay reduction algorithms. This delay model is the first to propose a way to take into
account imbalances in arrival times as they occur in networks. A more fundam ental issue
would be to understand when logic duplication is needed for delay reduction (we know th at
redundancy is not needed, even in the presence of false paths [25]) in order to find more
economical ways to perform partial collapsing or tree overlapping.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
B ibliography
[1] A. Aho, M. G anapathi, and S. Tjiang. Code Generation Using Tree M atching and
Dynamic Programming. A C M Transactions on Programming Languages and Systems,
11(4):491—516, October 1989.
[2] A. Aho, S. C. Johnson, and J. Ullman. Code generation for expressions with common
subexpressions. Journal o f the Association for Computing Machinery, 24(l):146-160,
1977.
[3] A. V. Aho and M. Ganapathi. Efficient Tree P attern Matching: an Aid to Code Gener
ation. In Twelfth Annual A C M Symposium on Principles o f Programming Languages,
pages 334-340, January 1985.
[5] C. L. Berman, J. L. Carter, and K. F. Day. The Fanout Problem: From Theory to
Practice. In C. L. Seitz, editor, Advanced Research in VLSI: Proceedings o f the 1989
Decennial Caltech Conference, pages 69-99. MIT Press, M arch 1989.
142
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 143
[10] R. E. Bryant. Graph Based Algorithms for Boolean Function M anipulation. IEEE
Transactions on Computers, C-35(8):677-691, August 1986.
[12] K. C. Chen and S. Muroga. Timing Optimization for Multi-Level Combinational Net
works. In Proceedings o f the 27th A C M /IE E E Design Automation Conference, pages
339-344,1990.
[16] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory
o f NP-Completeness. M athematical Sciences Series. Freeman, 1979.
[17] L. A. Glasser and L. P. J. Hoyte. Delay and Power Optimization in VLSI Circuits. In
21st A C M /IE E E Design Automation Conference, pages 529-535,1984.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 144
[21] M. Hofmann and J. K. Kim. Delay Optimization of Combinational Static CMOS Logic.
In 24th A C M /IE E E Design Autom ation Conference, pages 125-132,1987.
[23] R. Jacoby, P. Moceynas, H. Cho, and G. Hachtel. New ATPG Techniques for Logic
Optimization. In ICCAD, pages 548-551,1989.
[24] K. Keutzer. DAGON: Technology Binding and Local Optimization by DAG Matching.
In Proceedings of the 24th Design Autom ation Conference, pages 341-347. ACM /IEEE,
June 1987.
[27] K. Keutzer and W. Wolf. Anatomy of a Hardware Compiler. In Proceedings o f the SIG-
P L A N ’88 Conference on Programming Language Design and Implementation, pages
95-104. ACM, June 1988.
[28] E. L. Lawler, K. L. Levitt, and J. Turner. Module clustering to minimize delay in digital
networks. IE E E Transactions on Computers, C-18(l):47-57, January 1969. 1969.
[29] Mario Lega. Private communication. AT&T Bell Laboratories, October 1990.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 145
[32] R. Lisanke. Logic Synthesis and Optimization Benchmarks User Guide Version 2.0.
Technical report, MCNC, P.O. Box 12889, Research Triangle Park, NC 27709, Decem
ber 1988.
[34] F. W . Obermeier and R. H. K atz. Combining Circuit Level Changes with Electrical
Optimization. In ICCAD-88, pages 218-221. IEEE, 1988.
[35] P. G. Paulin and F . Poirot. Logic Decomposition Algorithms for the Timing Op
tim ization of Multi-Level Logic. In International Conference on Computer Design,
pages 329-333. IE EE, October 1989.
[36] R. Rudell. Logic Synthesis fo r V L S I Design. PhD thesis, UC Berkeley, April 1989.
U CB/ERL M89/49.
[39] M. Schulz, E. Trischler, and T. Sarfert. SOCRATES: a Highly Efficient ATPG System.
IE E E Transactions on Computer-Aided Design o f Integrated Circuits and Systems,
7(1):126-137, January 1988.
[43] D. Wallace. High-Level Delay Estim ation for Technology-Independent Logic Equations.
In ICCAD, pages 188-191, 1990.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 146
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.