0% found this document useful (0 votes)
12 views

Performance-oriented_technolog

Performance-oriented technology mapping

Uploaded by

koralaltay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Performance-oriented_technolog

Performance-oriented technology mapping

Uploaded by

koralaltay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may
be from any type of computer printer.

The quality of this reproduction is dep mdent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by


sectioning the original, beginning at the upper left-hand corner and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in
reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced


xerographically in this copy. Higher quality 6" x 9" black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly
to order.

U niversity M icrofilms International


A Bell & Howell Inform ation C o m p a n y
3 0 0 N orth Z e e b R o a d , A nn Arbor, Ml 4 8 1 0 6 -1 3 4 6 USA
3 1 3 /7 6 1 -4 7 0 0 8 0 0 /5 2 1 -0 6 0 0

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
O rder N um ber 9126812

Perform ance-oriented technology m apping

Touati, Herv6 Jacques, Ph.D.


University of California, Berkeley, 1990

Copyright ©1990 by Touati, Herv£ Jacques. All rights reserved.

UMI
300N. ZeebR i
Ann Arbor, MI 48106

Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
NOTE TO USERS

THE ORIGINAL DOCUMENT RECEIVED BY U.M.I. CONTAINED PAGES


WITH POOR PRINT. PAGES WERE FILMED AS RECEIVED.

THIS REPRODUCTION IS THE BEST AVAILABLE COPY.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Performance-Oriented Technology Mapping

By

Herve Jacques Touati

B.S. (University of Paris VH) 1980

DISSERTATION

Submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in

COMPUTER SCIENCE

in the

GRADUATE DIVISION

of the

UNIVERSITY OF CALIFORNIA at BERKELEY

Approved:
Chair:. . .
., Date

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Performance-Oriented Technology Mapping

Copyright © 1990

Herve J. Touati

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Performance-Oriented Technology Mapping

Herve J. Touati
University of California Departm ent of Electrical Engineering
Berkeley, California and Com puter Science
Ph.D . Thesis Computer Science Division

Abstract

This thesis presents a variety of techniques to minimize circuit delay during the translation
of a set of Boolean equations into a list of connected logic gates th at can be used for the
m anufacturing of combinational digital circuits. This translation process is called technol­
ogy mapping. The first contribution of this work is to present an optim al algorithm to
implement a Boolean circuits th a t can be represented as trees using an extension of known
tree covering algorithms. The second and more im portant contribution of this work is an
in-depth analysis of fanout optimization. The fanout problem is the problem of distributing
a signal to several destinations, where the signal may be required at different times, in order
to minimize the overall delay. This work presents the most detailed theoretical study of the
complexity of fanout optim ization published so far, and a spectrum of heuristics to solve
the fanout problem under realistic delay models. This thesis also introduces a simple algo­
rithm th a t can be used to apply fanout optim ization to an entire network. This algorithm
yields an optimal application of fanout optim ization in terms of delay, while keeping area
increase of the circuit to a low value. To study the integration of tree covering and fanout
optim ization, this work introduces a technology independent delay model th a t characterizes
precisely suboptimalities due to inbalances in a network. This is the first technology inde­
pendent delay model th a t models the delay through a node as a function of the arrival time
distribution at a node. This delay model can be used to derive analytically optim al solutions
in simple cases, which can be used to measure the suboptim ality of heuristics. An extension
to tree covering is then suggested, and shown to provide significant delay reductions for a
relatively heavy cost in area. Finally this work investigates technology independent delay
optimization techniques based on partial or total collapsing of logic, and shows th a t further
delay reductions can be achieved with these techniques possibly at a higher cost in area.

Prof. Robert K. Bpdyton


Thesis Committee Chairman

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgements
First I would like to thank my advisor, P r. Robert Brayton, for his undivided support and
encouragement. Much of this work would not have been possible without his help. I am
also very grateful to Pr. Sangiovanni-Vincentelli for accepting the burden of being a second
reader of this thesis and for m any exciting discussions. I am also indebted to Pr. Kobayashi,
of the M athem atics D epartm ent, who was kind enough to accept to be a member of my
thesis committee. I owe m uch to Pr. Alan Smith and P r. Alvin Despain for teaching me
m any of the fundam entals of computer science reseaxch. I also would like to thank Dr.
Kurshan of AT&T Bell Laboratories for m any interesting discussions, P r. Susan Graham
and Dr. Kurokawa of IBM Japan for their support and advice early in my graduate student
life. Finally I would like to acknowledge the French M inistry of Transportation, DARPA,
IBM and DEC who granted me the privilege of their support during my graduate studies.
Above all, I would like to thank my wife, Atsuko, and my daughter, Marianne
Ayaka, for their love and support, and for coping with the hardships of graduate student
life. I owe to M arianne the habit of waking up early every day and to Atsuko a strong desire
to graduate. These would be m ajor contributions in any field of research.
I would like to express my thanks to the CAD group in general, for providing such
a stim ulating and exciting research environment.
Of course, m any friends deserve special thanks. I was told th at it is more polite to
list one’s friends by alphabetic order, so th a t only those whose name start by ’A ’ do not get
offended. So m any thanks to Pranav Ashar; Mary and Wendell Baker, for tolerating me as
their office mate; Andrea Casotto; Fred Douglis; Paul Gutwin; Bruce Holmer; K urt Keutzer;
Luciano Lavagno; Bill Lin; Sharad Malik; Rick McGeer; Cho Moon; Rajeev Murgai; Antony
Ng; Terry Regier; Rick Rudell; Rafael Saavedra-B arrera; Alex Saldanha; Hamid Savoj;
Luigi Semenzato; Ellen Sentovitch; N arendra Shenoy; K. J. Singh; Ashok Singhal; Greg
Sorkin; Peter Van Roy, for sharing Belgian chocolates with me; Steve Viavant; Albert
Wang; Yosinori W atanabe; and Greg W hitcomb.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C on ten ts

Table o f C ontents ii

List o f Figures v

List of Tables vii

1 Introduction 1
1.1 O v e r v ie w .................................................................................................................. 1
1.2 Terminology and N o ta tio n ..................................................................................... 3
1.2.1 M athem atical N o t a t i o n ............................................................................ 3
1.2.2 Combinational L o g ic .................................................................................. 3
1.2.3 Physical M o d e lin g ..................................................................................... 5

2 D elay O ptim ization w ith Tree Covering 8


2.1 In tro d u c tio n .............................................................................................................. 8
‘ ‘2.2 Tree C o v erin g........................................................................................................... 11
2.3 Handling of Load V a lu e s........................................................................................ 14
2.3.1 Uniform D iscretization............................................................................... 15
2.3.2 Adaptive D is c re tiz a tio n ............................................................................ 15
2.3.3 Functional A b straction ............................................................................... 15
2.4 Limits to the Optimality of Tree Covering for D e l a y ....................................... 18
2.4.1 Initial D e c o m p o sitio n ............................................................................... 18
2.4.2 Suboptim ality of P in A ssignm ent............................................................ 18
2.4.3 Rise and Fall D e la y s ................................................................................... 20
2.5 Minimizing Area under a Delay Constraint ..................................................... 20
2.5.1 Adaptive Discretization of Required Times ......................................... 21
2.5.2 A Greedy Approach to Area R e c o v e ry ................................................... 22
2.5.3 Area Recovery by Optimal Inverter S e le c tio n ............................... 22
2.5.4 Area Recovery by Optimal Gate S e lec tio n .............................................. 23
2.6 Experim ental R e su lts............................................................................................... 23
2.6.1 Results with the MCNC Library ............................................................... 25
2.6.2 Results with the CM0S12 L i b r a r y ............................................................ 26
2.6.3 Results with the LIBRARY3L ib ra r y .......................................................... 27
2.6.4 Results with the LIBRARY4L ib r a r y .......................................................... 27

ii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS iii

2.7 Conclusion ............................................................................................................... 28

3 Fanout O ptim ization 29


3.1 In tro d u c tio n ............................................................................................................... 29
3.2 Definition of the Fanout P r o b le m ......................................................................... 32
3.3 Complexity of the Fanout P r o b le m ...................................................................... 32
3.3.1 Constant Delay M o d e l............................................................................... 33
3.3.2 U nit Fanout Delay M o d e l......................................................................... 34
3.3.3 U nit Fanout Model with Varying Sink L o a d s ...................................... 38
3.3.4 Berm an’s Delay M o d e l............................................................................... 42
3.4 A Spectrum of Fanout Optimization Algorithms ............................................ 42
3.4.1 N o ta tio n ......................................................................................................... 42
3.4.2 Buffer S election............................................................................................ 43
3.4.3 Two-Level Fanout Trees Ignoring Required Times ............................ 43
3.4.4 Two-Level Fanout Trees Taking Required Times into Account . . . . 46
3.4.5 Combinational M e r g in g ............................................................................ 49
3.4.6 Fanout Optimization based on L T -T re e s ................................................ 53
3.4.7 O ther Fanout A lg o rith m s ......................................................................... 60
3.5 Handling Differing Polarities ................................................................................ 62
3.6 Peephole Optimizations for Area and D e l a y ...................................................... 63
3.6.1 M o tiv a tio n ................................................................................................... 63
3.6.2 Optim al Buffer Selection............................................................................. 64
3.6.3 A rea Recovery under a Delay C o n stra in t................................................ 68
3.7 Global Fanout O ptim izatio n ................................................................................... 69
3.7.1 A One Pass A p p r o a c h ................................................................................ 70
3.7.2 Global Area Recovery under a Delay Constraint ................................ 73
3.8 Experim ental R e su lts................................................................................................ 74
3.8.1 Circuit D e s c rip tio n s ................................................................................... 74
3.8.2 Overall Performance of Fanout O p tim iz a tio n ...................................... 74
3.8.3 Detailed Performance A n a ly s is ................................................................ 77
3.9 Conclusion ................................................................................................................ 84

4 Com bining Tree Covering and Fanout O ptim ization 88


4.1 In tro d u c tio n ................................................................................................................ 88
4.2 Theoretical Analysis of Tree-Based Delay Minimization ................................ 92
4.2.1 Modeling Tree C o v e rin g ............................................................................. 92
4.2.2 Modeling Fanout O p tim iz a tio n ................................................................ 94
4.2.3 Formulation as a Convex Optimization P r o b l e m ................................ 94
4.2.4 A Simple E x a m p le ...................................................................................... 97
4.2.5 Suboptim al Local M in im a .......................................................................... 98
4.3 Selecting the Initial Im p le m e n ta tio n ................................................................... 100
4.4 Global Optim ization S c h e m e s................................................................................ 103
4.4.1 Iterative Im p ro v e m e n t................................................................................ 103
4.4.2 Optim ality of Iterative Im p ro v e m e n t....................................................... 107
4.4.3 Criticality Based I te ra tio n .......................................................................... Ill

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O N TE N TS iv

4.5 Beyond Tree-Based O p tim iz a tio n ........................................................................ 112


4.5.1 Phase Optimization by Tree D uplication ................................................. 113
4.5.2 Allowing Overlaps between T r e e s .............................................................. 115
4.6 Conclusion ............................................................................................................... 117

5 Technology Independent D elay Optim izations 120


5.1 In tro d u c tio n ............................................................................................................... 120
5.2 Effect on Delay of Minimizing Literal C o u n t..................................................... 122
5.3 Performance-Oriented Logic Sim plification......................................................... 122
5.4 Effect of Collapsing to Two Levels of L o g i c ...................................................... 126
5.5 Partial Collapsing for Delay M in im iz a tio n ......................................................... 126
5.5.1 Lawler’s A lg o rith m ....................................................................................... 126
5.5.2 Effect of Clustering on D e l a y .................................................................... 130
5.5.3 Area Recovery and C lustering .................................................................... 134
5.6 Conclusion ............................................................................................................... 139

6 Conclusion 140

Bibliography 142

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List o f F igures

1.1 A Boolean N etw ork................................................................................................... 4

2.1 Exam ple of R u le s ...................................................................................................... 9


2.2 Example of Decomposition into Prim itive G a te s ................................................ 12
2.3 Example of P a tte rn M a tc h in g ................................................................................ 13
2.4 Tree Covering for Minimum C o s t .......................................................................... 14
2.5 Example of Suboptim al Pin A ssig n m e n t............................................................. 19

3.1 D uality of Tree Covering and Fanout O p tim iz a tio n .......................................... 31


3.2 Com binational M e r g i n g .......................................................................................... 34
3.3 Splitting a Node Does not Increase D e l a y .......................................................... 36
3.4 O ptim al Buffer S election .......................................................................................... 44
3.5 Two-Level Fanout Tree Ignoring Required T i m e s ............................................. 46
3.6 Two-Level Fanout Tree Taking Required Times into A c c o u n t ....................... 47
3.7 The Greedy Sink Assignment Algorithm is Not O p tim a l................................ 48
3.8 Com binational Merging as Fanout A lg o r ith m ................................................... 50
3.9 Definition of L T -t r e e s ............................................................................................. 54
3.10 Examples of LT-trees ............................................................................................. 55
3.11 Suboptim ality of LT-Trees with Consecutive Sink O r d e r in g .......................... 56
3.12 Fanout Optim ization with L T - T r e e s ................................................................... 58
3.13 Illustration of L T -Tree A lgorithm .......................................................................... 59
3.14 Fanout Problem Unsolvable with Consecutive Sink O rd erin g ........................... 61
3.15 An Optim um Assignment A lg o rith m ................................................................... 67
3.16 A One-Pass Optim al Global Fanout A lg o r ith m ................................................ 70
3.17 Applying Fanout Algorithms in One Pass From O utputs to I n p u t s 71

4.1 Combining Tree Covering and Fanout Optimization ....................................... 89


4.2 Suboptim ality of Tree-Based O ptim ization .......................................................... 91
4.3 P artition of Edges ................................................................................................... 96
4.4 A Simple E x a m p le ................................................................................................... 97
4.5 A Simple Iterative Improvement A lg o r ith m ............................................... 106
4.6 A More Complex E x a m p le ...................................................................................... 107
4.7 Area / Delay T r a d e o f f............................................................................................. 119

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L IS T OF FIGU RES vi

5.1 m i s l l Logic Simplification S c r i p t ........................................................................ 124


5.2 Lawler’s A lg o rith m .................................................................................................. 128
5.3 Example of C lu s te rin g ............................................................................................ 129
5.4 m is 1 1 Clustering S c r i p t ........................................................................................ 131
5.5 m i s l l Speed-up S c r i p t ........................................................................................... 131
5.6 Example of Relabeling for A r e a ............................................................................ 135
5.7 Relabeling Procedure for Reducing Logic D uplication...................................... 137

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List o f Tables

2.1 Comparison of Tree Covering Algorithms with library MCNC............................ 25


2.2 Effect of Optimal Invertor Soloctlon with library MCNC...................................... 26
2.3 Comparison o f Treo Mapping Algorithms with library CM0S12......................... 26
2.4 Effect of Optimal Invertor Soloctlon with library CM0S12................................... 27

3.1 Fanout Trees: Two-Level vs. Multl-Lovol ............................................................. 45


3.2 General Information on the Bonchmurk S o t ......................................................... 75
3.3 Effect of Fanout Optimisation on Circuits Optimised by m i s l l ..................... 76
3.4 Effect of Fanout Optimisation on Unoptlmlsod Circuits ................................... 78
3.5 Effect of Inverter Optimisation on Circuits Optimised by m i s l l ............ 79
3.6 Effect of Buffering on Circuits Optimised by m i s l l .................................... 81
3.7 A Lower Bound on tho Effect of Fanout O p t im is a tio n ..................................... 82
3.8 Effect of Fanout Optimisation without Poopholo O ptim isation ......................... 83
3.9 Effect of Fanout Optimisation without Area R o c o v o r y ...................................... 85
3.10 Comparison with Singh's Algorithm ....................................................................... 86

4.1 Effect of Taking Polarities Into Account in Fanout Delay H e u r is tic 102
4.2 Effect of Using a Logarithmic vs. Linear Dolay E stim a te .......................... 104
4.3 Effect of Iterative Im p rovem en t......................................................................... 105
4.4 Iterative Improvement vs. Optimal A ssig n m e n t.................................................. 110
4.5 Effect of Tree D u p lic a tio n ................................................................................... 114
4.6 Effect of Allowing Troe O v e r la p s ...................................................................... 116
4.7 Effect of Limiting Treo Overlaps ...................................................................... 118

5.1 Effect of Literal Count M in im isa tio n ............................................................... 123


5.2 Effect of Simplification without F a c to r isa tio n .............................................. 125
5.3 Effect of Collapsing to Two Lovols of Logic ................................................. 127
5.4 Effect of Clustering with a Maximum Cluster Sise of 8 ............................. 132
5.5 Effect of the speocLup C o m m a n d ...................................................................... 133
5.6 Effect of Relabeling H e u r is tic ............................................................................ 136
5.7 Effect of Redundancy Removal after C lu ste r in g ........................................... 138

vll

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 1

In trod u ction

Books axe not made to be believed,


but to be submitted to examination.
— UMBERTO ECO, The Name o f the Rose (1980)

1.1 Overview

Logic synthesis is the process of transforming a set of Boolean equations into


a network of gates th a t realizes the logic and minimizes some cost function. The cost
function can be area, delay, testability or power consumption. Most often than not, it
is a combination of these factors. For simplicity and efficiency, logic synthesis is usually
decomposed in two phases: a technology independent phase, whose main objective is to
simplify the logic; and a technology dependent phase, also called technology mapping, whose
main role is to implement the logic using well characterized logic gates realized in a given
technology [8 ].
The main focus of this thesis is to develop techniques to reduce circuit delay during
the technology dependent phase of logic synthesis. Despite this focus of delay optimization,
other costs are not ignored. In particular, some care has been given to lim it the increase in
circuit area whenever possible.
Technology independent delay optimizations are just as im portant as technology
dependent delay optimizations. However, since they axe performed at a higher level, they
cannot be used with great accuracy until technology independent delay models and tech­
nology mapping algorithms of reliable performance are developed. Since high level delay
models axe still the subject of current research, technology independent delay optimizations

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 1. IN TR O D U C TIO N 2

are only a topic of secondary priority in this research. Much work remains to be done in
this area.
The technology m apping algorithms presented in this work rely on two main tech­
niques: tree covering and fanout optimization. Tree covering algorithms were first stud­
ied Aho et al. [2, 3, 1] in the context of code generation for expressions, and were later
adapted to technology mapping by Keutzer [24, 27] and Rudell [14, 36]. If the objective
is to minimize circuit area, tree covering algorithms produce good quality results and are
straightforward to use. However if the objective is to minimize circuit delay, tree covering
needs to be extended even to generate optim al solution for trees. This extension is discussed
in chapter 2 .
Tree covering alone tends to generate poor quality implementations in terms of
delay, because most circuits are not trees but directed acyclic graphs. The signal available
at the output of a tree needs to be distributed to several destinations. Such a signal is
called a fanout signal. W ith tree covering alone, the circuitry used to distribute a fanout
signal to its destinations is implemented by default with a wire. In first approximation, if
n is the num ber of destinations of a fanout signal, the delay through this wire is of order
0 {n ). Using a simple buffer tree, this delay can be reduced to O (logn). It is thus very
im portant, to minimize delay, to be able to insert buffer trees to reduce the delay incurred
by the distribution of fanout signals. This optimization, called fanout optimization [5], is
the main focus of chapter 3.
Tree covering can be form ulated as the problem of minimizing delay through a
fanin node, i.e. a node with several inputs but only one output. In a very similar way,
fanout optimization can be form ulated as the problem of minimizing delay through a fanout
node, i.e. a node with one input and m any outputs. To minimize delay through an entire
circuit, we need to coordinate the use of tree covering and fanout optim ization on the fanin
and fanout nodes th at compose the circuit. This global optimization problem is the focus
of chapter 4.
These techniques provide a solid set of technology mapping algorithms on which we
can rely to measure the effect of technology independent optimizations. Chapter 5 contains
the results of some empirical studies of the effect of technology independent optimizations
on the quality of the final, technology m apped implementation of a circuit. Finally chapter 6

summarizes the main results of this work.


In the remainder of this chapter we give the main definitions and notation used

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 1. IN TRO D U C TIO N 3

throughout this thesis, and a description of the abstraction we use to model the physical
behavior of circuit components.

1.2 Terminology and N otation

1.2.1 M a th em a tic a l N o ta tio n

By convention we use the letters n and m to denote integer valued constants, and
the letters i, j and k to denote integer valued variables, usually indices. We use the letters
a, b, c to designate real valued constants, and x , y , z , t , r and p to designate real valued
variables. Occasionally, we also used upper case letters.
We use other letters to designate certain quantities in specific contexts. In partic­
ular b, g and s are used as indices in a library of gates; b is usually denotes a buffer and
s a source, i.e. a gate used at the root of a tree. We use a, f3 and 7 as constants that
characterize the delay through a gate of a gate library. We also commonly use d to denote
the num ber of buffers in a library, and n to denote the number of destinations of a fanout
signal.
We use En to denote the group of perm utations of a set of n elements. In that
context, <7 is used to denote an element of S n, i.e. a perm utation. We use tt to denote the
natural projection from a Cartesian to one of its components. For example, 1t a '• A x B -* A ,
where 7r^((a, b)) = a. TZ denotes the field of real numbers, and TZ+ the subset of 1Z formed
of the nonnegative real numbers.

1.2.2 C o m b in ation al Logic

We also use the letters x, y, z to denote Boolean variables. We denote the Boolean
and with a or with a space if the meaning is clear. We denote the Boolean or with a
“+ ” and the negation with a bar over the expression to be negated. For example we would
write x + y = x.y.
A combinational logic circuit can be specified as a set of Boolean equations, with
no cyclic dependencies, of the form yj = f j ( x i , . . . , x n), where x \ , . . . , x n and yj denote
>
Boolean variables and f j a Boolean function. There is a one-to-one mapping between a set
of Boolean equations with no cyclic dependencies and a Boolean network, i.e. a directed
acyclic graph graph G = (F, E ), in which a logic equation is associated to each node.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 1. IN TRO D U C TIO N 4

PRIMARY OUTPUTS

fanout node

fanin node

PRIMARY INPUTS

Figure 1.1: A Boolean Network

Boolean networks form a natural, very general, multi-level representation of a piece


of combinational logic. O ther representations of combinational logic, such as two-level sum-
of-products [7, 37] or binary decision diagrams [10] are more useful in some contexts, but
are more limited in terms of the class of functions they can express in a reasonable amount
of space.

Some of the nodes of a Boolean network are distinguished as prim ary inputs and
prim ary outputs, and represent respectively the inputs and outputs of the corresponding
circuit. The fanin of a node v of a Boolean network is the set of nodes u whose output is
directly connected to an input of v. Similarly, the fanout of a node v is the set of nodes
u th at have an input directly connected to the output of v. A node th a t has a fanout
containing only one element is called a fanin node\ a node th a t has a fanin containing only
one element is called a fanout node. A Boolean network can always be decomposed into a
network of fanin and fanout nodes, in such a way th a t a fanin node is only connected to
fanout nodes, prim ary inputs or prim ary outputs, and a fanout node is only connected to

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 1. IN TR O D U C TIO N 5

fanin nodes, prim ary inputs or prim ary outputs, as illustrated in Figure 1.1.

1 .2 .3 P h y sic a l M o d elin g

We model a technology as a reasonably small library of gates implementable in


th a t technology. These gates are the only circuit primitives we claim to use. In addition, we
suppose th a t these gates are fully characterized by a combinational logic function and a set
of delay equations. This excludes latches and CMOS transmission gates. We also suppose
th a t every interconnection of gates yields a valid circuit. This fails for ECL (different
voltage levels; dot products) or in the presence of fanout limits. Taking care of latches or
taking fanout limits into account can easily be done by a simple extension of the algorithms
presented here.
Our algorithms can handle gate libraries composed of up to a few hundred gates,
which is good enough to handle industrial CMOS standard cell libraries. Some of the
m ost popular gates in standard cell CMOS technologies are 2-input NAND gates, that
implement the function out = a.b\ inverters, th a t implement the function out = a; XOR
gates, th a t implement the function out = ab+ ab\ and-or-invert and or-and-invert gates, as
for example, AOI22, th a t implements the function out = a \a 2 + 6162- A given logic function
can be implemented by several gates of different sizes and delay characteristics. The role of
the technology m apper is to select not only a logic function th a t corresponds to a gate in
the target library, but also to select the appropriate gate for a given logic function.

G a te A re a In standard cell technology, we use grid counts to measure the area of a gate
or cell. The grid count is the width of a cell relative to a standard design rule. Grid count is
directly proportional to standard cell area before routing [31]. If a well characterized router
is used for layout, grid count is a good estimate of final chip area.

G a te D e la y To model gate delay, we use a simple linear delay model. This model char­
acterizes the delay from an input pin i to the output pin of a gate g using a linear equation
of the form cti,g + Pi,g 7 . The coefficient cti,g models the intrinsic delay through the gate; the
coefficient Pi,9 models the load dependent delay; 7 is the capacitive load at the output of
the gate, which is estim ated by looking at the output connections of the gate. The model
also distinguishes between rise and fall delays. If capacitive loads are restricted to be within
a small range of values, and if delay coefficients were determined for th a t range of values,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 1. IN TRO D U C TIO N 6

this model is a reasonable first approximation, within 10% of actual delays [29]. Gate level
delay models used in the industry are usually more complex th a t this linear, 0 -order model.
Industrial models usually rely on a simple first-order non-linear delay model, th a t computes
the delay and its slope, i.e. a discretized version of the first derivative of the delay. A
well-tuned version of this model can estimate physical delays within a few percent.
To estimate the capacitive load a t the output of a gate g, we add the capacitive
loads of the input pins of the gates driven by g to a. additional term representing the delay
through the wires.

A rriv a l T im e s , R e q u ire d T im e s a n d S lack Given arrival times at the prim ary inputs
of the network, the arrival tim e at the output of any gate in the network is defined as the
latest possible moment a transition may occur at the output of th at gate. We use static
timing analysis to compute arrival times, i.e. we assume th a t all paths through the logic
can be activated. Techniques to detect false paths exist [33] but are too computationally
expensive to be of practical use during technology mapping. In addition, one of the goals
of technology mapping for delay is to make a large num ber of paths critical, reducing the
need for more sophisticated delay estimators.
Given required times at the prim ary outputs of a network, we can also compute
the required time at the outputs of any gate of the network by propagating required times
throughout the network. The slack at the output of a gate is defined as the difference
between the required time and the arrival time at th a t output.

T ech n o lo g y I n d e p e n d e n t M o d e ls The l i t e r a l count is a good technology indepen­


dent estim ator of circuit area. The literal count of a Boolean network is the sum of the
literal counts of each of its nodes. The literal count of a node is the num ber of literals
appearing in a factored form representation of the logic function of the node. A literal is
a term of the form x or x, where a: is a variable and x is the Boolean negation of x. For
example, if the logic function of a node is: Xi ( x 2 + $ 3 ) + sT®4 , the literal count of this node
is 5. Literal count correlates well with circuit area [31].
Finding reliable technology independent estim ators of circuit delay is still the
subject of current research, despite recent advances [43]. Simple models based on the
num ber of levels of logic with or without a corrective term for multiple fanouts are usually
very unreliable. Good delay estimators are crucial to the accuracy and consistency of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 1. IN TRO D U CTIO N

technology independent delay optimization algorithms.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 2

D elay O ptim ization w ith Tree


Covering

The most constant difficulty in contriving the engine


has arisen from the desire to reduce the time
in which the calculations were executed
to the shortest which is possible.
— CHARLES BABBAGE (1837)

2.1 Introduction

The problem of delay minimization of an arbitrary Boolean network is difficult in


general. Ideally, we would like to find a network of gates th a t is functionally equivalent to a
Boolean network and has minimum delay. For a given Boolean network, the num ber of such
implementations is theoretically infinite. Even if we restrict ourselves to implementations of
bounded size, the num ber of such implementations is still very large and there is no known
m ethod to explore the solution space efficiently. All the practical methods developed so far
consist in modifying an initial representation of a Boolean network using transformations
th a t preserve the behavior of the network and reduce the cost of its implementation.
Since the problem is so complex, it is in practice divided into two phases: a
technology independent phase and a technology dependent phase, also called technology
mapping. The purpose of the technology independent phase is to provide a Boolean network
equivalent to the original circuit which can be implemented efficiently by the technology
mapping algorithms. The role of the technology m apper is to compute a network of gates of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 9

rewrite

rewrite

Figure 2.1: Example of Rules

minimum cost equivalent to a given Boolean network. The simplifying assumption we make
about the technology m apper is th a t the structure of the network of gates it computes is
derived from the structure of the Boolean network it takes as input. The technology mapper
is not expected to modify the structure of the network drastically: such transformations are
best left to the technology independent phase, not because doing so yields b etter results,
but because it corresponds to a natural separation of concerns, and a simpler overall design.
The first technology m appers were rule-based (LSS [13], SOCRATES [19]), i.e.
based on local transformations called rules. Examples of rules are given in Figure 2.1.
Rules can be used to implement unimplem ented logic, or simply to improve the quality of
an already implemented set of gates. Rule-based systems are quite flexible and can gen­
erate circuits of good quality in term s of either area or delay. They can also be used in
a postprocessing phase to improve the quality of a circuit generated by other algorithms.
Unfortunately, rule-based systems suffer from two severe limitations: rules are library de­
pendent and only local optimizations are possible in polynomial time.
An attractive alternative to rule-based technology mappers is the use of a divide-
and-conquer strategy, where the network is partitioned into the largest pieces th a t can be
handled with efficient, optimum, library independent algorithms. The most im portant of
these algorithms is tree covering. Given a tree, expressed with simple primitives (e.g. 2-
input NAND gates and inverters), tree covering can find a minimum cost covering of the
tree by gates of a library. This covering can then be extracted to form an implementation of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 10

the tree. This implementation is not in general a minimum cost implem entation of the tree
because it is a function of the decomposition of the tree into primitives, and also because
in some cases the minimum cost solution cannot be expressed as a tree cover (i.e. when a
minimum delay solution requires the introduction of buffers between gates, or when the use
of Boolean identities is required to identify a m atch). However, tree covering usually yields
good quality implementations, and has the merit of being very fast.
Tree covering was originally developed for code generation in retargetable compil­
ers [1]. It was first introduced in the context of technology mapping by Keutzer [24] for
producing minimum area implementations. There has been some earlier attem pts to extend
tree covering to produce minimum delay implementations, but these techniques were either
ad hoc [26] or were not implemented [36]. In this chapter we introduce an elegant solution
to the problem of optim al tree covering for delay, and present and discuss experimental
results comparing this solution with simpler but suboptim al techniques.
O ptimal tree covering can be solved in tim e linear in the num ber of nodes in the
tree. The complexity of tree covering as a function of the num ber and size of the gates in
the library depends on which tree p attern matching algorithm is used. The fastest pattern
m atching algorithms build an autom aton th at summarizes all possible patterns m atched
by gates in the library. A t the cost of some preprocessing time, these algorithms avoid
having to traverse the same substructures several times. A good overview of tree pattern
m atching algorithms can be found in [20]. The fastest algorithms are based on a bottom
up traversal of the trees, but they have the largest space requirements. An interesting
way to reduce this space overhead can be found in [11]. No technology m apper has been
based on bottom -up tree covering. The second fastest algorithms are based on top down
string matching techniques [1]. These algorithms were used in Dagon [24], In contrast,
misll technology m apper [14] uses a more conventional but slower matching algorithm,
th a t simply enumerates all patterns and requires little preprocessing. For simplicity, we use
the slower m atching algorithm used in misll. The choice of the m atching algorithm has no
influence on the quality of the final result and thus has no influence on our experimental
results.
The rest of this chapter is organized as follows. In section 2.2 we review the basic
tree covering algorithm for minimum area and show how it can be directly extended to
minimize delay if load values are supposed to be constant. Then in section 2.3 we show
how to extend this algorithm to take exact load values into account. This extension was

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O PTIM IZATIO N W IT H TR E E COVERING 11

originally suggested by Rudell [36] and implemented by us [42] using load discretization.
We present a new m ethod th at relies on a functional representation of achievable arrival
times at a node v as a function of the load at the output of v. In section 2.4 we review
three main factors th at limits the optimality of tree covering for delay, and discuss how these
limitations could be overcome. In section 2.5 we present several techniques th at can be used
to reduce the area of tree cover under a delay constraint. One of these techniques consists
in extending tree covering further to directly minimize area under a delay constraint, at the
expense of more computation time. We present an adaptive time discretization algorithm
to perform this task. This algorithm was already described in [36, 42]. Finally we present
and discuss our experimental results in section 2 .6 and summarize the main results of this
chapter in section 2.7.

2.2 Tree Covering


The use of tree covering algorithms for technology mapping was originally pro­
posed by Keutzer [24]. The basis of this technique is to decompose the circuit into trees,
and use tree covering on each of the trees separately. Tree covering algorithms were ini­
tially developed for code generation [2, 1]. They are based on fast tree pattern matching
techniques, and use dynamic programming to implicitly enumerate all solutions efficiently.
Tree covering works very well for additive cost functions, and more generally whenever the
minimum cost cover of a tree rooted at a node is only function of the cost of the matches
at th at node and the cost of the subtrees. This is the case in code generation and logic
synthesis for minimum area when we ignore non linear effects (pipelining, caching in code
generation; placement and routing in logic synthesis).
The tree covering problem can be described as follows:

• G iv en a tree T and a set of tree patterns P , representing the gates of a library, and
a cost associated with each gate,

• F in d a cover of the tree T of minimum cost.

To allow the use of tree pattern matching algorithms, we need to represent both the tree
and the logic functions associated with the gates in a common set of primitives or base
functions. In m is ll and DAGON, this set of primitives is composed of only two types
of elements: 2-input NAND gates and inverters, m is ll also adds extraneous inverters to

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 12

N 0R2 ¥ OAI21

“ A n
Figure 2.2: Example of Decomposition into Prim itive Gates

allow the pattern m atching algorithm to choose in which phase subfunctions should be
implemented. An example of such a decomposition for a NOR gate and a A 0I21 gate is
given in figure 2.2. The network is decomposed in a similar fashion into primitive gates.
We suppose th at we have at our disposal an algorithm to enumerate all matching
tree patterns m at a given node v of tree T. We call this set m atch(v, P ). Associated with a
given pattern m in m atch(v, P ), we have a gate g(m ) and a list of nodes vi 6 in (m ), which
correspond to the nodes of T matching the inputs of m . An example of such a matching is
given in figure 2.3.
In general, the cost of a m atch depends on some contextual information th a t is
function of both the children and the parent nodes of given node v of the tree. While
for area minimization the cost of a m atch only depends on the children nodes, for delay
minimization, or area minimization under a delay constraint, the cost function also depends
on information provided by the parent node (load values, required times). We present here
a general formulation of the cost function th at cover all three types of optimization.
To formulate this cost function, we use a generic variable I to denote the contextual
information derived from the parent nodes. The minimum cost tree cover at a node v,
cost(v, I ) , is the minimum cost of a cover of the subtree rooted at v for a given I.
To evaluate the cost of a pattern m in m a tc h (v ,P ) we need a cost function
cost(m , I ) th at depends on the cost of the gate g(m ) associated with the p attern and the
costs cost(vmii, I m,i)> where the nodes vm>i € inputs(m ) are the nodes in the tree matching

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 13

-O x J

match
s V3

gate g(m) pattern m


v1
v2
v3
decompose
o
represent

Figure 2.3: Example of P attern M atching

the inputs of m , and Im,i is the contextual information derived from I th a t is propagated
through m to node Vi<m. For example, I m>i may be the load of gate g(m ) at input pin i, in
which case it is independent of I ; or it can be the required tim e at Vi<m derived from the
required tim e at v.

To be able to use a dynamic formulation of the minimum cost covering problem


we suppose in addition th at the function c o s t(m ,I) = cost(g{m ),[cost(vmti ,I mti) ,v rnii E
in p u ts(m )] is monotonic non-decreasing. We can thus use the principle of optimality to
assert th a t there exists a minimum cost cover at a node v which is m ade of a gate g
m atching a t node v and minimum cost covers of the subtrees rooted at the input pins of g.
It is easy to see by induction th a t in one bottom -up traversal of the tree, from the leaves
to the root, an algorithm can obtain a minimum cost cover. A sketch of this algorithm is
given in Figure 2.4, where the function s e le c t(v ,I) records the best p attern m atching at
node v as a function of the contextual param eter I.

This formulation relies on our ability to m anipulate functions of I as d ata and in


particular to compute the minimum of two such functions. In the case of area minimization,
there is no contextual information to propagate: area(m ) = c o st(m ,I) and area{v) =

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 14

p ro c ed u re minjcost(v, P)
fo r w in children(v) do min.cost(w, P) od;

fo r m in match(v, P) do

cost[m, I) = cost(g(m),[cost(vitm,Ii,m),vi,m € m jn r f s ( m ) ] ) ;

od;

cost(v,I) := m in rn em alch(Ulp ) C o s i ( m ,7 ) ;

select(v,I) a r g m i n m6ma4ch(Vlp ) cost{m, I)


end mirucost

Figure 2.4: Tree Covering for Minimum Cost

cost(v, I ) are simply numbers. More specifically:

area(m) = area(g(m))+ ^ area(vilTn) (2.1)


« ; ,m 6 t n p u t s ( m )

In the case of delay minimization, if we ignore the differences between loads and use a
nom inal load value 70 at the output of each gate, there is no contextual information to
propagate either. In th a t case the cost function a rriva l(m ) = cost(m , I ) becomes:

arrival(m ) = max ,(<*;,g(m) + ft,g(m)7 o + arrival{vitTn)) ( 2 .2 )

However if we want to take actual load values into account, c o s t(m ,I) and c o st(v ,I) are
non constant functions of I . We will see next how they can be computed efficiently.

2.3 Handling o f Load Values

To compute an exact minimum delay cover we need to take load values into ac­
count. Load information is propagated top down, from the root to the leaves of the tree.
Therefore load values have to be represented as contextual information. In th a t case the
cost function depends on the load 7 at the output of a node:

arrival{m,~j) = max A ait9(m) + Pii3(m ) 7 + arrival(viim,'tiiTn)) (2.3)


W .m Stnputafm )

where 7 i>m denotes the load on pin i of gate g(m ).

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 15

Cost functions are now dependent on the load 7 . We need to find a representation
of these functions that allow us to compute their minimum as in Figure 2.4. We propose here
three different representations. The first two use staircase functions, the third piece-wise
linear functions.

2.3.1 U niform D iscretiza tio n

The simplest non constant representation of the cost functions is to use staircase
functions, where the limits of the intervals are fixed and independent of the nodes. This is
simple to implement but relatively inaccurate unless the discretization intervals are made
fairly small, which is computationally expensive. This m ethod was originally suggested by
Rudell [36].

2.3.2 A d a p tiv e D iscretiza tio n

A better approach is to adapt the discretization intervals to each node. In one


precomputation phase, one can determine all the possible load values a t a node by examining
all matches at every node. These values can then be used to determine the boundaries of
discretization intervals. Since it is guaranteed th at no other value is possible at an internal
node, this method is exact. If the num ber of intervals grow too large, they can be easily
reduced by merging the smaller intervals with their neighbors, for a small expected reduction
in precision. This method was used in [42].

2.3 .3 F u n ction al A b stra ctio n

Adaptive discretization has two drawbacks: it requires the precom putation of all
possible load values at each node, which is roughly as tim e consuming as performing the
tree covering itself; it does not work at the root of the tree, where the num ber of all possible
load values grows exponentially and the range grows linearly with the number of fanouts of
the root node.
A more general and more elegant m ethod consists in going up one level of abstrac­
tion, by storing at each node a function th at gives, for any given load value, the optimum
choice of a gate at this node. The main difficulty is to find a data structure adapted to the
representation of such a function. We first list which operations are needed on this data

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O PTIM IZATIO N W IT H T R E E CO VERING 16

structure, and then show these operations can be efficiently implemented using piece-wise
linear functions.
We have to be able to perform efficiently the following operations:

• representation o f gate delays', given a gate g, m atching at node v, and arrival times
ai at the inputs of the gate, we should be able to compute a function / ( 7 ), where 7 is
the load at the output of node v, th at gives the arrival time at the output of g when
g has to drive a load of 7 .

• computation o f the m inimum o f two functions: given two functions / and g of one
variable 7 representing the load at the output of a node v, we need to be able to
compute the function h( 7 ) = m in (/( 7 ), <7(7 )).

• finding the m inim um solution: in addition, h = m in (/,y ) should be represented in


such a way th a t we can tell, for a given value of 7 , whether the minimum is realized
by / or g. If h represents the minimum of all the functions representing the arrival
times for all gates matching at a node v, we should be able to determine from h not
only the best arrival time, but also a gate th a t realizes the best arrival time.

Given these three operations, we can obtain a minimum delay tree cover by applying directly
the algorithm of Figure 2.4. In the remaining of this section, we show how these three
operations can be implemented efficiently using piece-wise linear functions within our delay
model. For more complex delay models, piece-wise linear functions may not suffice. In th at
case, a more complex data structure would need to be used.

P iece-W ise Linear Functions A piece-wise linear function is a continuous function,


formed of a finite sequence of connected, linear segments. We only consider piece-wise
linear functions defined on 72+, so we can always suppose th a t the support of each segment
is of the form (pi,p2), with p i £ 1Z+ and P2 £ 72+ U { + 0 0 } . We represent each segment by
a tuple of the form: (xi, yi, Si,pi), where (X i,yi) are the coordinates of the leftmost point
of the segment, Si is the slope of the segment, and Pi is a pointer to an object th at will
be specified later. Strictly speaking, this representation is incomplete, because it does not
specify the right bound of the segment, but it is not intended to be used in isolation.
We represent a piece-wise linear function / as an array of tuples representing the
connected segments forming / : ( ij, 3/*, Si,Pi)i<i<n, such th a t 0 = < X2 < . . . < x n) and,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 17

for 1 < i < n — 1, yi+i = yi + ( ® i+ i - Xi)s{. All segments but the last one are finite: the
right bound of the i th segment for i < n is given by ®»+i.

R epresentation o f th e Arrival Tim e Function at th e O utput o f a G ate Given


arrival tim es (aj)i< j< n a t the input pins of a gate g, the arrival time at the output of g
as a function of th e load at the output of g is given, in our delay model, as the maximum
of n linear functions, one per input pin. This maximum is a piece-wise linear function,
which is represented, as indicated in the previous paragraph, by a finite sequence of tuples:
(®n>2/n>5n>p)i<n<JV- The pointer p is independent of n and points to a record containing
inform ation pertaining to gate g. This information is needed to retrieve a description of
gate g in case the selection of g yields the earliest arrival time.

C om putation o f th e M inim um o f Two P iece-W ise Linear Functions The mini­


mum of two piece-wise linear functions / i and f 2 can be computed in tim e 0 ( n i+ n 2) where
ni is the num ber of segments contained in The algorithm bears some similarity with a
well-known linear tim e algorithm used for merging two sorted lists of data.
The algorithm scans implicitly all values of R + from 0 to +oo, and keeps track of
the best segment a t each point. In practice the num ber of points th a t need to be visited
does not exceed the total num ber of segments, n \ + n 2 ■ The algorithm m aintains two indices
i\ and i 2 pointing to the currently active segments, and a scan point x, initially set to 0. As
an invariant of the algorithm, x is guaranteed to lie within segment i i of / i and segment i 2
of f 2. The next value of x is the leftmost of the following three points: the rightmost limit
of segment ii of / i , the rightm ost limit of segment i 2 of f 2 and the intersection of ii and
i2, provided th a t this intersection lies to the right of the current value of x. The current
value of x and the next value of x determine a segment. This segment is a copy of i* of f i
or i 2 of f 2 on th a t range, whichever is the lower on th a t range of values, a: is then updated
to its new value, and ii or i 2 is incremented as necessary. The algorithm term inates when
all segments have been visited.

Finding th e O ptim um Solution By computing the m inimum of all the functions rep­
resenting the arrival times for all gates matching at a node v, we obtain a piece-wise linear
function / representing the best arrival times realizable at v for any load value. For a
given load value 7 , we simply perform a binary search on the array of tuples (X i,y i,S i,p i)

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 18

representing / to identify the segment to which 7 belongs. Once this segment is identified,
we use the pointer pi to retrieve the gate th a t realizes the minimum.
The actual implem entation is slightly more complex than w hat we have just de­
scribed, because a gate m ay m atch at a node in several different ways, and we need to keep
track w ith Pi not only of a choice of a gate but also of a choice of a matching.

2.4 Limits to the Optim ality o f Tree Covering for Delay

2.4.1 In itia l D e co m p o sitio n

The initial decomposition of a tree in simple primitives (i.e. 2-input NAND gates
and inverters) is required before tree covering can be used. The principle of tree covering
is to decompose the target tree and gates into the same set of primitives so th a t functional
m atching is replaced by the simpler problem of p attern matching. Since gates represent
small logic functions, it is feasible to enumerate all patterns th at represent a gate. This
num ber can be reduced further by exploiting the symmetry of some inputs.
Unfortunately, for the tree itself, it is not practical in general to enumerate all
possible decomposition in simple primitives. We choose one decomposition, and the result
depends in general on the quality of this decomposition. Ideally, an adequate choice should
be m ade as a function of the arrival times a t the leaves of the tree. In the absence of this
information, we simply generate a balanced decomposition of the nodes of the tree. This
is not in general the best choice. Before we can discuss better decompositions, we need to
cover the m aterial of chapter 4. We will come back to this problem in chapter 5.

2.4.2 Sub o p tim a lity o f P in A ssig n m en t

During tree matching, the symmetry of gate inputs is exploited to reduce the com­
putational overhead of the algorithm. All patterns th a t a gate can m atch are enumerated,
but all possible assignments of gate inputs to pattern inputs axe not. When the inputs
are equivalent, only one assignment is tried. For area minimization, logically equivalent pin
assignments have all the same cost, so there is no need to consider more than one. However,
for delay minimization, this is not the case, since the delay through a gate is pin dependent
in general.
For example, a 3-input NAND gate is represented by only one pattern, but has

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 2. D E L A Y O P TIM IZATIO N W ITH T R E E CO VERING 19

-£x>-i_
critical

b2 ’

decompose
match
gate delay (pin to output)

v1 - > f = 1.2 ns
pattern v 2 -> f = 1.1 ns
v3 - > f = 1.0 ns

Figure 2.5: Example of Suboptimal Pin Assignment

3! = 6 possible pin assignments. Only one out of the six possible pin assignments is tried by
the tree pattern matching algorithm. This can lead to suboptimal choices. In the example
of Figure 2.5, the critical input is assigned the slowest pin v \. For any preassigned choice
of input ordering, there is always a simple circuit configuration for which this choice is
suboptimal. We propose in this section a simple algorithm to correct this problem.

To optimize pin assignment during tree covering, we can proceed as follows. First,
in a preprocessing step th at can be done once and for all on the library, we identify the
sets of pins th at are functionally equivalent and therefore interchangeable. Then, for each
gate during pattern matching, we consider each of the equivalent sets of pins separately and
reorder them in order to decrease delay.

If we use constant load values, the problem of optimal pin assignment can be solved
in tim e O (nlogn) where n is the num ber of pins in the set. In th at case the problem can
be reduced to the following discrete optimization problem: if a j is the arrival time at node
Vi and dj the delay through the gate from pin j , the problem is to find an assignment of
nodes to pins th at minimize the total delay. Let E n be the set of perm utations of a set of n
elements, and for a 6 En , let a{i) be the image by the perm utation a of element i. The pin

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H TR E E CO VERING 20

assignment problem can the be formulated as the following discrete optimization problem:

min m ax a; + d„a)
<r€2n l < i < n W

An optim al solution can be computed by ordering the Oj and the dj by increasing size and
using the perm utation a (i) = n —i + 1. However, if we take pin load values into account,
the problem has the following form, where aj denotes the arrival time at node Vi and a j,
and 7 i are delay coefficients derived from the delay model:

min m ax (ai + f t * 7 <r(i) + o ^ ) )

This problem can still be solved optimally by dynamic programming in time 0 ( 2 n). Though
exponential, this algorithm is still practical for most libraries since the num ber of equivalent
pins of any given gate usually does not exceed eight in CMOS technology.

2 .4 .3 R ise and Fall D elays

So far we have ignored the fact that in our delay model we distinguish between rise
and fall delays. This means th at arrival times are characterized by a pair of real numbers:
(dr, a,f) instead of a single number. To decide which of two solutions is better, we need to
decide which of two pairs of arrival times is “faster” . We use the following criterion to make
this decision:

(aT, a f ) < (br, bf) if m ax(ar ,a /) < max(6r ,6 /)

This selection is not guaranteed to be optimal in general but works well in practice. It
outperforms the other obvious choice:

t , aj )\ < tu u )\ i•ft bT "b


^ bfL
®r H"
{ar (br , bf — <- —

2.5 M inimizing Area under a Delay Constraint

We can also use the dynamic programming algorithm of Figure 2.4 to find a min­
imum area cover under a delay constraint. The delay constraint is expressed as a required
tim e a t the root of the tree, and is propagated down the tree as a contextual value, to­
gether with load values. In th at case the cost function is of the form c o s t( m ,i,r ) where 7

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 21

represents the load and r the required tim e at the output of a node:

c o s t(m ,'i,r) = {a rea (m ),sla ck(m ,,y,r)}

area(m) = area(g(m )) + area(viiTn)


Vi,m e in p u ts(m )

s la c k {m ,'),r) = r — arriva l{m ,'),r)

a r r iv a lim ^ r ) = m ax , ( « .,g ( m ) + f t ,g ( m ) 7 +
Vi,m einputs{m )

where rj,m is the required tim e a t input node Vi)Tn propagated from required time r at node
v through gate g(m ). To minimize area under a delay constraint, we take into account in
the cost function both the area and the slack (the slack is the difference between required
tim e and arrival tim e). At each interm ediate node the minimum area solution is selected if
it meets the delay requirement (that is, if its slack is non negative); otherwise, the minimum
delay solution is chosen.

2.5 .1 A d a p tiv e D iscretiza tio n o f R eq uired T im es

To implement this algorithm, we need to discretize the required times. Enum erat­
ing all possible required times at a node is not feasible in this case, because required times
not only depend on the m atch just above a node b u t on all the possible combinations of
m atches above the node up to the root of the tree.
To control the run tim e of the algorithm, we enforce a limit on the num ber of
discretization intervals at each node, which is an integer-valued param eter r specified by
the user. The discretization intervals are obtained by first computing a range of interesting
values for the required tim e at each node and then dividing this range into equal intervals.
The range of interesting values for the required time a t a node v is determined as
follows. Let 7 be a possible load value at node v. Let a^eiay(,y) be the minimum arrival
tim e achievable a t node v with a load of 7 at its output, and a 2 rea(v ) the arrival tim e at v
of a minimum area cover of the subtree rooted at v for th at same load value. Any required
tim e outside the interval [o*{otf(w), a2rea(v)] does not need to be considered. Indeed, if
the required tim e at node v is less than o ^ Joy(i>), no cover of the subtree rooted a t v can
meet the tim ing requirement; in th a t case, the minimum cost cover is the minimum delay
cover for this subtree. If the required tim e is greater than a2rea(v), we can just choose the
minimum area cover for this subtree.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZA TIO N W IT H TR E E CO VERING 22

2.5.2 A G reed y A p p roach to A rea R eco v ery

If we neglect the inaccuracies introduced by the discretization of required times,


the previous algorithm is optimum, though relatively complex. A simpler approach to area
recovery consists in computing at each node the minimum delay and minimum area solution.
To select the cover, we can then use a simple top down traversal of the tree starting from
the root. Each time a gate is selected, we propagate the required time through its pins.
The required tim e at the root of the tree is simply the minimum achievable arrival time.
To select a cover for a subtree, we choose the minimum delay solution unless the minimum
area solution is fast enough. This approach knows no compromise: it never chooses an
intermediate solution in terms of area th a t would be acceptable in terms of delay; but is
fast and simple.

2 .5 .3 A rea R eco v ery b y O p tim a l In verter S electio n

Another effective approach to reducing the complexity of area recovery is to con­


centrate on special cases. The most obvious candidates are inverters. Inverters are the
m ost frequently used gates in circuits, and selecting the best inverter to minimize delay at
any given point in the network can be done very simply by enumerating all choices and
selecting the best. There is a simple and optimal algorithm th a t applies inverter selection
to minimize delay through a mapped network. A mapped network is a Boolean network
where each node represents a gate.
The algorithm proceeds as follows: it visits the nodes of the network in topological
order from the root to the leaves. Each time an inverter is encountered, it is replaced by an
inverter th at has minimum delay in this context. The optimal choice depends on the drive
a t the input and the load at the output of the inverter. If applied to a network obtained
from a minimum delay cover, this optim ization has no effect. However it can be used to
speed up networks obtained from the minimum delay tree covering algorithm th at assumes
th a t all load values are identical.
More im portantly, inverter selection can be used for area recovery. Given a mapped
network, with a required time at the root (either specified by the user, or taken as being
equal to the arrival time a t the root of the tree), we can traverse the network from the root
to the leaves and apply the following optimization each time an inverter is encountered. If
there is a smaller inverter that could be used a t the node without making the slack negative

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZATIO N W IT H T R E E CO VERING 23

at th a t node, the smallest such inverter is used in place of the current one. This algorithm
is greedy in the sense th at it does not necessarily make the best use of the available slack
to recover area. However it is guaranteed to never worsen delay, and, as we will see in the
results section, is quite effective in practice.

2.5 .4 A rea R eco v ery b y O p tim al G a te S electio n

There is no reason to perform optim al gate selection on inverters only. The inverter
selection algorithm can be extended to all gates th at come in different sizes and strengths
in the library. This provides a cheap and simple way to recover area after tree covering
without hampering the full power of tree covering for delay minimization. This optimization
is likely to be very effective for large libraries.

2.6 Experimental R esults

Circuits are usually not trees: they have several outputs, inputs are shared among
several functions, and there may be several paths from one node to another. To assess the
amount of improvement we can expect from the algorithms proposed in this chapter, we
need to measure their effect in isolation on trees. To th a t effect, we generated three Boolean
functions, nor32, b a la n c e d and unbalanced, th at can be represented as trees:

• nor32, a 32 input NOR gate.

• b alan ced , a balanced binary tree with 64 inputs. Internal nodes are alternatively
computing a Boolean AND or a Boolean OR, with inverters inserted randomly.

• unbalanced, an unbalanced binary tree with 32 inputs, where every node has at most
one child th a t is not a leaf. Internal nodes are alternatively computing a Boolean
AND or a Boolean OR, with inverters inserted randomly.

The results of these experiments are also sensitive to the gate libraries being used. To take
this effect into account, we performed our experiments with four different CMOS standard
cell libraries of various origins:

• MCNC, a public domain library available from MCNC. It is distributed with the IWLS’89
be ".hmark suite [32] under the name l i b 2 . It is composed of 29 gates. Inverters ap­

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZATIO N W IT H TR E E CO VERING 24

pear in 3 different strengths; all the other gates in one strength only. Gate delay
information is pin dependent.

• CMOS12 is a library from AT&T. It is composed of 189 gates, mostly AOI and OAI
gates. Only a few gates appear in different strengths. Gate delays are only available for
the slowest pins; all pins are assigned the worst case delay, which is too conservative.

• LIBRARY3 is an industrial library. It is composed of 80 gates. Most gates come in


different strengths, and delay information is pin dependent. The library contains 4
different sizes of inverters.

• LIBRARY4 is another industrial library. It is composed of 99 gates. Like the previous


library, most gates come in different strengths and delay information is pin dependent.
It provides a wider selection of strengths for commonly used gates th an LIBRARY3.
For example, it contains 9 different sizes of inverters instead of 4, and 4 different sizes
of 2-input NAND gates instead of 2.

In the following sections, we report and discuss results by library. We provide detailed
experimental results for the first two libraries, and only summary information for the re­
maining two. For clarity, we use the following acronyms to refer to the various tree covering
algorithms studied in these experiments:

• MA refers to minimum area tree covering.

• MDCL refers to minimum delay tree covering using a constant load value.

• MDCLIS refers to minimum delay tree covering using a constant load value, followed
by optim um inverter selection. Inverter selection is done using the algorithm of sec­
tion 2.5.

• MDEL refers to minimum delay tree covering using exact load values.

• MDELIS refers to minimum delay tree covering using exact load values, followed by
optim um inverter selection.

• MADC refers to minimum area tree covering under a delay constraint.

Since MDCL is not optim um for delay, MDCLIS can outperform MDCL both in terms of area
and delay. On the other hand, MDEL being optimum for delay, MDELIS can only improve
over MDEL in term s of area.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 25

c irc u it A MDCL MDEL MADC


area delay area delay area delay area delay
nor32 53 5.14 93 5.31 73 5.12 53 5.14
b a la n c e d 133 6.15 177 4.71 161 4.69 167 4.69
im balanced 75 20.70 111 16.32 95 16.31 106 16.61

Table 2.1: Comparison of Tree Covering Algorithms with library MCNC

MA: Minimum Area


MDCL: Minimum Delay with Constant Loads
MDEL: Minimum Delay with Exact Loads
MADC: Minimum Area under a Delay Constraint
area to tal cell area
delay measured in nanoseconds

2 .6 .1 R e s u l ts w i t h t h e MCNC L i b r a r y

In table 2.1 we show the results obtained by using minimum area tree covering,
minimum delay tree covering, minimum delay tree covering with constant loads and tree
covering for minimum area under a delay constraint, where the delay constraint is the
minimum delay achievable by tree covering.
The results indicate th a t MDEL tree covering does not lead to a significant delay
reduction over MDCL for this library: only 1%. However, the reduction in area is more
substantial: 15%. This can be explained by the fact th at using constant load values in MDCL
underestim ates the cost of stronger and larger gates, th at have higher input loads. The
penalty of using gates stronger than the optimum has a first order effect in terms of area.
In term s of delay, a higher input load is partially compensated by a stronger drive capability,
leading to a second order effect on delay. MADC tree covering does not lead to consistent
results. This is due to the fact th a t MADC relies on an older implementation than MDEL, th at
discretizes load values instead of using piece-wise linear functions. Discretization of arrival
times is another source of inaccuracy. Overall MADC reduces area by 6 % and increases delay
by 1% relative to MDEL.
Table 2.2 shows the.,effect of optim al inverter selection when used after tree cover­
ing. Inverter selection has no effect on trees built w ith MDEL. However it improves noticeably
the quality of the covers obtained with MDCL. The benefit of MDELIS over MDCLIS is only of
7% in area for no gain in delay, down from 15% and 1% respectively for MDEL vs. MDCL.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 2. D E L A Y O PTIM IZATIO N W IT H T R E E COVERING 26

c irc u it MDCLIS MDELIS


area delay area delay
nor32 93 5.31 73 5.12
b alan ced 153 4.52 161 4.69
unbalanced 96 16.32 95 16.31

Table 2.2: Effect of Optimal Inverter Selection with library MCNC

MDCLIS: Minimum Delay with Constant Loads, with optimum Inverter Selection
MDELIS: Minimum Delay with Exact Loads,with optimum Inverter Selection
area total cell area
delay measured in nanoseconds

c irc u it MA MDCL MDEL MADC


area delay area delay area delay area delay
nor32 53 4.57 189 3.79 234 3.70 174 4.00
balan ced 142 6.92 574 4.72 446 4.46 478 4.46
unbalanced 75 26.10 351 17.90 241 17.36 175 17.89

Table 2.3: Comparison of Tree Mapping Algorithms with library CMOS12

MA: Minimum Area


MDCL: Minimum Delay with Constant Loads
MDEL: Minimum Delay with Exact Loads
MADC: Minimum Area under a Delay Constraint
area total cell area
delay measured in nanoseconds

For one example, MDCLIS actually outperforms MDELIS in terms of delay. This anomaly
can be explained by the difficulty on handling rise and fall delays optimally, as explained
in section 2.4.3. The anomaly disappears if we modify the library to make all rise and fall
delays equal, or if we modify the comparison function used to compare pairs of rise and fall
arrival times.

2 .6 .2 R esu lts w ith th e CM0S12 L ibrary

We repeated the previous experiments, using the CM0S12 library. The results are
reported in table 2.3 and table 2.4. The results are essentially similar to those reported in
the previous section. MDCL tree covering increased area by 355% for a decrease in delay of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O PTIM IZATIO N W IT H T R E E CO VERING 27

c irc u it MDCLIS MDELIS


area delay area delay
nor32 189 3.79 234 3.70
b a la n c e d 430 4.46 430 4.46
unbalanced 189 17.70 193 17.36

Table 2.4: Effect of Optim al Inverter Selection with library CM0S12

MDCLIS: Minimum Delay with Constant Loads, with optimum Inverter Selection
MDELIS: Minimum Delay with Exact Loads, with optim um Inverter Selection
area total cell area
delay measured in nanoseconds

30% over MA. On average, MDEL tree covering outperformed MDCL by 13% in area and 4% in
delay. Again, using exact load values during delay minimization had some effect on area but
only a second order effect on delay. MADC tree covering obtained more satisfactory results
with this library. Compared with MDEL tree covering, it achieved a reduction of area of 17%
for an increase in delay of 4%.
For this library, inverter selection also improves the quality of MDEL tree covers in
terms of area, by 8 %. The effect of inverter selection on MDCL tree covers is more significant,
to the point th at MDCLIS tree covering actually outperforms MDELIS in term s of area by 8 %
for a cost in delay of only 1 %.

2.6.3 R esu lts w ith th e LIBRARY3 L ibrary

We repeated the same experiments with library LIBRARY3. MDEL tree covering
outperformed MDCL by 10% in area and 1% in delay. After inverter selection, the advantage
was only of 4% in area and 1% in delay for MDELIS over MDCLIS. Inverter selection reduced
the area of MDEL covers by 7%.

2 .6 .4 R esu lts w ith th e LIBRARY4 L ibrary

We repeated the same experiments with library LIBRARY4. For th a t library, which
is much richer than LIBRARY3 in term s of num ber of inverters, MDEL tree covering outper­
formed MDCL by 56% in area and 24% in delay. After inverter selection, the advantage was
reduced to 21% in area and 8 % in delay for MDELIS over MDCLIS. Inverter selection improved

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 2. D E L A Y O P TIM IZA TIO N W IT H T R E E CO VERING 28

the area of MDEL covers by only 3%.

2.7 Conclusion
The main conclusion from this experiments is somewhat disappointing: taking load
values into account during minimum delay tree covering (method MDEL) does not lead to a
very significant decrease in delay over the simpler tree covering m ethod th at uses constant
loads (method MDCL). The main advantage of MDEL over MDCL is actually more in terms of
area. By ignoring the effect of larger input loads, MDCL tend to favor gates th a t are larger
than necessary. Choosing larger gates th an necessary has a direct effect on area but only a
second order effect on delay, since the cost of higher input loads is partially compensated
by an increase in drive capability.
Another interesting experimental result is th a t most of the advantage of using
MDEL over MDCL can be obtained by the use of optim al inverter selection after tree covering.
Overall, MDEL remains the best tree covering algorithm for delay, b u t using MDCL followed
by optim al inverter selection is a very attractive choice if simplicity of implementation and
cpu tim e are an issue. It will be interesting to see the effect of extending optim al inverter
selection to optim al gate sizing as a postprocessing phase after tree covering.
Minimizing area under a delay constraint is not very effective and is com putation­
ally expensive. We strongly recommend a divide and conquer approach to this problem:
first, use a fast tree covering algorithm th a t minimizes delay only, to obtain a minimum
delay solution. Then, use, in a postprocessing phase, either an inverter selection algorithm
or a gate sizing algorithm to recover area at no delay cost. This approach of area recovery
is suboptimal, b u t leads rapidly to good quality circuits. If area is more of a concern, it
is always possible to use minimum area tree covering, possibly as a postprocessing phase
after delay optimization, on parts of the tree th at are not tim e critical. Trying to find a
good cover under an area and a delay constraint is too complex and tim e consuming: it
is more efficient to start w ith a minimum delay cover, and modify it in a postprocessing
optimization phase to recover area in a greedy fashion whenever possible.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C h ap ter 3

Fanout O ptim ization

Mathematicians are like Frenchmen:


whatever you say to them, they translate into their own language
and forthwith it is something entirely different.
— GOETHE

3.1 Introduction

The objective of fanout optimization is to build circuits th a t do not compute any


function but simply distribute a signal to one or more destinations at a minimum cost. If
th e cost to minimize is area, the problem is of very little interest a t the logic design level,
since a wire is the best we can do. The problem is only interesting if we are to minimize
delay or area under a delay constraint. The focus of this chapter is to find methods to
perform this optimization efficiently.
Fanout optimization is im portant for several reasons. First, it can reduce delay
often quite dramatically. If the output of a gate is connected to n fanout stems, in first
approximation the delay through the gate is of order 0 (n ). By building a simple buffer tree
a t the output of the gate, we can reduce this delay to O(logra). In addition, if we know
the individual times at which the signals axe required at each destination, we can build a
buffer tree th at delivers earlier the early required signals, which has the potential of further
decreasing the delay through the entire circuit. Ideally, a fanout algorithm should be able
to take advantage of the slack available a t some outputs to increase the slack at the initially
more critical outputs, to achieve an equilibrium point where all outputs are equally critical.
This is usually not achieved in practice due to the discrete nature of delay optimization at

29

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 30

this level.
The basic optimization techniques on which fanout optimization relies, buffering,
gate resizing, critical signal isolation, are not new. There is a vast literature on timing
optimization techniques th at covers these optimizations [34, 17, 21, 4,13]. W hat is original
in the fanout optimization approach originally due to Berman, Carter and Day [5], is the
idea of combining these techniques into a single algorithm. The main limitation of Berman’s
work is th at it did not propose a very practical approach to apply a fanout algorithm to an
entire circuit. It turns out th at we can use a simple technique due to Hoover, Klawe and
Pippenger [22] to solve this problem, as suggested by Fishburn [15]. We have extended this
technique to recover area after delay minimization.
Fanout optimization can also be used to enforce fanout constraints imposed by
a technology. Though in this work we ignore fanout constraints or load limitations, they
can be handled by a simple modification of our algorithms, for example by modifying the
cost functions to make gates infinitely slow as soon as their load constraints are violated.
Another reason for using fanout optimization to enforce load limitations is to control the
accuracy of gate-level delay models. The main source of inaccuracy of these models is the
presence of large capacitive loads at gate outputs.
In some situations, fanout circuits with reconvergence can yield faster circuits than
fanout trees, e.g. when the loads at the sinks are very high, such as for the output pads
of a chip. These situations can be handled by tree-based fanout algorithms if we replace
sinks with large loads by a num ber of virtual sinks with smaller loads before applying
the algorithm. In this work we suppose th at large loads are split among several virtual
destinations or handled by special purpose circuitry, and we only consider fanout circuits
th a t have no reconvergence, i.e. th at are trees.
There is an interesting duality between tree covering and fanout optimization as
can be seen in figure 3.1. For delay minimization, tree covering aims at minimizing the
arrival time at the root of a tree given arrival times at its leaves, while fanout optimization
aims at maximizing the required tim e at the root of a fanout tree given required times at
its leaves. In term s of complexity, fanout optimization is, for all but the simplest delay
models, NP-complete. But we can make, from an optimum fanout algorithm, a one pass
algorithm th a t optimizes the fanout problems of a circuit and yields a minimum delay
implementation, in the sense that there is no way to modify one or more of the fanout trees
of this implementation in order to decrease delay. In contrast, tree covering itself is of linear

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 3. FAN O U T O PTIM IZATIO N 31

PRIMARY OUTPUTS

fanout optimization

tree covering

PRIMARY INPUTS

Figure 3.1: Duality of Tree Covering and Fanout Optimization

complexity, but extending it to a globally optimum algorithm is a difficult problem since it


subsumes optim um DAG covering, which is NP-complete.
In section 3.2 we give a precise definition of the fanout problem and fix the ter­
minology and notation for the rest of this chapter. In section 3.3 we give an overview of
what is currently known about the complexity of the fanout problem, and how the delay
model influences the complexity. In section 3.4 we present a spectrum of delay optimization
algorithms for the fanout problem of increasing complexity and accuracy. In section 3.5
we describe succinctly what needs to be done to handle fanout problems where sinks are
of different polarities. In section 3.6 we present a set of simple delay optimizations that
can be used to improve the quality of an existing fanout tree. These optimizations take a
narrow view on the problem to be optimized. By analogy to software compilation, we call
them peephole optimizations. They are fast, simple to implement, effective at recovering

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZA TIO N 32

area, and can be applied to any fanout tree, independent of its origin. In section 3.7 we
show how we can apply a fanout algorithm to an entire circuit and explain in what sense
the resulting circuit implem entation is optim al with respect to fanout optimization. This
technique can also be used in postprocessing phase to recover wasted area in parts of the
circuit th a t are not critical for delay. Finally in section 3.8 we present our experimental
results, and in section 3.9 we summarize th e m ain results of this chapter.

3.2 Definition o f th e Fanout Problem


We give here the most general definition of the fanout problem th at we consider
in our work. Possible extensions to this definition include the use of several sources, the use
of reconvergent fanout circuits and the presence of load limitations.

F a n o u t P r o b le m fo r M in im u m D e la y

• G iv e n a library C of buffers and inverters, and for each b € C its input load 76, its
load dependent delay fib and its intrinsic delay a

o G iv e n the source s of a signal X , w ith its drive capability /3a;

• G iv e n n destinations or sinks, with separate required times r i, loads Z; and polarities


P i;

• F in d a tree of buffers and inverters th a t distributes the signal X to all the sinks and
maximizes the required tim e at the source.

F a n o u t P r o b le m fo r M in im u m A re a u n d e r a D e la y C o n s tra in t We can extend this


definition to the problem of finding a fanout tree of minimum area under a delay constraint.
The delay constraint is specified as an arrival tim e a specified at the source. The constraint
is th at the required-time at the source should not be less than the specified arrival time.

3.3 Com plexity o f the Fanout Problem

The complexity of the fanout problem depends on the delay model. For a very
simple delay model, under which the delay through a buffer is constant equal to 1 and the
fanout is constrained to be less than some constant value k, the fanout problem for delay can

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 33

be solved in linear tim e using a technique called combinational merging [18]. Unfortunately,
as soon as the delay model takes load values into account, even if all required times are equal
and only one type of buffer is used, the fanout problem for delay becomes NP-complete.
There is thus little hope of finding a polynomial tim e algorithm to solve the fanout problem
optimally with delay models of the level of complexity of the ones commonly used for CMOS
standard cells.
Berman et al. proved th at the fanout problem for minimum area under a delay
constraint is NP-complete under a simple delay model where gates are represented by a
finite num ber of virtual gates with fixed fanout and constant delay. Unfortunately, the
proof of this result is not very satisfactory since it requires the existence of buffers in an
unrealistically wide range of sizes and delays.
In this section we present a few complexity results for the fanout problem under
various delay models. To keep things simple, we ignore the issue of phase assignment: we
suppose th at all sinks require the signal with the same polarity as produced by the source and
th a t only buffers are available in the library. For a simple delay model, the fanout problem
can be solved optimally in tim e O (nlo g n ) using combinational merging. We present this
algorithm in section 3.3.1. We use combinational merging as a heuristic in one of our fanout
optim ization algorithms, but it is not optimum in general for more complex delay models.
In section 3.3.2 we study the fanout problem for a slightly more complex delay model, for
which we only have partial results. In section 3.3.3, we show th at if we allow non constant
load values at the sinks, the fanout problem for delay becomes NP-complete. This is the
first complexity result for the fanout problem for minimum delay without the addition of
an area constraint. Finally in section 3.3.4 we review briefly Berm an’s complexity result.
This overview is not complete unfortunately. Many complexity issues are left un­
resolved. However our main purpose is to provide solid evidence th a t the fanout problem is
difficult to solve exactly for complex delay models, in order to justify our use of approximate
algorithms. In th a t sense, our goal has been achieved.

3.3 .1 C o n sta n t D ela y M o d el

In the constant delay model, the library contains only one buffer. The delay
through this buffer is constant equal to one delay unit when the fanout does not exceed
some threshold k, and is infinite otherwise. For th at delay model, the combinational merging

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. FANO U T O PTIM IZATIO N 34

Let S be a set of nodes with an individual required time associated with each node.
Initially S contains all the sinks with their required times.
Sort S by decreasing required times.
While \S\ > 1 {
At the first iteration, r = \S\m od(k — 1). If r < 2, r = r + (k - 1)

At all remaining iterations, r = k.


Remove the first r nodes of 5 , ( u i , .. . , vT), with the largest required times.
Make the r nodes ( v i , . . . , v T) the fanouts of a new node v.
Set the required tim e of v to be mini<i<r required{vi) - 1
Add v to 5

}
The only remaining node in S is the root of an optimal fanout tree.

Figure 3.2: Combinational Merging

algorithm, due to Golumbic [18], finds a minimum delay fanout tree in time 0 {n\ogn) where
n is the num ber of sinks. Moreover this tree is guaranteed to be of minimum area among
all trees of finite delay. The algorithm is outlined in figure 3.2.
It is possible to prove that if ( r i , .. . , r n) are the required times at the sinks, the
required time at the root of an optimal fanout tree is given by the following formula [2 2 ]:

T root — - io g fc( £ * r ri) (3.1)


l< t< n

3.3 .2 U n it Fanout D ela y M o d el

In this section we introduce a slightly more realistic delay model. The library still
only contains one buffer, but this tim e the delay through the buffer is equal to the number
of its fanouts. This means that the load dependent delay coefficient of the buffer is 1, its
intrinsic delay is 0, and the loads of all the gates 1. This model is similar to but simpler
than the unit fanout delay model used in m is ll, which assumes a delay of 1 + 0 .2 * n where
n is the num ber of fanouts of the buffer, i.e. it assumes an intrinsic delay of 1 and a load

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 35

dependent delay of 0 .2 for a buffer.

E q u a l R e q u ire d T im e s We can solve the fanout problem for minimum delay exactly
for this delay model if all the required tim es at the sinks are equal. As we will see shortly,
even this simple case is not completely straightforward. We will make use of the following
definitions:

D e fin itio n 1 A 2-3 tree is a tree T such that any intermediate node o f T has a fanout o f
2 or 3.

D e fin itio n 2 Let V ( T ) be the set of paths from the root to a leaf in a 2-3 tree T , and let
p be such a path. Let xp and yp be the number o f nodes o f fanout 2 and 3 respectively on
that path. The weight o f path p is defined as follows:

w(p) = 2Xp X 3Vp (3.2)

The weight o f a 2-3 tree is the maximum weight on any o f its paths:

w( T) = m ax w(p) (3.3)
v ' p €7>(T)

The delay o f a path and the delay through a tree are defined similarly:

d(p) = 2 xp + 3 yp (3.4)

d(T) = m ax dip) (3.5)


peT{T)

Since we suppose th a t all required times are equal, maximizing the required tim e at the
source of the fanout tree is equivalent to minimizing the worst path delay through the tree,
i.e. the quantity d(T).

D e fin itio n 3 A 2-3 tree is a simple 2-3 tree i f all nodes at the same level have the same
fanout. In particular, in a simple 2-3 tree, all the leaves are at the same distance from the
root.

It is easy to see th a t all paths of a simple 2-3 tree have equal weight, and th at the number
of leaves of a simple 2-3 tree T is equal to w(T).

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZA TIO N 36

r r r

w l w2 f-2

r-2 r-f+2

Figure 3.3: Splitting a Node Does not Increase Delay

T h e o r e m 3.3.1 Let (I 2 , 13) be a pair o f integers that realizes the m inim um o f the quantity
2x + 3y subject to the constraint 2X 3V > n. Let T be a simple 2-3 tree with h levels o f nodes
o f fanout 2 and I3 levels o f nodes o f fanout 3. Then T has m inim um delay among all fanout
trees with n leaves or more. Its delay is given by the following expression:

d(T) = 3 h + l i f 3h < n < ^ x 3 h


O
d(T) = 3h + 2 if ^ x 3 h < n < 2 x 3 h
u
d(T) = 3h + 3 i f 2 x 3h < n < 3h + 1

The proof of theorem 3.3.1 relies on the following two lemmas,

le m m a 3.3.2 There is an optimal fanout tree that is a 2-3 tree.

P ro o f Let T be an optim al fanout tree. Its nodes with fanout equal to 1 can be eliminated
without modifying the fanout of other nodes and without increasing the delay through the
tree. We will show th at its nodes w ith fanout greater than 3 can be split into nodes with
strictly smaller fanouts w ithout increasing the delay through the tree, w ith the transfor­
m ation shown in figure 3.3. A node u with a fanout / of 4 or more can be replaced by
three nodes v , W\ and W2 , such th a t v is directly connected to the parent of u, and has w \
and W2 as children. Two children of the children of u are m ade children of w \, while the
remaining ones are made children of 102 - If r is the earliest required time of any child of u,

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O P TIM IZATIO N 37

the required time at u is r — f . In the worst case the earliest required time of a child of w i
is r and th e required time of w \ is r —2. Similarly, in the worst case the earliest required
time of a child of W2 is r and the required time of W2 is r — ( f — 2 ). Thus the required time
at v is no worse th an min( 7’ —2 ,r —/ + 2) + 2 = 7’ —/ , since / > 4. ■
Lemma 3.3.2 shows th at we can find a 2-3 tree th at is an optimal fanout tree. The following
lemma shows th a t we can restrict our attention further to simple 2-3 trees.

le m m a 3 .3 .3 Let T be a 2-3 tree. There is a simple 2-3 tree T ' that is at least as fast as
T and has at least as many leaves as T .

P ro o f Let T be a 2-3 tree and let l(T) be the num ber of leaves of T. The proof proceeds
in two steps. We first show th a t l(T) < w( T) for any 2-3 tree, and then we show th at for
any 2-3 tree T there is a simple 2-3 tree T ' such th a t d(T') < d(T) and w( T' ) = w(T). This
would prove the lemma, since for any simple 2-3 tree T' , 1{T') = w(T' ).
To prove th at l (T) < w(T), we proceed by induction on the height of T. The
result is obviously true for trees of height zero, since the num ber of leaves and the weight
are both equal to 1 in th a t case. Let us suppose th a t the result is true for all 2-3 trees of
height h - 1 > 0. Let ( T j , . . . ,Tfc), with k = 2 or k = 3, be the subtrees of T th at are the
fanouts of the root node of T. By induction hypothesis, l(Ti) < w(Ti). Moreover we have
l(T) = l(Ti) and w( T) = k x maxi<j<* w(T{). Thus l(T) < w(T).
We can build a simple 2-3 tree T ' such th a t d(T') < d(T) and w( T' ) = w( T) as
follows. Let p be a path of T such th a t w(p) = w( T) , and let I 1' be a simple 2-3 tree with x p
levels of nodes of fanout 2 and yp levels of nodes of fanout 3. We have w( T' ) = w(p) = w( T)
and d{T') = d(p) < d(T). m

P ro o f [of theorem 3.3.1] According to the previous two lemmas, we only need to consider
simple 2-3 trees. The optim al simple 2-3 tree is the simple 2-3 tree T th at minimizes d(T)
subject to the constraint th a t l(T) > n, where l (T) is the num ber of leaves of tree T. For
a given simple 2-3 tree T, with x levels of nodes with fanout 2 and y levels of nodes with
fanout 3, we have: d(T) = 2x + 3y and l(T) = w( T) = 2X x 3V. Thus the problem of finding
an optim al simple 2-3 tree is reduced to the problem of finding a pair of integers (x, y) th at
is solution of the following discrete optimization problem:

min2a: + 3y w ith 2* X 3y > n (3.6)

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 38

We will first show th a t there is always a solution of 3.6 th at is such th at 0 < x < 2. For
any pair (a, y ) of integers, let d(x, y) = 2 x + 3y and l(x, y) = 2X x 3y. If x > 3, we have:

d(x - 3,y + 2) = 2 (x - 3) + 3 (y + 2) = 2 x + 3 y = d(x, y)

l(x - 3, y + 2) = 2X~ 3 3y+ 2 > 2E 3v = l (x,y)

In other words, if x > 3 and (x, y) is an optimum solution, we can replace ( x , y ) by (x —


3,2/ + 2) without loss of optimality. Let us suppose th a t n > 3h, for some integer h. Then
we have 2X 3V > 3h. Since x < 2, we m ust have y > h - 1. In addition, if y = h - 1,
then x = 2 and d(2,h - 1) = 3h + 1. If y = h, then x > 1, and d( x, h) > 3h + 2. If
y > h, then d(x, y) > 3h + 3. Thus (2, h - 1) is an optim um solution for n in the range:
3h < n < 1(2, h — 1) = | x 3h. If n > | x Zh, then y > h and x > 1. The minimum
delay solution satisfying these constraints is (1,/i), and d( l , h) = Zh + 2. This solution is
optimum for n is the range: | X 3h < n < 1(1, h) = 2 X 3h. If n > 2 x 3 ^ , then the next
minimum delay solution is (0, h + 1), th a t has a delay of d(0, h + 1) = 3h + 3. This solution
is optimum for n in the range 2 x 3 h < n < 1(0, h + 1) = 3h+1. ■

Arbitrary Required Tim es We are not aware of any polynomial tim e algorithm to solve
the minimum delay fanout problem for arbitrary required times under the unit fanout delay
model. We conjecture th at this problem is not NP-complete and th at such an algorithm
exists.

Area M inim ization under a D elay Constraint We conjecture th a t the problem of


finding minimum area fanout tree under a delay constraint for arbitrary required times and
the unit fanout delay model is NP-complete.

3 .3 .3 U n it F anout M o d el w ith V aryin g S ink Loads

The difference between the unit fanout model of the previous section and the unit
fanout model with varying sink loads used in this section is th a t we now allow sink loads to
take any positive rational value. There is still only one buffer in the library, and its drive
capability is 1, its intrinsic delay 0 and its input load 1. Under this delay model, we will
prove th a t the fanout problem for minimum delay is NP-complete even if we restrict the
sink required times to be all equal. We will also prove th at the fanout problem for minimum
area under a delay constraint is NP-complete even if we restrict sink loads to be integers.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 3. FANO U T O PTIM IZATIO N 39

Fanout Problem for M inim um D elay

Theorem 3.S.4 Given a fanout problem fo r the unit fanout model with rational sink loads
and a constant D, the following decision problem is NP-complete: is there a fanout tree
such that the delay through the fanout tree is less than or equal to D. The problem remains
NP-complete even i f the required tim es at the sinks are all equal.

The decision problem is clearly in NP. To prove it is NP-complete, we will exhibit a polyno­
mial time reduction of 3-partition to it. For clarity, we restate here the 3-partition problem
[16]:

Theorem 3.3.5 (3-Partition) Given a finite set A o f 3m elements, an integer valued


bound B , and an integer valued size s(a) for each element a o f A , such that s(a) satisfies
5 /4 < s(a) < 5 / 2 and such that £ „ e A s (a ) = m5, the following decision problem is NP-
complete: can A be partitioned into m disjoint sets such that fo r 1 < i < m
'LaeSi K a) = B ?

The nature of the constraints is such th a t if a solution exists, the sets S i contain exactly 3
elements.
A decision problem is NP-complete in the strong sense if, unless P = N P , there
is no polynomial algorithm to solve the decision problem even if we restrict the problem
to instances where the numbers appearing in an instance are bounded by a polynomial
function of the size of the instance. Equivalently, a decision problem is NP-complete in the
strong sense if it cannot be solved by a pseudo-polynomial algorithm, i.e. an algorithm that
is polynomial in the size of the instance and the magnitude of the numbers appearing in
the instance. NP-complete problems th a t axe not strongly NP-complete derive their NP-
completeness from the presence of exponentially large numbers in the formulation of their
instances. PARTITION [16] is an example of an NP-complete problem th a t is not strongly
NP-complete: it can be solved in pseudo-polynomial tim e by dynamic programming. When
the numbers appearing in the instances of a problem are derived from finite precision phys­
ical param eters, as is the case for fanout optimization, it is not realistic to suppose the
presence of exponentially large num bers in problem instances. In other words, to be rele­
vant, proofs of NP-completeness of fanout problems need to show NP-completeness in the
strong sense, by derivation from a strongly NP-complete problem. Our NP-completeness

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 40

proofs are based on 3-partition, which is one of the simplest strongly NP-complete problems
[16].
The proof of theorem 3.3.4 relies on the following lemma:

lem m a 3.3.6 (R estricted 3-partition) 3-partition remains an NP-complete problem even


i f we restrict the number o f elements o f an instance to be a power o f 3, i.e. o f the form
3m = 3h.

P ro o f Let A be an instance of 3-partition. We will exhibit an instance A ' of restricted


3-partition th a t is equivalent to A. A 1 is constructed as follows. It has 3h elements, where
h = [log3 (3m )]. The first 3m elements of A ' are a copy of the elements of A , with their size
multiplied by 9. The remaining 3h - 3m elements of A 1 are grouped in triplets, of respective
sizes (3B + 1,31? 4- 1,31? — 2 ). And B ' is taken to be equal to 9B. Suppose th a t A has a
3-partition. Then the first 3m elements of A ' can be grouped together in triplets of total
size 9B . The remaining 3h - 3m elements can be kept together as they were created, in
triplets of total size 9B . Thus A 1 has a 3-partition. Conversely, if A ' has a 3-partition,
the sum of the sizes of any triplet ( s i, S2 , S3 ) in this 3-partition is equal to 9B , and is thus
divisible by 9. If any of these elements comes from an element of A, then all of them do,
otherwise the sum of their sizes would not be 0 modulo 9. Thus a 3-partition of A ' yields
a 3-partition of A . ■

P roof [of theorem 3.3.4] We only need to exhibit a polynomial time reduction of 3-
partition to the fanout problem for instances of 3-partitions such th at 3m = 3 h. Let A
be such an instance of 3-partition. We create 3h = 3m sinks, one per element of A , and
assign to the sink corresponding to element a of A the load 1 + s { a )/K where K is an
arbitrary integer such th at K > 3 /2 B . All required times are taken to be equal and D is
set to be equal to 3h + B / K . Clearly, this specifies a fanout problem for our delay model,
and the construction can be done in tim e polynomial in the size of the instance A . We need
now to prove th a t decision problem A is equivalent to this fanout problem.
Suppose th a t A has a solution. Then we can group the elements of A in triplets
S i , . . . , Sm , such th at £ aes,. s(a) = B for 1 < i < m . We can then build a 3-tree with h
levels and 3h leaves, such th a t the sinks corresponding to the elements of S , are siblings
of each other. Any node of the tree at level h — 1 has a fanout of 3, and the load it

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZA TIO N 41

drives is equal to 3 + B / K since ]£a€Si s(a) = B . Thus the to tal delay through the tree is
3(h — 1) + 3 + B / K = D , which proves th a t the fanout decision problem has a solution.
Conversely, suppose th at the fanout decision problem has a solution. There is then
a fanout tree whose delay is no greater than D. Prom lemma 3.3.2, which still holds if the
loads at the sinks are allowed to be larger than 1 , we can assume without loss of generality
th a t this fanout tree is a 2-3 tree. Let T be the 3-tree with 3h leaves and a depth of h, with
sinks allocated to its leaves in some arbitrary way. Since the load of any sink is less than
1 + 5 / 2 5 , the load of buffers at level h — 1 does not exceed 3(1 + B /2 K ) < 3 + 1 . Thus,
the delay through T is smaller than 3h + 1. Since no other 2-3 tree can drive 3h outputs of
load 1 or more in less than 3h + 1, T is the only possible 2-3 tree th a t realizes the minimum
delay fanout tree. Thus there is an assignment of sinks to leaves of T such th a t the delay
through T is no larger than D = 3h + B / K . This means th a t the sinks can be 3-partitioned
into triplets whose aggregate load is equal to 3 + B / K . This is equivalent to saying th at A
can be 3-partitioned. ■

Fanout Problem for M inim um A rea under a D elay Constraint

Theorem 3.3.7 Given a fanout problem fo r the unit fanout model with integer sink loads,
a delay constraint D and a constant A , the following decision problem is NP-complete: is
there a fanout tree such that the delay through the tree is less than or equal to D and its
area is less than A ? The problem remains NP-complete even i f the required times at the
sinks are all equal.

P roof This decision problem is clearly in NP. To prove it is NP-complete we will again ex­
hibit a polynomial time reduction of 3-partition to it. From a given instance A of 3-partition,
we build an instance of the fanout problem as follows: we create 3m sinks, one for each
element of A , having for loads the values (m + l)s (a ). The area constraint is A = m and
the delay constraint is D = m + (m + 1)B . All required times are taken to be the same.
To show th a t both problems are equivalent, we first note th a t if a fanout tree is such that
less than m gates are directly connected to the sinks, there m ust be at least one gate which
fanouts to 4 or more sinks. Given the constraint th a t s(a) > 5 / 4 , the load at this gate
m ust be greater than (m + 1 )(5 + 1) > m + (m + 1 )5 . Thus to be able to meet the delay
constraint, all m gates th at the fanout tree is allowed to contain under the area constraint
m ust be to used to drive sinks. Moreover, each of them has to drive exactly 3 sinks. In

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 42

th a t case, the delay constraint will be met if and only if the loads are equilibrated, th at is
if and only if A can be 3-partitioned. ■

3 .3 .4 B e rm a n ’s D e la y M o d el

Berman et al. used a different delay model in their work [5]. In their model, gates
have a fixed fanout and delay. This is equivalent to saying th at the load a t the output of a
gate is equal to the num ber of fanouts, and the delay through a gate is a piece-wise linear
function of the load, with a threshold value above which the delay through a gate becomes
infinite. Under this delay model they show th at the fanout problem for minimum area
under a delay constraint is NP-complete, but their proof relies on unrealistic assumptions
on gate size and gate delay param eters, which weakens their result. More specifically, they
suppose a library containing gates with the following area and delay characteristics, where
N and K are given integer-valued parameters: a gate w ith delay 1, fanout limit of 1 + y
and area (N K ) 3, and, for i = 1 , . . . , K , a gate with delay 2 N K + i, fanout limit 2 and area
( N K )2 - 3i.

3.4 A Spectrum of Fanout Optimization Algorithms

In what follows, we present a list of fanout optimization algorithms, sorted in order


of increasing complexity. Each of the algorithms is analyzed in terms of its computational
complexity and optimization ability. All of these algorithms have been implemented and an
empirical analysis of their efficiency is given in section 3.8. These algorithms can introduce
buffers, inverters, or a combination of both. However, to simplify the presentation, we
suppose th a t only buffers are used, and the source and the sinks are all of the same polarity.
We postpone the problem of correct phase selection until section 3.5. We first introduce
some notation th at are used in the rest of this section.

3.4.1 N o ta tio n

We use r , possibly with indices, to denote required times; a to denote intrinsic


delay, (5 drive capabilities and 7 loads. The num ber of sinks is n, and the num ber of buffers
or inverters in the library £ is d. The letter b designates a buffer. We use P s to designate
the drive capability of the source, and betab the drive capability of the buffer b. The input

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANOU T O PTIM IZATIO N 43

load of a buffer is 7 &.


We suppose th at, in a preprocessing step, the sinks have been sorted in order of
increasing required times, and their required times are denoted (7*1,. . . ,r„ ). In particular,
the following equation holds: r,- = minj>tT*j. Similarly, their loads are denoted (7 1 , . . . , 7 „),
in the same order. We also precompute the quantities 7 j>n = X)»<fc<n7fc- The quantities
j i tj = T,i<k<j 7k are then available in constant time as 7 i,n—7 j+i,n- The entire preprocessing
can be done in tim e O (nlogn) and is necessary in all of our algorithms, except the first two
(buffer selection and two-level fanout tree selection ignoring required times).

3.4 .2 B u ffer S electio n

The buffer selection algorithm is a very rudim entary fanout optimization algo­
rithm . It does not build any fanout tree; it simply sizes existing buffers optimally. If a
fanout tree is implemented as a wire, this algorithm has no effect. Buffer selection can also
be used inside trees as was done in chapter 2 .
For a given buffer selection b, the algorithm computes the required time at the
input of the buffer. If r 0 is the required time a t the output of the buffer, and 7 i in the load
at the output of the buffer, the delay through the buffer is ab + fib'll,n and the required time
at the input of the buffer is r0 - ab - fib'll,n- Unfortunately this quantity does not take into
account the delay due to the input load of the buffer, which should be taken into account
to make an optim al buffer selection. So we subtract from the required time a t the input of
the buffer the load dependent delay /347 b corresponding to the tim e it takes the source gate
to drive the buffer input. The algorithm selects a buffer th a t maximizes this quantity as
shown in Figure 3.4. The algorithm works in time 0 (d + n ) where d is the number of buffers
in the library and n the num ber of sinks. The computation of ro is not actually necessary
and is given in Figure 3.4 for clarity.

3 .4 .3 T w o-L evel F an ou t T rees Ign orin g R eq uired T im es

One of the main reasons why fanout optimization is necessary is to reduce large
loads created by large fanouts. The simplest way to do so is to insert a two-level tree of
buffers at multiple fanout points in the circuit, where a two-level tree is defined as follows:

D e fin itio n 4 A tree is a two-level tree i f any leaf o f the tree is separated from the root of
the tree by exactly one intermediate node.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 44

a lg o r ith m b u ffe rs election


r 0 = r i = m in i< i< nTi
bopt = argm ax66£(r0 - a b - p by i,„ - /3,7&)
e n d bu ffers election

Figure 3.4: Optimal Buffer Selection

The two-level fanout tree algorithm which ignores required times is given in Figure 3.5 and
explained in detail in th e rest of this section.

T w o -L e v e l v s. M u lti-L e v e l In general, the optimal num ber of levels of such a tree is


a logarithmic function of the ratio between the load 7 i,n and the drive capability ft, of the
source of the signal. In practice, however, this ratio never grows large enough to justify the
use of more than one level of buffers, i.e. of two-level buffer trees. This can be checked by
a simple back-of-the-envelope com putation. To make things concrete, we use delay values
from the M CNC library l i b 2 [32] and round to the nearest num ber with one significant
digit. We suppose for a buffer a drive capability of 2.0, an intrinsic delay of 0.3, and a
load of 0.1, a drive capability of 4.0 for the source and a load of 0.1 for each of the sinks.
The delay values obtained with no buffer tree, with a two-level buffer tree, and the best
multi-level buffer tree are reported in table 3.1 for a varying num ber of sinks. A number of
fanouts in the range of 10 to 20 is typical, above 50 is rare. The largest number we have
observed was 198.
As the d ata indicates, the gains obtained by using more th an one level of buffers
is only substantial for very large fanouts, and in any case is negligible compared to the
gains obtained by introducing one level of buffers. In addition, in practice, libraries contain
buffers of several strengths, which makes the two level fanout trees competitive in an even
larger range of fanouts.

B u ffe r S e le c tio n A two-level fanout tree is composed of one level of intermediate buffers.
For simplicity, we enforce the restriction th a t all interm ediate buffers are of the same
strength (or buffer type). This allows us to compute in constant tim e a good approxi­
m ation of the number of interm ediate buffers required. This computation is done once for

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. FAN O U T O P TIM IZATIO N 45

# sin k s 10 15 20 25 30 40 50 100 200

no o p t 4.0 6 .0 8 .0 1 0 .0 1 2 .0 16.0 2 0 .0 40.0 80.0


tw o -lev el 2 .1 2.5 2.9 3.3 3.5 3.9 4.3 6 .1 8.3
m u lti-le v e l 2 .1 2.5 2 .8 3.0 3.0 3.2 3.4 4.1 4.5

Table 3.1: Fanout Trees: Two-Level vs. Multi-Level

# sin k s: num ber of sinks; all required times are equal


no o p t: delay with no fanout optimization
tw o -lev el: delay with one level of buffers (e.g. two-level fanout trees)
m u lti-le v e l: delay with the best multi-level fanout tree

each buffer type b in the library.


We first compute th e total load of all the sinks, 7 i in, and then we determine the
optimum num ber k\pt of interm ediate buffers needed for a two-level fanout tree, supposing
th at all buffers are of type b and the load is equally divided among the intermediate buffers.
This minimization problem can be formulated as a quadratic minimization problem and
can be solved exactly in constant time, using the following two formulas:

= argnun(/3s76fc + /36^ ) (3.7)

f / &7l,n [ IP b ll ,n | 1
k opt
h (3.8)
U V &76 J ’ V &7b J

S in k A ssig n m e n t The num ber of interm ediate buffers to be used is computed by suppos­
ing th at the loads are equally divisible among all interm ediate buffers. This is not the case in
general. Unfortunately, even if the num ber of interm ediate buffers is given, assigning sinks
of varying loads to these buffers is a difficult problem. It is equivalent to multiprocessor
scheduling, which is known to be NP-complete (see [16] page 238).
To perform sink assignment, we use a simple greedy algorithm th at allocates the
next sink to the interm ediate buffer th a t has been assigned the least amount of load so
far. The best results are obtained if the sinks are sorted in order of decreasing loads. This
assignment is made for each buffer type b, but only for k\pt interm ediate buffers. The best
solution is then retained.

C o m p le x ity The num ber of intermediate buffers is of the order of \fn . The best fit
algorithm spends \ f n tim e to determine where how to assign each sink. Moreover this

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 46

algorithm twoJevel.no jrequireAJimes


sort the sinks by decreasing load values (7 1 , . . . , 7 n )
com pute 7i,n = E i< i< n 't'«
foreach b G C {
7.1 _ iPb 7l.n

foreach 1 < i < kbopt laad[i] = 0.0

foreach 1 < i<n{


ji = argmin 1<J<J.bjit load[j\
load\ji\ = load\ji\ + 7 i
assign[i] = ji
}
}
end twoJeveljn.ojrequireA-tim.es

Figure 3.5: Two-Level Fanout Tree Ignoring Required Times

computation is done once for each buffer type. Thus, in total, the complexity of this
algorithm is 0 { d re1-5).

3 .4 .4 T w o -L ev el F an ou t T rees T aking R eq u ired T im es in to A cco u n t

The previous algorithm ignores sink required times. Though it can handle non
constant load values, it does not perform very well in the presence of wide variations in
required times. It is possible to compensate for this deficiency by taking required times
into account during sink assignment. To do so we still use a best fit, greedy algorithm but
instead of assigning a sink to the interm ediate buffer th at has so far the least amount of
load to drive, we assign a sink to an intermediate buffer in such a way th at the required
time at the source of the fanout tree is decreased the least by this assignment. If all required
times are equal, these two greedy algorithms produce the same result. In the preprocessing
phase, we sort the sinks in order of increasing required times, and, in case of ties, in order
of decreasing loads. The complexity of this algorithm is also 0 ( d n 1-5). The algorithm is
sketched in Figure 3.6.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 47

algorithm twoJeveljmith.requiredJ.imes
sort the sinks by increasing required times ( r i , . . r n )
compute 7 i,„ = E i< i< n 7i
foreach b 6 C {
Lt — / Phll.n
Kopt — Y Pijb
foreach 1 < i < khopt {
required[i] = 0 .0 ;
load[i] = 0 .0 ;
}
foreach 1 < i<n{
foreach 1 < j < kbopt {
required[i, j] = m in(r, - Ptload\j], required[j}) —Pbli
}
ji = arg max1<; <itbfii required[i, j]
required\ji] = required[i, ji\
load\ji] = load\ji] + 7 i
assijn[i] = j i

}
}
end twoJeveljwithjrequired-times

Figure 3.6: Two-Level Fanout Tree Taking Required Times into Account

O p tim a lity

le m m a 3.4.1 The greedy sink assignment algorithm is not optimal. This is still true even
i f all sink loads are equal and all required times are integer valued.

P ro o f A counter example is given in figure 3.7. ■


We use the greedy sink assignment algorithm because it can handle non constant required
times as well as non constant loads. However, if all loads and all buffer drives are equal, and
if we assume for simplicity th a t all required times are integer valued, we can solve the sink
assignment problem in tim e 0 (n lo g 7?). This result is achieved by using a decision procedure
th a t can decide in linear tim e whether there exists a sink assignment th a t can produce a

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 48

3 3 4 4 4 3 4 4 3 4

optimal assignment greedy assignment

Figure 3.7: Tlie Greedy Sink Assignment Algorithm is Not Optimal

given required tim e at the source of the fanout tree. An O (nlogn) optim al algorithm can
be derived from this decision procedure by using binary search.

le m m a 3.4.2 Given k identical buffers, with drive capacity equal to 1, n sinks, with loads
equal to 1 and integer required times, and an integer constant D, the following decision
problem can be solved in linear time: is there a sink assignment such that the required time
at any o f the buffers is no less than D ?

P ro o f We assume th a t the sinks are sorted in order of increasing required times. By


convention we will use th e letters i and j as sink indices, and b as a buffer index. We have:
1 < b < k. We first prove th a t there is an optim al sink assignment th at assigns consecutive
sinks to each buffer. Let o be an optim al sink assignment. Let = min;|0.(j)_& T{. The
required time of buffer b is equal to R% - |{i, o(i) — 6 }|. W ithout loss of generality, we can
always suppose th a t the buffers are sorted in order of increasing Rb, i.e. if b < b' then Rb <
R y . If sinks are not assigned consecutively by cr, there exists a pair of sinks, (sj, S j ) such that
o(i) < <r(j) and > Tj. Interchanging the sinks S{ and Sj does not change the loads of the
buffers, does not decrease the value of 72<r(t) since o (i) < o (j) and thus R a{i) 5 : R<r(j) ^ rh
and does not decrease the value of R a(j) since rj > Tj > R a[j). By interchanging pairs of
sinks th a t are assigned out of order, we can therefore obtain a sink assignment th at is
no worse than the original optim al assignment and is such th at each buffer is assigned
consecutive sinks in required time order. We can thus limit our attention to consecutive
assignments.
W ith the following algorithm, we can decide in linear time whether there exists a
consecutive assignment such th at the required tim e at the input of each buffer is at least D.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 49

If such a consecutive assignment exists, it assigns to the first buffer the sinks ( s i , . . . , —1)
where ix is the first index for which the required time at the input of the first buffer becomes
less than D. It assigns the remaining sinks to the remaining buffers recursively using the
same principle. More precisely, it assigns to the bth buffer the sinks where
the finite sequence (jb)o<b<k is determined by the following recurrence equations:

io = 1

i b+1 = min {i, rib - ( i - %) < D }

The required tim e at the input of buffer b is given by the expression - (ib - ib-i) since
all sink loads are equal to 1 and the drives of the buffers are all equal to 1. The answer to
the decision problem is “yes” if and only if any % exceeds n. •

3 .4 .5 C o m b in a tio n a l M ergin g

The previous two algorithms are very limited in the kind of fanout trees they can
produce. These structures are sufficient for most practical fanout sizes only if the required
times at the sinks are close to each other. This is not often the case. To be able to obtain
faster fanout trees, it is desirable to explore a larger set of fanout structures than just
two-level trees.
Combinational merging is a simple, 0 (n \o g n ) algorithm which has the ability to
generate a rich set of fanout tree structures. Unfortunately, it relies on the characteristics of
a simple delay model, and needs to be adapted heuristically to a more complex delay model.
The basic step of this algorithm, illustrated in Figure 3.2, is simple. We first suppose th at
the sinks have been sorted in order in increasing required times ( r i , .. . , r n). We take a
group of sinks with the largest required tim es ( r * , . . . , r n), make them the children of a new
buffer node and remove them from the list of sinks. We then compute the required tim e
at the new buffer node, and merge it with the sorted list of sinks. This transform ation is
applied until k becomes equal to 1. In th a t case the source is used to drive the remaining
sinks directly, unless insertering a buffer between the source and the sinks yields a faster
circuit.
W ith our delay model, there are two questions to be answered before combinational
merging can be used: how k should be chosen, and which type b of buffer should be used.
We use a heuristic th at computes k = kb as a function of b, and we select the best b using

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 50

algorithm bottom.up-fanout.tree.construction
sort th e sinks s» by increasing required times ( r i , . . rn)
while n > 1 {
foreach 1 < i < n com pute 7 jlTl = Y^i<j<n Is
foreach b £ C {
l.b _ / PbKl.n
k°pt ~ V T r t
h = m a x j i , 1 < i < n, 7 f,n >
required[b] = r kb - (3blkh,n - a» - /3.7b
}
b = argmaxS6 £ required[g]
k=h
create a new buffer node v of type b
a tta c h th e sinks (sk, • • - 1 an ) to v
remove th e sinks ( sn)

compute the required tim e of v: rv = minfc<j<n r* —/9»7fc,n — “ 6


add v in order to the list of remaining sinks ( s i , sk~i)
set n = k
}
attach th e only remaining sink to the source node
end bottomjupJanoutJ.ree-Construction

Figure 3.8: Combinational Merging as Fanout Algorithm

some cost function. For each buffer 6 , we compute k\pt as in equation 3.7. We take kb such
th a t the sum of the loads of the sinks of index kb to n just exceed the quantity We
opt
then create a new buffer of type b, connect it directly to a new copy of the source, and
make it drive the kb sinks with largest required times. We compute the required time at
this new source, and select the buffer type b th a t maximizes this required time, and use kb
for k. This algorithm is given in Figure 3.8.

The choice of this heuristic can be motivated as follows. For a given 6 , we need
to determine an adequate value for k. A choice based on taking a -A— fraction of the total
opt
remaining sink load appears reasonable, since, in the case where all required times are equal,
it leads to a two-level tree th a t is close to the optim um fanout tree. To compute k \p t, we

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O P TIM IZATIO N 51

suppose th at the buffers are driven by the source gate; this is too restrictive in general and
is likely to lead to suboptim al results. Yoshikawa et al. [44] recently proposed an extension
of this algorithm, based on branch and bound techniques, th at does not suffer from this
limitation.

C o m p le x ity The complexity of this algorithm can be analyzed, provided th a t we make


the following simplifying assumption: at each step of the algorithm, the num ber of sinks is
decreased by a y /n for some constant a. If we suppose th a t all sinks have the same load,
and all loads and all drive capabilities are equal, we have kopt = \/ft> and k = n + 1 - \y /n ~\.
If we simply suppose th a t all sink loads are equal to some nominal value 7 , we have:
A= n + 1 - \y /n y fijfc ] . To make things more concrete, we computed the quantity
for all three inverters of the MCNC library, using as source a 2-input NAND gate and as sink
the input of a 2-input NAND gate. The actual values of this coefficient were 0.76,1.64 and
3.04. The larger the buffer is, the larger this coefficient is, since increasing the size of the
buffer has the effect of decreasing the load dependent delay coefficient /?&and increasing the
input load 7 &.
Thus each step of the algorithm is guaranteed to reduce the num ber of sinks by
a y/n, where a is a constant depending on the library, the drive of the source and the loads of
the sinks. Inserting a new sink takes 0 (log 71) time and recomputing the quantities 7 j,n take
0 (n ). Each step is done d times, once per buffer. In total, the complexity of the algorithm is
0 (d n f( n ) ) where f ( x ) = min{A:,5 A:(s) < 1} with g(x) = x - a y /x and gk{x) = gk~x{g{a:)).
Using the result of the following lemma, we deduce th a t the complexity of this algorithm is
0 (d n 1£).

le m m a 3.4.3 Asymptotically, f ( n ) does not exceed ^ y /n :

> 1 (3.9)
n —*+00 y/n a

P ro o f First we note th a t / is unchanged if we modify g so th at g(x) = 0 for all values of x


satisfying x < a 2. For any given n > a 2 we have, since the sequence gx(n) is monotonically
nondecreasing:

(3.10)

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 52

< n n ( ‘- ; )
l<i<k
,k
< n
(3' 12)

Using the inequality (1 — j) * < e valid for all x > 1, we can derive, for any given real
num ber e > 0 , the following inequality:

< n ( l - ^ (3.13)

(314)
<
- —
eca (3.15)

This inequality is also valid for 0 < n < a 2, since in th a t case > 1, and thus
g (n ) = 0 < P 3 -. By applying this inequality to ^ we obtain:

f^ fe ) s £ <*•«>
Since g is monotonic nondecreasing, we deduce from the previous two inequalities:

ffM n + r V = £ i ( n ) < ^ (3 .1 7 )

By induction we obtain, for any k > 1:

^ELol f « V ^ l ( n ) < JL . (3.18)

In particular, if k = , we have < 1, which proves that:

r^ rl" 1 r -fi-
/(») ^ E h/Wl
e *=o v
(3-19)
We can deduce from this inequality th at:

/w s E (i+VjE) (3.20)

< l + !2IE + eV5 g e- i ¥ (3.21)


e“ to
< 1 + !S S + - i ^ (3.22)
ea 1 - eT

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 53

Since e is arbitrary, we can select it to be equal to ^ 77- We then obtain:

We obtain finally:

lim s u p ^ ^ r < lim n 1/ 2 + n H — r- (3.24)


n —»+oo y fH ~ 4+00
n — a ( i _ e ^ - v los" )

< - (3.25)
a

3.4.6 Fanout Optimization based on LT -Trees

The main weakness of the combinational merging algorithm is that it relies on a


simple-minded heuristic to determine which type of buffer to use and how many sinks among
those with the largest required times should be grouped under one buffer. In this section,
we propose a new fanout algorithm th at realizes a compromise between the two-level fanout
tree algorithms and the combinational merging algorithm. This new algorithm is based on
LT-trees, a restricted class of fanout trees th a t is described below.
Like the two-level algorithms, the LT-tiee algorithm only considers a subset of the
set of all possible fanout trees. This subset is small enough for the algorithm to be practical,
but large enough to allow the algorithm to perform not only buffering of large capacitive
loads, but also, like the combinational merging algorithm, critical signal isolation. The LT-
tree algorithm has the additional advantage of using dynamic programming both to select
the shape of the tree and the types of buffers to be used at intermediate nodes: it gets much
of its ability to generate fast fanout trees from the tight connection between gate selection
and p attern selection. In th at sense, it is a direct analog of tree covering algorithms.
In the rest of this section, we give a definition of LT-trees, describe and analyze
the fanout algorithm based on LT-trees for delay minimization, and show how it can be
extended to perform area optimization under a delay constraint, using the same technique
as the one we used with tree covering.

L T -T re e s L T trees are designed to be just complex enough to allow for critical signal
isolation and buffering of large capacitive loads. A recursive definition of the set of L T

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 54

a leaf a two-level tree a tree whose root has one child


is an LT-tree is an LT-tree being an LT-tree, the others, leaves,
is an LT-tree

□ LT-tree

Figure 3.9: Definition of ZT-trees

trees is given below. Figure 3.9 illustrates this definition. When an L T tree is used as a
fanout circuit, its root corresponds to the source of the signal, and its leaves to the sinks.
If the tree is composed of a single leaf, the source and the sink are not distinguished here.
In the actual circuit, they would be distinguished, and connected by a wire.

D e fin itio n 5 (1) A leaf is an L T tree.

(2) A two-level tree is a LT-tree.

(3) Let T be a tree rooted at r such that one child o f r is an L T tree and all the other
children o f r are leaves. Then T is an L T tree.

In a LT-tree, there is a t most one interm ediate node th at has more th an one
interm ediate node as a child. If there is no such node, the L T -tree is term inated according
to case ( 1 ) of the definition, and is called a L T -tree of type 1. If there is such a node, the
node is the root of a two-level tree which term inates the L T -tree. In th a t case, the LT-tree
is of type 2. Examples of type 1 and type 2 ZT-trees are given in Figure 3.10. In the
example of type 2 , the interm ediate node with more than one interm ediate node as a child
is highlighted.

T h e o re m 3.4.4 The number o f LT-trees o f type 1 is equal to {d + l ) n -2 , where n is the


number o f sinks and d the number o f buffers in the library.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 55

tree of type 1 tree of type 2

Figure 3.10: Examples of L T -trees

P ro o f A ZT-tree of type 1 is entirely determined by the num ber k of interm ediate nodes of
the tree, the num ber of leaves attached to each interm ediate node, and the buffer selected at
each interm ediate node. For a given k, we can represent the topological structure of a ZT-
tree by a unique k-tuple of integers (®i, < , Xk) satisfying: 1 < x i < X2 < . . . < Xk < n — 2,
where leaves Xj + 1 to X j+1 are the leaves connected to the j ih interm ediate node (by
convention we set Xk+\ to be equal to n). In particular, leaves 1 to ®i are connected to the
root node, and leaves Xk + 1 to n are connected to the last interm ediate node of the tree.
The inequality Xk < n —2 is there to guarantee th a t the last interm ediate node is connected
to a t least 2 sinks, as required by the definition of type 1 ZT-trees. The num ber of such
k-tuples of integers is equal to the num ber of distinct ways of choosing k elements among
/ n - 2 \
7i — 2, i.e. I j . In addition, for a given topological structure with k interm ediate
\ k J
nodes, there are exactly dk possible assignments of buffers to interm ediate nodes of the tree.
In total, the num ber of distinct ZT-trees of type 1 is given by the formula:

U~ 2 j dk = (d + I ) " '2 (3.26)


0 < « < n —2

The ZT-tree based algorithm explores implicitly, using dynamic programming, all
ZT-trees of type 1. For ZT-trees of type 2, the ZT-tree based algorithm only considers those
trees whose two-level subtrees are derived from the two-level algorithm of section 3.4.4. In

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 56

Sorted Required Times Unsorted Required Times

required limes: inside nodes

p= l ( 1 -e loads: over edges

delay: load * P

P=0.5 ( - e P=0.5 I-e/2

Figure 3.11: Suboptimality of L T -Trees with Consecutive Sink Ordering

other words, for every L T -tree of type 1, the algorithm only considers one L T -tree of type 2,
for a to tal search space of size 2 (d + l ) " -2 . If we ignore tree patterns, then the size of
the search space is 2n_1. This is only a small fraction of the total num ber of rooted trees

with n leaves, which is at least of the order of the Catalan numbers T (n ) =

(the C atalan numbers give the num ber of ways to fully parenthesize a string of n symbols).
Using Stirling’s formula, we can easily deduce th at T (n ) is asymptotically equivalent to
22 n
n 1-6 '

The LT-tree based algorithm only considers assignments of sinks to leaves of the
LT-trees th a t are such th a t sinks with larger required times are placed further from the
root of the tree. This is partially justified by the following fact:

le m m a 3.4.5 I f sink loads are all equal, there is an optimal LT-tree such that the sinks
with larger required times are placed further from the root.

P r o o f W hen loads are equal, exchanging two sinks th a t are out of order in a LT-tree can
only increase the required time at the source of the tree. ■

Unfortunately, for arbitrary load values, the optimal L T -tree may require th at sinks are
placed out of order, as can be checked by inspection in the example of Figure 3.11.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANOU T O PTIM IZATIO N 57

S e le c tio n o f L T -T re e s w ith D y n a m ic P ro g ra m m in g The L T -tree based algorithm


selects the best L T -tree for a fanout problem by dynamic programming, under the restric­
tions presented in the previous paragraph, namely th a t the two-level subtrees of LT-trees of
type 2 are restricted to be trees generated by the two-level fanout algorithm of section 3.4.4
and the sinks appear in the tree in order of increasing required times. Prom now on, we
suppose th at these restrictions are enforced, and we will only speak about XT-trees without
additional qualifications.
The L T -tree based algorithm works as follows. As with previous algorithms, we
first preprocess the sinks by sorting them in order of increasing required times and compute
the quantities j i iT1 = J2i<j<n Tj- Then we precompute all the two-level trees that will be
considered as subtrees of candidate XT-trees. This precom putation is done by calling the
two-level fanout algorithm of section 3.4.4 on a fanout problem composed of a source and
n — k + 1 sinks. The source is a buffer b from the library, or, if k = 1, the source s of
the original fanout problem. The sinks are the n - k + 1 sinks w ith largest required times,
(sfc,. . . , s n). This computation is done for all values of k between 1 and n, and, for A: > 1,
all buffers in the library. For each pair (k , g) in {(1, s)} U ([2,..., n] X C), we keep in a table
the required time requiredtwo-ievel[k, g] achievable by the two-level fanout algorithm. The
tree itself does not need to be stored a t this point: it can be recomputed if needed.
Each pair (k,g ) in {(1, s)} U ([2,..., n] X £ ) specifies a fanout subproblem, of
source g and sinks ( s * , . . . , s n). The algorithm relies on dynamic programming to com­
pute by induction on k, k varying from n to 1, an XT-tree th a t achieves the maximum
required time for each fanout subproblem (k,g). For a given pair (k,g ), an optimal XT-tree
T(fci9) can be obtained by selecting the best of the following (n - k)d -) -1 configurations:

( 1 ) for some sink index I > k and buffer type b, the root of T^,g) is directly connected to
sinks (sj;,. . . , si_i) and to a buffer of type b. The subtree connected to the buffer b is
an optimal XT-tree T(ji6) for the subproblem (I, b). Since the algorithm proceeds by
induction from n to 1 , T^j,) has already been computed and is available.

(2 ) T(fciS) is a the two-level tree precomputed for pair (k,g).

The algorithm is detailed in Figure 3.12. The best required time achievable for a pair {k,g)
is stored in the table required[k, g]. To keep track of the optimal configuration selected for
(k ,g ), we simply need to store a flag, useJw oJevel[k,g], to decide whether a two level tree
is used or not; and, in the case a two level tree is not used, an index, next[k,g], which is

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 58

a lg o rith m ItJree.compuiation
sort the sinks si by increasing required times (77 , rn)
fo rea c h 1 < i < n compute 7 = J2i<j<n 7 j
required[n +1,0] = + 0 0 ;
fo r k = n to 1 {
fo re a c h g such that (k, g) £ { (1 , s)} U ( [ 2 , . . . , re] x £) {
required[k,g] = requirediwo- itvei[k, <7];
useJ,woJevel[k, 3 ] = true;
foreach (I,b) £ {(re + 1 , 0 )} U [2 ,. .. , n] x £ such that l> k {
required = min(rfc, required]}, 6] - a j) - /3s(7j + 7 k,n - 7i,n);
if (required > required[k, 3 ]) {
required[k, jf] = required;
useJbwoJevel\k, 3 ] = false;
next\k,g] = I;
gate[k, g] = 6;
}
}
}
}
end It-tree-computation

Figure 3.12: Fanout Optimization with XT-Trees

the index of the first sink not directly attached to the root of T(k,g)> and, if next[k,g] < n , a
buffer type, gate[k,g]., which is the type of the unique buffer attached to the root of T(fc)3).
An example of com putation of these entries is given in Figure 3.13. To compute the required
time of a configuration other than a precomputed two-level tree, we start with the required
time of the selected subproblem (1,6), required[l,b]. This required tim e is not exactly the
required time at the input of it does not take into account the intrinsic delay of buffer
b. To obtain the required time at the input of T(i,6), we need to subtract from required[l, b]
the intrinsic delay a& of buffer b. The required time at the output of the root of T^,g) is then
the minimum between the earliest required time of a sink connected to the root of T(fc,g))
which is rk, and the required time at the input of which is required[l,b] — ab- To

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 59

Figure 3.13: Illustration of L T -Tree Algorithm

obtain the required time required[k,g], we need to subtract from m in(r k,required[l,b\ —ab)
the load dependent delay Pgijb+'Yk.n—'Yl.n) but we do not include the intrinsic delay of gate
g. 7b is the load of buffer b while 7 k,n ~ 7l,n is the sum of the loads of the sinks attached
to the root of T(k,g)- The required tim e required[k,g] is actually the best required tim e
achieved for any possible choice of (l,g ), with I > g.

C o m p le x ity o f th e A lg o rith m The precom putation of two-level trees requires no more


than d x n calls to the two-level fanout algorithm, for a total cost of 0 ( d 2 n 2-5). The
computation, in the main algorithm, of a entry for a pair (fc, g) requires 0 (d X n) operations.
Since there are 0 ( d X n) such pairs, the total cost of the main part of the algorithm is
0 ( d 2 n 2). Overall, the complexity of the L T -tree based algorithm is thus 0 ( d 2 n 2-5).

A llow ing N o d e s w ith N o L eav es It can be helpful in practice to allow some interm edi­
ate nodes in a LT-tree to bear no leaves. For example, this gives the algorithm the freedom
to generate a sequence of buffers of increasing sizes to drive large loads when needed, as for
example at the root of a two-level tree. This can be implemented as a direct extension of
the dynamic programming algorithm of Figure 3.12, by computing, for each triplet (k ,g , I),
with 0 < I < L the optim al LT-tree Tk,g,i for sinks k to n with source g and I or less
interm ediate nodes w ith no leaves connected to them . Tk,g,i can only be composed of a
buffer attached to the tree Tk,g,i-i or a buffer attached to the tree Tk',g,i for some k 1 > k. If

Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 60

we allow up to L —1 interm ediate nodes to bear no leaves, the complexity of the algorithm
becomes 0 ( L d2 n 2-5).

M in im iz a tio n o f A re a u n d e r a D e la y C o n s tra in t The L T -tree based algorithm can


be modified to minimize area under a delay constraint. To do so, we use the same technique
as the one we used on tree covering in chapter 2. The only im portant difference between
L T -tree based fanout optimization and tree covering is th at the role of required times and
arrival times is reversed. The modified algorithm works as follows. Instead of selecting a
minimum delay L T tree for every pair (k,g ), we select a minimum cost L T -tree Tk,g,a "or
every triplet (k ,g ,a ), where a is an arrival time. The cost of each tree evaluated by the
algorithm is of th e form (A, r), where A is the area of the tree and r the required tim e at
the source of the tree. W hen two trees are compared, the tree w ith smaller area is selected,
unless its required time is smaller than the arrival time a, in which case the tree with larger
required time is selected.
To make this algorithm practical, arrival times need to be discretized. If r is the
num ber of discretization intervals used for arrival times, the complexity of the algorithm
becomes 0 ( t d2 to2-5). B etter results can be obtained by discretizing the values of a in
(k ,g ,a ) within bounds dependent on the value of k and g. T hat technique was also described
in chapter 2 and can be applied here without modification.

3 .4 .7 O th er F an ou t A lg o rith m s

Other fanout optimizations have been proposed, by Berman et al. in [5] and Singh
et al. in [40]. None of these algorithms are optimal, since they all rely, as the L T -tree
algorithm does for L T -trees of type 1, on ordering sinks by required times. All of these
algorithms produce trees th a t have the following property: there exists a depth-first search
order of the nodes of the tree th at visits the leaves in order of increasing required times.
We give in Figure 3.14 an example of a fanout problem th a t cannot be solved optimally
with such trees, even with the simple unit-fanout delay model of section 3.3.2. It is to be
noted th a t the suboptim ality of these algorithms has nothing to do with the fact th at the
fanout problem is NP-complete for some delay models. It is actually not known whether
the fanout problem is suboptimal for the unit-delay model.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R 3, FANO U T OPTIM IZATIO N 61

O ptim al BulTcrlng (U nit-D day Model)

Figure 3.14: Fanout Problem Unsolvable with Consecutive Sink Ordering

B e r m a n 's A lg o rith m s Berman et al. presented two algorithms for the fanout problem.
One complex algorithm, also based on dynamic programming, which can be seen as a
generalisation of our I T - tree based algorithm, and one much simpler algorithm, called the
tWQ=greup algorithm,
The L T ’tree algorithm only allows trees th at have at most one buffer in the fanout
©f any buffer, Since this condition is too restrictive to produce good quality results, we have
relaxed It somewhat by allowing fanout trees to contain one balanced fanout tree as a subtree
If It helps reducing delay, Berman’s algorithm relaxes the restriction of having at most one
buffer In the fanout of any buffer using a different technique. The algorithm allows any
number of buffers in the fanout of a buffer, from 1 to some limit k, and uses dynamic
programming to select this number optimally. By restricting the sinks to be ordered by
Increasing required times, the dynamic programming algorithm can be made polynomial,
though with a large exponent: 0(fcn3), if we ignore buffer selection. If we want to take
buffer selection into account with our delay model, a direct modification of this algorithm
yields a complexity 0 ( n 2 dk (n + d)). The dk term comes from the fact th at we have dk
possible ways of assigning buffers to k inputs. If k = 1, we obtain the bound 0 ( n 2 d2)
because the term n 2 dk n disappears in that case. This is the complexity of the L T -tree
based algorithm If we do not use two-level trees.
The two-group algorithm algorithm was introduced by Berman et al. as a more
practical alternative to fanout optimization. This very simple algorithm tries all possible

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 62

decomposition of the sinks into two groups of sinks with consecutive required times, and,
for each group, supposes th at all required times and all loads are equal to select a balanced
tree out of a set of precomputed trees. The precom putation can be done once and for all
for a given library, so the run tim e of the two-group algorithm is essentially 0 {n ). This
simple algorithm is fast but unlikely to generate results of the same quality as the algorithms
proposed earlier.

S in g h ’s A lg o rith m Singh’s algorithm [40] uses a divide-and-conquer strategy in which


the set of fanout signals is partitioned into subsets, and the process is recursively applied
on the smaller problems. It can be seen as the recursive application of the two-group
algorithm of Berman et al., and bears some similarity with the technique used by Paulin
et al. to decompose gates w ith large fanins [35]. Singh’s algorithm builds a fanout tree in
a top-down fashion, in contrast with combinational merging, which works bottom -up. The
partitioning of the fanout signals is based on a greedy procedure th at determines the kind
of re-distribution of the fanout signals (based on the required times and load distributions)
th at results in the greatest saving at the current step. The algorithm is able to generate
a balanced fanout tree if the signals are required at similar times and a skewed tree if the
signals are required at widely different times. The complexity of this algorithm is 0 ( d 2 n2).

3.5 Handling Differing Polarities

H a n d lin g In v e rtin g B u ffers In some technologies, like CMOS standard cells, nonin­
verting buffers are made of a juxtaposition of two inverting buffers. For delay optimization,
it is thus always preferable to use inverting buffers. Inverting buffers offer more possibilities
for optimization, and in the worst case can always be combined to reproduce the delay
characteristics of noninverting buffers. There is no m ajor difficulty in handling inverting
buffers. We simply need to keep track of the polarity of the signal a t the output of a buffer
and accept a connection with a sink only if the polarities match.

S inks R e q u ire d u n d e r B o th P o la ritie s In real circuits, a signal is often needed under


both polarities. It is possible to extend all previous algorithms to handle this situation
simply by separating the sinks into two groups, one for each polarity, and using the source
of the signal to distribute the signal to one group, and an inverter connected to the source to

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZA TIO N 63

distribute the signal to the other group. W hen inverting buffers are used, we do not specify
which group of sinks is directly connected to the source, since connecting the source to the
sinks with inverted polarities m ay yield a faster solution. We actually try both assignments
and keep the best solution.

T r e a tin g B o th P o la ritie s S im u lta n e o u s ly The LT -tiee based algorithm and Singh’s


fanout algorithm can handle signals of differing polarities directly, at an increased compu­
tational cost. This approach yields in general better results th a t the technique suggested
in the previous section b u t is too expensive to be applied to large fanout problems.
The L T -tree algorithm only needs an extra index variable to keep track of positive
and negative sinks independently. Since the two-level algorithms cannot handle sinks of
different polarities, the precom putation of two-level trees is done independently, for a total
cost of 0 ( d 2 m ax (n ,p )2-5), where p is the num ber of sinks of positive phase and n the number
of sinks of negative phase. The cost of the rest of the com putation is 0 ( d 2 npm ax.(n,p)),
for a total cost of 0 ( d 2 m ax(n,p) max(np, m ax(ra,p)1-5)). W hen treating sinks of different
polarities simultaneously, Singh’s algorithm becomes 0 ( d 3 n2 p2) [40].

3.6 Peephole Optim izations for Area and Delay

3.6.1 M o tiv a tio n

M any local optimizations on fanout trees can be performed independently of the


fanout algorithm used to produce them . Implementing these optimizations as a postpro­
cessing phase on fanout trees has several advantages:

• the local optimizations only have to be implemented once, instead of once per fanout
algorithm.

• the fanout algorithms can be made simpler. This is particularly true for area min­
imization under a delay constraint, which can be done quite effectively by local op­
tim ization. T hat way, the fanout algorithms can simply be implemented for delay
minimization only.

• good quality results can be obtained efficiently by the combined effect of a simple and
fast fanout algorithm and a simple and fast local optimization algorithm.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 64

This corresponds to a general approach for solving NP-complete problems approximately


th at has often be found effective in practice: first computing an initial solution with a
global greedy algorithm, and then improving this solution with local optimizations. This
organization is commonly used in optimizing compilers. It is also at the basis of the best
known heuristics for combinational problems such as the Traveling Salesman Problem.
In the rest of this section we first present an optim al algorithm to perform buffer
selection on a fanout tree, under the constraint th a t the topological structure of the tree
remains unchanged. This optim ization is useful after all our fanout optimization algorithms,
since none guarantee optim al gate selection. Then we present several local optimizations to
decrease the area of a fanout tree under a delay constraint.

3 .6 .2 O p tim a l B u ffer S electio n

Fanout algorithms do not guarantee in general th a t the buffers in the fanout trees
they produce are selected optimally. For example, the two-level algorithm enforces all inter­
mediate buffers to be of the same type which is not necessarily the best possible solution;
the combinational merging algorithm selects buffers heuristically, without complete knowl­
edge of the local structure of the tree in which the buffer is inserted. By performing optimal
buffer selection on the fanout trees resulting from these algorithms, we can decrease delay at
times significantly. In addition, the computational cost of performing optimal gate selection
is very low: 0 ( d 2 m ), where m is the num ber of edges in the tree, and d the num ber of
buffers in the library. The cost of performing optim al gate selection is thus only a fraction
of the cost of building a fanout tree in the first place. Thus there is no reason not to perform
optim al gate selection on every fanout tree.
Optim al buffer selection can be implemented with a simple algorithm th a t proceeds
from the sinks to the root of a fanout tree, and selects, at each interm ediate node, a
buffer th a t maximizes the required tim e at the parent of th a t node. In contrast with tree
covering, it is not enough to simply select a buffer th at maximizes the required time at a
node, because a later required time for a subtree usually means a higher load to drive for
th a t subtree. Selecting the largest possible required time for a subtree can slow down the
signal going to the other subtrees sharing the same parent. It is possible, using dynamic
programming, to compute for each subtree the required tim e required[b] achievable by
using a buffer of type b a t the root of the subtree. To do so, we proceed as follows.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FANO U T O P TIM IZATIO N 65

Let v be an intermediate node of the tree, with nodes (■tfi,...,vn) 35 children. For each
buffer b G C, we have already computed by induction the best required times required[vi, b]
achievable for the subtrees rooted at u,-, 1 < i < n if b is selected at node Vi. To compute
required[v, 6], we simply need to find an assignment of buffers to nodes ( « i , . . . , vn), i.e. a
function / : { 1 , . . . , n} -» { 1 , . .. , d}, th a t maximizes the required time at v, given by the
following expression:

required[v,b] = min required[vi, bf(i)]~ P b £ 7b/(») —a b (3.27)


l< t< n „ TTZ JV *
l< t< n

To compute / by exhaustive enumeration would require 0 ( d n) operations. It is


not uncommon to have libraries with d of the order of 10, so this brute force approach is not
practical. In the next subsection, we present an 0 ( d n) algorithm to compute the optimal
solution of equation 3.27. To compute gate selection for an entire tree, we need to apply
this algorithm d times to each intermediate node, for a total cost of 0 ( d 2m ), where m is the
num ber of edges in the fanout tree. The number of edges in a tree is equal to the number
of nodes of the tree minus one.

A Fast Buffer Assignm ent Algorithm

This assignment problem we are attem pting to solve can be reformulated as follows:
given two n X d m atrix of numbers and Utj such th at, for any given i, r i j and are
monotonically non decreasing in j , find an assignment / : { 1 ,. .. , 71} —> { l , . . . , d } th at
maximizes the quantity:

g a in (f) = 77 — If where (3.28)

77 = (3-29)

h =£ km (3-30)
l< t< n

The Tij represent the required times required[vi,bj] and the litj the load values 7 ^ . The
load values are actually independent of i, and, without loss of generality, we can suppose
th at they are sorted in increasing order. Thus the l i j are monotonically non decreasing
in j . If the r^j are not monotonically non decreasing in j , there would be a pair (j’i , j 2 )
of indices such th a t, for a some i, and 7,»iJ-1 > r ij2. For i, j \ would always be
a superior choice than j'2 . Wecan therefore replace r ij 2 by and Utj2 by without

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N 66

affecting at least one optim al assignment to the problem. We can iterate this replacement
until the monotonicity condition is satisfied by the Titj.
We define a distance between two assignments / and g as follows:

d(f>9) = £ \ f( i) ~ 9 { i) \ (3.31)
l< t< n

In particular, d ( f,g ) = 1 if and only if / and g differ on only one index, and the difference
of value on this index is only one.
The optim um assignment algorithm is outlined in Figure 3.15. The algorithm
computes a sequence of assignments (f k ) i < k < K , for some K < d n , such th a t d(fk+i, fk) = 1>
and such th a t there is a ko, 0< ko < K for which f^ is optimal, fo is initialized to be such
th at fo(i) = 1 for all 1 < i < n. Given f k , the computation of f k + i only takes constant time.

To compute fk+i from fk, the algorithm finds an index ik th a t is critical for the current
assignment, i.e. an index th a t minimizes the quantity r i j k^) for 1 < i < n. fk+i is then
defined as being equal to fk for all indices different from ik and equal to fk{ik) + 1 on ik- If
fk{ik) = d and thus cannot be incremented, the algorithm term inates the com putation of

the sequence (fk)- The total num ber of incrementing steps cannot exceed d n .
To find the optim al assignment , the algorithm works backwards, from f a to
fo, in order to exploit the fact th at the cost of assignment f k , g a in(fk), can be com­
puted in constant time from the cost of fk+ i, but not conversely. To compute ga in (fk)
in constant time from gain(fk+ i), the algorithm keeps track of the interm ediate quantities
r fk = mini<;<n r i)/fc(i) and l f k = £i<»<n and use the fact th at r f k = r iktfk{ik) and
lfk = - rik,fk(ik)+T- rikJk(ik)■ This guarantees th at k 0 can be computed in 0 (n d ) time.
Finally constructing fk Q from fo given ko takes 0 (n d ) time.

Theorem 3.6.1 The algorithm o f Figure 3.15 is optimum.

The proof of the theorem relies on the following lemma:

lemm a 3.6.2 Let f and g be two assignments, such that:

f(i) < 9(i) for a ll! < i < n (3.32)

f ( i 0) = 9(io) for i0 = arg m ^ n ^ r ij^ (3.33)

Then g a in (f) > gain(g).

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 67

algorithm optimaLbuffer-assignment
set / to be such that: f(i) = 1 for all 1 < i < n.
k = 1;
do {
io = argmini<j<„ rij(i)
if f{io) = d b re ak

f{io) = /(*o) + 1; i[k] = io! k = k + 1;


}
K — k - l \ ko = K;

i = £ l < » < n **./(»)>


gain — r —l\
for k = K — l to O {
/(*[*]) = /(*[&]) - 1;
r = ri[i],/(»[fc]); i = i - kk],fm)+i + *?[*],/(<[*]);
if (r —I > gain) {
ko = k ; gain = r —l\
}
}
for k = 1 to ko f{ik) = f{ik) + 1;
end optimaLbuffer-assignment

Figure 3.15: An Optimum Assignment Algorithm

P ro o f By the monotonicity hypothesis on the arrays r^j and lij, (3.32) implies th at
r f < rg and If < lg, and (3.33) implies th a t r f = = riotg^ 0) > rg. Thus rf = rg and
which implies th at g a in ( f) > gain(g). •

P ro o f [of theorem 3.6.1]


We only have to prove th a t one of the assignments ( fk)o<k<K produced by the al­
gorithm is optim al. Let g be an optim al assignment, and let I k = { i|l < i < n ,g ( i) < /*(*)}•
Let ko be the largest index for which Ijt = 0 . We will use the previous lemma to prove th at
g a in (fko) > gain(g).
Since Ik0 = 0 , fk 0(0 < 5 (0 for all 1 < z < n, thus condition (3.32) is satisfied. If

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 68

ko = K we know th a t there is an index io such th at /k(^o) = d and ri0,d = rf K>since that is


the term ination condition of the first phase of the algorithm. Thus io = arg mini<i<n »•*,/*(»)•
Moreover, since f x ( i o) < g(io) and / k (io) = d, we have /jc(io) = o(io), which proves
th a t condition (3.33) is satisfied. If ko < K , we know th at there is an index io such
th a t i0 = argm ini<i<n rj,/ko(i) and such th a t A o+i(i) = /fco(i) if i ^ io and /fco+i(io) =
/ifeo(io) + 1- Since is empty, Ifco+i contains at most one element: io- Since k0 is the
largest index for which Ik is empty, we have 1^ + 1 ^ 0. Thus Ifc0+i = {io}- Consequently
/fco(io) + 1 = /fco+i(io) > 5(io) > /fco(io)- This proves th a t /fc0(i0) = 5(i0), and thus that
condition (3.33) of the lemma is satisfied. ■

3 .6 .3 A rea R eco v ery u n d er a D e la y C o n stra in t

If the fanout problem is given with a delay constraint, our objective is to find a
fanout tree with minimum area th at meets the delay constraint. The fanout algorithms we
have presented so far can only minimize delay. We have suggested in section 3.4.6 a way
to extend the L T -tree based algorithm to minimize area under a delay constraint. Other
fanout algorithms could also be extended to support this optimization, but each algorithm
will have to be modified separately (e.g. the L T -tree algorithm is the only one to be based
on dynamic programming). In addition, it is likely to be difficult to extend the fanout
algorithms to minimize area under a delay constraint in an efficient and accurate way.
Fortunately, there are simpler ways to recover area. Though not optimal, the techniques
we present in the following paragraphs are very effective in practice and straightforward to
implement.

S e le c tin g t h e B e s t F a n o u t T re e We have at our disposal several fanout optimization


algorithms, based on inverter selection, two-level trees and iT -tree s. Even if used for
minimizing delays, these algorithms are going to produce fanout trees of differing area
and delay. Our first area recovery technique simply consists in computing the fanout trees
generated by all fanout algorithms at our disposal, and selecting the minimum area tree th at
meets the delay requirement. This technique is effective because the two-level algorithm
often generates trees th a t are fast enough and usually smaller than the trees produced by
the L T -tree algorithm. In addition, this technique also detects the case where the delay
constraint can be m et with no buffering at all.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 69

P a r ti a l C o llap se The second heuristic we use, which is also very effective at recovering
unnecessary area, consists in partially collapsing minimum delay fanout trees. This tech­
nique is particularly useful on ZT-trees or trees generated by combinational merging. The
algorithm performs partial collapses as follows. Given a fanout tree and a delay constraint,
i.e. an arrival tim e at the root of the fanout tree, the tree is visited from the root to the
sinks in order to compute the arrival times at every intermediate node. Then the tree is
visited in reverse order, from the sinks to the root. At each interm ediate node v, the algo­
rithm computes the required time at v th a t would be obtained if v were connected directly
to all the sinks driven by the subtree Tv rooted at v. If this required tim e is larger than the
arrival time at v, Tv is collapsed into v: all buffers contained in T v are eliminated and the
sinks of Tv are directly connected to v.

B u ffer S e le c tio n We also use a modified version of the optimal buffer selection algorithm
of section 3.6.2. This algorithm is not optimal: it does not find the buffer assignment th at
would minimize area under a delay constraint, but is fast, simple and easy to implement.
The modification is done as follows: given an arrival time at the root of the tree, for each
interm ediate node v and for each buffer type 6 , we want to compute an achievable arrival
time at v if node v is assigned a buffer of type b. This arrival time is taken to be the
arrival time obtained on the fanout tree after the buffer at node v has been replaced by
b, without changing the rest of the tree; it is clearly an achievable, but not necessarily an
optimal arrival time at node v with buffer b. The suboptimality comes from the fact th at
we do not know what is the optimal buffer assignment for the siblings of v given th a t v is
assigned b, and it would be too time consuming to compute it for all values of 6 . Given
this achievable arrival time at v for a given choice of buffer b, to perform buffer selection
we use the algorithm of section 3.6.2, modified to select the minimum index k for which f k
guarantees a nonnegative slack at v. This computation is done for each value of b and the
result is stored at node v. The buffer selection for node v is done when the parent node of
v is visited.

3.7 Global Fanout Optimization

We have only discussed so far algorithms to optimize a given fanout problem. It


is time to introduce a technique that can be used to apply a fanout algorithm to an entire

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 70

algorithm one.pass^lobaljanout.optimization
foreach node v visited in topological order from outputs to inputs {
if v is the root of a tree {
apply fanout optimization to the fanout problem rooted at v
} else {
propagate the required time at the output of v to the inputs of v
}
end onejpass-globalJanout.opUmizaiion

Figure 3.16: A One-Pass Optimal Global Fanout Algorithm

circuit. In section 3.7.1 we present such a technique and show th at it is optim al with respect
to delay minimization. In section 3.7.2 we extend this technique to recover area under a
delay constraint. For area minimization under a delay constraint, this technique is not
optim al but is very efficient in practice.

3.7.1 A One Pass Approach

To apply a fanout algorithm to an entire circuit, we use a simple procedure intro­


duced by Hoover et al. [22], This procedure, described in Figure 3.16 and illustrated in
Figure 3.17, consists in visiting the nodes of a circuit in topological order, and at each node
v doing the following:

• if v is the root of a tree, v is also the root of a fanout problem. In th a t case, we


replace the existing fanout tree rooted at v by the result of the fanout optim ization
algorithm applied a t the fanout problem rooted at v.

• otherwise, we simply propagate the required times of the outputs of v to the input
of v. More specifically, if (f>i,. . . ,v n) axe the fanouts of v, the required tim e r at the
output of v is the minimum of the required tim es of ( v i ,. . . ,v n)] the load 7 at the
output of v is the sum of the loads of ( i;i,. . . ,v n). The required time 7\ at input pin
i of node v is then given by the following equation: r , = r - a.i —$ 7 , where a{ is the
intrinsic delay and /3; the load dependent delay of the gate a t node v for input pin i.

The algorithm of Figure 3.16 is optimal in the following sense. Let N be a com­
binational network, L = (v 1 , . . . ,v m) be an arbitrary list of possibly repeated tree roots of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R 3. FAN O U T O P TIM IZATIO N 71

PRIMARY OUTPUTS

\y fanout optimization

PRIMARY INPUTS

Figure 3.17: Applying Fanout Algorithms in One Pass From O utputs to Inputs

N and F a fanout algorithm. The pair (F, L) defines a global fanout algorithm as follows:
first, compute the required times at all nodes in N . Then, for k = 1 to k = m , apply the
algorithm F to the fanout problem rooted at Vk and recompute the required times a t all
nodes in N wherever necessary. We define the cost of (F, L) as a n-dimensional vector:

cost(L, F ) = {r(pj), 1 < i < n } (3.34)

where the nodes pi, 1 < i < n, are the prim ary inputs of N , and r(v) designates the required
tim e a t node v. We say th a t the n-dimensional vectors x and y satisfy the inequality x < y
if and only if X{ < y; for all 1 < i < n.

T h e o r e m 3.7.1 Let F be a fanout algorithm with the following two properties:

1. it never replaces a fanout tree by one with worse delay.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O PTIM IZATIO N 72

2. the fanout trees it produces are such that the required times at the root o f the trees are
a non decreasing function o f the sink required times.

Let Lq = ( iq ,.. . , v m) be a topological order of the tree roots o f N starting from primary
outputs, and let L be an arbitrary list o f tree roots. Then cost(F, Lo) > cost(F, L). Moreover
i f F is an optimal fanout algorithm, the result o f (F , Lq) is optimal with respect to fanout
optimization in the sense that no modification of the individual fanout trees can reduce the
delay through the network.

P r o o f We will prove by induction th at the required time rF,L0{vk) obtained at u* with


( F , L 0) is no less than the required time rp,L(vk) obtained a t u* with (F ,L ). Let k = 1.
Since the nodes in Lo are sorted in topological order, there is no tree root in the transitive
fanout of V\. The required times at the sinks of the fanout problem rooted at v i are entirely
determined by the boundary conditions on the network. Thus r(i>i), after application of
( F, L) or ( F, L q ) , can only take two values: the value ro(vi) °f the required time a t v\
before F is applied to fanout problem rooted at V\, and the value r*i(i;i) after F is applied.
Property (1) guarantees th a t ri('Ui) > ro(ui). Since t f , l 0{vi) = K ^ i) we have T(f,L0)(v i) -
r(F,L)(vi)- Now let us now suppose that we have proved th at r(F,L0)(vi) ^ r {F,L){vi) for all
1 < i < k. Since F can only increase the required time at a node, rF,L0{vi ) is the largest
required tim e at node V{, 1 < i < k observed at any interm ediate step of computation
of ( F, L) or (F, Lq). Since the required times at the sinks of node Vk+i are monotonic
non decreasing functions of the required times r(vi) for 1 < i < k, and do not depend on
the required times at nodes Vi for i > k + 1, the fanout problem rooted at node Vk+i is
given with the largest required times observed during the execution of (F, L) or (F, Lo)
when r(vi) = rF,L0{vi) for 1 < i < k. From this fact and property (2), we deduce th at
T(F,Lo){vk+i) •> 'r (F,L)('ufe+i)) which proves the first part of the theorem.
The second part of the theorem is proved in a similar fashion. Let T be a circuit,
Topt a version of T th at is optim al with respect to fanout optimization, and let TV be the
result of applying to T an optim al fanout algorithm F to all tree roots in topological order.
Since only fanout optimization has been performed in both cases, Topt and Tp have the
same tree roots. We will prove by induction th at TTopt{vk) < rTF{vk) for all 1 < k < m,
where ( iq ,. ..,D m)i<i<m is a topological ordering of the tree roots. The result is true for
k = 1, since the required times at the sinks of v 1 are the same in both trees, and F is an
optim al fanout algorithm. Let us suppose th at the result holds for 1 < i < k. By induction

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTE R 3. FANOU T O PTIM IZATIO N 73

hypothesis, the required times at the sinks of tree root ujt+i are no worse in T p than in Topt.
Since F is optim al, rTopt(vk+i) < rTp(vk+i)- ■
The importance of theorem 3.7.1 is to prove th at any lack of optim ality in fanout
optimization is due to the lack of optimality of the fanout optimization algorithms, not to
the procedure we use to apply the fanout algorithms to the entire circuit. In particular,
there is no need to develop fast, incremental techniques to extract critical path information
th a t would be used to guide the fanout optimization. This simple algorithm based on one
topological traversal does as well in terms of delay as any other more complicated technique.

D u a lity w ith T re e C o v erin g An interesting question to ask at this point is why a


similar algorithm of applying tree covering in topological order, this time from inputs to
outputs, is not optimal for tree covering. The reason is th at tree covers for trees th a t are
on disjoint paths from primary inputs to primary outputs do interact, which is not the case
for fanout trees. For example, in Figure 3.17, the choice of a cover for tree A influences the
required times at the sinks of fanout tree 5, and thus the delay through fanout tree 5, and
thus the arrival times at tree B. There is no such coupling between the solutions of fanout
problems th at are not on a common path from inputs to outputs. This why theorem 3.7.1
holds for fanout optimization and the equivalent theorem does not hold for tree covering.

3.7.2 Global Area Recovery under a Delay Constraint

The main weakness of the optimal procedure described in the previous section is
th at it can be too wasteful in area. It optimizes all tree roots for minimum delay, with no
consideration for the necessity of such an optimization. We now introduce a technique to
recover area at no cost in delay. This technique works with arbitrary required times a t the
primary outputs and arbitrary arrival times a t the primary inputs of a network.
To recover area at no delay cost after fanout optimization, we proceed as follows.
We first save a t each tree root the arrival tim e achievable after fanout optimization. We
then reapply the fanout optimizer to each fanout problem, visited in topological order. This
time we call the fanout optimizer to minimize fanout tree area under the constraint th at
the required time at the root of the tree is no less than the arrival tim e at the root of the
tree. To perform this optimization, we use the techniques described in section 3.6.3.
W ith this simple algorithm, we can recover, at no delay cost, most of the area
wasted by the first phase of fanout optimization. Unfortunately this algorithm is not opti­

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 74

mal, for two reasons: first because it relies on a fanout algorithm th at itself is not optimal;
second because by visiting nodes in topological order, it uses the slack available on any given
path as early as possible. A more equilibrated use of the slack can lead, at least in some
cases, to a smaller circuit with the same delay. Despite these lim itations, this technique is
very effective at recovering area, as can be seen from the results of section 3.8.3.

3.8 Experim ental R esults

To provide some experimental evidence of the efficiency of fanout optimization, we


gathered a set of 25 benchmark circuits from several origins. These circuits are relatively
large, ranging from 119 gates to 2557 gates, and are described in more detail in section 3.8.1.
We present overall performance results in section 3.8.2 and a more detailed analysis of the
effect of various optimizations in section 3.8.3.

3.8.1 Circuit Descriptions

Our set of benchmark circuits come from four sources: MCNC, ISCAS, Intel and
AT&T. The MCNC benchmarks were put together during the International Logic Synthesis
Workshop 1989 [32] and are publicly available from MCNC. They are themselves from
several origins, though complete information was not always available on each of them . The
ISCAS benchmarks were originally testing benchmarks. The Intel and AT&T circuits come
from these companies. No details concerning their functionality were provided. Table 3.2
contains some general information on the 25 benchmark circuits, including the number
of prim ary inputs and prim ary outputs, the number of literals in factored form, and the
num ber of gates needed to implement the circuits when technology m apped for minimum
area using the MCNC library l i b 2 . We also indicate briefly the function of the circuit if
known. If it is not known, we simply used the word logic to characterize the circuit.

3.8.2 Overall Performance of Fanout Optimization

We measured the effect of applying fanout optimization to our set of benchmark


circuits after the circuits were mapped for minimum area. The effect of combining minimum
delay tree covering with fanout optimization will be discussed in the next chapter.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 75

c irc u it c irc u it fu n c tio n o rig in # Pis # Pos # lits # g a te s


C1355 error correcting ISCAS 41 32 1064 510
C1908 error correcting ISCAS 33 25 1497 349
C2670 ALU and control ISCAS 233 140 2075 505
C3540 ALU and control ISCAS 50 22 2936 740
C5315 ' ALU and selector ISCAS 178 123 4386 1080
C6288 16-bit multiplier ISCAS 32 32 4800 2371
C7552 ALU and control ISCAS 207 108 6144 1688
a lu 4 logic MCNC 14 8 1278 418
ampbpreg logic AT&T 117 88 3318 1137
ampbsm logic AT&T 75 66 2578 795
am ppint 2 logic AT&T 85 66 3372 513
ampxhdl logic AT&T 62 40 3742 365
apex 6 logic MCNC 135 99 904 438
des d ata encryption MCNC 256 245 6346 2557
d f lg r c b l logic Intel 108 65 623 179
fc o n rc b l logic Intel 62 35 459 129
frg 2 logic MCNC 143 139 2014 727
k2 logic MCNC 45 45 2930 1172
k c c tlc b 3 logic Intel 81 44 415 137
p a ir logic MCNC 173 137 2426 944
ro t logic MCNC 135 107 869 403
s b iu c b l logic Intel 40 35 591 144
tfa u ltc b l logic Intel 77 35 659 119
vda logic MCNC 17 39 1423 586
x3 logic MCNC 135 99 1345 486

Table 3.2: General Information on the Benchmark Set

c irc u it f u n c tio n : simple description of the logic function of the circuit, if available
o rig in : origin of the circuit
# pis: num ber of prim ary inputs
# p o s: num ber of prim ary outputs
# lits : num ber of literals in factored form as computed by m is ll
# g a te s : num ber of gates to implement the circuit using m is l l technology
m apper in minimum area mode with the MCNC library l i b 2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O PTIM IZATIO N 76

c irc u it m in a re a fa n o u t o p t g a in
a re a d e la y a re a d e la y a r e a d e la y
C1355 990 27.16 1119 24.25 1.13 0.89
C1908 1086 35.04 1236 29.55 1.14 0.84
C2670 1420 28.42 1478 22.14 1.04 0.78
C3540 2201 45.24 2421 34.27 1 .1 0 0.76
C5315 3171 39.94 3352 30.78 1.06 0.77
C6288 6777 120.37 7340 101.08 1.08 0.84
C7552 4660 69.47 5084 28.55 1.09 0.41
a lu 4 1486 47.17 1724 31.84 1.16 0 .6 8
ampbpreg 2741 59.85 2891 39.40 1.05 0 .6 6
ampbsm 1478 25.78 1615 18.05 1.09 0.70
am ppint 2 1021 22.45 1136 12.31 1 .1 1 0.55
ampxhdl 751 24.73 865 13.38 1.15 0.54
apex 6 1505 17.74 1565 13.40 1.04 0.76
des 6452 107.12 7358 17.84 1.14 0.17
d f lg r c b l 615 12.83 630 1 1 .0 1 1 .0 2 0 .8 6
fc o n rc b l 467 15.18 481 13.82 1.03 0.91
frg 2 1738 37.91 1893 14.95 1.09 0.39
k2 1972 32.80 2189 20.56 1 .1 1 0.63
k c c tlc b 3 449 11.50 469 1 0 .2 0 1.04 0.89
p a ir 3200 25.85 3482 17.80 1.09 0.69
ro t 1387 30.18 1444 21.49 1.04 0.71
s b iu c b l 471 22.18 514 18.89 1.09 0.85
tfa u ltc b l 375 9.91 400 8.41 1.07 0.85
vda 1118 23.76 1324 16.75 1.18 0.70
x3 1653 22.40 1754 11.45 1.06 0.51
aver - - - - 1.09 0 .6 6

Table 3.3: Effect of Fanout Optimization on Circuits Optimized by m is ll

m in a re a : minimum area technology mapping with MCNC library lib2


f a n o u t o p t: minimum area technology mapping followed by fanout optimization
g a in : gain in area or in delay obtained by using fanout optimization
a re a : area of the circuit (MCNC lib2 d ata divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 77

E ffect o n O p tim iz e d C irc u its Before technology mapping, we optimized each circuit
using m is ll technology independent simplification and factorization algorithms, as is nor­
mally done in logic synthesis. We then compared the area and delay of the circuits after
technology mapping for minimum area, with and without fanout optimization. The results
are reported in Table 3.3. We can observe a wide variation in delay reductions, ranging
from 9% to a factor of 6 , while the area increases ranged from 3% to 18%. On average,
fanout optimization reduces delay by 34% for a area increase of 9%.

E ffect o n U n o p tim iz e d C irc u its Logic simplification is usually beneficial both in terms
of area and delay, while logic factorization can decrease circuit area by sharing common
subexpressions a t the cost of larger fanouts and extra levels of logic. Because we ran our
experiments on circuits th a t were optimized by m is l l, there is the possibility th a t fanout
optim ization was effective on these circuits simply because it corrects the fanouts introduced
by factorization. To check for this possibility, we ran the same experiments on the same
circuits, but this tim e before the circuits were optimized by m is l l. The results are reported
in Table 3.4. Though for some circuits, such as C1355 and C6288, fanout optimization was
more effective on optimized circuits, on average fanout optimization was more effective on
unoptimized circuits, reducing delay by 44% for an area increase of 10% instead of 34%
and 9% respectively. From this experiment we can conclude th at factorization is not the
main factor contributing to large fanouts. Actually, technology independent optimization
can have the overall effect of reducing the im pact of fanout optimization.

3.8.3 Detailed Performance Analysis

In this section, we perform a more detailed analysis of our fanout optimization


algorithms. We first describe the effect of inverter sizing, then the effect of buffering with
no critical signal isolation, the effect of peephole optimization and the effect of area re­
covery. Finally we compare our fanout optimization algorithm to Singh’s algorithm [40].
All experiments use the optimized version of our benchmark circuits and minimum area
technology mapping with the MCNC library l i b 2 .

T h e E ffect o f I n v e r te r O p tim iz a tio n Inverter optimization is a simple optimization


th a t consists in solving a fanout problem by introducing one inverter if there is a sink that
needs the signal in a different polarity than provided by the source, and sizing this inverter

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 78

circuit m i n area fanout opt gain


area delay area delay area delay
C1355 1662 30.59 1757 27.65 1.06 0.90
C1908 1405 35.76 1633 27.10 1.16 0.76
C2670 1928 43.78 2052 25.19 1.06 0.58
C3540 2719 51.69 3106 36.85 1.14 0.71
C5315 4369 40.60 4587 32.05 1.05 0.79
C6288 7094 123.87 7801 113.28 1.10 0.91
C7552 6386 72.45 6816 28.16 1.07 0.39
alu4 1543 47.17 1807 30.79 1.17 0.65
ampbpreg 3901 101.72 4112 47.11 1.05 0.46
ampbsm 3017 43.40 3504 20.29 1.16 0.47
amppint2 2232 31.23 2562 10.86 1.15 0.35
ampxhdl 1471 32.95 1698 13.02 1.15 0.40
apex6 1548 15.55 1607 12.95 1.04 0.83
des 11023 114.56 12288 16.70 1.11 0.15
dflgrcbl 604 13.05 634 9.89 1.05 0.76
fconrcbl 467 15.18 481 13.82 1.03 0.91
frg2 2949 52.10 3246 12.27 1.10 0.24
k2 5055 48.25 6264 14.28 1.24 0.30
kcctlcb3 457 11.50 477 10.20 1.04 0.89
pair 3636 30.93 3982 17.81 1.10 0.58
rot 1438 30.17 1495 21.90 1.04 0.73
sbiucbl 505 22.46 576 18.90 1.14 0.84
tfaultcbl 367 9.94 394 8.41 1.07 0.85
vda 2474 29.79 3185 11.95 1.29 0.40
x3 2056 20.76 2138 11.48 1.04 0.55
aver - - - - 1.10 0.56

Table 3.4: Effect of Fanout Optimization on Unoptimized Circuits

m i n area: m inimum area technology mapping with MCNC library lib2


fanout opt: minimum area technology mapping followed by fanout optimization
gain: gain in area or in delay obtained by using fanout optimization
area: area of the circuit (MCNC lib2 d a ta divided by common divisor 464)
delay: delay of the circuit (MCNC Ub2 data in nanoseconds)
aver: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O P TIM IZATIO N

circuit m i n area inv opt gain


area delay area delay area delay
C1355 990.00 27.16 992.00 26.87 1.00 0.99
C1908 1086.00 35.04 1090.00 34.51 1.00 0.98
C2670 1420.00 28.42 1422.00 27.30 1.00 0.96
C3540 2201.00 45.24 2205.00 42.81 1.00 0.95
C5315 3171.00 39.94 3177.00 38.45 1.00 0.96
C6288 6777.00 120.37 6780.00 118.02 1.00 0.98
C7552 4660.00 69.47 4661.00 62.83 1.00 0.90
alu4 1486.00 47.17 1491.00 44.19 1.00 0.94
ampbpreg 2741.00 59.85 2743.00 58.88 1.00 0.98
ampbsm 1478.00 25.78 1482.00 23.69 1.00 0.92
amppint2 1021.00 22.45 1023.00 21.69 1.00 0.97
ampxhdl 751.00 24.73 754.00 23.59 1.00 0.95
apex6 1505.00 17.74 1507.00 15.79 1.00 0.89
des 6452.00 107.12 6453.00 94.97 1.00 0.89
dflgrcbl 615.00 12.83 616.00 12.15 1.00 0.95
fconrcbl 467.00 15.18 467.00 14.79 1.00 0.97
frg2 1738.00 37.91 1741.00 36.83 1.00 0.97
k2 1972.00 32.80 1973.00 31.88 1.00 0.97
kcctlcb3 449.00 11.50 449.00 11.04 1.00 0.96
pair 3200.00 25.85 3207.00 23.58 1.00 0.91
rot 1387.00 30.18 1390.00 29.66 1.00 0.98
sbiucbl 471.00 22.18 471.00 21.99 1.00 0.99
tfaultcbl 375.00 9.91 376.00 9.70 1.00 0.98
vda 1118.00 23.76 1121.00 21.86 1.00 0.92
x3 1653.00 22.40 1654.00 22.02 1.00 0.98
aver - - - - 1.00 0.95

Table 3.5: Effect of Inverter Optimization on Circuits Optimized by m is ll

m in a re a : minimum area technology mapping with MCNC library lib2


in v o p t: minimum area technology mapping followed by inverter optimization
g a in : gain in area or in delay obtained by using inverter optimization
a re a : area of the circuit (MCNC lib2 d a ta divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of th e gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FANOU T O PTIM IZATIO N 80

optimally for delay. In the area recovery phase, if the inverter is not critical for delay,
it is replaced by a smaller inverter. The effect of this simple optimization is reported in
Table 3.5. Inverter optimization can achieve a 5% decrease in delay at negligible cost in
area. Inverter optimization can achieve 15% of the to tal delay decrease obtainable with
fanout optimization.

F a n o u t O p tim iz a tio n w ith N o C r itic a l S ig n al Iso la tio n The two main fanout opti­
mizations combined in our algorithms are buffering and critical signal isolation. To deter­
mine what is the relative importance of these two optimizations, we measured the effect of
fanout optimization limited to buffering. For the buffering algorithm, we use the algorithm
of section 3.4.3, which restricts itself to two-level fanout trees. This algorithm ignores re­
quired times but takes load values into account. The results obtained with this algorithm
are reported in Table 3.6. Using this simple algorithm, we obtained a delay reduction of
23% for a cost in area of only 3% in average. Buffering alone accounts for 68% of the total
delay reduction we can obtain with fanout optimization, for only a third of the cost in area.

A L ow er B o u n d o n th e E ffect o f F a n o u t O p tim iz a tio n Fanout optimization can


reduce delay only at multiple fanout points. Using critical signal isolation, it is possible
to deliver a signal at a multiple fanout point with no overhead, provided th at the other
signals are required sufficiently late, which is not always the case. Thus a lower bound on
the effect of fanout optimization can be determined by computing the arrival times of all
signals of a network, ignoring the effect of multiple fanouts. To perform this computation at
a multiple fanout point, we simply replace the sum of the sink loads by their average. The
results of this computation are reported in Table 3.7 and compared with the delay values
obtained with fanout optimization. Only delay values are reported. The data indicate th at
on average, fanout optimization operates within 12% of this lower bound. This result also
provides a lower bound on the effect of source duplication, since source duplication can
only decrease the delay through fanout nodes. Source duplication can still be a helpful
technique, but its overall effect can only be secondary in comparison to the effect of fanout
optimization.

P e e p h o le O p tim iz a tio n We presented in section 3.6 several algorithms to improve the


quality of a fanout tree after it has been built. To determine the effect of these peephole

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N

circuit m i n area buffering gain


area delay area delay area delay
C1355 990 27.16 1054 26.29 1.06 0.97
C1908 1086 35.04 1119 33.34 1.03 0.95
C2670 1420 28.42 1441 25.15 1.01 0.88
C3540 2201 45.24 2247 39.60 1.02 0.88
C5315 3171 39.94 3220 36.26 1.02 0.91
C6288 6777 120.37 6870 111.65 1.01 0.93
C7552 4660 69.47 4758 35.64 1.02 0.51
alu4 1486 47.17 1565 40.53 1.05 0.86
ampbpreg 2741 59.85 2815 49.92 1.03 0.83
ampbsm 1478 25.78 1524 20.62 1.03 0.80
amppint2 1021 22.45 1052 15.62 1.03 0.70
ampxhdl 751 24.73 806 15.33 1.07 0.62
apex6 1505 17.74 1519 15.72 1.01 0.89
des 6452 107.12 6818 21.27 1.06 0.20
dflgrcbi 615 12.83 621 11.50 1.01 0.90
fconrcbl 467 15.18 477 14.52 1.02 0.96
frg2 1738 37.91 1834 21.05 1.06 0.56
k2 1972 32.80 2017 26.75 1.02 0.82
kcctlcb3 449 11.50 449 11.04 1.00 0.96
pair 3200 25.85 3341 20.48 1.04 0.79
rot 1387 30.18 1427 24.98 1.03 0.83
sbiucbl 471 22.18 482 21.31 1.02 0.96
tfaultcbi 375 9.91 381 9.33 1.02 0.94
vda 1118 23.76 1159 19.59 1.04 0.82
x3 1653 22.40 1690 13.23 1.02 0.59
aver 1.00 1.00 1.03 0.77 1.03 0.77

Table 3.6: Effect of Buffering on Circuits Optimized by m is l l

m in a re a : minimum area technology mapping with MCNC library lib2


b u ffe rin g : minimum area technology mapping followed by buffering alone
gain : gain in area or in delay obtained by using buffering
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC libS data in nanoseconds)
av e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANO U T O P TIM IZATIO N 82

c irc u it fa n o u t o p t lo w er b o u n d r a tio
d e la y d e la y d e la y
C1355 24.25 20.53 0.85
C1908 29.55 24.56 0.83
C2670 22.14 19.74 0.89
C3540 34.27 28.86 0.84
C5315 30.78 27.01 0 .8 8
C6288 101.08 83.74 0.83
C7552 28.55 24.48 0 .8 6
a lu 4 31.84 27.08 0.85
ampbpreg 39.40 34.24 0.87
ampbsm 18.05 15.97 0 .8 8
am ppint 2 12.31 11.00 0.89
ampxhdl 13.38 11.43 0.85
apex 6 13.40 12.55 0.94
des 17.84 15.71 0 .8 8
d f lg r c b l 1 1 .0 1 9.71 0 .8 8
fc o n rc b l 13.82 13.54 0.98
frg 2 14.95 12.06 0.81
k2 20.56 18.46 0.90
k c c tlc b 3 1 0 .2 0 8.75 0 .8 6
p a ir 17.80 16.00 0.90
ro t 21.49 18.22 0.85
s b iu c b l 18.89 17.77 0.94
tfa u ltc b l 8.41 7.75 0.92
vda 16.75 15.45 0.92
x3 11.45 10.40 0.91
aver - - 0 .8 8

Table 3.7: A Lower Bound on the Effect of Fanout Optimization

fa n o u t o p t: minimum area technology mapping with fanout optimization


lo w er b o u n d : minimum area technology mapping ignoring fanout to compute delay
r a tio : ratio between lower bound and delay realized with fanout optimization
d e la y : delay of the circuit (MCNC lib2 d ata in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 83

circuit m i n area no peephole gain


area delay area delay area delay
C1355 990 27.16 1209 24.25 1.22 0.89
C1908 1086 35.04 1324 29.64 1.22 0.85
C2670 1420 28.42 1500 22.14 1.06 0.78
C3540 2201 45.24 2587 33.97 1.18 0.75
C5315 3171 39.94 3565 30.78 1.12 0.77
C6288 6777 120.37 7512 101.72 1.11 0.85
C7552 4660 69.47 5388 29.13 1.16 0.42
alu4 1486 47.17 2010 31.90 1.35 0.68
ampbpreg 2741 59.85 3003 39.44 1.10 0.66
ampbsm 1478 25.78 1694 18.18 1.15 0.71
amppint2 1021 22.45 1233 12.31 1.21 0.55
ampxhdl 751 24.73 921 13.53 1.23 0.55
apex6 1505 17.74 1610 13.44 1.07 0.76
des 6452 107.12 8013 17.84 1.24 0.17
dflgrcbl 615 12.83 647 11.01 1.05 0.86
fconrcbl 467 15.18 489 13.82 1.05 0.91
frg2 1738 37.91 1973 15.03 1.14 0.40
k2 1972 32.80 2325 20.56 1.18 0.63
kcctlcb3 449 11.50 479 10.20 1.07 0.89
pair 3200 25.85 3638 17.86 1.14 0.69
rot 1387 30.18 1484 21.47 1.07 0.71
sbiucbl 471 22.18 543 18.98 1.15 0.86
tfaultcbl 375 9.91 405 8.41 1.08 0.85
vda 1118 23.76 1466 16.77 1.31 0.71
x3 1653 22.40 1815 11.51 1.10 0.51
aver - - - - 1.15 0.66

Table 3.8: Effect of Fanout Optimization without Peephole Optimization

m in a re a : minimum area technology mapping w ith MCNC library lib2


no p e e p h o le : minimum area technology mapping followed by fanout optimization
with no peephole optimization
g a in : gain in area or in delay obtained when using no peephole optimization
a re a : area of the circuit (MCNC lib2 d ata divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FANOU T O PTIM IZATIO N 84

optimizations, we ran the fanout optimizer after having deactivated them. The results of
this experiment are reported in Table 3.8. In terms of delay, the impact of peephole opti­
mization is negligible. This was to be expected, since our most powerful fanout optimization
algorithm does buffer selection optimally. However, in terms of area, peephole optimiza­
tion reduces the cost of fanout optimization from 15% down to 9%, which is a valuable
contribution to the overall performance of the fanout optimizer.

E ffect o f A r e a R e c o v e ry After using fanout optimization for delay, we use a second


pass on the network to recover area in fanout trees th at are not critical for delay. The
overall result is an average increase in area of only 9%, but it is not clear how much of this
lim ited increase in area is due to area recovery or to the fact that fanout optimization for
delay itself does not increase area significantly. To evaluate the effect of area recovery, we
ran the fanout optimizer w ithout area recovery. The results are reported in Table 3.9. Area
recovery reduces the average cost in area of fanout optimization dramatically, from 51% to
9%. In addition, as predicted, area recovery does not increase circuit delay.

C o m p a ris o n w ith S in g h ’s A lg o rith m We compared the results obtained with our


fanout optimizer to the results obtained with Singh’s algorithm. To perform a fair com­
parison, we interfaced Singh’s algorithm to our fanout optimizer, in such a way th a t it is
called in the same order and on the same data than our fanout algorithms. The results
are reported in Table 3.10. On most examples, the results are quite similar. Our fanout
algorithm does consistently better than his in terms of delay, with an average reduction of
1 2 %; but it does consistently worse than his in terms of area, with an average increase of
4%.

3.9 Conclusion

Fanout optimization is an im portant delay optimization technique, th at can reduce


delay often quite dramatically for a very moderate cost in area. It is an essential component
of any logic synthesis system th a t claims to optimize delay. It is an im portant optimization
for other reasons as well. In particular it can be directly adapted to make sure th a t fanout
constraints imposed by a technology are satisfied.
Optimizing a fanout problem is, for most delay models, a difficult problem. In-

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 3. FAN O U T O PTIM IZATIO N 85

circuit m i n area no area recov gain


area delay area delay area delay
C1355 990 27.16 1362 24.25 1.38 0.89
C1908 1086 35.04 1647 29.59 1.52 0.84
C2670 1420 28.42 1995 22.08 1.40 0.78
C3540 2201 45.24 3572 34.27 1.62 0.76
C5315 3171 39.94 4929 30.78 1.55 0.77
C6288 6777 120.37 10887 101.09 1.61 0.84
C7552 4660 69.47 6959 28.68 1.49 0.41
alu4 1486 47.17 2573 31.85 1.73 0.68
ampbpreg 2741 59.85 4209 39.46 1.54 0.66
ampbsm 1478 25.78 2219 18.05 1.50 0.70
amppint2 1021 22.45 1483 12.31 1.45 0.55
ampxhdl 751 24.73 1116 13.45 1.49 0.54
apex6 1505 17.74 2347 13.40 1.56 0.76
des 6452 107.12 9360 17.75 1.45 0.17
dflgrcbl 615 12.83 815 11.01 1.33 0.86
fconrcbl 467 15.18 653 13.82 1.40 0.91
frg2 1738 37.91 2465 14.95 1.42 0.39
k2 1972 32.80 3412 20.56 1.73 0.63
kcctlcb3 449 11.50 605 10.20 1.35 0.89
pair 3200 25.85 5002 17.80 1.56 0.69
rot 1387 30.18 2151 21.52 1.55 0.71
sbiucbl 471 22.18 806 18.89 1.71 0.85
tfaultcbl 375 9.91 508 8.41 1.35 0.85
vda 1118 23.76 1846 16.75 1.65 0.70
x3 1653 22.40 2344 11.54 1.42 0.52
aver 1.00 1.00 1.51 0.66 1.51 0.66

Table 3.9: Effect of Fanout Optimization without Area Recovery

m m area: minimum area technology mapping with MCNC library lib2


no area recov: minimum area technology mapping followed by fanout optimization
without area recovery
gain: gain in area or in delay obtained when using no area recovery
area: area of the circuit (MCNC lib2 d a ta divided by common divisor 464)
delay: delay of the circuit (MCNC Ub2 d ata in nanoseconds)
aver: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O P TIM IZATIO N 86

c irc u it fan o u t o p t Sing]a’s a lg g a in


a r e a d e la y a re a d e la y a re a d e la y
C1355 1119 24.25 1012 26.37 0.90 1.09
C1908 1236 29.55 1182 31.48 0.96 1.07
C2670 1478 22.14 1460 23.05 0.99 1.04
C3540 2421 34.27 2301 37.95 0.95 1 .1 1
C5315 3352 30.78 3251 33.12 0.97 1.08
C6288 7340 101.08 7003 110.29 0.95 1.09
C7552 5084 28.55 4713 57.88 0.93 2.03
a lu 4 1724 31.84 1592 35.82 0.92 1 .1 2
ampbpreg 2891 39.40 2824 44.71 0.98 1.13
ampbsm 1615 18.05 1539 20.18 0.95 1 .1 2
am ppint 2 1136 12.31 1083 14.12 0.95 1.15
ampxhdl 865 13.38 818 14.09 0.95 1.05
apex 6 1565 13.40 1537 14.27 0.98 1.06
des 7358 17.84 6784 2 0 .0 1 0.92 1 .1 2
d f lg r c b l 630 1 1 .0 1 623 11.33 0.99 1.03
f c o n rc b i 481 13.82 472 13.89 0.98 1 .0 1
frg 2 1893 14.95 1841 16.75 0.97 1 .1 2
k2 2189 20.56 2029 24.20 0.93 1.18
k c c tlc b 3 469 1 0 .2 0 456 10.30 0.97 1 .0 1
p a ir 3482 17.80 3357 19.94 0.96 1 .1 2
ro t 1444 21.49 1427 2 2 .1 1 0.99 1.03
s b iu c b l 514 18.89 487 20.59 0.95 1.09
tfa u ltc b l 400 8.41 385 8.90 0.96 1.06
vda 1324 16.75 1217 17.80 0.92 1.06
x3 1754 11.45 1699 14.86 0.97 1.30
aver - - - - 0.96 1 .1 2

Table 3.10: Comparison with Singh’s Algorithm

fa n o u t o p t: minimum area technology mapping followed by our fanout algorithm


S in g h ’s alg: minimum area technology mapping followed by Singh’s algorithm
without area recovery
g a in : gain or loss in area or in delay obtained by using Singh’s algorithm
a re a : area of the circuit (MCNC lib2 d a ta divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 d ata in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 3. FAN O U T O PTIM IZATIO N 87

stead of trying to find one fanout algorithm th at would be applicable in all contexts, we
recommend the approach we have been following, of developing a spectrum of simple but
efficient fanout optimization algorithms based on different approaches: balanced trees, LT-
trees, combinational merging, top-down traversal. These algorithms are efficient enough to
make it practical to try them all on every fanout problem to retain the best solution in
all cases. In addition, some optimizations can be shared by all fanout algorithms if done
during a postprocessing optim ization phase on fanout trees. These optimizations do not
affect delay, but can contribute significantly to area reduction.
The most im portant contribution of this chapter is to have demonstrated th at,
at least in the context of fanout optimization, there is a simple way of applying a fanout
optimization algorithm to an entire network th at is optimum in term s of delay and very
efficient in terms of area. This is a significant improvement over past methods which rely on
the identification and incremental improvement of critical paths. Our m ethod is guaranteed
to produce the best delay achievable with a given fanout optimization algorithm and requires
only two passes on the network. In addition it achieves significant delay reduction for a very
moderate cost in area, observed in our experiments to be no more th an 10% on average.
To minimize area under a delay constraint, we did not modify any of the fanout
algorithms to do so. Rather, we simply apply every fanout algorithm to each fanout problem,
and selected among the fanout trees so obtained th at met the delay constraint, one with
minimum area. Area reduction is achieved simply by using a spectrum of fanout algorithms,
including some th at have can only produce fairly simple fanout trees.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 4

C om bining Tree Covering and


Fanout O ptim ization

It is not necessary to hope to undertake


nor to succeed to persevere.
— GUILLAUME d’ORANGE

4.1 Introduction

In the previous two chapters, we covered separately two im portant delay opti­
mization techniques: tree covering and fanout optimization. The purpose of this chapter
is to study the problem of integrating these two optimizations. Figure 4.1 illustrates how
we have applied so far these optimizations. In gray are fanin trees, the parts of a circuit
where we apply tree covering to perform gate selection. In white are fanout trees, where we
apply fanout optimization. We can always combine tree covering and fanout optimization
as follows: we first use tree covering to implement the fanin trees, in one pass from primary
inputs to primary outputs. We then apply fanout optimization, in one pass from primary
outputs to primary inputs. We can run an additional fanout optimization pass to recover
area, but we will ignore area recovery for the moment. The question we would like to answer
in this chapter is: are there better ways to apply tree covering and fanout optimization to
a network th at would lead to significant speed improvements?
We first introduce some definitions and terminology used in the rest of this chapter.
We partition a Boolean network into fanin trees and fanout trees. We group each tree into

88

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COMBINING T R E E CO VERING AN D FAN O U T O P TIM IZATIO N 89

P R IM A R Y O U T P U T S

fanout optimization

tree covering

P R IM A R Y IN P U T S

Figure 4.1: Combining Tree Covering and Fanout Optimization

a node. To each fanin tree corresponds a fanin node, and to each fanout tree corresponds
a fanout node. Fanin nodes are specified by a Boolean function th a t can be implemented
as a tree. Fanout nodes are simply specified by a source and a set of sinks. The polarity of
these sinks is usually not known before the fanin trees are implemented. Fanin nodes can be
implemented by tree covering, or by some form of restructuring followed by tree covering.
In this work, we only use tree covering, but the theoretical p art of this chapter remains
valid if we use restructuring in addition to tree covering. Fanout nodes are implemented
using fanout optimization. In each case, we suppose th a t the implementation attem pts
to minimize delay for a given set of arrival times for fanin nodes or required times for
fanout nodes. An implementation th at is such that all delays from the leaves of the tree
to the root of the tree are equal is called a balanced implementation. We use unbalanced
implementations to compensate for the imbalance in arrival times or required times. The
problem we study in this chapter is the problem of finding a good order of application
of tree covering (possibly with restructuring) and fanout optimization to minimize delay.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING T R E E CO VERING A N D FAN O U T O P TIM IZA TIO N 90

We call this problem the tree-based delay optimization problem, since it is the problem of
minimizing delay while respecting the tree boundaries of a Boolean network.
In section 4.2 we formulate the tree-based delay optimization problem as a con­
vex optim ization problem. This formulation is only valid for a continuous version of the
constant delay model of section 3.3.1. By abstracting away the discrete nature of delay
optim ization in our setting, this formulation allows us to compute analytically the mini­
mum delay implementation of a few simple circuits. We can then use these examples to
detect potential sources of suboptim ality in tree-based delay optimization algorithms. In
this section we exhibit a circuit th a t has the following property: for any constant a > 0,
there is an implem entation of this circuit whose delay is a delay units worse than an optimal
implem entation and is such th a t all paths are critical and all fanin nodes and all fanout
nodes are implemented optimally. Such an implementation cannot be optimized by any
greedy application of tree covering or fanout optimization in any order. Due to physical
constraints, this example is only realistic for a limited range of values of a. Nevertheless it
indicates clearly the limitations of greedy delay improvement strategies.
This example is based on an initial implementation where arbitrarily unbalanced
implementations of fanin and fanout nodes compensate each other exactly. We can easily
avoid this problem by starting with a balanced initial implementation. To do so, we im­
plement all fanout trees using a balanced configuration prior to the first application of tree
covering. This technique is described in section 4.3. The main difficulty in building these
balanced fanout trees is to handle sink polarities properly. Since the fanout trees are built
before an implem entation of fanin trees is available, sink polarities are not known. This
phase assignment problem has an im portant impact on the quality of the final implemen­
tation.
Once we have built an initial implementation, we can iterate tree covering and
fanout optimization until we reach a local minimum. In section 4.4, we present a simple
iterative scheme th a t we can use to perform this iteration. This scheme consists in iterating
tree covering and fanout optimization passes on the network, using a t the i th iteration the
delay information computed at the ( i - l ) tfl iteration. Our experimental results indicate that
there is almost no advantage in performing more than one iteration with this method. To
estim ate the optim ality of the final result, we applied this optimization scheme to another
simple circuit for which we can compute the optim al solution for a simple delay model.
On this circuit also the iterative algorithm converges almost immediately, and the result

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING T R E E CO VERING A N D FAN O U T O P TIM IZATIO N 91

Figure 4.2: Suboptimality of Tree-Based Optimization.

obtained after one iteration is within a few percent of the optim um solution. This example
also suggests an improvement of the iterative algorithm, which converges more slowly but
reaches a solution th a t is within a fraction of a percent of the optim um. We conclude
section 4.4 by a brief overview of a new approach to tree-based minimization proposed by
Yoshikawa et al. [44].
The results of the section 4.4 are a strong indication that we are reaching the limit
of what can be achieved with tree-based delay minimization. The purpose of section 4.5
is to propose two techniques to minimize delay th at do not preserve tree boundaries but
are nevertheless simple variations of the tree covering and fanout optimization algorithms
proposed so far. The first of these techniques, t r e e d u p lic a tio n , allows the duplication of
a fanin node to implement the node both in positive and negative phases. In tree covering,
one phase is selected, and the other phase is provided by an inverter. W ith tree duplication,
both phases are implemented as separate trees, possibly with partial overlap. The rationale
behind this heuristic is to make available to the fanout optimizer the earliest possible source
for each polarity, possibly reducing by one the num ber of buffers on a critical path. Another
factor makes this optimization attractive: the possibility of avoiding unnecessary duplication
during the area recovery phase of fanout optimization. The second of these techniques, tree
overlap, is more radical. It allows tree covering to ignore tree boundaries. The example of
Figure 4.2 illustrates why allowing overlaps between trees can help reducing circuit delay.
The circuit shown in this example can be implemented as shown in solid lines, with three

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COMBINING TR E E CO VERING AN D FANOUT O P TIM IZATIO N 92

2-input NAND gates and one inverter. Or it can be implemented as suggested by the
dotted lines, w ith two 3-input NAND gates. The second implementation is significantly
faster, and in th a t case happens to use less area. In general, however, allowing overlaps
tend to increase area. If the example of Figure 4.2 is modified to have a fanout of 10,
the implementation with ten 3-input NAND gates will still be faster, but would use more
area th an an implem entation with eleven 2-input NAND gates and one inverter. This
optimization has the effect of moving logic across fanout points in a m anner reminiscent of
retiming [30].
The results obtained in this chapter indicate th at we are reaching a limit to what
we can expect from tree-based technology mapping algorithms in terms of delay optimiza­
tion. This opens the way to the next chapter, where we examine the effect of technology-
independent optimizations th at can modify the global structure of a network.

4.2 Theoretical Analysis o f Tree-Based Delay M inimization

In this section, we present an abstract formulation of the tree-based delay min­


imization problem, as a convex optimization problem. This formulation is obtained by
abstracting away the discrete nature of our algorithms, but uses delay equations directly
derived from the exact solution of fanout problems using the constant delay model of sec­
tion 3.3.1 and in th at sense represents a reasonable continuous approximation of the discrete
problem we are seeking to solve.
We first show how we model the effect of tree covering (with restructuring) using
a convex function, derived from combinational merging [18]. Then using symmetry we
apply this model to cover fanout optimization. Using these models, we can formulate the
tree-based covering problem as a convex optimization problem. We use this formulation
on a simple circuit to compute the optim um implementation for th at circuit, and exhibit
a class of implementations of th at circuit th at cannot be improved using greedy tree-based
optimization algorithms.

4.2.1 M o d elin g T ree C overin g

To model the effect of tree covering and restructuring on a fanin node, we use a
function /(ffli,. . . ,o„) th at represents the arrival time achieved by an optimal implem enta­
tion of a fanin node v at the output of v given arrival times ( a i , . . . , a n) at the inputs of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 4. COMBINING TR E E CO VERING A N D FANO U T O P TIM IZATIO N 93

v. If the Boolean function computed by fanin node v is a Boolean and of n inputs, and
if the library is composed of and gates of constant delay equal to 1 and fanin k, where k
lies between 2 and some limit t, the optim al implementation of v can be obtained using
combinational merging. Moreover, the function / can be computed exactly in th at case,
and is given by the following formula [18, 22]:

f { a i , . . . , an) — i°gt £ tai (4.1)


l< » < n

To be able to use optimization techniques from real analysis, we need to approximate / by


a differentiable function. We use the following approximation, obtained by dropping the
ceiling function and setting for convenience t — e. A different choice of k would only have
for effect of scaling the delay numbers by a multiplicative constant.

an) = log £ e* (4.2)


l< i< n

This approximation ignores the discrete nature of tree covering and restructuring, and the
added irregularity of fanin nodes with asymmetric logic functions. However it captures the
essence of fanin tree balancing. It produces a delay of log n for balanced arrival times, and,
for unbalanced arrival times, allows late signals to traverse a fanin node for less delay at an
extra cost for the other signals. Moreover, the model has not been chosen arbitrarily, but
derived directly from the optim al solution of a fanin problem for a simple delay model. In
addition, this model has the following im portant property:

le m m a 4.2.1 The function f given by equation 4-2 is convex.

P ro o f The function / is obtained by composition of infinitely differentiable functions, and


is thus infinitely differentiable over TZn. To prove th at it is strictly convex, we will show
th a t its Laplacian is strictly positive at any point of the space. We first define the following
coefficients:
gXi+xj
(Li A — (4.3)
(£2=1 e-02
A simple computation yields the following equalities:
d 2f
= -Oij if
n
d2f
— -a n + £ dik
fix-2
OXl fc=1

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING T R E E CO VERIN G A N D FAN O U T O P TIM IZA TIO N 94

We deduce from these equations, and from the fact th at a tJ- = ay,-, the following inequality,
which proves our result:

d2/
= - E ^ ^ + E ^ i
i,j * 3 t,j «,i
= E a*i(^ + 3 / |- 2^ i )
*<i
= E®<j(»"W)2 - °
i<j

4 .2 .2 M o d e lin g F a n o u t O p tim iza tio n

To model the effect of fanout optimization, we use a function g ( r \ , . . . , r n) th at


represents the required tim e achieved by an optimal im plem entation of a fanout node v
at the input of v given required tim es ( r i , .. . , r n) at the outputs of v. If the fanout node
v has n outputs, and the library is composed of buffers of constant delay equal to 1 and
fanout k, where k lies between 2 and some lim it t, the fanout problem is the exact analog
to the simplified fanin problem used in the previous section. One problem can be deduced
from the other by changing the direction of propagation of signals. As a consequence, the
optimum solution can be computed for this problem as follows:

9 { r i , - - . , r n) = - log* E l< t< n


rT i (4.4)

The continuous approxim ation of g is obtained in the same fashion as the continuous ap­
proxim ation of / in the previous section:

9 (ru - - - ,r n) = -lo g e~Ti (4.5)


l< i< n

It is easy to check th a t g is concave, since / is convex and g(x) = —/ ( —a).

4 .2 .3 F o rm u la tio n a s a C o n v ex O p tim iza tio n P ro b lem

Given a continuous model of the delay through fanin nodes and fanout nodes as the
result of tree covering and fanout optimization, we can proceed to formulate the tree-based
delay optim ization problem as a global optimization problem. We assume th at the Boolean
network to be optimized is decomposed into fanin nodes and fanout nodes. W ithout loss of

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P TE R 4. COM BINING TR E E CO VERING A N D FAN O U T O P TIM IZATIO N 95

generality, we can suppose th at the fanin nodes and fanout nodes are maximal, in the sense
th a t a fanin node is never connected to another fanin node, nor a fanout node to another
fanout node. If it were not the case, i.e. for example if two fanin nodes were connected, the
input node could always be collapsed into the output node.
We divide the edges or wires of the network into four sets: P I , PO, L E A F and
R O O T. Each wire only connects two nodes. Multiple fanouts are handled exclusively
through fanout nodes. The P I wires are wires directly connected to prim ary inputs. As­
sociated with each of the P I wires is an arrival tim e. Arrival times are represented as a
vector a of arrival times, of dimension |P I |. Similarly, the PO wires are the wires directly
connected to prim ary outputs. Associated with each PO wire is a required times. Required
times are represented as a vector r of dimension \P 0\. The L E A F wires are the wires th at
connect an output of a fanout node to an input of a fanin node. Arrival times on these
wires are represented by a vector x of dimension \LE A F \. Finally the R O O T wires are
the wires connecting the output of a fanin node to the input of a fanout node. Figure 4.3
illustrates these definitions.
For each fanin node, we suppose th a t we have at our disposal a function th at
computes the best achievable arrival tim e at the output of the node, for any feasible imple­
m entation of the node. This function only depends on x and a. To simplify the notation,
we designate by / the vector of such functions, and we keep implicit the dependency on the
vector of arrival times a, considered constant. Thus / : TZ\LEAF\ —> x 'R)r o o t \. All
the components of / are convex functions.
We note irpo the orthogonal projection of a vector of Tl\p o \ x R)r o o t \ onto 7?.^°I,
and similarly tvr o o t designates the projection onto N } r o o t \. For a given assignment of
arrival times x at the L E A F wires, ttpo(f(® )) designates the best achievable arrival times
at the prim ary outputs of the network. For 1 < p < +oo, |aj|p designates the quan­
tity (£lb=i ®+ designates the vector of components (max(a:fc,0))i<jt<n. The
to tal amount by which an implementation fails to meet its tim ing requirements is given
by \{'Kpo(f(x)) — and the maximum am ount by which a requirement fails is given
by |(irpo(/(® )) ~ ?')+|oo- In both cases, the convexity of the components of / implies the
convexity of the cost function.
Similarly, for each fanout node, we suppose th a t we have at our disposal a function
th a t computes the best achievable required time at the input of the node, for a given set of
required times at the outputs. This function only depends on x and the constant vector r.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTE R 4. COMBINING TR E E CO VERING AND FAN O U T O PTIM IZATIO N 96

PRIMARY OUTPUTS

V fanout node

A fanin node

------ PI o r PO wire

ROOT wire

-------- LEAF wire

PRIMARY INPUTS

Figure 4.3: Partition of Edges

As for prim ary input arrival times, to simplify notation, we keep implicit the dependency
on the prim ary output required times. We designate by g the vector of these functions:
g : TZ\l e a f \ —> %\p i \ x R}r o o t \. A L E A F wire assignment x is realizable if it satisfies the
following inequalities: 7 T |p /|( p ( a :) ) > a and n\r o o t \ { 9 { ^ ) ) > ’f |R O O T |( / ( ® ) ) -

In total, we have formulated the tree-based delay minimization problem as the


following optimization problem:

m in |(7r|PO| ( / ( a : ) ) - r ) + |00
x £ 'R}l‘EAF\

7r|P/|(fl'(a:)) > a
’H i i o o r i t e C a : ) ) > ^ | r o ot |( / ( ® ) )

Each constraint in the problem if of the form g > f , where g is a concave function and
/ a convex function. The set of points satisfying the constraints is thus convex, and the
problem has been expressed as the problem if minimizing a convex function over a convex
set.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 97

rl tZ

x4

a1 a2

Figure 4.4: A Simple Example

4.2.4 A Simple Example

We now proceed to use this formulation in order to compute an analytical solution


to the tree-based delay minimization problem for the simple circuit of Figure 4.4. The delay
function associated at each fanin node is given by equation 4.2 and by equation 4.5 for the
two fanout nodes.
We are interested in computing the best pairs of arrival times realizable at the
outputs of this circuit, i.e. in finding all the sets of points ( r i , ^ ) th at are realizable arrival
times for the circuit of Figure 4.4, and minimal for the partial ordering on vectors defined
by: x < y iff Xi < yi for all 1 < i < n. In other words, we are interested in the pairs
of realizable arrival times ( r ! , ^ ) such th a t Ti and 7*2 cannot be improved simultaneously.
Such sets of optimal points are often studied in microeconomics and called P a re to o p tim al
points. These points m ust satisfy the following equations:

T\ = log(eXl + e12)

r2 = log(eX3 + ex*)

ai = - lo g ( e -Xl + e_X3)

a2 = - lo g (e ~ X2 + e~X4)

where (ai, <12) are given arrival times at the inputs of the network. To simplify the compu­

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 98

tation, we perform th e following substitution of variables:

Ri = eTi

Ai = eai

Xi = eXi

The problem is then to reduced to finding the minimal pairs ( jRi , 122) satisfying the following
two equations:

12i = Xx + X 2
A 1X 1 A 2X 2
2 X i - Ai X 2 - A2

The minimal points m ust be such th a t an<^ (§§f> 3X6 colinear, otherwise
it would be possible to decrease both R i and R 2 and still stay in the feasibility region.
The first vector is equal to (1,1) and the second vector to ( ( x 2- a 2)2 )• Thus the
minimal points are characterized by the equation:

Xi = X2
A\ A2

A param etric representation of the set of minimum points is then readily available, using
T = ^ as param eter. The range of T is 1 < T < oo.

R\ = T ( A \ + A 2)

R2 = T -' i ^ 1+

An equivalent closed from is given by:

—log(e-ri + e~T2) = log(eai + e“2)

This closed form indicates th a t the behavior of this circuit is equivalent to th e behavior of
a single fanout node with an input arrival time of log(eai + e“2).

4.2.5 Suboptim al Local Minima

W ith the example of the previous section, we can exhibit a family of circuit imple­
m entations for the same network th at are arbitrarily far from the optimum solution, but yet
have the property th a t every node, taken in isolation, is configured optimally with respect

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 99

to the rest of the network. In other words, any algorithm th a t works greedily by changing
the circuit implementation one node at a tim e or a group composed of a fanin node and its
fanout node at a tim e, can enter a global implementation where every single node or group
appears optim al with respect to the rest of the circuit, but the overall delay through the
circuit is arbitrarily far from the optim al solution.
This family of implementations is param etrized by one param eter, designated as
a. Given a, we consider the following wire assignment to the circuit of Figure 4.4, where
we now suppose th a t the arrival times and the required times are all equal to 0:

si = log(l + e“ ) (4.6)

s 2 = log(l + e - “ ) (4.7)

= log(l + e "“ ) (4.8)

x 4 = log(l + e“ ) (4.9)

Since we have:

- log(e-Xl + e~Xa) = - log(e“ X2 + e"*4)

= - ^ ( r i h r + iT ? )
= 0

this leaf wire assignment is optim al with respect to the two fanout nodes. The best pri­
mary output arrival times achievable with this wire assignment are equal, for both primary
outputs, to the value:

log(e11 + e12) = log(eX3 + eXi)

= log(l + e“ + 1 + e- “ )

= log 2 + log(cosh(a))

which can be made arbitrarily far from an optim al solution. Solutions arbitrarily far from
the optim um are not realized in practice due to physical limitations. This result simply
indicates th at greedy tree-based delay optimization algorithms are unable to recover from
certain initial implementations, no m atter how suboptimal these implementations are.
In addition, if we take as initial implementation an implementation where the
fanout nodes are implemented optimally relative to the wire values given by equations 4.6

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 100

to 4.9, and the fanin trees are implemented arbitrarily, a one pass application of tree covering
to this implementation will produce a configuration th a t cannot be optimized any further
and whose distance to the optimal solution is given by equation 4.10. Only if the initial
implementation is balanced would the result be optimal.

4.3 Selecting the Initial Implem entation

Before improving a network implementation, we need to generate one. Since our


incremental improvement algorithms are not very powerful, the quality of the final result is
very sensitive to the quality of the initial point. To compute an initial implementation, we
perform tree covering on the entire network, using a heuristic to estimate the arrival times
at the output of fanout nodes. The quality of the initial implementation is very sensitive to
the quality of this heuristic. From the previous section, we know th a t it is im portant not
to introduce artificial imbalances in the initial implem entation, because they are a source
of suboptimality th a t cannot always be eliminated by incremental improvement algorithms.
Therefore a heuristic to estimate arrival times at the output of fanout nodes should estimate
these arrival times to be equal for all the outputs of a given fanout node.
There are still many ways to estim ate balanced arrival times, and the following
questions remain to be answered:

• what should be the delay through a fanout node?

• should the delay be sensitive to signal polarity?

• should the delay be sensitive to variations in sink loads?

These questions arise from the fact th at the heuristic is to be used during tree covering.
Since the implementation of fanin nodes is not known a t this point, neither the polarity
nor the load at the outputs of fanout nodes is known when the delay estimation heuristic
is called.

A S im ple F a n o u t D e la y E s tim a tio n H e u ris tic We present in this paragraph a simple


delay estimation heuristic th a t performs consistently better than its variants. W ith this
heuristic, we estimate the arrival time at the output of a fanout node v as follows. We
first compute the best arrival time achievable at the input of v by tree covering. This
information is available because tree covering proceeds in topological order from primary

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 101

inputs to primary outputs. Then, if n is the num ber of fanouts of v, we build a balanced
fanout tree with n sinks to implement v. The fanout tree is selected by the balanced tree
algorithm presented in section 3.4.3. The loads of the sinks are taken to be all equal to
some generic value (we use the input load of a minimum size 2-input NAND gate).
The computation of this fanout tree is done twice, once by supposing th a t all sinks
are of positive polarity, once by supposing th at all sinks are of negative polarity. Among
these two trees, the fanout tree w ith the earlier output arrival time is stored at node v , and
the other tree is discarded. The fanout tree stored at node v in this fashion is called T v .
We use T„ to compute the arrival tim e at any output of v.
W hen tree covering is applied to a fanin node in the fanout of v, we need to
compute the arrival tim e at gate inputs p connected to v. To do so, we compute the delay
through the fanout tree Tv stored at v, adding to the load driven by the buffer connected
to p the difference between the load at p and the generic sink load value we used to build
Tv . This computation is done irrespective of the polarity of p.

T a k in g S in k P o la rity in to A c c o u n t We modified the heuristic described in the previous


paragraph to take sink polarities into account. This modification was done as follows: we
stored with fanout tree T„ at each fanout node v the polarity of the signals available at its
outputs. Since Tv is a balanced tree, all outputs have the same polarity. W hen the input
pin p of a gate requires the signal under the polarity provided by T v, we compute the arrival
time at p as before. W hen the signal is required under the opposite polarity, we add an
inverter delay to the arrival time obtained from T v. We measured the effect of this modified
fanout delay heuristic on the quality of circuit implementations obtained after one pass of
tree covering followed by two passes of fanout optimization, one for delay and one for area
recovery. As show in Table 4.1, taking sink polarity into account yields circuits th a t are on
average 11% slower and 2% larger.
By taking polarities into account at this stage, we introduce an artificial bias in
favor of polarity X over polarity X . In doing so we tend to eliminate tree covers that
would have chosen polarity X if X and X were available with the same arrival time, before
we can determine whether later passes of fanout optimization could provide a signal of X
at less cost than X plus an extra inverter delay. In the absence of such information, it
is more im portant not to introduce any bias than to try to be consistent w ith a feasible
implementation of a fanout tree.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 102

circuit no polarity polarity gain


area delay area delay area delay
C1355 1192 24.21 1204 25.99 1.01 1.07
C1908 1358 28.24 1381 29.62 1.02 1.05
C2670 1796 20.86 1809 21.66 1.01 1.04
C3540 2857 32.88 2906 37.28 1.02 1.13
C5315 3693 29.31 3789 32.08 1.03 1.09
C6288 8272 95.47 8932 109.69 1.08 1.15
C7552 5322 27.83 5303 29.98 1.00 1.08
alu4 1977 29.24 1984 31.31 1.00 1.07
ampbpreg 3289 35.56 3422 42.43 1.04 1.19
ampbsm 1875 16.55 1930 19.51 1.03 1.18
amppint2 1353 11.29 1390 11.92 1.03 1.06
ampxhdl 1059 12.90 1022 13.85 0.97 1.07
apex6 1912 11.29 1905 12.41 1.00 1.10
des 8632 16.08 8598 17.40 1.00 1.08
dflgrcbl 730 10.39 725 11.03 0.99 1.06
fconrcbl 537 11.00 578 13.88 1.08 1.26
frg2 2367 13.17 2204 14.85 0.93 1.13
k2 2755 16.87 2887 19.69 1.05 1.17
kcctlcb3 557 9.26 597 9.75 1.07 1.05
pair 3956 16.40 3978 19.41 1.01 1.18
rot 1651 19.17 1724 20.67 1.04 1.08
sbiucbi 599 15.66 641 17.47 1.07 1.12
tfaultcbl 439 6.80 483 7.92 1.10 1.16
vda 1522 13.31 1620 15.06 1.06 1.13
x3 2033 10.97 2075 12.65 1.02 1.15
aver - - - - 1.02 1.11

Table 4.1: Effect of Taking Polarities Into Account in Fanout Delay Heuristic

no p o la r ity : fanout delay heuristics ignores polarities


p o la r ity : fanout delay heuristics takes polarities into account
g a in : increase in area or in delay obtained by taking polarities into account
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC Ub2 data in nanoseconds)
a v e r: geometric average of the gains

with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 103

U sin g a W ir e as a B a la n c e d F a n o u t T re e A wire is the simplest balanced fanout tree


we could use to estim ate the arrival tim e at the output of fanout nodes. Unfortunately,
the use of a wire as an delay estim ator at a fanout node yields implementations of poorer
quality, as shown in Table 4.2, w ith a degradation of 9% in delay and 3% in area. As in the
previous paragraph, these results were obtained by using the fanout delay heuristic during
an initial pass of tree covering, and completing the implementation of the circuits by two
passes of fanout optimization, one for delay and one for area. By using a linear delay model
at a fanout point, we introduce artificial imbalances in the implementation, in particular in
the presence of nodes with large fanouts.

C o n c lu sio n In summary, we found th a t the best delay heuristic estimates the delay
through a fanout node using a balanced fanout tree and ignores sink polarities. This
heuristic also takes into account variations in sink loads, though we have not assessed
the effectiveness of this technique, on the ground th at it is straightforward to implement
and we expect it to have only a second order effect on delay. In the results presented in the
remainder of this thesis, we apply this heuristic during the initial tree covering phase.

4.4 Global Optimization Schemes

In this section we present two iterative delay optimization algorithms. The first
algorithm, presented in section 4.4.1, simply iterates tree covering and fanout optimization.
We call it the simple iterative improvement algorithm. Experimentally this algorithm con­
verges very rapidly, and the result obtained after one iteration is almost as good as the
final result. Unfortunately these experiments do not provide any information about the
optim ality of the final result. To gain some insight into possible sources of suboptimality,
we introduce in section 4.4.2 a simple network for which we can compute an optim al imple­
m entation under the continuous delay model of section 4.2. We simulate the effect of the
simple iterative improvement algorithm on this network under the continuous delay model.
The solution obtained by this algorithm is suboptimal, but only within a few percent of the
optimum solution. Finally, in section 4.4.3 we discuss briefly a new technique proposed by
Yoshikawa et al. [44] to perform iterative improvements th a t uses an different approach.

4.4.1 Ite r a tiv e Im p rovem en t

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 104

circuit logarithmic linear gain


area delay area delay area delay
C1355 1192 24.21 1283 26.55 1.08 1.10
C1908 1358 28.24 1374 30.84 1.01 1.09
C2670 1796 20.86 1810 21.28 1.01 1.02
C3540 2857 32.88 2824 37.97 0.99 1.15
C5315 3693 29.31 3907 30.85 1.06 1.05
C6288 8272 95.47 8723 107.51 1.05 1.13
C7552 5322 27.83 5754 29.85 1.08 1.07
alu4 1977 29.24 2017 30.93 1.02 1.06
ampbpreg 3289 35.56 3330 38.52 1.01 1.08
ampbsm 1875 16.55 1995 17.52 1.06 1.06
amppint2 1353 11.29 1388 12.16 1.03 1.08
ampxhdl 1059 12.90 1064 13.46 1.00 1.04
apex6 1912 11.29 2029 13.46 1.06 1.19
des 8632 16.08 8659 17.92 1.00 1.11
dflgrcbl 730 10.39 725 10.39 0.99 1.00
fconrcbi 537 11.00 564 12.38 1.05 1.13
frg2 2367 13.17 2440 13.42 1.03 1.02
k2 2755 16.87 2775 18.13 1.01 1.07
kcctlcb3 557 9.26 567 10.77 1.02 1.16
pair 3956 16.40 3923 20.46 0.99 1.25
rot 1651 19.17 1722 19.26 1.04 1.00
sbiucbl 599 15.66 616 16.90 1.03 1.08
tfaultcbl 439 6.80 443 8.24 1.01 1.21
vda 1522 13.31 1571 14.66 1.03 1.10
x3 2033 10.97 2207 11.42 1.09 1.04
aver - - - - 1.03 1.09

Table 4.2: Effect of Using a Logarithmic vs. Linear Delay Estim ate

lo g a rith m ic : use a logarithmic model to estim ate delay through fanout node
lin e a r: use a linear delay to estim ate delay through fanout node
g ain : increase in area or in delay obtained by using a linear delay estimate
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 105

c irc u it o n e ite r t h r e e ite r s g a in


a r e a d e la y a re a d e la y a re a d e la y
C1355 1192 24.21 1182 24.61 0.99 1.02
C1908 1358 28.24 1342 27.81 0.99 0.98
C2670 1796 20.86 1768 20.92 0.98 1.00
C3540 2857 32.88 2835 32.49 0.99 0.99
C5315 3693 29.31 3760 28.77 1.02 0.98
C6288 8272 95.47 8413 94.20 1.02 0.99
C7552 5322 27.83 5135 26.89 0.96 0.97
a lu 4 1977 29.24 1974 29.06 1.00 0.99
ampbpreg 3289 35.56 3283 35.23 1.00 0.99
ampbsm 1875 16.55 1917 16.15 1.02 0.98
amppint2 1353 11.29 1329 11.00 0.98 0.97
ampxhdl 1059 12.90 1069 12.54 1.01 0.97
apex6 1912 11.29 1800 11.38 0.94 1.01
des 8632 16.08 8407 16.09 0.97 1.00
d f lg r c b l 730 10.39 727 10.39 1.00 1.00
fc o n rc b l 537 11.00 531 11.00 0.99 1.00
frg 2 2367 13.17 2363 13.27 1.00 1.01
k2 2755 16.87 2778 16.82 1.01 1.00
k c c tlc b 3 557 9.26 553 9.28 0.99 1.00
p a ir 3956 16.40 3931 16.29 0.99 0.99
ro t 1651 19.17 1671 18.79 1.01 0.98
s b iu c b l 599 15.66 634 14.93 1.06 0.95
tfa u ltc b l 439 6.80 440 6.52 1.00 0.96
vda 1522 13.31 1542 13.50 1.01 1.01
x3 2033 10.97 2011 10.99 0.99 1.00
aver - - - - 1.00 0.99

Table 4.3: Effect of Iterative Improvement

o n e ite r: one iteration of tree covering and fanout optimization


t h r e e ite rs : three iterations of tree covering and fanout optimization
g a in : gain in area or in delay obtained by using iterative improvement
a re a : area of the circuit (MCNC lib2 d a ta divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 106

algorithm iterativejmprovement
foreach node v visited in topological order from inputs to outputs do {
if if node is fanin node {
apply tree covering
} else {
apply fanout optimization, taking all required times to be equal
}
do {
foreach fanout node v visited in topological order from outputs to inputs do {
apply fanout optimization
}
foreach fanin node v visited in topological order from inputs to outputs do {
apply tree covering
}
} u n til network delay does not decrease
foreach fanout node v visited in topological order from outputs to inputs do {
apply fanout optimization in area recovery mode
}
end iterativejmprovement

Figure 4.5: A Simple Iterative Improvement Algorithm

The simple iterative improvement algorithm is sketched Figure 4.5. A fter an initial
implem entation has been built with tree covering, using the heuristic of section 4.3 to
estimate delay through fanout nodes, the algorithm iterates fanout optim ization and tree
covering. As long as fanout optim ization is done in topological order from outputs to inputs,
fanout problems do not interact w ith each other, and the optim al solution w ith respect to
fanout optim ization can be achieved. During tree covering however, we need to evaluate
the delay through fanout nodes. For th a t purpose, we use the fanout trees built at the
previous fanout optim ization pass. In this case, we need to take into account the polarity
at which a signal is needed at a gate input, otherwise we obtain worse results. The results
obtained by this algorithm after three iterations are reported in Table 4.3 and compared
with the results obtained with only one iteration. The advantage of iterating is negligible
on average, with a decrease in delay of 1% for no cost in area. In some examples, the delay

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 107

r2 r3

a1 a2

Figure 4.6: A More Complex Example

actually increases. This can be explained by the imprecision introduced by different rise
and fall delays. There is no observable decrease in delay after the third iteration.

4 .4 .2 O p tim a lity o f Ite r a tiv e Im p rovem en t

The example of section 4.2.5 shows a suboptimal implementation th at cannot


be improved by our iterative improvement algorithm. However, the heuristics we use to
compute the initial implementation finds the optim al solution after the first iteration for
this example. In other words, this example does not show any evidence th a t our approach is
suboptim al. In this section, we study a similar but more complex example, and detail how
the iterative improvement algorithm works on this example. Using the same delay model
and hypotheses as in section 4.2.5, and solve the minimization problem analytically.

A M o re C o m p le x E x a m p le As with the previous example, we are interested in com­


puting the set of points ( rj, T2 , T3 ) that are realizable for given values ( a i, a2) of the arrival
times, and dominated by no other solutions, i.e. the Pareto optimal points of this delay
minimization problem. We use the circuit in Figure 4.6. To simplify the algebraic manipu-

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 108

lations, we perform the following substitution of variables:

Ri = eTi

Ai = eai
Xi = eXi

The Pareto optimum points satisfy the following equations:

JZi = Xr+X2

R2 =X 5+ X 6

R3 =X 7+ X 3

Ci =

°2 =x ^ + x l + x ^ ~ I ^ = 0
1 , 1 1
3 “ Xe + X 7 ~ X 3+ X 4 ~
Since the optimization problem we are trying to solve in convex, the Pareto optimal points
have a simple characterization. For each Pareto optim al point p, there is a hyperplane H
containingp such th at all points satisfying the constraints (C \ , C 2, C3) axe on the same side
of the hyperplane H. T hat is, for each Pareto optim al point p there is a triplet (a, 6, c) such
th a t p is a solution of the following convex optimization problem:

mm a R tW + bRaW + cR^X)
X
Ci{X ) = 0,1 < * < 3

Thus at a Pareto optimal point, there is a linear combination of (dR i, d R 2, dR 2) th at is a


linear combination of (dCi, dC2, dC3). In our example, this condition is equivalent to saying
th at the following m atrix is of rank 2:
/ l 0 0 \
xi XI
i o 11_ 0 __ 1
Xi XV \2 U V2
XI
1 1 1
\ (Xs+A-4)2 (Jf3+X4)2 'X I Xi
In turn, since the X i have a range limited to the interval [0, +oo) this condition is equivalent
to the following set of equations:

X rfe X s = X 2X 5X 7

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 109

X 3X 6 = X S( X 3 + X 4)

X 4X 7 = X S( X 3 + X 4)

which is equivalent to:

X xX 4 = X 2X 3

X 3X 6 = X S( X 3 + X 4)

X 4X 7 = X 3( X 3 + X 4)

We can derive from these equations a param etric representation of the set of P areto points
for this example. We take as parameters the following quantities:

T = *i
*4

The Pareto points are characterized by the following parametrized representation:

T +1
Ri =
U
3(2 T + 1)
R2 =

R3 = 3( r + 2)
3(T

( 4 - i - * )

The values of the interm ediate variables can be rederived from the following equations:

1
= u
X~2
1
=
X4
1
x5
1
X3 3 \A 2 A! u)
X! = t x2

*3 = tx 4

X6 = (1 + 7p)Xs
X7 = (1 + T ) X 8

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 110

CD
O
O

O
in it o p tim u m p = 0.5

II
It
ai 0 2 a rr iv a l a rr iv a l # ite r a rriv a l # i te r a rr iv a l # i te r
0 .0 0 .0 2.398 2.485 1 2.402 12 2.398 79
2 .0 0 .0 3.885 3.968 3 3.899 13 3.892 84
4.0 0 .0 5.805 5.885 2 5.821 17 5.813 79

Table 4.4: Iterative Improvement vs. Optimal Assignment

in it: arrival times at the prim ary inputs


o p tim u m : optim um solution (derived analytically)
p = x: iterative algorithm with rate 7*j = 1 - p%
a rriv a l: worst case arrival tim e at the primary outputs
# ite r : num ber of iterations to reach convergence within 10 ~4

In particular, the Pareto optim al point satisfying 121 = R 2 = #3 is characterized by the


following two equations:

_ ( r + i ) /2T n
(7T + 4 ) W i A 2J

A p p lic a tio n o f t h e I te r a t iv e A lg o rith m We applied our iterative algorithm to this


example with three different pairs of values for the arrival times: (0,0), (2,0) and (4,0).
The results are reported in Table 4.4, under the column p = 0. In all cases, the iterative
algorithm converges very rapidly, but the solution is not optimal.
Closer inspection of the solution produced by the algorithm reveals a reason why
the algorithm does not converge towards an optimal solution. W hen arrival times a t the
prim ary inputs are equal, the iterative algorithm compensates the imbalances introduced
by the long path through nodes 3 and 5 entirely at node 5, while the optimal solution
distributes the compensation of the imbalance between node 0 and node 5. A similar
phenomenon occurs with unbalanced arrival times at the prim ary inputs. The iterative
algorithm attem pts to correct imbalances in a greedy fashion, which is suboptim al in general.
To determine w hether the greedy compensation of imbalances is the only source
of suboptim ality in the iterative algorithm , we performed th e following experiment. We
modified the iterative algorithm to limit the rate r at which imbalances are corrected during
a given iteration. A rate of r = 50% would mean th at only half of the imbalance is corrected

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 111

by the algorithm. We modified the iterative algorithm to use a rate of r » = (1 - p1) at the
iih iteration, where 0 < p < 1. The original algorithm corresponds to the special case
of p = 0. By decreasing r towards 0, the iterative algorithm is given more opportunity
to distribute imbalance corrections properly throughout the network, at the cost of more
iterations. The results for p = 0.0, p = 0.5 and p = 0.9 are given in Table 4.4.
The results obtained w ith p = 0.0 confirm our earlier experimental results th a t the
simple iterative improvement algorithm converges very rapidly. In all cases, the results for
p = 0.0 after only one iteration (our default iterative improvement algorithm) are within
2.5% of the optim um. By increasing p, we were able to improve further the quality of the
final result at the cost of more iterations to reach a good quality result. For p = 0.9, the
results after one iteration are only within 27% of the optimum, while the results after 100
iterations are within 0.2% of the optimum in all three examples.
These results provide, in a limited way, some solid evidence of the effectiveness of
our simple iterative improvement algorithm, and confirmation th at one iteration is sufficient
to obtain good quality results.

4 .4 .3 C r itic a lity B a sed Itera tio n

Yoshikawa et al. [44] have suggested a different technique to perform iterative


improvement, based on the criticality of nodes. Their approach borrows the notion of e-
criticai subnetwork from Singh et al. [41]. The e-critical subnetwork of a Boolean network,
for a given value of e > 0, is the subnetwork composed of the nodes and edges whose slack is
within e of the slack on a critical path. If e = 0, the e-critical subnetwork is simply composed
of the set of critical paths of the network. If all paths of the e-critical subnetwork are sped
up by some constant 6, we can only guarantee th at the circuit is sped up by min(e, <?). The
choice of e is a tradeoff between the amount of computation to be done to optimize the
network and the amount of improvement to be expected at each iteration.
Their algorithm alternates between two optimizations. The first optimization com­
putes a minimum weight node cutset through an e-critical subnetwork, where the weights
are used to direct the cutset on nodes with higher potential for delay reduction. The sec­
ond optim ization computes a minimum weight cutset across a region outside the e-critical
subnetwork. In their approach, a node groups together a fanin node and the fanout node
connected to its outputs.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 112

Their main result is th at by alternating between cutsets chosen within an e-critical


subnetwork and cutsets taken outside an e-critical subnetwork, they obtain better results
than by optimizing cutsets only chosen within an e-critical subnetwork. In their exper­
iments, they obtained 36% improvement by alternating cutsets vs. 28% by using only
critical cutsets. The reason why this phenomenon occurs can be explained as follows in
the case of fanout optimization. A non-critical p ath may have a slack of 0.5 units of delay,
which is enough to make it non-critical, but may be insufficient to allow a different selection
of a fanout tree that would make the critical path faster. By optimizing the non-critical
path, we may be able to increase the slack to 1 unit of delay, which may be sufficient to
allow the selection of a different fanout tree th a t would make the critical path faster. A
similar phenomenon can also occur in the case of fanin optimization.
Their results provide an additional justification of our global approach to iterative
improvement. By isolating critical subnetworks and concentrating the effort of the local
optimizers on critical nodes, global optimization algorithms produce lower quality results
because they do not exploit fully the slack th a t could be made available on non critical parts
of the circuit. Our approach, which does not distinguish at all between critical and non
critical paths, has, in addition to its simplicity, the advantage of avoiding this problem. Its
only drawback is th a t it may optimize more than necessary; but we have provided enough
evidence in this work th at area reclamation after delay optimization can be effective to
m aintain area increases within reasonable limits.

4.5 Beyond Tree-Based Optimization

In the previous section, we provided some evidence th a t we have reached the


limits of the reductions we can obtain with tree-based delay optimization algorithms. In
this section, we propose two additional delay optimization techniques th at do not respect
tree boundaries but are nevertheless simple modifications of tree-based delay optimization
algorithms.
The first of these optimizations allows the duplication of fanin trees in order to
provide the output of a fanin node in both polarities. This is achieved by implementing
a fanin node with the fastest tree independently for each polarity. This optimization is
discussed in section 4.5.1. We have implemented a simple version of this optimization and
we provide experimental results.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 113

The second of these optimizations, presented in section 4.5.2 allows the overlap
of the implementation of fanin trees as shown in Figure 4.2 in order to reduce delay. This
optim ization can decrease delay significantly, but may lead to substantial area increases.

4 .5 .1 P h a se O p tim iza tio n b y T ree D u p lica tio n

We showed in section 4.3 the role of phase assignment on the quality of tree-based
delay optimization. By biasing tree covering in favor of a given polarity, we could slow down
a circuit by an average 11%. By using a heuristic th a t attributes the same arrival times
to signals of different polarities, we essentially make sure th at the best phase is used for
any given tree in the absence of any accurate information concerning the arrival times of
signals. If all signals at the output of a fanout node are required with the same phase, there
is no problem. This is not the case in general. If a signal is required under both polarities,
at least one critical sink will receive the signal delayed by one inverter.
The problem occurs because we limit ourselves to one covering per fanin node. If
we allow tree duplication, we can cover each fanin node twice, each cover producing the
signal in a different polarity. These two trees may overlap and share some logic if there is
no advantage in duplicating them further. This optimization has two advantages:

• in the case of small fanouts with signals needed in different polarities, it can remove
one inverter delay if both trees can produce the signal with similar arrival times.

• for large fanouts, it decreases the need for deeper fanout trees by providing an addi­
tional source th at can provide signals to one half of the sinks.

On the other hand, this optimization has two potential drawbacks: it may be wasteful in
area and it does not preserve testability [38]. Unnecessary logic duplication can be controlled
easily using the same technique th a t we use during fanout optimization. In the first pass
of fanout optimization, we select the best solution at each node, which may require the use
of tree duplication on the source node. In the second pass of fanout optimization, the area
recovery pass, we can eliminate one cover of the source node if this transformation does
not slow down the circuit. Removing redundancies is a more time consuming operation,
but can only decrease delay and area, provided th at there are no false paths in the circuit
[33, 25].
We have implemented a simple version of this optimization. Our implementation
has the following limitations: it does not take into account the cost of tree duplication in

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 114

c irc u it nodup dup g a in


a re a d e la y a re a d e la y a re a d e la y
C1355 1192 24.21 1555 22.54 1.30 0.93
C1908 1358 28.24 1553 26.40 1.14 0.93
C2670 1796 20.86 1885 20.38 1.05 0.98
C3540 2857 32.88 3024 33.22 1.06 1.01
C5315 3693 29.31 4086 28.69 1.11 0.98
C6288 8272 95.47 9485 92.66 1.15 0.97
C7552 5322 27.83 5894 27.02 1.11 0.97
a lu 4 1977 29.24 2052 30.55 1.04 1.04
ampbpreg 3289 35.56 3333 34.90 1.01 0.98
ampbsm 1875 16.55 2004 16.05 1.07 0.97
amppint2 1353 11.29 1357 11.48 1.00 1.02
ampxhdl 1059 12.90 1128 12.35 1.07 0.96
apex6 1912 11.29 1951 10.73 1.02 0.95
des 8632 16.08 9047 16.15 1.05 1.00
d f lg r c b l 730 10.39 755 9.19 1.03 0.88
fc o n rc b l 537 11.00 587 10.62 1.09 0.97
f rg 2 2367 13.17 2445 13.24 1.03 1.01
k2 2755 16.87 2958 16.12 1.07 0.96
k c c tlc b 3 557 9.26 563 8.47 1.01 0.91
p a ir 3956 16.40 4047 16.24 1.02 0.99
ro t 1651 19.17 1651 18.73 1.00 0.98
sb iu c b l 599 15.66 609 15.19 1.02 0.97
tfa u ltc b l 439 6.80 439 6.80 1.00 1.00
vda 1522 13.31 1618 13.28 1.06 1.00
x3 2033 10.97 2035 10.57 1.00 0.96
aver - - - - 1.06 0.97

Table 4.5: Effect of Tree Duplication

nodup: tree covering and fanout optimization without tree duplication


dup: tree covering and fanout optimization w ith tree duplication
gain: gain in area or in delay obtained by using tree duplication
a re a : area of the circuit (MCNC lib2 d a ta divided by common divisor 464)
d elay : delay of the circuit (MCNC lib2 data in nanoseconds)
av e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 115

term s of extra fanout loads at the inputs of a fanin node; it uses a simple-minded allocation
of sinks to the two sources made available by tree duplication; and it does not perform
redundancy removal after tree duplication.
Given two sources, S and S, providing the same signal under differing polarities,
for a fanout node, we perform fanout optimization as follows. We partition the sinks into
two sets: the set P of sinks with positive polarity, and the set N of sinks with negative
polarity. We try all 4 possible assignments of P and N to S and S and perform fanout
optimization in all 4 cases; i.e. we consider implementing the problem with S alone, S alone,
or both S and ~S. The best solution with smallest delay is retained. In case of equality, the
solution which uses only one source is chosen and one tree is discarded.
We have implemented this optimization, and report the results of our experiments
in Table 4.5. We achieved an average delay reduction of 3% for a average area increase of
6%. Additional delay reductions should be achievable by using a better sink assignment
algorithm, and a more flexible tree duplication policy, allowing in particular the duplication
of sources of the same polarity.

4.5 .2 A llo w in g O verlaps b etw een Trees

We gave in Figure 4.1 an example of a circuit th a t could be m apped more efficiently


if overlaps between trees were allowed. However it is not clear how much delay improvement
could be obtained in general by allowing fanin trees to overlap. We performed an experiment
to answer this question.
Since m is ll tree covering algorithm is based on direct pattern matching, it is a
simple m atter to modify it to allow overlaps between trees. If overlaps between trees are
allowed, the num ber and position of multiple fanout points can be modified arbitrarily
by the covering algorithm. If a point p is originally a multiple fanout point, we predict
the arrival tim e at p using the heuristics of section 4.3. Otherwise, the arrival time at p is
predicted as if p had a fanout of 1, even if its fanout may increase due to an overlap between
trees.
The effect of allowing tree overlaps is reported in Table 4.6. The reduction in delay
obtained by allowing tree overlaps is significant: an average of 9%. However, as predicted,
this reduction in delay comes with a heavy price in area: an average increase of 44%. These
results indicate th a t better delays can be achieved by relaxing the constraints imposed by

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 116

c irc u it no o v e rla p ov e rla p g a in


a r e a d e la y a re a d e la y a re a d e la y
C1355 1192 24.21 1743 21.96 1.46 0.91
C1908 1358 28.24 2042 25.69 1.50 0.91
C2670 1796 20.86 2693 17.88 1.50 0.86
C3540 2857 32.88 4461 31.04 1.56 0.94
C5315 3693 29.31 6108 26.50 1.65 0.90
C6288 8272 95.47 15083 86.18 1.82 0.90
C7552 5322 27.83 8959 26.19 1.68 0.94
a lu 4 1977 29.24 3289 29.10 1.66 1.00
ampbpreg 3289 35.56 5026 30.10 1.53 0.85
ampbsm 1875 16.55 2702 14.62 1.44 0.88
am ppint2 1353 11.29 1710 9.49 1.26 0.84
ampxhdl 1059 12.90 1489 11.46 1.41 0.89
apex6 1912 11.29 2288 10.53 1.20 0.93
des 8632 16.08 14254 15.88 1.65 0.99
d f lg r c b i 730 10.39 901 9.53 1.23 0.92
f c o n rc b l 537 11.00 669 8.77 1.25 0.80
f rg 2 2367 13.17 3183 11.51 1.34 0.87
k2 2755 16.87 4146 15.01 1.50 0.89
k c c tlc b 3 557 9.26 740 8.17 1.33 0.88
p a ir 3956 16.40 5214 15.58 1.32 0.95
ro t 1651 19.17 2168 17.30 1.31 0.90
s b iu c b l 599 15.66 873 15.35 1.46 0.98
tfa u ltc b l 439 6.80 527 6.57 1.20 0.97
vda 1522 13.31 2390 11.96 1.57 0.90
x3 2033 10.97 2650 10.72 1.30 0.98
aver - - - - 1.44 0.91

Table 4.6: Effect of Allowing Tree Overlaps

no o v e rla p : tree covering and fanout optimization without tree overlaps


o v e rla p : tree covering and fanout optimization allowing tree overlaps
g a in : gain in area or in delay obtained by allowing tree overlap
a re a : area of the circuit (MCNC lib2 d ata divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
av e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 117

tree-based delay optimization, but more work is required to control the penalty in area.
Combining tree overlap w ith the tree duplication algorithm of section 4.5.1 has no effect,
since allowing overlaps allows in particular the duplication of trees in different phases.

L im itin g O v e rla p s It is possible to reduce the increase in area by limiting the overlap
between trees. A simple way to enforce this limit is to allow overlaps to take place only
over nodes th a t have a fanout of K or less, for some constant K . This simple technique
is an efficient way to reclaim area because most of the delay reduction can be obtained by
allowing overlaps over nodes with small fanouts, and a large fraction of the area increase is
due to overlaps allowed over nodes with large fanouts. The results of allowing tree overlaps
only on nodes with five or fewer fanouts is reported in Table 4.7. By limiting overlaps, we
achieved an average delay reduction of 8% for a cost in area of 28%.

4.6 Conclusion

We have provided an abstract framework for understanding tree-based delay opti­


mization, and formulated the tree-based delay optimization problem as a convex optim iza­
tion problem. Unfortunately, we do not have an analytic formula for the functions to be
optimized. Also with the num ber of variables in the problem being usually large even for
medium size circuits, it seems impractical to use this formulation directly. Instead we pro­
posed a simple iterative technique th at is guaranteed to produce solutions th a t cannot be
improved by local transform ations of the circuit. Unfortunately, we were able to exhibit a
class of circuits showing th at locally optimum implementations with respect to local circuit
transformations can be arbitrarily far removed from the optim um solution. Even though
the physical limitations of our model make this impossible in practice outside a fixed range
of values, these examples strongly suggest th a t algorithms based on iterative improvement
by local transformations are very limited in their optimization power. Finding an efficient
optimization technique th at would exploit directly the convexity of the search space remains
an open problem.
On the practical side, we have shown experimentally the effect of heuristics to
estimate the arrival tim e at a multiple fanout point; in particular, we have shown th at
these heuristics should give at the first application of tree covering the same delay for both
signal polarities, while the actual delay value is less critical to the quality of the result. We

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 118

c irc u it no o v e rla p o v e r ap -5 g a in
a re a d e la y a r e a d e la y a re a d e la y
C1355 1192 24.21 1967 22.20 1.65 0.92
C1908 1358 28.24 1890 26.59 1.39 0.94
C2670 1796 20.86 2400 20.53 1.34 0.98
C3540 2857 32.88 3799 31.51 1.33 0.96
C5315 3693 29.31 5494 26.56 1.49 0.91
C6288 8272 95.47 14534 87.28 1.76 0.91
C7552 5322 27.83 8254 26.98 1.55 0.97
a lu 4 1977 29.24 2542 28.24 1.29 0.97
ampbpreg 3289 35.56 4255 30.45 1.29 0.86
ampbsm 1875 16.55 2229 14.24 1.19 0.86
amppint2 1353 11.29 1473 9.92 1.09 0.88
ampxhdl 1059 12.90 1244 11.82 1.17 0.92
apex6 1912 11.29 2213 10.23 1.16 0.91
des 8632 16.08 9869 16.12 1.14 1.00
d f lg r c b i 730 10.39 879 9.41 1.20 0.91
fc o n rc b l 537 11.00 667 9.55 1.24 0.87
f rg 2 2367 13.17 2437 11.95 1.03 0.91
ls.2 2755 16.87 3891 15.10 1.41 0.90
k c c tlc b 3 557 9.26 684 8.09 1.23 0.87
p a ir 3956 16.40 4565 15.30 1.15 0.93
ro t 1651 19.17 1997 17.17 1.21 0.90
s b iu c b l 599 15.66 820 15.00 1.37 0.96
tfa u ltc b l 439 6.80 527 6.57 1.20 0.97
vda 1522 13.31 2057 12.20 1.35 0.92
x3 2033 10.97 2312 10.62 1.14 0.97
aver - - - - 1.28 0.92

Table 4.7: Effect of Limiting Tree Overlaps

no o v e rla p : tree covering and fanout optimization without tree overlaps


o v e rla p -5 : tree overlaps over nodes with five or fewer fanouts
g a in : gain in area or in delay obtained by allowing limited tree overlaps
a re a : area of the circuit (MCNC lib2 d ata divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 d ata in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4. COMBINING TREE COVERING AND FANOUT OPTIMIZATION 119

minimum area
delay

minimum area + fanout


2.0

minimum delay + fanout

1.0

minimum delay + overlap + fanout

area
1.0 2.0

Figure 4.7: Area / Delay Tradeoff

have also proposed to use logic duplication to provide a signal in both phases at a multiple
fanout point and shown th at this technique can lead to additional delay reductions at little
cost in area. More work needs to be done to preserve the testability of the circuit during
this transformation.
In total, we have provided several methods to perform technology mapping, which
provide a wide tradeoff between delay and area:

• minimum area tree covering with no fanout optimization.

• minimum area tree covering with fanout optimization.

• minimum delay tree covering with fanout optimization.

• minimum delay tree covering with limited overlaps and fanout optimization.

The average effect of these four methods in indicated in Figure 4.7. In area, all data are
relative to the minimum area mapping. In delay, the data relative the minimum delay tree
covering with overlaps and fanout optimization.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 5

T echnology Independent D elay


O ptim izations

Monde nouveau, tu m ’appartiens!


Sois done a moi, o beau pays!
Monde nouveau, tu m ’appartiens!
Sois done a moi!
— GIACOMO MEYERBEER, L’Africaine

5.1 Introduction

In the previous chapter, we presented several techniques for the efficient integration
of tree covering and fanout optimization and provided empirical evidence of the efficiency
of some of these methods. The purpose of this chapter is to examine the effect of technology
independent logic transformations on circuit delay. We do not introduce any new technology
independent algorithms to reduce delay. The originality and interest of this study comes
from the fact th at we now have at our disposal an efficient technology m apper on which
we can rely to estimate delay. Similar data previously reported in the literature are usually
of limited accuracy because they do not take into account the corrective effect of powerful
optimization techniques such as fanout optimization.
The first step of this empirical study is to measure the effect of literal count min­
imization on circuit delay and area. Literal count minimization has been used as the main
objective of technology independent optimization in logic synthesis because it correlates
well with final circuit area [31]. The effect of literal count minimization is to simplify and

120

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 121

factorize the logic. By eliminating logic, simplification helps both in term s of delay and in
term s of area. However factorization usually trades off delay for better area. We present in
section 5.2 empirical d a ta th a t confirms th a t literal count minimization has unpredictable
effects on delay.
Since our objective is to minimize delay more than area, we should control the
use of factorization, and concentrate our efforts on logic simplification instead. We have
not explored fully the modification of the m any technology independent transformations
available in a logic synthesis tool such as m is ll, but we examine in section 5.3 the effect
of a controlled use of a few of these transform ations th a t could lead to a substantial area
reduction a t a much smaller cost in delay than uncontrolled literal count minimization.
However if delay is our principal objective, we should be able to produce fast
circuits simply by flattening the logic. To flatten the logic, we collapse the Boolean network
into a graph with only one level of nodes. Each of the nodes has a function associated with
it th at can be represented in sum-of-product form. In other words, collapsing to one level
of nodes can be seen as collapsing to two levels of logic if no fanin or fanout lim itation is
enforced. Collapsing only helps in reducing delay for a certain set of circuits. For many
circuits, collapsing introduces such a large am ount of logic duplication th a t even delay
increases. Nevertheless, when it applies, collapsing is a simple and very efficient technology
independent delay optim ization technique. We discuss the effect of collapsing in more detail
in section 5.4.
Since network collapsing is such an effective technique a t reducing circuit delay, it
is worth investigating whether partial collapsing can be used when to tal collapsing is not
practical. To decide which parts of a network should be collapsed, we use an algorithm
developed by Lawler et al. [28]. Lawler’s algorithm can be viewed as the technology
independent analog of the extension of tree covering allowing overlaps between trees that
we introduced in section 4.5.2. The main drawback of this algorithm is th at it tends to
increase area, but some of this area can be recovered by using the controlled literal count
reduction techniques presented in section 5.3.
There has been some previous work in the area of logic restructuring for delay. The
most notable effort in this direction was speedup, by Singh et al. [41], which performs local
collapsing and factorization in order to reduce the number of levels of logic a signal has to
traverse while controlling the increase in area incurred by collapsing. The work by Fishburn
[15] is based on a similar idea, though the restructuring is performed differently. Another

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 122

technique, th a t is more efficient in terms of area but more computationally intensive, was
proposed by Chen and M uroga [12]. This technique consists in exploiting the observability
don’t care set at a node to remove connections on the critical path. Berman et al. propose
similar methods [6]. We present our adaptation of Lawler’s algorithm in section 5.5 and
compare it to speedup.
In the rem ainder of this section, we use the technology m apper to measure the area
and delay of a circuit. For technology mapping, we use tree covering in delay minimization
mode, followed by fanout optimization. We use the heuristic of section 4.3 but we do not
allow overlaps between tree covers.

5.2 Effect on Delay o f M inimizing Literal Count

We measured the effect of literal count minimization on circuit area and circuit
delay after technology mapping. The results are reported in Table 5.1. All circuits were
optimized using m i s l l standard algebraic script [9] except C2670, which was optimized
manually. As is apparent in the table, the circuits obtained from Intel (df l g r c b l, f c o n rc b l,
k c c tlc b 3 , s b iu c b l, t f a u l t c b i ) were already optimized, and little gain was achieved for
this circuits. On average, minimizing literal count decreased area by 28% for no cost in
delay. However, a more careful inspection of the data indicates th a t the effect on delay of
literal count minimization is unpredictable, varying from a decrease of 22% to an increase
of 26%, though the larger increases in delay correspond to significant decreases in area.
Obviously the techniques used in m is l l to reduce literal count are quite powerful.
Unfortunately if they are used without discrimination, they may lead at times to substantial
delay increases. This unpredictable behavior is undesirable and more work needs to be done
to control the effect on delay of these optimizations.

5.3 Performance-Oriented Logic Simplification

A simple way to reduce circuit area without having to pay for an increase in delay
is to reduce the literal count by using simplification only. A better way would be to allow,
in addition to simplification, factorization along non critical paths. However, to perform
this optim ization reliably, we need a good technology independent delay estim ator, and
none is available at present. We have experimented with a simple m is l l script, shown in

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS

circuit raw opt gain


area delay area delay area delay
C1355 1764 26.14 1192 24.21 0.68 0.93
C1908 1835 26.71 1358 28.24 0.74 1.06
C2670 2399 24.76 1796 20.86 0.75 0.84
C3540 3513 35.87 2857 32.88 0.81 0.92
C5315 5285 31.13 3693 29.31 0.70 0.94
C6288 8178 114.62 8272 95.47 1.01 0.83
C7552 7326 28.75 5322 27.83 0.73 0.97
alu4 2107 27.76 1977 29.24 0.94 1.05
ampbpreg 4607 45.36 3289 35.56 0.71 0.78
ampbsm 3840 19.32 1875 16.55 0.49 0.86
amppint2 3347 9.48 1353 11.29 0.40 1.19
ampxhdl 1990 11.42 1059 12.90 0.53 1.13
apex6 1963 10.98 1912 11.29 0.97 1.03
des 15166 16.15 8632 16.08 0.57 1.00
dflgrcbi 732 9.08 730 10.39 1.00 1.14
fconrcbl 537 11.00 537 11.00 1.00 1.00
frg2 4265 10.90 2367 13.17 0.55 1.21
k2 7616 13.34 2755 16.87 0.36 1.26
kcctlcb3 555 9.26 557 9.26 1.00 1.00
pair 4347 18.03 3956 16.40 0.91 0.91
rot 1758 19.44 1651 19.17 0.94 0.99
sbiucbl 634 15.97 599 15.66 0.94 0.98
tfaultcbl 433 7.04 439 6.80 1.01 0.97
vda 3799 11.58 1522 13.31 0.40 1.15
x3 2627 9.30 2033 10.97 0.77 1.18
aver - - - - 0.72 1.00

Table 5.1: Effect of Literal Count Minimization

raw: minimum delay technology mapping of unoptimized circuits


opt: minimum delay technology mapping of circuits optimized by m is ll
gain: gain in area or in delay obtained by literal count minimization
area: area of the circuit (MCNC lib2 data divided by common divisor 464)
delay: delay of the circuit (MCNC lib2 data in nanoseconds)
aver: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 124

sim plification script


sweep
decomp -q
eliminate -1100 1
simplify -1

Figure 5.1: m is l l Logic Simplification Script

Figure 5.1. This script applies several commands [9, 8]:

• sweep: this command eliminates nodes with no fanin or no fanout. It simply removes
from a network unnecessary nodes.

• decomp -q: this command factorizes nodes in a simple way with no concern for crit­
icality. It is only used to break large nodes into smaller ones so th at the other com­
mands can run in a reasonable amount of cpu time.

• e lim in a te -1 100 -1: this command collapses anode into its fanout. The collapsing
is only done if the node has a single fanout, and if the size of the resulting node does
not exceed 100 cubes. The role of this command is to ensure th at nodes are large
enough for simplification to have some effect, but not too large so that simplification
takes a reasonable am ount of cpu time.

• s im p lify -1: this command runs e s p re sso [7, 37], a two-level logic minimizer. The
minimizer is given some information about the structure of the network, th at allows it
to simplify the logic function at a node in the context of the other nodes in the network.
In particular, when simplifying a node v, the minimizer is allowed to change the inputs
of v if it simplifies the logic function at v. The -1 option limits the minimizer to using
as inputs of v only nodes th a t are closer to the prim ary inputs th an v. This restriction
guarantees th a t the num ber of levels of logic in the network is not increased by the
minimizer.

We measured the effect of the simple simplification script of Figure 5.1 on circuit area and
delay after technology mapping. The results are reported in Table 5.2. The average effect of
the simple simplification script is an area reduction of 9% and a delay reduction of 2%. The

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 125

circuit r aw simplified gain


area delay area delay area delay
C1355 1764 26.14 1274 20.45 0.72 0.78
C1908 1835 26.71 1796 26.13 0.98 0.98
C2670 2399 24.76 2079 23.78 0.87 0.96
C3540 3513 35.87 3339 33.25 0.95 0.93
C5315 5285 31.13 4471 30.65 0.85 0.98
C6288 8178 114.62 7016 108.04 0.86 0.94
C7552 7326 28.75 6294 27.06 0.86 0.94
alu4 2107 27.76 1992 28.46 0.95 1.03
ampbpreg 4607 45.36 4412 43.89 0.96 0.97
ampbsm 3840 19.32 2945 16.97 0.77 0.88
amppint2 3347 9.48 3101 10.68 0.93 1.13
ampxhdl 1990 11.42 1542 11.78 0.77 1.03
apex6 1963 10.98 1956 11.25 1.00 1.02
des 15166 16.15 14430 16.19 0.95 1.00
dflgrcbl 732 9.08 740 9.08 1.01 1.00
fconrcbl 537 11.00 534 11.00 0.99 1.00
frg2 4265 10.90 3388 12.40 0.79 1.14
k2 7616 13.34 7620 13.62 1.00 1.02
kcctlcb3 555 9.26 555 9.26 1.00 1.00
pair 4347 18.03 4002 16.91 0.92 0.94
rot 1758 19.44 1671 19.67 0.95 1.01
sbiucbl 634 15.97 626 15.67 0.99 0.98
tfaultcbl 433 7.04 433 7.04 1.00 1.00
vda 3799 11.58 3511 11.55 0.92 1.00
x3 2627 9.30 2433 9.51 0.93 1.02
aver - - - - 0.91 0.98

Table 5.2: Effect of Simplification without Factorization

ra w : minimum delay technology mapping of unoptimized circuits


sim p lified : minimum delay technology mapping of simplified circuits
g a in : gain in area or in delay obtained by simplifying circuits
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 126

fluctuations in delay are less severe than with the standard script, varying from a reduction
of 22% to an increase of 14%. The average reduction in area obtained by simplification
alone is roughly a third of the reduction in area obtainable with the standard script. This
is a heavy price to pay for a more controlled effect on delay.

5.4 Effect of Collapsing to Two Levels of Logic

A simple way to improve the performance of a circuit is to collapse it to two


levels of logic. Unfortunately this technique has its limitations: for a large class of circuits,
the area penalty is too large for collapsing to be practical. Nevertheless in some cases
collapsing yields significant delay reductions for an acceptable increase in area. We were
able to collapse 16 out of our 25 benchmark circuits. On each of the collapsed circuits we
run m is ll s im p lify -1 command to simplify the logic at each node after collapsing. The
results after technology mapping are reported in Table 5.3.

5.5 Partial Collapsing for Delay Minimization

5.5.1 L aw ler’s A lg o rith m

Some circuits cannot be collapsed into two levels of logic without a large area
penalty. However it is often possible to collapse these networks partially in order to reduce
delay at a more moderate cost in area. To perform partial collapsing, we need an algorithm
th a t determines which groups of nodes are to be collapsed into single nodes in order to
decrease delay the through the network.
Unfortunately we do not have at out disposal a reasonably accurate technology
independent delay model, such as the one proposed by Wallace et al. [43]. As a rough
measure of delay, we use the num ber of logic levels a signal has to cross. To lim it the
inaccuracy of this delay model, we only apply it after having decomposed a network into
simple gates. These simple gates are any of the four 2-input gates th at can be represented
as a 2-input NAND gate with possibly inverters at the inputs, representing one of the four
following Boolean functions: a + b, a + b, a + b or a + b.
To form the groups, we use a clustering algorithm due to Lawler th a t minimizes
the number of levels of logic in the network after collapsing of the groups subject to the
constraint th a t each group is formed of at most K nodes. This algorithm generates possibly

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 127

circuit opt collapsed gain


area delay area delay area delay
C1355 1192 24.21 * * - -
C1908 1358 28.24 * * - -
C2670 1796 20.86 * * - -
C3540 2857 32.88 * * - -
C5315 3693 29.31 * * - -
C6288 8272 95.47 * * - -
C7552 5322 27.83 * * - -
alu4 1977 29.24 3595 14.22 1.82 0.49
ampbpreg 3289 35.56 4055 10.55 1.23 0.30
ampbsm 1875 16.55 3352 9.49 1.79 0.57
amppint2 1353 11.29 3263 9.56 2.41 0.85
ampxhdl 1059 12.90 3770 11.39 3.56 0.88
apex6 1912 11.29 3011 9.56 1.57 0.85
des 8632 16.08 * * - -
dflgrcbl 730 10.39 915 7.23 1.25 0.70
fconrcbl 537 11.00 875 7.73 1.63 0.70
frg2 2367 13.17 7128 10.64 3.01 0.81
k2 2755 16.87 7902 12.08 2.87 0.72
kcctlcb3 557 9.26 1126 6.39 2.02 0.69
pair 3956 16.40 19143 14.22 4.84 0.87
rot 1651 19.17 * * - -
sbiucbl 599 15.66 1830 11.26 3.06 0.72
tfaultcbl 439 6.80 713 5.53 1.62 0.81
vda 1522 13.31 4544 9.95 2.99 0.75
x3 2033 10.97 3027 9.56 1.49 0.87
aver - - - - 2.15 0.70

Table 5.3: Effect of Collapsing to Two Levels of Logic

o p t: minimum delay technology mapping of circuits optimized by m is ll


co lla p sed : minimum delay technology mapping of the collapsed circuits
g ain : gain in area or in delay obtained by collapsing circuits
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
av e r: geometric average of the gains
*: circuit not collapsible to two levels of logic

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS

algorithm Laviler.clustering.algorithm
{ / * labeling step */
foreach node v visited in topological order from inputs to outputs {
if fanin(v) = 0 th en L = 0 else L = maxuef anin(v) ^ eKu)
k = |{it, u 6 transitive Janin{v)y label(u) = £}|
if k > K label{y) = L + 1 else label(v) = L
}
}
{ / * clustering step */
foreach node v visited in topological order from outputs to inputs {
if fanout(v) = 0 th e n L = oo else L = minuef anounv) label(u)
if label(v) < L {
create a new cluster c
c = {u}U {ti€ transitive-fanin(v), label(u) = label(v)}
}
}
{ / * collapsing step */
foreach cluster c {
root(c) = { t € c , label(v) < max uef anoui(v) label(u)}
foreach v 6 root(c) {
collapse into v all nodes in c fl transitive Janin{y)
}
}
}
end Lawler.clustering.algorithm

Figure 5.2: Lawler’s Algorithm

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 129

Figure 5.3: Example of Clustering

overlapping clusters, and clusters th a t have more th an one output, requiring extra logic
duplication during collapsing. For these reasons, this clustering algorithm usually increases
circuit area. In the next subsection, we describe Lawler’s algorithm in more detail.
Lawler’s algorithm determines a minimum delay clustering of a network under the
constraint th a t each cluster does not exceed a global capacity constraint K . The delay
through a cluster is assumed to be the same for all clusters, and the size of a cluster is
the num ber of nodes it contains. Lawler’s algorithm can handle more general clustering
problems, but the present formulation is sufficient for our purpose. Lawler’s algorithm
proceeds in two steps: a labeling step and a clustering step. We have added a third step to
do the collapsing of the clusters. The algorithm is described in Figure 5.2.
The labeling step proceeds as follows. We visit the nodes in topological order, from
inputs to outputs. For each node v, we compute the largest label L of any of its fanins. If
v does not have any fanin, L is taken to be equal to 0. We then compute the num ber k of
nodes in the transitive fanin of v th a t axe of label L. If k exceeds K , the label of v is set
to be L + 1, otherwise it is set to be L. In the clustering step the nodes are visited in the
reverse order, from outputs to inputs. If the label of a node v is less than the labels of all the

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 130

nodes in its fanout, a new cluster is created, containing v and all the nodes in the transitive
fanin of v with the same label as v. The collapsing step collapses the nodes of a cluster
together. Some duplication is introduced within a cluster if a cluster has several outputs.
One node is created per output, and every node of cluster contained in the transitive fanin
of two more output nodes of a cluster is duplicated. Duplicating is also introduced across
clusters, since a node may belong to several clusters as shown in Figure 5.3.
An example (from [28]) shows the application of the algorithm in Figure 5.3. In
th a t example K , the cluster size lim it, is set to 3. The labels are indicated inside the nodes
and the clustering is shown by encirclings. As can be seen in this example, the algorithm
may replicate some nodes. The labeling and clustering parts of the algorithm operates in
0 ( K N 2), where N is the to tal number of nodes in the network and K the maximum size
of a cluster. The tim e complexity of the collapsing p a rt is dependent on the logic function
obtained at each node.
As can be observed in Figure 5.2 and Figure 5.3, Lawler’s algorithm bears a strong
similarity to tree covering w ith overlaps. The labeling step is the analog of the forward
dynamic programming pass of tree covering. The clustering step is the analog of the gate
selection pass of tree covering. In both algorithms, the nodes of the network are visited
in the same order. In other words, the Lawler’s clustering algorithm can be thought as a
technology independent tree covering algorithm allowing overlaps, and in th a t sense is a
natural extension of the algorithms we presented in the previous chapter. It suffers from
the same problem as the tree covering algorithm allowing overlaps, as it often causes large
area increases. More work needs to be done in this area to determine whether these area
increases are necessary to reduce delay.

5.5.2 Effect of Clustering on Delay

In this section we examine the effect of the clustering algorithm on our set of
benchmarks. We apply the clustering algorithm using the script of Figure 5.4. Most of the
commands in this script have been introduced earlier. The new commands are:

• tech-decomp -o 2: this command decomposes the network into 2-input NAND gates
possibly with inverters a t one or both of the inputs.

• resu b - a -d: applied to a network decomposed into 2-input NAND gates, this com­
m and detects if two copies of the same node are present in the network. If it is the

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 131

clustering script {
sweep
decomp -q
tech_decomp -o 2
resub -a -d
sweep
reduce.depth -S 8
eliminate -1
simplify -1
}

Figure 5.4: m is ll Clustering Script

speed-up script {
sweep
decomp -q
speed-up -d 6 -m unit
}

Figure 5.5: m is ll Speed-up Script

case, one copy is removed and the fanout of the remaining node is increased by the
fanout of the removed node.

• reduce_depth -S 8: this command performs a clustering and a collapse of the clusters


using Lawler’s algorithm. The maximum cluster size is set to 8. The actual cluster size
used in the smallest cluster size with which the algorithm can get the same number of
logic levels as with a cluster size limit of 8. Since the number of logic levels can only
decrease as the cluster size limit is increased, we can find the smallest cluster size for
a given number of logic levels by binary search. In addition, the search only needs to
execute the labeling step of the clustering algorithm and is thus very fast.

The results of clustering on area and delay of circuits optimized for minimum

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS

circuit opt clustered gain


area delay area delay area delay
C1355 1192 24.21 1597 20.19 1.34 0.83
C1908 1358 28.24 1743 25.97 1.28 0.92
C2670 1796 20.86 2517 19.40 1.40 0.93
C3540 2857 32.88 4856 30.29 1.70 0.92
C5315 3693 29.31 5703 28.02 1.54 0.96
C6288 8272 95.47 10811 77.43 1.31 0.81
C7552 5322 27.83 8000 24.19 1.50 0.87
alu4 1977 29.24 2929 27.66 1.48 0.95
ampbpreg 3289 35.56 4703 20.67 1.43 0.58
ampbsm 1875 16.55 2693 13.25 1.44 0.80
amppint2 1353 11.29 1704 9.48 1.26 0.84
ampxhdl 1059 12.90 1389 11.65 1.31 0.90
apex6 1912 11.29 2559 10.44 1.34 0.92
des 8632 16.08 11992 17.39 1.39 1.08
dflgrcbl 730 10.39 879 10.18 1.20 0.98
fconrcbl 537 11.00 669 9.32 1.25 0.85
frg2 2367 13.17 2955 10.56 1.25 0.80
k2 2755 16.87 4857 14.48 1.76 0.86
kcctlcb3 557 9.26 853 6.57 1.53 0.71
pair 3956 16.40 5396 15.17 1.36 0.92
rot 1651 19.17 2169 16.05 1.31 0.84
sbiucbl 599 15.66 917 14.41 1.53 0.92
tfaultcbl 439 6.80 562 5.65 1.28 0.83
vda 1522 13.31 2301 12.32 1.51 0.93
x3 2033 10.97 2510 10.54 1.23 0.96
aver - - - - 1.39 0.87

Table 5.4: Effect of Clustering w ith a Maximum Cluster Size of 8

o p t: minimum delay technology mapping of circuits optimized by m is ll


c lu s te re d : minimum delay technology mapping of the clustered circuits
g a in : gain in area or in delay obtained by clustering circuits
a re a : area of the circuit (MCNC lib2 data divided by common divisor 464)
d e la y : delay of the circuit (MCNC lib2 data in nanoseconds)
a v e r: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 133

circuit opt speedup gain


area delay area delay area delay
C1355 1192 24.21 2203 21.21 1.85 0.88
C1908 1358 28.24 1895 26.49 1.40 0.94
C2670 1796 20.86 1836 19.98 1.02 0.96
C3540 2857 32.88 3309 33.92 1.16 1.03
C5315 3693 29.31 4712 23.52 1.28 0.80
C6288 8272 95.47 9247 95.36 1.12 1.00
C7552 5322 27.83 5849 26.19 1.10 0.94
alu4 1977 29.24 2009 27.03 1.02 0.92
ampbpreg 3289 35.56 4338 22.46 1.32 0.63
ampbsm 1875 16.55 2041 15.29 1.09 0.92
amppint2 1353 11.29 1341 11.37 0.99 1.01
ampxhdl 1059 12.90 1023 12.55 0.97 0.97
apex6 1912 11.29 2068 9.83 1.08 0.87
des 8632 16.08 8767 16.40 1.02 1.02
dflgrcbl 730 10.39 716 10.39 0.98 1.00
fconxcbl 537 11.00 507 12.20 0.94 1.11
frg2 2367 13.17 2555 11.83 1.08 0.90
k2 2755 16.87 2696 16.30 0.98 0.97
kcctlcb3 557 9.26 564 9.19 1.01 0.99
pair 3956 16.40 5655 16.94 1.43 1.03
rot 1651 19.17 2100 16.11 1.27 0.84
sbiucbl 599 15.66 677 13.94 1.13 0.89
tfaultcbl 439 6.80 453 6.97 1.03 1.03
vda 1522 13.31 1625 12.83 1.07 0.96
x3 2033 10.97 2226 11.17 1.09 1.02
aver - - - - 1.12 0.94

Table 5.5: Effect of the speed-up Command

opt: minimum delay technology mapping of circuits optimized by m is ll


speedup: minimum delay technology mapping after speed-up
gain: gain in area or in delay obtained using speed-up
area: area of the circuit (MCNC Ub2 d ata divided by common divisor 464)
delay: delay of the circuit (MCNC lib2 d ata in nanoseconds)
aver: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 134

literal count is given in Table 5.4. Clustering achieves an average delay reduction of 13%
for an average area increase of 39%. This technique performs better than tree covering
with overlaps in both area and delay. For comparison, we also give in Table 5.5 the results
obtained using m i s l l speed_up routine. The script we used to run the speed_up routine is
indicated in Figure 5.5. There is no need to use most of the commands in the previous script
because the speedjup command performs its own decomposition into simple gates and area
recovery. The speedjup command decreased delay by only 6% on average, for a m oderate
increase of 12% in area. In addition, speedjup does not perform very consistently: it
actually increases the delay of 7 out of our set of 25 benchmarks. In contrast, the clustering
algorithm increased delay in only one of the examples for a higher cost in area.

5.5.3 Area Recovery and Clustering

In this section we present two techniques to reduce the area increase due to clus­
tering. The first technique is a modification of the labeling step of the clustering algorithm.
The second technique is based on redundancy removal.

A re a E fficien t L a b e lin g P r o c e d u r e Lawler’s clustering algorithm forces the duplica­


tion of nodes in two cases. F irst, when a node belongs to more than one cluster, it is
duplicated in order to provide one copy per cluster. Secondly, when a cluster has several
output nodes ( u i , . . . all nodes in the cluster belonging to the transitive fanin of two
or more of the nodes ( t q , . . . , Vk) need to be duplicated. Equivalently, we can consider than
a cluster th an has several output nodes is itself duplicated, one copy per output node. The
duplicated nodes th a t, in a given copy, do not have any fanout can be removed.
Lawler’s algorithm has the property of attributing to each node the smallest possi­
ble label th a t respects the cluster size constraint. In some cases a node can be attributed a
larger label without violating the cluster size constraint and without causing the maximum
node label to increase. An example of a situation where relabeling can occur is given in
Figure 5.6. The effect of increasing the label of a node may be to remove an output node
from a multiple output cluster, and so doing to reduce logic duplication.
We use a simple greedy heuristic to increase node labels. This heuristic is outlined
in Figure 5.7. It is used after the labeling step and before the clustering step of the algorithm
of Figure 5.2. The heuristic visits the nodes of the network in topological order from outputs
to inputs. A t each node v, the maximum label value is supposed to be available. If v is a

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 135

Figure 5.6: Example of Relabeling for Area

primary output, this value is simply the largest label value computed in the labeling step.
If v is not a primary output, this value is guaranteed to be available when v is visited by
the topological ordering of the nodes. The label of each node is given this maximum value.
Then the largest label value the inputs of v could have without forcing an increase of the
label of v is computed. This value is propagated to the inputs of v.
The effect of the relabeling heuristic is illustrated in Table 5.6. Relabeling reduces
the average increase in area caused by clustering from 39% to 31%, and actually reduces
delay by an additional percentage point, yielding an average delay reduction of 14%.

R e d u n d a n c y R em o v al Clustering and collapsing may introduce a large num ber of re­


dundancies. By removing these redundancies we can only reduce area and delay, assuming
a static delay model. In the presence of false paths, i.e. paths in the circuit th a t cannot
propagate any signal under any circumstances due to cancellation effects from side paths,
redundancy removal may actually slow down the circuit by making a slow false path become
active. The best known example of a circuit where this problem occurs is the carry-bypass
adder. Removing logical redundancies from a carry-bypass adder has for effect to remove
the bypass circuitry, transforming the fast carry-bypass adder into a slow ripple-carry adder.
Redundancies can still be eliminated from circuits with false paths without slowing down
the circuit, possibly a t the cost of some logic duplication, using the algorithm of Keutzer
Keutzer et al. [25]. In our experiments, we simply assume th at all circuits have a t least one

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 136

circuit opt relabel gain


area delay area delay area delay
C1355 1192 24.21 1597 20.19 1.34 0.83
C1908 1358 28.24 1617 25.09 1.19 0.89
C2670 1796 20.86 2243 19.91 1.25 0.95
C3540 2857 32.88 4241 30.33 1.48 0.92
C5315 3693 29.31 5079 27.89 1.38 0.95
C6288 8272 95.47 10704 77.28 1.29 0.81
C7552 5322 27.83 8302 24.34 1.56 0.87
alu4 1977 29.24 2777 27.05 1.40 0.93
ampbpreg 3289 35.56 4162 20.66 1.27 0.58
ampbsm 1875 16.55 2390 12.69 1.27 0.77
amppint2 1353 11.29 1731 9.29 1.28 0.82
ampxhdl 1059 12.90 1259 11.62 1.19 0.90
apex6 1912 11.29 2429 10.12 1.27 0.90
des 8632 16.08 11540 17.40 1.34 1.08
dflgrcbl 730 10.39 803 10.18 1.10 0.98
fconrcbl 537 11.00 645 9.32 1.20 0.85
frg2 2367 13.17 2825 10.57 1.19 0.80
k2 2755 16.87 4784 14.36 1.74 0.85
kcctlcb3 557 9.26 810 6.39 1.45 0.69
pair 3956 16.40 5274 14.95 1.33 0.91
rot 1651 19.17 1966 15.69 1.19 0.82
sbiucbl 599 15.66 837 14.41 1.40 0.92
tfaultcbi 439 6.80 513 5.65 1.17 0.83
vda 1522 13.31 2269 12.32 1.49 0.93
x3 2033 10.97 2440 9.82 1.20 0.90
aver - - - - 1.31 0.86

Table 5.6: Effect of Relabeling Heuristic

opt: minimum delay technology mapping of circuits optimized by m is ll


relabel: minimum delay technology m apping of the relabeled, clustered circuits
gain: gain in area or in delay obtained by relabeled clustering
area: area of the circuit (MCNC lib2 data divided by common divisor 464)
delay: delay of the circuit (MCNC lib2 data in nanoseconds)
aver: geometric average of the gains

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D E LA Y OPTIMIZATIONS 137

algorithm relabelingJieuristic
m axJabel = maxugpo label(y)
foreach node v {
maxJabel(v) — maxJabel
}
foreach node v visited in topological order from outputs to inputs {
foreach node u E F A N I N ( v ) {
if (label(v) < maxJabel{v) and label(u) = = label(v)) {
increm ent = maxJabel(v) — label(u )
} else {
increm ent = max(0, maxJabel(v) — label(u) — 1)
}
maxJabel(u) = Tnm(maxJabel(u), label(u) + increment)
}
label(v) = maxJabel(v)
}
end relabelingJieuristic

Figure 5.7: Relabeling Procedure for Reducing Logic Duplication

active critical path. This hypothesis is satisfied by m ost circuits. Moreover, all circuits can
be made to satisfy this hypothesis by using Keutzer’s algorithm.

We removed all redundancies from circuits th a t were partially collapsed using


Lawler’s algorithm. We applied the relabeling heuristic described in the previous section.
The results axe reported in Table 5.7. To identify redundancies, we used an improved
version of the autom atic test p attern generation program Socrates [39] developed by Jacoby
et al. [23]. After redundancy removal, we ran the following m is l l commands: sweep;
e lim in a te -1 ; s im p lify -1 , except on C3540. For C3540, e lim in a te -1 causes a large
increase in the sum of product representation of the circuit which makes the use of Jacoby’s
redundancy removal program impractical. We were unable to complete the removal of all
redundancies on one circuit, C6288, after 72 hours of cpu time on a DEC-3100. Jacoby’s
redundancy removal program performs some limited form of factorization, which may bias

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 138

circuit opt rr gain


area delay area delay area delay
C1355 1192 24.21 1417 19.98 1.19 0.83
C1908 1358 28.24 1570 25.94 1.16 0.92
C2670 1796 20.86 2216 17.36 1.23 0.83
C3540f 2857 32.88 3836 28.68 1.34 0.87
C5315 3693 29.31 4909 26.91 1.33 0.92
C6288 8272 95.47 * * - -
C7552 5322 27.83 9550 25.34 1.79 0.91
alu4 1977 29.24 2209 26.55 1.12 0.91
ampbpreg 3289 35.56 2541 13.92 0.77 0.39
ampbsm 1875 16.55 2191 11.06 1.17 0.67
amppint2 1353 11.29 1761 9.30 1.30 0.82
ampxhdl 1059 12.90 1007 10.93 0.95 0.85
apex6 1912 11.29 2397 10.03 1.25 0.89
des 8632 16.08 11313 17.38 1.31 1.08
dflgrcbl 730 10.39 786 10.18 1.08 0.98
fconrcbl 537 11.00 680 8.80 1.27 0.80
frg2 2367 13.17 2420 10.30 1.02 0.78
k2 2755 16.87 4505 13.99 1.64 0.83
kcctlcb3 557 9.26 793 6.88 1.42 0.74
pair 3956 16.40 5145 14.10 1.30 0.86
rot 1651 19.17 1911 16.20 1.16 0.85
sbiucbl 599 15.66 782 12.01 1.31 0.77
tfaultcbi 439 6.80 486 6.38 1.11 0.94
vda 1522 13.31 2224 12.54 1.46 0.94
x3 2033 10.97 2465 9.51 1.21 0.87
aver - - - - 1.23 0.83

Table 5.7: Effect of Redundancy Removal after Clustering

opt: minimum delay technology mapping of circuits optimized by m is ll


rr: minimum delay technology mapping of the relabeled, clustered circuits
after redundancy removal
gain: gain in area or in delay obtained by clustering and redundancy removal
area: area of the circuit (MCNC lib2 d ata divided by common divisor 464)
delay: delay of the circuit (MCNC lib2 data in nanoseconds)
aver: geometric average of the gains
*: tim eout after 72 hours of cpu tim e on a DEC 3100
t: e lim in a te -1 was not applied before redundancy removal

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5. TECHNOLOGY INDEPENDENT D ELAY OPTIMIZATIONS 139

our results slightly in favor of area. Redundancy removal was effective a t reducing the area
penalty incurred by clustering. On average, clustering followed by redundancy removal
increased area by 23% for a reduction in delay of 17%.

5.6 Conclusion

We presented in this chapter several technology independent techniques to reduce


circuit delay. The simplest of these techniques, which consists in collapsing a circuit to
two levels of logic, is only applicable for a restricted class of circuits. For these circuits,
collapsing may yield impressive delay reductions. However, for most circuits, the cost in area
is too large for collapsing to be practical. As an alternative to full collapsing we introduced a
simple clustering technique that allows partial collapsing, and realizes a compromise between
area increase and delay reduction. This clustering technique fits naturally in this thesis, as
it can be seen as a technology independent version of a tree covering algorithm allowing
overlaps. Comparing clustering with speed-up, we saw th at clustering was more wasteful in
area, but gave better delays and was more consistent. Clustering can be rendered less costly
in area and even more efficient in delay by reducing overlaps between clusters whenever it
does not increase the num ber of levels of logic in the clustered network, and by using of
redundancy removal. Using clustering and redundancy removal, we were able to obtain, in
some cases, circuits th at were faster than their collapsed versions for a fraction of the area.
We demonstrated that there is still a lot of potential for delay minimization beyond
technology mapping. More work needs to be done to exploit this potential at a more
moderate cost in area and cpu time than the small set of techniques presented in this
chapter.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 6

C onclusion

Bornons ici cette carriere.


Les longs ouvrages me font peur.
Loin d ’epuiser une matiere,
On n ’en doit prendre que la fleur.
— LA FONTAINE

The m ain results of this work are as follows. We provided an exact solution to the minimum
delay tree covering problem based on piece-wise linear functions. We performed an extensive
study of fanout optimization heuristics, presented new complexity results, and introduced
a spectrum of fanout optimization algorithms. We developed a simple algorithm to apply
fanout optimization throughout an entire network th at reduces delay a t a very moderate cost
in area. To study the integration of tree covering and fanout optimization, we introduced
a technology independent delay model th a t characterizes precisely suboptimalities due to
imbalances in a network. This is the first technology independent delay model th at models
the delay through a node as a function of the arrival tim e distribution at a node. In addition,
this delay model can be used to derive analytically optim al solutions in simple cases which
can be used to assess the optim ality of algorithms. We showed the importance of the
technique used to evaluate the arrival times at the input of trees before fanout optimization,
and presented an efficient heuristic to solve this problem. We also experimented with
allowing tree covers to overlap, and showed significant delay reductions with this technique.
Finally we investigated technology independent delay optim ization techniques based on
partial or total collapsing of logic, and showed th at further delay reductions can be achieved
with these techniques possibly at a higher cost in area.
A surprising conclusion of our work is th a t it is im portant to ignore critical paths

140

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6. CONCLUSION 141

when performing delay optim ization during logic synthesis. As confirmed by the experi­
m ents of Yoshikawa et al. [44], delay reduction on non-critical paths can create additional
slacks on those paths th a t can be exploited to reduce delay through critical paths. By
concentrating on critical paths only, delay optimization algorithms condemn themselves to
suboptim al solutions.
We now have a t our disposal a spectrum of delay optimization techniques. Fanout
optim ization is the cheapest technique in terms of area consumption, and should be given
top priority. Tree covering for delay comes in second. The area-delay tradeoff potential
of tree covering depends more heavily on the quality of the library used by the technology
m apper. In some cases tree covering can outperform fanout optim ization, though we were
not able to dem onstrate this fact in this thesis due to the confidentiality of some of our
libraries. Allowing overlaps between tree covers as well as technology independent collaps­
ing algorithms followed by redundancy removal add to the arsenal of delay minimization
techniques.
More work needs to be done in technology independent delay optimization tech­
niques. There are three main avenues of research: the development of more accurate
technology independent delay models than the ones currently in use; the improvement
of collapsing algorithms in term s of area; the improvement of techniques based on kernel
extraction (speed_up) or observability don’t-care sets in term s of cpu speed. In particular,
it would be interesting to investigate the use of the technology independent delay model
introduced in chapter 4 or a model derived on similar ideas to drive technology independent
delay reduction algorithms. This delay model is the first to propose a way to take into
account imbalances in arrival times as they occur in networks. A more fundam ental issue
would be to understand when logic duplication is needed for delay reduction (we know th at
redundancy is not needed, even in the presence of false paths [25]) in order to find more
economical ways to perform partial collapsing or tree overlapping.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
B ibliography

[1] A. Aho, M. G anapathi, and S. Tjiang. Code Generation Using Tree M atching and
Dynamic Programming. A C M Transactions on Programming Languages and Systems,
11(4):491—516, October 1989.

[2] A. Aho, S. C. Johnson, and J. Ullman. Code generation for expressions with common
subexpressions. Journal o f the Association for Computing Machinery, 24(l):146-160,
1977.

[3] A. V. Aho and M. Ganapathi. Efficient Tree P attern Matching: an Aid to Code Gener­
ation. In Twelfth Annual A C M Symposium on Principles o f Programming Languages,
pages 334-340, January 1985.

[4] K. B artlett, W . Cohen, A. De Geus, and G. Hacthel. Synthesis and Optimization of


Multilevel Logic under Timing Constraints: IE E E Transactions on CAD, 5(4):582—596,
October 1986.

[5] C. L. Berman, J. L. Carter, and K. F. Day. The Fanout Problem: From Theory to
Practice. In C. L. Seitz, editor, Advanced Research in VLSI: Proceedings o f the 1989
Decennial Caltech Conference, pages 69-99. MIT Press, M arch 1989.

[6] C. L. Berman, D. J. Hathaway, A. S. LaPaugh, and L. H. Trevillyan. Efficient tech­


niques for tim ing correction. In ISCAS, pages 415-419, 1990.

[7] R. K. Brayton, G. D. Hachtel, C. T. McMullen, and A. Sangiovanni-Vincentelli. Logic


M inimization Algorithms for V LSI Synthesis. Kluwer Academic Publishers, 1984.

[8] R. K. Brayton, G. D. Hachtel, and A. L. Sangiovanni-Vincentelli. Multilevel Logic


Synthesis. Proceedings o f the IEEE, 78(2):264-300, February 1990.

142

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 143

[9] R. K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A. R. Wang. MIS: A


multiple-level logic optimization system. IE E E Transactions on Computer-Aided De­
sign, CAD-6(6):1062-1081, November 1987.

[10] R. E. Bryant. Graph Based Algorithms for Boolean Function M anipulation. IEEE
Transactions on Computers, C-35(8):677-691, August 1986.

[11] D. R. Chase. An Improvement to Bottom-up Tree P attern M atching. In ACM, editor,


Symposium on Principles o f Programming Languages, pages 168-177, January 1987.

[12] K. C. Chen and S. Muroga. Timing Optimization for Multi-Level Combinational Net­
works. In Proceedings o f the 27th A C M /IE E E Design Automation Conference, pages
339-344,1990.

[13] J. A. Darringer, D. Brand, W . H. Joyner, and L. Treviilyan. LSS: A system for


production logic synthesis. IB M Journal o f Research and Development, 28(5):537-545,
September 1984.

[14] E. Detjens, G. Gannot, R. Rudell, A. Sangiovanni-Vincentelli, and A. Wang. Technol­


ogy Mapping in MIS. In Proc. o f the ICCAD-87, pages 116-119, November 1987.

[15] J. P. Fishburn. A Depth-Decreasing Heuristic for Combinational Logic: or How to


Convert a Ripple-Carry Adder into a Carry-Lookahead Adder or Anything In-Between.
In Proceedings o f the 27th A C M /IE E E Design Automation Conference, pages 361-364,
1990.

[16] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory
o f NP-Completeness. M athematical Sciences Series. Freeman, 1979.

[17] L. A. Glasser and L. P. J. Hoyte. Delay and Power Optimization in VLSI Circuits. In
21st A C M /IE E E Design Automation Conference, pages 529-535,1984.

[18] M. C. Golumbic. Combinatorial Merging. IE E E Transactions on Computers,


25(ll):1164—1167, November 1976.

[19] D. Gregory, K. B artlett, A. de Geus, and Hachtel. G. SOCRATES: A System for


Automatically Synthesizing and Optimizing Combinational Logic. In 23rd A C M /IE E E
Design Automation Conference, pages 79-85,1986.

[20] C. M. Hoffmann and M. J. O ’Donnell. P attern Matching in Trees. Journal o f the


Association for Computing Machinery, 29(l):68—95, January 1982.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 144

[21] M. Hofmann and J. K. Kim. Delay Optimization of Combinational Static CMOS Logic.
In 24th A C M /IE E E Design Autom ation Conference, pages 125-132,1987.

[22] H. J. Hoover, M. M. Klawe, and N. J. Pippenger. Bounding Fan-out in Logical Net­


works. Journal o f the Association fo r Computing Machinery, 3l(l):13—18, January
1984.

[23] R. Jacoby, P. Moceynas, H. Cho, and G. Hachtel. New ATPG Techniques for Logic
Optimization. In ICCAD, pages 548-551,1989.

[24] K. Keutzer. DAGON: Technology Binding and Local Optimization by DAG Matching.
In Proceedings of the 24th Design Autom ation Conference, pages 341-347. ACM /IEEE,
June 1987.

[25] K. Keutzer, S. Malik, and A. Saldanha. Is redundancy necessary to reduce delay? In


Proceedings o f the Design Autom ation Conference, pages 228-234, June 1990. Accepted
for publication, IEEE Transactions on Computer Aided Design.

[26] K. Keutzer and M. Vancura. Timing Optimization in a Logic Synthesis System. In


G. Saucier, editor, Proceedings o f International Workshop on Logic and Arch. Synthesis
fo r Silicon Compilers, pages 1-13, Grenoble, France, May 1988. Inst. Nat. Polytech­
nique.

[27] K. Keutzer and W. Wolf. Anatomy of a Hardware Compiler. In Proceedings o f the SIG-
P L A N ’88 Conference on Programming Language Design and Implementation, pages
95-104. ACM, June 1988.

[28] E. L. Lawler, K. L. Levitt, and J. Turner. Module clustering to minimize delay in digital
networks. IE E E Transactions on Computers, C-18(l):47-57, January 1969. 1969.

[29] Mario Lega. Private communication. AT&T Bell Laboratories, October 1990.

[30] C. E. Leiserson, F. M. Rose, and J. B. Saxe. Optimizing synchronous circuitry by


retiming. In R. Bryant, editor, 3rd Caltech Conference on Very Large Scale Integration,
pages 87-116,1983.

[31] M. Lightner and W. Wolf. Experiments in Logic Optimization. In ICCAD, pages


286-289, 1990.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 145

[32] R. Lisanke. Logic Synthesis and Optimization Benchmarks User Guide Version 2.0.
Technical report, MCNC, P.O. Box 12889, Research Triangle Park, NC 27709, Decem­
ber 1988.

[33] P. McGeer. On the Interaction o f Functional and Timing Behavior o f Combinational


Logic Circuits. PhD thesis, U.C. Berkeley, November 1989.

[34] F. W . Obermeier and R. H. K atz. Combining Circuit Level Changes with Electrical
Optimization. In ICCAD-88, pages 218-221. IEEE, 1988.

[35] P. G. Paulin and F . Poirot. Logic Decomposition Algorithms for the Timing Op­
tim ization of Multi-Level Logic. In International Conference on Computer Design,
pages 329-333. IE EE, October 1989.

[36] R. Rudell. Logic Synthesis fo r V L S I Design. PhD thesis, UC Berkeley, April 1989.
U CB/ERL M89/49.

[37] R. L. Rudell and A. Sangiovanni-Vincentelli. Multiple-valued minimization for PLA


optimization. IE E E Transactions on Computer-Aide&JDesign, CAD-6(5):727-750,
September 1987.

[38] A. Saldanha, R. K. Brayton, A. L. Sangiovanni-Vincentelli, and K. T. Cheng. Timing


Optimization w ith Testability Considerations. In ICCAD, pages 460-463,1990.

[39] M. Schulz, E. Trischler, and T. Sarfert. SOCRATES: a Highly Efficient ATPG System.
IE E E Transactions on Computer-Aided Design o f Integrated Circuits and Systems,
7(1):126-137, January 1988.

[40] K. J. Singh and A. Sangiovanni-Vincentelli. A Heuristic Algorithm for the Fanout


Problem. In DAC, pages 357-360, June 1990.

[41] K. J. Singh, A. R. Wang, R. K. Brayton, and A. Sangiovanni-Vincentelli. Timing


Optimization of Com binational Logic. In ICCAD-88, pages 282-285. IEEE, 1988.

[42] H. Touati, C. Moon, R. K. B rayton, and A. Wang. Performance-Oriented Technology


Mapping. In M IT Press, editor, Proceedings o f the sixth M IT V L SI Conference, pages
79-97, April 1990.

[43] D. Wallace. High-Level Delay Estim ation for Technology-Independent Logic Equations.
In ICCAD, pages 188-191, 1990.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY 146

[44] K. Yoshikawa and H. Ichiryu. Timing Optimization by Technology Mapping and


Fanout Adjustm ent. In submitted to DAC, 1991.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

You might also like