Chortle CRF
Chortle CRF
2 Background
1 Introduction
Technology mapping produces a circuit that implements
Field Programmable Gate Arrays (FPGAs) are a re-
a combinational network using a restricted set of circuit
cent innovation in Application Specific Integrated Cir-
elements. Earl y work in technology mapping, such as
cuits (ASICS) that provide both large scale integra-
SOCRATES [Greg86] and the work by Kahrs [Kahr86],
tion and user-programmability [Hsie88] [Ahre90]. The
focused on circuits created from standard cell libraries.
user-programmability of FPGAs can dramatically re-
An important advance in library-based technology map-
duce ASIC turn-around time and manufacturing costs.
ping was the introduction of dynamic programming by
An FPGA consists of an array of programmable logic
Keutzer [Keut87]. Other library-based technology map-
blocks and a programmable routing network. An im-
pers include misII [Detj87] and McMAP [Lisa87].
portant class of FPGAs consists of those that use logic
A lookup table of K-inputs can implement 22K differ-
ent Boolean functions of K variables. For values of K
1This work ~= supported by NSERC Operating Gr~ts greater than 3 the library required to describe a K-input
#URFO043298 and #OGPOO05280, a research grant from Bell- lookup table becomes impractically large and therefore
Northern Research, and a research grant from the ITRC of On- technology mapping algorithms that deal specifically
tario.
with lookup tables are required [Fran90]. Two pre-
viously reported lookup table technology mappers are
Permlsslon to copy w>thout fee all or part of this material I< granted
Chortle [Fran90] and mis-pga [Murg90a].
provided that the copies are not made or distributed for drect commercial The Chortle technology mapper presented in [Fran90]
advanrage, the ACM copyright notice and the title of the pubhcation and uses an exhaustive search to find the optimal gate-
Its date appear, and notice is given that copying is by permission of the
level decomposition of every node in a fanout-free tree.
Association for Computing Machinery. To copy otherwise, or to republish,
requmes a fee and/or specific penmssion. However, the partitioning of the original network into
‘Y
If-t-Y
i I !
I ii i
. . . --..4
I i--- -.
ii
. . . . .
I. . .... .
.- .- - -.
I
i
L ---- ---- J
~-----”l
~ .. . . . . .
R?----
I
.----
.,
.
i I i
I 1 !
/ ! i
I I I
I i i
!
L_____ ---- J i---- ___.J
Y’
I i
$$-”
! !
i [
i I
i
i i
!
1 i
1....... . ...- ..—.. . ....
b) with gate decomposition
b) circuit of 5-input lookup tables Figure 2.
Figure 1.
Paper 15.1
228
FirstFitDecreasing
{
start with en empty bin list
~....
is K, and the size of each box (fanin lookup table) is
its number of used inputs. In Figure 3a the boxes have
sizes 3, 2, 2, 2, and 2. In Figure 3b the final contents
7’
I i Figure 4 [Gare79].
/
fanin lookup tables of Figure 3a. two-level decomposition. Any second-level lookup ta-
ble with unused inputs can be used to implement a
portion of the first-level tree, thereby reducing the to-
3.1.1 Two-Level Decomposition
tal number of lookup tables in the decomposition tree.
The two-level decomposition consists of a single jirst- Figure 3C illustrates the multi-level decomposition con-
level node and several second-level nodes. In Figure 3b structed from the two-level decomposition of Figure 3b.
the 3-input OR node is the first-level node and its three The detailed procedure for converting the two-level
inputs are the second-level nodes. Each second-level decomposition into a multi-level decomposition is out-
node implements the operation of the node being de- lined in Figure 5.
composed over a subset of one, some, or all of the fanin The final multi-level decomposition can be shown to
lookup tables. In Figure 3b there are three second-level be optimal if the network is a fanout-free tree and the
nodes each of which is implemented by a lookup ta- value of K is less than or equal to 5 [Fran91]. For net-
ble. The first-level node is not yet implemented by any works partitioned into fanout-free trees the bin packing
lookup tables, however, it will be implemented when the approach is up to 28 times faster than the previous ex-
two-level decomposition is converted into a multi-level haustive search approach [Fran90], yet it produces cir-
decomposition. cuits with the same number of lookup tables. This im-
The two-level decomposition is constructed using a provement in speed makes it practical to consider opti-
bin packing algorithm. In general, the goal of bin pack- mization exploiting reconvergent paths and replication
ing is to find the minimum number of bins into which of logic at fanout nodes, as discussed in the following
a set of boxes can be packed [Gare79]. In this case, the sect ions.
Paper 15.1
229
!
MultiLevel
{
while there is more than one unconnected bin
{
if there are no free inputs among the
remaining unconnected bins
{ a) fanin lookup tables with shared input
create an empty bin and
add
}
it to the end of the bin list -I.-l-.., r-tI--+ -t:
--- --- .
A.-.1--, .-1-.-1-.
Paper 15.1
230
.-..
.\.i
‘frw
Network Cho] :-crf mis-pga
i -c -cr -Cf -crf
L--- . .. . ..J
lookups Iookups lookups lookups lookum
.- -- -i ,... --- --l
!
z4ml 9 9 9 6 8
I
I misexl 20 20 19 19 11
!
!
! vg2 24 24 23 21 30
~--..;~...-.,
-.
----
I 5xpl 34 31 34 27 31
L---- ..-! L..- .-.__!
count 47 45 40 31 31
a) no replicated logic 9symml 63 59 62 55 56
9sym 69 65 67 59 72
apex7 72 71 71 64 64
Iwl
rd84 76 76 74 73 40
e64 95 95 80 80 82
!/
!1
C880 115 110 112 86 103
II apex2 123 123 121 120 80
II
!1 alu2 131 121 127 116 129
II
Ii . .... .... ... . . .—-. —.-.. duke2 138 136
. .. . .. . . --------- 126 120 128
C499 166 164 158 74 66
b) with replicated logic rot 219 207 208 189 200
Figure 7. apex6 232 219 230 212 243
alu4 238 219 227 195 235
apex4 603 600 579 558 765
des 1073 1060 1050 952 1016
output of the network. These subsequent fanout nodes -cf using the replication optimization
and primary outputs will be referred to as the visible
-crf using both reconvergent and replication
nodes.
To determine if the replication is worthwhile The first step in the experimental procedure was
Chortle-crf solves a series of subproblems. For each technology independent logic optimization using the
visible node the Best Circuit implementing the visible misII logic optimizer with the standard script [Bray86].
node is constructed twice; once with the replication and Chortle-crf was then used to implement the networks as
once without the replication. Each subproblem is itself circuits of 5-input lookup tables. Note that Chortle-crf
solved using Chortle-crf with the assumption that any is capable of implementing networks as circuits of K-
remaining fanout nodes encountered in these subprob- input lookup tables for values of K from 2 to 10.
lems are explicitly implemented and can therefore be Table 1 records the number of 5-input lookup tables
treated like primary inputs. The bin packing approach required to implement the networks in each of the four
is fast enough to make solving these subproblems prac- experiments. The reconvergent optimization reduced
tical. the total number of lookup tables required to imple-
After the subproblems have been solved the total ment the networks by 2.7 YO , and the replication opti-
number of lookup tables required to implement the vis- mization reduced the total number of lookup tables by
ible nodes both with and without the replication are 3.7 %. Combining both optimizations reduced the total
known. If the total number of lookup tables is reduced number of lookup tables by 14 Yo.
by the replication, then the replication is retained. The The reduction achieved when using both optimiza-
replication of logic is considered at every fanout node aa tion together often exceeds the sum of the individual
it is encountered by the dynamic programming traversal reductions. This occurs when reconvergent paths that
of the network. cross fanout nodes are found and realized within a single
Paper 15.1
231
Network Chortle-crf Network Chortle-crf I mis-pga xl OPT
-c -cr -Cf -crf CLBS
~ sec.
2
CLBS sec. 1
CLBS CLBS CLBS CLBS z4ml 3 0.8 7 6 296.5
z4ml
misexl
vg2
5xpl
count
14
20
23
32
5
42
5
14
19
20
31
21
23
32
50
14
7
14
18
20
27
41
3 misexl
vg2
5xpl
count
9symml
9sym
14
18
20
27
41
T 0.7
0.6
3.2
2.0
59.1
10
21
23
28
43
25.6
45.5
12
20
19
32
56
298.2
299.7
301.1
301.9
901.2
305.1
1
9symml 50 42 62.9 59 52
9sym 52 44 56 42 apex7 42 2.9 50 117.3 51 304.6
apex7 48 45 49 42 rd84 53 15.4 32 65.1 38 303.2
rd84 52 52 53 53 e64 54 1.9 61 65 901.5
e64 48 48 54 54 C880 69 12.6 82 101 1809.4
C880 75 70 94 69 apex2 93 34.9 70 102 909.7
apex2 94 90 97 93 alu2 83 56.3 102 91 907.8
alu2 94 86 98 83 duke2 89 9.1 105 357.1 99 903.6
duke2 88 87 91 89 C499 50 15.9 50 137.5 121 1847.0
C499 84 84 96 50 rot 131 14.0 153 844.8 166 1811.4
rot 134 129 144 131 apex6 161 25.3 191 1376.8 198 1822.6
apex6 169 161 169 161 alu4 138 178.1 189 232 1849.4
alu4 165 144 174 138 subtotal 1128
m
apex4 457 451 463 448
3ELE!I
apex4 448
des 714 695 797 743
des 743
z
total 2418 2317 2582 2319
tot al
-
2319
‘execution times on a
m
Sun 3/60
Table 2: CLB Results 2 execution times on a VAX 8800
Paper 15.1
232
estimate a VAX 8800 is twice as fast as a Sun 3/60. [Fran90] R. J. Francis, J. Rose, K. Chung, “Chortle: A
Taking into account the relative speed of the Sun 3/60 Technology Mapping Program for Lookup Table-
Based Field Programmable Gate Arrays: Proc.
and the VAX 8800, Chortle-crf is an average of 68 times
27th DAC, June 1990, pp. 613-619.
faster than mis-pga and 30 times faster than XNFOPT.
[Fran91] R. J. Francis, “Technology Mapping for Lookup
Table-Based FPGAs,” Ph.D. Thesis in preparation,
University of Toronto, Department of Electrical En-
5 Conclusions gineering.
The bin packing approach to gate decomposition de- [Gare79] M. R. Garey, D. S. Johnson, “Computers and
scribed in this paper is up to 28 times faster than a pre- Intractability, A Guide to the Theory of NP-
Completeness,” W. H. Freeman and Co., 1979, pp.
vious exhaustive search approach. The improved speed
124-129.
of gate decomposition makes it practical to consider lo-
cal optimizations that exploit both reconvergent paths [Gibb85] A. Gibbons, “Algorithmic Graph Theory,” Cam-
bridge University Press, 1985, pp. 125-133.
and replication of logic at fanout nodes.
Using both of these optimizations, Chortle-crf re- [Greg86] D. Gregory, et al., “Socrates: a system for au-
quired 14 % fewer 5-input lookup tables than Chortle tomatically synthesizing and optimizing combin>
[Fran90] and 10 % fewer lookup tables than mis-pga tion.?d logic,” Proc. 23rd DAC, June 1986, pp. 79-85.
[Murg90a] to implement a set of benchmark networks. [Hsie88] H. Hsieh, et al., “A 9000-Gate User-Programmable
Chortle-crf is also capable of implementing networks Gate Array,” Proc. 1988 CICC, May 1988, pp. 15,3,1
aa circuits of Xilinx 3000 series CLBS. To implement the -15.3.7.
benchmark networks as circuits of CLBS, Chortle-crf re- [Kahr86] M. Kahrs, “Matching a parts library in a silicon
quired 12 YO fewer CLBS than mis-pga and 22 ‘?10 fewer compiler,” IEEE ICCAD, 1986, pp. 169-172.
CLBS than XNFOPT. On average, Chortle-crf was 68
[Keut87] K. Keutzer, “DAGON: Technology Bindkg and Lo-
times faster than mis-pga and 30 times faster than cal Optimization by DAG Matching,” Proc. 24th
XNFOPT. DAC, June 1987, pp. 341-347.
local search. As well, realizing a pair of reconvergent [Rose90] J. Rose, R. J. Francis, D. Lewis, P. Chow, “Architec-
paths within a single lookup table may depend upon tures of Field-Prograrmnable Gate Arrays: The ef-
the replication of logic at multiple fanout nodes. fect of Logic Block Functionality of Area Efficiency,”
IEEE Journal of Solid-State Circuits, Vol. 25, No.
There are cases where the optimizations requiring
5, Oct. 1990, pp. 1217-1225.
replication of logic at different fanout nodes may be mu-
tually exclusive. A computationally tractable method [Xili89] XACT LCA Development System, Vol. II, Xilinx
blC.. 1989.
of determining which set of replications at fanout nodes
will result in the minimum number of lookup tables for
the entire network is needed.
References
[Ahre90] M. Ahrens, et aL, UAn FpGA Family optimized for
High Densities and Reduced Routing Delay,” Proc.
19!20 CICC, May 1990, pp. 31.5.1-31.5.4.
Paper 15,1
233