Tvlsi Fpga Arithmetic
Tvlsi Fpga Arithmetic
YY, 2019 1
Abstract—Hardened adder and carry logic is widely used (but less flexible) adder inputs, or are flexible (but slower)
in commercial FPGAs to improve the efficiency of arithmetic inputs coming from LUTs preferable? Second, what are the
functions. There are many design choices and complexities trade-offs in terms of performance and area between large,
associated with such hardening including: circuit design, FPGA
architectural choices, and the CAD flow. However these choices fast, multi-bit adders, and smaller, slower, but more flexible,
have seen little study, and hence we explore a number of pos- single-bit adders? Third, should adjacent hard adder units use
sibilities. We also highlight front-end elaboration optimizations dedicated links for carry signals crossing soft logic block
that help ameliorate the restrictions placed on logic synthesis by boundaries (which constrains the placement problem) or use
hardened arithmetic. We show that hard adders and carry chains the more flexible regular routing fabric? Fourth, how should
increase performance of simple adders by a factor of four or
more, but on larger benchmark designs that contain arithmetic hard adders be integrated with a fracturable LUT (a large
improve overall performance by 15%. Our results also show LUT that can be split into two smaller LUTs)? Does this
that for complete application circuits simple hardened ripple- effect how many bits of arithmetic should be associated with
carry adders perform as well as more complex carry-lookahead each LUT? These are important questions an architect must
adders. Our best non-fracturable LUT architecture with hard- answer when embedding hard adders with soft logic, and we
ened arithmetic yields 12% better area-delay product than ar-
chitectures without hardened arithmetic. We also investigate the present quantitative measurements of the impact of each of
impact of fracturable LUTs and their interaction with hardened these decisions.
arithmetic. We find that fracturable LUTs offer significant (12- Previous work in this area began in the early 1990’s, when
15%) area reductions, which are complementary to the delay Hsieh et al. [6] described the Xilinx 4000 FPGA that had
reductions of hardened arithmetic. Therefore our best fracturable soft logic blocks that were capable of implementing two
LUT architectures which use two bits of hardened arithmetic
achieve 25% better area-delay product than non-fracturable LUT independent adder bits per block. They employed dedicated
architectures without hardened arithmetic. carry logic and routing from adjacent logic blocks for the carry
signals. Woo [7] proposed adding additional flexibility to the
Index Terms—Field programmable gate arrays, Digital arith-
metic, Logic design, Design automation fast carry links between logic blocks to enable flexible tree-
based mappings of addition/subtraction/comparison functions.
I. I NTRODUCTION Both Hseih and Woo targeted older FPGAs that had relatively
A key FPGA architecture question is which functions should fewer and smaller lookup tables in the logic block compared
be hardened and which should be left for implementation in to the latest FPGAs.
the soft logic [1]. Hardening a function makes an FPGA more Xing proposed implementing carry lookahead adders (in an
efficient if the function occurs often in applications, and there FPGA architecture that contains just ripple adders) by using
is a large advantage when it is implemented in hard, rather than soft logic to do the carry lookahead operation [8]. His case
soft, logic. As adder-type arithmetic functions appear often and study on the Xilinx 4000 series FPGAs show that this ap-
hardened adders are much faster than soft adders, commercial proach is limiting because of the large area and delay penalty
devices commonly have hardened adder and/or carry logic and that results when soft logic is involved in carry lookahead
routing [2]–[5]. computations. Hauck [9] evaluated different implementations
There are many degrees of freedom in the electrical and for FPGA adders including ripple carry, carry-skip, and tree-
architectural design of hard adder logic, and in the software based adders. He showed that a Brent-Kung adder achieves a
used to map a complete application to such structures. While 3.8 times speedup vs. the basic ripple carry adder for 32-bit
commercial devices use a wide variety of hardened adder addition, at the expense of between 5 to 9.5 times more area
circuits and architectures (indicating there is no general agree- for the adder. Parandeh-Afshar has studied the implementation
ment on the best options), there has been little published work of compressor trees in commercial architectures [10], and
that explores the trade-offs of different hardening choices, or proposed adding hardened compressors to soft logic blocks
on the software flow used to map arithmetic to these structures. to speed up multi-input addition with a focus on DSP and
We study a number of these choices and determine their impact video applications [11]. The benchmarks used in this study
on the performance and area of both micro-benchmarks and appear to be on the order of a few hundred 6-LUTs [12].
complete designs. Some examples include: First, how should FPGA vendors currently choose different hard arithmetic ar-
adders and LUTs interact? For instance, should there be fast chitectures inside their soft logic blocks. The Xilinx Ultrascale
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 2
TABLE II
D ESIGN C HOICES E XPLORED
Architecture Adder Chain LUT Bits per
Balanced
Name Prim. Flex. Style LUT
Soft N/A N/A N/A LUT 0
Ripple No CLB Carry Ripple Yes Soft LUT 1
CLA No CLB Carry CLA Yes Soft LUT 1
Ripple Ripple Yes Hard LUT 1
CLA CLA Yes Hard LUT 1
UCLA CLA No Hard LUT 1
URipple Ripple No Hard LUT 1
frac soft N/A N/A N/A fLUT 0
frac ripple Ripple Yes Hard fLUT 1
frac 2ripple Ripple Yes Hard fLUT 2
Fig. 2. 5-LUTs and larger allow for more flexibility in technology-mapping frac uripple Ripple No Hard fLUT 1
addition. frac 2uripple Ripple No Hard fLUT 2
TABLE V
E FFECT OF O DIN II OPTIMIZATIONS . A LL VALUES ARE NORMALIZED TO
THE BASE CASE WITH NO OPTIMIZATIONS .
Circuit DHR ULR Both Both
CLB CLB CLB Delay
arm core 0.97 0.95 0.94 0.92
bgm 1.00 0.80 0.80 0.87
blob merge 0.91 0.99 0.91 0.55
boundtop 0.92 0.99 0.90 1.00
LU8PEEng 0.84 0.99 0.83 1.01
Fig. 8. fLUT with adder 1-bit vs. 2-bit balanced. LU32PEEng 0.83 0.99 0.82 0.98
V il Circuits
Verilog Ci it mcml 1.00 0.91 0.91 0.94
mkSMAdapter4B 1.00 0.89 0.89 0.92
or1200 0.90 0.92 0.86 1.09
Elaboration Odin II
raygentop 0.93 0.94 0.87 0.92
sha 1.00 0.99 0.99 1.03
Synthesis & Tech Map ABC
stereovision0 1.00 0.80 0.80 0.95
FPGA stereovision1 0.99 0.79 0.79 0.98
Architecture Packing
Back!end stereovision2 1.00 0.97 0.97 1.01
Description geomean 0.95 0.92 0.88 0.93
File Placement stdev 0.06 0.08 0.06 0.13
VPR
Routing
ff ff ff ff
+
Fig. 12. Example of transitive connections.
16
function are not logically equivalent. VPR 7.0, does not allow
14 Soft
us to selectively switch off output pin logical equivalence in Ripple No CLB Carry
12 CLA No CLB Carry
cases when the carry links are used by the BLEs. Hence, for Ripple
10
Delay (ns)
correctness, we do not allow any BLE swaps, thus removing CLA
8
all output logical equivalence. Turning off logical equivalence
6
for all outputs will lead to a slight pessimism on the routability
4
of the soft logic only architecture vs. that of the hard adder 2
architectures. To quantify this we evaluated the soft logic only 0
architecture both with and without output equivalence enabled, 0 20 40 60 80 100 120 140
and found the impact on delay and area was < 3% on the VTR Size of Addition (# bits)
benchmarks. Since the effect is small (compared to the impact Fig. 13. Delay vs. adder length for various non-fracturable architectures.
of architectural modifications) this restriction will not change
the architecture conclusions. Furthermore, to compensate for
the restriction, each output pin can directly access two sides
of the logic block, and hence both a vertical and a horizontal
channel, ensuring the pins still have access to a diverse set of
routing wires.
V. M ICROBENCHMARK R ESULTS
In this section we explore the impact of the different ways of
supporting arithmetic in an FPGA architecture when evaluated
on simple adder microbenchmarks. Here, each circuit is an
adder of N bits, where N ranges from one to 127. Both the
inputs and outputs of the adder are registered, so the critical Fig. 14. Delay vs. adder length for various architectures with fracturable
LUTs.
path delay measured is a direct function of the adder combi-
national logic delay. These registered adders are implemented noise. For hard adders, the lack of CAD flexibility forces
using the flow described in Section IV. In this section the hard a predictable physical design, thus greatly reducing CAD
adder threshold (Section IV-B) is disabled in order to measure noise for these microbenchmarks. The combination of higher
the impact of the different architectures at small bit widths. and more predictable performance provided by hard adders,
Figure 13 shows the impact on critical path delay vs. width especially those with hard inter-CLB links, is very desirable.
of addition, for the Soft, Ripple and CLA architectures, where The data from this experiment also shows that a 3-bit
the critical path delay is averaged over three placement seeds. addition implemented in soft logic is actually slightly faster
In addition, two variants of the Ripple and CLA architectures than any of the hard-logic adders, further motivating the CAD
are included, labelled no CLB carry, in which the general- hard adder thresholds described in Section IV-B.
purpose interconnect is used to implement carry links be- Table VI shows the tile area for a logic block (including
tween soft logic blocks, rather than using dedicated carry both inter and intra-block routing) in each architecture. Here
links. Figure 14 shows the results for the fracturable LUT we observe that the inclusion of hard adders increase tile area
architectures. The unbalanced architectures are not included by < 2.5% over their respective baseline architectures. The
here as their performance difference vs. balanced is negligible delay, logic block count and area for implementing a 32-bit
on the microbenchmarks. adder are also shown in Table VI. The architectures with hard
These results show trends that we generally expect, in adders are all substantially faster than the soft architectures,
that delay grows linearly with adder size, and that the more they also use fewer CLBs than the baseline Soft architecture
hardened architectures are faster. and hence (with the exception of frac ripple and frac uripple)
In the extreme case, for 127 bit addition, it is interesting to are more area efficient.
note that a pure soft adder is ten times slower than the fastest
(CLA) adder. For 32-bit addition, the hard adders provide TABLE VI
32- BIT A DDER D ELAY & A REA .
a 3.4-fold speedup over the soft adder. The no CLB carry
architectures have delay values in between fully hard and fully 32-bit Addition
soft adder architectures. While the CLA architecture is the Tile Area Delay Total Area
Architecture CLBs
(103 MTWA) (ns) (103 MTWA)
fastest of all, ripple carry is only 19% slower for 32 bit adders,
soft 19.84 3.94 18 357.06
and 42% slower even for 127 bit addition. A ripple architecture cla 20.25 0.93 16 324.05
can sustain 400 MHz operation for even a 96-bit addition. ucla 19.92 0.83 16 318.77
ripple 20.14 1.14 16 322.25
When adders are implemented in soft logic, CAD noise uripple 19.90 1.03 16 318.34
frac soft 23.06 3.74 14 322.77
can have a significant impact on delay. The effect of this frac ripple 23.06 1.06 16 368.92
noise is evident in the figure when observing the delay of frac uripple 23.37 1.15 16 373.86
frac 2ripple 23.40 1.08 14 327.64
additions ranging from 17 bits to 25 bits for the soft logic frac 2uripple 23.62 1.01 14 339.71
architecture, where delays for additions of similar size can Tile Area is the geomean across application benchmarks (Sec-
vary significantly as a result of CAD (in this case packer) tion VII) and so includes both logic and realistic routing area.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 8
14
+ + + + TABLE VII
12 + + G EOMETRIC MEAN OF DELAY AND NUMBER OF CLB S FOR CONSTANT
+
MULTIPLICATION (128 RANDOMLY SELECTED CONSTANTS ).
18 450
16 400 × × × ×
14 350 + +
+
Num CLB
250
10 soft
soft 200
8 ripple
ripple 150 cla
6 cla
100
4 × × × ×
50
+ +
2 + 0
0 10 20 30 40 50 60
0
0 10 20 30 40 50 60 Number of Taps
Number of Taps
Fig. 21. Logic block counts across the architectures on non-pipelined FIR
Fig. 17. Delays across the architectures on non-pipelined FIR filters. filters.
× × × ×
+ +
+
× × × ×
+ +
+
Fig. 18. Delays across the architectures on FIR filters with no pipelining. Fig. 22. Logic block counts across the architectures on FIR filters with no
pipelining.
However the gap between soft and hard adders is not as earlier observation that fracturable LUTs are much more area
large for the fracturable architectures. With the pipelined efficient at implementing soft adders than non-fracturable, thus
filters, hardened carry chains do not improve speed over soft lowering the area curve. Notably, using 2-bits of addition per
arithmetic as the critical path has moved into the multipliers. fLUT is again the most efficient. This indicates that two bits
Figures 21 and 22 shows the effect on logic block count of arithmetic achieves better utilization of the fLUT in front
as we increase the number of taps of non-pipelined FIR of the adders than the one bit case.
filters for the non-fracturable and fracturable architectures
VII. A PPLICATION C IRCUIT R ESULTS
respectively. Figures 23 and 24 shows the same effect with
pipelining. Architectures with hard adders show lower CLB We now focus on evaluating the different hard adder vari-
counts than soft adders, but the absolute CLB counts are ants using full application benchmarks. These allow us to
quite different between the fracturable and non-fracturable evaluate the overall quality of the different implementation
architectures. Interestingly both frac soft and frac ripple are approaches, considering overall delay and area. Architectures
quite close in CLB count, but both are substantially lower which minimize the area-delay product are the most efficient.
than the non-fracturable architectures. This underscores our The complete design benchmarks we use are from the VTR 7.0
600
6
× × × ×
500
5
+ +
400 +
Critical Path Delay (ns)
4
Num CLB
300 soft
3 soft ripple
ripple 200 cla
2 × × × × cla
100
+ +
1 +
0
0 10 20 30 40 50 60
0
0 10 20 30 40 50 60 Number of Taps
Number of Taps
Fig. 23. Logic block counts across the architectures on pipelined FIR filters.
Fig. 19. Delays across the architectures on FIR filters with pipelined adder
trees.
× × × ×
+ +
+
× × × ×
+ +
+
Fig. 24. Logic block counts across the architectures on FIR filters with
Fig. 20. Delays across the architectures on FIR filters with pipelining. pipelining.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 10
TABLE XI 103
C IRCUIT- BY- CIRCUIT BREAKDOWN COMPARING THE U-CLA 102
Count
ARCHITECTURE TO THE S OFT ARCHITECTURE . 101
Circuit Delay Area LUTs on CLA cout on 100
Crit Path Crit Path
LU32PEEng 0.76 1.05 0.62 26 Equal
mcml 0.55 1.05 0.25 144 10 Non-Adder Path
or1200 0.65 1.11 0.15 19 Adder Path
sha 0.57 1.04 0.25 10 8
E. Adder Input Balancing marks. Using soft inter-CLB links increases the delay of
a 32-bit adder by 2.3x on average across the hard adder
We now turn to consider how best to integrate the LUT and
architectures, but increases the delay of the VTR+ designs by
arithmetic circuitry. The balanced approach of splitting the 6-
only 5%. The area cost of hard inter-CLB carry is negligible,
LUT into two 5-LUTs, where each 5-LUT drives a different
as little hardware needs to be added to support them, and
adder input has good symmetry. The unbalanced approach of
as their use does not significantly increase the required inter-
using the 6-LUT to drive one adder input and a small mux to
CLB channel width, despite the constraint they create on the
select BLE input pins for the other adder input offers richer
placement engine.
LUT functionality feeding the adder input (six pins compared
We expect that the impact of hard inter-CLB carry links is
to five for the balanced case) but worse symmetry. It is thus
a strong function of the number of adder bits per logic block.
unclear which of these two approaches is better. Note also
Fewer bits per block means more inter-CLB links are required
that commercial FPGAs differ in their approach: Intel’s Stratix
for an addition of a given size, which in turn may have a bigger
V [14] and Stratix 10 [15] FPGAs support a balanced style,
impact on delay. Therefore, we believe that architectures with
while Xilinx’s Virtex 7 [5] and Ultrascale [13] FPGAs allow
8 adder bits per logic block (e.g. Ultrascale [13]) will benefit
both unbalanced and balanced styles.
more from hard inter-CLB links than architectures with 20 bits
The second column of Table IX shows the normalized
per block (e.g. Stratix 10 [15]).
delay values for each of the different architectures on the
application circuits. The delays for all architectures are similar, G. Bits of Addition per LUT
achieving an approximately 15% delay reduction. We therefore Table XIII compares the different fracturable LUT carry-
conclude that both balanced and unbalanced architectures chain architectures normalized to the fracturable soft archi-
achieve approximately the same overall delay. tecture to study the impact of the number of addition bits
Table X shows the balanced and unbalanced architectures per LUT. We see that critical path delay is fairly consistent
require virtually the same CLB count, indicating that the across architectures with a reduction of 15% to 18% vs. the
packer can fill both architectures with roughly the same soft fracturable LUT architecture. The area numbers are quite
amount of logic per CLB, despite the fact that the balanced different, with 2-bits of addition per fraturable LUT exhibiting
architectures can use a LUT on each input of an adder instead a much lower area overhead (2% to 4% more area than soft),
of only one input. Interestingly, the unbalanced architectures while 1-bit of addition per fracturable LUT uses 10% to
require a channel width that is 4% lower, on average. This 13% more area vs. soft. We therefore conclude that 2-bits of
is due to the fact that the unbalanced architecture can use all addition per fracturable LUT is a better architecture than 1-bit
6 inputs of a BLE when in adder mode, while the balanced because it provides both slightly better delay and significantly
architectures can use only 5 – the packer has more freedom better area, reducing area delay product by 15% vs. soft.
on what to pack with the adder in the unbalanced architecture H. Per Circuit Breakdown for frac 2uripple
and reduces the number of signals to route between clusters. Table XIV shows a circuit-by-circuit breakdown of our re-
The net impact is that while the unbalanced architectures sults on for frac 2uripple, our best architecture. By providing
require slightly more logic area due to their extra 2:1 mux per absolute values, this table serves as a comparison point for
BLE, they reduce overall area by 1% by reducing the required future studies. The columns from left to right are: circuit name,
amount of inter-cluster routing. minimum channel width, critical path delay in ns, number of
F. Utility of Inter-CLB Carry soft logic blocks, and total soft logic area including routing in
Dedicated carry links between logic blocks improve the number of minimum width transistors.
speed of long adders significantly, as shown in Figure 13, I. Summary
but their use constrains the placement engine to keep long Table XV compares our best hard adder architectures with
adders in a fixed relative placement, which may lengthen the soft LUT and fracturable LUT architectures. For the soft
wiring between other blocks. Table XII compares the QoR adder architectures critical path delay is similar, while both
of architectures with soft inter-CLB carry links (i.e, routed hard adder architectures reduce critical path delay by 15-
using the general-purpose interconnect) normalized to their 16%. However the fracturable LUTs significantly reduce the
corresponding architectures with hard inter-CLB carry links. number of CLBs required by 12-15%. This reduces area but
The first column is the architecture. The second column shows is slightly tempered by a higher minimum channel width. As
normalized delays for the 32-bit addition micro benchmark. a result the combination of fracturable LUTs with 2-bits of
The next three columns show the normalized geometric mean hardened addition per LUT improves area-delay product by
of delay, area, and area-delay product over the VTR+ bench- 25% compared to a soft (non-fracturable) LUT architecture.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 13
[22] J. Luu, J. Goeders et al., “VTR 7.0: Next Generation Architecture and Bo Yan (S’14) received the B.Eng. degree in elec-
CAD System for FPGAs,” ACM Trans. Reconfigurable Technol. Syst., tronic and information engineering from Hubei Uni-
vol. 7, no. 2, pp. 6:1–6:30, Jul. 2014. versity of Technology, Wuhan, Hubei, China, in
[23] Berkeley Logic Synthesis and Verification Group, “ABC: A System for 2010, and the M.Sc. degree in computer science,
Sequential Synthesis and Verification,” http://www.eecs.berkeley.edu/ Fredericton, New Brunswick, Canada, in 2015.
∼alanmi/abc, 2009.
[24] S. Jang et al., “SmartOpt: An Industrial Strength Framework for Logic
Synthesis,” in ACM FPGA, 2009, pp. 237–240.
[25] J. Luu, J. Rose, and J. Anderson, “Towards Interconnect-Adaptive
Packing for FPGAs,” in ACM FPGA, 2014.
[26] K. E. Murray, S. Whitty et al., “Titan: Enabling large and complex
benchmarks in academic cad,” in Int. Conf. on Field Programmable Charles Chiasson (S’08–M’15) received the B.Eng.
Logic and Applications, 2013, pp. 1–8. degree in electrical engineering from Universit de
Moncton in 2011 and the M.A.Sc. degree in elec-
trical and computer engineering from the University
of Toronto in 2013. He is currently with the Al-
Kevin E. Murray (S’12) is a PhD candidate at the tera Toronto Technology Center, Altera Corporation,
University of Toronto, where he received his BASc Toronto, ON, Canada.
and MASc in 2012 and 2015. He has previously been
a visiting Research Assistant at Imperial College
London, and worked on digital design flows at Ad-
vanced Micro Devices (AMD). His research interests
include FPGA CAD and Architecture, timing anal- Kenneth B. Kent (S’96–M’02–SM’13) received his
ysis, and modular design flows. He has contributed BSc degree from Memorial University of Newfound-
extensively to the VTR project since 2012. land, Canada, and MSc and PhD degrees from the
University of Victoria, Canada. He is a professor in
the Faculty of Computer Science at the University
Jason Luu is a senior software engineer at Altera. of New Brunswick, and Director of the IBM Centre
He received the BASc degree in CE from the Uni- for Advanced Studies-Atlantic, Canada. His research
versity of Waterloo in 2007 and the MASc and PhD interests are Hardware/Software Co- Design, Virtual
degrees at the University of Toronto in 2010 and Machines, Reconfigurable Computing, and Embed-
2014 respectively. He has contributed 8 years to ded Systems. His research groups are key contrib-
the open source VTR project for FPGA CAD and utors to widely used software such as the IBM J9
architecture research. Java virtual machine and the VTR (Verilog-To-Routing) FPGA CAD flow.
Jason Anderson (S’96–M’05) received the B.Sc.
degree in computer engineering from the University
of Manitoba, Winnipeg, MB, Canada, and the Ph.D.
Matthew J. P. Walker (S’16) received his BASc and M.A.Sc. degrees in electrical and computer
in Computer Engineering from the University of engineering (ECE) from the University of Toronto
Toronto in 2017, and is presently a MASc candidate (U of T), Toronto, ON, Canada. In 1997, he joined
in Computer Engineering at the same university, the FPGA Implementation Tools Group, Xilinx, Inc.,
where he is researching CGRA mapping. He has San Jose, CA, working on placement, routing, and
previously worked at Altera and Intel PSG, and synthesis. He is currently an Associate Professor
worked as a summer researcher in 2014 on the VTR with the ECE Department at U of T and holds the
project. Jeffrey Skoll Chair in Software Engineering. He has
authored over 60 papers published in refereed conference proceedings and
journals, and is an inventor on 26 issued U.S. patents. His current research
interests include computer-aided design, architecture and circuits for FPGAs.
Conor McCullough (S’14) received the B.Eng.
degree in computer engineering from the University Jonathan Rose (F’09) is a Professor with the De-
of New Brunswick, Fredericton, NB, Canada, in partment of Electrical and Computer Engineering,
2014. He worked as a summer researcher in 2014 on University of Toronto, Toronto, ON, Canada. He has
the VTR project under the supervision of Kenneth worked in the area of FPGA CAD and architecture
Kent. for over 20 years, including stints at the two major
vendors, Xilinx and Altera, as well as a startup. Prof.
Rose is a Fellow of the ACM and a Foreign Member
of the American National Academy of Engineering.