0% found this document useful (0 votes)

15 views

Tvlsi Fpga Arithmetic

Uploaded by

Syed Saud Ur Rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Tvlsi Fpga Arithmetic

Uploaded by

Syed Saud Ur Rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO.

YY, 2019 1

Optimizing FPGA Logic Block Architectures for

Arithmetic
Kevin E. Murray∗ , Jason Luu∗ , Matthew J. P. Walker∗ , Conor McCullough† , Sen Wang† , Safeen Huda∗ , Bo Yan† ,
Charles Chiasson∗ , Kenneth B. Kent† , Jason Anderson∗ , Jonathan Rose∗ , Vaughn Betz∗
∗ Dept. of Electrical and Computer Engineering, University of Toronto
† Faculty of Computer Science, University of New Brunswick

Abstract—Hardened adder and carry logic is widely used (but less flexible) adder inputs, or are flexible (but slower)
in commercial FPGAs to improve the efficiency of arithmetic inputs coming from LUTs preferable? Second, what are the
functions. There are many design choices and complexities trade-offs in terms of performance and area between large,
associated with such hardening including: circuit design, FPGA
architectural choices, and the CAD flow. However these choices fast, multi-bit adders, and smaller, slower, but more flexible,
have seen little study, and hence we explore a number of pos- single-bit adders? Third, should adjacent hard adder units use
sibilities. We also highlight front-end elaboration optimizations dedicated links for carry signals crossing soft logic block
that help ameliorate the restrictions placed on logic synthesis by boundaries (which constrains the placement problem) or use
hardened arithmetic. We show that hard adders and carry chains the more flexible regular routing fabric? Fourth, how should
increase performance of simple adders by a factor of four or
more, but on larger benchmark designs that contain arithmetic hard adders be integrated with a fracturable LUT (a large
improve overall performance by 15%. Our results also show LUT that can be split into two smaller LUTs)? Does this
that for complete application circuits simple hardened ripple- effect how many bits of arithmetic should be associated with
carry adders perform as well as more complex carry-lookahead each LUT? These are important questions an architect must
adders. Our best non-fracturable LUT architecture with hard- answer when embedding hard adders with soft logic, and we
ened arithmetic yields 12% better area-delay product than ar-
chitectures without hardened arithmetic. We also investigate the present quantitative measurements of the impact of each of
impact of fracturable LUTs and their interaction with hardened these decisions.
arithmetic. We find that fracturable LUTs offer significant (12- Previous work in this area began in the early 1990’s, when
15%) area reductions, which are complementary to the delay Hsieh et al. [6] described the Xilinx 4000 FPGA that had
reductions of hardened arithmetic. Therefore our best fracturable soft logic blocks that were capable of implementing two
LUT architectures which use two bits of hardened arithmetic
achieve 25% better area-delay product than non-fracturable LUT independent adder bits per block. They employed dedicated
architectures without hardened arithmetic. carry logic and routing from adjacent logic blocks for the carry
signals. Woo [7] proposed adding additional flexibility to the
Index Terms—Field programmable gate arrays, Digital arith-
metic, Logic design, Design automation fast carry links between logic blocks to enable flexible tree-
based mappings of addition/subtraction/comparison functions.
I. I NTRODUCTION Both Hseih and Woo targeted older FPGAs that had relatively
A key FPGA architecture question is which functions should fewer and smaller lookup tables in the logic block compared
be hardened and which should be left for implementation in to the latest FPGAs.
the soft logic [1]. Hardening a function makes an FPGA more Xing proposed implementing carry lookahead adders (in an
efficient if the function occurs often in applications, and there FPGA architecture that contains just ripple adders) by using
is a large advantage when it is implemented in hard, rather than soft logic to do the carry lookahead operation [8]. His case
soft, logic. As adder-type arithmetic functions appear often and study on the Xilinx 4000 series FPGAs show that this ap-
hardened adders are much faster than soft adders, commercial proach is limiting because of the large area and delay penalty
devices commonly have hardened adder and/or carry logic and that results when soft logic is involved in carry lookahead
routing [2]–[5]. computations. Hauck [9] evaluated different implementations
There are many degrees of freedom in the electrical and for FPGA adders including ripple carry, carry-skip, and tree-
architectural design of hard adder logic, and in the software based adders. He showed that a Brent-Kung adder achieves a
used to map a complete application to such structures. While 3.8 times speedup vs. the basic ripple carry adder for 32-bit
commercial devices use a wide variety of hardened adder addition, at the expense of between 5 to 9.5 times more area
circuits and architectures (indicating there is no general agree- for the adder. Parandeh-Afshar has studied the implementation
ment on the best options), there has been little published work of compressor trees in commercial architectures [10], and
that explores the trade-offs of different hardening choices, or proposed adding hardened compressors to soft logic blocks
on the software flow used to map arithmetic to these structures. to speed up multi-input addition with a focus on DSP and
We study a number of these choices and determine their impact video applications [11]. The benchmarks used in this study
on the performance and area of both micro-benchmarks and appear to be on the order of a few hundred 6-LUTs [12].
complete designs. Some examples include: First, how should FPGA vendors currently choose different hard arithmetic ar-
adders and LUTs interact? For instance, should there be fast chitectures inside their soft logic blocks. The Xilinx Ultrascale
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 2

FPGA family [13] contains a basic ripple carry architecture TABLE I

where addition can only start on every 8th adder bit (up from ROUTING ARCHITECTURE PARAMETERS .
Parameter Value
every 4th bit in Virtex 7 [5]). The interaction between the Cluster input flexibility (Fcin ) 0.2
soft logic and the adder is flexible; the adder can either be Cluster output flexibility (Fcout ) 0.1
Switch block flexibility (Fs) 3
driven by a 6-LUT and a logic block input pin or be driven Wire segment length (L) 4
by two 5-LUTs (fractured from the 6-LUT) with shared inputs. Switch Block Type Wilton
Interconnect Style Single-driver
Each fracturable 6-LUT drives one bit of arithmetic. The
Intel Stratix V architecture uses a two-level carry-skip adder
architecture [2]. Each soft logic block contains ten 2-bit carry-
skip adders that can be cascaded with dedicated links. Between
two logic blocks, there is an additonal carry-skip stage that
can skip 20 bits of addition. Lewis claims that this adder
results in both a delay improvement and an area reduction
compared to the basic ripple carry adder, as the increase in
logic gates necessary for the carry-skip feature is more than
offset by the area reduction made possible via transistor size
optimization. Each fracturable LUT in Stratix V drives two bits
of arithmetic, with each adder input driven by a 4-LUT with
input sharing constraints [14]. The recent Stratix 10 family
has a similar arithmetic structure but has removed the 20-bit Fig. 1. The base soft logic block consists of 8 BLEs connected by a 50%
depopulated crossbar. Each BLE consist of a LUT and a flip-flop with fast
carry skip hardware [15]. Outside of microbenchmarks, neither feedforward and feedback paths reflecting what is commonly found in state-
vendor has published, in depth, the impact of the major design of-the-art FPGAs.
decisions for their hard adder and carry chain architectures.
Prior published work on hardened arithmetic focused on the divided into four groups of ten logically equivalent pins. The
implementation of arithmetic structures, and evaluated results input pins are evenly distributed on the bottom and the right
on microbenchmarks like adders and adder trees or very small sides of the logic block, as this simplifies the layout of the
designs. A full design, on the other hand, imposes many FPGA.
other demands on the FPGA and its CAD flow. We seek to Table I gives the routing architecture parameters of the
measure the impact of different hard adder choices not only base architecture. In addition to logic blocks the architec-
on microbenchmarks, but also on complete designs with a full ture includes hard 32K-bit RAM blocks (with configurable
CAD flow. width/depth) and DSP blocks (36x36 bit multipliers which can
We published an earlier version of this work in [16]. This be fractured down to two 18x18 or four 9x9 multipliers). These
paper extends that work in two important ways by including values are chosen to be in line with the recommendations of
1) a new study on fracturable LUT architectures and 2) a prior research [17], [18].
new investigation of arithmetic-heavy benchmarks that exhibit
characteristics in between adder microbenchmarks and general A. Non-Fracturable LUTs
benchmarks. We also perform additional analysis of how hard
adders change a circuit’s timing path delay distribution. Figure 1 illustrates the baseline non-fracturable soft logic
This paper begins with a description of the base FPGA ar- block used in this study, which contains eight Basic Logic
chitecture (Section II), hard arithmetic structures (Section III), Elements (BLEs), 40 general inputs, eight general outputs,
and CAD (Section IV) to handle the unique properties of one cin pin and one cout pin. The BLE consists of a
carry chains. Afterwards, we discuss the effects of hardened non-fracturable 6-input LUT with an optionally registered
arithmetic starting with the pure-adder microbenchmarks (Sec- output pin. There are cin and cout pins into and out of
tion V), arithmetic-heavy kernels (Section VI), and then full the BLE, respectively, to drive a hard adder. The specific
application benchmarks (Section VII). details are described in Section III below. There is also a
fast path from the flip-flop output to the LUT input. We also
II. BASELINE A RCHITECTURES
consider architectures that do not contain hardened arithmetic,
The base FPGA architecture used in this study is designed and hence have neither cin nor cout pins.
in a 22nm CMOS process, and is a heterogeneous architecture Our choice of following industry trends on using larger
with soft logic blocks, simple I/Os, configurable memories and LUTs has interesting implications in terms of the efficiency of
fracturable multipliers. addition. When implementing arithmetic using only 4-LUTs,
The internal connectivity of the blocks is provided by a every bit of addition requires one LUT for the sum and
50% depopulated crossbar that connects block inputs and BLE another LUT for the carry. With 5-LUTs and larger, a soft
outputs to the BLE inputs. We have chosen a depopulated implementation of arithmetic can be more efficient. Figure 2
crossbar as this is common in most commercial devices [2], shows how three LUTs can implement 2-bits of addition. With
[5]. The depopulated crossbar is composed of four, smaller, fracturable 6-LUTs, this benefit grows even larger as fracturing
fully populated crossbars as designed by Chiasson [17]; this into the two 5-LUTs mode allows implementation of both the
depopulation results in the soft logic block inputs being 2-bit carry and a sum operation in a single fracturable 6-LUT.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 3

TABLE II
D ESIGN C HOICES E XPLORED
Architecture Adder Chain LUT Bits per
Balanced
Name Prim. Flex. Style LUT
Soft N/A N/A N/A LUT 0
Ripple No CLB Carry Ripple Yes Soft LUT 1
CLA No CLB Carry CLA Yes Soft LUT 1
Ripple Ripple Yes Hard LUT 1
CLA CLA Yes Hard LUT 1
UCLA CLA No Hard LUT 1
URipple Ripple No Hard LUT 1
frac soft N/A N/A N/A fLUT 0
frac ripple Ripple Yes Hard fLUT 1
frac 2ripple Ripple Yes Hard fLUT 2
Fig. 2. 5-LUTs and larger allow for more flexibility in technology-mapping frac uripple Ripple No Hard fLUT 1
addition. frac 2uripple Ripple No Hard fLUT 2

This choice of shared inputs is between that of a Virtex-style

fracturable LUT [13], where all 5 inputs of the 5-LUTS are
shared, and an Intel-style fracturable LUT [14], where 2-inputs
of the two 5-LUTs are shared.
C. Area and Delay Models
Fig. 3. A baseline fracturable Basic Logic Element (fBLE) which contains Transistor-level design of the base soft logic blocks and
one fracturable LUT (fLUT) with optionally registered outputs. routing architecture are performed with the COFFE tool [17]
B. Fracturable LUTs and a 22nm CMOS technology. The architecture uses pass
gates; statically controlled pass gates are gate-boosted by
FPGAs have traditionally used non-fracturable LUTs as
0.2V [21]. The architecture, area, and delay models for the
described above. However many modern commercial FPGA
memories and multipliers are scaled to 22nm from the com-
soft logic blocks now employ fracturable LUTs (fLUTs)
prehensive 40nm architecture in the VTR 7.0 release.
to obtain the performance advantages of 6-LUTs with the
area advantages of 4-LUTs [19]. Some academic work has III. H ARD A DDER AND C ARRY C HAIN A RCHITECTURES
questioned whether the additional flexibility of fLUTs is worth
To evaluate the impact of including hard adders in FPGA
their cost [20]. It is notable however, that [20] did not consider
logic blocks we explore different design choices relating to
the impact of hardened arithmetic, which we find to be
adder implementation and interaction with the rest of the logic
significant. Fracturable LUTs change the interaction of soft
block. Table II summarizes the different design choices, which
logic with hard adders and carry chains, so it is important to
are described in detail below.
evaluate their combined effect.
As much as possible, our fLUT architectures reuses the A. Adder Primitive
same architecture as the non-fracturable case so that we
To ensure we fairly compare various hard adder and carry
may compare between these architectures. Instead of BLEs,
chain architectures, we carefully electrically designed two hard
our baseline fLUT architecture uses fracturable Basic Logic
adder primitives and hand optimized them at the transistor
Elements (fBLEs) which, as shown in Figure 3, contains one
level. The first adder primitive is a basic 1-bit full adder. In a
fracturable LUT (fLUT) with optionally registered outputs.
soft logic block, eight of these full adders are linearly chained
Unlike the baseline non-fracturable architecture (Figure 1),
together to form a ripple carry chain. Table III shows the
which had only one output per BLE, each fBLE has two
properties of the 1-bit hard full adder used in this study. Area
independent outputs. Therefore the internal crossbar inside the
is measured as minimum width transistor areas (MWTAs),
soft logic block has an additional eight inputs (local feedbacks)
using the transistor drive to area conversion equations from
compared to the non-fracturable case, and the soft logic block
Chiasson [17]. The adder circuitry, LUTs and routing are all
has 16 outputs instead of 8.
designed with a similar goal of minimizing the area-delay
Figure 4 shows the baseline fLUT. This fLUT can operate
product of the FPGA, and the cin to cout path of the adder
either as one 6-LUT or two 5-LUTs with four shared inputs.
is particularly optimized for delay as it occurs n-1 times on
an n-bit adder.
The second adder primitive is a 4-bit carry-lookahead adder
(CLA). Each logic block contains two of these 4-bit adders
chained in a ripple carry fashion. Table IV shows the properties
of the 4-bit carry lookahead adder used in this study. The
carry lookahead optimization allows for a faster carry path
(20 ps) compared to a ripple of four 1-bit adders (44 ps)
when performing a 4-bit addition. The CLA design trades off
Fig. 4. Baseline fLUT which operates as either one 6-LUT or two 5-LUTs flexibility (as some bits are wasted if the desired adder length
with four shared inputs. is not divisible by 4) and area in exchange for speed.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 4

B. Adder Input Balancing

cin
Figure 5 shows one way the LUT and adder (within a BLE)
Inputs 5!LUT may interact. Here, we make use of the observation that a 6-
sumout LUT is constructed with two 5-LUTs and a mux. If that mux is
sumin +
Output dropped, then the adder can be driven by two 5-LUTs, where
6!LUT
5!LUT the LUTs share inputs. If the adder is not used, then another
Input mux can be used to produce the 6-LUT output. We call this
the balanced LUT interaction, and its underlying rationale is
that a symmetric amount of prior logic for each adder input
cout
may be the most appropriate architecture. Example circuits that
Fig. 5. A LUT with balanced adder interaction where both adder inputs are may benefit from this architecture would be applications where
driven by 5-LUTs.
multiplexers select the inputs to an adder. Similar interaction
for the CLA is shown in Figure 7.
Figure 6 shows another LUT-adder interaction architecture
cin
that we will explore. Here, the 6-LUT output drives one of
the adder inputs and the other adder input is driven by one
6!LUT
Inputs Output of the 6-LUT inputs. As with the previous case, if the adder
sumin +
sumout
is not used, then another mux can be used to select the 6-
LUT output. We call this the unbalanced LUT interaction.
cout We model each additional SRAM-controlled 2-to-1 mux (one
per BLE for the balanced LUT interaction, two per BLE for
Fig. 6. A LUT with unbalanced adder interaction where the 6-LUT drives
only one adder input. the unbalanced LUT interaction) as having 22 ps of delay and
occupying 15 minimum width transistor areas (including the
SRAM configuration bit). The underlying rationale for this
architecture is that there might be an advantage to allowing
5
5-LUT a0 cin a faster input into one side of the adder, which would be
s0
appropriate when speed was an issue.
5-LUT b0

C. Carry Chain Flexibility

5-LUT a1
5 Another class of interesting architectures are those with
5-LUT b1
s1
hardened adders but no dedicated carry link between logic
CLA-4 blocks. Here, both the cin and cout pin are treated as though
5-LUT a2
5 they are regular input and output pins, respectively, in the
5-LUT b2
s2 inter-block routing architecture. Within the logic block, the
carry signals maintain the same restricted connections. For
5-LUT a3
5 architectures that have a dedicated carry link, the carry link
5-LUT b3 cout s3 has a delay of 20 ps. For those without a dedicated cin/cout
we add the usual circuitry to allow them to access the right
Fig. 7. 4-bit CLA with balanced LUT interaction. and bottom side channels of the logic block.
There are a few different ways to implement the starting
location of a multi-bit addition. One can place a mux at every
TABLE III carry link that can select from logic-0, logic-1, or a carry
P ROPERTIES OF THE 1- BIT HARD ADDER . signal of a previous stage, but this can incur a significant delay
Property Value penalty because every carry link must now go through a mux.
Area 47.7 MWTAs
Delay cin to cout 11 ps
Alternatively, one can place these muxes only on selected carry
Delay sumin to cout 56 ps links, thus minimizing the overhead of excessive muxing, but
Delay cin to sumout 30 ps
Delay sumin to sumout 83 ps at the cost of having fewer locations where an addition may
begin. This latter approach is typical in commercial devices.
Alternatively, the responsibility for starting an addition can
be implemented in a front-end CAD tool – the tool can pad
TABLE IV the addition with a dummy adder before the LSB (whose
P ROPERTIES OF THE 4- BIT CARRY LOOKAHEAD ADDER .
addends are fed by constants) which generates a 0 or a 1
Property Value
Area 257 MWTAs cin for addition and subtraction, respectively. We employ this
Delay cin to cout 20 ps approach in this work.
Delay sumin to cout 80 ps
Delay cin to sumout LSB 25 ps D. Bits of Addition per LUT
Delay cin to sumout MSB 30 ps
Delay sumin to sumout LSB 65 ps For fLUT architectures each fLUT can produce two inde-
Delay sumin to sumout MSB 82 ps pendent outputs. This makes it possible to include a second
bit of addition in the fLUT with minimal cost. Figure 8
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 5

TABLE V
E FFECT OF O DIN II OPTIMIZATIONS . A LL VALUES ARE NORMALIZED TO
THE BASE CASE WITH NO OPTIMIZATIONS .
Circuit DHR ULR Both Both
CLB CLB CLB Delay
arm core 0.97 0.95 0.94 0.92
bgm 1.00 0.80 0.80 0.87
blob merge 0.91 0.99 0.91 0.55
boundtop 0.92 0.99 0.90 1.00
LU8PEEng 0.84 0.99 0.83 1.01
Fig. 8. fLUT with adder 1-bit vs. 2-bit balanced. LU32PEEng 0.83 0.99 0.82 0.98
V il Circuits
Verilog Ci it mcml 1.00 0.91 0.91 0.94
mkSMAdapter4B 1.00 0.89 0.89 0.92
or1200 0.90 0.92 0.86 1.09
Elaboration Odin II
raygentop 0.93 0.94 0.87 0.92
sha 1.00 0.99 0.99 1.03
Synthesis & Tech Map ABC
stereovision0 1.00 0.80 0.80 0.95
FPGA stereovision1 0.99 0.79 0.79 0.98
Architecture Packing
Back!end stereovision2 1.00 0.97 0.97 1.01
Description geomean 0.95 0.92 0.88 0.93
File Placement stdev 0.06 0.08 0.06 0.13
VPR
Routing

Timing & Area Estimation

logic synthesis optimizations such as common sub-expression
Quality of Results elimination. We observed that ABC was able to reduce the
Fig. 9. The VTR CAD flow number of soft adders when these boundaries were not in
place, and that multiple copies of adders with the same inputs
contrasts the balanced 1-bit adder fLUT on the left with the
were left intact when hardened adders were used. We also
balanced 2-bit adder fLUT on the right. The balanced 2-bit
attempted to use the “white box” feature of ABC [24]; while
adder requires that each 5-LUT further fracture down to two 4-
this made the functionality of the hard logic visible, it also led
LUTs, with most inputs shared, in order to supply the requisite
to ABC converting it into regular soft logic and hence was not
four addends to the adder.
suitable.
IV. CAD To compensate for reduced down-stream optimization, we
In this section, we describe the CAD tools we use and implemented two new optimizations in Odin II: the removal
the significant enhancements they required to explore the of duplicate hard adders and unused logic removal. Both these
architectures described in the previous sections. We can not optimizations are generalized to all hard blocks and are not
use commercial FPGA CAD tools to evaluate our proposed exclusive to hard adders. Duplicate hard block reduction is a
architectures, as they are closed-source (and hence can not simplified version of common sub-expression elimination. If
be modified), and do not support re-targeting to proposed all of the input pins of any two hard blocks anywhere in the
FPGA architectures. Instead, we employ the VTR 7.0 [22] circuit are found to be the same, the duplicate hard block can
CAD flow, which can target a wide range of user-described be removed and its fanout attached to the other hard block.
FPGA architectures. The VTR CAD flow is open-source which In a typical CAD flow, logic synthesis is responsible for
allows us to modify and enhance it to ensure it optimizes well sweeping away unused logic because synthesis optimizations
for the architectures we will evaluate. can sometimes reveal unused logic. ABC is unable to do this
The VTR CAD flow is illustrated in Figure 9. The two key for hard blocks as it optimizes exclusively based on logic
inputs are a circuit described in Verilog and a description of the expressions, and views hard blocks as black boxes. Hence we
FPGA architecture in a human-readable text file. The circuit is augmented Odin II to sweep away unused hard and soft logic
elaborated by Odin II and ABC [23] performs logic synthesis based purely on circuit connectivity.
to produce a technology-mapped netlist of device atoms such We quantified the impact of the new optimizations, using
as LUTs, FFs and basic multipliers. VPR then packs these the experimental methodology described subsequently in Sec-
atoms into logic, RAM and DSP blocks, places those blocks, tion VII but with adders always hardened. These experiments
and routes connections between them. Finally, VPR computes covered four cases for the optimization settings in Odin II:
the area and delay of that final, physical mapping. Below we None, DHR (duplicate hard logic removal), ULR (unused logic
detail the modifications and enhancements required to enable removal), and All (both DHR and ULR enabled).
hardened adders and carry chains. Table V shows the impact of these optimizations on the
A. Elaboration and Logic Optimization benchmark circuits described in Section VII; most circuits
In our initial experiments targeting hardened adders, we benefit from both optimizations. In this work we use the
discovered a surprising and unexpected downside: when front- geometric average to summarize results, as it equally weights
end elaboration inserts hardened adders into the circuit, it each benchmark. On average, duplicate hard logic removal
creates a boundary in the elaborated circuit that cannot be and unused logic removal reduce the logic blocks required by
crossed by ABC’s logic synthesis. Furthermore, the hardened a circuit by 5% and 8%, respectively, while their combination
logic is a “black box” and hence invisible to ABC and cannot reduces logic blocks required by 12% and the critical path
be optimized. This boundary reduces the effectiveness of basic delay by 7%.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 6

ff ff ff ff

+
Fig. 12. Example of transitive connections.

When a logic block contains carry chains, adder atoms must

be placed inside the logic block in an order that respects the
Fig. 10. Circuit speed vs. hard adder threshold. Results are the average across restrictive carry links. The packer should also make use of
14 benchmarks and normalized to the soft implementation. the architecture-specific features that allow the LUTs and flip-
flops to interact with the adder.
AAPack, the packer inside VPR 7.0 is an interconnect-
aware packing algorithm [25] that recognizes and respects
the various pin constraints that arise with different LUT and
adder interactions. The carry chain itself is specified using
the “molecule” feature in AAPack that allows the architect to
specify how certain atoms must be packed together.
In our initial experiments with microbenchmarks (mostly
pure adders fed by, and feeding into registers), we discovered
that the packing algorithm in VPR 7.0 was imperfect in a
Fig. 11. Average total area of different hard adder thresholds normalized to
the soft architecture.
number of cases. Figure 12 shows an example of the simple
input circuits that caused a problem. The adders in this figure
B. Threshold of When to Use Hard Adders
form a carry chain so they will be packed together into a logic
While using hardened adder and carry logic is clearly
block. If the flip-flops cannot be packed into the same logic
beneficial for wide arithmetic structures, for small adders the
block as the adders, then the packer will see these flip-flops
flexibility provided by soft logic might actually prove superior
as completely unrelated to each other because they do not
as hard adders impose a boundary across which it is difficult
share common nets. These flip-flops may then be separated
for logic synthesis to optimize. We define the hard adder
and packed with other logic, which is undesirable as it makes
threshold as the size, in bits, of addition/subtraction above
it impossible to place all the logic clusters containing these
which the CAD flow will implement it with hard adders and
registers close to the adder. We modified the packer to consider
below/equal to which the function is implemented in soft logic.
atoms that have transitive connectivity with the current logic
Figure 10 shows the impact on delay of different hard adder
block being packed. In this particular example, the flip-flops
thresholds when we target the ripple carry architecture. The
that drive the adder are transitively connected via the carry
x-axis shows the hard adder threshold in bits. The y-axis
chain so the packer gives them higher attraction to each
shows the geometric mean of the delay over the 14 circuits
other than to other unconnected logic. With this modification,
of Table V. There is a general trend towards achieving a
circuits such as that illustrated in Figure 12 were packed well.
minimum mean delay at a threshold of around 12 bits.
Figure 11 shows the area impact of different hard adder D. Placement and Routing
thresholds. The x-axis is again the hard adder threshold, while VPR 7.0 has place-and-route functionality for carry chain
the y-axis shows geometric mean of the total area for all exploration. The architecture description file allows any logic
benchmarks. The area consumed using an architecture with block pin to connect to the general inter-block routing and/or
hard adders is on average more than that of an equivalent to use a special dedicated connection to a specific pin on
architecture without carry chains. We see a gradual drop in a specific other logic block. In this work, when we explore
area with an increasing hard adder threshold; area drops from dedicated carry chain links they run vertically – the cout pin
10% above the soft adder architecture with a hard adder thresh- of one logic block can only connect to the cin pin of the logic
old of 0, to 3% above with a threshold of 12. Interestingly, block immediately below it. When such dedicated connections
preliminary measurements we made on commercial FPGAs are specified, VPR 7.0 automatically not only generates the
showed using carry chains in the CAD flow reduced area; appropriate edges in the routing resource graph to represent
we therefore suspect that with further improvements in logic this direct routing possibility, but also constrains the placement
synthesis the remaining 3% area penalty could be eliminated. algorithm to keep any blocks that are part of a hardened carry
Considering area and delay, the best hard adder threshold is chain in the correct relative position (in our case vertically
approximately 12 bits. This threshold is used for all subsequent adjacent) throughout placement.
results unless otherwise noted. For logic blocks without hard carry chains, it is possible
C. Packing to change which BLE performs which function during the
The packing stage of the CAD flow is responsible for routing stage of the CAD flow – making the outputs logically
grouping technology-mapped atoms such as LUTs, hard adder equivalent. However if hard carry chains are used the order
bits, flip-flops, and memory slices, into complex logic blocks. of BLEs is fixed, and the outputs of BLEs using their carry
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 7

16
function are not logically equivalent. VPR 7.0, does not allow
14 Soft
us to selectively switch off output pin logical equivalence in Ripple No CLB Carry
12 CLA No CLB Carry
cases when the carry links are used by the BLEs. Hence, for Ripple
10

Delay (ns)
correctness, we do not allow any BLE swaps, thus removing CLA
8
all output logical equivalence. Turning off logical equivalence
6
for all outputs will lead to a slight pessimism on the routability
4
of the soft logic only architecture vs. that of the hard adder 2
architectures. To quantify this we evaluated the soft logic only 0
architecture both with and without output equivalence enabled, 0 20 40 60 80 100 120 140
and found the impact on delay and area was < 3% on the VTR Size of Addition (# bits)
benchmarks. Since the effect is small (compared to the impact Fig. 13. Delay vs. adder length for various non-fracturable architectures.
of architectural modifications) this restriction will not change
the architecture conclusions. Furthermore, to compensate for
the restriction, each output pin can directly access two sides
of the logic block, and hence both a vertical and a horizontal
channel, ensuring the pins still have access to a diverse set of
routing wires.
V. M ICROBENCHMARK R ESULTS
In this section we explore the impact of the different ways of
supporting arithmetic in an FPGA architecture when evaluated
on simple adder microbenchmarks. Here, each circuit is an
adder of N bits, where N ranges from one to 127. Both the
inputs and outputs of the adder are registered, so the critical Fig. 14. Delay vs. adder length for various architectures with fracturable
LUTs.
path delay measured is a direct function of the adder combi-
national logic delay. These registered adders are implemented noise. For hard adders, the lack of CAD flexibility forces
using the flow described in Section IV. In this section the hard a predictable physical design, thus greatly reducing CAD
adder threshold (Section IV-B) is disabled in order to measure noise for these microbenchmarks. The combination of higher
the impact of the different architectures at small bit widths. and more predictable performance provided by hard adders,
Figure 13 shows the impact on critical path delay vs. width especially those with hard inter-CLB links, is very desirable.
of addition, for the Soft, Ripple and CLA architectures, where The data from this experiment also shows that a 3-bit
the critical path delay is averaged over three placement seeds. addition implemented in soft logic is actually slightly faster
In addition, two variants of the Ripple and CLA architectures than any of the hard-logic adders, further motivating the CAD
are included, labelled no CLB carry, in which the general- hard adder thresholds described in Section IV-B.
purpose interconnect is used to implement carry links be- Table VI shows the tile area for a logic block (including
tween soft logic blocks, rather than using dedicated carry both inter and intra-block routing) in each architecture. Here
links. Figure 14 shows the results for the fracturable LUT we observe that the inclusion of hard adders increase tile area
architectures. The unbalanced architectures are not included by < 2.5% over their respective baseline architectures. The
here as their performance difference vs. balanced is negligible delay, logic block count and area for implementing a 32-bit
on the microbenchmarks. adder are also shown in Table VI. The architectures with hard
These results show trends that we generally expect, in adders are all substantially faster than the soft architectures,
that delay grows linearly with adder size, and that the more they also use fewer CLBs than the baseline Soft architecture
hardened architectures are faster. and hence (with the exception of frac ripple and frac uripple)
In the extreme case, for 127 bit addition, it is interesting to are more area efficient.
note that a pure soft adder is ten times slower than the fastest
(CLA) adder. For 32-bit addition, the hard adders provide TABLE VI
32- BIT A DDER D ELAY & A REA .
a 3.4-fold speedup over the soft adder. The no CLB carry
architectures have delay values in between fully hard and fully 32-bit Addition

soft adder architectures. While the CLA architecture is the Tile Area Delay Total Area
Architecture CLBs
(103 MTWA) (ns) (103 MTWA)
fastest of all, ripple carry is only 19% slower for 32 bit adders,
soft 19.84 3.94 18 357.06
and 42% slower even for 127 bit addition. A ripple architecture cla 20.25 0.93 16 324.05
can sustain 400 MHz operation for even a 96-bit addition. ucla 19.92 0.83 16 318.77
ripple 20.14 1.14 16 322.25
When adders are implemented in soft logic, CAD noise uripple 19.90 1.03 16 318.34
frac soft 23.06 3.74 14 322.77
can have a significant impact on delay. The effect of this frac ripple 23.06 1.06 16 368.92
noise is evident in the figure when observing the delay of frac uripple 23.37 1.15 16 373.86
frac 2ripple 23.40 1.08 14 327.64
additions ranging from 17 bits to 25 bits for the soft logic frac 2uripple 23.62 1.01 14 339.71
architecture, where delays for additions of similar size can Tile Area is the geomean across application benchmarks (Sec-
vary significantly as a result of CAD (in this case packer) tion VII) and so includes both logic and realistic routing area.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 8

14
+ + + + TABLE VII
12 + + G EOMETRIC MEAN OF DELAY AND NUMBER OF CLB S FOR CONSTANT
+
MULTIPLICATION (128 RANDOMLY SELECTED CONSTANTS ).

Critical Path Delay (ns)

8 Arch Crit Delay (ns) Num CLB

cla soft 6.68 21.9
6
ripple cla 3.07 17.8
4 soft ucla 2.95 17.8
ripple 3.28 17.5
2
uripple 3.14 17.5
0 frac soft 6.62 15.4
0 10 20 30 40 50 60 frac ripple 3.39 18.3
Size of Addition (# Bits) frac uripple 3.70 20.0
frac 2ripple 3.54 10.1
Fig. 15. Delays across the architectures on depth 3 adder trees. frac 2uripple 3.79 12.5
+ + + + ment 16-bit multiplication using different constants; the inputs
+ +
+ and outputs to the combinational multiplier are registered.
Table VII shows the geometric average of performance
and number of CLBs for different architectures across 128
multiply-by-constant circuits. The soft adder architecture is
over twice as slow as the hard adder architectures, the same
ratio as that of the 16-bit pure adder benchmark. The carry-
lookahead architectures are 7% faster than the single-bit adder
Fig. 16. Delays across the architectures on depth 3 adder trees. ones, again matching the 16-bit pure adder benchmark. The
unbalanced hard adder architectures are slightly faster than
VI. A RITHMETIC K ERNEL R ESULTS
the balanced architectures, as the constant multiplier circuits
This section explores the impact of hardened adders and have some critical paths that are longer than others and the
carry chains on arithmetic kernels that are more complex unbalanced architecture can give slightly faster resources to
than a pure adder, but would still only be part of a full user these paths. Hard adders reduce CLB count versus soft adders
application. Intuitively, we expect that when a circuit becomes by approximately 24%.
more complex, with interactions between different types of Both the fracturable and non-fracturable architectures with
logic, then the differences between the different architectures hard adders improve critical path delay. However the results
may be reduced. The hard adder threshold (Section IV-B) is for logic block counts are quite different. The soft fracturable
disabled in this section to enable evaluation of the different architecture uses fewer logic blocks than the non-fracturable
architectures at small bit widths. soft architecture and less than the non-fracturable architec-
A. Adder Trees tures with hardend adders. Interestingly, the 1-bit fracturable
Adder trees are frequently used to implement functions like LUT architectures use more logic blocks. This is due to the
filters and dot products. Figures 15 and 16 shows the perfor- complexity of the logic block causing more registered input
mance of a depth three adder tree (which can add 8 numbers) opportunities to be missed by the CAD tool. However, the 2-bit
for non-fracturable and fracturable architectures respectively. fracturable LUT architectures are substantially more efficient,
The critical path now consists of the concatenation of several using significantly fewer logic blocks than all other archi-
ripple carry stages and 2 or 3 input to sumout stages. Each tectures (up to 42% compared to non-fracturable hardened
LUT before an adder input must be configured as a wire, architectures and 34% compared to the baseline fracturable
so the LUT is wasted. The performance of the balanced and architecture) while achieving similar delay.
unbalanced architectures are the same; hence only balanced is C. FIR Filters
shown. The hard carry lookahead adder is no faster than the FIR filters are widely used in DSP applications, are rich
ripple adder for these adder trees, as now the sum-generation with arithmetic, and are well suited for implementation on
logic has a larger impact. All the hard implementations are FPGAs. We study the impact of hardened adders and carry
faster than the soft implementation for adders of 4 or more chains on direct-form FIR filters with 18-bit word widths that
bits, with the gain widening to 4x at 48 bits for depth-3 consist of shift registers of input data feeding DSP blocks that
adder trees. The soft implementation uses essentially the same in turn feed adder trees. We vary the number of taps on the
number of logic blocks as the hard versions for adder trees filter and compare the results of pipelining vs. no pipelining
with 8-bit or smaller additions, as the adder trees provide more in the adder tree.
opportunities for a synthesis tool to make use of the 6-LUTs Figures 17 and 18 show the effect on critical path delay
than a simple adder does. Again this highlights that using soft as we increase the number of taps of a combinational FIR
logic for small adders is a very reasonable choice. However, filter for non-fracturable and fracturable architectures respec-
for large additions of 48-bits, soft logic requires 67% more tively. Only the balanced hard architectures are shown as the
logic blocks than the hard architectures. unbalanced variants have essentially the same performance.
B. Multiplication by a Constant Figures 19 and 20 shows the same effect with FIR filters with
Another important arithmetic kernel is multiplication by a deeply pipelined adder trees. For the non-pipelined filters, hard
constant, which can be efficiently implemented using addition, adders are substantially faster than soft adders, but the various
subtraction and wired-shifts. These microbenchmarks imple- hard adder architectures show almost identical performance.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 9

18 450

16 400 × × × ×
14 350 + +
+

Critical Path Delay (ns)

300
12

Num CLB
250
10 soft
soft 200
8 ripple
ripple 150 cla
6 cla
100
4 × × × ×
50
+ +
2 + 0
0 10 20 30 40 50 60
0
0 10 20 30 40 50 60 Number of Taps
Number of Taps
Fig. 21. Logic block counts across the architectures on non-pipelined FIR
Fig. 17. Delays across the architectures on non-pipelined FIR filters. filters.

× × × ×
+ +
+

Fig. 18. Delays across the architectures on FIR filters with no pipelining. Fig. 22. Logic block counts across the architectures on FIR filters with no
pipelining.
However the gap between soft and hard adders is not as earlier observation that fracturable LUTs are much more area
large for the fracturable architectures. With the pipelined efficient at implementing soft adders than non-fracturable, thus
filters, hardened carry chains do not improve speed over soft lowering the area curve. Notably, using 2-bits of addition per
arithmetic as the critical path has moved into the multipliers. fLUT is again the most efficient. This indicates that two bits
Figures 21 and 22 shows the effect on logic block count of arithmetic achieves better utilization of the fLUT in front
as we increase the number of taps of non-pipelined FIR of the adders than the one bit case.
filters for the non-fracturable and fracturable architectures
VII. A PPLICATION C IRCUIT R ESULTS
respectively. Figures 23 and 24 shows the same effect with
pipelining. Architectures with hard adders show lower CLB We now focus on evaluating the different hard adder vari-
counts than soft adders, but the absolute CLB counts are ants using full application benchmarks. These allow us to
quite different between the fracturable and non-fracturable evaluate the overall quality of the different implementation
architectures. Interestingly both frac soft and frac ripple are approaches, considering overall delay and area. Architectures
quite close in CLB count, but both are substantially lower which minimize the area-delay product are the most efficient.
than the non-fracturable architectures. This underscores our The complete design benchmarks we use are from the VTR 7.0
600
6
× × × ×
500
5
+ +
400 +
Critical Path Delay (ns)

4
Num CLB

300 soft
3 soft ripple
ripple 200 cla
2 × × × × cla
100
+ +
1 +
0
0 10 20 30 40 50 60
0
0 10 20 30 40 50 60 Number of Taps
Number of Taps
Fig. 23. Logic block counts across the architectures on pipelined FIR filters.
Fig. 19. Delays across the architectures on FIR filters with pipelined adder
trees.
× × × ×
+ +
+

× × × ×
+ +
+

Fig. 24. Logic block counts across the architectures on FIR filters with
Fig. 20. Delays across the architectures on FIR filters with pipelining. pipelining.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 10

TABLE VIII TABLE IX

B ENCHMARK S TATISTICS WHEN MAPPED TO R IPPLE ARCHITECTURE . D ELAY FOR DIFFERENT HARD ADDER ARCHITECTURES , NORMALIZED TO
Circuit Num Max Avg Add/LUT THE SOFT LOGIC ARCHITECTURE .
6LUTs Add Add Ratio Arch 32-bit Add Application Circuits
Len Len Delay Delay
bgm 5438 25 9.3 0.17 Ripple 0.29 0.85
blob merge 3754 13 12.0 0.48 U-Ripple 0.26 0.87
boundtop 309 19 7.2 0.11 CLA 0.24 0.85
LU8PEEng 3241 47 11.1 0.15 U-CLA 0.21 0.85
LU32PEEng 8235 47 11.9 0.11
mcml 24302 65 47.5 0.26 TABLE X
mkSMAdapter4B 431 33 6.9 0.24
or1200 534 65 23.9 0.19
Q O R OF THE VTR+ BENCHMARKS ON DIFFERENT CARRY CHAIN
raygentop 580 32 11.8 0.33 ARCHITECTURES . VALUES ARE THE GEOMETRIC MEAN OF VTR+
sha 309 33 24.0 0.15 CIRCUITS NORMALIZED TO THE SOFT ADDER ARCHITECTURE .
stereovision0 2920 18 11.2 0.35 Arch Area Min W Num Delay Area-Delay
stereovision1 2388 19 6.4 0.30 CLB Product
stereovision2 13843 32 23.9 1.26 Ripple 1.04 0.97 1.03 0.85 0.88
geomean 2109 31 13 0.25 URipple 1.03 0.92 1.03 0.87 0.90
CLA 1.06 0.96 1.04 0.85 0.90
UCLA 1.04 0.91 1.03 0.85 0.88
release, specifically, all circuits larger than 1000 6-LUTs1 . We
will refer to these as the VTR+ benchmarks. The geometric A. Contrasts between Microbenchmarks, Kernels & Applica-
average primitive count across the 13 circuits is 11,700. The tions
benchmarks are full application circuits, which include a An interesting first comparison is to assess the impact of
variety of arithmetic operations including addition, subtraction, hard adders on application circuits vs. microbenchmarks. We
and multiplication which can make use of hardened adders. use a 32-bit adder as a representative microbenchmark, as
Table VIII provides statistics on these benchmarks, and this is close to the average size of the longest adders in the
includes the number of addition/subtraction functions found in application circuits. Table IX shows the geometric average
the benchmarks on the Ripple architecture. The table columns critical path delay for each of the non-fracturable architectures
list the number of 6-LUTs, the length of the longest adder normalized to the non-fracturable soft logic architecture. An
chain in bits, the average adder chain length, and the ratio isolated 32-bit adder sees a compelling delay reduction of
of adder bits to LUTs. The benchmarks exhibit a wide range 71% to 79% with hard carry architectures, while application
in the number and length of addition/subtraction functions. circuits see much smaller (but still very significant) delay
The geometric mean of the ratio of adder bits divided by the reductions of 13% to 15%, depending on the hard carry ar-
number of 6-LUTs is 0.25, indicating arithmetic is plentiful chitecture. The arithmetic kernels mostly had delay reductions
and hence it is reasonable to include hard adder circuitry between these extremes: from no delay reduction for deeply
in every CLB. The widest addition/subtraction generated in pipelined FIR filters to 50% for constant multiplication and
these benchmarks is 65 bits, which corresponds to a 64-bit non-pipelined FIR filters, to as much as 80% for adder trees.
operation (as the first bit must always be used to generate This is a common outcome in the hardening of any kind
the carry-in signal). The geometric mean of the longest addi- of circuit – the final impact on critical path delay in more
tion/subtraction lengths is 31 bits. The most adder-intensive complex applications is limited because other paths in the
circuit is stereovision2 with 1.26 adders per LUT. These design quickly become more critical than the adder. On the
measurements correspond well with other modern bench- application circuits, the best delay improvement achieved by
marks. For the Titan benchmarks (with the SPARC cores hardening adders is 15%, for the U-CLA architecture. Observe,
excluded because they have almost no adders at all) [26], the however, that the other simpler hardened adder architectures
geometric average of the fraction of LUTs in arithmetic mode benefit circuit speed almost as much.
and the maximum of length of addition/subtraction are 0.22 Table X shows the quality of results (QoR) of each archi-
and 35.8, respectively. tecture normalized to the soft logic architecture. All values
We use the standard VTR 7.0 CAD flow, augmented as are the geometric averages across all application benchmarks,
described in Section IV, to determine the minimum routable normalized to the soft adder architecture. The columns from
channel width (Wmin ) for each circuit. The router is then left to right are the architecture, the total soft logic area
invoked again with a channel width of 1.3 × Wmin to mea- including routing, minimum channel width, number of used
sure critical path delay and area. Area measurements are in soft logic blocks, critical path delay, and area-delay product.
minimum-width-transistor-area units. Area is computed as the The CLA architecture increases area slightly (by between 1%
total number of soft logic blocks (CLBs) multiplied by the and 2%) compared to ripple, but cuts delay by roughly the
area of a soft logic tile, where this tile includes both the same amount, leading to an area-delay product that is very
logic cluster and inter-cluster interconnect area. The hard adder close to that of the ripple architectures.
threshold is set to 12, as this yields the best area-delay results
in Section IV-B. Each of the circuits was mapped to the soft B. Circuit-by-Circuit Analysis
logic and hard carry chain architectures described in Table II. Table XI provides a circuit-by-circuit breakdown comparing
the U-CLA and Soft architectures. The columns from left to
right are the benchmark name followed by the ratio of the U-
1 A large ARM processor core is also included, and the mkDelayWorker32B CLA/Soft values for critical path delay, the total soft logic area
benchmark is excluded as it caused ABC to crash. including routing, and the number of LUTs on the critical path.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 11

TABLE XI 103
C IRCUIT- BY- CIRCUIT BREAKDOWN COMPARING THE U-CLA 102

Count
ARCHITECTURE TO THE S OFT ARCHITECTURE . 101
Circuit Delay Area LUTs on CLA cout on 100
Crit Path Crit Path
LU32PEEng 0.76 1.05 0.62 26 Equal
mcml 0.55 1.05 0.25 144 10 Non-Adder Path
or1200 0.65 1.11 0.15 19 Adder Path
sha 0.57 1.04 0.25 10 8

Ripple Delay (ns)

stereovision2 0.85 0.77 0.07 1
blob merge 0.97 1.04 0.89 0 6
boundtop 0.95 1.02 0.89 0
LU8PEEng 0.82 1.04 0.64 0
mkSMAdapter4B 1.01 1.11 0.86 0 4
bgm 1.17 1.13 1.00 0
raygentop 0.97 1.11 N/A 0 2
stereovision0 0.94 1.02 1.00 0
stereovision1 1.07 1.10 N/A 0
2 4 6 8 10 100 101 102 103
geomean 0.85 1.04 0.46 – Soft Delay (ns) Count
stdev 0.19 0.09 0.36 –
Fig. 25. Endpoint path delay comparison for or1200 benchmark on the Soft
The last column is the number of hard adders on the critical and Ripple architectures
path for the U-CLA architecture. On average, the delay of the
circuits is reduced by 15% and the critical path LUT depth related timing paths out of the slower soft logic and into the
is cut by more than 50%, but there are 3 distinct classes of faster hard adders (which typically reduces LUT depth on the
circuits that show markedly different behaviour. critical path), several other factors can also induce changes
For the top 5 circuits hard adders are on the critical path, and including: technology mapping (due to limited optimization
we obtain a large delay reduction of 33%. It is interesting to across the adders), packing (due to differences in logic block
note that despite having the lowest adder to LUT ratio (11% in structure and flexibility), and placement (due to restrictions
Table VIII), the LU32PEEng design still obtains a significant keeping long carry chains in a fixed relative position).
24% delay reduction. Clearly, even when adders are a smaller Figure 25 shows how the timing path delay characteristics
portion of the design logic addition operations can still be of the or1200 design vary between the Soft and Ripple
timing-critical. This illustrates that even circuits with relatively architectures. Here we see the path delays of Ripple have
low arithmetic intensity still benefit from hard adders. been shifted down compared to the Soft architecture. It is
The next 4 circuits (blob merge, boundtop, and LU8PEEng, interesting to note that many timing paths not involving adders
and mkSMAdapter4B) have reductions in the critical path also see improvement. Another factor beyond those listed
LUT depth of 19% when targeting the U-CLA architecture, above is the impact of timing driven optimizations, which re-
even though no hard adders occur on their critical paths. This focus optimization effort on these non-adder paths as they
indicates that adder logic was likely timing critical in the are now more timing critical (i.e. no longer dominated by
Soft architecture2 , but has sped up enough to move off the addition related timing paths). It is also worth noting that
critical path in the U-CLA architecture. Interestingly, while while the critical timing paths on the Ripple architecture still
these 4 circuits have an average LUT depth that is 19% lower include adders, the path delays are dominated by other non-
when targeting U-CLA vs. Soft, the average delay reduction adder primitives (e.g. LUTs) and routing. Finally, there are a
across the 3 designs is only 7%. We believe this illustrates an number of other near critical timing paths not involving adders,
interesting trade-off when hard carry chains are added to an which indicate the CAD flow has successfully balanced the
FPGA: by limiting the flexibility of the packer and placer, the competing requirements of the various timing paths.
carry chains have increased the average routing delay per LUT
level on non-adder paths, and this costs some of the speed gain D. Simple vs. High Performance Adder Architectures
one would expect from reducing the logic on the critical path
with hard adders. An FPGA architect must choose between smaller, more
Finally, there are four circuits where the LUT depth is not flexible, slower adders vs. larger, less flexible, faster adders.
significantly reduced and where there is not a significant delay On these complete circuits, the results reaffirm the importance
reduction, indicating adders were not very timing-critical in of hard adders but show that different hard adder granularities
even the Soft architecture. Two of these circuits (raygentop (1-bit ripple or a more expensive 4-bit CLA) remain reasonable
and stereovision1) have hard multipliers as their critical paths architectural choices. This is an unexpected result, as Table III
so they show less variation in speed vs. carry architecture, as and Table IV show markedly different area and delay char-
one would expect. acteristics between 1-bit and 4-bit hard adders, respectively
C. Impact of Hard Adders on Path Delay Distribution – the ripple adder has 19% more delay for 32-bit addition.
As noted above, the introduction of fast hard adders can One would normally expect that architectures with 1-bit adders
change the character and distribution of a circuit’s timing path would result in smaller circuits that are also slower, yet the
delays. While the foremost factor is the movement of addition area and delay results on complete circuits exhibit this trend
2 Ideally we would examine the Soft implementation of a design to directly
only very weakly. Clearly the benefit of a very fast adder for
determine if its critical path included addition, but as ABC does not preserve long word-length additions is greatly diluted by the presence
node names, we cannot trace LUTs back to specific HDL operations. of all the logic surrounding adders in complete designs.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 12

TABLE XII TABLE XIII

Q O R FOR ARCHITECTURES WITH SOFT INTER -CLB LINKS . VALUES ARE F RACTURABLE LUT A PPLICATION B ENCHMARK R ESULTS . VALUES ARE
THE GEOMETRIC MEAN OF VTR+ CIRCUITS NORMALIZED TO THE THE GEOMEAN OVER BENCHMARKS , NORMALIZED TO THE FRACTURABLE
EQUIVALENT ARCHITECTURE WITH DEDICATED INTER -CLB LINKS . SOFT ARCHITECTURE .
Arch 32-bit VTR+ VTR+ VTR+ Arch Crit Path Total Used Num Area Delay
Add Delay Area Area-Delay Delay Soft Area CLB Product
Delay Product frac ripple 0.85 1.10 1.10 0.94
Ripple 2.41 1.07 0.99 1.06 frac uripple 0.85 1.13 1.12 0.97
U-Ripple 2.31 1.05 0.99 1.04
frac 2ripple 0.84 1.02 1.01 0.85
CLA 2.49 1.04 0.99 1.03
U-CLA 2.12 1.04 0.99 1.03 frac 2uripple 0.82 1.04 1.02 0.85

E. Adder Input Balancing marks. Using soft inter-CLB links increases the delay of
a 32-bit adder by 2.3x on average across the hard adder
We now turn to consider how best to integrate the LUT and
architectures, but increases the delay of the VTR+ designs by
arithmetic circuitry. The balanced approach of splitting the 6-
only 5%. The area cost of hard inter-CLB carry is negligible,
LUT into two 5-LUTs, where each 5-LUT drives a different
as little hardware needs to be added to support them, and
adder input has good symmetry. The unbalanced approach of
as their use does not significantly increase the required inter-
using the 6-LUT to drive one adder input and a small mux to
CLB channel width, despite the constraint they create on the
select BLE input pins for the other adder input offers richer
placement engine.
LUT functionality feeding the adder input (six pins compared
We expect that the impact of hard inter-CLB carry links is
to five for the balanced case) but worse symmetry. It is thus
a strong function of the number of adder bits per logic block.
unclear which of these two approaches is better. Note also
Fewer bits per block means more inter-CLB links are required
that commercial FPGAs differ in their approach: Intel’s Stratix
for an addition of a given size, which in turn may have a bigger
V [14] and Stratix 10 [15] FPGAs support a balanced style,
impact on delay. Therefore, we believe that architectures with
while Xilinx’s Virtex 7 [5] and Ultrascale [13] FPGAs allow
8 adder bits per logic block (e.g. Ultrascale [13]) will benefit
both unbalanced and balanced styles.
more from hard inter-CLB links than architectures with 20 bits
The second column of Table IX shows the normalized
per block (e.g. Stratix 10 [15]).
delay values for each of the different architectures on the
application circuits. The delays for all architectures are similar, G. Bits of Addition per LUT
achieving an approximately 15% delay reduction. We therefore Table XIII compares the different fracturable LUT carry-
conclude that both balanced and unbalanced architectures chain architectures normalized to the fracturable soft archi-
achieve approximately the same overall delay. tecture to study the impact of the number of addition bits
Table X shows the balanced and unbalanced architectures per LUT. We see that critical path delay is fairly consistent
require virtually the same CLB count, indicating that the across architectures with a reduction of 15% to 18% vs. the
packer can fill both architectures with roughly the same soft fracturable LUT architecture. The area numbers are quite
amount of logic per CLB, despite the fact that the balanced different, with 2-bits of addition per fraturable LUT exhibiting
architectures can use a LUT on each input of an adder instead a much lower area overhead (2% to 4% more area than soft),
of only one input. Interestingly, the unbalanced architectures while 1-bit of addition per fracturable LUT uses 10% to
require a channel width that is 4% lower, on average. This 13% more area vs. soft. We therefore conclude that 2-bits of
is due to the fact that the unbalanced architecture can use all addition per fracturable LUT is a better architecture than 1-bit
6 inputs of a BLE when in adder mode, while the balanced because it provides both slightly better delay and significantly
architectures can use only 5 – the packer has more freedom better area, reducing area delay product by 15% vs. soft.
on what to pack with the adder in the unbalanced architecture H. Per Circuit Breakdown for frac 2uripple
and reduces the number of signals to route between clusters. Table XIV shows a circuit-by-circuit breakdown of our re-
The net impact is that while the unbalanced architectures sults on for frac 2uripple, our best architecture. By providing
require slightly more logic area due to their extra 2:1 mux per absolute values, this table serves as a comparison point for
BLE, they reduce overall area by 1% by reducing the required future studies. The columns from left to right are: circuit name,
amount of inter-cluster routing. minimum channel width, critical path delay in ns, number of
F. Utility of Inter-CLB Carry soft logic blocks, and total soft logic area including routing in
Dedicated carry links between logic blocks improve the number of minimum width transistors.
speed of long adders significantly, as shown in Figure 13, I. Summary
but their use constrains the placement engine to keep long Table XV compares our best hard adder architectures with
adders in a fixed relative placement, which may lengthen the soft LUT and fracturable LUT architectures. For the soft
wiring between other blocks. Table XII compares the QoR adder architectures critical path delay is similar, while both
of architectures with soft inter-CLB carry links (i.e, routed hard adder architectures reduce critical path delay by 15-
using the general-purpose interconnect) normalized to their 16%. However the fracturable LUTs significantly reduce the
corresponding architectures with hard inter-CLB carry links. number of CLBs required by 12-15%. This reduces area but
The first column is the architecture. The second column shows is slightly tempered by a higher minimum channel width. As
normalized delays for the 32-bit addition micro benchmark. a result the combination of fracturable LUTs with 2-bits of
The next three columns show the normalized geometric mean hardened addition per LUT improves area-delay product by
of delay, area, and area-delay product over the VTR+ bench- 25% compared to a soft (non-fracturable) LUT architecture.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 13

TABLE XIV to consider how different fracturable LUT architectures would

C IRCUIT- BY- CIRCUIT BREAKDOWN OF THE BEST ( FRAC 2 URIPPLE ) interact with arithmetic. It would also be interesting to consider
ARCHITECTURE .
the impact of re-tuning the number of inputs to the logic block
Circuit Min W Crit Path Num Soft
Delay CLB Area for these architectures. Another direction for future work is to
focus on improving logic synthesis when hardened adders are
bgm 104 22.89 3744 8.84 · 107
blob merge 96 8.35 620 1.42 · 107 used; ideally ABC would be upgraded to understand the logic
boundtop 68 5.51 262 5.51 · 106 within hard adders and optimize across their boundaries. Fi-
LU8PEEng 122 66.12 2440 6.08 · 107
LU32PEEng 172 66.72 8225 2.33 · 108
nally, we expect that using hardened adders will reduce power
mcml 142 37.36 8211 2.14 · 108 consumption for arithmetic operations, however it would be
mkSMAdapter4B 78 4.89 192 4.19 · 106 useful to quantify this impact, and determine whether any of
or1200 98 7.49 295 6.92 · 106
raygentop 90 4.52 198 4.56 · 106 the proposed hard adder architectures would be better suited
sha 70 6.64 241 5.15 · 106 to low power FPGAs.
stereovision0 62 3.57 1051 2.14 · 107
stereovision1 120 5.27 1033 2.57 · 107 ACKNOWLEDGMENT
stereovision2 146 11.28 2063 5.51 · 107
We gratefully acknowledge the funding support of NSERC,
TABLE XV Intel, the New Brunswick Innovation Foundation, CMC Mi-
F RACTURABLE -LUT VS . NON - FRACTURABLE LUT ARCHITECTURE
RESULTS . VALUES ARE THE GEOMEAN OVER THE APPLICATION CIRCUITS . crosystems, the Semiconductor Research Corporation, and
Arch Crit Path Total Used Num Area Delay Texas Instruments.
Delay Soft Area CLB Product
Soft 1.00 1.00 1.00 1.00 R EFERENCES
Ripple 0.85 1.04 1.02 0.88
frac soft 1.02 0.85 0.73 0.86 [1] J. Rose, “Hard vs. Soft: The Central Question of Pre-Fabricated Silicon,”
frac 2uripple 0.84 0.88 0.74 0.75 IEEE ISMVL, pp. 2–5, 2004.
[2] D. Lewis et al., “Architectural Enhancements in Stratix V,” in ACM
There is an important caveat that this comparison is not FPGA, 2013, pp. 147–156.
completely fair because the number of inputs to the CLBs are [3] J. Greene et al., “A 65nm Flash-Based FPGA Fabric Optimized for Low
Cost and Power,” in ACM FPGA, 2011, pp. 87–96.
kept constant when in reality the ideal number of inputs to a [4] Lattice Semiconductor, “LatticeECP3 Family Handbook,” http://
CLB with non-fracturable LUTs should be lower than that for d12lxohwf1zsq3.cloudfront.net/documents/HB1009.pdf, 2013.
fracturable even if the underlying logic elements themselves [5] Xilinx Inc., “7 Series FPGAs Configurable Logic Block User Guide,”
have the same number of inputs. Hence our area results for the http://www.xilinx.com/support/documentation/user guides/ug474
7Series CLB.pdf, 2013.
non-fracturable LUT architectures could likely be somewhat [6] H.-C. Hsieh et al., “Third-Generation Architecture Boosts Speed and
improved with cluster input count retuning. Overall however, Density of Field-Programmable Gate Arrays,” in IEEE CICC, 1990, pp.
the case for a fracturable LUT architecture with 2-bits of hard 31–2.
[7] N.-S. Woo, “Revisiting the Cascade Circuit in Logic Cells of Lookup
arithmetic per LUT is compelling. Table Based FPGAs,” in ACM FPGA, 1995, pp. 90–96.
VIII. C ONCLUSIONS AND F UTURE W ORK [8] S. Xing and W. W. Yu, “FPGA Adders: Performance Evaluation and
Optimal Design,” IEEE Design & Test of Computers, vol. 15, no. 1, pp.
This study covered a broad range of different implementa- 24–29, 1998.
tions of hard adders, carry chains and fracturable LUTs within [9] S. Hauck, M. Hosler, and T. Fry, “High-Performance Carry Chains for
an FPGA soft logic block. Our results show that hardening FPGA’s,” IEEE TVLSI, vol. 8, no. 2, pp. 138–147, 2000.
[10] H. Parandeh-Afshar, A. Neogy et al., “Compressor tree synthesis on
adders and carry chains significantly improves the perfor- commercial high-performance fpgas,” ACM TRETS, vol. 4, no. 4, pp.
mance of arithmetic operations (69-79% delay reduction on 39:1–39:19, Dec. 2011.
microbenchmarks), and improves average application circuit [11] H. Parandeh-Afshar, P. Brisk, and P. Ienne, “A Novel FPGA Logic Block
for Improved Arithmetic Performance,” in ACM FPGA, 2008, pp. 171–
delay by 13-16%. While more complex architectures harden- 180.
ing a carry-lookahead adder improve standalone adder speed, [12] H. Parandeh-Afshar, P. Brisk, and P. Ienne, “Efficient Synthesis of
we found that simpler hardened ripple-carry adders perform Compressor Trees on FPGAs,” in IEEE ASP-DAC, 2008, pp. 138–143.
[13] Xilinx Inc., “UltraScale Architecture Configurable Logic
just as well on complete application circuits. We found the Block,” https://www.xilinx.com/support/documentation/user guides/
additional flexibility of fracturable LUT based architectures ug574-ultrascale-clb.pdf, 2017.
enabled them to be more area efficient (12-15% compared [14] Altera Corp., “Logic Array Blocks and Adaptive Logic Modules in
to non-fracturable). Fracturable LUTs and hardened adders Stratix V Devices,” http://www.altera.com/literature/hb/stratix-v/stx5
51002.pdf, 2013.
and carry chains are complementary and the combination can [15] Intel Corp., “Intel Stratix 10 Logic Array Blocks and Adaptive
improve both area and delay. We found 2-bits of addition to Logic Modules User Guide,” https://www.intel.com/content/dam/www/
be particularly well matched to fracturable LUT architectures, programmable/us/en/pdfs/literature/hb/stratix-10/ug-s10-lab.pdf, 2018.
[16] J. Luu, C. McCullough et al., “On hard adders and carry chains in
yielding an overall area-delay product improvement of 25% fpgas,” in IEEE FCCM, 2014, pp. 52–59.
compared to a non-fracturable architecture without hardened [17] C. Chiasson and V. Betz, “COFFE: Fully-Automated Transistor Sizing
arithmetic. for FPGAs,” in IEEE FPT, 2013, pp. 34–41.
[18] I. Kuon and J. Rose, “Area and Delay Trade-Offs in the Circuit and
Our results show that hardened arithmetic is an important Architecture Design of FPGAs,” in ACM FPGA, 2008, pp. 149–158.
consideration when evaluating soft logic block architectures, [19] D. Lewis, E. Ahmed et al., “The Stratix II logic and routing architec-
and in particular fracturable LUTs. It is therefore important ture,” in ACM FPGA, 2005, pp. 14–20.
for future work studying these areas to consider the impact [20] G. Zgheib and P. Ienne, “Evaluating fpga clusters under wide ranges of
design parameters,” in FPL, Sep. 2017, pp. 1–8.
of hardened arithmetic. Fracturable LUTs in particular have [21] C. Chiasson and V. Betz, “Should fpgas abandon the pass-gate?” in Int.
a large design space, so it would be interesting future work Conf. on Field programmable Logic and Applications, 2013, pp. 1–8.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO. YY, 2019 14

[22] J. Luu, J. Goeders et al., “VTR 7.0: Next Generation Architecture and Bo Yan (S’14) received the B.Eng. degree in elec-
CAD System for FPGAs,” ACM Trans. Reconfigurable Technol. Syst., tronic and information engineering from Hubei Uni-
vol. 7, no. 2, pp. 6:1–6:30, Jul. 2014. versity of Technology, Wuhan, Hubei, China, in
[23] Berkeley Logic Synthesis and Verification Group, “ABC: A System for 2010, and the M.Sc. degree in computer science,
Sequential Synthesis and Verification,” http://www.eecs.berkeley.edu/ Fredericton, New Brunswick, Canada, in 2015.
∼alanmi/abc, 2009.
[24] S. Jang et al., “SmartOpt: An Industrial Strength Framework for Logic
Synthesis,” in ACM FPGA, 2009, pp. 237–240.
[25] J. Luu, J. Rose, and J. Anderson, “Towards Interconnect-Adaptive
Packing for FPGAs,” in ACM FPGA, 2014.
[26] K. E. Murray, S. Whitty et al., “Titan: Enabling large and complex
benchmarks in academic cad,” in Int. Conf. on Field Programmable Charles Chiasson (S’08–M’15) received the B.Eng.
Logic and Applications, 2013, pp. 1–8. degree in electrical engineering from Universit de
Moncton in 2011 and the M.A.Sc. degree in elec-
trical and computer engineering from the University
of Toronto in 2013. He is currently with the Al-
Kevin E. Murray (S’12) is a PhD candidate at the tera Toronto Technology Center, Altera Corporation,
University of Toronto, where he received his BASc Toronto, ON, Canada.
and MASc in 2012 and 2015. He has previously been
a visiting Research Assistant at Imperial College
London, and worked on digital design flows at Ad-
vanced Micro Devices (AMD). His research interests
include FPGA CAD and Architecture, timing anal- Kenneth B. Kent (S’96–M’02–SM’13) received his
ysis, and modular design flows. He has contributed BSc degree from Memorial University of Newfound-
extensively to the VTR project since 2012. land, Canada, and MSc and PhD degrees from the
University of Victoria, Canada. He is a professor in
the Faculty of Computer Science at the University
Jason Luu is a senior software engineer at Altera. of New Brunswick, and Director of the IBM Centre
He received the BASc degree in CE from the Uni- for Advanced Studies-Atlantic, Canada. His research
versity of Waterloo in 2007 and the MASc and PhD interests are Hardware/Software Co- Design, Virtual
degrees at the University of Toronto in 2010 and Machines, Reconfigurable Computing, and Embed-
2014 respectively. He has contributed 8 years to ded Systems. His research groups are key contrib-
the open source VTR project for FPGA CAD and utors to widely used software such as the IBM J9
architecture research. Java virtual machine and the VTR (Verilog-To-Routing) FPGA CAD flow.
Jason Anderson (S’96–M’05) received the B.Sc.
degree in computer engineering from the University
of Manitoba, Winnipeg, MB, Canada, and the Ph.D.
Matthew J. P. Walker (S’16) received his BASc and M.A.Sc. degrees in electrical and computer
in Computer Engineering from the University of engineering (ECE) from the University of Toronto
Toronto in 2017, and is presently a MASc candidate (U of T), Toronto, ON, Canada. In 1997, he joined
in Computer Engineering at the same university, the FPGA Implementation Tools Group, Xilinx, Inc.,
where he is researching CGRA mapping. He has San Jose, CA, working on placement, routing, and
previously worked at Altera and Intel PSG, and synthesis. He is currently an Associate Professor
worked as a summer researcher in 2014 on the VTR with the ECE Department at U of T and holds the
project. Jeffrey Skoll Chair in Software Engineering. He has
authored over 60 papers published in refereed conference proceedings and
journals, and is an inventor on 26 issued U.S. patents. His current research
interests include computer-aided design, architecture and circuits for FPGAs.
Conor McCullough (S’14) received the B.Eng.
degree in computer engineering from the University Jonathan Rose (F’09) is a Professor with the De-
of New Brunswick, Fredericton, NB, Canada, in partment of Electrical and Computer Engineering,
2014. He worked as a summer researcher in 2014 on University of Toronto, Toronto, ON, Canada. He has
the VTR project under the supervision of Kenneth worked in the area of FPGA CAD and architecture
Kent. for over 20 years, including stints at the two major
vendors, Xilinx and Altera, as well as a startup. Prof.
Rose is a Fellow of the ACM and a Foreign Member
of the American National Academy of Engineering.

Sen Wang (S’14) received the B.SC. degree in com-

puter science from Nanjing University of Science & Vaughn Betz (S’88–M’91–SM’17) Dr. Betz co-
Technology, Nanjing, Jiangsu, China, in 2009, and founded Right Track CAD to develop new FPGA
the M.Sc. degree in computer science, Fredericton, architectures and CAD tools and was its VP of
New Brunswick, Canada, in 2014. Engineering until its acquisition by Altera in 2000.
He was at Altera from 2000 to 2011, ultimately
as Senior Director of Software Engineering, and is
one of the architects of both the Quartus II CAD
system and the Stratix I - V and Cyclone I - V
FPGAs. He is a Professor and the NSERC/Intel
Industrial Research Chair in Programmable Silicon
Safeen Huda (S’13) received the B.A.Sc. and at the University of Toronto, where his research
M.A.Sc. degrees in electrical engineering from the covers FPGA architecture, CAD, and acceleration of computation using
University of Toronto, Toronto, ON, Canada in 2009 FPGAs.
and 2012, respectively and the Ph.D. degree in
computer engineering in 2017 His research interests
include all the aspects of digital circuit and system
design. Mr. Huda has held the Natural Sciences
and Engineering Research Council of Canada Post-
Graduate Scholarship and the University of Toronto
Fellowship.

Design of Arithmetic Calculator Using Fpga
No ratings yet
Design of Arithmetic Calculator Using Fpga
58 pages
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
From Everand
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
Derek Molloy
4/5 (1)
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic(Fpga_arithmetic)_docs
No ratings yet
Low Power Carry Look Adder Design Using FLUT’s FPGA Arithmetic(Fpga_arithmetic)_docs
45 pages
Review On Hybrid LUT/Multiplexer Combinational Logic Block Architecture
No ratings yet
Review On Hybrid LUT/Multiplexer Combinational Logic Block Architecture
3 pages
Bhattacharjee 2011
No ratings yet
Bhattacharjee 2011
5 pages
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
No ratings yet
Efficient Implementation of Scan Register Insertion On Integer Arithmetic Cores For Fpgas
6 pages
Programmable Logic Arrays
No ratings yet
Programmable Logic Arrays
5 pages
Lecture Notes - Introduction To Accumulators and Fpgas: Accumulator Basics
No ratings yet
Lecture Notes - Introduction To Accumulators and Fpgas: Accumulator Basics
6 pages
22.multioperand Redundant Adders On FPGAs
No ratings yet
22.multioperand Redundant Adders On FPGAs
13 pages
Ritu
No ratings yet
Ritu
16 pages
Fpga Based An Advanced Lut Methodology For Design of A Digital Filter
No ratings yet
Fpga Based An Advanced Lut Methodology For Design of A Digital Filter
5 pages
Lec 2
No ratings yet
Lec 2
26 pages
Fpga Implimentation of LCD Display1
No ratings yet
Fpga Implimentation of LCD Display1
77 pages
Image Hardware PDF
No ratings yet
Image Hardware PDF
19 pages
Abstract-Field Programmable Gate Arrays (FPGA's) Have Rapidly
No ratings yet
Abstract-Field Programmable Gate Arrays (FPGA's) Have Rapidly
4 pages
Digital Design Using Verilog HDL PDF
No ratings yet
Digital Design Using Verilog HDL PDF
108 pages
Digital Design Using Verilog HDL
No ratings yet
Digital Design Using Verilog HDL
108 pages
A Review On Implementation of Parallel Prefix Adders Using FPGA'S
No ratings yet
A Review On Implementation of Parallel Prefix Adders Using FPGA'S
3 pages
Convolution FPGA
No ratings yet
Convolution FPGA
6 pages
Vlsi Unit-5
No ratings yet
Vlsi Unit-5
21 pages
Comparative Study On Digital Adders
No ratings yet
Comparative Study On Digital Adders
48 pages
Modeling FPGA Logic Architecture: Petar Minev and Valentina Kukenska
No ratings yet
Modeling FPGA Logic Architecture: Petar Minev and Valentina Kukenska
4 pages
Gunjan FPGA
No ratings yet
Gunjan FPGA
43 pages
CPLD and Fpga: Gaurav Verma ECE Dept Niec
No ratings yet
CPLD and Fpga: Gaurav Verma ECE Dept Niec
33 pages
FPGA (Field Programmable Gate Array)
No ratings yet
FPGA (Field Programmable Gate Array)
18 pages
major conference paper
No ratings yet
major conference paper
14 pages
Field Programmable Gate Array: Full Adder Implementation Using FPGA
No ratings yet
Field Programmable Gate Array: Full Adder Implementation Using FPGA
32 pages
Low Power and Area Efficient Binary Code
No ratings yet
Low Power and Area Efficient Binary Code
10 pages
aly paper 4
No ratings yet
aly paper 4
5 pages
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
No ratings yet
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
6 pages
Module 3 - Full
100% (2)
Module 3 - Full
74 pages
Major Conference Paper
No ratings yet
Major Conference Paper
14 pages
A VHDL Design Methodology For Fpgas
No ratings yet
A VHDL Design Methodology For Fpgas
16 pages
1-s2.0-S0045790624001459-main
No ratings yet
1-s2.0-S0045790624001459-main
11 pages
The Development of Visualize Graphic For Fpga Logic Synthesis
No ratings yet
The Development of Visualize Graphic For Fpga Logic Synthesis
6 pages
Bit magnitude comparator
No ratings yet
Bit magnitude comparator
14 pages
Intro Digital Design-Digilent-VHDL Online
100% (1)
Intro Digital Design-Digilent-VHDL Online
124 pages
Comparison Among Different Adders
No ratings yet
Comparison Among Different Adders
6 pages
Reconfigurable Computing Using Content Addressable Memory For Improved Performance and Resource Usage
No ratings yet
Reconfigurable Computing Using Content Addressable Memory For Improved Performance and Resource Usage
6 pages
Richard Haskell - Intro To Digital Design
100% (1)
Richard Haskell - Intro To Digital Design
111 pages
VHDL Tutorial
No ratings yet
VHDL Tutorial
127 pages
FPGA
No ratings yet
FPGA
20 pages
Comprehensive Study of 1 Bit Full Adder Cells: Review, Performance Comparison and Scalability Analysis
No ratings yet
Comprehensive Study of 1 Bit Full Adder Cells: Review, Performance Comparison and Scalability Analysis
15 pages
Principles and Structures of FPGAs
100% (1)
Principles and Structures of FPGAs
234 pages
Section4 Fpga
No ratings yet
Section4 Fpga
104 pages
Sec5-Fpga - Part1
No ratings yet
Sec5-Fpga - Part1
41 pages
Guide To FPGA
No ratings yet
Guide To FPGA
472 pages
VLSI Design Style
No ratings yet
VLSI Design Style
34 pages
Group 4 Activity
No ratings yet
Group 4 Activity
8 pages
On Designing Universal Logic Blocks and Their Application To FPGA Design
No ratings yet
On Designing Universal Logic Blocks and Their Application To FPGA Design
9 pages
DSUV Imp Questions
No ratings yet
DSUV Imp Questions
28 pages
Digital Filters in Programmable Logic Devices
No ratings yet
Digital Filters in Programmable Logic Devices
13 pages
Introduction To Logic Circuit Design With VHDL
From Everand
Introduction To Logic Circuit Design With VHDL
Bilgehan Erkal
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS
From Everand
Versatile Routing and Services with BGP: Understanding and Implementing BGP in SR-OS
Alcatel-Lucent
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
BICSI RCDD Registered Communications Distribution Designer Exam Prep And Dumps RCDD-001 Exam Guidebook Updated Questions
From Everand
BICSI RCDD Registered Communications Distribution Designer Exam Prep And Dumps RCDD-001 Exam Guidebook Updated Questions
Byte Books
No ratings yet
Distributed Facts Device for Flow Controls
From Everand
Distributed Facts Device for Flow Controls
Dr.V.V.L.N. Sastry
No ratings yet
5658UL12191009
No ratings yet
5658UL12191009
5 pages
a-comparative-study-of-software-defined-radio-and-cognitive-radio-network-technology-security
No ratings yet
a-comparative-study-of-software-defined-radio-and-cognitive-radio-network-technology-security
7 pages
Electric Vehicle
No ratings yet
Electric Vehicle
6 pages
FULLTEXT01
No ratings yet
FULLTEXT01
74 pages
Design and Implementation of An Android
No ratings yet
Design and Implementation of An Android
57 pages
A Novel Approach For Cloud Based E Learning System
No ratings yet
A Novel Approach For Cloud Based E Learning System
5 pages
Tan 2017
No ratings yet
Tan 2017
13 pages
Implementation of Digital and Analog Modulation Systems Using FPGA
No ratings yet
Implementation of Digital and Analog Modulation Systems Using FPGA
10 pages
m231028 Final
No ratings yet
m231028 Final
13 pages
Electronicproject 2
No ratings yet
Electronicproject 2
28 pages
He 2024 J. Phys. Conf. Ser. 2731 012016
No ratings yet
He 2024 J. Phys. Conf. Ser. 2731 012016
8 pages
Designing A Microcontroller-Based Line Follower Robot
No ratings yet
Designing A Microcontroller-Based Line Follower Robot
8 pages
Portable LED Lamps
No ratings yet
Portable LED Lamps
8 pages
Ext 36407
No ratings yet
Ext 36407
4 pages
Technical Seminar Report - 089
No ratings yet
Technical Seminar Report - 089
25 pages
IoT Based Smart Meter
No ratings yet
IoT Based Smart Meter
12 pages
Technical Seminar
No ratings yet
Technical Seminar
16 pages
Base Paper
No ratings yet
Base Paper
35 pages
Gate Ece 2019 February 2 9 Am
No ratings yet
Gate Ece 2019 February 2 9 Am
63 pages
Updated Adc Manual 2021-22
No ratings yet
Updated Adc Manual 2021-22
104 pages
Design of 8 Bit Alu Using Microwind 3.1
100% (1)
Design of 8 Bit Alu Using Microwind 3.1
6 pages
Ddco Module 2 Ppt_template Ait
No ratings yet
Ddco Module 2 Ppt_template Ait
73 pages
Bca 1 ST
No ratings yet
Bca 1 ST
42 pages
Lab 3
No ratings yet
Lab 3
14 pages
16 Bit Ripple Carry Adder
No ratings yet
16 Bit Ripple Carry Adder
2 pages
SKEE2263 Mini Project Guide
No ratings yet
SKEE2263 Mini Project Guide
39 pages
Adders and Subtractors
No ratings yet
Adders and Subtractors
15 pages
Anna University, Chennai-25 Practical Examination: College Programee Semester: 06 Subject
No ratings yet
Anna University, Chennai-25 Practical Examination: College Programee Semester: 06 Subject
4 pages
Lab 1 Tutorial
No ratings yet
Lab 1 Tutorial
14 pages
Chapter 12
No ratings yet
Chapter 12
22 pages
Logic Gates
No ratings yet
Logic Gates
19 pages
Lab Collection
No ratings yet
Lab Collection
116 pages
Digital Logic Design Lab Manual Final
No ratings yet
Digital Logic Design Lab Manual Final
30 pages
CSE DS Syllabus
No ratings yet
CSE DS Syllabus
20 pages
8-Bit Full Adder
No ratings yet
8-Bit Full Adder
4 pages
Design and Development of Binary Arithmetic Modules for a Digital Calculator
No ratings yet
Design and Development of Binary Arithmetic Modules for a Digital Calculator
3 pages
Chapter 3 OnlyFor Q39 and ProblemNo 9
No ratings yet
Chapter 3 OnlyFor Q39 and ProblemNo 9
32 pages
Syllabus For BCS
No ratings yet
Syllabus For BCS
15 pages
VLSI Adders CLA
No ratings yet
VLSI Adders CLA
38 pages
Lab 7 Half & Full Adder
No ratings yet
Lab 7 Half & Full Adder
8 pages
Design and Implementation of A Unified BCD/Binary Adder/Subtractor
No ratings yet
Design and Implementation of A Unified BCD/Binary Adder/Subtractor
19 pages
11th Computer Science Important Questions em
No ratings yet
11th Computer Science Important Questions em
6 pages
CUET-2024-Original-Paper 2024
No ratings yet
CUET-2024-Original-Paper 2024
14 pages
Design of Robust, Energy-Efficient Full Adders For Deep-Submicrometer Design Using Hybrid-CMOS Logic Style
No ratings yet
Design of Robust, Energy-Efficient Full Adders For Deep-Submicrometer Design Using Hybrid-CMOS Logic Style
22 pages
Vlsi Project
No ratings yet
Vlsi Project
75 pages
Low - Power - VLSI - JUNE 2023
100% (1)
Low - Power - VLSI - JUNE 2023
2 pages
Digital Circuits Questions and Answers - Demultiplexers (Data Distributors) - 2
No ratings yet
Digital Circuits Questions and Answers - Demultiplexers (Data Distributors) - 2
75 pages
II Unit
No ratings yet
II Unit
50 pages
8 ECT 304 - VLSI Circuit Design Module 4
No ratings yet
8 ECT 304 - VLSI Circuit Design Module 4
41 pages
BCA Over-All Syllabus
No ratings yet
BCA Over-All Syllabus
112 pages

Tvlsi Fpga Arithmetic

Uploaded by

Tvlsi Fpga Arithmetic

Uploaded by

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS. VOL. XX, NO.

Optimizing FPGA Logic Block Architectures for

FPGA family [13] contains a basic ripple carry architecture TABLE I

This choice of shared inputs is between that of a Virtex-style

B. Adder Input Balancing

C. Carry Chain Flexibility

Timing & Area Estimation

When a logic block contains carry chains, adder atoms must

Critical Path Delay (ns)

8 Arch Crit Delay (ns) Num CLB

Critical Path Delay (ns)

TABLE VIII TABLE IX

Ripple Delay (ns)

TABLE XII TABLE XIII

TABLE XIV to consider how different fracturable LUT architectures would

Sen Wang (S’14) received the B.SC. degree in com-

You might also like