High-Performance, Cost-Effective Heterogeneous 3D FPGA Architectures
High-Performance, Cost-Effective Heterogeneous 3D FPGA Architectures
FPGA Architectures
Roto Le
Division of Engineering
Brown University
Providence, RI 02912
[email protected]
Sherief Reda
Division of Engineering
Brown University
Providence, RI 02912
[email protected]
ABSTRACT
side the traditional reconfigurable fabric, heterogeneous FPGAs include dedicated full-custom design components such as digital signal processors (DSP), multipliers, on-chip memory blocks, and entire processors. Examples of such heterogeneous FPGAs include
Xilinx Spartan 3, Virtex 4, 5 and Altera Cyclone II, Stratix II, III
and Lattice ECP2 family.
To provide the required reconfigurable functionality, FPGAs provide a large amount of programmable interconnect resources in the
form of wire segments, switches, and signal repeaters. These programmable interconnect resources typically consume a large portion of the FPGA silicon die area. A number of recent studies show
that programmable interconnect fabric consumes about 7080% of
the total FPGA area [8, 6]. Since die area is one of the main factors
that determine manufacturing costs, reducing the silicon footprint
of the programmable fabric can lead to significant improvements in
the manufacturing costs of FPGAs. Reducing the length of interconnects will also bring performance improvements to the typical
interconnect-delay dominated FPGAs.
Three-dimensional (3D) Integrated Circuits (ICs) with throughsilicon vias is an new technology that will increase the functionality, scale of integration, and performance of integrated systems [1,
2]. Increasing the scale of integration is particularly attractive considering that optical lithography is approaching its natural limits.
In 3D integration, multiple die or layers are integrated and interconnected with through-silicon vias (TSVs). Three-dimensional
integration can lead to significant reduction in wire length and interconnect delay through the use of TSVs. A number of recent
publications propose novel 3D architectures and physical design
techniques that lead to FPGAs with better performance than existing planar FPGAs [3, 8, 9, 10, 6, 11]. For example, Alexander et
al. [3] developed 3D island-style based FPGAs that extend four directional 2D switch boxes to six directional 3D switch boxes. This
3D switch architecture allows logic blocks to have six immediate
neighbors including four on the die or plane where the switch box
is placed, and two others above and below the die. In another work,
Lin et al. propose a 3D FPGA architecture that partitions homogeneous FPGAs components such that configuration SRAM memory
cells and switch transistors can be moved to other 3D layers [6]. In
addition to devising new 3D FPGA architectures, a number of recent studies develop placement and routing models to support and
assess 3D FPGA architectures (e.g., [7, 8, 9, 10, 11]).
In this paper our objective is to develop novel 3D FPGA architectures and designs that improve performance with lower costs than
planar FPGAs. Our cost savings arise from significant reductions
in total die area enabled by our methodology. We summarize the
contributions of this paper as follows.
1.
R. Iris Bahar
Division of Engineering
Brown University
Providence, RI 02912
[email protected]
INTRODUCTION
Field Programmable Gate Arrays (FPGAs) have become a viable alternative to custom Integrated Circuits (ICs) by providing
flexible computing platforms with improved costs and shorter timeto-market. In an FPGA-based system, a design is mapped onto
an array of reconfigurable logic blocks and communicated by reprogrammable interconnections composed of wire segments and
switch boxes. While the re-programmable capability provides flexibility, it also leads to area and performance overheads in comparison to custom chips. Thus, to benefit from advantages of both
FPGAs and custom chips, heterogeneous FPGAs have emerged
as an attractive choice for system-on-a-chip implementations. Be-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GLSVLSI09, May 1012, 2009, Boston, Massachusetts, USA.
Copyright 2009 ACM 978-1-60558-522-2/09/05 ...$5.00.
We formulate the problem of resource partitioning for heterogeneous FPGAs into a number of die for 3D ICs to minimize
the total die area and the fabrication costs of 3D FGPAs.
251
In this paper, our objective is to tackle these challenges and develop a realistic design methodology for 3D FPGAs that delivers the expected 3D performance benefits while minimizes any incurred costs. The overarching goals of our objectives can be summarized with the following problem formulation.
Given: A planar FPGA that has a total area A and contains a set of
heterogeneous computational resources R = {r1 , . . . rN }.
Output: Find the optimal number of die, m, and a partition of
R into the m die such that the total die area of the 3D FPGA is
minimized compared to A and performance is maximized.
As an example, a set of heterogeneous computational resources
R for an FPGA might have 4000 logic blocks, 1000 4K memory
blocks, 200 DSPs, and 2 processors for a total of n = 5202 computational components. We seek to find the optimal number of die m
and a partition that maps each computational resource into exactly
one die.
We proceed by first proposing a novel approach to estimate the
total die area of 3D FPGAs and determine the optimal number of
die (Section 3). Our area estimation includes total logic area, total
TSV area, and total programmable interconnect area.
The rest of this paper is organized as follows. Section 2 introduces our motivation and formulation for the problem of transforming a planar heterogeneous FPGA design into a 3D FPGA design.
In Section 3, we propose how to calculate the die areas allocated for
computation, TSVs, and wiring in 3D FPGAs. In Section 4, we discuss how to calculate the improvement in performance attained by
using 3D FPGAs. Section 5 presents the results and observations
from our experimental evaluation. Finally, Section 6 summarizes
the main conclusions of this work.
2.
3.
3.1
In this section we present area estimation models for computational components in a generic heterogeneous island style SRAMbased FPGA. Such an FPGA will contain soft computational resources such as logic clusters, and hardcore computational resources
such as embedded memory blocks, DSP blocks, and processors.
We next describe the area estimation approach and assumptions for
each of these components.
Logic Cluster: The reconfigurable logic cluster or block executes
logic operations and is considered the main component in FPGAs.
A generic logic cluster contains a number of Look-Up Tables (LUTs)
and associated registers, I/O, multiplexers, clock, and reset units.
The total area of a logic cluster may be computed by summing the
area of all these components. In our study we use a cluster architecture and area model from [12]. The cluster consists of eight 4-input
LUTs, 20 logic input pins, 8 output pins, 1 clock and 1 reset signal.
Embedded Memory and DSP Blocks: In contrast to homogeneous FPGAs, heterogeneous FPGAs contain dedicated hard memory and DSP blocks to obtain higher performance and power saving. A typical heterogeneous FPGA often contains SRAM memory
blocks having different sizes to provide high flexibility in configuration and utilization. We use two different sized memory blocks,
Mem1 and Mem2 sized similar to memory blocks in a realistic
FPGA; i.e., 64 16 bits and 128 32 bits respectively. We assume
that these memory blocks are SRAM blocks and estimate their area
by using the CACTI memory models [14, 15]. After getting the
area estimation for memory blocks we estimate the area of DSP
blocks based on the relative ratio between DSP and Mem2 blocks
from the Altera Stratix II handbook documentation ([16] p. 2-41).
3. The number of die in the 3D stack should be chosen to maximize the performance and minimize the costs. Once the number of die in the 3D stack is determined, partitioning the computational resources of the FPGA among the die should be carried out in a way to minimize the total demand on the interconnect resources and the required number of TSVs.
252
Components
4-input LUT
Cluster size 8
Cluster size 16
Mem1 block
Mem2 block
DSP block
Capacity
1 LUT
8 LUTs
16 LUTs
32x16 bits
128x32 bits
four 16x16-bit
multipliers
Area
1
28.5
76.7
65.3
365.3
1461.2
Cut1=4
X0
Cut3=4
Cut4=3
l
X
i=1
Ti
l
l
X
X
Tij ,
(1)
i=1 j=1,j6=i
(2)
Y0
Y1
Cut2=6
max{E1 , E2 , . . . , Em1 }
.
(3)
Number of switch boxes
Assuming that the pitch width of the TSVs is p then the area allocated for TSVs per switch box is equal to p2 WV .
Z0
X0
In this section our objective is to compute the silicon area required by through-silicon vias. A typical TSV can occupy a remarkably large silicon area (e.g., 44m2 ) with a pitch of 20m [13],
and thus it is important to calculate the expected area utilized by
the TSVs in 3D FPGA designs. Figure 1 demonstrates a switch
box for 3D ICs that is architecturally formed by extending a regular 2D switch box to include two vertical channels of TSVs in
addition to the traditional lateral wiring channels. One key aspect
in the design of a 3D switch box is determining the size of the lateral wiring channel, WW , and the size of the vertical TSV channel,
WV . The sizes of these channels will play key roles in determining
the routability, performance, and die area of the 3D FPGAs.
The size of the vertical TSV channel is determined by a number
of factors, including: (1) the number of die in the 3D stack; (2) the
allocation of computational resources across the different die; and
(3) the expected inter-die communication which depends on the application circuit programmed in the FPGA as well as the placement
and routing tool. The exact size of the vertical TSV channel is determined by using a graph-theoretic approach that we describe with
the help of Figure 2. The figure shows a possible partitioning of a
heterogeneous system into five parts, where each partition should
Y0
Table 1: Area estimation of computational components normalized to the area of a 4-input LUT (26598 2 ).
3.2
Y1
X1
X1
Z1
3.3
LB
Mem1
LB
LB
LB
LB
Processor
LB
LB
3
LB
DSP
To estimate the total chip area, in this section we present the estimation model derived from [12, 8] for the reconfigurable routing
components, which includes the connection blocks and the switch
boxes. The area occupied by these routing components depends on
their architecture and the width of interconnect channels (i.e., both
the size of the within-die lateral wiring channels and the size of the
vertical TSV channels). The areas for these routing components
can be determined as follows:
Connection Blocks: The connection blocks consist of programmable
switches that connect I/O pins of logic clusters to lateral channels,
as shown in Figure 4. The size of the I/O connection blocks is
determined by the fan-in connection factor, Fci , and the fan-out
connection factor, Fco , which gives the fraction of wiring tracks in
a lateral channel to which each input pin and output pin can connect to, respectively. The area of a connection block can be calculated by first counting the number of buffers, pass transistors, and
multiplexors required for it, and then summing the areas of these
elements as outlined by [8].
LB
LB
Mem2
LB
LB
Heterogeneous FPGA
253
Ld 2(.d + 1)
[dp + N (1 p )],
N
Id = keq N
WW Fs (Fs + 1)
,
(4)
2
where Fs is the maximum allowable fanout for an incoming wire
segment into the switch box [12, 8]. In contrast to a 2D switch
box, a 3D switch box accommodates four lateral wiring channels
(each consisting of WW wiring tracks) and two vertical channels
(each consisting of WV TSVs), as shown in Figure 1. Since WV
is not necessarily equal to WW , there could be only WV tracks
among the WW tracks in each lateral wiring channel that can be
connected to the WV tracks of the TSVs. Furthermore, the maximum fanout of an incoming TSV, Fsv , could be different from
the maximum allowable fanout of in incoming wire segment, Fsw .
Thus, the number of switch points in a 3D switch box, S3D , can be
generally computed as
S2D =
3.4
q
X
(6)
d Id ,
L
(7)
d=1
d is the average
where q is the maximum fan-out of the netlist, L
length of a net having fanout of d and Id is the number of nets
120
Lateral channel width Ww
Vertical channel width Wv
100
SWbox
Channel width
SWbox
sram
in1
Logic
cluster
(9)
S3D =
WW =
(8)
Out 1
80
60
40
20
Out 8
in20
0
1
W tracks
Figure 4: A typical SRAM-based Island Style FPGA
5
Number of dies
10
254
source
SB
SB
SB
SB
SB
sink
a) 2D Interconnect Path
SB
SB
SB
SB
Config.
Die
1
1
2
1
2
1
2
3
1
2
3
sink
TSV
source
# of
Dies
1
SB
b) 3D Interconnect Path
4.
A
3
One of the important advantages of 3D technology is the general reduction in the average distance between the components of
the computational system. Three-dimensional technology can substitute long interconnect paths by short ones that are stitched together using TSVs. This reduction in interconnect length improves
the signal propagation delay between the computational resources
improving the overall FPGA performance. The reductions in wire
capacitance and resistance achieved from replacing long wires with
TSVs are significant. The objective of this section is to estimate the
improvement in signal propagation delay using our 3D FPGA design model.
To estimate the average interconnect path delay in 3D ICs, we
first consider every pair of locations across all die and calculate the
delay between the two locations and then calculate average delay as
average of these point-to-point delays. For every pair of locations,
we calculate the distance between them and then estimate the number of L4 and L16 wire segments that would be used to create an
interconnect path between the two locations. If the two locations
end up on the same die, then the delay of the path between them
is calculated using a distributed RC delay model of its path constituents (i.e., the L4/L16 wire segments and the pass transistors in
the intermediate switch boxes (SB), as shown in Figure 6(a)). If the
two locations end up on different die, the delay is computed for the
path shown in Figure 6(b) with TSV delay taken into account. The
estimation result will be shown in Section 5.
To further improve the performance of 3D FPGAs, we propose
incorporating bypass TSVs into the switch boxes. Bypass TSVs will
be used to connect non-adjacent dies directly by passing through a
switch box without any interaction with any intermediate switches.
A bypass TSV will not eliminate the silicon area required for the inseries TSVs in the intermediate die, but it will eliminate the delay
and area that would have been introduced by intermediate switches.
For 3D FPGAs, our experiments in Section 5 show that using bypass TSVs can reduce the average interconnect path delay and the
die area by significant amounts.
5.
Resource
logic + hardcores
logic
hardcores
logic + hardcores
logic + hardcores
logic
hardcores
logic
logic + hardcores
logic + hardcores
logic + hardcores
Die Area
(cm2 )
0.700
0.450
0.180
0.290
0.290
0.170
0.180
0.170
0.176
0.176
0.176
Total Area
(cm2 )
0.70
0.63
0.59
0.52
0.53
EXPERIMENTAL RESULTS
255
0.75
10
12
as the number of die in a 3D stack increases, the total interconnect area reduces and the total TSV area increases. We have investigated the optimal number of die that gives the greatest savings
in die area. We have estimated the improvement in performance
that will be attained by switching to 3D technology, and we have
analyzed the performance benefits of using heterogeneous FPGAs
with regular TSVs and bypass TSVs. Using Rent-based statistical
analysis, we have shown that 3D FPGAs can reduce die area by
about 27% while simultaneously improving performance by up to
58%. Though statistical-based estimation might cause variations
compared with realistic benchmark designs, the experimental results are consistent with theoretical analyses.
Finally, for future work, we would like to develop a 3D heterogeneous placement and routing tool to conduct experiments on benchmark designs to evaluate our statistical estimation model. Analyzing the impact of 3D stacking on thermal distribution of 3D heterogeneous FPGAs also would be considered.
0.7
Total Die Area (Cm2)
10
0.65
Region I
Region II
Region III
8
0.6
6
0.55
0.5
1
average delay
4
10
Number of Dies
7.
6.
REFERENCES
[1] K. Banerjee, et. al., 3-D ICs: A Novel Chip Design for
Deep-Submicrometer Interconnect Performance and
Systems-on-Chip Integration, Proc. of the IEEE, vol. 89(5), pp.
602633, 2001.
[2] A. W. Topol, et. al., Three-dimensional Integrated Circuits, IBM
Journal of Res. and Dev., vol. 50(4-5), pp. 491506, 2006.
[3] M. Alexander, et. al., Three-dimensional field-programmable gate
arrays, ASIC Conference and Exhibit, 1995., Proc. of the Eighth
Annual IEEE International, pp. 253256, Sep 1995.
[4] W. Meleis, et. al., Architectural design of a three dimensional
FPGA, Advanced Research in VLSI, 1997. Proc., Seventeenth
Conference on, pp. 256268, Sep 1997.
[5] G. Borriello, et. al., The triptych FPGA architecture, VLSI Systems,
IEEE Transactions on, vol. 3, no. 4, pp. 491501, Dec 1995.
[6] M. Lin, et. al., Performance benefits of monolithically stacked
3D-FPGA, in Proc. of the ACM/SIGDA 14th ISFPGA. New York,
NY, USA: ACM, 2006, pp. 113122.
[7] A. J. Alexander, et. al., Placement and routing for three-dimensional
FPGAs, in Fourth Canadian Workshop on Field-Programmable
Devices, 1996, pp. 1118.
[8] A. Rahman, et. al., Wiring requirement and three-dimensional
integration technology for field programmable gate arrays, VLSI
Systems, IEEE Transactions on, vol. 11, no. 1, pp. 4454, Feb 2003.
[9] C. Ababei, et. al., Three-dimensional place and route for FPGAs,
Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, vol. 25, no. 6, pp. 11321140, June 2006.
[10] Y.-S. Kwon, et. al., A 3-D FPGA wire resource prediction model
validated using a 3-D placement and routing tool, in Proc. of SLIP
05. New York, NY, USA: ACM, 2005, pp. 6572.
[11] M. Lin, et. al., A routing fabric for monolithically stacked
3D-FPGA, in Proc. of the ACM/SIGDA 15th ISFPGA. New York,
NY, USA: ACM, 2007, pp. 312.
[12] V. Betz, et. al., Architecture and CAD for Deep-Submicron FPGAs.
Norwell, MA, USA: Kluwer Academic Publishers, 1999.
[13] Vasilis F. Pavlidis, et. al, Three Dimensional Integrated Circuit
Design. Morgan Kaufman Publishers, 2008.
[14] S. Wilton and N. Jouppi, Cacti: an enhanced cache access and cycle
time model, Solid-State Circuits, IEEE Journal of, vol. 31, no. 5, pp.
677688, May 1996.
[15] Cacti 5.3, Online, available at:
http://quid.hpl.hp.com:9081/cacti/index.y?new.
[16] Altera stratix ii device handbook, volume 1,
http://www.altera.com/literature/hb/stx2/stratix2_handbook.pdf.
[17] B. Landman and R. Russo, On a pin versus block relationship for
partitions of logic graphs, Computers, IEEE Transactions on, vol.
C-20, no. 12, pp. 14691479, Dec. 1971.
[18] P. Zarkesh-Ha, et. al., Prediction of net-length distribution for global
interconnects in a heterogeneous system-on-a-chip, VLSI Systems,
IEEE Transactions on, vol. 8, no. 6, pp. 649659, 2000.
We also estimate the impact of using bypass TSVs between nonadjacent dies, as presented in Section 4. The result shows that by
using bypass TSVs the reductions in die area and average delay can
be improved more 4.63% and 9.78% respectively.
The common trends between the tested designs lead to an intuitive explanation for the impact of transforming planar FPGA designs to use 3D technology. If we denote the optimal number of die
from a pure area savings perspective as ma and the optimal number
of die from a pure delay (or performance) perspective as mp , then
from our results we can identify three regions for 3D FPGA design.
256