Reconfigurable Computing - What, Why & How
Reconfigurable Computing - What, Why & How
single-chip silicon die capacity grows, this class of architectures be-
comes increasingly viable, since more tasks can be profitably imple-
Reconfigurable Computing is emerging as an important new orga- mented spatially, and increasingly important, since post-fabrication
nizational structure for implementing computations. It combines customization is necessary to differentiate products, adapt to stan-
the post-fabrication programmability of processors with the spatial dards, and provide broad applicability for monolithic IC designs.
computational style most commonly employed in hardware de- In this tutorial we introduce the organizational aspects of re-
signs. The result changes traditional “hardware” and “software” configurable computing architectures, and we relate these recon-
boundaries, providing an opportunity for greater computational ca- figurable architectures to more traditional alternatives (Section 2).
pacity and density within a programmable media. Reconfigurable Section 3 distinguishes these different approaches in terms of in-
Computing must leverage traditional CAD technology for building struction binding timing. We emphasize an intuitive appreciation
spatial designs. Beyond that, however, reprogrammablility intro- for the benefits and tradeoffs implied by reconfigurable design
duces new challenges and opportunities for automation, including (Section 4), and comment on its relevance to the design of future
binding-time and specialization optimizations, regularity extraction computing systems (Section 6). We end with a roundup of CAD
and exploitation, and temporal partitioning and scheduling. opportunities arising in the exploitation of reconfigurable systems
(Section 7).
"!#$&%')(+*,-./!# $102*43"5%6(78$ 9:3;""5
Traditionally, we either implemented computations in hardware
(e.g. custom VLSI, ASICs, gate-arrays) or we implemented them When implementing a computation, we have traditionally decided
in software running on processors (e.g. DSPs, microcontrollers, between custom hardware implementations and software imple-
embedded or general-purpose microprocessors). More recently, mentations. In some systems, we make this decision on a subtask by
however, Field-Programmable Gate Arrays (FPGAs) introduced a subtask basis, placing some subtasks in custom hardware and some
new alternative which mixes and matches properties of the tradi- in software on more general-purpose processing engines. Hardware
tional hardware and software alternatives. Machines based on these designs offer high performance because they are:
FPGAs have achieved impressive performance [1] [11] [4]—often
achieving 100 the performance of processor alternatives and 10- customized to the problem—no extra overhead for inter-
100 the performance per unit of silicon area. pretation or extra circuitry capable of solving a more general
Using FPGAs for computing led the way to a general class problem
of computer organizations which we now call reconfigurable com- relatively fast—due to highly parallel, spatial execution
puting architectures. The key characteristics distinguishing these
machines is that they both: Software implementations exploit a “general-purpose” execution
can be customized to solve any problem after device fabrica-
engine which interprets a designated data stream as instructions
telling the engine what operations to perform. As a result, software
tion is:
exploit a large degree of spatially customized computation in flexible—task can be changed simply by changing the in-
order to perform their computation struction stream in rewriteable memory
This class of architectures is important because it allows the com- relatively slow—due to mostly temporal execution
putational capacity of the machine to be highly customized to the
instantaneous needs of an application while also allowing the com- relatively inefficient–since operators can be poorly matched
putational capacity to be reused in time at a variety of time scales. As to computational task
Figure 1 depicts the distinction between spatial and temporal
computing. In spatial implementations, each operator exists at a
different point in space, allowing the computation to exploit paral-
lelism to achieve high throughput and low computational latencies.
In temporal implementations, a small number of more general com-
pute resources are reused in time, allowing the computation to be
implemented compactly. Figure 3 shows that when we have only
Spatial Computation Temporal Computation
B B
< 1 =?> t1
X X <2= A <1
C <2 = <2@ B Dt2
< 2 = t2 < 1 A
A X + A = t2 @ C B
C
+
D
C ALU
Figure 1: Spatial versus Temporal Computation for the expression A1E A> 2
@ B>1@ C
Y
Figure 2: Spatially Configurable Implementation of expression A&E A> 2
@ B>F@ C
Processors
ory for instructions or circuitry to perform operations not needed 2048 SIMD/Vector
by a particular task. A gate-array implementation must use only Processors
pre-patterned gates; it need only see the wire segments needed for
a task, but must make do with the existing transistors and transistor
arrangement regardless of task needs. In the spatial extreme, an
FPGA or reconfigurable design needs to hold a single instruction;
this adds overhead for that instruction and for the more general Figure 5: Datapath Width Instruction Depth Architectural Design
structures which handle all possible instructions. The processor Space
needs to rebind its operation on every cycle, so it must pay a large
price in instruction distribution mechanism, instruction storage, and
limited instruction semantics in order to support this rebinding. operation on each cycle and are routed in the same manner to and
On the flip side, late operation binding implies an opportunity from Y memory or register files. For FPGAs, the datapath width is
to more closely specialize the design to the instantaneous needs of a one ( E 1) since routing is controlled at the bit level and each
given application. That is, if part of the data set used by an operator FPGA operator, typically a single-output Lookup-Table (LUT), can
is bound later than the operation is bound, the design may have to be controlled independently.
be much more general than the actual application requires. For a Sharing instructions across operators has two effects which re-
common example, consider digital filtering. Often the filter shape
and coefficients are not known until the device is deployed into a amortizes instruction storage area across several operators
duce the area per bit operator:
specific system. A custom device must allocate general-purpose limits interconnect requirements to word level
multipliers and allocate them in the most general manner to support However, when the SIMD sharing width is greater than the native
all possible filter configurations. An FPGA or processor design can operation width, the device is not able to Y fully exploit all of its
wait until the actual application requirements are known. Since the potential bit operators. Since a group of bits must all do the
particular problem will always be simpler, require less operations, same thing and beY routed in the same direction, smaller operations
than the general case, these post-fabrication architectures can ex- will still consume bit operators even though some of the datapath
ploit their late binding to provide a more optimized implementation. bits are performing no useful work. Note that segmented datapaths,
In the case of the FPGA, the filter coefficients can be built into the as found in modern multimedia instructions (e.g. MMX [7]) or
FPGA multipliers reducing area [2]. Specialized multipliers can multiguage architectures [10], still require that the bits in a wide-
be one-fourth the area of a general multiplier, and particular spe- word datapath perform the same instruction in SIMD manner.
cialized multipliers can be even smaller, depending on the constant.
Processors without hardwired multipliers can also use this trick ].
^S$ 3Q_W.`Z
How many device-wide instructions do
[6] to reduce execution cycles. If the computational requirements
we store locally on the chip and allow to change on each operating
change very frequently during operation, then the processor can
use its branching ability to perform only the computation needed at `
cycles? As noted, FPGAs store a single instruction per bit operator
( E 1) on chip allowing them to keep configuration overhead to a
each time. Modern FPGAs, which lack support to quickly change
minimum.` Processors typically store a large number of instructions
on chip ( E 1000–100,000) in the form of a large instruction
configurations, can only use their reconfiguration ability to track
run-time requirement changes when the time scale of change is rel-
cache. Increasing the number of on chip instructions allows the
ative large compared to their reconfiguration time. For conventional
device capacity to be used instantaneously for different operations
FPGAs, this reconfiguration time scale is milliseconds, but many
at the cost of diluting the area used for active computation and hence
experimental reconfigurable architectures can reduce that time to
decreasing device computational density.
microseconds.
Figure 5 shows where both traditional and reconfigurable or-
OP#
'Q$6
;"54*R3
$ ganizations lie in this slice of the post-fabrication design space.
Figure 6 shows the relative density of computational bit operators
based on architectural parameters (see [3] for model details and
As we have established, instructions are the distinguishing features further discussion). Of course, even the densest point in this post-
of our post-fabrication device organizations. Mapping out this de- fabrication design space is less dense than custom a pre-fabrication
sign space, instruction organization plays a large role in defining implementation of a particular task due to the overhead for gener-
device density and device efficiency. Two important parameters for ality and instruction configuration.
characterizing designs in this space are datapath width and instruc- The peak operator density shown in Figure 6 is obtainable only
tion depth. if the stylistic restrictions implied by the architecture are obeyed.
S"3QUTV;QXWY[Z
How many compute operators at the bit-
As noted, if the actual data width is smaller than the architected data
width, some bit operators cannot be used. Similarly, if a task has
level are controlled with a single instruction in SIMD form?
processors, this shows up as the ALU datapath width (e.g. E 32
Y In more cycle-by-cycle operation variance than supported by the archi-
tecture, operators can sit idle during operational cycles contributing
and 64), since all bits in the word must essentially perform the same to a net reduction in usable operator density. Figure 8 captures these
Processor
128 L2 Cache
Datapath Width (w) 64
16
Reconfigurable
I−Cache uP D−Cache MMU Array Block
4
1/2 Memory
Block
1/4
Density 1/8
1/16
1/32
1/64
1/128
1 Figure 7: Heterogeneous Post-Fabrication Computing Device in-
4
16
cluding Processor, Reconfigurable Array, and Memory
64
Instruction Depth (c)
256
1024 sequencing [8]. Mixing custom hardware and temporal processor
128 128
Design w 64 Design w 64
16 16
4 4
1 1
1.0 1.0
0.8 0.8
0.6 0.6
Efficiency Efficiency
0.4 0.4
0.2 0.2
1 1
4 4
16 16
64 64
Cycle-by-cycle Cycle-by-cycle
Operation Variation 256 Operation Variation 256
1024 1024
yz$)-.$ $
$6
[1] Duncan Buell, Jeffrey Arnold, and Walter Kleinfelder. Splash
2: FPGAs in a Custom Computing Machine. IEEE Computer
Society Press, 10662 Los Vasqueros Circle, PO Box 3014,
Los Alamitos, CA 90720-1264, 1996.
[2] Kenneth David Chapman. Fast Integer Multipliers fit in FP-
GAs. EDN, 39(10):80, May 12 1993. Anonymous FTP
www.ednmag.com:EDN/di_sig/DI1223Z.ZIP .
[3] André DeHon. Reconfigurable Architectures for General-
Purpose Computing. AI Technical Report 1586, MIT Artifi-
cial Intelligence Laboratory, 545 Technology Sq., Cambridge,
MA 02139, October 1996. <ftp://publications.ai.
mit.edu/ai-publications/1500-1999/
AITR-1586.ps.Z>.
[4] André DeHon. Comparing Computing Machines. In Con-
figurable Computing: Technology and Applications, vol-
ume 3526 of Proceedings of SPIE. SPIE, November 1998.
<http://www.cs.berkeley.edu/projects/
brass/documents/ccmpare_spie98.ps.gz>.
[5] John R. Hauser and John Wawrzynek. Garp: A MIPS Pro-
cessor with a Reconfigurable Coprocessor. In Proceedings
of the IEEE Symposium on Field-Programmable Gate Arrays
for Custom Computing Machines, pages 12–21. IEEE, April
1997. <http://ww.cs.berkeley.edu/projects/
brass/documents/GarpProcessors.html>.