0% found this document useful (0 votes)
20 views

Reconfigurable Computing - What, Why & How

The document discusses reconfigurable computing, which combines the post-fabrication programmability of processors with the spatial computational style of hardware designs. This changes traditional boundaries between hardware and software. Reconfigurable computing architectures must leverage traditional CAD tools and introduce new challenges and opportunities for automation due to reprogrammability.

Uploaded by

Ishan Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Reconfigurable Computing - What, Why & How

The document discusses reconfigurable computing, which combines the post-fabrication programmability of processors with the spatial computational style of hardware designs. This changes traditional boundaries between hardware and software. Reconfigurable computing architectures must leverage traditional CAD tools and introduce new challenges and opportunities for automation due to reprogrammability.

Uploaded by

Ishan Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Reconfigurable Computing:

What, Why, and Implications for Design Automation


André DeHon and John Wawrzynek
Berkeley Reconfigurable, Architectures, Software, and Systems
Computer Science Division
University of California at Berkeley
Berkeley, CA 94720-1776
contact: <[email protected]>

 
 single-chip silicon die capacity grows, this class of architectures be-
comes increasingly viable, since more tasks can be profitably imple-
Reconfigurable Computing is emerging as an important new orga- mented spatially, and increasingly important, since post-fabrication
nizational structure for implementing computations. It combines customization is necessary to differentiate products, adapt to stan-
the post-fabrication programmability of processors with the spatial dards, and provide broad applicability for monolithic IC designs.
computational style most commonly employed in hardware de- In this tutorial we introduce the organizational aspects of re-
signs. The result changes traditional “hardware” and “software” configurable computing architectures, and we relate these recon-
boundaries, providing an opportunity for greater computational ca- figurable architectures to more traditional alternatives (Section 2).
pacity and density within a programmable media. Reconfigurable Section 3 distinguishes these different approaches in terms of in-
Computing must leverage traditional CAD technology for building struction binding timing. We emphasize an intuitive appreciation
spatial designs. Beyond that, however, reprogrammablility intro- for the benefits and tradeoffs implied by reconfigurable design
duces new challenges and opportunities for automation, including (Section 4), and comment on its relevance to the design of future
binding-time and specialization optimizations, regularity extraction computing systems (Section 6). We end with a roundup of CAD
and exploitation, and temporal partitioning and scheduling. opportunities arising in the exploitation of reconfigurable systems
(Section 7).


   "!# $&%')(+*, -./!#  $102*43 "5 %6 (78$ 9:3;""5
Traditionally, we either implemented computations in hardware
(e.g. custom VLSI, ASICs, gate-arrays) or we implemented them When implementing a computation, we have traditionally decided
in software running on processors (e.g. DSPs, microcontrollers, between custom hardware implementations and software imple-
embedded or general-purpose microprocessors). More recently, mentations. In some systems, we make this decision on a subtask by
however, Field-Programmable Gate Arrays (FPGAs) introduced a subtask basis, placing some subtasks in custom hardware and some
new alternative which mixes and matches properties of the tradi- in software on more general-purpose processing engines. Hardware
tional hardware and software alternatives. Machines based on these designs offer high performance because they are:
FPGAs have achieved impressive performance [1] [11] [4]—often 
achieving 100  the performance of processor alternatives and 10- customized to the problem—no extra overhead for inter-
100  the performance per unit of silicon area. pretation or extra circuitry capable of solving a more general
Using FPGAs for computing led the way to a general class problem
of computer organizations which we now call reconfigurable com-  relatively fast—due to highly parallel, spatial execution
puting architectures. The key characteristics distinguishing these
machines is that they both: Software implementations exploit a “general-purpose” execution
 can be customized to solve any problem after device fabrica-
engine which interprets a designated data stream as instructions
telling the engine what operations to perform. As a result, software
tion is:
 exploit a large degree of spatially customized computation in  flexible—task can be changed simply by changing the in-
order to perform their computation struction stream in rewriteable memory
This class of architectures is important because it allows the com-  relatively slow—due to mostly temporal execution
putational capacity of the machine to be highly customized to the
instantaneous needs of an application while also allowing the com-  relatively inefficient–since operators can be poorly matched
putational capacity to be reused in time at a variety of time scales. As to computational task
Figure 1 depicts the distinction between spatial and temporal
computing. In spatial implementations, each operator exists at a
different point in space, allowing the computation to exploit paral-
lelism to achieve high throughput and low computational latencies.
In temporal implementations, a small number of more general com-
pute resources are reused in time, allowing the computation to be
implemented compactly. Figure 3 shows that when we have only
Spatial Computation Temporal Computation

B B
< 1 =?> t1
X X <2= A <1
C <2 = <2@ B Dt2
< 2 = t2  < 1 A
A X + A = t2 @ C B
C

+
D
C ALU

Figure 1: Spatial versus Temporal Computation for the expression A1E A> 2
@ B>1@ C

mul in in mul O1 A mul in B add O3 C add O2 O4 res=O5


ALU ALU ALU ALU ALU

Y
Figure 2: Spatially Configurable Implementation of expression A&E A> 2
@ B>F@ C

When computation defined?


G HI "JK7#9&$
pre−fabrication post−fabrication
(hardware) (software) Instruction binding time is an important distinction amongst these
distributed in?
Computation

three broad classes of computing media which helps us understand


Space ASIC reconfigurable
gate−array their relative merits. That is, in every case we must tell the compu-
tational media how to behave, what operation to perform and how
Time processors
to connect operators. In the pre-fabrication hardware case, we do
this by patterning the devices and interconnect, or programming,
the device during the fabrication process. In the “software” case,
after fabrication we select the proper function from those supported
Figure 3: Coarse Design Space for Computing Implementations by the silicon. This is done with a set of configuration bits, an
instruction, which tells each operator how to perform and where
to get its input. In purely spatial software architectures, the bits
these two options, we implicitly connect spatial processing with for each operator can be defined once and will then be used for
hardware computation and temporal processing with software. a long processing epoch (Figure 2). This allows the operators to
The key benefit of FPGAs, and more broadly reconfigurable store only a single instruction local to the compute and interconnect
devices, is that they introduce a class of post-fabrication config- operators. In temporal software architectures, the operator must
urable devices which support spatial computations, thus giving us change with each cycle amongst a large number of instructions in
a new organizational point in this space (Figure 3). Figure 2 shows order to implement the computation as a sequence of operations on
a spatially configurable computation for comparison with Figure 1. a small number of active compute operators. As a result, the spatial
Reconfigurable devices have the obvious benefit of spatial paral- designs must have high bandwidth to a large amount of instruc-
lelism, allowing them to perform more operations per cycle. As we tion memory for each operator (as shown in Figures 1). Figure 4
will see in Section 4, the organization has inherent density advan- shows this continuum from pre-fabrication operation binding time
tages over traditional processor designs. As a result, reconfigurables to cycle-by-cycle operation binding.
can often pack this greater parallelism into the same die area as a Early operation binding time generally corresponds to less im-
modern processor. plementation overhead. A fully custom design can implement only
the circuits and gates needed for the task; it requires no extra mem-
‘‘Hardware’’ ‘‘Software’’
Datapath Width (w)
Media:
LCustom MGate One
VLSI Array Time FPGA Processors 1 4 16 64 256 1024
Prog.
N first

Instruction Depth (c)


Binding metal fuse load every 1 FPGA
Time: mask masks program config. cycle
N fabrication Reconfigurable
time
4

Figure 4: Binding Time Continuum

Processors
ory for instructions or circuitry to perform operations not needed 2048 SIMD/Vector
by a particular task. A gate-array implementation must use only Processors
pre-patterned gates; it need only see the wire segments needed for
a task, but must make do with the existing transistors and transistor
arrangement regardless of task needs. In the spatial extreme, an
FPGA or reconfigurable design needs to hold a single instruction;
this adds overhead for that instruction and for the more general Figure 5: Datapath Width  Instruction Depth Architectural Design
structures which handle all possible instructions. The processor Space
needs to rebind its operation on every cycle, so it must pay a large
price in instruction distribution mechanism, instruction storage, and
limited instruction semantics in order to support this rebinding. operation on each cycle and are routed in the same manner to and
On the flip side, late operation binding implies an opportunity from Y memory or register files. For FPGAs, the datapath width is
to more closely specialize the design to the instantaneous needs of a one ( E 1) since routing is controlled at the bit level and each
given application. That is, if part of the data set used by an operator FPGA operator, typically a single-output Lookup-Table (LUT), can
is bound later than the operation is bound, the design may have to be controlled independently.
be much more general than the actual application requires. For a Sharing instructions across operators has two effects which re-
common example, consider digital filtering. Often the filter shape
and coefficients are not known until the device is deployed into a  amortizes instruction storage area across several operators
duce the area per bit operator:

specific system. A custom device must allocate general-purpose  limits interconnect requirements to word level
multipliers and allocate them in the most general manner to support However, when the SIMD sharing width is greater than the native
all possible filter configurations. An FPGA or processor design can operation width, the device is not able to Y fully exploit all of its
wait until the actual application requirements are known. Since the potential bit operators. Since a group of bits must all do the
particular problem will always be simpler, require less operations, same thing and beY routed in the same direction, smaller operations
than the general case, these post-fabrication architectures can ex- will still consume bit operators even though some of the datapath
ploit their late binding to provide a more optimized implementation. bits are performing no useful work. Note that segmented datapaths,
In the case of the FPGA, the filter coefficients can be built into the as found in modern multimedia instructions (e.g. MMX [7]) or
FPGA multipliers reducing area [2]. Specialized multipliers can multiguage architectures [10], still require that the bits in a wide-
be one-fourth the area of a general multiplier, and particular spe- word datapath perform the same instruction in SIMD manner.
cialized multipliers can be even smaller, depending on the constant.
Processors without hardwired multipliers can also use this trick  ].
^S$ 3Q_W.` Z
How many device-wide instructions do
[6] to reduce execution cycles. If the computational requirements
we store locally on the chip and allow to change on each operating
change very frequently during operation, then the processor can
use its branching ability to perform only the computation needed at `
cycles? As noted, FPGAs store a single instruction per bit operator
( E 1) on chip allowing them to keep configuration overhead to a
each time. Modern FPGAs, which lack support to quickly change
minimum.` Processors typically store a large number of instructions
on chip ( E 1000–100,000) in the form of a large instruction
configurations, can only use their reconfiguration ability to track
run-time requirement changes when the time scale of change is rel-
cache. Increasing the number of on chip instructions allows the
ative large compared to their reconfiguration time. For conventional
device capacity to be used instantaneously for different operations
FPGAs, this reconfiguration time scale is milliseconds, but many
at the cost of diluting the area used for active computation and hence
experimental reconfigurable architectures can reduce that time to
decreasing device computational density.
microseconds.
Figure 5 shows where both traditional and reconfigurable or-
OP#
'Q $6
;"54*R3 
$ ganizations lie in this slice of the post-fabrication design space.
Figure 6 shows the relative density of computational bit operators
based on architectural parameters (see [3] for model details and
As we have established, instructions are the distinguishing features further discussion). Of course, even the densest point in this post-
of our post-fabrication device organizations. Mapping out this de- fabrication design space is less dense than custom a pre-fabrication
sign space, instruction organization plays a large role in defining implementation of a particular task due to the overhead for gener-
device density and device efficiency. Two important parameters for ality and instruction configuration.
characterizing designs in this space are datapath width and instruc- The peak operator density shown in Figure 6 is obtainable only
tion depth. if the stylistic restrictions implied by the architecture are obeyed.
S "3  QUTV;QXWY[Z
How many compute operators at the bit-
As noted, if the actual data width is smaller than the architected data
width, some bit operators cannot be used. Similarly, if a task has
level are controlled with a single instruction in SIMD form?
processors, this shows up as the ALU datapath width (e.g. E 32
Y In more cycle-by-cycle operation variance than supported by the archi-
tecture, operators can sit idle during operational cycles contributing
and 64), since all bits in the word must essentially perform the same to a net reduction in usable operator density. Figure 8 captures these
Processor
128 L2 Cache
Datapath Width (w) 64
16
Reconfigurable
I−Cache uP D−Cache MMU Array Block
4

1/2 Memory
Block
1/4

Density 1/8
1/16

1/32

1/64
1/128
1 Figure 7: Heterogeneous Post-Fabrication Computing Device in-
4
16
cluding Processor, Reconfigurable Array, and Memory
64
Instruction Depth (c)
256
1024 sequencing [8]. Mixing custom hardware and temporal processor

Figure 6: Peak Density of Bit Operators in


Y  ` Architectural
on ASICs is moderately common these days. Since reconfigurable
architectures offer complementary characteristics, there are advan-
Design Space tages to adding this class of architectures to the mix, as well.
c d" " $feg;55 -."+e4"]/hid "  
 Ueg "J"9:9&" 5 /j
effects in caricature by looking at the efficiency of processor and
FPGA architectures across different task requirements. Two major trends demand increased post-fabrication programma-
In summary, the benefits of reconfigurable architectures are: bility.
1. greater single chip capacity
1. greater computational density than temporal processors 2. shrinking product lifetimes and short time to market windows
2. greater semantic power, fine grained control over bit operators,
for narrow word machines k#$6 $ ] J5 $&
'Q 3f
)"3 
/j
Large available, single-chip sili-
3. reuse of silicon area on coarse-grain time scales
con capacity drives us towards greater integration yielding System-
4. the ability to specialize the implementation to instantaneous
on-a-Chip designs. This integration makes sense to reduce system
computational requirements, minimizing the resources actually
production and component costs. At the same time, however, sys-
required to perform a computation
tem designers loose the traditional ability to add value and differ-
a $6$ ;J"$  $6 b
'Q$'
;$6 entiate their systems by post-fabrication selection of components
and integration. As a result, monolithic System-on-a-Chip designs
will require some level of post-fabrication customization to make
Large computing tasks are often composed of subtasks with differ- up for “configuration” which was traditionally done at the board
ent stylistics requirements. As we see in Figure 8, both purely FPGA composition level.
and purely processor architectures can be very inefficient when run- Further, the larger device capacity now makes it feasible to im-
ning tasks poorly matched to their architectural assumptions. By plement a greater variety of tasks in a programmable media. That is,
a similar consideration, a post-fabrication programmable device, many tasks, such as video processing, which traditionally required
spatial or temporal, can be much less efficient than a pre-fabrication custom hardware to meet their demands can now be supported on
device which exactly solves a required computing task. Counter- single-chip, post-fabrication media. The density benefit of recon-
wise, a pre-fabrication device which does not solve the required figurable architectures helps expands this capacity and hence the
computing can be less efficient than a post-fabrication device. Con- range of problems for which post-fabrication solutions are viable.
sider, for example, a custom floating-point multiplier unit. While
this can be 20  the performance density of an FPGA implemen- 7#9&$ h hmlX oni$6pq  -.$srtj'
65 $
Post-fabrication customizable parts
tation when performing floating-point multiplies, the floating-point
allow system designers to get new ideas into the market faster. They
multiplier, by itself, is useless for motion estimation.
eliminate the custom silicon design time, fabrication time, and man-
These mixed processing requirements drive interest in hetero-
ufacturing verification time. The deferred operation binding time
geneous “general-purpose” and “application-specific” processing
also reduces the risk inherent in custom design; the customization
components which incorporate subcomponents from all the cate-
is now in software and can be upgraded late in the product devel-
gories shown in Figure 3. In terms of binding time, these compo-
opment life cycle. In fact, the late binding time leaves open the
nents recognize that a given application or application set has a range
possibility of “firmware” upgrades once the product is already in
of data and operation binding times. Consequently, these mixed de-
the customer’s hands. As markets and standards evolve the behav-
vices provide a collection of different processing resources, each
ior and feature set needed to maintain a competitive advantage over
optimized for handling data bound at a different operational time
competitors changes. With post-fabrication devices, much of this
scale. Figure 7 shows an architecture mixing spatial and tempo-
adaptation can be done continually, decoupling product evolution
ral computing elements; example components exhibiting this mix
to please market requirements from silicon design spins.
include Triscend’s E5, National Semiconductor’s NAPA [9], and
Berkeley’s GARP [5]. Berkeley’s Pleiades architecture combines
custom functional units in a reconfigurable network with a con-
ventional processor for configuration management and operation
Y FPGA
E ` E 1
Y E `
Processor
64, E 1024

128 128
Design w 64 Design w 64

16 16

4 4

1 1
1.0 1.0

0.8 0.8

0.6 0.6
Efficiency Efficiency

0.4 0.4

0.2 0.2

1 1
4 4
16 16
64 64
Cycle-by-cycle Cycle-by-cycle
Operation Variation 256 Operation Variation 256
1024 1024

Figure 8: Yielded Efficiency across Task–Architecture Mismatches

uvw3;3;"; $6 2. Transform problem to expose greater commonality– trans-


formations which create regularity will increase our opportu-
Reconfigurable architectures allow us to achieve high performance nities to exploit this advantage.
and high computational density while retaining the benefits of a
post-fabrication programmable architecture. To fully exploit this 3. Schedule to exploit regularity–when we exploit the oppor-
opportunity, we need to automate the discovery and mapping pro- tunity to reuse the substrate in time to perform different com-
cess for reconfigurable and heterogeneous devices. putations, we want to schedule the operations to maximize
Since reconfigurable designs are characteristicly spatial, tradi- the use of each configuration.
tional CAD techniques for logical and physical design (e.g. logic
As noted in Section 3, a programmable computation needs only
optimization, retiming, partitioning, placement, and routing) are
support its instantaneous processing requirements. This gives us
essential to reconfigurable design mapping. Three things differ:
the opportunity to highly specialize the implemented computation
1. Some problems take on greater importance–e.g. since pro- to the current processing needs, reducing resource requirements,
grammable interconnect involves programmable switches, execution time, and power. To exploit this class of optimization,
the delay for distant interconnect is a larger contribution we need to:
to path delay than in custom designs; also, since latencies
1. Discover binding times–early bound and slowly changing
are larger than in custom designs and registers are relatively
data become candidates for data to be specialized into the
cheap, there is a greater benefit for pipelining and retiming.
computation.
2. Fixed resource constraints and fixed resource ratios–in
2. Specialize implementations–fold this early bound data into
custom designs, the goal is typically to minimize area usage;
the computation to minimize processing requirements.
with a programmable structure, wires and gates have been
pre-allocated, so the goal is to fit the device into the available 3. Fast, online algorithms to exploit run-time specialization–
resources. for data bound at runtime, this specialization needs to occur
efficiently during execution; this creates a new premium for
3. Increased demand for short tool runtimes–while hardware
lightweight optimization algorithms.
CAD tools which run for hours or days are often acceptable,
software tools more typically run for seconds or minutes. As The fixed capacity of pre-fabrication devices requires that we
these devices are increasingly used to develop, test, and tune map our arbitrarily large problem down to a fixed resource set.
new ideas, long tool turn-around is less acceptable. This When our device’s physical resources are less than the problem
motivates a better understanding of the tool run-time versus requirements, we need to exploit the device’s capacity for tempo-
design quality tradeoff space. ral reuse. We can accomplish this “fit” using a mix of several
techniques:
The raw density advantage of reconfigurable components results
from the elimination of overhead area for local instructions. This 1. Temporal x spatial assignment–on heterogeneous devices
comes at the cost of making it relatively expensive to change the with both processor and reconfigurable devices, we can use
instructions. To take advantage of this density, we need to: techniques like hardware-software partitioning to utilize the
available array capacity to best accelerate the application,
1. Discover regularity in problem–the regularity allows us to
while falling back on the density of the temporal processor to
reuse the computational definition for large number of cycles
fit the entire design onto the device.
to amortize out any overhead time for instruction reconfigu-
ration.
2. Area-time tradeoff–most compute tasks do not have a single [6] Daniel J. Magenheimer, Liz Peters, Karl Pettis, and Dan Zuras.
spatial implementation, but rather a whole range of area-time Integer Multiplication and Division on the HP Precision Ar-
implementations. These can be exploited to get the best chitecture. In Proceedings of the Second International Con-
performance out of a fixed resource capacity. ference on the Architectural Support for Programming Lan-
guages and Operating Systems, pages 90–99. IEEE, 1987.
3. Time-slice schedule–since the reconfigurable resources can
be reused, for many applications we can break the task into [7] A. Peleg, S. Wilkie, and U. Weiser. Intel MMX for Multime-
spatial slices and process these serially on the array; this dia PCs. Communications of the ACM, 40(1):24–38, January
effectively virtualizes the physical resources much like phys- 1997.
ical memory and other limited resources are virtualized in
modern processing systems. [8] Jan Rabaey. Reconfigurable Computing: The Solution to Low
Power Programmable DPP. In Proceedings of the 1997 IEEE
In general, to handle compute tasks with dynamic processing re- International Conference on Acoustics, Speech, and Signal
quirements, we need to perform run-time resource management, Processing, April 1997.
making a run-time or operating system an integral part of the com-
putational substrate. This, too, motivates fast online algorithms [9] Charlé Rupp, Mark Landguth, Tim Garverick, Edson Gom-
for scheduling, placement, and, perhaps, routing, to keep scheduler ersall, Harry Holt, Jeffrey Arnold, and Maya Gokhale. The
overhead costs reasonable small. NAPA Adaptive Processing Architecture. In Proceedings of
the IEEE Symposium on FPGAs for Custom Computing Ma-
*4"9:9&  j chines, pages 28–37, April 1998.
[10] Lawrence Snyder. An Inquiry into the Benefits of Multigauge
Reconfigurable computing architectures complement our existing Parallel Computation. In Proceedings of the 1985 Interna-
alternatives of temporal processors and spatial custom hardware. tional Conference on Parallel Processing, pages 488–492.
They offer increased performance and density over processors while IEEE, August 1985.
remaining post-fabrication configurable. As such, they are an im-
portant new alternative and building block for all kinds of compu- [11] Jean E. Vuillemin, Patrice Bertin, Didier Roncin, Mark
tational systems. Shand, Hervé Touati, and Philippe Boucard. Programmable
Active Memories: Reconfigurable Systems Come of Age.
s
n'  ![5 $6"J;$ 9&$   IEEE Transactions on VLSI Systems, 4(1):56–69, March
1996. Anonymous FTP pam.devinci.fr:pub/doc/
To-Be-Published/PAMieee.ps.Z.
The Berkeley Reconfigurable Architectures Software and Systems
effort is supported by the Defense Advanced Research Projects
Agency under contract numbers F30602-94-C-0252 and DABT63-
C-0048.

yz$)-.$  $ 
$6
[1] Duncan Buell, Jeffrey Arnold, and Walter Kleinfelder. Splash
2: FPGAs in a Custom Computing Machine. IEEE Computer
Society Press, 10662 Los Vasqueros Circle, PO Box 3014,
Los Alamitos, CA 90720-1264, 1996.
[2] Kenneth David Chapman. Fast Integer Multipliers fit in FP-
GAs. EDN, 39(10):80, May 12 1993. Anonymous FTP
www.ednmag.com:EDN/di_sig/DI1223Z.ZIP .
[3] André DeHon. Reconfigurable Architectures for General-
Purpose Computing. AI Technical Report 1586, MIT Artifi-
cial Intelligence Laboratory, 545 Technology Sq., Cambridge,
MA 02139, October 1996. <ftp://publications.ai.
mit.edu/ai-publications/1500-1999/
AITR-1586.ps.Z>.
[4] André DeHon. Comparing Computing Machines. In Con-
figurable Computing: Technology and Applications, vol-
ume 3526 of Proceedings of SPIE. SPIE, November 1998.
<http://www.cs.berkeley.edu/projects/
brass/documents/ccmpare_spie98.ps.gz>.
[5] John R. Hauser and John Wawrzynek. Garp: A MIPS Pro-
cessor with a Reconfigurable Coprocessor. In Proceedings
of the IEEE Symposium on Field-Programmable Gate Arrays
for Custom Computing Machines, pages 12–21. IEEE, April
1997. <http://ww.cs.berkeley.edu/projects/
brass/documents/GarpProcessors.html>.

You might also like