Plasma: An FPGA For Million Gate Systems: 1. Abstract
Plasma: An FPGA For Million Gate Systems: 1. Abstract
2.1 Teramac
Teramac (Figure 1) is a test vehicle for computer architects
at Hewlett-Packard’s Research Laboratories. An engineer can
quickly synthesize a custom architecture for a specialized prob-
lem and test the design at high speed using Teramac. The name
derives from tera (1012) and “Multiple Architecture Computer.”
With this tool the architect can create a machine with a million
boolean functions of two variables being simultaneously evalu-
ated at one megahertz— a trillion very small operations per sec-
LUT L Q L Q
180%
M UX
160%
140%
1 Output
2 Outputs
120% Figure 3: 1/2 PALE Logic
PALEs as % Total Chip Area
3 Outputs
100%
4.2 Interconnect
80%
Plasma contains 256 PALEs, organized into sixteen groups,
called hextants, of sixteen PALEs each. Eight PALEs bound
60%
each side of the hextant, and eight hextants bound each side of a
central crossbar (Figure 4). These two levels of crossbars allow
40% (1) interconnections of PALEs within the same set; (2) intercon-
nections of PALEs in different sets; and (3) interconnections of
20% PALEs and I/O signal pins. I/O signal pins are connected to the
top and bottom of the central crossbar.
0%
6 Bits
Register File
Address 64 bits 64 bits
1 Bit
Data-out
6 Bits
Signal Pins Address
1 Bit 8 Read
Data-out
Figure 4: Plasma Block Diagram
. Ports
The number of bits of configuration memory could be 6 Bits .
greatly reduced from the 1000 per PALE Plasma has. Each of Address .
the PALE inputs connects to only a single line of the 100 in the 1 Bit
hextant crossbar at any one time, but 100 bits of configuration Data-out
memory are used to express this connection. Obviously, only 6 Bits
seven bits are required, a 93% reduction. But, using only seven Address
configuration bits would require a 100-1 multiplexor to select the 1 Bit
correct line. Such large multiplexes are quite slow and require Data-out
significant chip area. The number of transistors used would not Figure 5: Plasma Register File
decrease.
Plasma takes a novel approach to reducing this count. Be-
cause the PALE LUTs are large trees of pass transistors, they can
4.4 Register Files be connected “backwards” to build large decoders. The six input
Custom computing designs frequently have need for large PALE conveniently converts into a 6-64 decoder. By taking a
numbers of registers. Building these from scratch using gates slice of PALEs from each of eight hextants, a register file primi-
(viz. LUTs) is expensive. The large decoders used to select the tive is constructed (Figure 5). Each Plasma contains four, 2-bit
register accessed by each port require a minimum of one output wide, 64-deep, 16 port register files[8]. Eight ports are config-
per register per port. Building a 64-deep register file with two ured as read ports and eight as write ports. A write port has both
read and one write port would require 256 LUTs to drive the row an address and write enable; a read port has only an address.
lines. A 64 wide OR gate is required for each bit (column) of
each read port. The chip does not support wire-OR or tri-state
logic. Using the six input PALEs of Plasma, a 64 deep, 32-bit 4.4.1 Average gate count
wide, register file requires approximately 1200 PALEs. For designs using this feature, extremely high gate effi-
ciency can be achieved. One test design, a sort engine for 64
elements, achieved an equivalent gate count of 60,000 for a sin-
gle Plasma, nearly 240 gates per PALE. A more typical number sive data path with most of the signals switching from 1 to 0 (or
for random logic is eight to ten gates per PALE, approximately 0 to 1) at the same clock edge. The resulting collapse of the
2000 gates per chip. These numbers are achieved by 100% internal power supply can cause configuration bits to change,
automatic place and route. leaving a random design in the chip. Not only would the result-
ing design be incorrect, but the chip outputs could short damag-
4.5 Pinouts ing it.
A CCM requires greater connectivity than is provided by We took significant precautions to reduce the danger of si-
commercially available FPGA’s. Plasma’s 336 signal I/O pins multaneous switching problems. Great care was taken in the
are roughly double that which can be obtained off-the-shelf. In design of the buffer arrays: central crossbar buffers, peripheral
addition, placement and routing within Plasma is insensitive to crossbar buffers, and PALE buffers. The buffers are designed
signal pin assignments because all I/O pins have equal connec- with small devices to keep switching currents small. Break-
tivity into the chip’s central crossbar. before-make design minimizes crossover currents in the buffers
during logic state switches. Peripheral crossbar buffers are in-
4.6 Scan verting so alternate buffers in a path switch to the opposite state.
The Plasma chip is designed to make the debugging of a Logic states are positive true in the peripheral crossbars and
user’s custom design easy and natural. All internal state is negative true in the central crossbar.
available on a user scan chain. This includes all flip-flops and The power busing is designed to meet the instantaneous cur-
register files as well as all the I/O pins of the chip. Both the D rent requirements when switching all flip-flops using the lowest
and the Q of every flip-flop are simultaneously observable. The resistivity metal layer. Eight large VDD/GND connections, each
outputs and inputs of all LUTs are observable and the user inter- with three bonding wires, allow better current carrying capability
face recreates signals that exist on the original schematic but and lower inductance. Dirty VDD/GND power pins separately
were subsumed into lookup tables. supply all the off-chip pad drivers. Plasma is unusual because
The clock is stopped before examining the scan chain and the on-chip core SSO is much more severe than I/O pad SSO.
the user may change the state of the machine before resuming Wide internal VDD/GND power rings between the core logic
clocking. The Plasma architecture allows the user full peek and and the pads provide power connections from core to power pads.
poke capability while debugging. This scan capability uses dif- Power buses are routed directly over high current buffers to
ferent commands to the chip than the read and write configura- minimize voltage drop.
tion commands allowing fast debugging. To keep the configuration bits from changing logic states,
the configuration cell bit lines are precharged high at all times
4.7 Programming except when accessing. This is done to take advantage of the
The chip has a second scan chain used for configuring. The difficulty of writing a “1” into the cells. Many conventional
scan chains are separated for fast reading and writing of the user designs precharge the bit lines only immediately before accessing
state, allowing simple algorithms for examining and setting the the cells.
state of all user visible signals in the chip to provide excellent To prevent all 512 flip-flops from switching at the same
debug capability. clock edge, the logic clocks are deliberately skewed up to seven
The configuration scan chain writes and reads the underly- nanoseconds by the clock distribution tree. The inner eight sets
ing configuration bits. The read allows verification that a con- of hextants are skewed to clock later than the outermost eight
figuration was received correctly. Configuration requires first hextants to deal with the worst case hold time as seen from out-
halting the clocks and placing the chip in configure mode. A side the Plasma chip. All hold times are kept negative because
separate high frequency clock is used for configuration. A com- the best case buffer switching times are much less than the clock
plete configuration consists of almost 400,000 bits. The configu- skew. To account for skewing, additional time is added to the
ration bits are stored in a large SRAM structure underlying the non-overlap time of the clock. SPICE shows that all hold times
crossbars and PALEs. Each row may be read and written inde- can be met. Because the Plasma compiler computes the clock
pendently, allowing minor modifications to a design without period required by each individual design, the skewing of clocks
completely downloading a new configuration. in the chip causes no setup time problems.
Plasma uses a parent-child scheme on the scan chain sig- A potential problem arises with “Global Drive Enable,” the
nals. A header record tells the chip whether this configuration signal that enables all buffers out of tristate after a configuration
data was intended for it. If not, the remaining data is passed to has been loaded. When all buffers on the chip are enabled at
the child pin. This allows the chips to be connected serially for once, a significant current surge can occur. Our solution created
system configuration, reducing the signals that must be con- five individual “Global Drive Enable” signals each separated by
nected to the master system. A broadcast command can address one hundred nanoseconds, routed so that four tied to one hextant
all chips on a given chain. in each quadrant and the fifth enabled the pad drivers. The tim-
ing is controlled by the chip controller.
4.8 Simultaneous Switching Outputs (SSO)
FPGAs used for custom computing have a high probability
of simultaneous switching problems. A user may design a mas-
Figure 6: Plasma Chip Photograph
References
[1] R. Amerson, R. Carter, W. Culbertson, P. Kuekes, G.
Snider. "Teramac -- Configurable Custom Computing", Proceed-
ings of the 1995 IEEE Symposium on FPGA's for Custom Com-
puting Machines.
[2] Azam Barkatullah, Wern-Yan Koe, Harish Nayak, Nazar
Zaidi, “Pre-Silicon Validation of Pentium CPU”, 1993 Hot Chips
Symposium
[3] J. Hadley, B. Hutchings. “Design Methodologies for
Partially Reconfigured Systems,” Proceedings of the 1995 IEEE
Symposium on FPGA’s for Custom Computing Machines.
[4] B. Landman and R. Russo, "On a Pin vs. Block Relation-
ship for Partitions of Logic Graphs," IEEE Transactions on Com-
puters, December 1971.
[5] R. Amerson and P. Kuekes. “The Design of an Ex-
tremely Large MCM-C -- A Case Study,” International Journal of
Microcircuits and Electronic Packaging, Vol. 17, No. 4.
[6] Jonathan Rose, Robert J. Francis, David Lewis, Paul
Chow, “Architecture of Field Programmable Gate Arrays: The
Effect of Logic Block Functionality on Area Efficiency,” IEEE
Journal of Solid-State Circuits, Vol 25, No. 5, October 1990
[7] Dwight Hill, Nam-Sung Woo, “The Benefits of Flexibil-
ity in Lookup Table-Based FPGA’s,” IEEE Transactions on