Fpga Orca
Fpga Orca
AT&T Microelectronics
1247 S. Cedar Crest Blvd.
Allentown, PA 18103-6209
Data In
Most digital systems require some form of memory. They (4 511s)
range from small registers for state machine designs, to fast
FIFOs for communication circuits, to large RAMS for
general data storage. In order to accommodate these
different requirements, the logic units of some SRAM-based
FPGAs can be converted into distributed asynchronous I Fig. 3. PFU Synchronous 16x4 Memory
memories and combined appropriately to satisfy various
memory requirements. With increasing clock speed
requirement of applications, such as ATM, the
Write Enable
rr-$i
asynchronous-memory approach becomes more difficult to
use because of all the address, data and control setuphold
time requirements.
PFUs are enhanced in the 'A' version to support 16x4
synchronous memory operations, as shown in Figure 3. All
the incoming address, data and write enable signals are now Dataln PFU Memory
(16x2)
~
become transparent for the next half clock cycle to sample PFUMemory I
(1 6x2)
the incoming signals. This design simplifies memory timing Clock I
requirements: all setuphold times are now referred to one \ 0 <--New Block PFU
clock edge. The output data can be latched by the
Fig. 4. PFU Synchronous 16x2 Dual-Port Memory
12.3.2
248
this mode, the 16x4 single-port memory is divided into two write strobes for writing. Since each of these write strobes
banks. The write address and data buses of the two banks are requires one AND gate (one look-up table), PFUs are
connected together so that any arriving data is written into consumed for this function.
both banks. The same address bus is also connected to the In the 'A' version, a port enable pin is added to absorb this
read address of the first bank, as in the single-port mode. AND gate into the PFU memory to simplify the external
However, another address bus is connected to the read decoding requirement. Figure 5b shows the simplified
address of the second bank so that it can be accessed decoding with the port enable pin. The port enable and write
independently. As a result, each PFU can be constructed as a enable inputs are internally ANDed together to replace the
16x2 dual-port memory, with one port for both reading and original write strobe function. This enhancement also speeds
writing and the other port for reading only. up the write enable path and reduces the number of routing
The synchronous memory and the dual-port memory nodes required.
features make each PFU memory more flexible and easier to
implement. The port enable function to be described next is 2.5 5V Tolerant I/O buffers
used to make it easier to create larger memories from the Since not all the components in a typical digital design are
individual PFU memories. readily available in 3.3V versions yet, one of the design
objectives for the 3.3V 'A' version was to make sure that the
2.4 Port Enable
U 0 buffers can communicate with 5V devices. To achieve
When a memory deeper than 16-bit words is required, this, additional 5V power supply pins are required to ensure
multiple PFU memories can be grouped together to form it proper biasing of the transistors in the proximity of the U0
[Xili95]. For example, when a 64x4 memory is needed, four pads. All other programmable features such as selection for
16x4 PFU memories are needed, as shown in Figure 5a (to input levels, input speed, input float value, output drive,
simplify the figure, only the read enable and write strobe output speed, output sense and 3-state sense are maintained.
signals of the PFU memories are illustrated). The upper two Care was taken to ensure that the I/O characteristics for both
bits of address are decoded externally to control the read the 5V and 3.3V devices remain compatible with the PCI
enable inputs, which in turn control the read-data tristate specification [ATT95b].
buffers of each PFU. The same decoded signals are logic-
ANDed with the write enable line to turn on one of the PFU 3. Process Technology and Result
Until now, not many 3.3V SRAM-based FPGAs have been
available. For the ones that are available, they are usually
designed with the same process technology as the SV
Upper 2-Bit Read Enable
Address devices. However, this compromises the performance of the
Write Strobe 3.3V devices as shown in Table 1.
Write Enable
PFU Memory
SV 0.35 pm SV 0.35 pm operated at 3.3 V 3.3V 0.35 pm
II i (16xi)
7s ps 106 ps 59 ps
12.3.3
249
counterpart, as shown in Table 1. However, this speed new high-capacity ORCA series is now capable of
advantage does not necessary apply in FPGA designs. In implementing a yet wider range of very high performance
fact, it was found that the performance of the 3.3V ‘A’ applications in both 5V and 3.3V.
version is only slightly faster than the 5V counterpart on the
average. The main reason is that the driving capability of the Acknowledgments
N-channel devices (used as the MUXes and programmable The authors would like to thank C.T. Chen for his
switches) is greatly reduced by the body effect under 3.3V leadership and technical advice throughout the design, J.
operating conditions. Table 3 shows some PFU and routing Hoff for his contribution in writing this paper and all the test,
timing numbers for the four different processes. The new layout, and product engineers for their excellent support.
2CA series (0.3Sp/SV) is 18-31% faster than the 2C series
(O,Sp/5V) and the 2TA series (0.35pI3.3V) is 29-47% faster References
than the 2T series ( 0 . 5 /3.3V).
~ More importantly, the new [ATT95a] AT&T Field-Programmable Gate Arrays Data Book,
2TA 3.3V series is as fast as the 2CA 5V series, so that the April 1995
same application can achieve equal or better speed in 3.3V
[A’IT95b] AT&T Preliminary Data Sheet, ORCA 2T15, Sept.
FPGAs than in SV FPGAs, with greatly reduced power 1995
consumption.
[BRIT941 Barry K. Britton, Yaw T. Oh, William Oswald, Ho T.
Nguyen, Satwant Singh, Chong Lee, Wai-Bor Leung,
Carolyn Spivak, Jim Steward and C. T. Chen, “Second
LUT4 Delay (ns) Generation ORCA Architecture Utilizing 0.511 Process
I .9 1.1 1.0 Enhances the Speed and Usable Gate Capacity of
I .4 0.9 0.9
FPGAs”, IEEE International ASIC Conference and
Exhibit, Sept. 1994, pp. 474- 478
Table 3. Worst Case Timing Numbers of the different ORCA Series
[Xili95] The Programmable Logic Data Book, Xilinx Co., San
The above speed improvement is the result of circuit Jose, CA., 1995
optimization in the advanced 0 . 3 5 ~processes. Table 4
illustrates how the architecture innovation in the ‘A’ version
improves performance when implementing an asynchronous
parallel 8x8 multiplier and a 128x8 synchronous RAM. The
first column shows the delay numbers of this two circuits in
the current 2C/0.5p technology. If the ‘A’ version
architecture was implemented in the same 2C/0.5 p
technology, these two circuits are 32-62% faster, as shown in
the second column. With the advanced 0 . 3 5 ~technology,
the ‘A’ architecture becomes even faster.
2C in If2CA in 2CA in
0.5 gm 0.5 prn 0.35pm
Async. 8x8 Parallel Multiplier (ns) 75 51 40
128x8 Synchronous RAM’(ns) 50 19 I 16
Table 4. Worst Case Timing Numbers of the different ORCA Series
a. Clock Period; Reading and writing are possible in each
clock cycle.
All timing numbers shown in this paper are based on
worst-case operating conditions. Other benchmark and
performance results will be available during the zonference.
Shown in Figure 6 i s a chip microphotograph of the 15K-
usable gates device in 0 . 3 5 ~process technology.
4. Conclusions
In this paper, we have described the new features and
advantages of the AT&T ORCA 2CM2TA 0 . 3 5 ~5VI3.3V
series of FPGAs. By optimizing the 2C architecture in the Fig. 6. Chip Microphotograph of the 15K-usable Gates
advanced 0 . 3 5 ~processes with enhanced functionality, the Array in 0.35-Micron Process Technology
12.3.4
250