0% found this document useful (0 votes)
57 views

Fpga Orca

This document describes a new generation of ORCA FPGA products from AT&T with enhanced features and performance. The new ORCA 2CA and 2TA series are based on the previous ORCA 2C/T architecture but implemented using newer 0.35 micron CMOS processes optimized for 3.3V and 5V operations. New features of the ORCA 2CA/2TA include an efficient 4x1 parallel multiplier, synchronous single-port and dual-port memory support, and 5V-tolerant I/O buffers on the 3.3V devices. These enhancements provide significantly smaller size and 15-30% faster performance compared to the previous 0.5 micron ORCA 2C/T series

Uploaded by

souranshus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Fpga Orca

This document describes a new generation of ORCA FPGA products from AT&T with enhanced features and performance. The new ORCA 2CA and 2TA series are based on the previous ORCA 2C/T architecture but implemented using newer 0.35 micron CMOS processes optimized for 3.3V and 5V operations. New features of the ORCA 2CA/2TA include an efficient 4x1 parallel multiplier, synchronous single-port and dual-port memory support, and 5V-tolerant I/O buffers on the 3.3V devices. These enhancements provide significantly smaller size and 15-30% faster performance compared to the previous 0.5 micron ORCA 2C/T series

Uploaded by

souranshus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A New Generation of ORCA FPGA with Enhanced Features and Performance

T. Ngai, S. Singh, B. K. Britton,W-B. Leung, H. Nguyen,


G. P. Powell, R. Albu, W. B. Andrews, J. He, C. W. Spivak

AT&T Microelectronics
1247 S. Cedar Crest Blvd.
Allentown, PA 18103-6209

Abstract In this paper we present the details of implementing the


ORCA 2CA/2TA series. In the next section we describe the
This paper describes the new AT&T Optimized
new features for the new series. Section 3 describes the
Reconfigurable Cell Array (ORCA) 2CA and 2TA series of
0.35pI3.3V and 0.35plSV processes used to implement the
Field-Programmable Gate Arrays (FPGAs). Both series are
ORCA 2CN2TA series and the advantages of these process
based on the ORCA 2C/T architecture, but migrated to the
versus the 0.5 p process used for the original ORCA 2C/T
advanced AT&T 0 . 3 5 ~CMOS processes. These two
series.
processes are individually optimized for SV and 3.3V
operations. In addition, architectural innovations are 2. New Features
incorporated into the new series to enhance both
performance and functionality. These include: In this section we describe the new features of the ORCA
2CA/2TA series. They include a 4x1 multiplier, synchronous
Efficient support of parallel multiplier single- and dual-port memories, memory write-enablelport-
Efficient support of synchronous single-port and dual-port enable control, and the SV-tolerant inputloutput buffers for
memories the 3.3V devices. These new 2CAl2TA series are referred as
the 'A' version for the rest of the paper.
Streamlined creation of large memory using port-enable
control and internal tristate buffers 2. I Multiplier
Easy system integration with the 5V-tolerant inputloutput
Multipliers are very common components in DSP
buffers, found on the 3.3V 2TA series.
applications. Because of their complexity, any reasonable
As the result, the ORCA 2CAl2TA arrays are significantly sized multiplier has tended in the past to consume a large
smaller and 15%/30% faster (PFU speed) than the number of FPGA logic cells and the resulting performance
corresponding 0 . 5 2C/T
~ arrays. was usually poor.
1. Introduction The 'A' version incorporates an innovative solution to
implement fast parallel multipliers very efficiently. Figure 1
The speed and gate density of FPGAs have grown rapidly shows how the ORCA logic block, called a Programmable
for the past few years, but the demands of today's FPGA Functional Unit (PFU), can be used as a 4x1 multiplier. Xo-3
applications have evolved as well. On top of demanding and Y, are the unsigned 4-bit and 1-bit data. Pin0.3 and PoUto.
faster, cheaper and higher density FPGAs, new requirements are the input and output partial products. Ci, and Gout are
are posted by today's applications, such as better memory the carry-in and carry-out signals. As shown in the diagram,
support (synchronous memories and multi-port memories the four AND gates and the 4-bit adder that are required for a
such as register files), more arithmetical functions such as 4x1 multiplier are found in one FPU, which translates to
multipliers, and a fast 3.3V FPGA series for low-power compactness and speed. Since the fast carry routing is used
applications. In order to fulfill these requirements, AT&T has I 1
re-engineered the second generation ORCA 2C/T series
[ATT95a] [brit941 with new functionality, then individually
optimized this architecture for two advanced processes:
0.3SpI3.3V and 0.35p15V CMOS. The resulting new series, Cin
called ORCA 2CA and 2TA, are 100% bitstream upward-
compatible from the ORCA 2C/T series, even with the added I 7

' Pout3 ' +out* Pouti Pout0


functionality.
I Fig. 1. PFU 4x1 Multiplier
I
12.3.1
247
0-7803-3177-6 $5.00 0 1996 IEEE IEEE 1996 CUSTOM INTEGRATED CIRCUITS CONFERENCE
Critical Path 1 configurable PFU FFs to meet the timing requirement of the
next stage easier.
(\3 ?2 x1 2 0 In order to boost performance for applications that require
simultaneous reading and writing, the read and write
addresses are individually decoded so that a different read
address can be supplied to read memories during the half
clock cycle when the write address is latched and memories
are written. This approach requires an external address
multiplexer to multiplex the readwrite addresses between
the first and second half clock cycles, but it allows the
memory to be written and read from different addresses
during every clock cycle.
Fig. 2. Cascading 4 PFUs to Form a 4x4 Multiplier The next section describes a dual-port synchronous
memory mode which can be simultaneously read and written
for much of the critical path, this also translates to increased using two clocks that are asynchronous to each other.
speed.
2.3 Dual-Port Synchronous Memory
Each PFU multiplier can be easily stacked together to
form deeper and/or wider multipliers. Figure 2 demonstrates Figure4 shows how a 16x2 dual-port memory can be
how a 4x4 multiplier is implemented using four PFU constructed from the available 16x4 single-port memory. In
multipliers. With the abundant 2CN2TA routing resources,
this multiplier can easily be packed into four neighboring / - - - - - - - - \
PFUs. As the result, the critical path goes through only three
segments of PFU to PFU direct routing resources, and the Write Enable

resulting 4x4 multiplier is fast. Speed results are summarized


in Section 3. If an even higher throughput rate is required,
this multiplier can be pipelined using the four PFU FFs that
are available. Data Oul
Address
(4 Bits)
(4 BltS)
2.2 Distributed Synchronous Memory
~

Data In
Most digital systems require some form of memory. They (4 511s)
range from small registers for state machine designs, to fast
FIFOs for communication circuits, to large RAMS for
general data storage. In order to accommodate these
different requirements, the logic units of some SRAM-based
FPGAs can be converted into distributed asynchronous I Fig. 3. PFU Synchronous 16x4 Memory
memories and combined appropriately to satisfy various
memory requirements. With increasing clock speed
requirement of applications, such as ATM, the
Write Enable

rr-$i
asynchronous-memory approach becomes more difficult to
use because of all the address, data and control setuphold
time requirements.
PFUs are enhanced in the 'A' version to support 16x4
synchronous memory operations, as shown in Figure 3. All
the incoming address, data and write enable signals are now Dataln PFU Memory
(16x2)
~

latched by the clock while memory read operations remain (2 BltS)


asynchronous. Then, if the write enable is active, an internal Read-only Pori
Address Read Addr
R-Only Poi
write strobe is generated from the same clock edge to write (4 BltS) Data Out
Wrne Addr (2 BltS)
the latched data into the latched address location. All latches Write Data

become transparent for the next half clock cycle to sample PFUMemory I
(1 6x2)
the incoming signals. This design simplifies memory timing Clock I
requirements: all setuphold times are now referred to one \ 0 <--New Block PFU
clock edge. The output data can be latched by the
Fig. 4. PFU Synchronous 16x2 Dual-Port Memory

12.3.2
248
this mode, the 16x4 single-port memory is divided into two write strobes for writing. Since each of these write strobes
banks. The write address and data buses of the two banks are requires one AND gate (one look-up table), PFUs are
connected together so that any arriving data is written into consumed for this function.
both banks. The same address bus is also connected to the In the 'A' version, a port enable pin is added to absorb this
read address of the first bank, as in the single-port mode. AND gate into the PFU memory to simplify the external
However, another address bus is connected to the read decoding requirement. Figure 5b shows the simplified
address of the second bank so that it can be accessed decoding with the port enable pin. The port enable and write
independently. As a result, each PFU can be constructed as a enable inputs are internally ANDed together to replace the
16x2 dual-port memory, with one port for both reading and original write strobe function. This enhancement also speeds
writing and the other port for reading only. up the write enable path and reduces the number of routing
The synchronous memory and the dual-port memory nodes required.
features make each PFU memory more flexible and easier to
implement. The port enable function to be described next is 2.5 5V Tolerant I/O buffers
used to make it easier to create larger memories from the Since not all the components in a typical digital design are
individual PFU memories. readily available in 3.3V versions yet, one of the design
objectives for the 3.3V 'A' version was to make sure that the
2.4 Port Enable
U 0 buffers can communicate with 5V devices. To achieve
When a memory deeper than 16-bit words is required, this, additional 5V power supply pins are required to ensure
multiple PFU memories can be grouped together to form it proper biasing of the transistors in the proximity of the U0
[Xili95]. For example, when a 64x4 memory is needed, four pads. All other programmable features such as selection for
16x4 PFU memories are needed, as shown in Figure 5a (to input levels, input speed, input float value, output drive,
simplify the figure, only the read enable and write strobe output speed, output sense and 3-state sense are maintained.
signals of the PFU memories are illustrated). The upper two Care was taken to ensure that the I/O characteristics for both
bits of address are decoded externally to control the read the 5V and 3.3V devices remain compatible with the PCI
enable inputs, which in turn control the read-data tristate specification [ATT95b].
buffers of each PFU. The same decoded signals are logic-
ANDed with the write enable line to turn on one of the PFU 3. Process Technology and Result
Until now, not many 3.3V SRAM-based FPGAs have been
available. For the ones that are available, they are usually
designed with the same process technology as the SV
Upper 2-Bit Read Enable
Address devices. However, this compromises the performance of the
Write Strobe 3.3V devices as shown in Table 1.
Write Enable
PFU Memory
SV 0.35 pm SV 0.35 pm operated at 3.3 V 3.3V 0.35 pm
II i (16xi)
7s ps 106 ps 59 ps

(a) Decodina Required to construct


64x4 memory W.O. Port Enable
I 4-
Write Strobe
PFU Memory
(16x4)

Upper 2-Bit Read Enable


Address Port Enable
-Write Strobe 3.3V/0.3Spm 5V10.35pm
Write Enable
PFU Memory N-Channel L' (pm) 0.32 0.46
I (16x4) Tox (A) 65 115
Metal 1 Pitch (pm) 1.04 1.04
Metal 2 Pitch (pm) 1.20 1.20
Port Enable Metal 3 Pitch (pm) 1.36 1.36
(b) Decodina Required to construc L

64x4 memow w. Port Enable PFU Memory


(16x4)

Fig. 5. Forming Deeper Memory W . m . 0 . Port Enable

12.3.3
249
counterpart, as shown in Table 1. However, this speed new high-capacity ORCA series is now capable of
advantage does not necessary apply in FPGA designs. In implementing a yet wider range of very high performance
fact, it was found that the performance of the 3.3V ‘A’ applications in both 5V and 3.3V.
version is only slightly faster than the 5V counterpart on the
average. The main reason is that the driving capability of the Acknowledgments
N-channel devices (used as the MUXes and programmable The authors would like to thank C.T. Chen for his
switches) is greatly reduced by the body effect under 3.3V leadership and technical advice throughout the design, J.
operating conditions. Table 3 shows some PFU and routing Hoff for his contribution in writing this paper and all the test,
timing numbers for the four different processes. The new layout, and product engineers for their excellent support.
2CA series (0.3Sp/SV) is 18-31% faster than the 2C series
(O,Sp/5V) and the 2TA series (0.35pI3.3V) is 29-47% faster References
than the 2T series ( 0 . 5 /3.3V).
~ More importantly, the new [ATT95a] AT&T Field-Programmable Gate Arrays Data Book,
2TA 3.3V series is as fast as the 2CA 5V series, so that the April 1995
same application can achieve equal or better speed in 3.3V
[A’IT95b] AT&T Preliminary Data Sheet, ORCA 2T15, Sept.
FPGAs than in SV FPGAs, with greatly reduced power 1995
consumption.
[BRIT941 Barry K. Britton, Yaw T. Oh, William Oswald, Ho T.
Nguyen, Satwant Singh, Chong Lee, Wai-Bor Leung,
Carolyn Spivak, Jim Steward and C. T. Chen, “Second
LUT4 Delay (ns) Generation ORCA Architecture Utilizing 0.511 Process
I .9 1.1 1.0 Enhances the Speed and Usable Gate Capacity of
I .4 0.9 0.9
FPGAs”, IEEE International ASIC Conference and
Exhibit, Sept. 1994, pp. 474- 478
Table 3. Worst Case Timing Numbers of the different ORCA Series
[Xili95] The Programmable Logic Data Book, Xilinx Co., San
The above speed improvement is the result of circuit Jose, CA., 1995
optimization in the advanced 0 . 3 5 ~processes. Table 4
illustrates how the architecture innovation in the ‘A’ version
improves performance when implementing an asynchronous
parallel 8x8 multiplier and a 128x8 synchronous RAM. The
first column shows the delay numbers of this two circuits in
the current 2C/0.5p technology. If the ‘A’ version
architecture was implemented in the same 2C/0.5 p
technology, these two circuits are 32-62% faster, as shown in
the second column. With the advanced 0 . 3 5 ~technology,
the ‘A’ architecture becomes even faster.

2C in If2CA in 2CA in
0.5 gm 0.5 prn 0.35pm
Async. 8x8 Parallel Multiplier (ns) 75 51 40
128x8 Synchronous RAM’(ns) 50 19 I 16
Table 4. Worst Case Timing Numbers of the different ORCA Series
a. Clock Period; Reading and writing are possible in each
clock cycle.
All timing numbers shown in this paper are based on
worst-case operating conditions. Other benchmark and
performance results will be available during the zonference.
Shown in Figure 6 i s a chip microphotograph of the 15K-
usable gates device in 0 . 3 5 ~process technology.

4. Conclusions
In this paper, we have described the new features and
advantages of the AT&T ORCA 2CM2TA 0 . 3 5 ~5VI3.3V
series of FPGAs. By optimizing the 2C architecture in the Fig. 6. Chip Microphotograph of the 15K-usable Gates
advanced 0 . 3 5 ~processes with enhanced functionality, the Array in 0.35-Micron Process Technology

12.3.4
250

You might also like