0% found this document useful (0 votes)
85 views3 pages

Processors.: Mops Integer Dmder Ic

1. The document describes a 32-bit integer divider chip that performs division operations at a throughput of 25 million operations per second without needing to prenormalize operands. 2. The chip uses a systolic array architecture with 16 identical two-block arithmetic cells arranged in a pipelined manner. 3. The division algorithm used is a continuous restoring procedure that generates bits of the quotient one at a time over successive clock cycles through conditional addition/subtraction operations.

Uploaded by

api-3736507
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views3 pages

Processors.: Mops Integer Dmder Ic

1. The document describes a 32-bit integer divider chip that performs division operations at a throughput of 25 million operations per second without needing to prenormalize operands. 2. The chip uses a systolic array architecture with 16 identical two-block arithmetic cells arranged in a pipelined manner. 3. The division algorithm used is a continuous restoring procedure that generates bits of the quotient one at a time over successive clock cycles through conditional addition/subtraction operations.

Uploaded by

api-3736507
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

A 25 MOPS SYSTOLIC INTEGER D M D E R I C

A. Roberto Criado

TRW LSI P m m INC.


P.O. Box 2412
~a J o l l a , California 92038

mmm 'pne integer data formats are as follows:

T h e divide function i n siqml processing systems is, i n Divident Input and a t p u t Eius
r m y instances, unavoidable. Particularly when
implerrrenting range scaling, m t r i x operations, or D31 D30 ... D18 ... DO1 WO
perspective transforms, in system applications such as
wrkstations, radar systems, and image processors. 31 30 18 1 0
Traditionally t h i s need has been f u l f i l l e d a t the -2 2 ... 2 ... 2 2
e-nse of reduced system sped and efficiency by
relying on canverging recursive techniques to carry out Divisor Input E m
the division operation. T h i s paper reports the design
and implementation of a 32-bit fixed pint integer X15 X14 ... XO1 XOO
divider. The OK6 chip performs two's-aniplmt
integer division of 32-bit dividends and 16-bit 15 14 1 0
divisors without p r e n o m l i z a t i o n , to prcduce a 32-bit -2 2 ... 2 2
output quotient.
For fractional data the fonnats are as follaws :

A_uxIRITHM AND DEVICE ARCHITECIVRE Dividend Input and a t p u t E m

The device is architected a s a one-dimensional s y s t o l i c D31 D30 D29 D28 ... D18 ... DO1 Do0
array of sixteen identical tvn-block arithmetic cells.
?Ae pipelined s y s t o l i c architecture operates a t a 0 -1 -2 -3 -1 3 -30 -31
throughput of 25 million operations per second (25 -2 2 2 2 ... 2 ... 2 2
WPS) , w i t h a latency of 1 9 clock cycles. The algorithm
chosen f o r this design is a continuous mn-restoring Divisor Input E m
procedure [ 13 , implemented a s a series of conditioral
adderfsubtractor modules. These arithwtic modules X15 X14 X13 ... XO1 XOO
opsrate on the previous module's output (reminder),
along w i t h the original divisor. Each a r i b t i c module 0 -1 -2 -13 -14
accepts the next two b i t s of the dividend and generates -2 2 2 2 2 ...
the next tw bits of t h e quotient. This process is (note: the minus sign represents the sign b i t ) .
repeated N t k s u n t i l the remainder is zero, or a
positive n-r. A s l i g h t cunplication occurs a t t i m s Having latched the incaning operands, the divisor i n
when the division of two p s i t i v e numbers yields a f u l l and a 16-bit sign-extension of t h e dividend sign
negative remainder. I n these cases one extra processing b i t are operated on by the f i r s t of the 16 arithwtic
step is needed; the divisor is added to the emanating modules which ccmprise the chip's core. This procedure
negative remainder to correct it. The c h i p ' s is repzated throughout the core, with each module
architecture p a r a l l e l s the a l q x i t h m i e flow (Fig. 1 ) . accepting the next two hits of the dividend
The device cunprises 1 6 identical s y s t o l i c functional appropriately delayed to canpensate f o r t b pipeline
c e l l s separated by synchronous pipeline registers (Fig. delay within the core. The carries e m a ~ t i n ga t each
2 ) . The n m k r of s y s t o l i c functional cells (one-bit stage of the arithwtic a r e are collected and
quotient generators) per inter-register pipeline stage assembled into a provisional quotient. Before
sets t h e balance between throughput and number of outputting the f i n a l result, conditioning of the
cycles of pipeline latency. The core of each block quotient is performea to correct the result for the
consists of two N-bit wide adderfsubtractor circuits, p s s i b i l i t y of having reached a z e r o remainder prior to
which can increase o r decrease the inccmirq remainder the last arithmetic process in the core, or as stated
by the divisor (Fig. 3 ) . Subtraction is performed i f above, t o correct a p s s i b l e remainder anamly; and
the sign b i t s of the divisor and the incaning reminder l a s t l y , to oanplement the quotient in case of a
match; addition is done i f they d i f f e r . With two cells negative divisor, or dividend. The design o f f e r s two
per stage, the 1Mc3211 is able t o meet the design goals error flags, DZ (divisor equals zero), ard RFM (inexact
of throughput and chip s i z e . result, non-zero ranainder), which accompany each
quotient output. When DZ is high, indicating a divide
by zero operation, the quotient is meaningless. me t o
RJNC!lXQ\LAL DBCRIFTICN a f i n i t e data word width, a t m ' s complement overflow
e r r o r OCCUTS urder the following unique conditions :
Both the 32-bit d i v i d e d ard the 16-bit divisor i n p t s
are accepted by the chip on separate 25 MHz. busses. Dividend : Y = 80000000 [hex] (neg. f u l l scale)
Each input port is enabled individually to allow
ccntinuous division by a constant. The integer divider Divisor : X = FFFF [hex] ( -1)
s u p p r t s a l l fixed-point input f o m t s , although the
u s e r must keep track of the b i m r y point t o i n t e r p r e t Result
the quotient properly.
Quotient : Q = 80000000 [hex] (neg. f u l l scale)

CH1234-5/8910000-P7-4.1$01 .OO 0 1989 IEEE 7.7 I


As stated earlier, this condition occurs due to a X11531 RI1591
limitation in the n-r of bits available to indicate
a positive full scale quotient.

W P DESIC24 AND
--- IMPLINDTTATICN .IF. .X15
...t

-
216 aM + A*
. .-
. .R15
- zx R + x + yY
OM R15 -
-
,
ckle of the lrain goals in t h i s effort was to achieve the IF X15 [CARRY-IN1
(CARRY-OUT)
Z 1 6 x OM + R' 2x R -X + YM
desired throughput of 25 Mops, and also to maximize the
effective use of active area real estate. The systolic
architecture implemented as a direct result of the
chosen division algorithm pointed to a "long and
M_rrow' aspect ratio. Furthermre, the dividend and
16 4 XI1531 16 R'l153;

quotient holding circuits required shift registers that IF X15 # R'15 -


in the f o m case increase in length by one stage as
the process flows towards the final result, and in the
latter case decrease in length by one stage (Fig. 4 ) .
These requirements on the holding circuits, as e11 as
on the arithmetic core, i m p s e d special care on the
chip's flax plan. The device flmr plan follows a U-
shaped flm. The data flow is fran upper right, to
u p p r left along the U-path. The dividend holding
registers supply the delayed Y-bits to the modules fran
the mxlule's outer kourdary (away fran the center);
while the quotient bits are received by the quotient FIGURE 1
holding circuits located in the center of the chip
D I V I D E R ALGORITHMIC FLOW
(fig. 5). The increasing, ard decreasing registers of
the quotient an3 dividend holding circuits,
respectively, were implemented as S-shape clusters of
&type flip-flops in an attempt to m3ximize active area
usage. A v e q desirable chip aspect ratio has been
accaplished, the die size is (7.00 X 7.18) m. sq.
(Fig. 6). The transistor count is over 50 K. The logic
implawntation of the chip was done exclusively with
cells f m n OUT CMOS I1sP-cell library; ard fabricated in
W ' s ~ ~ C N -me-micron
C CMOS process.

The entire project, from architecture definition to


first prototypes, was acomplished in 8 calendar
mths; and the design proved successful on the first
iteration. Although somewhat m r e cunberscme than
dtiplication, division can be executed cost-
effectively in an algorithm-specific integrated
. .
circuit. The size of the chip is roughly proportional
to the precisian of the divisor, which determines the
width of each acader/subtractor stage. Likewise, it is
also p r o p o r t i o ~ lto the desired precision of the
quotient, which requires one adder/subtractor per bit
The 1MC3211 offers a cost effective, low p e r , high
speed solution for performing fixed pint division in a
variety of digital signal processing applications.
I ZERO RfMAINOER CORRLCTOR OYSR RiM I
REFERENCES
FIGURE 2
111 Oterman, R.M.M., Digital Circuits for Binary T M C 3 2 1 1 BLOCK DIAGRAM
Arithmetic, maw-Hill, Nar York (1979)

P7-4.2
OlVlSOR REMAINDER
r I
X15 X14:O A15 R144 Y31

Y
HOLl
CKTI

D I V I D E R ACTUAL FLOOR PLAN

I
X
- R
1
Y31

FIGURE 3
ARITHMETIC CELL FUNCTIONAL BLOCK DIAGRAM

X
DIVISOR

FIGURE 6
I
C H IP M I CROP HOTOGRAPH
HOLDING
MSB

1 OlVlDER
QUOTIENT .
I CORE

FLAGS 4 2,
FIGURE 4

DIVIDER PRELIMINARY FLOOR PLAN

P7-4.3

You might also like