0% found this document useful (0 votes)
213 views

A 30-b Integrated Logarithmic Number System Processor - 91

This document summarizes a research paper that describes an integrated processor for performing logarithmic number system (LNS) arithmetic. Some key points: 1. The processor uses a new segmented linear approximation algorithm to perform 30-bit LNS addition and subtraction using much less lookup table space than previous designs, enabling higher precision LNS arithmetic. 2. The algorithm reduces table space by 285x compared to previous methods through two techniques: segmented linear approximation using different intervals, and nonlinear compression of difference values. 3. The prototype processor implemented in 3um CMOS performs at 5 million operations per second while representing numbers with 30-bit precision, much higher than previous LNS processors.

Uploaded by

Phuc Hoang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
213 views

A 30-b Integrated Logarithmic Number System Processor - 91

This document summarizes a research paper that describes an integrated processor for performing logarithmic number system (LNS) arithmetic. Some key points: 1. The processor uses a new segmented linear approximation algorithm to perform 30-bit LNS addition and subtraction using much less lookup table space than previous designs, enabling higher precision LNS arithmetic. 2. The algorithm reduces table space by 285x compared to previous methods through two techniques: segmented linear approximation using different intervals, and nonlinear compression of difference values. 3. The prototype processor implemented in 3um CMOS performs at 5 million operations per second while representing numbers with 30-bit precision, much higher than previous LNS processors.

Uploaded by

Phuc Hoang
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL 26, NO.

10, OCTOBER 1991 1433

A 30-b Integrated Logarithmic Number


System Processor
Lawrence K. Yu, Member, IEEE, and David M. Lewis, Member, IEEE

Abstract -This paper describes an integrated processor that accurate than a single-precision FP processor, due to the
performs addition and subtraction of 30-b numbers in the limited circuit area in 3-pm technology, it offers higher
logarithmic number system (LNS). This processor offers 5-
performance than FP arithmetic in the same technology.
MOPS performance in 3-pm CMOS technology, and is imple-
mented in a two-chip set comprising 170K transistors. A new A denser technology would allow LNS arithmetic to offer
algorithm for linear approximation using different-sized approx- better speed, accuracy, and performance than single-pre-
imation intervals in each of a number of segments is used. A cision FP in the same technology.
second technique, nonlinear compression, further reduces table The central difficulty in implementing addition and
space by storing the difference between the exact value of the
function and a linear approximation. This allows the implemen-
subtraction operations in LNS is the need to approximate
tation of logarithmic arithmetic using much less ROM than two nonlinear functions, which has typically been per-
previously required, making high-speed logarithmic arithmetic formed using lookup tables. In a straightforward imple-
possible in area comparable to single-precision floating-point mentation with F bits of fractional precision, roughly
processors. Compared to previous techniques for logarithmic 2 F x 2 F words are required [3]. For this reason, published
arithmetic, a factor of 285 reduction in table space is realized. implementations have been restricted to 8 to 12 b of
fractional precision [21, [41.
Efficient approximation of a nonlinear function using
I. INTRODUCTION small lookup tables suggests the use of a Taylor series

C ALCULATIONS requiring high precision and range approach. Linear approximation [SI,quadratic approxima-
can be performed with several different numeric tion [6], and linear approximation with a nonlinear differ-
representations, including floating-point (FP) or logarith- ence function in a PLA [71 have been used to advantage
mic number system (LNS) representations. FP represen- in the approximation of some functions, such as log(x).
tation is the most common number representation, while This is possible for log(x) because of its smooth nature
LNS representations have rarely been used. The scarcity over a small range. In contrast, one of the functions that
of LNS processors is due to the difficulty of performing must be approximated in LNS arithmetic has a singularity
LNS addition and subtraction. While the LNS offers that makes straightforward Taylor series approximations
better accuracy than FP [l]and simple multiplication and difficult. A previous attempt [SI at using linear approxi-
division, addition and subtraction circuits have area that mation in LNS arithmetic has achieved better precision
is exponential in numeric precision. Most applications for addition only by using a modified linear approxima-
require both additive and multiplicative operations, mak- tion, but is limited to about 3.85 additional bits of
ing LNS arithmetic impractical. As a result, the highest precision.
precision processor previously described offered only 12-b This paper describes an integrated LNS arithmetic pro-
precision in a 3-pm I *L technology [2]. No algorithms for cessor using a new method for linear approximation of
higher precision LNS arithmetic have previously been the LNS arithmetic functions. Using 3-pm CMOS tech-
described, making LNS arithmetic impractical for most nology, the prototype offers 20-b precision, considerably
applications. greater than previous designs. Two techniques are used to
This paper describes a new algorithm and architecture increase the precision possible for a given amount of
for performing LNS addition and subtraction, and its ROM. First, a new segmented technique for linear ap-
prototype implementation in an integrated processor. This proximation is used to reduce the amount of table storage
processor offers 5-MOPS performance using a 30-b num- required to 561 kb, a factor of 127 reduction compared to
ber representation, and is implemented in a two-chip set the most efficient previous method [2]. A second table
in 3-pm CMOS. Although the prototype is slightly less compression technique, linear approximation with differ-
ence coding, is used for further reduction, to 251 kb, a
Manuscript received January 17, 1991; revised May 13. 1991. This factor of 285 reduction.
work was supported by the Natural Sciences and Engineering Research The remainder of paper is organized as follows. Section
Council of Canada. I1 introduces the LNS representation and the algorithms
The authors are with the Department of Electrical Engineering,
University of Toronto, Toronto, Ont., Canada M5S 1A4. used. The chip optimization and design is described in
IEEE Log Number 9101852. Section 111, followed by conclusions in Section IV.

0018-9200/91/1000-1433$01.00 61991 IEEE


1434 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991

11. NUMBER
REPRESENTATION
AND 2.

ALGORITHM
DESIGN
The logarithmic number system represents a number x
by its sign and logarithm in some base b, together with 4.

some distinct representation for zero. In this paper we -6.


will only consider b = 2, and use a distinct bit to represent 18 16 12 '0r
zero for simplicity. Formally, the number x is represented
by the triple ( zx, s,, e,), with z , being a zero flag, s, the Fig. 1. Plot of f J r ) and f , ( r ) functions.
sign, e, = log2((xl), and x = (1 - z,)x( - 1)". X2'1. e, is
an N-bit fixed-point number with I integer bits and F r < 0. Previous implementations of LNS arithmetic used
fractional bits of precision, and N = F + 1. An F frac- lookup tables with all bits of r as input, and detecting
tional bit LNS representation has precision comparable to essential zeros to eliminate the associated ROM space [l].
a F + 0.53-b precision FP representation. We also assume This leads to 2 X ( F + 2) x 2F words of lookup table.
an excess-n representation of e,, with n = 2'-' - 1. The The varying nonlinearity of the functions makes a
processor described in this paper uses a 30-b LNS format, straightforward linear approximation using equal intervals
with I = 8 and F = 20, and two extra bits for z , and s,. for approximation across the entire domain impractical.
This is slightly inferior to single-precision FP, but is the Instead, the algorithms described in this paper use a new
most accurate format that could be implemented in the segmented approach, with linear approximation across
available process technology. smaller regions where the function is more nonlinear.
The central algorithmic technique used for high-preci- Furthermore, study of f,(r> and f 7 ( r )will reveal proper-
sion LNS addition and subtraction is a new method for ties of these functions that simplify linear approximation
segmented linear approximation, which is fully described if logarithmic arithmetic is used in the approximation.
in [9]. A brief summary and subsequent enhancements are
presented here for exposition of the architecture. A. Linear Approximation off, and f,y

Let a and b be two numbers represented in LNS


format by ( z , , s,, e,) and ( z , , sb, e,), and without loss of A linear approximation of some function f ( x > in the
generality assume e, 2 e,. Ignoring signs, algorithms for neighborhood of x is defined by
addition and subtraction using auxiliary functions f , and xAx
f, can be derived as follows: f(x + Ax) = f (x) + -
dx
Addition: c=a+b This formulation appears to require a ROM storing f ( x >
log(c) = log(a + 6 ) at some set of points, a ROM for d f ( x ) / d x , and a
log(c)= log(a x ( l + b/a)) multiplier, which is potentially expensive. Further consid-
log(c)= log(a)+log(l+ b / a ) eration of f , ( r ) and f s ( r ) will show that the d f ( x ) / d x
log(c) = log(a) + ROM and the multiplier can be eliminated at the cost of
log(l + 2 b ( h ) - - l ~ (1a ) a small ROM and additional logic. First, the multiplica-
e, = e, + f,,(e, - e,) tion can be eliminated by using logarithmic arithmetic, so
where f , , ( r ) = log(l+2') (1) can be replaced by (2) if d f ( x ) / d x > 0 and (3) if
Subtraction: c=a-b d f ( x ) / d x < 0:
e, = e,, + f&e, - e,>
where f , ( r ) = log(1 -2'1.
fix + Ax) = +
f ( x) sgn( Ax)

In this paper log(x) means the logarithm to base 2, and


exp(x) means 2".
The central difficulty in LNS arithmetic is the imple-
mentation of the functions f , ( r ) and f s ( r ) , for r < 0.
f(x + Ax) = f ( x) - sgn ( A x )
Lookup tables are the most common method used in
previous processors. A graph of these functions in Fig. 1
illustrates the singularity in f , ( f ) that makes linear ap-
proximation difficult. While f,( r ) is well behaved near The function sgn(.) in (2) and (3) is the sign function.
0, f , ( r ) -+ - 00 as r -0. The function f , ( r ) has a positive derivative and is approx-
Both functions are smooth for more negative r . LNS imated using (2), while f , ( r ) always has a negative deriva-
arithmetic often exploits the essential zeros of f , ( r ) and tive and is approximated using (3). These calculations
f s ( r ) , which are thresholds such that for r smaller than eliminate the multiplication, but appear to increase the
the essential zero, f,,(r) and f , ( r ) are zero to within the complexity of the circuitry. This complexity can be elimi-
accuracy of the representation. The essential zero has the nated by noticing that f , ( r ) and f s ( r ) have properties (4)
value - F -2, at which both If,(r)l < 2-Fp' and If,(r)l < and (5). As a result, the calculations can be performed
2-F- 1, so lookup tables are required only for - F - 2 < using logarithmic arithmetic, as shown in (6) and (7). The
YU AND LEWIS: 30-b INTEGRATED LOGARITHMIC NUMBER SYSTEM PROCESSOR 1435

TABLE I
SEGMENT
CHOICE

-4
multiplication and calculation of d f ( x ) / d x have been
eliminated, at the cost of adding lookup tables for log(.) -4 -3 -2
and exp(.). These tables can be shared for both addition Fig. 2. Linear approximation intervals for F = 4
and subtraction, and will be seen to be small.
ing value of Ax,,, is 2p'. Given some p i and p,, the
(4) values of r I , rh, rl, and re are defined by (8)-(11). The
notation r, . . . r, means the value of the binary represen-
tation of bits m through n inclusive of r .
r , = y p,-1 ..' '-F (8)
L(x + Ax ) = fa( x) + sgn ( Ax )
rl = rpl . . . r P < - 2 pi (9)
xexp(log(lAxl)+x - f , ( x ) ) (6)
L( x + Ax ) = f,( x ) - sgn ( Ax )
rl, = r - I . . . rp,+ + 2p/ ( 10)
rr = rlP, . . . r(,. (11)
Xexp(log(lAxl)+x - f , ( x ) ) . (7)
Using this partitioning, r, is the integer part of r , rh is a
Previous linear approximations have generally used posi-
positive quantity, with - 1 - p l bits being dependent upon
tive Ax, extracted as an unsigned bit field from the input
parameter. In contrast, Ax is a signed number in this
r , and rl is a signed quantity with p l - p , 1 significant +
bits, and (rll< 2p/.Finally, 0 < re < 2"., and r, < r < rf + 1.
paper. This doubles the range that can be used for linear
Combining this partitioning of r with the approxima-
approximation with some fixed maximum error, and halves
tions (6) and (7) leads to the formulas (12) and (13) as
the size of the lookup tables.
approximations for f u ( r ) and f s ( r > :
The remaining problem is how to choose x and Ax
such that r = x + Ax. The error in linear approximation L(r) = fo(r,)+ sgn( r o x exp (log(Ir,I) + rr - fa( It))
with some maximum value of IAxl, called Axmclx,is
(12)
(d2f(x)/dx2)x Axi,,/2), so the choice of the points x
and corresponding Ax,,, must be made to meet the A( r ) = f,(r , ) - sgn ( r r ) x exp (log( bll) + rr - f,( d).
required error tolerance. For a given maximum error, the
(13)
value of AX,,^ is proportional to 11/2x d 2 f ( x > / ~ * I P " * .
This choice of Ax,,, forms the basis of segmented The choice of segments and Ax,,, is made to meet
linear approximation. The domain of x is divided into a error tolerance requirements as well as result in a simple
number of segments, and a worst-case value of Ax,,, is implementation. The values of p , and p , are chosen to
used in each segment. Within a segment, the values of meet the accuracy constraint that the error due to linear
f ( x ) are stored at a set of points 2X Ax,,, apart (the approximation should be smaller than half a least signifi-
factor of 2 arises from the fact that Ax is signed). Thus, cant bit. The function f , ( r ) is smooth everywhere, allow-
segments are chosen in a manner that makes it easy to ing a relatively simple choice of segments based on inter-
compute the correct segment and Axmaxfor the given +
vals [ r l , r l 1). The value of Ax,, increases with more
segment, while not wasting excessive table space. negative r,. For r < - 1,f s ( r ) allows a similar treatment,
A simple technique for choosing Ax,,, is to partition but the singularity at 0 requires a different approach for
the binary representation of r into several parts, specifi- r E [ - 1,O). In this region the interval is divided into
cally, r r , r h , r l , and re, such that r = rr + rh + rl + r e . Also segments [ - 2-', - 2-'-'). The segment size and Ax,,,
+
define rf = r, rh. The linear approximation will be per- both decrease as r + 0. Table I shows the sizes of inter-
formed with x = r , , Ax = rl. The value r, is ignored, as it vals and segments for the functions, to within a constant
is chosen to be too small to affect the result. Although r, factor. The effect of segmented linear approximation for
is not directly used in this calculation, it is used later to F = 4 is shown in Fig. 2. The crosses mark the points
select a segment. stored in lookup tables, and the lines represent the range
The partitioning of r into bit fields can be described by of linear approximation about each point. The arithmetic
two integers p , and p,, p , < p , and p , < 0. Using seg- performed by the remainder of the processor and the
mented linear approximation, pe and p l are functions of corresponding data paths are shown in Figs. 3 and 4,
r , but are constant within each segment. The correspond- respectively. Each step in the algorithm and correspond-
1436 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991

TABLE 11
stage SIZESFOR LINEAREXTRAPOLATION
TABLE
1
2
3
generate P* .PI
partition r into ri. rh. rl. r, = ri + rh
Table I Words I Width
4 fn =fn_rbKrO
5 lrl = ibg-tbl( I r: I )
6 f d = r, -fr~
I lcor = lrl + fd
0 cor = eap-tbl(lcor)
9 f f = f r t + c o r ifr,>O. ff=frt-cor ifr:<0
10 e, =e. +f f exp 4096 23
-
Fig. 3. Addition algorithm

r
c I I

Fig. 5. Nonlinear function compression

resulting in a value in the range [.5,1), and is used to


perform the table lookup. The table output has i sub-
tracted, producing the result. A similar technique is used
for exp(. 1.
The total lookup table sizes after applying this opti-
mization to the log and exp tables are shown in Table 11.
A total of 22K words with 574 464 b is required. Further
reduction is required to fit in the available technology.

C. Nonlinear Table Compression


1
e, A second technique, called nonlinear table compres-
Fig. 4. Data path for segmented linear approximation.
sion, is used to reduce the size of each of the lookup
tables. Nonlinear compression uses the observation that a
linear approximation provides a close, but inexact, ap-
ing hardware block is labeled with an identifying number. proximation of the function. Since the approximation is
The data path operates on two numbers x and y , and close, a table storing the difference between the linear
chooses a and b that meet the constraint e, 3 eb. The approximation and the exact function can use a few bits
dotted line is related to the circuit partitioning into two to represent the difference. This is expressed by (14)
chips, and will be described later. through (16). It is necessary to represent the value of
f ( x , ) and df(x,)/dx for each possible value of xb, and
B. Range Reduction the value of f d ( x ) for each possible value of x.

The size of the log(.) and exp(.) lookup tables can be


reduced by using algebraic identities. Consider the log
ROM, which looks up log(x) in the range [0,1) using a
fixed-point input representation,' and has the identity
log(2' X x) = i +log(x). This can be used to reduce table
space by half. A priority encoder generates i correspond-
ing to the most significant one in x; x is left shifted by i
Equation (15) corresponds to the hardware implementa-
'Log(0) is chosen to be a negative value of large enough magnitude to tion shown in Fig. 5 , so each ROM of Fig. 4 is replaced by
guarantee a correct result. an instance of the logic in Fig. 5. Given some f ( x > with
YU AND LEWIS: 30-b INTEGRATED LOGARITHMIC NUMBER SYSTEM PROCESSOR 1437

N, bits of precision in x , nonlinear table compression TABLE 111


COMPRESSED TABLE SIZES
splits x into two parts, x,, containing the N, least signifi- (a) Optimal Values of W
cant bits of x used for linear approximation, and X b ,
containing the Nh most significant bits used to look up
+
the function value and derivative, with N, = Ne Nb. The Name N, Wf W, W,, W words

value of x b is used to address a ROM. Each word fa 3 23 13 49 85 258


fsb 3 23 13 49 85 386
contains the value of f ( x h ) ,the value of df(x,)/dw, and fssl 4 27 14 135 176 640
2Nc- 1 correction words f d ( x ) (note fd(xJ = 0 and is not fss2 3 28 22 11 127 162
stored). The widths of these quantities are Wr, W,, and 4 24 13 105 142 128
log
W,, respectively, with the total width of bits used for
exp 4 23 12 75 100 256
f d ( x ) being W,, = ( 2 N e - l > X W,. A multiplexer uses x, to
select a field from the ROM, producing the value of (b) Actual Values of W
fd(x). A multiplier and two adders produce f ( x > .
In fact, the value stored for d f ( x b ) / d u is chosen to
optimize the accuracy of the linear approximation and
thus minimize the number of bits of ROM, and is not
necessarily the exact value of d f ( x b ) / du (although the
result must still be exact). The total ROM width is W =
+
Wf Wd + W,,, but the number of words is reduced by a
factor of 2 4 . As N, increases, the number of words
decreases, and W,, and W increase. The value of N, is
chosen to minimize total ROM space.
complex only when the possible allocation of multiple
segments or tables to a single ROM is considered.
111. CHIPAREAOPTIMIZATION
A N D DESIGN
The log and exp tables are both accessed once per
A chip set that implements LNS addition using the operation, and so require a distinct ROM for each table.
above algorithm has been implemented. The chip set is The processor also requires one access to either f & r ) or
designed to demonstrate the feasibility of high-precision f , ( r ) . This potentially allows several segments of f J r > and
LNS processors, so only the arithmetic section of the f s ( r ) to be implemented in a single ROM, if this will save
processor has been designed. This is implemented as a space. This is only directly possible if the ROM is wide
pipelined processor, accepting two operands and deliver- enough to accommodate the widest of each of the fields,
ing a result per clock cycle. The processor is deeply since otherwise some data steering logic is required. AS a
pipelined in order to maximize throughput, at the cost of first step, the segments for f,(r> and f s ( r ) were indepen-
slightly increasing latency. In a pipelined architecture, dently optimized and grouped into four collections that
combinational circuits can be pipelined to use a cycle could fit into identical word widths. About 5% of the
time as small as a gate delay plus pipeline register, but an table space is not used, but the corresponding area is
architecture that uses ROM’s cannot easily be pipelined small compared to the cost of separate tables. These
using a cycle shorter than a ROM access. A goal of 140 ns collections of segments and corresponding word widths
was set for ROM access time and cycle time of the chip. are shown in Table 1”. Total table space is reduced to
The algorithms described above describe an architec- 234290 b, a reduction of 57%.
ture for LNS arithmetic, but detailed design decisions It is desirable to perform further sharing by merging
must be made to optimize circuit area. There are two the various f , ( r ) and f s ( r ) tables into fewer ROM’s.
principal decisions, related to nonlinear compression and Although the merged ROM might contain more bits, it
allocation of lookup tables to ROM’s. will eliminate some wiring and overhead circuits such as
word-line drivers and sense amplifiers. Merging cannot be
done efficiently using the tables of Table III(a) because of
A. Nonlinear Compression Optimization
the large variations in word widths. However, the varia-
The nonlinear compression algorithm places no con- tion of ROM size with respect to N, is relatively flat near
straints on Ne. Because the number of words and the size the minimum. This allows ROM’s to be merged by choos-
of each word varies with Ne, it is important to choose an ing a suboptimal N, that results in comparable values of
Ne that minimizes chip area. The first step in this opti- Wf, Wd, and W,, for the different segments. While the
mization was to write a computer program that accepts value of Ne for a given collection of segments is subopti-
the contents of a lookup table and evaluates the total mal, the overall chip area is minimized.
table size for any given value of N,. This can be used to We wrote computer programs to study the area trade-off
determine the optimum value of Ne. This technique can between sharing tables and implementing each table sep-
be applied to each segment of the f o ( r ) and f s ( r >lookup arately. The programs use a detailed model of ROM area,
tables, and to the log and exp tables as a whole. This including core area and overhead area for the technology
optimization is relatively straightforward, and becomes available to us. This was used to explore the possibilities
1438 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991

TABLE IV
ROM SIZES
ROM I Words I Width I Total Bits

log I 128 1 110 I 14080 I 1 I I I


exp 1 256 I 142 1 36352
0 35 110 125 140 t(ns.)

Fig. 6. Clock timing.


Total I 249131

of sharing tables, and used to find the optimal implemen-


10-mm X 10-mm die, but the multiproject silicon foundry
service available to us had a maximum die size of 8.2
tation of the tables. It was found that it was most econom-
ical to merge all sections of f a and f , into two ROM’s, as
mm X 8.2 mm, necessitating partitioning the chip into a
shown in Table IV. A large ROM (the f-ROM) with two-chip set. Partitioning the chip into a two-chip set
W = 90 stores most of the data using the values Wf = 27,
increased area considerably due to the large number of
w d = 14, and W,, = 49, which can be interpreted as either
pads required for communication. The total area of the
processor is still well within the amount that can be
three fields of 14 b or seven fields of 7 b. No data steering
fabricated on a single chip with acceptable yield, so this
logic is required except for fd(x). This can store any table
decision is purely due to the constraints of the chip
entry except the fss2 segments, where 1 b of Wf and 8 b
of w d bits are stored in a separate 265 X 9 excess ROM.
manufacturer.
Although the values of Ne for each table are slightlyTwo-phase clocking with the timing illustrated in Fig. 6
is used. Two-stage dynamic latches are used throughout,
suboptimal, the overall circuit area is minimized. This
increases the number of bits of ROM to 245731 b, as with all state changes on 4,, which is used for precharg-
shown in Table III(b), a 5% increase, but overall ROM ing dynamic circuits. The timing of 41 is primarily dic-
area is reduced by 28%, and wiring area is reduced as tated by ROM requirements. 42 is only used for the first
well. latch stage in pipeline registers. This constrains the pro-
The optimal ROM sizes presented above require com-cessor to a single dynamic circuit per pipe stage, but does
plicated logic to map r into a f-ROM address. The not cause any extra pipe stages.
complexity of the logic can be reduced by placing each The chip design uses simple static circuits wherever
possible in order to keep design complexity low. A small
table at an address in the ROM that simplifies the map-
number of commonly used cells, such as adders, and
ping function. This leaves “holes” in the address space of
the ROM, but no additional area is consumed, since themultiplexers were designed to keep layout time moderate.
corresponding words in the ROM are not implemented. A fully static adder cell with block carry lookahead was
designed. Variable-sized blocks are used to reduce carry
The word-line decoder must decode only those addresses
propagation time, and the resulting circuit can perform a
that are implemented. The resulting structure is effec-
30-b addition in 55 ns. Dynamic circuits were used only
tively a PLA implemented with the area efficiency of a
ROM due to the density of the terms. when a static circuit was too slow or too large. Dynamic
circuits were used only for ROM’s, shifters, PLAs, and
The use of a multilevel decoder tree requiring multiples
pipeline registers. The shifters and PLA’s are constructed
of eight words in each segment, plus the need for reason-
using domino logic, precharged on 4, and evaluated on
able aspect ratios, creates a further small increase in
ROM space. A total of 251 kb of ROM is actually present 6,.
The pipeline registers are dynamic latches.
in the chip. The pipeline partitioning of the chip is chosen to make
An unfortunate side effect of the complicated algo-the ROM access time the only limit on clock cycle. The
use of no more than one dynamic circuit per pipe stage
rithms used by this processor is the large delay through
the processor. A total of ten pipeline stages is used in the
sets a lower bound of five-clock-cycle latency on the
circuit. Although consecutive dynamic stages with no in-
present implementation. One of these stages is used for
tervening logic can be placed in consecutive clock cycles,
interchip communication, since multiplexing of the inter-
chip signals is required. Four of these stages are required
most of the dynamic circuits have some intervening static
logic. The static logic requires the use of an extra pipeline
to perform range reduction operations for the log and exp
lookup tables. In retrospect, the area saved by range stage if it cannot fit into the time slack of the preceding
reduction was small, so these stages should be eliminated
dynamic stage. A total of nine stages results from these
at the expense of slightly greater area. constraints, and an extra stage to communicate between
chips brings the total to ten stages.
The f-ROM is the limiting factor in processor speed, so
B. Chip Implementation
a simple but reasonably fast design was used. The address
A two-chip set that performs LNS addition and sub- is latched and predecoded during while the ROM
traction has been implemented in 3-pm DLM p-well precharges. The access is performed during and the ql,
CMOS. The circuits were estimated as requiring a output data are latched during &. The ROM design uses
YU A N D LEWIS: 30-h INTEGRATED LOGARITHMIC N U M B E R S Y S T t M PKOCESSOR 133’)

sion of metal straps for the predecoded address lines.


SPICE predicts 175-ns cycle time for chip 1 with this
omission, closer to the measurements. Our test setup
limited control of the clock widths to multiples of 20 ns,
so the actual maximum clock rates could be higher than
these rates. and closer to the SPICE simulations.

1v. COMPARISON TO FI.OATING-POINT PROCESSORS


The use of 3-pm technology makes the performance of
the prototype uncompetitive with modern FP processors.
The processor is not only slower than recent FP proces-
sors, but this technology only allows 30-b words with 20-b
precision that has less accuracy than a single-precision FP
processor. A 2-pm or higher density technology is re-
quired to implement 32-b words with 23-b precision (de-
leting the zero bit, and using a distinct value to represent
zero), offering about 0.53 b more precision than FP.
As mentioned earlier, the LNS has not been widely
used due to the difficulty of implementing addition and
subtraction. The methods describe here could make LNS
arithmetic competitive with FP processors, and feasible
for general-purpose computations. To compare LNS
arithmetic to FP, we consider general-purpose computa-
tions, which require circuits for multiplication as well as
addition. In the LNS, a multiplication or division is per-
formed with addition or subtraction, so the silicon area
required is insignificant, and the time required using our
adder is 55 ns in 3-pm CMOS. Using improved ROM’s,
the processor presented here can achieve 140-11s addition
(h)
with 1400-11s latency. A FP processor designed using
similar technology [ 101 offers lower performance than the
Fig. 7. Photomicrographs of processor chips LNS processor, offering 1 .l-pus addition, 1.6-ps multipli-
cation, and 2.7-ps division.
LNS arithmetic should offer excellent performance in
an El-ROM core with straps to ground every 32 b and higher density technology. The core area of the LNS chips
second-level-metal word-line straps every 64 b. Two level totals about 100 mm’, of which about 30% is ROM.
decoders are used for fast word-line drive. A simple sense Increasing the word size to 32 b to achieve better accu-
amplifier, a ratioed inverter with a high switching thresh- racy than a single-precision F P processor would increase
old, is used to provide reasonably fast sensing with good ROM area to 74 mm2, for a total of 144 mm2. Scaling to
noise immunity. By choosing the inverter ratio appropri- 1.2-pm technology suggests an area of 23 mm’, smaller
ately, fast switching can be achieved. Since a ratioed than current FP processors. Eliminating the interchip
inverter has a long pull-down time, an additional NMOS communication and range reduction logic for the log and
FET is used to precharge the data-out line low. Simula- exp ROM’s would eliminate five pipe stages while in-
tions predicted a 35-ns precharge and 90-ns access time creasing area by about 5%. The predicted cycle time in
for a 125-ns cycle. An additional 15-ns dead time is 1.2-pm CMOS from SPICE simulations is 25 ns, with a
allowed for a total cycle of 140 ns. latency of 125 ns. Multiplication and division could be
The processor is partitioned into two chips, with the accomplished with insignificant area and 25-11s latency. A
functions allocated to each chip shown in Fig. 4. Fig. 7 recent F P processor 1111 in 1.2-pm CMOS offers 25-11s
shows photomicrographs of the two chips, with labels cycle times for addition and multiplication, and can com-
corresponding to Fig. 4. Both chips have been fabricated plete two operations every three cycles, but has circuit
and tested successfully. One circuit design error, a float- area of 110 mm’. This processor can perform either
ing address line to a ROM, was found. This was not single- or double-precision FP arithmetic, and also inte-
detected during simulations due to the limited number of grates a register file and interface logic. Another single-
vectors that could be simulated using the entire chip. At precision F P processor in the same technology [12] has
VLIn= 5.0 V and 25”C, chip 1 operated at a speed of 4.5 area of 58 mm’ and performs an addition and multiplica-
MHz, and chip 2 operated at 5.6 MHz. The lower than tion every 67 ns. A LNS processor in this technology
expected clock speed was the result of inadvertent omis- would offer higher performance and smaller circuit area.
1440 IEEE J O U K N A L OF SOLID-STATE CIRCUITS, VOL. 26, NO. IO, OCTOBER 1991

LNS processors using the algorithms described here [3] E. E. Swartzlander and A. G. Alexopolous, “The signed logarithm
number system,” IEEE Trans. Comput., vol. C-24, pp. 1238-1242,
cannot implement double-precision FP processors in rea- Dec. 1975.
sonable area due to the 0 ( 2 F / 2 )dependency of circuit [4] J. H. Lang, C. A. Zukowski, R. 0. LaMaire, and C. H. An,
area on precision. Implementation of high-precision LNS “Integrated-circuit logarithmic units,” IEEE Trans. Comput., vol.
C-34, pp. 475-483, May 1985.
addition will require the development of different algo- [SI M. Combet, H. Van Zonneveld, and L. Verbeek, “Computation of
rithms. the base two logarithm of binary numbers,’’ IEEE Trans. Electron.
Comput., vol. EC-14, pp. 863-867, Dec. 1965.
[6] D. Marino, “New algorithm for the approximate evaluation in
hardware of binary logarithms and elementary functions,” IEEE
V. CONCLUSIONS Trans. Comput., vol. C-21, pp. 1416-1421, Dec. 1972.
This paper has described the architecture of an inte- [7] H.-Y. Lo and Y. Aoki, “Generation of a precise binary logarithm
with difference grouping programmable logic array,” IEEE Trans.
grated processor for 30-b LNS arithmetic. Two techniques Comput., vol. C-34, pp. 681-691, Aug. 1985.
are use to achieve this precision in moderate circuit area. [XI F. J. Taylor, “An extended precision logarithmic number system,”
Linear approximation of the LNS arithmetic functions IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp.
232-234, Jan. 1983.
using logarithmic arithmetic is shown to be simple due to [ 9 ] D. M. Lewis, “An architecture for addition and subtraction of long
the particular functions involved. A segmented approach word length numbers in the logarithmic number system,” IEEE
to linear approximation minimizes the amount of table Trans. Comput., vol. 39, pp. 1326-1336, Nov. 1990.
[IO] G. Wolrich et al., “ A high performance floating point coprocessor,”
space required. Subsequent nonlinear compression of each IEEE J . Solid-State Circuits, vol. SC-19, pp. 690-696, Oct. i984.
lookup table leads to a further reduction in table size. [ l l ] K. Molnar et al., “ A 40 Mhz 64-b floating point co-processor,” in
The result is that a factor of 285 reduction in table size is ISSCC Dig. Tech. Papers, 1989, pp. 48-49.
[12] D. A. Staver et al., “A 30-MFLOPS CMOS single precision floating
achieved, compared to previous techniques. point multiply-accumulate chip,” in ISSCC Dig. Tech. Papers, 1987,
The circuit area of the implementation is minimized by pp. 274-275.
optimizing the table parameters, using a computer pro-
gram that accurately models ROM area. The implementa-
tion is highly pipelined, and produces one result per clock Lawrence K. Yu (S’8S-M’90) received the
B.A.Sc. and M.A.Sc. degrees in electrical engi-
cycle using a ten-stage pipeline. The architecture could be neering from the University of Toronto,
scaled using modern technology to higher precision and Toronto, Ont., Canada in 1986 and 1990, re-
performance, as well as reduced latency. As a result, LNS spectively.
H e is presently employed at the University of
arithmetic can offer higher speed and better accuracy Toronto as a Research Associate on the Hubnet
than a single-precision FP processor in smaller circuit project. His research interests include computer
area. arithmetic and VLSI systems design.

ACKNOWLEDGMENT David M. Lewis (M’88) received the B.A.Sc.


Circuit fabrication was provided by the Canadian Mi- degree with honors in engineering science from
the University of Toronto, Toronto, Ont.,
croelectronics Corporation. Canada, in 1977, and the Ph.D. degree in elec-
trical engineering in 1985.
From 1982 to 1985 he was employed as a
REFERENCES Research Associate on the Hubnet project, and
developed custom integrated circuits for a 50-
[ l ] J. N. Mitchell Jr., “Computer multiplication and division using Mb/s local area network. H e has been an Assis-
binary logarithms,” IRE Trans. Electron. Cornput., vol. EC-1 1, pp. tant Professor at the University of Toronto since
512-517, Aug. 1962. 1985. His research interests include logic and
[2] F. J. Raylor, R. Gill, J. Joseph, and J. Radke, “ A 20 bit logarithmic circuit simulation, computer architecture for high-level languages, loga-
number system processor,” IEEE Trans. Comput., vol. 37, pp. rithmic arithmetic, and VLSI architecture.
190-200. Feb. 1988. Dr. Lewis is a member of the ACM.

You might also like