A 30-b Integrated Logarithmic Number System Processor - 91
A 30-b Integrated Logarithmic Number System Processor - 91
Abstract -This paper describes an integrated processor that accurate than a single-precision FP processor, due to the
performs addition and subtraction of 30-b numbers in the limited circuit area in 3-pm technology, it offers higher
logarithmic number system (LNS). This processor offers 5-
performance than FP arithmetic in the same technology.
MOPS performance in 3-pm CMOS technology, and is imple-
mented in a two-chip set comprising 170K transistors. A new A denser technology would allow LNS arithmetic to offer
algorithm for linear approximation using different-sized approx- better speed, accuracy, and performance than single-pre-
imation intervals in each of a number of segments is used. A cision FP in the same technology.
second technique, nonlinear compression, further reduces table The central difficulty in implementing addition and
space by storing the difference between the exact value of the
function and a linear approximation. This allows the implemen-
subtraction operations in LNS is the need to approximate
tation of logarithmic arithmetic using much less ROM than two nonlinear functions, which has typically been per-
previously required, making high-speed logarithmic arithmetic formed using lookup tables. In a straightforward imple-
possible in area comparable to single-precision floating-point mentation with F bits of fractional precision, roughly
processors. Compared to previous techniques for logarithmic 2 F x 2 F words are required [3]. For this reason, published
arithmetic, a factor of 285 reduction in table space is realized. implementations have been restricted to 8 to 12 b of
fractional precision [21, [41.
Efficient approximation of a nonlinear function using
I. INTRODUCTION small lookup tables suggests the use of a Taylor series
C ALCULATIONS requiring high precision and range approach. Linear approximation [SI,quadratic approxima-
can be performed with several different numeric tion [6], and linear approximation with a nonlinear differ-
representations, including floating-point (FP) or logarith- ence function in a PLA [71 have been used to advantage
mic number system (LNS) representations. FP represen- in the approximation of some functions, such as log(x).
tation is the most common number representation, while This is possible for log(x) because of its smooth nature
LNS representations have rarely been used. The scarcity over a small range. In contrast, one of the functions that
of LNS processors is due to the difficulty of performing must be approximated in LNS arithmetic has a singularity
LNS addition and subtraction. While the LNS offers that makes straightforward Taylor series approximations
better accuracy than FP [l]and simple multiplication and difficult. A previous attempt [SI at using linear approxi-
division, addition and subtraction circuits have area that mation in LNS arithmetic has achieved better precision
is exponential in numeric precision. Most applications for addition only by using a modified linear approxima-
require both additive and multiplicative operations, mak- tion, but is limited to about 3.85 additional bits of
ing LNS arithmetic impractical. As a result, the highest precision.
precision processor previously described offered only 12-b This paper describes an integrated LNS arithmetic pro-
precision in a 3-pm I *L technology [2]. No algorithms for cessor using a new method for linear approximation of
higher precision LNS arithmetic have previously been the LNS arithmetic functions. Using 3-pm CMOS tech-
described, making LNS arithmetic impractical for most nology, the prototype offers 20-b precision, considerably
applications. greater than previous designs. Two techniques are used to
This paper describes a new algorithm and architecture increase the precision possible for a given amount of
for performing LNS addition and subtraction, and its ROM. First, a new segmented technique for linear ap-
prototype implementation in an integrated processor. This proximation is used to reduce the amount of table storage
processor offers 5-MOPS performance using a 30-b num- required to 561 kb, a factor of 127 reduction compared to
ber representation, and is implemented in a two-chip set the most efficient previous method [2]. A second table
in 3-pm CMOS. Although the prototype is slightly less compression technique, linear approximation with differ-
ence coding, is used for further reduction, to 251 kb, a
Manuscript received January 17, 1991; revised May 13. 1991. This factor of 285 reduction.
work was supported by the Natural Sciences and Engineering Research The remainder of paper is organized as follows. Section
Council of Canada. I1 introduces the LNS representation and the algorithms
The authors are with the Department of Electrical Engineering,
University of Toronto, Toronto, Ont., Canada M5S 1A4. used. The chip optimization and design is described in
IEEE Log Number 9101852. Section 111, followed by conclusions in Section IV.
11. NUMBER
REPRESENTATION
AND 2.
ALGORITHM
DESIGN
The logarithmic number system represents a number x
by its sign and logarithm in some base b, together with 4.
TABLE I
SEGMENT
CHOICE
-4
multiplication and calculation of d f ( x ) / d x have been
eliminated, at the cost of adding lookup tables for log(.) -4 -3 -2
and exp(.). These tables can be shared for both addition Fig. 2. Linear approximation intervals for F = 4
and subtraction, and will be seen to be small.
ing value of Ax,,, is 2p'. Given some p i and p,, the
(4) values of r I , rh, rl, and re are defined by (8)-(11). The
notation r, . . . r, means the value of the binary represen-
tation of bits m through n inclusive of r .
r , = y p,-1 ..' '-F (8)
L(x + Ax ) = fa( x) + sgn ( Ax )
rl = rpl . . . r P < - 2 pi (9)
xexp(log(lAxl)+x - f , ( x ) ) (6)
L( x + Ax ) = f,( x ) - sgn ( Ax )
rl, = r - I . . . rp,+ + 2p/ ( 10)
rr = rlP, . . . r(,. (11)
Xexp(log(lAxl)+x - f , ( x ) ) . (7)
Using this partitioning, r, is the integer part of r , rh is a
Previous linear approximations have generally used posi-
positive quantity, with - 1 - p l bits being dependent upon
tive Ax, extracted as an unsigned bit field from the input
parameter. In contrast, Ax is a signed number in this
r , and rl is a signed quantity with p l - p , 1 significant +
bits, and (rll< 2p/.Finally, 0 < re < 2"., and r, < r < rf + 1.
paper. This doubles the range that can be used for linear
Combining this partitioning of r with the approxima-
approximation with some fixed maximum error, and halves
tions (6) and (7) leads to the formulas (12) and (13) as
the size of the lookup tables.
approximations for f u ( r ) and f s ( r > :
The remaining problem is how to choose x and Ax
such that r = x + Ax. The error in linear approximation L(r) = fo(r,)+ sgn( r o x exp (log(Ir,I) + rr - fa( It))
with some maximum value of IAxl, called Axmclx,is
(12)
(d2f(x)/dx2)x Axi,,/2), so the choice of the points x
and corresponding Ax,,, must be made to meet the A( r ) = f,(r , ) - sgn ( r r ) x exp (log( bll) + rr - f,( d).
required error tolerance. For a given maximum error, the
(13)
value of AX,,^ is proportional to 11/2x d 2 f ( x > / ~ * I P " * .
This choice of Ax,,, forms the basis of segmented The choice of segments and Ax,,, is made to meet
linear approximation. The domain of x is divided into a error tolerance requirements as well as result in a simple
number of segments, and a worst-case value of Ax,,, is implementation. The values of p , and p , are chosen to
used in each segment. Within a segment, the values of meet the accuracy constraint that the error due to linear
f ( x ) are stored at a set of points 2X Ax,,, apart (the approximation should be smaller than half a least signifi-
factor of 2 arises from the fact that Ax is signed). Thus, cant bit. The function f , ( r ) is smooth everywhere, allow-
segments are chosen in a manner that makes it easy to ing a relatively simple choice of segments based on inter-
compute the correct segment and Axmaxfor the given +
vals [ r l , r l 1). The value of Ax,, increases with more
segment, while not wasting excessive table space. negative r,. For r < - 1,f s ( r ) allows a similar treatment,
A simple technique for choosing Ax,,, is to partition but the singularity at 0 requires a different approach for
the binary representation of r into several parts, specifi- r E [ - 1,O). In this region the interval is divided into
cally, r r , r h , r l , and re, such that r = rr + rh + rl + r e . Also segments [ - 2-', - 2-'-'). The segment size and Ax,,,
+
define rf = r, rh. The linear approximation will be per- both decrease as r + 0. Table I shows the sizes of inter-
formed with x = r , , Ax = rl. The value r, is ignored, as it vals and segments for the functions, to within a constant
is chosen to be too small to affect the result. Although r, factor. The effect of segmented linear approximation for
is not directly used in this calculation, it is used later to F = 4 is shown in Fig. 2. The crosses mark the points
select a segment. stored in lookup tables, and the lines represent the range
The partitioning of r into bit fields can be described by of linear approximation about each point. The arithmetic
two integers p , and p,, p , < p , and p , < 0. Using seg- performed by the remainder of the processor and the
mented linear approximation, pe and p l are functions of corresponding data paths are shown in Figs. 3 and 4,
r , but are constant within each segment. The correspond- respectively. Each step in the algorithm and correspond-
1436 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 10, OCTOBER 1991
TABLE 11
stage SIZESFOR LINEAREXTRAPOLATION
TABLE
1
2
3
generate P* .PI
partition r into ri. rh. rl. r, = ri + rh
Table I Words I Width
4 fn =fn_rbKrO
5 lrl = ibg-tbl( I r: I )
6 f d = r, -fr~
I lcor = lrl + fd
0 cor = eap-tbl(lcor)
9 f f = f r t + c o r ifr,>O. ff=frt-cor ifr:<0
10 e, =e. +f f exp 4096 23
-
Fig. 3. Addition algorithm
r
c I I
TABLE IV
ROM SIZES
ROM I Words I Width I Total Bits
LNS processors using the algorithms described here [3] E. E. Swartzlander and A. G. Alexopolous, “The signed logarithm
number system,” IEEE Trans. Comput., vol. C-24, pp. 1238-1242,
cannot implement double-precision FP processors in rea- Dec. 1975.
sonable area due to the 0 ( 2 F / 2 )dependency of circuit [4] J. H. Lang, C. A. Zukowski, R. 0. LaMaire, and C. H. An,
area on precision. Implementation of high-precision LNS “Integrated-circuit logarithmic units,” IEEE Trans. Comput., vol.
C-34, pp. 475-483, May 1985.
addition will require the development of different algo- [SI M. Combet, H. Van Zonneveld, and L. Verbeek, “Computation of
rithms. the base two logarithm of binary numbers,’’ IEEE Trans. Electron.
Comput., vol. EC-14, pp. 863-867, Dec. 1965.
[6] D. Marino, “New algorithm for the approximate evaluation in
hardware of binary logarithms and elementary functions,” IEEE
V. CONCLUSIONS Trans. Comput., vol. C-21, pp. 1416-1421, Dec. 1972.
This paper has described the architecture of an inte- [7] H.-Y. Lo and Y. Aoki, “Generation of a precise binary logarithm
with difference grouping programmable logic array,” IEEE Trans.
grated processor for 30-b LNS arithmetic. Two techniques Comput., vol. C-34, pp. 681-691, Aug. 1985.
are use to achieve this precision in moderate circuit area. [XI F. J. Taylor, “An extended precision logarithmic number system,”
Linear approximation of the LNS arithmetic functions IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp.
232-234, Jan. 1983.
using logarithmic arithmetic is shown to be simple due to [ 9 ] D. M. Lewis, “An architecture for addition and subtraction of long
the particular functions involved. A segmented approach word length numbers in the logarithmic number system,” IEEE
to linear approximation minimizes the amount of table Trans. Comput., vol. 39, pp. 1326-1336, Nov. 1990.
[IO] G. Wolrich et al., “ A high performance floating point coprocessor,”
space required. Subsequent nonlinear compression of each IEEE J . Solid-State Circuits, vol. SC-19, pp. 690-696, Oct. i984.
lookup table leads to a further reduction in table size. [ l l ] K. Molnar et al., “ A 40 Mhz 64-b floating point co-processor,” in
The result is that a factor of 285 reduction in table size is ISSCC Dig. Tech. Papers, 1989, pp. 48-49.
[12] D. A. Staver et al., “A 30-MFLOPS CMOS single precision floating
achieved, compared to previous techniques. point multiply-accumulate chip,” in ISSCC Dig. Tech. Papers, 1987,
The circuit area of the implementation is minimized by pp. 274-275.
optimizing the table parameters, using a computer pro-
gram that accurately models ROM area. The implementa-
tion is highly pipelined, and produces one result per clock Lawrence K. Yu (S’8S-M’90) received the
B.A.Sc. and M.A.Sc. degrees in electrical engi-
cycle using a ten-stage pipeline. The architecture could be neering from the University of Toronto,
scaled using modern technology to higher precision and Toronto, Ont., Canada in 1986 and 1990, re-
performance, as well as reduced latency. As a result, LNS spectively.
H e is presently employed at the University of
arithmetic can offer higher speed and better accuracy Toronto as a Research Associate on the Hubnet
than a single-precision FP processor in smaller circuit project. His research interests include computer
area. arithmetic and VLSI systems design.