Low Latency Floating-Point Division and Square Root Unit
Low Latency Floating-Point Division and Square Root Unit
2, FEBRUARY 2020
Abstract—Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root.
These iterative algorithms present a good trade-off in terms of performance, area and power. We present a floating-point division and
square root unit, which implements a radix-64 floating-point division and a radix-16 floating-point square root. To have an affordable
implementation, each radix-64 division iteration and radix-16 square root iteration are made of simpler radix-4 iterations: 3 radix-4
iterations in division and 2 in square root. Speculation is used between consecutive radix-4 iterations to get a reduced timing. There are
three different parts in digit-recurrence implementations: initialization, digit iterations, and rounding. The digit iteration is the iterative part
and it uses the same logic for several cycles. Division and square root share partially the initialization and rounding stages, whereas each
one has different logic for the digit iterations. The result is a low-latency floating-point divider and square root, requiring 11, 6, and 4 cycles
for double, single and half-precision division with normalized operands and result, and 15, 8 and 5 cycles for square root. One or two
additional cycles are needed in case of subnormal operand(s) or result.
1 INTRODUCTION
and square root are some of the most representa- of interest, double, single and half-precision, digit-recurrence
D IVISION
tive floating-point functions in modern processors.
Although they are less frequent than the two basic arithme-
methods are faster. Multiplicative methods rely on several
iterations of a fused multiply-add (FMA) operation, and the
tic operations, addition and multiplication, a poor perfor- latency of a single FMA is between 3 and 6 cycles [13], [22],
mance when computing these operations can impact the [25]. In some cases, this is the latency of our proposed divider
processor global performance. for single-precision.
For a low precision computation of these functions, it is In this paper the architecture of a floating-point unit imple-
possible to employ direct table look-up, bipartite tables [3], menting a radix-64 digit-recurrence divider and a radix-16
[24] (table look-up and addition), or low-degree polynomial square root is described. A radix-r digit-recurrence algorithm,
or rational approximations [12], [15], [27], [28], but the area being r a power of 2, is an iterative algorithm where a radix-r
requirements become prohibitive for table-based methods digit, log2 ðrÞ bits, of the result quotient or root, is obtained
when performing medium or high precision computations every iteration. To get an energy and timing efficient imple-
(such as the single and double-precision floating-point for- mentation both the division and the square root iteration are
mat). More efficient alternatives are iterative algorithms [7] obtained by overlapping simpler radix-4 iterations. Hence,
On one hand, digit-recurrence methods [5], [7] have linear three radix-4 division iterations are overlapped in a single
convergence and are based on subtraction; but their linear cycle providing 6 bits of the quotient per cycle, which is equiv-
convergence sometimes leads to long latencies and makes alent to a radix-64 iteration. Similarly, two radix-4 square root
them inadequate methods for these computations. High- iterations are overlapped in a single cycle providing 4 bits of
radix digit-recurrence methods result in faster but bigger the root per cycle, which is equivalent to a radix-16 iteration.
designs. On the other hand, multiplicative-based methods A digit-recurrence division or square root algorithm with
[4], [8], [20], [21], such as the Newton-Raphson and Gold- floating-point operands has three parts: pre-processing, digit
schmidt algorithms, have quadratic convergence at the iterations, and post-processing. The pre-processing includes
expense of greater hardware requirements. operands unpacking, operands normalization (if required)
The energy efficiency of both approaches has been recently and initialization. Digit iterations is the iterative part of the
analyzed and the conclusion is that the digit-recurrence digit-recurrence algorithm. Post-processing consists of the
approach is much more energy efficient than the multiplica- rounding logic and right-shift in case of a subnormal result
tive methods [17]. In addition, for the floating-point precisions (in division only). The pre-processing and post-processing
logic is mostly shared between division and square root,
whereas the iterative part, the digit iterations, are specific for
The author is with the ARM Ltd, CB1 9NJ Cambridge, U.K. either operation.
E-mail: [email protected]. To improve the timing of the division iteration we have
Manuscript received 1 Apr. 2019; revised 4 Oct. 2019; accepted 11 Oct. 2019. used the divisor pre-scaling technique [6]. This is a well-
Date of publication 16 Oct. 2019; date of current version 13 Jan. 2020. known technique to simplify the quotient-digit selection
(Corresponding author: Javier D. Bruguera.)
Recommended for acceptance by N. Takagi. function, probably the most timing critical step in a digit-
Digital Object Identifier no. 10.1109/TC.2019.2947899 recurrence algorithm. The selection function is simplified by
0018-9340 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 275
scaling the divisor to a value close enough to 1. This way, the The partial result before iteration i is defined as
radix-4 selection function is independent of the divisor. This
X
i
scaling is carried out before the digit iterations. R½i ¼ pj rj ; (1)
Additionally, some other techniques has been used to j¼0
improve the timing or latency of division and/or square
root, and each algorithm iteration is described by the following
equations:
To reduce the timing, speculation is used between
consecutive radix-4 iterations in the cycle. d Tb½iÞ
piþ1 ¼ SELðrem½i; (2)
First iteration is not carried out in the iterative part
but in the pre-processing stage; this way, latency is rem½i þ 1 ¼ r rem½i piþ1 f½i þ 1; (3)
reduced by 1 cycle for given precisions.
– Division: The first iteration, which gives the inte- d
being rem½i and Tb½i estimations with a few bits of the
ger digit of the result, is carried out in parallel remainder rem½i and the divisor (in case of division) or the
with the operands pre-scaling, contributing to partial result R½i (in case of square root), respectively. The
save one cycle in single-precision. number of bits in the estimation needed for the selection
– Square root: The first iteration is skipped and inte- function SEL depends on the radix and the operation. Term
grated in the initialization stage. The partial root f½i þ 1 is different for each operation.
and the remainder for the second iteration are For a fast iteration, the remainder is kept in carry-save of
easily obtained from the input value. This way, signed–digit redundant representation. In our implementa-
the latency is reduced by 1 cycle in double and tion, we have chosen a radix-2 signed-digit representation
single-precision. for the remainder, with a positive and a negative word.
The result is a low-latency unit with 11, 6, and 4 cycles On the other hand, note that because of the algorithm
latency for double-precision, single-precision and half- convergence conditions and the multiplication times r in
precision division, respectively, and 15, 8, and 5 cycles latency Equation (3), the remainder will have several bits in the inte-
for double-precision, single-precision and half-precision ger part; the number of integer bits depending on the radix,
square root. These latencies include the pre-processing and the digit set, and the operation.
post-processing cycles and correspond to a division or square Then, every iteration a radix-r digit of the result is
root with normalized operands and result. In case of subnor- obtained from the current remainder, and a new remainder
mal operands, one or two additional normalization cycles are is computed for the next iteration.
needed. Similarly, in case of a division subnormal result an Then, the number of iterations is
additional cycle after the rounding cycle is added to right shift it ¼ dn=log 2 ðrÞe; (4)
the result and adjust the exponent.
This unit has been implemented in a processor with a being n the number of bits of the result, including the bits
frequency of 3 GHz. required for rounding.
The floating-point division has been already described The number of cycles is directly related to the number of
in [1]. In this paper, the square root implementation is iterations and to the number of iterations performed per
discussed and some more details about the division cycle. Then, considering m iterations per cycle, the number
implementation are provided. Only the operations on the of cycles is
mantissas are shown; the sign and exponent are obtained
separately. cycles ¼ dit=me: (5)
The rest of the paper is organized as follows. Section 2 is
a brief description of the foundations of digit-recurrence Equations (1) to (4) can be particularized to radix-4, r ¼ 4,
division and square root. In Section 3 the general architec- division and square root.
ture is presented and the main features of the proposed
unit are outlined. In Sections 4 and 5 the implementation 2.1 Radix-4 Division
of both the divider and the square root is described. Some The floating-point division of a dividend x and a divisor d
considerations about how subnormals operands and result produces a quotient q ¼ x=d. The partial quotient before
are processed are given in Section 6. Finally, in Section 7 iteration i and the digit obtained at iteration i are called Q½i
the unit is compared with other implementations in actual and qiþ1 respectively, then Equation (1) is rewritten as
processors, and in Section 8 the main conclusions are
presented. X
i
Q½i ¼ qj 4j : (6)
j¼1
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
276 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 2, FEBRUARY 2020
Note that f½i þ 1 ¼ d, and the initial value for the remain-
der is rem½0 ¼ x.
For this implementation, it has been determined that
only the 6 most-significant bits (MSB) of the remainder,
three integer bits and three fractional bits, are required to
select the next quotient digit with Equation (7) [7].
X
i
S½i ¼ sj 4j s0 ¼ 1: (9)
j¼0
d S½iÞ
siþ1 ¼ SELðrem½i; b (10)
then
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 277
root remainder update; consequently, three radix-4 iterations 3.2 Floating-Point Square-Root
wouldn’t fit in one cycle and the performance would be The unit computes the floating-point
pffiffiffi square-root of the oper-
degraded. Note that the area would be smaller because part of and, x, to obtain the root, S ¼ x. The operand needs to be
the carry-save adders could be removed. normalized, x 2 ½1; 2Þ, although a subnormal operand is
The post-processing stage rounds the quotient/root. Final accepted; in this case, the subnormal operand is normalized
remainder and unrounded result need to be assimilated, that before the digit iterations.
is subtract the negative word form the positive word, to get The normalized operand needs to be right shifted by 1 or
the remainders’ sign and the non-redundant unrounded 2 bits before the iterations. This is because:
result, and the result is rounded to get the final quotient or
root. In a floating-point division, the final quotient can be a 1) The algorithm requires the operand to be in [0.5,1),
subnormal number. In this case the mantissa needs to be 2) In case of an operand with an even exponent, after the
right-shifted to have a IEEE standard compliant result [11]. shift to have the operand in [0.5,1) the exponent
The post-processing stage is not longer discussed because becomes odd; then the mantissa is divided by 2 and the
it is a traditional floating-point rounding stage. exponent is incremented to have again an even expo-
The algorithm used for the division and square root is the nent; otherwise, the square-root cannot be calculated.
radix-4 digit-recurrence algorithm with three (division) or Then, the input to the iterations is
two (square root) iterations per cycle, and with a signed-
0 x=2 if exponent is odd
digit representation of the quotient with digit set f2; 1; 0; x ¼ : (14)
x=4 if exponent is even
þ1; þ2g; that is, being r ¼ 4, a ¼ 2, the radix and the digit
set respectively.
Consequently, x0 2 ½0:25; 1Þ and the root will be in [0.5,1).
With the radix-4 algorithm, 2 bits of the quotient or root
The first iteration is skipped and integrated into the ini-
are obtained every iteration. As three radix-4 iterations are
tialization of the remainder and partial root, reducing the
performed per clock cycle in division, 6 bits of the quotient
number of iterations by 1 without affecting the timing. The
are obtained every cycle, which is equivalent to a radix-64
latency is related to the number of iterations and to the num-
divider. Similarly, two radix-4 iterations are performed per
ber of iterations per cycle, and skipping the first iteration
clock cycle in square root, and 4 bits of the root are obtained
results in 1-cycle latency reduction in double- and single-
every cycle, which is equivalent to a radix-16 square root.
precision.
Let’s now describe the main features of the division and
square root implementations.
3.3 Early Termination Mode
3.1 Floating-Point Division There is an early-termination mode for exceptional oper-
ands. The early termination occurs when any of the oper-
The divider performs the floating-point division of a divi-
ands, or the only square root operand, are NaN, infinity, or
dend, x, and a divisor, d, to obtain a quotient, q ¼ x=d. The
zero, or a power of 2 with normalized operands. In the latter
two operands need to be normalized, x; d 2 ½1; 2Þ, although
case, three cases are differentiated,
subnormal operands are accepted; in this case, the subnor-
mal operands are normalized before the digit iterations. 1) Division by a power of 2. The result is obtained by
If the two operands are normalized in [1,2), the result is in merely incrementing or decrementing the exponent
[0.5,2); this way two bits to the right of the least–significant of the dividend.
bit (LSB) of the quotient are needed for rounding, the guard 2) Square root with an even power of 2. The operand is
and the round bits. The guard bits is used for rounding when a power of 4. The result is a power of 2 and the calcu-
the result is normalized, q 2 ½1; 2Þ, whereas the round bit is lation of the square root is carried out by merely
used for rounding when the result is not normalized, halving the input exponent
q 2 ½0:5; 1Þ. In this latter case, the results is left-shifted by 3) Square root pwith
ffiffiffi an odd power of 2. The mantissa of
1 bit, and the guard and round bits become the LSB and the the result is 2 and the exponent is obtained by halv-
guard bit, respectively, of the normalized result. ing the input exponent minus 1.
However, to simplify the rounding, the final quotient is
forced to be in q 2 ½1; 2Þ. Note that q < 1 only if x < d. This 3.4 Latency
situation is detected in pre-processing and the dividend is For normal operands and result, the latency is composed of
left-shifted by 1 bit in such a way that q ¼ 2 x=d and 1 cycle for pre-processing, several cycles for digit iterations,
q 2 ½1; 2Þ. Of course, the mantissa is the same as in x=d but and 1 cycle for post-processing. There are no cycles for oper-
the exponent needs to be decremented. ands’ normalization, nor for right-shifting the result. Then,
Each iteration, a digit of the quotient is obtained by means the latency is the number of cycles of the iterative part plus
of a selection function. In order to have a quotient-digit selec- 2. Therefore, for practical floating-point formats, double,
tion function independent of the divisor, the divisor has been singe and half-precision, DP, SP and HP, respectively, the
scaled close to 1. To preserve the result the dividend needs to number of bits n, the number of iterations, the number of
be scaled by the same amount as the divisor. cycles (see Equations (4) and (5)), and the latency are:
In addition, note that the first quotient digit, which is the Division:
integer digit of the result, can take only values fþ1; þ2g,
and its calculation is much simpler than the calculation of DP: n ¼ 53, it ¼ 27, cycles ¼ 9, latency ¼ 11
the remaining digits. Then, it is obtained in parallel with the SP: n ¼ 24, it ¼ 12, cycles ¼ 4, latency ¼ 6
operand pre-scaling, saving one cycle in single-precision. HP: n ¼ 11, it ¼ 6, cycles ¼ 2, latency ¼ 4
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
278 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 2, FEBRUARY 2020
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 279
Fig. 2. Pre-processing stage for division: pre-scaling and integer quotient digit calculation.
5) The next remainder, rem½1, (see Equation (3)) is For the selection of qiþ1 the 6-bit remainder estimate is
obtained from the non-redundant scaled dividend computed with the 6-bit adder in front of the first SELECT
(positive word of the remainder) and the non- block.
redundant scaled divisor (negative word of the To select digits qiþ2 and qiþ3 the 9 MSBs of the speculative
remainder). The operands comparison is used in rem½i þ 1 are assimilated. Note that, the 9 MSBs are assimi-
the selection of the scaled dividend as well, in such lated because although the selection function for qiþ2 only
a way that the scaled dividend is 1-bit left-shifted if needs the 6 MSBs, 2 additional bits are required for the
the divisor is larger than the dividend. On the other selection of qiþ3 because of the 2-bit left-shift of rem½i þ 2,
hand, the scaled divisor is 1-bit left-shifted if the and the other additional bit is used to catch the carry into
quotient digit is þ2 and is not shifted if the quotient the least-significant position of the 8 bits.
digit is þ1. Digit qiþ1 selects the 9 MSBs, among the 5 speculatively
calculated MSBs, that are going to be used in the selection
4.2 Digit Iteration of qiþ2 . Note that only 6 bits are used in the selection.
The actual implementation of the floating-point divider per- The 6 MSBs so obtained may be different to the 6 MSB
forms three radix-4 iterations per cycle. So, the logic has obtained directly from rem½i þ 1, because the þ1 to com-
been optimized taking this fact into account. Fig. 3 shows plete the 2’s complement of the sum word in the assimila-
the block diagram of a digit iteration cycle; that is the com- tion of rem½i þ 1 is added at a different position. In the
putation of three radix-4 iterations. The block diagram in actual implementation in Fig. 3, it is added at the position of
Fig. 3 is split into two parts, (1) quotient digit selection and, the 8th MSB whereas, in case of being obtained directly
(2) remainder calculation. from rem½i þ 1, it would be added at the position of the 6th
The remainders rem½i þ 1, rem½i þ 2 and rem½i þ 3 are MSB. Consequently, the carry into the 6th MSB can be dif-
computed speculatively according to Equation (8). So, five ferent. This difference makes the end-points of the intervals
remainders are computed every iteration, one remainder in Table 2 a get a wrong selection when the carry into the
for each possible value of the quotient digit; the correct 6th MSB bit is zero. This is corrected with the selection func-
remainder is selected when the digit is obtained. Note that, tion shown in Table 2 b. Note that the selection of the inter-
the remainder has to be left-shifted by two bits as part of the val end-points depends on the carry into the 6th MSB (carry
computation of the next remainder. column in the table) .
The quotient-digit selection uses an estimation of the Therefore, it is clear that the carry into the 6th bit is
remainder to obtain the next quotient digit (Equation (7)). As required for the selection of the q½i þ 2 digit. Hence, the 9–
said in Section 2.1, it has been determined that only the 6 bit adder has to be split into a 6–bit adder and a 3-bit adder
most-significant bits (MSB) of the remainder are required, to get access to this carry.
three integer bits and three fractional bits. The quotient digit In parallel with the selection of qiþ2 , the 6 MSB to be used
selection function is shown in Table 2 a. The intervals in the selection of digit qiþ3 are computed speculatively for
4 rem½i for the selection of every digit has been obtained every value of qiþ2 . Thus, the non-redundant estimation of
following the methodology described in [7]. rem½i þ 2 is obtained in the five 7-bit adders, by adding the
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
280 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 2, FEBRUARY 2020
shifted 7 MSB of rem½i þ 1 plus the 7 MSBs of qiþ2 d. organization, the latency is also reduced by 1 cycle. The
Then, digit qiþ2 is used to select the correct adder output, first iteration is skipped and integrated in the initialization
and qiþ3 is selected according to Table 2 a. stage. The partial root and the remainder for the second
This way, the delay of the logic in the cycle has been iteration are easily obtained from the input value x0 in
reduced with respect to a plain implementation of the three Equation (14).
quotient-digit selection functions.
In the quotient-digit selection, SELECT block in the figure, 5.1.1 New Initial Values for the Partial Root
the quotient digit is coded as a 1-hot 5-bit code fqp2; and the Remainder
qp1; qz; qn1; qn2g, so that for example, qp2 ¼ 1, qp1 ¼ qz ¼
As the first iteration is being skipped, the initial values for
qn1 ¼ qn2 ¼ 0 if qjþ1 ¼ þ2. The logic function to get every bit
the partial root and the remainder are those for the second
in the 1-hot 5-bit code is relatively simple, a 3-level 2-input
iteration in case the first iteration is not skipped. To get
gate logic function. Each time a new digit is obtained it is
these new initial values it has to be taken into account that
concatenated to the partial quotient.
the first digit can take only values f2; 1; 0g, because the
integer digit is 1, s0 ¼ 1, and the result is in [0.5,1). This
5 SQUARE ROOT MICROARCHITECTURE reduced set of possible values for the first digit is still fur-
ther reduced if odd and even input exponents are consid-
In this section the square root implementation is described.
ered separately.
The focus is in the skipping of the first iteration in the pre-
Let us consider the square root operand being
processing stage, and in the digit iterations stage.
x ¼ 1:x0 x1 x2 x3 x4 . . . xn1 :
5.1 Latency Reduction by Skipping the
First Iteration As stated in Section 3.2, this operand will be right-shifted
As explained in Section 2, the calculation of the square root is by 1 or 2 bits, depending whether the operand’s exponent is
composed of several iterations, the number of iterations odd or even, to get x0 , the input to the square root digit-
depending on the precision (number of bits) of the final result. recurrence algorithm (see Equation (14)).
The larger the number of iterations the larger the latency. In the following discussion, the initial values for remainder
By skipping the first iteration the number of iterations is and partial root are rem½0 ¼ x0 1 and S½0 ¼ 1, respectively
reduced by 1 and, for some floating-point precisions and (see Section 2.2)
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 281
Therefore
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
282 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 2, FEBRUARY 2020
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 283
Both, the remainder and the partial root, are kept in redun- Once the interval is known, four selection constants are
dant representation. It has been determined that the 9 MSBs obtained; these selection constants correspond to the lower
of the remainder (4 integer bits and 5 fractional bits) and the points of the remainder intervals for a root-digit equal to
5 MSBs of the partial root (the integer bit and 4 fractional 1, 0, þ1, and þ2 [5], [7]. The selection constants are 7-bit
bits) are enough for the root digit selection. width; so only the 7 most–significant bits of the remainder
The assimilation of the 5 MSBs of the partial root gives estimate has to be compared against these selection con-
the partial root interval. Taking into account that the partial stants to determine the next-root digit. Then, using a set of
root is in ½0:5; 1Þ, that is the lower half [0,0.5) of the range is four 7-bit comparators, the root-digit is obtained.
not used, only 8 intervals, I0 ; . . . ; I7 , are possible. Moreover, To speed up the digit selection in the second iteration in
an additional interval I8 is considered for the partial root the cycle and to balance the delay of the remainder calcula-
being S½i ¼ 1:0 . . . 0. tion and the digit selection in the first iteration in the cycle,
the 9 most–significant bits (MSB) of the remainder rem½i þ 1
are speculatively assimilated to its non-redundant represen-
tation. This allows to eliminate the 9-bit adder in the selection
of siþ2 .
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
284 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 2, FEBRUARY 2020
TABLE 3 TABLE 4
Division and Square Root Latencies Latency Comparison
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 285
zSeries [10], IBM z13 [14], HAL Sparc [16], AMD K7 [19], part. As for the divisor, in the digit iteration stage there are
AMD Jaguar [23]. The latencies shown in the table include the five 58-bit CSAs for iteration for a total of fifteen 58-bit
iteration cycles and the pre- and post-processing cycles, such CSAs, five 9-bit CPAs, and five 7-bit CPAs, plus the logic
as unpacking, pre-scaling and rounding. Note that no cycles for the selection of three quotient digits and the multi-
for normalization are included here because it has been plexers, twelve 58-bit 4:1 muxes and two 5:1 small wide
assumed that the operands are already normal; although, as muxes. In addition, in the pre-scaling logic three 58-bit add-
stated previously, the proposed divider can handle subnor- ers and some additional logic, two CSAs, multiplexers, and
mal inputs and output. a reduced selection logic, are needed.
It has to be pointed out that the latency in Table 4 is in The digit iteration stage in the square root has ten 4-to-2
cycles, and the comparison is done only in terms of the carry-save adders,3 six 8-bit adders and eight 7-bit compara-
latency without taking into account that different processors tors in the digit selection logic, plus some additional logic.
might run at different frequencies.
Most of the design in the table uses a multiplicative divi- 7.2.2 Other Division/Square Root or Division Only Units
sion algorithm, and one of them uses a radix-4 digit-recur-
The area of the radix-4 divider [10] is much smaller than the
rence implementation.
area in our proposal. The redundant partial remainder con-
As shown in the table, our proposal gets much lower
sists of a sum part of 116 bits and a carry part of 28 bits
latencies. The multiplicative implementation are limited by
(only 1 out 4 carries are flopped); the 6 most-significant bits
the latency of the multiplier of multiply-and-accumulate
must be in non-redundant format because they are used for
units that, as stated in the introduction, can be very signifi-
the quotient digit selection. The iteration is implemented
cant. On the other hand, the implementation in [10] uses a
with one stage of 116-bit 3-to-2 CSA and one stage of a 4-bit
very low radix, which implies a high number of iterations,
CPA; an additional 6-bit CPA is needed to deliver the 6
although its implementation is quite simple.
most-significant bits to the digit selection table.
The Intel Penryn processor [2] implements a radix-16
The radix-16 division/square root unit in [2] has been
combined division/square root unit by cascading two radix-
obtained by concatenating two radix-4 iterations in the same
4 iterations every cycle. Consequently, the latency is almost
cycle. Although the paper doesn’t provide any area break-
halved with respect to that of the radix-4 unit. As it is a com-
down, some area information is provided in a later publica-
bined unit, division and square root have the same latency.
tion [18] by different authors. Replication is used in the
Finally, the IBM z13 processor [14] has a divide and square
remainder calculation at each radix-4 iteration; five partial
root unit supporting single, double and quad precision, and
remainders are computed speculatively and then one of them
all the hexadecimal floating-point data types. The underlying
is selected once the digit is determined. Consequently, eight
algorithm is a radix-8 division and radix-4 square root, gener-
full-length 3-to-2 CSAs , four per radix-4 iteration, are used.
ating 3 bits per cycle and 2 bits per cycle respectively. The
The inputs to the 3-to-2 CSAs are the redundant partial
major challenge was to perform a radix-8 divide or radix-4
remainder and the f-vector, which is different for division
square root step on a wide quad precision mantissa, 113 bits
and square root; consequently, eight f-vector are specula-
plus some extra rounding bits, and fit it in a single cycle.
tively generated. Each of these full-length f-vectors is
In our implementation we have been able to put in a
obtained with a sequencer and a set of muxes.
single cycle three and two radix 4 iterations, for division
No area data is provided for the divide and square root
and square root respectively, by using speculation between
unit of the IBM z13 processor [14]. However, we can guess
iteration in the same cycle. In addition, there are only one
that the wider quad-precision datapath adds a large area
pre-processing cycle before the iterations, unpacking of
overhead.
operands and pre-scaling, and one post-processing cycle for
Finally, multiplicative division implementations [16],
rounding after the iterations.
[19], [23] involve only modest additional cost because the
7.2 Area existing FP multipliers are reused to perform each algo-
The area is strongly dependent on the algorithm, digit-recur- rithm iteration. Only a look-up table for the initial seed and
rence or multiplicative, being used for division and square some additional logic is needed to implement the divider.
root. In general, multiplicative algorithms have smaller area The total storage required for reciprocal and reciprocal
requirements if the multipliers or FMA units are shared with square root initial approximations in [19], [23] is 69 Kbits. In
other floating-point instructions. In case of digit-recurrence [16] the table is even smaller, 3.5 Kbits.
algorithms, the radix has also a great impact, the larger the
radix the large the area. 7.3 Timing
Therefore, the area of our division/square root unit is For the critical path delay estimation the Logical Effort
much larger than the area of the other units in the table; model [26] is used in this section. Table 5 summarizes the
however, our focus was on obtaining a low latency divi- delay of the basic gates (upper part) and of the main mod-
sion/square root unit. ules in Figs. 2 and 3 (middle and lower parts respectively)
The area of the rounding stage is not included in the dis- in terms of a FO4 and its equivalent in picoseconds. We
cussion because it should be roughly the same for all the units. have considered a FO4 delay of 6 ps. The load of every sig-
nal has been taken into account, so that a fanout of n adds a
7.2.1 Division/Square Root Unit in this Paper delay equivalent to log 4 n FO4.
Our unit uses a large number of 3-to-2 carry-save adders
(CSA) and carry-propagate adders (CPA) in the iterative 3. Roughly, each 4-to-2 carry-save adder is equivalent to two CSAs
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
286 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 2, FEBRUARY 2020
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.
BRUGUERA: LOW LATENCY FLOATING-POINT DIVISION AND SQUARE ROOT UNIT 287
Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 09:43:22 UTC from IEEE Xplore. Restrictions apply.