Floating Point Elsevier
Floating Point Elsevier
a r t i c l e
i n f o
a b s t r a c t
The high integration density of current nanometer technologies allows the implementation of complex oating-point applications in a single FPGA. In this work the intrinsic complexity of oating-point operators is addressed targeting congurable devices and making design decisions providing the most suitable performance-standard compliance trade-offs. A set of oating-point libraries composed of adder/ subtracter, multiplier, divisor, square root, exponential, logarithm and power function are presented. Each library has been designed taking into account special characteristics of current FPGAs, and with this purpose we have adapted the IEEE oating-point standard (software-oriented) to a custom FPGA-oriented format. Extended experimental results validate the design decisions made and prove the usefulness of reducing the format complexity. 2011 Elsevier B.V. All rights reserved.
Article history: Available online 7 May 2011 Keywords: Floating-point arithmetic FPGAs Library of operators High performance
1. Introduction Current deep sub-micron technologies allow manufacturing of FPGAs with extraordinary logic density and speed. The initial challenges related to FPGAs programmability and large interconnection capacitances (poor performance, low logic density and high power dissipation) have been overcome while providing attractive low cost and exibility [1]. Subsequently, use of FPGAs in the implementation of complex applications is increasingly common but relatively new when dealing with oating-point applications ranging from scientic computing to nancial or physics simulations [24]. This is a eld of increasing research activity due to the performance and efciency that FPGAs can achieve. The peak FPGA oating-point performance is growing signicantly faster than the CPU counterpart [5] while their energy efciency outperforms CPUs or GPUs [6]. Additionally, FPGA exibility and inherent ne-grain parallelism make them ideal candidates for hardware acceleration improving GPUs capabilities for a particular set of problems with complex datapaths or control and data inter-dependencies [7]. FPGAs exibility also allows the use of tailored precision, what can signicantly improve certain applications. Furthermore, new FPGA architectures have embedded resources which can simplify the implementation of oating-point operators. However, the large and deeply pipelined oating-point units require careful design to take advantage of the specic FPGA features. Designing this kind of application from scratch is almost
Corresponding author.
E-mail addresses: [email protected] (P. Echeverra), [email protected] (M. Lpez-Vallejo). 0141-9331/$ - see front matter 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2011.04.004
impossible or makes the design cycle extremely long. Thus, the availability of complete and fully characterized oating-point libraries targeting FPGAs has become a must. The IEEE standard for binary oating-point arithmetic was conceived to be implemented through custom VLSI units developed for microprocessors. However, if the target hardware is an FPGA, the internal architecture of these operators must be highly optimized to take advantage of the FPGA architecture [8]. Furthermore, implementations with slight deviations from the standard could be of great interest, since many applications can afore some accuracy reduction [3,9], given the important savings that can be achieved: reduced hardware resources and increased performance. Several approaches have addressed the hardware implementation of a set of oating-point operators [1012], but none of them includes the wide and general analysis carried out here. Some works only include the basic operators (adder/subtracter, multiplier, divider and square root) [10]. Other works focus on the implementation of particular oating-point operator implementations [1113]. Regarding advanced operators (exponential, logarithm and power functions) few works can be found, standing out [1416], with implementations of the exponential and logarithm functions. In [8], the potential of FPGAs for oating-point implementations is exploited focusing on the use of internal xed formats and error analysis what requires specic analysis for every application and supporting tools. Therefore, oating-point operators are mainly replicated in hardware without tailoring the format to the applications. In this scenario is where we have focused on, improving the performance of the oating-point units taking advantage of FPGA exibility. We have tuned the architecture of oating-point operators to get the best performance-cost trade-off with slight deviations from the
536
standard. This approach was also discussed in [17] but it is extended here in several ways: We have included advanced operators. A more complete set of deviations is studied. We perform an in-depth analysis of the implications of the deviations. We study the replicability of the operators. We provide a set of recommendations to achieve the resolution and accuracy of the standard with high performance. Our proposed libraries include both conventional and advanced operators. Starting by an almost fully-compliant library1 (Std), we have made several design decisions that allow clear improvements in terms of area and performance. These design decisions include the substitution of denormalized numbers by zero, the use of truncation rounding or the denition of specic hardware ags that allow the use of extended bit width internally. The interest of this work is focused on the impact of those decisions over oating-point operators and not in presenting new architectures. Thus, the main contributions of this work are the following: A thorough analysis on the implications of the proposed design decisions has been carried out focusing on the performanceaccuracy trade-offs. This is the base of a set of recommendations that can be considered when a complex oating-point application is implemented in congurable hardware. A complete set of varying accuracy oating-point libraries has been developed including both conventional (adder/subtracter, multiplier, divider and square root) and advanced (exponential, logarithm and power functions) operators. A systematic approach based on specic interfaces has been adopted allowing the use of extended bit widths. It simplies the implementation of complex applications and reduces the resources needed for chained operators tting in a single FPGA with better performance. Two nal FPGA oriented libraries with signicant hardware improvements have been implemented, taking advantage of the proposed design decisions. The paper structure is as follows: Section 2 summarizes the oating-point format. Section 3 presents the key design decisions that are proposed while Section 4 describes the particular architecture of each operator. Experimental results are thoroughly discussed in Section 5, paying special attention to the inuence of the proposed design decisions. Finally, Section 6 introduces a hardware library specially designed for FPGAs and Section 7 draws some conclusions. 2. Floating-point format IEEE 754 The IEEE Standard [18] is mainly designed for software architectures, usually using 32 bit words (single precision) or 64 bit words (double precision). Each word (Fig. 1) is composed of a sign (s, 1 bit), a mantissa (mnt, mb bits) and an exponent (exp, eb bits), being the value of a number:
zeros, innities, exceptions (Not a Number, NaN) and two number types, normal ones (normalized) and numbers very close to zero (denormalized). The differentiation among these ve types is based on the exponent and mantissa values. Table 1 depicts all possible combinations of exponent and mantissa values. The standard is specically designed to handle these ve number types sharing a common format while maximizing the total set of numbers that are represented. Combining these two facts increases the complexity of the arithmetic units because, in addition to the calculation unit itself, it is needed a preprocessing (also known as prenormalization) of the inputs numbers and a postprocessing (also known as postnormalization) of the output numbers, see Fig. 2. Therefore, when implementing a oating-point operator, the hardware required is not only devoted to the calculation unit itself, additional logic is needed just to handle the complexity of the format. This logic represents a signicant fraction of the area of a oating-point unit, a 48% of logic in average for the studied operators, as will be shown in Section 5. In a general way preprocessing logic includes: Analysis of the number type of the inputs, which includes exponent and mantissa analysis. Determination of operation exceptions due to the number type or sign of the inputs (square root, logarithm). Normalization of inputs. Conversion of inputs to the format required by the calculation unit.
Table 1 Types of oating-point numbers. Type Zero Denormalized Normalized Innities NaN Exponent 0 0 1 to 2eb 2 2eb 1 2eb 1 Mantissa 0 0 0 0 h 0 1 value 0 Eq. (1) Eq. (1) 1
Prenormalization
Calculation
1
Postnormalization
where h is an implicit bit known as the hidden bit and the bias is a constant that depends on eb being its value 2eb 1 1. With this number representation the oating-point format can represent
Only some software issues as exception handing with additional ags and signaling NaNs (not a number) are not implemented.
1
Result (z)
537
In the same way, postprocessing logic includes: Rounding of the result to adjust it to the format precision (mantissa and exponent correction). Determination of result exceptions. Format of the output to t the result number type. 3. Floating-point units for FPGAs: Adapting the format and standard compliance The implementation of a oating-point application in hardware using FPGAs requires the adaptation of the software-oriented oating-point format to this particular context to obtain the best performance. This can be accomplished by simplifying the complexity of the format so its associated processing is reduced. Following this idea, we have developed four different oating-point custom formats where three key design decisions have been gradually introduced: 1. Simplication of denormalized numbers. 2. Limitation of rounding to just truncation towards zero. 3. Introduction of specic hardware representation for the number type. These decisions affect three features of the format that are responsible of most of the preprocessing and postprocessing overhead: handling denormalized numbers (1), rounding (2) and handling the ve number types with a common format (3). Next, the implications of these decisions are studied. 3.1. Simplication of denormalized numbers The use of denormalized numbers is responsible of most logic needed during preprocessing and postprocessing. However, most oating-point arithmetic algorithms work with a normalized mantissa (leading bit equals 1) in the calculation unit. Thus, denormalized numbers have to be converted to normalized ones during preprocessing. This requires the detection of leading one bit of the mantissa, left shift of the mantissa and an adjustment of the exponent value depending on the position of the leading one. During postprocessing the result of the calculation unit is analyzed to determine the number type it corresponds to and to make some adjustments to its exponent and mantissa. Again, most logic related to these two tasks is required for handling the case of results that are denormalized numbers. Consequently, the use of denormalized numbers requires of signicant resources, negatively affecting the performance of the arithmetic units as usually the slowest paths are related to the handling of that denormalized numbers. However their use does not contribute substantially to most applications because denormalized numbers (i.e. single precision eb = 8 and mb = 23): Represent a small and infrequent part of the format: most of the oating-point format is reserved for normalized numbers being 2mb 2eb 2 different normalized numbers to only 2mb denormalized ones (i.e. ?223 254 normalized to 223 denormalized for single precision). Furthermore, their value, < 2eb 2 (i.e. ?<2126), also makes their use very infrequent except for some special applications. Compromise the accuracy of results: while normalized numbers have a precision of mb + 1 bits the denormalized precision varies from mb bits to 1 depending on its leading one position. This compromises the accuracy of the result of any operation where denormalized numbers are involved.
Therefore, one way to simplify the format and reduce logic is handling denormalized numbers as zeros so all the related logic is eliminated. Actually, all commercial FPGA oating-point libraries [19,20] follow this scheme and previous works as [17,12] also deviate this way from the standard. Deviation from the Standard. The standard cost of this solution is related to resolution and accuracy. First, when denormalized numbers are replaced by zeros we are losing resolution around the zero, as now 2mb (i.e. ?223) numbers with values between eb 1 eb 1 eb 1 22 2 22 2mb and 22 2mb have been removed (i.e. ?2126 2149 and 2149). Second, we are losing accuracy because we operate with zeros at the inputs or obtain a zero result when we have denormalized input or output. However, this loss of accuracy is relative as the resolution of a denormalized number depends on the position of its leading 1. Due to the lack of resolution or due to a rounding from a previous operation, the maximum relative error a denormalized number can reach is even the 100% of its value,2 being much bigger than the maximum relative error for a normalized number, 2mb . 3.2. Truncation rounding The exact result of any operation can need more mantissa bits than the bits provided by the format. In this case the result needs to be rounded to x with the format as its value will be comprised between two consecutive representable oating-point numbers. In the IEEE standard we can nd four different rounding methods: Nearest: rounding to the nearest value. Up: towards +1. Down: towards 1. Zero: towards 0.
To implement these methods the result is generated with additional mantissa bits. Then, in the postprocessing stage, the extra bits of the result (guard bit when necessary, round bit and sticky bit) and the sign of the result (for rounding methods towards up and down) are analyzed to carry out the rounding. Rounding to nearest, provides the most accurate rounding, as ensures a maximum error of 1 ulp (unit in the last place, that is the least signicant 2 mantissa bit, lsb). The other three methods have a maximum error of 1 ulp. Meanwhile, rounding to zero is the method that needs less logic because it is equivalent to truncating the result without taking into account the round bit and the sticky bit. This feature is useful for hardware oating-point units, specially for units that require of iterative algorithms like division or square root. If only rounding towards zero is implemented, these units can be reduced as they do not need to generate the round bit and the sticky bit, consuming less cycles and resources and, in case of pipelined architectures, less stages.3 Deviation from the Standard. Truncation rounding slightly affects the accuracy of a single operation, and this only happens if we compare with round to nearest, which ensures a rounding error of up to half ulp while truncation rounding ensures up to 1 ulp. If we just take into account one operation, this small loss of accuracy can be considered negligible as we are in the range of a relative error of 2mb . However, for applications with a large number of chained operators, the error propagation must be considered, as the errors
2 When the leading 1 of the mantissa corresponds to the lsb and rounding mode is up or down. 3 Always for algorithms computing one bit per cycle, and depending on the precision and number of bits obtained for other algorithms.
538
introduced in the rst operations spread along the chain while the following operators also introduce error in the partial results. In these cases the error in the nal result can be higher when using truncation rounding instead of round to nearest. Additionally, with truncation rounding the results are biased as truncation always rounds to the same direction, diminishing the absolute value of the result for each operation. 3.3. Hardware representation In any operation using a oating-point number the rst task is to analyze the values of the exponent and the mantissa of the operands to determine the number type. Meanwhile, the last task of the operation is to compose the exponent and mantissa of the result taking into account the number type of the result. These two tasks are necessary to make compatible the software architecture requirements of a xed word length and the use of the oating-point standard implying different number types. However, in an FPGA architecture the word length used is exible and congurable by the designer and can be extended with some ags to indicate the number type. To represent the ve number types three ag bits would be necessary. However, if the previous mentioned simplication of denormalized numbers is applied, two ag bits are enough. This scheme follows the internal use of ags presented in previous works [14,15] of the FPLibrary [21] that we have extended with a systematic approach using interface blocks. Two interface blocks are required: rst, the input interface that calculates the value of the ags from a standard oating-point number; second, the output interface that composes a standard oating-point number from our custom oating-point number and its corresponding ags. The two interfaces carry out some of the preprocessing and postprocessing tasks so the logic needed in the arithmetic unit can be reduced. When a datapath involves chained operations, the advantages of this scheme are clear as the input interface is only needed for each input number while the output interface is found only at the nal result. All intermediate operations need no interfaces so all the operators involved have been reduced. Deviation from the Standard. This design decision has no cost in terms of resolution or accuracy. Meanwhile, the use of interfaces makes transparent the transformation between the standard format and the extended FPGA format. 3.4. Global approach analysis The three design decisions just explained insist on the same point, the simplication of the oating-point format complexity. Handling complexity requires logic resources, thus as the complexity is reduced so are the logic resources needed for an operator. Additionally, as less resources are used, operators implementations become more efcient, and speed or the number of pipeline stages is improved. Therefore our approach has focused on determining those features of oating-point arithmetic that require a heavy processing overhead while their relevance in the format is minimum.
Table 2 Logic reduction due to design simplications. Preprocessing Denormalized Numbers Simplication Truncation Rounding Hardware Flags Mantissa analysis (All) Exponent analysis (All)
However, while the use of dedicated ags does not have any effect on the standard compliance, the other design decisions affect compliance in terms of accuracy and resolution around zero. Finally, and regarding oating-point standard compliance, one last issue should be addressed when using FPGAs: the non-associativity of oating-point operations. FPGA datapaths are commonly designed taking advantage of FPGAs intrinsic parallelism placing parallel operators in the datapath. However the results obtained with parallel operations may differ from the ones obtained with an equivalent sequential implementation [22]. Therefore, two issues affect standard compliance: how the oating-point operators are implemented and how the datapath is designed. The second issue is architecturally dependent and cannot be analyzed in a general way, being out of the scope of this work. 4. Arithmetic units To study the impact of the three design decisions on oatingpoint operators we have implemented a set of libraries where those decisions have been gradually applied, from standard operators to operators including the three simplications. Table 2 reviews the tasks that are eliminated by each design decision and for each operator. The libraries are composed of seven operators: additionsubtraction, multiplication, division and square root, exponential, logarithm and power functions, while the precision selected has been single precision. Although double precision presents higher accuracy, the accuracy required by many applications can be achieved with single precision or a tailored precision which is much closer to single than to double precision [23]. Furthermore, the conclusions of the study we perform here for single precision can be easily generalized to double precision. Following, we briey analyze how we have implemented the calculation stage of each operator. 4.1. Addersubtracter The calculation stage of a oating-point addersubtracter unit does not present any particular complexity and is just compounded of a xed arithmetic adder/subtracter (preprocessing aligns the mantissas) that calculates the mantissa of the result and the sign calculator that takes into account the input signs, if it is an addition or a subtraction and which operand is bigger. The exponent of the result is considered equal to the exponent of the biggest operand, and then it is adjusted during postprocessing if the mantissa result presents a carry (addition) or a cancelation of its most signicant bits (subtraction). 4.2. Multiplication Nowadays FPGAs include embedded multipliers that can be directly used to multiply the input mantissas. Since current embedded multipliers (18 18 multipliers for Virtex 4 and Stratix III and
Postprocessing Mantissa shifting (+, , /, ex, xy) Exponent correction (+, , /, ex, xy) Rounding bits evaluation (All) Rounding (All) Format exponent (All) Format mantissa (All)
p Leading one detection (, /, , ln, xy) p Mantissa left shifting (, /, , ln, xy) p Exponent correction (+, , /, , ln, xy)
539
IV, 25 18 multipliers for Virtex 5 and 6) have at least one of their two operand inputs with a bit width smaller than the 24 bits mantissas of single oating-point precision, our operator gets advantage of the multiplication distribution property:
a b c d a c a d b c b d
and splits de input mantissas into two parts, one corresponding to the most signicant bits (upper parts, xu and yu) and the other part corresponding to the least signicant bits (lower part, xl and yl) by multiplying these subparts in parallel. Afterwards, the results of the partial multiplications are added taking into account the necessary alignment between operands4:
now, the bit width of each step (and of that operations) is determined by the bit width of the partial result calculated before that step. The exponent of the result is obtained just dividing by two the input exponent (shifting right one position) and a preprocessing of the mantissa in case the input exponent was odd (one shift left). No sign calculation is needed as calculated square roots are always positive. 4.5. Exponential, logarithm and power functions The exponential and logarithm operators are based on the previous work of Detrey and Dinechin [14,15]. The technique used in both works is to reduce the input range and then use table-driven methods to calculate the function in the reduced range. This type of algorithm has been selected instead of iterative algorithms [16], due to better performance in terms of frequency. For our operators we have redesigned the exponential and logarithm datapaths, introducing several changes to improve performance while reducing resources [30]. The basis for our power unit can also be found in our previous work [30]. The power unit derives from the exponential and logarithm operators as xy can be reduced to a combination of other operations and calculated straightforward with the transformation:
xu 2
12
xl yu 2
12
yl xu yu 2 xl y l
24
xu yl xl yh 2
12
The sign of the result is calculated with an XOR gate while the exponent is obtained by adding the input exponents (we will not go into the details of handling the exponent bias; the same will be done in the rest of the units). As in the addersubtracter, a carry may be obtained in the mantissa result so an additional exponent adjustment is needed in postprocessing. 4.3. Division Divisor and multiplier architectures are similar as division is the multiplication inverse function (the exponent result is now calculated by subtraction). However since there are no embedded divisors in current FPGAs, mantissas division has to be implemented with logic. Exact binary division can be implemented by several algorithms [24], like digit recurrence methods as restoring, non-restoring or SRT algorithms [25]. From those algorithms, the most common implementations are the digit recurrence methods where one (restoring, non-restoring and radix-2 SRT algorithms) or more bits (radix-4 SRT, radix-8 SRT, etc. algorithms) of the mantissa result are calculated per division step. Other implementations rely on some kind of approximation as algorithms based on Taylor series [26], Goldshmidt algorithm [27] or the NewtonRaphson method, where the mantissa result is computed with several extra bits to minimize the error introduced by the approximation. For our libraries we have selected the non-restoring algorithm due to its better performance when compared to other digit recurrence methods, and due to its logic requirements when compared to approximation algorithms as these methods require of multiplications and even look-up tables in the case of the Taylor series [12]. This selection implies a division with the highest number of division steps needed to obtain the result, and therefore more clock cycles than with SRT algorithms with a radix bigger than two [28]. However it presents the highest performance per each division step (mainly an XOR for a conditional negation, and an adder) being the algorithm most suited for high clock rates. 4.4. Square root Square Root can be computed very similarly to division when a digit recurrence algorithm is implemented and thus a non-redundant algorithm has been selected again. The calculation unit follows the non-restoring algorithm presented in [29]. This algorithm is especially well suited for FPGAs as it works with bit-width reduced operands. As in the divisor each step is composed of a conditional negation and an addition, but
4 For Virtex 5 and 6, this scheme will correspond to the equation a (b + c) where only one input mantissa needs to be split.
z xy eyln x
5. Libraries evaluation and comparison
The evaluation of our libraries has been carried out in a Xilinx Virtex-4 XC4VF140-11 FPGA with the ISE 10.1 environment [20]. Results obtained are post place & route with balanced mapping, and place & route with high effort. To make a fair study and comparison of the impact of each design decision four libraries have been developed: Std: operators without any signicant change with respect to the oating-point standard (supporting the four rounding methods and denormalized numbers). Only software issues as exception handling and signaling NaNs are not implemented, just providing quiet NaNs. Norm: operators handling denormalized numbers as zeros. 0_Rnd: operators with only truncation rounding mode and also handling denormalized numbers as zeros. HP (High Performance): operators designed including all three design decisions; denormalized numbers as zeros, truncation rounding and use of ags. If performance is the design goal, oating-point operators require deeply pipelined implementations. The criterion followed to determine the number of pipeline stages of the operators has been achieving a high clock frequency with a reasonable number of stages. Thus, we have determined which basic calculation or iteration in any of the operators of the HP library is in the critical path, being its delay the maximum delay for all other operators of HP. For Std and Norm libraries, the same design criterion has been followed, while for 0_Rnd we have chosen the same number of pipeline stages as the operators of the HP library to study the benets of the use of internal ags without any other changes. The experimental results obtained are summarized in Table 3 where each operator is characterized by the number of logic resources (Slc, slices), the number of pipeline stages (Stg, stages) and their clock frequency (MHz), and in Table 4 where are detailed the operators embedded resources, which are common for all the libraries, Block RAMs (BRAM) and embedded DSPs (DSP).
540 Table 3 Operators results. Std Slc +/ / p ln xy 414 471 1044 515 554 878 1734 Stg 9 9 29 20 17 18 37
Norm MHz 250.9 250.9 253.4 250.0 250.0 234.0 210.0 Slc 402 148 802 411 482 777 1472 Stg 8 6 26 18 16 16 34 MHz 252.9 253.5 276.9 250.2 244.4 250.6 209.4
0_Rnd Slc 369 109 753 342 463 737 1457 Stg 6 5 24 16 15 14 33 MHz 267.5 242.7 274.3 250.4 258.6 250.3 220.2
HP Slc 344 102 733 328 449 732 1433 Stg 6 5 24 16 15 14 33 MHz 286.4 250.0 280.5 256.8 253.9 250.3 214.8
Table 5 Operators results. Commercial library. Xilinx [31] Stg +/ / p 8 6 26 27 18 Slc 418 150 668 868 418 MHz 256.1 246.9 175.7 290.9 190.7 Norm Slc 402 148 802 821 411 MHz 252.9 253.5 276.9 293.7 250.2 NormXlx Slc 395 144 777 781 389 MHz 257.9 255.0 276.9 296.8 250.6
Our evaluation has been carried out in three main parts. 1. Comparison of operators with a commercial library to verify the quality of our libraries. 2. Study and comparison of the results for each component and for each of the libraries, analyzing the impact of each design decision on preprocessing and postprocessing logic. As xy is composed of a chain of other operators, we have not included this unit in this study as it will distort this global analysis. 3. Analysis of the capabilities of current FPGAs to implement oating-point operators. 5.1. Comparison with respect to a commercial library As reference commercial library we have selected Xilinx oating-point operators (logic core Floating-Point Operator v4.0 [31]). This library is parameterizable (variable exponent and mantissa bit widths, number of pipeline stages), denormalized numbers are not supported and rounding is restricted to round to nearest. Consequently, Xilinx operators are almost equivalent to our Norm ones, only differing in the calculation of the rounding as Xilinx does not support the four oating-point standard rounding modes, see Section 3.2. Therefore, to make an exact comparison we have tuned our Norm operators restricting rounding to round to nearest, NormXlx operators. Finally we have congured Xilinx and NormXlx operators with the same number of stages as the ones of Norm. For the four basic operators some common features can be observed in Table 5. Our operators achieve better frequencies, with a remarkable increase in both divisor and square root. The 100 MHz increase of the divisor is due to a design improvement as one stage is removed making a dual rst division step5 so the calculation of
5 Dividend Divisor, for the cases where mantissa dividend is bigger or equal than divisor one, and (Dividend 2) Divisor for cases where divisor mantissa is bigger.
the guard bit is unnecessary. This makes that the Xilinx operator is congured with one stage less than its optimum. However, if divisors are congured with Xilinx optimum number of stages, 27, our operator is still faster. Regarding the square root, the 60 MHz increase is due to the algorithm selected or how it is implemented, as the Xilinx operator is below 200 MHz until it is congured with 26 stages. With respect to the resources used, all our Norm operators are slightly smaller although implementing the four rounding methods (except division with 26 stages, which is not comparable as ours is 100 MHz faster). When comparing Xilinx operators with NormXlx ones, the improvements are increased as we are removing logic from our operators. The biggest impacts can be observed at the divisor and at the square root due to the removal of part of the sticky bit calculation logic. This can be possible as in both operators the calculation of the sticky bit differs between rounding to nearest and rounding up or down methods, and requires independent logic. For the three advanced operators no comparison is possible as the Xilinx library only provides the four basic operators. 5.2. Operators evaluation In Fig. 3 the results of Table 3 (clock frequency, number of pipeline stages and number of logic resources) are graphically depicted. Regarding all gures of merit the nal library, HP, outperforms the standard library Std in all metrics, being each operator faster while requiring less resources and less pipeline stages. And as each design decision is introduced each library outperforms or equals the previous one. Only one exception can be found, the clock frequency. For each design decision, we are removing logic (completely or partially) and therefore the clock frequency should be increased if the pipelined architectures are not changed, as it is our case. However there are several exceptions to this general trend and mainly in operators using DSPs. Analyzing these exceptions we have found it is due to the placement and routing heuristic algorithms which do not ensure achieving the optimum implementations. In the exceptions found, the stages determining the clock frequency were exactly the same (or even with less logic) that in previous faster operators, being the slower frequency achieved due to a different placement or routing. The importance of all the improvements can be analyzed together comparing HP operators from Std. The reduction of logic resources is between a 78.3% (multiplier) and a 16.6% (logarithm), while pipeline stages are reduced between 5 (divisor) and 2 (exponential) stages. To analyze the impact of preprocessing and postprocessing overheads on the different libraries we have separately analyzed each stage: preprocessing (pre), operator calculation (cal) and postprocessing (pst). Two metrics have been analyzed, the number of resources needed, see Table 6 and the number of pipeline stages required, see Table 7.
541
Table 6 Split slice comparison. Std Pre. +/ / p ex ln 146 262 252 150 124 130 Cal. 128 75 729 355 301 581 Pst. 198 186 122 49 169 231 Norm Pre. 146 43 47 39 124 25 Cal. 128 75 729 355 301 581 Pst. 192 113 64 39 105 231 0_Rnd Pre. 146 43 47 39 124 25 Cal. 124 63 695 311 301 581 Pst. 155 51 46 16 74 178 HP Pre. 138 4 4 29 119 3 Cal. 124 63 695 311 301 581 Pst. 134 50 12 0 38 173
Table 7 Split stages comparison. Std Pre. +/ / p ex ln 2 2 2 2 1 2 Cal. 2 4 25 17 13 12 Pst. 5 3 2 1 4 4 Norm Pre. 2 1 1 1 1 1 Cal. 2 4 25 17 13 12 Pst. 4 3 2 1 3 4 0_Rnd Pre. 2 1 1 1 1 1 Cal. 2 4 24 16 13 12 Pst. 2 1 1 1 2 2 HP Pre. 2 1 1 1 1 1 Cal. 2 4 24 16 13 12 Pst. 2 1 1 0 2 2
5.2.1. Denormalized numbers Comparing Std operator with Norm ones we can observe a reduction on the number of slices required from a 68.5% (multiplier) to a 11.5% (logarithm). The adder can be considered a special case as almost all the extra logic needed in Std is subsumed in the logic of Norm, and only the exponent correction due to a denormalized number is simplied, see Table 2. Std overheads can be mainly found in the preprocessing stage where almost every operator has several tasks related only to denormalized numbers as these numbers need to be converted into normalized numbers, see Table 2. Results in Table 6 show that
these tasks have a major impact in the resources required, mainly in the multiplier and the divider, as for both the resources for these tasks have to be replicated for both inputs. On the other hand, the impact in the postprocessing stage is much smaller due to two facts. Firstly, there is only one output to handle so the logic is not replicated. And secondly, part of the logic of the tasks needed for handling the denormalized number is shared with the logic required for handling the result. With respect to pipeline stages, not handling denormalized numbers reduces the required pipeline stages (between 1 and 3) in two ways:
542
Elimination of stages; as some tasks are removed with dedicated logic in the datapath. Overlap of stages; when denormalized numbers are no longer handled, the calculation stage of some operators can directly work with the input numbers. In parallel, the preprocessing stage analyzes the inputs number type and processes the exceptions due to not normalized input. Furthermore, it can be also observed that the handling of denormalized numbers also affects the clock frequency of the operators. Although there are specic pipeline stages for this handling, these stages become the slowest stages harming the global speed of most units (see Fig. 3a), requiring even more additional pipeline stages. 5.2.2. Rounding Introducing the limitation of rounding methods to just rounding towards zero, 0_Rnd, implies an additional reduction of slices, up to 69 for the square root, and between 1 and 2 stages when comparing 0_Rnd operators with Norm ones. Rounding mainly affects to the postprocessing stage and to the calculation stage as the rounding bits need to be calculated. With respect to postprocessing, the resources needed are required to evaluate the rounding bits, the rounding that implies two adders (exponent and mantissa), and the logic needed for cases where a carry can be obtained after rounding the mantissa (a multiplexor). Regarding the calculation stage the major impact is in the operators with digit recurrence methods as extra iterations are needed for computing those bits. 5.2.3. Number types Again, as with denormalized numbers, when applying this design decision the major impact can be found in the operators with two inputs. The logic needed (comparators) to determine the number type by analyzing the mantissa and exponent values needs to be replicated. Meanwhile, in the postprocessing the result is formatted (using a multiplexor). When this logic is removed, the saved resources are equivalent to the resources needed by the interface units. Therefore, the reduction of logic resources respect 0_Rnd (up to a 6.8% in the multiplier) will only be effective when implementing chained operators. 5.3. Replicability Replicability is a key issue to address when analyzing the suitability and capacity of modern FPGAs to implement oating-point applications. We can dene replicability in an easy way as the number of operators that can be implemented in an FPGA considering 100% of resources available and taking as base reference the resources used by one operator. The results obtained following this denition are depicted in Fig. 4 for HP operators and for four diffent
FPGAs which present a good mix of elements, two Virtex-4 (SX55, FX140) and two Virtex-5 (FX200, SX240). From Fig. 4, it can be deduced the great capacity of modern FPGAs to implement complex single oating-point algorithms, even involving hundreds of operators. However, in a real scenario other facts have to be taken into account as: Routing stress: for large implementations routing congestion can affect very seriously the performance or make impossible to route already mapped operators. Use of logic resources instead of embedded elements: when no more embedded elements are available, they will be substituted by slices. The datapath: for operators sharing the same inputs there could be replicated logic removed by implementation tools. Additionally, for chained operators the input or output registers should be removed. Taking all these features into account, we have designed a synthetic datapath, Fig. 5, to analyze how many times an operator can be replicated. The datapath is composed of n levels of 10 operators each while another 9 operators provide the nal output. For coarse grain congurability, n can be increased adding more levels, while for ne grain congurability, operators can be added at the output. The datapath has been designed to prevent implementation tools removing duplicated logic: there are no two equal inputs, using Zij (the output of each operator) and Zij (the output but with the bits reordered) while the operators are registered only at their outputs. Operators under study are the adder among the operators not using embedded elements and the logarithm among the ones using embedded elements (for this case, we have chained operators as there is only one input per operand). The reference FPGA used has been the Virtex-4 XC4VF140 and the design strategy has been balanced implementation. In Fig. 6 the results obtained for the adder operator are shown. The expected number of operators obtained extrapolating the results for one operator is widely exceeded. Instead of the 183 operators expected (rst vertical line starting from the left in Fig. 6) it is possible to implement up to 241 adders. The rst reason for this increase, is that only the outputs are registered in these tests. The second reason is the way the implementation tools work. When reaching the usage limits of the FPGA, the implementation tools focused their work on area optimization (although results are obtained with balanced goal). Therefore, and for operators without input registers, two more theoretical limits have been calculated: the rst one with balanced implementation (203, second vertical
i=0 Z 00 i=1 i=n
op 00
op 10
op n0 op op n1 op op op op op
op 01
op 11
op 08
op 18
op n8
op op op
op 09 Z00
Fig. 4. Operators replicability.
op 19
op n9
543
MHz
300 250
Slices
1 0.8
200
Number of Operators
Fig. 6. Adder replicability results.
MHz
BRAMs
Slices
18x18
200
0.8
150
MHz
0.6
100
0.4
50
0.2
0 0 10 20 30 40 50 60
Numberof Operators
Fig. 7. Logarithm replicability results.
line) and the second with area oriented implementation (250, third vertical line). For the adder, this area oriented limit has proved to be the one closest to the results obtained in the experimental tests, where we have obtained just 3.6% fewer operators than this limit. In the Slices graph (results are normalized, being 1 the total number of slices of the FPGA) we can see that 100% of resources of the FPGA are achieved before the 241 operators. The number of slices used grows linearly with the operators until the 100% is achieved. Then, the implementation tools reduce the slices needed per operator as the resources limit has been achieved. In this point, the speed of the datapath, MHz graph, is seriously affected as for each new operator introduced routing becomes more difcult, harming the performance. When the embedded elements are the limiting components the situation changes, as can be seen in Fig. 7 for the logarithm. When the limit is going to be achieved (between 38 and 39 logarithms) the implementation tools replace one 18 18 multipliers of each logarithm by logic. Therefore, for 39 multipliers, all the operators now have four 18 18 multipliers instead of the previous ve, when it would be strictly necessary 36 logarithms with ve 18 18 and three with four 18 18. Consequently there is a big increase in the number of slices used while a big decrease is observed in the number of 18 18 multipliers. The same circumstance happens for 48 operators. However, the next frontier (64 operators) becomes the nal limit as with 64 logarithms we have reached the 100% of slices used, making impossible to replace 64 18 18 multipliers with logic.
6. Towards standard compliance and high performance As seen in Section 5, performance of FPGA oating-point operators can be considerably improved when introducing a few modications to obtain substandard operators. However, how can we obtain a standard compliant library while trying to preserve improvements of sub-standard libraries? Since using dedicated ags for the number type has no cost in terms of standard compliance, we can always apply it by using interfaces. The two other decisions cannot be directly applied if compliance with the standard is a must, so trying to fulll the standard will require additional techniques. 6.1. Simplication of denormalized numbers: one bit exponent extension A number is denormalized for a given oating-point precision depending on the exponent bitwidth for that precision, eb. Thus, it will be denormalized if the value of its leading one corresponds to a 2x with x in [2eb 1 mb , 2eb 1 1] while it will be normalized if x is in [2eb 1 2,2eb 1 ). Consequently, a denormalized number for a given precision will be a normalized number for a precision with one extra exponent bit, e0b = eb + 1 as now the x corresponding to the leading one will be in 0 0 the new normalized range of 2eb 1 2; 2eb 1 2eb 2; 2eb . Therefore, the operators handling of denormalized numbers as zeros can be applied in combination with an extension of the num-
250
Normalized Slices
MHz
544
Table 8 Operators results with the nal proposed features. Slices Speed (MHz) Stages HW HW+1 HW HW+1 HW 378 385 270.0 273.9 8 138 145 250.2 252.3 6 / 742 742 286.2 284.4 26 p 366 368 250.1 250.6 17 ex 470 505 246.7 248.3 16 ln 760 805 258.5 258.5 16 xy 1455 1508 213.1 210.2 34
ber precision in one exponent bit in order to preserve the resolution of the denormalized numbers of the not extended format. Now, as in the use of ags for the number type, the interfaces between standard numbers and hardware numbers will be in charge of transforming the numbers. Meanwhile operators will need to handle numbers with an extra bit in the exponent. However, handling exponents has a small hardware cost in the operators, much smaller than handling denormalized numbers, see Section 6.3. Regarding the standard compliance, this solution ensures that we are not loosing accuracy nor resolution, as the denormalized numbers in the new format will always be zero in the standard precision. Furthermore, with the extended precision more accurate results can be obtained. In datapaths involving several operations we can
nd that partial results that were zeros or innities with the standard precision, with the new precision are normalized numbers. Therefore, nal results that were a zero, an innity or NaN with standard precision can have a numeric value with the new precision. Additionally, partial results with a standard denormalized number, and reduced precision, now can be normalized numbers with not reduced precision, so the subsequent operations are more accurate. 6.2. Truncation rounding: mantissa extension To try to obtain the same accuracy with truncation rounding than with round to nearest, an extension of one bit of mantissa can be applied. The ulp of the new precision with extended mantissa, ulp, has a value of half of the standard ulp. So the maximum error of extended precision and truncation rounding, 1 ulp, will equal the maximum error obtained with standard precision and round to nearest, 0.5 ulp. This solution will require that each operator would compute the input mantissas with one extra bit and also generate outputs with the extra bit. Part of the advantages of truncation rounding, not having to compute the rounding and sticky bit, is lost while the calculation unit has to work with input mantissas of one more bit, requiring more resources.
Table 9 Required interfaces. SW-HW HW Slices Speed (MHz) Stages 42 434.2 1 HW + 1 124 323.5 2 HW-SW HW 35 700.8 1 HW + 1 120 295.6 2
545
Concerning the bias introduced by truncation rounding, this source of no standard compliance cannot be corrected and we can only try to reduce the introduced bias by extending the precision with more mantissa bits, making the relative error per operation smaller and so the bias. Nevertheless, the bias can have an important impact on the accuracy of results, for example, in the statistical analysis of large quantities of data or in applications requiring a high degree of accuracy. Therefore as bias cannot be corrected, round to nearest should be kept for standard compliance and to avoid the impact of bias in given applications. 6.3. FPGA-oriented oating-point library From the previous analysis and discussions, we consider the following features as good options for almost standard compliance while still taking advantage of the FPGA exibility: Use of dedicated ags for the number type. Handling denormalized numbers as zeros while exponent is extended in one bit. Round to nearest should be kept. Following these recommendations we have developed two last libraries taking into account two possible scenarios: one with the one bit exponent extension, HW + 1 (Hardware library and +1 indicates the exponent extension), and another without the exponent extension, HW. Table 8 summarizes the results for the developed operators with those features while Table 9 shows the results for the required interfaces. As can be seen in both tables, the exponent extension has impact mainly in the interfaces required, as now they have to handle the conversion between denormalized and normalized (one more stage, more resources and less frequency), and in the complex operators (up to 7.4% of slices in the exponential). Thereby, if it is known that in a given application denormalized numbers are not in the range of partial or nal results, it would be better using operators with no bit extension. When comparing the results of these nal libraries, HW and HW + 1, with respect to the other libraries we can focus on Fig. 8, which shows the results of the operators of HW and HW + 1 compared to the results of the initial standard library, Std, and to the results of the sub-standard high performance library, HP. Firstly, it can be observed that HW and HW+1 operators preserve partially the clock frequency improvements achieved with the substandard operators, except for the exponential operator. If we focus on the graphics depicting the number of pipeline stages and use of logic resources it can be seen that for both features, although not all the improvements achieved with sub-standard operators can be preserved, the HW and HW + 1 results are closer to the results for HP than to the Std ones. HW and HW + 1 operators present a reduction of pipeline stages between one and three fewer stages than Std operators while the increase of stages with respect to HP operators is between one and two stages. And nally, considering the use of slices, the improvements achieved with respect to Std operators are between 12% (exponential) and 67.2% (multiplier) for HW operators and between 10% and 66.1% for HW + 1 operators. 7. Conclusions Current nanometer technologies allow the implementation of complex oating-point applications in a single FPGA device. Nevertheless, the complexity of this kind of operators (deep pipelines, computationally intensive algorithms and format overhead) makes
their design especially long and complicated. The availability of complete and FPGA-oriented libraries signicantly simplies the design of those complex oating-point applications. In this work we have presented several design decisions that can be made to improve the performance of oating-point operators implemented in FPGA architectures. An in depth analysis of the performance-accuracy trade-offs of those decisions has been carried out through the development and comparison of complete oating-point libraries of operators. These libraries include both conventional (adder/subtracter, multiplier, divider and square root) and advanced (exponential, logarithm and power functions) operators. Actually, the power operator is included in a oatingpoint library for the rst time. Three design decisions targeting the simplication of oatingpoint complexity have been thoroughly analyzed. Handling complexity requires logic resources, thus as the complexity is reduced so are the logic resources needed for an operator. Additionally, as less resources are used, operators implementations become more efcient, and speed is also improved or the number of pipeline stages is reduced. Our approach has focused on determining those features of oating-point arithmetic that require a heavy processing overhead while their relevance in the format is minimum or can be neglected. The extended experimental results validate the different decisions made to improve performance and area requirements of oating-point units and can be taken as guide by hardware designers when implementing this kind of applications. Finally, a set of features that implies improved performance and reduced resources has been chosen to design two almost standardcompliant libraries where their tailored implementation adapts the standard format to improve performance taking advantage of FPGA exibility. Acknowledgements This work has been partly funded by BBVA under Contract P060920579 and by the Spanish Ministry of Science and Innovation through the Project TEC2009-08589. References
[1] I. Kuon, J. Rose, Measuring the gap between FPGAs and ASICs, computer-aided design of integrated circuits and systems, IEEE Transactions on 26 (2) (2007) 203215. [2] R. Scrofano, M. Gokhale, F. Trouw, V. Prasanna, Accelerating molecular dynamics simulations with recongurable computers, Parallel and Distributed Systems, IEEE Transactions on 19 (6) (2008) 764778. [3] G. Zhang, P. Leong, C. Ho, K. Tsoi, C. Cheung, D.-U. Lee, R. Cheung, W. Luk, Recongurable acceleration for Monte Carlo based nancial simulation, 2005, pp. 215222, doi:10.1109/FPT.2005.1568549. [4] L. Zhuo, V. Prasanna, High-performance designs for linear algebra operations on recongurable hardware, Computers, IEEE Transactions on 57 (8) (2008) 10571071. [5] K. Underwood, FPGAs vs. CPUs: trends in peak oating-point performance, in: FPGA 04: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, ACM, New York, NY, USA, 2004, pp. 171 180. doi:http://doi.acm.org/10.1145/968280.968305. [6] A. George, H. Lam, G. Stitt, Novo-g: At the forefront of scalable recongurable supercomputing, Computing in Science Engineering 13 (1) (2011) 8286, doi:10.1109/MCSE.2011.11. [7] S. Che, J. Li, J.W. Sheaffer, K. Skadron, J. Lach, Accelerating compute-intensive applications with gpus and fpgas, Application Specic Processors, Symposium on (2008) 101107. doi:http://doi.ieeecomputersociety.org/ 10.1109/SASP.2008.4570793. [8] F. de Dinechin, J. Detrey, O. Cret, R. Tudoran, When FPGAs are better at oatingpoint than microprocessors, in: FPGA 08: Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, ACM, New York, NY, USA, 2008, p. 260. doi:http://doi.acm.org/10.1145/ 1344671.1344717. [9] M. Chiu, M.C. Herbordt, M. Langhammer, Performance potential of molecular dynamics simulations on high performance recongurable computing systems, in: Second International Workshop on High-Performance Recongurable Computing Technology and Applications, 2008, pp. 110.
546
P. Echeverra, M. Lpez-Vallejo / Microprocessors and Microsystems 35 (2011) 535546 [27] R. Goldschmidt, Applications of Division by Convergence, Masters thesis, Massachusetts Inst. of Technology, 1964. [28] K.S. Hemmert, K.D. Underwood, Floating-point divider design for FPGAs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15 (1) (2007) 115118. [29] Y. Li, W. Chu, Implementation of single precision oating point square root on FPGAs, in: IEEE Symposium on FPGA-Based Custom Computing Machines, 1997, p. 226. [30] P. Echeverra, M. Lpez-Vallejo, An FPGA implementation of the powering function with single oating-point arithmetic, in: Conference of Real Number and Computers, 2008. [31] Xilinx, <http://www.xilinx.com/products/ipcenter/oating_pt.htm>.
[10] B. Lee, N. Burgess, Parameterisable oating-point operations on fpga, in: Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, vol. 2, 2002, pp. 10641068. [11] G. Govindu, L. Zhuo, S. Choi, V. Prasanna, Analysis of high-performance oating-point arithmetic on FPGAs, in: International Parallel and Distributed Processing Symposium, 2004, pp. 149156. [12] S.B. Xiaojun Wang, M. Leeser, Advanced components in the variable precision oating-point library, in: IEEE Field-Programmable Custom Computing Machines, 2006, pp. 249258. [13] J. Liang, R. Tessier, O. Mencer, Floating point unit generation and evaluation for FPGAs, in: IEEE Field-Programmable Custom Computing Machines, 2003, pp. 185194. [14] J. Detrey, F. de Dinechin, A parameterized oating-point exponential function for FPGAs, in: IEEE International Conference Field-Programmable Technology, 2005, pp. 2734. [15] J. Detrey, F. de Dinechin, A parameterized oating-point logarithm operator for FPGAs, in: Signals, Systems and Computers, 2005. Conference Record of the Thirty-Ninth Asilomar Conference, 2005, pp. 11861190. [16] J. Detrey, F. de Dinechin, X. Pujol, Return of the hardware oating-point elementary function, in: Symposium on Computer Arithmetic, 2007, pp. 161 168. [17] G. Govindu, R. Scrofano, V. Prasanna, A library of parameterizable oatingpoint cores for FPGAs and their application to scientic computing, in: International Conference on Engineering of Recongurable Systems and Algorithms, 2005. [18] IEEE Standard Board, IEEE standard for binary oating-point arithmetic, The Institute for Electrical an Electronics Engineers, 1985. [19] Altera Corporation, <http://www.altera.com/products/ip/dsp/arithmetic/ m-alt-oat-point.html>. [20] Xilinx, <http://www.xilinx.com/products/index.htm>. [21] J. Detrey, F. de Dinechin, Flopoco project (oating-point cores), <http://www. ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/>. [22] N. Kapre, A. DeHon, Optimistic parallelization of oating-point accumulation, in: 18th IEEE Symposium on Computer Arithmetic, 2007, pp. 205216. [23] J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Dongarra, Exploiting the performance of 32 bit oating point arithmetic in obtaining 64 bit accuracy (revisiting iterative renement for linear systems), LAPACK Working Note, July 2006. [24] S.F. Oberman, M.J. Flynn, Division algorithms and implementation, IEEE Transactions on Computers 46 (8) (1997) 833854. [25] C. Freiman, Statistical analysis of certain binary division algorithms, Proceedings of the IRE 49 (1) (1961) 91103, doi:10.1109/JRPROC.1961. 287780. [26] P. Hung, H. Fhmy, O. Mencer, M.J. Flynn, Fast division algorithm with a small lookup table, in: 33rd Asilomar Conference on Signals, Systems and Computersc, 1999, pp. 2427.
Pedro Echeverra received the M.S. degree in telecommunications with a major in electronics from the Universidad Politcnica de Madrid, Spain, in 2005. He is currently working towards the Ph.D. degree at the Department of Electronic Engineering, Universidad Politcnica de Madrid. His research interests include high-performance computing with FPGAs, applicationspecic high-performance programmable architectures computer arithmetic and random number generation.
Marisa Lpez-Vallejo received the M.S. and Ph.D. degrees from the Universidad Politcnica de Madrid, Spain, in 1993 and 1999, respectively. She is an Associate Professor in the Department of Electronic Engineering, Universidad Politcnica de Madrid. She was with Bell Laboratories, Lucent Technologies as a member of the technical staff. Her research activity is currently focused on low-power and temperatureaware design, CAD for hardware/software codesign of embedded systems, and application-specic highperformance programmable architectures.