0% found this document useful (0 votes)
343 views

Floating-Point Hardware Designs For Multimedia Processing

This thesis document describes Metin Mete Ozbilen's Ph.D. thesis on floating-point hardware designs for multimedia processing. The thesis was completed at Cukurova University in Adana, Turkey in 2009 under the supervision of Assoc. Prof. Dr. Mustafa Gok. The thesis includes designs for floating-point addition, multiplication, multiply-add, and division circuits to process multimedia instructions. It also includes designs that perform the operations on packed data to increase the speed of floating-point multimedia instructions. The thesis presents hardware implementations for multiplying, adding, subtracting and calculating reciprocals using packed floating-point numbers.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
343 views

Floating-Point Hardware Designs For Multimedia Processing

This thesis document describes Metin Mete Ozbilen's Ph.D. thesis on floating-point hardware designs for multimedia processing. The thesis was completed at Cukurova University in Adana, Turkey in 2009 under the supervision of Assoc. Prof. Dr. Mustafa Gok. The thesis includes designs for floating-point addition, multiplication, multiply-add, and division circuits to process multimedia instructions. It also includes designs that perform the operations on packed data to increase the speed of floating-point multimedia instructions. The thesis presents hardware implementations for multiplying, adding, subtracting and calculating reciprocals using packed floating-point numbers.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

INSTITUTE OF NATURAL AND APPLIED SCIENCES UNIVERSITY OF CUKUROVA

Ph.D. THESIS

Metin Mete OZBILEN

FLOATING-POINT HARDWARE DESIGNS FOR MULTIMEDIA PROCESSING

DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

ADANA, 2009

CUKUROVA UNIVERSITESI FEN BILIMLERI ENSTITUSU

COKLU-ORTAM ISLEME ICIN KAYAN-NOKTA DONANIM TASARIMLARI

Metin Mete OZBILEN DOKTORA TEZI ELEKTRIK VE ELEKTRONIK MUHENDISLIGI ANABILIM DALI

Bu tez 08.07.2009 tarihinde asa daki j ri uyeleri tarafndan oybirli i ile kabul edilmistir. g u g
Imza............................. Doc.Dr. Mustafa GOK DANISMAN Imza............................. Yrd.Doc.Dr. Ulus CEV IK UYE Imza............................. Prof.Dr. Mehmet TUMAY UYE Imza............................. Yrd.Doc.Dr Mutlu AVCI UYE Imza............................. Yrd.Doc.Dr. S leyman TOSUN u UYE

Bu tez Enstit m z Elektrik ve Elektronik M hendisli i Anabilim Dalnda hazrlanmstr. u u u g Kod No:

Prof.Dr. Aziz ERTUNC Enstit M d r u u uu Imza ve M h r u u

Not: Bu tezde kullanlan ozg n ve baska kaynaktan yaplan bildirislerin, cizelge, sekil ve u foto raarn kaynak g sterilmeden kullanm, 5846 sayl Fikir ve Sanat Eserleri Kanunundaki g o h k mlere tabidir. u u

Sevgili aileme,

ABSTRACT Ph.D. THESIS

FLOATING-POINT HARDWARE DESIGNS FOR MULTIMEDIA PROCESSING

Metin Mete OZBILEN

DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING INSTITUTE OF NATURAL AND APPLIED SCIENCES UNIVERSITY OF CUKUROVA Supervisor: Assoc.Prof.Dr. Mustafa GOK Year: 2009, Pages: 120 Jury: Assoc.Prof.Dr. Mustafa GOK Prof.Dr. Mehmet TUMAY Assist.Prof.Dr. Mutlu AVCI Assist.Prof.Dr. Ulus CEV IK Assist.Prof.Dr. S leyman TOSUN u In this dissertation oating-point arithmetic circuits for multimedia processing are designed. The arithmetic operations oating-point add, oating-point multiply, oatingpoint multiply-add and oating-point division are researched and specic hardware designs for them are implemented. The multimedia instructions are single instruction multi data (SIMD) type instructions. Hardware designs that perform operations on packed data increase the speed of the execution of oating-point multimedia instructions. In this dissertation, multiplication, addition, subtraction and reciprocal operations are speed up and additional functionalities are added using packet oating-point numbers.

Key Words: multimedia, hardware, design, oating-point, SIMD.

OZ DOKTORA TEZI

COKLU-ORTAM ISLEME ICIN KAYAN-NOKTA DONANIM TASARIMLARI

Metin Mete OZBILEN CUKUROVA UNIVERSITESI FEN BILIMLERI ENSTITUSU ELEKTRIK VE ELEKTRONIK MUHENDISLIGI ANABILIM DALI Dansman: Doc.Dr. Mustafa GOK Yl: 2009, Sayfa: 120 J ri: Doc.Dr. Mustafa GOK u Prof.Dr. Mehmet TUMAY Yrd.Doc.Dr Mutlu AVCI Yrd.Doc.Dr. Ulus CEV IK Yrd.Doc.Dr. S leyman TOSUN u Bu tezde coklu ortamlarda icin kayan-nokta aritmetik devreleri tasarmlar yaplmstr. Bu amacla kayan-nokta toplama, kayan-nokta carpma, kayan nokta carp-topla ve kayan nokta b lme aritmetik islemleri arastrld ve ozel donanm tasarmlar gerceklestirildi. o Coklu ortam y nergeleri tek y nerge coklu veri tipi (SIMD) y nergeleridir. Paketlenmis o o o veri uzerinde islem gerceklestiren donanmlar kayan nokta coklu ortam y nergelerinin o isletilme hzn artrr. Bu tezde de carpma, toplama ckartma ve bire b lme islemlerinde o paketlenmis kayan-nokta saylar kullanlarak coklu ortam islemlerin gerceklestirilmesinin hzlandrlmas ve beraberinde fonksiyonel gelistirmeler sa lanmstr. g

Anahtar Kelimeler: coklu-ortam, kayan-nokta, donanm, tasarm, SIMD.

II

TABLE OF CONTENTS

PAGE

ABSTRACT OZ

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

I II III VI

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF FIGURES 1 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII 1 7 7 9 9 10 10 10 11 11 11 13 17 18 22 24 25 26 27 29 31

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PREVIOUS RESEARCH 2.1 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Floating Point Description . . . . . . . . . . . . . . . . . . . . . . . . . Floating Point Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 2.2.3 2.2.4 Round to Nearest Mode . . . . . . . . . . . . . . . . . . . . . . Round to Positive-Innity . . . . . . . . . . . . . . . . . . . . . Round to Negative-Innity . . . . . . . . . . . . . . . . . . . . . Round to zero . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 2.4

Floating Point Special Cases . . . . . . . . . . . . . . . . . . . . . . . . Floating Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.4.2 2.4.3 2.4.4 Floating Point Addition and Subtraction . . . . . . . . . . . . . . Floating Point Multiplication . . . . . . . . . . . . . . . . . . . . Floating-Point Multiply-Add Fused (FPMAF) . . . . . . . . . . . Floating-Point Division . . . . . . . . . . . . . . . . . . . . . . .

2.5

Floating-Point Packed Data . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 2.5.2 2.5.3 2.5.4 Packed Floating Point Addition and Subtraction . . . . . . . . . . Packed Floating Point Multiplication . . . . . . . . . . . . . . . Packed Floating Point Division and Reciprocal . . . . . . . . . . Packed Floating Point Multiply Add Fused(MAF) . . . . . . . .

2.6 2.7

Floating Point Packed Instruction Extensions . . . . . . . . . . . . . . . Benchmarking SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . .

III

2.8

Previous Packed Floating Point Designs . . . . . . . . . . . . . . . . . . 2.8.1 2.8.2 Packed Floating Point Multiplication Designs . . . . . . . . . . . Packed Floating Point Multiplier Add Fused Designs . . . . . . .

34 34 37 39 39 42

2.9

Previous Patented Packed Floating Point Designs . . . . . . . . . . . . . 2.9.1 2.9.2 Multiple-Precision MAF Algorithm . . . . . . . . . . . . . . . . Shared Floating Point and SIMD 3D Multiplier . . . . . . . . . .

2.10 Method and Apparatus For Performing Multiply-Add Operation on Packed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Multiplier Structure Supporting Different Precision Multiplication Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Method and Apparatus for Calculating:Reciprocals and Reciprocal Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 THE PROPOSED FLOATING POINT UNITS . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 The Multi-Precision Floating-Point Adder . . . . . . . . . . . . . . . . . The Single/Double Precision Floating-Point Multiplier Design . . . . . . The Multi-Functional Double-Precision FPMAF Design . . . . . . . . . 3.3.1 3.3.2 The Mantissas Preparation step . . . . . . . . . . . . . . . . . . The Implementation Details for Multi-Functional DoublePrecision FPMAF Design . . . . . . . . . . . . . . . . . . . . . 3.4 Multi-Functional Quadruple-Precision FPMAF . . . . . . . . . . . . . . 3.4.1 3.4.2 The Preparation of Mantissas . . . . . . . . . . . . . . . . . . . . The Implementation Details for The Multi-Functional QuadruplePrecision FPMAF Design . . . . . . . . . . . . . . . . . . . . . 3.5 Multi-Precision Floating-Point Reciprocal Unit . . . . . . . . . . . . . . 3.5.1 3.5.2 3.5.3 Derivation of Initial Values . . . . . . . . . . . . . . . . . . . . . Newton-Raphson Iteration . . . . . . . . . . . . . . . . . . . . . The Implementation Details for Double/Single Precision Floating Reciprocal Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Results for Multi-Precision Floating-Point Adder Design . . . . . . . 86 90 90 77 81 81 83 65 70 71 49 51 51 55 58 60 47 44

IV

4.2 4.3 4.4 4.5 5

The Results for Single/Double Precision Floating-Point Multiplier Design The Results for Multi-functional Double-precision FPMAF design . . . . The Results for Multi-Functional Quadruple-Precision FPMAF . . . . . . The Multi-Precision Floating-Point Reciprocal Unit . . . . . . . . . . . .

91 92 95 97 99

CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

CURRICULUM VITAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

LIST OF TABLES Table 2.1 Table 2.2 Table 2.3 Table 2.4 Table 2.5 Table 2.6 Table 2.7 Table 3.1 Table 3.2 Table 3.3 Table 4.1 Table 4.2 Table 4.3

PAGE 11 12 28 39 46 46 46 68 73 78 90 91

Rounding Modes Examples . . . . . . . . . . . . . . . . . . . . . . Effective Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . Operations of Packed MAF . . . . . . . . . . . . . . . . . . . . . . Word-lengths in Single/Double Precision MAF . . . . . . . . . . . . Multiply-Accumulate Patent . . . . . . . . . . . . . . . . . . . . . . Packed Multiply-Add Patent . . . . . . . . . . . . . . . . . . . . . . Packed Multiply-Subtract Patent . . . . . . . . . . . . . . . . . . . . The Execution Modes . . . . . . . . . . . . . . . . . . . . . . . . . The Logic Equations for The Generation of The Modied Mantissas . Quadruple Precision Execution Modes . . . . . . . . . . . . . . . . Area and Delay Estimates for Multi-Precision Floating Point Adder . Additional Components in Multi-Precision Adder Design . . . . . . Area and Delay Estimates for Single/Double-Precision Multiplier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

Table 4.4 Table 4.5 Table 4.6 Table 4.7

Additional Components in Single/Double-Precision Multiplier Design 92 Area Estimates for Double-Precision FPMAF Design . . . . . . . . . Delay Estimates for Double-Precision FPMAF Design . . . . . . . . Additional Components in Multi-Functional Double-Precision FPMAF Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 96 96 93 94

Table 4.8 Table 4.9 Table 4.10

Area Estimates for Quadruple-Precision FPMAF Design . . . . . . . Delay Estimates for Quadruple-Precision FPMAF Design . . . . . . Additional Components in Multi-Functional Quadrable-Precision FPMAF Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97 97 98

Table 4.11 Table 4.12

The Comparison of the Standard and Proposed Reciprocal Design . . Additional Components in Multi-Precision Reciprocal Design . . . .

VI

LIST OF FIGURES Figure 1.1 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9

PAGE 2 7 8 9 13 14 16 19 21 23 23 24 24 25 25 26 26 27 27 28 29 30 30 35 36 37

SSID vs SIMD Structure . . . . . . . . . . . . . . . . . . . . . . . . Floating Point Number Parts . . . . . . . . . . . . . . . . . . . . . . Single and Double Precision Formats . . . . . . . . . . . . . . . . . Single Precision Floating Point Representation . . . . . . . . . . . . Additional Bits Used for Rounding . . . . . . . . . . . . . . . . . . Floating Point Adder/Subtracter . . . . . . . . . . . . . . . . . . . . Floating Point Multiplier. . . . . . . . . . . . . . . . . . . . . . . . Floating-Point Multiply Add Fused. . . . . . . . . . . . . . . . . . . Newton-Raphson Iteration. . . . . . . . . . . . . . . . . . . . . . . . Floating-Point Divider. . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.10 SIMD Type Data Alignment . . . . . . . . . . . . . . . . . . . . . . Figure 2.11 SIMD Type Data Alignment Example . . . . . . . . . . . . . . . . . Figure 2.12 SIMD Addition Alignment Example . . . . . . . . . . . . . . . . . . Figure 2.13 SIMD Addition Numerical Example. . . . . . . . . . . . . . . . . . Figure 2.14 SIMD Multiplication Alignment Example . . . . . . . . . . . . . . . Figure 2.15 SIMD Multiplication Numerical Example . . . . . . . . . . . . . . . Figure 2.16 SIMD Division Alignment Example . . . . . . . . . . . . . . . . . . Figure 2.17 SIMD Reciprocal Numerical Example . . . . . . . . . . . . . . . . . Figure 2.18 SIMD Division Numerical Example . . . . . . . . . . . . . . . . . . Figure 2.19 Packed Single Precision Floating Point Dot Product Results. . . . . . Figure 2.20 3DNow! Technology Floating-Point Data Type . . . . . . . . . . . . Figure 2.21 SIMD Extensions, Register Layouts, and Data Types. . . . . . . . . . Figure 2.22 Motorola Altivec Vector Register. . . . . . . . . . . . . . . . . . . . Figure 2.23 Benchmark Result of with out SIMD and with SIMD. . . . . . . . . Figure 2.24 Dual Mode Quadruple Precision Multiplier . . . . . . . . . . . . . . Figure 2.25 The Divide-and-Conquer Technique . . . . . . . . . . . . . . . . . . Figure 2.26 Two Single-Precision Numbers Packed in One Double-Precision Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.27 General structure of multipleprecision MAF unit . . . . . . . . . . . Figure 2.28 Shared Floating Point and SIMD 3D Multiplier . . . . . . . . . . . . VII

38 40 43

Figure 2.29 Multiply-Add Design for Packed Data . . . . . . . . . . . . . . . . . Figure 2.30 Multiplier Structure Supporting Different Precision Multiplication Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.31 Reciprocal and Reciprocal Square Root Apparatus . . . . . . . . . . Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 The Alignments of Floating-Point Numbers in Multi-Precision Adder The Block Diagram of Multi-Precision Floating-Point Adder . . . . . The Alignments for Double and Single Precision Numbers . . . . . . The Multiplication Matrix for Single and Double Precision Mantissas The Block Diagram for the Proposed Floating Point Multiplier . . . . The Alignments of Double and Single Precision Floating-Point Operands in 64-bit Registers . . . . . . . . . . . . . . . . . . . . . . Figure 3.7 Figure 3.8 Figure 3.9 The Partial Product Matrices Generated for (DPM) and (SPM) . . . . The Matrix Generated for (DOP) Mode. . . . . . . . . . . . . . . . . The Mantissa Modier Unit in the Double Precision FPMAF . . . .

45

48 50 52 54 56 57 59

61 63 64 66

Figure 3.10 The Block Diagram for Multi-Functional Double Precision FPMAF Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.11 The Alignments of Operands in 128-bit Registers . . . . . . . . . . . Figure 3.12 The Partial Product Matrices Generated for SPM Mode . . . . . . . . 67 72 75

Figure 3.13 The Matrix Generated for Single Precision Dot Product (SDOP) Mode 76 Figure 3.14 The Block Diagram for the Proposed Quadruple Precision FPMAF Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.15 Simple Reciprocal Unit that uses Newton-Raphson Method . . . . . Figure 3.16 Alignment of Double Precision and Single Precision Mantissas . . . Figure 3.17 Multiplication Matrix for Single and Double Precision Mantissas . . Figure 3.18 Alignment of Double and Single Precision Floating Point Numbers . Figure 3.19 The proposed Single/Double Precision Reciprocal Unit . . . . . . . . 82 84 85 86 87 89

VIII

1. INTRODUCTION

ILEN Metin Mete OZB

1. INTRODUCTION Multimedia can be dened as multiples of media integrated together (Buford, 2007). The term media can be text, graphics, audio, animation, video or data. Other than media integration, multimedia sometimes used for interactive types of media like video games. Multimedia has become an important in industry, education and entertainment. The information from televisions, magazines, web pages to movies can be thought as multimedia streams. Advertising may be one of the largest industry using multimedia to convey their messages to people (Buford, 2007). Another popular use of multimedia is interactive education. Human beings can learn with their senses, especially with sight and hearing. A lecture that uses pictures and videos can help an individual learn and retain information much more effectively. Online learning applications replaces the physical contact of the teacher by multimedia content and offers more accessible learning environment. One of the most popular multimedia application area is the graphics. At the beginning 2D graphics applications were considered quite satisfying. However new applications raised the bar to 3D graphics(Hillman, 1997). Engineering CAD(Computer Aided Design)/CAM(Computer Aided Manufacturing), scientic visualization and 3D animation becomes important aspects of multimedia. Graphic processing requires large computations that can be performed via specialized hardware in general purpose microprocessors extensions. Graphics processing applications are supported via extensions. These extensions consist of instructions that operate on packets of data. This type of instructions perform a single operation on all the data in the packet, which is known as SIMD. SIMD instructions entered to personal computing world with Intels

MMX(Multimedia Extension) instructions added to the x86 instruction set (Lempel, Peleg, Weiser, 1997). Motorola introduces the Altivec instructions to PowerPC G3 and later a better version to PowerPC G4 processor (Diefendorff, Dubey, Hochsprung, Scale, 2000). The term SIMD(Single Instruction Multiple Data) is a processor structure which a single instruction manipulates multi data structure. As it can be seen from the Figure 1.1, a SIMD processor uses a property of the data stream called data parallelism. When large amount of uniform data, that needs same instruction performed on it, requires data parallelism. For example an application which t to SIMD operation is applying a lter

1. INTRODUCTION

ILEN Metin Mete OZB

to an image. When a raster-based image has to ltered, the same lter has to be applied to all pixels of image. The computation of lter equations for each pixel is the same. That means there is single operation to be performed on multiple data.
ru st In

n io ct ru st In

SISD CPU

D at a

11 00 11 00 11 00 11 00 11 00 11 00

Output

111 111 000 000 111 111 000 000 111 111 000 000 111 000 111 000 111 000 111 000 111 111 000 000 111 111 000 000
SSID vs SIMD Structure.

SIMD CPU

Figure 1.1

Today, many general-purpose processors have multimedia extensions that increase the performance for 3D applications. Processors from AMD(Advanced Micro Devices) support 3DNow! and 3DNow!+ (AMD, 2000). These extensions have additional 21 instruction for supporting packed oating point arithmetic and packed oating point comparison (Oberman, Favor, Weber, 1999). Intel has implemented SSE (Streaming SIMD Extension) since the Pentium 3 processor with support to SIMD single precision oating point operations, 64 bit integer SIMD operations also cache ability control, prefetch and instruction ordering operations The SSE2 and SSE3 were introduced with the Pentium 4 processor (Singhal, 2004) with support to on packed double-precision oating operations and packed byte,word doubleword and quadword operations and the SSE4 were introduced with the Core platform (Varghese, 2007), giving support to packed doubleword multiplies, oating-point dot products, simplify packed blending, packed integer operations, integer format conversions (Intel, 2007). Another trend to increase the performance of graphics processing is the use of graphical processing units(GPU) computational power (Macedonia, 2003). With the introduction of GeForce 256 processor from NVIDIA in 1999, graphics cards processor can be used as co-processor in graphics calculations. Since these cards are designed to execute fast graphics operations, they have high performance parallel processing units. The GeForce 3 has the rst programmable vertex processor executing vertex shaders, along

D at a

n io ct

Output

1. INTRODUCTION

ILEN Metin Mete OZB

with a congurable 32-bit oating-point fragment pipeline, programmed with Microsoft DirectX8 and OpenGL. The Radeon 9700, introduced in 2002, featured a programmable 24-bit oating- point pixel-fragment processor programmed with Microsoft Direct X9 (Charles, 2007) and OpenGL (Open Graphics Library) (Cole, 2005). The GeForce FX added 32-bit oating-point pixel-fragment processors. These GPUs has has a registerbased instruction set including oating-point, integer, bit, conversion, transcendental, ow control, memory load/store,and texture operations. Floating-point and integer operations include add, multiply, multiply-add, minimum, maximum, compare, set predicate, and conversions between integer and oating-point numbers. (Lindholm, Nickolls, Oberman, Montrym, 2008). Recently, Nvidia has introduced CUDA(Compute Unied Device Architecture), which is a general purpose parallel computing architecture that leverages the parallel compute engine in NVIDIA graphics processing units (GPUs) to solve many complex computational problems in a fraction of the time required on a CPU (Garland, Le Grand, Nickolls, Anderson, Hardwick, Morton, Phillips, Yao, Volkov, 2008). This dissertation presents multi precision and multi functional oating point units that can be efciently used in graphics processing. The cited previous work shows that there is a considerable research effort on increasing the performance of multimedia applications. Leading chip manufacturers introduces a new extension almost every year. The presented units also support dot product modes, which have never been implemented on any FPMAF(Floating Point Multiply Add Fused) design. The quadprecision FPMAF has two dot product modes: One of these modes performs two doubleprecision oating-point multiplications and adds their products with another doubleprecision oating-point operand; the other mode performs four single-precision oatingpoint multiplications and adds their products with an other single-precision oating-point operand. The double-precision FPMAF has only one dot product mode that performs two single-precision oating-point multiplications and adds their products with another single-precision oating-point operand. The proposed designs achieve signicant hardware savings by supporting these functions in one unit instead of using a separate circuit for each mode. The Dot product is also called scalar product. It takes to real number vectors and generate a real scalar value. It is inner product of orthonormal Euclidean space (Arfken,

1. INTRODUCTION

ILEN Metin Mete OZB

1985). From the denition, dot product is very useful in geometric and physics calculation. Two and Three dimension computer graphics deal with both of them. Our design simplies and also speed up these type of calculation. There exists instruction making similar calculation in todays popular processor multimedia extensions. The Intel Pentium 4 has single precision dot product instruction begin from the SSE 4 (Intel, 2007). The AMD processor has an accumulate multiplication in its 3Dnow multimedia extension which also do similar computation. (Amd, 2007) A multi-precision oating-point adder design that overcomes the performance degradation caused by format conversion operations. The proposed multi-precision oatingpoint adder design can perform four half-precision (in NVIDIA format)(NVidia, 2007) oating-point additions or two-single precision oating-point additions or a single doubleprecision oating-point addition. In low-precision operation modes, the results are generated in parallel. A oating-point adder with the proposed functionality is not reported in the literature. Floating point addition is used in many places hence it is one of the most common operation. Packed oating point can speed up ltering operation of images by accessing multiple data. Both popular general purpose processors have packed single precision oating point addition instructions in their multimedia extension instructions sets (AMD and INTEL, 2007). The following contributions are made by this dissertations: A multi-precision oating point adder/subtractor is designed that support half, single and double precision oating-point additions (Ozbilen, Gok, 2008). Compared to a single-precision oating point adder the proposed multi-precision design can compute four half precision or two single precision addition simultaneously. Therefore the performance of a single-precision addition can be doubled and half precision addition can be quadrupled with the proposed design. In addition to these advantages, to our best of knowledge, the proposed adder is the only multi-precision adder that supports half precision addition reported in the literature. A oating point multiplier design method that supports single and double precision multiplication is designed (Gok and Ozbilen, 2009b). Beside double precision multiplication, the proposed multiplier can simultaneously perform two single precision multiplication within the delay of a standard double precision multiplication. 4

1. INTRODUCTION

ILEN Metin Mete OZB

One of the main advantage of the proposed design method is it can be applicable to all kind of oating point multipliers. A multi-precision oating point multiply add fused design method is introduced and using this method a double precision multiply add and a quadruple precision multiply add designs are implemented (Gok and Ozbilen, 2008). The proposed double precision multiply add fused supports single and double precisions multiply-add operations and single precision dot-product operation. The proposed quadruple precision multiply add fused supports single, double, and quadruple precision multiply add fused operations and single and double dot product operations. Compared to the previous state of the art double-precision multiply-add fused designs presented in (Huang, Shen, Dai, and Wang, 2007) and (Jessani and Putrino, 1998) and the proposed double-precision designs have the following advantages: The dot product operation mode which may double the performance of a matrix multiplication. Another novelty of this design is in dot product mode the rounding error is decreased since only one rounding operation is performed whereas a dot product operation with a multiply add design requires rounding as much as the number of iterations. The quadruple precision multiply add fused design is very rare in academic research though there exist recent designs by main chip manufacturers. Therefore the design is compared with a quadruple multiplier presented in (Akkas, Schulte, 2006) the proposed quad-MAF has 3% more area and approximately the same delay compared to the reference design however the functionality of the design far exceeds it. A oating point reciprocal unit design method that is based on the previous design methods is presented (Ozbilen, Gok, 2008). The double precision reciprocal units designed with this method supports two single precision reciprocal operation with nearly same delay. This unit can be also enhanced by coupling with proposed double precision multiply add fused unit to support division operation, divide and sum or divide by subtract. This design is compared with the design presented in (Kucukkabak, Akkas, 2004). Compared to the reference design the proposed design can perform two reciprocal operation in the same critical delay.

1. INTRODUCTION

ILEN Metin Mete OZB

In general, all the proposed designs overcomes the additional delay due to the format conversion. Format conversion adds extra delay to computation if a smaller precision operation is performed using a larger precision unit. In that case smaller precision operands are converted to the larger precision and after the operation the large precision result is converted back to the small precision format.

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

2. PREVIOUS RESEARCH This section explains oating point number formats, oating point addition, subtraction, multiplication, multiplication-add fused, division and reciprocal operations and describes basic implementation methods for those operation. This section also presents some of the signicant previous work on oating point circuits described for multimedia operations based on patents and/or research papers. 2.1 Floating Point Description The oating point format is used to represent very big or very small real numbers in computers or calculators. A oating point number consists of three parts: A sign bit that shows whether the number is positive or negative, an exponent which represents the position of the radix point, and a mantissa which represents the digits of the numbers magnitude. The sign, exponent and mantissa are placed as shown in Figure 2.1 where sign is the most signicant bit. This placement makes comparison of the numbers easier.
Sign Exponent Mantissa

Figure 2.1

Floating Point Number Parts.

Since the acceptance of the IEEE standard in late 80s, oating point hardware in modern processors abide the rules dictated by IEEE-754 standard (IEEE, 1985). This has increased the portability of the oating-point applications. Due to general demand the standard is undergoing modications (Microprocessor Standards Committee, 2006). The current draft of the standard can be accessed from ANSI(American National Standards Institute) -IEEE Standard 754. The main differences between the current draft and the IEEE-754 standard are the inclusion of decimal oating point number formats and quadruple precision format and exclusion of extended precision formats. The single and double precision formats are kept unchanged. The advantage of this notation is that the point can be placed so that long strings of leading or trailing zeros can be avoided. The specic place for the point is typically just after the leftmost nonzero digit. Because of this the leftmost digit of the signicant cant be zero. This is called normalization. So,

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

there is no need to express the point explicitly which is hidden. Popular general purpose processors such as, the Intel Pentium and the Motorola 68000 series provide 80 bit extended precision format, which has 15 bit exponent and 64 bit mantissa, with no hidden bit. The IEEE-754 standard has two different precision types: the single, which has 32 bits data width, with 8 bit exponent and 23 bit mantissa and the double, which has 64 bits data width, with 11 bit exponent and 52 bit mantissa. The single and double formats are shown in Figure 2.2.
31 s 30 e 23 22 m 0

(a) Single Precision


63 s 62 e 52 51 m 0

(b) Double Precision

Figure 2.2

Single and Double Precision Formats

The exponent is biased by 281 1 = 127, so that exponents range is -126 to +127. For the normalized numbers, the number has value V = s 2e 1.m where s = +1 for positive numbers when the sign bit is 0 s = 1 for negative numbers when the sign bit is 1 e = exponent 127 exponent is stored with 127 added to it, also called biased with 127. m = the mantissa with leading one, where 1 1.m < 2 (2.1)

Since both formats have nite area for representing real numbers, the numbers have to be approximated while they are converted to oating-point representation. Through out the text IEEE-754 format oating point numbers are referred as oating-point. The single-precision format representation of 0.156255 is shown in Figure 2.3.

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Exponents (8 Bits) 23 22

Mantissa (23 Bits) 00

0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31 30

Figure 2.3

Single Precision Floating Point Representation of Real Number 0.15625

2.2 Floating Point Rounding Floating point numbers is used to represent the real numbers, but sometimes these numbers can not be represented exactly. In this case the oating point number is rounded. For example the real number 0.1 can not be represented exactly in IEEE-754 format (IEEE, 1985). 0.1 = 00011001100110011001100... When it is rounded to single precision format it is represented as s = 10011001100110011001100 e = 01111011(-4) s = 0 The exact decimal value after conversion is s = 0.09999994 (2.4) (2.3) (2.2)

The difference between two consecutive oating-point numbers which have same exponent is called unit in last place (ulp). For numbers which has exponent of 0, an ulp is exactly 223 or about 107 in single precision, and about 1016 in double precision.The IEEE-754 standard has four rounding modes: Round to nearest even, round up or round toward positive innity, round down or round to negative innity and round toward zero. The IEEE-754 standard accepts round-to-nearest even as default rounding for all fundamental algebraic operations (IEEE, 1985). Consider a oating number, x is between to real numbers R1 and R2 , that means R1 x R2 , has to be rounded. 2.2.1 Round to Nearest Mode In this mode, the inexact result is rounded to the nearer of the two adajent values. If the result is in the middle, then the even alternative is chosen. This rounding is also known

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

as round to even. It can be formulated as R 1 if |x R1 | < |x R2 | Rnd (x) = R if |x R1 | > |x R2 | 2 Even(R , R ) if |x R | = |x R | 1 2 1 2

(2.5)

For example, 0.016 is rounded to 0.02 because the next digit 6 is 6 or more; 0.013 is rounded to 0.01 because the next digit 3 is 4 or less; 0.015 is rounded to 0.02 because the next digit is 5, and the hundredths digit 1 is odd; 0.045 is rounded to 0.04 because the next digit is 5, and the hundredths digit 4 is even; 0.04501 is rounded 3.05 because the next digit is 5, but it is followed by non-zero digits. 2.2.2 Round to Positive-Innity This mode rounds inexact results to the possible value closer to positive innity. It can be formulated as Rnd (x) = R2 (2.6)

For example, 0.016 rounded to hundredths is 0.02; 0.013 rounded to hundredths is 0.02. 2.2.3 Round to Negative-Innity This mode rounds inexact results to the possible value closer to negative innity. It can be formulated as Rnd (x) = R1 (2.7)

For example, 0.016 rounded to hundredths is 0.01; 0.013 rounded to hundredths is 0.01. 2.2.4 Round to zero This mode rounds inexact results to the possible value closer to zero. In other way the result is truncated. It can be formulated as R if x 0 1 Rnd (x) = R if x 0 2 For example, 0.016 rounded to hundredths is 0.0; 10

(2.8)

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Examples for rounding modes are summarized in the Table 2.1. The real number 1.0016 is digitized to 40 bits and its rounded to single precision 23 bits versions are given in binary and decimal.

Table 2.1

Rounding Modes Examples

No Round Round-to-Nearest Round-to-Positive Innity Round-to-Negative Innity Round-to-Zero

100000000011010001101101110001011101011 100000000011010001101110 100000000011010001101110 100000000011010001101101 100000000011010001101101

1.0016 1.0016 1.0016 1.0015999 1.0015999

2.3 Floating Point Special Cases Following special are cases usually indicated by ags for oating-point operations: Overow, when the exponent is incremented during normalization and rounding step. If exponent E 255, the overow ag is set and result is set to . Underow, when the exponent is decremented during normalization and if exponent E = 0, the underow ag is set and fraction is left unnormalized. Zero, when the mantissa is zero, E = 0 and F = 0 then zero ag is set. Inexact, when the guard bits are all one than inexact ag is set. Not a number (NAN), when one of the operand or both is a NAN, then result is set to NAN. 2.4 Floating Point Operations 2.4.1 Floating Point Addition and Subtraction The most popular oating point operation is oating point addition. The addition of two oating point numbers X = Sx 2Ex Mx and Y = Sy 2Ey My can be formulated as Mz = (1)Sx Mx (1)Sy My 2(Ey Ex ) Ez = max (Ex , Ey ) if Ex Ey if Ex < Ey (2.10)

(1)Sx Mx 2(Ey Ex ) (1)Sy My

(2.9)

11

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

where Z = Sz 2Ez Mz is the result. The oating point addition operation begins with equalization of the exponents of operands. The number with small exponent is equalized by right shifting the mantissa while increasing the exponent by one with each shift. This operation is known as alignment. After alignment of the mantissas the effective operation takes place. The effective operation base on the signs is shown in Table 2.2. The exponent of the result is chosen

Table 2.2

Effective Operation

Floating-Point Operation Add Add Subtract Subtract

Signs of Operation Effective Operation(EOP) equal add different subtract equal subtract different add

from one of the equalized exponents. The sign of the result is determined by selecting largest operand. After the operation the result might require normalization operation. There results might be in one of these three forms: 1. The result is already normalized. 2. When the effective operation is addition, there might be an overow in the mantissa. 3. When the effective operation is subtraction, there might be leading zeros. For the second and third forms of result has to be normalized and according to the normalization shift amount the exponent has to be updated. Leading One Dedector (LOD) determines the position of the leading one in the result. After normalization and exponent update, rounding takes place for the result. The alignment of mantissa may increase the operand size of the result. To obtain correct result only three additional fractional bits are sufcient. These bits are called guard bits. They are guard (G), round (R), and sticky (T ), which are shown in Figure 2.4, where F denotes the fractional part of mantissa. In Round to nearest even mode result is rounded up if G = 1 and R and T are not both 0 and round to even if G = 1 and R = T = 0. In Round towards zero mode result is truncated. In Round toward positive innity mode result is 12

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

1.XXXXXXXXXXXXXXXX
F
Figure 2.4 Additional Bits Used for Rounding

LGRT

rounded up if G,R, and T are not all zero. In Round toward positive innity mode result is rounded up if G,R, and T are all zero. The basic oating point adder is shown as a block diagram in Figure 2.5. The function of each block is explained as follows. The Exponent Difference unit computes difference of the exponents. The sign bit of the difference is used to select the greatest exponent which realizes Equation 2.10. This sign bit is also used by the Swap unit to decide which number has to be aligned. The EOP unit performs the effective operation given in Table 2.2. The Alignment unit right shifts by d digits. The Add/Sub unit performs the effective operation. The Normalization unit performs normalization based on the value generated by LZA unit. The LZA unit anticipates the number of leading zeros. The normalized result is rounded by the Round unit and the mantissa of the result is generated. Based on the ovf signal Exponent Update unit increments the exponent value and the exponent of the result is generated. The Sign unit determines the sign of operation depending on input signs and result of effective operation. 2.4.2 Floating Point Multiplication

Floating point multiplication is another popular operation used in oating point operations. The oating point multiplication is performed for oating point numbers x and y and product z as Mz = 1.Mx 1.My Ez = Ex + Ey Sz = Sx Sy (2.11) (2.12) (2.13)

where Mx , My , and Mz are mantissas, Ex , Ey , and Ez are exponents and Sx , Sy , and Sz are signs of the operands X , Y , and the result Z, respectively. 13

2. PREVIOUS RESEARCH

Ex

Ey

Mx

My

Sx

Sy

MUX

sgn

Exponent Difference

sgn

Swap

EOP

Allignment ovf
14

Add/Sub

Sign

Exponent Update

Normalize

LZA Sz

Ez

ovf

Round
Metin Mete OZBILEN

Mz
Figure 2.5: Floating Point Adder/Subtracter.

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

The computation of Equations 2.11-2.13 can be performed in parallel. The addition of the exponents in biased representation is performed by adding the exponents and subtracting the extra bias that comes from second operand. The operation is expressed as EB,z = EB,x + EB,y B (2.14)

where B is the bias value. Exponent addition can be performed by using fast carry propagate adder (CPA) (Koren, 2002). The sign of the result is evaluated with an X OR gate. The mantissa multiplication is usually performed by a fast parallel multiplier. Some of the popular multipliers used in mantissa multiplication are unsigned-radix-2, signed Baugh-Wooley (Baugh and Wooley, 1973) and signed Booth (Booth, 1951). These methods are used for generating multiplication matrix. Then these values are reduced to carry-save vectors using reduction methods like Wallace (Wallace, 1964) or Dadda (Dadda, 1965) reduction. The nal result is obtained by using a nal CPA. The multiplication of n-bit mantissas generates a 2n-bit product, P. But, only the n-bits are needed in results, others are used in generation of guard bits. The sticky-bit is computed in parallel with the multiplication. The n 2 least signicant bits of P are not returned as a part of the rounded P, but for rounding it is important to know if any of the discarded bits is a one. The sticky-bit represents this situation (Gok ve Ozbilen, 2008). The trivial method for generating sticky simply ORs all the n 2 least signicant bits of P. The sticky-bit can also be determined from the second half of the carry-save representation of the product (Bewick, 1994; Yu and Zyner, 1995). In Bewicks design a 1 is added into the partial product tree, and later is corrected during the addition of sum and carry vectors by setting the carry-in input of the CPA to one (Bewick, 1994). Yu and Zyner presented a method that determines whether the sum of sum and carry vectors is a zero, without performing a carry- propagate addition (Yu and Zyner, 1995). After the multiplication step, normalization of the mantissa step is performed. When Mx 1 and My < 2,the result is in the range [1, 4) so a normalization by shifting right one position might be needed. There is no left shift normalization is needed in oating point multiplication. Mantissa is rounded as in the oating point addition. The block diagram of a simple oating point multiplier can be seen in Figure 2.6. In the gure, the Exponent Addition unit computes the Equation 2.14. The Multiplier 15

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Ex

Ey

Mx

My

Sx

Sy

Exponent Addition

Parallel Tree Multiplier CS Output S CPA C Sticky T

Sign

Exponent Update rnd

Normalize Round

Ez

Mz
Figure 2.6 Floating Point Multiplier.

Sz

generates the product of the mantissas in carry-save format. The sign of the multiplication is computed by an XOR gate in the Sign unit. (Gurkayna, Leblebicit, Chaouati, McGuinness, 2000; Beaumont-Smith and Lim, 2001), Carry-Lookahead Adders (Yu-Ting and Yu-Kumg, 2004; Fu-Chiung, Unger and Theobald, 2000; Wang, Jullien, Miller and Wang, 1993) or Carry Skip Adder (Min and Swartzlander, 2000; Chirca, Schulte, Glossner, Horan, Mamidi, Balzola, Vassiliadis, 2004). At the same time these vectors are used by the Sticky unit for sticky bit computation. After the unnormalized results are normalized in the Normalization unit, the Round unit performs rounding. The Exponent update unit updates the exponent depending on the normalization and rounding operations (Even, Mueller and Seidel, 1997; Gok, 2007; Even and Seidel, 2000; Quach, Takagi and Flynn, 2004).

16

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

2.4.3 Floating-Point Multiply-Add Fused (FPMAF)

The FPMAF unit calculates Z = (X Y ) +W (2.15)

where X ,Y , W and Z are the operands represented with (Mx ,Ex ),(My ,Ey ) and (Mw ,Ew ) respectively and result Z is represented with (Mz ,Ez ). All the mantissas are signed and normalized. This reduces the number of interconnections between units and provides accuracy more than separate multiply and add units. The accuracy comes from single normalization and rounding step instead of two. The FPMAF can be also used to perform addition and multiplication by setting Y = 1.0 or W = 0.0, respectively (Ercegovac and Lang, 2004).Floating-Point multiplication add fused operation is dened as Mz = (1)(Sx Sy ) 1.Mx 1.My + (1)Sw Mw 2(Ex +Ey BEw ) Ez = max(Ex + Ey B, Ew ) (2.16) (2.17)

where the operands X = Sx 2Ex Mx , Y = Sy 2Ey My and W = Sw 2Ew Mw . B is bias value. Mantissa multiplication of Mx and My is performed by a fast parallel multiplier similar to Floating-Point multiplication. Addition of exponents Ex and Ey and determination of alignment shift for operand Mw for biased exponents can be expressed in Equation 2.18 d = Ex + Ey Ew B + m + 3 (2.18)

where d is distance, B is bias value, m is 1+length of fractional part, 3 is for extra guard bits. The main part of FPMAF is the mantissa multiplier. After the generation of multiplication matrix and reducing them to carry and sum vectors. The nal adder can be modied to add a third oating point number(W ). This addition can be realized with a Carry-Save-Adder (CSA) and a Carry-Propagate-Adder (CPA) (Harris and Sutherland, 2003). The alignment of the W can be performed in parallel with the multiplication of mantissas. The size of the shifter is 3m + 2 bits. 2m comes from the result of multiplication and 1m from the third oating point number. There are 2 more bits that can be used as 17

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

guard-bits. To avoid bidirectional shift operation. The addend is positioned at m + 3 bit left to the product in the shifter. So, only right shifting is performed when necessary. The 3m + 2-bit 3-2 Carry-Save-Adder (CSA) is used for addition of 2m-bit Carry and Save vectors produced by multiplier with aligned Mw . The unnormalized resultant mantissa is obtained after 2-1 carry propagate adder (CPA). Since the leftmost m + 2 bits of adder input are always 0, the adder can be divided into an adder and an incrementer. The normalization of FPMAF is performed as in Floating- Point Addition. The leading one detector locates the position of one. The left shifter can shift up to 2m position. Additional m positions comes form initial position of adder operands. The exponent is updated based on the shift amount. Rounding of mantissa is performed after normalization. The rounding is performed as in Floating-Point Addition with out any change. The determination of special values of Floating-Point Addition with out any change is applicable to FPMAF design.The FPMAFs are usually pipelined to increase the throughput. A typical pipelined FPMAF design, which has 3 stages is shown in Figure 2.7. The description of the functional blocks is explained as follows: The multiplication matrix unit generates the products in parallel with the alignment of W . Distance unit computes the right shift amount d. Also the exponent with greater value is selected between sum of exponents Ex and Ey , and Ew in this unit. Then, the aligned additive and vectors carry and sum are added in the unit CSA. The resultant sum is obtained after the CPA unit, during this operation also sticky bit and leading zeros are generated by the units LZA(Leading Zero Anticipator) and Sticky respectively. The resultant sum is normalized with the value taken from the LZA unit, then rounded in Round unit to its nal value. The exponent is also adjusted with the value from LZA and Round unit to nal value. The sign bit is determined in Sign unit from the sum generated by the CPA unit 2.4.4 Floating-Point Division Though, oating point division is not popular as much as oating point multiplication or oating point addition, this operation is also supported in hardware in modern processors. The operation is expressed with Q = X /D (2.19)

18

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Ex

Ey

Ew

Mw

Sx Sy Sw

Mx

My

Stage 1

Exponent Addition Maximum

Right Shifter

EOP

Multiplicaiton Matrix CS Output

Distance

S
CSA S C

Stage 2

LZA

CPA

Sticky T Stage 3

Normalize Exponent Update rnd Round Sign

Ez
Figure 2.7

Sz

Floating-Point Multiply Add Fused.

19

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

where the operands X = Sx 2Ex Mx is dividend, D = Sd 2Ed Md is divider and Q = Sq 2Eq Mq is quotient. All the mantissas are signed and normalized. The division of mantissas and exponent subtraction is performed with the Equations Mq = 1.Mx /1.Md Eq = Ex Ed (2.20) (2.21)

The division of the mantissas is realized with either Radix-2 or 4 Digit Recurrence method or reciprocal of the divisor d is multiplied by the dividend x. In the Digit Recurrence method increase of radix makes quotient-digit selection more complicated. Beside, it reduces the number of iteration need for exact-quotient. For simplicity Radix-2 division algorithm is demonstrated below.(Ercegovac and Lang, 2004) 1. Initialize W S[0] x/2; WC[0] 0; Q[1] = 0; q0 = 0; 2. Recurrence for j = 0 n + 1; (n + 2 iterations because of initialization and guard bit) q j+1 SEL (y); (WC [ j + 1] ,W S [ j + 1]) CSA 2WC [ j] , 2W S [ j] , q j+1 d ; Q [ j] CONV ERT (Q [ j 1] , qi ); end for; 3. Terminate If w [n + 2] < 0 then q = 2 (CONV ERT (Q [n + 1] , qn+2 1)) else q = 2 (CONV ERT (Q [n + 1] , qn+2 )); where (W S) and (WC) represent sum and carry vectors in the residual redundant form, i.e. w [ j] = (WC [ j] ,W S [ j]) where w is residual of partial remainder, n is the precision in bits, q j {1, 0, 1} is the jth quotient digit, SEL is the quotient-digit selection function given in Equation 2.22 with y the value of truncated carry-save shifted residual (2w [ j]) with four bits. (three integer and one fractional bit). Because the range of y, 2w [ j] requires also three integer bits and, therefore,w [ j] has two integer bits. CSA is carry- save adder,

20

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

q j+1 d is in twos complement form, CONV ERT is the on-the-y conversion function producing the accumulated quotient in conventional representation. 1 if 0 y 3/2 q j+1 = SEL Y = 0 if y = 1/2 1 if 5/2 y 1

(2.22)

In the later method, Newton-Raphson iteration is used for computation of divisor reciprocal. The main idea of this method is to nd a zero point of function. The derivation can be carried out by Taylor series. It is shown in Figure 2.8. The Newton-Raphson

f(x)

f(xi )

f(x ) i

xi

xi+1

Figure 2.8

Newton-Raphson Iteration.

formula is f (xi+1 ) = f (xi ) + f (xi ) (xi+1 xi ) if f (xi+1 ) is approximate to 0, then xi+1 = xi f (xi ) / f (xi ) (2.24) (2.23)

where xi is the value of ith iteration, f (xi ) is the value of function at xi and f (xi ) is the derivative of function at xi . A lookup is used to approximate the initial value of the iteration and fast multipliers are used for getting closer to the result (Chen, Wang, Zhang, Hou, 2006). The division operation is formulated with this method as q = x/d = x (1/d) 21 (2.25)

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

The reciprocal value of 1/d is formulated in Newton-Raphson method as f (q) = 1/q d qi+1 = qi (2 qi d) q0 = 1/d0 (2.26) (2.27) (2.28)

The subtraction of the exponents in biased representation is performed by subtracting the exponents and adding the missed bias. The operation is expressed as EB,q = EB,x EB,d + B where B is bias value. The second step is normalization of Mq and update of exponents. After division the quotient is in a range of
1 2,2

(2.29)

, for the IEEE754 standard the range is [1, 2), a normal-

ization might be required when the result is less than 1. That is a left shift and decrement of exponent. In the third step rounding of Quotient is done. For digit recurrence method the rounding take place with on-the-y-conversion (Ercegovac and Lang, 1987). The last step is determination of special values. The same situation of Floating-Point multiplication with out any change is applicable to oating point devision. The oating point divider can be seen in Figure 2.9. 2.5 Floating-Point Packed Data Floating-point operation that applied to the multimedia data are in a form of SIMD type. This type of instructions uses multiple data in packed form. For example, two single precision oating numbers can be packed as shown in Figure 2.10. In this gure R1 holds A and C, R2 holds B and D, and R3 holds E and F. The multimedia applications perform the same operation on multiple data for example, while processing a 3D scene of a movie, the same lighting transformation is applied to the every pixel of the image or while processing a voice, the same ltering is a applied to the every sample of voice. Generally multimedia data are packed in low precision format that means two or more of them can be stored in one higher precision data. Using this advantage number of loops used for processing multimedia data might be reduced by

22

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Sx

Sy

Ex

Ey

Mx

My

XOR

Exponent Difference

Mantissa Division CS Reduction C CPA Normalize S

Exponent Update Sq Eq
Figure 2.9 Floating-Point Divider.

Round M

R1 R2 R3

X Y Z

63 62 Sx Sy Sz

52 51 Ex Ey Ez Mx My Mz

(a) Double Precision FloatingPoint Number


A 63 62 Sa 55 54 Ea Eb Ee Ma Mb Me 32 31 30 Sc Sd Sf 23 22 Ec Ed Ef Mc Md Mf 0 C D F

R1 R2 R3

B Sb E Se

(b) Single Precision FloatingPoint Number

Figure 2.10

SIMD Type Data Alignment.

23

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

using vector structures. Using these vectors multiple addition, subtraction, multiplication or division can be performed in once.
Sa R1 A Ea Ma Sc Ec Mc C

0 10000000 11000000000000000000000
Sb Eb Mb

0 01111111 01000000000000000000000
Sd Ed Md

R2

0 10000000 00000000000000000000000
Se Ee Me

0 10000001 00000000000000000000000
Sf Ef Mf

R3

E 63 62 55 54 32 31 30 23 22 0

Figure 2.11

SIMD Type Data Alignment Example.

2.5.1 Packed Floating Point Addition and Subtraction

R1 R2 + R3

63 62 Sa

55 54 Ea Eb Ma Mb

32 31 30 Sc Sd

23 22 Ec Ed Mc Md

0 C D

B Sb

X Sx

Ex

Mx

Sy

Ey

My

Figure 2.12

SIMD Addition Alignment Example.

Figure 2.12 demonstrates the operation of packed oating point addition operation on single precision operands. The additions of A to B and C to D onto E and F respectively. Formulated as S if E > E a a c Sx = S if E < E c a c S if E > E b b d Sy = S if E < E d b d Ey = max (Eb , Ed ) Mx = 1.Mb + 1.Md (2.30) (2.31) (2.32)

Ex = max (Ea , Ec ) , Mx = 1.Ma + 1.Mc

Each addition member of the packed is added using standard oating point addition algorithm as shown in Equations (2.9) and (2.10). The mantissas of each addition is aligned in pairs simultaneously. Then the effective operation is performed on aligned mantissas in once. The exponents are also handled in pairs. The greater exponent is selected from

24

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

each pair. Both additions are normalized, rounded and each exponent is updated simultaneously. Then each additions are packed with order sign,exponent and mantissa of rst addition than second addition like in the Figure 2.10 (G k and Ozbilen, 2008). The como puted results and their layout in the resultant register can be seen in Figure 2.13, where the value in part E is 5.5 and in part F is 4.25
Sa R1 Ea Ma Sc Ec Mc C A 0 10000000 11000000000000000000000 Sb R2 Eb Mb

0 01111111 01000000000000000000000
Sd Ed Md

B 0 10000000 00000000000000000000000

0 10000001 00000000000000000000000

Se R3

Ee

Me

Sf

Ef

Mf F
0

E 0 10000000 01100000000000000000000
63 62 55 54

0 10000001 00010000000000000000000
23 22

32 31 30

Figure 2.13

SIMD Addition Numerical Example.

2.5.2 Packed Floating Point Multiplication

R1 R2

63 62 Sa

55 54 Ea Eb Ma Mb

32 31 30 Sc Sd

23 22 Ec Ed Mc Md

0 C D

B Sb

R3

X Sx

Ex

Mx

Sy

Ey

My

Figure 2.14

SIMD Multiplication Alignment Example.

Figure 2.14 demonstrates the operation of packed oating point multiplication on data packets that contain two single-precision oating point numbers. Each corresponding member of the packets are multiplied independently as Sx = Sa Sc , Ex = Ea + EC B, Sy = Sb Sd Ey = Eb + Ed B (2.33) (2.34) (2.35)

Mx = 1.Ma 1.Mc My = 1.Mb 1.Md

Packed multiplication uses double precision multiplication matrix for multiplication of both mantissas. The reduction of multiplication matrix is done by double precision 25

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

matrix. The sum of exponents also handled in the extended exponent adder of double precision multiplier in the same way subword integer addition. Signs are simultaneously. The path-a-way of packed multiplication is same in original oating point multiplication. That is normalization and rounding is done simultaneously. Then the results are packed into one double-precision area as in the oating point addition. The results of the multiplication and their alignment in 64 bit can be seen in Figure 2.15, where the value of part E is 7.0 and the value of part F is 5.0.
Sa R1 Ea Ma Sc Ec Mc C A 0 10000000 11000000000000000000000 Sb R2 Eb Mb

0 01111111 01000000000000000000000
Sd Ed Md

B 0 10000000 00000000000000000000000

0 10000001 00000000000000000000000

Sx R3

Ex

Mx

Sy

Ey

My Y
0

X 0 10000001 01100000000000000000000
63 62 55 54

0 10000001 01000000000000000000000
23 22

32 31 30

Figure 2.15

SIMD Multiplication Numerical Example.

2.5.3 Packed Floating Point Division and Reciprocal In modern processors the packed division operation is performed using multiplicative division method. In this method, the reciprocal of packed divisors is multiplied with the packed dividend using packed multiplication operation. In packed reciprocal, the reciprocal of oating point number in location B is computed using Newton Rapson method that explained before then result is duplicated to D location.
A 63 62 Sa 55 54 Ea Eb Ma Mb 32 31 30 Sc Sd 23 22 Ec Ed Mc Md 0 C D

R1 R2

B Sb

R3

Sx

Ex

Mx

Sy

Ey

My

Figure 2.16

SIMD Division Alignment Example.

Figure 2.16 demonstrates the operation of packed oating point division on packets that contains two single-precision oating point numbers. Each corresponding member 26

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

of packets are multiplied with reciprocal of divisor independently as Sx = Sa Sc , Ex = Ea Ec + B, Sy = Sb Sd Ey = Eb Ed + B (2.36) (2.37) (2.38)

Mx = 1.Ma (1/1.Mc ) Mx = 1.Mb (1/1.MD )

For example, in Figure 2.18, if the oating numbers in locations A and C on R1 are divided to 2.0. The oating point number 2.0 is put in B on R2 then packed reciprocal operation is executed in R2 register. The results of reciprocal operation can be seen in Figure 2.17.
Sb R2
63 62

Eb
55 54

Mb

Sd
32 31 30

Ed
23 22

Md D
0

B 0 01111110 00000000000000000000000

0 01111110 00000000000000000000000

Figure 2.17

SIMD Reciprocal Numerical Example.

Then, the packed multiplication operation is executed between R1 and R2 for completing division operation. The results of divisions are in locations X and Y on R3, which have values respectively 1.75 and 0.675 can be seen in Figure 2.18.
Sa R1 Ea Ma Sc Ec Mc C A 0 10000000 11000000000000000000000 Sb R2 Eb Mb

0 01111111 01000000000000000000000
Sd Ed Md

B 0 01111110 00000000000000000000000

0 01111110 00000000000000000000000

Sx R3

Ex

Mx

Sy

Ey

My Y
0

X 0 01111111 11000000000000000000000
63 62 55 54

0 01111111 01000000000000000000000
23 22

32 31 30

Figure 2.18

SIMD Division Numerical Example.

2.5.4 Packed Floating Point Multiply Add Fused(MAF) As mentioned before, multiplication and addition operations can be joined and replaced by a MAF circuit. A double precision FPMAF can be modied to work on two packed single precision number. The packed form of FPMAF uses the main functions of

27

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

the standard FPMAF. The exponent units are slightly modied to handle both multiplications exponent addition and update operations. The rounding and normalization units are modied for both single/double precision and multiple data operations. The Multiplication matrix used to multiply two multiplication of packed data. The packed form of MAF can have an additional function dot product. With dot product operation, two pairs of single precision multiplication can be executed and summed with a third single precision number, which might be previously computed multiplication. Multiplication matrix and adders must be modied to handle this operation. A summery of a packed MAF can perform is listed in Table 2.3 using the inputs in Figure 2.11. As in packed multipli-

Table 2.3

Operations of Packed MAF

Operation A B +C D + F A B +C D A +C + F A B||C D AB+F AB A+F

Description Dot product Sum of product by setting F = 0.0 Triple adder by setting D and B to 1.0 Dual multiplication by setting F = 0.0 Single MAF by setting D or B to 0.0 Single multiplication by setting D or B and F to 0.0 Single addition by setting D or B to 0.0 and C to 1.0

cation, all other parts of standard MAF is shared. As an example, the single precision dot product operation and its result is demonstrated in Figure 2.19. Here, the content of single precision oating point numbers in location A and C, and B and D are multiplied in pairs and added to oating point number location in F with value 3.75. The result of dot product operation is location F with value 15.75.
Sa R1 Ea Ma Sc Ec Mc C A 0 10000000 11000000000000000000000 Sb R2 Eb Mb

0 01111111 01000000000000000000000
Sd Ed Md

B 0 10000000 00000000000000000000000 Se Ee Me E 0 10000010 11111000000000000000000


63 62 55 54

0 10000001 00000000000000000000000
Sf Ef Mf

R3

0 10000000 11100000000000000000000
23 22 0

32 31 30

Figure 2.19

Packed Single Precision Floating Point Dot Product Results.

28

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

2.6 Floating Point Packed Instruction Extensions Today, many general purpose processors have multimedia extensions which includes SIMD type instructions. AMD has 3DNow! extension. 3DNow! technology is a set of new instructions providing single-precision oating-point packed data to x86 programs. 3DNow! architecture is an innovative extension of the x86 MMX architecture. It uses the same registers and the same basic instruction formats supporting register-to-register and memory-to-register instructions. 3DNow! technology introduces single-precision oating point format to existing MMX register set, which is compatible with IEEE-754 singleprecision format shown in Figure 2.20. 3DNow! instructions support two-packed singleprecision oating point operations addition, subtraction, multiplication and reciprocal.

63 D1

32 31 D0

Figure 2.20 3DNow! technology oating-point data type: Two packed IEEE singleprecision oating-point doublewords (32 bits 2)(AMD, 2000).

The Intel Corporation introduce SSE extensions with Pentium III processor family. The SSE instructions operate on packed single-precision oating-point values contained in the XMM registers and on packed integers contained in the MMX registers. The SSE SIMD integer instructions are an extension of the MMX technology instruction set. Several additional SSE instructions provide state management, cache control, and memory ordering operations. The SSE instructions are targeted at applications that architecture operate on arrays of single-precision oating-point data elements, including 3-D geometry, 3-D rendering, and video encoding and decoding applications.The packed oating point operations that SSE support are addition, subtraction, multiplication, division and reciprocal with two packed operand. The SSE2 extensions were introduced in the Pentium 4 processors. The SSE2 instructions operate on packed double-precision oating-point values contained in the XMM registers and on packed integers contained in the MMX and the XMM registers. Figure 2.21 shows a summary of the various SIMD extensions, the data types they operated on, and how the data types are packed into MMX and XMM registers(Intel, 2007). With the core architecture, Intel introduces the SSE4 and SSE4.1,

29

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

the SSE4.1 has also give support to packed oating point dot product in both double and single precision data type.
SSE 4 Packed SinglePrecision FloatingPoint Values

XMM Registers SSE2

2 Packed DoublePrecision FloatingPoint Values XMM Registers

Figure 2.21

SIMD Extensions, Register Layouts, and Data Types.(Intel, 2007)

The PowerPC processor from Motorola instruction set is extended by Altivec technology. Altivec is based on SIMD style parallel execution units that operate on 128-bit vectors. The Altivec technology supports 16-way parallelism for 8-bit signed and unsigned integers, 8-way parallelism for 16-bit signed and unsigned integers and 4 way parallelism for 32-bit signed and unsigned integer and IEEE-754 oating point numbers. The Altivec data element can be seen in Figure 2.22. The Altivec ISA (instruction set architecture) includes oating-point arithmetic, rounding and conversion, compare and estimate operation. In this set, it supports packed single precision oating point operations addition, subtraction, multiply-add, multiply-subtract and reciprocal on 4 way packed single-precision oating point numbers. The target application for the AltiVec technology are IP(Internet Protocol) telephony gateways, multi-channel modems, speech processing systems, echo cancelers, image and video processing systems, scientic array processing systems, as well as network infrastructure such as Internet routers and virtual private network servers. (Freescale, 2006)
Quad Word Word 0 Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Word 1 Byte 5 Byte 6 Byte 7 Byte 8 Word 2 Byte 9 Byte 10 Byte 11 Byte 12 Word 3 Byte 13 Byte 14 Byte 15

HaftWord 0 HaftWord 1 HaftWord 2 HaftWord 3 HaftWord 4 HaftWord 5 HaftWord 6 HaftWord 7

Figure 2.22

Motorola Altivec Vector Register (Motorola, 2000).

30

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

2.7 Benchmarking SIMD A benchmark is a test designed to measure the performance of one particular part of a computer. For example, one benchmark might test your CPU (Central Processing Unit) is at oating point calculations by performing billions of arithmetic operations and timing how long it takes to complete them all. There are very few benchmarking software especially focused on SIMD architecture some of them are: DARPA, ALPBench, Multibench 1 and 2. DARPA(Defense Advanced Research Projects Agency) is an image understanding benchmark and widely-accepted platform for evaluation of parallel systems (Weems, Riseman, Hanson and Rosenfeld, 1991). MediaBench is a benchmark suite, that introduced in 1997, provides set of full application-level benchmarks for studying video processing characteristics (Lee, Potkonjak and Mangione-Smith, 1997). ALPBench(All Levels of Parallelism for Multimedia) is a suit that includes ve complex media applications from various sources: speech recognition, face recognition complex media applications, ray tracing, MPEG-2(Moving Pictures Experts Group) encode/decode. Below there are some benchmarking suit results taken from the Mediabench suit tools: JPEG(Joint Photographic Experts Group): This package contains C software to implement JPEG image compression and decompression. Shade analyzer output: #instruction count: 13905129 #alu ops: %alu ops: #immed ops: %immed ops: 8171845 0.59 5219031 0.64

Stores ====== Total st08 st16 st32 stxx

========= ========= ========= ========= ========= 709615 139912 0.20 54861 0.08 514841 0.73 1 0.00

31

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Alu ops ======== Total op08 op16 op32 opxx

========= ========= ========= ========= ========= 2208348 490216 0.22 255747 0.12 1462385 0.66 1 0.00

#ops used for output: %ops used for output:

2208348 0.27

Analyzer: /u/gs3/leec/leec/Projects/MediaBench/SPIX/SHADE/src/alu Version: 1.0 (10/Mar/97) (shade version: 5.25 V8 SPARC ELF32 (14/Feb/95)) Uname: panther sun4u SunOS 5.5.1 Generic_103640-08 Start: Mon Jun 16 19:31:32 1997 Application: ./cjpeg -dct int -progressive -opt -outfile testout.jpg testimg.ppm Application Instructions: 13905129 Stop: Mon Jun 16 19:32:07 1997 Instructions: 13905129 Time: 14.580 usr 0.010 sys 35.169 real 41.485%

Speed: 953.059 KIPS MPEG: mpeg2play is a player for MPEG-1 and MPEG-2 video bitstreams. It is based on mpeg2decode by the MPEG Software Simulation Group. Shade analyzer output: #instruction count: 175505114 #alu ops: %alu ops: #immed ops: %immed ops: 78655559 0.45 59915131 0.76

Stores 32

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

====== Total st08 st16 st32 stxx

========= ========= ========= ========= ========= 11126484 1544167 0.14 Alu ops ======== Total op08 op16 op32 opxx 1057402 0.10 7003691 0.63 1521224 0.14

========= ========= ========= ========= ========= 16247622 1998403 0.12 362264 0.02 13886546 0.85 1521224 0.00

#ops used for output: %ops used for output:

16247622 0.21

Analyzer: /u/gs3/leec/leec/Projects/MediaBench/SPIX/SHADE/src/alu Version: 1.0 (10/Mar/97) (shade version: 5.25 V8 SPARC ELF32 (14/Feb/95)) Uname: cheetah sun4u SunOS 5.5.1 Generic_103640-08 Start: Tue Jun 17 02:21:22 1997 Application: ../src/mpeg2dec/mpeg2decode -b mei16v2.m2v -r -f -o0 tmp%d Application Instructions: 175505114 Stop: Tue Jun 17 02:24:15 1997 Instructions: 175505114 Time: 122.930 usr 0.120 sys 173.355 real 70.982%

Speed: 1426.291 KIPS Testing the performance effects of SIMD instructions on practical, needs special benchmarking suits. To learn the how efcient SIMD instructions work, a program which is suitable for SIMD operations must be written. An ideal program to show SIMD performance must be repetitive in its method. An image or video processing application is a

33

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

good candidate, which benchmark suits simulates. An investigation of SIMD instruction set form University of Ballarat, uses a program to compute the approximate value of pi. They use series give in Equation 2.39 for calculating pi. 1 1 1 1 1 1 + + 3 5 7 9 11 4 (2.39)

This is an inefcient algorithm however, large number of iterations make it ideal candidate. To show the effectiveness of SIMD, the main loop of the program makes 128,000 iteration 1000 times, which gives an accurate pi value with single precision oating numbers. The algorithm is executed ve times on : 1. A version that uses the CPU alone in a SISD manner. 2. A version optimized for Altivec on the PowerPC chip. 3. A version optimized for SSE2 on Intel (x86) chip. In this study, 8 different congurations were used:Pentium 4 with SSE3 at 2.80Ghz on Ubuntu Linux, Pentium 4 with SSE3 at 2.80Ghz on OSX(Dev), Pentium 4 with SSE3 at 1.40Ghz on Ubuntu Linux, Pentium 4 with SSE3 at 1.40Ghz on OSX(Dev), Pentium 4 with SSE2 at 2.00Ghz on Ubuntu Linux, Quad Xeon with SSE3 at 3.10Ghz on Gentoo Linux, Dual PowerPC G5 with Altivec at 2.7Ghz on OSX Version 10.4.3 and PowerPC G5 with Altivec at 1.4Ghz on OSX Version 10.4. The Figure 2.23 shows the score have while CPUs working with bare instructions, though in Figure 2.23 the CPUs working with SIMD type instructions. These gures shows SIMD type instruction has great impact on performance if they are usable. It is also seen that clock speed hight effective on overall performance. 2.8 Previous Packed Floating Point Designs 2.8.1 Packed Floating Point Multiplication Designs A recent work in (Akkas and Schulte, 2006) presents a quadruple precision oating point multiplier that supports two dual-precision oating-point multiplications in parallel. The design is shown in Figure 2.24.

34

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Time in seconds Power PC G4 @1.4GHz Pentium 4 @ 1.4GHz (Linux) Pentium 4 @ 1.4GHz (OsX) PowerPC Dual G5 @ 2.7 GHz (OsX) Quad Xeon @ 3.1GHz (Linux) Pentium 4 @ 2.0GHz (Linux) Pentium 4 @ 2.8GHz (Linux) Pentium 4 @ 2.8GHz (OsX) 2.075 1.198
Time in seconds Power PC G4 @1.4GHz Pentium 4 @ 1.4GHz (Linux) Pentium 4 @ 1.4GHz (OsX) PowerPC Dual G5 @ 2.7 GHz (OsX) Quad Xeon @ 3.1GHz (Linux) Pentium 4 @ 2.0GHz (Linux) Pentium 4 @ 2.8GHz (Linux) Pentium 4 @ 2.8GHz (OsX) 2.52 1.714 1.438 1.181 1.002 0.841 0.838 0.693

7.494 4.338 4.245 3.748 3.2 2.898

Figure 2.23

benchmark Result of with out SIMD and with SIMD.

35

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

63 51 48 47

0 63 56 52 55 51

0 R2 R3

63 51 48 47

0 63 5655 52 51

0 R4

R1

00001

00001

00001

Quad first cycle

M2 1

00001

M1 0

Quad

M3 1

0 M4 1

Quad

1 M5

Tree Multiplier (57 x 57)

Tree Multiplier (57 x 57)+2 rows

Carry Sum

Carry Sum

4to2 Compressor

0 M6 1

Quad

Quad second cycle

0 M7 1

P2

P1

Figure 2.24

Dual Mode Quadruple Precision Multiplier (Akkas and Schulte, 2006).

36

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

The same technique is also used for dual-mode double precision oating point multiplier that performs two single precision multiplications in parallel. The divide-andconquer technique (Beuchat, Tisserand, 2002) is used to multiply mantissas of high precision oating point numbers. This technique uses smaller multiplications and additions to compute high precision multiplication. If two n bits numbers, X and Y can be divided into two parts, such as X = X1 k + X0 Y = Y 1 k +Y 0 where k = 2n/2 . The product of X Y is computed as (X 1 k + X 0) (Y 1 k +Y 0) = X 1 Y 1 k2 + (X 1 Y 0 + X 0 Y 1) k + X 0 Y 0 Figure 2.25 shows technique given with Equation 2.42.
n bits Y1 Y0 X0 X0*Y0 X0*Y1*k X1*Y0*k

(2.40) (2.41)

(2.42)

X1

X1*Y1*k*k

Figure 2.25

The Divide-and-Conquer Technique(Akkas and Schulte, 2006).

2.8.2 Packed Floating Point Multiplier Add Fused Designs One of the few multi functional MAF design is presented in (Heikes and ColonBoneti, 1996). That study describes two oating-point multiply-add units capable of performing IEEE-754 compliant single and double precision oating-point operations. Of course, it is possible to use a larger precision oating-point unit to operate on smaller precision operands, however, this requires the conversion of smaller precision operands 37

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

to the larger precision format and then conversion of the result back to smaller precision format. The conversion operations might signicantly reduce the performance. Another MAF design is presented in (Huang, Shen, Dai, and Wang, 2007). That study proposes a new architecture for the MAF unit that supports multiple IEEE precisions multiply-add operation with Single Instruction Multiple Data (SIMD) feature. The proposed MAF unit can perform either one double- precision or two parallel single-precision operations using about 18% more hardware and with 9% increase in delay than a conventional double-precision MAF unit. The simultaneous computation of two single-precision MAF operations is adapted by redesigning several basic modules of double-precision MAF unit. The adaptation are either segmentation by precision mode dependent multiplexers or duplication of hardware. The proposed MAF unit can be fully pipelined and the experimental results show that it is suitable for processors with oating point unit (FPU). Figure 2.26.a shows the 64-bit double-precision register used to store two singleprecision number and Figure 2.26.b shows the generated results when performing two single-precision MAF operations.
63 62 S2 55 54 E2 F2 32 31 30 S1 23 22 E1 F1 0

(a) Two single packed in one double register


63 A B A2 B2 C2 32 31 A1 B1 C1 0

R2= A 2 B 2 C 2 x +

R 1= A 1 B 1 C 1 x +

(b) Two single MAF operation result

Figure 2.26 Two Single-Precision Numbers Packed in One Double-Precision Register (Huang, Shen, Dai, and Wang, 2007).

The MAF unit is considered as an exponent and a mantissa unit. From the Table 2.4, it is seen that for exponent processing, the word-length of 13-bit double-precision exponent should be extended to 20-bits for two single-precision computing. But for speed, two

38

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

separated single precision exponent is used in this design. Below, the algorithm shows

Table 2.4

Word-lengths in Single/Double Precision MAF

modules single double Multiply Array 24 53 3-2 CSA 48 106 Alignment-Adder-Normalization 74 161 Exponent Processing 10 13

mantissa datapath of the simplied multiple-precision MAF unit. In the algorithm, sa , ea , fa denote sign, exponent and mantissa of the operand A respectively. The same rule is applied for operands B and C. The control signal double is used for double- precision operation. The signal x[m : n] denotes the portion of x from n to m. s.sub, s.sub1 and s.sub2 in Step 3 denotes the signs of the effective mantissa addition operations for one double and two single-precision operations respectively. The proposed MAF unit derived from algorithm is shown in Figure 2.27. 2.9 Previous Patented Packed Floating Point Designs 2.9.1 Multiple-Precision MAF Algorithm The algorithm requires A, B, C to be normalized numbers (Huang, Shen, Dai, Wang, 2007). Step 1: Exponent Difference: [19 : 10] if double = 1 then [12 : 0] = ea [12 : 0] + eb [12 : 0] ec [12 : 0] 967 else [9 : 0] = ea [9 : 0] + eb [9 : 0] ec [9 : 0] 100 [19 : 10] = ea [19 : 10] + eb [19 : 10] ec [19 : 10] 100 end if Step 2: Mantissa Product: f prod[105 : 0]

39

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

C 63

32 31

A 63

32 31

B 63

32 31

1 000001

1 000001

1 000001

1 M1 0

double Alignment Shifter shift amount

1 M2 0

double

1 M3 0

double

Sticky sub

53bit subword multiplier supporting one 53x53 two 24x24

Negation

0 0

0 0

1 M4 0

double

1 M5 0

double 32 CSA

55bit Aligned C

106bit Carry

106bit Sum

LeadingZero Anticipator

106bit Adder 55bit INC sign Complementer carry

12bit Shift Amount

161bit Significant Constant shifter (step 1) 108bit variable shifter (step 2) Rounder 53bit Significand exponent difference

Round bit Guard bit Sticky bit

round

Figure 2.27 General structure of multipleprecision MAF unit (Huang, Shen, Dai, and Wang, 2007).

40

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

if double = 1 then f prod[105 : 0] = fa [52 : 0] fb [52 : 0] else f prod[47 : 0] = fa [23 : 0] + fb [23 : 0] f prod[96 : 49] = fa [48 : 25] + fb [47 : 24] end if Step 3: Alignment and negation: f ca[160 : 0] if double = 1 then f ca[160 : 0] = (1)s.sub fc [52 : 0] 2[12:0] else f ca[73 : 0] = (1)s.sub1 fc [23 : 0] 2[9:0] f ca[148 : 75] = (1)s.sub2 fc [47 : 24] 2[9:0] end if Step 4: Mantissa Addition: f acc[160 : 0] f acc[160 : 0] = f prod[105 : 0] + f ca[160 : 0] Step 5: Complementation: f accabs[160 : 0] if double = 1 then f accabs[160 : 0] = | f accabs[160 : 0]| else f accabs[73 : 0] = | f accabs[73 : 0]| f accabs[148 : 75] = | f accabs[148 : 75]| end if Step 6: Normalization: f accn[160 : 0] if double = 1 then f accn[160 : 0] = normshi f t( f accabs[160 : 0]) else f accn[73 : 0] = normshi f t( f accabs[73 : 0]) f accn[148 : 75] = normshi f t( f accabs[148 : 75]) end if Step 7: Rounding: f res[51 : 0] if double = 1 then

41

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

f res[51 : 0] = round( f accn[160 : 0]) else f res[22 : 0] = round( f accn[73 : 0]) f res[45 : 23] = round( f accn[148 : 75]) end if

2.9.2 Shared Floating Point and SIMD 3D Multiplier This is a multiplier that can perform multiplications of scalar oating point values (X Y ) and packed oating points values (X1 Y1 and X2 Y2 ). The multiplier also can be congured to compute X Y Z. The multiplier can be congured to compute two versions of result: With Overow or With out Overow exception. The main functional units of design is shown in Figure 2.28. In Figure 2.28, the multiplexers at the input selects multiplier and multiplicand according to state machine control signal. The selected inputs are routed to booth encoder and adder. The outputs of booth encoders are routed to booth multiplexers for generating partial products. The selected partial products are reduced to carry and save vectors in the adder tree. The pre-rounded results are generated at carry-save adders by adding rounding constant and carry-save vectors. The parallel twice calculation of addition for with or with out overow condition is for reducing processing time. The outputs of carrysave adders are passed to carry-propagate adders and sticky unit for rounding operation. The normalization units performs corrections and then the rounded result selection unit decides which result will be used. The multiplier can operate on a maximum of 76-bit operands. It can be congured to perform all AMD 3DNOW! (AMD, 2007) SIMD oating point multiplication. The adder tree can multiply 76 by 76 bit operands or 24-32 bit packed oating point operands. It is implemented as pipelined to increase the instruction throughput. In the rst stage, adder generates the 3X multiple of multiplicand. Booth encoders generate signals to control booth multiplexers for generating signed multiples of multiplicand. In the second stage, partial products are reduced to two using adder tree. The rst portion of the multipliers rounding, which involves addition of rounding constants with 42

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Local By passing A

Source A

Local By passing B State Machine Control

Source B

MUX

MUX Stage 1

3X Adder

Booth 3 Encoders

26 Booth Muxes Stage 2

Rounding Constant With No Overflow

Binary Tree S,C

Rounding Constant With Overflow

CSA(w/o Owerflow)

CSA(w/ Owerflow) Stage 3

CPA

Sticky

CPA

CPA

Stage 4 Normalize Normalize

Round Mul

Ri

Rounded Result Selection

Figure 2.28

Shared Floating Point and SIMD 3D Multiplier(Oberman, 2002).

43

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

CSAs, is done in this stage. Because, the result is unknown the addition is performed twice for overow condition. The carry-save adders are also congured to perform backmultiply and subtract operation which can be used as computation of remainder required for division and square root operation. In the third stage of pipeline, three versions of the carry-assimilated results are computed. The sticky bit is also generated in parallel from carry and save vectors. In the fourth stage, the normalization is done and rounding is completed. The most signicant bit of unrounded result determined which rounded result will be used. For division and square root iterations, a result Ri is also computed. Ri is the ones complement of the unrounded multiplication result.

2.10 Method and Apparatus For Performing Multiply-Add Operation on Packed Data This is a design from Intel Corporation, which performs primarily multiply-add operations on packed data. This design in a part of processor system. The design performs various operations on rst and second packed data to generate a third packed data. The main functional blocks of design can be seen from Figure 2.29. The design can perform operations given in Table 2.5, Table 2.6 and Table 2.7. The packed data can be in three form: packed byte, packed word and packed double word. Packed byte is a storage 64-bit or 128-bit long and contains 8 or 16 elements. Packed word is a storage 64-bit or 128-bit long and contains 4 or 8 elements, which each element is 16-bit long. Packed doubleword can be 64-bit or 128-bit long and contains 4 or 8 elements. Each doubleword element is 32-bit long. The design also supports packed single and packed double formats, which can contain oating point elements. Packed single can be 64-bit or 128-bit long and contains 2 or 4 single data element. Each single data element contains 32-bit. Packed Double also can be 64-bit or 128-bit long and contains 1 or 2 double data elements. Each double data element contains 64-bit. The multiply-add and multiply-

subtract instructions can be executed on multiple data elements at the same time by a single multiplication operation on unpacked data. Parallelism may be used to process 44

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Operation Control

Source1

Source2

Booth Encoder

Partial Product

Partial Product

Booth Encoder

Compression Array

Full Adder Saturation Detection Saturation Constants MUX Packed Multiply Adder

Result Register

Figure 2.29 Multiply-Add Design for Packed Data (Debes, Macy, Tyler, Peleg, Mittal, Mennemeier, Eitan, Dulong, Kowashi, Witt, 2008).

45

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Table 2.5

Multiply-Accumulate

Multiply-Accumulate Source 1, Source 2 A1 Source 1 B1 Source 2 = A1 B1 +accumulated Value Result 1

Table 2.6

Packed Multiply-Add

Packed Multiply-Add Source 1, Source 2 A1 A2 A3 A4 Source 1 B1 B2 B3 B4 Source 2 = A1 B1 + A2 B2 A3 B3 + A4 B4 Result 1

Table 2.7

Packed Multiply-Subtract

Packed Multiply-Subtract Source 1, Source 2 A1 A2 A3 A4 Source 1 B1 B2 B3 B4 Source 2 = A1 B1 A2 B2 A3 B3 A4 B4 Result 1

data at the same time. Figure 2.29 shows the details of packed multiply-add/subtract operation. The operation control unit enables the circuit. The packed multiply-add/subtract circuit contains 16 by 16 multiplier circuits and 32-bit adders. The rst 16 by 16 multiplier contains a booth encoder, which has inputs Source1[63:48] and Source2[63:48]. Booth encoder selects partial products depending on its inputs. The second 16 by 16 multiplier also contains a booth encoder, which has inputs Source1[47:32] and Source2[47:32]. This booth encoder also selects partial products depending on its inputs. The booth encoders are used to select a partial products. For example, partial product of zero, if Source1[47:45] are 000 or 111; Source2[47:32], if Source1[47:45] are 001 or 010; 2 times Source2[47:32], if Source1[47:45] are 011; negative 2 times Source2[47:32], if Source1[47:45] are 100; or negative 1 times Source2[47:32], if Source1[47:45] are 101

46

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

or 110. Like this, Source1[45:43], Source1[43:41], Source1[41:39], etc can be used to select respective partial products. Partial products are routed to compression array. Here partial products are aligned in according to Source1. The compression array may be implemented as a Wallace tree structure of carry-save adders or a sign-digit adder structure. The results are then routed to adder. Depending on operation, compression array and adders do addition or subtraction. The results routed to result register for formating output. 2.11 Multiplier Structure Supporting Different Precision Multiplication Operations This is a multiplier design that can perform operation on both integer and oating point operands. The multiplier is design as sub-tree form, so it can be congure as singletree structure for non-SIMD or partitioned into 2 or 4 for SIMD operation. The design can be seen from Figure 2.30. The gure also shows various ways of multiplier partition structure. When multiplier is congured for 4 partitions, 4 multiplications executed simultaneously on independent data. When multiplier is congured for 2 partitions, two 32-bit similar structures Tree AB in Figure 2.30 is multiplied. When multiplier is not partitioned then combined 64-bit structure TreeABCD in Figure 2.30 is multiplied. Various partitioning tree structures can be formed in order to support different multiplier structures. The data ow can be summarized as: First, partial products are generated by Wallace tree structure for each bit in the multiplier, then partial products are summed with carrysave adders (CSA). In binary number system, the multiplier can be either one or zero, that means the product is either 1 multiplicant or 0 multiplicant. The number of partial product that is going to be added is related with non-zero bits in the multiplier. Booth encoding is used to reduce the number of partial products. Booth encoding uses two side by side bits as well as MSB(Most Signicant Bit) of the previous two bits to determine the partial product.

47

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

Booth Encoding

format mux

format mux

format mux

format mux

RS2 Operand

RS2 Operand

RS2 Operand

RS2 Operand

BMux

BMux

BMux

BMux

BMux

BMux

BMux

BMux

4:2 CSA 4:2 CSA

4:2 CSA 4:2 CSA

4:2 CSA 4:2 CSA

4:2 CSA 4:2 CSA

4:2 CSA

4:2 CSA

4:2 CSA

4:2 CSA

REG

REG

REG

REG Tree D

4:2 CSA Tree AB 4:2 CSA

4:2 CSA

Tree ABCD MUX MUX

REG

REG

128 Bit Adder

Figure 2.30 Multiplier Structure Supporting Different Precision Multiplication Operations(Jagodik, Brooks, Olson, 2008).

48

2. PREVIOUS RESEARCH

ILEN Metin Mete OZB

2.12 Method and Apparatus for Calculating:Reciprocals and Reciprocal Square Roots The design is a part of microprocessor design from AMD Inc. The design gives processor capability of evaluating reciprocal and reciprocal square root of operand. The processor has a multiplier that can be used to perform iteration operations needed. The design uses two path one, assumes that overow has occurred, other assumes that nooverow has occurred. The intermediate results are stored for next iteration. The general form of design is shown in Figure 2.31. The design utilities division operation over reciprocal and multiplication operation. The operation is formulated as A B1 where A is dividend and B is divisor. The reciprocal of divisor is realized using a version of Newton and Raphson iteration. The iteration Equation used for calculation of reciprocal of B is X1 = X0 (2 X0 B) (2.43)

The iteration needs an initial estimation X0, which can be determined from a ROM(Read Only Memory). Once X0 is determined, it is multiplied by B. After multiplication, the term (2 X0 B) is formed, by inverting the term (X0 B). Ones complement is used to speedup the calculation. The corresponding sign and exponent bits are also computed along the mantissa computation. The approximations for (2 X0 B) are performed in parallel by each path. Using double path may be save time in normalization by without needing normalization bits. After this step, the result is passed back to multiplier to complete the iteration by multiplying with X0 . If the desired accuracy is reached, the results are output. If desired accuracy is not reached, the iteration is repeated. The results of the multiplication are once passed down the paths in parallel. The accuracy is depended on initial guess X0 .

49

2. PREVIOUS RESEARCH

Initial Estimate Generator MUX MUX Rounding Mode Input

Partial Product Generator

Selection Logic

PPA Adder

10

MUX

Control Signal
Control Logic STICKY BIT LOGIC CSA CPA CSA CPA Sticky Bit Logic

50
Normalization Normalization Exponent Control Logic NONOverflow Path/Logic Overflow Path/Logic

Metin Mete OZBILEN

MUX

Rounded and Normalized Result

Figure 2.31: Reciprocal and Reciprocal Square Root Apparatus (Oberman, Juffa, Weber, 2000).

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

3. THE PROPOSED FLOATING POINT UNITS This section presents the oating point designs for multimedia processing. The following designs are discussed in detail: Multi-Precision Floating Point Adder, Double/Single Floating Point Multiplier, Multi-Functional Double Precision Floating Point MAF, Multi-Functional Quad Precision Floating Point MAF and Multi-Precision Floating Point Reciprocal Unit. 3.1 The Multi-Precision Floating-Point Adder The proposed multi-precision adder can operate on double, single and half precision numbers. In single precision addition mode two simulteneous oating point additions are performed. In half-precision addition mode four simultanous oating point additions are performed. The input operands for the multi-precision adder are packed based on the operation mode. Figure 3.1 presents the alignments of double, single, and and half precision oating-point numbers and their sums in three 64-bit registers R1, R2, R3. The registers are used for demonstration purpose they are not a part of the actual implementation. In Figure 3.1.a, three double precision oating-point numbers X , Y and their sums are shown. In Figure 3.1.b, four single-precision oating-point numbers A, B, C, D and their sums, E, and F are shown. In Figure 3.1.c, eight half-precision oating-point numbers K, L, M, N, P, R, S, T , and their sums I, O, Q, and V are shown in NVIDIA half-precision format (Nvidia, 2007). The half-precision format described by NVIDIA is not included in the IEEE-754 standard, however, it is widely used in graphics processing applications.

Figure 3.2 presents the block diagram for the proposed multi-precision oating point adder. The design of this adder is based on a modied version of the single-path oating point adder presented in (Ercegovac and Lang, 2004). The mode of operation is selected by using a control signal, M. When M = 01 (Mode 1), a double-precision oating-point addition is performed. When M = 10 (Mode 2), two parallel single-precision oatingpoint additions are performed. When M = 11 (Mode 3), four parallel half-precision oating-point additions are performed. EOP represents the effective operation. To reduce

51

3. THE PROPOSED FLOATING POINT UNITS

R1 R2 + R3

X Y Z

63 62 Sx Sy Sz

52 51 Ex Ey Ez Mx My Mz

(a) Double Precision FloatingPoint Number


R1 R2 + R3 E Se A B 63 62 Sa Sb 55 54 Ea Eb Ee Ma Mb + Me F 32 C D 31 30 Sc S d Sf 23 22 Ec Ed Ef Mc Md Mf 0

52

(b) Single Precision FloatingPoint Numbers


R1 R2 + R3 63 62 57 58 48 Sk Ek Mk K L Sl I Si El Ei Ml Mi + 32 47 46 42 41 Mm M Sm E m N Sn O So En Eo Mn + Mo Q Sq Eq Mq 3130 26 25 P Sp Ep R Sr Er 16 Mp Mr + V Sv 15 14 10 9 S Ss Es T St Et Ev 0 Ms Mt

Metin Mete OZBILEN

Mv

(c) Half Precision FloatingPoint Numbers


Figure 3.1: The Alignments of Double, Single, and Half Precision Floating-Point Numbers.

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

the complexity of the gure, the inputs of the units in Figure 3.2 are plainly designated as R1 and R2. In the actual implementation only the parts of the vectors that are used in the unit are connected. The location of these parts can be observed from Figure 3.1. The functionality of the main units and data ow are explained as follows: The exponent subtracter unit computes the differences of the operands exponents in all modes. These differences are used to align the operands. The signs of the differences are used in the Swap unit for decision of small operand. The Swap unit changes the places of the mantissa if the sign of the difference is negative. By this way only the mantissa with the smaller exponent is right-shifted. Based on the operation mode, the swap unit operates on different operands. The Compare unit compares the magnitudes of the operands when the difference or differences between the exponents are zeros. Then informs the swap unit for smaller operand. The Bit Invert unit inverts the mantissa (or mantissas) with the smallest exponent so that the result (or results) is always positive. The addition of 1 ulp required for twos complement conversion is performed in the mantissa adder. The Mantissa Generator unit prepares the mantissa bits for operation in all modes. The mantissas are converted into twos complement format and they are also shifted for alignment. The mantissa adder is a twos complement adder that can perform an addition on 53-bit operands or two parallel additions on 24-bit operands, or four parallel additions on 10-bit operands. The signs of the results are generated in the Mantissa Adder. The Leading One Detector (LOD) units compute the number of right-shifts to normalize the result when the EOP is a subtraction. LOD 1 operates in all modes, LOD 2 operates in Modes 2 and 3, and LOD 3 operates only in Mode 3. LOD 3 operates on two half-precision operands in Mode 3. The Normalize units are normalizing shifters. The mantissas are either left shifted with the amount determined in LOD units or right shifted by one digit when addition overow occurs. The Flag units determine the rounding ags with respect to the rounding mode that is selected. Since all IEEE-754 rounding modes are supported a ag for each rounding mode is generated. The Rounding units perform the addition of 1 ulp when it is necessary to perform rounding. These cases are indicated by the ags generated by the Flag Units. The overow due to the addition in rounding units is also checked here and adjustment shift is performed when necessary. The Exponent Update units update the exponent strings which are prepared in exponent generator unit. The Sign unit generates

53

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

R1

R2

EOP

Exponent Subtractor

Swap

Compare

Mantissa Alignment Conditional Bit Invert Conditional Bit Invert

Control

Exponent Update

Mantissa Adder

LOD 1

LOD 2

LOD 3

Normalize 1

Normalize 2

Normalize 3

Flag 1

Flag 2

Flag 3

Rounding 1

Rounding 2

Rounding 3

Sign

M1

M2

M3

Figure 3.2

The Block Diagram of Multi-Precision Floating-Point Adder.

54

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

the sign of the result or results based on the signs of the operands with greater magnitude. The sign, exponent and mantissa of the result (or results) are represented as S, E, M, respectively.

3.2 The Single/Double Precision Floating-Point Multiplier Design

This section presents a new oating-point multiplier which can perform a doubleprecision oating-point multiplication or two simultaneous single precision oating-point multiplications. Since in single precision oating-point multiplication two results are generated in parallel, the multipliers performance is almost doubled compared to a conventional oating-point multiplier. Figure 3.3.a shows the alignments of two double precision oating point numbers X , Y and their product Z that are placed in three 64-bit registers. Figure 3.3.b shows the alignments of four single precision oating-point numbers A,B,C and D and the product of A and B, E, and the product of C and D, F that are placed in three 64-bit registers. The multiplication of X and Y is performed as Ez = Ex + Ey Mz = Mx My Sz = Sx Sy The multiplication of A and B, and the multiplication of C and D are performed as Ee = Ea + Eb , E f = Ec + Ed Me = Ma Mb , M f = Mc Md Se = Sa Sb , S f = Sc Sd (3.4) (3.5) (3.6) (3.7) (3.8) (3.9) (3.1) (3.2) (3.3)

The proposed design performs these two oating-point multiplications in parallel. In (Gok, Krithivasan and Schulte, 2004) a design method for the multiplication of two unsigned integer operands is presented. Figure 3.4 presents the adaptation of that technique 55

3. THE PROPOSED FLOATING POINT UNITS

R1 R2 R3

X Y Z

63 62 Sx Sy Sz

52 51 Ex Ey Ez Mx My Mz

(a) Double Precision FloatingPoint Number


R1 R2 R3 A B E 63 62 Sa Sb S e 55 54 Ea Eb Ee Ma Mb Me 32 C D F 31 30 Sc Sd Sf 23 22 Ec Ed Ef Mc Md Mf 0
56

(b) Single Precision FloatingPoint Numbers


Figure 3.3: The Alignments for Double and Single Precision Numbers.

Metin Mete OZBILEN

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

to implement the proposed method. In this gure, the matrices generated for the two single precision oating-point multiplications are placed in the matrix generated for a double precision oating-point multiplication. All the bits are generated in double precision mode, the shaded area Z is not generated when single precision multiplication is performed, the non-shaded regions designate the generated bits.
53 24

Z
M

24

53

24

Figure 3.4

24

The Multiplication Matrix for Single and Double Precision Mantissas.

The partial products within the regions Z are generated using the following equations b j = s b j and pi j = ai b j and the rest of the partial products are generated with pi j = ai b j (3.11) (3.10)

s is used as control signal. When s = 0, only the bits in no shaded regions are generated otherwise all bits are generated. The i and j are respective matrix indexes. High-speed multipliers reduce the partial product matrix to two vectors using a reduction method. Then, these two vectors are added to produce the result with carry-propagate adder. The reduction method and the type of the carry-propagate adder are not important for the proposed design, since it only modies the generation of the partial products. This 57

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

also means that the reduction algorithm and the carry-propagate adder is not modied for the implementation of the proposed method. The standard oating-point multiplier, which is mentioned in section 2.3 implements Equation 3.1 to Equation 3.3. Figure 3.5 presents the proposed single/dual oating-point multiplier which is designed by slightly modifying the standard oating-point multiplier. The modications can be used on every type of double precision oating-point multiplier.

The data ow and the functionality of each unit in the proposed design are explained as follows: The Control Signal determines the mode of execution; when s = 0 double precision oating-point multiplication is performed, otherwise two single precision multiplications are performed. A 11-bit adder is used for double precision exponent addition and two 8-bit adders are used for single precision exponent additions. The Exponent Updaters remove extra bias values form the exponent sums. The Mantissa Modier selects the appropriate mantissas to be send to the mantissa multiplier. The Mantissa Multiplier generates carry-save vectors. The Add, Normalize and Round unit generates normalized and rounded result or results. The signs of the products are obtained by X OR gates. 3.3 The Multi-Functional Double-Precision FPMAF Design The Multi-functional double-precision FPMAF design supports three modes named as double-precision multiplication (DPM), single-precision multiplication (SPM) and dotproduct (DOP). 1. In DPM mode, the design works as a double-precision FPMAF unit. It computes X D Y D + ZD,where X D, Y D and ZD are double-precision oating-point operands. 2. In SPM mode, the design works as a single-precision oating-point multiplier and computes AS BS and CS DS in parallel, where AS, BS, CS and DS are singleprecision oating-point operands. This mode has two advantages: rst, the latency for performing two single-precision multiplications is approximately the same as the latency for performing one double-precision multiplication. The second advantage is that there is no need to convert operands from single to double-precisions back and forth.

58

3. THE PROPOSED FLOATING POINT UNITS

Ea

Eb

Ec

Ed

Ex

Ey

M M M Md a b c

Single Exponent Addition S a Sb Sc Sd

Single Exponent Addition Sx Sy

Double Exponent Addition

Mantissa Modifier Mx My

S ab Scd

MUX XOR Exponent Update

MUX

XOR Exponent
Update Scd Ecd

XOR Exponent Update Sz Ez Carry Net cm Sticky T Mantissa Multiplier with CS Output

59
S
ab

Eab

Add Normalize Round

Metin Mete OZBILEN

Mz

Figure 3.5: The Block Diagram for the Proposed Floating Point Multiplier.

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

3. In DOP mode, the design works as a dot-product unit, and performs two singleprecision oating-point multiplications in parallel and then adds the products of these multiplications with a single-precision operand. This operation can be expressed as AS BS + CS DS + U S. By setting appropriate operands to 0 and 1,a two-operand or a three-operand single-precision oating-point addition, or a singleprecision oating-point multiply-add can be performed. 3.3.1 The Mantissas Preparation step Figure 3.6 shows the alignments of the three double-precision and ve singleprecision IEEE-754 oating-point operands in 64-bit registers, R1, R2, and R3. These registers are used for demonstration purpose, they are not a part of the actual design. The double-precision format is used in DPM mode, and the single-precision format is used in SPM and DOP modes. Based on the execution mode, the initial mantissas are modied before they are input to the mantissa multiplier. The modied mantissas (named as M1 and M2) are differently generated for each mode. In DPM mode, the inputs for mantissa multipliers are produced as DPM(M1) = 1 & R151:0 DPM(M2) = 1 & R251:0 where 1s are the concatenated hidden bits described by IEEE-754 standard(IEEE, 1985). & represents the concatenation operator R151:0 = Mx R151:0 = My Figure 3.7 shows the 53 by 53 mantissa multiplication matrix generated for DPM mode. All the partial product bits in this matrix contribute the generation of the product. In SPM mode, two versions of M1 and one version of M2 are produced. The rst version of M1 is designated as M1UH . The least-signicant 26 bits of M2 and M1UH are used to generate the upper half of the 53 by 53 multiplication matrix. These vectors are (3.12)

60

3. THE PROPOSED FLOATING POINT UNITS

R1 R2 R3

X Y Z

63 62 Sx Sy Sz

52 51 Ex Ey Ez Mx My Mz

(a) DoublePrecision Alignment


R1 R2 R3 A B 63 62 Sa Sb 55 54 Ea Eb Ma Mb 32

C
D F

31 30 Sc S d Sf

23 22 Ec Ed Ef Mc Md Mf

61

(b) SinglePrecision Alignment


Figure 3.6: The Alignments of Double and Single Precision Floating-Point Operands in 64-bit Registers.

Metin Mete OZBILEN

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

produced as SPM(M152:0 )UH = {0}29 & 1 &R122:0 SPM(M225:0 ) = 001 &R222:0 where {0}29 represents 29 instances of 0, R122:0 = Mc , R222:0 = Md . The second version of M1 is designated as M1LH . The most signicant 27 bits of M2 and M1LH are used to generate the lower half of the 53 by 53 matrix. These vectors are produced as SPM(M152:0 )LH = 1 &R154:32 {0}29 SPM(M252:26 ) = 1 & R254:32 & 000 where R154:32 = Ma R254:32 = Mb . Figure 3.7.b shows the multiplication matrix generated for SPM mode. In this gure, the partial product bits located inside the regions designated by Z are set to zeros. The unshaded regions contain the matrices generated for the multiplications; (1 & Ma ) (1 & Mb ), and (1 & Mc ) (1 & Md ). The main idea for DOP implementation is performing the addition of the products by only using the adders in the partial product reduction tree. The application of this idea requires a little more complex modications than the modications for the previous modes. In DOP mode, the upper half of the matrix is generated using DOP(M152:0 )UH = {R131 R231 }d & 1 & R122:0 & {0}29d DOP(M225:0 ) = 001 & R222:0 where d =| Eab Ecd | Eab = Ea + Eb 127 Ecd = Ec + Ed 127. 62 (3.15) (3.14) (3.13)

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

53

1M

1M

53

53

(a) DPM Mode


53 24
1M d

pp er

al

1M c

24

53

1M b

1M a

24 53

(b) SPM Mode


Figure 3.7 The Partial Product Matrices Generated for (DPM) and (SPM).

63

Lo

er

24

al

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

53 d+1 24

25
N1

1M

c
MP1 MP2

1M

d
Z

53

1M
a

b
25

N2

25 53
Figure 3.8 The Matrix Generated for (DOP) Mode.

Without loss of generality, in Equation 3.16, it is assumed that Ecd Eab . The lower half of the multiplication matrix is generated using DOP(M152:0 )LH = {0}29 & 1 & R154:32 DOP(M252:26 ) = 01 & R254:32 & 00 Figure 3.8 presents the multiplication matrix generated for DOP mode. In addition to the mantissa modications described by Equation 3.16 and Equation 3.17 following adjustments are made. The operands are extended by one bit and converted into twos complement format when their sign bits are different. By this way, the addition of the partial products can be performed without considering the signs of the operands (i.e. no need to consider the effective operation). To prevent a performance decrease due to twos complement conversion, the mantissa with the negative sign is selected as the multiplicand, then its bits are inverted and a copy of the positive mantissa (the multiplier) is inserted into the (3.16)

1M

64

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

multiplication matrix. These operations can be expressed as (MN + 1) MP = (MN MP) + MP (3.17)

where MN and MP represent the negative and positive mantissas, respectively. In Figure 3.8, MP1 and MP2 vectors are injected into the matrix to perform the addition of the positive mantissas. MP1 and the upper 25 by 25 matrix are shifted together. The twos complement multiplication algorithm presented in (Baugh and Wooley, 1973) is used to prevent the sign extension of the partial products. This algorithm requires 2n 2 bits to be complemented. The complemented bits are located inside the dark gray shaded areas, N1 and N2 in Figure 3.8. The bits in N1 and N2 regions are not shifted. The 25 by 25 matrix with the smaller exponent is moved in the upper half, and right shifted by d columns. The region S is lled by zeros, if the sign of the operands are the same, otherwise, it is lled by ones. So, the addition of the bits in S does not effect the result. 3.3.2 The Implementation Details for Multi-Functional Double-Precision FPMAF Design The proposed design is implemented by mainly using the hardware of the standard double-precision oating-point multiplier. Naturally, some extra hardware is used to support additional operation modes, however, this extra hardware is signicantly less than the hardware required to design a different unit for each mode. The block diagram for the proposed multi-functional FPMAF design is shown in Figure 3.10. Although some of the units in the design can be combined, this approach is not preferred for double-precision implementation to keep the organization simple. The design is divided into four pipeline stages. Except the rst stage, the stages are similar to the basic double-precision FPMAF design. The function of each block and the data ow between stages are explained as follows: The mantissa bits are modied in the rst stage. The control signals T1 and T0 are used to select the operation mode which is given in Table 3.1. The function of the each unit in this stage is explained as follows: 65

3. THE PROPOSED FLOATING POINT UNITS

R1 63

T0

R1 31

T0

R2 63

T0

R2 31 T0 T1 T0

R154:32

R1 22:0

R2 54:32

R2 22:0

T1

MUX

MUX

MUX

MUX

66
52:29 d 28:0 RightShifter R151:0 52:29 28:0 R151:0 52:28 27:25 24:0 R251:0 MUX MUX MUX M1 UH M1 LH M2

Metin Mete OZBILEN

Figure 3.9: The Mantissa Modier Unit in the Double Precision FPMAF.

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

SA SB

SC

SD

EX

EY

EA

EB EC

ED

T1:0

RD1 RD1 RD2


63 54:0

63

RD2

T 54:0 1:0

RD3

51:0

SX S Y S Z

XOR3 XOR 1 XOR2 ADD1 ADD2 ADD3

EAB

ECD Difference & Max. Gen..

Mantissa Modifier

2s Comp. & Negate Stage 1

d
max(E , E )
AB CD

EZ

EU

EXY

UH

M LH

MY

MUX

MUX

Distance & Max. Gen.. sa RightShifter C

53 by 53 Mantissa Multiplier

Stage 2

CSA INC C S

MUX

CPA

Co

Stage 3

LZA

Sticky1

Complement

Normalize 1

Normalize 2

Normalize 3 Stage 4

Exp Upd 1

Exp Upd 2

Exp Upd 3

Rounding 1

Rounding 2

Rounding 3

Sticky 2

AB

AB

CD

CD

AB

CD

SR

Figure 3.10

The Block Diagram for Multi-Functional Double Precision FPMAF Design.

67

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

Table 3.1

The Execution Modes

T1 T0 00 10 01 11

Operation DPM SPM DOP NAN

The XOR1 and XOR2 gates compare the signs of the operands in SPM. The XOR3 gate compares the signs of the operands in DPM mode, the output of this gate is sent to 2s Comp. & Negation Unit. There is no need to compare the signs of the operands in DOP mode, since the operands are in twos complement format in this mode. The 11-bit adder (ADD1) computes Exy = Ex + Ey 1023 where R162:52 = Ex and R262:52 = Ey . The rst 8-bit adder (ADD2) computes Eab = Ea + Eb 127, where R162:55 = Ea and R262:55 = Eb . The second 8-bit adder (ADD3) computes Ecd =

Ec + Ed 127, where R130:23 = Ec and R230:23 = Ed . The Difference and Maximum Generator Unit computes d =| Eab Ecd | and max(Eab , Ecd ). d is sent to the Mantissa Modier Unit. Two 2-input multiplexers select the correct inputs to the Distance and Maximum Generator Unit (located in the second stage.) The Mantissa Modier Unit shown in Figure 3.9 generates the modied mantissas using Equations (3.14)-(3.17) for all modes. This unit consists of a 32-bit right-shifter (that can shift up to 29 digits) and several multiplexers and glue logic. The inputs to the Mantissa Modier Unit are R163 , R154:0 , R263 , and R254:0 . Based on the multiplication mode, these vectors contain the mantissas and sign bits as follows, Mx = R151:0 , My = R251:0 , or Ma = R154:32 , Mb = R254:32 , Mc = R122:0 , and Md = R222:0 , and the sign bits Sx = R163 , Sy = R263 , or Sa = R163 , Sb = R263 , Sc = R131 , and Sd = R231 . The 2s Comp. & Negation Unit negates the addend Mz or Mu based on the multiplication mode and the sign comparison of the operands. In DPM mode, if Sz is different than Sx Sy , Mz is negated. In this case, the correct sign of the result is determined later by comparing the signs of the operands and the sign of the output of the CPA. In DOP mode, Mu is converted into twos complement format. The functions of the units located in this stage are explained as follows: 68

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

The modied mantissas are multiplied by the Mantissa Multiplier. The generation of partial products in the multiplier is slightly modied to implement the insertion of MP1 and MP2 vectors and to perform the inversion of the bits in regions N1 and N2 in DOP mode. The rest of the multiplier hardware is not modied. The Mantissa Multiplier generates sum and carry vectors. The Distance Computation and Maximum Generation Unit computes | Ez Exy + 56 | or | Eu max(Eab , Ecd + 28) |. Since the biases are subtracted during the computation of Exy and max(Eab , Ecd ), the constants used to calculate the sa are 56 and 28. The selected difference, sa, is the shift-amount sent to the Right-Shifter Unit when the multiplier operates in DPM or DOP mode. This unit also computes max(Ez , Exy ) or max(Eu , Eab , Ecd ) based on the multiplication mode. The Right-Shifter Unit can perform up to 161 digit right-shift. This unit right shifts either ( 1 &Mz ) by (sa + 55) digits in DPM mode or ( 1 &Mu ) by (sa + 85) digits in DOP mode. The functions of the units located in the third stage are explained as follows: The aligned mantissa (Mz or Mu ) is split into two parts, low and high, the low part consists of least-signicant 106 bits and the high part consists of most-signicant 55 bits. The low part is added with sum and carry vectors in the 106-bit CSA adder and the highpart is incremented by the INC unit. The incremented value of the high-part is selected, if the 106-bit CPA generates a carry-out. The CPA generates a sum or two sums based on the multiplication mode. In DPM mode, a 106-bit sum is generated; in SPM mode two 48-bit sums are generated; in DOP mode a 50-bit sum is generated. The last stage performs the normalization, exponent update, and rounding as follows: The Complement Unit generates the complement of a negative result and updates the sign of the result (Sr ) in DPM and DOP modes. The LZA computes the shift-amount required to normalize the sum generated by the CPA. The LZA unit is designed by using the method presented by (Schmookler and Mikan, 1996). Note that this unit determines the shift-amount exactly because there is no carry input to the CPA. The Sticky1 Unit is designed by adapting the method presented in (Yu and Zyner, 1995). This unit computes preliminary sticky-bit using the carry and save vectors.

69

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

The Normalize 1 and Normalize 2 units generate the normalized products in SPM mode. These units can perform a 1-digit right-shift. The Normalize 3 unit performs the normalization for DPM and DOP modes. This unit is capable of performing up to 108 digit left-shift. The Sticky2 Unit generates the sticky-bits based on the preliminary stickybits and shifted-out bits. The Exp Upd 1 and Exp Upd 2 increments their inputs by one if a normalization right-shift is performed. Exp Upd 3 decrement the exponent down to 53, this unit is only used in DPM and DOP modes.The signals Sr , Er , and Mr represent the sign, exponent and mantissa of the result in DPM and DOP modes, respectively.

3.4 Multi-Functional Quadruple-Precision FPMAF

This section presents a multi-functional quadruple-precision FPMAF designed by extending the techniques presented in the previous sections. The Quadruple-Precision FPMAF design execute parallel double-precision and single-precision multiplications, and dot product operations. (Gok and Ozbilen, 2008) Also, the number of single-precision operands that can be operated on is increased from two to four. Brief descriptions for the supported modes of operations are given as follows: 1. In QPM mode, the design works as a quadruple-precision FPMAF unit. It computes X Y + Z, where X , Y and Z are quadruple-precision oating-point numbers. 2. In DPM mode, the design works as a double-precision oating-point multiplier and computes K L and R T , where K, L, R, and T are double-precision oating-point numbers. 3. In SPM mode, the design works as a single-precision oating-point multiplier and computes A B, C D, E F, and G H in parallel, where all operands are singleprecision oating-point numbers. 4. In DDOP mode, the design works as a double-precision dot-product unit, and performs two double-precision oating-point multiplications in parallel and then adds the products of these multiplications with a double-precision operand, U D. This operation can be expressed as K L + R T +U 70 (3.18)

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

5. In SDOP mode, the design works as a single-precision dot-product unit, and performs four double-precision oating-point multiplications in parallel and then adds the products of these multiplications with a single-precision operand, NS. This operation can be expressed as A B +C D + E F + G H + N 3.4.1 The Preparation of Mantissas Figure 3.11 shows the alignments of the three quadruple-precision, ve doubleprecision, and nine single-precision oating-point operands in 128-bit registers, R1, R2, and R3 register. The proposed design method modies the operands based on the execution mode. Table 3.2 shows the logic Equations used to generate modied mantissas for all modes in quadruple-precision FPMAF. Without loss of generality, the Equations in this table are derived based on the following assumptions for the exponents: Ert Ekl , Eab Ecd , Ee f Egh , Ecd Egh (3.19)

71

R1 R2

127 126 X Sx Y Z Sy Sz

112 111 Ex Ey Ez Mx My Mz

3. THE PROPOSED FLOATING POINT UNITS

R3

(a) QuadruplePrecision Alignment


R1 R2
72

127 126 K Sk L Sl

116 115 Ek El Mk Ml

64 R T U

63 62 Sr St Su

52 51 Er Et Eu Mr Mt Mu

R3

(b) DoublePrecision Alignment


R1 R2 R3 127126119 118 Ma A Sa Ea B Sb Eb Mb 96 95 94 87 86 Ec Mc c C S D S d Ed Md 64 63 62 55 54 Me E Se Ee F Sf Ef Mf 32 31 30 23 22 Mg G Sg Eg H Sh N Sn Eh Mh 0

Metin Mete OZBILEN

En

Mn

(c) SinglePrecision Alignment


Figure 3.11: The Alignments of Quadruple, Double and Single Precision Floating Point Operands in 128-bit Registers.

3. THE PROPOSED FLOATING POINT UNITS

Table 3.2

The Logic Equations for The Generation of The Modied Mantissas for All Modes.

QPM DPM DDOP 73 SPM

SDOP

M1 = 1 & R1111:0 M1UH = {0}60 & 1 & R151:0 M1LH = 1 & R1115:64 & 060 M1UH = {R163 R263 }d4 & 1 & R151:0 & {0}60d4 M1LH = {0}60 & 1 & R1115:64 M11 = {0}89 & 1 & R122:0 M12 = {0}60 & 1 & R154:32 & {0}29 M13 = {0}29 & 1 & R186:64 & {0}60 M14 = 1 & R1118:96 & {0}89 M11 = {R131 R231 }d1+d3 & 1 & R122:0 & {0}89(d1+d3) M12 = {0}29 & {R163 R263 }d3 & 1 & R154:32 & {0}60d3 M13 = {0}60 & {R195 R295 }d2 &1 & R186:64 & {0}29d2 M14 = {0}89 & 1 & R1118:96

M2 = 1 & R2111:0 M2 = 0001 & R251:0 & 1 & R2115:64 & 000 M2 = 0001 & R251:0 & 1 & R2115:64 & 000 M2 = 001 & R222:0 & {0}7 & 1 & R254:32 & {0}6 & 1 & R286:64 & 1 & R2118:96 & 00 M2 = 001 & R222:0 & {0}7 & 1 & R254:32 & {0}6 & 1 & R286:64 & 1 & R2118:96 & 00

Metin Mete OZBILEN

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

The modications of the mantissas in QPM, DPM and DDOP modes in the quadrupleprecision FPMAF are similar to the the modications of the mantissas in DPM, SPM, and DOP modes in the proposed double-precision FPMAF. In QPM mode, one version of M1 and M2 are produced. In DPM and DDOP modes, two versions of M1 and one version of M2 are produced. Two versions of M1 (M1UH and M1LH ) are used for the generation of upper and lower half of the 113 by 113 matrix similar to the previous implementation. In SPM and SDOP modes four version of M1 and one version of M2 are generated. In these modes, 113 by 113 matrix is divided into four regions. These regions are generated by the multiplications; M11 M2, M12 M2, M13 M2, and M14 M2. The implementations for SPM and SDOP modes will be explained in detail, since they are slightly different than the implementations described before. Figure 3.12 shows the 113 by 113 multiplication matrix generated for SPM mode in the quadruple-precision implementation. In this gure, the shaded regions,that are labeled with Z, are set to zeros and the four unshaded regions contain 24 by 24 sub-matrices generated for the following multiplications: (1 & Ma ) (1 & Mb ), (1 & Mc ) (1 & Md ) (1 & Me ) (1 & M f ), (1 & Mg ) (1 & Mh ) (3.20) (3.21)

Figure 3.13 presents the 113 by 113 matrix multiplication matrix generated for SDOP mode. In this gure, four 25 by 25 matrices are placed into the 113 by 113 matrix based on the assumptions for the exponents given above. In SDOP mode, the matrices are aligned according to the difference between their exponents. To do that, four 25 by 25 matrices are grouped in two pairs. One of the pairs consists of the matrices generated by the multiplications: (1 & Ma ) (1 & Mb ) and (1 & Mc ) (1 & Mc ) and the other pair consists of the matrices generated by the multiplications: (1 & Me ) (1 & M f ) and (1 & Mg ) (1 & Mh ) The distances used for the alignment of matrices are computed as follows: | E E | , if max(E , E ) max(E , E ) ef ab cd ab cd gh d1 = | E E | , otherwise
gh ef

(3.22)

(3.23)

(3.24)

74

113

3. THE PROPOSED FLOATING POINT UNITS

Z
1M e

1M f

1M g

1M h

113
1M d

1M b

1M c

1M

75
113

Metin Mete OZBILEN

Figure 3.12: The Partial Product Matrices Generated for SPM Mode in the Quadruple Precision FPMAF.

113 d1+d3 S
N1 MP1

3. THE PROPOSED FLOATING POINT UNITS

1M a 1 M b

d3 S
N2

MP2

1M c 1 M d

Z 113

d2 S
N3

MP3

N4

113

Figure 3.13: The Matrix Generated for Single Precision Dot Product (SDOP) Mode in the Quadruple Precision FPMAF.

1M g

1M h

1M e
MP4

1M

76

Metin Mete OZBILEN

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

| E E | , if max(E , E ) max(E , E ) ef ef gh ab cd gh d2 = | E E | , otherwise ab cd d3 =| max(Eab , Ecd ) max(Ee f , Egh ) |

(3.25) (3.26)

The pair that contains the matrix with the maximum exponent is placed into the lower half of the 113 by 113 matrix, in which the matrix with the maximum exponent is located at the bottom and the other one is placed above it and right shifted by d2 columns. The other pair is moved into the upper half of the 113 by 113 matrix, in which the matrix that has the minimum exponent is located at the top and right shifted by (d1 + d3) columns. The second matrix in this pair is located under the top matrix and right shifted by d3 digits. Similar to double-precision implementation, the additional adjustments such as conversion of the operands in twos complement format when the signs are different, and the application of twos complement word correction algorithm are also used in this implementation. The vectors MP1 to MP4 represents the positive multiplicands inserted into the multiplication matrix. 3.4.2 The Implementation Details for The Multi-Functional Quadruple-Precision FPMAF Design The block diagram for the proposed quadruple-precision FPMAF design is shown in Figure 3.14. This design is quite similar to the proposed double-precision FPMAF design, except the sizes of the components are increased and some of the units are modied to be used in different precisions. The design is divided into four pipeline stages. The function of each block and the data ow between stages are explained as follows: The rst stage is mainly dedicated to the preparation of mantissa vectors. The control signals T2:0 are used to select the operation mode given in Table 3.3

The function of the each unit in this stage is explained as follows: The Sign Generator Unit consists of XOR gates that compares the signs of the operands for all modes. This

77

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

Table 3.3

Quadruple Precision Execution Modes

T1 T0 00 10 01 11

Operation DPM SPM DOP QPM

unit generates the following signals Skl = Sk Sl , Srt = Sr St Sab = Sa Sb , Scd = Sc Sd Se f = Se S f , Sgh = Sg Sh S1 = Sx Sy Sz (3.27) (3.28) (3.29) (3.30)

There is no need to compare the signs of the operands in SDOP and DDOP modes because the operands are in twos complement format in those modes. In QPM mode, 2s Comp. & Negate Unit computes the negative of its input, when S1 signal is set to one, and in the other modes, it generates the twos complement representation of the addend based on its sign. The Exponent Adder Unit consists of two 17-bit adders. For space reasons, in Figure 3.14, 15-bit, 11-bit, and 8-bit exponents are grouped and represented as EQ , ED , and ES , respectively. The 17-bit adders operates on three different size exponents as follows: In QPM mode, one 17-bit adder computes Exy = Ex + Ey 16383 In DPM mode, two 17-bit adders in parallel compute Ekl = Ek + El 1023 Ert = Er + Et 1023 In SPM mode, one 17-bit adder computes Eab = Ea + Eb 127 Ecd = Ec + Ed 127 78 (3.34) (3.35) (3.32) (3.33) (3.31)

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

and the other one computes Ee f = Ee + E f 127 Egh = Eg + Eh 127 (3.36) (3.37)

The Difference and Maximum Generator Unit consists of one 11-bit subtracter, three 8-bit subtracters and several multiplexers. In DPM mode, this unit computes d4 =| Ert Ekl | and max(Ert , Ekl ) (3.38)

In SDOP mode, the unit computes d1, d2, and d3. These values and the signs of the differences before the absolute value conversions (sd1, sd2, sd3) are sent to the Mantissa Modier Unit1 The Mantissa Modier Unit is split into two parts to balance the delay between Stage 1 and Stage 2. The Mantissa Modier Unit1 and Mantissa Modier Unit2 generate the modied mantissas using Equations presented in Table 3.2 for all modes. The Mantissa Modier Unit1 unit consists of multiplexers and Mantissa Modier Unit2(located in Stage 2) consists of three 113-bit right shifters that can shift up to 89, 60 and 29 digits, respectively.The functions of the units in the second stage are explained as follows: The modied mantissas are multiplied by the Mantissa Multiplier. The generation of partial products in the multiplier is slightly modied to implement the insertion of MP1 to MP4 in SDOP mode or MP1 and MP2 in DDOP mode (MP1 and MP2 are generated differently in SDOP and DDOP modes) and to perform the inversion of the bits in regions, N1, N2, N3, and N4 (N1 and N2 regions are different in SDOP and DDOP modes). The rest of the hardware that handles the partial product reduction is not modied. The Mantissa Multiplier generates sum and carry vectors. The Distance and Maximum Generation Unit computes Ez Exy + 116 or Eu max(Ekl , Ert ) + 57 or En max(Eab , Ecd , Ee f , Egh ) + 28 (3.39) (3.40) (3.41)

This difference, sa , is sent to the Right-Shifter Unit when the multiplier operates in QPM,

79

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

DDOP, and SDOP modes. Based on the multiplication mode, this unit also generates max(Ez , Exy ) or max(Eu , Ekl , Ert ) or max(En , Eab , Ecd , Ee f , Egh ) (3.42) (3.43) (3.44)

The Right-Shifter Unit can perform up to 200 digit right shift. This unit right shifts (1 & Mz ) by (sa + 116) digits in QPM mode, or (1 & Mu ) by (sa + 172) digits in DDOP mode, or (1 & Mn ) by (sa + 200) digits in SDOP mode. The functions of the units in the third stage are explained as follows: The CSA adds the sum and carry output of the Mantissa Multiplier and the aligned mantissa (Mz or Mu or Mn ) and generates carry and save vectors. The high part of the aligned mantissa is sent to the INC unit, the low part of the aligned addend is sent to the 226-bit CPA. The incremented high-part is selected, if the carry-out bit of the CPA is 1. The 226-bit CPA generates different sums based on the multiplication mode. In QPM mode, a 226-bit sum is generated; in DPM mode two 106-bit sums are generated; in SPM mode four 48-bit sums are generated; in DDOP mode a 108-bit sum is generated; in SDOP mode a 51-bit sum is generated. The Sticky1 Unit is designed by adapting the method presented in (Yu and Zyner, 1995). This unit computes preliminary sticky-bit/s for all modes. The LZA computes the shift-amount required to normalize the sum generated by the CPA. This unit is designed based on the method presented in (Schmookler and Mikan, 1996). The last stage performs the normalization, exponent update, and rounding as follows: The Complement Unit generates the complement of a negative result and updates the sign of the result (Sr ). This unit is used in QPM, DDOP, and SDOP modes. The Normalize 1 unit generates the normalized products in DPM and SPM modes. This unit consists of two 53-bits right shifters. 53-bit right shifters are modied to operate on 24bit operands too. The Normalize 2 unit performs the normalization for QPM, DDOP, and SDOP modes. This unit is capable of performing up to 239 digit left-shift. The Sticky2 Unit generates the sticky-bit/s based on the preliminary sticky-bit and shifted-out bits. All rounding units can increment the normalized products by 1 ulp based on the rounding mode. The Exp Upd 1 unit consists of two 17-bit incrementers. This unit increments four 8-bit operands in SPM mode or two 11-bit operands in DPM mode. The Exp Upd 2 in80

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

crements the 15-bit operand up to 113, this unit is only used in QPM, DDOP, and SDOP modes. Sr , Er , and Mr represent the sign, exponent and mantissa of the result in QPM, DDOP and SDOP modes, respectively. 3.5 Multi-Precision Floating-Point Reciprocal Unit 3.5.1 Derivation of Initial Values Let n-bit mantissa M is represented as 1.m1 m2 m3 mn1 (mi {0, 1}, i = 1 n) when M is divided in to two parts M1 and M2 as M1 = 1.m1 m2 m3 mm and M2 = 0.mm+1 mm+2 mm+3 mn1 (3.46) (3.45)

The rst-order Taylor expansion of M p of number, M is between M1 and M1 + 2m and is expressed as (Takagi, 1997) M1 2m1
p1

M1 + 2m1 + p M2 2m1

(3.47)

The Equation 3.47 can be expressed as C M where C = M1 2m1


p1

(3.48)

(3.49)

and M = M1 + 2m1 + p M2 2m1 C can be read from a lookup-table which is addressed by M1 , without leading one. The look-up table contains the 2m of C values of M for special values of p, where it is 20 for reciprocal of M. The size of the required ROM for the look-up table is about 2m 2 m bits.The initial approximation of oating point number M 1 is computed by multiplication of term C with modied operand M. The modied form of M by only complementing M2 part bitwise. The last term can be ignorable.

81

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

Sx Sy 3x1

Sz

Sk Sl Sr St 4x1

Sa Se

Sb Sc Sd Sf Sg S 8x1
h

EQ 2x15

ED 4x11

ES 8x8 T2:0

R1

118:0

R2 118:0 R3111:0 S z Su Sn

Exponent Adder Sign Generator

119

119 112 3x1

E kl E rt

Eab Ecd

E ef E gh

Difference Max Gen. Stage 1 max(Ekl, E )


rt

sd1, sd2, sd3 d1, d2, d3

Mantissa Modifier 1 Negate

max(E , E
ab ef

cd

Ez E w MUX

En

E xy

E , E gh)

MUX

Mantissa Modifier 1 Difference Max Gen. sa RightShifter 161 226 113 by 113 Mantissa Multiplier 4x112 M1 3:0 M2 112

Stage 2

C 226
CSA

106 55 Stage 3 Co

226

226

CPA 226

INC

LZA

Sticky 1

212 Stage 4 Normalize 1

Complement

Normalize 2

Exp Upd 1 6x1 S kl S ab S S S


rt ef cd

Exp Upd 2 2x11 Ekl E


rt

Rounding 1 4x24 M ab M M ef M
cd gh

Rounding 2 2x53 112 Sr M


r

Sticky 2

4x8 Eab Ecd E E


ef gh

15 E
r

kl

gh

M rt

Figure 3.14

The Block Diagram for the Proposed Quadruple Precision FPMAF Design.

82

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

3.5.2 Newton-Raphson Iteration

The Newton-Raphson iteration was discussed in Previous Work. The iteration general formula is rewritten here (Ercegovac, 2004): xi+1 = xi f (x)/ f (xi ) (3.50)

An initial look-up table is used to obtain an approximate value of the root. The derivation of algorithm using Newton-Raphson method for computing reciprocal as follows: x = 1/X f (x) = 1/X x f (x) = 1/x2 When Equations 3.53 are put into Equation 3.50, the iteration Equation yields to xi+1 = xi (2 X xi ) (3.54) (3.51) (3.52) (3.53)

The Equation 3.54 can be implemented in hardware. The implementation requires two multiplications and one subtraction operation. The block diagram of this implementation can be seen in Figure 3.15. The circuit can be implemented as pipelined. Basic multiplicative reciprocal unit is show in Figure 3.15. The mantissa modify unit process the most signicant part of the M and generates M according to Equation 3.50. Also, the initial approximation, C is obtained from the look-up table. In the rst cycle, the rst multiplexer selects modied M value, the second multiplexer selects the output of the rst multiplexer. The third multiplexer selects the output of the lookup table and the forth selects also the output of third multiplexer. In the second cycle, the multiplier generates a result in carrysave format. In the third cycle the carry-save vectors are summed by a fast carry-propagate adder. At the end of the third cycle the initial value, xi is obtained. In the fourth cycle, the rst and second multiplexers select the initial value generated in the previous cycle, the third and fourth multiplexer select M. In the fth cycle, these values are multiplied and in the sixth cycle, the vectors generated by the multiplication are added. In the seventh cycle, the twos complement of the result is selected and the stored initial value in rst iteration of the Newton-Raphson is selected. In the seventh and eighth cycle, these values are multiplied and vectors are summed for nal 83

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

Mantissa Modify

Lookup Table

MUX1

MUX3

MUX2

MUX4

Multiplier

Twos Complement

Register Adder

Register

Register

1/M
Figure 3.15 Simple Reciprocal Unit that uses Newton-Raphson Method.

84

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

result of iteration calculation. In the ninth cycle, the nal result routed to normalization to suit IEEE mantissa format. Rounding is not handled here because this circuit can be coupled with a oating point multiplier for realizing oating point division operation. Rounding can be handled after multiplication by multiplication circuitry. This also minimizes the rounding error. A packed multiplier design which performs the mantissa multiplications for NewtonRaphson method was discussed in Double/Single Precision Multiplier is rearranged here. Figure 3.16.a shows the alignment of one double precision oating-point mantissa and Figure 3.16.b shows the alignments of two single precision mantissas (Gok, Schulte, and Krithivasan, 2004).
52 X 1 Mx 0

(a) DoublePrecision Alignment


52 51 A 1 Ma 29 28 1 0 Mc 00000 B

(b) SinglePrecision Alignment

Figure 3.16

Alignment of Double Precision and Single Precision Mantissas

Figure 3.17 presents the adaptation of the techniques given in (Gok, Schulte, and Krithivasan, 2004) to implement the proposed design. In this gure, the matrices generated for two single precision mantissa multiplications are placed in the matrix generated for a double precision mantissa multiplication. All the bits are generated in double precision multiplication; the shaded areas labeled with Z1, Z2 and Z3 are not generated in single precision multiplication. The un-shaded areas are generated for single precision multiplication. The partial products within the regions Z1, Z2, Z3 are generated using Equations: bj = s bj pi j = ai b j The rest of the partial products are produced with pi j = ai b j (3.57) (3.55) (3.56)

85

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

The signal s is used as control. When s = 1 only bits with un-shaded regions are generated. When s = 0, all bits are generated. The i and j are indexes for appropriate partial product in multiplication the matrix (Gok and Ozbilen, 2009).

53 Z3 Z1

Z2

24
Figure 3.17 Multiplication Matrix for Single and Double Precision Mantissas

3.5.3 The Implementation Details for Double/Single Precision Floating Reciprocal Unit

This unit uses the previous reciprocal computation methods and generates reciprocals in different precisions as follows: 1. In double precision mode the unit generates a double-precision reciprocal. 2. In rst single-precision mode, the reciprocal unit generates a single-precision reciprocal and a copy of generated. 3. In the second single-precision mode, the reciprocal unit generates two different reciprocals in parallel. The input format of modied design is shown in Figure 3.18. Figure 3.18.a shows the input and output format in double precision mode. Figure 3.18.b shows the same 86

53

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

input and output in single precision mode. An input, S signal selects operating mode. The
R1 X 63 62 Sx 52 51 Ex Mx 0

(a) DoublePrecision Alignment


R1 A 63 62 Sa 55 54 Ea Ma 32 31 30 Sc 23 22 Ec Mc 0

(b) SinglePrecision Alignment

Figure 3.18

Alignment of Double and Single Precision Floating Point Numbers

block diagram for the proposed design is shown in Figure 3.19. The explanations of the main units are as follows: Exponent Unit generates the exponents of one double precision or two single precision results. In single precision mode exponents are obtained with Equation 3.58. Two circuit compute either exponent in parallel. In double precision mode, the circuits are cascade connected. Ez = 1111111 Ex (3.58)

Mantissa Modier generates modied mantissas based on the operation mode in order to prepare the inputs ready for the packed multiplier like in Figure 3.16. Lookup Table contains look-up tables needed for initial approximation required for Newton-Raphson method. These are C values of Equation 3.49. They are pre-computed values generated by computer software such Maple, MatLab, etc. Operand Modier modies the operands required for initial value calculation. The value evaluated here is X of Equation 3.48. It is evaluated by inverting the digits starting from 10th digit for this design. The modication of operand(s) depends on the selected operation mode. State Counter drives the multiplexers to select correct inputs to the packed multiplier during the computation of Newton-Raphson iteration. The computation of Equation 3.54 requires three multiplications. Depending on selected operation mode the inputs of multiplexers are in double precision or packed single precision format as shown in Figure 3.16. In the second cycle of circuit multiplexers are arranged for multiplication of look-up value(s) and modied mantissas as in the Equation 3.54. In the third cycle, multiplexers are arranged for multiplication of computed initial approximation value(s) and the input 87

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

mantissa(s) in the Equation 3.54. And, in the fourth cycle, multiplexers are arranged for multiplication of stored initial value(s) and computed value(s) of inside parenthesis of the Equation 3.54. Packed Multiplier is 53 by 53 multiplier slightly modied to handle two single and one double precision number as described. The input format of multiplier is shown in Figure 3.18. Multiplication output depends on selected operation mode. Packed Product Generator processes the output of packed multiplier and generates output used in next stages of iteration. The output of this unit is stored in a register. The format of output is truncated one 53-bit double mantissa or two 24-bit single mantissas depending on selected mode. The mantissas arranged as in Figure 3.16. I.A.Store unit stores the Initial Approximation value(s) computed in the second cycle of circuit. These are xi values in Equation 3.48, which are needed in fourth cycle. Inverter inverts the stored multiplication result for the third stage of stage controller to compute the expression in the parenthesis of Equation 3.54. The inversion is done depending of selected mode. Single Normalizer(s) normalize the result in single-precision mode. Double Normalizer normalizes the result in double precision mode. The normalization is one left shift if required. Exponent Updater updates the exponents depending on the normalization results. Two decrementers are separately used to update 8-bit exponents in single mode or in double mode these decrementers are connected cascade to update 11-bit exponent.

88

3. THE PROPOSED FLOATING POINT UNITS

ILEN Metin Mete OZB

Ex /Ea

Eb

Mx /Ma

Mb Lookup Table Register Y

Mantissa Exponent Unit Register X Modifier

Operand Modifier

Stage Counter

MUX4

MUX1

MUX2

MUX3

Subword Mutiplier Register Adder

Single Normalizer

Packed Product Generator I.A. Store Register Inverter

Single Normalizer Exponent Updater

Double Normalizer

Ex

Ea

Eb

1/M a

1/M b

1/M x

Figure 3.19

The proposed Single/Double Precision Reciprocal Unit

89

4. RESULTS

ILEN Metin Mete OZB

4. RESULTS This chapter presents synthesis results for the proposed and reference designs, that detailed implementation descriptions given in Chapter 3. All designs are modeled with VHDL(Very High Speed Integrated Circuit Hardware Description Language). Syntheses are done using TSMC(Taiwan Semiconductor Manufacturing Company) 0.18 micron standard ASIC(Application Specic Integrated Circuit library and Leonardo Spectrum program. The syntheses are tuned for delay optimizations with maximum effort. 4.1 The Results for Multi-Precision Floating-Point Adder Design This section presents the syntheses results obtained for the proposed multi-precision oating-point adder and single-path oating-point adders. In addition to the doubleprecision oating-point adders, single-precision oating adders are also designed. The second multi-precision design performs a single-precision oating-point addition or a two half-precision oating-point adders in parallel. The area and delay estimates are presented in Table 4.1. In this table, the unit for area is the number of gates and the unit for delay is nanoseconds (ns).

Table 4.1

Area and Delay Estimates for Multi-Precision Floating Point Adder

Adder Design Area(Gates) Double-Precision 4868 Multi-Precision 1 8195 Single-Precision 2056 Multi-Precision 2 2854

Delay(ns) 14.65 17.33 9.33 9.51

According to the given estimates the rst multi-precision design has approximately 68% more area and has less than 3 nanoseconds more delay than the reference double precision design and the second multi-precision design has approximately 38 % times more gates and has less than half nanoseconds more delay than the reference single precision oating-point adder. The delay differences between the proposed designs and the refer90

4. RESULTS

ILEN Metin Mete OZB

ence designs are expected to decrease if the designs are pipelined. A question that can be raised is why not use one double-precision, two single-precision and four half-precision oating-point adders instead of the multi-precision one oating-point adder, that capable of handling all mentioned formats. The proposed unit is expected to use approximately 20% less gates than the total gates required to design all separate units (Assuming a halfprecision oating-point adder can be designed by using approximately 500 gates). Also, the dedicated bus requirement for all the units can be a serious design problem since the wire delay gets signicant as the transistor sizes decreases. The additional components used to provide single/double precision can be seen in Table 4.2.

Table 4.2

Additional Components in Multi-Precision Adder Design

Unit Name Wide Number Adder/Subtractor 8-bit 6 Decoder/Encoder 3-bit 3 Left Shifter 24-bit 1 Left Shifter 10-bit 2

The proposed design eliminates the type conversion requirement and generates multiple results in parallel. The presented design is especially expected to increase the performance for 2D and 3D applications since these applications performs intensive oatingpoint additions on low-precision oating-point operands.

4.2 The Results for Single/Double Precision Floating-Point Multiplier Design

In this section we present the synthesis results for the proposed single/double precision oating point multiplier and the standard dual precision oating-point multiplier. Both circuits are optimized for delay. The values in Table 4.3 are in nanoseconds for time and in number of gate for area.

The single/double precision multiplier has approximately 9.49% more area and has about 34% more critical delay. The oating-point multipliers used in modern processors are usually pipelined designs. If the proposed method is applied to a pipelined multiplier 91

4. RESULTS

ILEN Metin Mete OZB

Table 4.3

Area and Delay Estimates for Single/Double-Precision Multiplier Design

Adder Design Area(Gates) Double-Precision 25175 Multi-Precision 27566

Delay(ns) 4.10 5.49

the area increase is expected to fall down below 5% and also critical delay increase will be dissolved in pipeline stages. One of the important aspects of the presented design method is that it can be applicable to all kinds of oating-point multipliers. The presented design is compared with a standard oating point multiplier via synthesis. The synthesis results showed that proposed design is 10% larger than conventional multiplier and critical path increment is only one or two gate delay. Since modern oating-point multiplier designs have signicantly larger area than the standard oating-point multiplier, the percentage of the extra hardware will be less for those units. The additional components used to provide single/double precision can be seen in Table 4.4. The methods presented in this design is used on the design of oating-point multiplier-adder circuits.

Table 4.4

Additional Components in Single/Double-Precision Multiplier Design

Unit Name Wide Number Adder/Subtractor 8-bit 2 Incrementer 8-bit 2 Left Shifter 24-bit 1

4.3 The Results for Multi-functional Double-precision FPMAF design

The major additional components used to convert the basic double-precision to the Multi-functional double-precision FPMAF are placed in the following stages: The rst stage: Two 8-bit adders, one 11-bit adder, and one 8-bit subtracter (in the Difference and Maximum Generator). One 53-bit right-shifter that can shift up to 29 digits (in the Mantissa Modier). The fourth stage: Two 8-bit incrementers (Exp Upd 1 92

4. RESULTS

ILEN Metin Mete OZB

and Exp Upd 2), and two 24-bit incrementers (Rounding 1 and Rounding 2). Two 48-bit 1-digit right-shifters. The Right-Shifter in Stage 2, the Mantissa Multiplier, LZA, and Sticky1 in Stage 3 are also slightly modied to handle multiple-precision operands, but the amount of extra hardware for these modications are negligible. The proposed double precision design can be optimized by combining Normalize 1 and Normalize 2, Rounding 1 and Rounding 2, Exp Upd 1 and Exp Upd 2 units. However, the hardware gain by this optimization is not signicant. The proposed multi-functional FPMAF design is compared with the standard doubleprecision FPMAF by syntheses. All circuits are modeled using structural VHDL code. The adders, subtracters and incrementers in these designs are implemented by using parallel-prex adders. The correctness of the proposed designs are veried with extensive simulation. Syntheses are done using TSMC 0.18 micron standard ASIC library and Leonardo Spectrum program. Both syntheses are tuned for delay optimizations with maximum effort. Table 4.5 presents area estimates for conventional and the proposed designs. In this table, the number of gates for each pipeline stage is presented. The proposed double-precision FPMAF design have approximately 8% more area than the standard double-precision design.

Table 4.5

Area Estimates for Double-Precision FPMAF Design

Pipeline Stage Mantissa Prepare Multiplication Add Round Total Area

Basic MAF 23771 6450 5428 35649

Multi-Functional 2805 24184 6570 4950 38509

Table 4.6 presents delay estimates for conventional and the proposed design in nanoseconds. The critical delay for the proposed double-precision FPMAF design is approximately 2.2% more than the critical delay for the standard double precision design. The delay of the extra pipeline stage is less than the delay for the stage with longest delay.

93

4. RESULTS

ILEN Metin Mete OZB

Table 4.6

Delay Estimates for Double-Precision FPMAF Design

Pipeline Stage Mantissa Prepare Multiplication Add Round

Basic MAF 3.42 3.53 2.98

Multi-Functional 3.36 3.34 3.61 2.27

The previous double-precision designs presented in (Jessani and Putrino, 1998), (Huang, Shen, Dai and Wang, 2007) and the proposed double-precision designs are structurally very similar. The dual- precision design in (Huang, Shen, Dai and Wang, 2007) and the proposed design in this study are synthesized using 0.18 TSMC standard library. The extra hardware required to provide multi-precision execution functionality for the proposed designs is less than 9% where as for Huang, Shen, Dai and Wang(2007) design it is 18%. Note that the unit of area estimate for the proposed designs is the number of gates while for Huang et al.s design it is micrometer squares Huang, Shen, Dai and Wang(2007). Even though synthesis tools, mantissa multiplier designs, and adder types are different the estimated clock delays for the proposed and Huang et al.s designs are very close. The delay estimate for Jessani & Putrinos design in (Jessani and Putrino, 1998) could be also very close to those two estimates, if it was synthesized with the same ASIC library. So it can be assumed that the clock delays for all designs are equal. On the other hand, the latencies for the designs in (Jessani and Putrino, 1998), (Huang, Shen, Dai and Wang, 2007), and the proposed design are 3, 3, and 4, respectively.

Table 4.7

Additional Components in Multi-Functional Double-Precision FPMAF Design

Unit Name Adder/Subtractor Incrementer Incrementer Left Shifter Right Shifter Right Shifter

Wide Number 8-bit 3 8-bit 2 24-bit 2 48-bit 1 53-bit 1 108-bit 1

94

4. RESULTS

ILEN Metin Mete OZB

The design is implemented by extending the hardware of conventional FPMAF units. The additional components used to provide multifunctionality can be seen in Table 4.7. However, the presented design methods can be tailored to provide same functions in other high-performance FPMAF designs. The extra hardware used to modify the standard designs is not signicant compared to the overall hardware. In fact, most part of it is tted into an additional pipeline stage. The proposed designs are expected to increase performance results for the applications that perform lots of independent oating-point multiplications. However, for the applications that may be data dependent, the extra pipeline stage may reduce the performance compared to standard FPMAF designs. 4.4 The Results for Multi-Functional Quadruple-Precision FPMAF The additional components used to convert the basic quadruple-precision to the Multifunctional quadruple-precision FPMAF are placed in the following stages: The rst stage: Two 17-bit adders, four 8-bit subtracters (in the Exponent Adder and Difference and Maximum Generator). The second stage: Three 103-bit right shifters (in the Mantissa Modier 2). The fourth stage: Two 17-bit incrementers (in the Exp Upd 1) and, and two 53-bit incrementers (in the Rounding 1). Two 106-bit 1-digit right-shifters. The multi-functional FPMAF design is compared with the standard quadrupleprecision FPMAF by syntheses. All circuits are modeled using structural VHDL code. The adders, subtracters and incrementers in these designs are implemented by using parallel-prex adders. The correctness of the proposed designs are veried with extensive simulation. Table 4.8 presents area estimates for conventional and the proposed designs. In this table, the number of gates for each pipeline stage is presented. The quadruple-precision FPMAF design have approximately 12.5% more area than the standard quadruple-precision design. The percentage increase in area is more than the one for double-precision design, since the number of the supported modes is increased in the quadruple-precision design. Table 4.9 presents delay estimates for conventional and the proposed design in nanoseconds. The critical delay for the proposed quadrupleprecision FPMAF design is approximately 5% more than the critical delay for the standard quadruple-precision design. The delay of the extra pipeline stage is less than the delay for the stage with longest delay.

95

4. RESULTS

ILEN Metin Mete OZB

Table 4.8

Area Estimates for Quadruple-Precision FPMAF Design

Pipeline Stage Mantissa Prepare Multiplication Add Round Total Area

Basic MAF 106224 13518 11663 131405

Multi-Functional 3494 119684 13940 10720 147838

Table 4.9

Delay Estimates for Quadruple-Precision FPMAF Design

Pipeline Stage Mantissa Prepare Multiplication Add Round

Basic MAF 4.43 4.51 4.26

Multi-Functional 4.63 4.71 4.74 4.65

The design is implemented by extending the hardware of conventional FPMAF units. However, the presented design methods can be tailored to provide same functions in other high-performance FPMAF designs. The extra hardware used to modify the standard designs is not signicant compared to the overall hardware. The additional components used to provide multifunctionality can be seen in Table 4.10. The single-precision operation modes supported in all the design can be especially useful in 3D multimedia applications which do not require high-precision oating-point operands. The proposed design also support dot product with low-precision operands. The presented dot-product mode reduces the rounding error, since only one rounding is performed in each pass. The proposed design are expected to increase performance results for the applications that perform lots of independent oating-point multiplications. Another advantage of the proposed design over the previous designs is that the proposed design can support more than two precisions where as the previous designs can support only two different precisions. The proposed quadruple-precision multiplier can perform double and single precision operations.

96

4. RESULTS

ILEN Metin Mete OZB

Table 4.10 sign

Additional Components in Multi-Functional Quadrable-Precision FPMAF De-

Unit Name Adder/Subtractor Incrementer Incrementer Left Shifter Right Shifter Right Shifter

Wide Number 17-bit 4 17-bit 2 53-bit 2 106-bit 1 113-bit 3 168-bit 1

4.5 The Multi-Precision Floating-Point Reciprocal Unit

The synthesis results for the proposed single/double precision oating point reciprocal unit is presented. The design in (Kucukkabak and Akkas, 2004) was used as reference standard double precision oating-point reciprocal unit with some estimation. The estimations include design of unsigned radix-2 multiplier, carry propagate-adders and a controlling logic for the multiplexers. The clock delays and area estimates (in terms of number of gates ) for both designs are given in Table 4.11. The values in Table 4.11 are in nanoseconds for time and in number of gate for area.

Table 4.11

The Comparison of the Standard Double Precision and Proposed Floating-Point

Reciprocal Design

Design Numb. of Gates Latency Reference Double Precision 31979 3.86 Single/Double Precision 33997 3.94

The single/double precision reciprocal unit has approximately 6% more area and has about 3% more critical delay. The most critical delay occurs in the multiplier. Because of the multiplier we used is slightly modied a negligible difference occurs in delay. The additional circuits cause also negligible grows in design. The oating-point reciprocal units used in modern processors are usually pipelined designs. The design performs two single-precision reciprocal with about same latency which is dissolved in pipeline stages. 97

4. RESULTS

ILEN Metin Mete OZB

The presented reciprocal unit is designed for multimedia applications and operates on SIMD type data input. The accuracy of the results are 20 bits for each iteration. Compared to the previous reference designs less than 1% area increase and delay increase is reported based on synthesis results. However the functionality of the reciprocal unit is improved to support three operation modes. The mode that generates two different reciprocals simultaneously is expected to double the performance of single precision division operations. The extra hardware used to modify the standard designs is signicant compared to the overall hardware. The additional components used to provide multi-precision can be seen in Table 4.12. The proposed unit can be expanded to support reciprocal-square-root operation with additional circuit and modications.

Table 4.12

Additional Components in Multi-Precision Reciprocal Design

Unit Name Adder/Subtractor Incrementer Left Shifter Right Shifter

Wide Number 8-bit 1 8-bit 1 24-bit 1 168-bit 1

98

5. CONCLUSIONS

ILEN Metin Mete OZB

5. CONCLUSIONS This dissertation presents novel oating-point hardware designs for multimedia applications. The main goal of the dissertation is to add functionality and accelerate the basic arithmetic operations used in multimedia applications. Though multimedia applications require too much computational power, this computation is usually repetitive for multimedia data. The SIMD extensions are developed to operate the same operation on the pieces of a packed data in parallel. SIMD instruction set extensions are very popular among major processor manifacturars. For example, SSE, SSE, SSE3, SSE4 from Intel Corp., 3DNow form AMD are well know examples. instructions sets. The proposed designs presented in this thesis offers efcient implementations of the main SIMD instructions offered in those popular multimedia instruction set extensions. More precisiley the implementation for the following instructions are presented: Packed oating-point add, packed oating-point multiply, packed oating- point multiply-add, dot product, packed reciprocal operations. The proposed multi-precision adder can be used in addition or subtraction of two single precision or four half precision operands. When a matrix data has to be added or subtracted, the proposed design can decrease the delay for the calculation about 70%. The proposed oating point adder has about 40% more area with nearly same delay with additional precision capabilities. The proposed multi-functional MAF design can decrease the delay for the matrix multiplication with its dot product function. It decreases the delay for parallel low precision oating-point multiplications. The proposed design has about 2% more area and same delay with basic double or quad precision multiplier with additional functions like dot product and simulteneous multiplication of two or four single precision number. Similar gains are achieved by the multi-precision reciprocal design. The proposed design has about 6% more area than reference design. It has about 3% more delay but has capability of taking reciprocal of two single precision oating number beside double precision. When this design is coupled with a multi-functioncal MAF design, the design can perform division or divide and sum operation or a divide and subtract operation. The major general purpose processor manifactures and graphical processing unit manifacturers are adding new futures to their designs to overcome multimedia load because the

99

5. CONCLUSIONS

ILEN Metin Mete OZB

demand on digital world increases day by day. Every single new future requires greater computational power. The purposed designs give support with more computation with same delay. This designs can be implemented directly into microprocessor as an extension or implemented as separate co-processor on a daughter board. When implemented as add-on, it can be used by either graphical processing unit or central processing unit. With some modication, they can be t on an FPGA(Field Programmable Gate Array) and used for extra calculating power for microcontrollers or analog digital processing units. Although there exists an abundance of multimedia applications, most of the operations required to execute them are uniform. For example some image manipulation operations, some 3D transformation like rotation, sizing, translation operation or some audio manipulation like amplication, equalization or echo addition/cancellation operations require similar type of operations. All those applications may benet from the designs developed in this dissertation.

100

BIBLIOGRAPHY Akkas, A., Schulte, M.J., 2006. Dual-mode oating-point multiplier architectures with parallel operations. Journal of Systems Architecture, 549-562. AltiVec Technology Programming Environments Manual, Motorola, Online (2006) http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPEM. pdf?WT_TYPE=ReferenceManuals&WT_VENDOR=FREESCALE&WT_FILE_FORMAT= pdf&WT_ASSET=Documentation AMD-3DNow!, Technology manual, Online (2000). http://www.amd.com AMD, 2007. ATI FireGL Technical Specications. Online. http://ati.amd.com/ products/workstation/techspecs2.html Ansi/ieee standard 754, 1985. IEEE standard for binary oating-point arithmetic. Arfken, G., 1985. Mathematical Methods for Physicists, 3rd ed, Academic Press, Orlando, pp.13-18. Baugh, C.R., Wooley, B.A., 1973. A Twos Complement Parallel Array Multiplication Algorithm, Computers, IEEE Transactions, C-22(12):1045-1047. Baugh, C.R., Wooley, B.A., 1973. A twos complement parallel array multiplication algorithm, IEEE Transactions on Computers, C-22(12):1045-1047. Beaumont-Smith, A., Lim, C.C., 2001. Parallel prex adder design, Computer Arithmetic, Proceedings. 15th IEEE Symposium on, 218-225. Beuchat, J.L., Tisserand, A., September 2002. Small multiplier-based multiplication and divison operators for Vertex-II devices. in Proceedings of 12th International Conference on Field-Programble Logic and Applications, 513-522. Booth, A., 1951. A Signed Binary Multiplication Technique, Quarterly J. Mechanics of Applied Math., 4:236-240. Buford, J.F.K., 1994. Multimedia Systems, Addison-Wesley Pub. Co. Charles, P., 25 Jul 2007, 3D Programming for Windows, Microsoft Press, 448p Chen S., Wang D., Zhang T., Hou C., 2006. Design and Implementation of a 64/32bit Floating-point Division, Reciprocal, Square root, and Inverse Square root Unit. Solid-State and Integrated Circuit Technology, ICSICT06. 8th International Conference on, Shanghai, 1976-1979. Chirca, K., Schulte, M., Glossner, J., Horan W., Mamidi, B., Balzola, P., Vassiliadis, S., 2004. A static low-power, high-performance 32-bit carry skip adder. Digital System Design, DSD Euromicro Symposium on, 615-619. 101

Cole, P., Oct/Nov 2005. OpenGL ES SC - open standard embedded graphics API for safety critical applications. DASC 2005, 2:8. Dadda, L., 1965. Some Schemes for Parallel Multipliers, Alta Frequenza, 34:349-356 Debes, E., Macy, W.W., Tyler, J.J., Peleg, A.D., Mittal, M., Mennemeier, L.M., Eitan, B., Dulong, C., Kowashi, E., Witt, W., 2008. Method and Apparatus for Performing Multiply-Add-Operations on Packed Data. Intel Corporation, Patent Number 7.395.298 B2. Diefendorff, K., Dubey, P.K., Hochsprung, R., Scale, H., Mar/Apr 2000. AltiVec extension to PowerPC accelerates media processing. Micro, IEEE, 20(2):85-95. Ercegovac, M.D., Lang, T., 2004. Digital Arithmetic, Morgan Kauffmann. Ercegovac M.D., Lang, T., 1987. On-the-y conversion of redundant into conventional representations. IEEE Transactions on Computers, 895-897. Even, G., Mueller, S., Seidel, P., 1997. A dual mode ieee multiplier. Proceedings of the 2nd Annual IEEE Int. Conf. on Innovative Systems in Silicon, Austin, TX, USA, 282-289. Even, G., Seidel, P.M., 2000. A comparison of three rounding algorithms for ieee oatingpoint multiplication. IEEE Transactions Computers, 49:638650. Fossum, T., Grundmann, R.W., Hag, M.S., 1991. Pipelined Floating Point Adder For Digital Computer. Digital Equipment Corporation, Patent Number 4.994.996. Fu-Chiung, C., Unger, S.H., Theobald, M., Jul 2000. Self-timed carry-lookahead adders, Computers, IEEE Transactions on, 49(7):659-672. Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Yao Z., Volkov, V., Jul/Aug 2008. Parallel Computing Experiences with CUDA. Micro, IEEE, 28(4):13-27. Gok, M., Ozbilen, M.M., 2008. Multi-functional oating-point MAF designs with dot product support Journal of Microelectonics, 39:30-43 Gok, M., Ozbilen, M.M., 2009a. Evaluation of Sticky-Bit Generation Methods for Floating-Point Multipliers. Journal of Signal Processing Systems, 56:51 Gok, M., 2007. A novel IEEE rounding algorithm for high-speed oating-point multipliers. Integration, the VLSI Journal, 40:549-560. Gok, M., Schulte, M.J., Krithivasan, S., 2004. Designs for subword-parallel multiplications and dot product operations. in: WASP04, Third Workshop On Application Specic Processors, Stockholm, Sweden, 27-31.

Gok, M., Ozbilen, M. M., 2009b. A Single or Double Precision Floating-Point Multiplier Design for Multimedia Applications. Istanbul University Journal Of Electrical and Electronics Engineering, 9:827-831 Gok, M., Ozbilen, M. M., 2009c. A Single or Double Precision Floating-Point Reciprocal Unit for Multimedia Applications. In Review Gurkayna, F.K., Leblebicit, Y., Chaouati, L., McGuinness, P.J., 2000 Higher radix KoggeStone parallel prex adder architectures, Circuits and Systems Proceedings. ISCAS 2000 Geneva, 5:609-612. Harris, D., Sutherland, I., 9-12 Nov 2003. Logical effort of carry propagate adders. Conference Record of the 37th Asilomar Conference on, 1:873-878. Heikes, C., Colon-Bonet, G., Feb 1996. A Dual Floating Point Coprocessor with an FMAC Architecture. ISSCC Dig. Tech. Papers, 354-355 Hillman, D., 1997, Multimedia Technology and Applications, Delmar Pub., 274p Hokenek, E., Montoye, R., Cook, P., 1990. Second-generation risc oating point with multiply-add fused. IEEE Journal of Solid-State Circuits, 25(10):1207-1203. Huang, L., Shen, L., Dai, K., Wang, Z., 2007. A new architecture for multiple-precision oating-point multiply-add fused unit design. Proceedings of the 18th IEEE Symposium on Computer Arithmetic, IEEE Computer Society, Washington, DC, USA, 69-76. Intel 64 and IA-32 architectures software developers manual, Online (2007). http:// www.intel.com/design/processor/manuals/253667.pdf Intel SSE4 programming reference, Online (2007). intel.com Jagodik, P.J., Brooks, J.S., Olson, C., 2008. Multiplier Structure Supporting Different Precision Multiplication Operations. Sun Microsystems Inc., Patent Number 7.433.912 B1 Jessani, R.M., Putrino, M., 1998. Comparison of single and dual pass multiply add fused oating-point units. IEEE Trans. Comput., 47(9):927-937. Koren, I., 2002, Computer Arithmetic Algorithms. A.K. Peters Ltd., Canada, 281p Kucukkabak, U., Akkas, A., 2004. Design and implementation of reciprocal unit using table look-up and Newton-Raphson iteration. Digital System Design 2004 Euromicro Symposium on, 249-253 http://softwarecommunity.

Lee, C., Potkonjak, M., Mangione-Smith, W.H., 1997. MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems. Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, IEEE Computer Society, 330-335 Lempel, O., Peleg, A., Weiser, U., 23-26 Feb 1997. Intels MMXTM technology-a new instruction set extension. Compcon 97. Proceedings, IEEE, 255-259. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J., Mar/Apr 2008. NVIDIA Tesla: A Unied Graphics and Computing Architecture. Micro, IEEE, 28(2):39-55. Macedonia, M., Oct 2003. The GPU enters computings mainstream. Computer, IEEE, 36(10):106-108. Microprocessor Standards Committee, 2006. DRAFT Standard for Floating-Point Arithmetic P754, IEEE. Min C., Swartzlander, E.E., 2000. Modied carry skip adder for reducing rst block delay. Circuits and Systems, Proceedings of the 43rd IEEE Midwest Symposium on, 1:346-348. Nvidia, 2007. GeForce Family. Online. http://www.nvidia.com/object/geforce_ family.html Oberman, S., Favor, G., Weber, F., Mar/Apr 1999. AMD 3DNow! technology: architecture and implementations. Micro, IEEE , 19(2):37-48. Oberman, S.F., Juffa, N., Weber, F., 2000. Method and Apparatus For Calculating Reciprocals and Reciprocal Square Roots. Advanced Micro Devices Inc., Patent Number 6.115.773 Oberman, S.F., 2002. Shared FP and SIMD 3D Multiplier. Advanced Micro Devices Inc., Patent Number 6.490.607 B1. Singhal, R., Agu 2004. Intel Pentium 4 Processor on 90nm Technology. Hot Chips, 16 OConnell, F.P., White, S.W., 2000. Power3: The next generation of PowerPC processors., IBM Journal of Research and Development 44(6):873-884. Ozbilen, M.M, Gok, M., 2008. A Multi-Precision Floating-Point Adder. 4th International Conference on Ph.D. Research in Electrical and Electronics Engineering, Prime 2008, 117-120 Quach, N., Takagi, N., Flynn, M., 2004. Systematic ieee rounding on high-speed oatingpoint multipliers, IEEE Transactions VLSI Systems, 12:511519

Takagi, N. 1997. Generating a power of an operand by a table look-up and a multiplication. In Proceedings of 13th Sym. on Computer Arithmetic, Asilomar, 126-131 Schmookler, M.S., Mikan, D.G., 1996. Two state leading zero/one anticipator (LZA). Patent Number 5.493.520 Varghese G., Sanjeev J.,; Chao T., Smits, K., Satish D., Siers, S., Ves N.,; Tanveer K., Sanjib S., Puneet S., Nov. 2007. Penryn: 45-nm next generation Intel core 2 processor. Solid-State Circuits Conference, IEEE Asian, 14-17. Wallace, C.S., 1964. A Suggestion for a Fast Multiplier, IEEE Transections on Electronic Computers, EC-13:14-17 Wang, Z., Jullien, G.A., Miller, W.C., Wang, J., May 1993 New concepts for the design of carry lookahead adders, Circuits and Systems, ISCAS 93, 3:1837-1840. Weems, C., Riseman, E., Hanson, A., Rosenfeld, A., 1991. The DARPA image understanding benchmark for parallel computers. Journal of Parallel and Distributed Computing, 11:1-24. Yang, X., Lee, R.B., 2004. PLX FP: An efcient oating-point instruction set for 3D graphics. in: ICME04, IEEE International Conference on Multimedia and Expo, Taipei, 1:137-140. Yang, C.L., Sano, B., Lebeck, A.R., 2000. Exploiting parallelism in geometry processing with general purpose processors and oating-point simd instructions. IEEE Trans. Comput., 49(9):934-946. Yu, R.K., Zyner, G.B., 1995. 167 mhz radix-4 oating point multiplier. in: ARITH95: Proceedings of the 12th Symposium on Computer Arithmetic, IEEE Computer Society, Washington, 149. Yu-Ting P., Yu-Kumg C., Jan. 2004 The fastest carry lookahead adder, Electronic Design, Test and Applications, DELTA 2004. Second IEEE International Workshop on, 434-436

CURRICULUM VITAE Metin Mete Ozbilen was born in Tarsus in 1974. He completed his elementary education at Kayseri Ahmet Pasa Primary School in 1984. He went to high school at Kayseri Nuh Mehmet K cukcalk Anatolia High School. He graduated from Gaziantep University u department of Electrical and Electronics Engineering in 1996. He worked as electrical and electronics engineer in a company in Gaziantep from 1996 to 1998. He worked as an information technology instructor in Gaziantep Vocational High School from 1999 to 2001. He taught Database Management, Computer Hardware, Microprocessors and Operating Systems courses. He graduated from Cukurova University, department of Electrical and Electronics Engineering with degree M.Sc. in 2002. Since 2001, he has been working as a research assistant in Mersin University. He is married and father of a son and a daughter. His interest areas are computer architecture, digital design, microprocessors and operating systems and system programming.

106

You might also like