0% found this document useful (0 votes)
25 views21 pages

Slide n2 Appendix Posted

This document discusses floating point arithmetic and representation of real numbers in computers. It describes how real numbers are represented using a mantissa and exponent in a given number base. It also discusses the IEEE floating point standard and issues like rounding, rounding error, and machine precision that arise from representing real numbers with a finite number of bits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views21 pages

Slide n2 Appendix Posted

This document discusses floating point arithmetic and representation of real numbers in computers. It describes how real numbers are represented using a mantissa and exponent in a given number base. It also discusses the IEEE floating point standard and issues like rounding, rounding error, and machine precision that arise from representing real numbers with a finite number of bits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

FloatingPointArithmetic(Appendix)

Computer Architecture

This material is for educational uses only. Some contents are based on the materialprovided by other paper/book authors and may
be copyrighted by them.
Representation of Real Numbers

In everyday life we use decimal representation of numbers. For example

1234.567

for us means

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

More generally
. . . dj . . . d1 d0 .d−1 . . . d−i . . .
represents

· · · dj ∗ 10j + · · · + d1 ∗ 101 + d0 ∗ 100 + d−1 ∗ 10−1 + · · · + d−i ∗ 10−i + · · · .

Floating Point Arithmetic – 2


Representation of Real Numbers
Let β ≥ 2 be an integer. For every x ∈ IR there exist integers e and
di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that
∞ 

x = sign(x) di β −i β e . (1)
i=0

The representation is unique if one requires that d0 > 0 when x = 0.

Example

11
= 5 ∗ 100 + 5 ∗ 10−1 = (5.5)10 ,
2
11
= 1 ∗ 22 + 0 ∗ 21 + 1 ∗ 20 + 1 ∗ 2−1
2
= (1 ∗ 20 + 0 ∗ 2−1 + 1 ∗ 2−2 + 1 ∗ 2−3 ) ∗ 22 = (1.011)2 ∗ 22 .

Floating Point Arithmetic – 3


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1 

x̄ = (−1)s di β −i β e
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


 β is called the base,
m−1 −i

i=0 di β is the significant or mantissa, m is the mantissa length,
 e is the exponent, and {emin , . . . , emax } is the exponent range.
 If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.
 If β = 10, then we say the floating point number system is a decimal
system. In this case the di ’s are called digits.
 A floating point number x̄ = 0 is said to be normalized if d0 > 0.

Floating Point Arithmetic – 4


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ = 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

-
1537 5 3 7 5 7
2848
1 4 2 4
2 2
3 2
4 5 6 7
0

Positive numbers with exponent e = 0 , 1 , 2 ,−1

Floating Point Arithmetic – 5


Consider the floating point number system
m−1 
s
 −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


 The mantissa satisfies

m−1 
m−1
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

 The mantissa of a normalized floating point number is always ≥ 1.


 The largest floating point number is
m−1 
 −i
x̄max = (β − 1)β β emax = (1 − β −m )β emax +1 .
i=0

 The smallest positive normalized floating pt. number is x̄min = β emin .


 The distance between 1 and the next largest floating pt. number is β 1−m .
Half this number, mach = 12 β 1−m , is called machine precision or unit
roundoff. (We will see later why).
The spacing between the floating pt. numbers in [1, β] is β −(m−1) .
The spacing between the floating pt. numbers in [β e , ββ e ] is β −(m−1) β e .
Floating Point Arithmetic – 6
IEEE Floating Point Numbers
 Almost all every modern computer implements the IEEE binary (β = 2)
floating point standard.
 IEEE single precision floating point numbers are stored in 32 bits.
 IEEE double precision floating point numbers are stored in 64 bits.
 How these numbers are stored is quite interesting (clever), but a little too
involved to get into here. See the references [G91,O01,SUN] given at the
end of this lecture.
 Here are some important numbers.

Common Name (Approximate) Equivalent Value


Single Precision Double Precision

Unit roundoff 2−24 ≈ 6.e − 8 2−53 ≈ 1.1e − 16


Maximum normal number 3.4e + 38 1.7e + 308
Minimum positive normal number 1.2e − 38 2.3e − 308
Maximum subnormal number 1.1e − 38 2.2e − 308
Minimum positive subnormal number 1.5e − 45 5.0e − 324

Floating Point Arithmetic – 7


Rounding
Given a real number x we define
fl(x) = normalized floating point number closest to x.
A floating point number x̄ closest to x is obtained by rounding. If
∞ 

x = sign(x) di β −i β e ,
i=0

then
⎧ m−1
e
−i
⎨ sign(x) i=0 di β β , if dm < 12 β,
fl(x) = 
⎩ sign(x) m−1
di β −i
+ β −(m−1)
βe, if dm ≥ 12 β.
i=0

Example Let β = 10, m = 3. Then


fl(1.234 ∗ 10−1 ) = 1.23 ∗ 10−1 ,
fl(1.235 ∗ 10−1 ) = 1.24 ∗ 10−1 ,
fl(1.295 ∗ 10−1 ) = 1.30 ∗ 10−1 .

Note, there may be two floating point numbers closest to x. fl(x) picks one of
them. For example, let β = 10, m = 3. Then 1.235 − 1.24 = 0.005, but also
1.235 − 1.23 = 0.005. See [G91,O01,SUN] for more details on ’breaking’ ties.
Floating Point Arithmetic – 8
Rounding Error

Theorem
If x is a number within the range of floating point numbers and
|x| ∈ [β e , β e+1 ), then the absolute error between x and the floating point
number fl(x) closest to x is given by
1 e (1−m)
|fl(x) − x| ≤ β
2
and, provided x = 0, the relative error is given by

|fl(x) − x| 1
≤ β 1−m . (2)
|x| 2

The number
1 1−m
mach =
def
β
2
is called machine precision or unit roundoff.

Floating Point Arithmetic – 9


Proof of the theorem:
If x = 0, then fl(x) = x and the assertion follows immediately.
Consider x > 0. (The case x < 0 can be treated in the same manner.)
Recall that the spacing between the floating point numbers
m−1 

x̄ = di β −i β e ∈ [β e , β e+1 )
i=0

is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the floating point number x̄


closest to x satisfies |x̄ − x| ≤ 12 β −(m−1) β e . Since x ≥ β e ,

|x̄ − x| 1
≤ β −(m−1) .
|x| 2


fl(x) is a floating point number closest to x = i=0 di β −i βe,d0>0

Floating Point Arithmetic – 10


Examples
Let β = 10, m = 3, thus mach = 5 ∗ 10−3 .

|fl(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | = 0.0004,


|fl(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | 0.0004
= ≈ 3.2 ∗ 10−3 ,
1.234 ∗ 10−1 1.234 ∗ 10 −1

|fl(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | = 0.0005,


|fl(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | 0.0005
= ≈ 3.9 ∗ 10−3 .
1.295 ∗ 10−1 1.295 ∗ 10 −1

Floating Point Arithmetic – 11


Floating Point Arithmetic
 Let  represent one of the elementary operations +, −, ∗, /. If x̄ and
ȳ are floating point numbers, then x̄ȳ may not be a floating point
number.
Example: β = 10, m = 4: 1.234 + 2.751 ∗ 10−1 = 1.5091.
What is the computed value for x̄ȳ?
 In IEEE floating point arithmetic the result of the computation x̄ȳ
is equal to the floating point number that is nearest to the exact
result x̄ȳ. Therefore we use fl(x̄ȳ) to denote the result of the
computation x̄ȳ
 Model for the computation of x̄ȳ, where  is one of the
elementary operations +, −, ∗, /.
1. Given floating point numbers x̄ and ȳ.
2. Compute x̄ȳ exactly.
3. Round the exact result x̄ȳ to the nearest floating point number
and normalize the result.
Example cont.: 1.234 + 2.751 ∗ 10−1 = 1.5091. Comp. result: 1.509
The actual implementation of the elementary operations is more
sophisticated. For more details see [DG91,O01].
Floating Point Arithmetic – 12
Floating Point Arithmetic (Cont.)
Given two numbers x̄, ȳ in floating point format, the computed result satisfies
|fl(x̄ȳ) − (x̄ȳ)|
≤ mach .
x̄ȳ

Examples
Consider the floating point system β = 10 and m = 4.
i. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
x̄ − ȳ = 0.001 ∗ 103 = 1.000 ∗ 100 . In this case x̄ − ȳ is a floating point
number and nothing needs to done; no error occurs in the subtraction of
x̄, ȳ.
ii. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 102 .
x̄ − ȳ = 2.2969 ∗ 103 . This is not a floating point number. The floating
point number nearest to x̄ − ȳ is fl(x̄ − ȳ) = 2.297 ∗ 103 .

|fl(x̄ − ȳ) − (x̄ − ȳ)| |2.297 ∗ 103 − 2.2969 ∗ 103 |


= ≈ 4.4 ∗ 10−5
|x̄ − ȳ| 2.2969 ∗ 103

< mach = 5 ∗ 10−4 .

Floating Point Arithmetic – 13


Floating Point Arithmetic: Cancellation
For the previous result on the error between x̄ȳ and the computed fl(x̄ȳ)
only holds if x̄, ȳ in floating point format. What happens when we operate with
numbers that are not in floating point format?
Example
Consider the floating point system β = 10 and m = 4.
Subtract the numbers x = 2.5515052 ∗ 103 and y = 2.5514911 ∗ 103 .
1. Compute the floating point numbers x̄ and ȳ nearest to x and y,
respectively: x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
2. Compute x̄ − ȳ exactly: x̄ − ȳ = 0.001 ∗ 103 .
3. Round the exact result x̄ − ȳ to the nearest floating point number:
fl(0.001 ∗ 103 ) = 0.001 ∗ 103 . Normalize the number:
fl(0.001 ∗ 103 ) = 1.000. The last digits are filled with (spurious) zeros.
The exact result is 2.5515052 ∗ 103 − 2.5514911 ∗ 103 = 1.410 ∗ 10−2 . The
relative error between exact and computed solution is

|1.000 − 1.410 ∗ 10−2 |


≈ 70  mach = 5 ∗ 10−4 .
1.410 ∗ 10−2
NotethatthislargeerrorisnotdueƚŽthecomputationoffl(x̄−ȳ).Thelarge
error is caused by the rounding of x and y at the beginning.
Floating Point Arithmetic – 14
Floating Point Arithmetic: Cancellation (cont.)
 To analyze the error incurred by the subtraction of two
numbers, the following representation is useful:
For every x ∈ IR, there exists  with || ≤ mach such that
fl(x) = x(1 + ).
Note that if x = 0, then the previous identity is satisfied for
def
 = (fl(x) − x)/x. The bound || ≤ mach follows from (2).
 For x, y ∈ IR we have 1 , 2 with |1 |, |2 | ≤ mach such that
fl(x) = x(1 + 1 ), fl(y) = y(1 + 2 ).
Moreover fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ), with |3 | ≤ mach .
 Thus,
fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ) = [x(1 + 1 ) − y(1 + 2 )](1 + 3 )
= (x − y)(1 + 3 ) + (x1 − y2 )(1 + 3 )
and, if x − y = 0, then the relative error is given by
|fl(fl(x) − fl(y)) − (x − y)| x1 − y2
= 3 + (1 + 3 ) (3)
|x − y| x−y
If 1 2 = 0 and x − y is small, the quantity on the rhs could be  mach .
Floating Point Arithmetic – 15
FloatingPointArithmetic: Cancelůation(cont.)

 Similar analysis can be carried out for +, −, ∗, /.


 Catastrophic cancelation can only occur with +, −.
 Catastrophic cancelation can only occur if one subtracts two
numbers which are not both in floating point format and which have
the same sign and are of approximately the same size, see (3), or if
one adds two numbers which are not both in floating point format,
which have opposite sign and their absolute values of approximately
the same size.

Floating Point Arithmetic – 16


FloatingPointArithmetic: CancelůationExample 1
x 1 − cos
0.500000 0.122417E + 00
0.125000 0.780231E − 02
0.312500E − 01 0.488222E − 03
Evaluation of 1 − cos(x) near x = 0.
0.781250E − 02 0.305176E − 04
(All computations were done using
0.195312E − 02 0.190735E − 05
single precision Fortran.)
0.488281E − 03 0.119209E − 06
0.122070E − 03 0.
0.305176E − 04 0.
0.762939E − 05 0.
0.190735E − 05 0.

Since cos(0) = 1 we expect catastrophic cancelation. If x = 0.122070E − 03,


then
1 − cos(x) ≈ 1.0000000000 − 0.99999999254......
= 0.00000000745..... = 7.45054.....e − 09
−1
   9254...... ∗ 10 )
1 − fl(cos(x)) ≈ 1.000000 − fl(9.999999
7 digits
= 1.000000 − 1.000000 = 0.
Floating Point Arithmetic – 17
FloatingPointArithmetic: CancelůationExample 1(cont.)
Two alternatives for small |x|.
 Since cos2 (x) + sin2 (x) = 1 it holds that
1 − cos(x) = sin2 (x)/(1 + cos(x)).
The formula sin2 (x)/(1 + cos(x)) avoids subtraction of two number that
are not in floating point format and are almost the same (recall that we
consider the case |x| small).

 The Leibnitz criterion
says that if the
series S = ∞ i
i=1 (−1) ci , ci ≥ 0,
converges, then S − i=1 (−1) ci < cn+1 .
n i

If we apply this to the Taylor expansion of cos(x),


x2 x4 x6 x8
cos(x) = 1 − + − + ± ...,
2 4! 6! 8!
then  
x2 x4 x6 x8
cos(x) − 1 − + − < .
2 4! 6! 8!
After some rearrangements we can use the approximation
 
x2 x2 x4
1 − cos(x) ≈ 1− +
2 12 360
and we know that the difference is less than x8 /(8!) which allows us to
control the error of the approximation.
Floating Point Arithmetic – 18
FloatingPointArithmetic: CancelůationExample 1(cont.)

x 1 − cos sin2 /(1 + cos) Taylor


0.500000 0.122417 0.122417 0.122418
0.125000 0.780231E − 02 0.780233E − 02 0.780233E − 02
0.312500E − 01 0.488222E − 03 0.488241E − 03 0.488242E − 03
0.781250E − 02 0.305176E − 04 0.305174E − 04 0.305174E − 04
0.195312E − 02 0.190735E − 05 0.190735E − 05 0.190735E − 05
0.488281E − 03 0.119209E − 06 0.119209E − 06 0.119209E − 06
0.122070E − 03 0. 0.745058E − 08 0.745058E − 08
0.305176E − 04 0. 0.465661E − 09 0.465661E − 09
0.762939E − 05 0. 0.291038E − 10 0.291038E − 10
0.190735E − 05 0. 0.181899E − 11 0.181899E − 11
0.476837E − 06 0. 0.113687E − 12 0.113687E − 12
0.119209E − 06 0. 0.710543E − 14 0.710543E − 14
0.298023E − 07 0. 0.444089E − 15 0.444089E − 15

Computations were performed using single precision Fortran.

Floating Point Arithmetic – 19


Summary

 Introduced how numbers are represented on a computer.


 Only a small set of numbers can be represented on the computer.
 The relative error between x = 0 and its nearest floating point
number fl(x) is

|fl(x) − x| def 1
≤ mach = β 1−m .
|x| 2
 Introduced basic properties of floating point arithmetic.
 Catastrophic cancellation can occur if one subtracts [adds] two
numbers which are not both in floating point format and which have
the same [opposite] sign and [their absolute values] are of
approximately the same size.

Floating Point Arithmetic – 21


Additional Reading

G91 David Goldberg. What every computer scientist should know about
floating-point arithmetic, ACM Comput. Surv., Vol. 23 (1), 1991,
pp. 5 - 48.
http://docs.sun.com/source/806-3568/ncg goldberg.html
O01 Michael L. Overton. Numerical Computing with IEEE Floating Point
Arithmetic, SIAM, Philadelphia, 2001.
SUN SUN Microsystems Numerical Computation Guide
http://docs.sun.com/source/806-3568/

Floating Point Arithmetic – 22

You might also like