0% found this document useful (0 votes)

25 views21 pages

Slide n2 Appendix Posted

This document discusses floating point arithmetic and representation of real numbers in computers. It describes how real numbers are represented using a mantissa and exponent in a given number base. It also discusses the IEEE floating point standard and issues like rounding, rounding error, and machine precision that arise from representing real numbers with a finite number of bits.

Uploaded by

‍이예진[학생](대학원 전자정보융합공학과)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views21 pages

Slide n2 Appendix Posted

Uploaded by

‍이예진[학생](대학원 전자정보융합공학과)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

FloatingPointArithmetic(Appendix)

Computer Architecture

This material is for educational uses only. Some contents are based on the materialprovided by other paper/book authors and may
be copyrighted by them.
Representation of Real Numbers

In everyday life we use decimal representation of numbers. For example

1234.567

for us means

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

More generally
. . . dj . . . d1 d0 .d−1 . . . d−i . . .
represents

· · · dj ∗ 10j + · · · + d1 ∗ 101 + d0 ∗ 100 + d−1 ∗ 10−1 + · · · + d−i ∗ 10−i + · · · .

Floating Point Arithmetic – 2

Representation of Real Numbers
Let β ≥ 2 be an integer. For every x ∈ IR there exist integers e and
di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that
∞

x = sign(x) di β −i β e . (1)
i=0

The representation is unique if one requires that d0 > 0 when x = 0.

Example

11
= 5 ∗ 100 + 5 ∗ 10−1 = (5.5)10 ,
2
11
= 1 ∗ 22 + 0 ∗ 21 + 1 ∗ 20 + 1 ∗ 2−1
2
= (1 ∗ 20 + 0 ∗ 2−1 + 1 ∗ 2−2 + 1 ∗ 2−3 ) ∗ 22 = (1.011)2 ∗ 22 .

Floating Point Arithmetic – 3

Floating Point Numbers
In a computer only a ﬁnite subset of all real numbers can be represented.
These are the so–called ﬂoating point numbers and they are of the form
m−1

x̄ = (−1)s di β −i β e
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

β is called the base,
m−1 −i

i=0 di β is the significant or mantissa, m is the mantissa length,
e is the exponent, and {emin , . . . , emax } is the exponent range.
If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.
If β = 10, then we say the floating point number system is a decimal
system. In this case the di ’s are called digits.
A floating point number x̄ = 0 is said to be normalized if d0 > 0.

Floating Point Arithmetic – 4

A Toy Floating Point Number System
Consider the ﬂoating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized ﬂoating point numbers x̄ = 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

-
1537 5 3 7 5 7
2848
1 4 2 4
2 2
3 2
4 5 6 7
0

Positive numbers with exponent e = 0 , 1 , 2 ,−1

Floating Point Arithmetic – 5

Consider the ﬂoating point number system
m−1
s
−i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

The mantissa satisﬁes

m−1
m−1
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

The mantissa of a normalized ﬂoating point number is always ≥ 1.

The largest ﬂoating point number is
m−1
−i
x̄max = (β − 1)β β emax = (1 − β −m )β emax +1 .
i=0

The smallest positive normalized ﬂoating pt. number is x̄min = β emin .

The distance between 1 and the next largest floating pt. number is β 1−m .
Half this number, mach = 12 β 1−m , is called machine precision or unit
roundoff. (We will see later why).
The spacing between the floating pt. numbers in [1, β] is β −(m−1) .
The spacing between the floating pt. numbers in [β e , ββ e ] is β −(m−1) β e .
Floating Point Arithmetic – 6
IEEE Floating Point Numbers
Almost all every modern computer implements the IEEE binary (β = 2)
floating point standard.
IEEE single precision floating point numbers are stored in 32 bits.
IEEE double precision floating point numbers are stored in 64 bits.
How these numbers are stored is quite interesting (clever), but a little too
involved to get into here. See the references [G91,O01,SUN] given at the
end of this lecture.
Here are some important numbers.

Common Name (Approximate) Equivalent Value

Single Precision Double Precision

Unit roundoﬀ 2−24 ≈ 6.e − 8 2−53 ≈ 1.1e − 16

Maximum normal number 3.4e + 38 1.7e + 308
Minimum positive normal number 1.2e − 38 2.3e − 308
Maximum subnormal number 1.1e − 38 2.2e − 308
Minimum positive subnormal number 1.5e − 45 5.0e − 324

Floating Point Arithmetic – 7

Rounding
Given a real number x we define
fl(x) = normalized floating point number closest to x.
A floating point number x̄ closest to x is obtained by rounding. If
∞

x = sign(x) di β −i β e ,
i=0

then
⎧ m−1
e
−i
⎨ sign(x) i=0 di β β , if dm < 12 β,
ﬂ(x) =
⎩ sign(x) m−1
di β −i
+ β −(m−1)
βe, if dm ≥ 12 β.
i=0

Example Let β = 10, m = 3. Then

fl(1.234 ∗ 10−1 ) = 1.23 ∗ 10−1 ,
fl(1.235 ∗ 10−1 ) = 1.24 ∗ 10−1 ,
fl(1.295 ∗ 10−1 ) = 1.30 ∗ 10−1 .

Note, there may be two ﬂoating point numbers closest to x. ﬂ(x) picks one of
them. For example, let β = 10, m = 3. Then 1.235 − 1.24 = 0.005, but also
1.235 − 1.23 = 0.005. See [G91,O01,SUN] for more details on ’breaking’ ties.
Floating Point Arithmetic – 8
Rounding Error

Theorem
If x is a number within the range of floating point numbers and
|x| ∈ [β e , β e+1 ), then the absolute error between x and the floating point
number fl(x) closest to x is given by
1 e(1−m)
|fl(x) − x| ≤ β
2
and, provided x = 0, the relative error is given by

|ﬂ(x) − x| 1
≤ β 1−m . (2)
|x| 2

The number
1 1−m
mach =
def
β
2
is called machine precision or unit roundoﬀ.

Floating Point Arithmetic – 9

Proof of the theorem:
If x = 0, then ﬂ(x) = x and the assertion follows immediately.
Consider x > 0. (The case x < 0 can be treated in the same manner.)
Recall that the spacing between the ﬂoating point numbers
m−1

x̄ = di β −i β e ∈ [β e , β e+1 )
i=0

is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the ﬂoating point number x̄

closest to x satisﬁes |x̄ − x| ≤ 12 β −(m−1) β e . Since x ≥ β e ,

|x̄ − x| 1
≤ β −(m−1) .
|x| 2

∞
ﬂ(x) is a ﬂoating point number closest to x = i=0 di β −i βe,d0>0

Floating Point Arithmetic – 10

Examples
Let β = 10, m = 3, thus mach = 5 ∗ 10−3 .

|ﬂ(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | = 0.0004,

|ﬂ(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | 0.0004
= ≈ 3.2 ∗ 10−3 ,
1.234 ∗ 10−1 1.234 ∗ 10 −1

|ﬂ(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | = 0.0005,

|ﬂ(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | 0.0005
= ≈ 3.9 ∗ 10−3 .
1.295 ∗ 10−1 1.295 ∗ 10 −1

Floating Point Arithmetic – 11

Floating Point Arithmetic
Let represent one of the elementary operations +, −, ∗, /. If x̄ and
ȳ are floating point numbers, then x̄ȳ may not be a floating point
number.
Example: β = 10, m = 4: 1.234 + 2.751 ∗ 10−1 = 1.5091.
What is the computed value for x̄ȳ?
In IEEE floating point arithmetic the result of the computation x̄ȳ
is equal to the floating point number that is nearest to the exact
result x̄ȳ. Therefore we use fl(x̄ȳ) to denote the result of the
computation x̄ȳ
Model for the computation of x̄ȳ, where is one of the
elementary operations +, −, ∗, /.
1. Given floating point numbers x̄ and ȳ.
2. Compute x̄ȳ exactly.
3. Round the exact result x̄ȳ to the nearest floating point number
and normalize the result.
Example cont.: 1.234 + 2.751 ∗ 10−1 = 1.5091. Comp. result: 1.509
The actual implementation of the elementary operations is more
sophisticated. For more details see [DG91,O01].
Floating Point Arithmetic – 12
Floating Point Arithmetic (Cont.)
Given two numbers x̄, ȳ in floating point format, the computed result satisfies
|fl(x̄ȳ) − (x̄ȳ)|
≤ mach .
x̄ȳ

Examples
Consider the floating point system β = 10 and m = 4.
i. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
x̄ − ȳ = 0.001 ∗ 103 = 1.000 ∗ 100 . In this case x̄ − ȳ is a floating point
number and nothing needs to done; no error occurs in the subtraction of
x̄, ȳ.
ii. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 102 .
x̄ − ȳ = 2.2969 ∗ 103 . This is not a floating point number. The floating
point number nearest to x̄ − ȳ is fl(x̄ − ȳ) = 2.297 ∗ 103 .

|ﬂ(x̄ − ȳ) − (x̄ − ȳ)| |2.297 ∗ 103 − 2.2969 ∗ 103 |

= ≈ 4.4 ∗ 10−5
|x̄ − ȳ| 2.2969 ∗ 103

< mach = 5 ∗ 10−4 .

Floating Point Arithmetic – 13

Floating Point Arithmetic: Cancellation
For the previous result on the error between x̄ȳ and the computed fl(x̄ȳ)
only holds if x̄, ȳ in floating point format. What happens when we operate with
numbers that are not in floating point format?
Example
Consider the floating point system β = 10 and m = 4.
Subtract the numbers x = 2.5515052 ∗ 103 and y = 2.5514911 ∗ 103 .
1. Compute the floating point numbers x̄ and ȳ nearest to x and y,
respectively: x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
2. Compute x̄ − ȳ exactly: x̄ − ȳ = 0.001 ∗ 103 .
3. Round the exact result x̄ − ȳ to the nearest floating point number:
fl(0.001 ∗ 103 ) = 0.001 ∗ 103 . Normalize the number:
fl(0.001 ∗ 103 ) = 1.000. The last digits are filled with (spurious) zeros.
The exact result is 2.5515052 ∗ 103 − 2.5514911 ∗ 103 = 1.410 ∗ 10−2 . The
relative error between exact and computed solution is

|1.000 − 1.410 ∗ 10−2 |

≈ 70 mach = 5 ∗ 10−4 .
1.410 ∗ 10−2
NotethatthislargeerrorisnotdueƚŽthecomputationoffl(x̄−ȳ).Thelarge
error is caused by the rounding of x and y at the beginning.
Floating Point Arithmetic – 14
Floating Point Arithmetic: Cancellation (cont.)
To analyze the error incurred by the subtraction of two
numbers, the following representation is useful:
For every x ∈ IR, there exists with || ≤ mach such that
fl(x) = x(1 + ).
Note that if x = 0, then the previous identity is satisfied for
def
= (fl(x) − x)/x. The bound || ≤ mach follows from (2).
For x, y ∈ IR we have 1 , 2 with |1 |, |2 | ≤ mach such that
fl(x) = x(1 + 1 ), fl(y) = y(1 + 2 ).
Moreover fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ), with |3 | ≤ mach .
Thus,
fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ) = [x(1 + 1 ) − y(1 + 2 )](1 + 3 )
= (x − y)(1 + 3 ) + (x1 − y2 )(1 + 3 )
and, if x − y = 0, then the relative error is given by
|fl(fl(x) − fl(y)) − (x − y)| x1 − y2
= 3 + (1 + 3 ) (3)
|x − y| x−y
If 1 2 = 0 and x − y is small, the quantity on the rhs could be mach .
Floating Point Arithmetic – 15
FloatingPointArithmetic: Cancelůation(cont.)

Similar analysis can be carried out for +, −, ∗, /.

Catastrophic cancelation can only occur with +, −.
Catastrophic cancelation can only occur if one subtracts two
numbers which are not both in ﬂoating point format and which have
the same sign and are of approximately the same size, see (3), or if
one adds two numbers which are not both in ﬂoating point format,
which have opposite sign and their absolute values of approximately
the same size.

Floating Point Arithmetic – 16

FloatingPointArithmetic: CancelůationExample 1
x 1 − cos
0.500000 0.122417E + 00
0.125000 0.780231E − 02
0.312500E − 01 0.488222E − 03
Evaluation of 1 − cos(x) near x = 0.
0.781250E − 02 0.305176E − 04
(All computations were done using
0.195312E − 02 0.190735E − 05
single precision Fortran.)
0.488281E − 03 0.119209E − 06
0.122070E − 03 0.
0.305176E − 04 0.
0.762939E − 05 0.
0.190735E − 05 0.

Since cos(0) = 1 we expect catastrophic cancelation. If x = 0.122070E − 03,

then
1 − cos(x) ≈ 1.0000000000 − 0.99999999254......
= 0.00000000745..... = 7.45054.....e − 09
−1
9254...... ∗ 10 )
1 − fl(cos(x)) ≈ 1.000000 − fl(9.999999
7 digits
= 1.000000 − 1.000000 = 0.
Floating Point Arithmetic – 17
FloatingPointArithmetic: CancelůationExample 1(cont.)
Two alternatives for small |x|.
Since cos2 (x) + sin2 (x) = 1 it holds that
1 − cos(x) = sin2 (x)/(1 + cos(x)).
The formula sin2 (x)/(1 + cos(x)) avoids subtraction of two number that
are not in floating point format and are almost the same (recall that we
consider the case |x| small).

The Leibnitz criterion
says that if the
series S = ∞ i
i=1 (−1) ci , ci ≥ 0,
converges, then S − i=1 (−1) ci < cn+1 .
n i

If we apply this to the Taylor expansion of cos(x),

x2 x4 x6 x8
cos(x) = 1 − + − + ± ...,
2 4! 6! 8!
then
x2 x4 x6 x8
cos(x) − 1 − + − < .
2 4! 6! 8!
After some rearrangements we can use the approximation

x2 x2 x4
1 − cos(x) ≈ 1− +
2 12 360
and we know that the diﬀerence is less than x8 /(8!) which allows us to
control the error of the approximation.
Floating Point Arithmetic – 18
FloatingPointArithmetic: CancelůationExample 1(cont.)

x 1 − cos sin2 /(1 + cos) Taylor

0.500000 0.122417 0.122417 0.122418
0.125000 0.780231E − 02 0.780233E − 02 0.780233E − 02
0.312500E − 01 0.488222E − 03 0.488241E − 03 0.488242E − 03
0.781250E − 02 0.305176E − 04 0.305174E − 04 0.305174E − 04
0.195312E − 02 0.190735E − 05 0.190735E − 05 0.190735E − 05
0.488281E − 03 0.119209E − 06 0.119209E − 06 0.119209E − 06
0.122070E − 03 0. 0.745058E − 08 0.745058E − 08
0.305176E − 04 0. 0.465661E − 09 0.465661E − 09
0.762939E − 05 0. 0.291038E − 10 0.291038E − 10
0.190735E − 05 0. 0.181899E − 11 0.181899E − 11
0.476837E − 06 0. 0.113687E − 12 0.113687E − 12
0.119209E − 06 0. 0.710543E − 14 0.710543E − 14
0.298023E − 07 0. 0.444089E − 15 0.444089E − 15

Computations were performed using single precision Fortran.

Floating Point Arithmetic – 19

Summary

Introduced how numbers are represented on a computer.

Only a small set of numbers can be represented on the computer.
The relative error between x = 0 and its nearest ﬂoating point
number ﬂ(x) is

|fl(x) − x| def 1
≤ mach = β 1−m .
|x| 2
Introduced basic properties of floating point arithmetic.
Catastrophic cancellation can occur if one subtracts [adds] two
numbers which are not both in floating point format and which have
the same [opposite] sign and [their absolute values] are of
approximately the same size.

Floating Point Arithmetic – 21

Additional Reading

G91 David Goldberg. What every computer scientist should know about
ﬂoating-point arithmetic, ACM Comput. Surv., Vol. 23 (1), 1991,
pp. 5 - 48.
http://docs.sun.com/source/806-3568/ncg goldberg.html
O01 Michael L. Overton. Numerical Computing with IEEE Floating Point
Arithmetic, SIAM, Philadelphia, 2001.
SUN SUN Microsystems Numerical Computation Guide
http://docs.sun.com/source/806-3568/

Floating Point Arithmetic – 22

Sas 2 Ing (Kelas 9) - 2024-1
No ratings yet
Sas 2 Ing (Kelas 9) - 2024-1
6 pages
Spotify & Music Streaming Services: A Marketing Approach
100% (2)
Spotify & Music Streaming Services: A Marketing Approach
29 pages
FloatingPoint Handout
No ratings yet
FloatingPoint Handout
122 pages
O'Connor - Matlab's Floating Point System
No ratings yet
O'Connor - Matlab's Floating Point System
17 pages
Lab 3
No ratings yet
Lab 3
5 pages
Chapter2 CS314 FloatingPt
No ratings yet
Chapter2 CS314 FloatingPt
11 pages
Complete Floating Point (Blog)
No ratings yet
Complete Floating Point (Blog)
18 pages
Floating Point Formats: 0 1 1 P 1 (P 1) e I
No ratings yet
Floating Point Formats: 0 1 1 P 1 (P 1) e I
3 pages
MIT18 335JF10 Lec4 Hand PDF
No ratings yet
MIT18 335JF10 Lec4 Hand PDF
3 pages
floating_point
No ratings yet
floating_point
55 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
ComputerArithmetic-and-Interpolation 2023
No ratings yet
ComputerArithmetic-and-Interpolation 2023
29 pages
Floating Point
No ratings yet
Floating Point
3 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
Ponto Flutuante
No ratings yet
Ponto Flutuante
87 pages
Floating Points
No ratings yet
Floating Points
31 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
No ratings yet
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
33 pages
L-5 Floating Point Representation of Numbers
No ratings yet
L-5 Floating Point Representation of Numbers
12 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
181
No ratings yet
181
11 pages
Numerical+Analysis+Chapter+1 2
No ratings yet
Numerical+Analysis+Chapter+1 2
13 pages
Floating Point Numbers - Representation & Arithmetic: Dr. Arunachalam V Associate Professor, SENSE
No ratings yet
Floating Point Numbers - Representation & Arithmetic: Dr. Arunachalam V Associate Professor, SENSE
14 pages
Lecture 1 FloatingPointNumberSystems
No ratings yet
Lecture 1 FloatingPointNumberSystems
46 pages
Nummeth Lect 01
No ratings yet
Nummeth Lect 01
24 pages
Computations in Mechanical Engineering: Numbers and Vectors
No ratings yet
Computations in Mechanical Engineering: Numbers and Vectors
18 pages
4.4_1 New Floating Point.pptx
No ratings yet
4.4_1 New Floating Point.pptx
22 pages
CIT335 SUMMARY
No ratings yet
CIT335 SUMMARY
10 pages
CEF352 Lect2
No ratings yet
CEF352 Lect2
18 pages
NAChapter-1
No ratings yet
NAChapter-1
24 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
Mathematical Modeling
No ratings yet
Mathematical Modeling
14 pages
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
No ratings yet
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
16 pages
Cs2100 9 Floating Point
No ratings yet
Cs2100 9 Floating Point
32 pages
Rounding and Machine Arithmetic
No ratings yet
Rounding and Machine Arithmetic
20 pages
CHAP 03e
No ratings yet
CHAP 03e
32 pages
Advanced Computational Methods: ENGR 680
No ratings yet
Advanced Computational Methods: ENGR 680
19 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
No ratings yet
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
30 pages
Demystifying Floating Point - John Farrier - CppCon 2015
No ratings yet
Demystifying Floating Point - John Farrier - CppCon 2015
61 pages
5268882
No ratings yet
5268882
23 pages
chapter 3
No ratings yet
chapter 3
48 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Floating-Point Arithmetic in The Coq System
No ratings yet
Floating-Point Arithmetic in The Coq System
10 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
Chapter 1 (5 Lectures)
No ratings yet
Chapter 1 (5 Lectures)
15 pages
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
No ratings yet
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
39 pages
MA 214 Lecture 4
No ratings yet
MA 214 Lecture 4
103 pages
Floating Point Arithmethic - Error Analysis
No ratings yet
Floating Point Arithmethic - Error Analysis
30 pages
Chap2 Float
No ratings yet
Chap2 Float
20 pages
Computer Oriented Numerical Methods!
No ratings yet
Computer Oriented Numerical Methods!
160 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
No ratings yet
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
17 pages
ch1 anal num
No ratings yet
ch1 anal num
17 pages
Floating Point Arithmetic
No ratings yet
Floating Point Arithmetic
74 pages
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
SMETA 7.0 Issue Titles With Method and Timescale October 2024 v1.0
No ratings yet
SMETA 7.0 Issue Titles With Method and Timescale October 2024 v1.0
76 pages
Soal Ulangan KD 3.5
No ratings yet
Soal Ulangan KD 3.5
4 pages
Ajp Practical 20
100% (1)
Ajp Practical 20
4 pages
QGIS 3.16 PyQGISDeveloperCookbook en
No ratings yet
QGIS 3.16 PyQGISDeveloperCookbook en
164 pages
Contents and Layout of Research Report
No ratings yet
Contents and Layout of Research Report
4 pages
314313-ESTIMATING-COSTING-AND-VALUATION
No ratings yet
314313-ESTIMATING-COSTING-AND-VALUATION
9 pages
Dli-to-Lko-ticket
No ratings yet
Dli-to-Lko-ticket
2 pages
Why ITC Has Failed To Move Out of Tobacco
No ratings yet
Why ITC Has Failed To Move Out of Tobacco
10 pages
MEED _ Adnoc shelves mega Ruwais refinery project
No ratings yet
MEED _ Adnoc shelves mega Ruwais refinery project
4 pages
I Can Talk About Celebrity: A. Celebrity and The Media
No ratings yet
I Can Talk About Celebrity: A. Celebrity and The Media
9 pages
Unit Test 1 Chemistry
No ratings yet
Unit Test 1 Chemistry
4 pages
Guia Bolsillo Auditor Interno Automotriz
No ratings yet
Guia Bolsillo Auditor Interno Automotriz
42 pages
Ariston
No ratings yet
Ariston
40 pages
Euglenophyta: The Euglena: History and Classifications
No ratings yet
Euglenophyta: The Euglena: History and Classifications
4 pages
CSS Lec-17 HTML Controls (Part-A) .07becfb
No ratings yet
CSS Lec-17 HTML Controls (Part-A) .07becfb
5 pages
Pangu: Virtual Spacecraft Image Generation
No ratings yet
Pangu: Virtual Spacecraft Image Generation
6 pages
Intervention-Plan TLE 8
No ratings yet
Intervention-Plan TLE 8
4 pages
Life Gear - 93460 2959-681
No ratings yet
Life Gear - 93460 2959-681
22 pages
PPT Lecture 3.3.1(International Trade)
No ratings yet
PPT Lecture 3.3.1(International Trade)
9 pages
Manual de Usuario de Medidor G100
100% (1)
Manual de Usuario de Medidor G100
48 pages
5 KarakasAPSTRACT2014
No ratings yet
5 KarakasAPSTRACT2014
8 pages
Chapter 3 Special Purpose Diodes
No ratings yet
Chapter 3 Special Purpose Diodes
40 pages
The Complete MARILLION Discography V2 PDF
No ratings yet
The Complete MARILLION Discography V2 PDF
13 pages
Modicon M340 For Ethernet: Communications Modules and Processors User Manual
No ratings yet
Modicon M340 For Ethernet: Communications Modules and Processors User Manual
396 pages
The Adventures of Maimuna
No ratings yet
The Adventures of Maimuna
74 pages
Gaudi Morphogenesis PDF
No ratings yet
Gaudi Morphogenesis PDF
49 pages
CIMA P1 Performance Operations Study Text 2013
100% (8)
CIMA P1 Performance Operations Study Text 2013
697 pages
ML MODEL 1
No ratings yet
ML MODEL 1
42 pages

Slide n2 Appendix Posted

Uploaded by

Slide n2 Appendix Posted

Uploaded by

FloatingPointArithmetic(Appendix)

In everyday life we use decimal representation of numbers. For example

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

· · · dj ∗ 10j + · · · + d1 ∗ 101 + d0 ∗ 100 + d−1 ∗ 10−1 + · · · + d−i ∗ 10−i + · · · .

Floating Point Arithmetic – 2

The representation is unique if one requires that d0 > 0 when x = 0.

Floating Point Arithmetic – 3

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

Floating Point Arithmetic – 4

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

Positive numbers with exponent e = 0 , 1 , 2 ,−1

Floating Point Arithmetic – 5

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

The mantissa of a normalized ﬂoating point number is always ≥ 1.

The smallest positive normalized ﬂoating pt. number is x̄min = β emin .

Common Name (Approximate) Equivalent Value

Unit roundoﬀ 2−24 ≈ 6.e − 8 2−53 ≈ 1.1e − 16

Floating Point Arithmetic – 7

Example Let β = 10, m = 3. Then

Floating Point Arithmetic – 9

is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the ﬂoating point number x̄

Floating Point Arithmetic – 10

|ﬂ(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | = 0.0004,

|ﬂ(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | = 0.0005,

Floating Point Arithmetic – 11

|ﬂ(x̄ − ȳ) − (x̄ − ȳ)| |2.297 ∗ 103 − 2.2969 ∗ 103 |

< mach = 5 ∗ 10−4 .

Floating Point Arithmetic – 13

|1.000 − 1.410 ∗ 10−2 |

Similar analysis can be carried out for +, −, ∗, /.

Floating Point Arithmetic – 16

Since cos(0) = 1 we expect catastrophic cancelation. If x = 0.122070E − 03,

If we apply this to the Taylor expansion of cos(x),

x 1 − cos sin2 /(1 + cos) Taylor

Computations were performed using single precision Fortran.

Floating Point Arithmetic – 19

Introduced how numbers are represented on a computer.

Floating Point Arithmetic – 21

Floating Point Arithmetic – 22

You might also like

The representation is unique if one requires that d0 > 0 when x = 0.