0% found this document useful (0 votes)

85 views30 pages

Floating Point Arithmethic - Error Analysis

Floating point arithmetic introduces roundoff errors. [1] Basic algebra rules are not always satisfied in floating point arithmetic. [2] Real numbers are represented in floating point format using a mantissa and exponent. [3] Machine epsilon represents the smallest number such that 1 + epsilon is greater than 1, and determines the precision of calculations.

Uploaded by

Alfreda Oduro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views30 pages

Floating Point Arithmethic - Error Analysis

Uploaded by

Alfreda Oduro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

• Brief review of floating point arithmetic

• Model of floating point arithmetic

• Notation, backward and forward errors

3-1
Roundoff errors and floating-point arithmetic

ä The basic problem: The set A of all possible representable

numbers on a given machine is finite - but we would like to use this
set to perform standard arithmetic operations (+,*,-,/) on an infinite
set. The usual algebra rules are no longer satisfied since results of
operations are rounded.

ä Basic algebra breaks down in floating point arithmetic.

Example: In floating point arithmetic.

a + (b + c) ! = (a + b) + c

- Matlab experiment: For 10,000 random numbers find number of

instances when the above is true. Same thing for the multiplication..
3-2 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-2
Floating point representation:

Real numbers are represented in two parts: A mantissa (significand)

and an exponent. If the representation is in the base β then:
x = ±(.d1d2 · · · dt)β e

ä .d1d2 · · · dt is a fraction in the base-β representation (Generally

the form is normalized in that d1 6= 0), and e is an integer
ä Often, more convenient to rewrite the above as:
x = ±(m/β t) × β e ≡ ±m × β e−t

ä Mantissa m is an integer with 0 ≤ m ≤ β t − 1.

3-3 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-3
Machine precision - machine epsilon

ä Notation : f l(x) = closest floating point representation

of real number x (’rounding’)

ä When a number x is very small, there is a point when 1+x ==

1 in a machine sense. The computer no longer makes a difference
between 1 and 1 + x.

Machine epsilon: The smallest number such that 1 + is a

float that is different from one, is called machine epsilon. Denoted
by macheps or eps, it represents the distance from 1 to the next
larger floating point number.

ä With previous representation, eps is equal to β −(t−1).

3-4 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-4
Example: In IEEE standard double precision, β = 2, and t =
53 (includes ‘hidden bit’). Therefore eps = 2−52.
Unit Round-off A real number x can be approximated by a floating
number f l(x) with relative error no larger than u = 12 β −(t−1).
ä u is called Unit Round-off.
ä In fact can easily show:
f l(x) = x(1 + δ) with |δ| < u

- Matlab experiment: find the machine epsilon on your computer.

ä Many discussions on what conditions/ rules should be satisfied

by floating point arithmetic. The IEEE standard is a set of standards
adopted by many CPU manufacturers.

3-5 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-5
Rule 1.

f l(x) = x(1 + ), where || ≤ u

Rule 2. For all operations (one of +, −, ∗, /)

f l(x y) = (x y)(1 + ), where || ≤ u

Rule 3. For +, ∗ operations

f l(a b) = f l(b a)

- Matlab experiment: Verify experimentally Rule 3 with 10,000

randomly generated numbers ai, bi.

3-6 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-6
Example: Consider the sum of 3 numbers: y = a + b + c.
ä Done as f l(f l(a + b) + c)
η = f l(a + b) = (a + b)(1 + 1)
y1 = f l(η + c) = (η + c)(1 + 2)
= [(a + b)(1 + 1) + c] (1 + 2)
= [(a + b + c)+ (a + b)1)] (1 + 2)
a+b
= (a + b + c) 1 + 1(1 + 2) + 2
a+b+c

So disregarding the high order term 12

f l(f l(a + b) + c) = (a + b + c)(1 + 3)

a+b
3 ≈ 1 + 2
a+b+c

3-7 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-7
ä If we redid the computation as y2 = f l(a + f l(b + c)) we
would find

f l(a + f l(b + c)) = (a + b + c)(1 + 4)

b+c
4 ≈ 1 + 2
a+b+c

ä The error is amplified by the factor (a + b)/y in the first case

and (b + c)/y in the second case.
ä In order to sum n numbers accurately, it is better to start with
small numbers first. [However, sorting before adding is not worth it.]
ä But watch out if the numbers have mixed signs!

3-8 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-8
The absolute value notation

ä For a given vector x, |x| is the vector with components |xi|,

i.e., |x| is the component-wise absolute value of x.
ä Similarly for matrices:

|A| = {|aij |}i=1,...,m; j=1,...,n

ä An obvious result: The basic inequality

|f l(aij ) − aij | ≤ u |aij |
translates into
f l(A) = A + E with |E| ≤ u |A|

ä A ≤ B means aij ≤ bij for all 1 ≤ i ≤ m; 1 ≤ j ≤ n

3-9 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-9
Error Analysis: Inner product

ä Inner products are in the innermost parts of many calculations.

Their analysis is important.

Lemma: If |δi| ≤ u and nu < 1 then

nu
Πni=1(1 + δi) = 1 + θn where |θn| ≤
1 − nu

nu
ä Common notation γn ≡ 1−nu

- Prove the lemma [Hint: use induction]

3-10 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-10
ä Can use the following simpler result:

Lemma: If |δi| ≤ u and nu < .01 then

Πni=1(1 + δi) = 1 + θn where |θn| ≤ 1.01nu

Example: Previous sum of numbers can be written

f l(a + b + c) = a(1 + 1)(1 + 2)
+ b(1 + 1)(1 + 2) + c(1 + 2)
= a(1 + θ1) + b(1 + θ2) + c(1 + θ3)
= exact sum of slightly perturbed inputs,
where all θi’s satisfy |θi| ≤ 1.01nu (here n = 2).
ä Alternatively, can write ‘forward’ bound:
|f l(a + b + c) − (a + b + c)| ≤ |aθ1| + |bθ2| + |cθ3|.
3-11 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-11
Backward and forward errors

ä Assume the approximation ŷ to y = alg(x) is computed by

some algorithm with arithmetic precision . Possible analysis: find
an upper bound for the Forward error
|∆y| = |y − ŷ|

ä This is not always easy.

Alternative question: find equivalent perturbation on initial data
(x) that produces the result ŷ. In other words, find ∆x so that:
alg(x + ∆x) = ŷ

ä The value of |∆x| is called the backward error. An analysis to

find an upper bound for |∆x| is called Backward error analysis.
3-12 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-12
Example:
a b d e
A= B=
0 c 0 f
Consider the product: f l(A.B) =

(ad)(1 + 1) [ae(1 + 2) + bf (1 + 3)] (1 + 4)
0 cf (1 + 5)
with i ≤ u , for i = 1, ..., 5. Result can be written as:

a b(1 + 3)(1 + 4) d(1 + 1) e(1 + 2)(1 + 4)
0 c(1 + 5) 0 f

ä So f l(A.B) = (A + EA)(B + EB ).
ä Backward errors EA, EB satisfy:
|EA| ≤ 2u |A| + O(u 2) ; |EB | ≤ 2u |B| + O(u 2)
3-13 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-13
ä When solving Ax = b by Gaussian Elimination, we will see that
a bound on kexk such that this holds exactly:
A(xcomputed + ex) = b
is much harder to find than bounds on kEAk, kebk such that this
holds exactly:
(A + EA)xcomputed = (b + eb).

Note: In many instances backward errors are more meaningful than

forward errors: if initial data is accurate only to 4 digits say, then
my algorithm for computing x need not guarantee a backward error
of less then 10−10 for example. A backward error of order 10−4 is
acceptable.

3-14 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-14
Main result on inner products:

ä Backward error expression:

f l(xT y) = [x .∗ (1 + dx)]T [y .∗ (1 + dy )]

where kdk∞ ≤ 1.01nu , = x, y.

ä Can show equality valid even if one of the dx, dy absent.

T T T
ä Forward error expression: |f l(x y) − x y| ≤ γ n |x| |y|

with 0 ≤ γn ≤ 1.01nu .
ä Elementwise absolute value |x| and multiply .∗ notation.
ä Above assumes nu ≤ .01.
For u = 2.0 × 10−16, this holds for n ≤ 4.5 × 1013.
3-15 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-15
ä Consequence of lemma:
|f l(A ∗ B) − A ∗ B| ≤ γn |A| ∗ |B|

ä Another way to write the result (less precise) is

|f l(xT y) − xT y| ≤ n u |x|T |y| + O(u 2)

3-16 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-16
- Assume you use single precision for which you have u = 2. ×
10−6. What is the largest n for which nu ≤ 0.01 holds? Any
conclusions for the use of single precision arithmetic?
- What does the main result on inner products imply for the case
when y = x? [Contrast the relative accuracy you get in this case
vs. the general case when y 6= x]

3-17 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-17
- Show for any x, y, there exist ∆x, ∆y such that

f l(xT y) = (x + ∆x)T y, with |∆x| ≤ γn|x|

f l(xT y) = xT (y + ∆y), with |∆y| ≤ γn|y|

- (Continuation) Let A an m × n matrix, x an n-vector, and

y = Ax. Show that there exist a matrix ∆A such
f l(y) = (A + ∆A)x, with |∆A| ≤ γn|A|

- (Continuation) From the above derive a result about a column

of the product of two matrices A and B. Does a similar result hold
for the product AB as a whole?

3-18 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-18
Supplemental notes: Floating Point Arithmetic

In most computing systems, real numbers are represented in two

parts: A mantissa and an exponent. If the representation is in the
base β then:
x = ±(.d1d2 · · · dm)β β e

ä .d1d2 · · · dm is a fraction in the base-β representation

ä e is an integer - can be negative, positive or zero.
ä Generally the form is normalized in that d1 6= 0.

3-19 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-19
Example: In base 10 (for illustration)

1. 1000.12345 can be written as

0.10001234510 × 104

2. 0.000812345 can be written as

0.81234510 × 10−3

ä Problem with floating point arithmetic: we have to live with

limited precision.
Example: Assume that we have only 5 digits of accuray in the
mantissa and 2 digits for the exponent (excluding sign).

.d1 d2 d3 d4 d5 e1 e2
3-20 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-20
Try to add 1000.2 = .10002e+03 and 1.07 = .10700e+01:
1000.2 = .1 0 0 0 2 0 4 ; 1.07 = .1 0 7 0 0 0 1

First task: align decimal points. The one with smallest exponent
will be (internally) rewritten so its exponent matches the largest one:
1.07 = 0.000107 × 104

Second task: add mantissas:

0. 1 0 0 0 2
+ 0. 0 0 0 1 0 7
= 0. 1 0 0 1 2 7

3-21 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-21
Third task:
round result. Result has 6 digits - can use only 5 so we can
ä Chop result: .1 0 0 1 2 ;
ä Round result: .1 0 0 1 3 ;
Fourth task:
Normalize result if needed (not needed here)
result with rounding: .1 0 0 1 3 0 4 ;
- Redo the same thing with 7000.2 + 4000.3 or 6999.2 + 4000.3.

3-22 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-22
Some More Examples

ä Each operation f l(x y) proceeds in 4 steps:

1. Line up exponents (for addition & subtraction).
2. Compute temporary exact answer.
3. Normalize temporary result.
4. Round to nearest representable number
(round-to-even in case of a tie).
.40015 e+02 .40010 e+02 .41015 e-98
+ .60010 e+02 .50001 e-04 -.41010 e-98
temporary 1.00025 e+02 .4001050001e+02 .00005 e-98
normalize .100025e+03 .400105⊕ e+02 .00050 e-99
round .10002 e+03 .40011 e+02 .00050 e-99
note: round to round to nearest too small:
even ⊕=not all 0’s unnormalized
exactly halfway closer to exponent is
between values upper value at minimum
3-23 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-23
The IEEE standard

32 bit (Single precision) :

± 8 bits ← 23 bits →

sign
| {z } | {z }
exponent mantissa
ä In binary: The leading one in mantissa does not need to be
represented. One bit gained. ä Hidden bit.
ä Largest exponent: 27 − 1 = 127; Smallest: = −126. [‘bias’
of 127]

3-24 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-24
64 bit (Double precision) :

± 11 bits ← 52 bits →

sign
| {z } | {z }
exponent mantissa
ä Bias of 1023 so if c is the contents of exponent field
actual exponent value is 2c−1023
ä e + bias = 2047 (all ones) = special use
ä Largest exponent: 1023; Smallest = -1022.
ä Including the hidden bit, mantissa has total of 53 bits (52 bits
represented, one hidden).

3-25 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-25
- Take the number 1.0 and see what will happen if you add
1/2, 1/4, ...., 2−i. Do not forget the hidden bit!

Hidden bit (Not represented)

Expon. ↓ ← 52 bits →
e 1 1 0 0 0 0 0 0 0 0 0 0
e 1 0 1 0 0 0 0 0 0 0 0 0
e 1 0 0 1 0 0 0 0 0 0 0 0
.......
e 1 0 0 0 0 0 0 0 0 0 0 1
e 1 0 0 0 0 0 0 0 0 0 0 0

(Note: The ’e’ part has 12 bits and includes the sign)
ä Conclusion
f l(1 + 2−52) 6= 1 but: f l(1 + 2−53) == 1 !!
3-26 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-26
Special Values

ä Exponent field = 00000000000 (smallest possible value)

No hidden bit. All bits == 0 means exactly zero.
ä Allow for unnormalized numbers,
leading to gradual underflow.
ä Exponent field = 11111111111 (largest possible value)
Number represented is ”Inf” ”-Inf” or ”NaN”.

3-27 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-27
Appendix to set 3: Analysis of inner products

Consider sn = f l(x1 ∗ y1 + x2 ∗ y2 + · · · + xn ∗ yn)

ä In what follows ηi’s comme from ∗, i’s comme from +

ä They satisfy: |ηi| ≤ u and |i| ≤ u .
ä The inner product sn is computed as:
1. s1 = f l(x1y1) = (x1y1)(1 + η1)
2. s2 = f l(s1 + f l(x2y2)) = f l(s1 + x2y2(1 + η2))
= (x1y1(1 + η1) + x2y2(1 + η2)) (1 + 2)
= x1y1(1 + η1)(1 + 2) + x2y2(1 + η2)(1 + 2)
3. s3 = f l(s2 + f l(x3y3)) = f l(s2 + x3y3(1 + η3))
= (s2 + x3y3(1 + η3))(1 + 3)
3-28 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2

3-28
Expand: s3 = x1y1(1 + η1)(1 + 2)(1 + 3)
+x2y2(1 + η2)(1 + 2)(1 + 3)
+x3y3(1 + η3)(1 + 3)
ä Induction would show that [with convention that 1 ≡ 0]

n
X n
Y
sn = xiyi(1 + ηi) (1 + j )
i=1 j=i

Q: How many terms in the coefficient of xiyi do we have?

• When i > 1 : 1 + (n − i + 1) = n − i + 2
A:
• When i = 1 : n (since 1 = 0 does not count)
ä Bottom line: always ≤ n.
3-29 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2

3-29
ä For each of these products
Qn
(1 + ηi) j=i(1 + j ) = 1 + θi, with |θi| ≤ γnu so:
Pn
sn = i=1 xiyi(1 + θi) with |θi| ≤ γn or:
Pn Pn Pn
fl i=1 xiyi = i=1 xiyi + i=1 xiyiθi with |θi| ≤ γn

ä This leads to the final result (forward form)

n
! n
n
X X X
f l x i yi − xiyi ≤ γn |xi||yi|

i=1 i=1 i=1

ä or (backward form)
n
! n
X X
fl xiyi = xiyi(1 + θi) with |θi| ≤ γn
i=1 i=1

3-30 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2

3-30

20250128_annotated
No ratings yet
20250128_annotated
34 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
ARM Cortex-A9 ARM V7-A A Programmer's Perspective
No ratings yet
ARM Cortex-A9 ARM V7-A A Programmer's Perspective
22 pages
12b9a8dd-d434-49f9-a096-d9825c7089c4
No ratings yet
12b9a8dd-d434-49f9-a096-d9825c7089c4
86 pages
chapter2_computation_error
No ratings yet
chapter2_computation_error
25 pages
CIT335 SUMMARY
No ratings yet
CIT335 SUMMARY
10 pages
MATH3511_Assignment_4
No ratings yet
MATH3511_Assignment_4
13 pages
EEPC-102-Module-1
No ratings yet
EEPC-102-Module-1
6 pages
week1_handout
No ratings yet
week1_handout
24 pages
S Ccs Answers
No ratings yet
S Ccs Answers
192 pages
04 Numerics
No ratings yet
04 Numerics
124 pages
Nummeth Lect 01
No ratings yet
Nummeth Lect 01
24 pages
R Numeric Programming
100% (1)
R Numeric Programming
124 pages
Errors
50% (2)
Errors
106 pages
Lec 3
No ratings yet
Lec 3
29 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
CHAP 03e
No ratings yet
CHAP 03e
32 pages
A Short Introduction To Interval Analysis: Raazesh Sainudiin
No ratings yet
A Short Introduction To Interval Analysis: Raazesh Sainudiin
139 pages
tut 8s
No ratings yet
tut 8s
5 pages
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
No ratings yet
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
30 pages
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
No ratings yet
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
33 pages
Num Math PDF
No ratings yet
Num Math PDF
87 pages
Num Math
No ratings yet
Num Math
87 pages
Lecture 4 Roundoff-Error
No ratings yet
Lecture 4 Roundoff-Error
43 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Lab 3
No ratings yet
Lab 3
5 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Mathematical Modeling
No ratings yet
Mathematical Modeling
14 pages
Chapter2 CS314 FloatingPt
No ratings yet
Chapter2 CS314 FloatingPt
11 pages
CBNST Notes For BCA PU 3rd Sem Based On Syllabus PDF
100% (1)
CBNST Notes For BCA PU 3rd Sem Based On Syllabus PDF
27 pages
Numerical Methods Notes
100% (1)
Numerical Methods Notes
553 pages
Splines
No ratings yet
Splines
16 pages
Error Analysis in Numerical Methods
No ratings yet
Error Analysis in Numerical Methods
9 pages
Numerical Methods With MATLAB PDF
No ratings yet
Numerical Methods With MATLAB PDF
189 pages
Fractions Convert Improper To Mixed All
No ratings yet
Fractions Convert Improper To Mixed All
20 pages
Slide n2 Appendix Posted
No ratings yet
Slide n2 Appendix Posted
21 pages
Rounding and Machine Arithmetic
No ratings yet
Rounding and Machine Arithmetic
20 pages
Sample Solutions To Assignment 5.: Amath 584, Autumn 2017
No ratings yet
Sample Solutions To Assignment 5.: Amath 584, Autumn 2017
5 pages
Class 6 Maths Notes Brackets
50% (2)
Class 6 Maths Notes Brackets
3 pages
Week 2 M1Lessons 2-3
No ratings yet
Week 2 M1Lessons 2-3
41 pages
Unit 4 - 1
No ratings yet
Unit 4 - 1
11 pages
Math 5 Curriculum Review for SY 2023-24
No ratings yet
Math 5 Curriculum Review for SY 2023-24
3 pages
Maths All Notes
No ratings yet
Maths All Notes
122 pages
5 - Python - Number System
No ratings yet
5 - Python - Number System
30 pages
Floating-Point Arithmetic in The Coq System
No ratings yet
Floating-Point Arithmetic in The Coq System
10 pages
8 MCQ - Sequences & Series (O.P. GUPTA)
No ratings yet
8 MCQ - Sequences & Series (O.P. GUPTA)
4 pages
Numerical Analysis
No ratings yet
Numerical Analysis
13 pages
Lecture Notes On Numerical Analysis
No ratings yet
Lecture Notes On Numerical Analysis
68 pages
Numerical Methods - Exercises: Introductory Lecture
No ratings yet
Numerical Methods - Exercises: Introductory Lecture
37 pages
ALGEBRA
No ratings yet
ALGEBRA
4 pages
INTSO-MTSO - VIII CLASS - Linear Equations in One Variable - WS-1 - Sol
No ratings yet
INTSO-MTSO - VIII CLASS - Linear Equations in One Variable - WS-1 - Sol
5 pages
1.3 Error, Accuracy, and Stability: Preliminaries
No ratings yet
1.3 Error, Accuracy, and Stability: Preliminaries
4 pages
SM CH PDF
No ratings yet
SM CH PDF
16 pages
Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic
No ratings yet
Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic
7 pages
CH 1
No ratings yet
CH 1
11 pages
Math Book
No ratings yet
Math Book
160 pages
Lecture Notes17
No ratings yet
Lecture Notes17
122 pages
1.3 Error, Accuracy, and Stability: Preliminaries
No ratings yet
1.3 Error, Accuracy, and Stability: Preliminaries
4 pages
Errors and Propagation
No ratings yet
Errors and Propagation
8 pages
BOOTH AND DAADDA VHDL Implementation
No ratings yet
BOOTH AND DAADDA VHDL Implementation
7 pages
Chance and Data Worksheet 4
No ratings yet
Chance and Data Worksheet 4
1 page
cvc cut and paste
No ratings yet
cvc cut and paste
4 pages
Algebraic Expressions
No ratings yet
Algebraic Expressions
3 pages
Exercise Set 01
No ratings yet
Exercise Set 01
5 pages
Week 4
No ratings yet
Week 4
15 pages
Number Theory
No ratings yet
Number Theory
1 page
IOQM - 2024 - TEST PAPER-12: TOPIC:-Number Theory Complete, Algebra (Till Taught)
No ratings yet
IOQM - 2024 - TEST PAPER-12: TOPIC:-Number Theory Complete, Algebra (Till Taught)
1 page
O'Connor - Matlab's Floating Point System
No ratings yet
O'Connor - Matlab's Floating Point System
17 pages
Experiment No 2 BCD Adder
No ratings yet
Experiment No 2 BCD Adder
7 pages
Chapter 1 (5 Lectures)
No ratings yet
Chapter 1 (5 Lectures)
15 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
Floating Point
No ratings yet
Floating Point
3 pages
18.335 Problem Set 2 Solutions
No ratings yet
18.335 Problem Set 2 Solutions
7 pages
P03 Intro To Numerical Methods
No ratings yet
P03 Intro To Numerical Methods
11 pages
Construction Tender Document
No ratings yet
Construction Tender Document
27 pages
Lesson Plan - Using Fractions As Operators
No ratings yet
Lesson Plan - Using Fractions As Operators
2 pages
Ratio and Proprtion New
No ratings yet
Ratio and Proprtion New
14 pages
MAM380F 2008 Tutorial 1 Solutions
No ratings yet
MAM380F 2008 Tutorial 1 Solutions
2 pages
UNIT 3 Lessons 2 and 3 SCIENTIFIC NOTATION AND SIGNIFICANT FIGURES
No ratings yet
UNIT 3 Lessons 2 and 3 SCIENTIFIC NOTATION AND SIGNIFICANT FIGURES
23 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
11 - Ip - Practical List2023-24
No ratings yet
11 - Ip - Practical List2023-24
2 pages
DLP Sequence and Series
No ratings yet
DLP Sequence and Series
3 pages
vb6 crc32
No ratings yet
vb6 crc32
3 pages
Ee3412-Linear and Digital Circuits Laboratory-733621063-Ee 3412-Linear and Digital Integrated Circuits Lab Manual
No ratings yet
Ee3412-Linear and Digital Circuits Laboratory-733621063-Ee 3412-Linear and Digital Integrated Circuits Lab Manual
123 pages
The Set of Real Numbers and Their Properties
No ratings yet
The Set of Real Numbers and Their Properties
38 pages
Genmath q1 Mod2 Evaluatingfunctions v2 PDF
No ratings yet
Genmath q1 Mod2 Evaluatingfunctions v2 PDF
29 pages
Math 5
No ratings yet
Math 5
3 pages
Diagnostic Test in Mathematics 7
No ratings yet
Diagnostic Test in Mathematics 7
3 pages
Math Curriculum Guide Grade10
No ratings yet
Math Curriculum Guide Grade10
7 pages
MMW Module 5 1 Binary Numbers Codes
No ratings yet
MMW Module 5 1 Binary Numbers Codes
26 pages

Floating Point Arithmethic - Error Analysis

Uploaded by

Floating Point Arithmethic - Error Analysis

Uploaded by

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

• Brief review of floating point arithmetic

• Model of floating point arithmetic

• Notation, backward and forward errors

ä The basic problem: The set A of all possible representable

ä Basic algebra breaks down in floating point arithmetic.

- Matlab experiment: For 10,000 random numbers find number of

Real numbers are represented in two parts: A mantissa (significand)

ä .d1d2 · · · dt is a fraction in the base-β representation (Generally

ä Mantissa m is an integer with 0 ≤ m ≤ β t − 1.

ä Notation : f l(x) = closest floating point representation

ä When a number x is very small, there is a point when 1+x ==

Machine epsilon: The smallest number  such that 1 +  is a

ä With previous representation, eps is equal to β −(t−1).

- Matlab experiment: find the machine epsilon on your computer.

ä Many discussions on what conditions/ rules should be satisfied

f l(x) = x(1 + ), where || ≤ u

Rule 2. For all operations (one of +, −, ∗, /)

f l(x y) = (x y)(1 +  ), where | | ≤ u

Rule 3. For +, ∗ operations

- Matlab experiment: Verify experimentally Rule 3 with 10,000

So disregarding the high order term 12

f l(f l(a + b) + c) = (a + b + c)(1 + 3)

f l(a + f l(b + c)) = (a + b + c)(1 + 4)

ä The error is amplified by the factor (a + b)/y in the first case

ä For a given vector x, |x| is the vector with components |xi|,

|A| = {|aij |}i=1,...,m; j=1,...,n

ä An obvious result: The basic inequality

ä A ≤ B means aij ≤ bij for all 1 ≤ i ≤ m; 1 ≤ j ≤ n

ä Inner products are in the innermost parts of many calculations.

Lemma: If |δi| ≤ u and nu < 1 then

- Prove the lemma [Hint: use induction]

Lemma: If |δi| ≤ u and nu < .01 then

Example: Previous sum of numbers can be written

ä Assume the approximation ŷ to y = alg(x) is computed by

ä This is not always easy.

ä The value of |∆x| is called the backward error. An analysis to

Note: In many instances backward errors are more meaningful than

ä Backward error expression:

where kdk∞ ≤ 1.01nu , = x, y.

ä Another way to write the result (less precise) is

f l(xT y) = (x + ∆x)T y, with |∆x| ≤ γn|x|

- (Continuation) Let A an m × n matrix, x an n-vector, and

- (Continuation) From the above derive a result about a column

In most computing systems, real numbers are represented in two

ä .d1d2 · · · dm is a fraction in the base-β representation

1. 1000.12345 can be written as

2. 0.000812345 can be written as

ä Problem with floating point arithmetic: we have to live with

Second task: add mantissas:

ä Each operation f l(x y) proceeds in 4 steps:

32 bit (Single precision) :

Hidden bit (Not represented)

ä Exponent field = 00000000000 (smallest possible value)

Consider sn = f l(x1 ∗ y1 + x2 ∗ y2 + · · · + xn ∗ yn)

ä In what follows ηi’s comme from ∗, i’s comme from +

Q: How many terms in the coefficient of xiyi do we have?

ä This leads to the final result (forward form)

You might also like

Machine epsilon: The smallest number such that 1 + is a

f l(x) = x(1 + ), where || ≤ u

f l(x y) = (x y)(1 + ), where || ≤ u

So disregarding the high order term 12

f l(f l(a + b) + c) = (a + b + c)(1 + 3)

f l(a + f l(b + c)) = (a + b + c)(1 + 4)

ä In what follows ηi’s comme from ∗, i’s comme from +