0% found this document useful (0 votes)
4 views46 pages

Lect 05&06(Floating Point) 2025

The document discusses the floating-point system used in numerical linear algebra, focusing on machine precision, rounding errors, and floating-point arithmetic. It explains the structure of floating-point representation, including normalized forms, underflow and overflow thresholds, and machine epsilon. Additionally, it outlines IEEE standards for single and double precision floating-point representations, detailing their components and effective exponent ranges.

Uploaded by

Sourish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views46 pages

Lect 05&06(Floating Point) 2025

The document discusses the floating-point system used in numerical linear algebra, focusing on machine precision, rounding errors, and floating-point arithmetic. It explains the structure of floating-point representation, including normalized forms, underflow and overflow thresholds, and machine epsilon. Additionally, it outlines IEEE standards for single and double precision floating-point representations, detailing their components and effective exponent ranges.

Uploaded by

Sourish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

MA571 Numerical Linear Algebra

Lectures 5&6: Floating-point system

Rafikul Alam
Department of Mathematics
IIT Guwahati

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Lecture outline

• Floating-point system
• Machine precision and unit roundoff
• Rounding errors
• Floating-point arithmetic

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Anomalous arithmetic

Arithmetic operations on computers are inexact and can produce surprising


results. For example, matlab produces
4
( − 1) ∗ 3 − 1 = −2.2204 × 10−16
3
(1 + exp(−50)) − 1
5× = NaN
(1 + exp(−50)) − 1
log(exp(750))
= Inf
100

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Example 2
Let p(x) = x 7 − 7x 6 + 21x 5 − 35x 4 + 35x 3 − 21x 2 + 7x − 1 = (x − 1)7 .

Does the plot look like a polynomial? What explains this behaviour?
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Normalized floating-point representation

Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation

x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.

Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Normalized floating-point representation

Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation

x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.

Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by

X dj
f = (·d1 d2 · · · dt · · · )β = .
βj
j=1

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Normalized floating-point representation

Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation

x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.

Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by

X dj
f = (·d1 d2 · · · dt · · · )β = .
βj
j=1

Note that 0 < f ≤ 1 and x = ±f × β e is a unique representation. The number of digits


allowed in f is called precision of the floating-point representation, which in the present case is
infinite.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Computers use a floating-point representation with base β = 2. The set of numbers
represented on a computer is called a floating-point system.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Computers use a floating-point representation with base β = 2. The set of numbers
represented on a computer is called a floating-point system.

A floating-point system F (β, t, emin , emax ) is a finite set given by

{±(·d1 d2 · · · dt )β × β e , 0 ≤ di < β, d1 6= 0, emin ≤ e ≤ emax } ∪ {0}.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Computers use a floating-point representation with base β = 2. The set of numbers
represented on a computer is called a floating-point system.

A floating-point system F (β, t, emin , emax ) is a finite set given by

{±(·d1 d2 · · · dt )β × β e , 0 ≤ di < β, d1 6= 0, emin ≤ e ≤ emax } ∪ {0}.

β : base of the floating-point system


t : precision (number of digits in the mantissa) of the floating-point system

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Computers use a floating-point representation with base β = 2. The set of numbers
represented on a computer is called a floating-point system.

A floating-point system F (β, t, emin , emax ) is a finite set given by

{±(·d1 d2 · · · dt )β × β e , 0 ≤ di < β, d1 6= 0, emin ≤ e ≤ emax } ∪ {0}.

β : base of the floating-point system


t : precision (number of digits in the mantissa) of the floating-point system
emin : lowest exponent of the floating-point system
emax : highest exponent of the floating-point system

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Computers use a floating-point representation with base β = 2. The set of numbers
represented on a computer is called a floating-point system.

A floating-point system F (β, t, emin , emax ) is a finite set given by

{±(·d1 d2 · · · dt )β × β e , 0 ≤ di < β, d1 6= 0, emin ≤ e ≤ emax } ∪ {0}.

β : base of the floating-point system


t : precision (number of digits in the mantissa) of the floating-point system
emin : lowest exponent of the floating-point system
emax : highest exponent of the floating-point system
Pt dj
The fraction 0 < f := (·d1 d2 · · · dt )β = j=1 β j ≤ 1 − β −t < 1.
Example: (·101)2 × 22 = (1 × 2−1 + 0 × 2−2 + 1 × 2−3 ) × 22 = 5/2.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Computers use a floating-point representation with base β = 2. The set of numbers
represented on a computer is called a floating-point system.

A floating-point system F (β, t, emin , emax ) is a finite set given by

{±(·d1 d2 · · · dt )β × β e , 0 ≤ di < β, d1 6= 0, emin ≤ e ≤ emax } ∪ {0}.

β : base of the floating-point system


t : precision (number of digits in the mantissa) of the floating-point system
emin : lowest exponent of the floating-point system
emax : highest exponent of the floating-point system
Pt dj
The fraction 0 < f := (·d1 d2 · · · dt )β = j=1 β j ≤ 1 − β −t < 1.
Example: (·101)2 × 22 = (1 × 2−1 + 0 × 2−2 + 1 × 2−3 ) × 22 = 5/2.
Note that #F (β, t, emin , emax ) = 2(β − 1)β t−1 × (emax − emin + 1) + 1.
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Floating-Point System
Underflow threshold (smallest positive floating point number):
• realmin := (·10 · · · 0)β × β emin = β emin −1
Overflow threshold (largest positive floating point number):
• realmax := (·(β − 1) · · · (β − 1))β × β emax = (1 − β −t )β emax

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Underflow threshold (smallest positive floating point number):
• realmin := (·10 · · · 0)β × β emin = β emin −1
Overflow threshold (largest positive floating point number):
• realmax := (·(β − 1) · · · (β − 1))β × β emax = (1 − β −t )β emax

Machine epsilon/precision: eps := β 1−t .


• eps = (·10 · · · 01)β × β 1 − (·10 · · · 0)β × β 1 = β 1−t

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-Point System
Underflow threshold (smallest positive floating point number):
• realmin := (·10 · · · 0)β × β emin = β emin −1
Overflow threshold (largest positive floating point number):
• realmax := (·(β − 1) · · · (β − 1))β × β emax = (1 − β −t )β emax

Machine epsilon/precision: eps := β 1−t .


• eps = (·10 · · · 01)β × β 1 − (·10 · · · 0)β × β 1 = β 1−t

For matlab we have the following


Binary Decimal
eps 2−52 2.2204 × 10−16
realmin 2−1022 2.2251 × 10−308
realmax (2 − eps) × 21023 1.7977 × 10308

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


IEEE single precision floating-point representation
IEEE specifies F (2, t, emin , emax ) as follows.
sign (1 bit) exponent (8 bits) mantissa (23 bits)
Table : Single precision 32 bits

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


IEEE single precision floating-point representation
IEEE specifies F (2, t, emin , emax ) as follows.
sign (1 bit) exponent (8 bits) mantissa (23 bits)
Table : Single precision 32 bits

# Exponents: 28 = 256 Largest exponent: 28 − 1 = 255


Exponent range: 0 ≤ b ≤ 255 and −127 ≤ e ≤ 128
x = (−1)s × (1 · d1 · · · d23 )2 × 2b−127 where e = b − 127.

Effective exponent range: −126 ≤ e ≤ 127 as e = −127 is reserved for ±0 and


e = 128 for ±inf and NaN

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


IEEE single precision floating-point representation
IEEE specifies F (2, t, emin , emax ) as follows.
sign (1 bit) exponent (8 bits) mantissa (23 bits)
Table : Single precision 32 bits

# Exponents: 28 = 256 Largest exponent: 28 − 1 = 255


Exponent range: 0 ≤ b ≤ 255 and −127 ≤ e ≤ 128
x = (−1)s × (1 · d1 · · · d23 )2 × 2b−127 where e = b − 127.

Effective exponent range: −126 ≤ e ≤ 127 as e = −127 is reserved for ±0 and


e = 128 for ±inf and NaN

eps = 2−23 ≈ 1.192 × 10−7 , realmin = 2−126 ≈ 1.175 × 10−38 , realmax =


(2 − eps) × 2127 ≈ 3.403 × 1038
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
IEEE double precision floating-point representation
IEEE specifies F (2, t, emin , emax ) as follows.
sign (1 bit) exponent (11 bits) mantissa (52 bits)
Table : Double precision 64 bits

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


IEEE double precision floating-point representation
IEEE specifies F (2, t, emin , emax ) as follows.
sign (1 bit) exponent (11 bits) mantissa (52 bits)
Table : Double precision 64 bits

# Exponents: 211 = 2048 Largest exponent: 211 − 1 = 2047


Exponent range: 0 ≤ b ≤ 2047 and −1023 ≤ e ≤ 1024
x = (−1)s × (1 · d1 · · · d52 )2 × 2b−1023 where e = b − 1023.

Effective exponent range: −1022 ≤ e ≤ 1023 as e = −1023 is reserved for ±0


and e = 1024 for ±inf and NaN

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


IEEE double precision floating-point representation
IEEE specifies F (2, t, emin , emax ) as follows.
sign (1 bit) exponent (11 bits) mantissa (52 bits)
Table : Double precision 64 bits

# Exponents: 211 = 2048 Largest exponent: 211 − 1 = 2047


Exponent range: 0 ≤ b ≤ 2047 and −1023 ≤ e ≤ 1024
x = (−1)s × (1 · d1 · · · d52 )2 × 2b−1023 where e = b − 1023.

Effective exponent range: −1022 ≤ e ≤ 1023 as e = −1023 is reserved for ±0


and e = 1024 for ±inf and NaN

eps = 2−52 ≈ 2.2204 × 10−16 , realmin = 2−1022 ≈ 2.2251 × 10−308 ,


realmax = (2 − eps) × 21023 ≈ 1.7977 × 10308
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e .

Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e .

Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.

This shows that next(x) − x = ulp(x) = β e−t . So, if x is a large number then e
is large which implies large gap between x and next(x). Hence the gap between
consecutive floating-point number is nonuniform.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e .

Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.

This shows that next(x) − x = ulp(x) = β e−t . So, if x is a large number then e
is large which implies large gap between x and next(x). Hence the gap between
consecutive floating-point number is nonuniform.

Note that eps = ulp(1) = β 1−t and next(1) = 1 + eps.


Example: Consider x ∈ F (10, 4, −30, 30) given by x = 10−9 = .1000 × 10−8 .
Then ulp(x) = 10−8−4 = 10−12 and next(x) − x = 10−12 .

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e .

Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.

This shows that next(x) − x = ulp(x) = β e−t . So, if x is a large number then e
is large which implies large gap between x and next(x). Hence the gap between
consecutive floating-point number is nonuniform.

Note that eps = ulp(1) = β 1−t and next(1) = 1 + eps.


Example: Consider x ∈ F (10, 4, −30, 30) given by x = 10−9 = .1000 × 10−8 .
Then ulp(x) = 10−8−4 = 10−12 and next(x) − x = 10−12 .
Next, consider y = 1015 = .1000 × 1016 . Then ulp(y ) = 1016−4 = 1012 and
next(y ) − y = 1012 .
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Relative gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e . Then
next(x) = x + ulp(x) = x + β e−t implies that the relative gap
next(x) − x β e−t
= e
≤ β 1−t = eps.
x (.d1 · · · dt )β × β
Hence the relative gap between two consecutive floating-point numbers is almost
uniform and the relative gap is at most eps.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Relative gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e . Then
next(x) = x + ulp(x) = x + β e−t implies that the relative gap
next(x) − x β e−t
= e
≤ β 1−t = eps.
x (.d1 · · · dt )β × β
Hence the relative gap between two consecutive floating-point numbers is almost
uniform and the relative gap is at most eps.
Example: Consider x ∈ F (10, 4, −30, 30) given by x = 10−9 = .1000 × 10−8 .
Then ulp(x) = 10−8−4 = 10−12 and next(x) − x = 10−12 .

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Relative gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e . Then
next(x) = x + ulp(x) = x + β e−t implies that the relative gap
next(x) − x β e−t
= e
≤ β 1−t = eps.
x (.d1 · · · dt )β × β
Hence the relative gap between two consecutive floating-point numbers is almost
uniform and the relative gap is at most eps.
Example: Consider x ∈ F (10, 4, −30, 30) given by x = 10−9 = .1000 × 10−8 .
Then ulp(x) = 10−8−4 = 10−12 and next(x) − x = 10−12 . Note, however, that
next(x) − x 10−12
= = 10−3 = eps.
x 10−9

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Relative gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e . Then
next(x) = x + ulp(x) = x + β e−t implies that the relative gap
next(x) − x β e−t
= e
≤ β 1−t = eps.
x (.d1 · · · dt )β × β
Hence the relative gap between two consecutive floating-point numbers is almost
uniform and the relative gap is at most eps.
Example: Consider x ∈ F (10, 4, −30, 30) given by x = 10−9 = .1000 × 10−8 .
Then ulp(x) = 10−8−4 = 10−12 and next(x) − x = 10−12 . Note, however, that
next(x) − x 10−12
= = 10−3 = eps.
x 10−9
Next, consider y = 1015 = .1000 × 1016 . Then ulp(y ) = 1016−4 = 1012 and
next(y ) − y = 1012 .
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Relative gap between floating-point numbers
Let x ∈ F (β, t, emin , emax ) be given by x = (·d1 d2 · · · dt )β × β e . Then
next(x) = x + ulp(x) = x + β e−t implies that the relative gap
next(x) − x β e−t
= e
≤ β 1−t = eps.
x (.d1 · · · dt )β × β
Hence the relative gap between two consecutive floating-point numbers is almost
uniform and the relative gap is at most eps.
Example: Consider x ∈ F (10, 4, −30, 30) given by x = 10−9 = .1000 × 10−8 .
Then ulp(x) = 10−8−4 = 10−12 and next(x) − x = 10−12 . Note, however, that
next(x) − x 10−12
= = 10−3 = eps.
x 10−9
Next, consider y = 1015 = .1000 × 1016 . Then ulp(y ) = 1016−4 = 1012 and
next(y ) − y 1012
next(y ) − y = 1012 .However, we have = 15 = 10−3 = eps. 
y 10
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Rounding
A number x ∈ R can be represented by a number in F (β, t, emin , emax ) provided
realmin ≤ |x| ≤ realmax.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding
A number x ∈ R can be represented by a number in F (β, t, emin , emax ) provided
realmin ≤ |x| ≤ realmax.

Let fl(x) denote the floating-point representation of x, that is,

fl : R → F (β, t, L, U), x 7−→ fl(x).


How to define fl(x)?

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding
A number x ∈ R can be represented by a number in F (β, t, emin , emax ) provided
realmin ≤ |x| ≤ realmax.

Let fl(x) denote the floating-point representation of x, that is,

fl : R → F (β, t, L, U), x 7−→ fl(x).


How to define fl(x)?
Let x = (·d1 d2 · · · dt · · · )β × β e . Set xL := (·d1 d2 · · · dt )β × β e and
xR := xL + (·0 · · · 01)β × β e = xL + β e−t = xL + ulp(xL ). Then xR is next
floating-point number larger than xL and xL ≤ x ≤ xR .

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding
A number x ∈ R can be represented by a number in F (β, t, emin , emax ) provided
realmin ≤ |x| ≤ realmax.

Let fl(x) denote the floating-point representation of x, that is,

fl : R → F (β, t, L, U), x 7−→ fl(x).


How to define fl(x)?
Let x = (·d1 d2 · · · dt · · · )β × β e . Set xL := (·d1 d2 · · · dt )β × β e and
xR := xL + (·0 · · · 01)β × β e = xL + β e−t = xL + ulp(xL ). Then xR is next
floating-point number larger than xL and xL ≤ x ≤ xR .

 xL , round down
Define fl(x) := xR , round up
xL or xR whichever is closer to x, round to nearest

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding error

β 1−t round down
Unit roundoff: u := 1 1−t
2
β , round to nearest

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding error

β 1−t round down
Unit roundoff: u := 1 1−t
2
β , round to nearest

Theorem: Let fl : R → F (β, t, L, U), x 7−→ fl(x). Then

fl(x) = x(1 + δ) with |δ| ≤ u.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding error

β 1−t round down
Unit roundoff: u := 1 1−t
2
β , round to nearest

Theorem: Let fl : R → F (β, t, L, U), x 7−→ fl(x). Then

fl(x) = x(1 + δ) with |δ| ≤ u.

Proof: Let x = (·d1 d2 · · · dt · · · )β × β e . Set δ := fl(x)−x


x
. Then fl(x) = x(1 + δ).
Note that |fl(x) − x| ≤ β e−t for round down, and |fl(x) − x| ≤ β e−t /2 for round
to nearest. Also x ≥ β e−1 .

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding error

β 1−t round down
Unit roundoff: u := 1 1−t
2
β , round to nearest

Theorem: Let fl : R → F (β, t, L, U), x 7−→ fl(x). Then

fl(x) = x(1 + δ) with |δ| ≤ u.

Proof: Let x = (·d1 d2 · · · dt · · · )β × β e . Set δ := fl(x)−x


x
. Then fl(x) = x(1 + δ).
Note that |fl(x) − x| ≤ β e−t for round down, and |fl(x) − x| ≤ β e−t /2 for round
to nearest. Also x ≥ β e−1 . Hence
 1−t
|fl(x) − x| β , round down
|δ| ≤ ≤ 1 1−t
|x| 2
β , round to nearest.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Rounding error
Let f (t) := sin(2πt) for t ∈ [0, 1]. Consider F (10, 5, −10, 10). Then
t = 0:.001:1; tt = sin(2*pi*t); rt = round(tt, 5);
roundoff = tt-rt; plot(tt, roundoff, ’LineWidth’,1.5) produces the following

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-point arithmetic
Arithmetic model: If x, y ∈ F (β, t, emin , emax ) and ⊕ ∈ {+, −, ×, /} and u is the
unit roundoff then
fl(x ⊕ y ) = (x ⊕ y )(1 + δ), where |δ| ≤ u.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-point arithmetic
Arithmetic model: If x, y ∈ F (β, t, emin , emax ) and ⊕ ∈ {+, −, ×, /} and u is the
unit roundoff then
fl(x ⊕ y ) = (x ⊕ y )(1 + δ), where |δ| ≤ u.

Loss of significance (or cancellation): If x and y are not floating-point numbers


then we have to start with bx := fl(x) = x(1 + δ1 ) and yb := fl(y ) = y (1 + δ2 ).
Then  
xδ1 ± y δ2
x ± yb = x(1 + δ1 ) ± y (1 + δ2 ) = (x ± y ) 1 +
x ±y
b

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Floating-point arithmetic
Arithmetic model: If x, y ∈ F (β, t, emin , emax ) and ⊕ ∈ {+, −, ×, /} and u is the
unit roundoff then
fl(x ⊕ y ) = (x ⊕ y )(1 + δ), where |δ| ≤ u.

Loss of significance (or cancellation): If x and y are not floating-point numbers


then we have to start with bx := fl(x) = x(1 + δ1 ) and yb := fl(y ) = y (1 + δ2 ).
Then  
xδ1 ± y δ2
x ± yb = x(1 + δ1 ) ± y (1 + δ2 ) = (x ± y ) 1 +
x ±y
b

Now, if x and y have the same sign then the relative error
xδ1 + y δ2 xδ1 − y δ2
≤ |δ1 | + |δ2 | ≤ 2u is small but can be arbitrarily large
x +y x −y
when x − y is very small.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Example
Problem: Solve ax 2 + bx + c = 0, where a, b, c ∈ R

Classical method: Naive use of the formula yields


√ √
−b − b 2 − 4ac −b + b 2 − 4ac
x1 = , x2 =
2a 2a
does not avoid subtraction whereas the modified formula

b + sign(b) b 2 − 4ac c
x1 = − , x2 =
2a ax1
avoid subtractions and hence avoid loss of significance.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA


Example
Problem: Solve ax 2 + bx + c = 0, where a, b, c ∈ R

Classical method: Naive use of the formula yields


√ √
−b − b 2 − 4ac −b + b 2 − 4ac
x1 = , x2 =
2a 2a
does not avoid subtraction whereas the modified formula

b + sign(b) b 2 − 4ac c
x1 = − , x2 =
2a ax1
avoid subtractions and hence avoid loss of significance.

For 10−3 x 2 + 107 x + 3 = 0, the classical method in matlab yields x1 = −1010


and x2 = 0 whereas the modified method yields the accurate answer x1 = −1010
and x2 = −3.0 × 10−7 .
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Cancellation and remedy

1 − cos(x)
Let f (x) := for x 6= 0. If a is close to 0 then evaluation of f (x) at a
sin(x)
causes cancellation as cos(a) ≈ 1. The remedy is to rewrite f (x) as
1 − cos(x) sin(x)
f (x) = =
sin(x) 1 + cos(x)
which avoids cancellation at a ≈ 0.

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA

You might also like