Lect 05&06(Floating Point) 2025
Lect 05&06(Floating Point) 2025
Rafikul Alam
Department of Mathematics
IIT Guwahati
• Floating-point system
• Machine precision and unit roundoff
• Rounding errors
• Floating-point arithmetic
Does the plot look like a polynomial? What explains this behaviour?
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Normalized floating-point representation
Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation
x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.
Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by
Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation
x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.
Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by
∞
X dj
f = (·d1 d2 · · · dt · · · )β = .
βj
j=1
Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation
x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.
Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by
∞
X dj
f = (·d1 d2 · · · dt · · · )β = .
βj
j=1
Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.
Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.
This shows that next(x) − x = ulp(x) = β e−t . So, if x is a large number then e
is large which implies large gap between x and next(x). Hence the gap between
consecutive floating-point number is nonuniform.
Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.
This shows that next(x) − x = ulp(x) = β e−t . So, if x is a large number then e
is large which implies large gap between x and next(x). Hence the gap between
consecutive floating-point number is nonuniform.
Define ulp(x) := β e−t . Then next(x) := x + ulp(x) is the next floating point
number larger than x, that is, next(x) is the smallest floating-point number
larger than x. Hence x and next(x) are consecutive floating-point numbers.
This shows that next(x) − x = ulp(x) = β e−t . So, if x is a large number then e
is large which implies large gap between x and next(x). Hence the gap between
consecutive floating-point number is nonuniform.
Now, if x and y have the same sign then the relative error
xδ1 + y δ2 xδ1 − y δ2
≤ |δ1 | + |δ2 | ≤ 2u is small but can be arbitrarily large
x +y x −y
when x − y is very small.
1 − cos(x)
Let f (x) := for x 6= 0. If a is close to 0 then evaluation of f (x) at a
sin(x)
causes cancellation as cos(a) ≈ 1. The remedy is to rewrite f (x) as
1 − cos(x) sin(x)
f (x) = =
sin(x) 1 + cos(x)
which avoids cancellation at a ≈ 0.