Lect 05&06(Floating Point) 2025

The document discusses the floating-point system used in numerical linear algebra, focusing on machine precision, rounding errors, and floating-point arithmetic. It explains the structure of floating-point representation, including normalized forms, underflow and overflow thresholds, and machine epsilon. Additionally, it outlines IEEE standards for single and double precision floating-point representations, detailing their components and effective exponent ranges.

Uploaded by

Sourish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views46 pages

Lect 05&06(Floating Point) 2025

Uploaded by

Sourish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

MA571 Numerical Linear Algebra

Lectures 5&6: Floating-point system

Rafikul Alam
Department of Mathematics
IIT Guwahati

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA

Lecture outline

• Floating-point system
• Machine precision and unit roundoff
• Rounding errors
• Floating-point arithmetic

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA

Anomalous arithmetic

Arithmetic operations on computers are inexact and can produce surprising

results. For example, matlab produces
4
( − 1) ∗ 3 − 1 = −2.2204 × 10−16
3
(1 + exp(−50)) − 1
5× = NaN
(1 + exp(−50)) − 1
log(exp(750))
= Inf
100

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA

Example 2
Let p(x) = x 7 − 7x 6 + 21x 5 − 35x 4 + 35x 3 − 21x 2 + 7x − 1 = (x − 1)7 .

Does the plot look like a polynomial? What explains this behaviour?
R. Alam, IIT Guwahati (January-May 2025) MA571 NLA
Normalized floating-point representation

Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation

x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.

Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA

Normalized floating-point representation

Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation

x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.

Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by
∞
X dj
f = (·d1 d2 · · · dt · · · )β = .
βj
j=1

R. Alam, IIT Guwahati (January-May 2025) MA571 NLA

Normalized floating-point representation

Let x ∈ R be nonzero and β > 1 be an integer. Then we have the normalized floating-point
representation

x = ±(·d1 d2 · · · dt · · · )β × β e , 0 ≤ di < β, d1 6= 0, e ∈ Z.

Here β is called the base, e is called the exponent and f := (·d1 d2 · · · dt · · · )β is called
mantissa or fraction and given by
∞
X dj
f = (·d1 d2 · · · dt · · · )β = .
βj
j=1

Note that 0 < f ≤ 1 and x = ±f × β e is a unique representation. The number of digits

allowed in f is called precision of the floating-point representation, which in the present case is
infinite.