0% found this document useful (0 votes)
85 views30 pages

Floating Point Arithmethic - Error Analysis

Floating point arithmetic introduces roundoff errors. [1] Basic algebra rules are not always satisfied in floating point arithmetic. [2] Real numbers are represented in floating point format using a mantissa and exponent. [3] Machine epsilon represents the smallest number such that 1 + epsilon is greater than 1, and determines the precision of calculations.

Uploaded by

Alfreda Oduro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views30 pages

Floating Point Arithmethic - Error Analysis

Floating point arithmetic introduces roundoff errors. [1] Basic algebra rules are not always satisfied in floating point arithmetic. [2] Real numbers are represented in floating point format using a mantissa and exponent. [3] Machine epsilon represents the smallest number such that 1 + epsilon is greater than 1, and determines the precision of calculations.

Uploaded by

Alfreda Oduro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

FLOATING POINT ARITHMETHIC - ERROR ANALYSIS

• Brief review of floating point arithmetic

• Model of floating point arithmetic

• Notation, backward and forward errors

3-1
Roundoff errors and floating-point arithmetic

ä The basic problem: The set A of all possible representable


numbers on a given machine is finite - but we would like to use this
set to perform standard arithmetic operations (+,*,-,/) on an infinite
set. The usual algebra rules are no longer satisfied since results of
operations are rounded.

ä Basic algebra breaks down in floating point arithmetic.


Example: In floating point arithmetic.

a + (b + c) ! = (a + b) + c

- Matlab experiment: For 10,000 random numbers find number of


instances when the above is true. Same thing for the multiplication..
3-2 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-2
Floating point representation:

Real numbers are represented in two parts: A mantissa (significand)


and an exponent. If the representation is in the base β then:
x = ±(.d1d2 · · · dt)β e

ä .d1d2 · · · dt is a fraction in the base-β representation (Generally


the form is normalized in that d1 6= 0), and e is an integer
ä Often, more convenient to rewrite the above as:
x = ±(m/β t) × β e ≡ ±m × β e−t

ä Mantissa m is an integer with 0 ≤ m ≤ β t − 1.

3-3 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-3
Machine precision - machine epsilon

ä Notation : f l(x) = closest floating point representation


of real number x (’rounding’)

ä When a number x is very small, there is a point when 1+x ==


1 in a machine sense. The computer no longer makes a difference
between 1 and 1 + x.

Machine epsilon: The smallest number  such that 1 +  is a


float that is different from one, is called machine epsilon. Denoted
by macheps or eps, it represents the distance from 1 to the next
larger floating point number.

ä With previous representation, eps is equal to β −(t−1).

3-4 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-4
Example: In IEEE standard double precision, β = 2, and t =
53 (includes ‘hidden bit’). Therefore eps = 2−52.
Unit Round-off A real number x can be approximated by a floating
number f l(x) with relative error no larger than u = 12 β −(t−1).
ä u is called Unit Round-off.
ä In fact can easily show:
f l(x) = x(1 + δ) with |δ| < u

- Matlab experiment: find the machine epsilon on your computer.

ä Many discussions on what conditions/ rules should be satisfied


by floating point arithmetic. The IEEE standard is a set of standards
adopted by many CPU manufacturers.

3-5 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-5
Rule 1.

f l(x) = x(1 + ), where || ≤ u

Rule 2. For all operations (one of +, −, ∗, /)

f l(x y) = (x y)(1 +  ), where | | ≤ u

Rule 3. For +, ∗ operations

f l(a b) = f l(b a)

- Matlab experiment: Verify experimentally Rule 3 with 10,000


randomly generated numbers ai, bi.

3-6 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-6
Example: Consider the sum of 3 numbers: y = a + b + c.
ä Done as f l(f l(a + b) + c)
η = f l(a + b) = (a + b)(1 + 1)
y1 = f l(η + c) = (η + c)(1 + 2)
= [(a + b)(1 + 1) + c] (1 + 2)
= [(a + b + c)+ (a + b)1)] (1 + 2) 
a+b
= (a + b + c) 1 + 1(1 + 2) + 2
a+b+c

So disregarding the high order term 12

f l(f l(a + b) + c) = (a + b + c)(1 + 3)


a+b
3 ≈ 1 + 2
a+b+c

3-7 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-7
ä If we redid the computation as y2 = f l(a + f l(b + c)) we
would find

f l(a + f l(b + c)) = (a + b + c)(1 + 4)


b+c
4 ≈ 1 + 2
a+b+c

ä The error is amplified by the factor (a + b)/y in the first case


and (b + c)/y in the second case.
ä In order to sum n numbers accurately, it is better to start with
small numbers first. [However, sorting before adding is not worth it.]
ä But watch out if the numbers have mixed signs!

3-8 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-8
The absolute value notation

ä For a given vector x, |x| is the vector with components |xi|,


i.e., |x| is the component-wise absolute value of x.
ä Similarly for matrices:

|A| = {|aij |}i=1,...,m; j=1,...,n

ä An obvious result: The basic inequality


|f l(aij ) − aij | ≤ u |aij |
translates into
f l(A) = A + E with |E| ≤ u |A|

ä A ≤ B means aij ≤ bij for all 1 ≤ i ≤ m; 1 ≤ j ≤ n


3-9 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-9
Error Analysis: Inner product

ä Inner products are in the innermost parts of many calculations.


Their analysis is important.

Lemma: If |δi| ≤ u and nu < 1 then


nu
Πni=1(1 + δi) = 1 + θn where |θn| ≤
1 − nu

nu
ä Common notation γn ≡ 1−nu

- Prove the lemma [Hint: use induction]

3-10 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-10
ä Can use the following simpler result:

Lemma: If |δi| ≤ u and nu < .01 then


Πni=1(1 + δi) = 1 + θn where |θn| ≤ 1.01nu

Example: Previous sum of numbers can be written


f l(a + b + c) = a(1 + 1)(1 + 2)
+ b(1 + 1)(1 + 2) + c(1 + 2)
= a(1 + θ1) + b(1 + θ2) + c(1 + θ3)
= exact sum of slightly perturbed inputs,
where all θi’s satisfy |θi| ≤ 1.01nu (here n = 2).
ä Alternatively, can write ‘forward’ bound:
|f l(a + b + c) − (a + b + c)| ≤ |aθ1| + |bθ2| + |cθ3|.
3-11 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-11
Backward and forward errors

ä Assume the approximation ŷ to y = alg(x) is computed by


some algorithm with arithmetic precision . Possible analysis: find
an upper bound for the Forward error
|∆y| = |y − ŷ|

ä This is not always easy.


Alternative question: find equivalent perturbation on initial data
(x) that produces the result ŷ. In other words, find ∆x so that:
alg(x + ∆x) = ŷ

ä The value of |∆x| is called the backward error. An analysis to


find an upper bound for |∆x| is called Backward error analysis.
3-12 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-12
Example:    
a b d e
A= B=
0 c 0 f
Consider the product: f l(A.B) =
 
(ad)(1 + 1) [ae(1 + 2) + bf (1 + 3)] (1 + 4)
0 cf (1 + 5)
with i ≤ u , for i = 1, ..., 5. Result can be written as:
  
a b(1 + 3)(1 + 4) d(1 + 1) e(1 + 2)(1 + 4)
0 c(1 + 5) 0 f

ä So f l(A.B) = (A + EA)(B + EB ).
ä Backward errors EA, EB satisfy:
|EA| ≤ 2u |A| + O(u 2) ; |EB | ≤ 2u |B| + O(u 2)
3-13 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-13
ä When solving Ax = b by Gaussian Elimination, we will see that
a bound on kexk such that this holds exactly:
A(xcomputed + ex) = b
is much harder to find than bounds on kEAk, kebk such that this
holds exactly:
(A + EA)xcomputed = (b + eb).

Note: In many instances backward errors are more meaningful than


forward errors: if initial data is accurate only to 4 digits say, then
my algorithm for computing x need not guarantee a backward error
of less then 10−10 for example. A backward error of order 10−4 is
acceptable.

3-14 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-14
Main result on inner products:

ä Backward error expression:


f l(xT y) = [x .∗ (1 + dx)]T [y .∗ (1 + dy )]

where kdk∞ ≤ 1.01nu ,  = x, y.


ä Can show equality valid even if one of the dx, dy absent.

T T T
ä Forward error expression: |f l(x y) − x y| ≤ γ n |x| |y|

with 0 ≤ γn ≤ 1.01nu .
ä Elementwise absolute value |x| and multiply .∗ notation.
ä Above assumes nu ≤ .01.
For u = 2.0 × 10−16, this holds for n ≤ 4.5 × 1013.
3-15 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-15
ä Consequence of lemma:
|f l(A ∗ B) − A ∗ B| ≤ γn |A| ∗ |B|

ä Another way to write the result (less precise) is


|f l(xT y) − xT y| ≤ n u |x|T |y| + O(u 2)

3-16 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-16
- Assume you use single precision for which you have u = 2. ×
10−6. What is the largest n for which nu ≤ 0.01 holds? Any
conclusions for the use of single precision arithmetic?
- What does the main result on inner products imply for the case
when y = x? [Contrast the relative accuracy you get in this case
vs. the general case when y 6= x]

3-17 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-17
- Show for any x, y, there exist ∆x, ∆y such that

f l(xT y) = (x + ∆x)T y, with |∆x| ≤ γn|x|


f l(xT y) = xT (y + ∆y), with |∆y| ≤ γn|y|

- (Continuation) Let A an m × n matrix, x an n-vector, and


y = Ax. Show that there exist a matrix ∆A such
f l(y) = (A + ∆A)x, with |∆A| ≤ γn|A|

- (Continuation) From the above derive a result about a column


of the product of two matrices A and B. Does a similar result hold
for the product AB as a whole?

3-18 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float

3-18
Supplemental notes: Floating Point Arithmetic

In most computing systems, real numbers are represented in two


parts: A mantissa and an exponent. If the representation is in the
base β then:
x = ±(.d1d2 · · · dm)β β e

ä .d1d2 · · · dm is a fraction in the base-β representation


ä e is an integer - can be negative, positive or zero.
ä Generally the form is normalized in that d1 6= 0.

3-19 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-19
Example: In base 10 (for illustration)

1. 1000.12345 can be written as


0.10001234510 × 104

2. 0.000812345 can be written as


0.81234510 × 10−3

ä Problem with floating point arithmetic: we have to live with


limited precision.
Example: Assume that we have only 5 digits of accuray in the
mantissa and 2 digits for the exponent (excluding sign).

.d1 d2 d3 d4 d5 e1 e2
3-20 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-20
Try to add 1000.2 = .10002e+03 and 1.07 = .10700e+01:
1000.2 = .1 0 0 0 2 0 4 ; 1.07 = .1 0 7 0 0 0 1

First task: align decimal points. The one with smallest exponent
will be (internally) rewritten so its exponent matches the largest one:
1.07 = 0.000107 × 104

Second task: add mantissas:

0. 1 0 0 0 2
+ 0. 0 0 0 1 0 7
= 0. 1 0 0 1 2 7

3-21 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-21
Third task:
round result. Result has 6 digits - can use only 5 so we can
ä Chop result: .1 0 0 1 2 ;
ä Round result: .1 0 0 1 3 ;
Fourth task:
Normalize result if needed (not needed here)
result with rounding: .1 0 0 1 3 0 4 ;
- Redo the same thing with 7000.2 + 4000.3 or 6999.2 + 4000.3.

3-22 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-22
Some More Examples

ä Each operation f l(x y) proceeds in 4 steps:


1. Line up exponents (for addition & subtraction).
2. Compute temporary exact answer.
3. Normalize temporary result.
4. Round to nearest representable number
(round-to-even in case of a tie).
.40015 e+02 .40010 e+02 .41015 e-98
+ .60010 e+02 .50001 e-04 -.41010 e-98
temporary 1.00025 e+02 .4001050001e+02 .00005 e-98
normalize .100025e+03 .400105⊕ e+02 .00050 e-99
round .10002 e+03 .40011 e+02 .00050 e-99
note: round to round to nearest too small:
even ⊕=not all 0’s unnormalized
exactly halfway closer to exponent is
between values upper value at minimum
3-23 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-23
The IEEE standard

32 bit (Single precision) :

± 8 bits ← 23 bits →

sign
| {z } | {z }
exponent mantissa
ä In binary: The leading one in mantissa does not need to be
represented. One bit gained. ä Hidden bit.
ä Largest exponent: 27 − 1 = 127; Smallest: = −126. [‘bias’
of 127]

3-24 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-24
64 bit (Double precision) :

± 11 bits ← 52 bits →

sign
| {z } | {z }
exponent mantissa
ä Bias of 1023 so if c is the contents of exponent field
actual exponent value is 2c−1023
ä e + bias = 2047 (all ones) = special use
ä Largest exponent: 1023; Smallest = -1022.
ä Including the hidden bit, mantissa has total of 53 bits (52 bits
represented, one hidden).

3-25 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-25
- Take the number 1.0 and see what will happen if you add
1/2, 1/4, ...., 2−i. Do not forget the hidden bit!

Hidden bit (Not represented)


Expon. ↓ ← 52 bits →
e 1 1 0 0 0 0 0 0 0 0 0 0
e 1 0 1 0 0 0 0 0 0 0 0 0
e 1 0 0 1 0 0 0 0 0 0 0 0
.......
e 1 0 0 0 0 0 0 0 0 0 0 1
e 1 0 0 0 0 0 0 0 0 0 0 0

(Note: The ’e’ part has 12 bits and includes the sign)
ä Conclusion
f l(1 + 2−52) 6= 1 but: f l(1 + 2−53) == 1 !!
3-26 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-26
Special Values

ä Exponent field = 00000000000 (smallest possible value)


No hidden bit. All bits == 0 means exactly zero.
ä Allow for unnormalized numbers,
leading to gradual underflow.
ä Exponent field = 11111111111 (largest possible value)
Number represented is ”Inf” ”-Inf” or ”NaN”.

3-27 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl

3-27
Appendix to set 3: Analysis of inner products

Consider sn = f l(x1 ∗ y1 + x2 ∗ y2 + · · · + xn ∗ yn)

ä In what follows ηi’s comme from ∗, i’s comme from +


ä They satisfy: |ηi| ≤ u and |i| ≤ u .
ä The inner product sn is computed as:
1. s1 = f l(x1y1) = (x1y1)(1 + η1)
2. s2 = f l(s1 + f l(x2y2)) = f l(s1 + x2y2(1 + η2))
= (x1y1(1 + η1) + x2y2(1 + η2)) (1 + 2)
= x1y1(1 + η1)(1 + 2) + x2y2(1 + η2)(1 + 2)
3. s3 = f l(s2 + f l(x3y3)) = f l(s2 + x3y3(1 + η3))
= (s2 + x3y3(1 + η3))(1 + 3)
3-28 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2

3-28
Expand: s3 = x1y1(1 + η1)(1 + 2)(1 + 3)
+x2y2(1 + η2)(1 + 2)(1 + 3)
+x3y3(1 + η3)(1 + 3)
ä Induction would show that [with convention that 1 ≡ 0]

n
X n
Y
sn = xiyi(1 + ηi) (1 + j )
i=1 j=i

Q: How many terms in the coefficient of xiyi do we have?


• When i > 1 : 1 + (n − i + 1) = n − i + 2
A:
• When i = 1 : n (since 1 = 0 does not count)
ä Bottom line: always ≤ n.
3-29 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2

3-29
ä For each of these products
Qn
(1 + ηi) j=i(1 + j ) = 1 + θi, with |θi| ≤ γnu so:
Pn
sn = i=1 xiyi(1 + θi) with |θi| ≤ γn or:
Pn  Pn Pn
fl i=1 xiyi = i=1 xiyi + i=1 xiyiθi with |θi| ≤ γn

ä This leads to the final result (forward form)


n
! n
n
X X X
f l x i yi − xiyi ≤ γn |xi||yi|


i=1 i=1 i=1

ä or (backward form)
n
! n
X X
fl xiyi = xiyi(1 + θi) with |θi| ≤ γn
i=1 i=1

3-30 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2

3-30

You might also like