Floating Point Arithmethic - Error Analysis
Floating Point Arithmethic - Error Analysis
3-1
Roundoff errors and floating-point arithmetic
a + (b + c) ! = (a + b) + c
3-2
Floating point representation:
3-3 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-3
Machine precision - machine epsilon
3-4 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-4
Example: In IEEE standard double precision, β = 2, and t =
53 (includes ‘hidden bit’). Therefore eps = 2−52.
Unit Round-off A real number x can be approximated by a floating
number f l(x) with relative error no larger than u = 12 β −(t−1).
ä u is called Unit Round-off.
ä In fact can easily show:
f l(x) = x(1 + δ) with |δ| < u
3-5 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-5
Rule 1.
f l(a b) = f l(b a)
3-6 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-6
Example: Consider the sum of 3 numbers: y = a + b + c.
ä Done as f l(f l(a + b) + c)
η = f l(a + b) = (a + b)(1 + 1)
y1 = f l(η + c) = (η + c)(1 + 2)
= [(a + b)(1 + 1) + c] (1 + 2)
= [(a + b + c)+ (a + b)1)] (1 + 2)
a+b
= (a + b + c) 1 + 1(1 + 2) + 2
a+b+c
3-7 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-7
ä If we redid the computation as y2 = f l(a + f l(b + c)) we
would find
3-8 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-8
The absolute value notation
3-9
Error Analysis: Inner product
nu
ä Common notation γn ≡ 1−nu
3-10 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-10
ä Can use the following simpler result:
3-11
Backward and forward errors
3-12
Example:
a b d e
A= B=
0 c 0 f
Consider the product: f l(A.B) =
(ad)(1 + 1) [ae(1 + 2) + bf (1 + 3)] (1 + 4)
0 cf (1 + 5)
with i ≤ u , for i = 1, ..., 5. Result can be written as:
a b(1 + 3)(1 + 4) d(1 + 1) e(1 + 2)(1 + 4)
0 c(1 + 5) 0 f
ä So f l(A.B) = (A + EA)(B + EB ).
ä Backward errors EA, EB satisfy:
|EA| ≤ 2u |A| + O(u 2) ; |EB | ≤ 2u |B| + O(u 2)
3-13 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-13
ä When solving Ax = b by Gaussian Elimination, we will see that
a bound on kexk such that this holds exactly:
A(xcomputed + ex) = b
is much harder to find than bounds on kEAk, kebk such that this
holds exactly:
(A + EA)xcomputed = (b + eb).
3-14 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-14
Main result on inner products:
T T T
ä Forward error expression: |f l(x y) − x y| ≤ γ n |x| |y|
with 0 ≤ γn ≤ 1.01nu .
ä Elementwise absolute value |x| and multiply .∗ notation.
ä Above assumes nu ≤ .01.
For u = 2.0 × 10−16, this holds for n ≤ 4.5 × 1013.
3-15 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-15
ä Consequence of lemma:
|f l(A ∗ B) − A ∗ B| ≤ γn |A| ∗ |B|
3-16 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-16
- Assume you use single precision for which you have u = 2. ×
10−6. What is the largest n for which nu ≤ 0.01 holds? Any
conclusions for the use of single precision arithmetic?
- What does the main result on inner products imply for the case
when y = x? [Contrast the relative accuracy you get in this case
vs. the general case when y 6= x]
3-17 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-17
- Show for any x, y, there exist ∆x, ∆y such that
3-18 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float
3-18
Supplemental notes: Floating Point Arithmetic
3-19 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-19
Example: In base 10 (for illustration)
.d1 d2 d3 d4 d5 e1 e2
3-20 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-20
Try to add 1000.2 = .10002e+03 and 1.07 = .10700e+01:
1000.2 = .1 0 0 0 2 0 4 ; 1.07 = .1 0 7 0 0 0 1
First task: align decimal points. The one with smallest exponent
will be (internally) rewritten so its exponent matches the largest one:
1.07 = 0.000107 × 104
0. 1 0 0 0 2
+ 0. 0 0 0 1 0 7
= 0. 1 0 0 1 2 7
3-21 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-21
Third task:
round result. Result has 6 digits - can use only 5 so we can
ä Chop result: .1 0 0 1 2 ;
ä Round result: .1 0 0 1 3 ;
Fourth task:
Normalize result if needed (not needed here)
result with rounding: .1 0 0 1 3 0 4 ;
- Redo the same thing with 7000.2 + 4000.3 or 6999.2 + 4000.3.
3-22 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-22
Some More Examples
3-23
The IEEE standard
± 8 bits ← 23 bits →
sign
| {z } | {z }
exponent mantissa
ä In binary: The leading one in mantissa does not need to be
represented. One bit gained. ä Hidden bit.
ä Largest exponent: 27 − 1 = 127; Smallest: = −126. [‘bias’
of 127]
3-24 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-24
64 bit (Double precision) :
± 11 bits ← 52 bits →
sign
| {z } | {z }
exponent mantissa
ä Bias of 1023 so if c is the contents of exponent field
actual exponent value is 2c−1023
ä e + bias = 2047 (all ones) = special use
ä Largest exponent: 1023; Smallest = -1022.
ä Including the hidden bit, mantissa has total of 53 bits (52 bits
represented, one hidden).
3-25 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-25
- Take the number 1.0 and see what will happen if you add
1/2, 1/4, ...., 2−i. Do not forget the hidden bit!
(Note: The ’e’ part has 12 bits and includes the sign)
ä Conclusion
f l(1 + 2−52) 6= 1 but: f l(1 + 2−53) == 1 !!
3-26 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-26
Special Values
3-27 TB: 13; GvL 2.7; Ort 9.2; AB: 1.4.1–. – FloatSuppl
3-27
Appendix to set 3: Analysis of inner products
3-28
Expand: s3 = x1y1(1 + η1)(1 + 2)(1 + 3)
+x2y2(1 + η2)(1 + 2)(1 + 3)
+x3y3(1 + η3)(1 + 3)
ä Induction would show that [with convention that 1 ≡ 0]
n
X n
Y
sn = xiyi(1 + ηi) (1 + j )
i=1 j=i
3-29
ä For each of these products
Qn
(1 + ηi) j=i(1 + j ) = 1 + θi, with |θi| ≤ γnu so:
Pn
sn = i=1 xiyi(1 + θi) with |θi| ≤ γn or:
Pn Pn Pn
fl i=1 xiyi = i=1 xiyi + i=1 xiyiθi with |θi| ≤ γn
ä or (backward form)
n
! n
X X
fl xiyi = xiyi(1 + θi) with |θi| ≤ γn
i=1 i=1
3-30 TB: 13-15; GvL 2.7; Ort 9.2; AB: 1.4.1–.2 – Float2
3-30