Demystifying Floating Point - John Farrier - CppCon 2015
Demystifying Floating Point - John Farrier - CppCon 2015
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 1
Goals
• Understand the basics of floats
• Understand the language of floats
• Get a feel for when further investigation is required
• Have a few tools ready to help
• Be prepared to keep learning about IEEE floats
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 2
You are in good company…
“Nothing brings fear to my heart more than a floating point number.”
— Gerald Jay Sussman, Professor, MIT
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 3
You are in good company…
“Modeling error is usually several orders of magnitude greater than floating point
error. People who nonchalantly model the real world and then sneer at floating point
as just an approximation strain at gnats and swallow camels.”
— John D. Cook, Singular Value Consulting
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 4
Commonly Committed Floating Point Fallacies
• “It’s floating point error”
• “All floating point involves magical rounding errors”
• “Linux and Windows handle floats differently”
• “Floating point represents an interval value near the actual value”
• “A double holds 15 decimal places and I only need 3, so I have nothing to worry
about”
• “My programming language does better math than your programming language”*
*Ada is a bit ambiguous when it comes to math, but most modern languages…
93 pages
“Floating-point arithmetic is considered an esoteric subject by many people.“
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 7
IEEE Float Specification
• IEEE 754-1985, IEEE 854-1987, IEEE 754-2008
• Provide for portable, provably consistent math
• Ensure some significant mathematical identities hold true:
• 𝑥 + 𝑦 == 𝑦 + 𝑥
• 𝑥 + 0 == 𝑥
• 𝑖𝑓 𝑥 == 𝑦 , 𝑡ℎ𝑒𝑛 𝑥 − 𝑦 == 0
• 𝑥
(𝑥 2 +𝑦 2 )
≤1
Note: Finite real numbers may not have a perfect IEEE Float representation.
• Signed Infinity
• Overflow protection
• Signed Zero
• Underflow protection, preserves sign
• +0 = −0
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 12
Simple Example
zeroPointOne == 0.100000001490116119384765625000
zeroPointTwo == 0.200000002980232238769531250000
zeroPointThree == 0.300000011920928955078125000000
sum == 0.300000011920928955078125000000
zeroPointOne == 0.100000001490116119384765625000
zeroPointTwo == 0.200000002980232238769531250000
zeroPointThree == 0.300000011920928955078125000000
sum == 0.300000011920928955078125000000
zeroPointOne == 0.100000000000000005551115123126
zeroPointTwo == 0.200000000000000011102230246252
zeroPointThree == 0.299999999999999988897769753748
sum == 0.300000000000000044408920985006
1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0
1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0
[0][00000000][00000000000000000000000]
1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0
[0][01111111][00000000000000000000000]
1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0
[0][01111111][00000000000000000000000]
1.0000000000000000
[0][01111111][00000000000000000000000]
1.0000000000000000
[0][01111111][00000000000000000000000]
[0][01111111][00000000000000000000001]
1.0000001192092896
Sign Exponent Mantissa Total Exponent bias Bits precision Significant Digits
Half (IEEE 754-2008) 1 5 10 16 15 11 3-4
Single 1 8 23 32 127 24 6-9
Double 1 11 52 64 1023 53 15-17
x86 extended precision 1 15 64 80 16383 64 18-21
Quad 1 15 112 128 16383 113 33-36
http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
1.0e-37
[0][00000000][11011001110001111101110]
0.09999999e-37
1.0e-37
[0][00000000][11011001110001111101110]
0.09999999e-37
http://blogs.msdn.com/b/dwayneneed/archive/2010/05/07/fun-with-floating-point.aspx
© 2015 John Farrier, All Rights Reserved 27
Floating Point Precision
The number of floats from 0.0
• …to 0.1 = 1,036,831,949
• …to 0.2 = 8,388,608
• … to 0.4 = 8,388,608
• … to 0.8 = 8,388,608
• … to 1.6 = 8,388,608
• … to 3.2 = 8,388,608
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 29
Storage of π
π = 3.14159265
πf= 3.14159274
Δ = 0.00000009
• Correct Rounding:
• Basic operations (add, subtract, multiply, divide, sqrt) should return the number nearest the
mathematical result.
• If there is a tie, round to the number with an even mantissa
[0][00000000][00000000000000000000000][G][R][S...]
© 2015 John Farrier, All Rights Reserved 35
“Guard Bit”, “Round Bit”, “Sticky Bit”
[G][R][S]
[0][-][-] - Round Down (do nothing)
[1][0][0] - Round Up if the mantissa LSB is 1
[1][0][1] - Round Up
[1][1][0] - Round Up
[1][1][1] - Round Up
[0][00000000][00000000000000000000000][G][R][S...]
© 2015 John Farrier, All Rights Reserved 36
Significance Error
• Compute the Area of a Triangle
• Heron’s Formula:
• 𝐴=
𝑥+𝑦+𝑍
2
𝑥+𝑦+𝑍
2
−𝑥
𝑥+𝑦+𝑍
2
−𝑦
𝑥+𝑦+𝑍
2
−𝑧
• Kahan’s Algorithm:
• Sort x, y, z such that 𝑥 ≥ 𝑦 ≥ 𝑧
• If 𝑧 < 𝑥 − 𝑦, then no such triangle exists.
• Else 𝐴 =
(𝑥+ 𝑦+𝑧) × 𝑧−(𝑥−𝑦 )×(𝑧+ 𝑥−𝑦 )×(𝑥+ 𝑦−𝑧 )
4
Heron: 9.999999809638328700000000000000
Kahan: 10.000000077021038000000000000000
delta: 0.000000267382709751018410000000
0.012345000170171261000000000000 !=
0.0148140005767345430
(-0.002469000406563282000000000000)
12.345000267028809000000000000000 !=
11.701118469238281000000000000000
(0.643881797790527340000000000000)
(Store simTime as uint64_t and get microsecond precision for 584555 years.)
Control: 0.000000000015915494616658421000
x*1.0e-10f: 0.000000000015915494616658421000(0.000000000000000000000000000000)
x/1.0e10f: 0.000000000015915492881934945000(0.000000000000000001734723475977)
Relative Error: 0.000000108995891423546710000000
IEEE 754 Exception Result when traps disabled Argument to trap handler
overflow ± ∞ or ± 𝑥𝑚𝑎𝑥 round(𝑥2−𝛼 )
underflow 0, 2𝑒𝑚𝑖𝑛 or denormalized round(𝑥2𝛼 )
divide by zero ±∞ invalid operation
invalid NaN invalid operation
inexact round(x) round(x)
http://preshing.com/images/float-point-perf.png
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 54
Use Your Compiler’s Output
• warning C4244: ‘initializing’ : conversion from ‘double’ to
‘float’, possible loss of data
• warning C4056: overflow in floating point constant arithmetic
• warning C4305: 'identifier' : truncation from 'type1' to
'type2'
• warning: conversion to 'float' from 'int' may alter its value
• warning: floating constant exceeds range of ‘double’
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 60
Demystifying Floating Point
John Farrier, Booz Allen Hamilton
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 61