0% found this document useful (0 votes)

11 views

Demystifying Floating Point - John Farrier - CppCon 2015

Uploaded by

alan88w

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Demystifying Floating Point - John Farrier - CppCon 2015

Uploaded by

alan88w

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Demystifying Floating Point

John Farrier, Booz Allen Hamilton

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 1
Goals
• Understand the basics of floats
• Understand the language of floats
• Get a feel for when further investigation is required
• Have a few tools ready to help
• Be prepared to keep learning about IEEE floats

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 2
You are in good company…
“Nothing brings fear to my heart more than a floating point number.”
— Gerald Jay Sussman, Professor, MIT

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 3
You are in good company…
“Modeling error is usually several orders of magnitude greater than floating point
error. People who nonchalantly model the real world and then sneer at floating point
as just an approximation strain at gnats and swallow camels.”
— John D. Cook, Singular Value Consulting

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 4
Commonly Committed Floating Point Fallacies
• “It’s floating point error”
• “All floating point involves magical rounding errors”
• “Linux and Windows handle floats differently”
• “Floating point represents an interval value near the actual value”
• “A double holds 15 decimal places and I only need 3, so I have nothing to worry
about”
• “My programming language does better math than your programming language”*

*Ada is a bit ambiguous when it comes to math, but most modern languages…

© 2015 John Farrier, All Rights Reserved 5

StackOverflow
• All roads lead to:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
David Goldberg, March, 1991 issue of Computing Surveys.

93 pages
“Floating-point arithmetic is considered an esoteric subject by many people.“

© 2015 John Farrier, All Rights Reserved 6

Anatomy of IEEE Floats

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 7
IEEE Float Specification
• IEEE 754-1985, IEEE 854-1987, IEEE 754-2008
• Provide for portable, provably consistent math
• Ensure some significant mathematical identities hold true:
• 𝑥 + 𝑦 == 𝑦 + 𝑥
• 𝑥 + 0 == 𝑥
• 𝑖𝑓 𝑥 == 𝑦 , 𝑡ℎ𝑒𝑛 𝑥 − 𝑦 == 0
• 𝑥
(𝑥 2 +𝑦 2 )
≤1

© 2015 John Farrier, All Rights Reserved 8

IEEE Float Specification
• Ensure every floating point number is unique
• Zero is a special case because of this
• Ensure every floating point number has an opposite
• Specifies algorithms for addition, subtraction, multiplication, division, and sqrt

© 2015 John Farrier, All Rights Reserved 9

IEEE Float Specification - Layout
• An approximation using scientific notation
• 𝑥 = −1𝑠 × 2𝑒 × 1. 𝑚
• 𝑥 = −1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
• 𝑥 = −1[0] × 𝑏𝑎𝑠𝑒2[10000000] × 1. [10010010000111111011011]
• 𝑥 = 0 [10000000][10010010000111111011011]
• 32-bits = 1 sign bit + 8 exponent bits + 23 mantissa bits
• 64-bits = 1 sign bit + 12 exponent bits + 52 mantissa bits

Note: Finite real numbers may not have a perfect IEEE Float representation.

© 2015 John Farrier, All Rights Reserved 10

IEEE Float Specification - Special Floats
• Divide by Zero
• 1/0
• Not a Number (NaN)
• 0
0
,0 ×∞

• Signed Infinity
• Overflow protection
• Signed Zero
• Underflow protection, preserves sign
• +0 = −0

© 2015 John Farrier, All Rights Reserved 11

Now that we are experts…

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 12
Simple Example

auto zeroPointOne = 0.1f;

auto zeroPointTwo = 0.2f;
auto zeroPointThree = 0.3f;
auto sum = zeroPointOne + zeroPointTwo;

cppCon() << "zeroPointOne == " << zeroPointOne << "\n";

cppCon() << "zeroPointTwo == " << zeroPointTwo << "\n";
cppCon() << "zeroPointThree == " << zeroPointThree << "\n";
cppCon() << "sum == " << sum << "\n";

© 2015 John Farrier, All Rights Reserved 13

Simple Example

zeroPointOne == 0.100000001490116119384765625000
zeroPointTwo == 0.200000002980232238769531250000
zeroPointThree == 0.300000011920928955078125000000
sum == 0.300000011920928955078125000000

© 2015 John Farrier, All Rights Reserved 14

Simple Example

zeroPointOne == 0.100000001490116119384765625000
zeroPointTwo == 0.200000002980232238769531250000
zeroPointThree == 0.300000011920928955078125000000
sum == 0.300000011920928955078125000000

zeroPointOne == 0.100000000000000005551115123126
zeroPointTwo == 0.200000000000000011102230246252
zeroPointThree == 0.299999999999999988897769753748
sum == 0.300000000000000044408920985006

© 2015 John Farrier, All Rights Reserved 15

Storage of 1.0

1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0

© 2015 John Farrier, All Rights Reserved 16

Storage of 1.0

1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0
[0][00000000][00000000000000000000000]

© 2015 John Farrier, All Rights Reserved 17

Storage of 1.0

1.0000000000000000
−1𝑠𝑖𝑔𝑛𝐵𝑖𝑡 × 𝑏𝑎𝑠𝑒2𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡 × 1. 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎
−10 × 20 × 1.0
[0][01111111][00000000000000000000000]

© 2015 John Farrier, All Rights Reserved 18

Storage of 1.0

• The exponent is shift-127 encoded

• 0xeeeeeeee - 127
© 2015 John Farrier, All Rights Reserved 19
Storage of 1.0

1.0000000000000000
[0][01111111][00000000000000000000000]

© 2015 John Farrier, All Rights Reserved 20

Storage of 1.0

1.0000000000000000
[0][01111111][00000000000000000000000]
[0][01111111][00000000000000000000001]
1.0000001192092896

© 2015 John Farrier, All Rights Reserved 21

“Epsilon”
• The difference between 1.0 and the next available floating point number
• Useful for programmatic “almost equal” computations
• std::numeric_limits<T>::epsilon

© 2015 John Farrier, All Rights Reserved 22

Significant Digits
• Significant digits measure precision
• Remember the “exponent” field
• They are not a magnitude
• How close is close enough?
• Define as significant digits, not absolute error

Sign Exponent Mantissa Total Exponent bias Bits precision Significant Digits
Half (IEEE 754-2008) 1 5 10 16 15 11 3-4
Single 1 8 23 32 127 24 6-9
Double 1 11 52 64 1023 53 15-17
x86 extended precision 1 15 64 80 16383 64 18-21
Quad 1 15 112 128 16383 113 33-36
http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF

© 2015 John Farrier, All Rights Reserved 23

Storage of the Very Small
• For 32-bit floats, the minimum base 10 exponent is -36.
• How is 1.0𝑒 −37represented?

1.0e-37
[0][00000000][11011001110001111101110]
0.09999999e-37

© 2015 John Farrier, All Rights Reserved 24

“Denormalized Number”
• Numbers that have a zero exponent
• Required when the exponent is below the minimum exponent
• Helps prevent underflow

1.0e-37
[0][00000000][11011001110001111101110]
0.09999999e-37

© 2015 John Farrier, All Rights Reserved 25

Floating Point Precision
• Representation is not uniform between numbers
• Most precision lies between 0.0 and 0.1
• Precision falls away
• std::nextafter

© 2015 John Farrier, All Rights Reserved 26

Floating Point Precision

http://blogs.msdn.com/b/dwayneneed/archive/2010/05/07/fun-with-floating-point.aspx
© 2015 John Farrier, All Rights Reserved 27
Floating Point Precision
The number of floats from 0.0
• …to 0.1 = 1,036,831,949
• …to 0.2 = 8,388,608
• … to 0.4 = 8,388,608
• … to 0.8 = 8,388,608
• … to 1.6 = 8,388,608
• … to 3.2 = 8,388,608

© 2015 John Farrier, All Rights Reserved 28

Errors in Floating Point

https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 29
Storage of π
π = 3.14159265
πf= 3.14159274
Δ = 0.00000009

© 2015 John Farrier, All Rights Reserved 30

Measuring Error: “Ulps”
• Units in Last Place π = 3.14159265
• “Harrison” and “Goldberg” definitions πf= 3.14159274
• (6-8 definitions floating around) Δ = 0.00000009
• “The gap between the two floating-point numbers 9 ulps
nearest to x, even if x is one of them.” – W. Kahan
• https://www.cs.berkeley.edu/~wkahan/LOG10HAF.TXT
• IEEE 754 requires that that elementary arithmetic
operations are correctly rounded to within 0.5 ulps
• Transcendental functions are generally rounded to
between 0.5 and 1.0 ulps

© 2015 John Farrier, All Rights Reserved 31

Measuring Error: “Relative Error”
• The difference between the “real” number and the π = 3.14159265
approximated number, divided by the “real” number. πf= 3.14159274
Δ = 0.00000009
𝜋 − (𝜋𝑓)
= 2.864789𝑒 −8 9 ulps
𝜋
2.864789e-8 β

© 2015 John Farrier, All Rights Reserved 32

Rounding Error
• Induced by approximating an infinite range of numbers into a finite number of bits
• Math is done exactly, then rounded*
• Towards the nearest
• Towards Zero
• Towards positive infinity (round up)
• Towards negative infinity (round down)

*Look up “exactly rounded” as well

© 2015 John Farrier, All Rights Reserved 33

Rounding Error
• What about rounding the half-way case? (i.e. 0.5)
• Round Up vs. Round Even

• Correct Rounding:
• Basic operations (add, subtract, multiply, divide, sqrt) should return the number nearest the
mathematical result.
• If there is a tie, round to the number with an even mantissa

© 2015 John Farrier, All Rights Reserved 34

“Guard Bit”, “Round Bit”, “Sticky Bit”
• Only used while doing calculations
• Not stored in the float itself
• The mantissa is shifted in calculations to align radix
• The guard bits and round bits are extra precision
• The sticky bit is an OR of anything that shifts through it

[0][00000000][00000000000000000000000][G][R][S...]
© 2015 John Farrier, All Rights Reserved 35
“Guard Bit”, “Round Bit”, “Sticky Bit”
[G][R][S]
[0][-][-] - Round Down (do nothing)
[1][0][0] - Round Up if the mantissa LSB is 1
[1][0][1] - Round Up
[1][1][0] - Round Up
[1][1][1] - Round Up

[0][00000000][00000000000000000000000][G][R][S...]
© 2015 John Farrier, All Rights Reserved 36
Significance Error
• Compute the Area of a Triangle
• Heron’s Formula:

• 𝐴=
𝑥+𝑦+𝑍
2
𝑥+𝑦+𝑍
2
−𝑥
𝑥+𝑦+𝑍
2
−𝑦
𝑥+𝑦+𝑍
2
−𝑧

• Kahan’s Algorithm:
• Sort x, y, z such that 𝑥 ≥ 𝑦 ≥ 𝑧
• If 𝑧 < 𝑥 − 𝑦, then no such triangle exists.

• Else 𝐴 =
(𝑥+ 𝑦+𝑧) × 𝑧−(𝑥−𝑦 )×(𝑧+ 𝑥−𝑦 )×(𝑥+ 𝑦−𝑧 )
4

© 2015 John Farrier, All Rights Reserved 37

Significance Error
/// Area of a triangle
/// From http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
/// From https://www.cs.berkeley.edu/~wkahan/Triangle.pdf
TEST(CPPCon2015, AreaOfATriangleFloat)
{
const auto a = 100000.0f;
const auto b = 99999.99979f;
const auto c = 0.00029f;

ASSERT_TRUE(a >= b);

ASSERT_TRUE(b >= c);

auto heronsFormula = HeronsFormula(a, b, c);

auto kahansFormula = KahansFormula(a, b, c);
EXPECT_NE(kahansFormula, heronsFormula);
cppCon() << “Kahan: " << kahansFormula;
cppCon() << “Heron: " << heronsFormula;
cppCon() << “delta: " << kahansFormula - heronsFormula << "\n";
}

© 2015 John Farrier, All Rights Reserved 38

Significance Error
Heron: 0.000000000000000000000000000000
Kahan: 14.500000000000000000000000000000
delta: 14.500000000000000000000000000000

© 2015 John Farrier, All Rights Reserved 39

Significance Error
Heron: 0.000000000000000000000000000000
Kahan: 14.500000000000000000000000000000
delta: 14.500000000000000000000000000000

Hint: The Answer is 10.0

© 2015 John Farrier, All Rights Reserved 40

Significance Error
Heron: 0.000000000000000000000000000000
Kahan: 14.500000000000000000000000000000
delta: 14.500000000000000000000000000000

Heron: 9.999999809638328700000000000000
Kahan: 10.000000077021038000000000000000
delta: 0.000000267382709751018410000000

© 2015 John Farrier, All Rights Reserved 41

Significance Error - Use Stable Algorithms
• Loss of Significance
• Keep big numbers with big numbers, little numbers with little numbers.
• Parentheses can help
• Analysis of Algorithms is Critical
• The compiler won’t re-arrange your math if it cannot prove it would yield the same
result, even if the computation would be faster
• 𝑥= 𝑏− 𝑎 + 𝑏 − 𝑎 should not be replaced with 𝑥 = 0
• See the “Kahan’s Algorithm” example

© 2015 John Farrier, All Rights Reserved 42

Significance Error – Simulation Time
• Time is used to compute distance, velocity, acceleration
• High frequency phenomena
• Sensors: Doppler shift, pulse compression, PRF
• These computed values feed into other computed values which may or may not
require a time component
• Thousands of computations per frame of simulation
• Thousands of little compounding errors per frame
• Combine with poor or nonexistent testing

© 2015 John Farrier, All Rights Reserved 43

Significance Error – Simulation Time
• Accumulate Time (works for arbitrary • Delta Time (works for fixed time steps)
time steps)

auto totalFrames = size_t(0); auto totalFrames = size_t(0);

auto frameLength = 0.01f; auto frameLength = 0.01f;
auto simTime = 0.0f; auto simTime = 0.0f;

// Run 120 Frames // Run 120 Frames

auto simTime120Frames = 1.20f; auto simTime120Frames = 1.20f;
for(; totalFrames < 120; totalFrames++) totalFrames = 120;
{ simTime = totalFrames * frameLength;
simTime += frameLength;
}

Significance Error – Simulation Time
Accumulate Time (works for arbitrary time steps) Delta Time (works for fixed time steps)
• 1.199999999999999955591079014994 != • 1.200000047683715820312500000000 !=
1.199999213218688964843750000000 1.199999928474426269531250000000
(0.000000786781310990747329014994) (0.000000119209289550781250000000)
• 12.000000000000000000000000000000 != • 12.000000000000000000000000000000 ==
12.000179290771484375000000000000 (- 12.000000000000000000000000000000
0.000179290771484375000000000000) (0.000000000000000000000000000000)
• 120.000000000000000000000000000000 != • 120.000000000000000000000000000000 ==
120.007225036621093750000000000000 (- 120.000000000000000000000000000000
0.007225036621093750000000000000) (0.000000000000000000000000000000)
• 600.000000000000000000000000000000 != • 600.000000000000000000000000000000 ==
600.274414062500000000000000000000 (- 600.000000000000000000000000000000
0.274414062500000000000000000000) (0.000000000000000000000000000000)
• 3600.000000000000000000000000000000 != • - 3600.000000000000000000000000000000 ==
3603.204101562500000000000000000000 (- 3600.000000000000000000000000000000
3.204101562500000000000000000000) (0.000000000000000000000000000000)

Significance Error – Simulation Time
• “Sub-Microsecond Precision”
• 1.2345e-6 seconds per frame (810044.5 Hz)

0.012345000170171261000000000000 !=
0.0148140005767345430
(-0.002469000406563282000000000000)

12.345000267028809000000000000000 !=
11.701118469238281000000000000000
(0.643881797790527340000000000000)

Significance Error – Don’t use IEEE Floats?
• Use integers
• Very fast
• Trade more precision for less range
• Only input/output may be impacted by floating point conversions
• Financial applications represent dollars as only cents or tenth’s of cents
• Use a math library
• Slower
• Define your own level of accuracy
• MPFR (w/C++ Wrapper), TTMath, Boost, GMP C++
• CRlibm (Correctly Rounded Mathematical Library)

(Store simTime as uint64_t and get microsecond precision for 584555 years.)

Algebraic Assumption Error
• Mathematical Identities
• Traditional identities (associative, commutative, distributive) do not hold
• Distributive Rule does not apply: 𝑥 × 𝑦 − 𝑥 × 𝑧 ≠ 𝑥 (𝑦 − 𝑧)
• Associative Rule does not apply: 𝑥 + 𝑦 + 𝑧 ≠ 𝑥 + 𝑦 + 𝑧
• Cannot interchange division and multiplication: 10.0 ≠ 𝑥 × 0.1
𝑥

Does a naïve compiler make these assumptions too?

https://msdn.microsoft.com/library/aa289157.aspx

Algebraic Assumption Error
const auto oneRadian = 0.15915494309f;
const auto control = 0.000000000015915494309f;

const auto oneRadianMultiplied = oneRadian * 1.0e-10f;

const auto oneRadianDivided = oneRadian / 1.0e10f;

Algebraic Assumption Error
const auto oneRadian = 0.15915494309f;
const auto control = 0.000000000015915494309f;

const auto oneRadianMultiplied = oneRadian * 1.0e-10f;

const auto oneRadianDivided = oneRadian / 1.0e10f;

Control: 0.000000000015915494616658421000
x*1.0e-10f: 0.000000000015915494616658421000(0.000000000000000000000000000000)
x/1.0e10f: 0.000000000015915492881934945000(0.000000000000000001734723475977)
Relative Error: 0.000000108995891423546710000000

Floating Point Exceptions
• Enable floating point exceptions to be alerted when things go awry.

IEEE 754 Exception Result when traps disabled Argument to trap handler
overflow ± ∞ or ± 𝑥𝑚𝑎𝑥 round(𝑥2−𝛼 )
underflow 0, 2𝑒𝑚𝑖𝑛 or denormalized round(𝑥2𝛼 )
divide by zero ±∞ invalid operation
invalid NaN invalid operation
inexact round(x) round(x)

𝑥 is the exact result of the operation

α = 192 for single precision, 1536 for double
𝑥𝑚𝑎𝑥 = 1.111 … 111 × 2𝑒𝑚𝑖𝑛 .
See <cfenv> - Floating Point Environment

But Wait! There’s More!
• Binary to Decimal Conversion Error
• Summation Error
• Propagation Error
• Underflow, Overflow
• Type Narrowing/Widening Rules

Miscellaneous Notes

http://preshing.com/images/float-point-perf.png
https://github.com/DigitalInBlue/CPPCon2015
© 2015 John Farrier, All Rights Reserved 54
Use Your Compiler’s Output
• warning C4244: ‘initializing’ : conversion from ‘double’ to
‘float’, possible loss of data
• warning C4056: overflow in floating point constant arithmetic
• warning C4305: 'identifier' : truncation from 'type1' to
'type2'
• warning: conversion to 'float' from 'int' may alter its value
• warning: floating constant exceeds range of ‘double’

Fused Multiply-Add (FMA)
• a = a + (b * c); a += b * c;
• Multiplier-accumulator (MAC Unit)
• One rounding
• Compiler Options
• GCC = -mfma
• VC++ = #pragma fp_contract (off)
• Reference: FMA3, FMA4
• http://en.wikipedia.org/wiki/FMA_instruction_set

Streaming SIMD Extensions (SSE)
• SSE can provide significant performance gains
• Supports integer, floating point, logical, conversion, shift, and shuffle operations
• C, C++, Fortran do not natively support SSE, but compiler-specific support exists

Float Tricks
/// Fast reciprocal square root approximation for x > 0.25
/// Quake’s float Q_rsqrt(float number) is much more entertaining.
inline float FastInvSqrt(float x)
{
int tmp = ((0x3f800000 << 1) + 0x3f800000 - *(long*)&x) >> 1;
auto y = *(float*)&tmp;
return y * (1.47f – 0.47f * x * y * y);
}

Testing
TEST(CPPCon2015, pointOnePlusPointTwo)
• Design for Numerical Stability {
auto zeroPointOne = 0.1f;
• Perform Meaningful Testing auto zeroPointTwo = 0.2f;
auto zeroPointThree = 0.3f;
• Document assumptions auto sum = zeroPointOne + zeroPointTwo;

• Track sources of approximation EXPECT_DOUBLE_EQ(0.3f, zeroPointThree);

EXPECT_EQ(0.3f, zeroPointThree);
• Quantify goodness EXPECT_EQ(0.3f, sum);
• Well conditioned algorithms
EXPECT_EQ(zeroPointThree, sum);
• Backward error analysis EXPECT_DOUBLE_EQ(zeroPointThree, sum);
}
• Are the outputs identical for slightly
modified inputs?

http://xkcd.com/217/
http://www.explainxkcd.com/wiki/index.php/217:_e_to_the_pi_Minus_pi

FloatingPoint Handout
No ratings yet
FloatingPoint Handout
122 pages
Floating Point
No ratings yet
Floating Point
3 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
Floating Point
No ratings yet
Floating Point
16 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
181
No ratings yet
181
11 pages
Floating Point
No ratings yet
Floating Point
26 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Ponto Flutuante
No ratings yet
Ponto Flutuante
87 pages
Floating Point
No ratings yet
Floating Point
26 pages
Floating Points
No ratings yet
Floating Points
31 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
Floating Point Representation of Numbers: Wide Range
No ratings yet
Floating Point Representation of Numbers: Wide Range
11 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
CEF352 Lect2
No ratings yet
CEF352 Lect2
18 pages
Representing Fractions - Fixed Point: The Problem
No ratings yet
Representing Fractions - Fixed Point: The Problem
35 pages
COA
No ratings yet
COA
14 pages
Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium
No ratings yet
Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium
2 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Floating Point: - We Need A Way To Represent
No ratings yet
Floating Point: - We Need A Way To Represent
14 pages
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
No ratings yet
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
39 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
IEEE 754 Floating Point Formats
No ratings yet
IEEE 754 Floating Point Formats
12 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
No ratings yet
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
5 pages
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
No ratings yet
An Introduction To Floating Point Arithmetic by Example: Pat Quillen
33 pages
Floating Point Arithmetic: Numbers
No ratings yet
Floating Point Arithmetic: Numbers
14 pages
Ieee Arith
No ratings yet
Ieee Arith
3 pages
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
No ratings yet
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
8 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
10 MIPS Floating Point Arithmetic
No ratings yet
10 MIPS Floating Point Arithmetic
28 pages
Floating Point
No ratings yet
Floating Point
33 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
Lab 3
No ratings yet
Lab 3
5 pages
Lect4 Floats
No ratings yet
Lect4 Floats
64 pages
CS137Part03 Floats Mathlib Root Finding Post
No ratings yet
CS137Part03 Floats Mathlib Root Finding Post
51 pages
NVIDIA-CUDA-Floating-Point
No ratings yet
NVIDIA-CUDA-Floating-Point
7 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
Complete Floating Point (Blog)
No ratings yet
Complete Floating Point (Blog)
18 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
Floating Point Arithmetic: Numbers
No ratings yet
Floating Point Arithmetic: Numbers
41 pages
Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic 33333
No ratings yet
Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic 33333
18 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
CHAP 03e
No ratings yet
CHAP 03e
32 pages
Essential Computer Hardware: The Illustrated Guide to Understanding Computer Systems
From Everand
Essential Computer Hardware: The Illustrated Guide to Understanding Computer Systems
Kevin Wilson
No ratings yet
#1 Book on Python Programming
From Everand
#1 Book on Python Programming
Minhaj
No ratings yet
Reactive Stream Processing Rx4DDS - Sumant Tambe - CppCon 2015
No ratings yet
Reactive Stream Processing Rx4DDS - Sumant Tambe - CppCon 2015
51 pages
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
No ratings yet
The Birth of Study Group 14 - Nicolas Guillemot, Sean Middleditch, Michael Wong - CppCon 2015
44 pages
QT - Modern User Interfaces For C++ - Milian Wolff - CppCon 2015
No ratings yet
QT - Modern User Interfaces For C++ - Milian Wolff - CppCon 2015
43 pages
C++ On The Web - JF Bastien - CppCon 2015
No ratings yet
C++ On The Web - JF Bastien - CppCon 2015
24 pages
Functional Design Explained - David Sankel - CppCon 2015
No ratings yet
Functional Design Explained - David Sankel - CppCon 2015
43 pages
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
No ratings yet
Simple Extensible Pattern Matching With C++14 - John Bandela - CppCon 2015
118 pages
C++ Metaprogramming - Fedor Pikus - CppCon 2015
100% (1)
C++ Metaprogramming - Fedor Pikus - CppCon 2015
76 pages
Introducing Brigand - Edouard Alligand and Joel Falcou - CppCon 2015
No ratings yet
Introducing Brigand - Edouard Alligand and Joel Falcou - CppCon 2015
9 pages
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
No ratings yet
Contracts For Dependable C++ - Gabriel Dos Reis - CppCon 2015
35 pages
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
No ratings yet
RCPP - Seamless R and C++ Integration - Matt P. Dziubinski - CppCon 2015
137 pages
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
No ratings yet
C++ Multi-Dimensional Arrays For Computational Physics and Applied Mathematics - Pramod Gupta - CppCon 2015
43 pages
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
No ratings yet
From Functional To Parallel - Stochastic Modelling in C++ - Kevin Carpenter - CppCon 2015
64 pages
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
No ratings yet
Easy Compilation From TouchDevelop To ARM Cortex-M0 Using C++11 - Jonathan Protzenko - CppCon 2015
20 pages
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
No ratings yet
Compile-Time Tools For Generic Programming in C++ - Abel Sinkovics - CppCon 2015
241 pages
C++ in The Telecom Industry - Yani Miguel - CppCon 2015
No ratings yet
C++ in The Telecom Industry - Yani Miguel - CppCon 2015
13 pages
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
No ratings yet
C++11, 14, 17 Atomics - The Deep Dive - Michael Wong - CppCon 2015
69 pages
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
No ratings yet
Benchmarking C++ Code - Bryce Adelstein Lelbach - CppCon 2015
79 pages
The Canonical Class - Michael Caisse - CppCon 2014
No ratings yet
The Canonical Class - Michael Caisse - CppCon 2014
138 pages
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
No ratings yet
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
283 pages
Using C++ To Connect To Web Services - Steve Gates - CppCon 2014
No ratings yet
Using C++ To Connect To Web Services - Steve Gates - CppCon 2014
40 pages
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
No ratings yet
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
131 pages
Types Don't Know # - Howard Hinnant - CppCon 2014
No ratings yet
Types Don't Know # - Howard Hinnant - CppCon 2014
95 pages
Being Smart About Pointers - Michael VanLoon - CppCon 2015
No ratings yet
Being Smart About Pointers - Michael VanLoon - CppCon 2015
47 pages
Where Did My Performance Go - Fedor Pikus - CppCon 2014
No ratings yet
Where Did My Performance Go - Fedor Pikus - CppCon 2014
66 pages
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
No ratings yet
The Implementation of Value Types - Lawrence Crowl - CppCon 2014
71 pages
Rebuilding Boost Date-Time For C++11 - Jeff Garland - CppCon 2014
No ratings yet
Rebuilding Boost Date-Time For C++11 - Jeff Garland - CppCon 2014
56 pages
STL Features and Implementation Techniques - Stephan T. Lavavej - CppCon 2014
No ratings yet
STL Features and Implementation Techniques - Stephan T. Lavavej - CppCon 2014
47 pages
Lock-Free by Example - Tony Van Eerd - CppCon 2014
No ratings yet
Lock-Free by Example - Tony Van Eerd - CppCon 2014
215 pages
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
No ratings yet
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
81 pages
How Facebook's HHVM Uses Modern C++ For Fun and Profit (Literally) - Drew Paroski - CppCon 2014
No ratings yet
How Facebook's HHVM Uses Modern C++ For Fun and Profit (Literally) - Drew Paroski - CppCon 2014
75 pages
Alma User Management
No ratings yet
Alma User Management
4 pages
Biostar A320mh Spec
No ratings yet
Biostar A320mh Spec
4 pages
En Quick Start Guide Tips Tricks Online Bus Information
No ratings yet
En Quick Start Guide Tips Tricks Online Bus Information
1 page
WEEK 14 - P5.JS: Wen-Bin Jian Department of Electrophysics, National Chiao Tung University
No ratings yet
WEEK 14 - P5.JS: Wen-Bin Jian Department of Electrophysics, National Chiao Tung University
20 pages
MQ Commands
No ratings yet
MQ Commands
1 page
Assignemnt (1)
No ratings yet
Assignemnt (1)
2 pages
Log Com - Storytaco.bloodkiss - Dangerous.ikemen - Otome.story 1649552899
No ratings yet
Log Com - Storytaco.bloodkiss - Dangerous.ikemen - Otome.story 1649552899
33 pages
R18.2_DTN_and_DTN-X_Alarm_and_Trouble_Clearing_Guide
No ratings yet
R18.2_DTN_and_DTN-X_Alarm_and_Trouble_Clearing_Guide
614 pages
CCNA Sem 1 Module 2 v3.0
No ratings yet
CCNA Sem 1 Module 2 v3.0
53 pages
Ict
100% (1)
Ict
34 pages
DP-900 Practice Set
100% (2)
DP-900 Practice Set
23 pages
Research Paper On AWS Cloud Infrastructure
No ratings yet
Research Paper On AWS Cloud Infrastructure
6 pages
SIM7600E H 4G HAT Manual EN
No ratings yet
SIM7600E H 4G HAT Manual EN
28 pages
Advance Burp Suite Training
No ratings yet
Advance Burp Suite Training
33 pages
Database Processing: David M. Kroenke and David J. Auer
No ratings yet
Database Processing: David M. Kroenke and David J. Auer
84 pages
IGP Troubleshooting
No ratings yet
IGP Troubleshooting
15 pages
Roll No. ...................... Total Pages: 3: 45172/PDF/KD/298 (P.T.O
No ratings yet
Roll No. ...................... Total Pages: 3: 45172/PDF/KD/298 (P.T.O
3 pages
A Comparative Study of Deep Learning
No ratings yet
A Comparative Study of Deep Learning
6 pages
Resume 1
No ratings yet
Resume 1
2 pages
AWS Solutions Architect Syllabus
No ratings yet
AWS Solutions Architect Syllabus
3 pages
OmniVision OV97121
No ratings yet
OmniVision OV97121
2 pages
Operating Systems Structures: Jerry Breecher
100% (1)
Operating Systems Structures: Jerry Breecher
22 pages
TMEIC Crane Control Industry
100% (1)
TMEIC Crane Control Industry
32 pages
G5616A GlobalMixed Modetechnology
No ratings yet
G5616A GlobalMixed Modetechnology
2 pages
Computer Troubleshooting 2
No ratings yet
Computer Troubleshooting 2
52 pages
Msal
No ratings yet
Msal
364 pages
What Is The Difference Between A Connector, Jack, Plug, and Port
No ratings yet
What Is The Difference Between A Connector, Jack, Plug, and Port
3 pages
S2 2021 433783 Bibliography
No ratings yet
S2 2021 433783 Bibliography
8 pages
Lecture 3 - Packets
No ratings yet
Lecture 3 - Packets
46 pages
Figma Notes
No ratings yet
Figma Notes
2 pages

Demystifying Floating Point - John Farrier - CppCon 2015

Uploaded by

Demystifying Floating Point - John Farrier - CppCon 2015

Uploaded by

Demystifying Floating Point

John Farrier, Booz Allen Hamilton

© 2015 John Farrier, All Rights Reserved 5

© 2015 John Farrier, All Rights Reserved 6

© 2015 John Farrier, All Rights Reserved 8

© 2015 John Farrier, All Rights Reserved 9

© 2015 John Farrier, All Rights Reserved 10

© 2015 John Farrier, All Rights Reserved 11

auto zeroPointOne = 0.1f;

cppCon() << "zeroPointOne == " << zeroPointOne << "\n";

© 2015 John Farrier, All Rights Reserved 13

© 2015 John Farrier, All Rights Reserved 14

© 2015 John Farrier, All Rights Reserved 15

© 2015 John Farrier, All Rights Reserved 16

© 2015 John Farrier, All Rights Reserved 17

© 2015 John Farrier, All Rights Reserved 18

• The exponent is shift-127 encoded

© 2015 John Farrier, All Rights Reserved 20

© 2015 John Farrier, All Rights Reserved 21

© 2015 John Farrier, All Rights Reserved 22

© 2015 John Farrier, All Rights Reserved 23

© 2015 John Farrier, All Rights Reserved 24

© 2015 John Farrier, All Rights Reserved 25

© 2015 John Farrier, All Rights Reserved 26

© 2015 John Farrier, All Rights Reserved 28

© 2015 John Farrier, All Rights Reserved 30

© 2015 John Farrier, All Rights Reserved 31

© 2015 John Farrier, All Rights Reserved 32

*Look up “exactly rounded” as well

© 2015 John Farrier, All Rights Reserved 33

© 2015 John Farrier, All Rights Reserved 34

© 2015 John Farrier, All Rights Reserved 37

ASSERT_TRUE(a >= b);

auto heronsFormula = HeronsFormula(a, b, c);

© 2015 John Farrier, All Rights Reserved 38

© 2015 John Farrier, All Rights Reserved 39

Hint: The Answer is 10.0

© 2015 John Farrier, All Rights Reserved 40

© 2015 John Farrier, All Rights Reserved 41

© 2015 John Farrier, All Rights Reserved 42

© 2015 John Farrier, All Rights Reserved 43

auto totalFrames = size_t(0); auto totalFrames = size_t(0);

// Run 120 Frames // Run 120 Frames

© 2015 John Farrier, All Rights Reserved 44

© 2015 John Farrier, All Rights Reserved 45

© 2015 John Farrier, All Rights Reserved 46

© 2015 John Farrier, All Rights Reserved 47

© 2015 John Farrier, All Rights Reserved 48

Does a naïve compiler make these assumptions too?

© 2015 John Farrier, All Rights Reserved 49

const auto oneRadianMultiplied = oneRadian * 1.0e-10f;

© 2015 John Farrier, All Rights Reserved 50

const auto oneRadianMultiplied = oneRadian * 1.0e-10f;

© 2015 John Farrier, All Rights Reserved 51

𝑥 is the exact result of the operation

© 2015 John Farrier, All Rights Reserved 52

© 2015 John Farrier, All Rights Reserved 53

© 2015 John Farrier, All Rights Reserved 55

© 2015 John Farrier, All Rights Reserved 56

© 2015 John Farrier, All Rights Reserved 57

© 2015 John Farrier, All Rights Reserved 58

• Track sources of approximation EXPECT_DOUBLE_EQ(0.3f, zeroPointThree);

© 2015 John Farrier, All Rights Reserved 59

You might also like