0% found this document useful (0 votes)

16 views85 pages

04.Basic_Concepts_II

The document covers fundamental concepts of integral and floating-point types in Modern C++. It discusses fixed-width integers, arithmetic operation semantics, and the behavior of signed and unsigned integers, including potential pitfalls and undefined behaviors. Additionally, it addresses floating-point types, their properties, and issues related to floating-point arithmetic.

Uploaded by

Solwan Shokry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views85 pages

04.Basic_Concepts_II

Uploaded by

Solwan Shokry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Modern C++

Programming
4. Basic Concepts II
Integral and Floating-point Types

Federico Busato
2025-01-22
Table of Contents

1 Integral Data Types

Fixed Width Integers
size t
⋆
ptrdiff t
⋆
uintptr t
Arithmetic Operation Semantics
Promotion, Truncation
Undefined Behavior
Saturation Arithmetic ⋆

1/83
Table of Contents

2 Floating-point Types and Arithmetic

IEEE Floating-point Standard and Other Representations
Normal/Denormal Values
Infinity (∞)
Not a Number (NaN)
Machine Epsilon
Units at the Last Place (ULP)
Cheatsheet
Limits and Useful Functions

2/83
Table of Contents

Arithmetic Properties
Special Values Behavior
Floating-Point Undefined Behavior
Detect Floating-point Errors ⋆

3 Floating-point Issues
Catastrophic Cancellation
Floating-point Comparison

3/83
Integral Data Types
A Firmware Bug

“Certain SSDs have a firmware bug causing them to irrecoverably fail after
exactly 32,768 hours of operation. SSDs that were put into service at the
same time will fail simultaneously, so RAID won’t help”

HPE SAS Solid State Drives - Critical Firmware Upgrade

4/83
Via twitter.com/martinkl/status/1202235877520482306?s=19
Overflow Implementations

Note: Computing the average in the right way is not trivial, see On finding the average
of two unsigned integers without overflow
related operations: ceiling division, rounding division
5/83
ai.googleblog.com/2006/06/extra-extra-read-all-about-it-nearly.html
Potentially Catastrophic Failure

51 days = 51 · 24 · 60 · 60 · 1000 = 4 406 400 000 ms

Boeing 787s must be turned off and on every 51 days to prevent ‘misleading data’
6/83
being shown to pilots
C++ Data Model

Number of Bits
C++ Data
OS short int long long long pointer
Model
ILP32 Windows/Unix 32-bit 16 32 32 64 32
LLP64 Windows 64-bit 16 32 32 64 64
LP64 Linux 64-bit 16 32 64 64 64

char is always 1 byte

LP32: Windows 16-bit APIs (no more used)

7/83
C++ Fundamental types
Fixed Width Integers 1/3

int* t <cstdint>
C++11 provides fixed width integer types.
They have the same size on any architecture:
int8 t, uint8 t
int16 t, uint16 t
int32 t, uint32 t
int64 t, uint64 t

Good practice: Prefer fixed-width integers instead of native types. int and
unsigned can be directly used as they are widely accepted by C++ data models

8/83
Fixed Width Integers 2/3

int* t types are not “real” types, they are merely typedefs to appropriate
fundamental types

C++ standard does not ensure a one-to-one mapping:

• There are five distinct fundamental types ( char , short , int , long ,
long long )

• There are four int* t overloads ( int8 t , int16 t , int32 t , and

int64 t )

9/83
ithare.com/c-on-using-int t-as-overload-and-template-parameters
Fixed Width Integers 3/3

Warning: I/O Stream interprets uint8 t and int8 t as char and not as integer
values
int8_t var;
cin >> var; // read '2'
cout << var; // print '2'
int a = var * 2;
cout << a; // print '100' !!

10/83
size t

size t <cstddef>
size t is an alias data type capable of storing the biggest representable value on
the current architecture

• size t is an unsigned integer type (of at least 16-bit)

• size t is the return type of sizeof() and commonly used to represent size
measures

• size t is 4 bytes on 32-bit architectures, and 8 bytes on 64-bit architectures

• C++23 adds uz / UZ literals for size t , e.g. 5uz

11/83
⋆
ptrdiff t

ptrdiff t <cstddef>
ptrdiff t is an aliases data type used to store the results of the difference
between pointers or iterators

• ptrdiff t is the signed version of size t commonly used for computing

pointer differences

• ptrdiff t are 4 bytes on 32-bit architectures, and 8 bytes on 64-bit

architectures

• C++23 adds z / Z for ptrdiff t , e.g. 5z

12/83
⋆
uintptr t

uintptr t <cstdint>
uintptr t (C++11) is an integer type that can be converted from and to a
void pointer

• uintptr t is an unsigned type

• sizeof(uintptr t) == sizeof(void*)

• uintptr t is an optional type of the standard and compilers may not provide it

13/83
Arithmetic Operation Semantics

Overflow The result of an arithmetic operation exceeds the word length, namely the
largest positive/negative values

Wraparound The result of an arithmetic operation is reduced modulo 2N where N is the

number of bits of the word

Saturation The result of an arithmetic operation is constrained within a fixed range

defined by a minimum and maximum value. If the result of an operation
exceeds this range, it is “clamped” to the boundary value

14/83
Signed/Unsigned Integer Characteristics

Without undefined behavior, signed and unsigned integers use the same exact
hardware, and they are equivalent at binary level thanks to the two-complement
representation

#include <cstdint>
int a1 = INT_MAX;
int b1 = a1 + 4; // 10000000000000000000000000000011

unsigned a2 = INT_MAX;
unsigned b2 = a2 + 4; // 10000000000000000000000000000011

However, signed and unsigned integers have different semantics in C++. The
compiler can exploit undefined behavior to optimize the code even if such operations
are well-defined at hardware level
15/83
Signed Integer

Represent positive, negative, and zero values (Z

Properties: Commutative, reflexive, not associative (overflow/underflow)

Represent the human intuition of numbers

All bitwise operations are well-defined, except shift

16/83
Signed Integer - Problems

More negative values (231 − 1) than positive (231 − 2)

Even multiply, division, and modulo by -1 can fail, e.g. INT MIN * -1

Overflow/underflow semantic → undefined behavior

Possible behavior: overflow : (231 − 1) + 1 → min, underflow : −231 − 1 → max

Shift could lead to undefined behavior x ≪ y

• undefined behavior if y is larger than the number of bits of x
• implementation-defined if x is negative (until C++20)
• undefined behavior if y is negative

17/83
Unsigned Integer

Represent only non-negative values (N

Properties: commutative, reflexive, associative

Discontinuity in 0, 232 − 1

Wraparound semantic → well-defined (modulo 232 )

Bit-wise operations are well-defined, except shift

Shift could lead to undefined behavior x ≪ y

• undefined behavior if y is larger than the number of bits of x

18/83
When Use Signed/Unsigned Integer? 1/2

Google Style Guide

Because of historical accident, the C++ standard also uses unsigned integers to
represent the size of containers - many members of the standards body believe this
to be a mistake, but it is effectively impossible to fix at this point

Solution: use int64 t

max value: 263 − 1 = 9,223,372,036,854,775,807 or
9 quintillion (9 billion of billion),
about 292 years in nanoseconds,
9 million terabytes

19/83
When Use Signed/Unsigned Integer? 2/2

When use signed integer?

• if it can be mixed with negative values, e.g. subtracting byte sizes
• prefer expressing non-negative values with signed integer and assertions
• optimization purposes, e.g. exploit undefined behavior for overflow or in loops

When use unsigned integer?

• if the quantity can never be mixed with negative values (?)
• bitmask values
• optimization purposes, e.g. division, modulo
• safety-critical system, signed integer overflow could be “non-deterministic”

Subscripts and sizes should be signed, Bjarne Stroustrup

Don’t add to the signed/unsigned mess, Bjarne Stroustrup 20/83
Integer Type Selection in C++: in Safe, Secure and Correct Code, Robert C. Seacord
Arithmetic Type Limits

Query properties of arithmetic types in C++11:

# include <limits>

std::numeric_limits<int>::max(); // 231 − 1
std::numeric_limits<uint16_t>::max(); // 65, 535

std::numeric_limits<int>::min(); // −231
std::numeric_limits<unsigned>::min(); // 0

* this syntax will be explained in the next lectures

21/83
Promotion and Truncation

Promotion to a larger type keeps the sign

int16_t x = -1;
int y = x; // sign extend
cout << y; // print -1

Truncation to a smaller type is implemented as a modulo operation with respect to

the number of bits of the smaller type
int x = 65537; // 2ˆ16 + 1
int16_t y = x; // x % 2ˆ16
cout << y; // print 1

int z = 32769; // 2ˆ15 + 1 (does not fit in a int16_t)

int16_t w = z; // (int16_t) (x % 2ˆ16 = 32769)
cout << w; // print -32767

22/83
Mixing Signed/Unsigned Errors 1/3

unsigned a = 10; // array is small

int b = -1;
array[10ull + a * b] = 0; // ?
A Segmentation fault!
int f(int a, unsigned b, int* array) { // array is small
if (a > b)
return array[a - b]; // ?
return 0;
}
A Segmentation fault for a < 0!

// v.size() return unsigned

for (size_t i = 0; i < v.size() - 1; i++)
array[i] = 3; // ?
A Segmentation fault for v.size() == 0 ! 23/83
Mixing Signed/Unsigned Errors 2/3

Easy case:

unsigned x = 32; // x can be also a pointer

x += 2u - 4; // 2u - 4 = 2 + (2ˆ32 - 4)
// = 2ˆ32 - 2
// (32 + (2ˆ32 - 2)) % 2ˆ32
cout << x; // print 30 (as expected)

What about the following code?

uint64_t x = 32; // x can be also a pointer
x += 2u - 4;
cout << x;

24/83
Mixing Signed/Unsigned Errors 3/3

A real-world case:
// allocate a zero rtx vector of N elements
//
// sizeof(struct rtvec_def) == 16
// sizeof(rtunion) == 8
rtvec rtvec_alloca(int n) {
rtvec rt;
int i;
rt = (rtvec)obstack_alloc(
rtl_obstack,
sizeof(struct rtvec_def) + ((n - 1) * sizeof(rtunion)));
// ...
return rt;
}

Garbage In, Garbage Out: Arguing about Undefined Behavior with Nasal Daemons, 25/83
Chandler Carruth, CppCon 2016
Undefined Behavior 1/7

The C++ standard does not prescribe any specific behavior (undefined behavior) for
several integer/unsigned arithmetic operations
• Signed integer overflow/underflow
int x = std::numeric_limits<int>::max() + 20;

• More negative values than positive

int x = std::numeric_limits<int>::max() * -1; // (2ˆ31 -1) * -1
cout << x; // -2ˆ31 +1 ok

int y = std::numeric_limits<int>::min() * -1; // -2ˆ31 * -1

cout << y; // hard to see in complex examples // 2ˆ31 overflow!!

The Usual Arithmetic Confusions 26/83

Undefined Behavior 2/7

• Initialize an integer with a value larger than its range is undefined behavior
int z = 3000000000; // undefined behavior!!

• Bitwise operations on signed integer types with negative value is undefined

behavior
int y = -1 << 12; // undefined behavior!!, until C++20
int z = 1 << -12; // undefined behavior!!

• Shift larger than #bits of the data type is undefined behavior even for unsigned
unsigned y = 1u << 32u; // undefined behavior!!

• Undefined behavior in implicit conversion

uint16_t a = 65535; // 0xFFFF
uint16_t b = 65535; // 0xFFFF expected: 4'294'836'225
27/83
cout << (a * b); // print '-131071' undefined behavior!! (int overflow)
Undefined Behavior - Signed Overflow Example 1 3/7

# include <climits>
# include <cstdio>

void f(int* ptr, int pos) {

pos++;
if (pos < 0) // <-- the compiler could assume that signed overflow never
return; // happen and "simplify" the condition to check
ptr[pos] = 0;
}

int main() { // the code compiled with optimizations, e.g. -O3

int* tmp = new int[10]; // leads to segmentation faults with clang, while
f(tmp, INT_MAX); // it terminates correctly with gcc
printf("%d\n", tmp[0]);
}

28/83
Undefined Behavior - Signed Overflow Example 2 4/7

s/open.c of the Linux kernel

int do_fallocate(..., loff_t offset, loff_t len) {

inode *inode = ...;
if (offset < 0 || len <= 0)
return -EINVAL;
/* Check for wrap through zero too */
if ((offset + len > inode->i_sb->s_maxbytes) || (offset + len < 0))
return -EFBIG; // the compiler is able to infer that both 'offset' and
... // 'len' are non-negative and can eliminate this check,
} // without verify integer overflow

29/83
Undefined Behavior - Division by Zero Example 5/7

src/backend/utils/adt/int8.c of PostgreSQL

if (arg2 == 0) {
ereport(ERROR, (errcode(ERRCODE_DIVISION_BY_ZERO), // the compiler is not aware
errmsg("division by zero"))); // that this function
} // doesn't return
/* No overflow is possible */
PG_RETURN_INT32((int32) arg1 / arg2); // the compiler assumes that the divisor is
// non-zero and can move this statement on
// the top (always executed)

30/83
Undefined Behavior: What Happened to My Code?
Undefined Behavior - Implicit Overflow Example 6/7

Even worse example:

# include <iostream>

int main() {
for (int i = 0; i < 4; ++i)
std::cout << i * 1000000000 << std::endl;
}
// with optimizations, it is an infinite loop
// --> 1000000000 * i > INT_MAX
// undefined behavior!!

// the compiler translates the multiplication constant into an addition

31/83
Why does this loop produce undefined behavior?
Undefined Behavior - Common Loops 7/7

Is the following loop safe?

void f(int size) {
for (int i = 0; i < size; i += 2)
...
}

• What happens if size is equal to INT MAX ?

• How to make the previous loop safe?
• i >= 0 && i < size is not the solution because of undefined behavior of
signed overflow
• Can we generalize the solution when the increment is i += step ?

32/83
Overflow / Underflow

Detecting wraparound for unsigned integral types is not trivial

// some examples
bool is_add_overflow(unsigned a, unsigned b) {
return (a + b) < a || (a + b) < b;
}

bool is_mul_overflow(unsigned a, unsigned b) {

unsigned x = a * b;
return a != 0 && (x / a) != b;
}

Detecting overflow/underflow for signed integral types is even harder and must be
checked before performing the operation

33/83
⋆
Saturation Arithmetic

C++26 adds four main functions to perform saturation arithmetic with integer types
in the <numeric> library. In other words, the undefined behavior or the wrap-around
behavior for overflow/underflow is replaced by saturation values, namely the minimum
or maximum values of the operands
• T add sat(T x, T y)

• T sub sat(T x, T y)

• T mul sat(T x, T y)

• T div sat(T x, T y)

• R saturate cast<R>(T x)
34/83
Floating-point Types
and Arithmetic
IEEE Floating-Point Standard

IEEE754 is the technical standard for floating-point arithmetic

The standard defines the binary format, operations behavior, rounding rules, exception
handling, etc.
First Release : 1985
Second Release : 2008. Add 16-bit, 128-bit, 256-bit floating-point types
Third Release : 2019. Specify min/max behavior
see The IEEE Standard 754: One for the History Books

IEEE754 technical document:

754-2019 - IEEE Standard for Floating-Point Arithmetic

In general, C/C++ adopts IEEE754 floating-point standard:

en.cppreference.com/w/cpp/types/numeric limits/is iec559
35/83
32/64-bit Floating-Point

• IEEE754 Single-precision (32-bit) float

Sign Exponent (or base) Mantissa (or significant)

1-bit 8-bit 23-bit

• IEEE754 Double-precision (64-bit) double

Sign Exponent (or base) Mantissa (or significant)

1-bit 11-bit 52-bit

36/83
128/256-bit Floating-Point

• IEEE754 Quad-Precision (128-bit) std::float128 t C++23

Sign Exponent (or base) Mantissa (or significant)

1-bit 15-bit 112-bit

• IEEE754 Octuple-Precision (256-bit) (not standardized in C++)

Sign Exponent (or base) Mantissa (or significant)

1-bit 19-bit 236-bit

37/83
16-bit Floating-Point

• IEEE754 16-bit Floating-point ( std::binary16 t ) C++23 → GPU, Arm7

Sign Exponent Mantissa

1-bit 5-bit 10-bit

• Google 16-bit Floating-point ( std::bfloat16 t ) C++23 → TPU, GPU, Arm8

Sign Exponent Mantissa

1-bit 8-bit 7-bit

38/83
half-precision-arithmetic-fp16-versus-bfloat16
8-bit Floating-Point (Non-Standardized in C++/IEEE)

• E4M3

Sign Exponent Mantissa

1-bit 4-bit 3-bit

• E5M2

Sign Exponent Mantissa

1-bit 5-bit 2-bit

• Floating Point Formats for Machine Learning, IEEE draft

39/83
• FP8 Formats for Deep Learning, Intel, Nvidia, Arm
Other Real Value Representations (Non-standardized in C++/IEEE) 1/2

• TensorFloat-32 (TF32) Specialized floating-point format for deep learning

applications

• Posit (John Gustafson, 2017), also called unum III (universal number ), represents
floating-point values with variable-width of exponent and mantissa.
It is implemented in experimental platforms

• NVIDIA Hopper Architecture In-Depth

• Beating Floating Point at its Own Game: Posit Arithmetic
• Posits, a New Kind of Number, Improves the Math of AI
40/83
• Comparing posit and IEEE-754 hardware cost
Other Real Value Representations (Non-standardized in C++/IEEE) 2/2

• Microscaling Formats (MX) Specification for low-precision floating-point

formats defined by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm.
It includes FP8, FP6, FP4, (MX)INT8

• Fixed-point representation has a fixed number of digits after the radix point
(decimal point). The gaps between adjacent numbers are always equal. The range
of their values is significantly limited compared to floating-point numbers.
It is widely used on embedded systems

41/83
• OCP Microscaling Formats (MX) Specification
Floating-point Representation 1/2

Floating-point number:
• Radix (or base): β
• Precision (or digits): p
• Exponent (magnitude): e
• Mantissa: M

n = |{z}
M ×β e → IEEE754: 1.M × 2e
p

float f1 = 1.3f; // 1.3

float f2 = 1.1e2f; // 1.1 · 102
float f3 = 3.7E4f; // 3.7 · 104
float f4 = .3f; // 0.3
double d1 = 1.3; // without "f"
double d2 = 5E3; // 5 · 103 42/83
Floating-point Representation 2/2
Exponent Bias
In IEEE754 floating point numbers, the exponent value is offset from the actual value
by the exponent bias

• The exponent is stored as an unsigned value suitable for comparison

• Floating point values are lexicographic ordered

• For a single-precision number, the exponent is stored in the range [1, 254] (0 and 255
have special meanings), and is biased by subtracting 127 to get an exponent value in the
range [−126, +127]

0 10000111 11000000000000000000000
normal
+ 2(135−127)
=28 1
21 + 212 = 0.5 + 0.25 = 0.75 → 1.75

+1.75 ∗ 28 = 448.0 43/83

Floating-point - Normal/Denormal 1/2

Normal number
A normal number is a floating point value that can be represented with at least one
bit set in the exponent or the mantissa has all 0s

Denormal number
Denormal (or subnormal) numbers fill the underflow gap around zero in
floating-point arithmetic. Any non-zero number with magnitude smaller than the
smallest normal number is denormal
A denormal number is a floating point value that can be represented with all 0s in
the exponent, but the mantissa is non-zero

44/83
Floating-point - Normal/Denormal 2/2

Why denormal numbers make sense: (↓ normal numbers)

The problem: distance values from zero (↓ denormal numbers)

45/83
Floating-point representation, by Carl Burch
Infinity ∞ 1/2

Infinity
In the IEEE754 standard, inf (infinity value) is a numeric data type value that
exceeds the maximum (or minimum) representable value

Operations generating inf :

• ±∞ · ±∞
• ±∞ · ±finite value
• finite value op finite value > max value
• finite value / ± 0
There is a single representation for +inf and -inf
Comparison: (inf == finite value) → false
(±inf == ±inf) → true
46/83
Infinity 2/2

cout << 5.0 / 0.0; // print "inf"

cout << -5.0 / 0.0; // print "-inf"

auto inf = std::numeric_limits<float>::infinity;

cout << (-0.0 == 0.0); // true, 0 == 0
cout << ((5.0f / inf) == ((-5.0f / inf)); // true, 0 == 0
cout << (10e40f) == (10e40f + 9999999.0f); // true, inf == inf
cout << (10e40) == (10e40f + 9999999.0f); // false, 10e40 != inf

47/83
Not a Number (NaN) 1/2

NaN
In the IEEE754 standard, NaN (not a number) is a numeric data type value
representing an undefined or non-representable value

Floating-point operations generating NaN :

• Operations with a NaN as at least one operand
• ±∞ · ∓∞ , 0 · ∞
• 0/0, ∞/∞
√
• x , log(x ) for x < 0
• sin−1 (x ), cos−1 (x ) for x < −1 or x > 1

Comparison: (NaN == x) → false, for every x

(NaN == NaN) → false
48/83
Not a Number (NaN) 2/2

There are many representations for NaN (e.g. 224 − 2 for float)
The specific (bitwise) NaN value returned by an operation is implementation/compiler
specific
cout << 0 / 0; // undefined behavior
cout << 0.0 / 0.0; // print "nan" or "-nan"

49/83
quiet NaN
Machine Epsilon

Machine epsilon
Machine epsilon ε (or machine accuracy ) is defined to be the smallest number that
can be added to 1.0 to give a number other than one

IEEE 754 Single precision : ε = 2−23 ≈ 1.19209 ∗ 10−7

IEEE 754 Double precision : ε = 2−52 ≈ 2.22045 ∗ 10−16

50/83
Units at the Last Place (ULP)

ULP
Units at the Last Place is the gap between consecutive floating-point numbers
ULP(p, e) = β e−(p−1) → 2e−(p−1)

Example:
β = 10, p = 3
π = 3.1415926... → x = 3.14 × 100
ULP(3, 0) = 10−2 = 0.01

Relation with ε :
• ε = ULP(p, 0)
• ULPx = ε ∗ β e(x )
51/83
Floating-Point Representation of a Real Number

The machine floating-point representation fl(x ) of a real number x is expressed as

fl (x ) = x (1 + δ), where δ is a small constant

The approximation of a real number x has the following properties:

1
Absolute Error : |fl(x ) − x | ≤ · ULPx
2

fl (x ) − x 1
Relative Error : ≤ ·ε
x 2

52/83
Floating-point - Cheatsheet 1/3

• NaN (mantissa ̸= 0)
∗ 11111111 ***********************

• ± infinity
∗ 11111111 00000000000000000000000

• Lowest/Largest (±3.40282 ∗ 10+38 )

∗ 11111110 11111111111111111111111

• Minimum (normal) (±1.17549 ∗ 10−38 )

∗ 00000001 00000000000000000000000

• Denormal number (< 2−126 )(minimum: 1.4 ∗ 10−45 )

∗ 00000000 ***********************

• ±0
∗ 00000000 00000000000000000000000 53/83
Floating-point - Cheatsheet 2/3

E4M3 E5M2 float16 t

Exponent 4 [0-14] (no inf) 5-bit [0-30]

Bias 7 15

Mantissa 4-bit 2-bit 10-bit

1.75 ∗ 28 1.75 ∗ 215 216

Largest (±)
448 57, 344 65, 536
2−6 2−14
Smallest (±)
0.015625 0.00006
2−9 2−16 2−24
Smallest Denormal
0.001953125 1.5258 ∗ 10−5 6.0 · 10−8
2−4 2−2 2−10
Epsilon
0.0625 0.25 0.00098
Floating-point - Cheatsheet 3/3

bfloat16 t float double

Exponent 8-bit [0-254] 11-bit [0-2046]

Bias 127 1023

Mantissa 7-bit 23-bit 52-bit

2128 21024
Largest (±)
3.4 · 1038 1.8 · 10308
2−126 2−1022
Smallest (±)
1.2 · 10−38 2.2 · 10−308
2−149 2−1074
Smallest Denormal /
1.4 · 10−45 4.9 · 10−324
2−7 2−23 2−52
Epsilon
0.0078 1.2 · 10−7 2.2 · 10−16
Floating-point - Limits

# include <limits>
// T: float or double

std::numeric_limits<T>::max(); // largest value

std::numeric_limits<T>::lowest(); // lowest value (C++11)

std::numeric_limits<T>::min(); // smallest value

std::numeric_limits<T>::denorm min() // smallest (denormal) value

std::numeric_limits<T>::epsilon(); // epsilon value

std::numeric_limits<T>::infinity() // infinity

std::numeric_limits<T>::quiet NaN() // NaN

58/83
Floating-point - Useful Functions

# include <cmath> // C++11

bool std::isnan(T value) // check if value is NaN

bool std::isinf(T value) // check if value is ±infinity
bool std::isfinite(T value) // check if value is not NaN
// and not ±infinity

bool std::isnormal(T value); // check if value is Normal

T std::ldexp(T x, p) // exponent shift x ∗ 2p

int std::ilogb(T value) // extracts the exponent of value

59/83
Floating-point Arithmetic Properties 1/3

Floating-point operations are written

• ⊕ addition
• ⊖ subtraction
• ⊗ multiplication
• ⊘ division

⊙ ∈ {⊕, ⊖, ⊗, ⊘}

op ∈ {+, −, ∗, /} denotes exact precision operations

60/83
Floating-point Arithmetic Properties 2/3

(P1) In general, a op b ̸= a ⊙ b

(P2) Not Reflexive a ̸= a

• Reflexive without NaN

(P3) Not Commutative a ⊙ b ̸= b ⊙ a

• Commutative without NaN (NaN ̸= NaN)

(P4) In general, Not Associative (a ⊙ b) ⊙ c ̸= a ⊙ (b ⊙ c)

• even excluding NaN and inf in intermediate computations

(P5) In general, Not Distributive (a ⊕ b) ⊗ c ̸= (a ⊗ c) ⊕ (b ⊗ c)

• even excluding NaN and inf in intermediate computations
61/83
Floating-point Arithmetic Properties 3/3

(P6) Identity on operations is not ensured

• (a ⊖ b) ⊕ b =
̸ a
• (a ⊘ b) ⊗ b ≠ a

(P7) Overflow/Underflow Floating-point has “saturation” values inf, -inf

• as opposite to integer arithmetic with wrap-around behavior

62/83
Special Values Behavior

Zero behavior
• a ⊘ 0 = inf, a ∈ {finite − 0} [IEEE-764], undefined behavior in C++
• 0 ⊘ 0, inf ⊘ 0 = NaN [IEEE-764], undefined behavior in C++
• 0 ⊗ inf = NaN
• +0 = -0 but they have a different binary representation
Inf behavior
• inf ⊙ a = inf, a ∈ {finite − 0}
• inf ⊕⊗ inf = inf
• inf ⊖⊘ inf = NaN
• ± inf ⊙ ∓ inf = NaN
• ± inf = ± inf
NaN behavior
• NaN ⊙ a = NaN
63/83
• NaN ̸= a
Floating-Point Undefined Behavior

• Division by zero
e.g., 108 /0.0

• Conversion to a narrower floating-point type:

e.g., 0.1 double → float

• Conversion from floating-point to integer:

e.g., 108 float → int

• Operations on signaling NaNs: Arithmetic operations that cause an “invalid

operation” exception to be signaled
e.g., inf - inf

• Incorrectly assuming IEEE-754 compliance for all platforms:

e.g., Some embedded Linux distribution on ARM 64/83
⋆
Detect Floating-point Errors 1/2

C++11 allows determining if a floating-point exceptional condition has occurred by

using floating-point exception facilities provided in <cfenv>
#include <cfenv>
// MACRO
FE_DIVBYZERO // division by zero
FE_INEXACT // rounding error
FE_INVALID // invalid operation, i.e. NaN
FE_OVERFLOW // overflow (reach saturation value +inf)
FE_UNDERFLOW // underflow (reach saturation value -inf)
FE_ALL_EXCEPT // all exceptions

// functions
std::feclearexcept(FE_ALL_EXCEPT); // clear exception status
std::fetestexcept(<macro>); // returns a value != 0 if an
// exception has been detected
65/83
⋆
Detect Floating-point Errors 2/2

#include <cfenv> // floating point exceptions

#include <iostream>
#pragma STDC FENV_ACCESS ON // tell the compiler to manipulate the floating-point
// environment (not supported by all compilers)
// gcc: yes, clang: no
int main() {
std::feclearexcept(FE_ALL_EXCEPT); // clear
auto x = 1.0 / 0.0; // all compilers
std::cout << (bool) std::fetestexcept(FE_DIVBYZERO); // print true

std::feclearexcept(FE_ALL_EXCEPT); // clear
auto x2 = 0.0 / 0.0; // all compilers
std::cout << (bool) std::fetestexcept(FE_INVALID); // print true

std::feclearexcept(FE_ALL_EXCEPT); // clear
auto x4 = 1e38f * 10; // gcc: ok
std::cout << std::fetestexcept(FE_OVERFLOW); // print true
}
66/83
see What is the difference between quiet NaN and signaling NaN?
Floating-point Issues
Some Examples... 1/4

Ariene 5: data conversion from 64-bit Patriot Missile: small chopping error
floating point value to 16-bit signed in- at each operation, 100 hours activity
teger → $137 million → 28 deaths 67/83
Some Examples... 2/4

Integer type is more accurate than floating type for large numbers
cout << 16777217; // print 16777217
cout << (int) 16777217.0f; // print 16777216!!
cout << (int) 16777217.0; // print 16777217, double ok

float numbers are different from double numbers

cout << (1.1 != 1.1f); // print true !!!

68/83
Some Examples... 3/4

The floating point precision is finite!

cout << setprecision(20);
cout << 3.33333333f; // print 3.333333254!!
cout << 3.33333333; // print 3.333333333
cout << (0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1); // print 0.59999999999999998

Floating point arithmetic is not associative

cout << 0.1 + (0.2 + 0.3) == (0.1 + 0.2) + 0.3; // print false

IEEE754 Floating-point computation guarantees to produce deterministic output,

namely the exact bitwise value for each run, if and only if the order of the operations
is always the same
→ same result on any machine and for all runs
69/83
Some Examples... 4/4

“Using a double-precision floating-point value, we can represent easily the

number of atoms in the universe.
If your software ever produces a number so large that it will not fit in a
double-precision floating-point value, chances are good that you have a bug”
Daniel Lemire, Prof. at the University of Quebec

“ NASA uses just 15 digits of π to calculate interplanetary travel.

With 40 digits, you could calculate the circumference of a circle the size of the
visible universe with an accuracy that would fall by less than the diameter of
a single hydrogen atom”
Latest in space, Twitter

Number of atoms in the universe versus floating-point values 70/83

Floating-point Algorithms

• addition algorithm (simplified):

(1) Compare the exponents of the two numbers. Shift the smaller number to the right until its
exponent would match the larger exponent
(2) Add the mantissa
(3) Normalize the sum if needed (shift right/left the exponent by 1)

• multiplication algorithm (simplified):

(1) Multiplication of mantissas. The number of bits of the result is twice the size of the operands
(46 + 2 bits, with +2 for implicit normalization)
(2) Normalize the product if needed (shift right/left the exponent by 1)
(3) Addition of the exponents

• fused multiply-add (fma):

• Recent architectures (also GPUs) provide fma to compute addition and multiplication in a single
instruction (performed by the compiler in most cases)
71/83
• The rounding error of fma(x , y , z) is less than (x ⊗ y ) ⊕ z
Catastrophic Cancellation 1/5

Catastrophic Cancellation
Catastrophic cancellation (or loss of significance) refers to loss of relevant
information in a floating-point computation that cannot be revered

Two cases:
(C1) a ± b, where a ≫ b or b ≫ a. The value (or part of the value) of the smaller
number is lost

(C2) a − b, where a, b are approximation of exact values and a ≈ b, namely a loss of

precision in both a and b. a − b cancels most of the relevant part of the result
because a ≈ b. It implies a small absolute error but a large relative error

72/83
Catastrophic Cancellation (case 1) - Granularity 2/5

Intersection = 16, 777, 216 = 224 73/83

Catastrophic Cancellation (case 1) 3/5

How many iterations performs the following code?

while (x > 0)
x = x - y;

How many iterations?

float: x = 10,000,000 y = 1 -> 10,000,000
float: x = 30,000,000 y = 1 -> does not terminate
float: x = 200,000 y = 0.001 -> does not terminate
bfloat: x = 256 y = 1 -> does not terminate !!

74/83
Catastrophic Cancellation (case 1) 4/5

Floating-point increment
float x = 0.0f;
for (int i = 0; i < 20000000; i++)
x += 1.0f;

What is the value of x at the end of the loop?

a

Ceiling division
b
// std::ceil((float) 101 / 2.0f) -> 50.5f -> 51
float x = std::ceil((float) 20000001 / 2.0f);

What is the value of x ? 75/83

Catastrophic Cancellation (case 2) 5/5

Let’s solve a quadratic equation:

√
−b ± b 2 − 4ac
x1,2 =
2a

x 2 + 5000x + 0.25
(-5000 + std::sqrt(5000.0f * 5000.0f - 4.0f * 1.0f * 0.25f)) / 2 // x2
(-5000 + std::sqrt(25000000.0f - 1.0f)) / 2 // catastrophic cancellation (C1)
(-5000 + std::sqrt(25000000.0f)) / 2
(-5000 + 5000) / 2 = 0 // catastrophic cancellation (C2)
// correct result: 0.00005!!

|0 − 0.00005|
relative error : = 100%
0.00005 76/83
Floating-point Comparison 1/3

The problem
cout << (0.11f + 0.11f < 0.22f); // print true!!
cout << (0.1f + 0.1f > 0.2f); // print true!!

Do not use absolute error margins!!

bool areFloatNearlyEqual(float a, float b) {
if (std::abs(a - b) < epsilon); // epsilon is fixed by the user
return true;
return false;
}

Problems:
• Fixed epsilon “looks small” but it could be too large when the numbers being compared
are very small
• If the compared numbers are very large, the epsilon could end up being smaller than the
smallest rounding error, so that the comparison always returns false 77/83
Floating-point Comparison 2/3
|a−b|
Solution: Use relative error b <ε
bool areFloatNearlyEqual(float a, float b) {
if (std::abs(a - b) / b < epsilon); // epsilon is fixed
return true;
return false;
}

Problems:
• a=0, b=0 The division is evaluated as 0.0/0.0 and the whole if statement is (nan <
espilon) which always returns false
• b=0 The division is evaluated as abs(a)/0.0 and the whole if statement is (+inf <
espilon) which always returns false
• a and b very small. The result should be true but the division by b may produces
wrong results
• It is not commutative. We always divide by b 78/83
Floating-point Comparison 3/3
|a−b|
Possible solution: max(|a|,|b|) <ε
bool areFloatNearlyEqual(float a, float b) {
constexpr float normal_min = std::numeric_limits<float>::min();
constexpr float relative_error = <user_defined>

if (!std::isfinite(a) || !isfinite(b)) // a = ±∞, NaN or b = ±∞, NaN

return false;
float diff = std::abs(a - b);
// if "a" and "b" are near to zero, the relative error is less effective
if (diff <= normal_min) // or also: user_epsilon * normal_min
return true;

float abs_a = std::abs(a);

float abs_b = std::abs(b);
return (diff / std::max(abs_a, abs_b)) <= relative_error;
} 79/83
Minimize Error Propagation - Summary

• Prefer multiplication/division rather than addition/subtraction

• Try to reorganize the computation to keep near numbers with the same scale
(e.g. sorting numbers)

• Consider putting a zero very small number (under a threshold). Common

application: iterative algorithms

• Scale by a power of two is safe

• Switch to log scale. Multiplication becomes Add, and Division becomes

Subtraction

• Use a compensation algorithm like Kahan summation, Dekker’s FastTwoSum,

Rump’s AccSum 80/83
References

Suggest readings:
• What Every Computer Scientist Should Know About Floating-Point Arithmetic
• Do Developers Understand IEEE Floating Point?
• Yet another floating point tutorial
• Unavoidable Errors in Computing

Floating-point Comparison readings:

• The Floating-Point Guide - Comparison
• Comparing Floating Point Numbers, 2012 Edition
• Some comments on approximately equal FP comparisons
• Comparing Floating-Point Numbers Is Tricky

Floating point tools:

• IEEE754 visualization/converter
• Find and fix floating-point problems 81/83
System/360 Model 44

Ken Shirriff: Want to adjust your computer’s floating point precision by turning
82/83
a knob? You could do that on the System/360 Model 44
On Floating-Point

83/83

The History of Microprocessor PDF
No ratings yet
The History of Microprocessor PDF
28 pages
Serial and Licence Key Collection
100% (1)
Serial and Licence Key Collection
5 pages
Centera API Reference Guide 069001185 A03
No ratings yet
Centera API Reference Guide 069001185 A03
294 pages
03.basic Concepts I
No ratings yet
03.basic Concepts I
35 pages
CS205-2020 Spring - Lecture 2 PDF
No ratings yet
CS205-2020 Spring - Lecture 2 PDF
54 pages
PF Datatypes Arrays - 11
No ratings yet
PF Datatypes Arrays - 11
112 pages
PF - Lecture 09-10 Operators and Expressions
No ratings yet
PF - Lecture 09-10 Operators and Expressions
69 pages
C++ Integer Overflow and Type Promotions
No ratings yet
C++ Integer Overflow and Type Promotions
18 pages
Stanford lecture
No ratings yet
Stanford lecture
92 pages
Chapter 2 (B) - Data Type and Operators
No ratings yet
Chapter 2 (B) - Data Type and Operators
5 pages
Programming With C++ - Lecture3
No ratings yet
Programming With C++ - Lecture3
19 pages
Introduction To Object-Oriented Programming COMP2011: C++ Basics II
No ratings yet
Introduction To Object-Oriented Programming COMP2011: C++ Basics II
30 pages
Lecture 5 Input Output Functions
No ratings yet
Lecture 5 Input Output Functions
30 pages
CPP Bloque 1 PDF
No ratings yet
CPP Bloque 1 PDF
96 pages
Lect3 Bitwise Operations
No ratings yet
Lect3 Bitwise Operations
73 pages
Data_Types
No ratings yet
Data_Types
7 pages
Main
No ratings yet
Main
30 pages
Expressions in C
No ratings yet
Expressions in C
7 pages
Home Work 2 (FALL-2023)
No ratings yet
Home Work 2 (FALL-2023)
21 pages
PF Lecture 3
No ratings yet
PF Lecture 3
29 pages
Bit Math Tutorial by CosineKitty
No ratings yet
Bit Math Tutorial by CosineKitty
13 pages
CodeVisionAVR User Manual (13tr)
No ratings yet
CodeVisionAVR User Manual (13tr)
13 pages
Data Classification (Data Types)
No ratings yet
Data Classification (Data Types)
34 pages
C++imp
No ratings yet
C++imp
12 pages
BSC C++ Unit I
No ratings yet
BSC C++ Unit I
31 pages
4 Numbers and Matrices
No ratings yet
4 Numbers and Matrices
26 pages
Numbers PDF
No ratings yet
Numbers PDF
26 pages
C++ Overview of C++: Object-Oriented Programming
No ratings yet
C++ Overview of C++: Object-Oriented Programming
35 pages
INTRODUCTION C++
No ratings yet
INTRODUCTION C++
20 pages
Casting Incompatible Data Types
No ratings yet
Casting Incompatible Data Types
9 pages
Expressions
No ratings yet
Expressions
25 pages
PF Lec05 Data Types and Arithmetic Operations 02-02-2023
No ratings yet
PF Lec05 Data Types and Arithmetic Operations 02-02-2023
23 pages
C++ basic syntax SPU
No ratings yet
C++ basic syntax SPU
60 pages
Bca C202 SM01
No ratings yet
Bca C202 SM01
26 pages
21.Advanced Topics II
No ratings yet
21.Advanced Topics II
61 pages
Programming Fund Lec 004
No ratings yet
Programming Fund Lec 004
16 pages
c++
No ratings yet
c++
29 pages
Lesson 5. Building A 64-Bit Application
No ratings yet
Lesson 5. Building A 64-Bit Application
4 pages
IntroP Curs 3 Eng
No ratings yet
IntroP Curs 3 Eng
64 pages
The C++ Type System Is Your Friend 2016
No ratings yet
The C++ Type System Is Your Friend 2016
39 pages
C-Var Defs, Data Types
100% (1)
C-Var Defs, Data Types
21 pages
02a. Data Types-2
No ratings yet
02a. Data Types-2
35 pages
C++Lecture 5 Scope Format ChRead
No ratings yet
C++Lecture 5 Scope Format ChRead
22 pages
UNIT-II
No ratings yet
UNIT-II
52 pages
Block-1: Unit-Ii
No ratings yet
Block-1: Unit-Ii
52 pages
What Is A Data Type
No ratings yet
What Is A Data Type
5 pages
Datatypes
No ratings yet
Datatypes
30 pages
Important Review: Behrang Parhizkar (Hani)
No ratings yet
Important Review: Behrang Parhizkar (Hani)
26 pages
A First Book of C++: Data Types, Declarations, and Displays
No ratings yet
A First Book of C++: Data Types, Declarations, and Displays
54 pages
Module 2_PPT_P1
No ratings yet
Module 2_PPT_P1
48 pages
Csc2311 - Lecture_3
No ratings yet
Csc2311 - Lecture_3
7 pages
UES103_L2
No ratings yet
UES103_L2
179 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
C++ Language Notes
No ratings yet
C++ Language Notes
233 pages
02 Data Stru
No ratings yet
02 Data Stru
59 pages
CIE 202 - Unit 01 - Basic Programing - Spring 2020
No ratings yet
CIE 202 - Unit 01 - Basic Programing - Spring 2020
125 pages
5614_L17_slides(1)
No ratings yet
5614_L17_slides(1)
22 pages
Microprocessor and Assembly Language: Lecture-2-Integer Representation
No ratings yet
Microprocessor and Assembly Language: Lecture-2-Integer Representation
66 pages
C
No ratings yet
C
104 pages
Intro To C - Module 3
No ratings yet
Intro To C - Module 3
13 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
C Programming
From Everand
C Programming
Netra
No ratings yet
Mips PPT Slides
No ratings yet
Mips PPT Slides
35 pages
02 Core Azure Services
No ratings yet
02 Core Azure Services
5 pages
The Top Issues in IBM MQ and IIB: Barry D. Lamkin Executive IT Specialist
No ratings yet
The Top Issues in IBM MQ and IIB: Barry D. Lamkin Executive IT Specialist
50 pages
Lab Sheet 1.4_Introduction to OS Simulator
No ratings yet
Lab Sheet 1.4_Introduction to OS Simulator
8 pages
Operators Manual 34397 TDS Issue 2
No ratings yet
Operators Manual 34397 TDS Issue 2
52 pages
D4 Notes
No ratings yet
D4 Notes
3 pages
D-Link DCS-942L Network Camera Setup Manual Internet Recording Service
No ratings yet
D-Link DCS-942L Network Camera Setup Manual Internet Recording Service
35 pages
COMPUTER NETWOR-WPS Office
No ratings yet
COMPUTER NETWOR-WPS Office
18 pages
03 SM-Series Hardware (SM150)
No ratings yet
03 SM-Series Hardware (SM150)
81 pages
r11_cryptography_and_the_z_enterprise
No ratings yet
r11_cryptography_and_the_z_enterprise
29 pages
Gratisexam.com Alcatel Lucent.braindumps.4A0 108.v2015!03!10.by.rhea.247q
No ratings yet
Gratisexam.com Alcatel Lucent.braindumps.4A0 108.v2015!03!10.by.rhea.247q
146 pages
Sample SOW
No ratings yet
Sample SOW
6 pages
Makalah Bahasa Inggris (Mengenai Hardware)
No ratings yet
Makalah Bahasa Inggris (Mengenai Hardware)
8 pages
Module 3 - Planning and Configuring A Cloud Solution
No ratings yet
Module 3 - Planning and Configuring A Cloud Solution
29 pages
CSS NC II TESDA Offers Computer System Servicing
75% (4)
CSS NC II TESDA Offers Computer System Servicing
4 pages
CNS V2 PDF
No ratings yet
CNS V2 PDF
332 pages
ICT ASSIGNMENT 4
No ratings yet
ICT ASSIGNMENT 4
4 pages
Python Basic String
100% (1)
Python Basic String
14 pages
MCC User Manual
0% (1)
MCC User Manual
20 pages
IP addressing seminar ppt PPT
No ratings yet
IP addressing seminar ppt PPT
1 page
Qca6174a Product Brief
No ratings yet
Qca6174a Product Brief
2 pages
Creating Our Own FTP: %myftp Server-Host-Name Server-Port-Number Myftp FTP
No ratings yet
Creating Our Own FTP: %myftp Server-Host-Name Server-Port-Number Myftp FTP
4 pages
HC110110011 VRP Operating System Image Management
100% (1)
HC110110011 VRP Operating System Image Management
14 pages
Oracle Architecture
100% (3)
Oracle Architecture
17 pages
CSc. Worksheet-Assembly Language 3(Solution)
No ratings yet
CSc. Worksheet-Assembly Language 3(Solution)
4 pages
Docu52397 SourceOne Email Management 7.1 SP1 Release Notes
No ratings yet
Docu52397 SourceOne Email Management 7.1 SP1 Release Notes
133 pages
Hardware Implementation of A Lossless Image Compression Algorithm Using A Field Programmable Gate Array
No ratings yet
Hardware Implementation of A Lossless Image Compression Algorithm Using A Field Programmable Gate Array
11 pages