04.Basic_Concepts_II
04.Basic_Concepts_II
Programming
4. Basic Concepts II
Integral and Floating-point Types
Federico Busato
2025-01-22
Table of Contents
1/83
Table of Contents
2/83
Table of Contents
Arithmetic Properties
Special Values Behavior
Floating-Point Undefined Behavior
Detect Floating-point Errors ⋆
3 Floating-point Issues
Catastrophic Cancellation
Floating-point Comparison
3/83
Integral Data Types
A Firmware Bug
“Certain SSDs have a firmware bug causing them to irrecoverably fail after
exactly 32,768 hours of operation. SSDs that were put into service at the
same time will fail simultaneously, so RAID won’t help”
4/83
Via twitter.com/martinkl/status/1202235877520482306?s=19
Overflow Implementations
Note: Computing the average in the right way is not trivial, see On finding the average
of two unsigned integers without overflow
related operations: ceiling division, rounding division
5/83
ai.googleblog.com/2006/06/extra-extra-read-all-about-it-nearly.html
Potentially Catastrophic Failure
Boeing 787s must be turned off and on every 51 days to prevent ‘misleading data’
6/83
being shown to pilots
C++ Data Model
Number of Bits
C++ Data
OS short int long long long pointer
Model
ILP32 Windows/Unix 32-bit 16 32 32 64 32
LLP64 Windows 64-bit 16 32 32 64 64
LP64 Linux 64-bit 16 32 64 64 64
7/83
C++ Fundamental types
Fixed Width Integers 1/3
int* t <cstdint>
C++11 provides fixed width integer types.
They have the same size on any architecture:
int8 t, uint8 t
int16 t, uint16 t
int32 t, uint32 t
int64 t, uint64 t
Good practice: Prefer fixed-width integers instead of native types. int and
unsigned can be directly used as they are widely accepted by C++ data models
8/83
Fixed Width Integers 2/3
int* t types are not “real” types, they are merely typedefs to appropriate
fundamental types
9/83
ithare.com/c-on-using-int t-as-overload-and-template-parameters
Fixed Width Integers 3/3
Warning: I/O Stream interprets uint8 t and int8 t as char and not as integer
values
int8_t var;
cin >> var; // read '2'
cout << var; // print '2'
int a = var * 2;
cout << a; // print '100' !!
10/83
size t
size t <cstddef>
size t is an alias data type capable of storing the biggest representable value on
the current architecture
• size t is the return type of sizeof() and commonly used to represent size
measures
11/83
⋆
ptrdiff t
ptrdiff t <cstddef>
ptrdiff t is an aliases data type used to store the results of the difference
between pointers or iterators
12/83
⋆
uintptr t
uintptr t <cstdint>
uintptr t (C++11) is an integer type that can be converted from and to a
void pointer
• sizeof(uintptr t) == sizeof(void*)
• uintptr t is an optional type of the standard and compilers may not provide it
13/83
Arithmetic Operation Semantics
Overflow The result of an arithmetic operation exceeds the word length, namely the
largest positive/negative values
14/83
Signed/Unsigned Integer Characteristics
Without undefined behavior, signed and unsigned integers use the same exact
hardware, and they are equivalent at binary level thanks to the two-complement
representation
#include <cstdint>
int a1 = INT_MAX;
int b1 = a1 + 4; // 10000000000000000000000000000011
unsigned a2 = INT_MAX;
unsigned b2 = a2 + 4; // 10000000000000000000000000000011
However, signed and unsigned integers have different semantics in C++. The
compiler can exploit undefined behavior to optimize the code even if such operations
are well-defined at hardware level
15/83
Signed Integer
16/83
Signed Integer - Problems
17/83
Unsigned Integer
Discontinuity in 0, 232 − 1
18/83
When Use Signed/Unsigned Integer? 1/2
Because of historical accident, the C++ standard also uses unsigned integers to
represent the size of containers - many members of the standards body believe this
to be a mistake, but it is effectively impossible to fix at this point
19/83
When Use Signed/Unsigned Integer? 2/2
std::numeric_limits<int>::max(); // 231 − 1
std::numeric_limits<uint16_t>::max(); // 65, 535
std::numeric_limits<int>::min(); // −231
std::numeric_limits<unsigned>::min(); // 0
21/83
Promotion and Truncation
22/83
Mixing Signed/Unsigned Errors 1/3
Easy case:
24/83
Mixing Signed/Unsigned Errors 3/3
A real-world case:
// allocate a zero rtx vector of N elements
//
// sizeof(struct rtvec_def) == 16
// sizeof(rtunion) == 8
rtvec rtvec_alloca(int n) {
rtvec rt;
int i;
rt = (rtvec)obstack_alloc(
rtl_obstack,
sizeof(struct rtvec_def) + ((n - 1) * sizeof(rtunion)));
// ...
return rt;
}
Garbage In, Garbage Out: Arguing about Undefined Behavior with Nasal Daemons, 25/83
Chandler Carruth, CppCon 2016
Undefined Behavior 1/7
The C++ standard does not prescribe any specific behavior (undefined behavior) for
several integer/unsigned arithmetic operations
• Signed integer overflow/underflow
int x = std::numeric_limits<int>::max() + 20;
• Initialize an integer with a value larger than its range is undefined behavior
int z = 3000000000; // undefined behavior!!
• Shift larger than #bits of the data type is undefined behavior even for unsigned
unsigned y = 1u << 32u; // undefined behavior!!
# include <climits>
# include <cstdio>
28/83
Undefined Behavior - Signed Overflow Example 2 4/7
29/83
Undefined Behavior - Division by Zero Example 5/7
src/backend/utils/adt/int8.c of PostgreSQL
if (arg2 == 0) {
ereport(ERROR, (errcode(ERRCODE_DIVISION_BY_ZERO), // the compiler is not aware
errmsg("division by zero"))); // that this function
} // doesn't return
/* No overflow is possible */
PG_RETURN_INT32((int32) arg1 / arg2); // the compiler assumes that the divisor is
// non-zero and can move this statement on
// the top (always executed)
30/83
Undefined Behavior: What Happened to My Code?
Undefined Behavior - Implicit Overflow Example 6/7
int main() {
for (int i = 0; i < 4; ++i)
std::cout << i * 1000000000 << std::endl;
}
// with optimizations, it is an infinite loop
// --> 1000000000 * i > INT_MAX
// undefined behavior!!
31/83
Why does this loop produce undefined behavior?
Undefined Behavior - Common Loops 7/7
32/83
Overflow / Underflow
Detecting overflow/underflow for signed integral types is even harder and must be
checked before performing the operation
33/83
⋆
Saturation Arithmetic
C++26 adds four main functions to perform saturation arithmetic with integer types
in the <numeric> library. In other words, the undefined behavior or the wrap-around
behavior for overflow/underflow is replaced by saturation values, namely the minimum
or maximum values of the operands
• T add sat(T x, T y)
• T sub sat(T x, T y)
• T mul sat(T x, T y)
• T div sat(T x, T y)
• R saturate cast<R>(T x)
34/83
Floating-point Types
and Arithmetic
IEEE Floating-Point Standard
36/83
128/256-bit Floating-Point
37/83
16-bit Floating-Point
38/83
half-precision-arithmetic-fp16-versus-bfloat16
8-bit Floating-Point (Non-Standardized in C++/IEEE)
• E4M3
• E5M2
• Posit (John Gustafson, 2017), also called unum III (universal number ), represents
floating-point values with variable-width of exponent and mantissa.
It is implemented in experimental platforms
• Fixed-point representation has a fixed number of digits after the radix point
(decimal point). The gaps between adjacent numbers are always equal. The range
of their values is significantly limited compared to floating-point numbers.
It is widely used on embedded systems
41/83
• OCP Microscaling Formats (MX) Specification
Floating-point Representation 1/2
Floating-point number:
• Radix (or base): β
• Precision (or digits): p
• Exponent (magnitude): e
• Mantissa: M
n = |{z}
M ×β e → IEEE754: 1.M × 2e
p
• For a single-precision number, the exponent is stored in the range [1, 254] (0 and 255
have special meanings), and is biased by subtracting 127 to get an exponent value in the
range [−126, +127]
0 10000111 11000000000000000000000
normal
+ 2(135−127)
=28 1
21 + 212 = 0.5 + 0.25 = 0.75 → 1.75
Normal number
A normal number is a floating point value that can be represented with at least one
bit set in the exponent or the mantissa has all 0s
Denormal number
Denormal (or subnormal) numbers fill the underflow gap around zero in
floating-point arithmetic. Any non-zero number with magnitude smaller than the
smallest normal number is denormal
A denormal number is a floating point value that can be represented with all 0s in
the exponent, but the mantissa is non-zero
44/83
Floating-point - Normal/Denormal 2/2
45/83
Floating-point representation, by Carl Burch
Infinity ∞ 1/2
Infinity
In the IEEE754 standard, inf (infinity value) is a numeric data type value that
exceeds the maximum (or minimum) representable value
47/83
Not a Number (NaN) 1/2
NaN
In the IEEE754 standard, NaN (not a number) is a numeric data type value
representing an undefined or non-representable value
There are many representations for NaN (e.g. 224 − 2 for float)
The specific (bitwise) NaN value returned by an operation is implementation/compiler
specific
cout << 0 / 0; // undefined behavior
cout << 0.0 / 0.0; // print "nan" or "-nan"
49/83
quiet NaN
Machine Epsilon
Machine epsilon
Machine epsilon ε (or machine accuracy ) is defined to be the smallest number that
can be added to 1.0 to give a number other than one
50/83
Units at the Last Place (ULP)
ULP
Units at the Last Place is the gap between consecutive floating-point numbers
ULP(p, e) = β e−(p−1) → 2e−(p−1)
Example:
β = 10, p = 3
π = 3.1415926... → x = 3.14 × 100
ULP(3, 0) = 10−2 = 0.01
Relation with ε :
• ε = ULP(p, 0)
• ULPx = ε ∗ β e(x )
51/83
Floating-Point Representation of a Real Number
fl (x ) − x 1
Relative Error : ≤ ·ε
x 2
52/83
Floating-point - Cheatsheet 1/3
• NaN (mantissa ̸= 0)
∗ 11111111 ***********************
• ± infinity
∗ 11111111 00000000000000000000000
• ±0
∗ 00000000 00000000000000000000000 53/83
Floating-point - Cheatsheet 2/3
Bias 7 15
2128 21024
Largest (±)
3.4 · 1038 1.8 · 10308
2−126 2−1022
Smallest (±)
1.2 · 10−38 2.2 · 10−308
2−149 2−1074
Smallest Denormal /
1.4 · 10−45 4.9 · 10−324
2−7 2−23 2−52
Epsilon
0.0078 1.2 · 10−7 2.2 · 10−16
Floating-point - Limits
# include <limits>
// T: float or double
std::numeric_limits<T>::infinity() // infinity
59/83
Floating-point Arithmetic Properties 1/3
⊙ ∈ {⊕, ⊖, ⊗, ⊘}
60/83
Floating-point Arithmetic Properties 2/3
(P1) In general, a op b ̸= a ⊙ b
62/83
Special Values Behavior
Zero behavior
• a ⊘ 0 = inf, a ∈ {finite − 0} [IEEE-764], undefined behavior in C++
• 0 ⊘ 0, inf ⊘ 0 = NaN [IEEE-764], undefined behavior in C++
• 0 ⊗ inf = NaN
• +0 = -0 but they have a different binary representation
Inf behavior
• inf ⊙ a = inf, a ∈ {finite − 0}
• inf ⊕⊗ inf = inf
• inf ⊖⊘ inf = NaN
• ± inf ⊙ ∓ inf = NaN
• ± inf = ± inf
NaN behavior
• NaN ⊙ a = NaN
63/83
• NaN ̸= a
Floating-Point Undefined Behavior
• Division by zero
e.g., 108 /0.0
// functions
std::feclearexcept(FE_ALL_EXCEPT); // clear exception status
std::fetestexcept(<macro>); // returns a value != 0 if an
// exception has been detected
65/83
⋆
Detect Floating-point Errors 2/2
std::feclearexcept(FE_ALL_EXCEPT); // clear
auto x2 = 0.0 / 0.0; // all compilers
std::cout << (bool) std::fetestexcept(FE_INVALID); // print true
std::feclearexcept(FE_ALL_EXCEPT); // clear
auto x4 = 1e38f * 10; // gcc: ok
std::cout << std::fetestexcept(FE_OVERFLOW); // print true
}
66/83
see What is the difference between quiet NaN and signaling NaN?
Floating-point Issues
Some Examples... 1/4
Ariene 5: data conversion from 64-bit Patriot Missile: small chopping error
floating point value to 16-bit signed in- at each operation, 100 hours activity
teger → $137 million → 28 deaths 67/83
Some Examples... 2/4
Integer type is more accurate than floating type for large numbers
cout << 16777217; // print 16777217
cout << (int) 16777217.0f; // print 16777216!!
cout << (int) 16777217.0; // print 16777217, double ok
68/83
Some Examples... 3/4
Catastrophic Cancellation
Catastrophic cancellation (or loss of significance) refers to loss of relevant
information in a floating-point computation that cannot be revered
Two cases:
(C1) a ± b, where a ≫ b or b ≫ a. The value (or part of the value) of the smaller
number is lost
72/83
Catastrophic Cancellation (case 1) - Granularity 2/5
while (x > 0)
x = x - y;
74/83
Catastrophic Cancellation (case 1) 4/5
Floating-point increment
float x = 0.0f;
for (int i = 0; i < 20000000; i++)
x += 1.0f;
a
Ceiling division
b
// std::ceil((float) 101 / 2.0f) -> 50.5f -> 51
float x = std::ceil((float) 20000001 / 2.0f);
√
−b ± b 2 − 4ac
x1,2 =
2a
x 2 + 5000x + 0.25
(-5000 + std::sqrt(5000.0f * 5000.0f - 4.0f * 1.0f * 0.25f)) / 2 // x2
(-5000 + std::sqrt(25000000.0f - 1.0f)) / 2 // catastrophic cancellation (C1)
(-5000 + std::sqrt(25000000.0f)) / 2
(-5000 + 5000) / 2 = 0 // catastrophic cancellation (C2)
// correct result: 0.00005!!
|0 − 0.00005|
relative error : = 100%
0.00005 76/83
Floating-point Comparison 1/3
The problem
cout << (0.11f + 0.11f < 0.22f); // print true!!
cout << (0.1f + 0.1f > 0.2f); // print true!!
Problems:
• Fixed epsilon “looks small” but it could be too large when the numbers being compared
are very small
• If the compared numbers are very large, the epsilon could end up being smaller than the
smallest rounding error, so that the comparison always returns false 77/83
Floating-point Comparison 2/3
|a−b|
Solution: Use relative error b <ε
bool areFloatNearlyEqual(float a, float b) {
if (std::abs(a - b) / b < epsilon); // epsilon is fixed
return true;
return false;
}
Problems:
• a=0, b=0 The division is evaluated as 0.0/0.0 and the whole if statement is (nan <
espilon) which always returns false
• b=0 The division is evaluated as abs(a)/0.0 and the whole if statement is (+inf <
espilon) which always returns false
• a and b very small. The result should be true but the division by b may produces
wrong results
• It is not commutative. We always divide by b 78/83
Floating-point Comparison 3/3
|a−b|
Possible solution: max(|a|,|b|) <ε
bool areFloatNearlyEqual(float a, float b) {
constexpr float normal_min = std::numeric_limits<float>::min();
constexpr float relative_error = <user_defined>
• Try to reorganize the computation to keep near numbers with the same scale
(e.g. sorting numbers)
Suggest readings:
• What Every Computer Scientist Should Know About Floating-Point Arithmetic
• Do Developers Understand IEEE Floating Point?
• Yet another floating point tutorial
• Unavoidable Errors in Computing
Ken Shirriff: Want to adjust your computer’s floating point precision by turning
82/83
a knob? You could do that on the System/360 Model 44
On Floating-Point
83/83