Ieee Arith

The IEEE Standard 754 specifies floating-point number formats and arithmetic operations. It defines binary floating-point numbers with four main formats - single (32 bits), double (64 bits), half (16 bits), and quadruple (128 bits) precision. The standard requires arithmetic operations be computed as if with infinite precision then rounded, and includes special values like NaN and infinity. It aims to provide a closed arithmetic system for numerical computation.

Uploaded by

Sofiane Bessai

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Ieee Arith

Uploaded by

Sofiane Bessai

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

What Is IEEE Standard Arithmetic?

Nicholas J. Higham∗
May 7, 2020

The IEEE Standard 754, published in 1985 and revised in 2008 and 2019, is a standard
for binary and decimal floating-point arithmetic. The standard for decimal arithmetic
(IEEE Standard 854) was separate when it was first published in 1987, but it was included
with the binary standard from 2008. We focus here on the binary part of the standard.
The standard specifies floating-point number formats, the results of the basic floating-
point operations and comparisons, rounding modes, floating-point exceptions and their
handling, and conversion between different arithmetic formats.
A binary floating-point number is represented as
y = ±m × 2e−t ,
where t is the precision and e ∈ [emin , emax ] is the exponent. The significand m is an
integer satisfying m ≤ 2t − 1. Numbers with m ≥ 2t−1 are called normalized. Subnormal
numbers, for which 0 < m < 2t−1 and e = emin , are supported.
Four formats are defined, whose key parameters are summarized in the following table.
The second column shows the number of bits allocated to store the significand and the
exponent. I use the prefix “fp” instead of the prefix “binary” used in the standard. The
unit roundoff is u = 2−t .

Fp32 (single precision) and fp64 (double precision) were in the 1985 standard; fp16 (half
precision) and fp128 (quadruple precision) were introduced in 2008. Fp16 is defined only
as a storage format, though it is widely used for computation.
The size of these different number systems varies greatly. The next table shows the
number of normalized and subnormal numbers in each system.

∗
Department of Mathematics, University of Manchester, Manchester, M13 9PL, UK
([email protected]).

1
We see that while one can easily carry out a computation on every fp16 number (to check
that the square root function is correctly computed, for example), it is impractical to do
so for every double precision number.
A key feature of the standard is that it is a closed system, thanks to the inclusion of
NaN (Not a Number) and ∞ (usually written as inf in programing languages) as floating-
point numbers: every arithmetic operation produces a number in the system. A NaN is
generated by operations such as
√
0/0, 0 × ∞, ∞/∞, (+∞) + (−∞), −1.

Arithmetic operations involving a NaN return a NaN as the answer. The number ∞
obeys the usual mathematical conventions regarding infinity, such as

∞ + ∞ = ∞, (−1) × ∞ = −∞, (finite)/∞ = 0.

This means, for example, that 1 + 1/x evaluates as 1 when x = ∞.

The standard specifies that all arithmetic operations (including square root) are to
be performed as if they were first calculated to infinite precision and then rounded to
the target format. A number is rounded to the next larger or next smaller floating-point
number according to one of four rounding modes:
• round to the nearest floating-point number, with rounding to even (rounding to the
number with a zero least significant bit) in the case of a tie;
• round towards plus infinity and round towards minus infinity (used in interval
arithmetic); and
• round towards zero (truncation, or chopping).
For round to nearest it follows that
√
fl(x op y) = (x op y)(1 + δ), |δ| ≤ u, op ∈ {+, −, ∗, /, },

where fl denotes the computed result. The standard also includes a fused multiply-add
operation (FMA), x ∗ y + z. The definition requires it to be computed with just one
rounding error, so that fl(x ∗ y + z) is the rounded version of x ∗ y + z, and hence satisfies

fl(x ∗ y + z) = (x ∗ y + z)(1 + δ), |δ| ≤ u.

FMAs are supported in some hardware and are usually executed at the same speed as a
single addition or multiplication.
The standard recommends the provision of correctly rounded exponentiation (xy ) and
transcendental functions (exp, log, sin, acos, etc.) and defines domains and special values
for them, but these functions are not required.
A new feature of the 2019 standard is augmented arithmetic operations, which compute
fl(x op y) along with the error x op y − fl(x op y), for op = +, −, ∗. These operations
are useful for implementing compensated summation and other special high accuracy
algorithms.
William (“Velvel”) Kahan of the University of California at Berkeley received the
1989 ACM Turing Award for his contributions to computer architecture and numerical
analysis, and in particular for his work on IEEE floating-point arithmetic standards 754
and 854.

2
References
This is a minimal set of references, which contain further useful references within.
• D. Goldberg, What Every Computer Scientist Should Know About Floating-Point
Arithmetic, ACM Computing Surveys 23, 5–48, 1991.
• Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edi-
tion, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
• IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of
IEEE 754-2008), The Institute of Electrical and Electronics Engineers, New York,
2019.
• Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod,
Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge
Torres, Handbook of Floating-Point Arithmetic, second edition, Birkhäuser, Boston,
MA, 2018.

Related Blog Posts

• A Multiprecision World (2017)
• Book Review Revisited: Overton’s Numerical Computing with IEEE Floating Point
Arithmetic (2014)
• Floating Point Numbers by Cleve Moler (2014)
• Half Precision Arithmetic: fp16 Versus bfloat16 (2018)
• The Rise of Mixed Precision Arithmetic (2015)
• What Is Rounding? (2020)
• What Is Floating-Point Arithmetic? (2020)

This article is part of the “What Is” series, available from https://nhigham.com/
category/what-is and in PDF form from the GitHub repository https://github.com/
higham/what-is.

FloatingPoint Handout
No ratings yet
FloatingPoint Handout
122 pages
Excel Functions
0% (1)
Excel Functions
209 pages
Math10 (DLP) Q4
100% (2)
Math10 (DLP) Q4
30 pages
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
No ratings yet
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
5 pages
Floating Point
No ratings yet
Floating Point
3 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
181
No ratings yet
181
11 pages
Floating Point Arithmetic
No ratings yet
Floating Point Arithmetic
30 pages
Cao Journal Review - Merged
No ratings yet
Cao Journal Review - Merged
13 pages
Unit 4 - 1
No ratings yet
Unit 4 - 1
11 pages
Floating Point: Adders and Multipliers
No ratings yet
Floating Point: Adders and Multipliers
45 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
IEEE 754 Floating Point Formats
No ratings yet
IEEE 754 Floating Point Formats
12 pages
NVIDIA-CUDA-Floating-Point
No ratings yet
NVIDIA-CUDA-Floating-Point
7 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
Demystifying Floating Point - John Farrier - CppCon 2015
No ratings yet
Demystifying Floating Point - John Farrier - CppCon 2015
61 pages
COA Chapter 02 Part 3
No ratings yet
COA Chapter 02 Part 3
28 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
MIT18 335JF10 Lec4 Hand PDF
No ratings yet
MIT18 335JF10 Lec4 Hand PDF
3 pages
Floating Point Formats: 0 1 1 P 1 (P 1) e I
No ratings yet
Floating Point Formats: 0 1 1 P 1 (P 1) e I
3 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
IJSPR_1203_438 (1)
No ratings yet
IJSPR_1203_438 (1)
4 pages
Implementation of Binary To Floating Point Converter Using HDL
No ratings yet
Implementation of Binary To Floating Point Converter Using HDL
41 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
f31 Book Arith Pres Pt5
No ratings yet
f31 Book Arith Pres Pt5
93 pages
IEEE754 Floating Point Standard Presentation Detailed
No ratings yet
IEEE754 Floating Point Standard Presentation Detailed
15 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
No ratings yet
GSC-320 Numerical Computing: Lecturer:Fasiha Ikram
17 pages
Operations On Floating Point Numbers
No ratings yet
Operations On Floating Point Numbers
16 pages
(Ebook) Numerical computing with IEEE floating point arithmetic: including one theorem, one rule of thumb, and one hundred and one exercises by Michael L. Overton ISBN 9780898714821, 9780898715712, 0898714826, 0898715717 - The ebook in PDF and DOCX formats is ready for download now
100% (1)
(Ebook) Numerical computing with IEEE floating point arithmetic: including one theorem, one rule of thumb, and one hundred and one exercises by Michael L. Overton ISBN 9780898714821, 9780898715712, 0898714826, 0898715717 - The ebook in PDF and DOCX formats is ready for download now
48 pages
PDF (Ebook) Numerical computing with IEEE floating point arithmetic: including one theorem, one rule of thumb, and one hundred and one exercises by Michael L. Overton ISBN 9780898714821, 9780898715712, 0898714826, 0898715717 download
No ratings yet
PDF (Ebook) Numerical computing with IEEE floating point arithmetic: including one theorem, one rule of thumb, and one hundred and one exercises by Michael L. Overton ISBN 9780898714821, 9780898715712, 0898714826, 0898715717 download
86 pages
Lecture 4 - Floating Point Data
No ratings yet
Lecture 4 - Floating Point Data
44 pages
Floating-Point Arithmetic in The Coq System
No ratings yet
Floating-Point Arithmetic in The Coq System
10 pages
Floating Point: - We Need A Way To Represent
No ratings yet
Floating Point: - We Need A Way To Represent
14 pages
Numerical Methods
No ratings yet
Numerical Methods
72 pages
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
No ratings yet
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
16 pages
Floating Point
No ratings yet
Floating Point
16 pages
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
No ratings yet
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
8 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
CH08.2-Computer Arithmetic
No ratings yet
CH08.2-Computer Arithmetic
14 pages
Chapter_03_arith_3_float
No ratings yet
Chapter_03_arith_3_float
30 pages
Floating Point Arithmetic Sun
No ratings yet
Floating Point Arithmetic Sun
74 pages
AN4044 Application Note: Floating Point Unit Demonstration On STM32 Microcontrollers
No ratings yet
AN4044 Application Note: Floating Point Unit Demonstration On STM32 Microcontrollers
31 pages
Lecture Notes On Numerical Analysis
No ratings yet
Lecture Notes On Numerical Analysis
68 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Ponto Flutuante
No ratings yet
Ponto Flutuante
87 pages
Chap-03 Computer Arithmetics
No ratings yet
Chap-03 Computer Arithmetics
16 pages
Arith
No ratings yet
Arith
245 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Digital Circuit Simulation Using Excel
From Everand
Digital Circuit Simulation Using Excel
Anthony Mazzurco
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Soft Computing: Fundamentals and Applications
From Everand
Soft Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Worked Examples in Mechanical Vibrations using MATLAB
From Everand
Worked Examples in Mechanical Vibrations using MATLAB
Eric Okoth Ogur
No ratings yet
Chapter 7 Mathematical Library Methods. EMCQs
No ratings yet
Chapter 7 Mathematical Library Methods. EMCQs
9 pages
Surveying Lec1!16!17
No ratings yet
Surveying Lec1!16!17
14 pages
ASME B31.3 Processing Piping - Trainer's Version
50% (2)
ASME B31.3 Processing Piping - Trainer's Version
1,240 pages
APPROXIMATION
No ratings yet
APPROXIMATION
5 pages
105 ROUNDING NUMBERS and SIGNIFICANT FIGURES - Final 10
100% (1)
105 ROUNDING NUMBERS and SIGNIFICANT FIGURES - Final 10
23 pages
Estimation and Approximation
No ratings yet
Estimation and Approximation
15 pages
Session 1 - Chapter 1 Introduction To Surveying
No ratings yet
Session 1 - Chapter 1 Introduction To Surveying
72 pages
M140 Unit1
No ratings yet
M140 Unit1
81 pages
Module T6 - DCC50242 Bim
No ratings yet
Module T6 - DCC50242 Bim
62 pages
Rounding Specification: Globalization Product Management - Japan / SAP Japan November 27, 2012
No ratings yet
Rounding Specification: Globalization Product Management - Japan / SAP Japan November 27, 2012
23 pages
AX2009 Financials - 1
100% (1)
AX2009 Financials - 1
564 pages
LCT501
No ratings yet
LCT501
104 pages
Primary Scope and Sequence Years 5-6
No ratings yet
Primary Scope and Sequence Years 5-6
42 pages
Decimals and Decimal Operations
No ratings yet
Decimals and Decimal Operations
7 pages
Elsevier - LAB CALCU Review Questions With Ratio
No ratings yet
Elsevier - LAB CALCU Review Questions With Ratio
7 pages
Chapter 2 Measurements and Calculations-In-Person
No ratings yet
Chapter 2 Measurements and Calculations-In-Person
17 pages
Career Paths Marine Engineering TG
100% (1)
Career Paths Marine Engineering TG
44 pages
Maths Project Report 2
No ratings yet
Maths Project Report 2
10 pages
Review: MULTIPLY HARDWARE Version 1: °64-Bit Multiplicand Reg, 64-Bit ALU, 64-Bit Product Reg
No ratings yet
Review: MULTIPLY HARDWARE Version 1: °64-Bit Multiplicand Reg, 64-Bit ALU, 64-Bit Product Reg
6 pages
P E (MCQ) : Ractice Xercises S
No ratings yet
P E (MCQ) : Ractice Xercises S
13 pages
Lua Programming Gems
No ratings yet
Lua Programming Gems
370 pages
Distance Learning Drilling Calculations 1
No ratings yet
Distance Learning Drilling Calculations 1
5 pages
Mathematics Literacy Grade 12
No ratings yet
Mathematics Literacy Grade 12
124 pages
Using Significant Digits in Test Data To Determine Conformance With Specifications
No ratings yet
Using Significant Digits in Test Data To Determine Conformance With Specifications
5 pages
For Teaching From 2010 For Award From 2012: Wjec Gcse in Mathematics
No ratings yet
For Teaching From 2010 For Award From 2012: Wjec Gcse in Mathematics
44 pages
Rounding Off Decimals
No ratings yet
Rounding Off Decimals
3 pages
Topic 1 Sequences and Series Test Review
No ratings yet
Topic 1 Sequences and Series Test Review
19 pages
m2
No ratings yet
m2
2 pages

Ieee Arith

Uploaded by

Ieee Arith

Uploaded by

What Is IEEE Standard Arithmetic?

∞ + ∞ = ∞, (−1) × ∞ = −∞, (finite)/∞ = 0.

This means, for example, that 1 + 1/x evaluates as 1 when x = ∞.

fl(x ∗ y + z) = (x ∗ y + z)(1 + δ), |δ| ≤ u.

Related Blog Posts

You might also like