0% found this document useful (0 votes)

6 views

Module2.1 of nothing

The document explains the IEEE 754 standard for floating-point representation, detailing its components, formats (single and double precision), and examples of converting decimal numbers to binary. It outlines the advantages and disadvantages of using floating-point representation, as well as its real-world applications in fields like scientific computing and machine learning. Additionally, it covers the processes for floating-point addition and subtraction, highlighting the importance of aligning exponents and normalizing results.

Uploaded by

pratyush.pgat2nd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Module2.1 of nothing

Uploaded by

pratyush.pgat2nd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Floating point representation with Institute of Electrical and Electronics

Engineers (IEEE)

Floating-point representation is used to store real numbers efficiently, allowing for a wide
range of values with high precision. There are several ways to represent floating point
number but IEEE 754 is the most efficient in most cases The IEEE 754 standard defines how
floating-point numbers are stored and manipulated in computers.

IEEE 754 Floating-Point Representation

IEEE 754 defines two commonly used formats:

 Single Precision (32-bit)
 Double Precision (64-bit)
Each format consists of three components:
Value=(−1)S×M×2E

Where IEEE 754 has 3 basic components:

 S (Sign bit): Determines if the number is positive (0) or negative (1).

 E (Exponent): Stores the exponent using a bias (127 for single-precision, 1023 for
double-precision).
 M (Mantissa or Significand): Stores the significant digits of the number.

IEEE 754 Formats

A. Single-Precision (32-bit) Format

Bit Field Description

1 bit Sign (S) 0 = Positive, 1 = Negative
8 bits Exponent (E) Biased exponent (Bias = 127)
23 bits Mantissa (M) Fractional part (leading 1 is implicit)

B. Double-Precision (64-bit) Format

Bit Field Description
1 bit Sign (S) 0 = Positive, 1 = Negative
11bits Exponent (E) Biased exponent (Bias = 1023)
52 bits Mantissa (M) Fractional part (leading 1 is implicit)

IEEE 754 Example (Single Precision)

Example: Representing 13.625 in IEEE 754 (Single-Precision)

Step 1: Convert to Binary

 Convert 13 to binary: 1310=11012
 Convert 0.625 to binary:
.625×2=1.25 → 1
.25×2=0.5 → 0
.5×2=1.0 → 1
.62510=0.1012
 Full binary representation:13.62510=1101.1012
Step 2: Normalize the Binary Number
Convert to scientific notation:
1.101101×23
 Mantissa (M): 10110100000000000000000 (Drop leading 1, pad to 23 bits)
 Exponent (E): 3+127=13010=100000102
Step 3: IEEE 754 Representation
Sign (1 bit) Exponent (8 bits) Mantissa (23 bits)
0 (positive) 10000010 10110100000000000000000
Final IEEE 754 representation (in binary):
0 10000010 101101000000000000000000
Hexadecimal equivalent:
416B000016

Example

85.125
85 = 1010101
0.125 = 001
85.125 = 1010101.001
=1.010101001 x 2^6
sign = 0
1. Single precision:
biased exponent 127+6=133
133 = 10000101
Normalised mantisa = 010101001
we will add 0's to complete the 23 bits

The IEEE 754 Single precision is:

= 0 10000101 01010100100000000000000
This can be written in hexadecimal form 42AA4000

2. Double precision:
biased exponent 1023+6=1029
1029 = 10000000101
Normalised mantisa = 010101001
we will add 0's to complete the 52 bits

The IEEE 754 Double precision is:

= 0 10000000101 0101010010000000000000000000000000000000000000000000
This can be written in hexadecimal form 4055480000000000

Special Cases in IEEE 754

Condition Sign (S) Exponent (E) Mantissa (M) Value

Zero 0 or 1 00000000 00000000000000000000000 ±0
Infinity 0 or 1 11111111 00000000000000000000000 ±∞
NaN (Not a Number) 0 or 1 11111111 Nonzero Mantissa NaN

Advantages

 Supports a wide range of numbers.

 Provides high precision.
 Handles very small and very large values efficiently.

Disadvantages

 Complex hardware implementation.

 Rounding errors can cause precision loss.
 More memory required compared to fixed-point representation.

Real-World Applications

 Scientific computing (e.g., physics simulations)

 Machine learning and AI computations
 Graphics processing (GPU computations)
 Financial calculations (floating-point arithmetic)

Example: Let us consider:

Suppose number is using 32-bit format: the 1 bit sign bit, 8 bits for signed exponent, and 23
bits for the fractional part. The leading bit 1 is not stored (as it is always 1 for a normalized
number) and is referred to as a “hidden bit”.
Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x25 , which is represented as
following below,

Where 00000101 is the 8-bit binary value of exponent value +5.

Real numbers
Real numbers are numbers that include fractions/values after the decimal point.
For example, 123.75 is a real number.
Floating point representation
Real numbers are stored in a computer as floating point numbers using a mantissa (m),
a base (b) and an exponent (e) in this format:
m X be
Example (in decimal)
This example gives a general idea of the role of the mantissa, base and exponent. It does not
fully reflect the computer's method for storing real numbers.
The number 123.75 can be represented as a floating point number. To do this, move all the
digits so that the most significant digit is to the right of the decimal point:
123.75 → 0.12375
The number after the decimal point is the mantissa (m).
As this number is written in decimal (denary), the base (b) is 10 .
To work out the exponent (e) count how many decimal places you have moved the decimal
point by (in this case three). So we can represent 123.75 in floating point representation as
this:
0.12375 x 103
Example (in binary)
In the Higher course, all floating point representation is in binary.
First convert 123.75 to binary:

64 32 16 8 4 2 1 0.5 0.25
1 1 1 1 0 1 1 1 1
64 + 32 + 16 + 8 + 2 + 1 + 0.5 + 0.25 = 1111011.11
123.75 in binary is 1111011.11
Key fact
The computer will not store the actual decimal point as part of the floating point
number but it is used here for illustrative purposes.
To find the mantissa, move the decimal point to the right of the most significant bit of the
mantissa:
1111011.11 → 0.111101111
To calculate the exponent, count how many places the decimal point moved to give the
mantissa. In this case the decimal point moved seven places to the left:
So the exponent for our number is 7.
4 2 1
1 1 1
In binary, the number 7 is 111 as 4 + 2 + 1 = 7
In order to represent 123.75 the mantissa would be 111101111 and the exponent would be
111. This can be thought of as:
0.111101111 x 2111
Sign bit
As well as the mantissa, base and exponent, we have a digit before the decimal point. This is
used as a sign bit and is represented in binary as a 0 for positive and a 1 for negative.
How many bits?
There will always be a trade-off between accuracy and range when using floating point
notation, as there will always be a set number of bits allocated to storing real numbers:
 increasing the number of bits devoted to the mantissa will improve the accuracy of a
floating point number
 increasing the number of bits devoted to the exponent will increase the range of
numbers that can be held
In the Higher course, floating point numbers are represented as follows:
 1 bit for the sign
 15 bits for the mantissa
 8 bits for the exponent
FLOATING POINT ADDITION AND SUBTRACTION

FLOATING POINT ADDITION

To understand floating point addition, first we see addition of real numbers in decimal as
same logic is applied in both cases.
For example, we have to add 1.1 * 103 and 50.
We cannot add these numbers directly. First, we need to align the exponent and then, we can
add significant.
After aligning exponent, we get 50 = 0.05 * 103
Now adding significant, 0.05 + 1.1 = 1.15
So, finally we get (1.1 * 103 + 50) = 1.15 * 103
Here, notice that we shifted 50 and made it 0.05 to add these numbers.
Now let us take example of floating point number addition
We follow these steps to add two numbers:
1. Align the significant
2. Add the significant
3. Normalize the result
Let the two numbers be
x = 9.75
y = 0.5625
Converting them into 32-bit floating point representation,
9.75’s representation in 32-bit format = 0 10000010 00111000000000000000000
0.5625’s representation in 32-bit format = 0 01111110 00100000000000000000000
Now we get the difference of exponents to know how much shifting is required.
(10000010 – 01111110)2 = (4)10
Now, we shift the mantissa of lesser number right side by 4 units.
Mantissa of 0.5625 = 1.00100000000000000000000
(note that 1 before decimal point is understood in 32-bit representation)
Shifting right by 4 units, we get 0.00010010000000000000000
Mantissa of 9.75 = 1. 00111000000000000000000
Adding mantissa of both
0. 00010010000000000000000
+ 1. 00111000000000000000000
————————————————-
1. 01001010000000000000000
In final answer, we take exponent of bigger number
So, final answer consist of :
Sign bit = 0
Exponent of bigger number = 10000010
Mantissa = 01001010000000000000000
32 it representation of answer = x + y = 0 10000010 01001010000000000000000

FLOATING POINT SUBTRACTION

Subtraction is similar to addition with some differences like we subtract mantissa unlike
addition and in sign bit we put the sign of greater number.
Let the two numbers be
x = 9.75
y = – 0.5625
Converting them into 32-bit floating point representation
9.75’s representation in 32-bit format = 0 10000010 00111000000000000000000
– 0.5625’s representation in 32-bit format = 1 01111110 00100000000000000000000
Now, we find the difference of exponents to know how much shifting is required.
(10000010 – 01111110)2 = (4)10
Now, we shift the mantissa of lesser number right side by 4 units.
Mantissa of – 0.5625 = 1.00100000000000000000000
(note that 1 before decimal point is understood in 32-bit representation)
Shifting right by 4 units, 0.00010010000000000000000
Mantissa of 9.75= 1. 00111000000000000000000
Subtracting mantissa of both
0. 00010010000000000000000
– 1. 00111000000000000000000
————————————————
1. 00100110000000000000000
Sign bit of bigger number = 0
So, finally the answer = x – y = 0 10000010 00100110000000000000000

#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
IEEE Standard 754
No ratings yet
IEEE Standard 754
10 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
Floating Points
No ratings yet
Floating Points
31 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Week 5: IEEE Floating Point Revision Guide For Phase Test
No ratings yet
Week 5: IEEE Floating Point Revision Guide For Phase Test
23 pages
Computer Organisation
No ratings yet
Computer Organisation
4 pages
Floating Point Representation
No ratings yet
Floating Point Representation
3 pages
arch1-LECTURE-NUMBER REPRESENTATION
No ratings yet
arch1-LECTURE-NUMBER REPRESENTATION
42 pages
Lec 3 Cao Floating Point Representation
No ratings yet
Lec 3 Cao Floating Point Representation
28 pages
Lecture 4 - Computer Arithmetic
No ratings yet
Lecture 4 - Computer Arithmetic
18 pages
COA-Module6-FloatingPoint
No ratings yet
COA-Module6-FloatingPoint
17 pages
Unit-1 COA
No ratings yet
Unit-1 COA
26 pages
Floating Point
No ratings yet
Floating Point
26 pages
03.2 Numbers, Floating Point
No ratings yet
03.2 Numbers, Floating Point
12 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
7 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
5268882
No ratings yet
5268882
23 pages
FIXED and FLOAT
No ratings yet
FIXED and FLOAT
8 pages
Lecture Slides Week4
No ratings yet
Lecture Slides Week4
42 pages
Itec1000 Lecture Note 5
No ratings yet
Itec1000 Lecture Note 5
10 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
26 pages
4.4_1 New Floating Point.pptx
No ratings yet
4.4_1 New Floating Point.pptx
22 pages
Floating Point Representation
No ratings yet
Floating Point Representation
17 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Chapter 5 - Floating Point Numbers
No ratings yet
Chapter 5 - Floating Point Numbers
9 pages
Floating Point Representation
No ratings yet
Floating Point Representation
3 pages
Floating Point Representation - M.eng Term Paper
No ratings yet
Floating Point Representation - M.eng Term Paper
6 pages
IEEE STANDARD FOR FLOATING POINT NUMBERS
No ratings yet
IEEE STANDARD FOR FLOATING POINT NUMBERS
5 pages
Floating Point
No ratings yet
Floating Point
16 pages
Floating Point Number Representation
No ratings yet
Floating Point Number Representation
21 pages
Tutorial - Floating-Point Binary
No ratings yet
Tutorial - Floating-Point Binary
4 pages
Floating Point
No ratings yet
Floating Point
26 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
7 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
Part 5 Floating Point Add Sub Mul
No ratings yet
Part 5 Floating Point Add Sub Mul
20 pages
COA
No ratings yet
COA
14 pages
Data Representation Workbook
No ratings yet
Data Representation Workbook
8 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
IEEE FP Representation
No ratings yet
IEEE FP Representation
3 pages
Binary Subtraction
No ratings yet
Binary Subtraction
4 pages
Fixed and Floating Point Numbers: Dr. Ashish GUPTA Sense, Vit-Ap Ashish - Gupta@vitap - Ac.in
No ratings yet
Fixed and Floating Point Numbers: Dr. Ashish GUPTA Sense, Vit-Ap Ashish - Gupta@vitap - Ac.in
34 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
Lecture 05 - Floating Point Numbers
No ratings yet
Lecture 05 - Floating Point Numbers
28 pages
CEF352 Lect2
No ratings yet
CEF352 Lect2
18 pages
Number System
No ratings yet
Number System
38 pages
Fixed and Floating Point Representation
No ratings yet
Fixed and Floating Point Representation
5 pages
3-EED220 Lecture 3
No ratings yet
3-EED220 Lecture 3
22 pages
Floating Point Representation: Major: All Engineering Majors Authors: Autar Kaw, Matthew Emmons
No ratings yet
Floating Point Representation: Major: All Engineering Majors Authors: Autar Kaw, Matthew Emmons
21 pages
COA UNIT-III PPTs Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-III PPTs Dr.G.Bhaskar ECE
64 pages
MTH 214 Accuracy in Numerical Calculations and Error Analysis
No ratings yet
MTH 214 Accuracy in Numerical Calculations and Error Analysis
18 pages
IEEE Paper On Floating Point
No ratings yet
IEEE Paper On Floating Point
28 pages
Binary Tutorial
No ratings yet
Binary Tutorial
10 pages
Review: How To Represent Real Numbers
No ratings yet
Review: How To Represent Real Numbers
9 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Mathematics Principles V11
From Everand
Mathematics Principles V11
Clive W. Humphris
No ratings yet
Module 2 Part II - Codes PDF
No ratings yet
Module 2 Part II - Codes PDF
68 pages
Lecture 2 Sequence and Series JEE Mr Nitesh Sir Vaqar Shaikh 45
No ratings yet
Lecture 2 Sequence and Series JEE Mr Nitesh Sir Vaqar Shaikh 45
40 pages
CH 02
No ratings yet
CH 02
32 pages
Grade 8 Diagnostic Test
No ratings yet
Grade 8 Diagnostic Test
9 pages
Grade 7 and 8 - Basic IQ Math Practice Problems Set 1B PDF
No ratings yet
Grade 7 and 8 - Basic IQ Math Practice Problems Set 1B PDF
3 pages
selfstudys_com_file (1).pdf (1)
No ratings yet
selfstudys_com_file (1).pdf (1)
10 pages
CSEC Mathematics June 1981 P2
No ratings yet
CSEC Mathematics June 1981 P2
9 pages
F4 M2 EasterHoliday Assignment
No ratings yet
F4 M2 EasterHoliday Assignment
3 pages
Functional Equations - Reid Barton - MOP 2006
No ratings yet
Functional Equations - Reid Barton - MOP 2006
2 pages
2018 H2 Prelim Compilation (Complex Numbers) SOLUTIONS
No ratings yet
2018 H2 Prelim Compilation (Complex Numbers) SOLUTIONS
38 pages
Sequences and Series IB Problems Worksheet
No ratings yet
Sequences and Series IB Problems Worksheet
3 pages
Grade 7 Math Curriculum Map
No ratings yet
Grade 7 Math Curriculum Map
11 pages
Mathematics: Quarter 1 - Module 1
86% (14)
Mathematics: Quarter 1 - Module 1
34 pages
Mathematics Syllabus - S1-S3
No ratings yet
Mathematics Syllabus - S1-S3
34 pages
Lines, Angles, and Triangles Level 4
No ratings yet
Lines, Angles, and Triangles Level 4
2 pages
Supplement To Chapter 8: Multiplication by Vedic Methods
No ratings yet
Supplement To Chapter 8: Multiplication by Vedic Methods
9 pages
Congruence and Similarity 2 Solutions
0% (1)
Congruence and Similarity 2 Solutions
6 pages
Surds and Indices Worksheet 3
No ratings yet
Surds and Indices Worksheet 3
18 pages
Compound Curve and Reverse Curve
No ratings yet
Compound Curve and Reverse Curve
15 pages
Chapter - 7 - Integrals & Definite Integral
No ratings yet
Chapter - 7 - Integrals & Definite Integral
5 pages
Errt
No ratings yet
Errt
101 pages
Classification of Numbers: Real Numbers Complex Numbers
100% (1)
Classification of Numbers: Real Numbers Complex Numbers
20 pages
3.4 Radicals
100% (1)
3.4 Radicals
2 pages
MK11 2014
No ratings yet
MK11 2014
3 pages
Sboa (Cbse) Class - 10 Worksheet
No ratings yet
Sboa (Cbse) Class - 10 Worksheet
4 pages
Bus - Math Q1 Week 4 Part 1
No ratings yet
Bus - Math Q1 Week 4 Part 1
39 pages
Number System
No ratings yet
Number System
4 pages
Logical Reasoning
No ratings yet
Logical Reasoning
3 pages
Yr 10 Higher Revision List 2 2
No ratings yet
Yr 10 Higher Revision List 2 2
2 pages
Yellow Book (Mesl Elements 02)
No ratings yet
Yellow Book (Mesl Elements 02)
3 pages

Module2.1 of nothing

Uploaded by

Module2.1 of nothing

Uploaded by

Floating point representation with Institute of Electrical and Electronics

IEEE 754 Floating-Point Representation

IEEE 754 defines two commonly used formats:

Where IEEE 754 has 3 basic components:

 S (Sign bit): Determines if the number is positive (0) or negative (1).

IEEE 754 Formats

A. Single-Precision (32-bit) Format

Bit Field Description

B. Double-Precision (64-bit) Format

IEEE 754 Example (Single Precision)

Example: Representing 13.625 in IEEE 754 (Single-Precision)

Step 1: Convert to Binary

The IEEE 754 Single precision is:

The IEEE 754 Double precision is:

Special Cases in IEEE 754

Condition Sign (S) Exponent (E) Mantissa (M) Value

 Supports a wide range of numbers.

 Complex hardware implementation.

 Scientific computing (e.g., physics simulations)

Example: Let us consider:

Where 00000101 is the 8-bit binary value of exponent value +5.

FLOATING POINT ADDITION

FLOATING POINT SUBTRACTION

You might also like