0% found this document useful (0 votes)
6 views

Module2.1 of nothing

The document explains the IEEE 754 standard for floating-point representation, detailing its components, formats (single and double precision), and examples of converting decimal numbers to binary. It outlines the advantages and disadvantages of using floating-point representation, as well as its real-world applications in fields like scientific computing and machine learning. Additionally, it covers the processes for floating-point addition and subtraction, highlighting the importance of aligning exponents and normalizing results.

Uploaded by

pratyush.pgat2nd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module2.1 of nothing

The document explains the IEEE 754 standard for floating-point representation, detailing its components, formats (single and double precision), and examples of converting decimal numbers to binary. It outlines the advantages and disadvantages of using floating-point representation, as well as its real-world applications in fields like scientific computing and machine learning. Additionally, it covers the processes for floating-point addition and subtraction, highlighting the importance of aligning exponents and normalizing results.

Uploaded by

pratyush.pgat2nd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Floating point representation with Institute of Electrical and Electronics

Engineers (IEEE)

Floating-point representation is used to store real numbers efficiently, allowing for a wide
range of values with high precision. There are several ways to represent floating point
number but IEEE 754 is the most efficient in most cases The IEEE 754 standard defines how
floating-point numbers are stored and manipulated in computers.

IEEE 754 Floating-Point Representation

IEEE 754 defines two commonly used formats:


 Single Precision (32-bit)
 Double Precision (64-bit)
Each format consists of three components:
Value=(−1)S×M×2E

Where IEEE 754 has 3 basic components:

 S (Sign bit): Determines if the number is positive (0) or negative (1).


 E (Exponent): Stores the exponent using a bias (127 for single-precision, 1023 for
double-precision).
 M (Mantissa or Significand): Stores the significant digits of the number.

IEEE 754 Formats

A. Single-Precision (32-bit) Format

Bit Field Description


1 bit Sign (S) 0 = Positive, 1 = Negative
8 bits Exponent (E) Biased exponent (Bias = 127)
23 bits Mantissa (M) Fractional part (leading 1 is implicit)

B. Double-Precision (64-bit) Format


Bit Field Description
1 bit Sign (S) 0 = Positive, 1 = Negative
11bits Exponent (E) Biased exponent (Bias = 1023)
52 bits Mantissa (M) Fractional part (leading 1 is implicit)

IEEE 754 Example (Single Precision)

Example: Representing 13.625 in IEEE 754 (Single-Precision)

Step 1: Convert to Binary


 Convert 13 to binary: 1310=11012
 Convert 0.625 to binary:
.625×2=1.25 → 1
.25×2=0.5 → 0
.5×2=1.0 → 1
.62510=0.1012
 Full binary representation:13.62510=1101.1012
Step 2: Normalize the Binary Number
Convert to scientific notation:
1.101101×23
 Mantissa (M): 10110100000000000000000 (Drop leading 1, pad to 23 bits)
 Exponent (E): 3+127=13010=100000102
Step 3: IEEE 754 Representation
Sign (1 bit) Exponent (8 bits) Mantissa (23 bits)
0 (positive) 10000010 10110100000000000000000
Final IEEE 754 representation (in binary):
0 10000010 101101000000000000000000
Hexadecimal equivalent:
416B000016

Example

85.125
85 = 1010101
0.125 = 001
85.125 = 1010101.001
=1.010101001 x 2^6
sign = 0
1. Single precision:
biased exponent 127+6=133
133 = 10000101
Normalised mantisa = 010101001
we will add 0's to complete the 23 bits

The IEEE 754 Single precision is:


= 0 10000101 01010100100000000000000
This can be written in hexadecimal form 42AA4000

2. Double precision:
biased exponent 1023+6=1029
1029 = 10000000101
Normalised mantisa = 010101001
we will add 0's to complete the 52 bits

The IEEE 754 Double precision is:


= 0 10000000101 0101010010000000000000000000000000000000000000000000
This can be written in hexadecimal form 4055480000000000

Special Cases in IEEE 754

Condition Sign (S) Exponent (E) Mantissa (M) Value


Zero 0 or 1 00000000 00000000000000000000000 ±0
Infinity 0 or 1 11111111 00000000000000000000000 ±∞
NaN (Not a Number) 0 or 1 11111111 Nonzero Mantissa NaN

Advantages

 Supports a wide range of numbers.


 Provides high precision.
 Handles very small and very large values efficiently.

Disadvantages

 Complex hardware implementation.


 Rounding errors can cause precision loss.
 More memory required compared to fixed-point representation.

Real-World Applications

 Scientific computing (e.g., physics simulations)


 Machine learning and AI computations
 Graphics processing (GPU computations)
 Financial calculations (floating-point arithmetic)

Example: Let us consider:


Suppose number is using 32-bit format: the 1 bit sign bit, 8 bits for signed exponent, and 23
bits for the fractional part. The leading bit 1 is not stored (as it is always 1 for a normalized
number) and is referred to as a “hidden bit”.
Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x25 , which is represented as
following below,

Where 00000101 is the 8-bit binary value of exponent value +5.


Real numbers
Real numbers are numbers that include fractions/values after the decimal point.
For example, 123.75 is a real number.
Floating point representation
Real numbers are stored in a computer as floating point numbers using a mantissa (m),
a base (b) and an exponent (e) in this format:
m X be
Example (in decimal)
This example gives a general idea of the role of the mantissa, base and exponent. It does not
fully reflect the computer's method for storing real numbers.
The number 123.75 can be represented as a floating point number. To do this, move all the
digits so that the most significant digit is to the right of the decimal point:
123.75 → 0.12375
The number after the decimal point is the mantissa (m).
As this number is written in decimal (denary), the base (b) is 10 .
To work out the exponent (e) count how many decimal places you have moved the decimal
point by (in this case three). So we can represent 123.75 in floating point representation as
this:
0.12375 x 103
Example (in binary)
In the Higher course, all floating point representation is in binary.
First convert 123.75 to binary:

64 32 16 8 4 2 1 0.5 0.25
1 1 1 1 0 1 1 1 1
64 + 32 + 16 + 8 + 2 + 1 + 0.5 + 0.25 = 1111011.11
123.75 in binary is 1111011.11
Key fact
The computer will not store the actual decimal point as part of the floating point
number but it is used here for illustrative purposes.
To find the mantissa, move the decimal point to the right of the most significant bit of the
mantissa:
1111011.11 → 0.111101111
To calculate the exponent, count how many places the decimal point moved to give the
mantissa. In this case the decimal point moved seven places to the left:
So the exponent for our number is 7.
4 2 1
1 1 1
In binary, the number 7 is 111 as 4 + 2 + 1 = 7
In order to represent 123.75 the mantissa would be 111101111 and the exponent would be
111. This can be thought of as:
0.111101111 x 2111
Sign bit
As well as the mantissa, base and exponent, we have a digit before the decimal point. This is
used as a sign bit and is represented in binary as a 0 for positive and a 1 for negative.
How many bits?
There will always be a trade-off between accuracy and range when using floating point
notation, as there will always be a set number of bits allocated to storing real numbers:
 increasing the number of bits devoted to the mantissa will improve the accuracy of a
floating point number
 increasing the number of bits devoted to the exponent will increase the range of
numbers that can be held
In the Higher course, floating point numbers are represented as follows:
 1 bit for the sign
 15 bits for the mantissa
 8 bits for the exponent
FLOATING POINT ADDITION AND SUBTRACTION

FLOATING POINT ADDITION


To understand floating point addition, first we see addition of real numbers in decimal as
same logic is applied in both cases.
For example, we have to add 1.1 * 103 and 50.
We cannot add these numbers directly. First, we need to align the exponent and then, we can
add significant.
After aligning exponent, we get 50 = 0.05 * 103
Now adding significant, 0.05 + 1.1 = 1.15
So, finally we get (1.1 * 103 + 50) = 1.15 * 103
Here, notice that we shifted 50 and made it 0.05 to add these numbers.
Now let us take example of floating point number addition
We follow these steps to add two numbers:
1. Align the significant
2. Add the significant
3. Normalize the result
Let the two numbers be
x = 9.75
y = 0.5625
Converting them into 32-bit floating point representation,
9.75’s representation in 32-bit format = 0 10000010 00111000000000000000000
0.5625’s representation in 32-bit format = 0 01111110 00100000000000000000000
Now we get the difference of exponents to know how much shifting is required.
(10000010 – 01111110)2 = (4)10
Now, we shift the mantissa of lesser number right side by 4 units.
Mantissa of 0.5625 = 1.00100000000000000000000
(note that 1 before decimal point is understood in 32-bit representation)
Shifting right by 4 units, we get 0.00010010000000000000000
Mantissa of 9.75 = 1. 00111000000000000000000
Adding mantissa of both
0. 00010010000000000000000
+ 1. 00111000000000000000000
————————————————-
1. 01001010000000000000000
In final answer, we take exponent of bigger number
So, final answer consist of :
Sign bit = 0
Exponent of bigger number = 10000010
Mantissa = 01001010000000000000000
32 it representation of answer = x + y = 0 10000010 01001010000000000000000

FLOATING POINT SUBTRACTION


Subtraction is similar to addition with some differences like we subtract mantissa unlike
addition and in sign bit we put the sign of greater number.
Let the two numbers be
x = 9.75
y = – 0.5625
Converting them into 32-bit floating point representation
9.75’s representation in 32-bit format = 0 10000010 00111000000000000000000
– 0.5625’s representation in 32-bit format = 1 01111110 00100000000000000000000
Now, we find the difference of exponents to know how much shifting is required.
(10000010 – 01111110)2 = (4)10
Now, we shift the mantissa of lesser number right side by 4 units.
Mantissa of – 0.5625 = 1.00100000000000000000000
(note that 1 before decimal point is understood in 32-bit representation)
Shifting right by 4 units, 0.00010010000000000000000
Mantissa of 9.75= 1. 00111000000000000000000
Subtracting mantissa of both
0. 00010010000000000000000
– 1. 00111000000000000000000
————————————————
1. 00100110000000000000000
Sign bit of bigger number = 0
So, finally the answer = x – y = 0 10000010 00100110000000000000000

You might also like