0% found this document useful (0 votes)
24 views

9-Algorithms For Floating Point Arithmetic Operations-22-01-2024

The document discusses floating point representation and arithmetic. It covers: 1) Floating point numbers represent real numbers using a sign bit, biased exponent, and significand (mantissa). The radix point is fixed between the sign and significand. 2) Common floating point formats like IEEE 754 use biased exponent representation, where the exponent value is the stored exponent minus a bias. 3) Floating point arithmetic operations like addition and multiplication require aligning operands, adjusting exponents, performing operations, then normalizing the result. Special cases like overflow and underflow must be handled.

Uploaded by

Baladhithya T
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

9-Algorithms For Floating Point Arithmetic Operations-22-01-2024

The document discusses floating point representation and arithmetic. It covers: 1) Floating point numbers represent real numbers using a sign bit, biased exponent, and significand (mantissa). The radix point is fixed between the sign and significand. 2) Common floating point formats like IEEE 754 use biased exponent representation, where the exponent value is the stored exponent minus a bias. 3) Floating point arithmetic operations like addition and multiplication require aligning operands, adjusting exponents, performing operations, then normalizing the result. Special cases like overflow and underflow must be handled.

Uploaded by

Baladhithya T
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

Floating Point Representation

Real Numbers
• Numbers with fractions
• Could be done in pure binary
• 1001.1010 = 23 + 20 +2-1 + 2-3 =9.625
• Radix point: Fixed or Moving?
• Fixed radix point: can’t represent very large or very small
numbers.
• Dynamically sliding the radix point -
• a range of very large and very small numbers can be represented.

In mathematics, radix point refers to the symbol used in numerical representations to separate the integral part of the number (to the
left of the radix) from its fractional part (to the right of the radix). The radix point is usually a small dot, either placed on the baseline
or halfway between the baseline and the top of the numerals. In base 10, the radix point is more commonly called the decimal
point. ... From en.wikipedia.org/wiki/Radix_point 2
Sign bit Floating Point

Biased Significand or Mantissa


Exponent
• +/- significand x 2exponent
• Point is actually fixed between sign bit and body of mantissa
• Exponent indicates place value (point position)

3
Signs for Floating Point
• Mantissa is stored in 2s compliment.
• Exponent is in excess or biased notation.
• Excess (biased exponent) 128 means
• 8 bit exponent field
• Pure value range 0-255
• Subtract (2 k-1 - 1)to get correct value
• Range -128 to +127

4
Normalization
• FP numbers are usually normalized
• exponent is adjusted so that leading bit (MSB) of mantissa is 1
• Since it is always 1 there is no need to store it
• (Scientific notation where numbers are normalized to give a single digit before
the decimal point e.g. 3.123 x 103)
• In FP representation: not representing more
individual values, but spreading the numbers.

5
IEEE 754
• Standard for floating point storage
• 32 and 64 bit standards
• 8 and 11 bit exponent respectively
• Extended formats (both mantissa and exponent) for intermediate
results

6
IEEE Floating-point Format
• IEEE has introduced a standard floating-point format for
arithmetic operations in mini and microcomputer, which
is defined in IEEE Standard 754
• In this format, the numbers are normalized so that the
significand or mantissa lie in the range 1F<2, which
corresponds to an integer part equal to 1
• An IEEE format floating-point number X is formally
defined as:
EB
X  1 x 2
S
x 1 .F
where S = sign bit [0+, 1]
E = exponent biased by B
F = fractional mantissa
7
Biased Exponent Representation

How to represent a signed exponent? The Choices are


 Sign + magnitude representation for the exponent
Two’s complement representation
 Biased representation
IEEE 754 uses biased representation for the exponent ( 32 bit)
•Value of exponent = val(E) = E – Bias (Bias is a constant)
•Recall that exponent field is 8 bits for single precision
•E can be in the range 0 to 255
• E = 0 and E = 255 are reserved for special use
• E = 1 to 254 are used for normalized floating point numbers
• Bias = 127 (half of 254), val(E) = E – 127
• val(E=1) = –126, val(E=127) = 0, val(E=254) = 127
• Two basics format are defined in the IEEE Standard 754
• These are the 32-bit single and 64-bit double formats,
with 8-bit and 11-bit exponent respectively

S ign
8 bits 23 bits
bit
B iased
S ignificand
E xponent

(a) S ingle form at

S ign
11 bits 52 bits
bit

B iased Exponent S ignificand

(b) D ouble form at

• A sign-magnitude representation has been adopted for


the mantissa; mantissa is negative if S =1, and positive if
S =0

9
Floating Point Examples

negative

20 127 + 20 = 147

negative

normalized
-20 127 - 20 = 107

The bias equals to (2K-1 – 1)  28-1 – 1 = 127 10


Example
Convert these number to IEEE single precision format:
(a) 199.953125 10 = 1100 0111.111101 2
= 1.100 0111 111101 x 2 7 stored
+ 7 + 127 = 13410 1  1 0 0 0 1 1 1 1 1 1 1 0 1
0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand

(b) -77.7 10 = -100 1101.10110 0110 2 ... 77 10 = 100 1101 2

= -1.00 1101 101100110 ... x 2 6 0.7 10  0.7 x 2  1.4


0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4
Slides adapted from tan 0.4 x 2  0.8
wooi haw’s lecture notes 0.8 x 2  1.6
(FOE) 0.6 x 2  1.2
0.2 x 2  0.4

...
stored [23 bits]
– 6 + 127 = 133 10 1
0 0 1 1 0 1 1 0 1 1 0 ...

1 1 0 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
sign biased exponent significand
11
12
FP Arithmetic +/-

• Check for zeros


• Align significands (adjusting exponents)
• Add or subtract significands
• Normalize result

13
FP Arithmetic x/

• Check for zero


• Add/subtract exponents
• Multiply/divide significands (watch sign)
• Normalize
• Round
• All intermediate results should be in double length storage

14
Floating-point Arithmetic (cont.)

Some basic floating-point arithmetic operations are shown in the table

15
Floating-point Arithmetic (cont.)

• For addition and subtraction, it is necessary to ensure that both


operand exponents have the same value
• This may involves shifting the radix point of one of the operand to
achieve alignment

16
Floating-point Arithmetic (cont.)
• Some problems that may arise during arithmetic operations are:
i. Exponent overflow: A positive exponent exceeds the
maximum possible exponent value and this may leads to +
or - in some systems
ii. Exponent underflow: A negative exponent is less than
the minimum possible exponent value (eg. 2-200), the
number is too small to be represented and maybe
reported as 0
iii. Significand underflow: In the process of aligning
significands, the smaller number may have a
significand which is too small to be represented
iv. Significand overflow: The addition of two
significands of the same sign may result in a carry out
from the most significant bit

17
FP Arithmetic +/-
• Unlike integer and fixed-point number representation,
floating-point numbers cannot be added in one simple
operation
• Consider adding two decimal numbers:
A = 12345
B = 567.89
If these numbers are normalized and added in floating-
point format, we will have

0.12345 x 10 5
+ 0.56789 x 10 3
?.????? x 10 ?

Obviously, direct addition cannot take place as the


exponents are different
18
FP Arithmetic +/- (cont.)
• Floating-point addition and subtraction will typically
involve the following steps:
i. Align the significand
ii. Add or subtract the significands
iii. Normalize the result
• Since addition and subtraction are identical except for
a sign change, the process begins by changing the sign
of the subtrahend if it is a subtract operation
• The floating-point numbers can only be added if the
two exponents are equal
• This can be done by aligning the smaller number with
the bigger number [increasing its exponent] or vice-
versa, so that both numbers have the same exponent
Slides adapted from tan wooi
haw’s lecture notes (FOE) 19
FP Arithmetic +/- (cont.)
• As the aligning operation may result in the lost of
digits, it is the smaller number that is shifted so that
any lost will therefore be of relatively insignificant
8 bits rem ains
shift
left
1.1001 x 2 9 110010000 x 2 1 1 x 2 9 is lost
1.0111 x 2 1 1.0111000 x 2 1
• Hence, the smaller number are shifted right by
increasing its exponent until the two exponents are the
same
• If both numbers have exponents that differ
significantly, the smaller number is lost as a result of
shifting
1.1001001 x 2 9 1.1001001 x 2 9
1.0110001 x 2 1 shift
0.0000000 x 2 9
20
right
1.1101 x 2 4
FP Arithmetic +/- (cont.) + 0.0101 x 2 4
10.00 10 x 2 4 1.0001 x 2 5
• After the numbers have been aligned, they are added
together taking into account their signs
• There might be a possibility of significand overflow
due to a carry out from the most significant bit
• If this occurs, the significand of the result if shifted
right and the exponent is incremented
• As the exponents are incremented, it might overflows
and the operation will stop
• Lastly, the result if normalized by shifting significand
digits left until the most significant digit is non-zero
• Each shift causes a decrement of the exponent and thus
could cause an exponent underflow
• Finally, the result is rounded off and reported
21
1.01101 x 2 7 1.01101 x 2 7
X – Y = ZSUBTRACT X = 1.01101 x 27
+ 0.110101 x 2 7 – 0.110101 x 2 7
Y = 1.10101 x 2 6
10.001111 x 2 7 0.100101 x 2 7

Change sign of Y

X = 1.01101 x 2 7
X+Y=Z 0.100101 x 2 7
Y = 0.110101 x 2 7
no no Expoenents yes Add signed Results yes
ADD X = 0? Y = 0? Round result
Equal? significands norm alized?

yes yes no no

Increm ent sm aller yes Significand Shift significand RETURN


ZY ZX Z0
exponent = 0? left

no 1.000 1111 x 2 8
1.00101 x 2 6
RETURN Shift significand RETURN Significand no Decrem ent
right overflow? exponent

10.001111 x 2 7 yes 1.00101 x 2 6


Significand Shift significand no Exponent
Y = 0.110101 x 2no7 = 0? right underflow?

yes

Put other num ber


1.0001111 x 2 8 Increm ent
Report underflow
in Z RETURN exponent

Slides adapted from tan wooi


haw’s lecture notes (FOE)
RETURN
Report overflow
yes Exponent
overflow?
no RETURN
22
FP Arithmetic +/- (cont.)
• Some of the floating-point arithmetic will lead to an
increase number of bits in the mantissa
• For example, consider adding these 5 significant bits
floating-point numbers:
A = 0.11001 x 24
B = 0.10001 x 23
A = 0.11001 x 2 4
B = 0.010001 x 2 4
norm alize
1.000011 x 2 4 0.100 0011 x 2 5

• The result has two extra bit of precision which cannot


be fitted into the floating point format
• For simplicity, the number can be truncated to give
0.10000 x 25
23
FP Arithmetic +/- (cont.)
• Truncation is the simplest method which involves
nothing more than taking away the extra bits
• A much better technique is rounding in which if the
value of the extra bits is greater than half the least
significant bit of the retained bits, 1 is added to the
LSB of the remaining digits
• For example, consider rounding these numbers to 4
significant bits:
i. 0.1101101
extra bits  0.0000101
LSB of retained bits  0.0001
0.110 1
0.1 1 0 1 1 0 1
+ 1
0.111 0
m ore than half
add 1 to the
LS B
24
FP Arithmetic +/- (cont.)
ii. 0.1101011
extra bits  0.0000011
LSB of retained bits  0.0001
0.1 1 0 1 0 1 1 0.1101
extra bits are
truncated
less than half

• Truncation always undervalues the result, leading to a


systematic error, whereas rounding sometimes reduces
the result and sometimes increases it
• Rounding is always preferred to truncation partly
because it is more accurate and partly it gives rise to an
unbiased error
• Major disadvantage of rounding is that it requires a
further arithmetic operation on the result
25
continue ...
ii. 68.310 + 12.210
68.3 10 = 100 0100.01001 1001 ... 68 10 = 100 0100 2
= 1.00 0100 01001 1001 ... x 2 6 0.3 10  0.3 x 2  0.6
0.6 x 2  1.2
0.2 x 2  0.4
0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2

...
only 24 bits can be stored

1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
32-bit register
m ore than half
+1 of the LS B

stored [23 bits]


+ 6 + 127 = 133 10 1  0 0 0 1 0 0 0 1 0 0 1 ...
0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0
sign biased exponent significand

26
continue ...
12.2 10 = 1100.0011 0011 ... 12 10 = 1100 2
= 1.100 0011 0011 ... x 2 3 0.2 10  0.2 x 2  0.4
0.4 x 2  0.8
0.8 x 2  1.6
0.6 x 2  1.2
0.2 x 2  0.4

...
only 24 bits can be stored

1 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

less than half of


the LS B

stored [23 bits]


+ 3 + 127 = 130 10 1  1 0 0 0 0 1 1 0 0 1 1 ...
0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
sign biased exponent significand

27
continue ...
Align the smaller number with the larger number by
shifting it to the right [increasing the exponent]
1000 0010 1.100 00110011 0011001 1  1000 0101 0.001 10000110 0110011 0011
exponent m antissa exponent m antissa

ADD the mantissa


1.00010001001100110011010
+ 0.00110000110011001100110011
1.01000010000000000000000011
less than half
of the LS B

Store the result in IEEE single-precision format


0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sign biased exponent significand

28
Floating-point Multiplication
XxY=Z X = 6.25 10 = 110.01 2 = 1.1001 x 2 2
M U LTIP LY Y = 12.5 10 = 1100.1 2 = 1.1001 x 2 3
E 1 = 127 + 2 = 129
no no
E 2 = 127 + 3 = 130
X = 0? Y = 0? A dd exponents
E 1 + E 2 = 259
yes yes

E T = 259 – 12 7 = 1 32
Z0 S ubtract bias

R E TU R N E xponent yes R eport


overflow? overflow

no

E xponent yes R eport


underflow underflow

1.100 1 2 no

x 1.100 1 2
M ultiply
10.01 110001 2 significands

10.01110001 x 2 5
=1.001110001 x 2 6 N orm alize

R ound R E TU R N 29
Floating-point Division
X = 3.75 10 = 11.11 2 = 1.111 x 2 1
Y X = Z
D IV ID E Y = 95.625 10 = 101 1111.101 2
= 1.011111101 x 2 6
E 1 = 127 + 1 = 128
X = 0?
no
Y = 0?
no S ubtract
exponents
E 2 = 127 + 6 = 133
E2 – E1 = 5
yes yes

E T = 127 + 5 = 132
Z 0 Z  A dd bias

R E TU R N E xponent yes R eport


overflow? overflow

no

E xponent yes R eport


underflow underflow

no

0.110011
1.111 1.011111101 D ivide
significands

0.110011 x 2 5
= 1.10011 x 2 4 N orm alize

R ound R E TU R N 30
Floating Point Multiplication

31
Floating Point Division

32
PROBLEM (1)

• Express the number - (640.5)10 in IEEE 32 bit and 64 bit floating point
format

33
SOLUTION (1)….
• IEEE 32 BIT FLOATING POINT FORMAT

MSB 8 bits 23 bits


sign Biased Mantissa/Significand
Exponent (Normalized)
Step 1: Express the given number in binary form
(640.5) = 1010000000.1* 20
Step 2: Normalize the number into the form 1.bbbbbbb

1010000000.1* 20 = 1. 0100000001* 29
Once Normalized, every number will have 1 at the leftmost bit. So IEEE notation is saying that there is no

need to store this bit. Therefore significand to be stored is 0100 0000 0100 0000 0000 000 in the allotted

23 bits
34
SOLUTION (1)…….
• Step 3: For the 8 bit biased exponent field, the bias
used is
2k-1-1 = 28-1-1 = 127
Add the bias 127 to the exponent 9 and
convert it into binary in order to store for 8-bit biased
exponent. 127 + 9 =136
( 1000 1000)
• Step 4: Since the given number is negative, put MSB
as 1
• Step 5: Pack the result into proper format(IEEE 32 bit)

1 1000 1000 0100 0000 0010 0000 0000 000


35
IEEE-754 Conversion Example
Represent -12.62510 in single precision IEEE-754 format.
• Step #1: Convert to target base. -12.62510 = -1100.1012
• Step #2: Normalize. -1100.1012 = -1.1001012 × 23
• Step #3: Fill in bit fields. Sign is negative, so sign bit is 1.
Exponent is
in excess 127 (not excess 128!), so exponent is represented as
the
unsigned integer 3 + 127 = 130. Leading 1 of significant is
hidden, so
final bit pattern is:
1 1000 0010 . 1001 0100 0000 0000 0000 000
SOLUTION (1)…...
• IEEE 64 BIT FLOATING POINT FORMAT

MSB 11 bits 52 bits


sign Biased Mantissa/Significand
Exponent (Normalized)
Step 1: Express the given number in binary form
(640.5) = 1010000000.1* 20
Step 2: Normalize the number into the form 1.bbbbbbb

1010000000.1* 20 = 1. 0100000001* 29
Once Normalized, every number will have 1 at the leftmost bit. So IEEE notation is saying that there is no

need to store this bit. Therefore significand to be stored is 0100 0000 0100 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 in the allotted 52 bits


37
SOLUTION (1)…
 Step 3: For the 11 bit biased exponent field, the bias
used is
2k-1-1 = 211-1-1 = 1023
Add the bias 1023 to the
exponent 9 and convert it into binary in order to store for
11-bit biased exponent.
1023 + 9 =1032 ( 1000 0001 000)
 Step 4: Since the given number is negative, put MSB as
1
 Step 5: Pack the result into proper format(IEEE 64 bit)

1 1000 0001 000 0100 0000 0010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
38
Character Representation ASCII
ASCII (American Standard Code for Information Interchange) Code

MSB (3 bits)
0 1 2 3 4 5 6 7

LSB 0 NUL DLE SP 0 @ P ‘ P


(4 bits) 1 SOH DC1 ! 1 A Q a q
2 STX DC2 “ 2 B R b r
3 ETX DC3 # 3 C S c s
4 EOT DC4 $ 4 D T d t
5 ENQ NAK % 5 E U e u
6 ACK SYN & 6 F V f v
7 BEL ETB ‘ 7 G W g w
8 BS CAN ( 8 H X h x
9 HT EM ) 9 I Y I y
A LF SUB * : J Z j z
B VT ESC + ; K [ k {
C FF FS , < L \ l |
D CR GS - = M ] m }
E SO RS . > N m n ~
F SI US / ? O n o DEL
Control Character Representation
(ASCII)
NUL Null DC1 Device Control 1
SOH Start of Heading (CC) DC2 Device Control 2
STX Start of Text (CC) DC3 Device Control 3
ETX End of Text (CC) DC4 Device Control 4
EOT End of Transmission (CC) NAK Negative Acknowledge (CC)
ENQ Enquiry (CC) SYN Synchronous Idle (CC)
ACK Acknowledge (CC) ETB End of Transmission Block (CC)
BEL Bell CAN Cancel
BS Backspace (FE) EM End of Medium
HT Horizontal Tab. (FE) SUB Substitute
LF Line Feed (FE) ESC Escape
VT Vertical Tab. (FE) FS File Separator (IS)
FF Form Feed (FE) GS Group Separator (IS)
CR Carriage Return (FE) RS Record Separator (IS)
SO Shift Out US Unit Separator (IS)
SI Shift In DEL Delete
DLE Data Link Escape (CC)
(CC) Communication Control
(FE) Format Effector
(IS) Information Separator
The EBCDIC character code, shown
with hexadecimal indices
The EBCDIC control character
representation
EBCDIC

• The EBCDIC (Extended Binary Coded Decimal Interchange Code) is an


extended binary code for IBM mainframes, mid-range computers,
and peripheral devices that use 8 bits instead of the original 6-bit
format.
• Although EBCDIC is still used today, more modern encoding forms,
such as ASCII and Unicode, exist. While all IBM computers use
EBCDIC as their default encoding format, most IBM devices also
include support for modern formats, allowing them to take
advantage of newer features that EBCDIC does not provide.

43
• Applications
• EBCDIC is exclusively used on IBM machines such as mainframes,
midrange personal computers, and peripheral devices. Since most
IBM machines include extensive processing capabilities and some
support for modern encoding languages, they are able to keep up
and even outperform devices from other brands. However, most
machines and operating systems depend on ASCII and Unicode as
their default encoding format.

44
EBCDIC

• Advantages and Disadvantages


• EBCDIC is advantageous because it consists of an 8-bit character
language rather than the old 6-bit character language found on
punch card encoding systems.
• This allows EBCDIC to provide IBM machines with support for a wide
variety of functions that punch card encoding systems did not
provide

45
ASCII

• ASCII represents American Standard Code for Information


Interchange. It is the standard binary code used to represent
alphanumeric characters.
• Alphanumeric characters are used for the transfer of information to
and from the I/O devices and the computer. This standard helps
seven bits to code 128 characters. However, there is an additional bit
on the left that is always assigned 0. Therefore, there are 8 bits in
total.
• The ASCII code consists of 34 nonprinting characters and 94
characters used for various control operations. There are 26
uppercase letters A through Z, 26 lowercase letters a through z,
numerals from 0 to 9, and 32 printable characters including %,*.
• The control characters are used to route the data and arrange the
printed text into a prescribed format.
46
• In general, data is stored in a computer in the form of bits (1 or, 0).
There are various coding schemes available specifying the set of bytes
represented by each character.
• ASCII − Stands for American Standards Code for Information
Interchange. It is developed by American standards association and is
the mostly used coding system. It represents characters using 7 bits and
has includes 128 characters: upper and lowercase Latin alphabet, the
numbers 0-9, and some extra characters).
• Unicode (UTF) − Stands for Unicode Translation Format. It is developed
by The Unicode Consortium. if you want to create documents that use
characters from multiple character sets, you will be able to do so using
the single Unicode character encodings. It provides 3 types of encodings.
• UTF-8 − It comes in 8-bit units (bytes), a character in UTF8 can be from 1
to 4 bytes long, making UTF8 variable width.
• UTF-16 − It comes in 16-bit units (shorts), it can be 1 or 2 shorts long,
making UTF16 variable width.
• UTF-32 − It comes in 32-bit units (longs). It is a fixed-width format and is
always 1 "long" in length.

47
Reference

48
• Fractional Part

49

You might also like