0% found this document useful (0 votes)
14 views

Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)

Uploaded by

celestemelody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)

Uploaded by

celestemelody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Department of Systems and Engineering

Design, Carleton University

SYSC 3320 Computer Systems Design


Number Systems and
Floating Point Unit (FPU)

1
Copyright/Source

• Credit for material: Dr. Mohamed Atia

2
What we learn in this lecture
• Number systems
• Integer number representations
• Fractional number representations
• Fixed-point representations
• Floating point representations

3
Number Representation in Digital Systems
• Numbers processing is basic part of any computing system. Numbers are
represented in digital hardware using Base 2 Positional Weighting System.
• Numbers can be categorized to
• Integer and
• Fractional numbers

Example: decimal 173 in binary

Integer numbers in Base 2 (Binary) 4


Number Representation in Digital Systems
• Integer representation is straight forward and mathematical operations on integers
are relatively easy and efficient to implement in hardware.
• Adder, Multiplier and Divider Circuits for Integer Binary Numbers

4-bit Array Multiplier

5
Signed Binary Numbers: Sign/Magnitude
• Sign/Magnitude Numbers
➢ 1 sign bit, N-1 magnitude bits
➢ Sign bit is the most significant (left-most) bit
o Positive number: sign bit = 0
o Negative number: sign bit = 1

• Example:4-bit sign/mag representations of ± 6:


➢ +6 = 0110
➢ - 6 = 1110
• Range: N-bit sign/magnitude number:
➢ [-(2N-1-1), 2N-1-1]

6
Signed Binary Numbers: Sign/Magnitude
• Problems
– Addition doesn’t work, for example -6 + 6:
1110
+ 0110
10100 (wrong!)

– Two representations of 0 (± 0):


1000
0000

7
Signed Binary Numbers: Two’s Complement
• Don’t have same problems as sign/magnitude numbers:
➢ Addition works
➢ Single representation for 0
• Most positive 4-bit number: 0111
• Most negative 4-bit number: 1000
• The most significant bit still indicates the sign
• (1 = negative, 0 = positive)
• Range of an N-bit two’s comp number: [-(2N-1), 2N-1-1]
• Taking the Two’s Complement
1. Invert the bits
2. Add 1
• Example: Flip the sign of 310 = 00112
8
Sign-Extension: Increasing Number of Bits

• Sign bit copied to most significant bits


• Number value is same
• Example 1:
– 4-bit representation of 3 = 0011
– 8-bit sign-extended value: 00000011
• Example 2:
– 4-bit representation of -5 = 1011
– 8-bit sign-extended value: 11111011

9
Number System Comparison

Number System Range


Unsigned [0, 2N-1]
Sign/Magnitude [-(2N-1-1), 2N-1-1]
Two’s Complement [-2N-1, 2N-1-1]

For example, 4-bit representation:

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Unsigned

1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111 Two's Complement

0000
1111 1110 1101 1100 1011 1010 1001 0001 0010 0011 0100 0101 0110 0111 1000 Sign/Magnitude

10
Fractional Numbers
• Fractional numbers are more complex and more challenging in hardware
implementation.
• Specialized co-processors and optimized hardware accelerators are usually
designed specifically to process fractional numbers.
• Two main types of fractional number representations:
• Fixed point representation
• Floating point representation

11
Fractional Numbers: Fixed-Point representation

• Fixed number of bits are considered for integer part


• More bits means bigger range of numbers
• Fixed number of bits are considered for fractional part
• More bits means higher accuracy
• Assume an imaginary decimal point to separate the integer and
fractional digits

Fractional numbers in Base 2

1
2
Fractional Numbers: Fixed-Point representation
It can be represented by 1’s complement or 2’s complement formats as
shown.
In both encodings, the smallest numerical difference between the
decoded numbers is 𝟐−𝑹. For example, in integers numbers where R=0,
smallest increment is 20=1.
• 𝟐−𝑹 quantifies the “imprecision” of the representation.

L R
2’s complement

x = −bL−1 2 L−1 + i=− R bi 2 i , bi {0,1}


i=L−2

w = bL−1 …b0 .b−1 …b−R , bi {0,1}

13
Fractional Numbers: Fixed-Point representation
• Imprecision: The smallest difference between two consecutive numbers
under certain word length and fraction point position. Imprecision = 2−𝑅
L R
➢ Integer length L
➢ Fraction Length R
➢ Word Length B = L+R w = bL−1 …b0 .b−1 …b−R , bi {0,1}

2−𝑅+2−𝑅 + 2−𝑅

2−𝑅+2−𝑅

2−𝑅 2−𝑅
0

−2−𝑅
i
−2−𝑅 − 2−𝑅

−2−𝑅 −2−𝑅 −2−𝑅

1
4
Fractional Numbers: Fixed-Point representation
• Because the computer has limited number of digits to store numbers, some
numbers such as 7 or 𝜋 cannot be exactly represented due to the
imprecision. The number of digits the computer uses to store numbers is
called “significant digits” or “significant figures”.
w

2−𝑅+2−𝑅 + 2−𝑅

2−𝑅+2−𝑅

2−𝑅 2−𝑅
0

−2−𝑅
i
−2−𝑅 − 2−𝑅

−2−𝑅 −2−𝑅 −2−𝑅

15
Fractional Numbers: Fixed-Point representation
• The imprecision can be decreased by increasing R. For fixed word size B,
increasing R means decreasing L. Decreasing L decreases the largest
numbers that can be represented.

• Example: For a 4-digits signed representation


– L=R=2
– the imprecision is …
– the largest magnitude is …
Sign bit Decimal Point

– If L=1 and R=3


– imprecision is decreased to …
– the largest magnitude is halved to … Sign bit Decimal Point

1
6
Fractional Numbers: Fixed-Point representation
• The imprecision can be decreased by increasing R. For fixed word size B,
increasing R means decreasing L. Decreasing L decreases the largest
numbers that can be represented.

• Example: For a 4-digits signed representation


– L=R=2
– the imprecision is 0.25
– the largest magnitude is 1.75 = 1+0.5+0.25
Sign bit Decimal Point

– If L=1 and R=3


– imprecision is decreased to 0.125
– the largest magnitude is halved to 0.875 = 0+.5+.25+ 0.125 Sign bit Decimal Point

17
Fractional Numbers: Fixed-Point representation
• For a fixed word length B, there is a tradeoff between the range (the
interval from the largest positive to the largest negative number)
and imprecision of the represented numbers, i.e., improving the
precision entails decreasing the range.
• This tradeoff results in errors in computer numerical calculations. As the
smallest number that can be represented in fixed-point representation is
2-R, if a number x has fractions less than 2-R, one of the following errors
will occur:
➢ Truncation: drop off the fraction that is less than 2−𝑅
➢ Round-off : approximate the fraction that is less than 2-R to be exactly 2-R

• In either case, error is bounded by 2-R. If two numbers x and y less


than 2-R apart are subtracted, result will be truncated to zero (error).
• To control this error, it is better to have a “floating” fraction point not
fixed point. This will lead us to the Floating-Point format.

18
Floating-Point Representation
• Floating-Point Representation. A more flexible representation that can
accommodate large and small numbers. Implemented by allowing the
fraction point to be floating not fixed. This allows the imprecision to vary
with numeric magnitude.

Fixed-point

Floating-point

• Floating-point representation makes the imprecision proportional to the


magnitude of number (good idea!)

1
9
Floating-Point Representation
• A Floating-point binary word forma consists of
➢ Sign bit
➢ Signed exponent (usually integer)
➢ Mantissa (Can be any number, usually fraction in fixed point format)
➢ And base b (In binary equals 2)

• The value of the number in this form = -1(sign-bit) x (mantissa) x (base)exponent

20
Floating-Point Representation
• Note: by varying the exponent, we vary the fraction point
position. This means the fraction point is “floating”

Exponent Mantissa

21
Floating-Point Representation
1
• Mantissa normalization: Consider the number =
34
0.02941176 . Possible Floating-point representation is
0.0294*100. However, the leading zero in 0.0294 is useless
and we lost significant digits. A solution to this limitation is to
normalize the mantissa as follows:
➢ 0.2941𝑥10−1(an additional significant digit is retained)
1
• A normalized mantissa has limited range: ≤𝑚<1
𝑏𝑎𝑠𝑒
1
➢ Minimum mantissa =
𝑏𝑎𝑠𝑒
➢ Maximum mantissa is obviously less than “1” (it is a fraction)

22
IEEE-754 Floating-point Format
• IEEE-754
➢ Single precision (SP) : 32 bit word
o 1 bit mantissa sign, 8 bit exponent,23 bits mantissa fraction part

➢ Double precision (DP) : 64 bit word


o 1 bit mantissa sign,11 bit exponent,52 bits mantissa

• Mantissa =
➢ 1 + (𝑏222−1 + 𝑏212−2 + ⋯ . . + 𝑏02−23) for SP
➢ 1 + (𝑏512−1 + 𝑏502−2 + ⋯ . . + 𝑏02−52) for DP
• Exponent = the encoded unsigned integer – Bias
➢ 𝑒 = 𝑒𝑏 − 𝐵𝑖𝑎𝑠
➢ Bias = 127 for SP , Bias = 1023 for DP
23
IEEE-754 Floating-point Format
• Example: convert the following IEEE-754 SP formatted number to decimal:

0 01000000 01111101110000000000010

➢ 1 bit sign
➢ 8 bits for Exponent (bias is +127)
➢ 23 bits for mantissa
• Exponent 𝑒 = 26 − 127 = −63
• Mantissa = 1 + 2−2 + 2−3 +2−4 +2−5 + 2−6 + 2−8 + 2−9 + 2−10 + 2−22 = 1.4912
• Number = 1.4912 ∗ 2−63 = 1.6168𝑥10−19

24
Special Numbers in IEEE-754 Floating-point Format
• IEEE-754 has three special numbers:

Number Sign Exponent Fraction


0 X 00000000 00000000000000000000000
∞ 0 11111111 00000000000000000000000
-∞ 1 11111111 00000000000000000000000
NaN X 11111111 non-zero

• Encoded exponent reserved for the special numbers


➢ 0...0 (all zeroes) for zero
➢ 1...1 (all ones) for inf. And NaN.
• The permissible exponent values non-special numbers
➢ - 126 to 127 for SP
➢ - 1022 to 1023 for DP
2
5
Numeric Range of IEEE-754 Floating-point Format

• IEEE-754 DP numeric range is shown below:

• Multiplying 1.0 × 10−200 with 5.0 × 10−200 results in underflow


6.1×10150
• Performing the division results in overflow
4.0×10−200

• Possible causes for NaN: 0/0 or −10

2
6
Floating-Point Addition
1. Extract exponent and fraction bits
2. Prepend leading 1 to form mantissa
3. Compare exponents
4. Shift smaller mantissa if necessary
5. Add mantissas
6. Normalize mantissa and adjust exponent if necessary
7. Round result if necessary
8. Assemble exponent and fraction back into floating-point format

27
Floating-Point Addition Example
• Extract exponent and fraction bits
1 bit 8 bits 23 bits
0 01111111 100 0000 0000 0000 0000 0000
Sign Exponent Fraction
1 bit 8 bits 23 bits
0 10000000 101 0000 0000 0000 0000 0000
Sign Exponent Fraction

For N1: S1 = 0, BE1 = 127, E1 = 127-127=0, F1 = .1


For N2: S2 = 0, BE2 = 128, E2 = 128-127=1, F2 = .101

• Prepend leading 1 to form mantissa


N1= 1.1 × 20 = 0.110 × 21
N2= 1.101 × 21 = 1.101 × 21

28
Floating-Point Multiplication/Division
• Multiplication
➢ Multiply mantissas and add exponents!
• Division
➢ Divide mantissas and subtract exponents!

29
Floating Point Unit (FPU)
• Floating point operations can be done by hardware (circuitry) or by
software (program code). The programmer will not know which
design is used without prior knowledge of the system hardware
design. Software method is approximately 1000 times slower than
hardware method. The hardware unit that performs floating point
arithmetic operations is called Floating Point Unit (FPU).
• FPU is also known as “Floating Point ‘Co-processor’”. In SoC design,
this co-processor is embedded within the processor core in a separate
section.

30
Floating Point Unit (FPU)
• Floating Point Unit (FPU):
• The Zynq SoC Platform Dual-core ARM A9 Processor has an
FPU co-processor within each processor core

3
1
Thank You ☺

32

You might also like