0% found this document useful (0 votes)
2 views

Lecture5_Arithmetic for Computers – Part 2

The document discusses floating point representation in computer arithmetic, emphasizing the IEEE 754 standard for single and double precision formats. It explains the significance of normalized and denormalized forms, special numbers like zero, infinity, and NaN, and the range and precision of floating point numbers. Additionally, it provides examples of converting between decimal and floating point representations.

Uploaded by

matt33768.ee11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture5_Arithmetic for Computers – Part 2

The document discusses floating point representation in computer arithmetic, emphasizing the IEEE 754 standard for single and double precision formats. It explains the significance of normalized and denormalized forms, special numbers like zero, infinity, and NaN, and the range and precision of floating point numbers. Additionally, it provides examples of converting between decimal and floating point representations.

Uploaded by

matt33768.ee11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

計算機組織

Computer Organization

Arithmetic for Computers – Part 2

Kun-Chih (Jimmy) Chen 陳坤志


[email protected]
Institute of Electronics,
National Yang Ming Chiao Tung University

NYCU EE / IEE
Floating Point: Motivation

❖ What can be represented in n bits?


Unsigned 0 to 2n - 1
2's Complement -2n-1 to 2n-1- 1
1's Complement -2n-1+1 to 2n-1- 1
Excess M -M to 2n - M - 1
❖ But, what about ...
❖very large numbers? 1,987,987,987,987,987,987,987,987,987
❖very small numbers? 0.0000000000000000000000054088
❖rationals 2/3
❖irrationals 2
❖transcendentals e, 
❖Types float and double in C

P2
Floating Point: Example

❖ Floating Point
❖ A = 31.48
➢ 3 → 3  101
➢ 1 → 1  100
➢ 4 → 4  10-1
➢ 8 → 8  10-2

❖ Scientific notation
❖ A = 3.148  101
➢ 3 → 3  100  101
➢ 1 → 1  10-1  101
➢ 4 → 4  10-2  101
➢ 8 → 8  10-3  101

P3
Scientific Notation: Decimal

Fraction (Mantissa) exponent


Significand 3.5ten x 10-9

radix (base)
“decimal point”
❖ Normalized form: no leading 0s
(exactly one digit to left of decimal point)

❖ Alternatives to represent 0.0000000035


❖ Normalized: 3.5 x 10-9
❖ Not normalized: 0.35 x 10-8, 35.0 x 10-10

P4
Scientific Notation: Binary

Fraction (Mantissa) exponent


Significand 1.001two x 2-9

radix (base)
“binary point”

❖ Computer arithmetic that supports it is called floating point, because


the binary point is not fixed, as it is for integers

❖ Normalized form: no leading 0s


(exactly one digit to left of binary point)

❖ Scientific notation
❖ Normalized: 1.001 x 2-9
❖ Not normalized: 0.1001 x 2-8, 10.01 x 2-10

P5
Floating Point Standard

❖ Defined by IEEE Std 754-1985

❖ Developed in response to divergence of representations


❖ Portability issues for scientific code

❖ Now almost universally adopted

❖ Two representations
❖ Single precision (32-bit)
❖ Double precision (64-bit)

P6
FP Representation

❖ Normal format: 1.xxxxxxxxxxtwo  2yyyytwo


❖ Want to put it into multiple words: 32 bits for single-precision and 64
bits for double-precision
❖ A simple single-precision representation:

31 30 23 22 0
S Exponent Fraction
1 bit 8 bits 23 bits

S represents sign
Exponent represents y's
Fraction represents x’s

❖ Represent numbers as small as ~2.0 x 10-38 to as large as ~2.0 x


1038

P7
Double Precision Representation

❖ Next multiple of word size (64 bits)


31 30 20 19 0
S Exponent Fraction
1 bit 11 bits 20 bits

Fraction (cont'd)
32 bits

❖ Double precision (vs. single precision)


❖ Represent numbers almost as small as
~2.0 x 10-308 to almost as large as ~2.0 x 10308
❖ But primary advantage is greater accuracy
due to larger fraction

P8
IEEE 754 Standard (1/4)

❖ Regarding single precision, DP similar


❖ Sign bit:
1 means negative
0 means positive
❖ Fraction:
❖ To pack more bits, leading 1 implicit for normalized numbers (hidden leading
1 bit)
❖ 1 + 23 bits single, 1 + 52 bits double
❖ always true: 0 ≤ Fraction < 1
(for normalized numbers)
❖ Significand is Fraction with the “1.” restored
❖ Note: 0 has no leading 1, so reserve exponent value 0 just for number 0

P9
IEEE 754 Standard (2/4)

❖ Exponent:
❖ Need to represent positive and negative exponents
❖ Also want to compare FP numbers as if they were integers, to help in
value comparisons
❖ How about using 2's complement to represent?
Ex: 1.0 x 2-1 versus 1.0 x2+1 (1/2 versus 2)

1/2 0 1111 1111 000 0000 0000 0000 0000 0000

2 0 0000 0001 000 0000 0000 0000 0000 0000

If we use integer comparison for these two words, we


will conclude that 1/2 > 2!!!

P10
IEEE 754 Standard (3/4)

❖ Instead, let notation 0000 0000 be most negative, and 1111 1111
most positive
❖ Called biased notation, where bias is the number subtracted to get
the real number
❖ IEEE 754 uses bias of 127 for single precision:
Subtract 127 from Exponent field to get actual value for exponent
❖ 1023 is bias for double precision

126-127=-1
1/2 0 0111 1110 000 0000 0000 0000 0000 0000
2 0 1000 0000 000 0000 0000 0000 0000 0000
128-127=1
We can use integer comparison for floating point
comparison.

P11
Biased (Excess) Notation
❖ Biased 7
0 0000 -7
1 0001 -6
2 0010 -5
3 0011 -4
4 0100 -3
5 0101 -2
6 0110 -1
7 0111 0
8 1000 1
9 1001 2
10 1010 3
11 1011 4
12 1100 5
13 1101 6
14 1110 7
15 1111 8

P12
IEEE 754 Standard (4/4)

❖ Summary (single precision):

31 30 23 22 0
S Exponent Fraction
1 bit 8 bits 23 bits

(-1)S x (1.Fraction) x 2 (Exponent-127)

❖ Double precision are same, except with exponent bias of 1023

P13
Example 1: FP to Decimal

0 0110 1000 101 0101 0100 0011 0100 0010

❖ Sign: 0 => positive


❖ Exponent:
❖ 0110 1000two = 104ten
❖ Bias adjustment: 104 - 127 = -23
❖ Fraction:
❖ 1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22
= 1.0 + 0.666115
❖ Represents: 1.666115ten2-23  1.986  10-7

P14
Example 2: Decimal to FP

❖ Number = - 0.75
= - 0.11two  20 (scientific notation)
= - 1.1two  2-1 (normalized scientific notation)

❖ Sign: negative => 1


❖ Exponent:
❖ Bias adjustment: -1 +127 = 126
❖ 126ten = 0111 1110two

1 0111 1110 100 0000 0000 0000 0000 0000

P15
Example 3: Decimal to FP

❖ A more difficult case: representing 1/3?


= 0.33333…10 = 0.0101010101… 2  20
= 1.0101010101… 2  2-2
❖ Sign: 0
❖ Exponent = -2 + 127 = 12510=011111012
❖ Fraction = 0101010101…

0 0111 1101 0101 0101 0101 0101 0101 010

P16
Double-Precision Range

❖ Exponents 0000…00 and 1111…11 reserved


❖ Smallest value
❖ Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
❖ Fraction: 000…00  significand = 1.0
❖ ±1.0 × 2–1022 ≈ ±2.2 × 10–308
❖ Largest value
❖ Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
❖ Fraction: 111…11  significand ≈ 2.0
❖ ±2.0 × 2+1023 ≈ ±1.8 × 10+308

P17
Floating-Point Precision

❖ Relative precision
❖ all fraction bits are significant
❖ Single: approx 2–23
➢ Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
❖ Double: approx 2–52
➢ Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision

P18
Zero and Special Numbers

❖ What have we defined so far? (single precision)

Exponent Fraction Object


0 0 ???
0 nonzero ???
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???

P19
Representation for 0

❖ Represent 0?
❖ Exponent: all zeroes
❖ Fraction: all zeroes, too
❖ What about sign?
❖ +0: 0 00000000 00000000000000000000000
❖ -0: 1 00000000 00000000000000000000000
❖ Why two zeroes?
❖ Helps in some limit comparisons

P20
Special Numbers

❖ What have we defined so far? (single precision)

Exponent Fraction Object


0 0 +/- 0
0 nonzero ???
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???
❖ Range:
1.0  2-126  1.2  10-38
What if result too small? (>0, < 1.2x10-38 => Underflow!)
1.11…1  2127  (2 – 2-23)  2127  3.4  1038
What if result too large? (> 3.4x1038 => Overflow!)

P21
Range of Singe Precision Floating Point Number

Underflow
Overflow Overflow
0
–∞ +∞

+ 1.0  2–126
– 1.11…11  2127
– 1.0  2–126 + 1.11…11  2127

P22
Gradual Underflow

❖ Represent denormalized numbers (denorms)


❖ Exponent : all zeroes
❖ Fraction : non-zeroes
❖ Allow a number to degrade in significance until it become 0 (gradual
underflow)

❖ The smallest normalized number


➢ 1.0000 0000 0000 0000 0000 000  2-126
❖ The smallest de-normalized number
➢ 0.0000 0000 0000 0000 0000 001  2-127

P23
Special Numbers

❖ What have we defined so far? (single precision)

Exponent Fraction Object


0 0 +/- 0
0 nonzero denorm
1-254 anything +/- floating-point
255 0 ???
255 nonzero ???

P24
Representation for +/- Infinity

❖ In FP, divide by zero should produce +/- infinity, not overflow


❖ Why?
❖ OK to do further computations with infinity
Ex: X/0 > Y may be a valid comparison
❖ IEEE 754 represents +/- infinity
❖ Most positive exponent reserved for infinity
❖ Fractions all zeroes

S 1111 1111 0000 0000 0000 0000 0000 000

P25
Special Numbers (cont’d)

❖ What have we defined so far? (single-precision)

Exponent Fraction Object


0 0 +/- 0
0 nonzero denom
1-254 anything +/- floating-point
255 0 +/- infinity
255 nonzero ???

P26
Representation for Not a Number

❖ What do I get if I calculate sqrt(-4.0) or 0/0?


❖ If infinity is not an error, these should not be either
❖ They are called Not a Number (NaN)
❖ Exponent = 255, fraction nonzero

❖ Why is this useful?


❖ Hope NaNs help with debugging?
❖ They contaminate: op(NaN,X) = NaN
❖ OK if calculate but don't use it

P27
Special Numbers (cont’d)

❖ What have we defined so far? (single-precision)

Exponent Fraction Object


0 0 +/- 0
0 nonzero denom
1-254 anything +/- floating-point
255 0 +/- infinity
255 nonzero NaN

P28
Decimal Addition

❖ A = 3.71345  102, B = 1.32  10-4, Perform A + B

3.71345  102
+ 0.00000132  102
3.71345132  102
Right shift 2 – (-4) bits
❖ A = 3.71345  102
❖ B = 1.32  10-4 = 0.00000132  102
❖ A + B = (3.71345 + 0.00000132)  102

P29
Floating-Point Addition
Basic addition algorithm:
(1) Align binary point :compute Ye – Xe
❖ right shift the smaller number, say Xm, that many positions to
form Xm  2Xe-Ye

(2) Add mantissa: compute Xm  2Xe-Ye +Ym

(3) Normalization & check for over/underflow if necessary:


❖ left shift result, decrement result exponent
❖ right shift result, increment result exponent
❖ check overflow or underflow during the shift

(4) Round the mantissa and renormalize if necessary

P30
Floating-Point Addition Example

❖ Now consider a 4-digit binary example


❖ 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
❖ 1. Align binary points
❖ Shift number with smaller exponent
❖ 1.0002 × 2–1 + –0.1112 × 2–1
❖ 2. Add mantissa
❖ 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
❖ 3. Normalize result & check for over/underflow
❖ 1.0002 × 2–4, with no over/underflow
❖ 4. Round and renormalize if necessary
❖ 1.0002 × 2–4 (no change) = 0.0625

P31
Floating-Point Addition

P32
Sign Exponent Significand Sign Exponent Significand

Compare
Small ALU exponents

Exponent
difference Step 1
0 1 0 1 0 1

Shift smaller
Control Shift right
number right

Add Step 2
Big ALU

0 1 0 1

Increment or Step 3
decrement Shift left or right Normalize

Step 4
Rounding hardware Round

Sign Exponent Significand

P33
FP Adder Hardware

❖ Much more complex than integer adder


❖ Doing it in one clock cycle would take too long
❖ Much longer than integer operations
❖ Slower clock would penalize all instructions
❖ FP adder usually takes several cycles
❖ Can be pipelined

P34
Decimal Multiplication

❖ A = 3.12  102, B = 1.5  10-4, Perform A  B

3.12  102
 1.5  10-4
4.68  10-2

❖ A = 3.12  102
❖ B = 1.5  10-4
❖ A  B = (3.12  1.5)  10(2+(-4))

P35
Floating-Point Multiplication

Basic multiplication algorithm


(1) Add exponents of operands to get exponent of product
doubly biased exponent must be corrected:
Xe = 7
Ye = -3 Xe = 1111 = 15 = 7 + 8
Excess 8 Ye = 0101 = 5 = -3 + 8
10100 20 4+8+8
need extra subtraction step of the bias amount
(2) Multiplication of operand mantissa
(3) Normalize the product & check overflow or underflow
during the shift
(4) Round the mantissa and renormalize if necessary
(5) Set the sign of product

P36
Floating-Point Multiplication

P37
Floating-Point Multiplication Example

❖ Now consider a 4-digit binary example


❖ 1.0002 × 2–1 × –1.1102 × 2–2 (i.e., 0.5 × –0.4375)

1. Add exponents
❖ Unbiased: –1 + –2 = –3
❖ Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
2. Multiply operand mantissa
❖ 1.0002 × 1.1102 = 1.1102  1.1102 × 2–3
3. Normalize result & check for over/underflow
❖ 1.1102 × 2–3 (no change) with no over/underflow
4. Round and renormalize if necessary
❖ 1.1102 × 2–3 (no change)
5. Determine sign:
❖ –1.1102 × 2–3 = –0.21875

P38
FP Arithmetic Hardware

❖ FP multiplier is of similar complexity to FP adder


❖ But uses a multiplier for significands instead of an adder
❖ FP arithmetic hardware usually does
❖ Addition, subtraction, multiplication, division, reciprocal, square-root
❖ FP  integer conversion
❖ Operations usually takes several cycles
❖ Can be pipelined

P39
P40
FP Instructions in RISC-V
❖ Separate FP registers: f0, …, f31
❖ double-precision
❖ single-precision values stored in the lower 32 bits

❖ FP instructions operate only on FP registers


❖ Programs generally don’t do integer ops on FP data, or vice versa
❖ More registers with minimal code-size impact

❖ FP load and store instructions


❖ flw, fld
❖ fsw, fsd

P41
FP Instructions in RISC-V
❖ Single-precision arithmetic
❖ fadd.s, fsub.s, fmul.s, fdiv.s, fsqrt.s
➢ e.g., fadd.s f2, f4, f6

❖ Double-precision arithmetic
❖ fadd.d, fsub.d, fmul.d, fdiv.d, fsqrt.d
➢ e.g., fadd.d f2, f4, f6

❖ Single- and double-precision comparison


❖ feq.s, flt.s, fle.s
❖ feq.d, flt.d, fle.d
❖ Result is 0 or 1 in integer destination register
➢ Use beq, bne to branch on comparison result

❖ Branch on FP condition code true or false


❖ B.cond

P42
FP Instructions in RISC-V

2024/10/8 Andy Yu-Guang Chen 43


P43
FP Example: °F to °C

❖ C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
❖ fahr in f10, result in f10, literals in global memory space

❖ Compiled RISC-V code:


f2c:
flw f0,const5(x3) // f0 = 5.0f
flw f1,const9(x3) // f1 = 9.0f
fdiv.s f0, f0, f1 // f0 = 5.0f / 9.0f
flw f1,const32(x3) // f1 = 32.0f
fsub.s f10,f10,f1 // f10 = fahr – 32.0
fmul.s f10,f0,f10 // f10 = (5.0f/9.0f) * (fahr–32.0f)
jalr x0,0(x1) // return

We assume the compiler places the three floating-point


constants in the memory within easy reach of register x3

P44
Accurate Arithmetic

❖ IEEE Std 754 specifies additional rounding control


❖ Extra bits of precision (guard, round, sticky)
❖ Choice of rounding modes
❖ Allows programmer to fine-tune numerical behavior of a computation

❖ Not all FP units implement all options


❖ Most programming languages and FP libraries just use defaults

❖ Trade-off between hardware complexity, performance, and market


requirements

P45
Subword Parallelism

❖ Graphics and audio applications can take advantage of performing


simultaneous operations on short vectors
❖ Example: 128-bit adder:
➢ Sixteen 8-bit adds
➢ Eight 16-bit adds
➢ Four 32-bit adds

❖ Also called data-level parallelism, vector parallelism, or Single


Instruction, Multiple Data (SIMD)

P46
Final 64-bit RISC-V ALU

ALUop Function
0000 and
0001 or
0010 add
0110 subtract
0111 set-on-less-than
1100 nor

P47
ALU Control and Function

ALUop
Ainvert 2
CarryIn Operation

a 0
0
1
1
Binvert
Result
b 0 2
1
ALU Control (ALUop) Function
slt 3
0000 and
0001 or
CarryOut 0010 add
0110 subtract
0111 set-on-less-than
1100 nor

P48
Ripple Carry Adder

❖ Carry Ripple from lower-bit to the higher-bit

00111111 Cin = 1
00101010
+ 00010101
01
01 01
01 01
01 01
0

❖ Ripple computation dominates the run time


❖ Higher-bit ALU must wait for carry from lower-bit ALU
❖ Run time complexity: O(n)

P49
Problems with Ripple Carry Adder

❖ Carry bit may have to propagate from LSB to MSB => worst case
delay: N-stage delay

CarryIn0
A0 1-bit Result0
B0 ALU
CarryOut0
CarryIn1
A1 1-bit Result1
B1 ALU
CarryOut1
CarryIn2
A2 1-bit Result2
B2 ALU
CarryOut2
CarryIn3
A3 1-bit Result3
Design Trick: look for
B3 ALU
CarryOut3
parallelism and throw
hardware at it

P50
Remove the Dependency

❖ Ripple carry adder


a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0

Cout + + + + + + + + Cin

s7 s6 s5 s4 s3 s2 s1 s0

❖ Carry lookahead adder


❖ No carry bit propagation from LSB to MSB

Carry Computation Circuit

a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0

Cout + + + + + + + + Cin

s7 s6 s5 s4 s3 s2 s1 s0

P51
4-bit Carry-Lookahead Adder (CLA)

❖ Ripple carry adder which takes a lot of time to determine the carry bit

❖ Carry-Lookahead adder (CLA) is type of adder which improves speed by


reducing the amount of time required to determine carry bit

A3 B3 A2 B2 A1 B1 A0 B0

1-bit full C3 1-bit full C2 1-bit full C1 1-bit full


C0
S3 adder S2
adder adder adder
S1 S0
P3 G3 P2 G2 P1 G1 P0 G0

4-bit CLL (carry look-ahead logic)


C4

PG GG
P52
Carry-Lookahead Adder
𝑆 = 𝐴 ⊕ 𝐵 ⊕ 𝐶𝑖𝑛
Full adder = { 𝐶out =𝐴 ⊕ 𝐵 · 𝐶𝑖𝑛 + 𝐴 · 𝐵

❖ Ci+1 = (Ai · Bi) + (Ai ^ Bi) · Ci


=Gi + Pi · Ci
❖ Generate : Gi = Ai · Bi
❖ Propagate : Pi = Ai ^ Bi

❖ C1 = G0 + P0 · C0
C2 = G1 + P1 · C1 = G1 + P1 · (G0 + P0 · C0) = G1 + P1 · G0 + P1 · P0 · C0
C3 = G2 + P2 · G1 + P2 · P1 · G0+ P2 · P1 · P0 · C0
C4 = G3 + P3 · G2 + P3 · P2 · G1+ P3 · P2 · P1 · G0 + P3 · P2 · P1 · P0 · C0

❖ Only need A, B and C0 to calculate the carry bit

P53
16-bit CLA

g p

C P G

• Same as before, p,g’s are generated in parallel in 1 gate delay


• Now, without input carry, the first-tier CLL cannot generate C’s……
Instead they generate P,G’s (group propagator and group generator) in 2 gate delay
P => This group will propagate the input carry to the group P=p0p1p2p3
G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0
• The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s”
for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from
P,G’s is exactly the same to generating C’s from p,g’s!)
• With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in
2 gate delay
• With all C’s in place, S’s are calculated in 3 gate delay due to the XOR gate
P54
Pi, Gi Generation in a 16-bit CLA
❖ Propagate (P) → 1 gate delay
❖ P0 = p3 · p2 · p1 · p0
Therefore, totally
❖ P1 = p7 · p6 · p5 · p4
1+2+2+3 = 8 gate delay
❖ P2 = p11 · p10 · p9 · p8 to finish the whole thing!!
❖ P3 = p15 · p14 · p13 · p12

❖ Generate (G) → 2 gate delay


❖ G0 = g3 + (p3 · g2) + (p3 · p2 · g1) + (p3 · p2 · p1 · g0)
❖ G1 = g7 + (p7 · g6) + (p7 · p6 · g5) + (p7 · p6 · p5 · g4)
❖ G2 = g11 + (p11 · g10) + (p11 · p10 · g9) + (p11 · p10 · p9 · g8)
❖ G3 = g15 + (p15 · g14) + (p15 · p14 · g13) + (p15 · p14 · p13 · g12)

❖ Carry (C) → 2 gate delay


❖ C1 = G0 + c0 · P0
❖ C2 = G1 + G0 · P1 + c0 · P0 · P1
❖ C3 = G2 + G1 · P2 + G0 · P1 · P2 + c0 · P0 · P1 · P2
❖ C4 = G3 + G2 · P3 + G1 · P2 · P3 + G0 · P1 · P2 · P3 + c0 · P0 · P1 · P2 · P3
P55
16-bits Carry-Lookahead Adder

❖ 16-bit Carry-Lookahead Adder is composed of 4-bit Carry-Lookahead Adder

P56
Who Cares About FP Accuracy?

❖ Important for scientific code


❖ But for everyday consumer use?
➢"My bank balance is out by 0.0002¢!" 

❖ The Intel Pentium (Floating point Division) FDIV bug, 1994


❖ Recall cost: USD $475M
❖ The market expects accuracy
❖ See Colwell, The Pentium Chronicles

66 MHz Intel Pentium

P57

You might also like