Lecture5_Arithmetic for Computers – Part 2
Lecture5_Arithmetic for Computers – Part 2
Computer Organization
NYCU EE / IEE
Floating Point: Motivation
P2
Floating Point: Example
❖ Floating Point
❖ A = 31.48
➢ 3 → 3 101
➢ 1 → 1 100
➢ 4 → 4 10-1
➢ 8 → 8 10-2
❖ Scientific notation
❖ A = 3.148 101
➢ 3 → 3 100 101
➢ 1 → 1 10-1 101
➢ 4 → 4 10-2 101
➢ 8 → 8 10-3 101
P3
Scientific Notation: Decimal
radix (base)
“decimal point”
❖ Normalized form: no leading 0s
(exactly one digit to left of decimal point)
P4
Scientific Notation: Binary
radix (base)
“binary point”
❖ Scientific notation
❖ Normalized: 1.001 x 2-9
❖ Not normalized: 0.1001 x 2-8, 10.01 x 2-10
P5
Floating Point Standard
❖ Two representations
❖ Single precision (32-bit)
❖ Double precision (64-bit)
P6
FP Representation
31 30 23 22 0
S Exponent Fraction
1 bit 8 bits 23 bits
S represents sign
Exponent represents y's
Fraction represents x’s
P7
Double Precision Representation
Fraction (cont'd)
32 bits
P8
IEEE 754 Standard (1/4)
P9
IEEE 754 Standard (2/4)
❖ Exponent:
❖ Need to represent positive and negative exponents
❖ Also want to compare FP numbers as if they were integers, to help in
value comparisons
❖ How about using 2's complement to represent?
Ex: 1.0 x 2-1 versus 1.0 x2+1 (1/2 versus 2)
P10
IEEE 754 Standard (3/4)
❖ Instead, let notation 0000 0000 be most negative, and 1111 1111
most positive
❖ Called biased notation, where bias is the number subtracted to get
the real number
❖ IEEE 754 uses bias of 127 for single precision:
Subtract 127 from Exponent field to get actual value for exponent
❖ 1023 is bias for double precision
126-127=-1
1/2 0 0111 1110 000 0000 0000 0000 0000 0000
2 0 1000 0000 000 0000 0000 0000 0000 0000
128-127=1
We can use integer comparison for floating point
comparison.
P11
Biased (Excess) Notation
❖ Biased 7
0 0000 -7
1 0001 -6
2 0010 -5
3 0011 -4
4 0100 -3
5 0101 -2
6 0110 -1
7 0111 0
8 1000 1
9 1001 2
10 1010 3
11 1011 4
12 1100 5
13 1101 6
14 1110 7
15 1111 8
P12
IEEE 754 Standard (4/4)
31 30 23 22 0
S Exponent Fraction
1 bit 8 bits 23 bits
P13
Example 1: FP to Decimal
P14
Example 2: Decimal to FP
❖ Number = - 0.75
= - 0.11two 20 (scientific notation)
= - 1.1two 2-1 (normalized scientific notation)
P15
Example 3: Decimal to FP
P16
Double-Precision Range
P17
Floating-Point Precision
❖ Relative precision
❖ all fraction bits are significant
❖ Single: approx 2–23
➢ Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
❖ Double: approx 2–52
➢ Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision
P18
Zero and Special Numbers
P19
Representation for 0
❖ Represent 0?
❖ Exponent: all zeroes
❖ Fraction: all zeroes, too
❖ What about sign?
❖ +0: 0 00000000 00000000000000000000000
❖ -0: 1 00000000 00000000000000000000000
❖ Why two zeroes?
❖ Helps in some limit comparisons
P20
Special Numbers
P21
Range of Singe Precision Floating Point Number
Underflow
Overflow Overflow
0
–∞ +∞
+ 1.0 2–126
– 1.11…11 2127
– 1.0 2–126 + 1.11…11 2127
P22
Gradual Underflow
P23
Special Numbers
P24
Representation for +/- Infinity
P25
Special Numbers (cont’d)
P26
Representation for Not a Number
P27
Special Numbers (cont’d)
P28
Decimal Addition
3.71345 102
+ 0.00000132 102
3.71345132 102
Right shift 2 – (-4) bits
❖ A = 3.71345 102
❖ B = 1.32 10-4 = 0.00000132 102
❖ A + B = (3.71345 + 0.00000132) 102
P29
Floating-Point Addition
Basic addition algorithm:
(1) Align binary point :compute Ye – Xe
❖ right shift the smaller number, say Xm, that many positions to
form Xm 2Xe-Ye
P30
Floating-Point Addition Example
P31
Floating-Point Addition
P32
Sign Exponent Significand Sign Exponent Significand
Compare
Small ALU exponents
Exponent
difference Step 1
0 1 0 1 0 1
Shift smaller
Control Shift right
number right
Add Step 2
Big ALU
0 1 0 1
Increment or Step 3
decrement Shift left or right Normalize
Step 4
Rounding hardware Round
P33
FP Adder Hardware
P34
Decimal Multiplication
3.12 102
1.5 10-4
4.68 10-2
❖ A = 3.12 102
❖ B = 1.5 10-4
❖ A B = (3.12 1.5) 10(2+(-4))
P35
Floating-Point Multiplication
P36
Floating-Point Multiplication
P37
Floating-Point Multiplication Example
1. Add exponents
❖ Unbiased: –1 + –2 = –3
❖ Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
2. Multiply operand mantissa
❖ 1.0002 × 1.1102 = 1.1102 1.1102 × 2–3
3. Normalize result & check for over/underflow
❖ 1.1102 × 2–3 (no change) with no over/underflow
4. Round and renormalize if necessary
❖ 1.1102 × 2–3 (no change)
5. Determine sign:
❖ –1.1102 × 2–3 = –0.21875
P38
FP Arithmetic Hardware
P39
P40
FP Instructions in RISC-V
❖ Separate FP registers: f0, …, f31
❖ double-precision
❖ single-precision values stored in the lower 32 bits
P41
FP Instructions in RISC-V
❖ Single-precision arithmetic
❖ fadd.s, fsub.s, fmul.s, fdiv.s, fsqrt.s
➢ e.g., fadd.s f2, f4, f6
❖ Double-precision arithmetic
❖ fadd.d, fsub.d, fmul.d, fdiv.d, fsqrt.d
➢ e.g., fadd.d f2, f4, f6
P42
FP Instructions in RISC-V
❖ C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
❖ fahr in f10, result in f10, literals in global memory space
P44
Accurate Arithmetic
P45
Subword Parallelism
P46
Final 64-bit RISC-V ALU
ALUop Function
0000 and
0001 or
0010 add
0110 subtract
0111 set-on-less-than
1100 nor
P47
ALU Control and Function
ALUop
Ainvert 2
CarryIn Operation
a 0
0
1
1
Binvert
Result
b 0 2
1
ALU Control (ALUop) Function
slt 3
0000 and
0001 or
CarryOut 0010 add
0110 subtract
0111 set-on-less-than
1100 nor
P48
Ripple Carry Adder
00111111 Cin = 1
00101010
+ 00010101
01
01 01
01 01
01 01
0
P49
Problems with Ripple Carry Adder
❖ Carry bit may have to propagate from LSB to MSB => worst case
delay: N-stage delay
CarryIn0
A0 1-bit Result0
B0 ALU
CarryOut0
CarryIn1
A1 1-bit Result1
B1 ALU
CarryOut1
CarryIn2
A2 1-bit Result2
B2 ALU
CarryOut2
CarryIn3
A3 1-bit Result3
Design Trick: look for
B3 ALU
CarryOut3
parallelism and throw
hardware at it
P50
Remove the Dependency
Cout + + + + + + + + Cin
s7 s6 s5 s4 s3 s2 s1 s0
a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0
Cout + + + + + + + + Cin
s7 s6 s5 s4 s3 s2 s1 s0
P51
4-bit Carry-Lookahead Adder (CLA)
❖ Ripple carry adder which takes a lot of time to determine the carry bit
A3 B3 A2 B2 A1 B1 A0 B0
PG GG
P52
Carry-Lookahead Adder
𝑆 = 𝐴 ⊕ 𝐵 ⊕ 𝐶𝑖𝑛
Full adder = { 𝐶out =𝐴 ⊕ 𝐵 · 𝐶𝑖𝑛 + 𝐴 · 𝐵
❖ C1 = G0 + P0 · C0
C2 = G1 + P1 · C1 = G1 + P1 · (G0 + P0 · C0) = G1 + P1 · G0 + P1 · P0 · C0
C3 = G2 + P2 · G1 + P2 · P1 · G0+ P2 · P1 · P0 · C0
C4 = G3 + P3 · G2 + P3 · P2 · G1+ P3 · P2 · P1 · G0 + P3 · P2 · P1 · P0 · C0
P53
16-bit CLA
g p
C P G
P56
Who Cares About FP Accuracy?
P57