Floating Point & fixed point Representation_BCA II
Floating Point & fixed point Representation_BCA II
Representation
Basic Arithmetic and the ALU
• Now
– Floating point & Fixed point representation
– Floating point addition, multiplication
2/22/2024 2
Floating Point
• Want to represent larger range of numbers
– Fixed point (integer): -2n-1 … (2n-1 –1)
• Sacrifice precision for range by providing
exponent to shift relative weight of each bit
position
• Similar to scientific notation: 3.14159 x 1023
• Cannot specify every discrete value in the
range, but can span much larger range
2/22/2024 3
Fractional binary numbers
• What is 1011.101?
2/22/2024 4
Fractional Binary Numbers
2i
2i–1
4
••• 2
. 1
bi bi–1 • • • b2 b1 b0 b–1 b–2 b–3 • • • b–j
1/2
1/4
•••
1/8
2–j
• Representation
–Bits to right of “binary point” represent fractional powers of 2
– Represents rational number:
2/22/2024
5
Fractional Binary Numbers: Examples
• Value Representation
5 and ¾ 101.112
• Observations
– Divide by 2 by shifting right
– Multiply by 2 by shifting left
– Numbers of form 0.111111…2 are just below 1.0
• 1/2 + 1/4 + 1/8 + … + 1/2i + … → 1.0
• Use notation 1.0 – ε
2/22/2024 6
Representable Numbers
• Limitation
– Can only exactly represent numbers of the form x/2k
– Other rational numbers have repeating bit representations
• Value Representation
1/3 0.0101010101[01]…2
1/5 0.001100110011[0011]…2
1/10 0.0001100110011[0011]…2
2/22/2024 7
Floating Point
• Still use a fixed number of bits
– Sign bit S, exponent E, significand F
– Value: (-1)S x F x 2E
• IEEE 754 standard S E F
2/22/2024 8
Floating Point Exponent
• Exponent specified in biased or excess
notation
• Why?
– To simplify sorting
– Sign bit is MSB to ease sorting
– 2’s complement exponent:
• Large numbers have positive exponent
• Small numbers have negative exponent
– Sorting does not follow naturally
2/22/2024 9
Floating Point Normalization
• S, E, F representation allows more than one representation for a
particular value, e.g.
1.0 x 105 = 0.1 x 106 = 10.0 x 104
– This makes comparison operations difficult
– Prefer to have a single representation
• Hence, normalize by convention:
– Only one digit to the left of the floating point
– In binary, that digit must be a 1
• Since leading ‘1’ is implicit, no need to store it
• Hence, obtain one extra bit of precision for free
2/22/2024 10
FP Overflow/Underflow
• FP Overflow
– Analogous to integer overflow
– Result is too big to represent
– Means exponent is too big
• FP Underflow
– Result is too small to represent
– Means exponent is too small (too negative)
• Both can raise an exception under IEEE754.
2/22/2024 11
IEEE754 Special Cases
Single Precision Double Precision Value
Exponent Significand Exponent Significand
0 0 0 0 0
0 nonzero 0 nonzero denormalized
1-254 anything 1-2046 anything fp number
255 0 2047 0 infinity
NaN (Not a
255 nonzero 2047 nonzero
Number)
2/22/2024 12
FP Rounding
• Rounding is important
– Small errors accumulate over billions of ops
• FP rounding hardware helps
– Compute extra guard bit beyond 23/52 bits
– Further, compute additional round bit beyond that
• Multiply may result in leading 0 bit, normalize shifts
guard bit into product, leaving round bit for rounding
– Finally, keep sticky bit that is set whenever ‘1’ bits are
“lost” to the right
• Differentiates between 0.5 and 0.500000000001
2/22/2024 13
Fixed Point Representation
float → 32 bits; double → 64 bits
We might try representing fractional binary numbers by picking a
fixed place for an implied binary point
“fixed point binary numbers”
Let's do that, using 8 bit floating point numbers as an example
#1: the binary point is between bits 2 and 3
b7 b6 b5b4 b3 [.] b2 b1 b0
#2: the binary point is between bits 4 and 5
b7 b6 b5 [.] b4 b3 b2 b1 b0
The position of the binary point affects the range and precision
range: difference between the largest and smallest
representable numbers
precision: smallest possible difference between any two
2/22/2024 numbers 14
Fixed Point Pros and Cons
Pros
It's simple. The same hardware that does integer arithmetic can do
fixed point arithmetic
In fact, the programmer can use ints with an implicit fixed point
E.g., int balance; // number of pennies in the account
ints are just fixed point numbers with the binary point to the right
of b0
Cons
There is no good way to pick where the fixed point should be
Sometimes you need range, sometimes you need precision. The
more you have of one, the less of the other.
2/22/2024 15
IEEE Floating Point
• Fixing fixed point: analogous to scientific notation
– Not 12000000 but 1.2 x 10^7; not 0.0000012 but 1.2 x 10^-6
• IEEE Standard 754
– Established in 1985 as uniform standard for floating point
arithmetic
• Before that, many idiosyncratic formats
– Supported by all major CPUs
• Driven by numerical concerns
– Nice standards for rounding, overflow, underflow
– Hard to make fast in hardware
• Numerical analysts predominated over hardware designers in
defining standard
2/22/2024 16
Floating Point Representation
• Numerical Form:
(–1)s M 2E
• Sign bit s determines whether number is negative or positive
• Significand (mantissa) M , normally a fractional value in range
[1.0,2.0).
• Exponent E weights value by power of two
• Encoding
• MSB s is sign bit s
• frac field encodes M (but is not equal to M)
• exp field encodes E (but is not equal to E)
s exp frac
2/22/2024 17
Precisions
• Single precision: 32 bits
s exp frac
1 8 23
2/22/2024 19
Floating Point Addition
• Just like grade school
– First, align decimal points
– Then, add significands
– Finally, normalize result
• Example
9.997 x 102 9.997000 x 102
2/22/2024 20
Sign Exponent Significand Sign Exponent Significand
FP
Adder Small ALU
Compare
exponents
Exponent
difference
0 1 0 1 0 1
Shift smaller
Control Shift right
number right
Add
Big ALU
0 1 0 1
Increment or
decrement Shift left or right Normalize
2/22/2024 22
FP Multiplication
• Compute sign, exponent, significand
• Normalize
– Shift left, right by 1
• Check for overflow, underflow
• Round
• Normalize again (if necessary)
2/22/2024 23
Summary
• Floating point representation
– Normalization
– Overflow, underflow
– Rounding
• Floating point add
• Floating point multiply
2/22/2024 24