EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
EE 109 Unit 20
Floating Point
• Used to represent very small numbers
(fractions) and very large numbers
– Avogadro’s Number: +6.0247 * 1023
– Planck’s Constant: +6.6254 * 10-27
– Note: 32 or 64-bit integers can’t represent this
range
• Floating Point representation is used in HLL’s
like C by declaring variables as float or
double
3
Fixed Point
• Unsigned and 2’s complement fall under a category of
representations called “Fixed Point”
• The radix point is assumed to be in a fixed location for all numbers
[Note: we could represent fractions by implicitly assuming the
binary point is at the left…A variable just stores bits…you can
assume the binary point is anywhere you like]
– Integers: 10011101. (binary point to right of LSB) Bit storage
• For 32-bits, unsigned range is 0 to ~4 billion Fixed point Rep.
• Main point: By fixing the radix point, we limit the range of numbers
that can be represented
– Floating point allows the radix point to be in a different location for each
value
4
Overall Sign of #
S Exp. fraction
5
Normalized FP Numbers
• Decimal Example
– +0.754*1015 is not correct scientific notation
– Must have exactly one significant digit before decimal point:
+7.54*1014
• In binary the only significant digit is ‘1’
• Thus normalized FP format is:
±1.bbbbbb * 2±exp
• FP numbers will always be normalized before being
stored in memory or a reg.
– The 1. is actually not stored but assumed since we always will store
normalized numbers
– If HW calculates a result of 0.001101*25 it must normalize to
1.101000*22 before storing
6
1 8 23 1 11 52
Exponent Representation
• Exponent needs its own sign (+/-) 2’s E' Excess-
comp. (stored Exp.) 127
• Rather than using 2’s comp. system we use -1 1111 1111 +128
Excess-N representation -2 1111 1110 +127
– Single-Precision uses Excess-127
– Double-Precision uses Excess-1023
-128 1000 0000 1
– This representation allows FP numbers to be
easily compared +127 0111 1111 0
Exponent Representation
• FP formats reserved E’ E (=E’-127)
the exponent values (range of 8-bits shown) and special values
Single-Precision Examples
-1.1100110 * 23
= -1110.011 * 20
= -14.375
2 +0.6875 = +0.1011
= +1.011 * 2-1
-1 +127 = 126
0 0111 1110 011 0000 0000 0000 0000 0000
11
Examples
1 1 10100 101101 2 +21.75 = +10101.11
20-15=5
= +1.010111 * 24
-1.101101 * 25
4+15=19
= -110110.1 * 20
0 10011 010111
= -110110.1 = -54.5
Rounding Methods
• +213.125 = 1.1010101001*27 => Can’t keep all fraction bits
• 4 Methods of Rounding (you are only responsible for the first 2)
Normal rounding you learned in grade school.
Round to the nearest representable number. If
Round to Nearest
exactly halfway between, round to representable
value w/ 0 in LSB.
Round the representable value closest to but not
Round towards 0
greater in magnitude than the precise value.
(Chopping)
Equivalent to just dropping the extra bits.
Round toward +∞ Round to the closest representable value greater
(Round Up) than the number
Round toward -∞ Round to the closest representable value less
(Round Down) than the number
15
Round to
Nearest -∞ -3.75 0 +5.8 +∞
Round to Zero
-∞ 0 +∞
Round to
+Infinity -∞ 0 +∞
Round to -
Infinity -∞ 0 +∞
16
Rounding Implementation
• There may be a large number of bits after the fraction
• To implement any of the methods we can keep only a
subset of the extra bits after the fraction [hardware is
finite]
– Guard bits: bits immediately after LSB of fraction (in this class
we will usually keep only 1 guard bit)
– Round bit: bit to the right of the guard bits
– Sticky bit: Logical OR of all other bits after G & R bits
1.01001010010 x 24
Logical OR (output is ‘1’ if any input is ‘1’,
4 ‘0’ otherwise
1.010010101 x 2
GRS
We can perform rounding to a 6-bit
fraction using just these 3 bits.
17
1.11111011010 x 24
1.111110111 x 24
GRS
Round Up
Round to Nearest
Round to Nearest
• In all these cases, the numbers are halfway between the 2 possible round
values
• Thus, we round to the value w/ 0 in the LSB
GRS GRS GRS
1.001100100 x 24 1.111111100 x 24 1.001101100 x 24
G = ‘1’ and R,S = ‘0’ G = ‘1’ and R,S = ‘0’ G = ‘1’ and R,S = ‘0’
1.111111 x 24
0 10011 001100 + 0.000001 x 24 0 10011 001110
10.000000 x 24
1.000000 x 25 Requires renormalization
0 10100 000000
22
Round to 0 (Chopping)
• Simply drop the G,R,S bits and take fraction as
is
12. 75
+ 6. 5
Integer Fraction
29
+ 6. 5 + 650. 0 Integer +
19. 25 Fraction
Integer Fraction
1925. 0 19. Or just integer
30
Another Example
• Suppose we want to convert from "dozens" to
number of individual items (i.e. 1.25 dozen =
15 items)
– Simple formula: 12*dozens = individual items
– Suppose we only support fractions: ¼, ½, ¾
represented as follows:
Decimal View Binary View
3. 25 11. 01
3. 50 11. 10
3. 75 11. 11
31
Another Example
• Procedure
– Assemble int + frac pieces into 1 variable
• Shift integer to the left, then add/OR in fraction bits
– Peform desired arithmetic operation
– Disassemble int + frac by shifting back
Decimal View Binary View
i f i f
3. 25 11. 01