EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic

1
EE 109 Unit 20
IEEE 754 Floating Point

Representation
Floating Point Arithmetic
2
Floating Point
• Used to represent very small numbers
(fractions) and very large numbers
– Avogadro’s Number: +6.0247 * 1023
– Planck’s Constant: +6.6254 * 10-27
– Note: 32 or 64-bit integers can’t represent this
range
• Floating Point representation is used in HLL’s
like C by declaring variables as float or
double
3
Fixed Point
• Unsigned and 2’s complement fall under a category of
representations called “Fixed Point”
• The radix point is assumed to be in a fixed location for all numbers
[Note: we could represent fractions by implicitly assuming the
binary point is at the left…A variable just stores bits…you can
assume the binary point is anywhere you like]
– Integers: 10011101. (binary point to right of LSB) Bit storage
• For 32-bits, unsigned range is 0 to ~4 billion Fixed point Rep.
– Fractions: .10011101 (binary point to left of MSB)

• Range [0 to 1)
• Main point: By fixing the radix point, we limit the range of numbers
that can be represented
– Floating point allows the radix point to be in a different location for each
value
4
Floating Point Representation

• Similar to scientific notation used with
decimal numbers
– ±D.DDD * 10 ±exp
• Floating Point representation uses the
following form
– ±b.bbbb * 2±exp
– 3 Fields: sign, exponent, fraction (also called
mantissa or significand)
Overall Sign of #
S Exp. fraction
5
Normalized FP Numbers
• Decimal Example
– +0.754*1015 is not correct scientific notation
– Must have exactly one significant digit before decimal point:
+7.54*1014
• In binary the only significant digit is ‘1’
• Thus normalized FP format is:
±1.bbbbbb * 2±exp
• FP numbers will always be normalized before being
stored in memory or a reg.
– The 1. is actually not stored but assumed since we always will store
normalized numbers
– If HW calculates a result of 0.001101*25 it must normalize to
1.101000*22 before storing
6
IEEE Floating Point Formats

• Single Precision • Double Precision
(32-bit format) (64-bit format)
– 1 Sign bit (0=pos/1=neg) – 1 Sign bit (0=pos/1=neg)
– 8 Exponent bits – 11 Exponent bits
• Excess-127 representation • Excess-1023 representation
• More on next slides • More on next slides
– 23 fraction (significand or – 52 fraction (significand or
mantissa) bits mantissa) bits
– Equiv. Decimal Range: – Equiv. Decimal Range:
• 7 digits x 10±38 • 16 digits x 10±308
1 8 23 1 11 52
S Exp. Fraction S Exp. Fraction

7
Exponent Representation
• Exponent needs its own sign (+/-) 2’s E' Excess-
comp. (stored Exp.) 127
• Rather than using 2’s comp. system we use -1 1111 1111 +128
Excess-N representation -2 1111 1110 +127
– Single-Precision uses Excess-127
– Double-Precision uses Excess-1023
-128 1000 0000 1
– This representation allows FP numbers to be
easily compared +127 0111 1111 0
• Let E’ = stored exponent code and +126 0111 1110 -1
E = true exponent value

• For single-precision: E’ = E + 127 +1 0000 0001 -126
– 21 => E = 1, E’ = 12810 = 100000002 0 0000 0000 -127
• For double-precision: E’ = E + 1023 Comparison of

– 2-2 => E = -2, E’ = 102110 = 011111111012 2’s comp. & Excess-N
Q: Why don’t we use Excess-N
more to represent negative #’s
8
Exponent Representation
• FP formats reserved E’ E (=E’-127)
the exponent values (range of 8-bits shown) and special values
of all 1’s and all 0’s for 11111111 Reserved

special purposes 11111110 E’-127=+127
• Thus, for single- …
precision the range of 10000000 E’-127=+1

exponents is 01111111 E’-127=0
-126 to + 127 01111110 E’-127=-1
…
00000001 E’-127=-126
00000000 Reserved
9
IEEE Exponent Special Values

E’ Fraction Meaning
All 0’s All 0’s 0
All 0’s Not all 0’s Denormalized

(any bit = ‘1’) (0.fraction x 2-126)
All 1’s All 0’s Infinity
All 1’s Not all 0’s NaN (Not A Number)

(any bit = ‘1’) - 0/0, 0*∞,SQRT(-x)
10
Single-Precision Examples
1 1 1000 0010 110 0110 0000 0000 0000 0000

130-127=3
-1.1100110 * 23
= -1110.011 * 20
= -14.375
2 +0.6875 = +0.1011
= +1.011 * 2-1
-1 +127 = 126
0 0111 1110 011 0000 0000 0000 0000 0000
11
Floating Point vs. Fixed Point

• Single Precision (32-bits) Equivalent Decimal Range:
– 7 significant decimal digits * 10±38
– Compare that to 32-bit signed integer where we can
represent ±2 billion. How does a 32-bit float allow us to
represent such a greater range?
– FP allows for range but sacrifices precision (can’t represent
all numbers in its range)
• Double Precision (64-bits) Equivalent Decimal Range:
• 16 significant decimal digits * 10±308
12
IEEE Shortened Format

• 12-bit format defined just for this class
(doesn’t really exist)
– 1 Sign Bit
– 5 Exponent bits (using Excess-15)
• Same reserved codes
– 6 Fraction (significand) bits
1 5-bits 6-bits
S E’ F
Sign Bit Exponent Fraction

0=pos. Excess-15 1.bbbbbb
1=neg. E’ = E+15
E = E’ - 15
13
Examples
1 1 10100 101101 2 +21.75 = +10101.11
20-15=5
= +1.010111 * 24
-1.101101 * 25
4+15=19
= -110110.1 * 20
0 10011 010111
= -110110.1 = -54.5
3 1 01101 100000 4 +3.625 = +11.101

13-15=-2
= +1.110100 * 21
-1.100000 * 2-2
1+15=16
= -0.011 * 20
0 10000 110100
= -0.011 = -0.375
14
Rounding Methods
• +213.125 = 1.1010101001*27 => Can’t keep all fraction bits
• 4 Methods of Rounding (you are only responsible for the first 2)
Normal rounding you learned in grade school.
Round to the nearest representable number. If
Round to Nearest
exactly halfway between, round to representable
value w/ 0 in LSB.
Round the representable value closest to but not
Round towards 0
greater in magnitude than the precise value.
(Chopping)
Equivalent to just dropping the extra bits.
Round toward +∞ Round to the closest representable value greater
(Round Up) than the number
Round toward -∞ Round to the closest representable value less
(Round Down) than the number
15
Number Line View Of Rounding Methods

Green lines are numbers that fall between two
representable values (dots) and thus need to be
rounded
Round to
Nearest -∞ -3.75 0 +5.8 +∞
Round to Zero
-∞ 0 +∞
Round to
+Infinity -∞ 0 +∞
Round to -
Infinity -∞ 0 +∞
16
Rounding Implementation
• There may be a large number of bits after the fraction
• To implement any of the methods we can keep only a
subset of the extra bits after the fraction [hardware is
finite]
– Guard bits: bits immediately after LSB of fraction (in this class
we will usually keep only 1 guard bit)
– Round bit: bit to the right of the guard bits
– Sticky bit: Logical OR of all other bits after G & R bits
1.01001010010 x 24
Logical OR (output is ‘1’ if any input is ‘1’,
4 ‘0’ otherwise
1.010010101 x 2
GRS
We can perform rounding to a 6-bit
fraction using just these 3 bits.
17
Rounding to Nearest Method

• Same idea as rounding in decimal
– .51 and up, round up,
– .49 and down, round down,
– .50 exactly we round up in decimal
• In this method we treat it differently…If precise value is
exactly half way between 2 representable values, round
towards the number with 0 in the LSB
18
Round to Nearest Method

• Round to the closest representable value
– If precise value is exactly half way between 2 representable
value, round towards the number with 0 in the LSB
1.11111011010 x 24
1.111110111 x 24
GRS
Round Up
+1.111110 x 24 +1.111110111 x 24 +1.111111 x 24
Precise value will be rounded to one of the

representable value it lies between.
In this case, round up because precise value is closer

to the next higher respresentable values
19
Rounding to Nearest Method

• 3 Cases in binary FP:
– G = ‘1’ & (R,S ≠ 0,0) =>
• round fraction up (add 1 to fraction)
• may require a re-normalization
– G = ‘1’ & (R,S = 0,0) =>
• round to the closest fraction value with a ‘0’ in the LSB
• may require a re-normalization
– G = ‘0’ =>
• leave fraction alone (add 0 to fraction)
20
Round to Nearest
GRS GRS GRS

1.001100110 x 24 1.111111101 x 24 1.001101001 x 24
G = ‘1’ & R,S ≠ 0,0 G = ‘1’ & R,S ≠ 0,0 G = ‘0’
Round up (fraction + 1) Round up (fraction + 1) Leave fraction
0 10011 001101 1.111111 x 24 0 10011 001101

+ 0.000001 x 24
10.000000 x 24
1.000000 x 25
Requires renormalization
0 10100 000000
21
Round to Nearest
• In all these cases, the numbers are halfway between the 2 possible round
values
• Thus, we round to the value w/ 0 in the LSB
GRS GRS GRS
1.001100100 x 24 1.111111100 x 24 1.001101100 x 24
G = ‘1’ and R,S = ‘0’ G = ‘1’ and R,S = ‘0’ G = ‘1’ and R,S = ‘0’
Rounding options are: Rounding options are: Rounding options are:

1.001100 or 1.001101 1.111111 or 10.000000 1.001101 or 1.001110
In this case, round down In this case, round up In this case, round up
1.111111 x 24
0 10011 001100 + 0.000001 x 24 0 10011 001110
10.000000 x 24
1.000000 x 25 Requires renormalization
0 10100 000000
22
Round to 0 (Chopping)
• Simply drop the G,R,S bits and take fraction as
is
GRS GRS GRS

1.001100001 x 24 1.001101101 x 24 1.001100111 x 24
drop G,R,S bits drop G,R,S bits drop G,R,S bits
0 10011 001100 0 10011 001101 0 10011 001100

23
Important Warning For Programmers

• FP addition/subtraction is NOT associative
– Because of rounding / inability to precisely
represent fractions, (a+b)+c ≠ a+(b+c)
(small + LARGE) – LARGE ≠ small + (LARGE – LARGE)
Why? Because of rounding and special values like Inf.

(0.0001 + 98475) – 98474 ≠ 0.0001 + (98475-98474)
98475-98474 ≠ 0.0001 + 1
1 ≠ 1.0001
Another Example:
1 + 1.11…1*2127 – 1.11…1*2127
24
USING INTEGERS TO EMULATE

DECIMAL/FRACTION ARITHMETIC
25
Option 1 for Performing FP

• Problem: Suppose you need to add, subtract
and compare numbers representing FP
numbers
– Ex. amounts of money (dollars and cents.)
float x, y, z;
• Option 1: Use floating point variables if (x > y)
– Pro: Easy to implement and easy to work with z = z + x;
– Con: Some processors like Arduino don't have
HW support of FP and so they would need to Option 1: Just use 'float' or
'double' variables in your
include a lot of code to emulate FP operations code
– Con: Numbers like $12.50 can be represented
exactly but most numbers are approximate due
to rounding of fractions that can be can't be
represented in a finite number of bits.
• Example 0.1 decimal can't be represented exactly in
binary
26

• Option 2: Split the amounts into two
fixed-point variables, one for the dollars
and one for the cents. $12.53 -> 12 and
53
/* add two numbers */
– Pro: Everything done by fixed-point
z_cents = x_cents + y_cents;
operations, no FP. z_dollars = x_dollars + y_dollars;
– Cons: All operations require at least two if (z_cents > 100) {
z_dollars++;
fixed point operations (though this is
z_cents -= 100;
probably still faster than the code the }
compiler would include in Option 1) Option 2: Use 'ints' to
emulate FP
27

• Option 3: Use a single fixed-point
variable by multiplying the
amounts by 100 (e.g. $12.53 -> int x_amount = x_dollars*100+x_cents;
int y_amount = y_dollars*100+y_cents;
1253)
– Pro: All adds, subtracts, and int z_amount;
if (x_amount > y_amount)
compares are done as a single fixed- z_amount += z_amount + x_amount;
point operation
– Con: Must convert to this form on int z_dollars = z_amount / 100;
int z_cents = z_amount % 100;
input and back to dollars and cents
on output. Option 3: Use single 'int' values
that contain the combined values
of integer and decimal
28
Emulating Floating Point

• Let's examine option 3 a bit more
• Suppose we have separate variables storing the integer
and fraction part of a number separately
– If we just added integer portions we'd get 18
– If we account for the fractions into the inputs of our operation
we would get 19.25 (which could then be truncated to 19)
– We would prefer this more precise answer that includes the
fractional parts in our inputs
Desired Decimal Operation

(separate integer / fraction):
12. 75
+ 6. 5
Integer Fraction
29
Example and Procedure

• Imagine we took our numbers and multiplied them each by
100, performed the operation, then divided by 100
– 100*(12.75+6.5) = 1275 + 650 = 1925 => 1925/100 = 19.25 => 19
• Procedure
– Assemble int + frac pieces into 1 variable
• Shift integer to the left, then add/OR in fraction bits
– Peform desired arithmetic operation
– Disassemble int + frac by shifting back
Desired Decimal Operation Shift Integers & Add in Disassemble integer and
(separate integer / fraction): fractions: fraction results (if only
integer is desired, simply
12. 75 1275. 0 shift right to drop fraction)
+ 6. 5 + 650. 0 Integer +
19. 25 Fraction
Integer Fraction
1925. 0 19. Or just integer
30
Another Example
• Suppose we want to convert from "dozens" to
number of individual items (i.e. 1.25 dozen =
15 items)
– Simple formula: 12*dozens = individual items
– Suppose we only support fractions: ¼, ½, ¾
represented as follows:
Decimal View Binary View
Integer Fraction Integer Fraction
3. 25 11. 01
3. 50 11. 10
3. 75 11. 11
31
Another Example
• Procedure
– Assemble int + frac pieces into 1 variable
• Shift integer to the left, then add/OR in fraction bits
– Peform desired arithmetic operation
– Disassemble int + frac by shifting back
Decimal View Binary View
i f i f
3. 25 11. 01
1 i = i << 2 12. 25 1100. 01

i |= f
13. 1101.
2 i *= 12 156. 10011100.
3 i = i >> 2 39. 100111.

EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic

Uploaded by

Copyright:

Available Formats

EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic

Uploaded by

Copyright:

Available Formats

1

IEEE 754 Floating Point

– Fractions: .10011101 (binary point to left of MSB)

Floating Point Representation

IEEE Floating Point Formats

S Exp. Fraction S Exp. Fraction

• Let E’ = stored exponent code and +126 0111 1110 -1

E = true exponent value

– 21 => E = 1, E’ = 12810 = 100000002 0 0000 0000 -127

• For double-precision: E’ = E + 1023 Comparison of

of all 1’s and all 0’s for 11111111 Reserved

• Thus, for single- …

precision the range of 10000000 E’-127=+1

IEEE Exponent Special Values

All 0’s All 0’s 0

All 0’s Not all 0’s Denormalized

All 1’s Not all 0’s NaN (Not A Number)

1 1 1000 0010 110 0110 0000 0000 0000 0000

Floating Point vs. Fixed Point

IEEE Shortened Format

Sign Bit Exponent Fraction

3 1 01101 100000 4 +3.625 = +11.101

Number Line View Of Rounding Methods

Rounding to Nearest Method

Round to Nearest Method

+1.111110 x 24 +1.111110111 x 24 +1.111111 x 24

Precise value will be rounded to one of the

In this case, round up because precise value is closer

Rounding to Nearest Method

GRS GRS GRS

Round up (fraction + 1) Round up (fraction + 1) Leave fraction

0 10011 001101 1.111111 x 24 0 10011 001101

Rounding options are: Rounding options are: Rounding options are:

GRS GRS GRS

0 10011 001100 0 10011 001101 0 10011 001100

Important Warning For Programmers

(small + LARGE) – LARGE ≠ small + (LARGE – LARGE)

Why? Because of rounding and special values like Inf.

USING INTEGERS TO EMULATE

Option 1 for Performing FP

Option 2 for Performing FP

Option 3 for Performing FP

Emulating Floating Point

Desired Decimal Operation

Example and Procedure

Integer Fraction Integer Fraction

1 i = i << 2 12. 25 1100. 01

3 i = i >> 2 39. 100111.

You might also like