IEEE Standard 754 Floating Point Numbers
IEEE Standard 754 Floating Point Numbers
IEEE Standard 754 Floating Point Numbers
Numbers
Steve Hollasch / Last update 2003-Jan-02
IEEE Standard 754 floating point is the most common representation today for real
numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms. This
article gives a brief overview of IEEE floating point and its representation. Discussion of
arithmetic implementation may be found in the book mentioned at the bottom of this article.
Storage Layout
IEEE floating point numbers have three basic components: the sign, the exponent, and the
mantissa. The mantissa is composed of the fraction and an implicit leading digit (explained
below). The exponent base (2) is implicit and need not be stored.
The following figure shows the layout for single (32-bit) and double (64-bit) precision
floating-point values. The number of bits for each field are shown (bit ranges are in square
brackets):
The sign bit is as simple as it gets. 0 denotes a positive number; 1 denotes a negative
number. Flipping the value of this bit flips the sign of the number.
The Exponent
The exponent field needs to represent both positive and negative exponents. To do this, a
bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision
floats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponent field.
A stored value of 200 indicates an exponent of (200-127), or 73. For reasons discussed later,
exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.
For double precision, the exponent field is 11 bits, and has a bias of 1023.
The Mantissa
The mantissa, also known as the significand, represents the precision bits of the number. It
is composed of an implicit leading bit and the fraction bits.
To find out the value of the implicit leading bit, consider that any number can be expressed
in scientific notation in many different ways. For example, the number five can be represented as
any of these:
5.00 x 100
0.05 x 102
5000 x 10-3
In order to maximize the quantity of representable numbers, floating-point numbers are
typically stored in normalized form. This basically puts the radix point after the first non-zero
digit. In normalized form, five is represented as 5.0 x 100.
A nice little optimization is available to us in base two, since the only possible non-zero
digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it explicitly. As
a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.
3. The exponent field contains 127 plus the true exponent for single-precision, or 1023
plus the true exponent for double precision.
4. The first bit of the mantissa is typically assumed to be 1.f, where f is the field of
fraction bits.
Ranges of Floating-Point Numbers
Let's consider single-precision floats for a second. Note that we're taking essentially a 32-
bit number and re-jiggering the fields to cover a much broader range. Something has to give, and
it's precision. For example, regular 32-bit integers, with all precision centered around zero, can
precisely store integers with 32-bits of resolution. Single-precision floating-point, on the other
hand, is unable to match this resolution with its 24 bits. It does, however, approximate this value
by effectively truncating from the lower end. For example:
The range of positive floating point numbers can be split into normalized numbers (which
preserve the full precision of the mantissa), and denormalized numbers (discussed later) which use
only a portion of the fractions's precision.
Approximate
Denormalized Normalized
Decimal
Single ± 2-149 to (1-2-23)x2- ± 2-126 to (2-2- ± ~10-44.85 to
126 23
Precision )x2127 ~1038.53
Double ± 2-1074 to (1-2- ± 2-1022 to (2-2- ± ~10-323.3 to
52
Precision )x2-1022 52
)x21023 ~10308.3
Since the sign of floating point numbers is given by a special leading bit, the range for negative
numbers is given by the negation of the above values.
There are five distinct numerical ranges that single-precision floating-point numbers are not able to
represent:
Here's a table of the effective range (excluding infinite values) of IEEE floating-point numbers:
Binary Decimal
Note that the extreme values occur (regardless of sign) when the exponent is at the maximum
value for finite numbers (2127 for single-precision, 21023 for double), and the mantissa is filled with 1s
(including the normalizing 1 bit).
Special Values
IEEE reserves exponent field values of all 0s and all 1s to denote special values in the floating-point
scheme.
Zero
As mentioned above, zero is not directly representable in the straight format, due to the assumption
of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special
value denoted with an exponent field of zero and a fraction field of zero. Note that -0 and +0 are
distinct values, though they both compare as equal.
Denormalized
If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the
value is a denormalized number, which does not have an assumed leading 1 before the binary point.
Thus, this represents a number (-1)s x 0.f x 2-126, where s is the sign bit and f is the fraction. For
double precision, denormalized numbers are of the form (-1) s x 0.f x 2-1022. From this you can
interpret zero as a special type of denormalized number.
Infinity
The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The
sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as
a specific value is useful because it allows operations to continue past overflow situations.
Operations with infinite values are well defined in IEEE floating point.
Not A Number
The value NaN (Not a Number) is used to represent a value that does not represent a real number.
NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction. There are
two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).
A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely through
most arithmetic operations. These values pop out of an operation when the result is not
mathematically defined.
An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an exception
when used in operations. SNaN's can be handy to assign to uninitialized variables to trap premature
usage.
Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.
Special Operations
Operations on special numbers are well-defined by IEEE. In the simplest case, any operation with a
NaN yields a NaN result. Other operations are as follows:
Operation Result
n / ±Infinity 0
±Infinity x ±Infinity ±Infinity
±nonzero / 0 ±Infinity
Infinity + Infinity Infinity
±0 / ±0 NaN
Infinity - Infinity NaN
±Infinity / ±Infinity NaN
±Infinity x 0 NaN
Summary
To sum up, the following are the corresponding values for a given representation:
References
A lot of this stuff was observed from small programs I wrote
to go back and forth between hex and floating point (printf-
style), and to examine the results of various operations. The
bulk of this material, however, was lifted from Stallings'
book.
See Also