0% found this document useful (0 votes)

26 views30 pages

Chapter2

Uploaded by

zihanliangeddie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views30 pages

Chapter2

Uploaded by

zihanliangeddie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Chapter 2

Computing with Floating Point

Numbers

The first part of this chapter gives an elementary introduction to the representation of computer
numbers and computer arithmetic. First, we introduce the computer numbers typically available on
modern computers (and available through Matlab). Next, we discuss the quality of the approxi-
mations produced by numeric computing.
In the second part of the chapter we present a number of simple examples that illustrate some of
the pitfalls of numerical computing. There are several such examples, each of them suitable for short
computer projects. They illustrate a number of pitfalls, with particular emphasis on the e↵ects of
catastrophic cancelation, floating–point overflow and underflow, and the accumulation of roundo↵
errors.

2.1 Numbers
Most scientific programming languages (including Matlab) provide users with at least two types of
numbers: integers and floating–point numbers. In fact, Matlab supports many other types too but
we concentrate on integers and floating–point numbers as those most likely to be used in scientific
computing. The floating–point numbers are a subset of the real numbers, so we begin by discussing
some basics about numbers.

2.1.1 Decimal and Floating Point Numbers

In math and science courses we typically use the decimal (or base-10) system to denote integer and
non-integer numbers, such as 0.1 and 93.75. Note that these specific examples can be written as
rational numbers (fractions involving integers),
1 750
= 0.1 and = 93.75
10 8
Some numbers, such as
⇡ = 3.141592653 · · ·
cannot be written with a finite number of decimal digits. Therefore, when doing hand calculations
or calculations on a computer, we are forced to use an approximation of the number by truncating
the decimal digits (usually using rounding), for example

⇡ ⇡ 3.1415927 (2.1)

Note that this means that our representation of ⇡ is an approximation; the more digits we use, the
better the approximation. This is usually referred to as precision. For example, the 8-digit decimal

23
24 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

approximation of ⇡ given in (2.1) is more accurate than the 3-digit decimal approximation

⇡ ⇡ 3.14

Precision is a very important topic in scientific computation!

Computer vendors (and users) need to consider how much precision they want to use when
representing numbers on a computer. Before we can discuss this, though, it is essential to have a
unique format for numbers. For example, we could represent the rational number 750/8 in many
ways,
750
= 93.75
8
= 9.375 ⇥ 101
= 0.9375 ⇥ 102
1
= 937.5 ⇥ 10
..
.

As users of computers, this lack of uniqueness may not be especially important, but to understand
numerical precision and how computers do arithmetic, it is important. In this book define the
unique, or normalized base-10 number to have the form

x = ± (d1 .d2 · · · dt dt+1 · · · ) ⇥ 10e (2.2)

where di are integers, 0  di  9, with d1 6= 0, and e is an integer exponent.

On a computer, we would not be able to represent an infinite number of decimal digits, so we
need to terminate (e.g., by rounding) at some point. In addition, we cannot represent an infinitely
large (positive or negative) exponent. Thus, we define a t-digit base-10 (decimal) floating point
approximation of x to be:
fl (x) = ± (d1 .d2 · · · dt ) ⇥ 10e (2.3)
where di are integers, 0  di  9, with d1 6= 0, and e is an integer exponent that also must satisfy
L  e  U , where L is a lower bound (e.g., a large negative number) and U is an upper bound (e.g.,
a large positive number) for e. We will often dispense with the notation fl (·) when it is clear that we
are using floating point numbers. Obviously, we cannot represent all real numbers on a computer,
and thus the set of all floating point numbers is a subset of the real numbers. The number of decimal
digits, t, is often referred to as the precision. Notice that the normalized floating point representation
defined in (2.3) can also be written as
✓ ◆
d2 dt
fl (x) = ± d1 + + · · · + t 1 ⇥ 10e (2.4)
10 10
This latter form is convenient if we want to generalize the concept of floating point numbers to other
bases, as will be done later in this section.
750
Example 2.1.1. Consider the rational number x = 8 .

(a) The normalized base-10 representation of x is

✓ ◆
750 3 7 5
= 9+ + + ⇥ 101
8 10 100 1000
= 9.375 ⇥ 101

(b) A 3-digit base 10 normalized floating point approximation of x is

✓ ◆ ✓ ◆
750 3 8
fl = 9+ + ⇥ 101
8 10 100
= 9.38 ⇥ 101
2.1. NUMBERS 25

2.1.2 Binary Numbers

It is now easy to generalize the base-10 representation of floating point numbers to an arbitrary base
. Specifically, we define a normalized base- floating point number as
✓ ◆
d1 d2 dn
fl ((x) ) = ± 0
+ 1 + ··· + n 1 ⇥ e
e
= ± (d1 .d2 · · · dt ) ⇥

where di are integers, 0  di < , d1 6= 0, and e is an integer exponent, L  e  U , usually also

represented as a base- number. For computers, the most commonly used base is = 2, which
means
✓ ◆
d2 dn
fl ((x)2 ) = ± d1 + + · · · + n 1 ⇥ 2e
2 2
= ± (d1 .d2 · · · dn ) ⇥ 2e

where di are either 0 or 1 (i.e., binary digits, or bits), d1 6= 0, and e is a base-2 integer exponent,
with L  e  U . The value n specifies the number of bits allocated in the computer to store the
fraction part of the number; this will be discussed below in more detail. There is a relation between
this number of bits and the base-10 precision value t.
750
Example 2.1.2. Consider the rational number x = 8 .

(a) Recall that the normalized base-10 representation of x is

✓ ◆ ✓ ◆
750 3 7 5
= 9+ + + ⇥ 101
8 10 10 100 1000
= 9.375 ⇥ 101

(b) The normalized base-2 representation of x is

✓ ◆ ✓ ◆
750 0 1 1 1 0 1 1 1
= 1+ + + + + + + + ⇥ 26
8 2 2 4 8 16 32 64 128 256
= (1.01110111) ⇥ 26

Here we are using the notation (·) to emphasize which basis is being used to represent the number.
Usually it is clear from the context, in which case we will dispense from using this notation.

2.1.3 Computer Representation of Numbers

All computer numbers are stored as a sequence of binary digits, or bits. A bit assumes one of two
values, 0 or 1. The bits allocated to a number are usually ordered from right to left. For example,
if the number v is stored using 8 bits, we may write it

stored(v) = b7 b6 b5 b4 b3 b2 b1 b0

Here, bi represents the ith bit of v; the indices assigned to these bits start at 0 and increase to the
left. Because each bit can assume the value 0 or 1, there are 28 = 256 distinct patterns of bits.
However, the value associated with each pattern depends on what kind of number v that represents.

Integers
The most commonly used type of integers are the signed integers. Most computers store signed
integers using two’s complement notation; in this representation the 8-bit signed integer i is stored
as
stored(i) = b7 b6 b5 b4 b3 b2 b1 b0
26 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

which is assigned the value

value(i) = b7 ( 27 ) + b6 (26 ) + · · · + b0 (20 )

When we compute with integers we normally use signed integers. So, for example, among the 256
di↵erent 8-bit signed integers, the smallest value is 128 and the largest is 127. This asymmetry
arises because the total number of possible bit patterns, 28 = 256, is even and one bit pattern,
00000000, is assigned to zero.
An 8-bit unsigned integer j stored as

stored(j) = b7 b6 b5 b4 b3 b2 b1 b0

is assigned the value

value(j) = b7 (27 ) + b6 (26 ) + b5 (25 ) + · · · + b0 (20 )
Unsigned integers are used mainly in indexing and in the representation of floating–point numbers.
Among the 256 di↵erent 8-bit unsigned integers, the smallest value, 00000000, is 0 and the largest,
11111111, is 255.

Example 2.1.3. Consider the decimal integer 13. Its representation as an unsigned integer is
00001101, because

13 = 0 · (27 ) + 0 · (26 ) + 0 · (25 ) + 0 · (24 ) + 1 · (23 ) + 1 · (22 ) + 0 · (21 ) + 1 · (20 )

The signed integer representation is also 00001101 because

13 = 0 · ( 27 ) + 0 · (26 ) + 0 · (25 ) + 0 · (24 ) + 1 · (23 ) + 1 · (22 ) + 0 · (21 ) + 1 · (20 )

Now consider the decimal integer -13. It obviously has no representation as an unsigned integer. As
a signed integer we first observe that

13 = 128 + 115 = 1 · ( 27 ) + 115

Now we need to use the remaining seven bits to represent the decimal integer 115:

115 = 1 · (26 ) + 1 · (25 ) + 1 · (24 ) + 0 · (23 ) + 0 · (22 ) + 1 · (21 ) + 1 · (20 )

Thus, the two’s complement binary form of the decimal integer -13 is 11110011, or

13 = 1 · ( 27 ) + 1 · (26 ) + 1 · (25 ) + 1 · (24 ) + 0 · (23 ) + 0 · (22 ) + 1 · (21 ) + 1 · (20 )

Problem 2.1.1. Determine a formula for the smallest and largest n-bit unsigned integers.
What are the numerical values of the smallest and largest n-bit unsigned integers for each
of n = 8, 11, 16, 32, 64?

Problem 2.1.2. Determine a formula for the smallest and largest n-bit signed integers.
What are numerical values of the smallest and largest n-bit signed integers for each of
n = 8, 11, 16, 32, 64?

Problem 2.1.3. List the value of each 8-bit signed integer for which the negative of this
value is not also an 8-bit signed integer.
2.1. NUMBERS 27

2.1.4 Floating-Point Numbers

In early computers the types of floating–point numbers available depended not only on the program-
ming language but also on the computer. The situation changed dramatically after 1985 with the
widespread adoption of the ANSI/IEEE 754-1985 Standard for Binary Floating–Point Arithmetic.
For brevity we’ll call this the Standard. It defines several types of floating–point numbers. The
most widely implemented is its 64-bit double precision (DP) type. In most computer languages
the programmer can also use the 32-bit single precision (SP) type of floating–point numbers.
If a program uses just one type of floating–point number, we refer to that type as the working
precision (WP) floating–point numbers. (WP is usually either DP or SP. In Matlab, WP=DP.)
The important quantity machine epsilon, ✏WP , is the distance from 1 to the next larger number
in the working precision.
An important property of WP numbers is that the computed result of every sum, di↵erence,
product, or quotient of a pair of WP numbers is a valid WP number. For this property to be
possible, the WP numbers in the Standard include representations of finite real numbers as well as
the special numbers ±0, ±1 and NaN (Not-a-Number). For example, the quotient 1/0 is the special
number 1 and the product 1 ⇥ 0, which is mathematically undefined, has the value NaN.
In the Standard, an SP number x is stored using 32 bits, which are partitioned as
s(x) e(x) f (x)
z}|{ z }| {z }| {
stored(x) = b31 b30 b29 · · · b23 b22 b21 · · · b0

and a DP number x is stored using 64 bits, which are partitioned as

s(x) e(x) f (x)
z}|{ z }| {z }| {
stored(y) = b63 b62 b61 · · · b52 b51 b50 ··· b0

The boxes illustrate how the bits are partitioned into 3 fields:
• One bit is an unsigned integer s(x), the sign bit of x; s(x) = 0 indicates a positive number,
and s(x) = 1 indicates a negative number.

• The next partition uses a certain number of bits (8 for single, 11 for double) to represent e(x),
as an unsigned integer, and is called the biased exponent of x.
• The next partition uses the remaining available bits (23 for single, 52 for double) to represent
f (x), as an unsigned integer, and is called the fraction of x.

Note that this definition implies that the machine epsilon, that is the distance from 1 to the next
larger number in the given working precision is
23 7
single precision: ✏SP = 2 ⇡ 1.1921 ⇥ 10
52 16
double precision: ✏DP = 2 ⇡ 2.2204 ⇥ 10

The precise details of how to convert the bits into a decimal value, which are not at first obvious,
are given in Section 2.1.5, but it is important to notice that there are limitations on what numbers
can be represented. Specifically:
Exponent Limitation: This can lead to either overflow or underflow.
Overflow: This occurs if the exponent is a too large positive number to be represented
by the available bits in e(x). In this case, |x| is very large, and the Standard returns a
result of either Inf or Inf.
• Underflow: This occurs if the exponent is a too large negative number to be represented
by the available bits in e(x). In this case, |x| is a tiny number, and the Standard returns
a result of 0.
28 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

Fraction Limitation: Some fraction parts cannot be represented

p by a finite number of bits.
This is most easily seen for irrational numbers, such as ⇡ or 2. Obviously these cannot be
written with a finite number of decimal digits, and hence cannot be represented with a finite
number of bits. But the situation can occur with relatively simple rational numbers. For
example, the decimal number 1/10 does not have a finite binary representation, but instead
must be written as the repeating (but infinite) series
✓ ◆
1 1 1 0 0 1 1 0 0 1 1 0
= 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + · · ·
10 2 2 2 2 2 2 2 2 2 2 2 2

In these situations, the number of bits must be “cuto↵ ” at some point, leading to roundo↵
error.
Section 2.2 discusses in more detail issues of underflow, overflow, and propagation of roundo↵ errors
in scientific computing.

2.1.5 Additional Details on Floating Point Numbers

Double Precision Numbers
As stated in Section 2.1.4 In the Standard, a DP number y is stored as
s(y) e(y) f (y)
z}|{ z }| {z }| {
stored(y) = b63 b62 b61 · · · b52 b51 b50 ··· b0

The boxes illustrate how the 64 bits are partitioned into 3 fields: a 1-bit unsigned integer s(y), the
sign bit of y, an 11-bit unsigned integer e(y), the biased exponent of y, and a 52-bit unsigned
integer f (y), the fraction of y. Because the sign of y is stored explicitly, with s(y) = 0 for positive
and s(y) = 1 for negative y, each DP number can be negated. The 11-bit unsigned integer e(y)
represents values from 0 through 2047:
• If e(y) = 2047, then y is a special number.
• If 0 < e(y) < 2047, then y is a normalized floating–point number with value
n f (y) o
value(y) = ( 1)s(y) · 1 + 52 · 2E(y)
2
where E(y) = e(y) 1023 is the (unbiased) exponent of y.
• If e(y) = 0 and f (y) 6= 0, then y is a denormalized floating–point number with value
n f (y) o
value(y) = ( 1)s(y) · 0 + 52 · 2 1022
2

• If e(y) = 0 and f (y) = 0, then y has value zero.

f (y) f (y)
The significand of y is 1 + 52
for normalized numbers, 0 + 52 for denormalized numbers, and 0
2 2
for zero. The normalized DP numbers y with exponent⇥E(y) = k belong to binade k. For example,
binade 1 consists of the numbers y for which |y| 2 12 , 1 and binade 1 consists of the numbers
y for which |y| 2 [2, 4). In general, binade k consists of the numbers y for which 2k  |y| < 2k+1 .
Each DP binade contains the same number of positive DP numbers. We define the DP machine
epsilon ✏DP as the distance from 1 to the next larger DP number. For a finite nonzero DP number
y, we define ulpDP (y) ⌘ ✏DP 2E(y) as the DP unit-in-the-last-place of y.
2.1. NUMBERS 29

Example 2.1.4. Consider the real number y = 1/6. This number is not directly representable
in the Standard but we will compute the nearest Standard floating–point number to it. Note that
1/6 = (8/6)2 3 so E(y) = 3, s(y) = 1, e(y) = 1020 (hence the number y is a normalized DP
floating–point number), and f (y) is the closest integer to 252 /3 (because 8/6 1 = 1/3). This gives
f (y) = 1501199875790165 as a decimal and 101010101010101010101010101010101010101010101010101
in binary.
The above representation as defined in the Standard is rather too cumbersome for our purposes
in this text. Below, without significant loss of generality, we will simplify the representation. We
use as our scientific notation for floating point numbers that any DP number x is represented
S(x) · 2e(x) where the exponent e(x) is an integer and the mantissa S(x) satisfies 1  |S(x)| < 2.

Problem 2.1.4. Write y = 6.1 as a normalized DP floating–point number

Problem 2.1.5. What are the smallest and largest possible significands for normalized DP
numbers?
Problem 2.1.6. What are the smallest and largest positive normalized DP numbers?

Problem 2.1.7. How many DP numbers are in each binade? Sketch enough of the positive
real number line so that you can place marks at the endpoints of each of the DP binades 3,
2, 1, 0, 1, 2 and 3. What does this sketch suggest about the spacing between successive
positive DP numbers?
52
Problem 2.1.8. Show that ✏DP = 2 .

Problem 2.1.9. Show that y = q · ulpDP (y) where q is an integer with 1  |q|  253 1.

1 1
Problem 2.1.10. Show that neither y = nor y = is a DP number. Hint: If y is a DP
m
3 10
number, then 2 y is a 53-bit unsigned integer for some (positive or negative) integer m.

Problem 2.1.11. What is the largest positive integer n such that 2n 1 is a DP number?

Single Precision Numbers

This section has been included for completeness and may be omitted unless great emphasis is to be
given to the definition of computer arithmetic.
In the Standard, a single precision (SP) number x is stored as
s(x) e(x) f (x)
z}|{ z }| {z }| {
stored(x) = b31 b30 b29 · · · b23 b22 b21 · · · b0

The boxes illustrate how the 32 bits are partitioned into 3 fields: a 1-bit unsigned integer s(x), the
sign bit of x, an 8-bit unsigned integer e(x), the biased exponent of x, and a 23-bit unsigned
integer f (x), the fraction of x. The sign bit s(x) = 0 for positive and s(x) = 1 for negative SP
numbers. The 8-bit unsigned integer e(x) represents values in the range from 0 through 255:
• If e(x) = 255, then x is a special number.
• If 0 < e(x) < 255, then x is a normalized floating–point number with value
n f (x) o
value(x) = ( 1)s(x) · 1 + 23 · 2E(x)
2
where E(x) ⌘ e(x) 127 is the (unbiased) exponent of x.
30 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

• If e(x) = 0 and f (x) 6= 0, then x is a denormalized floating–point number with value

n f (x) o
value(x) = ( 1)s(x) · 0 + 23 · 2 126
2

• If both e(x) = 0 and f (x) = 0, then x is zero.

f (x) f (x)
The significand of x is 1+ for normalized numbers, 0+ 23 for denormalized numbers, and 0
223 2
for zero. The normalized SP numbers y with exponent⇥E(y) = k belong to binade k. For example,
binade 1 consists of the numbers y for which |y| 2 12 , 1 and binade 1 consists of the numbers
y for which |y| 2 [2, 4). In general, binade k consists of the numbers y for which 2k  |y| < 2k+1 .
Each SP binade contains the same number of positive SP numbers. We define the SP machine
epsilon ✏SP as the distance from 1 to the next larger SP number. For a finite nonzero SP number
x, we define ulpSP (x) ⌘ ✏SP 2E(x) as the SP unit-in-the-last-place of x.

Problem 2.1.12. Write y = 6.1 as a normalized SP floating–point number

Problem 2.1.13. What are the smallest and largest possible significands for normalized SP
numbers?
Problem 2.1.14. What are the smallest and largest positive normalized SP numbers?

Problem 2.1.15. How many SP numbers are in each binade? Sketch enough of the positive
real number line so that you can place marks at the endpoints of each of the SP binades 3,
2, 1, 0, 1, 2 and 3. What does this sketch suggest about the spacing between consecutive
positive SP numbers?

Problem 2.1.16. Compare the sketches produced in Problems 2.1.7 and 2.1.15. How many
DP numbers lie between consecutive positive SP numbers? We say that distinct real numbers
x and y are resolved by SP numbers if the SP number nearest x is not the same as the SP
number nearest y. Write down the analogous statement for DP numbers. Given two distinct
real numbers x and y, are x and y more likely to be resolved by SP numbers or by DP
numbers? Justify your answer.
23
Problem 2.1.17. Show that ✏SP = 2 .

Problem 2.1.18. Show that x = p · ulpSP (x) where p is an integer with 1  |p|  224 1.

Problem 2.1.19. If x is a normalized SP number in binade k, what is the distance to the

next larger SP number? Express your answer in terms of ✏SP .

1 1
Problem 2.1.20. Show that neither x = nor x = is a SP number. Hint: If x is a SP
m
3 10
number, then 2 x is an integer for some (positive or negative) integer m.

Problem 2.1.21. What is the largest positive integer n such that 2n 1 is a SP number?

2.2 Computer Arithmetic

What kind of numbers should a program use? Signed integers may be appropriate if the input data
are integers and the only operations used are addition, subtraction, and multiplication. Floating–
point numbers are appropriate if the data involves fractions, has large or small magnitudes, or if the
operations include division, square root, or computing transcendental functions like the sine.
2.2. COMPUTER ARITHMETIC 31

2.2.1 Integer Arithmetic

In integer computer arithmetic the computed sum, di↵erence, or product of integers is the exact
result, except when integer overflow occurs. This happens when to represent the result of an
integer arithmetic operation more bits are needed than are available for the result. Thus, for the
8-bit signed integers i = 126 and j = 124 the computed value i + j = 250 is a number larger than
the value of any 8-bit signed integer, so it overflows. So, in the absence of integer overflow, integer
arithmetic performed by the computer is exact. That you have encountered integer overflow may
not always be immediately apparent. The computer may simply return an (incorrect) integer result!
For example, in Matlab, the computation:
int8(126) + int8(124)

returns an int8 value of 127. Similarly, the computation

int8(-126) + int8(-124)
returns an int8 value of -128. Fortunately the default number of bits used to store integers in most
programming languages, including Matlab, is more than 8, so this kind of situation is unlikely to
occur.

2.2.2 Floating-Point Arithmetic

The set of all integers is closed under the arithmetic operations of addition, subtraction, and
multiplication. That is the sum, di↵erence, or product of two integers is another integer. Similarly,
the set of all real numbers is closed under the arithmetic operations of addition, subtraction, and
multiplication, and division (except by zero). Unfortunately, the set of all DP numbers is not closed
under any of the operations of add, subtract, multiply, or divide. For example, both x = 252 + 1 and
y = 252 1 are DP numbers and yet their product xy = 2104 1 is not an DP number. To make
the DP numbers closed under the arithmetic operations of addition, subtraction, multiplication, and
division, the Standard modifies slightly the result produced by each operation. Specifically, it defines

x y ⌘ flDP (x + y)
x y ⌘ flDP (x y)
x⌦y ⌘ flDP (x ⇥ y)
x↵y ⌘ flDP (x/y)

Here, flDP (z) is the DP number closest to the real number z; that is, flDP () is the DP rounding
function1 . So, for example, the first equality states that x y, the value assigned to the sum of the
DP numbers x and y, is the DP number flDP (x + y). In a general sense, no approximate arithmetic
on DP numbers can be more accurate than that specified by the Standard.
In summary, floating–point arithmetic is inherently approximate; the computed value of any sum,
di↵erence, product, or quotient of DP numbers is equal to the exact value rounded to the nearest
floating–point DP number. In the next section we’ll discuss how to measure the quality of this
approximate arithmetic.

Problem 2.2.1. Show that 2104 1 is not a DP number. Hint: Recall Problem 2.1.11.

Problem 2.2.2. Show that each of 253 1 and 2 is a DP number, but that their sum is
not a DP number. So, the set of all DP numbers is not closed under addition. Hint: Recall
Problem 2.1.11.
Problem 2.2.3. Show that the set of DP numbers is not closed under subtraction; that is,
find two DP numbers whose di↵erence is not a DP number.
1 When z is midway between two adjacent DP numbers, flDP (z) is the one whose fraction is even.
32 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

Problem 2.2.4. Show that the set of DP numbers is not closed under division, that is find
two nonzero DP numbers whose quotient is not a DP number. Hint: Consider Problem
2.1.10.
Problem 2.2.5. Assuming floating–point underflow does not occur, why can any DP num-
ber x be divided by 2 exactly? Hint: Consider the representation of x as a DP number.

Problem 2.2.6. Let x be a DP number. Show that flDP (x) = x.

Problem 2.2.7. Let each of x, y and x + y be a DP number. What is the value of the DP
number x y? State the extension of your result to the di↵erence, product and quotient of
DP numbers.
Problem 2.2.8. The real numbers x, y and z satisfy the associative law of addition:

(x + y) + z = x + (y + z)

Consider the DP numbers a = 260 , b = 260 and c = 2 60

. Show that

(a b) c 6= a (b c)

So, in general DP addition is not associative. Hint: Show that b c = b.

Problem 2.2.9. The real numbers x, y and z satisfy the associative law of multiplica-
tion:
(x ⇥ y) ⇥ z = x ⇥ (y ⇥ z),
52 52 52
Consider the DP numbers a = 1 + 2 ,b=1 2 and c = 1.5 + 2 . Show that

(a ⌦ b) ⌦ c 6= a ⌦ (b ⌦ c)

So, in general DP multiplication is not associative. Hint: Show that a ⌦ b = 1 and b ⌦ c =

1.5 2 52 .
Problem 2.2.10. The real numbers x, y and z satisfy the distributive law:

x ⇥ (y + z) = (x ⇥ y) + (x ⇥ z)

Choose values of the DP numbers a, b and c such that

a ⌦ (b c) 6= (a ⌦ b) (a ⌦ c)

So, in general DP arithmetic is not distributive.

Problem 2.2.11. Define the SP rounding function flSP () that maps real numbers into SP
numbers. Define the values of x y, x y, x ⌦ y and x ↵ y for SP arithmetic.

Problem 2.2.12. Show that 224 1 and 2 are SP numbers, but their sum is not a SP
number. So, the set of all SP numbers is not closed under addition. Hint: Recall Problem
2.1.21.

2.2.3 Quality of Approximations

To illustrate the use of approximate arithmetic we employ normalized decimal scientific no-
tation. The Standard represents numbers in binary notation. We use decimal representation to
simplify the development since the reader will be more familiar with decimal than binary arithmetic.
Using the notation in equations (2.2) and (2.3), a nonzero real number T is represented as

T = ± m(T ) · 10e(T )
2.2. COMPUTER ARITHMETIC 33

where 1  m(T ) < 10 is a real number and e(T ) is a positive, negative or zero integer. Here, m(T )
is the decimal significand and e(T ) is the decimal exponent of T . For example,
120. = (1.20) · 102
⇡ = (3.14159 . . .) · 100
2
0.01026 = ( 1.026) · 10
For the real number T = 0, we define m(T ) = 1 and e(T ) = 1.
For any integer k, decade k is the set of real numbers whose exponents e(T ) = k. So, the decade
k is the set of real numbers with values whose magnitudes are in the half open interval [10k , 10k+1 ).
For a nonzero number T , its ith significant digit is the ith digit of m(T ), counting to the right
starting with the units digit. So, the units digit is the 1st significant digit, the tenths digit is the 2nd
significant digit, the hundredths digit is the 3rd significant digit, etc. For the value ⇡ listed above,
the 1st significant digit is 3, the 2nd significant digit is 1, the 3rd significant digit is 4, etc.
Frequently used measures of the error A T in A as an approximation to the true value T are
absolute error = |A T|
A T
absolute relative error =
T
with the relative error being defined only when T 6= 0. The approximation A to T is said to be
q–digits accurate if the absolute error is less than 12 of one unit in the q th significant digit of T .
Since 1  m(T ) < 10, A is a q–digits approximation to T if
1 10e(T ) q+1
|A T|  |m(T )|10e(T ) 10 q

2 2
If T 6= 0, then dividing the above inequality by T = m(T )10e(T ) , we obtain the following statement
for relative errors:
If the absolute relative error
|A T |
r
|T |
then A is a q–digits accurate approximation to T provided that q  log10 (2r).
Returning now to binary representation, whenever flDP (z) is a normalized DP number,
✏DP
flDP (z) = z(1 + µ) where |µ| 
2
The value |µ|, the absolute relative error in flDP (z), depends on the value of z. Then
x y = (x + y)(1 + µa )
x y = (x y)(1 + µs )
x⌦y = (x ⇥ y)(1 + µm )
x↵y = (x/y)(1 + µd )
✏DP
where the absolute relative errors |µa |, |µs |, |µm | and |µd | are each no larger than .
2

Problem 2.2.13. Show that if A is an approximation to T with an absolute relative error

less than 0.5 · 10 16 , then A is a 16–digit accurate approximation to T .
Problem 2.2.14. Let r be an upper bound on the absolute relative error in the approx-
imation A to T , and let q be an integer that satisfies q  log10 (2r). Show that A
e(T )+1 10 q
is a q–digit approximation to T . Hint: Show that |T |  10 , r  , and
2
q e(T ) q+1
10 |T | 10
|A T |   . This last inequality demonstrates that A is q–digits
2 2
accurate.
34 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

Problem 2.2.15. Let z be a real number for which flDP (z) is a normalized DP number.
Show that |flDP (z) z| is at most half a DP ulp times the value of flDP (z). Then show that
fl (z) z ✏DP
µ ⌘ DP satisfies the bound |µ|  .
z 2
Problem 2.2.16. Let x and y be DP numbers. The absolute relative error in x y as an
✏DP
approximation to x + y is no larger than . Show that x y is about 15 digits accurate
2
as an approximation to x + y. How does the accuracy change if you replace the addition
operation by any one of subtraction, multiplication, or division (assuming y 6= 0)?

2.2.4 Propagation of Errors

There are two types of errors in any computed sum, di↵erence, product, or quotient. The first is
the error that is inherent in the numbers, and the second is the error introduced by the arithmetic.
Let x0 and y 0 be DP numbers and consider computing x0 ⇥ y 0 . Commonly, x0 and y 0 are the
result of a previous computation. Let x and y be the values that x0 and y 0 would have been if they
had been computed exactly; that is,

x0 = x(1 + µx ), y 0 = y(1 + µy )

where µx and µy are the relative errors in the approximations of x0 to x and of y 0 to y, respectively.
Now, x0 ⇥ y 0 is computed as x0 ⌦ y 0 , so

x0 ⌦ y 0 = flDP (x0 ⇥ y 0 ) = (x0 ⇥ y 0 )(1 + µm )

where µm is the relative error in the approximation x0 ⌦ y 0 of x0 ⇥ y 0 . How well does x0 ⌦ y 0

approximate the exact result x ⇥ y? We find that

x0 ⌦ y 0 = (x ⇥ y)(1 + µ)

where µ = (1 + µx )(1 + µy )(1 + µm ) 1. Expanding the products µ = µx + µy + µm + µx µy +

µx µm + µy µm + µx µy µm . Generally, we can drop the terms involving products of the µ’s because
the magnitude of each value µ is small relative to 1, so the magnitudes of products of the µ’s are
smaller still. Consequently
µ ⇡ µx + µy + µm
So the relative error in x0 ⌦ y 0 is (approximately) equal to the sum of
a. the relative errors inherent in each of the values x0 and y 0
b. the relative error introduced when the values x0 and y 0 are multiplied
In an extended computation, we expect the error in the final result to come from the (hopefully slow)
accumulation of the errors in the initial data and of the errors from the arithmetic operations on that
data. This is a major reason why DP arithmetic is mainly used in general scientific computation.
While the final result may be represented adequately by a SP number, we use DP arithmetic in an
attempt to reduce the e↵ect of the accumulation of errors introduced by the computer arithmetic
because the relative errors µ in DP arithmetic are so much smaller than in SP arithmetic.
Relatively few DP computations produce exactly a DP number. However, subtraction of nearly
equal DP numbers of the same sign is always exact and so is always a DP number. This exact
cancelation result is stated mathematically as follows:

x0 y 0 = x0 y0
1 x0
whenever  0  2. So, µs = 0 and we expect that µ ⇡ µx + µy . It is easy to see that
2 y
x0 y 0 = x0 y 0 = (x y)(1 + µ)
2.2. COMPUTER ARITHMETIC 35

xµx yµy
where µ = . We obtain an upper bound on µ by applying the triangle inequality:
x y
|µ| |x| + |y|
g⌘
|µx | + |µy | |x y|
The left hand side measures how the relative errors µx and µy in the values x0 and y 0 , respectively,
are magnified to produce the relative error µ in the value x0 y 0 . The right hand side, g, is an upper
bound on this magnification. Observe that g 1 and that g grows as |x y| gets smaller, that is
as more cancelation occurs in computing x y. When g is large, the relative error in x0 y 0 may
be large because catastrophic cancelation has occurred. In the next section we present examples
where computed results su↵er from the e↵ect of catastrophic cancelation.

Example 2.2.1. We’ll give a simple example to simulate how the floating–point addition works on
a computer. To simplify, we’ll work in three decimal digit floating–point arithmetic, and assume we
want to compute an approximation of 1/3 + 8/7.
• Since neither x = 1/3 nor y = 8/7 are representable, we must first round to three decimal
digits:
fl (x) = fl (1/3) = 3.33 ⇤ 10 1
fl (y) = fl (8/7) = 1.14 ⇤ 100
Notice the errors in just representing the numbers on our computer:
fl (x) = x(1 + µx ) ) fl (1/3) = (1/3)(1 0.001) ) |µx | = 0.001
fl (y) = y(1 + µy ) ) fl (8/7) = (8/7)(1 0.0025) ) |µy | = 0.0025
• Now compute x y:
1
x y = fl 3.33 ⇤ 10 + 1.14 ⇤ 100 = fl 1.473 ⇤ 100 = 1.473 ⇤ 100
Notice that since we simulating 3-digit decimal arithmetic, we need to round after each result.
And, x y = (x + y)(1 0.0042), and we observe the e↵ects of propagation of rounding error, since
|µa | = 0.0042 > |µx | + |µy | = 0.0035.

Problem 2.2.17. Using the previous example, investigate the propagation of errors of x y
and x y for x = 1/11 and y = 4/3. Assume 3-digit decimal floating point arithmetic.
Problem 2.2.18. Let x = 1/3, y = 8/7 and z = 1/13 and use 3-digit decimal floating point
arithmetic to compute
(x y) z and x (y z)
You should get di↵erent results, which illustrates that floating point arithmetic is not asso-
ciative.
Problem 2.2.19. Derive the expression

x0 y 0 = x0 y 0 = (x y)(1 + µ)
xµx yµy
where µ ⌘ . Show that
x y

|xµx yµy |  |x||µx | + |y||µy |  (|x| + |y|)(|µx | + |µy |)

and that
xµx yµy |x| + |y|
 (|µx | + |µy |)
x y |x y|
36 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

2.3 Examples
The following examples illustrate some less obvious pitfalls in simple scientific computations.

2.3.1 Plotting a Polynomial

Consider the polynomial p(x) = (1 x)10 , which can be written in power series form as

p(x) = x10 10 x9 + 45 x8 120 x7 + 210 x6 252 x5 + 210 x4 120 x3 + 45 x2 10 x + 1

Suppose we use this power series to evaluate and plot p(x). For example, if we use DP arithmetic to
evaluate p(x) at 101 equally spaced points in the interval [0, 2] (so that the spacing between points
is 0.02), and plot the resulting values, we obtain the curve shown on the left in Fig. 2.1. The curve
has a shape we might expect. However, if we attempt to zoom in on the region near x = 1 by
using DP arithmetic to evaluating p(x) at 101 equally spaced points in the interval [0.99, 1.01] (so
that the spacing is 0.0002), and plot the resulting values, then we obtain the curve shown on the
right in Fig. 2.1. In this case, the plot suggests that p(x) has many zeros in the interval [0.99, 1.01].
(Remember, if p(x) changes sign at two points then, by continuity, p(x) has a zero somewhere
between those points.) However, the factored form p(x) = (1 x)10 implies there is only a single
zero, of multiplicity 10, at the point x = 1. Roundo↵ error incurred while evaluating the power
series and the e↵ects of cancelation of the rounded values induce this inconsistency.
−13
1 x 10
1

0.8

0.5
0.6
p(x)

p(x)

0.4 0

0.2
−0.5
0

−0.2 −1
0 0.5 1 1.5 2 0.99 0.995 1 1.005 1.01
x x

Figure 2.1: Plot of the power form p(x) = x10 10 x9 + 45 x8 120 x7 + 210 x6 252 x5 + 210 x4
120 x3 + 45 x2 10 x + 1 evaluated in DP arithmetic. The left shows a plot of p(x) using 101 equally
spaced points on the interval 0  x  2. The right zooms in on the region near x = 1 by plotting
p(x) using 101 equally spaced points on the interval 0.99  x  1.01.

In Fig. 2.1, the maximum amplitudes of the oscillations are larger to the right of x = 1 than
to the left. Recall that x = 1 is the boundary between binade 1 and binade 0. As a result, the
magnitude of the maximum error incurred by rounding can be a factor of 2 larger to the right of
x = 1 than to the left, which is essentially what we observe.

2.3.2 Repeated Square Roots

Let n be a positive integer and let x be a DP number such that 1  x < 4. Consider the following
procedure. First, initialize
p the DP variable t := x. Next, take the square root of t a total of n times
via the assignment t := t. Then, square the resulting value t a total of n times via the assignment
t := t ⇤ t. Finally, print the value of error, x t. The computed results of an experiment for various
values of x, using n = 100, are shown in Table 2.1. But for rounding, the value of the error in the
third column would be zero, but the results show errors much larger than zero! To understand what
is happening, observe that taking the square root of a number discards information. The square
2.3. EXAMPLES 37

root function maps the interval [1, 4) onto the binade [1, 2). Now [1, 4) is the union of the binades
[1, 2) and [2, 4). Furthermore, each binade contains 253 DP numbers. So, the square root function
maps each of 2 · 253 = 254 arguments into one of 253 possible square roots. On average, then, the
square root function maps two DP numbers in [1, 4) into one DP number in [1, 2). So, generally,
the DP square root of a DP argument does not contain sufficient information to recover that DP
argument, that is taking the DP square root of a DP number usually loses information.

x t x t
1.0000 1.0000 0.0000
1.2500 1.0000 0.2500
1.5000 1.0000 0.5000
1.7500 1.0000 0.7500
2.0000 1.0000 1.0000
2.2500 1.0000 1.2500
2.5000 1.0000 1.5000
2.7500 1.0000 1.7500
3.0000 1.0000 2.0000
3.2500 1.0000 2.2500
3.5000 1.0000 2.5000
3.7500 1.0000 2.7500

Table 2.1: Results of the repeated square root experiment. Here x is the exact value, t is the
computed result (which in exact arithmetic should be the same as x), and x t is the error.

2.3.3 Estimating the Derivative

Consider the forward di↵erence estimate
f (x + h) f (x)
E f (x) ⌘
h
of the derivative f 0 (x) of a continuously di↵erentiable function f (x). Assuming the function f (x) is
sufficiently smooth in an interval containing both x and x + h, for small values of the increment h,
a Taylor series expansion (see Problem 2.3.1) gives the approximation

hf 00 (x)
E f (x) ⇡ f 0 (x) +
2
As the increment h ! 0 the value of E f (x) ! f 0 (x), a fact familiar from calculus.
Suppose, when computing E f (x), that the only errors that occur involve when rounding the
exact values of f (x) and f (x + h) to working precision (WP) numbers, that is

flWP (f (x)) = f (x)(1 + µ1 ), flWP (f (x + h)) = f (x + h)(1 + µ2 )

where µ1 and µ2 account for the errors in the rounding. If WP f (x) denotes the computed value of
E f (x), then

flWP (f (x + h)) flWP (f (x))

WP f (x) ⌘
h
f (x + h)(1 + µ2 ) f (x)(1 + µ1 )
=
h
f (x + h)µ2 f (x)µ1
= E f (x) +
h
1 f (x + h)µ2 f (x)µ1
⇡ f 0 (x) + hf 00 (x) +
2 h
38 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

✏WP 23
where each absolute relative error |µi |  . (For SP arithmetic ✏WP = 2 and for DP ✏WP =
2
52
2 .) Hence, we obtain

WP f (x) f 0 (x) f 00 (x) 1 f (x + h)µ2 f (x)µ1

r= 0
⇡ h 0
+
f (x) 2f (x) h f 0 (x)
f 00 (x) 1 f (x)(|µ2 | + |µ1 |) c2
⇡h +  c1 h + ⌘R
2f 0 (x) h f 0 (x) h

an approximate upper bound in accepting WP f (x) as an approximation of the value f 0 (x), where
f 00 (x) f (x)✏WP
c1 ⌘ 0
and c2 ⌘ . In deriving this upper bound we have assumed that f (x + h) ⇡
2f (x) f 0 (x)
f (x) which is valid when h is sufficiently small and f (x) is continuous on the interval [x, x + h].
Consider the expression for R. The term c1 h ! 0 as h ! 0, accounting for the error in accepting
0 c2
E f (x) as an approximation of f (x). The term ! 1 as h ! 0, accounting for the error arising
h
from using the computed, rather than the exact, values of f (x) and f (x + h). So, as h ! 0, we
might expect the absolute relative error in the forward di↵erence estimate WP f (x) of the derivative
first to decrease and then to increase. Consider using SP arithmetic, the function f (x) ⌘ sin(x)
with x = 1 radian, and the sequence of increments h ⌘ 2 n for n = 1, 2, 3, · · · , 22. Fig. 2.2 uses
dots to display log10 (2r), the digits of accuracy in the computed forward di↵erence estimate of the
derivative, and a solid curve to display log10 (2R), an estimate of the minimum digits of accuracy
obtained from the model of rounding. Note, the forward di↵erence estimate becomes more accurate
as the curve increases, reaching a maximum accuracy of about 4 digits at n = 12, and then becomes
less accurate as the curve decreases. The maximum accuracy (that is the minimum value of R)
predicted by the model is approximately proportional to the square root of the working-precision.
For h sufficiently small, what we have observed is catastrophic cancelation of the values of f (x)
and f (x + h) followed by magnification of the e↵ect of the resulting loss of significant digits by
division by a small number, h.
We will revisit the problem of numerical di↵erentiation in Chapter 5.

5
Estimated Digits of Accuracy

0
5 10 15 20
n

Figure 2.2: Forward di↵erence estimates: SP f (x) (dots) and R (solid)

2.3. EXAMPLES 39

Problem 2.3.1. A Taylor series expansion of a function f (z) about the point c is defined
to be
f 00 (c) f 000 (c)
f (z) = f (c) + (z c)f 0 (c) + (z c)2 + (z c)3 + ···
2! 3!
Conditions that guarantee when such a series expansion exists are addressed in a di↵erential
calculus course. The function must be sufficiently smooth in the sense that all of the deriva-
tives f (n) (z), n = 0, 1, 2, . . ., must exist in an open interval containing c. In addition, the
series must converge, which means the terms of the series must approach zero sufficiently
quickly, for all z values in an open interval containing c.
Suppose f is sufficiently smooth in an interval containing z = x and z = x + h, where |h| is
small. Use the above Taylor series with z = x + h and c = x to show that

f 00 (x) f 000 (x)

f (x + h) = f (x) + hf 0 (x) + h2 + h3 + ···
2! 3!
and argue that
f (x + h) f (x) hf 00 (x)
E f (x) ⌘ ⇡ f 0 (x) + .
h 2
c2
Problem 2.3.2. Let the function R(h) ⌘ c1 h + where c1 , c2 are positive constants. For
h
what value of h does R(h) attain its smallest value? (Give an expression for this smallest
f 00 (x) f (x)✏WP
value.) Now, substitute c1 ⌘ 0
and c2 ⌘ to show how the smallest value
2f (x) f 0 (x)
p
of R(h) depends on ✏WP . For h = 2 n , what integer n corresponds most closely to this
smallest value?

2.3.4 A Recurrence Relation

Consider the sequence of values {Vj }1
j=0 defined by the definite integrals
Z 1
Vj = ex 1 xj dx, j = 0, 1, 2, · · ·
0
1
It can easily be seen that 0 < Vj+1 < Vj < for all j 1; see Problem 2.3.4.
j
Integration by parts demonstrates that the values {Vj }1
j=0 satisfy the recurrence relation

Vj = 1 jVj 1, j = 1, 2, · · ·
Because we can calculate the value of
Z 1
1
V0 = ex 1
dx = 1
0 e
we can determine the values of V1 , V2 , · · · using the recurrence starting from the computed estimate
for V0 . Table 2.2 displays the values V̂j for j = 0, 1, · · · , 16 computed by evaluating the recurrence
in SP arithmetic.
The first few values V̂j are positive and form a decreasing sequence. However, the values V̂j for
j 11 alternate in sign and increase in magnitude, contradicting the mathematical properties of the
sequence. To understand why, suppose that the only rounding error that occurs is in computing V0 .
So, instead of the correct initial value V0 we have used instead the rounded initial value V̂0 = V0 + ✏.
Let {V̂j }1
j=0 be the modified sequence determined exactly from the value V̂0 . Using the recurrence,
the first few terms of this sequence are
V̂1 = 1 V̂0 = 1 (V0 + ✏) = V1 ✏
V̂2 = 1 2V̂1 = 1 2(V1 ✏) = V2 + 2✏
V̂3 = 1 3V̂2 = 1 3(V2 + 2✏) = V3 3 · 2✏ (2.5)
V̂4 = 1 4V̂3 = 1 4(V3 3 · 2✏) = V4 + 4 · 3 · 2✏
..
.
40 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

V̂j
j V̂j
j!
0 6.3212E-01 6.3212E-01
1 3.6788E-01 3.6788E-01
2 2.6424E-01 1.3212E-01
3 2.0728E-01 3.4546E-02
4 1.7089E-01 7.1205E-03
5 1.4553E-01 1.2128E-03
6 1.2680E-01 1.7611E-04
7 1.1243E-01 2.2307E-05
8 1.0056E-01 2.4941E-06
9 9.4933E-02 2.6161E-07
10 5.0674E-02 1.3965E-08
11 4.4258E-01 1.1088E-08
12 -4.3110E+00 -8.9999E-09
13 5.7043E+01 9.1605E-09
14 -7.9760E+02 -9.1490E-09
15 1.1965E+04 9.1498E-09
16 -1.9144E+05 -9.1498E-09

Table 2.2: Values of Vj determined using SP arithmetic.

In general,
V̂j = Vj + ( 1)j j!✏
Now, the computed value V̂0 should have an absolute error ✏ no larger than half a ulp in the SP value
ulpSP (V0 )
of V0 . Because V0 ⇡ 0.632 is in binade 1, ✏ ⇡ = 2 25 ⇡ 3 · 10 8 . Substituting this value
2
for ✏ we expect the first few terms of the computed sequence to be positive and decreasing as theory
predicts. However, because j! · (3 · 10 8 ) > 1 for all j 11 and because Vj < 1 for all values of j, the
formula for V̂j leads us to expect that the values V̂j will ultimately be dominated by ( 1)j j!✏ and
so will alternate in sign and increase in magnitude. That this analysis gives a reasonable prediction
V̂j
is verified by Table 2.2, where the error grows like j!, and approaches a constant.
j!
What we have observed is the potential for significant accumulation and magnification of round-
ing error in a simple process with relatively few steps.
We emphasize that we do not recommend using the recurrence to evaluate the integrals V̂j . They
may be evaluated simply either symbolically or numerically using Matlab integration software; see
Chapter 5.

Problem 2.3.3. Use integration by parts to show the recurrence

Vj = 1 jVj 1, j = 1, 2, · · ·

for evaluating the integral

Z 1
Vj = ex 1 j
x dx, j = 0, 1, 2, · · ·
0
2.3. EXAMPLES 41

Problem 2.3.4. In this problem we show that the sequence {Vj }1

j=0 has positive terms and
is strictly decreasing to 0.

(a) Show that for 0 < x < 1 we have 0 < xj+1 < xj for j 0. Hence, show that
0 < Vj+1 < Vj .
1
(b) Show that for 0 < x < 1 we have 0 < ex 1
< 1. Hence, show that 0 < Vj < .
j
(c) Hence, show that the sequence {Vj }1
j=0 has positive terms and is strictly decreasing to
0.

Problem 2.3.5. The error in the terms of the sequence {V̂j }1j=0 grows because V̂j 1 is mul-
tiplied by j. To obtain an accurate approximation of Vj , we can instead run the recurrence
backwards
1 V̂j+1
V̂j = , j = M 1, M 2, · · · , 1, 0
j+1
Now, the error in V̂j is divided by j. More specifically, if you want accurate SP approxima-
tions of the values Vj for 0  j  N , start with M = N + 12 and V̂M = 0 and compute
the values of V̂j for all j = M 1, M 2, · · · , 1, 0. We start 12 terms beyond the first
value of V̂N so that the error associated with using V̂N +12 = 0 will be divided by at least
12! ⇡ 4.8 · 108 , a factor large enough to make the contribution of the initial error in V̂M to
the error in V̂N less than a SP ulp in VN . (We know that V̂M is in error – from Problem
1 1
2.3.4, VM is a small positive number such that 0 < VM < M , so |V̂M VM | = |VM | < M .)

2.3.5 Summing the Exponential Series

From calculus, the Taylor series for the exponential function

x2 xn
ex = 1 + x + + ··· + + ···
2! n!
converges mathematically for all values of x both positive and negative. Let the term
xn
Tn ⌘
n!
and define the partial sum of the first n terms

x2 xn 1
Sn ⌘ 1 + x + + ··· +
2! (n 1)!

so that Sn+1 = Sn + Tn with S0 = 0.

Let’s use this series to compute e 20.5 ⇡ 1.25 · 10 9 . For x = 20.5, the terms alternate in sign
and their magnitudes increase until we reach the term T20 ⇡ 7.06 · 107 and then they alternate in
sign and their magnitudes steadily decrease to zero. So, for any value n > 20, from the theory of
alternating series
|Sn e 20.5 | < |Tn | .
That is, the absolute error in the partial sum Sn as an approximation to the value of e 20.5 is less
than the absolute value of the first neglected term, |Tn |. Note that DP arithmetic is about 16–digits
|Tn |
accurate. So, if a value n 20 is chosen so that < 10 16 , then Sn should be a 16–digit
e 20.5
approximation to e 20.5 . The smallest value of n for which this inequality is satisfied is n = 98, so
S98 should be a 16–digit accurate approximation to e 20.5 . Table 2.3 displays a selection of values
of the partial sums Sn and the corresponding terms Tn computed using DP arithmetic.
42 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

n Tn Sn+1
0 1.0000e+00 1.0000E+00
5 -3.0171e+04 -2.4057e+04
10 3.6122e+06 2.4009e+06
15 -3.6292e+07 -2.0703e+07
20 7.0624e+07 3.5307e+07
25 -4.0105e+07 -1.7850e+07
30 8.4909e+06 3.4061e+06
35 -7.8914e+05 -2.8817e+05
40 3.6183e+04 1.2126e+04
45 -8.9354e+02 -2.7673e+02
50 1.2724e+01 3.6627e+00
55 -1.1035e-01 -2.9675e-02
60 6.0962e-04 1.5382e-04
65 -2.2267e-06 -5.2412e-07
70 5.5509e-09 6.2893e-09
75 -9.7035e-12 5.0406e-09
80 1.2178e-14 5.0427e-09
85 -1.1201e-17 5.0427e-09
90 7.6897e-21 5.0427e-09
95 -4.0042e-24 5.0427e-09
98 3.7802e-26 5.0427e-09

Table 2.3: Values of Sn and Tn determined using DP arithmetic.

From the table, it appears that the sequence of values Sn is converging. However, S98 ⇡ 5.04 ·
10 9 , and the true value should be e 20.5 ⇡ 1.25 · 10 9 . Thus, S98 is an approximation of e 20.5
that is 0–digits accurate! At first glance, it appears that the computation is wrong! However, the
computed value of the sum, though very inaccurate, is reasonable. Recall that DP numbers are
about 16 digits accurate, so the largest term in magnitude, T20 , could be in error by as much as
about |T20 | · 10 16 ⇡ 7.06 · 107 · 10 16 = 7.06 · 10 9 . Of course, changing T20 by this amount
will change the value of the partial sum S98 similarly. So, using DP arithmetic, it is unlikely that
the computed value S98 will provide an estimate of e 20.5 with an absolute error much less than
7.06 · 10 9 .
So, how can we calculate e 20.5 accurately using Taylor series? Clearly we must avoid the
catastrophic cancelation involved in summing an alternating series with large terms when the value
of the sum is a much smaller number. There follow two alternative approaches which exploit simple
mathematical properties of the exponential function.
1
1. Consider the relation e x = x . If we use a Taylor series for ex for x = 20.5 it still involves
e
large and small terms Tn but they are all positive and there is no cancellation in evaluating
the sum Sn . In fact the terms Tn are the absolute values of those appearing in Table 2.3. In
adding this sum we continue until adding further terms can have no impact on the sum. When
the terms are smaller then e20.5 · 10 16 ⇡ 8.00 · 108 · 10 16 = 8.00 · 10 8 they have no further
impact so we stop at the first term smaller than this. Forming this sum and stopping when
|Tn | < 8.00·10 8 , it turns out that we need the first 68 terms of the series to give a DP accurate
approximation to e20.5 . Since we are summing a series of positive terms we expect (and get)
1
an approximation for e20.5 with a small relative error. Next, we compute e 20.5 = 20.5 using
e
this approximation for e20.5 . The result has at least 15 correct digits in a DP calculation.

2. Another idea is to use a form of range reduction. We can use the alternating series as long as
we avoid the catastrophic cancellation which leads to a large relative error. We will certainly
2.3. EXAMPLES 43

avoid this problem if we evaluate e x only for values of x 2 (0, 1) since in this case the
magnitudes of the terms of the series are monotonically decreasing. Observe that
20.5 20 0.5 1 20 0.5
e =e e = (e ) e

So, if we can calculate e 1 accurately then raise it to the 20th power, then evaluate e 0.5 by
Taylor series, and finally we can compute e 20.5 by multiplying the results. (In the spirit of
range reduction, we use, for example, the Matlab exponential function to evaluate e 1 and
then its power function to compute its 20th power; this way these quantities are calculated
accurately. If we don’t have an exponential function available, we could use the series with
x = 1 to calculate e 1 .) Since e 0.5 ⇡ 0.6, to get full accuracy we terminate the Taylor
series when |Tn | < 0.6 · 10 16 , that is after 15 terms. This approach delivers at least 15 digits
of accuracy for e 20.5 .

Problem 2.3.6. Quote and prove the alternating series theorem that shows that |Sn
e 20.5 | < |Tn | in exact arithmetic.

Problem 2.3.7. Suppose that you use Taylor series to compute ex for a value of x such
that |x| > 1. Which is the largest term in the Taylor series? In what circumstances are
Tn x
there two equally sized largest terms in the Taylor series? [Hint: = .]
Tn 1 n

Problem 2.3.8. Suppose we are using SP arithmetic (with an accuracy of about 10 7 ) and
say we attempt to compute e 15 using the alternating Taylor series. What is the largest
value Tn in magnitude. Using the fact that e 15 ⇡ 3.06 · 10 7 how many terms of the
alternating series are needed to compute e 15 to full SP accuracy? What is the approximate
absolute error in the sum as an approximation to e 15 ? [Hint: There is no need to calculate
the value of the terms Tn or the partial sums Sn to answer this question.]

Problem 2.3.9. For what value of n does the partial sum Sn form a 16–digit accurate
approximation of e 21.5 ? Using DP arithmetic, compare the computed value of Sn with
the exact value e 21.5 ⇡ 4.60 · 10 10 . Explain why the computed value of Sn is reasonable.
Tn x
[Hint: Because = , if the current value of term is Tn 1 , then sequence of assignment
Tn 1 n
statements

term := x*term
term := term/n
converts term into the value of Tn .]

2.3.6 Euclidean Length of a Vector

Many computations require the Euclidean length
p
p = a 2 + b2

a
of the 2-vector . The value p can also be viewed as the Pythagorean sum of a and b. Now,
b
p
min(|a|, |b|)  p  2 · max(|a|, |b|)

so we should be able to compute p in such a way that we never encounter floating–point underflow
and
p we rarely encounter floating–point overflow. However, when computing p via the relation p =
a2 + b2 we can encounter one or both of floating–point underflow and floating–point overflow when
44 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

computing the squares of a and b. To avoid floating–point overflow, we can choose a value c in such

hai b
a way that we can scale a and b by this value of c and the resulting scaled quantities and
c c
may be safely squared. Using this scaling,
s
h a i2  b 2
p=c· +
c c
hai 
b
An obvious choice for the scaling factor is c = max(|a|, |b|) so that one of and equals 1 and
c c
the other has magnitude less than 1. Another sensible choice of scaling factor is c a power of 2 just
greater than max(|a|, |b|); the advantage of this choice is that with Standard arithmetic, division by
2 is performed exactly (ignoring underflow). If we choose the scaling factor c in one of these ways it
is possible that the smaller squared term in that occurs when computing p will underflow but this
is harmless; that is, the computed value of p will be essentially correct.
Another technique that avoids both unnecessary floating–point overflow and floating–point un-
derflow is the Moler-Morrison Pythagorean p sum algorithm displayed in the pseudocode in Fig. 2.3.
In this algorithm, the value p converges to a2 + b2 from below, and the value q converges rapidly
to 0 from above. So, floating–point overflow can occur only if the exact value of p overflows. Indeed,
only harmless floating–point underflow can occur; that is, the computed value of p will be essentially
correct.

Moler-Morrison
Pythagorean Sum Algorithm

Input: scalars a and b

p
Output: p = a 2 + b2

p := max(|a|, |b|)
q := min(|a|, |b|)
for i = 1 to N
r := (q/p)2
s := r/(4 + r)
p := p + 2sp
q := sq
next i

p
Figure 2.3: The Moler-Morrison Pythagorean sum algorithm for computing p ⌘ a2 + b2 . Typically
N = 3 suffices for any SP or DP numbers p and q.

Problem 2.3.10. For any values a and b show that

p p
min(|a|, |b|)  p ⌘ a2 + b2  2 · max(|a|, |b|)

Problem 2.3.11. Suppose a = 1 and b = 10 60 . If c ⇡ max(|a|, |b|), floating–point un-

 2
b
derflow occurs in computing in DP arithmetic. In this case, the Standard returns the
c
 2
b
value 0 for . Why is the computed value of p still accurate?
c
2.3. EXAMPLES 45

Problem 2.3.12. In the Moler-Morrison Pythagorean sum algorithm, show that though
each trip through the for-loop may change the values of p and q, in exact arithmetic it never
changes the value of the loop invariant p2 + q 2 .
p
Problem 2.3.13. Design a pseudocode to compute d = x2 + y 2 + z 2 . If floating–point
underflow occurs it should be harmless, and floating–point overflow should occur only when
the exact value of d overflows.

2.3.7 Roots of a Quadratic Equation

We aim to design a reliable algorithm to determine the roots of the quadratic equation

ax2 2bx + c = 0

(Note that the factor multiplying x is 2b and not b as in the standard notation.) We assume that
the coefficients a, b and c are real numbers, and that the coefficients are chosen so that the roots
are real. The familiar quadratic formula for these roots is
p
b± d
x± =
a
where d ⌘ b2 ac is the discriminant, assumed nonnegative here. The discriminant d is zero if
there is a double root. Here are some problems that can arise.
First, the algorithm should check for the special cases a = 0, b = 0 or c = 0. These cases are
trivial to eliminate. When a = 0 the quadratic equation degenerates p cto a linear equation whose
c
solution 2b requires at most a division. When b = 0 the roots are ± a . When c = 0 the roots are
0 and 2ba . In the latter two cases the roots will normally be calculated more accurately using these
special formulas than by using the quadratic formula.
Second, computing d can lead to either floating–point underflow or floating–point overflow, typi-
cally when either b2 or ac or both are computed. This problem can be eliminated using the technique
described in the previous section; scale the coefficients a, b and c by a power of two chosen so that
b2 and ac can be safely computed.
Third, the computation can su↵er from catastrophic cancelation when either
• the roots are nearly equal, i.e., ac ⇡ b2
• the roots have significantly di↵erent magnitudes, i.e., |ac| ⌧ b2
When ac ⇡ b2 there is catastrophic cancelation when computing d. This may be eliminated by
computing the discriminant d = b2 ac using higher precision arithmetic, when possible. For
example, if a, b and c are SP numbers, then we can use DP arithmetic to compute d. If a, b and c
are DP numbers, maybe we can use QP (quadruple, extended, precision) arithmetic to compute d;
Matlab does not provide QP. The aim when using higher precision arithmetic is to discard as few
digits as possible of b2 and ac before forming b2 ac. For many arithmetic processors this is achieved
without user intervention. That is, the discriminant is calculated to higher precision automatically,
by computing the value of the whole expression b2 ac in higher precision before rounding. When
cancelation is inevitable, we might first use the quadratic formula to compute approximations to the
roots and then usingp a Newton iteration (see Chapter 6) to improve
p these approximations.
When |ac| ⌧ b2 , d ⇡ |b| and one of the computations b ± d su↵ers from catastrophic cance-
lation. To eliminate this problem note that
p p
b+ d c b d c
= p , = p
a b d a b+ d
So, when b > 0 use p
b+ d c
x+ = , x = p
a b+ d
46 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

and when b < 0 use p

c b d
x+ = p , x =
b d a
The following problems are intended as a “pencil and paper” exercises.

Problem 2.3.14. Show that a = 2049, b = 4097 and c = 8192 are SP numbers. Use
Matlab SP arithmetic to compute the roots of ax2 2bx + c = 0 using the quadratic
formula.

Problem 2.3.15. Show that a = 1, b = 4096 and c = 1 are SP numbers. Use Matlab SP
arithmetic to compute the roots of ax2 2bx + c = 0 using the quadratic formula.

2.4 Matlab Notes

Matlab provides a variety of data types, including integers and floating–point numbers. By default,
all floating–point variables and constants, and the associated computations, are held in double
precision. Indeed, they are all complex numbers but this is normally not visible to the user when
using real data. However, there are cases where real data can lead to complex number results,
for example the roots of a polynomial with real coefficients can be complex and this is handled
seamlessly by Matlab. The most recent versions of Matlab (version 7.0 and higher) allow for the
creation of single precision variables and constants. Matlab also defines certain parameters (such
as machine epsilon) associated with floating–point computations.
In addition to discussing these issues, we provide implementation details and several exercises
associated with the examples discussed in Section 2.3.

2.4.1 Integers in Matlab

Matlab supports both signed and unsigned integers. For each type, integers of 8, 16, 32 and 64
bits are supported. Matlab provides functions to convert a numeric object (e.g. an integer of
another type or length, or a double precision number) into an integer of the specified type and
length; e.g,. int8 converts numeric objects into 8–bit signed integers and uint16 converts numeric
objects into 16–bit unsigned integers. Matlab performs arithmetic between integers of the same
type and between integers of any type and double precision numbers.

2.4.2 Single Precision Computations in Matlab

By default, Matlab assumes all floating–point variables and constants, and the associated compu-
tations, are double precision. To compute using single precision arithmetic, variables and constants
must first be converted using the single function. Computations involving a mix of SP and DP
variables generally produce SP results. For example,

theta1 = 5*single(pi)/6
s1 = sin(theta1)
produces the SP values theta1= 2.6179941 and s1= 0.4999998. Because we specify single(pi),
the constants 5 and 6 in theta1 are assumed SP, and the computations use SP arithmetic.
As a comparison, if we do not specify single for any of the variables or constants,

theta2 = 5*pi/6
s2 = sin(theta2)
then Matlab produces the DP values theta2= 2.617993877991494 and s2= 0.50000000000000.
However, if the computations are written
2.4. MATLAB NOTES 47

theta3 = single(5*pi/6)
s3 = sin(theta3)
then Matlab produces the values theta3= 2.6179938, and s3= 0.5000001. The computation
5*pi/6 uses default DP arithmetic, then the result is converted to SP.

2.4.3 Special Constants

A nice feature of Matlab is that many parameters discussed in this chapter can be generated
easily. For example, eps is a function that can be used to compute ✏DP and ulpDP (y). In particular,
eps(1), eps(’double’) and eps all produce the same result, namely 2 52 . However, if y is a DP
number, then eps(y) computes ulpDP (y), which is simply the distance from y to the next largest (in
magnitude) DP number. The largest and smallest positive DP numbers can be computed using the
functions realmax and realmin, respectively.
As with DP numbers, the functions eps, realmax and realmin can be used with SP num-
bers. For example, eps(’single’) and eps(single(1)) produce the result 2 23 . Similarly,
realmax(’single’) and realmin(’single’) return, respectively, the largest and smallest SP
floating–point numbers.
Matlab also defines parameters for 1 (called inf or Inf) and NaN (called nan or NaN). It is
possible to get, and to use, these quantities in computations. For example:
• The computation 1/0 produces Inf.
• The computation 1/Inf produces 0.

• Computations of indeterminate quantities, such as 0*Inf, 0/0 and Inf/Inf produce NaN.

Problem 2.4.1. In Matlab, consider the anonymous functions:

f = @(x) x ./ (x.*(x-1));
g = @(x) 1 ./ (x-1);
What is computed by f(0), f(eps), g(0), g(eps), f(Inf), and g(Inf)? Explain these
results.

2.4.4 Floating-Point Numbers in Output

When displaying output, Matlab rounds floating–point numbers to fit the number of digits to be
displayed. For example, consider a Matlab code that sets a DP variable x to the value 1.99 and
prints it twice, first using a floating–point format with 1 place after the decimal point and second
using a floating–point format with 2 places after the decimal point. Matlab code for this is:

x = 1.99;
fprintf(’Printing one decimal point produces %3.1f \n’, x)
fprintf(’Printing two decimal points produces %4.2f \n’, x)
The first value printed is 2.0 while the second value printed is 1.99. Of course, 2.0 is not the true
value of x. The number 2.0 appears because 2.0 represents the true value rounded to the number
of digits displayed. If a printout lists the value of a variable x as precisely 2.0, that is it prints just
these digits, then its actual value may be any number in the range 1.95  x < 2.05.
A similar situation occurs in the Matlab command window. When Matlab is started, numbers
are displayed on the screen using the default “format” (called short) of 5 digits. For example, if we
set a DP variable x = 1.99999, Matlab displays the number as

2.0000
48 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

More correct digits can be displayed by changing the format. So, if we execute the Matlab statement

format long
then 15 digits are displayed; that is, the number x is displayed as
1.99999000000000
Other formats can be set; see doc format for more information.

Problem 2.4.2. Using the default format short, what is displayed in the command window
when the following variables are displayed?
x = exp(1)
y = single(exp(1))
z = x - y
Do the results make sense? What is displayed when using format long?

Problem 2.4.3. Read Matlab’s help documentation on fprintf. In the example Matlab
code given above, why did the first fprintf command contain \n? What happens if this is
omitted?
Problem 2.4.4. Determine the di↵erence between the following fprintf statements:

fprintf(’%6.4f \n’,pi)
fprintf(’%8.4f \n’,pi)
fprintf(’%10.4f \n’,pi)
fprintf(’%10.4e \n’,pi)
In particular, what is the significance of the numbers 6.4, 8.4, and 10.4, and the letters f
and e?

2.4.5 Examples
Section 2.3 provided several examples that illustrate difficulties that can arise in scientific computa-
tions. Here we provide Matlab implementation details and several associated exercises.

Plotting a Polynomial
Consider the example from Section 2.3.1, where a plot of

p(x) = (1 x)10
= x10 10 x9 + 45 x8 120 x7 + 210 x6 252 x5 + 210 x4 120 x3 + 45 x2 10 x + 1

on the interval [0.99, 1.01] is produced using the power series form of p(x). In Matlab we use
linspace, plot, and a very useful function called polyval, which is used to evaluate polynomials.
Specifically, the following Matlab code is used to produce the plot shown in Fig. 2.1:

x = linspace(0.99, 1.01, 101);

c = [1, -10, 45, -120, 210, -252, 210, -120, 45, -10, 1];
p = polyval(c, x);
plot(x, p)
xlabel(’x’), ylabel(’p(x)’)

Of course, since we know the factored form of p(x), we can use it to produce an accurate plot:
2.4. MATLAB NOTES 49

x = linspace(0.99, 1.01, 101);

p = (1 - x).^(10);
plot(x, p)
Not all polynomials can be factored easily, and it may be necessary to use the power series form.

Problem 2.4.5. Use the code given above to sketch y = p(x) for values x 2 [0.99, 1.01]
using the power form of p(x). Pay particular attention to the scaling of the y-axis. What is
the largest value of y that you observe? Now modify the code to plot on the same graph the
factored form of y = p(x) and to put axes on the graph. Can you distinguish the plot of the
factored form of y = p(x)? Explain what you observe.
Problem 2.4.6. Construct a figure analogous to Fig. 2.1, but using SP arithmetic rather
than DP arithmetic to evaluate p(x). What is the largest value of y that you observe in this
case?

Repeated Square Roots

The following Matlab code performs the repeated square roots experiment outlined in Section 2.3.2
on a vector of 10 equally spaced values of x using 100 iterations (use a script M–file):
x = 1:0.25:3.75;
n = 100
t = x;
for i = 1:n
t = sqrt(t);
end
for i = 1:n
t = t .* t;
end
disp(’ x t x-t ’)
disp(’==============================’)
for k = 1:10
disp(sprintf(’%7.4f %7.4f %7.4f’, x(k), t(k), x(k)-t(k)))
end
When you run this code, the disp and sprintf commands are used to display, in the command
window, the results shown in Table 2.1. The command sprintf works just like fprintf but prints
the result to a Matlab string (displayed using disp) rather than to a file or the screen.

Problem 2.4.7. Using the Matlab script M–file above, what is the smallest number of
iterations n for which you can reproduce the whole of Table 2.1 exactly?
Problem 2.4.8. Modify the Matlab script M–file above to use SP arithmetic. What is the
smallest number of iterations n for which you can reproduce the whole of Table 2.1 exactly?

Estimating the Derivative

The following problems use Matlab for experiments related to the methods of Section 2.3.3

Problem 2.4.9. Use Matlab (with its default DP arithmetic), f (x) ⌘ sin(x) and
x = 1 radian. Create a three column table with column headers “n”, “ log10 (2r)”, and
“ log10 (2R)”. Fill the column headed by “n” with the values 1, 2, · · · , 51. The remaining
entries in each row should be filled with values computed using h = 2 n . For what value of
n does the forward di↵erence estimate DP f (x) of the derivative f 0 (x) achieve its maximum
accuracy, and what is this maximum accuracy?
50 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

Problem 2.4.10. In Matlab, open the help browser, and search for single precision
mathematics. This search can be used to find an example of writing M–files for di↵erent
data types. Use this example to modify the code from Problem 2.4.9 so that it can be used
for either SP or DP arithmetic.

A Recurrence Relation
The following problems use Matlab for experiments related to the methods of Section 2.3.4 As
mentioned at the end of Section 2.3.4, we emphasize that we do not recommend the use of the
recurrence for evaluating the integrals V̂j . These integrals may be almost trivially evaluated using
the Matlab function integral; see Chapter 5 for more details.

Problem 2.4.11. Reproduce the Table 2.2 using Matlab single precision arithmetic.
Problem 2.4.12. Repeat the above analysis, but use Matlab with its default DP arith-
metic, to compute the sequence {V̂j }23 j=0 . Generate a table analogous to Table 2.2 that
V̂j
displays the values of j, V̂j , and . Hint: Assume that ✏, the error in the initial value V̂0 ,
j!
has a magnitude equal to half a ulp in the DP value of V0 . Using this estimate, show that
j!✏ > 1 when j 19.
Problem 2.4.13. Use the Matlab DP integration function integral to compute the cor-
rect values for Vj to the number of digits shown in Table 2.2
Problem 2.4.14. Consider the sequence of numbers:

xk+2 = 2.25xk+1 0.5xk , k = 1, 2, . . . (2.6)

with starting values of x1 = 1/3 and x2 = 1/12. It can be shown that this sequence of
numbers can also be written as:
41 k
xk = , k = 1, 2, . . . (2.7)
3
Looking at (2.7), what can you say about the terms xk as k increases?
(a) Write a MATLAB code that will generate the two sequences of numbers given in (2.6)
and (2.7). Your code should be written as a function m-file, with input n (number of points
to generate) and output two vectors containing values x1 , x2 , . . . , xn for each of (2.6) and
(2.7).
(b) Run the code using n = 60, and plot the resulting xk values using MATLAB’s semilogy
function. Plot both sets of points on the same axes, using di↵erent symbols and colors (e.g.,
blue circles for (2.6) and red diamonds for (2.7)).
(c) Do the points generated from your code behave as you expect, considering what you found
in part (a)? Can you explain your results?

Summing the Exponential Series

The following problems use Matlab for experiments related to the methods of Section 2.3.5

Problem 2.4.15. Implement the scheme described in Section 2.3.5, Note 1, in Matlab
DP arithmetic to compute e 20.5 .
Problem 2.4.16. Implement the scheme described in Section 2.3.5, Note 2, in Matlab
DP arithmetic to compute e 20.5 .
2.4. MATLAB NOTES 51

Euclidean Length of a Vector

The following problem uses Matlab for experiments related to the methods of Section 2.3.6.

Problem 2.4.17. Write a Matlab  DP program

 that uses the Moler-Morrison algorithm
a 1
to compute the length of the vector = . Your code should display the values of p
b 1
and q initially as well as at the end of each trip through the for loop. It should also display
the ratio of the new value of q to the cube of its previous value. Compare the value of these
1
ratios to the value of .
4(a2 + b2 )

a
Problem 2.4.18. Use DP arithmetic to compute the lengths of the vectors =
   b
60 60
10 a 10
and = using both the unscaled and scaled formulas for p. Use
1061 b 10 61
the factor c = max(|a|, |b|) in the scaled computation.

Roots of a Quadratic Equation

The following problem uses Matlab for experiments related to the methods of Section 2.3.7.
Note that Matlab does not provide software specifically for computing the roots of a quadratic
equation. However, it provides a function roots for calculating the roots of general polynomials
including quadratics; see Section 6.5 for details.

Problem 2.4.19. Write a Matlab function to compute the roots of the quadratic equation
ax2 2bx + c = 0 where the coefficients a, b and c are SP numbers. As output produce SP
values of the roots. Use DP arithmetic to compute d = b2 ac. Test your program on the
cases posed in Problems 2.3.14 and 2.3.15

Problem 2.4.20. This problem considers two di↵erent approaches to compute the roots of
a quadratic polynomial, p(x) = ax2 + bx + c . We know that the values r such that p(r) = 0,
are given by the quadratic formula:
p
b ± b2 4ac
r= .
2a
(a) A naive Matlab implementation to compute the roots might look like the function
QuadFormula1 given below. On your computer, create the Matlab function m-file
QuadFormula1.m, using this code. You should type in the comment lines as well.
(b) Test your code using two simple problems:

p(x) = x2 3x + 2
160 2
p(x) = 10 x 3 ⇤ 10160 x + 2 ⇤ 10160

Does this code compute good approximations of the true roots for this problem?

(c) Briefly explain why you did not get good approximations for the roots in one of the
polynomials, and explain how you can fix it.
Copy the code in QuadFormula1.m into a new function called QuadFormula2.m, and
modify the code so that it implements your fix. Use QuadFormula2.m to compute the
roots from the above examples and show that your new code obtains accurate approxi-
mations for both of these polynomials.
52 CHAPTER 2. COMPUTING WITH FLOATING POINT NUMBERS

function r = QuadFormula1(coeffs)
%
% Given coefficients of a quadratic polynomial, p(x), this function
% computes the roots of p(x) = 0 using the quadratic formula.
%
% Input: coeffs - vector containing the coefficents of p(x),
% [a, b, c], where p(x) = a*x^2 + b*x + c.
%
% Output: r - vector containing the two roots of p(x) = 0.
%
%
r = zeros(2,1);
a = coeffs(1);
b = coeffs(2);
c = coeffs(3);
d = sqrt(b^2 - 4*a*c);
a2 = 2*a;
r(1) = (-b + d)/a2;
r(2) = (-b - d)/a2;

Unit 2
No ratings yet
Unit 2
85 pages
BSBINM601 Assessment1
100% (2)
BSBINM601 Assessment1
3 pages
07 Datarepresentation 150216185458 Conversion Gate02
No ratings yet
07 Datarepresentation 150216185458 Conversion Gate02
43 pages
Sports Club Applications System - Mostafa Okasha
No ratings yet
Sports Club Applications System - Mostafa Okasha
183 pages
Symbol Scanner
No ratings yet
Symbol Scanner
384 pages
Chapter1 -Introduction & Number Theory
No ratings yet
Chapter1 -Introduction & Number Theory
139 pages
Chapter 3 - Data Representation Section 3.1 - Data Types
No ratings yet
Chapter 3 - Data Representation Section 3.1 - Data Types
14 pages
Filter Dropdown Keyboard Shortcuts
No ratings yet
Filter Dropdown Keyboard Shortcuts
72 pages
Numerical Analysis - Patel
No ratings yet
Numerical Analysis - Patel
16 pages
EA1 Ch3 2
No ratings yet
EA1 Ch3 2
154 pages
Representations PDF
No ratings yet
Representations PDF
14 pages
COS1521-foundations_of_computer_science_-chapter_3
No ratings yet
COS1521-foundations_of_computer_science_-chapter_3
78 pages
Unit III CAO
No ratings yet
Unit III CAO
39 pages
Machine Level Representation of Data Part 3
100% (1)
Machine Level Representation of Data Part 3
32 pages
Chapter 2: Data Representation
No ratings yet
Chapter 2: Data Representation
32 pages
LSP Project Presentation
No ratings yet
LSP Project Presentation
23 pages
233 Notes Section1
No ratings yet
233 Notes Section1
125 pages
Part 1
No ratings yet
Part 1
33 pages
COA UNIT-III PPTs Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-III PPTs Dr.G.Bhaskar ECE
64 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
Lecture 8- Artificial Neural Networks
No ratings yet
Lecture 8- Artificial Neural Networks
41 pages
Chapter 3
No ratings yet
Chapter 3
72 pages
Integer Representation
No ratings yet
Integer Representation
34 pages
C Practical File
No ratings yet
C Practical File
34 pages
Chiu, Shu-Ching 2016
No ratings yet
Chiu, Shu-Ching 2016
9 pages
Fixed _And_Floating_Point_representation
No ratings yet
Fixed _And_Floating_Point_representation
40 pages
Slide8-Number Systems and Number Representations
No ratings yet
Slide8-Number Systems and Number Representations
24 pages
(Turner) - Applied Scientific Computing - Chap - 02
No ratings yet
(Turner) - Applied Scientific Computing - Chap - 02
19 pages
Lect4 Floats
No ratings yet
Lect4 Floats
64 pages
Chapter 3 (Comp. Arch)
No ratings yet
Chapter 3 (Comp. Arch)
31 pages
COMPX203 Computer Systems: Number Representation
No ratings yet
COMPX203 Computer Systems: Number Representation
33 pages
Computations in Mechanical Engineering: Numbers and Vectors
No ratings yet
Computations in Mechanical Engineering: Numbers and Vectors
18 pages
Chapter1 2
No ratings yet
Chapter1 2
66 pages
Cao Iii PDF
No ratings yet
Cao Iii PDF
16 pages
SDA SCL: //wire - Begin //connects I2C // Serial - Begin (9600)
No ratings yet
SDA SCL: //wire - Begin //connects I2C // Serial - Begin (9600)
6 pages
Resume - Rahul - Mishra
No ratings yet
Resume - Rahul - Mishra
2 pages
L-5 Floating Point Representation of Numbers
No ratings yet
L-5 Floating Point Representation of Numbers
21 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Addition in Binary and Hexadecimal: 0 + 0 0 0 + 1 1 1 + 0 1 1 + 1 0 Carry 1
No ratings yet
Addition in Binary and Hexadecimal: 0 + 0 0 0 + 1 1 1 + 0 1 1 + 1 0 Carry 1
15 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
Chapter 2: Computer Memory and Storage, Representing Numbers, Random Numbers
No ratings yet
Chapter 2: Computer Memory and Storage, Representing Numbers, Random Numbers
7 pages
CH03-Data-II(2) (2)
No ratings yet
CH03-Data-II(2) (2)
31 pages
Unit1 2
No ratings yet
Unit1 2
64 pages
Bits, Bytes, Integers, and Floats Notes
No ratings yet
Bits, Bytes, Integers, and Floats Notes
18 pages
Tin học đại cương - Unit 1 (part 2)
No ratings yet
Tin học đại cương - Unit 1 (part 2)
83 pages
Geological Database
No ratings yet
Geological Database
147 pages
Introduction To Numerical Computing: Statistics 580 Number Systems
No ratings yet
Introduction To Numerical Computing: Statistics 580 Number Systems
35 pages
Unit1_2
No ratings yet
Unit1_2
98 pages
11 Number Systems
No ratings yet
11 Number Systems
53 pages
Desktop Virtualization: Exploring New Technologies To Simplify and Stabilize Desktop Reliability and Support
No ratings yet
Desktop Virtualization: Exploring New Technologies To Simplify and Stabilize Desktop Reliability and Support
23 pages
ZNote for Theory A level
No ratings yet
ZNote for Theory A level
36 pages
Wa0018.
No ratings yet
Wa0018.
55 pages
Digital I Summaries - Appendix I - Bases, Conversions, Complements, Representations, and Operations
No ratings yet
Digital I Summaries - Appendix I - Bases, Conversions, Complements, Representations, and Operations
19 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
3.1 Data Representation: 3.1.3 Real Numebrs and Normalized Floating-Point Representation
No ratings yet
3.1 Data Representation: 3.1.3 Real Numebrs and Normalized Floating-Point Representation
14 pages
CH08.2-Computer Arithmetic
No ratings yet
CH08.2-Computer Arithmetic
14 pages
Fourier
No ratings yet
Fourier
13 pages
w4 One PDF
No ratings yet
w4 One PDF
40 pages
Cluster Issues
No ratings yet
Cluster Issues
10 pages
Binary PDF
No ratings yet
Binary PDF
8 pages
Bsc. (Hons) Web Technologies: Cohort: Bwt/08/Ft
No ratings yet
Bsc. (Hons) Web Technologies: Cohort: Bwt/08/Ft
6 pages
CNSSP 7 20151120 Final
No ratings yet
CNSSP 7 20151120 Final
11 pages
Unit 3 (Coa) Notes
No ratings yet
Unit 3 (Coa) Notes
29 pages
Advanced Computational Methods: ENGR 680
No ratings yet
Advanced Computational Methods: ENGR 680
19 pages
Chapter 2-Data Representation
No ratings yet
Chapter 2-Data Representation
20 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Isaba Siwes Report
No ratings yet
Isaba Siwes Report
15 pages
Apply computer software in solving tasks
No ratings yet
Apply computer software in solving tasks
7 pages
11 MD Zakir Hussain
No ratings yet
11 MD Zakir Hussain
6 pages
Test
No ratings yet
Test
7 pages
COA Chapter 3
No ratings yet
COA Chapter 3
23 pages
Chap 02
No ratings yet
Chap 02
16 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Volvo BL71 PLUS Backhoe Loader Service Repair Manual INSTANT DOWNLOAD PDF
60% (5)
Volvo BL71 PLUS Backhoe Loader Service Repair Manual INSTANT DOWNLOAD PDF
2 pages
ZNote for Theory
No ratings yet
ZNote for Theory
36 pages
QG 312B 332R5 R6
No ratings yet
QG 312B 332R5 R6
2 pages
PZEM 004T V3.0 Datasheet User Manual
No ratings yet
PZEM 004T V3.0 Datasheet User Manual
7 pages
Murphy 50700597
No ratings yet
Murphy 50700597
4 pages
COMP204 Tutorial 3
No ratings yet
COMP204 Tutorial 3
4 pages
Psychology of Problem Solving The instant download
No ratings yet
Psychology of Problem Solving The instant download
16 pages
FAARFIELD 2.0.7 Readme 2021-09-14
No ratings yet
FAARFIELD 2.0.7 Readme 2021-09-14
5 pages
2 CS1FC16 Information Representation
No ratings yet
2 CS1FC16 Information Representation
4 pages
coa unit1 chapter2
No ratings yet
coa unit1 chapter2
8 pages
ARCh Presentation1
No ratings yet
ARCh Presentation1
12 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
DSP Practical
No ratings yet
DSP Practical
2 pages
1.1 The Laws of Exponents
No ratings yet
1.1 The Laws of Exponents
5 pages
1 Commission Electricalelectronic Equipment Systems
No ratings yet
1 Commission Electricalelectronic Equipment Systems
33 pages
Dell's Customer Contact Centres in India
No ratings yet
Dell's Customer Contact Centres in India
13 pages
Fast mental calculation tricks
From Everand
Fast mental calculation tricks
EasyMath
No ratings yet