Numerical Errors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

2.

Numerical errors
Numerical methods are used to find an approximation of the solution of a mathematical problem with
an error that falls below a desired user-defined accuracy. In this chapter we will discuss why numerical
methods can only provide approximations. In engineering and science, when one develops a
mathematical model to describe some reality, one must be aware of the many possible error sources. To
start with, the model itself is an approximation of this reality. Hence, inherently the model cannot be
fully accurate, for example because the assumptions made to derive the model may be wrong. Also,
when developing the model, some mathematical mistakes could be introduced and further the
parameters needed to feed the model are usually inaccurate. Programming errors may also occur when
coding the process to solve the model. Note that all the aforementioned errors are not specific to
numerical methods and as such occur in any engineering and science field. In the remainder of this
book, we will assume that these errors can be either avoided, are negligible or quantifiable. However,
there are two categories of errors that are specific to the use of numerical methods. These two error
types are known as round-off errors and truncation errors. Round-off errors are due to the limited
number of digits that can be stored in memory. Truncation errors appear because we solve a simplified
and approximated problem in lieu of our actual problem. Both, round-off and truncation errors, will be
defined and discussed in detail in this chapter. As we will see, round-off errors cannot be avoided.
Truncation errors, for some algorithms, can be avoided. But for most cases both types of errors are
present.

2.1. Round-off errors


When executing numerical calculations in a computer, only a finite number of digits can be stored in the
computer memory. As a consequence, regardless of the number of stored digits, they constitute an
approximation of the actual mathematical number that should have been used in calculations. The
number of digits that can be processed depends on the size of the memory allocated for storage. A
common case is to use the so-called double precision which roughly corresponds to storing 16 decimal
digits. For example, if the number √2 has to be stored, only the first 16 decimal digits are actually in
memory. The difference between the actual number (√2 in our case) and the stored number (the first
16 digits of √2) is called the round-off error.

DEFINITION 2-1. ROUND-OFF ERRORS


The difference between the actual mathematical number 𝑟 and its representation with a finite number
of digits is called the round-off error of 𝑟.

The difficulties begin as soon as a finite number of digits are used to represent numbers to perform
calculations. Let us illustrate this with an example.

EXAMPLE 2-1. ROUND-OFF ERRORS


Consider the following Octave code which aims to compute the quantity 𝑎 = 0.2 + 0.2 + 0.2:
>> a = 0.2+0.2+0.2
a = 0.60000
As expected, Octave attributes the value 0.6 to the variable 𝑎. But does it really assign 0.6 to 𝑎? Let us
verify this by computing 𝑎 − 0.6:
>> a-0.6
ans = 1.1102e-16
The answer is not equal to zero! Even if it is very close to zero, there is a difference.

Example 2-1 shows that adding three numbers can result in an erroneous answer. The reason behind
this phenomenon is the accumulation of round-off errors during the additions. To understand how these
round-off errors appear, a closer look on how numbers are stored in a computer is required. A computer
stores numbers using base two. Technically this consists of a sequence of bits 𝑏𝑖 , where a bit has only
two states (0 or 1). This is commonly referred to as the binary code. For example, the binary number
𝑥 = 10100, consists of 5 bits, and equals to the decimal number 20, as can be verified as follows:

𝑥 = 24 + 0 + 22 + 0 + 0 = 20
The IEEE754 standard specifies how decimal numbers are expressed as binary numbers in a computer
using floating-point numbers or floating-point representation. A floating-point number has three parts: a
sign (either positive or negative), a mantissa (a sequence of 𝑁 bits 𝑏𝑖 ) and an exponent 𝑝 :

±𝑏1 𝑏2 … 𝑏𝑁 × 2𝑝
Depending on how many bits are used to store the mantissa and exponent, different families of
precision are defined in the IEEE standard. Table 2-1 lists these defined families and specifies how many
bits are used for each part of the floating-point number1. For example, the popular double precision
floating-point number has a total of 64 bits (or 4 bites, as per definition one bite is a sequence of 8 bits).
The first bit is used to encode the sign, the next 11 bits are for the exponent, and the remaining 52 are
used to store the mantissa.

A double precision floating-point number can hold approximatively 16 decimal digits. Indeed, typically
about 3.25 bits are needed to encode a decimal digit (23.25 ≈ 10). As double precision has memory to
store 52 bits for the mantissa, this gives about 52/3.25=16 decimal digits.

Precision Sign Exponent Mantissa


Single 1 8 23
Double 1 11 52
Long double 1 15 64
Table 2-1 IEEE standard for different precision of floating-point number. The table lists the number of bits used for each part of
the floating-point number.

Obviously, without any further rules, the decision on which part goes to the mantissa or to the exponent
is open. For example, one could encode the decimal number 12.34 as 12.34 × 100 , 1.234 × 101 or
0.1234 × 102 (and of course many other options). The IEEE standard defines which one of the
enumerated choices is actually used to store a number in a computer. The interested reader may learn
more about how floating-point numbers are handled in a computer by reading in the IEEE754 standard.
The technical details, as we will realize, are not important for our purpose.

1
The IEEE 475 standard defines as well how the exponent (and its sign) are encoded in bits. For the purpose of our
presentation we do not have to know how this is done technically and we can simply consider the exponent to be
encoded as a decimal number 𝑝 with sign. The interested reader is referred to the IEEE 475 standard.
Expanding further on the example of representing the number 12.34 allows us to understand the
terminology “floating-point” number: by keeping the same mantissa one can move around the location
of the decimal point.

Let us go back to Example 2-1. The number 0.2, in binary code, would actually need an infinite number
of digits to represent it. To inspect the binary representation of a decimal number with Octave, one can
use the format bit command:
>> format bit
>> 0.2
ans = 0011111111001001100110011001100110011001100110011001100110011010
̅̅̅̅̅̅̅, a situation similar to
In fact, in binary code the decimal number 0.2 is equal to 00111111110010011
1⁄ = 0. 3̅ for numbers represented in base ten. Consequently, it is not possible, using a finite number
3
of digits, to represent in binary code the decimal number 0.2 without round-off error. As one adds
several times the number 0.2, the round-off error accumulates and ends up becoming significant.

The take away here is that one should never trust a numerical calculation. One must always be aware
that a numerical calculation is actually wrong (from a pure mathematical point of view) and represents
only an approximation. We like to refer to this fact as the “golden rule of numerical methods”:

GOLDEN RULE OF NUMERICAL METHODS


Any numerical calculations conducted with floating-point numbers is only an approximation and never
free of round-off errors. Never assume a numerical calculation is error free.

At this point it is useful to introduce the concept of significant digits. Not all authors define this concept
the same way. In the frame of this book we adopt Definition 2-1. Note that we will use interchangeably
the words significant digits and significant figures.

DEFINITION 2-1. SIGNIFICANT DIGITS


Any digit of a number, beginning with the digit farthest to the left that is not zero and ending with the
last digit farthest to the right that is either not zero or that is a zero but is considered to be exact, is
called a significant digit.

To get familiar with this definition, it is useful to study Example 2-2.

EXAMPLE 2-2. SIGNIFICANT DIGITS


The number 20.1 has three significant digits which are 2, 0 and 1.

The number 0.0201 has as well three significant digits which again are 2, 0 and 1. This shows that
leading zeros are simply ignored when identifying significant digits.

The number 2.010 has four significant digits which are 2, 0, 1 and 0. Here the last 0 is significant
(otherwise one would have written 2.01).

A good way to understand the concepts of significant digits is to keep in mind that numbers usually
represent physical quantities associated to a physical unit. If for example the number 20.1 represents a
distance of 20.1 meters, then changing its unit to centimeters or kilometres (which will result in moving
the decimal point), will not change the knowledge, and consequently which digits are significant or not,
we have on this distance.

2.1.1. Propagation of round-off errors


Round-off errors enter into a calculation in a complex way and their effect on the final result depends on
how the different operations are executed. Example 2-3 gives an illustration of this fact.

EXAMPLE 2-3. PROPAGATION OF ROUND-OFF ERRORS


Consider the following two functions:

(2-1) 𝑓(𝑥) = 𝑥(√𝑥 + 1 − √𝑥)


𝑥
(2-2) 𝑔(𝑥) =
√𝑥+1+√𝑥

It is straight forward to prove that 𝑓(𝑥) = 𝑔(𝑥) for any 𝑥 where 𝑓 and 𝑔 are defined. Let us evaluate
these two functions numerically in 𝑥 = 500. We start by defining the functions 𝑓 and 𝑔 in Octave:
>> f = @(x) x*(sqrt(x+1)-sqrt(x));
>> g = @(x) x/(sqrt(x+1)+sqrt(x));
We used here the concept of anonymous functions to define them in Octave. Let us now evaluate these
two functions in 𝑥 = 500:
>> f(500)
ans = 11.175
>> g(500)
ans = 11.175
They seem to agree as expected. But if we display all digits that are stored inside our computer (which is
achieved with the format long command), we observe that actually they do not agree:
>> format long
>> f(500)
ans = 11.17475530074685
>> g(500)
ans = 11.17475530074720
The last three digits differ. This shows that at least one of the numerical function evaluations must be
wrong (maybe even both are wrong). At this stage it is impossible to know which one is correct (if any).

At this point it should become clear to the reader that round-off errors can accumulate during
calculations to result in inaccurate answers. Another phenomenon that can take place is known as loss
of significance. This happens when, for example, two numbers are added, both of which can be free of
round-off errors, but their sum requires more digits than what possibly can be stored in the computer
memory. For example, let’s imagine we are using a hypothetical computer that can only store four digits
at time. Let’s also assume, for the sake of this example, that this hypothetical computer stores numbers
using base 10. Let’s consider two decimal numbers 1234 and 0.12. Both those numbers can be stored
without round-off error. But their sum (which would be 1234.12) cannot be stored free of round-off
error.

In numerical algorithms, loss of significance happens often when subtracting numbers that are similar.
For illustrative purpose, let’s imagine we are using a hypothetical computer that can only store 3 digits
in its mantissa. Again, for the sake of this example, let’s assume the numbers are stored in base ten.
Using this hypothetical computer let’s perform the following operation: √4.01 − 2. As only 3 decimal
digits can be stored, the result becomes:

√4.01 − 2 =
⏟ 2.00 − 2.00 = 0.00
3 𝑠.𝑑

In this extreme example, we end up with a result which has all its significant digits wrong. One says that
all significant digits were lost.

The crucial takeaway from the above examples is that whenever calculations involve floating-point
numbers, it is nearly certain that the result will be wrong. Any numerical calculation introduces round-
off errors, which are challenging to predict. Another way to express this fact is to recognize that any
calculations carried out with floating-point numbers will lead to the loss of some significant digits.
Observation 2-1 summarizes this discussion.

OBSERVATION 2-1. LOOSING CORRECT DIGITS


If 𝑥 has 𝑚 correct digits, then in general 𝑓(𝑥) will have less than 𝑚 correct digits.

Since round-off errors are not avoidable in numerical calculations, the only solution is to learn to live
with them. The consequence is that we will have to develop methods able to provide estimates of errors
(recall that a good estimate is an estimation which is larger or equal than the actual error). Section 2.3
will give a first set of methods estimating errors.

2.1.2. Machine epsilon and undistinguishable numbers


A useful number to quantify the round-off errors is the machine epsilon 𝜀𝑚 . This number is defined as
follows2:

DEFINITION 2-2. MACHINE EPSILON


The machine epsilon is the smallest number 𝜀𝑚 such that for any number 𝑥 ≤ 0.5𝜀𝑚 one has, when
evaluating the expression numerically, 1 + 𝑥 = 1.

If calculations can be performed with infinite precision, then 𝜀𝑚 is equal to zero. But in numerical
calculations, where only a finite number of digits are available, 𝜀𝑚 will not be equal to zero. Octave,
which uses double precision floating-point numbers, stores 𝜀𝑚 in the built-in variable eps:
>> eps
ans = 2.220446049250313e-16
We can verify that indeed 1 + 𝑥 = 1 for any 𝑥 ≤ 0.5𝜀𝑚 . The following lines of octave code
demonstrates this when adding 10−17 to the number one:

2
The factor 0.5 in front of 𝜀𝑚 comes from the rounding convention Octave is using.
>> format long
>> x=1E-17;
>> 1+x
ans = 1
Octave computed that 1 + 𝑥, with 𝑥 = 10−17 is exactly equal to the decimal number 1. Numerically,
Octave is unable to distinguish between the two numbers 1 + 10−17 and 1.

Definition 2-2, which uses the number 1 as reference, can be generalized to an arbitrary number 𝛿.

THEOREM 2-1. UNDISTINGUISHABLE NUMBERS


If 𝜀𝑚 is the value of the machine epsilon and 𝛿 a real number, then for any number 𝑥 ≤ 0.5 ∙ 𝛿 ∙ 𝜀𝑚 one
has 𝛿 + 𝑥 = 𝛿 when evaluated with floating numbers.

Let us illustrate Theorem 2-1 with the number 𝛿 = 3279. In this case, for any number 𝑥 smaller than
0.5 ∙ 3279 ∙ 𝜀𝑚 the relation 3279 + 𝑥 = 3279 holds. For example:
>> x = 0.4*3279*eps;
>> 3279+x == 3279
ans = 1
The reader is encouraged to repeat this example with other values of 𝑥 (some smaller and some larger
than 0.5 ∙ 3279 ∙ 𝜀𝑚 ).

An alternative way to formulate Theorem 2-1, is summarized in Observation 2-2.

OBSERVATION 2-2. UNDISTINGUISHABLE NUMBERS


If 𝜀𝑚 is the value of the machine epsilon, then the numbers 𝑥 and any number in the interval
[𝑥, 𝑥(1 + 0.5𝜀𝑚 )] cannot be distinguished numerically.

<Figure 2-1 here>


Figure 2-1 Numerical evaluation, using double precision floating point numbers, of the quantity 𝑦 = 1 + ∆𝑥 − 1.

Figure 2-1 gives a visual representation of Theorem 2-1 and Observation 2-2. The result of the numerical
evaluation of 𝑦 = 1 + ∆𝑥 − 1 is plotted in function of ∆𝑥. If 𝑦 could be evaluated with an infinite
accuracy, then of course 𝑦 = ∆𝑥. But when using floating-point numbers, the situation is different. Due
to round-off errors, 1 + ∆𝑥 evolves stepwise with ∆𝑥. The quantity 𝑦 = 1 + ∆𝑥 − 1 differs from ∆𝑥.

2.2. Truncation errors


Numerical algorithms are designed to find approximations to mathematical problems, even if a closed
form solution is unknown. Algorithms can achieve this by replacing the problem aimed to be solved by
another problem that approximates the original problem. This approximated problem is chosen such
that it has a known closed form solution. The error introduced when replacing the original problem by
an approximated one, is called the truncation error.

DEFINITION 2-3. TRUNCATION ERROR


The error introduced by replacing a mathematical problem by another problem is called truncation
error.
The word “truncation” can be misleading. First of all, it should not be confused with the concept of
round-off error. The “truncation” operation in Definition 2-3 does not refer to truncating a number of
digits. It refers rather to a common method used to approximate a problem: series expansion (for
example Taylor series) followed by truncating the series after a finite number of terms. Example 2-4
illustrates this idea in the case of a series expansion of the exponential function.

EXAMPLE 2-4. TRUNCATION ERROR


A well know series expansion of the exponential function is


𝑥
𝑥2 𝑥3 𝑥𝑛
(2-3) 𝑒 = 1+𝑥+ + +⋯ = ∑
2! 3! 𝑛!
𝑛=0

To give an approximation of the exponential function, one can decide to include, for example, only the
first three terms of the series:

𝑥2
(2-4) 𝑒𝑥 ≅ 1 + 𝑥 +
2!

The truncation error is then the difference between the exact value of 𝑒 𝑥 and the approximation
including only the first three terms of the series (as always, errors are measured in absolute values):

𝑥2
(2-5) Truncation error = |𝑒 𝑥 − (1 + 𝑥 + )|
2!

Note that the truncation error does not include the round-off errors. In other words, it is assumed that
the various expressions can all be evaluated with an infinite number of digits.

2.2.1. The truncation error of Taylor series


A common tool used in numerical methods to generate an approximated expression of an exact one is
the Taylor series expansion as stated in Theorem 2-2.

THEOREM 2-2. TAYLOR SERIES EXPANSION


Consider two real numbers 𝑥 and ℎ. If the function 𝑓 is 𝑘 + 1 times continuously differentiable in the
interval [𝑥, 𝑥 + ℎ], then there exists a number 𝜉 in [𝑥, 𝑥 + ℎ] such that

𝑓 ′′ (𝑥) 2 𝑓 (𝑘) (𝑥) 𝑘 𝑓 (𝑘+1) (𝜉) 𝑘+1


(2-6) 𝑓(𝑥 + ℎ) = 𝑓(𝑥) + 𝑓 ′ (𝑥)ℎ + ℎ + ⋯+ ℎ + ℎ
2! 𝑘! (𝑘 + 1)!

The term

𝑓 (𝑘+1) (𝜉) 𝑘+1


(2-7) 𝑅= ℎ
(𝑘 + 1)!

is called the Taylor reminder of order 𝑘 + 1 and the polynomial


𝑓′′(𝑥) 2 𝑓 (𝑘) (𝑥) 𝑘
(2-8) 𝑇𝑘 (ℎ) = 𝑓(𝑥) + ℎ𝑓 ′ (𝑥) + ℎ + ⋯+ ℎ
2! 𝑘!

is called the Taylor polynomial of order 𝑘 or the order 𝑘 approximation of the function 𝑓 in 𝑥 + ℎ.

Taylor series expansion is a powerful tool used to generate various approximations of mathematical
expressions. For example, if a function 𝑓 is known in some point 𝑥, one can obtain the approximation of
𝑓 in some other point 𝑥 + ℎ. Figure 2-2 illustrates this idea. Figure 2-2 suggests that the more terms are
included in the Taylor series expansion, the higher the accuracy of the approximation becomes. This is
correct when ℎ is sufficiently small (the term: “sufficiently small” will be discussed below).

<Figure 2-2 here>


Figure 2-2 Graphical illustration of Taylor’s theorem. In general, as the order of the expansion increases, the approximation
becomes better.

Taylor’s theorem can also generate an expression for the truncation error 𝐸𝑡𝑟𝑢𝑛𝑐 by evaluating the
reminder term of the series:

𝑓 (𝑘+1) (𝜉) 𝑘+1


(2-9) 𝐸𝑡𝑟𝑢𝑛𝑐 = | ℎ |
(𝑘 + 1)!

Even 𝜉 ∈ [𝑥, 𝑥 + ℎ] is unknown, one can give an upper bound by considering the maximum of the Taylor
reminder in [𝑥, 𝑥 + ℎ]:

𝑓 (𝑘+1) (𝜉) 𝑘+1


(2-10) 𝐸𝑡𝑟𝑢𝑛𝑐 ≤ max | ℎ |
𝜉∈[𝑥,𝑥+ℎ] (𝑘 + 1)!

Alternative ways of obtaining upper boundaries to the truncation error exist, but they are problem
specific. Example 2-5 illustrates such a case.

EXAMPLE 2-5. APPROXIMATING THE COSINE FUNCTION


The cosine function has no closed form expression that would allow its numerical evaluation. Taylor
series can be used to give numerical approximations of this function. The idea is to use a known value of
the cosine function (in our case cos(0) = 1) and then, with the help of the Taylor series, construct an
approximation for a value 𝑥 where the cosine function is not known. Using Theorem 2-2, it follows:

𝑑 𝑥 2 𝑑2
(2-11) cos(𝑥) = cos(0 + 𝑥) = cos(0) + 𝑥 cos(𝑥)| + cos(𝑥)| +⋯
𝑑𝑥 𝑥=0 2! 𝑑𝑥 2 𝑥=0

For an order 𝑘 = 6 approximation, one obtains after simplification:

𝑥2 𝑥4 𝑥6
(2-12) cos(𝑥) = 1 − + − +𝑅
2! 4! 6!
where 𝑅 is the Taylor reminder of order 7. Based on the Taylor remainder term (2-7) an upper boundary
of the truncation error is:

𝑥7
(2-13) Truncation error ≤ | |
7!

because the derivative of order 𝑘 + 1 = 7 of cos(𝑥) will always be between -1 and 1, regardless in
which point it is evaluated. However, in the present example, a more precise estimation of the
truncation upper boundary can be obtained. Indeed, if we are interested in small numbers 𝑥 (for
example 𝑥 = 0.3), then we can estimate the truncation error as follows:

𝑥 8 𝑥 10 𝑥 12
(2-14) Truncation error = | − + − ⋯|
8! 10! 12!

For small 𝑥, the terms in this summation decrease very rapidly and we may approximate the truncation
error by only including the first of them:

𝑥8
(2-15) Truncation error ≅ | |
8!

Note that this estimation is more precise than estimation (2-13). As an example, let us approximate
cos(0.3):

0.32 0.34 0.36


(2-16) cos(0.3) ≅ 1 − + − ≅ 0.9553385
2! 4! 6!

The truncation error is about

0.38
(2-17) Truncation error ≅ | | ≅ 7 ∙ 10−6
8!

Assuming octave can compute cos(0.3) exactly (or at least with an error lower than our estimated
truncation error), we obtain for the true error3
>> format short E
>> x=0.3;
>> 1 - x^2/2 + x^4/factorial(4) + x^6/factorial(6) - cos(x)
ans = 2.0234E-06
This confirms that our truncation error estimation (2-17) is indeed a valid estimation as it is a
conservative estimation.

It would be wrong to conclude that adding more terms to the Taylor series expansion will always result
in the reduction of the truncation error. In fact, the Taylor series expansion must be conducted
sufficiently close to the reference point. Sufficiently close means that the expansion must be around a

3
The command format short E is used to present the answers in a more human friendly way.
value within the radius of convergence of the Taylor series. Definition 2-4 recalls the definition of the
radius of convergence of a power series.

DEFINITION 2-4. RADIUS OF CONVERGENCE OF A POWER SERIES


Consider a power series such as

(2-18) 𝑓(𝑥) = ∑ 𝑎𝑘 (𝑥 − 𝑏)𝑘


𝑘=0

The radius of convergence of this power series is the smallest positive number 𝑟 such that for any 𝑥
satisfying |𝑥 − 𝑎| < 𝑟 the series (2-18) will converge as 𝑛 → ∞.

Example 2-6 illustrates the importance to keep the Taylor series expansion within its radius of
convergence.

EXAMPLE 2-6. APPROXIMATION OF THE LOGARITHMIC FUNCTION


To approximate the natural logarithmic function, we expand the function around 𝑥 = 1, where we know
that ln(1) = 0. Using Theorem 2-2, one obtains the following 𝑘 order expansion:

𝑥2 𝑥3 𝑥4 𝑥𝑘 𝜉 𝑘+1
(2-19) ln(1 + 𝑥) = 𝑥 − + − + ⋯ − (−1)𝑘−1 + (−1)𝑘
2 3 4 𝑘 𝑘+1

with 𝜉 ∈ [1, 1 + 𝑥]. Note that 𝑥 must be larger or equal to -1 in order ln(1 + 𝑥) is defined. Written as a
series, Equation (2-19) becomes


𝑥𝑘
(2-20) ln(1 + 𝑥) = ∑(−1)𝑘−1
𝑘
𝑘=1

which is known to have a radius of convergence of 1 from calculus4. Consequently, the truncation error
of an order 𝑘 approximation

𝜉 𝑘+1
(2-21) Truncation error = | |
𝑘+1

will only decrease with order 𝑘 if |𝑥| < 1.

<Figure 2-3 here>


Figure 2-3 Taylor series expansion of 𝑓(𝑥) = ln(1 + 𝑥) with increasing orders (order 6, 7, 14 and 19). Note that the Taylor series
approximates the function 𝑓 (dashed line) only in the interval −1 < 𝑥 < 1.

Figure 2-3 illustrates this behavior. Consider 𝑥 such that −1 < 𝑥 < 1. In this case, when increasing the
order of the Taylor series, better approximations of the function 𝑓(𝑥) = ln(1 + 𝑥) are obtained. The
truncation error (2-21) decreases as the order 𝑘 of the Taylor series increases. The situation is different

4
To be precise, series (2-20) converges as well for 𝑥 = 1, i.e. for −1 < 𝑥 ≤ 1.
for 𝑥 > 1. In this case, increasing the order of the Taylor series results in approximations of the function
𝑓(𝑥) = ln(1 + 𝑥) that become less and less accurate. This time the truncation error (2-21) increases as
the order 𝑘 of the Taylor series increases.

2.2.2. Behaviour of truncation errors


In the present textbook, the majority of mathematical problems encountered will be approximated
using the Taylor series expansion. The corresponding algorithms will make use of a parameter ℎ. This
parameter ℎ has to be chosen in order to perform the series expansion. Hence, it is necessary to
address how ℎ affects the behaviour of the truncation error. Theorem 2-2 gives an expression for the
truncation error using the reminder 𝑅 of the Taylor series. However, as seen in Equation (2-9), the
algorithms will not have full control over the exact value of the truncation error. The parameter ℎ and
the order 𝑘 of the series expansion can be chosen, but little control is available over 𝜉 which is only
known to be a value between 𝑥 and 𝑥 + ℎ. Luckily, to operate numerical algorithms, the exact
expression of the truncation error will not be needed, but only the behaviour of the truncation error in
the two parameters ℎ and 𝑘 is important. In fact, the relevant question is to understand how the
truncation error decreases with these parameters ℎ and 𝑘. To quantify this decrease, it is common to
use the concept of the ‘big O’ notation. For our purposes, we define the big O notation as follows:

DEFINITION 2-5. BIG O NOTATION


A real function 𝑓 is said to behave as 𝒪(𝑔) if there exists a finite number 𝐾 ≠ 0 such that

𝑓(𝑥)
(2-22) lim =𝐾
𝑥→𝑎 𝑔(𝑥)

One writes 𝑓(𝑥) = 𝒪(𝑔(𝑥)) for 𝑥 → 𝑎 which reads as: 𝑓 is of order 𝑔 as 𝑥 tends to 𝑎.

When a function 𝑓 is of order 𝒪(𝑔), this means intuitively that as 𝑥 tends to approach 𝑎, the functions 𝑓
and 𝑔 become increasingly similar. Consequently, if we know that a function 𝑓 is of order 𝑔 as 𝑥 → 𝑎,
we possess more information compared to merely evaluating 𝑓(𝑎). Indeed, we further know how 𝑓(𝑥)
behaves around 𝑥 = 𝑎. As 𝑓 and 𝑔 become increasingly similar as 𝑥 → 𝑎, we can replace the function 𝑓
by the function 𝑔 in the vicinity of 𝑥 = 𝑎.

Let us apply this concept to the truncation error of a Taylor series expansion. We would like to quantify
how this truncation error behaves as ℎ → 0. Consider a Taylor series expansion 𝑇𝑘 (ℎ) of order 𝑘 of a
function 𝑓 around 𝑥. Equation (2-9), which gives the truncation error when replacing 𝑓(𝑥 + ℎ) by 𝑇𝑘 (ℎ),
suggests that this error is in order ℎ𝑘+1 . This is indeed the case. For ℎ within the radius of convergence
of the Taylor series expansion of 𝑓(𝑥 + ℎ) and 𝑓 (𝑘+1) (𝑥) ≠ 0, one can write5

𝑓(𝑥 + ℎ) − 𝑇𝑘 (ℎ) 1 𝑓(𝑘+1) (𝑥) 𝑓(𝑘+2) (𝑥) 𝑓(𝑘+1) (𝑥)


(2-23) lim = lim { [ ℎ𝑘+1 + ℎ𝑘+2 + ⋯ ]} =
ℎ→0 ℎ𝑘+1 ℎ→0 𝑘+1
ℎ ( 𝑘 + 1) ! ( 𝑘 + 2) ! ( 𝑘 + 1) !

5
The case where 𝑓 (𝑘+1) (𝑥) = 0 will be discussed separately shortly.
Equation (2-23) confirms that the truncation error of the Taylor series expansion of 𝑓(𝑥 + ℎ) is in order
ℎ𝑘+1 if 𝑓 (𝑘+1) (𝑥) ≠ 0. Using the big O notation, the Taylor series expansion in order 𝑘 can therefore be
written as

𝑓"(𝑥) 𝑓(𝑘) (𝑥)


(2-24) 𝑓(𝑥 + ℎ) = 𝑇𝑘 (ℎ) + 𝒪(ℎ𝑘+1 ) = 𝑓(𝑥) + 𝑓′(𝑥)ℎ + ℎ2 + ⋯ + ℎ𝑘 + 𝒪(ℎ𝑘+1 )
2! 𝑘!

Let us once again stress out the intuitive meaning of 𝒪(ℎ𝑘+1 ). As ℎ → 0, the truncation error tends to
become increasingly similar to ℎ𝑘+1 . In fact, from Equation (2-23), one can write

(2-25) |𝑓(𝑥 + ℎ) − 𝑇𝑘 (ℎ)| ≅ 𝐾 ∙ ℎ𝑘+1

𝑓(𝑘+1) (𝑥)
with 𝐾 = | (𝑘+1)!
|. Equation (2-25) will become more accurate as ℎ grows smaller. Example 2-7
illustrates the usage of this equation.

EXAMPLE 2-7. BEHAVIOUR OF TAYLOR SERIES TRUNCATION ERROR


To illustrate the intuitive interpretation of the big O notation, we consider the Taylor series expansion of
the square root function 𝑓(𝑥) = √𝑥 around 1. As an example, let us consider the order 3 expansion:

ℎ ℎ2 ℎ3
(2-26) √1 + ℎ = 𝑇3 (ℎ) + 𝒪(ℎ4 ) = 1 + − + + 𝒪(ℎ4 )
2 8 16

The truncation error is in order 4, which means that as ℎ → 0, the truncation error becomes increasingly
similar to |𝐾 ∙ ℎ4 | with the coefficient 𝐾 given by

𝑓 (4) (𝑥 = 1) 5
(2-27) 𝐾 = | |=
4! 128

Similarly, we can obtain approximations of √1 + ℎ with different orders and their corresponding
truncation error estimations.

Figure 2-4 illustrates how the truncation errors for various orders of approximation behave. The true
truncation errors, computed by evaluating numerically |√1 + ℎ − 𝑇𝑘 (ℎ)|, are plotted for a range of
values of ℎ and expansion orders from 𝑘 = 0 to 𝑘 = 3. Furthermore, the estimation of the truncation
errors based on |𝐾 ∙ ℎ𝑘 | are added too. We can observe how the truncation errors become increasingly
similar to |𝐾 ∙ ℎ𝑘 | as ℎ → 0.

<Figure 2-4 here>


Figure 2-4 Truncation errors when approximating √1 + ℎ with a Taylor series expansion of order 𝑘. The circles (o) are the true
truncation errors obtained by calculating numerically |√1 + ℎ − 𝑇𝑘 (ℎ)| and the lines (-) are the truncation error estimation from
the Taylor remainder term |𝐾 ∙ ℎ4 |.

Plots depicting the truncation errors in function of some parameters controlling its behavior, the
parameter ℎ in our case, are important tools in numerical methods to understand how algorithms
behave. They will be further very useful in developing tools and methods to estimate errors introduced
by the algorithms during their numerical calculations.

Equation (2-24) tells us that the truncation error of a Taylor series of order 𝑘 (i.e. by using a Taylor
polynomial 𝑇𝑘 (ℎ) to replace the original function) the truncation error is of order 𝒪(ℎ𝑘+1 ). Let us note
that if the function 𝑓 is such that 𝑓 (𝑘+1) (𝑥) = 0 , then the truncation error is of order 𝒪(ℎ𝑘+2 ), which is
of higher order than 𝒪(ℎ𝑘+1 ). This statement can be proven by following calculations similar to
Equation (2-23). Example 2-8 illustrates such a case.

EXAMPLE 2-8. TRUNCATION ERROR OF THE COSINE TAYLOR SERIES


Recall the Taylor series expansion of order 𝑘 = 6 of the cosine function around 𝑥 = 0:

𝑥2 𝑥4 𝑥6
(2-28) cos(𝑥) = 1 − + − + 𝑅(𝑥)
2! 4! 6!

The truncation error 𝑅(𝑥) = |cos 𝑥 − 𝑇6 (𝑥)| is of order 𝒪(𝑥 8 ). To verify this, consider the expression of
the remainder term 𝑅:

𝑥 8 𝑥 10 𝑥 12
(2-29) 𝑅(𝑥) = | − + − ⋯|
8! 10! 12!

This term is indeed of order 𝒪(𝑥 8 ) as

𝑅(𝑥) 1 𝑥2 𝑥4 1
(2-30) lim = lim { − + −⋯} =
𝑥→0 𝑥8 𝑥→0 8! 10! 12! 8!

In other words: as 𝑥 → 0, the error in replacing cos 𝑥 by the Taylor series of order 6 becomes
𝑥8
increasingly similar to .
8!

The big O notation has some properties that will be used frequently.

THEOREM 2-3. PROPERTIES OF BIG O NOTATION


1. The sum of two expressions of order 𝒪(ℎ𝑘 ) is again of order 𝒪(ℎ𝑘 ).
2. The ratio of an expression of order 𝒪(ℎ𝑘 ) over an expression of order 𝒪(ℎ𝑛 ) is of order
𝒪(ℎ𝑘−𝑛 ).
3. An expression of order 𝒪(ℎ𝑘 ) will eventually always become smaller than an expression of
order 𝒪(ℎ𝑛 ) if 𝑘 > 𝑛 as ℎ approaches zero.

The proof of property 1 is straight forward and based on the definition of the big O notation. Consider
two expressions 𝑓(ℎ) and 𝑔(ℎ) of order 𝒪(ℎ𝑘 ). Their sum is as well of order 𝒪(ℎ𝑘 ) as
𝑓(ℎ) + 𝑔(ℎ) 𝑓(ℎ) 𝑔(ℎ)
(2-31) lim = lim + lim = 𝐾𝑓 + 𝐾𝑔
𝑥→0 ℎ𝑘 𝑥→0 ℎ𝑘 𝑥→0 ℎ𝑘

Similarly, we can prove property 2. Let us now discuss the third property. An expression in order 𝒪(ℎ𝑘 )
will become increasingly similar to 𝐾1 ∙ ℎ𝑘 whereas an expression in order 𝒪(ℎ𝑛 ) will become similar to
𝐾2 ∙ ℎ𝑛 . Now, regardless of how small 𝐾2 may be, eventually for a small enough ℎ, one will have that
𝐾1 ∙ ℎ𝑘 < 𝐾2 ∙ ℎ𝑛 (because 𝑘 > 𝑛), which proves property 3.

Property 3 justifies notations involving the approximation sign ‘≅’. For example, when we write

(2-32) 𝑓(𝑥 + ℎ) ≅ 𝑓(𝑥) + ℎ𝑓′(𝑥)

we imply that this expression is evaluated numerically using a value ℎ sufficiently small such that higher
order terms, order 2 in this case, are negligible to the first order approximation.

A further important application of property 3 is to compare the truncation errors of algorithms. An


algorithm having a truncation error of a higher order than another algorithm will eventually always
produce approximations with a lower truncation error.

2.3. Error propagation


When numerical calculations are conducted, errors will propagate through them. The propagation of the
errors can become quite difficult to trace. Recall Example 2-3 which illustrated that, in the case of
round-off errors, the error propagation actually depends on how the calculations are executed. In this
section, we want to study in more detail how numerical errors propagate. Due to the complexity of how
round-off errors enter into a specific calculation, we will not be able to provide a precise estimation of
the effect of round-off errors. However, we will be able to discuss the portion of the error propagation
that cannot be avoided, regardless of which way a specific numerical calculation is accomplished.

2.3.1. Propagation of an input error


Assume we would like to evaluate a function 𝑦 = 𝑓(𝑥) in some value 𝑥. Furthermore, let us imagine
that for 𝑥 we only have an approximation 𝑥𝑐 such that 𝑥𝑐 = 𝑥 + 𝜀. Here, 𝜀 is the absolute error (for
example due to round-off errors) of 𝑥𝑐 . The question we want to answer is the following: assuming we
can evaluate with infinite precision the function 𝑓, how large will the absolute error of 𝑦 = 𝑓(𝑥) be if we
evaluate this function in 𝑥𝑐 = 𝑥 + 𝜀 ? In numerical analysis, the error 𝜀 is referred to as the input error
and the error |𝑓(𝑥 + 𝜀) − 𝑓(𝑥) | is called the output error.

Taylor series can help us to find an estimation of the output error given an input error. We write

(2-33) 𝑓(𝑥𝑐 ) = 𝑓(𝑥 + 𝜀) = 𝑓(𝑥) + 𝜀𝑓 ′ (𝑥) + 𝒪(𝜀 2 )

For the moment, we assume that 𝑓′(𝑥) ≠ 0. The case where 𝑓 ′ (𝑥) = 0 will be discussed at the end of
this section. If 𝑓′(𝑥) ≠ 0, we can estimate, using Equation (2-33), the output error as a first order
expression in the input error 𝜀 (i.e. the input error is small enough such that the second order terms
𝒪(𝜀 2 ) are negligible compared to 𝜀𝑓 ′ (𝑥)):

(2-34) |𝑓(𝑥𝑐 ) − 𝑓(𝑥)| ≅ |𝑓′(𝑥) ∙ 𝜀| = |𝑓′(𝑥)| ∙ |𝜀|


Equation (2-34) can be understood as follows: the input error on 𝑥 (the error |𝜀| on the value 𝑥) is
multiplied by the quantity |𝑓′(𝑥)| after evaluating the function 𝑓.

From Equation (2-34) we can also obtain a relation linking the relative errors instead of the absolute
errors:

𝑓(𝑥𝑐 ) − 𝑓(𝑥) 𝑓′(𝑥) ∙ 𝑥 𝜀


(2-35) | |≅| |∙| |
𝑓(𝑥) 𝑓(𝑥) 𝑥

Equations (2-34) and (2-35) are the motivation for Definition 2-6:

DEFINITION 2-6. CONDITION NUMBER OF A FUNCTION


The absolute condition number 𝐶𝑓 (𝑥) of a function 𝑓 in the point 𝑥 is

(2-36) 𝐶𝑓 (𝑥) = |𝑓′(𝑥)|

And the relative condition number 𝑐𝑓 (𝑥) of a function 𝑓 in the point 𝑥 is

𝑓′(𝑥) ∙ 𝑥
(2-37) 𝑐𝑓 (𝑥) = | |
𝑓(𝑥)

In numerical analysis, conditioning is an important concept. It measures how a calculation, or more


generally an algorithm, is sensitive to errors in the input. Depending on the mathematical problem one
is interested in, conditioning is defined differently (depending on what is the input and what is the
output of the problem)6. But the idea is always the same: one wants to estimate how a small change in
the input affects the output. Figure 2-5 illustrates this idea. Some calculations, or some algorithms, 𝑓 are
applied to a quantity 𝑥. Conditioning measures how a small change ∆𝑥 in the input is reflected in the
output. If the outputs 𝑓(𝑥) and 𝑓(𝑥 + ∆𝑥) are close to each other, the condition number will be small.
Whereas, if the outputs 𝑓(𝑥) and 𝑓(𝑥 + ∆𝑥) are far from each other, the condition number will be
large. In the next chapters, we will revisit this concept more thoroughly and define quantitatively what
the terms “close” and “far” mean. For the moment we restrict ourselves to situations where the inputs
are numbers and the outputs are numbers too. In this case, the distance (how close or how far) two
numbers 𝑥1 and 𝑥2 are from each other is measured with the quantity |𝑥2 − 𝑥1 |.

<Figure 2-5 here>


Figure 2-5 The concept of conditioning in numerical analysis. Conditioning measures how a small change ∆𝑥 in the input is
reflected in the output after the calculation is accomplished. Here 𝑓(𝑥) represents a calculation (or an algorithm) that is applied
to some quantity 𝑥.

It is important to realize that the condition number of a function will only allow us to estimate the so-
called unavoidable error, which is the error due to the error in the input of the function. The actual
numerical error may be larger than the estimation using the condition number. Indeed, when evaluating

6
In chapter 3, which discusses the numerical solution of algebraic equations, the mathematical expression of the
conditioning number will be the inverse of the one from Definition 2-6. The reason is that input and outputs are
exactly reversed compared to the situation discussed here.
numerically the function, round-off errors are also introduced, and these are not considered in the
approach using the condition number, which assumes that the function can be evaluated with infinite
accuracy. This observation is summarized in Theorem 2-4.

THEOREM 2-4. UNAVOIDABLE OUTPUT ERROR ESTIMATION


When evaluating a function 𝑓, the first order estimation of the unavoidable output error ∆𝑦 due to the
input error |𝜀| is:

(2-38) ∆𝑦 ≅ 𝐶𝑓 (𝑥) ∙ |𝜀|

where ∆𝑦 = |𝑓(𝑥 + 𝜀) − 𝑓(𝑥)| is the unavoidable output error (the error in the output assuming 𝑓 can
be evualted with infinite precision) and 𝐶𝑓 (𝑥) is the absolute condition number of 𝑓.

Similarly, one has for the relative errors

∆𝑦 𝜀
(2-39) | | ≅ 𝑐𝑓 (𝑥) ∙ | |
𝑓(𝑥) 𝑥

where 𝑐𝑓 (𝑥) is the relative condition number of 𝑓.

Example 2-9 illustrates how the concept of condition number of a function can be used to estimate the
propagation of input errors through calculations.

EXAMPLE 2-9. ERROR PROPAGATION AND CONDITIONNING


Let us consider the function 𝑦 = 𝑓(𝑥) = 𝑥 2 . From Definition 2-6, it follows that the absolute condition
number is

(2-40) 𝐶𝑓 (𝑥) = 2|𝑥|

This means that if we evaluate the function 𝑓 in 𝑥 + 𝜀, the result will have an absolute error of about
∆𝑦 ≅ 𝐶𝑓 (𝑥) ∙ 𝜀 = 2|𝑥| ∙ 𝜀. The initial error is multiplied by the factor 2|𝑥|. We can verify this by
evaluating numerically the function 𝑓 in 𝑥 = √2. The correct answer would be 𝑓(√2) = 2. Octave finds
an absolute error of ∆𝑦 ≅ 4 ∙ 10−16 :
>> f=@(x) x^2;
>> f(sqrt(2))-2
ans = 4.4409e-16
This agrees with the condition number calculated in Equation (2-40). Indeed, when evaluating
numerically √2, a round-off error of about 𝜀 ≅ 0.5 ∙ √2 ∙ 𝜀𝑚 is introduced (refer to Observation 2-2).
Consequently, the expected absolute error when evaluating 𝑓(√2) will be about

(2-41) ∆𝑦 ≅ 2√2 ∙ 0.5 ∙ √2 ∙ 𝜀𝑚 ≅ 4.5 ∙ 10−16

which is similar to the error we found when evaluating 𝑓(√2) − 2 numerically with Octave.

So far, we have considered situations where 𝑓’(𝑥) ≠ 0. If 𝑓’(𝑥) = 0, then according Theorem 2-4, the
first order estimation of the unavoidable error is equal to zero. As we will see, in such case, the
unavoidable error is indeed small, but not equal to zero. When a more precise estimation is required, a
higher order estimation in the input error needs to be developed. For this purpose, we use the Taylor
series expansion of 𝑓(𝑥𝑐 ) = 𝑓(𝑥 + 𝜀), include higher order terms and consider that 𝑓’(𝑥) = 0. We
obtain7:
1
(2-42) 𝑓(𝑥𝑐 ) = 𝑓(𝑥 + 𝜀) = 𝑓(𝑥) + 2! 𝑓′′ (𝑥)𝜀 2 + 𝒪(𝜀 3 )

The unavoidable error becomes, for small enough input errors (i.e. small enough such that the third
1
order terms 𝒪(𝜀 3 ) are negligible compared to the second order term 𝑓′′ (𝑥)𝜀 2):
2!

1 1
(2-43) |𝑓(𝑥𝑐 ) − 𝑓(𝑥)| ≅ |2! 𝑓′′(𝑥) ∙ 𝜀 2 | = |2! 𝑓′′(𝑥)| ∙ |𝜀 2 |

The unavailable error is now in second order of the input error |𝜀|.

EXAMPLE 2-10. ERROR PROPAGATION WHEN 𝑓’(𝑥) = 0


Consider the cosine function 𝑦 = cos(𝑥). Let us estimate the error when computing numerically cos(𝜋)
when we approximate 𝜋 by 3.14159.

Let us first determine the round-off error of our numerical approximation of 𝜋. For this we use the
inbuilt variable pi of Octave:
>> format short E
>> abs(3.14159-pi)
ans = 2.6536E-06
The input error |𝜀| is about 2.7 ∙ 10−6 .
𝑑
As 𝑑𝑥 (cos 𝑥) = − sin 𝑥 and sin 𝜋 = 0, we have to use a second order approximation for the
unavoidable error:
1 1
(2-44) |cos(3.14159) − cos 𝜋| ≅ |− cos 𝜋 ∙ 𝜀 2 | = |𝜀 2 | ≅ 3.5 ∙ 10−12
2! 2

We can verify that this estimation is correct by calculating with Octave


>> abs(cos(3.14159)-cos(pi))
ans = 3.5207E-12
Note that the unavoidable error is indeed very small compared to the input error (as it is in second order
of the input error) but not equal to zero as would predict a first order estimation of the unavoidable
error.

2.3.2. Estimating round-off errors introduced by a function evaluation


As stated in Theorem 2-4, the condition number allows estimating the unavoidable output error only.
However, when evaluating a function, round-off errors incur as well (refer to Observation 2-1). The
estimation of the round-off errors introduced during the evaluation of a function is a challenging

7
If 𝑓’’(𝑥) = 0, or higher orders of derivatives are zero too, an even higher order expansion is required. The
principle will remain the same as outlined in the text.
problem. Contrary to the unavoidable error, there is no simple formula to estimate it. Further analysis is
required. We present here one methodology that can be used.

Its principle is simple. We do not know how to compute exactly 𝑓(𝑥) (i.e. we do not know how many
digits from the numerical evaluation of 𝑓 are correct). However, based on Theorem 2-4, we know the
error associated with computing 𝑓(𝑥 + ℎ), assuming no round-off error associated with the evaluation
of 𝑓. Consequently, the difference between |𝑓(𝑥 + ℎ) − 𝑓(𝑥)| and |𝑓’(𝑥)ℎ|, when evaluating
numerically the different quantities, equals to the round-off associated with |𝑓(𝑥 + ℎ) − 𝑓(𝑥)|. To
obtain the round-off error of 𝑓(𝑥) we need to conduct this analysis for ℎ tending towards zero.

Let us put in action this idea. When evaluating any function 𝑓 in a value 𝑥, the numerical answer, which
will be written 𝑓𝑓𝑙 , will have a round off error. We can write

(2-45) 𝑓𝑓𝑙 (𝑥) = 𝑓(𝑥) + 𝜀

where 𝑓𝑓𝑙 is the numerical evaluation of 𝑓 using floating-point numbers and 𝜀 is the corresponding
accumulated round-off error.

Similarly, evaluating 𝑓(𝑥 + ℎ) numerically will produce a numerical answer 𝑓𝑓𝑙 (𝑥 + ℎ) with a
corresponding round off error 𝜀 ′ :

(2-46) 𝑓𝑓𝑙 (𝑥 + ℎ) = 𝑓(𝑥 + ℎ) + 𝜀′

Using Theorem 2-4, we can further write

(2-47) 𝑓𝑓𝑙 (𝑥 + ℎ) − 𝑓𝑓𝑙 (𝑥) = 𝑓(𝑥 + ℎ) − 𝑓(𝑥) + 𝜀′ − 𝜀 ≅ 𝑓′(𝑥) ∙ ℎ + 𝜀′ − 𝜀

Rearranging (2-47) results in (recall that |𝑎 + 𝑏| ≥ |𝑎| + |𝑏|):

(2-48) |𝑓𝑓𝑙 (𝑥 + ℎ) − 𝑓𝑓𝑙 (𝑥)| − |𝑓′(𝑥) ∙ ℎ| ≥ |𝜀′ − 𝜀|

Note that the round-off errors 𝜀 and 𝜀′ will, either add up or cancel each other out, depending on the
value of 𝑥 and ℎ.

By evaluating Equation (2-48) for small values of ℎ, a conservative estimation of 𝜀 is obtained. Indeed, if
ℎ is small, it seems reasonable to assume that the round-off errors 𝜀 and 𝜀′ are similar. Their difference
can be either equal to zero or to |2𝜀|, depending on whether the round-off errors add up or cancel out.
Consequently, for small values of ℎ:

(2-49) |𝑓𝑓𝑙 (𝑥 + ℎ) − 𝑓𝑓𝑙 (𝑥)| − |𝑓′(𝑥) ∙ ℎ| ≥ 2|𝜀|

where 𝜀 is the accumulated round-off error when evaluating numerically 𝑓 in 𝑥. Hence, equation (2-49)
can be used to obtain a conservative estimation of the round-off error.

Note that so far, our analysis has assumed that 𝑥 is free of error. In practice, this is seldom the case. The
reader is encouraged to repeat calculations from Equation (2-45) to (2-49) and assess how an input error
in 𝑥 will impact the various calculation steps. In fact, any effect an input error may produce will cancel
out in Equation (2-47) and consequently Equation (2-49) remains valid.
In practice, using Equation (2-49) requires an adequate choice of ℎ. A lower limit is given by the effect of
the round-off error on the quantity 𝑥 + ℎ. One needs to choose ℎ large enough such that 𝑥 + ℎ and 𝑥
can be distinguished numerically. In other words, ℎ needs to be at least equal to 0.5𝜀𝑚 ∙ 𝑥 (refer to
Observation 2-2). As for the upper limit of ℎ, there is no general rule that can be given. The value of ℎ
must be small enough such that the approximation 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ≅ 𝑓 ′ (𝑥) ∙ ℎ remains acceptable.
Plotting |𝑓𝑓𝑙 (𝑥 + ℎ) − 𝑓𝑓𝑙 (𝑥)| versus ℎ can help in judging if this is the case. Example 2-11 gives an
illustration on how Equation (2-49) can be used to estimate the round-off error introduced when
evaluating numerically the square root function.

EXAMPLE 2-11. ROUND-OFF ERRORS OF THE SQUARE ROOT FUNCTION


This example studies how the function 𝑦 = 𝑓(𝑥) = √𝑥 introduces round-off errors when evaluated
numerically.

As a first example, let us estimate the round-off error when evaluating √16 nummerically. In this case
we know that √16 = 4. This allows us to verify our methodology. Octave computes
>> format long
>> sqrt(16)-4
ans = 0

No digits are lost during the evaluation of √16 as all 16 decimal digits of the solution are correct. Now,
let us reach the same conclusion by applying Equation (2-49), to estimate the round-off error associated
with the numerical solution of √16. For this we compute numerically

(2-50) ∆𝑦𝑓𝑙 = |𝑓𝑓𝑙 (16 + ℎ) − 𝑓𝑓𝑙 (16)| = |√16 + ℎ − √16|

for a range of values of ℎ. We use as lower limit 0.5 ∙ 16𝜀𝑚 and then increase it logarithmically (by
multiplying ℎ each time by 2). Let us note that for ℎ = 0.5 ∙ 16𝜀𝑚 we obtain ∆𝑦𝑓𝑙 = 0:
>> sqrt(16+8*eps)-sqrt(16)
ans = 0
This indicates that in that particular case, the round-off errors introduced when evaluating numerically
the functions must have canceled all significant digits. For larger values of ℎ this is no longer the case as
seen in Figure 2-6 (a). Figure 2-6 (a) shows as well that the approximation 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ≅ 𝑓 ′ (𝑥) ∙
1
ℎ= ℎ is reasonable over the range of selected ℎ values.
2√𝑥

<Figure 2-6 here>


Figure 2-6 (a) Plot of ∆𝑦𝑓𝑙 = |√16 + ℎ − √16| versus ℎ. The straight line is y=𝑓 ′ (16)ℎ = 1⁄2√16 ∙ ℎ. (b) Plot of ∆𝑦𝑓𝑙 − 𝑓 ′ (16)ℎ
versus ℎ.

1
Figure 2-6 (b) depicts the difference between ∆𝑦𝑓𝑙 and 𝑓 ′ (𝑥) ∙ ℎ = 2 ℎ. This is, according to Equation
√𝑥
(2-49), an estimation of twice the round-off error introduced when evaluating numerically √16. The
figure reveals that the round-off error associated with the numerical computation of √16 is about 2 ∙
10−16. This is smaller than 0.5 ∙ √16𝜀𝑚 ≅ 4.5 ∙ 10−16 which means that all the computed decimal digits
are correct.
The round-off error introduced when evaluating numerically the square root function will depend on the
value of 𝑥 where this evaluation takes place. Let us repeat the same exercise, but this time for √401. In
this case Octave computes
>> format long
>> sqrt(401)
ans = 20.02498439450079
Let us try to estimate how many of the computed digits are correct. For this we need to study the
behaviour of the numerical evaluation of

(2-51) ∆𝑦𝑓𝑙 = |𝑓𝑓𝑙 (401 + ℎ) − 𝑓𝑓𝑙 (401)| = |√401 + ℎ − √401|

as a function of ℎ. Again, we choose a range of ℎ with a lower bound of 0.5 ∙ 401𝜀𝑚 .

<Figure 2-7 here>


Figure 2-7 (a) Plot of ∆𝑦𝑓𝑙 = |√401 + ℎ − √401| versus ℎ. The straight line is y=𝑓 ′ (401)ℎ = 1⁄2√401 ∙ ℎ. (b) Plot of ∆𝑦𝑓𝑙 −
𝑓 ′ (401)ℎ versus ℎ.

1
Figure 2-7 (a) confirms that the approximation 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ≅ 𝑓 ′ (𝑥) ∙ ℎ = 2 ℎ is reasonable for all
√𝑥
values ℎ considered. From Figure 2-7 (b) we can estimate the round-off error when evaluating √401.
We find a round-off error of about 2 ∙ 10−15. This indicates that the last digit (maybe even the last two,
depending on how the round-off error is added or subtracted) of the numerical solution of √401 is
wrong. We conclude that 20.024984394500 are all correct digits of √401.

2.4. Round-off and truncation error trade off


As discussed so far, when performing numerical calculations two sources of errors are present. The
round-off errors cannot be avoided when using floating-point numbers. The truncation errors incur
when some form of mathematical approximation is employed. Both sources contribute to the total
error. For obvious reasons it is desired to minimize their effect.

Reducing the impact of the round-off error can solely be accomplished by increasing the number of
digits used in calculations. Reformulating the mathematical expressions used in the calculations (for
example to minimize effects such as loss of significance) may mitigate the round-off error impact but not
as efficiently as employing a larger number of digits during computations.

The truncation error on the other hand can be reduced via two strategies derived from the Taylor series
expansion. The first strategy consists in employing a higher order expansion, whereas the second is to
conduct the expansion as locally as possible i.e. by reducing the magnitude of ℎ in equation (2-6). Both
strategies however have their limitations. As the truncation error is reduced by using a higher order
expansion, the higher order terms themselves become of increasingly small magnitude. This may lead of
an increase in the round-off error thereby negating the impact of a higher order expansion. A similar
effect will occur when reducing the value of ℎ. The following two examples illustrate this problematic.

EXAMPLE 2-12. ROUND-OFF AND TRUNCATION ERROR TRADE-OFF IN TAYLOR SERIES


Let us estimate Euler’s number with as many correct digits as possible using a Taylor series expansion
𝑇𝑘 (𝑥) of the exponential function around zero, as shown by Equation (2-3).
<Figure 2-8 here>
Figure 2-8 True error when approximating Euler’s number 𝑒 with a Taylor series expansion of order 𝑘. The true value to sixteen
digits was used to represent Euler’s number 𝑒.

The exact value of Euler’s number to sixteen decimal digits is 𝑒 = 2.718281828459045. Octave stores
this value in the internal variable e. Figure 2-8 depicts the true error when approximating the 16 digits
Euler’s number with the 𝑘th order Taylor series expansion 𝑇𝑘 (𝑥 = 1) of the exponential function. As
expected, the error decreases with increasing order 𝑘. But when the 17th order is reached the error no
longer decreases. This is due to the additional terms in the Taylor series expansion that have become
such small that the numbers 𝑇𝑘 (𝑥 = 1) and 𝑇𝑘+1 (𝑥 = 1) are no longer distinguishable numerically:
>> format long
>> Tk = @(k) sum(1./factorial(linspace(0,k,k+1)));
>> Tk(17)
ans = 2.718281828459046
>> Tk(18)
ans = 2.718281828459046
The conclusion is that the numerical evaluation of the Taylor series 𝑇𝑘 (1) cannot approximate 𝑒 to 16
decimal digits despite using high order expansions (the last digit is wrong). For example, if we use an
expansion of the order 𝑘 = 19, the truncation error is about
1
(2-52) Truncation error ≅ 20! ≅ 4 ∙ 10−19

However, the total error is about |𝑇19 (1) − 𝑒| ≅ 4 ∙ 10−16 :


>> abs(Tk(19)-e)
ans = 4.440892098500626e-16
which is larger than the truncation error 8. Consequently, the round-off error must have become
significantly larger than the truncation error and limits the accuracy of the result.

The next example illustrates the issues that algorithms face if one attempts to reduce the truncation
error by conducting the expansion as locally as possible.

EXAMPLE 2-13. ROUND-OFF AND TRUNCATION ERROR TRADE-OFF


A simple algorithm to estimate numerically the first derivative of a function 𝑓(𝑥) can be developed
based on the first order Taylor series expansion of 𝑓:

(2-53) 𝑓(𝑥 + ℎ) = 𝑓(𝑥) + ℎ𝑓 ′ (𝑥) + 𝒪(ℎ2 )

After rearrangement and using Theorem 2-3 it follows:

𝑓(𝑥 + ℎ) − 𝑓(𝑥)
(2-54) 𝑓 ′ (𝑥) = + 𝒪(ℎ)

8
Note that the error 4 ∙ 10−16 is about the to be expected round-off error when representing 𝑒 by a flowing point
number. Indeed, 0.5 ∙ 𝑒 ∙ 𝜀𝑚 ≅ 3 ∙ 10−16.
Let us apply Equation (2-54) to the exponential function 𝑓(𝑥) = 𝑒 𝑥 . We aim to estimate the first
derivative in 𝑥 = 0. The numerical approximation becomes

𝑑 𝑥 𝑒ℎ − 1
(2-55) 𝑒 | ≅
𝑑𝑥 𝑥=0 ℎ

whereas the exact value is 𝑓 ′ (0) = 1. Figure 2-9 displays the true absolute error in function of ℎ of the
numerical estimation (2-55) computed as

𝑒ℎ − 1
(2-56) True error = | − 1|

<Figure 2-9 here>


Figure 2-9 True error when approximating numerically the first derivative of the exponential function 𝑓(𝑥) = 𝑒 𝑥 in 𝑥 = 0 with
Equation (2-54) in function of ℎ. The error has two contributions: the truncation error, of order 𝒪(ℎ), and the round-off errors, of
order 𝒪(ℎ−1 ).

As expected, as ℎ decreases, so does the error. At first, the total error consists essentially made of the
truncation error. This error decreases with an order of ℎ, as predicted by Equation (2-54). However, at a
certain point, where ℎ reaches values lower than 10−8, the true error starts increasing again. This occurs
because the round-off errors are no longer negligible and contribute significantly to the total error.

It is possible to obtain a theoretical expression for the total error by a careful analysis of the
approximation of the first derivative:

𝑓𝑓𝑙 (𝑥 + ℎ) − 𝑓𝑓𝑙 (𝑥) 𝑓(𝑥 + ℎ) + 𝜀1 − 𝑓(𝑥) − 𝜀2


(2-57) True error = |𝑓 ′ (𝑥) − | = |𝑓 ′ (𝑥) − |
ℎ ℎ

where 𝑓𝑓𝑙 stands for the floating-point number representation of the result of the function evaluation,
and 𝜀1 and 𝜀2 are the round-off errors. Rearranging and simplifying these expressions lead to

𝑓(𝑥 + ℎ) − 𝑓(𝑥) 𝜀1 − 𝜀2 𝑓(𝑥 + ℎ) − 𝑓(𝑥) 𝜀1 − 𝜀2


(2-58) True error = |𝑓 ′ (𝑥) − + | ≤ |𝑓 ′ (𝑥) − |+| |
ℎ ℎ ℎ ℎ

The true error is the sum of two contributions. The first is the truncation error, which in our case is of
the order 𝒪(ℎ). The second is the accumulated round-off error, which in this case is of order 𝒪(ℎ−1 ).

The two aforementioned examples illustrate the important concept of round-off and truncation error
trade-off. The total error consists of two contributions: the round-off and the truncation error. While
the truncation error may be reduced by using higher order expansions or conducting more local
expansions, unfortunately the round-off errors will increase by doing so. This showcases the limitations
of numerical methods and demonstrates how the total error may not be decreased infinitely.

<Figure 2-10 here>


Figure 2-10 Round-off and truncation error trade-off. Total error, round-off and truncation errors are shown. In this case the
approximation is based on a series expansion of order 𝒪(ℎ𝑘 ).
Figure 2-10 illustrates the trade-off between truncation and round-off errors. In this case the
approximation is based on a series expansion of order 𝒪(ℎ𝑘 ). As ℎ decreases, the total error decreases
according to 𝒪(ℎ𝑘 ) because the round-off error is still several order of magnitudes smaller than the
truncation error. However, there comes a point where the truncation error becomes of similar
magnitude and eventually significantly smaller than the truncation error. Round-off errors will limit the
accuracy of the result. Usually, round-off errors will increase as ℎ decreases. Figure 2-10 highlights that
running an algorithm with as local approximations (small values of ℎ) as possible will not lead to
accurate results. On the contrary, for too small values of ℎ, results may become less accurate than for
large values of ℎ. An important task for us will be to learn how to identify at which stage algorithms no
longer provide more accurate results, to avoid operating them in conditions leading to poor results.

You might also like