Numerical Errors
Numerical Errors
Numerical Errors
Numerical errors
Numerical methods are used to find an approximation of the solution of a mathematical problem with
an error that falls below a desired user-defined accuracy. In this chapter we will discuss why numerical
methods can only provide approximations. In engineering and science, when one develops a
mathematical model to describe some reality, one must be aware of the many possible error sources. To
start with, the model itself is an approximation of this reality. Hence, inherently the model cannot be
fully accurate, for example because the assumptions made to derive the model may be wrong. Also,
when developing the model, some mathematical mistakes could be introduced and further the
parameters needed to feed the model are usually inaccurate. Programming errors may also occur when
coding the process to solve the model. Note that all the aforementioned errors are not specific to
numerical methods and as such occur in any engineering and science field. In the remainder of this
book, we will assume that these errors can be either avoided, are negligible or quantifiable. However,
there are two categories of errors that are specific to the use of numerical methods. These two error
types are known as round-off errors and truncation errors. Round-off errors are due to the limited
number of digits that can be stored in memory. Truncation errors appear because we solve a simplified
and approximated problem in lieu of our actual problem. Both, round-off and truncation errors, will be
defined and discussed in detail in this chapter. As we will see, round-off errors cannot be avoided.
Truncation errors, for some algorithms, can be avoided. But for most cases both types of errors are
present.
The difficulties begin as soon as a finite number of digits are used to represent numbers to perform
calculations. Let us illustrate this with an example.
Example 2-1 shows that adding three numbers can result in an erroneous answer. The reason behind
this phenomenon is the accumulation of round-off errors during the additions. To understand how these
round-off errors appear, a closer look on how numbers are stored in a computer is required. A computer
stores numbers using base two. Technically this consists of a sequence of bits 𝑏𝑖 , where a bit has only
two states (0 or 1). This is commonly referred to as the binary code. For example, the binary number
𝑥 = 10100, consists of 5 bits, and equals to the decimal number 20, as can be verified as follows:
𝑥 = 24 + 0 + 22 + 0 + 0 = 20
The IEEE754 standard specifies how decimal numbers are expressed as binary numbers in a computer
using floating-point numbers or floating-point representation. A floating-point number has three parts: a
sign (either positive or negative), a mantissa (a sequence of 𝑁 bits 𝑏𝑖 ) and an exponent 𝑝 :
±𝑏1 𝑏2 … 𝑏𝑁 × 2𝑝
Depending on how many bits are used to store the mantissa and exponent, different families of
precision are defined in the IEEE standard. Table 2-1 lists these defined families and specifies how many
bits are used for each part of the floating-point number1. For example, the popular double precision
floating-point number has a total of 64 bits (or 4 bites, as per definition one bite is a sequence of 8 bits).
The first bit is used to encode the sign, the next 11 bits are for the exponent, and the remaining 52 are
used to store the mantissa.
A double precision floating-point number can hold approximatively 16 decimal digits. Indeed, typically
about 3.25 bits are needed to encode a decimal digit (23.25 ≈ 10). As double precision has memory to
store 52 bits for the mantissa, this gives about 52/3.25=16 decimal digits.
Obviously, without any further rules, the decision on which part goes to the mantissa or to the exponent
is open. For example, one could encode the decimal number 12.34 as 12.34 × 100 , 1.234 × 101 or
0.1234 × 102 (and of course many other options). The IEEE standard defines which one of the
enumerated choices is actually used to store a number in a computer. The interested reader may learn
more about how floating-point numbers are handled in a computer by reading in the IEEE754 standard.
The technical details, as we will realize, are not important for our purpose.
1
The IEEE 475 standard defines as well how the exponent (and its sign) are encoded in bits. For the purpose of our
presentation we do not have to know how this is done technically and we can simply consider the exponent to be
encoded as a decimal number 𝑝 with sign. The interested reader is referred to the IEEE 475 standard.
Expanding further on the example of representing the number 12.34 allows us to understand the
terminology “floating-point” number: by keeping the same mantissa one can move around the location
of the decimal point.
Let us go back to Example 2-1. The number 0.2, in binary code, would actually need an infinite number
of digits to represent it. To inspect the binary representation of a decimal number with Octave, one can
use the format bit command:
>> format bit
>> 0.2
ans = 0011111111001001100110011001100110011001100110011001100110011010
̅̅̅̅̅̅̅, a situation similar to
In fact, in binary code the decimal number 0.2 is equal to 00111111110010011
1⁄ = 0. 3̅ for numbers represented in base ten. Consequently, it is not possible, using a finite number
3
of digits, to represent in binary code the decimal number 0.2 without round-off error. As one adds
several times the number 0.2, the round-off error accumulates and ends up becoming significant.
The take away here is that one should never trust a numerical calculation. One must always be aware
that a numerical calculation is actually wrong (from a pure mathematical point of view) and represents
only an approximation. We like to refer to this fact as the “golden rule of numerical methods”:
At this point it is useful to introduce the concept of significant digits. Not all authors define this concept
the same way. In the frame of this book we adopt Definition 2-1. Note that we will use interchangeably
the words significant digits and significant figures.
The number 0.0201 has as well three significant digits which again are 2, 0 and 1. This shows that
leading zeros are simply ignored when identifying significant digits.
The number 2.010 has four significant digits which are 2, 0, 1 and 0. Here the last 0 is significant
(otherwise one would have written 2.01).
A good way to understand the concepts of significant digits is to keep in mind that numbers usually
represent physical quantities associated to a physical unit. If for example the number 20.1 represents a
distance of 20.1 meters, then changing its unit to centimeters or kilometres (which will result in moving
the decimal point), will not change the knowledge, and consequently which digits are significant or not,
we have on this distance.
It is straight forward to prove that 𝑓(𝑥) = 𝑔(𝑥) for any 𝑥 where 𝑓 and 𝑔 are defined. Let us evaluate
these two functions numerically in 𝑥 = 500. We start by defining the functions 𝑓 and 𝑔 in Octave:
>> f = @(x) x*(sqrt(x+1)-sqrt(x));
>> g = @(x) x/(sqrt(x+1)+sqrt(x));
We used here the concept of anonymous functions to define them in Octave. Let us now evaluate these
two functions in 𝑥 = 500:
>> f(500)
ans = 11.175
>> g(500)
ans = 11.175
They seem to agree as expected. But if we display all digits that are stored inside our computer (which is
achieved with the format long command), we observe that actually they do not agree:
>> format long
>> f(500)
ans = 11.17475530074685
>> g(500)
ans = 11.17475530074720
The last three digits differ. This shows that at least one of the numerical function evaluations must be
wrong (maybe even both are wrong). At this stage it is impossible to know which one is correct (if any).
At this point it should become clear to the reader that round-off errors can accumulate during
calculations to result in inaccurate answers. Another phenomenon that can take place is known as loss
of significance. This happens when, for example, two numbers are added, both of which can be free of
round-off errors, but their sum requires more digits than what possibly can be stored in the computer
memory. For example, let’s imagine we are using a hypothetical computer that can only store four digits
at time. Let’s also assume, for the sake of this example, that this hypothetical computer stores numbers
using base 10. Let’s consider two decimal numbers 1234 and 0.12. Both those numbers can be stored
without round-off error. But their sum (which would be 1234.12) cannot be stored free of round-off
error.
In numerical algorithms, loss of significance happens often when subtracting numbers that are similar.
For illustrative purpose, let’s imagine we are using a hypothetical computer that can only store 3 digits
in its mantissa. Again, for the sake of this example, let’s assume the numbers are stored in base ten.
Using this hypothetical computer let’s perform the following operation: √4.01 − 2. As only 3 decimal
digits can be stored, the result becomes:
√4.01 − 2 =
⏟ 2.00 − 2.00 = 0.00
3 𝑠.𝑑
In this extreme example, we end up with a result which has all its significant digits wrong. One says that
all significant digits were lost.
The crucial takeaway from the above examples is that whenever calculations involve floating-point
numbers, it is nearly certain that the result will be wrong. Any numerical calculation introduces round-
off errors, which are challenging to predict. Another way to express this fact is to recognize that any
calculations carried out with floating-point numbers will lead to the loss of some significant digits.
Observation 2-1 summarizes this discussion.
Since round-off errors are not avoidable in numerical calculations, the only solution is to learn to live
with them. The consequence is that we will have to develop methods able to provide estimates of errors
(recall that a good estimate is an estimation which is larger or equal than the actual error). Section 2.3
will give a first set of methods estimating errors.
If calculations can be performed with infinite precision, then 𝜀𝑚 is equal to zero. But in numerical
calculations, where only a finite number of digits are available, 𝜀𝑚 will not be equal to zero. Octave,
which uses double precision floating-point numbers, stores 𝜀𝑚 in the built-in variable eps:
>> eps
ans = 2.220446049250313e-16
We can verify that indeed 1 + 𝑥 = 1 for any 𝑥 ≤ 0.5𝜀𝑚 . The following lines of octave code
demonstrates this when adding 10−17 to the number one:
2
The factor 0.5 in front of 𝜀𝑚 comes from the rounding convention Octave is using.
>> format long
>> x=1E-17;
>> 1+x
ans = 1
Octave computed that 1 + 𝑥, with 𝑥 = 10−17 is exactly equal to the decimal number 1. Numerically,
Octave is unable to distinguish between the two numbers 1 + 10−17 and 1.
Definition 2-2, which uses the number 1 as reference, can be generalized to an arbitrary number 𝛿.
Let us illustrate Theorem 2-1 with the number 𝛿 = 3279. In this case, for any number 𝑥 smaller than
0.5 ∙ 3279 ∙ 𝜀𝑚 the relation 3279 + 𝑥 = 3279 holds. For example:
>> x = 0.4*3279*eps;
>> 3279+x == 3279
ans = 1
The reader is encouraged to repeat this example with other values of 𝑥 (some smaller and some larger
than 0.5 ∙ 3279 ∙ 𝜀𝑚 ).
Figure 2-1 gives a visual representation of Theorem 2-1 and Observation 2-2. The result of the numerical
evaluation of 𝑦 = 1 + ∆𝑥 − 1 is plotted in function of ∆𝑥. If 𝑦 could be evaluated with an infinite
accuracy, then of course 𝑦 = ∆𝑥. But when using floating-point numbers, the situation is different. Due
to round-off errors, 1 + ∆𝑥 evolves stepwise with ∆𝑥. The quantity 𝑦 = 1 + ∆𝑥 − 1 differs from ∆𝑥.
∞
𝑥
𝑥2 𝑥3 𝑥𝑛
(2-3) 𝑒 = 1+𝑥+ + +⋯ = ∑
2! 3! 𝑛!
𝑛=0
To give an approximation of the exponential function, one can decide to include, for example, only the
first three terms of the series:
𝑥2
(2-4) 𝑒𝑥 ≅ 1 + 𝑥 +
2!
The truncation error is then the difference between the exact value of 𝑒 𝑥 and the approximation
including only the first three terms of the series (as always, errors are measured in absolute values):
𝑥2
(2-5) Truncation error = |𝑒 𝑥 − (1 + 𝑥 + )|
2!
Note that the truncation error does not include the round-off errors. In other words, it is assumed that
the various expressions can all be evaluated with an infinite number of digits.
The term
is called the Taylor polynomial of order 𝑘 or the order 𝑘 approximation of the function 𝑓 in 𝑥 + ℎ.
Taylor series expansion is a powerful tool used to generate various approximations of mathematical
expressions. For example, if a function 𝑓 is known in some point 𝑥, one can obtain the approximation of
𝑓 in some other point 𝑥 + ℎ. Figure 2-2 illustrates this idea. Figure 2-2 suggests that the more terms are
included in the Taylor series expansion, the higher the accuracy of the approximation becomes. This is
correct when ℎ is sufficiently small (the term: “sufficiently small” will be discussed below).
Taylor’s theorem can also generate an expression for the truncation error 𝐸𝑡𝑟𝑢𝑛𝑐 by evaluating the
reminder term of the series:
Even 𝜉 ∈ [𝑥, 𝑥 + ℎ] is unknown, one can give an upper bound by considering the maximum of the Taylor
reminder in [𝑥, 𝑥 + ℎ]:
Alternative ways of obtaining upper boundaries to the truncation error exist, but they are problem
specific. Example 2-5 illustrates such a case.
𝑑 𝑥 2 𝑑2
(2-11) cos(𝑥) = cos(0 + 𝑥) = cos(0) + 𝑥 cos(𝑥)| + cos(𝑥)| +⋯
𝑑𝑥 𝑥=0 2! 𝑑𝑥 2 𝑥=0
𝑥2 𝑥4 𝑥6
(2-12) cos(𝑥) = 1 − + − +𝑅
2! 4! 6!
where 𝑅 is the Taylor reminder of order 7. Based on the Taylor remainder term (2-7) an upper boundary
of the truncation error is:
𝑥7
(2-13) Truncation error ≤ | |
7!
because the derivative of order 𝑘 + 1 = 7 of cos(𝑥) will always be between -1 and 1, regardless in
which point it is evaluated. However, in the present example, a more precise estimation of the
truncation upper boundary can be obtained. Indeed, if we are interested in small numbers 𝑥 (for
example 𝑥 = 0.3), then we can estimate the truncation error as follows:
𝑥 8 𝑥 10 𝑥 12
(2-14) Truncation error = | − + − ⋯|
8! 10! 12!
For small 𝑥, the terms in this summation decrease very rapidly and we may approximate the truncation
error by only including the first of them:
𝑥8
(2-15) Truncation error ≅ | |
8!
Note that this estimation is more precise than estimation (2-13). As an example, let us approximate
cos(0.3):
0.38
(2-17) Truncation error ≅ | | ≅ 7 ∙ 10−6
8!
Assuming octave can compute cos(0.3) exactly (or at least with an error lower than our estimated
truncation error), we obtain for the true error3
>> format short E
>> x=0.3;
>> 1 - x^2/2 + x^4/factorial(4) + x^6/factorial(6) - cos(x)
ans = 2.0234E-06
This confirms that our truncation error estimation (2-17) is indeed a valid estimation as it is a
conservative estimation.
It would be wrong to conclude that adding more terms to the Taylor series expansion will always result
in the reduction of the truncation error. In fact, the Taylor series expansion must be conducted
sufficiently close to the reference point. Sufficiently close means that the expansion must be around a
3
The command format short E is used to present the answers in a more human friendly way.
value within the radius of convergence of the Taylor series. Definition 2-4 recalls the definition of the
radius of convergence of a power series.
The radius of convergence of this power series is the smallest positive number 𝑟 such that for any 𝑥
satisfying |𝑥 − 𝑎| < 𝑟 the series (2-18) will converge as 𝑛 → ∞.
Example 2-6 illustrates the importance to keep the Taylor series expansion within its radius of
convergence.
𝑥2 𝑥3 𝑥4 𝑥𝑘 𝜉 𝑘+1
(2-19) ln(1 + 𝑥) = 𝑥 − + − + ⋯ − (−1)𝑘−1 + (−1)𝑘
2 3 4 𝑘 𝑘+1
with 𝜉 ∈ [1, 1 + 𝑥]. Note that 𝑥 must be larger or equal to -1 in order ln(1 + 𝑥) is defined. Written as a
series, Equation (2-19) becomes
∞
𝑥𝑘
(2-20) ln(1 + 𝑥) = ∑(−1)𝑘−1
𝑘
𝑘=1
which is known to have a radius of convergence of 1 from calculus4. Consequently, the truncation error
of an order 𝑘 approximation
𝜉 𝑘+1
(2-21) Truncation error = | |
𝑘+1
Figure 2-3 illustrates this behavior. Consider 𝑥 such that −1 < 𝑥 < 1. In this case, when increasing the
order of the Taylor series, better approximations of the function 𝑓(𝑥) = ln(1 + 𝑥) are obtained. The
truncation error (2-21) decreases as the order 𝑘 of the Taylor series increases. The situation is different
4
To be precise, series (2-20) converges as well for 𝑥 = 1, i.e. for −1 < 𝑥 ≤ 1.
for 𝑥 > 1. In this case, increasing the order of the Taylor series results in approximations of the function
𝑓(𝑥) = ln(1 + 𝑥) that become less and less accurate. This time the truncation error (2-21) increases as
the order 𝑘 of the Taylor series increases.
𝑓(𝑥)
(2-22) lim =𝐾
𝑥→𝑎 𝑔(𝑥)
One writes 𝑓(𝑥) = 𝒪(𝑔(𝑥)) for 𝑥 → 𝑎 which reads as: 𝑓 is of order 𝑔 as 𝑥 tends to 𝑎.
When a function 𝑓 is of order 𝒪(𝑔), this means intuitively that as 𝑥 tends to approach 𝑎, the functions 𝑓
and 𝑔 become increasingly similar. Consequently, if we know that a function 𝑓 is of order 𝑔 as 𝑥 → 𝑎,
we possess more information compared to merely evaluating 𝑓(𝑎). Indeed, we further know how 𝑓(𝑥)
behaves around 𝑥 = 𝑎. As 𝑓 and 𝑔 become increasingly similar as 𝑥 → 𝑎, we can replace the function 𝑓
by the function 𝑔 in the vicinity of 𝑥 = 𝑎.
Let us apply this concept to the truncation error of a Taylor series expansion. We would like to quantify
how this truncation error behaves as ℎ → 0. Consider a Taylor series expansion 𝑇𝑘 (ℎ) of order 𝑘 of a
function 𝑓 around 𝑥. Equation (2-9), which gives the truncation error when replacing 𝑓(𝑥 + ℎ) by 𝑇𝑘 (ℎ),
suggests that this error is in order ℎ𝑘+1 . This is indeed the case. For ℎ within the radius of convergence
of the Taylor series expansion of 𝑓(𝑥 + ℎ) and 𝑓 (𝑘+1) (𝑥) ≠ 0, one can write5
5
The case where 𝑓 (𝑘+1) (𝑥) = 0 will be discussed separately shortly.
Equation (2-23) confirms that the truncation error of the Taylor series expansion of 𝑓(𝑥 + ℎ) is in order
ℎ𝑘+1 if 𝑓 (𝑘+1) (𝑥) ≠ 0. Using the big O notation, the Taylor series expansion in order 𝑘 can therefore be
written as
Let us once again stress out the intuitive meaning of 𝒪(ℎ𝑘+1 ). As ℎ → 0, the truncation error tends to
become increasingly similar to ℎ𝑘+1 . In fact, from Equation (2-23), one can write
𝑓(𝑘+1) (𝑥)
with 𝐾 = | (𝑘+1)!
|. Equation (2-25) will become more accurate as ℎ grows smaller. Example 2-7
illustrates the usage of this equation.
ℎ ℎ2 ℎ3
(2-26) √1 + ℎ = 𝑇3 (ℎ) + 𝒪(ℎ4 ) = 1 + − + + 𝒪(ℎ4 )
2 8 16
The truncation error is in order 4, which means that as ℎ → 0, the truncation error becomes increasingly
similar to |𝐾 ∙ ℎ4 | with the coefficient 𝐾 given by
𝑓 (4) (𝑥 = 1) 5
(2-27) 𝐾 = | |=
4! 128
Similarly, we can obtain approximations of √1 + ℎ with different orders and their corresponding
truncation error estimations.
Figure 2-4 illustrates how the truncation errors for various orders of approximation behave. The true
truncation errors, computed by evaluating numerically |√1 + ℎ − 𝑇𝑘 (ℎ)|, are plotted for a range of
values of ℎ and expansion orders from 𝑘 = 0 to 𝑘 = 3. Furthermore, the estimation of the truncation
errors based on |𝐾 ∙ ℎ𝑘 | are added too. We can observe how the truncation errors become increasingly
similar to |𝐾 ∙ ℎ𝑘 | as ℎ → 0.
Plots depicting the truncation errors in function of some parameters controlling its behavior, the
parameter ℎ in our case, are important tools in numerical methods to understand how algorithms
behave. They will be further very useful in developing tools and methods to estimate errors introduced
by the algorithms during their numerical calculations.
Equation (2-24) tells us that the truncation error of a Taylor series of order 𝑘 (i.e. by using a Taylor
polynomial 𝑇𝑘 (ℎ) to replace the original function) the truncation error is of order 𝒪(ℎ𝑘+1 ). Let us note
that if the function 𝑓 is such that 𝑓 (𝑘+1) (𝑥) = 0 , then the truncation error is of order 𝒪(ℎ𝑘+2 ), which is
of higher order than 𝒪(ℎ𝑘+1 ). This statement can be proven by following calculations similar to
Equation (2-23). Example 2-8 illustrates such a case.
𝑥2 𝑥4 𝑥6
(2-28) cos(𝑥) = 1 − + − + 𝑅(𝑥)
2! 4! 6!
The truncation error 𝑅(𝑥) = |cos 𝑥 − 𝑇6 (𝑥)| is of order 𝒪(𝑥 8 ). To verify this, consider the expression of
the remainder term 𝑅:
𝑥 8 𝑥 10 𝑥 12
(2-29) 𝑅(𝑥) = | − + − ⋯|
8! 10! 12!
𝑅(𝑥) 1 𝑥2 𝑥4 1
(2-30) lim = lim { − + −⋯} =
𝑥→0 𝑥8 𝑥→0 8! 10! 12! 8!
In other words: as 𝑥 → 0, the error in replacing cos 𝑥 by the Taylor series of order 6 becomes
𝑥8
increasingly similar to .
8!
The big O notation has some properties that will be used frequently.
The proof of property 1 is straight forward and based on the definition of the big O notation. Consider
two expressions 𝑓(ℎ) and 𝑔(ℎ) of order 𝒪(ℎ𝑘 ). Their sum is as well of order 𝒪(ℎ𝑘 ) as
𝑓(ℎ) + 𝑔(ℎ) 𝑓(ℎ) 𝑔(ℎ)
(2-31) lim = lim + lim = 𝐾𝑓 + 𝐾𝑔
𝑥→0 ℎ𝑘 𝑥→0 ℎ𝑘 𝑥→0 ℎ𝑘
Similarly, we can prove property 2. Let us now discuss the third property. An expression in order 𝒪(ℎ𝑘 )
will become increasingly similar to 𝐾1 ∙ ℎ𝑘 whereas an expression in order 𝒪(ℎ𝑛 ) will become similar to
𝐾2 ∙ ℎ𝑛 . Now, regardless of how small 𝐾2 may be, eventually for a small enough ℎ, one will have that
𝐾1 ∙ ℎ𝑘 < 𝐾2 ∙ ℎ𝑛 (because 𝑘 > 𝑛), which proves property 3.
Property 3 justifies notations involving the approximation sign ‘≅’. For example, when we write
we imply that this expression is evaluated numerically using a value ℎ sufficiently small such that higher
order terms, order 2 in this case, are negligible to the first order approximation.
Taylor series can help us to find an estimation of the output error given an input error. We write
For the moment, we assume that 𝑓′(𝑥) ≠ 0. The case where 𝑓 ′ (𝑥) = 0 will be discussed at the end of
this section. If 𝑓′(𝑥) ≠ 0, we can estimate, using Equation (2-33), the output error as a first order
expression in the input error 𝜀 (i.e. the input error is small enough such that the second order terms
𝒪(𝜀 2 ) are negligible compared to 𝜀𝑓 ′ (𝑥)):
From Equation (2-34) we can also obtain a relation linking the relative errors instead of the absolute
errors:
Equations (2-34) and (2-35) are the motivation for Definition 2-6:
𝑓′(𝑥) ∙ 𝑥
(2-37) 𝑐𝑓 (𝑥) = | |
𝑓(𝑥)
It is important to realize that the condition number of a function will only allow us to estimate the so-
called unavoidable error, which is the error due to the error in the input of the function. The actual
numerical error may be larger than the estimation using the condition number. Indeed, when evaluating
6
In chapter 3, which discusses the numerical solution of algebraic equations, the mathematical expression of the
conditioning number will be the inverse of the one from Definition 2-6. The reason is that input and outputs are
exactly reversed compared to the situation discussed here.
numerically the function, round-off errors are also introduced, and these are not considered in the
approach using the condition number, which assumes that the function can be evaluated with infinite
accuracy. This observation is summarized in Theorem 2-4.
where ∆𝑦 = |𝑓(𝑥 + 𝜀) − 𝑓(𝑥)| is the unavoidable output error (the error in the output assuming 𝑓 can
be evualted with infinite precision) and 𝐶𝑓 (𝑥) is the absolute condition number of 𝑓.
∆𝑦 𝜀
(2-39) | | ≅ 𝑐𝑓 (𝑥) ∙ | |
𝑓(𝑥) 𝑥
Example 2-9 illustrates how the concept of condition number of a function can be used to estimate the
propagation of input errors through calculations.
This means that if we evaluate the function 𝑓 in 𝑥 + 𝜀, the result will have an absolute error of about
∆𝑦 ≅ 𝐶𝑓 (𝑥) ∙ 𝜀 = 2|𝑥| ∙ 𝜀. The initial error is multiplied by the factor 2|𝑥|. We can verify this by
evaluating numerically the function 𝑓 in 𝑥 = √2. The correct answer would be 𝑓(√2) = 2. Octave finds
an absolute error of ∆𝑦 ≅ 4 ∙ 10−16 :
>> f=@(x) x^2;
>> f(sqrt(2))-2
ans = 4.4409e-16
This agrees with the condition number calculated in Equation (2-40). Indeed, when evaluating
numerically √2, a round-off error of about 𝜀 ≅ 0.5 ∙ √2 ∙ 𝜀𝑚 is introduced (refer to Observation 2-2).
Consequently, the expected absolute error when evaluating 𝑓(√2) will be about
which is similar to the error we found when evaluating 𝑓(√2) − 2 numerically with Octave.
So far, we have considered situations where 𝑓’(𝑥) ≠ 0. If 𝑓’(𝑥) = 0, then according Theorem 2-4, the
first order estimation of the unavoidable error is equal to zero. As we will see, in such case, the
unavoidable error is indeed small, but not equal to zero. When a more precise estimation is required, a
higher order estimation in the input error needs to be developed. For this purpose, we use the Taylor
series expansion of 𝑓(𝑥𝑐 ) = 𝑓(𝑥 + 𝜀), include higher order terms and consider that 𝑓’(𝑥) = 0. We
obtain7:
1
(2-42) 𝑓(𝑥𝑐 ) = 𝑓(𝑥 + 𝜀) = 𝑓(𝑥) + 2! 𝑓′′ (𝑥)𝜀 2 + 𝒪(𝜀 3 )
The unavoidable error becomes, for small enough input errors (i.e. small enough such that the third
1
order terms 𝒪(𝜀 3 ) are negligible compared to the second order term 𝑓′′ (𝑥)𝜀 2):
2!
1 1
(2-43) |𝑓(𝑥𝑐 ) − 𝑓(𝑥)| ≅ |2! 𝑓′′(𝑥) ∙ 𝜀 2 | = |2! 𝑓′′(𝑥)| ∙ |𝜀 2 |
The unavailable error is now in second order of the input error |𝜀|.
Let us first determine the round-off error of our numerical approximation of 𝜋. For this we use the
inbuilt variable pi of Octave:
>> format short E
>> abs(3.14159-pi)
ans = 2.6536E-06
The input error |𝜀| is about 2.7 ∙ 10−6 .
𝑑
As 𝑑𝑥 (cos 𝑥) = − sin 𝑥 and sin 𝜋 = 0, we have to use a second order approximation for the
unavoidable error:
1 1
(2-44) |cos(3.14159) − cos 𝜋| ≅ |− cos 𝜋 ∙ 𝜀 2 | = |𝜀 2 | ≅ 3.5 ∙ 10−12
2! 2
7
If 𝑓’’(𝑥) = 0, or higher orders of derivatives are zero too, an even higher order expansion is required. The
principle will remain the same as outlined in the text.
problem. Contrary to the unavoidable error, there is no simple formula to estimate it. Further analysis is
required. We present here one methodology that can be used.
Its principle is simple. We do not know how to compute exactly 𝑓(𝑥) (i.e. we do not know how many
digits from the numerical evaluation of 𝑓 are correct). However, based on Theorem 2-4, we know the
error associated with computing 𝑓(𝑥 + ℎ), assuming no round-off error associated with the evaluation
of 𝑓. Consequently, the difference between |𝑓(𝑥 + ℎ) − 𝑓(𝑥)| and |𝑓’(𝑥)ℎ|, when evaluating
numerically the different quantities, equals to the round-off associated with |𝑓(𝑥 + ℎ) − 𝑓(𝑥)|. To
obtain the round-off error of 𝑓(𝑥) we need to conduct this analysis for ℎ tending towards zero.
Let us put in action this idea. When evaluating any function 𝑓 in a value 𝑥, the numerical answer, which
will be written 𝑓𝑓𝑙 , will have a round off error. We can write
where 𝑓𝑓𝑙 is the numerical evaluation of 𝑓 using floating-point numbers and 𝜀 is the corresponding
accumulated round-off error.
Similarly, evaluating 𝑓(𝑥 + ℎ) numerically will produce a numerical answer 𝑓𝑓𝑙 (𝑥 + ℎ) with a
corresponding round off error 𝜀 ′ :
Note that the round-off errors 𝜀 and 𝜀′ will, either add up or cancel each other out, depending on the
value of 𝑥 and ℎ.
By evaluating Equation (2-48) for small values of ℎ, a conservative estimation of 𝜀 is obtained. Indeed, if
ℎ is small, it seems reasonable to assume that the round-off errors 𝜀 and 𝜀′ are similar. Their difference
can be either equal to zero or to |2𝜀|, depending on whether the round-off errors add up or cancel out.
Consequently, for small values of ℎ:
where 𝜀 is the accumulated round-off error when evaluating numerically 𝑓 in 𝑥. Hence, equation (2-49)
can be used to obtain a conservative estimation of the round-off error.
Note that so far, our analysis has assumed that 𝑥 is free of error. In practice, this is seldom the case. The
reader is encouraged to repeat calculations from Equation (2-45) to (2-49) and assess how an input error
in 𝑥 will impact the various calculation steps. In fact, any effect an input error may produce will cancel
out in Equation (2-47) and consequently Equation (2-49) remains valid.
In practice, using Equation (2-49) requires an adequate choice of ℎ. A lower limit is given by the effect of
the round-off error on the quantity 𝑥 + ℎ. One needs to choose ℎ large enough such that 𝑥 + ℎ and 𝑥
can be distinguished numerically. In other words, ℎ needs to be at least equal to 0.5𝜀𝑚 ∙ 𝑥 (refer to
Observation 2-2). As for the upper limit of ℎ, there is no general rule that can be given. The value of ℎ
must be small enough such that the approximation 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ≅ 𝑓 ′ (𝑥) ∙ ℎ remains acceptable.
Plotting |𝑓𝑓𝑙 (𝑥 + ℎ) − 𝑓𝑓𝑙 (𝑥)| versus ℎ can help in judging if this is the case. Example 2-11 gives an
illustration on how Equation (2-49) can be used to estimate the round-off error introduced when
evaluating numerically the square root function.
As a first example, let us estimate the round-off error when evaluating √16 nummerically. In this case
we know that √16 = 4. This allows us to verify our methodology. Octave computes
>> format long
>> sqrt(16)-4
ans = 0
No digits are lost during the evaluation of √16 as all 16 decimal digits of the solution are correct. Now,
let us reach the same conclusion by applying Equation (2-49), to estimate the round-off error associated
with the numerical solution of √16. For this we compute numerically
for a range of values of ℎ. We use as lower limit 0.5 ∙ 16𝜀𝑚 and then increase it logarithmically (by
multiplying ℎ each time by 2). Let us note that for ℎ = 0.5 ∙ 16𝜀𝑚 we obtain ∆𝑦𝑓𝑙 = 0:
>> sqrt(16+8*eps)-sqrt(16)
ans = 0
This indicates that in that particular case, the round-off errors introduced when evaluating numerically
the functions must have canceled all significant digits. For larger values of ℎ this is no longer the case as
seen in Figure 2-6 (a). Figure 2-6 (a) shows as well that the approximation 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ≅ 𝑓 ′ (𝑥) ∙
1
ℎ= ℎ is reasonable over the range of selected ℎ values.
2√𝑥
1
Figure 2-6 (b) depicts the difference between ∆𝑦𝑓𝑙 and 𝑓 ′ (𝑥) ∙ ℎ = 2 ℎ. This is, according to Equation
√𝑥
(2-49), an estimation of twice the round-off error introduced when evaluating numerically √16. The
figure reveals that the round-off error associated with the numerical computation of √16 is about 2 ∙
10−16. This is smaller than 0.5 ∙ √16𝜀𝑚 ≅ 4.5 ∙ 10−16 which means that all the computed decimal digits
are correct.
The round-off error introduced when evaluating numerically the square root function will depend on the
value of 𝑥 where this evaluation takes place. Let us repeat the same exercise, but this time for √401. In
this case Octave computes
>> format long
>> sqrt(401)
ans = 20.02498439450079
Let us try to estimate how many of the computed digits are correct. For this we need to study the
behaviour of the numerical evaluation of
1
Figure 2-7 (a) confirms that the approximation 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ≅ 𝑓 ′ (𝑥) ∙ ℎ = 2 ℎ is reasonable for all
√𝑥
values ℎ considered. From Figure 2-7 (b) we can estimate the round-off error when evaluating √401.
We find a round-off error of about 2 ∙ 10−15. This indicates that the last digit (maybe even the last two,
depending on how the round-off error is added or subtracted) of the numerical solution of √401 is
wrong. We conclude that 20.024984394500 are all correct digits of √401.
Reducing the impact of the round-off error can solely be accomplished by increasing the number of
digits used in calculations. Reformulating the mathematical expressions used in the calculations (for
example to minimize effects such as loss of significance) may mitigate the round-off error impact but not
as efficiently as employing a larger number of digits during computations.
The truncation error on the other hand can be reduced via two strategies derived from the Taylor series
expansion. The first strategy consists in employing a higher order expansion, whereas the second is to
conduct the expansion as locally as possible i.e. by reducing the magnitude of ℎ in equation (2-6). Both
strategies however have their limitations. As the truncation error is reduced by using a higher order
expansion, the higher order terms themselves become of increasingly small magnitude. This may lead of
an increase in the round-off error thereby negating the impact of a higher order expansion. A similar
effect will occur when reducing the value of ℎ. The following two examples illustrate this problematic.
The exact value of Euler’s number to sixteen decimal digits is 𝑒 = 2.718281828459045. Octave stores
this value in the internal variable e. Figure 2-8 depicts the true error when approximating the 16 digits
Euler’s number with the 𝑘th order Taylor series expansion 𝑇𝑘 (𝑥 = 1) of the exponential function. As
expected, the error decreases with increasing order 𝑘. But when the 17th order is reached the error no
longer decreases. This is due to the additional terms in the Taylor series expansion that have become
such small that the numbers 𝑇𝑘 (𝑥 = 1) and 𝑇𝑘+1 (𝑥 = 1) are no longer distinguishable numerically:
>> format long
>> Tk = @(k) sum(1./factorial(linspace(0,k,k+1)));
>> Tk(17)
ans = 2.718281828459046
>> Tk(18)
ans = 2.718281828459046
The conclusion is that the numerical evaluation of the Taylor series 𝑇𝑘 (1) cannot approximate 𝑒 to 16
decimal digits despite using high order expansions (the last digit is wrong). For example, if we use an
expansion of the order 𝑘 = 19, the truncation error is about
1
(2-52) Truncation error ≅ 20! ≅ 4 ∙ 10−19
The next example illustrates the issues that algorithms face if one attempts to reduce the truncation
error by conducting the expansion as locally as possible.
𝑓(𝑥 + ℎ) − 𝑓(𝑥)
(2-54) 𝑓 ′ (𝑥) = + 𝒪(ℎ)
ℎ
8
Note that the error 4 ∙ 10−16 is about the to be expected round-off error when representing 𝑒 by a flowing point
number. Indeed, 0.5 ∙ 𝑒 ∙ 𝜀𝑚 ≅ 3 ∙ 10−16.
Let us apply Equation (2-54) to the exponential function 𝑓(𝑥) = 𝑒 𝑥 . We aim to estimate the first
derivative in 𝑥 = 0. The numerical approximation becomes
𝑑 𝑥 𝑒ℎ − 1
(2-55) 𝑒 | ≅
𝑑𝑥 𝑥=0 ℎ
whereas the exact value is 𝑓 ′ (0) = 1. Figure 2-9 displays the true absolute error in function of ℎ of the
numerical estimation (2-55) computed as
𝑒ℎ − 1
(2-56) True error = | − 1|
ℎ
As expected, as ℎ decreases, so does the error. At first, the total error consists essentially made of the
truncation error. This error decreases with an order of ℎ, as predicted by Equation (2-54). However, at a
certain point, where ℎ reaches values lower than 10−8, the true error starts increasing again. This occurs
because the round-off errors are no longer negligible and contribute significantly to the total error.
It is possible to obtain a theoretical expression for the total error by a careful analysis of the
approximation of the first derivative:
where 𝑓𝑓𝑙 stands for the floating-point number representation of the result of the function evaluation,
and 𝜀1 and 𝜀2 are the round-off errors. Rearranging and simplifying these expressions lead to
The true error is the sum of two contributions. The first is the truncation error, which in our case is of
the order 𝒪(ℎ). The second is the accumulated round-off error, which in this case is of order 𝒪(ℎ−1 ).
The two aforementioned examples illustrate the important concept of round-off and truncation error
trade-off. The total error consists of two contributions: the round-off and the truncation error. While
the truncation error may be reduced by using higher order expansions or conducting more local
expansions, unfortunately the round-off errors will increase by doing so. This showcases the limitations
of numerical methods and demonstrates how the total error may not be decreased infinitely.