Chapter One
Chapter One
Chapter One
CHAPTER ONE
Number System and Numerical Error Analysis
1.1 Introduction
This section is devoted to introducing review of general information concerned with the
quantification of errors such as the concept of significant digits is reviewed.
Significant digits:
Are important in showing the truth one has in a reported number. For example, if someone asked
me what the population of my county is, I would simply respond, “The population of Addis
Ababa area is 6 million!” Is it exactly six million? “I don’t know! But I am quite sure that it will
not be ten”.
The problem comes when someone else was going to give me a $100 for every citizen of the
county, I would have to get an exact count in that case. That count would have been 5,228,000 in
this year.
So you can see that in my statement that the population is 6 million, that there is only one
significant digit. i.e, 6, and in the statement that the population is 5,228,000, there are seven
significant digits. That means I am quite sure about the accuracy of the number up to the seventh
digit. So, how do we differentiate the number of digits correct in 1,000,000 and 5,228,000? Well
for that, one may use scientific notation. For our data we can have
6
1,000,000=1 ×10
6
5,228,000=5.228 ×10
to signify the correct number of significant digits.
Accuracy: refers to how a closely computed value or measured value agrees with the true value.
Precision: refers to how closely individual computed or measured values agree with each other.
Computational Methods 1
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
Error in solving an engineering or science problem can arise due to several factors. First, the error
may be in the modeling technique. A mathematical model may be based on using assumptions
that are not acceptable. For example, one may assume that the drag force on a car is proportional
to the velocity of the car, but actually it is proportional to the square of the velocity of the car.
This itself can create huge errors in determining the performance of the car, no matter how
accurate the numerical methods you may use.
Second, errors may arise from mistakes in programs themselves or in the measurement of
physical quantities. But, in application of numerical methods itself, the two errors we need to
focus on are:
- Round off error and,
- Truncation error
A computer can only represent a number approximately because there is no mechanism to deal
with infiniteness in computers. For example, a number like 1/3 may be represented as 0.333333
on a PC. Then, the round off error generated in this case is
1/ 3−0.333333=0.00000033
There are other numbers that cannot be represented exactly. For example, π and √ 2are numbers
that need to be approximated in computer calculations.
Truncation error is defined as the error caused by truncating a mathematical procedure. For
example, the Maclaurin series for e x is given as
2 3
x x x
e =1+ x + + +…
2! 3 !
This series has an infinite number of terms but when using this series to calculate e x only a finite
number of terms can be used on iteration. For example, if one uses three terms in his calculation,
then the resulting truncation error becomes
x2 x3 x4
x
[
Truncation Error=e − 1+ x + ]= + +…
2! 3! 4 !
Clearly we cannot avoid truncation errors. The question should rather be how we can control
truncation errors. We can use the concept of relative approximation in order to control this. For
the time being let us assume that this could be achieved by setting a particular acceptable error
Computational Methods 2
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
magnitude and decide for the minimum number of terms involved in our calculation. With this
assumption in mind one can finally conclude that increasing the number of terms would decrease
truncation error.
The total numerical error is the summation of truncation error and round off error. We can
minimize round off errors by increasing the number of significant digits. In contrast to this,
Subtractive cancellation1 and increase in the number of computations (i.e number of terms
involved) will intensify round off errors.
Truncation error on the other hand could be reduced with decreasing the step size which in turn
could increase subtractive cancelation or increase in computations.
All of the above statements are simply would mean that truncation error increases as round off
errors are increased which that means we are forced to trade off between the two errors in finding
the suitable step size for a particular computation.
The basic unit of memory is the binary digit, called a bit. A bit may contain a 0 or a l. It is the
simplest possible unit. A device capable of storing only zeros could hardly form the basis of a
memory system.
People often say that computers use binary arithmetic because it is "efficient." What they mean
(although they rarely realize it) is that digital information can be stored by distinguishing between
different values of some continuous physical quantity, such as voltage or current. The more
values that must be distinguished, the lesser is the separation between adjacent values, and the
less reliable the memory we would have.
The binary number system requires only two values to be distinguished. Consequently, it is the
most reliable method for encoding digital information.
In everyday life, we use a number system with a base of 10, simply because we have ten fingers!
For example, look at the number 257.56. Each digit in 257.56 has a value of 0 through 9 and has a
place value. It can be written as
2 1 0 −1 −2
257.56=2× 10 + 5× 10 +7 ×10 +5 ×10 + 6 ×10
In a binary system, we have a similar system where the base is made of only two digits 0 and 1.
So it is a base 2 system. A number like (1011.0011) in base-2 represents the decimal number as
3 2 1 0 −1 −2 −3 −4
1011.0011=1 ×2 +0 ×2 +1 ×2 +1 ×2 +0 ×2 + 0× 10 +1× 2 +1 ×2
¿ 11.1875
1
Subtractive cancellation is the term that refers to round off induced when subtracting two nearly equal floating point
numbers. A common instance where this occurs involves finding the roots of quadratic equation for case b 2 ≫ 4 ac .
Computational Methods 3
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
In general in both systems we have an integer part to the left of the decimal point and we have a
fraction part to the right of the decimal point.
The fundamental unit whereby ‘meaningful’ information is represented is called a word. This is
an entity that consists of a string of binary digits or bits. Numbers are typically stored in one or
more words.
The primary logic units of a digital computer are on/off electronic components. Hence numbers
on a computer are ‘represented with binary or base-2 systems’.
Integer representation:
The most straight forward approach is the signed magnitude method. It employs the first bit of a
word to indicate the sign (0 is positive and 1 is negative) of the number and the remaining bits are
used to store the number. The following figure shows a signed magnitude representation of the
integer -173 on a 16-bit computer.
1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1
1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1
Sign number
Fig: The representation of the decimal integer -173 on a `16 bit computer
Fractional numbers are represented in computers using floating point form. The general format is
e
m .b
Computational Methods 4
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
So we can say that 156.78 could be represented as 0.15678 X 10 3 as a floating point number in
base 10 system.
The above figure shows one way of storing a floating point number in a word. The mantissa is
usually normalized if it has a leading zero digit. For example the number 1/34 = 0.029411765….
may be stored on a floating point base ten system that would allow four decimal places to be
stored i.e
0
1/ 34=0.0294 ×10
That means we are available with more space to increase our significant figure in approximating
the number to the four decimal places available. One can easily deduce that the absolute value of
the mantissa is between 1/10 = 0.1 and 1. Generally, for any base b
1/ b ≤ m<1
In addition to taking more time and memory space the floating point representation has got its
own disadvantage because of round off errors introduced in approximating the mantissa.
1.3.2. Conversion algorithm between base two and base ten systems
A point of interest regarding the two number systems is the conversion mechanism between them.
We have already converted the binary system to the decimal in the beginning of this section and
now we will proceed with the reverse.
Let us take the number 11.1875.
Computational Methods 5
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
(IV) Unexpectedly we should multiply the number after decimal now. That is 0.5 multiplied
by 2. So we have 1.0 and a-4=1.
If we got a 0 after the decimal, that will mark the end of the conversion process.
Generally, conversion from base 10 to base two involves two independent procedures for both the
integer and the fraction part.
The following figures show the conversion flowchart
Computational Methods 6
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
Fi
Computational Methods 7
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
g . Flow charts for the integer part and fraction part conversion of base two to base ten.
In any numerical analysis, errors will arise during the calculations. To be able to deal with the
issue of errors, we need to
(A) Identify where the error is coming from (“is it round off or truncation?”), followed by
(B) Quantifying the error (“what is this?”), and lastly
(C) Minimize the error as per our needs (“how about increasing the step size?”).
This section is devoted to addressing what we have stated under (B).
True error, denoted by Et is the difference between the true value (also called the exact value)
and the approximate value.
The magnitude of true error does not show how bad the error is. That means it actually depends
on the mathematical equation that describes a given quantity.
Relative true error is defined as the ratio of the true error and the true value.
∈ t∗100 %
Sometimes it would give more sense if we use the absolute relative true error |∈t|
So far, we discussed how to calculate true errors. Such errors are calculated only if true values are
known. In most cases we will not have the luxury of knowing true values. So when we are solving
a problem numerically, we will only have access to approximate values. We need to know how to
quantify error for such cases.
Approximate error is denoted by Ea and is defined as the difference between the present
approximation and previous approximation.
Here again the magnitude of approximate error does not show how bad the error is.
Computational Methods 8
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
Relative approximate error is denoted by Ea and is defined as the ratio between the approximate
error and the present approximation.
approximate error
Relative Approximate Error=
present approximation
Similar to relative true error we will have approximate error in percentage as well as an absolute
value of relative approximate error.
The scope of this section to study how errors in numbers can influence the results of mathematical
calculations those numbers are involved in.
(I) Functions of single variable
Suppose we have a function f (x) with the independent variable x that is approximated as x
So what will be the effect of having an approximate value of the variable x on f (x) .
∆ f ( x )=|f ( x )−f ( x )|
We have again a problem. We cannot have the exact value of f ( x) since we cannot have the exact
value of x. But if we are guaranteed that f is continuous and differentiable and taking x as closer
as possible to x we may apply Taylor series.
' ''
f ( x )=f ( x )+ f ( x ) ( x−x ) +f ( x ) (x −x)
Let us drop the second and higher terms and rearrange to approximately get
'
f ( x )−f ( x )=f ( x )( x− x )
Here we are going to generalize the foregoing approach. So we will apply the multivariable
version of the tailor series. Suppose we have two variables u and v.
2 2 2
∂f ∂f 1 ∂ f ∂ f ∂ f
f ( v i+1 , ui+1 ) =f ( ui , v i ) +
∂u
( ui +1−ui ) + ( v i+1 −v i ) +
∂v [
2! ∂u2
(ui+1−u i)2 +2
∂u ∂ v
( u i+1−u i) ( v i+1−v i ) + 2 (v
∂v
Computational Methods 9
Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
∂f ∂f
∆ f ( u , v )=| | | |
∂u
∆ u+
∂v
∆v
∂f ∂f ∂f
| | | |
∆ f ( x 1 , x 1 , … … .. x n ) =
∂ x1
∆ x 1+
∂ x2 | |
∆ x 2 … … ..
∂ xn
∆ xn
Here we measure the sensitivity of a given mathematical equation to variations in its input
variables. We say that a computation is numerically unstable if the uncertainty of the input
values is grossly magnified by the numerical method i.e the function we are using.
x−x
The relative true error of x is given as
x
Taking the ratio of the two we will define a condition number as follows
x f '(x)
condition number=
f ( x)
This number is used to measure the extent of uncertainty in x is magnified by f(x) with three
independent cases:
Computational Methods 10