Chapter One

Addis Ababa Science and Technology University
Department Of Electrical and Computer Engineering
CHAPTER ONE
Number System and Numerical Error Analysis
1.1 Introduction
This section is devoted to introducing review of general information concerned with the
quantification of errors such as the concept of significant digits is reviewed.
Significant digits:
Are important in showing the truth one has in a reported number. For example, if someone asked
me what the population of my county is, I would simply respond, “The population of Addis
Ababa area is 6 million!” Is it exactly six million? “I don’t know! But I am quite sure that it will
not be ten”.
The problem comes when someone else was going to give me a $100 for every citizen of the
county, I would have to get an exact count in that case. That count would have been 5,228,000 in
this year.
So you can see that in my statement that the population is 6 million, that there is only one
significant digit. i.e, 6, and in the statement that the population is 5,228,000, there are seven
significant digits. That means I am quite sure about the accuracy of the number up to the seventh
digit. So, how do we differentiate the number of digits correct in 1,000,000 and 5,228,000? Well
for that, one may use scientific notation. For our data we can have
6
1,000,000=1 ×10
6
5,228,000=5.228 ×10
to signify the correct number of significant digits.
Precision and accuracy:
Accuracy: refers to how a closely computed value or measured value agrees with the true value.
Precision: refers to how closely individual computed or measured values agree with each other.
Numerical methods should be sufficiently accurate to meet the requirements of a particular

engineering problem. They also should be precise enough for adequate engineering design. The
collective term error represents both the inaccuracy and imprecision of our predictions in the
following lessons.
Computational Methods 1
1.2 Sources of Error
Error in solving an engineering or science problem can arise due to several factors. First, the error
may be in the modeling technique. A mathematical model may be based on using assumptions
that are not acceptable. For example, one may assume that the drag force on a car is proportional
to the velocity of the car, but actually it is proportional to the square of the velocity of the car.
This itself can create huge errors in determining the performance of the car, no matter how
accurate the numerical methods you may use.
Second, errors may arise from mistakes in programs themselves or in the measurement of
physical quantities. But, in application of numerical methods itself, the two errors we need to
focus on are:
- Round off error and,
- Truncation error
1.2.1. Round of Error
A computer can only represent a number approximately because there is no mechanism to deal
with infiniteness in computers. For example, a number like 1/3 may be represented as 0.333333
on a PC. Then, the round off error generated in this case is
1/ 3−0.333333=0.00000033
There are other numbers that cannot be represented exactly. For example, π and √ 2are numbers
that need to be approximated in computer calculations.
1.2.2. Truncation Error
Truncation error is defined as the error caused by truncating a mathematical procedure. For
example, the Maclaurin series for e x is given as
2 3
x x x
e =1+ x + + +…
2! 3 !
This series has an infinite number of terms but when using this series to calculate e x only a finite
number of terms can be used on iteration. For example, if one uses three terms in his calculation,
then the resulting truncation error becomes
x2 x3 x4
x
[
Truncation Error=e − 1+ x + ]= + +…
2! 3! 4 !
Clearly we cannot avoid truncation errors. The question should rather be how we can control
truncation errors. We can use the concept of relative approximation in order to control this. For
the time being let us assume that this could be achieved by setting a particular acceptable error
magnitude and decide for the minimum number of terms involved in our calculation. With this
assumption in mind one can finally conclude that increasing the number of terms would decrease
truncation error.
The total numerical error is the summation of truncation error and round off error. We can
minimize round off errors by increasing the number of significant digits. In contrast to this,
Subtractive cancellation1 and increase in the number of computations (i.e number of terms
involved) will intensify round off errors.
Truncation error on the other hand could be reduced with decreasing the step size which in turn
could increase subtractive cancelation or increase in computations.
All of the above statements are simply would mean that truncation error increases as round off
errors are increased which that means we are forced to trade off between the two errors in finding
the suitable step size for a particular computation.
1.3 Number Representation and Conversion Algorithm
The basic unit of memory is the binary digit, called a bit. A bit may contain a 0 or a l. It is the
simplest possible unit. A device capable of storing only zeros could hardly form the basis of a
memory system.
People often say that computers use binary arithmetic because it is "efficient." What they mean
(although they rarely realize it) is that digital information can be stored by distinguishing between
different values of some continuous physical quantity, such as voltage or current. The more
values that must be distinguished, the lesser is the separation between adjacent values, and the
less reliable the memory we would have.
The binary number system requires only two values to be distinguished. Consequently, it is the
most reliable method for encoding digital information.
In everyday life, we use a number system with a base of 10, simply because we have ten fingers!
For example, look at the number 257.56. Each digit in 257.56 has a value of 0 through 9 and has a
place value. It can be written as
2 1 0 −1 −2
257.56=2× 10 + 5× 10 +7 ×10 +5 ×10 + 6 ×10
In a binary system, we have a similar system where the base is made of only two digits 0 and 1.
So it is a base 2 system. A number like (1011.0011) in base-2 represents the decimal number as
3 2 1 0 −1 −2 −3 −4
1011.0011=1 ×2 +0 ×2 +1 ×2 +1 ×2 +0 ×2 + 0× 10 +1× 2 +1 ×2
¿ 11.1875
1
Subtractive cancellation is the term that refers to round off induced when subtracting two nearly equal floating point
numbers. A common instance where this occurs involves finding the roots of quadratic equation for case b 2 ≫ 4 ac .
in the decimal system.
In general in both systems we have an integer part to the left of the decimal point and we have a
fraction part to the right of the decimal point.
1.3.1 Floating Point Representation
The fundamental unit whereby ‘meaningful’ information is represented is called a word. This is
an entity that consists of a string of binary digits or bits. Numbers are typically stored in one or
more words.
The primary logic units of a digital computer are on/off electronic components. Hence numbers
on a computer are ‘represented with binary or base-2 systems’.
Integer representation:
The most straight forward approach is the signed magnitude method. It employs the first bit of a
word to indicate the sign (0 is positive and 1 is negative) of the number and the remaining bits are
used to store the number. The following figure shows a signed magnitude representation of the
integer -173 on a 16-bit computer.
1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1
1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1
Sign number
Fig: The representation of the decimal integer -173 on a `16 bit computer
Floating point representation:
Fractional numbers are represented in computers using floating point form. The general format is
e
m .b
Where: m=mantissa (or significant)

b=the base of the number system being used
e=the exponent (or characteristic)
Sign signed exponent Mantissa
Fig: The manner in which floating point is represented in a word.
So we can say that 156.78 could be represented as 0.15678 X 10 3 as a floating point number in
base 10 system.
The above figure shows one way of storing a floating point number in a word. The mantissa is
usually normalized if it has a leading zero digit. For example the number 1/34 = 0.029411765….
may be stored on a floating point base ten system that would allow four decimal places to be
stored i.e
0
1/ 34=0.0294 ×10
Could be normalized as 0.2941 ×10−1
That means we are available with more space to increase our significant figure in approximating
the number to the four decimal places available. One can easily deduce that the absolute value of
the mantissa is between 1/10 = 0.1 and 1. Generally, for any base b
1/ b ≤ m<1
In addition to taking more time and memory space the floating point representation has got its
own disadvantage because of round off errors introduced in approximating the mantissa.
1.3.2. Conversion algorithm between base two and base ten systems
A point of interest regarding the two number systems is the conversion mechanism between them.
We have already converted the binary system to the decimal in the beginning of this section and
now we will proceed with the reverse.
Let us take the number 11.1875.
First look at the integer part 11
(I)Divide 11 by 2. This gives a quotient of 5 and remainder of 1. Let a0=1

(II) Divide the quotient 5 by 2 again. We have a quotient of 2 and remainder of 1.Let a1=1
(III) Divide 2 by 2. We have a quotient of 1 and remainder of 0. Let a2=0.
(IV) Divide again 1 by 2. We have a quotient of 0 and remainder of 1. Let a3=1
Since we finally reached at a quotient of 0 we will stop and write our decimal equivalent as
follows.
11=(a 3 a 2 a 1 a 0)two=(1011)two
Now let us look at the fraction part i.e 0.1875
(I) Multiply 0.1875 by 2. This will give 0.375. The number before the decimal is 0. So a-1=0.
(II) Multiply 0.375 by 2. This will give 0.75. Again, a-2=0.
(III) Multiply 0.75 by 2. This will result in 1.5. Now, a-3=1.
(IV) Unexpectedly we should multiply the number after decimal now. That is 0.5 multiplied
by 2. So we have 1.0 and a-4=1.
If we got a 0 after the decimal, that will mark the end of the conversion process.
Generally, conversion from base 10 to base two involves two independent procedures for both the
integer and the fraction part.
The following figures show the conversion flowchart
Fi
g . Flow charts for the integer part and fraction part conversion of base two to base ten.
1.4 Error Estimation
In any numerical analysis, errors will arise during the calculations. To be able to deal with the
issue of errors, we need to
(A) Identify where the error is coming from (“is it round off or truncation?”), followed by
(B) Quantifying the error (“what is this?”), and lastly
(C) Minimize the error as per our needs (“how about increasing the step size?”).
This section is devoted to addressing what we have stated under (B).
1.4.1 True Error
True error, denoted by Et is the difference between the true value (also called the exact value)
and the approximate value.
True Error=True value – Approximate value
The magnitude of true error does not show how bad the error is. That means it actually depends
on the mathematical equation that describes a given quantity.
Relative true error is defined as the ratio of the true error and the true value.
Relative true error =true error /true value=∈t
This value could be expressed as percentage value as follows
∈ t∗100 %
Sometimes it would give more sense if we use the absolute relative true error |∈t|
1.4.2 Approximate Error
So far, we discussed how to calculate true errors. Such errors are calculated only if true values are
known. In most cases we will not have the luxury of knowing true values. So when we are solving
a problem numerically, we will only have access to approximate values. We need to know how to
quantify error for such cases.
Approximate error is denoted by Ea and is defined as the difference between the present
approximation and previous approximation.
Approximate Error=Present Approximation – Previous Approximation
Here again the magnitude of approximate error does not show how bad the error is.
Relative approximate error is denoted by Ea and is defined as the ratio between the approximate
error and the present approximation.
approximate error
Relative Approximate Error=
present approximation
Similar to relative true error we will have approximate error in percentage as well as an absolute
value of relative approximate error.
1.5 Error Propagation
The scope of this section to study how errors in numbers can influence the results of mathematical
calculations those numbers are involved in.
(I) Functions of single variable
Suppose we have a function f (x) with the independent variable x that is approximated as x
So what will be the effect of having an approximate value of the variable x on f (x) .
∆ f ( x )=|f ( x )−f ( x )|
We have again a problem. We cannot have the exact value of f ( x) since we cannot have the exact
value of x. But if we are guaranteed that f is continuous and differentiable and taking x as closer
as possible to x we may apply Taylor series.
' ''
f ( x )=f ( x )+ f ( x ) ( x−x ) +f ( x ) (x −x)
Let us drop the second and higher terms and rearrange to approximately get
'
f ( x )−f ( x )=f ( x )( x− x )
∆ f ( x )=|f ' ( x )|( x−x )
Now the quantity (x−x ) represents the estimate of the error in e.
(II) Functions of more than one variable
Here we are going to generalize the foregoing approach. So we will apply the multivariable
version of the tailor series. Suppose we have two variables u and v.
2 2 2
∂f ∂f 1 ∂ f ∂ f ∂ f
f ( v i+1 , ui+1 ) =f ( ui , v i ) +
∂u
( ui +1−ui ) + ( v i+1 −v i ) +
∂v [
2! ∂u2
(ui+1−u i)2 +2
∂u ∂ v
( u i+1−u i) ( v i+1−v i ) + 2 (v
∂v
Where all partial points are evaluated at the base point i
We can drop the second and higher terms to obtain
∂f ∂f
∆ f ( u , v )=| | | |
∂u
∆ u+
∂v
∆v
For n independent values x 1 , x 2 , x 3 … … … ., x n we have
∂f ∂f ∂f
| | | |
∆ f ( x 1 , x 1 , … … .. x n ) =
∂ x1
∆ x 1+
∂ x2 | |
∆ x 2 … … ..
∂ xn
∆ xn
(III) Stability and condition
Here we measure the sensitivity of a given mathematical equation to variations in its input
variables. We say that a computation is numerically unstable if the uncertainty of the input
values is grossly magnified by the numerical method i.e the function we are using.
Let us take a first order Taylor series
f ( x )=f ( x )+ f ' ( x−x )
The relative error of f given by
f ( x )−f ( x) f ' ( x−x )

=
f (x) f ( x)
x−x
The relative true error of x is given as
x
Taking the ratio of the two we will define a condition number as follows
x f '(x)
condition number=
f ( x)
This number is used to measure the extent of uncertainty in x is magnified by f(x) with three
independent cases:
Condition number >1 relative error in x is amplified in f(x)

Condition number =1 relative error is identical in both x and f(x)
Condition number <1 relative error is attenuated in f(x)

Chapter One

Uploaded by

Copyright:

Available Formats

Chapter One

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter One

Uploaded by

Copyright:

Available Formats

Addis Ababa Science and Technology University

Department Of Electrical and Computer Engineering

Precision and accuracy:

Numerical methods should be sufficiently accurate to meet the requirements of a particular

1.2 Sources of Error

1.2.1. Round of Error

1.2.2. Truncation Error

1.3 Number Representation and Conversion Algorithm

in the decimal system.

1.3.1 Floating Point Representation

Floating point representation:

Where: m=mantissa (or significant)

Sign signed exponent Mantissa

Fig: The manner in which floating point is represented in a word.

Could be normalized as 0.2941 ×10−1

First look at the integer part 11

(I)Divide 11 by 2. This gives a quotient of 5 and remainder of 1. Let a0=1

1.4 Error Estimation

1.4.1 True Error

True Error=True value – Approximate value

Relative true error =true error /true value=∈t

This value could be expressed as percentage value as follows

1.4.2 Approximate Error

Approximate Error=Present Approximation – Previous Approximation

1.5 Error Propagation

∆ f ( x )=|f ' ( x )|( x−x )

Now the quantity (x−x ) represents the estimate of the error in e.

(II) Functions of more than one variable

Where all partial points are evaluated at the base point i

We can drop the second and higher terms to obtain

For n independent values x 1 , x 2 , x 3 … … … ., x n we have

(III) Stability and condition

Let us take a first order Taylor series

f ( x )=f ( x )+ f ' ( x−x )

The relative error of f given by

f ( x )−f ( x) f ' ( x−x )

Condition number >1 relative error in x is amplified in f(x)

You might also like