02 Computational Methods For Numerical Analysis With R - 2

2
Error Analysis
2.1 True Values

Rounding numbers is common in everyday conversation. We do this because
getting an answer that is close enough is sufficient for most purposes. For in-
stance, we might give directions and say “your destination is around five miles
past the old post office,” assuming you know which post office is which. Or we
might say something costs twenty dollars when the actual price is $21.99. This
extends beyond purely numerical values. Al Jolson had an important date at
“about a quarter to nine,” and it is unlikely his date arrived at precisely 8:45
p.m.
Numerical analysis is grounded on getting answers that are good enough.
A calculation should target the highest precision and accuracy necessary with-
out wasting computational resources. This is because numerical analysis finds
approximations for everyday use in science and engineering. A modern 3D
printer may have a printing resolution of 100 microns (a micron is one-one
millionth of a meter and 100 microns is one-tenth of a millimeter). When con-
structing the computer model of something to be printed, describing the curve
with any finer resolution than the printer may accept is unnecessary. Further,
it wastes computing resources.
Conversely, the numbers fed into an algorithm for evaluation may be only
close approximations. The distance to the sun is commonly given at 150 million
kilometers. But the Earth’s orbit is an ellipse and the distance is constantly
fluctuating. Similarly, a meter stick may not have any marks closer together
than one millimeter. An observer may estimate where between two marks a
measurement lies, or may pick one of the endpoints.
In addition, the rule for selecting an endpoint can follow different rounding
rules. For instance, the observer may be required to pick the smaller of the
potential options. Or the observer may be required to pick the larger of the
potential options. Finally, an observer may select whichever mark appears
closest. These rules may be complicated if the rule for observational rounding
is unknown or perhaps not strictly enforced.
The level of correctness necessary to get a final answer is dependent upon
both the input numbers and the final application’s requirements. The amount
of error we can allow in a calculation is dependent entirely on the circum-
stances. We call this the error tolerance. The error tolerance for cooking can
29
30 Computational Methods for Numerical Analysis with R
Figure 2.1
Newark is near New York (Stamen Design)
be quite high. When making cookies, measurements for flour or sugar can be
“about a cup” or even undefined. It is unclear just how much salt is in a pinch.
On the other hand, when working with chocolate, temperatures have to be
controlled very carefully or the chocolate may not crystallize correctly when
cooled.
Numerical error is composed of two components, precision and accuracy.
Precision and accuracy are two related terms that are often used interchange-
ably. These terms describe two different facets of how an estimated value
relates to the true value. But each has a different meaning and a different
application in numerical analysis.
For this section, we will assume there is a true value, x, that we are at-
tempting to measure. We will further assume that x0 is the measured value.
The value of x0 may be measured through observation or may be the result of
a numerical process or calculation. Finally, the error will be represented by .
2.1.1 Accuracy
Accuracy measures how close an estimate is to the true value. This is the
commonly used definition for accuracy. For example, if we were travelling to
New York City, we might get directions. If we get directions to Newark, New
Jersey, we will end our trip in the wrong city, the wrong state, and on the
wrong side of the Hudson River. This is a problem of accuracy. Directions
to Boston or London are progressively less accurate than our directions to
Newark. But sometimes close is close enough, depending on how the estimate
Error Analysis 31
could be used. New York is a brief train ride from Newark. How much accuracy
we need depends on the application the estimate is used for.
We can measure accuracy in different ways. The first is absolute accuracy
or absolute error. This measures the raw distance between the correct value
of x and the measured value. Formally,
A = |x − x0 |, (2.1)
where A is the absolute error. This measurement is a raw figure and, im-
portantly, in units of the underlying measure. If x is in kilometers, so is x0 .
Accordingly, the value of A will also be denominated in kilometers and we
can say our estimate was off by A kilometers.
However, the absolute error of a measurement only partially conveys the
context of the measurement and may not present a meaningful error estimate
in context. In the directions, Newark is 16 kilometers (10 miles) from New
York. That is a substantial difference in this context. But in another context,
a 16-kilometer error may not make a substantial difference. For instance, using
the radius of the Earth’s orbit around the Sun again, an error of 16 kilometers
is not very large. This is especially true considering the Earth’s orbit varies
from 147 million to 151 million kilometers from the Sun.
Measuring how substantial the estimated error is leads to a different error
type definition. The relative error measures, as a part of the true value, how
much error there is in the measurement. Formally,
x − x0

R = , (2.2)
x
where R is the relative error. By conveying the error as a proportion of the

true value, the effect of the error can be judged and evaluated.
Unlike the absolute error, the relative error is unitless. It is the ratio of the
error to the true value and the units cancel each other. This has advantages.
The unitless error estimate can be compared across different systems, if oth-
erwise appropriate. Or the relative error can be interpreted as a percentage
of the true value. Reducing relative error has a larger impact on correctness
than reducing the absolute error, when considered as a percentage of the true
value.
However, even with a small relative error, it is possible for the effect to
cause problems. Being in Newark when a job interview is in New York is
catastrophic. In contrast, after crossing hundreds of millions of kilometers to
go to another planet, if a space probe were off by 16 kilometers, it might
crash into the planet, or it just might miss the best picture opportunity. With
each estimate of the accuracy of measurement, the context is a guide to how
suitable the accuracy is.
Accuracy only tells the story of how far from the true value a result is.
There are limits on how close the true result can be. If a sequence of digits is
infinite, as the digits of π are, then no computer contains sufficient memory
Figure 2.2
At least we got to New York, this time (Stamen Design)
and computational power to calculate a result using the full number of dig-
its. Practically, this is rarely a concern and the limits of the calculation are
described as the precision of the result.
2.1.2 Precision
Precision provides a measure for how specific a number is. That is, precision
explains the level of detail in a measurement. Continuing our travel analogy,
getting directions to New York when we are interested in going to the Empire
State Building is a problem of precision. We have gone to the right place,
but the right place is too large to get to the final destination. Directions
to Manhattan narrow the directions and you might see the Empire State
Building. But directions to the corner of 34th Street and Fifth Avenue will
take you to the building itself. As the directions increase in precision, we get
closer to our destination, without being incorrect at each step.
It is possible to estimate a value of x with an apparent high degree of
precision, but this may not necessarily be more accurate. For instance, we can
estimate the value of π at 22/7, a commonly used approximation,
22
π≈ ≈ 3.14285714 . . . (2.3)
7
This estimate is accurate to two decimal places. But the estimate given by
22/7 is a decimal that continues forever. Only the number of digits we are
willing to show limits its precision. However, the value of this estimate is its
accuracy and for some purposes, this level of accuracy is sufficient.
Error Analysis 33
Figure 2.3
A journey of a thousand miles starts with a good map (Stamen Design)
A more accurate estimate of π would simply be the first 8 digits,

3.14159265. This estimate, while less precise that 22/7 has greater accuracy
and is usable in many applications. Finding the area of a circle from the ra-
dius (A = πr2 ) will be more accurate with this simple representation than
with 22/7. However, the area in both cases can be calculated using any num-
ber of digits. This is known as multiple precision. While multiple precision
mathematics libraries are widely available, they are mostly implemented in
computer algebra systems. R is a system designed for computation and most
calculations in R have limited precision.
While increased precision may sound beneficial, it can cause problems. A
calculation may induce greater precision in the final result than is warranted.
This can cause users of those calculations to trust them more than they should.
This is called false precision. False precision arises when a very precise estimate
for a measure is presented, but the precision presented does not necessarily
reflect the precision actually included in the number.
For instance, a note saying a box will hold 5 kilograms might turn into
11.0231 pounds. In this case, the precision is generated by the conversion fac-
tor, 2.20462 pounds per kilogram. But to say the box will hold 11.0231 pounds
adds a measure of precision that is probably unwarranted, and certainly easy
to misunderstand. The critical input value was the 5-kilogram figure. It has
one significant digit, the 5. The final result should include only one, and be
interpreted as 10 pounds.
Significant digits, also called significant figures, are the parts of a number
that include the precision of that number. This is limited to any non-zero
digits in the representation of the number. The significant digits convey all of
the precision of a number. For instance, the number π can be represented as
3, 3.1, 3.14, 3.142, 3.1416, 3.14159, and so on with an increasing number of
significant digits. Accordingly, as the number of significant digits increases, so
does the precision of the estimate.
Significant digits are frequently encoded in scientific notation. The number
1.5 × 108 is the same as 150 million and 150,000,000. When discussing the
distance from the Earth to the Sun, this estimate was shown to be the center
of a range. But as an estimate, 150 million kilometers is a good fit and 1.5×108
conveys all of the critical information. The 1.5 carries all of the precision and
the 108 carries the magnitude of the number. As a result, this estimate of the
radius of the Earth’s orbit contains two significant digits.
However, when doing mathematics with numbers of limited precision, the
lowest precision number determines the final precision. That is, the number
with the fewest significant digits included carries that number of significant
digits into the final calculation. That is why in the case of the 5-kilogram
holding box, the estimate in pounds should only be 10, no matter how precise a
conversion factor is used. Critically, this statement holds for multiplication and
division of estimated values, but also for addition and subtraction. However,
unlike with addition and subtraction, multiplication and division can change
the magnitude without affecting the number of significant digits in the final
result.
Coming back to rounding, the process of rounding numbers reduces both
the number of significant digits and the precision of an estimated value. For
each place of rounding, the number of significant digits is reduced by one.
With π, rounding to 6 total digits is 3.14159, because the next number is 2.
However, if the value of π is rounded to only 5 digits, the value is 3.1416.
In this case, a significant digit is lost and a single digit of precision is lost.
Accuracy is also lost, but by an unknown and bounded amount. In this case,
the rounding process happened in the last place and split the input range
evenly. As a result, the upper bound on the absolute error is 0.00005. We
know the error is actually smaller than that, but only because we know the
value of π.
Because a computer only has a finite amount of storage, we can only store
numbers with a limited degree of accuracy and precision. In practice, there are
standard representations of numbers that limit the accuracy and precision to
a fixed value. Two important representations are the integer and the floating
point number.
Error Analysis 35
2.2 Internal Data Storage

Internally, R has a variety of ways to store numbers. We have already seen the
aggregate data types such as matrices and vectors. However, the two storage
types we are going to focus on here are integer and numeric data types. Most
numbers in R are stored as floating point numbers. These are numbers that
allow for storing decimals. We will begin our discussion of data storage with
integers.
R does not use integers often and even when used as an index for a vector,
integers are usually stored as floating point numbers. As we will see, floating
point numbers are made up of several components, including two integers that
tell us the value and size of the number. So understanding how integers are
stored and used gives us a better understanding of how floating numbers work.
2.2.1 Binary Numbers

When R creates an integer, as we described in Section 1.2.1, R uses the native
32-bit integer format. The 32-bit integer format uses 31 bits to store the
number and one bit to store the sign of the number. As we will see, the
integer format is not the default in R. When working with numbers, integers
are usually coerced into the floating point numeric format, described later.
But the basic binary format underpins the storage of numerical data.1
The integer data type cannot store any fractional part of the number.
Within these limitations, not many numbers can be represented. The largest
positive number R can hold as an integer is 231 − 1 or 2147483647. As we
increment the number by one, we quickly reach a number too large for the
data type:
> as.integer(2^31 - 2)
[1] 2147483646
> as.integer(2^31 - 1)
[1] 2147483647
> as.integer(2^31)
[1] NA
We use the as.integer function to ensure the result is converted into an
integer. R, if left to its own devices, converts everything to numeric, which
we will discuss in Section 2.2.2. At the negative extreme, we can see the same
behavior as we decrement toward the minimum integer:
> -2147483646L
[1] -2147483646
> -2147483646L - 1L
1 See Knuth (1997) for a background on the binary number system, other bases, radix
conversion, and other properties of numbers at the hardware-level.

Hexadecimal Decimal Binary Hexadecimal Decimal Binary

0 0 0000 8 8 1000
1 1 0001 9 9 1001
2 2 0010 A 10 1010
3 3 0011 B 11 1011
4 4 0100 C 12 1100
5 5 0101 D 13 1101
6 6 0110 E 14 1110
7 7 0111 F 15 1111
Table 2.1
Hexadecimal, decimal, and binary numbers with equivalents
[1] -2147483647
> -2147483646L - 2L
[1] NA
Here, instead of the as.integer function, we can append capital letter “L”
to the number to force the integer format. The calculation will continue using
integers if all component parts are integers. The letter L stands for long, the
name of the declaration for a 32-bit integer. Removing the final L converts
the intermediate stages to numeric.
> -2147483646L
[1] -2147483646
Generally, we will call these the maximum and minimum of the data type,
the largest positive and negative numbers. Closely related, we will call the
largest number the number with the greatest magnitude, regardless of sign.
Finally, the smallest number is the term used to describe numbers of lowest
magnitude, regardless of sign. These are the numbers closest to 0. In the case
of integers, this is 1 and -1.
The maximum value of a 32-bit integer, roughly 2.1 billion, is too small for
many common figures. The 2010 population of the world, just under 7 billion
people, could not be represented by a 32-bit integer. The value of 13! is more
than 6 billion. And there are approximately 4.33×1019 potential arrangements
for a Rubik’s cube (Bandelow 1982, 45). This also ignores the fact the integer
data type cannot store numbers smaller than 1, and if we entered 1/2, it would
round down to zero.
> as.integer(0.5)
[1] 0
> as.integer(1.9)
[1] 1
This is not to say there are no advantages to storing numbers as integers.
One is that it provides a convenient format for entering binary numbers in-
Error Analysis 37
directly. R does not support binary digit input, without the use of a package
library. But it does support hexadecimal input. And we can use hexadecimal
input to quickly enter binary because there is an exact 4-bits (binary) to 1 digit
(hexadecimal) conversion factor. Table 2.1 provides a reference table for the
binary-decimal-hexadecimal digits from 0 to 15. Hexadecimal numbers can be
entered if they are preceded by the “0x” code to signal the use of hexadecimal
numbers:
> 0xFACE
[1] 64206
Of course, R has no trouble with numbers larger than 231 − 1. However,
it will not use the internal integer storage format to hold these numbers. In
fact, the intermediate forms of 231 used in the prior two examples were not
integers. Accordingly, these expressions were valid and held the full value until
the final conversion to the integer data type.
> 2^32
[1] 4294967296
> class(2^32)
[1] "numeric"
In the example above, the numbers are being stored in the internal numeric
data type, which supports numbers larger than a 32-bit integer can store.
Numeric, based on the underlying floating point number format, provides the
most flexible and powerful option for native mathematics available to R.
2.2.2 Floating Point Numbers

Floating point numbers provide the way around the limitations of binary inte-
gers. Floating point numbers are capable of storing noninteger values, such as
2.71828182845905, 3.14159265358979, and 0.25. Floating point numbers are
also capable of storing much larger numbers, as you can see here:
> 2^31
[1] 2147483648
> 2^40
[1] 1.099512e+12
Floating point numbers also include large magnitude negative numbers. With-
out the assignment to the integer data type, R defaults to storing the numerical
data as floating point data. There are several standards for floating point, but
we will focus on the double precision floating point number, usually just called
a double.
The double corresponds to the C and Java programming language data
type, double. It is called double precision because it has approximately double
the storage space available than the standard floating point number format,
corresponding to the float data type from C. The double format is the default
storage format and is used when R identifies a number format as numeric.
Unless you have specifically changed the specification of a number, as we did

above with the as.integer function, R is almost certainly using numeric to
store data. The double precision data type has 64 bits of data storage per
number.
The C float data type is sometimes called single precision, and R in-
cludes a function to convert a number to a single precision value. However,
the underlying data type within R is still double precision and the function
has no actual effect. The R function provides internal bookkeeping for R’s
interactions with languages that support single precision data types.
The double precision data type is based on Institute of Electrical and Elec-
tronics Engineers (ieee) 754, the standard for storing floating point numbers.2
The ieee standard provides for a noncontinuous space representing both very
large and very small numbers. Under the standard, each floating point number
is composed of three parts: the base, exponent, and mantissa. It functions just
like scientific notation, but the base is not necessarily 10.
For traditional scientific notation, the base is 10, because humans are used
to working with numbers in base 10. For double precision numbers, the base
is not encoded in the storage. The base is 2 and is implicit. All ieee double
precision, and therefore, numeric, data types use base 2. Accordingly, it would
be redundant to store that information with each instance of a double precision
number.
The primary part of the number is called the mantissa or significand. This
number encodes the significant digits of the number represented in the base.
Since all double precision numbers are in base 2, this is binary. In this case
of a double precision or numeric data type, 53 of the 64 bits are used to store
the mantissa, with one of those bits reserved for the sign of the mantissa.
The final part of a floating point number is the exponent. The exponent,
like the exponent on a number recorded in scientific notation, is the power the
base is raised to, which is then multiplied by the mantissa for a final number.
For a double precision data type, there are 11 bits available for storing the
exponent, with one bit reserved for the sign of the exponent.
With these three components, a floating point number like 1000 is stored
as,
1000 = 0b1111101000 ∗ 29 , (2.4)
where the 2 is never stored and 0b · · · , shows the binary representation of a
number. Similarly, a negative exponent can represent fractional values. Con-
sequently, the number 0.75 is also representable as,
3
0.75 = = 0b0001 ∗ 2−1 , (2.5)
4
as a floating point number. There is only one digit included because the initial
2 Severance (1998) provides an interview with William Kahan, one of the standards cre-
ators, explaining his motivations and design decisions. See Zuras et al. (2008) for the stan-
dard.
Error Analysis 39
digit is implicit. Therefore, this number is 0b0.11 in binary. Because the num-
ber is a fractional part in base 2, any number can be represented as the sum
of fractions where the denominator is a power of 2. The fraction 3/4 can be
represented by 1/2 + 1/4. As binary decimals, 1/2 = 0b0.1 and 1/4 = 0b0.01
This adds up and 3/4 = 0b0.11. Numbers that are not the sum of such frac-
tions cannot be represented at the exact value in the double precision format
and are instead approximated. One common example of this is 1/3.
Storing numbers in the double precision format has advantages for the
numerical processor that handles these numbers. For addition or subtraction,
both numbers should be in a common exponent. One can be converted to the
other by changing the mantissa. After this conversion, the numbers are added
or subtracted, then the result stored as a new floating point number. Using
10 as a base, we can add 200 to 2000 and watch the process:
200 + 2000 = 2 × 102 + 2 × 103

= 2 × 102 + 20 × 102
= 22 × 102
= 2.2 × 103
The double precision format is equally adept at handling multiplication

and division. For neither is changing the exponent necessary. To multiply
two numbers, the mantissae are multiplied together and the exponents are
added together. For division, the mantissae are divided and the exponents are
subtracted. Again, using 10 as a base for its relative familiarity,
200 ∗ 2000 = 2 × 102 ∗ 2 × 103

= (2 ∗ 2) × 102+3
= 4 × 106
Generally, the internal details of how floating point operations work are not
necessary for numerical analysts. But some specific data about the double
precision floating can help some analyses before they become too big and
complex.
In addition to representing numbers, floating point numbers permit several
special values to be stored. The most important of these is NaN, represented
in R by “NaN.” NaN stands for “not a number.” One way to get a result of NaN
is 0/0:
> 0 / 0
[1] NaN
NaN has the property that if introduced into an expression, the NaN will
generally propagate throughout the rest of the expression. For instance, NaN +
x = NaN, for any value of x. Also, NaNx, NaN ∗ x, and NaN/x, all equal NaN
for any value of x, including if x is itself NaN. There is an exception. Under
the ieee standard, NaNx = NaN unless x = 0, in which case NaN0 = 1. There
is a test function, is.nan that tests if the value given is NaN. Like the other
is.* functions, it returns a value of TRUE or FALSE and can accept vector and
matrix arguments. Finally, NaN is not equal to anything, even itself:
> NaN == NaN
[1] NA
NaN is distinct from NA. NaN implies a result that cannot be calculated for
whatever reason, or is not a floating point number. Some calculations that
lead to NaN, other than 0/0, are attempting to take a square root of a negative
number, or perform calculations with infinities that lead to undefined results:
> sqrt(-1)
[1] NaN
> Inf - Inf
[1] NaN
However, adding two infinities produces ∞:
> Inf + Inf
[1] Inf
NA is different from NaN in that NA is not a part of the ieee standard for
floating point numbers. NA is a construction of R used to represent a value
that is not known, as a placeholder. NA says no result was available or the
result is missing. It can be used in a matrix to fill in a value of a vector:
> c(1, 2, 3, 4, NA, 5, 6)
[1] 1 2 3 4 NA 5 6
> matrix(c(1, 2, NA, 4, NA, 6, NA, 8, 9), 3)
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 NA 8
[3,] NA 6 9
In addition, NaN is different from infinity, which has its own representa-
tion. Within R, positive infinity is represented as Inf and negative infinity is
represented by -Inf. The value of infinity is the result of a couple of different
operations, including exceeding the basic storage requirements. One interest-
ing property of the double precision data type is that division by 0 yields
infinity. Within most algebraic frameworks, division by 0 yields an undefined
result. We normally ascribe no meaning to this undefined result. It is just
undefined, and not infinity, so the definition matches.
Within double precision arithmetic, and by extension, R numeric arith-
metic, division by 0 leads to infinity, except when the dividend itself is 0:
> 1 / 0
[1] Inf
Error Analysis 41
And this is balanced by a certain symmetry that almost any value divided by
Inf returns 0:
> 1 / Inf
[1] 0
Of course, Inf divided by itself is undefined:
> Inf / Inf
[1] NaN
In addition to Inf, the double precision data type includes a negative infinity,
-Inf. This behaves in the same way as Inf, except the sign propagates as
expected. Inf also has the property of being equal to itself, but not equal to
-Inf. Inf is greater than all other numbers and -Inf is less than all numbers:
> Inf == Inf

[1] TRUE
> Inf == -Inf
[1] FALSE
> Inf > 100
[1] TRUE
> Inf < 100
[1] FALSE
> -Inf > 100
[1] FALSE
Infinity is also the next larger number than the maximum value of stored
double precision number. For instance, if you double the value of .Ma-
chine$double.xmax, the number returned is Inf.
Finally, the most surprising aspect of the double precision data type is
the existence of a signed zero, or ±0. Negative zero, −0, acts like 0, but the
negative sign tends to propagate as expected:
> 1 / -0
[1] -Inf
Further, if a result is −0, it will be displayed as 0 while the negative is still
carried in the number and used in calculations. This can lead to a somewhat
confusing result:
> -0
[1] 0
> 1 / -Inf
[1] 0
> 1 / sqrt(-0)
[1] -Inf
Otherwise, −0 acts like 0, and −0 = 0. Generally, x − x = 0 for any value of
x, and not −0.
These special values give the floating point system a way to handle excep-
tional conditions. In some other environments, division by zero, for instance,
automatically terminates the program. By instead returning a NaN, floating
point allows computation to continue while still making it clear to us that
something has gone wrong.
2.3 Numerical Error

In addition to accuracy and precision in our source numbers, the idiosyncrasies
of the floating point formats can create new errors. As we have seen, some
numbers can be represented precisely while others cannot. This mixing and
matching of numbers and storage types can lead to new errors, above and
beyond measurement error, that we must be able to plan and account for.
2.3.1 Round-Off Error and Machine

Another characteristic of number storage we should look at is machine error.
The machine error, frequently just called machine , captures the spacing of
numbers within the floating point space. As we have seen, floating point can
represent a very broad range of numbers. However, it cannot represent every
number between 0 and 1.79769313486232e+308.
Machine is defined as the smallest value such that,
1 + > 1, (2.6)
within the floating point system. Intuitively, we know that 1 plus any number
greater than 0 will be greater than 1. But with the number spacing within
floating point, that is not necessarily the case. Machine is the threshold value
for representation.
Given the smallest possible number we found the double precision type to
support, we might expect this number to be so minuscule as to be inconse-
quential. However, this number, available within R as .Machine$double.eps,
is comparatively quite large:
> .Machine$double.eps
[1] 2.220446e-16
While the smallest number is hundreds of orders of magnitude smaller than
even the smallest measurable distance, the value of machine is at the scale
of some values we might need. For instance, the width of a proton is 8.4087 ×
10−16 meters, a value just about four times larger than machine . But other
subatomic particles can be even smaller and it is possible to inadvertently
cross these limits with real-world analyses.
The limitation on machine is a real limit. Using the print function to
Error Analysis 43
force R to print more digits, we can see what happens when we add machine
to 1, along with some nearby values:
> print(1 + .Machine$double.eps, digits = 20)
[1] 1.000000000000000222
> print(1 + .Machine$double.eps * 2, digits = 20)
[1] 1.0000000000000004441
> print(1 + .Machine$double.eps / 2, digits = 20)
[1] 1
Values equal to or greater than machine , when added to 1 are distinct.
Those smaller are not. The computer will simply round up or down a result
appropriately and leave it at that.
R provides a second value to measure the precision of the floating point
implementation called .Machine$double.neg.eps. This is the smallest value
such that,
1 − < 1, (2.7)
within the floating point system. Like .Machine$double.eps, we expect any
positive number should meet this requirement, but for the same reasons, this
is not the case. This number is also much larger than some applications may
call for:
> .Machine$double.neg.eps
[1] 1.110223e-16
And we can see the number in action, just like with .Machine$double.eps.
> print(1 - .Machine$double.neg.eps, digits = 20)
[1] 0.99999999999999988898
> print(1 - .Machine$double.neg.eps * 2, digits = 20)
[1] 0.99999999999999977796
> print(1 - .Machine$double.neg.eps / 2, digits = 20)
[1] 1
Because floating point is a scientific notation system, the relative effect of
different addends to different numbers can change. We will define a new value,
x , such that,
x + x > x, (2.8)
in the floating point system where x is any positive number. As x increases in
magnitude, the value of x also increases. We can observe this experimentally
by adding to, for instance, 1000,
> print(1000 + .Machine$double.eps, digits = 20)
[1] 1000
which has no effect. The value of x is directly related to, and proportional
to, the value of x. A similar relationship can be shown for the corresponding
negative for x. The library pracma contains a function, eps, that provides
the value of x for any given value of x:
> library(pracma)
> eps(1000)
[1] 1.136868e-13
> eps(1000000)
[1] 1.164153e-10
> eps(1000000000)
[1] 1.192093e-07
As we can see, the value of x grows with x and the magnitude of x has a
linear relationship with the magnitude of x.
The most important effect of this is that certain numbers cannot be repre-
sented precisely within a floating
√ point system. It is obvious that transcenden-
tal numbers, such as π or 2, require infinite precision. Since infinite precision
requires infinite storage capacity, we can quickly assume these numbers must
be represented by approximations rather than true values.
This is a machine-induced error called round-off error. This error is the dif-
ference between the true value x and the floating point representation thereof,
x0 . Round-off error is a manifestation of accuracy problems and can lead to
some bizarre interactions.
Transcendental and irrational numbers are not the only unrepresentable
numbers. Some perfectly rational numbers such as 1/3, which repeats in-
finitely, cannot be represented. And in binary, even 1/10 repeats infinitely.
These numbers are represented by rounded off approximations. Finally, even
numbers that do not repeat in binary, and are rational, cannot be represented
if the mantissa requires more than 53 bits for storage, including the sign.
Sometimes, this is a problem, but we can counter it with our intuitive un-
derstanding of mathematics. For instance, we usually see the number 0.333 . . .
and convert it to 1/3 in our heads. But if we see 1.28571428571429, we prob-
ably do not immediately recognize it as 9/7. As numerical analysis is the
search for an answer that is good enough for the application domain, we try
to balance the need for a true answer with an answer that meets the need.
Functions within the MASS library can convert decimals to reduced frac-
tions. If double precision is insufficient to meet those requirements, several
packages for R are available that can provide greater precision. The first is
gmp which provides multiple precision floating point arithmetic. Based on a
widely available library, also called gmp, this multiple precision library is well
tested and provides precision to whatever finite degree we ask. Further, the
library Rmpfr provides additional function using the gmp library as a base.
2.3.2 Loss of Significance

While round-off error happens in or near the last place of numbers, round-off
error can appear in other places. Round-off error can be brought to prominence
with a simple subtraction exercise:
> 1 / 3 - 0.33333333333333
Error Analysis 45
[1] 3.330669e-15
Here, 1/3 is stored as 53 binary digits, which is around 15 decimal digits of
significance. Subtracting 0.33333333333333, non-repeating decimal, from 1/3
should return 3.3 × 10−15 . But the floating point representation of 1/3 has no
significant digits after the fifteenth. As a result, the floating point subtraction
results in the leftover bits in the representation that are, in reality, the round-
off error in the storage of 1/3.
This problem is called loss of significance. We can see the same result
subtracting a number near to 1 from 1:
> 1 - 0.999999999999
[1] 9.999779e-13
Loss of significance will occur when subtracting any two “close” numbers when
at least one is not perfectly represented in binary. Mathematically, we note
that round-off error becomes more significant in subtraction of the form x −
(x − δ) as δ goes to and for any value of x. This error can intensify if the
subtraction is an interim value preceding a multiplication, since the interim
result will just be multiplied as is and the error will increase in magnitude:
> (1 - 0.999999999999) * 1000
[1] 9.999779e-10
In practice, formulae can often be rearranged to remedy these sorts of sub-
tractions, as we saw with Kahan’s summation algorithm in Section 1.3.1.
In other cases, round-off error can change intermediate results. For in-
stance, this simple addition problem shows how a simple subtraction problem,
20.55 − 1.35 − 19.2, is not commutative in floating point arithmetic:
> 20.55 - 19.2 - 1.35
[1] 1.332268e-15
> 20.55 - 1.35 - 19.2
[1] 0
When we perform both of these calculations by hand, the value is 0. But a
simple rearranging to allow for the way mathematics is done by the computer
changes the result from something close enough to 0 that we probably do not
care, to a value that is truly 0.
Loss of significance is characterized by the behavior that the change in rel-
ative error of a result, following an operation, is much larger than the change
in absolute error. As explained in Section 2.1.1, we use relative error to un-
derstand the effect of the error, by showing its magnitude relative to the final
result. Generally, the relative error should be insignificant compared to the
absolute error. Loss of significance flips this around by creating a situation
where the change in absolute error is very small, comparatively. This hinders
our ability to understand the error.
Fortunately, we can solve many instances of loss of significance by changing
our approach to the calculation. Using the familiar quadratic formula, we can
quadratic <- function ( b2 , b1 , b0 ) {

t1 <- sqrt ( b1 ^2 - 4 * b2 * b0 )
t2 <- 2 * b2
x1 <- - ( b1 + t1 ) / t2
x2 <- - ( b1 - t1 ) / t2
return ( c ( x1 , x2 ) )
}
R Function 2.1
The quadratic formula
demonstrate this. Given a, b, and c, such that ax2 + bx + c = 0, the quadratic

formula is, √
−b ± b2 − 4ac
x= , (2.9)
2a
which is implemented in R using the naı̈ve approach in Function 2.1.
The quadratic formula is a classic example of potential loss of significance.
The formula itself includes a single subtraction that is, following a square root
operation, subject to a division. This combination can yield errors when the
subtraction’s terms are of the same magnitude, which may be the case with
b2 − 4ac.
For some quadratic equations, this is a nonissue. If the equation in question
is 24x2 − 50x − 14, then
b2 − 4ac = 502 − 4(24)(14) (2.10)

= 2500 − 1344 (2.11)
= 1156 (2.12)
And the square root of 1156 is 34 and nothing terribly interesting has hap-
pened. This is usually the case and there are a number of contrived examples
of loss of significance that take place in an imaginary floating point calculator,
of five or ten significant digits. It is possible to avoid these by changing the
algorithm to not perform any subtraction in the numerator. This eliminates
the catastrophic cancellation that can cause problems.
For the quadratic formula,
√
−b ± b2 − 4ac
x= (2.13)
√2a √
−b ± b2 − 4ac −b − b2 − 4ac
= × √ (2.14)
2a −b − b2 − 4ac
2c
= √ . (2.15)
−b ∓ b2 − 4ac
Error Analysis 47
quadratic2 <- function ( b2 , b1 , b0 ) {

t1 <- sqrt ( b1 ^2 - 4 * b2 * b0 )
t2 <- 2 * b0
x1 <- t2 / ( - b1 - t1 )
x2 <- t2 / ( - b1 + t1 )
# # Reverse the order so they come

# # back the same as quadratic ()
return ( c ( x2 , x1 ) )
}
R Function 2.2
The quadratic formula rewritten
An example of this is provided in Function 2.2, which is mathematically equiv-

alent to the quadratic formula.
In practice, double precision numeric calculations provide sufficient signifi-
cance, that these results are not common. However, it is possible to experience
catastrophic cancellation. In one more extreme case, assume that,3
a =94906265.625, (2.16)
b =189812534.000, and (2.17)
c =94906268.375. (2.18)
These values, under double precision arithmetic, yield incorrect results,

though we must use the print function to see the higher precision in R:
> b2 <- 94906265.625

> b1 <- 189812534.000
> b0 <- 94906268.375
> print(quadratic(b2, b1, b0), digits = 20)
[1] -1.0000000144879792607 -1.0000000144879792607
The correct result is 1 and 1.000000028975958 . . ., though even under double
precision arithmetic, R is unable to calculate it correctly using the quadratic
formula. Not every proposed fix can solve each problem. For instance, we can
use quadratic2 with this example, as well:
> print(quadratic2(b2, b1, b0), digits = 20)
[1] -1.0000000144879790387 -1.0000000144879790387
3 This example was discovered by Kahan (November 20, 2004), who provides substantially
more detail on the subject. In addition, Kahan’s work on loss of significance is remarkable,
leading to the Kahan summation algorithm from Section 1.3.1.
Here, we have a different incorrect result that is not, in any meaninful way,
better than the results we received from the quadratic function.
2.3.3 Overflow and Underflow

Because R’s numeric data type can hold such a wide range of numbers, un-
derflow and overflow are less likely to be a problem than they could be with
other systems. However, it is good to understand the risks of underflow and
overflow when R is working with other operating environments.
Underflow is the phenomenon that a number is too small, regardless of
sign, to be represented by the computer as anything other than 0. Double
precision numeric data types, as shown, can represent numbers so small, that
the number is not realistically usable in most applications. Like other machine
characteristics, it is available for inquiry, through the .Machine object:
> .Machine$double.xmin
[1] 2.225074e-308
This number, if printed directly, would be a decimal point followed by
almost 300 zeros before the first nonzero number appeared. This number is
significantly smaller than almost all numbers that may be used in practice. For
comparison, the Planck length, the smallest theoretical measurable distance,
is 1.61619926 × 10−35 meters (Garay 1995). The lowest density vacuum ever
created in a laboratory was approximately 1 × 10−18 Pascals (Benvenuti and
Chiggiato 1993). And the probability of rolling die 20 times and getting the
same number each time is approximately 2.74 × 10−16 .
Even at the extreme lower boundary of small-scale physics and proba-
bility, the numbers involved are far larger than the lower bound of double
precision. These exceptionally small numbers are, at best, 270 orders of mag-
nitude larger than the smallest number a double precision number can hold.
But they are there if we need them down to anything between 0 and .Ma-
chine$double.xmin.
Even an attempt to contrive an example to illustrate the point is foiled by
the floating point standard. The simplest way to actually underflow should be
to divide .Machine$double.xmin by 2, a result which should yield 0.
> .Machine$double.xmin / 2
[1] 1.112537e-308
In this case, the machine has outwitted us, again. The internal floating
point representation has a method to store even smaller numbers than .Ma-
chine$double.xmin by moving leading zeros to the mantissa. These numbers
are called denormalized and provide a method to push the underflow bound-
ary even lower than .Machine$double.xmin. However, the implementation is
machine dependent and may not necessarily be available. In practice, we are
unlikely to reach the underflow boundary with a real-world problem.
Finally, we should keep in mind the .Machine$double.xmin value is sub-
Error Analysis 49
stantially smaller than machine . As shown in equation 2.8, the value of the
next floating point number after x changes based on how large x is. In fact,
the floating point numbers are closer together closer to 0. Further, there are
approximately the same number of floating point numbers between 0 and 1
as there are greater than 1.
At the other end of the spectrum, when the number is too large to be
expressed, we call the phenomenon overflow. Like underflow, overflow is largely
a theoretical problem, though one that is easier to create an example for. The
largest value can be interrogated in R using .Machine$double.xmax:
> .Machine$double.xmax
[1] 1.797693e+308
Because the same bits that hold negative exponents also hold positive
exponents, this is a number followed by more than 300 zeros. This number is
significantly larger than almost all numbers that might be used in practice. For
comparison, cosmologists estimate there are only 1080 hydrogen atoms in the
entire Universe. There are 2256 ≈ 1.15792089 × 1077 encryption keys possible
with the 256-bit variant of the AES encryption system. And the surface area
of the Milky Way’s galactic disk is roughly 10 × 1041 square meters.4
Like the smallest numbers, even at the extreme upper boundary of practical
mathematics and physics, the numbers involved are far smaller than the upper
bound of double precision. These exceptionally large numbers are more than
200 orders of magnitude smaller than the largest possible double precision
numbers can store. And like at the smallest numbers, we have larger numbers
available, if we need them, up to .Machine$double.xmax.
Here we can better contrive an example to illustrate the point. First, lim-
iting the values to those stored as integers, doubling the value will result in
integer overflow:
> 2147483647L * 2L
[1] NA
And a warning may be displayed, as well.
Of course, without the specification that these numbers be conducted with
32-bit integers, R will dutifully accept and calculate a result:
> 2147483647
[1] 2147483647
An advantage to using double precision data types to store numbers is
that they can store integers up to 253 without any loss of precision.
Even though the computer has a mechanism to handle numbers below
.Machine$double.xmin, there is no comparable method to store numbers
larger than .Machine$double.xmax. Multiplying .Machine$double.xmax by
2 yields a curious result:
4 Both this value and the number of atoms in the Universe are drawn from Wikipedia
pages. Neither is properly cited there, either.

> .Machine$double.xmax * 2
[1] Inf
While we know this number is not infinity, R returns Inf because it has no
other value available to signify the overflow has occurred.
In both the case of underflow and overflow, R does not cause an error. It
simply uses the best available option. In the case of underflow, using 0 may
likely be a good enough answer. In the case of overflow, it is unlikely that
infinity is a good enough answer, but if seen, should appropriately signal that
something has gone wrong.
For the same reasons as underflow, real data is unlikely to cause an overflow
condition. However, both overflow and underflow can arise in intermediate
calculations. This happens when all the input variables are representable as
is the true final answer. However, some intermediate calculation, such as an
intermediate sum when calculating a mean, is larger than can be represented.
In addition, not all systems will process data with double precision float-
ing point numbers. R is capable of interacting with and exchanging data with
systems that only support single precision arithmetic, where the limits are
substantially smaller. In a single precision environment, the smallest possi-
ble number is 5.87747175411144e − 39 and the largest possible number is
3.40282366920938e + 38. While these numbers are smaller and larger than
most applications, we have seen examples where they are insufficient.
We should keep in mind where the boundaries are and be aware if they are
approaching. It is possible that we may reach a boundary during an intervening
step in a complex calculation. Monitoring and understanding the calculation’s
complete process will allow us to avoid those conditions, or at least be aware
that a result may not be trusted. We will examine that in the next subsection.
2.3.4 Error Propagation and Stability

Each of these error types can manifest at any time. For a single arithmetic
operation, this error is probably well below the threshold of concern. But we
do not usually use the computer like a pocket calculator. We use the computer
to solve more complicated problems that require a lot of mathematics.
Some operations, like matrix transformations, require numerous calcula-
tions in succession in order to reach a result. Each of those operations is subject
to numerical error and those errors can compound in a form of cascading er-
ror. Understanding how errors in one calculation affect other calculations in
a complex operation is known as error propagation.
Using a practical example, rather than one from machine arithmetic, puts
this within grasp. Imagine a traditional ruler with marks every millimeter.5
There is a natural limitation of precision where no measurement of greater
precision than a millimeter is possible. A length measurement, ` will necessar-
5 You may, if necessary, use sixteenths of an inch as a substitute for the millimeter here;
the mathematics works out identically.

Error Analysis 51
− ` +
Figure 2.4
Error in the measurement of length
`1 `2
Figure 2.5
Additive error in measurement of length
ily include some error. We will say that ` = `0 ± where ` is the true length, `0
is the measured length, and is some error. In this case, is 0.5 millimeters.
An example is shown in Figure 2.4 where the potential error area is shaded.
We can allow that this measured error is acceptable because if it were not,
we would use a better measuring tool. However, even with that allowance,
it can cause trouble. If given two measurements and we must add them, the
errors themselves are additive. Under the same circumstances, we have two
measurements, `1 and `2 such that,
`1 = `01 ± (2.19)
`2 = `02 ± . (2.20)
Adding the values together leads to a combined error of,
`1 + `2 = (`01 ± ) + (`02 ± ) (2.21)
= `01 + `02 ± (2). (2.22)
Assuming each measurement were taken the same way, 1 2 , but the error
grows, regardless. In the example with the ruler, the combined error of two
measurements added is one millimeter. This is shown in Figure 2.5 where
the potential area, subject to the error estimates, is shaded. Accordingly, they
grow linearly and if we were designing a container to hold ten objects of length
`, the combined error is 10, and so errors can propagate through an addition
process.
+
h
−
− ` +
Figure 2.6
Multiplicative error in measurement of length
Other situations can amplify error more extremely. Imagine two measure-
ments, a length, ` and a height, h, with a desire to find the surface area. If
both were taken via the same method and have the same potential error in
the measurement, , the error is multiplicative, across all terms, and,
` × h = (`0 ± ) × (h0 ± ) (2.23)

0 0 0 0 2
= ` h ± (` + h + ). (2.24)
The entire error in the surface area is `0 + h0 2 . If the measurement error
were assumed to be within half a millimeter, as the prior example, then the
combined potential error in surface area is at more than half the length and
height added together, in square millimeters. This is shown graphically in
Figure 2.6. The potential error in a volume measurement is, commensurately,
cubical.
Through these examples, we can see the affects of compounding errors.
When a value is subject to machine error, through the floating point repre-
sentation or through some other source, the same error propagation patterns
emerge. In the case of the surface area shown above, the calculation will yield
a result, even if it is not necessarily correct.
But if one of the source values, such as the height, had overflowed, and was
now represented as Inf, then the final result would be Inf. The same thing
can happen with NaN. While these results are obviously unnatural, they arise
just as easily through mistake as they do through actual results. An accidental
logarithm of 0 returns -Inf and if buried within a larger calculation, we can
be left scratching our heads for a few minutes as we look for what went wrong.
This leads to the search for numerical stability, another important goal of
numerical analysis. Numerical stability is the property of numerical algorithm
Error Analysis 53
that the approximation errors, from whatever source, are not amplified by the
algorithm. This property manifests in two important ways.
The first way it manifests is that numerically stable algorithms will not pro-
duce drastic output changes after small changes in the input. Drastic changes
are sometimes in the eye of the beholder. An algorithm where the output
is an exponential function of the input will grow quickly as the input value
increases. Here, a drastic change is one beyond the expectation of change.
The second manifestation of numerical stability refers to how the algorithm
behaves near certain boundary values or singularity points. An algorithm that
divides by the input will behave well until the input value is 0, division’s sin-
gularity. However, how the algorithm behaves near the singularity is expected
and only produces incorrect results at 0.
A numerically unstable program is not well behaved in this sense. One
example would be an algorithm that will magnify the inherent errors without
bound. An analysis of the program will show that the final error is indetermi-
nate, or if determined, the relative error is large, leading to unusable results.
Some algorithms are only unstable at certain boundary points, when a com-
ponent function has no return value or meaning. In these cases, the algorithm
may be stable for some input domains, but not others.
Numerical stability is a desirable property to have in an algorithm, but it is
generally found in an algorithm-problem pairing. As a result, it does not nec-
essarily follow that an algorithm-problem pairing which is stable will remain
stable if the problem is changed. Sometimes, problem areas can be identified
ahead of time. In other cases, instabilities are best observed experimentally.
Observing how an algorithm behaves and produces results for different inputs
is a key aspect of numerical analysis.
2.4 Applications
With a firm understanding of how data is stored in R, we will now explore the
first application of this knowledge. In Section 2.4.1, we will see a simple method
to perform integer division, with a remainder, through repeated subtraction.
This method is slow and not very interesting. Now, using the mechanics of
binary data, we can explore a new method that implements long division like
we learned in school.
2.4.1 Simple Division Algorithms

In our exploration of fundamental problems and efficiency, we begin by ex-
ploring two algorithms which should not be used in practice. These algorithms
perform basic division on integers returning the quotient and a remainder.
This is usually called Euclidean division. These algorithms should not be used
naivediv <- function (m , n ) {

quot <- 0
r <- m
if ( n == 0)
stop ( " Attempted division by 0 " )
while ( r >= n ) {
quot <- quot + 1
r <- r - n
}
return ( list ( quotient = quot , remainder = r ) )

}
R Function 2.3
Naı̈ve division by repeated subtraction
as a replacement for the internal R operations for division and modulo, in-
teger division that just returns the remainder. However, they do perform the
operations and can be implemented in just a few lines of R.
The simplest approach to division mimics the basic method of multiplica-
tion. When learning multiplication, we learn the times tables. But before that,
we learn to intuit the mechanism by successively adding blocks of numbers
together. Given four sets of three apples each, we learn to multiply by adding
four sets of three together, 4 × 3 = 3 + 3 + 3 + 3 = 12. We can divide by
reversing the process.
This method is provided in Function 2.3, the function naivediv. We will
describe this function, and others, as naı̈ve, but this should not be considered
a pejorative. The naı̈ve approach is the most basic approach you could take
to solving a problem. The naivediv function accepts two options. The first
is the dividend, or the numerator of the division process. The second is the
divisor, or the denominator.
The function begins by setting an internal value, r to the dividend. From
there, the function subtracts the denominator from r in a loop until the value
of r is less than the divisor. For each subtraction, the value of a second internal
value, q, is incremented. When r is less than the divisor, the value of quot is
the quotient and r contains the remainder. The function returns a list with
two components named quot and r representing each value, respectively.
The algorithm is, at best, inefficient. It will cycle through the loop q =
bm/nc times. That is quite a bit and while it reduces when n is larger, it
increases when m is larger. The function returns the quotient and remainder
for any positive values of m and n:
> naivediv(314, 7)
Error Analysis 55
$quotient
[1] 44
$remainder
[1] 6
Before using these division functions, remember that R has a built-in op-
erator that performs the modulo operation, %%. The floor function, combined
with basic division, should be used to get the quotient. The modulo operator
should be the preferred choice:
> floor(314 / 7)
[1] 44
> 314 %% 7
[1] 6
Like most algorithms in this text, there are cleaner, faster, and more robust
implementations already available within R.
2.4.2 Binary Long Division

When we are learning division, we only use the inverse multiplication method
early and for small numbers. We also use the standard multiplication tables
as a mental shortcut. For larger numbers, we use the long division, dreaded
by elementary students for generations. The long division algorithm works
by breaking the dividend into its components, at the level of its digits, and
dividing each set of digits. Revisiting the last example, the long division for
314/7 is:6
314 7
−
28 44
34
−
28
6
Of course, as the dividend is broken up into the number of digits as are in the
divisor, with each step:
327 6 7 1023
−
306 9 32
20 7 7
−
20 4 6
3 1
The algorithm is typically conducted in base 10, but we can replicate

this process in any base. The function longdiv, provided in Function 2.4,
6 With apologies to the American readers, but as differing tableau styles for long division
are available via LATEX, the French style is much clearer to follow.
longdiv <- function (m , n ) {

quot <- 0
r <- 0
if ( n == 0)
stop ( " Attempted division by 0 " )
for ( i in 31:0) {
r <- bitwShiftL (r , 1)
r <- r + bitwAnd ( bitwShiftR (m , i ) , 1)
if ( r >= n ) {
r <- r - n
quot <- quot + bitwShiftL (1 , i )
}
}
return ( list ( quotient = quot , remainder = r ) )

}
R Function 2.4
Binary long division
does this. This function, after preparing an initialization for the quotient and
remainder, executes a simple loop over the bits of the dividend. The function
need only execute over the significant bits and eliminate leading 0s; however,
when executing the algorithm, it simply returns 0 for each of those bits.
The algorithm uses the bit manipulation functions inside of R to provide
access to the bits. First, shifting the bits one unit to the left is equivalent to
multiplying by two. A shift to the right is equivalent to dividing by two, with
the remainder lost. The quotient is built up bit by bit, with the highest order
bit first; any remainder, at each point, is carried through to the next step.
> longdiv(314, 7)
$quotient
[1] 44
$remainder
[1] 6
The longdiv is substantially faster than division by repeated subtraction.
Further, its runtime is bounded by the size of the dividend, not both the
dividend and the divisor. In this implementation, the loop is executed over all
the available bits of storage, regardless of whether they are used. Therefore,
the loop will execute exactly 32 times, sufficient for the 32 bits of any integer
in R. For 1000/3, the naivediv function executes the loop 334 times, ten
times the maximum of longdiv.
Error Analysis 57
Comments
Understanding error, its nature and sources, is a key part of numerical analysis.
There are two main types of error we must contend with, round-off and algo-
rithmic, and they manifest in different ways and for different reasons. Worse,
they can appear individually or in concert with each other. However, each type
of error can be mitigated through defensive programming techniques and the
knowledge that, no matter how hard we try, error cannot be eliminated. It
can, at best, be bounded, in the general case.
Further, the use of double precision arithmetic allows us to save some cases
of round-off error before they happen. Round-off error, in a double precision
number, is sufficient to solve most numbers we need and solve most problems
we have. And there is good news in this.
Most numerical analysis problems represent some form of applied mathe-
matical problem. Perhaps it is defining a curve for a wind turbine or estimating
wealth disparity. Applied mathematical problems generally involve some type
of measurement of real-world values. In practice, our measurement error will
be a greater limiting factor for many of the calculations we will perform. For
instance, a standard rule will only have a measurement resolution of 1/16th of
an inch, or so. That number can be stored as a floating point number without
problems. However, with real-world data, each of the individual measurement
errors is a range that is added or subtracted to the given measurement. If we
are lucky, these errors cancel each other out leaving us with a net-zero error.
In practice, when measurement error is unbiased, then the error is centered
at 0 and appears with an estimable frequency pattern.
But a measuring tool, such as a ruler, may be biased, one way or the other.
Consequently, especially in the realm of error propagation, the defined errors
are maximum potential error, not minimum and certainly not estimates of the
actual error. It is obvious to think that if we know what the error is, we can
just adjust our results to accommodate it. But in reality, all we have is an
upper bound, in most cases.
As we work with numerical analysis, we want to keep an eye on that upper
bound of the error. The upper bound is still useful. We can use that to manage
expectations on results. Many of the algorithms we work with are focused
on minimizing that upper bound. When we fit that upper bound within a
tolerance we have specified, we know that our answer was good enough, for
whatever good enough means at that time. Fitting our algorithm to meet
our needs is a crucial part of the selection process, on par with meeting the
requirements we specified in Chapter 1.
The rest of this book will now introduce a series of algorithms for solving
different types of mathematical problems. Some are general solutions and some
are keyed to specific problems. Finally, some will be applied to specific real-
world problems as we see how all of this connects back to applied mathematics.
Exercises
1. What is the binary representation, as an integer, of the number
4714?
2. What is the floating point representation of the number 8673.34?
3. What is the floating point representation of the number 1/6?
4. The number e, the base of the natural logarithm, is approximately
2.71. What is the absolute error in that estimate? What is the rel-
ative error?
5. Give an example of a measurement which is accurate, but not pre-
cise.
6. Give an example of a measurement which is precise, but not accu-
rate.
7. How many numbers are there between 1 and .Machine$double.xmax
in double precision arithmetic?
8. How many numbers are there between 0 and 1 in double precision
arithmetic?
9. If a 5×6 array contains double precision numeric, find the minimum
number of bytes necessary to store the array.
10. Write the data storage format for the double precision represented
by the numbers 1/2, 1/3, and 1/10.
11. Find the smallest possible such that 1000 + > 1000.
√
12. Rewrite f (x) = x2 + 1 − 1 so there is no loss of significance.
13. What the relationship between x and −x ?
14. Some computers support quadruple precision arithmetic, using 128
bits to store a number. The exponent uses 15 bits, the fractional
part uses 112 bits, and the sign is stored in a single bit. What is the
largest and smallest number this can represent?
15. Research other methods of storing numeric data such as binary-
coded decimal, fixed-point, or Java’s BigDecimal and discuss the
pros and cons of these methods versus floating point numeric data
types.

02 Computational Methods For Numerical Analysis With R - 2

Uploaded by

Copyright:

Available Formats

02 Computational Methods For Numerical Analysis With R - 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Computational Methods For Numerical Analysis With R - 2

Uploaded by

Copyright:

Available Formats

2

2.1 True Values

where R is the relative error. By conveying the error as a proportion of the

A more accurate estimate of π would simply be the first 8 digits,

2.2 Internal Data Storage

2.2.1 Binary Numbers

conversion, and other properties of numbers at the hardware-level.

Hexadecimal Decimal Binary Hexadecimal Decimal Binary

2.2.2 Floating Point Numbers

Unless you have specifically changed the specification of a number, as we did

200 + 2000 = 2 × 102 + 2 × 103

The double precision format is equally adept at handling multiplication

200 ∗ 2000 = 2 × 102 ∗ 2 × 103

> Inf == Inf

2.3 Numerical Error

2.3.1 Round-Off Error and Machine 

2.3.2 Loss of Significance

quadratic <- function ( b2 , b1 , b0 ) {

demonstrate this. Given a, b, and c, such that ax2 + bx + c = 0, the quadratic

b2 − 4ac = 502 − 4(24)(14) (2.10)

quadratic2 <- function ( b2 , b1 , b0 ) {

# # Reverse the order so they come

An example of this is provided in Function 2.2, which is mathematically equiv-

These values, under double precision arithmetic, yield incorrect results,

> b2 <- 94906265.625

2.3.3 Overflow and Underflow

pages. Neither is properly cited there, either.

2.3.4 Error Propagation and Stability

the mathematics works out identically.

` × h = (`0 ± ) × (h0 ± ) (2.23)

2.4.1 Simple Division Algorithms

naivediv <- function (m , n ) {

return ( list ( quotient = quot , remainder = r ) )

2.4.2 Binary Long Division

The algorithm is typically conducted in base 10, but we can replicate

longdiv <- function (m , n ) {

return ( list ( quotient = quot , remainder = r ) )

You might also like

where R is the relative error. By conveying the error as a proportion of the

2.3.1 Round-Off Error and Machine

` × h = (`0 ± ) × (h0 ± ) (2.23)