5345 Scicompbook
5345 Scicompbook
5345 Scicompbook
Victor Eijkhout
with
Edmond Chow, Robert van de Geijn
Preface
The field of high performance scientific computing lies at the crossroads of a number of disciplines and skill sets,
and correspondingly, for someone to be successful at using parallel computing in science requires at least elementary
knowledge of and skills in all these areas. Computations stem from an application context, so some acquaintance
with physics and engineering sciences is desirable. Then, problems in these application areas are typically translated
into linear algebraic, and sometimes combinatorial, problems, so a computational scientist needs knowledge of
several aspects of numerical analysis, linear algebra, and discrete mathematics. An efficient implementation of
the practical formulations of the application problems requires some understanding of computer architecture, both
on the CPU level and on the level of parallel computing. Finally, in addition to mastering all these sciences, a
computational sciences needs some specific skills of software management.
The authors of this book felt that, while good texts exist on applied physics, numerical linear algebra, computer ar-
chitecture, parallel computing, performance optimization, no book brings together these strands in a unified manner.
The need for a book such as the present was especially apparent at the Texas Advanced Computing Center: users
of the facilities there often turn out to miss crucial parts of the background that would make them efficient compu-
tational scientists. This book, then, comprises those topics that the authors have found indispensible for scientists
engaging in large-scale computations.
The contents of this book is a combination of theoretical material and self-guided tutorials on various practical
skills. The theory chapters have exercises that can be assigned in a classroom, however, their placement in the text
is such that a reader not inclined to do exercises can simply take them as statement of fact.
The tutorials should be done while sitting at a computer. Given the practice of scientific computing, they have a
clear Unix bias.
Public draft This book is unfinished and open for comments. What is missing or incomplete or unclear? Is
material presented in the wrong sequence? Kindly mail me with any comments you may have.
You may have found this book in any of a number of places; the authoritative download location is http://www.
tacc.utexas.edu/˜eijkhout/istc/istc.html. It is also possible to get a nicely printed copy from
lulu.com: http://www.lulu.com/product/paperback/introduction-to-high-performance-scientific
12995614.
Acknowledgement Helpful discussions with Kazushige Goto and John McCalpin, are gratefully acknowledged.
Thanks to Dan Stanzione for his notes on cloud computing and Ernie Chan for his notes on scheduling of block
algorithms. Thanks to Elie de Brauwer for many comments.
Introduction
Scientific computing is the cross-disciplinary field at the intersection of modeling scientific processes, and the use
of computers to produce quantitative results from these models. As a definition, we may posit
The efficient computation of constructive methods in applied mathematics.
This clearly indicates the three branches of science that scientific computing touches on:
• Applied mathematics: the mathematical modeling of real-world phenomena. Such modeling often leads
to implicit descriptions, for instance in the form of partial differential equations. In order to obtain actual
tangible results we need a constructive approach.
• Numerical analysis provides algorithmic thinking about scientific models. It offers a constructive ap-
proach to solving the implicit models, with an analysis of cost and stability.
• Computing takes numerical algorithms and analyzes the efficacy of implementing them on actually ex-
isting, rather than hypothetical, computing engines.
One might say that ‘computing’ became a scientific field in its own right, when the mathematics of real-world
phenomena was asked to be constructive, that is, to go from proving the existence of solutions to actually obtaining
them. At this point, algorithms become an object of study themselves, rather than a mere tool.
The study of algorithms became important when computers were invented. Since mathematical operations now
were endowed with a definable time cost, complexity of algoriths became a field of study; since computing was no
longer performed in ‘real’ numbers but in representations in finite bitstrings, the accuracy of algorithms needed to
be studied. (Some of these considerations predate the existence of computers, having been inspired by computing
with mechanical calculators. Also, Gauss discussed iterative solution of linear systems as being more robust under
occasional mistakes in computing.)
A prime concern in scientific computing is efficiency. While to some scientists the abstract fact of the existence of a
solution is enough, in computing we actually want that solution, and preferably yesterday. For this reason, we will
be quite specific about the efficiency of both algorithms and hardware.
Victor Eijkhout
Contents
4
CONTENTS 5
Victor Eijkhout
6 CONTENTS
In order to write efficient scientific codes, it is important to understand computer architecture. The difference in
speed between two codes that compute the same result can range from a few percent to orders of magnitude,
depending only on factors relating to how well the algorithms are coded for the processor architecture. Clearly,
it is not enough to have an algorithm and ‘put it on the computer’: some knowledge of computer architecture is
advisable, sometimes crucial.
Some problems can be solved on a single CPU, others need a parallel computer that comprises more than one
processor. We will go into detail on parallel computers in the next chapter, but even for parallel processing, it is
necessary to understand the invidual CPUs.
In this chapter, we will focus on what goes on inside a CPU and its memory system. We start with a brief general
discussion of how instructions are handled, then we will look into the arithmetic processing in the processor core;
last but not least, we will devote much attention to the movement of data between memory and the processor, and in-
side the processor. This latter point is, maybe unexpectedly, very important, since memory access is typically much
slower than executing the processor’s instructions, making it the determining factor in a program’s performance;
the days when ‘flop1 counting’ was the key to predicting a code’s performance are long gone. This discrepancy is
in fact a growing trend, so the issue of dealing with memory traffic has been becoming more important over time,
rather than going away.
This chapter will give you a basic understanding of the issues involved in CPU design, how it affects performance,
and how you can code for optimal performance. For much more detail, see an online book about PC architec-
ture [57], and the standard work about computer architecture, Hennesey and Patterson [51].
7
8 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
von Neumann architectures. This describes a design with an undivided memory that stores both program and data
(‘stored program’), and a processing unit that executes the instructions, operating on the data.
This setup distinguishes modern processors for the very earliest, and some special purpose contemporary, designs
where the program was hard-wired. It also allows programs to modify themselves, since instructions and data are
in the same storage. This allows us to have editors and compilers: the computer treats program code as data to
operate on. In this book we will not explicitly discuss compilers, the programs that translate high level languages to
machine instructions. However, on occasion we will discuss how a program at high level can be written to ensure
efficiency at the low level.
In scientific computing, however, we typically do not pay much attention to program code, focusing almost exclu-
sively on data and how it is moved about during program execution. For most practical purposes it is as if program
and data are stored separately. The little that is essential about instruction handling can be described as follows.
The machine instructions that a processor executes, as opposed to the higher level languages users write in, typically
specify the name of an operation, as well as of the locations of the operands and the result. These locations are not
expressed as memory locations, but as registers: a small number of named memory locations that are part of the
CPU2 . As an example, here is a simple C routine
void store(double *a, double *b, double *c) {
*c = *a + *b;
}
and its X86 assembler output, obtained by3 gcc -O2 -S -o - store.c:
.text
.p2align 4,,15
.globl store
.type store, @function
store:
movsd (%rdi), %xmm0 # Load *a to %xmm0
addsd (%rsi), %xmm0 # Load *b and add to %xmm0
movsd %xmm0, (%rdx) # Store to *c
ret
The instructions here are:
• A load from memory to register;
• Another load, combined with an addition;
• Writing back the result to memory.
Each instruction is processed as follows:
2. Direct-to-memory architectures are rare, though they have existed. The Cyber 205 supercomputer in the 1980s could have 3 data
streams, two from memory to the processor, and one back from the processor to memory, going on at the same time. Such an architecture is
only feasible if memory can keep up with the processor speed, which is no longer the case these days.
3. This is 64-bit output; add the option -m64 on 32-bit systems.
• Instruction fetch: the next instruction according to the program counter is loaded into the processor. We
will ignore the questions of how and from where this happens.
• Instruction decode: the processor inspects the instruction to determine the operation and the operands.
• Memory fetch: if necessary, data is brought from memory into a register.
• Execution: the operation is executed, reading data from registers and writing it back to a register.
• Write-back: for store operations, the register contents is written back to memory.
Complicating this story, contemporary CPUs operate on several instructions simultaneously, which are said to be
‘in flight’, meaning that they are in various stages of completion. This is the basic idea of the superscalar CPU
architecture, and is also referred to as instruction-level parallelism. Thus, while each instruction can take several
clock cycles to complete, a processor can complete one instruction per cycle in favourable circumstances; in some
cases more than one instruction can be finished per cycle.
The main statistic that is quoted about CPUs is their Gigahertz rating, implying that the speed of the processor is the
main determining factor of a computer’s performance. While speed obviously correlates with performance, the story
is more complicated. Some algorithms are cpu-bound , and the speed of the processor is indeed the most important
factor; other algorithms are memory-bound , and aspects such as bus speed and cache size become important.
In scientific computing, this second category is in fact quite prominent, so in this chapter we will devote plenty of
attention to the process that moves data from memory to the processor, and we will devote relatively little attention
to the actual processor.
Victor Eijkhout
10 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
1.1.1.1 Pipelining
The floating point add and multiply units of a processor are pipelined, which has the effect that a stream of inde-
pendent operations can be performed at an asymptotic speed of one result per clock cycle.
The idea behind a pipeline is as follows. Assume that an operation consists of multiple simpler operations, and that
for each suboperation there is separate hardware in the processor. For instance, an multiply instruction can have the
following components:
• Decoding the instruction, including finding the locations of the operands.
• Copying the operands into registers (‘data fetch’).
• Aligning the exponents; the multiplication .3 × 10−1 × .2 × 102 becomes .0003 × 102 × .2 × 102 .
• Executing the multiplication, in this case giving .00006 × 102 .
• Normalizing the result, in this example to .6 × 100 .
• Storing the result.
These parts are often called the ‘stages’ or ‘segments’ of the pipeline.
If every component is designed to finish in 1 clock cycle, the whole instruction takes 6 cycles. However, if each has
its own hardware, we can execute two multiplications in less than 12 cycles:
• Execute the decode stage for the first operation;
• Do the data fetch for the first operation, and at the same time the decode for the second.
• Execute the third stage for the first operation and the second stage of the second operation simultaneously.
• Et cetera.
You see that the first operation still takes 6 clock cycles, but the second one is finished a mere 1 cycle later. This
idea can be extended to more than two operations: the first operation still takes the same amount of time as before,
but after that one more result will be produced each cycle. Formally, executing n operations on a s-segment pipeline
takes s + n − 1 cycles.
Exercise 1.1. Let us compare the speed of a classical floating point unit, and a pipelined one. If the
pipeline has s stages, what is the asymptotic speedup? That is, with T0 (n) the time for n
operations on a classical CPU, and Ts (n) the time for n operations on an s-segment pipeline,
what is limn→∞ (T0 (n)/Ts (n))?
Next you can wonder how long it takes to get close to the asymptotic behaviour. Define Ss (n)
as the speedup achieved on n operations. The quantity n1/2 is defined as the value of n such
that Ss (n) is half the asymptotic speedup. Give an expression for n1/2 .
Since a vector processor works on a number of instructions simultaneously, these instructions have to be indepen-
dent. The operation ∀i : ai ← bi + ci has independent additions; the operation ∀i : ai+1 ← ai bi + ci feeds the result
of one iteration (ai ) to the input of the next (ai+1 = . . .), so the operations are not independent.
[A pipelined processor can speed up operations by a factor of 4, 5, 6 with respect to earlier CPUs. Such numbers
were typical in the 1980s when the first successful vector computers came on the market. These days, CPUs can
have 20-stage pipelines. Does that mean they are incredibly fast? This question is a bit complicated. Chip designers
continue to increase the clock rate, and the pipeline segments can no longer finish their work in one cycle, so they
are further spit up. Sometimes there are even segments in which nothing happens: that time is needed to make sure
data can travel to a different part of the chip in time.]
The amount of improvement you can get from a pipelined CPU is limited, so in a quest for ever higher performance
several variations on the pipeline design have been tried. For instance, the Cyber 205 had separate addition and
multiplication pipelines, and it was possible to feed one pipe into the next without data going back to memory first.
Operations like ∀i : ai ← bi + c · di were called ‘linked triads’ (because of the number of paths to memory, one
input operand had to be scalar).
Exercise 1.2. Analyse the speedup and n1/2 of linked triads.
Another way to increase performance is to have multiple identical pipes. This design was perfected by the NEC SX
series. With, for instance, 4 pipes, the operation ∀i : ai ← bi + ci would be split module 4, so that the first pipe
operated on indices i = 4 · j, the second on i = 4 · j + 1, et cetera.
Exercise 1.3. Analyze the speedup and n1/2 of a processor with multiple pipelines that operate in par-
allel. That is, suppose that there are p independent pipelines, executing the same instruction,
that can each handle a stream of operands.
(The reason we are mentioning some fairly old computers here is that true pipeline supercomputers hardly exist
anymore. In the US, the Cray X1 was the last of that line, and in Japan only NEC still makes them. However, the
Victor Eijkhout
12 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
functional units of a CPU these days are pipelined, so the notion is still important.)
Exercise 1.4. The operation
for (i) {
x[i+1] = a[i]*x[i] + b[i];
}
can not be handled by a pipeline or SIMD processor because there is a dependency between in-
put of one iteration of the operation and the output of the previous. However, you can transform
the loop into one that is mathematically equivalent, and potentially more efficient to compute.
Derive an expression that computes x[i+2] from x[i] without involving x[i+1]. This
is known as recursive doubling. Assume you have plenty of temporary storage. You can now
perform the calculation by
• Doing some preliminary calculations;
• computing x[i],x[i+2],x[i+4],..., and from these,
• compute the missing terms x[i+1],x[i+3],....
Analyze the efficiency of this scheme by giving formulas for T0 (n) and Ts (n). Can you think
of an argument why the preliminary calculations may be of lesser importance in some circum-
stances?
• prefetching: data can be speculatively requested before any instruction needing it is actually encountered
(this is discussed further in section 1.2.5).
As clock frequency has gone up, the processor pipeline has grown in length to make the segments executable in less
time. You have already seen that longer pipelines have a larger n1/2 , so more independent instructions are needed to
make the pipeline run at full efficiency. As the limits to instruction-level parallelism are reached, making pipelines
longer (sometimes called ‘deeper’) no longer pays off. This is generally seen as the reason that chip designers have
moved to multi-core architectures as a way of more efficiently using the transistors on a chip; section 1.3.
There is a second problem with these longer pipelines: if the code comes to a branch point (a conditional or the test
in a loop), it is not clear what the next instruction to execute is. At that point the pipeline can stall. CPUs have taken
to speculative execution’ for instance, by always assuming that the test will turn out true. If the code then takes the
other branch (this is called a branch misprediction), the pipeline has to be cleared and restarted. The resulting delay
in the execution stream is called the branch penalty.
Processors are often characterized in terms of how big a chunk of data they can process as a unit. This can relate to
• The width of the path between processor and memory: can a 64-bit floating point number be loaded in
one cycle, or does it arrive in pieces at the processor.
• The way memory is addressed: if addresses are limited to 16 bits, only 64,000 bytes can be identified.
Early PCs had a complicated scheme with segments to get around this limitation: an address was specified
with a segment number and an offset inside the segment.
• The number of bits in a register, in particular the size of the integer registers which manipulate data
address; see the previous point. (Floating point register are often larger, for instance 80 bits in the x86
architecture.) This also corresponds to the size of a chunk of data that a processor can operate on simul-
taneously.
• The size of a floating point number. If the arithmetic unit of a CPU is designed to multiply 8-byte numbers
efficiently (‘double precision’; see section 3.2) then numbers half that size (‘single precision’) can some-
times be processed at higher efficiency, and for larger numbers (‘quadruple precision’) some complicated
scheme is needed. For instance, a quad precision number could be emulated by two double precision
numbers with a fixed difference between the exponents.
These measurements are not necessarily identical. For instance, the original Pentium processor had 64-bit data
busses, but a 32-bit processor. On the other hand, the Motorola 68000 processor (of the original Apple Macintosh)
had a 32-bit CPU, but 16-bit data busses.
The first Intel microprocessor, the 4004, was a 4-bit processor in the sense that it processed 4 bit chunks. These
days, processors are 32-bit, and 64-bit is becoming more popular.
Victor Eijkhout
14 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
We will now refine the picture of the Von Neuman architecture, in which data is loaded immediately from memory
to the processors, where it is operated on. This picture is unrealistic because of the so-called memory wall : the
processor is much too fast to load data this way. Specifically, a single load can take 1000 cycles, while a processor
can perform several operations per cycle. (After this long wait for a load, the next load can come faster, but still too
slow for the processor. This matter of wait time versus throughput will be addressed below in section 1.2.2.)
In reality, there will be various memory levels in between the floating point unit and the main memory: the registers
and the caches. Each of these will be faster to a degree than main memory; unfortunately, the faster the memory on
a certain level, the slower it will be. This leads to interesting programming problems, which we will discuss in the
rest of this chapter, and particularly section 1.5.
The use of registers is the first instance you will see of measures taken to counter the fact that loading data from
memory is slow. Access to data in registers, which are built into the processor, is almost instantaneous, unlike main
memory, where hundreds of clock cycles can pass between requesting a data item, and it being available to the
processor.
One advantage of having registers is that data can be kept in them during a computation, which obviates the need
for repeated slow memory loads and stores. For example, in
s = 0;
for (i=0; i<n; i++)
s += a[i]*b[i];
the variable s only needs to be stored to main memory at the end of the loop; all intermediate values are almost
immediately overwritten. Because of this, a compiler will keep s in a register, eliminating the delay that would
result from continuously storing and loading the variable. Such questions of data reuse will be discussed in more
detail below; we will first discuss the components of the memory hierarchy.
1.2.1 Busses
The wires that move data around in a computer, from memory to cpu or to a disc controller or screen, are called
busses. The most important one for us is the Front-Side Bus (FSB) which connects the processor to memory. In one
popular architecture, this is called the ‘north bridge’, as opposed to the ‘south bridge’ which connects to external
devices, with the exception of the graphics controller.
The bus is typically much slower than the processor, operating between 500MHz and 1GHz, which is the main
reason that caches are needed.
T (n) = α + βn
where α is the latency and β is the inverse of the bandwidth: the time per byte.
Victor Eijkhout
16 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
Typically, the further away from the processor one gets, the longer the latency is, and the lower the bandwidth.
These two factors make it important to program in such a way that, if at all possible, the processor uses data from
cache or register, rather than from main memory. To illustrate that this is a serious matter, consider a vector addition
for (i)
a[i] = b[i]+c[i]
Each iteration performs one floating point operation, which modern CPUs can do in one clock cycle by using
pipelines. However, each iteration needs two numbers loaded and one written, for a total of 24 bytes4 of memory
traffic. Typical memory bandwidth figures (see for instance figure 1.2) are nowhere near 32 bytes per cycle. This
means that, without caches, algorithm performance can be bounded by memory performance.
The concepts of latency and bandwidth will also appear in parallel computers, when we talk about sending data
from one processor to the next.
1.2.3 Registers
The registers are the memory level that is closest to the processor: any operation acts on data in register and leaves
its result in register. Programs written in assembly language explicitly use these registers:
addl %eax, %edx
which is an instruction to add the content of one register to another. As you see in this sample instruction, registers
are not numbered in memory, but have distinct names that are referred to in the assembly instruction.
Registers have a high bandwidth and low latency, because they are part of the processor. That also makes them a
very scarce resource.
1.2.4 Caches
In between the registers, which contain the data that the processor operates on, and the main memory where lots of
data can reside for a long time, are various levels of cache memory, that have lower latency and higher bandwidth
than main memory. Data from memory travels the cache hierarchy to wind up in registers. The advantage to having
cache memory is that if a data item is reused shortly after it was first needed, it will still be in cache, and therefore
it can be accessed much faster than if it would have to be brought in from memory.
The caches are called ‘level 1’ and ‘level 2’ (or, for short, L1 and L2) cache; some processors can have an L3 cache.
The L1 and L2 caches are part of the die, the processor chip, although for the L2 cache that is a recent development;
the L3 cache is off-chip. The L1 cache is small, typically around 16Kbyte. Level 2 (and, when present, level 3)
cache is more plentiful, up to several megabytes, but it is also slower. Unlike main memory, which is expandable,
caches are fixed in size. If a version of a processor chip exists with a larger cache, it is usually considerably more
expensive.
4. Actually, a[i] is loaded before it can be written, so there are 4 memory access, with a total of 32 bytes, per iteration.
Data that is needed in some operation gets copied into the various caches on its way to the processor. If, some
instructions later, a data item is needed again, it is first searched for in the L1 cache; if it is not found there, it is
searched for in the L2 cache; if it is not found there, it is loaded from main memory. Finding data in cache is called
a cache hit, and not finding it a cache miss.
Figure 1.2 illustrates the basic facts of caches, in this case for the AMD Opteron chip: the closer caches are to the
floating point units, the fast, but also the smaller they are. Some points about this figure.
• Loading data from registers is so fast that it does not constitute a limitation on algorithm execution speed.
On the other hand, there are few registers. The Opteron5 has 16 general purpose registers, 8 media and
floating point registers, and 16 SIMD registers.
• The L1 cache is small, but sustains a bandwidth of 32 bytes, that is 4 double precision number, per cycle.
This is enough to load two operands each for two operations, but note that the Opteron can actually
perform 4 operations per cycle. Thus, to achieve peak speed, certain operands need to stay in register.
The latency from L1 cache is around 3 cycles.
• The bandwidth from L2 and L3 cache is not documented and hard to measure due to cache policies (see
below). Latencies are around 15 cycles for L2 and 50 for L3.
• Main memory access has a latency of more than 100 cycles, and a bandwidth of 4.5 bytes per cycle, which
is about 1/7th of the L1 bandwidth. However, this bandwidth is shared by the 4 cores of the opteron chip,
5. Specifically the server chip used in the Ranger supercomputer; desktop versions may have different specifications.
Victor Eijkhout
18 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
so effectively the bandwidth is a quarter of this number. In a machine like Ranger, which has 4 chips per
node, some bandwidth is spent on maintaining cache coherence (see section 1.3) reducing the bandwidth
for each chip again by half.
On level 1, there are separate caches for instructions and data; the L2 and L3 cache contain both data and instruc-
tions.
You see that the larger caches are increasingly unable to supply data to the processors fast enough. For this reason
it is necessary to code in such a way that data is kept as much as possible in the highest cache level possible. We
will discuss this issue in detail in the rest of this chapter.
Exercise 1.5. The L1 cache is smaller than the L2 cache, and if there is an L3, the L2 is smaller than
the L3. Give a practical and a theoretical reason why this is so.
Victor Eijkhout
20 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
address. Direct mapping then takes from each memory address the last (‘least significant’) 16 bits, and uses these
as the address of the data item in cache.
Direct mapping is very efficient because of its address calculations can be performed very quickly, leading to low
latency, but it has a problem in practical applications. If two items are addressed that are separated by 8K words,
they will be mapped to the same cache location, which will make certain calculations inefficient. Example:
double A[3][8192]; int i;
for (i=0; i<512; i++)
a[2][i] = ( a[0][i]+a[1][i] )/2.;
Here, the locations of a[0][i], a[0][i], and a[2][i] are 8K from each other for every i, so the last 16 bits
of their addresses will be the same, and hence they will be mapped to the same location in cache. The execution of
the loop will now go as follows:
• The data at a[0][0] is brought into cache and register. This engenders a certain amount of latency.
Together with this element, a whole cache line is transferred.
• The data at a[1][0] is brought into cache (and register, as we will not remark anymore from now on),
together with its whole cache line, at cost of some latency. Since this cache line is mapped to the same
location as the first, the first cache line is overwritten.
• In order to write the output, the cache line containing a[2][0] is brought into memory. This is again
mapped to the same location, causing flushing of the cache line just loaded for a[1][0].
• In the next iteration, a[0][1] is needed, which is on the same cache line as a[0][0]. However, this
cache line has been flushed, so it needs to be brought in anew from main memory or a deeper cache level.
In doing so, it overwrites the cache line that holds a[1][0].
• A similar story hold for a[1][1]: it is on the cache line of a[1][0], which unfortunately has been
overwritten in the previous step.
If a cache line holds four words, we see that each four iterations of the loop involve eight transfers of elements of a,
where two would have sufficed, if it were not for the cache conflicts.
Exercise 1.7. In the example of direct mapped caches, mapping from memory to cache was done by
using the final 16 bits of a 32 bit memory address as cache address. Show that the problems in
this example go away if the mapping is done by using the first (‘most significant’) 16 bits as
the cache address. Why is this not a good solution in general?
flushed prematurely as in the above example. In that example, a value of k = 2 would suffice, but in practice higher
values are often encountered.
For instance, the Intel Woodcrest processor has
• an L1 cache of 32K bytes, that is 8-way set associative with a 64 byte cache line size;
• an L2 cache of 4M bytes, that is 8-way set associative with a 64 byte cache line size.
On the other hand, the AMD Barcelona chip has 2-way associativity for the L1 cache, and 8-way for the L2.
A higher associativity (‘way-ness’) is obviously desirable, but makes a processor slower, since determining whether
an address is already in cache becomes more complicated. For this reason, the associativity of the L1 cache, where
speed is of the greatest importance, is typically lower than of the L2.
Exercise 1.8. Write a small cache simulator in your favourite language. Assume a k-way associative
cache of 32 entries and an architecture with 16 bit addresses. Run the following experiment for
k = 1, 2, 4:
1. Let k be the associativity of the simulated cache.
2. Write the translation from 16 bit address to 32/2k bit cache address.
3. Generate 32 random machine addresses, and simulate storing them in cache.
Since the cache has 32 entries, optimally the 32 addresses can all be stored in cache. The
chance of this actually happening is small, and often the data of one address will be removed
from the cache when it conflicts with another address. Record how many addresses, out of 32,
are actually stored in the cache. Do step 3 100 times, and plot the results; give median and
average value, and the standard deviation. Observe that increasing the associativity improves
the number of addresses stored. What is the limit behaviour? (For bonus points, do a formal
statistical analysis.)
Victor Eijkhout
22 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
Since prefetch is controlled by the hardware, it is also described as hardware prefetch . Prefetch streams can some-
times be controlled from software, though often it takes assembly code to do so.
process. The case where the page is not remembered in the TLB is called a TLB miss, and the page lookup table
is then consulted, if necessary bringing the needed page into memory. The TLB is (sometimes fully) associative
(section 1.2.4.6), using an LRU policy (section 1.2.4.2).
A typical TLB has between 64 and 512 entries. If a program accesses data sequentially, it will typically alternate
between just a few pages, and there will be no TLB misses. On the other hand, a program that access many random
memory locations can experience a slowdown because of such misses.
Section 1.5.4 and appendix D.5 discuss some simple code illustrating the behaviour of the TLB.
[There are some complications to this story. For instance, there is usually more than one TLB. The first one is
associated with the L2 cache, the second one with the L1. In the AMD Opteron, the L1 TLB has 48 entries, and
is is fully (48-way) associative, while the L2 TLB has 512 entries, but is only 4-way associative. This means that
there can actually be TLB conflicts. In the discussion above, we have only talked about the L2 TLB. The reason
that this can be associated with the L2 cache, rather than with main memory, is that the translation from memory to
L2 cache is deterministic.]
6. Another solution is Intel’s Hyperthreading, which lets a processor mix the instructions of several instruction streams. The benefits of
this are strongly dependent on the individual case. However, this same mechanism is exploited with great success in GPUs; see section 2.8.
Victor Eijkhout
24 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
Figure 1.3: Projected heat dissipation of a CPU if trends had continued – this graph courtesy Pat Helsinger
logic unit and having its own registers. Currently, CPUs with 4 or 6 cores are on the market and 8-core chips will
be available shortly. The core count is likely to go up in the future: Intel has already shown an 80-core prototype
that is developed into the 48 core ‘Single-chip Cloud Computer’, illustrated in fig 1.4. This chip has a structure with
24 dual-core ‘tiles’ that are connected through a 2D mesh network. Only certain tiles are connected to a memory
controller, others can not reach memory other than through the on-chip network.
With this mix of shared and private caches, the programming model for multi-core processors is becoming a hybrid
between shared and distributed memory:
Core The cores have their own private L1 cache, which is a sort of distributed memory. The above mentioned
Intel 80-core prototype has the cores communicating in a distributed memory fashion.
Socket On one socket, there is often a shared L2 cache, which is shared memory for the cores.
Node There can be multiple sockets on a single ‘node’ or motherboard, accessing the same shared memory.
Network Distributed memory programming (see the next chapter) is needed to let nodes communicate.
conflicts are in fact impossible. However, processor typically have some private cache, which contains copies of
data from memory, so conflicting copies can occur. This situation arises in particular in multi-core designs.
Suppose that two cores have a copy of the same data item in their (private) L1 cache, and one modifies its copy.
Now the other has cached data that is no longer an accurate copy of its counterpart, so it needs to reload that item.
This will slow down the computation, and it wastes bandwidth to the core that could otherwise be used for loading
or storing operands.
The state of a cache line with respect to a data item in main memory is usually described as one of the following:
Scratch: the cache line does not contain a copy of the item;
Valid: the cache line is a correct copy of data in main memory;
Reserved: the cache line is the only copy of that piece of data;
Dirty: the cache line has been modified, but not yet written back to main memory;
Invalid: the data on the cache line is also present on other processors (it is not reserved ), and another process has
modified its copy of the data.
Exercise 1.9. Consider two processors, a data item x in memory, and cachelines x1 ,x2 in the private
caches of the two processors to which x is mapped. Describe the transitions between the states
of x1 and x2 under reads and writes of x on the two processors. Also indicate which actions
cause memory bandwidth to be used. (This list of transitions is a Finite State Automaton (FSA);
see section A.3.)
Victor Eijkhout
26 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
data transfer can be minimized by programming in such a way that data stays as close to the processor as possible.
Partly this is a matter of programming cleverly, but we can also look at the theoretical question: does the algorithm
allow for it to begin with.
∀i : xi ← xi + yi .
This involves three memory accesses (two loads and one store) and one operation per iteration, giving a data reuse
of 1/3. The axpy (for ‘a times x plus y) operation
∀i : xi ← xi + a · yi
has two operations, but the same number of memory access since the one-time load of a is amortized. It is therefore
more efficient than the simple addition, with a reuse of 2/3.
The inner product calculation
∀i : s ← s + xi · yi
is similar in structure to the axpy operation, involving one multiplication and addition per iteration, on two vectors
and one scalar. However, now there are only two load operations, since s can be kept in register and only written
back to memory at the end of the loop. The reuse here is 1.
This involves 3n2 data items and 2n3 operations, which is of a higher order. The data reuse is O(n), meaning that
every data item will be used O(n) times. This has the implication that, with suitable programming, this operation
has the potential of overcoming the bandwidth/clock speed gap by keeping data in fast cache memory.
Exercise 1.10. The matrix-matrix product, considered as operation, clearly has data reuse by the above
definition. Argue that this reuse is not trivially attained by a simple implementation. What
determines whether the naive implementation has reuse of data that is in cache?
[In this discussion we were only concerned with the number of operations of a given implementation, not the
mathematical operation. For instance, there are ways or performing the matrix-matrix multiplication and Gaussian
elimination algorithms in fewer than O(n3 ) operations [78, 70]. However, this requires a different implementation,
which has its own analysis in terms of memory access and reuse.]
The matrix-matrix product is the heart of the ‘LINPACK benchmark’ [29]. The benchmark may give an optimistic
view of the performance of a computer: the matrix-matrix product is an operation that has considerable data reuse,
so it is relatively insensitive to memory bandwidth and, for parallel computers, properties of the network. Typically,
computers will attain 60–90% of their peak performance on the Linpack benchmark. Other benchmark may give
considerably lower figures.
1.4.3 Locality
Since using data in cache is cheaper than getting data from main memory, a programmer obviously wants to code in
such a way that data in cache is reused. While placing data in cache is not under explicit programmer control, even
from assembly language, in most CPUs7 , it is still possible, knowing the behaviour of the caches, to know what
data is in cache, and to some extent to control it.
The two crucial concepts here are temporal locality and spatial locality. Temporal locality is the easiest to explain:
this describes the use of a data element within a short time of its last use. Since most caches have a LRU replacement
policy (section 1.2.4.2), if in between the two references less data has been referenced than the cache size, the
element will still be in cache and therefore quickly accessible. With other replacement policies, such as random
replacement, this guarantee can not be made.
As an example, consider the repeated use of a long vector:
for (loop=0; loop<10; loop++) {
for (i=0; i<N; i++) {
... = ... x[i] ...
}
}
Each element of x will be used 10 times, but if the vector (plus other data accessed) exceeds the cache size,
each element will be flushed before its next use. Therefore, the use of x[i] does not exhibit temporal locality:
subsequent uses are spaced too far apart in time for it to remain in cache.
If the structure of the computation allows us to exchange the loops:
for (i=0; i<N; i++) {
for (loop=0; loop<10; loop++) {
7. Low level memory access can ben controlled by the programmer in the Cell processor and in some GPUs.
Victor Eijkhout
28 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
Our first observation is that both implementations indeed compute C ← C + A · B, and that they both take roughly
2n3 operations. However, their memory behaviour, including spatial and temporal locality is very different.
c[i,j] In the first implementation, c[i,j] is invariant in the inner iteration, which constitutes temporal locality,
so it can be kept in register. As a result, each element of C will be loaded and stored only once.
In the second implementation, c[i,j] will be loaded and stored in each inner iteration. In particular,
this implies that there are now n3 store operations, a factor of n more than the first implementation.
a[i,k] In both implementations, a[i,j] elements are accessed by rows, so there is good spatial locality, as
each loaded cacheline will be used entirely. In the second implementation, a[i,k] is invariant in the
inner loop, which constitutes temporal locality; it can be kept in register. As a result, in the second case
A will be loaded only once, as opposed to n times in the first case.
b[k,j] The two implementations differ greatly in how they access the matrix B. First of all, b[k,j] is never
invariant so it will not be kept in register, and B engenders n3 memory loads in both cases. However, the
access patterns differ.
In second case, b[k,j] is access by rows so there is good spatial locality: cachelines will be fully
utilized after they are loaded.
In the first implementation, b[k,j] is accessed by columns. Because of the row storage of the matri-
ces, a cacheline contains a part of a row, so for each cacheline loaded, only one element is used in the
columnwise traversal. This means that the first implementation has more loads for B by a factor of the
cacheline length.
Note that we are not making any absolute predictions on code performance for these implementations, or even
relative comparison of their runtimes. Such predictions are very hard to make. However, the above discussion
identifies issues that are relevant for a wide range of classical CPUs.
Exercise 1.11. Consider the following pseudocode of an algorithm for summing n numbers x[i] where
n is a power of 2:
for s=2,4,8,...,n/2,n:
for i=0 to n-1 with steps s:
x[i] = x[i] + x[i+s/2]
sum = x[0]
Analyze the spatial and temporal locality of this algorithm, and contrast it with the standard
algorithm
sum = 0
for i=0,1,2,...,n-1
sum = sum + x[i]
1.5.1 Pipelining
In section 1.1.1.1 you learned that the floating point units in a modern CPU are pipelined, and that pipelines require
a number of independent operations to function efficiently. The typical pipelineable operation is a vector addition;
an example of an operation that can not be pipelined is the inner product accumulation
Victor Eijkhout
30 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
a += 2; b += 2;
}
A first observation about this code is that we are implicitly using associativity and commutativity of addition: while
the same quantities are added, they are now in effect added in a different order. As you will see in chapter 3, in
computer arithmetic this is not guaranteed to give the exact same result.
In a further optimization, we disentangle the addition and multiplication part of each instruction. The hope is that
while the accumulation is waiting for the result of the multiplication, the intervening instructions will keep the
processor busy, in effect increasing the number of operations per second.
for (i = 0; i < N/2-1; i ++) {
temp1 = *(a + 0) * *(b + 0);
temp2 = *(a + 1) * *(b + 1);
a += 2; b += 2;
}
Finally, we realize that the furthest we can move the addition away from the multiplication, is to put it right in front
of the multiplication of the next iteration:
for (i = 0; i < N/2-1; i ++) {
sum1 += temp1;
temp1 = *(a + 0) * *(b + 0);
sum2 += temp2;
temp2 = *(a + 1) * *(b + 1);
a += 2; b += 2;
}
s = temp1 + temp2;
Of course, we can unroll the operation by more than a factor of two. While we expect an increased performance,
large unroll factors need large numbers of registers. Asking for more registers than a CPU has is called register
spill , and it will decrease performance.
Another thing to keep in mind is that the total number of operations is unlikely to be divisible by the unroll factor.
This requires cleanup code after the loop to account for the final iterations. Thus, unrolled code is harder to write
than straight code, and people have written tools to perform such source-to-source transformations automatically.
Cycle times for unrolling the inner product operation up to six times are given in table 1.2. Note that the timings
do not show a monotone behaviour at the unrolling by four. This sort of variation is due to various memory-related
factors.
1 2 3 4 5 6
6794 507 340 359 334 528
Table 1.2: Cycle times for the inner product operation, unrolled up to six times
Victor Eijkhout
32 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
12 2.0
10
1.8
8
Cache miss fraction
1.6
cycles per op
6
1.4
4
1.2
2
00 5 10 15 20 25 301.0
dataset size
Figure 1.5: Average cycle count per operation as function of the dataset size
}
assuming that the L1 size divides evenly in the dataset size. This strategy is called cache blocking or blocking for
cache reuse.
1.5.4 TLB
As explained in section 1.2.7, the Translation Look-aside Buffer (TLB) maintains a list of currently in use memory
pages, and addressing data that is located on one of these pages is much faster than data that is not. Consequently,
one wants to code in such a way that the number of pages in use is kept low.
7 300
6
250
5
cache line utilization
200
total kcycles
4
150
3
100
2
11 2 3 4 5 6 750
stride
Consider code for traversing the elements of a two-dimensional array in two different ways.
#define INDEX(i,j,m,n) i+j*m
array = (double*) malloc(m*n*sizeof(double));
/* traversal #1 */
for (j=0; j<n; j++)
for (i=0; i<m; i++)
array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;
/* traversal #2 */
for (i=0; i<m; i++)
for (j=0; j<n; j++)
array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;
The results (see Appendix D.5 for the source code) are plotted in figures 1.8 and 1.7.
Using m = 1000 means that, on the which has pages of 512 doubles, we need roughly two pages for each column.
We run this example, plotting the number ‘TLB misses’, that is, the number of times a page is referenced that is not
recorded in the TLB.
1. In the first traversal this is indeed what happens. After we touch an element, and the TLB records the page
it is on, all other elements on that page are used subsequently, so no further TLB misses occur. Figure 1.8
shows that, with increasing n, the number of TLB misses per column is roughly two.
Victor Eijkhout
34 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
2.5
1.5
2.0
tlb misses / column
total cycles
1.0 1.5
1.0
0.5
0.5
Figure 1.7: Number of TLB misses per column as function of the number of columns; columnwise traversal of the
array.
1000 1.0
800 0.8
tlb misses / column
total cycles
600 0.6
400 0.4
200 0.2
Figure 1.8: Number of TLB misses per column as function of the number of columns; rowwise traversal of the
array.
2. In the second traversal, we touch a new page for every element of the first row. Elements of the second
row will be on these pages, so, as long as the number of columns is less than the number of TLB entries,
these pages will still be recorded in the TLB. As the number of columns grows, the number of TLB
increases, and ultimately there will be one TLB miss for each element access. Figure 1.7 shows that, with
a large enough number of columns, the number of TLB misses per column is equal to the number of
elements per column.
If the length of the vectors y, xi is precisely the right (or rather, wrong) number, yj and xi,j will all be mapped to
the same location in cache. As an example we take the AMD , which has an L1 cache of 64K bytes, and which is
two-way set associative. Because of the set associativity, the cache can handle two addresses being mapped to the
same cache location, but not three of more. Thus, we let the vectors be of size n = 4096 doubles, and we measure
the effect in cache misses and cycles of letting m = 1, 2, . . ..
First of all, we note that we use the vectors sequentially, so, with a cacheline of eight doubles, we should ideally
see a cache miss rate of 1/8 times the number of vectors m. Instead, in figure 1.9 we see a rate approximately
proportional to m, meaning that indeed cache lines are evicted immediately. The exception here is the case m = 1,
where the two-way associativity allows the cachelines of two vectors to stay in cache.
Compare this to figure 1.10, where we used a slightly longer vector length, so that locations with the same j are no
longer mapped to the same cache location. As a result, we see a cache miss rate around 1/8, and a smaller number
of cycles, corresponding to a complete reuse of the cache lines.
Two remarks: the cache miss numbers are in fact lower than the theory predicts, since the processor will use
prefetch streams. Secondly, in figure 1.10 we see a decreasing time with increasing m; this is probably due to a
progressively more favourable balance between load and store operations. Store operations are more expensive than
loads, for various reasons.
Victor Eijkhout
36 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
8 35
7
30
6
L1 cache misses per column
25
5
1 5
0 1 2 3 4 5 6 7 0
#terms
Figure 1.9: The number of L1 cache misses and the number of cycles for each j column accumulation, vector
length 4096
0.18
10
0.16
0.14 8
L1 cache misses per column
0.12
cycles per column
6
0.10
0.08 4
0.06
2
0.04
0.02 1 2 3 4 5 6 7 0
#terms
Figure 1.10: The number of L1 cache misses and the number of cycles for each j column accumulation, vector
length 4096 + 8
x[i] = ...
∀i,j : yi ← aij · xj
This involves 2n2 operations on n2 + 2n data items, so reuse is O(1): memory accesses and operations are of the
same order. However, we note that there is a double loop involved, and the x, y vectors have only a single index, so
each element in them is used multiple times.
Exploiting this theoretical reuse is not trivial. In
/* variant 1 */
for (i)
for (j)
y[i] = y[i] + a[i][j] * x[j];
the element y[i] seems to be reused. However, the statement as given here would write y[i] to memory in every
inner iteration, and we have to write the loop as
/* variant 2 */
for (i) {
s = 0;
for (j)
s = s + a[i][j] * x[j];
y[i] = s;
}
to ensure reuse. This variant uses 2n2 loads and n stores.
This code fragment only exploits the reuse of y explicitly. If the cache is too small to hold the whole vector x plus
a column of a, each element of x is still repeatedly loaded in every outer iteration.
Reversing the loops as
/* variant 3 */
for (j)
for (i)
y[i] = y[i] + a[i][j] * x[j];
exposes the reuse of x, especially if we write this as
Victor Eijkhout
38 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
/* variant 3 */
for (j) {
t = x[j];
for (i)
y[i] = y[i] + a[i][j] * t;
}
but now y is no longer reused. Moreover, we now have 2n2 + n loads, comparable to variant 2, but n2 stores, which
is of a higher order.
It is possible to get reuse both of x and y, but this requires more sophisticated programming. The key here is split
the loops into blocks. For instance:
for (i=0; i<M; i+=2) {
s1 = s2 = 0;
for (j) {
s1 = s1 + a[i][j] * x[j];
s2 = s2 + a[i+1][j] * x[j];
}
y[i] = s1; y[i+1] = s2;
}
This is also called loop unrolling, loop tiling, or strip mining. The amount by which you unroll loops is determined
by the number of available registers.
Figure 1.11: Performance of naive and optimized implementations of the Discrete Fourier Transform
Figures 1.11 and 1.12 show that there can be wide discrepancy between the performance of naive implementations
of an operation (sometimes called the ‘reference implementation’), and optimized implementations. Unfortunately,
Figure 1.12: Performance of naive and optimized implementations of the matrix-matrix product
optimized implementations are not simple to find. For one, since they rely on blocking, their loop nests are dou-
ble the normal depth: the matrix-matrix multiplication becomes a six-deep loop. Then, the optimal block size is
dependent on factors like the target architecture.
We make the following observations:
• Compilers are not able to extract anywhere close to optimal performance8 .
• There are autotuning projects for automatic generation of implementations that are tuned to the architec-
ture. This approach can be moderately to very successful. Some of the best known of these projects are
Atlas [84] for Blas kernels, and Spiral [73] for transforms.
8. Presenting a compiler with the reference implementation may still lead to high performance, since some compilers are trained to
recognize this operation. They will then forego translation and simply replace it by an optimized variant.
9. We are conveniently ignoring matters of set-associativity here, and basically assuming a fully associative cache.
Victor Eijkhout
40 CHAPTER 1. SEQUENTIAL COMPUTER ARCHITECTURE
a[i] = sqrt(a[i]);
will take time linear in N up to the point where a fills the cache. An easier way to picture this is to compute a
normalized time, essentially a time per execution of the inner loop:
t = time();
for (x=0; x<NX; x++)
for (i=0; i<N; i++)
a[i] = sqrt(a[i]);
t = time()-t;
t_normalized = t/(N*NX);
The normalized time will be constant until the array a fills the cache, then increase and eventually level off again.
The explanation is that, as long as a[0]...a[N-1] fit in L1 cache, the inner loop will use data from the L1 cache.
Speed of access is then determined by the latency and bandwidth of the L1 cache. As the amount of data grows
beyond the L1 cache size, some or all of the data will be flushed from the L1, and performance will be determined
by the characteristics of the L2 cache. Letting the amount of data grow even further, performance will again drop to
a linear behaviour determined by the bandwidth from main memory.
The largest and most powerful computers are sometimes called ‘supercomputers’. For the last few decades, this has,
without exception, referred to parallel computers: machines with more than one CPU that can be set to work on the
same problem.
Parallelism is hard to define precisely, since it can appear on several levels. In the previous chapter you already saw
how inside a CPU several instructions can be ‘in flight’ simultaneously. This is called instruction-level parallelism,
and it is outside explicit user control: it derives from the compiler and the CPU deciding which instructions, out of
a single instruction stream, can be processed simultaneously. At the other extreme is the sort of parallelism where
more than one instruction stream is handled by multiple processors, often each on their own circuit board. This type
of parallelism is typically explicitly scheduled by the user.
In this chapter, we will analyze this more explicit type of parallelism, the hardware that supports it, the programming
that enables it, and the concepts that analyze it.
For further reading, a good introduction to parallel computers and parallel programming is Wilkinson and Allen [85].
2.1 Introduction
In scientific codes, there is often a large amount of work to be done, and it is often regular to some extent, with the
same operation being performed on many data. The question is then whether this work can be sped up by use of a
parallel computer. If there are n operations to be done, and they would take time t on a single processor, can they
be done in time t/p on p processors?
Let us start with a very simple example. Adding two vectors of length n
for (i=0; i<n; i++)
x[i] += y[i]
can be done with up to n processors, and execution time is linearly reduced with the number of processors. If each
operation takes a unit time, the original algorithm takes time n, and the parallel execution on p processors n/p. The
parallel algorithm is faster by a factor of p1 .
1. We ignore minor inaccuracies in this result when p does not divide perfectly in n.
41
42 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
2.2.1 SIMD
Parallel computers of the SIMD type apply the same operation simultaneously to a number of data items. The
design of the CPUs of such a computer can be quite simple, since the arithmetic unit does not need separate logic
and instruction decoding units: all CPUs execute the same operation in lock step. This makes SIMD computers
excel at operations on arrays, such as
for (i=0; i<N; i++) a[i] = b[i]+c[i];
and, for this reason, they are also often called array processors. Scientific codes can often be written so that a large
fraction of the time is spent in array operations.
On the other hand, there are operations that can not can be executed efficiently on an array processor. For instance,
evaluating a number of terms of a recurrence xi+1 = axi + bi involves that many additions and multiplications, but
they alternate, so only one operation of each type can be processed at any one time. There are no arrays of numbers
here that are simultaneously the input of an addition or multiplication.
In order to allow for different instruction streams on different parts of the data, the processor would have a ‘mask
bit’ that could be set to prevent execution of instructions. In code, this typically looks like
where (x>0) {
x[i] = sqrt(x[i])
The programming model where identical operations are applied to a number of data items simultaneously, is known
as data parallelism.
Such array operations can occur in the context of physics simulations, but another important source is graphics
applications. For this application, the processors in an array processor can be much weaker than the processor in a
PC: often they are in fact bit processors, capable of operating on only a single bit at a time. Along these lines, ICL
had the 4096 processor DAP [56] in the 1980s, and Goodyear built a 16K processor MPP [14] in the 1970s.
Later, the Connection Machine (CM-1, CM-2, CM-5) were quite popular. While the first Connection Machine had
bit processors (16 to a chip), the later models had traditional processors capable of floating point arithmetic, and
were not true SIMD architectures. All were based on a hyper-cube interconnection network; see section 2.6.4.
Another manufacturer that had a commercially successful array processor was MasPar.
Supercomputers based on array processing do not exist anymore, but the notion of SIMD lives on in various guises,
which we will now discuss.
2.2.1.1 Pipelining
A number of computers have been based on a vector processor or pipeline processor design. The first commer-
cially successful supercomputers, the Cray-1 and the Cyber-205 were of this type. In recent times, the Cray-X1
and the NEC SX series have featured vector pipes. The ‘Earth Simulator’ computer [75], which led the TOP500
(section 2.11) for 3 years, was based on NEC SX processors. The general idea behind pipelining was described in
section 1.1.1.1.
Victor Eijkhout
44 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
While supercomputers based on pipeline processors are in a distinct minority, pipelining is now mainstream in the
superscalar CPUs that are the basis for clusters. A typical CPU has pipelined floating point units, often with separate
units for addition and multiplication; see the previous chapter.
However, there are some important differences between pipelining in a modern superscalar CPU and in, more old-
fashioned, vector units. The pipeline units in these vector computers are not integrated floating point units in the
CPU, but can better be considered as attached vector units to a CPU that itself has a floating point unit. The vector
unit has vector registers2 with a typical length of 64 floating point numbers; there is typically no ‘vector cache’.
The logic in vector units is also simpler, often addressable by explicit vector instructions. Superscalar CPUs, on the
other hand, are more complicated and geared towards exploiting data streams in unstructured code.
True SIMD array processing can be found in modern CPUs and GPUs, in both cases inspired by the parallelism
that is needed in graphics applications.
Modern CPUs from Intel and AMD, as well as PowerPC chips, have instructions that can perform multiple instances
of an operation simultaneously. On Intel processors this is known as SSE: Streaming SIMD Extensions. These
extensions were originally intended for graphics processing, where often the same operation needs to be performed
on a large number of pixels. Often, the data has to be a total of, say, 128 bits, and this can be divided into two 64-bit
reals, four 32-bit reals, or a larger number of even smaller chunks such as 4 bits.
Current compilers can generate SSE instructions automatically; sometimes it is also possible for the user to insert
pragmas, for instance with the Intel compiler:
void func(float *restrict c, float *restrict a,
float *restrict b, int n)
{
#pragma vector always
for (int i=0; i<n; i++)
c[i] = a[i] * b[i];
}
Use of these extensions often requires data to be aligned with cache line boundaries (section 1.2.4.3), so there are
special allocate and free calls that return aligned memory.
For a nontrivial example, see figure 2.1, which describes complex multiplication using SSE3.
Array processing on a larger scale can be found in Graphics Processing Unit (GPU)s. A GPU contains a large
number of simple processors, ordered in groups of 32, typically. Each processor group is limited to executing the
same instruction. Thus, this is true example of Single Instruction Multiple Data (SIMD) processing.
By far the most common parallel computer architecture these days is called Multiple Instruction Multiple Data
(MIMD): the processors execute multiple, possibly differing instructions, each on their own data. Saying that the
instructions differ does not mean that the processors actually run different programs: most of these machines operate
in Single Program Multiple Data (SPMD) mode, where the programmer starts up the same executable on the parallel
processors. Since the different instances of the executable can take differing paths through conditional statements,
or execute differing numbers of iterations of loops, they will in general not be completely in sync as they were on
SIMD machines. This lack of synchronization is called load unbalance, and it is a major source of less than perfect
speedup.
There is a great variety in MIMD computers. Some of the aspects concern the way memory is organized, and
the network that connects the processors. Apart from these hardware aspects, there are also differing ways of
programming these machines. We will see all these aspects below. Machines supporting the SPMD model are
usually called clusters. They can be built out of custom or commodity processors; if they consist of PCs, running
Linux, and connected with Ethernet, they are referred to as Beowulf clusters [49].
Victor Eijkhout
46 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
System software. Moreover, keeping caches coherent means that there is data traveling through the network, taking
up precious bandwidth.
The SGI Origin and Onyx computers can have more than a 1000 processors in a NUMA architecture. This requires
a substantial network.
Victor Eijkhout
48 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
what kind of actions these are and how hard it is to actually execute them in parallel, as well has how efficient the
resulting execution is.
The discussion in this section will be mostly on a conceptual level; in section 2.5 we will go into some detail on
how parallelism can actually be programmed.
It is fairly common for a program that have loops with a simple body, that gets executed for all elements in a large
data set:
Such code is considered an instance of data parallelism or fine-grained parallelism. If you had as many processors
as array elements, this code would look very simple: each processor would execute the statment
a = 2*b
If your code consists predominantly of such loops over arrays, it can be executed efficiently with all processors in
lockstep. Architectures based on this idea, where the processors can in fact only work in lockstep, have existed, see
section 2.2.1. Such fully parallel operations on arrays can appear in computer graphics, where every bit of an image
is processed independently.
Continuing the above example for a little bit, consider the operation
bleft ← shiftright(b)
bright ← shiftleft(b)
a ← (bleft + bright)/2
where the shiftleft/right instructions cause a data item to be sent to the processor with a number lower or
higher by 1. For this second example to be efficient, it is necessary that each processor can communicate quickly
with its immediate neighbours, and the first and last processor with each other.
In various contexts such a ‘blurr’ operations in graphics, it makes sense to have operations on 2D data:
for 0 < i < m do
for 0 < j < n do
aij ← (bij−1 + bij+1 + bi−1j + bi+1j )
end
end
and consequently processors have be able to move data to neighbours in a 2D grid.
Victor Eijkhout
50 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
Unlike in the data parallel example above, the assignment of data to processor is not determined in advance in
such a scheme. Therefore, this mode of parallelism is most suited for thread-programming, for instance through the
OpenMP library; section 2.5.1.
A finite element mesh is, in the simplest case, a collection of triangles that covers a 2D object. Since angles that
are too acute should be avoided, the Delauney mesh refinement process can take certain triangles, and replace them
by better shaped one. This is illustrated in figure 2.3: the black triangles violate some angle condition, so either
they themselves get subdivided, or they are joined with some neighbouring ones (rendered in grey) and then jointly
redivided.
In pseudo-code, this can be implemented as follows (for a more detailed discussion, see [59]):
It is clear that this algorithm is driven by a worklist data structure that has to be shared between all processes.
Together with the dynamic assignment of data to processes, this implies that this type of irregular parallelism is
suited to shared memory programming, and is much harder to do with distributed memory.
In certain contexts, a simple, often single processor, calculation needs to be performed on many different inputs.
Scheduling the calculations as indicated above is then referred to as a parameter sweep. Since the program execu-
tions have no data dependencies and need not done in any particular sequence, this is an example of embarassingly
parallel computing.
The above strict realization of data parallelism assumes that there are as many processors as data elements. In
practice, processors will have much more memory than that, and the number of data elements is likely to be far
larger than the processor count of even the largest computers. Therefore, arrays are grouped onto processors in
subarrays. The code then looks like this:
my_lower_bound = // some processor-dependent number
my_upper_bound = // some processor-dependent number
for (i=my_lower_bound; i<my_upper_bound; i++)
// the loop body goes here
This model has some characteristics of data parallelism, since the operation performed is identical on a large number
of data items. It can also be viewed as task parallelism, since each processor executes a larger section of code, and
does not necessarily operate on equal sized chunks of data.
Victor Eijkhout
52 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
2.5.1 OpenMP
OpenMP is an extension to the programming languages C and Fortran. Its main approach to parallelism is the
parallel execution of loops: based on compiler directives, a preprocessor can schedule the parallel execution of the
loop iterations.
The amount of parallelism is flexible: the user merely specifies a parallel region, indicating that all iterations are
independent to some extent, and the runtime system will then use whatever resources are available. Because of
this dynamic nature, and because no data distribution is specified, OpenMP can only work with threads on shared
memory.
OpenMP is neither a language nor a library: it operates by inserting directives into source code, which are interpreted
by the compiler. Many compilers, such as GCC or the Intel compiler, support the OpenMP extensions. In Fortran,
OpenMP directives are placed in comment statements; in C, they are placed in #pragma CPP directives, which
indicate compiler specific extensions. As a result, OpenMP code still looks like legal C or Fortran to a compiler that
does not support OpenMP. Programs need to be linked to an OpenMP runtime library, and their behaviour can be
controlled through environment variables.
OpenMP features dynamic parallelism: the number of execution streams operating in parallel can vary from one
part of the code to another.
For more information about OpenMP, see [21].
part of one process and therefore share each other’s data. Threads do have a possibility of having private data, for
instance, they have their own data stack.
Threads serve two functions:
1. By having more than one thread on a single processor, a higher processor utilization can result, since the
instructions of one thread can be processed while another thread is waiting for data.
2. In a shared memory context, multiple threads running on multiple processors or processor cores can be
an easy way to parallelize a process. The shared memory allows the threads to all see the same data.
Victor Eijkhout
54 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
#include <stdlib.h>
#include <stdio.h>
#include "pthread.h"
int sum=0;
void adder() {
sum = sum+1;
return;
}
#define NTHREADS 50
int main() {
int i;
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0) return i+1;
printf("joining\n");
for (i=0; i<NTHREADS; i++)
if (pthread_join(threads[i],NULL)!=0) return NTHREADS+i+1;
printf("Sum computed: %d\n",sum);
return 0;
}
The fact that this code gives the right result is a coincidence: it only happens because updating the variable is so
much quicker than creating the thread. (On a multicore processor the chance of errors will greatly increase.) If we
artificially increase the time for the update, we will no longer get the right result:
void adder() {
int t = sum; sleep(1); sum = t+1;
return;
}
Now all threads read out the value of sum, wait a while (presumably calculating something) and then update.
This can be fixed by having a lock on the code region that should be ‘mutually exclusive’:
pthread_mutex_t lock;
void adder() {
int t,r;
pthread_mutex_lock(&lock);
int main() {
....
pthread_mutex_init(&lock,NULL);
The lock and unlock commands guarantee that no two threads can interfere with each other’s update.
Victor Eijkhout
56 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
Figure 2.4: Static or round-robin (left) vs dynamic (right) thread scheduling; the task numbers are indicated.
the problem would be even worse. Now the last processor can not start its receive since it is blocked sending xn−1
to processor 0. This situation, where the program can not progress because every processor is waiting for another,
is called deadlock .
The reason for blocking instructions is to prevent accumulation of data in the network. If a send instruction were
to complete before the corresponding receive started, the network would have to store the data somewhere in the
mean time. Consider a simple example:
buffer = ... ; // generate some data
send(buffer,0); // send to processor 0
buffer = ... ; // generate more data
send(buffer,1); // send to processor 1
Victor Eijkhout
58 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
After the first send, we start overwriting the buffer. If the data in it hasn’t been received, the first set of values would
have to be buffered somewhere in the network, which is not realistic. By having the send operation block, the data
stays in the sender’s buffer until it is guaranteed to have been copied to the recipient’s buffer.
One way out of the problem of sequentialization or deadlock that arises from blocking instruction is the use of
non-blocking communication instructions, which include explicit buffers for the data. With non-blocking send
instruction, the user needs to allocate a buffer for each send, and check when it is safe to overwrite the buffer.
buffer0 = ... ; // data for processor 0
send(buffer0,0); // send to processor 0
buffer1 = ... ; // data for processor 1
send(buffer1,1); // send to processor 1
...
// wait for completion of all send operations.
2.5.3 MPI
If OpenMP is the way to program shared memory, MPI [77] is the standard solution for programming distributed
memory. MPI (‘Message Passing Interface’) is a specification for a library interface for moving between processes
that do not otherwise share data. The MPI routines can be divided roughly in the following categories:
• Process management. This includes querying the parallel environment and constructing subsets of pro-
cessors.
• Point-to-point communication. This is a set of calls where two processes interact. These are mostly vari-
ants of the send and receive calls.
• Collective calls. In these routines, all processors (or the whole of a specified subset) are involved. Exam-
ples are the broadcast call, where one processor shares its data with every other processor, or the gather
call, where one processor collects data from all participating processors.
Let us consider how the OpenMP examples can be coded in MPI. First of all, we no longer allocate
double a[ProblemSize];
but
double a[LocalProblemSize];
where the local size is roughly a 1/P fraction of the global size. (Practical considerations dictate whether you want
this distribution to be as evenly as possible, or rather biased in some way.)
The parallel loop is trivially parallel, with the only difference that it now operates on a fraction of the arrays:
for (i=0; i<LocalProblemSize; i++) {
a[i] = b[i];
}
However, if the loop involves a calculation based on the iteration number, we need to map that to the global value:
Victor Eijkhout
60 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
a[i] = (b[i]+bleft+bright)/3
Obtaining the neighbour values is done as follows. First we need to ask our processor number, so that we can start
a communication with the processor with a number one higher and lower.
MPI_Comm_rank(MPI_COMM_WORLD,&myTaskID);
MPI_Sendrecv
(/* to be sent: */ &b[LocalProblemSize-1],
/* result: */ &bfromleft,
/* destination */ myTaskID+1, /* some parameters omited */ );
MPI_Sendrecv(&b[0],&bfromright,myTaskID-1 /* ... */ );
There are still two problems with this code. First, the sendrecv operations need exceptions for the first and last
processors. This can be done elegantly as follows:
MPI_Comm_rank(MPI_COMM_WORLD,&myTaskID);
MPI_Comm_size(MPI_COMM_WORLD,&nTasks);
if (myTaskID==0) leftproc = MPI_PROC_NULL;
else leftproc = myTaskID-1;
if (myTaskID==nTasks-1) rightproc = MPI_PROC_NULL;
else rightproc = myTaskID+1;
MPI_Sendrecv( &b[LocalProblemSize-1], &bfromleft, rightproc );
MPI_Sendrecv( &b[0], &bfromright, leftproc);
Exercise 2.2. There is still a problem left with this code: the boundary conditions from the original,
global, version have not been taken into account. Give code that solves that problem.
MPI gets complicated if different processes need to take different actions, for example, if one needs to send data to
another. The problem here is that each process executes the same executable, so it needs to contain both the send
and the receive instruction, to be executed depending on what the rank of the process is.
if (myTaskID==0) {
MPI_Send(myInfo,1,MPI_INT,/* to: */ 1,/* labeled: */,0,
MPI_COMM_WORLD);
} else {
MPI_Recv(myInfo,1,MPI_INT,/* from: */ 0,/* labeled: */,0,
/* not explained here: */&status,MPI_COMM_WORLD);
}
Although MPI is sometimes called the ‘assembly language of parallel programming’, for its perceived difficulty
and level of explicitness, it not all that hard to learn, as evinced by the large number of scientific codes that use it.
The main issues that make MPI somewhat intricate to use, are buffer management and blocking semantics.
These issues are related, and stem from the fact that, ideally, data should not be in two places at the same time. Let
us briefly consider what happens if processor 1 sends data to processor 2. The safest strategy is for processor 1 to
execute the send instruction, and then wait until processor 2 acknowledges that the data was successfully received.
This means that processor 1 is temporarily blocked until processor 2 actually executes its receive instruction, and
the data has made its way through the network. Alternatively, processor 1 could put its data in a buffer, tell the
system to make sure that it gets sent at some point, and later checks to see that the buffer is safe to reuse. This
second strategy is called non-blocking communication, and it requires the use of a temporary buffer.
Victor Eijkhout
62 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
Blocking operations have the disadvantage that they can lead to deadlock , if two processes wind up waiting for
each other. Even without deadlock, they can lead to considerable idle time in the processors, as they wait without
performing any useful work. On the other hand, they have the advantage that it is clear when the buffer can be
reused: after the operation completes, there is a guarantee that the data has been safely received at the other end.
The blocking behaviour can be avoided, at the cost of complicating the buffer semantics, by using non-blocking
operations. A non-blocking send (MPI_Isend) declares that a data buffer needs to be sent, but then does not
wait for the completion of the corresponding receive. There is a second operation MPI_Wait that will actually
block until the receive has been completed. The advantage of this decoupling of sending and blocking is that it now
becomes possible to write:
ISend(somebuffer,&handle); // start sending, and
// get a handle to this particular communication
{ ... } // do useful work on local data
Wait(handle); // block until the communication is completed;
{ ... } // do useful work on incoming data
With a little luck, the local operations take more time than the communication, and you have completely eliminated
the communication time.
In addition to non-blocking sends, there are non-blocking receives. A typical piece of code then looks like
ISend(sendbuffer,&sendhandle);
IReceive(recvbuffer,&recvhandle);
{ ... } // do useful work on local data
Wait(sendhandle); Wait(recvhandle);
{ ... } // do useful work on incoming data
Exercise 2.3. Go back to exercise 1 and give pseudocode that solves the problem using non-blocking
sends and receives. What is the disadvantage of this code over a blocking solution?
encompasses both SMP and distributed shared memory. The typical PGAS languages, Unified Parallel C
(UPC), allows you to write programs that for the most part looks like regular C code. However, by
indicating how the major arrays are distributed over processors, the program can be executed in parallel.
2.5.4.1 Discussion
Parallel languages hold the promise of making parallel programming easier, since they make communication oper-
ations appear as simple copies or arithmetic operations. However, by doing so they invite the user to write code that
may not be efficient, for instance by inducing many small messages.
As an example, consider arrays a,b that have been horizontally partitioned over the processors, and that are shifted:
for (i=0; i<N; i++)
for (j=0; j<N/np; j++)
a[i][j+joffset] = b[i][j+1+joffset]
If this code is executed on a shared memory machine, it will be efficient, but a naive translation in the distributed
case will have a single number being communicated in each iteration of the i loop. Clearly, these can be combined
in a single buffer send/receive operation, but compilers are usually unable to make this transformation. As a result,
the user is forced to, in effect, re-implement the blocking that needs to be done in an MPI implementation:
for (i=0; i<N; i++)
t[i] = b[i][N/np+joffset]
for (i=0; i<N; i++)
for (j=0; j<N/np-1; j++) {
a[i][j] = b[i][j+1]
a[i][N/np] = t[i]
}
On the other hand, certain machines support direct memory copies through global memory hardware. In that case,
PGAS languages can be more efficient than explicit message passing, even with physically distributed memory.
Victor Eijkhout
64 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
int i;
for(i=MYTHREAD; i<N; i+=THREADS)
v1plusv2[i]=v1[i]+v2[i];
}
The same program with an explicitly parallel loop construct:
//vect_add.c
#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], v1plusv2[N];
void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}
2.5.4.3 Titanium
Titanium is comparable to UPC in spirit, but based on Java rather than on C.
2.5.4.6 Chapel
Chapel [1] is a new parallel programming language4 being developed by Cray Inc. as part of the DARPA-led
High Productivity Computing Systems program (HPCS). Chapel is designed to improve the productivity of high-
end computer users while also serving as a portable parallel programming model that can be used on commodity
clusters or desktop multicore systems. Chapel strives to vastly improve the programmability of large-scale parallel
computers while matching or beating the performance and portability of current programming models like MPI.
Chapel supports a multithreaded execution model via high-level abstractions for data parallelism, task parallelism,
concurrency, and nested parallelism. Chapel’s locale type enables users to specify and reason about the placement
of data and tasks on a target architecture in order to tune for locality. Chapel supports global-view data aggregates
with user-defined implementations, permitting operations on distributed data structures to be expressed in a natural
manner. In contrast to many previous higher-level parallel languages, Chapel is designed around a multiresolution
philosophy, permitting users to initially write very abstract code and then incrementally add more detail until they
are as close to the machine as their needs require. Chapel supports code reuse and rapid prototyping via object-
oriented design, type inference, and features for generic programming.
Chapel was designed from first principles rather than by extending an existing language. It is an imperative block-
structured language, designed to be easy to learn for users of C, C++, Fortran, Java, Perl, Matlab, and other popular
languages. While Chapel builds on concepts and syntax from many previous languages, its parallel features are most
directly influenced by ZPL, High-Performance Fortran (HPF), and the Cray MTA’s extensions to C and Fortran.
Victor Eijkhout
66 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
2.5.4.7 Fortress
Fortress [7] is a programming language developed by Sun Microsystems. Fortress5 aims to make parallelism more
tractable in several ways. First, parallelism is the default. This is intended to push tool design, library design, and
programmer skills in the direction of parallelism. Second, the language is designed to be more friendly to paral-
lelism. Side-effects are discouraged because side-effects require synchronization to avoid bugs. Fortress provides
transactions, so that programmers are not faced with the task of determining lock orders, or tuning their locking
code so that there is enough for correctness, but not so much that performance is impeded. The Fortress looping
constructions, together with the library, turns ”iteration” inside out; instead of the loop specifying how the data
is accessed, the data structures specify how the loop is run, and aggregate data structures are designed to break
into large parts that can be effectively scheduled for parallel execution. Fortress also includes features from other
languages intended to generally help productivity – test code and methods, tied to the code under test; contracts
that can optionally be checked when the code is run; and properties, that might be too expensive to run, but can be
fed to a theorem prover or model checker. In addition, Fortress includes safe-language features like checked array
bounds, type checking, and garbage collection that have been proven-useful in Java. Fortress syntax is designed to
resemble mathematical syntax as much as possible, so that anyone solving a problem with math in its specification
can write a program that can be more easily related to its original specification.
2.5.4.8 X10
X10 is an experimental new language currently under development at IBM in collaboration with academic part-
ners. The X10 effort is part of the IBM PERCS project (Productive Easy-to-use Reliable Computer Systems) in the
DARPA program on High Productivity Computer Systems. The PERCS project is focused on a hardware-software
co-design methodology to integrate advances in chip technology, architecture, operating systems, compilers, pro-
gramming language and programming tools to deliver new adaptable, scalable systems that will provide an order-
of-magnitude improvement in development productivity for parallel applications by 2010.
X10 aims to contribute to this productivity improvement by developing a new programming model, combined with
a new set of tools integrated into Eclipse and new implementation techniques for delivering optimized scalable
parallelism in a managed runtime environment. X10 is a type-safe, modern, parallel, distributed object-oriented
language intended to be very easily accessible to Java(TM) programmers. It is targeted to future low-end and
high-end systems with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies,
and interconnected in scalable cluster configurations. A member of the Partitioned Global Address Space (PGAS)
family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities
embodied in async, future, foreach, and ateach constructs; constructs for termination detection (finish) and phased
computation (clocks); the use of lock-free synchronization (atomic blocks); and the manipulation of global arrays
and data structures.
2.5.4.9 Linda
As should be clear by now, the treatment of data is by far the most important aspect of parallel programing, far
more important than algorithmic considerations. The programming system Linda [39], also called a coordination
language, is designed to address the data handling explicitly. Linda is not a language as such, but can, and has been,
incorporated into other languages.
The basic concept of Linda is tuple space: data is added to a pool of globally accessible information by adding a
label to it. Processes then retrieve data by their label, and without needing to know which processes added them to
the tuple space.
Linda is aimed primarily at a different computation model than is relevant for High-Performance Computing (HPC):
it addresses the needs of asynchronous communicating processes. However, is has been used for scientific compu-
tation [25]. For instance, in parallel simulations of the heat equation (section 4.3), processors can write their data
into tuple space, and neighbouring processes can retrieve their ghost region without having to know its provenance.
Thus, Linda becomes one way of implementing .
Victor Eijkhout
68 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
2.6 Topologies
If a number of processors are working together on a single task, most likely they need to communicate data. For this
reason there needs to be a way for data to make it from any processor to any other. In this section we will discuss
some of the possible schemes to connect the processors in a parallel machine.
In order to get an appreciation for the fact that there is a genuine problem here, consider two simple schemes that
do not ‘scale up’:
• Ethernet is a connection scheme where all machines on a network are on a single cable6 . If one machine
puts a signal on the wire to send a message, and another also wants to send a message, the latter will
detect that the sole available communication channel is occupied, and it will wait some time before
retrying its send operation. Receiving data on ethernet is simple: messages contain the address of the
intended recipient, so a processor only has to check whether the signal on the wire is intended for it.
The problems with this scheme should be clear. The capacity of the communication channel is finite,
so as more processors are connected to it, the capacity available to each will go down. Because of the
scheme for resolving conflicts, the average delay before a message can be started will also increase7 .
• In a fully connected configuration, each processor has one wire for the communications with each other
processor. This scheme is perfect in the sense that messages can be sent in the minimum amount of
time, and two messages will never interfere with each other. The amount of data that can be sent from
one processor is no longer a decreasing function of the number of processors; it is in fact an increasing
function, and if the network controller can handle it, a processor can even engage in multiple simultaneous
communications.
The problem with this scheme is of course that the design of the network interface of a processor is no
longer fixed: as more processors are added to the parallel machine, the network interface gets more con-
necting wires. The network controller similarly becomes more complicated, and the cost of the machine
increases faster than linearly in the number of processors.
In this section we will see a number of schemes that can be increased to large numbers of processors.
6. We are here describing the original design of Ethernet. With the use of switches, especially in an HPC context, this description does
not really apply anymore.
7. It was initially thought that ethernet would be inferior to other solutions such as IBM’s ‘token ring’. It takes fairly sophisticated
statistical analysis to prove that it works a lot better than was naively expected.
8. We assume that connections are symmetric, so that the network is an undirected graph .
Victor Eijkhout
70 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
If d is the diameter, and if sending a message over one wire takes unit time (more about this in the next section),
this means a message will always arrive in at most time d.
Exercise 2.4. Find a relation between the number of processors, their degree, and the diameter of the
connectivity graph.
In addition to the question ‘how long will a message from processor A to processor B take’, we often worry about
conflicts between two simultaneous messages: is there a possibility that two messages, under way at the same time,
will need to use the same network link? This sort of conflict is called congestion or contention. Clearly, the more
links the graph of a parallel comupter has, the smaller the chance of congestion.
A precise way to describe the likelihood of congestion, is to look at the bisection width . This is defined as the
minimum number of links that have to be removed to partition the processor graph into two unconnected graphs.
For instance, consider processors connected as a linear array, that is, processor Pi is connected to Pi−1 and Pi+1 .
In this case the bisection width is 1.
The bisection width w describes how many messages can, guaranteed, be under way simultaneously in a parallel
computer. Proof: take w sending and w receiving processors. The w paths thus defined are disjoint: if they were not,
we could separate the processors into two groups by removing only w − 1 links.
In practice, of course, more than w messages can be under way simultaneously. For instance, in a linear array, which
has w = 1, P/2 messages can be sent and received simultaneously if all communication is between neighbours,
and if a processor can only send or receive, but not both, at any one time. If processors can both send and receive
simultaneously, P messages can be under way in the network.
Bisection width also describes redundancy in a network: if one or more connections are malfunctioning, can a
message still find its way from sender to receiver?
Exercise 2.5. What is the diameter of a 3D cube of processors? What is the bisection width? How does
that change if you add wraparound torus connections?
While bisection width is a measure express as a number of wires, in practice we care about the capacity through
those wires. The relevant concept here is bisection bandwidth : the bandwidth across the bisection width, which is
the product of the bisection width, and the capacity (in bits per second) of the wires. Bisection bandwidth can be
considered as a measure for the bandwidth that can be attained if an arbitrary half of the processors communicates
with the other half. Bisection bandwidth is a more realistic measure than the aggregate bandwidth which is some-
times quoted: it is defined as the total data rate if every processor is sending: the number of processors times the
bandwidth of a connection times the number of simultaneous sends a processor can perform. This can be quite a
high number, and it is typically not representative of the communication rate that is achieved in actual applications.
Victor Eijkhout
72 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
2.6.4 Hypercubes
Above we gave a hand-waving argument for the suitability of mesh-organized processors, based on the prevalence
of nearest neighbour communications. However, sometimes sends and receives between arbitrary processors occur.
One example of this is the above-mentioned broadcast. For this reason, it is desirable to have a network with a
smaller diameter than a mesh. On the other hand we want to avoid the complicated design of a fully connected
network.
A good intermediate solution is the hypercube design. An n-dimensional hypercube computer has 2n processors,
with each processor connected to one other in each dimension; see figure 2.6. The nodes of a hypercube are num-
bered by bit patterns as in figure 2.7.
An easy way to describe this is to give each processor an address consisting of d bits. A processor is then connected
to all others that have an address that differs by exactly one bit.
The big advantages of a hypercube design are the small diameter and large capacity for traffic through the network.
Victor Eijkhout
74 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
1D Gray code : 0 1
.
1D code and reflection: 0 1 .. 1 0
2D Gray code : .
append 0 and 1 bit: 0 0 .. 1 1
.
2D code and reflection: 0 1 1 0 .. 0 1 1 0
.
3D Gray code : 0 0 1 1 .. 1 1 0 0
.
append 0 and 1 bit: 0 0 0 0 .. 1 1 1 1
processors to a switch or switching network. Some popular network designs are the crossbar butterfly exchange
and the fat tree [47].
Switching networks are made out of switching elements, each of which have a small number (up to about a dozen)
of inbound and outbound links. By hooking all processors up to some switching element, and having multiple stages
of switching, it then becomes possible to connect any two processors by a path through the network.
The simplest switching network is a cross bar, an arrangement of n horizontal and vertical lines, with a switch
element on each intersection that determines whether the lines are connected; see figure 2.9. If we designate the
horizontal lines as inputs the vertical as outputs, this is clearly a way of having n inputs be mapped to n outputs.
Every combination of inputs and outputs (sometimes called a ‘permutation’) is allowed.
Figure 2.10: A butterfly exchange network for two and four processors/memories
several processors to access memory simultaneously. Also, their access times are identical, see exchange networks
are a way of implementing a UMA architecture; see section 2.3.1.
Exercise 2.13. For both the simple cross bar and the butterfly exchange, the network needs to be ex-
panded as the number of processors grows. Give the number of wires (of some unit length)
and the number of switching elements that is needed in both cases to connect n processors
and memories. What is the time that a data packet needs to go from memory to processor,
expressed in the unit time that it takes to traverse a unit length of wire and the time to traverse
a switching element?
2.6.5.3 Fat-trees
If we were to connect switching nodes like a tree, there would be a big problem with congestion close to the root
since there are only two wires attached to the root note. Say we have a k-level tree, so there are 2k leaf nodes. If
all leaf nodes in the left subtree try to communicate with nodes in the right subtree, we have 2k−1 messages going
through just one wire into the root, and similarly out through one wire. A fat-tree is a tree network where each
level has the same total bandwidth, so that this congestion problem does not occur: the root will actually have 2k−1
incoming and outgoing wires attached.
The first successful computer architecture based on a fat-tree was the Connection Machines CM5.
Victor Eijkhout
76 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
In fat-trees, as in other switching networks, each message carries its own routing information. Since in a fat-tree
the choices are limited to going up a level, or switching to the other subtree at the current level, a message needs to
carry only as many bits routing information as there are levels, which is log2 n for n processors.
The theoretical exposition of fat-trees in [64] shows that fat-trees are optimal in some sense: it can deliver messages
as fast (up to logarithmic factors) as any other network that takes the same amount of space to build. The underlying
assumption of this statement is that switches closer to the root have to connect more wires, therefore take more
components, and correspondingly are larger.
This argument, while theoretically interesting, is of no practical significance, as the physical size of the network
hardly plays a role in the biggest currently available computers that use fat-tree interconnect. For instance, in the
Ranger supercomputer of The University of Texas at Austin, the fat-tree switch connects 60,000 processors, yet
takes less than 10 percent of the floor space.
A fat tree, as sketched above, would be costly to build, since for every next level a new, bigger, switch would have
to be designed. In practice, therefore, a network with the characteristics of a fat-tree is constructed from simple
switching elements; see figure 2.12. This network is equivalent in its bandwidth and routing possibilities to a fat-
tree. Routing algorithms will be slightly more complicated: in a fat-tree, a data packet can go up in only one way,
but here a packet has to know to which of the two higher switches to route.
This type of switching network is one case of a Clos network [23].
T (n) = α + βn
for the transmission time of an n-byte message. Here, α is the latency and β is the time per byte, that is, the inverse
of bandwidth.
2.7 Theory
There are two important reasons for using a parallel computer: to have access to more memory or to obtain higher
performance. It is easy to characterize the gain in memory, as the total memory is the sum of the individual memo-
ries. The speed of a parallel computer is harder to characterize. A simple approach is to let the same program run
on a single processor, and on a parallel machine with p processors, and to compare runtimes.
With T1 the execution time on a single processor and Tp the time on p processors, we define the speedup as
Sp = T1 /Tp . (Sometimes T1 is defined as ‘the best time to solve the problem on a single processor’, which allows
for using a different algorithm on a single processor than in parallel.) In the ideal case, Tp = T1 /p, but in practice
we don’t expect to attain that, so SP ≤ p. To measure how far we are from the ideal speedup, we introduce the
efficiency Ep = Sp /p. Clearly, 0 < Ep ≤ 1.
There is a practical problem with this definition: a problem that can be solved on a parallel machine may be too
large to fit on any single processor. Conversely, distributing a single processor problem over many processors may
give a distorted picture since very little data will wind up on each processor.
There are various reasons why the actual speed is less than P . For one, using more than one processors necessitates
communication, which is overhead. Secondly, if the processors do not have exactly the same amount of work to do,
they may be idle part of the time, again lowering the actually attained speedup. Finally, code may have sections that
are inherently sequential.
Communication between processors is an important source of a loss of efficiency. Clearly, a problem that can be
solved without communication will be very efficient. Such problems, in effect consisting of a number of completely
independent calculations, is called embarassingly parallel ; it will have close to a perfect speedup and efficiency.
Exercise 2.14. The case of speedup larger than the number of processors is called superlinear speedup.
Give a theoretical argument why this can never happen.
In practice, superlinear speedup can happen. For instance, suppose a problem is too large to fit in memory, and a
single processor can only solve it by swapping data to disc. If the same problem fits in the memory of two processors,
the speedup may well be larger than 2 since disc swapping no longer occurs. Having less, or more localized, data
may also improve the cache behaviour of a code.
Victor Eijkhout
78 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
no matter how many processors are available. Thus, the speedup on that code is limited to a factor of 20. This
phenomenon is known as Amdahl’s Law [12], which we will now formulate.
Let Fp be the parallel fraction and Fs be the parallel fraction (or more strictly: the ‘parallelizable’ fraction) of a
code, respectively. Then Fp + Fs = 1. The parallel execution time Tp is the sum of the part that is sequential T1 Fs
and the part that can be parallelized T1 Fp /p:
TP = T1 (Fs + Fp /P ). (2.2)
As the number of processors grows p → ∞, the parallel execution time now approaches that of the sequential
fraction of the code: TP ↓ T1 Fs . We conclude that speedup is limited by SP ≤ 1/Fs and efficiency is a decreasing
function E ∼ 1/p.
The sequential fraction of a code can consist of things such as I/O operations. However, there are also parts of a code
that in effect act as sequential. Consider a program that executes a single loop, where all iterations can be computed
independently. Clearly, this code is easily parallelized. However, by splitting the loop in a number of parts, one per
processor, each processor now has to deal with loop overhead: calculation of bounds, and the test for completion.
This overhead is replicated as many times as there are processors. In effect, loop overhead acts as a sequential part
of the code.
In practice, many codes do not have significant sequential parts, and overhead is not important enough to affect
parallelization adversely. One reason for this is discussed in the next section.
Exercise 2.15. Investigate the implications of Amdahls’s law: if the number of processors P increases,
how does the parallel fraction of a code have to increase to maintain a fixed efficiency?
Tp = T1 (Fs + Fp /P ) + Tc ,
T1
Sp = .
T1 /p + Tc
For this to be close to p, we need Tc T1 /p or p T1 /Tc . In other words, the number of processors should not
grow beyond the ratio of scalar execution time and communication overhead.
2.7.3 Scalability
Above, we remarked that splitting a given problem over more and more processors does not make sense: at a
certain point there is just not enough work for each processor to operate efficiently. Instead, in practice, users of
a parallel code will either choose the number of processors to match the problem size, or they will solve a series
of increasingly larger problems on correspondingly growing numbers of processors. In both cases it is hard to talk
about speedup. Instead, the concept of scalability is used.
We distinguish two types of scalability. So-called strong scalability is in effect the same as speedup, discussed
above. We say that a program shows strong scalability if, partitioned over more and more processors, it shows
perfect or near perfect speedup. Typically, one encounters statements like ‘this problem scales up to 500 processors’,
meaning that up to 500 processors the speedup will not noticeably decrease from optimal.
More interesting, weak scalability is a more vaguely defined term. It describes that, as problem size and number
of processors grow in such a way that the amount of data per processor stays constant, the speed in operations per
second of each processor also stays constant. This measure is somewhat hard to report, since the relation between
the number of operations and the amount of data can be complicated. If this relation is linear, one could state that
the amount of data per processor is kept constant, and report that parallel execution time is constant as the number
of processors grows.
Scalability depends on the way an algorithm is parallelized, in particular on the way data is distributed. In section 6.3
you will find an analysis of the matrix-vector product operation: distributing a matrix by block rows turns out not
to be scalable, but a two-dimensional distribution by submatrices is.
Victor Eijkhout
80 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
Present day GPUs have an architecture that combines SIMD and MIMD parallelism. For instance, an NVidia GPU
has 16 Streaming Multiprocessors (SMs), and a SMs consists of 8 Streaming Processors (SPs), which correspond
to processor cores; see figure 2.13.
The SIMD, or , nature of GPUs becomes apparent in the way CUDA starts processes. A kernel , that is, a function
that will be executed on the GPU, is started on mn cores by:
KernelProc<< m,n >>(args)
The collection of mn cores executing the kernel is known as a grid , and it is structured as m thread blocks of
n threads each. A thread block can have up to 512 threads.
Recall that threads share an address space (see section 2.5.1.1), so they need a way to identify what part of the data
each thread will operate on. For this, the blocks in a thread are numbered with x, y coordinates, and the threads
in a block are numbered with x, y, z coordinates. Each thread knows its coordinates in the block, and its block’s
coordinates in the grid.
SM SP thread blocks threads
GPU 16 8 × 16 = 128
SM 8 8 768
Thread blocks are truly data parallel: if there is a conditional that makes some threads take the true branch and others
the false branch, then one branch will be executed first, with all threads in the other branch stopped. Subsequently,
and not simultaneously, the threads on the other branch will then execute their code. This may induce a severy
performance penalty.
These are some of the differences between GPUs and regular CPUs:
• First of all, as of this writing (late 2010), GPUs are attached processors, so any data they operate on has
to be transferred from the CPU. Since the memory bandwidth of this transfer is low, sufficient work has
to be done on the GPU to overcome this overhead.
• Since GPUs are graphics processors, they put an emphasis on single precision floating point arithmetic.
To accomodate the scientific computing community, double precision support is increasing, but double
precision speed is typically half the single precision flop rate.
• A CPU is optimized to handle a single stream of instructions, that can be very heterogeneous in character;
a GPU is made explicitly for data parallelism, and will perform badly on traditional codes.
• A CPU is made to handle one thread , or at best a small number of threads. A GPU needs a large number
of threads, far larger than the number of computational cores, to perform efficiently.
described as load unbalance: there is no intrinsic reason for the one processor to be idle, and it could have been
working if we had distributed the work load differently.
Victor Eijkhout
82 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
‘office’ applications without the user actually installing any software. This idea is sometimes called Software-as-a-
Service, where the user connects to an ‘application server’, and accesses it through a client such as a web browser.
In the case of Google Docs, there is no longer a large central dataset, but each user interacts with their own data,
maintained on Google’s servers. This of course has the large advantage that the data is available from anywhere the
user has access to a web browser.
The term cloud computing usually refers to this internet-based model where the data is not maintained by the
user. However, it can span some or all of the above concepts, depending on who uses the term. Here is a list of
characteristics:
• A cloud is remote, involving applications running on servers that are not owned by the user. The user
pays for services on a subscription basis, as pay-as-you-go.
• Cloud computing is typically associated with large amounts of data, either a single central dataset such
as on airline database server, or many independent datasets such as for Google Docs, each of which are
used by a single user or a small group of users. In the case of large datasets, they are stored distributedly,
with concurrent access for the clients.
• Cloud computing makes a whole datacenter appear as a single computer to the user [71].
• The services offered by cloud computing are typically business applications and IT services, rather than
scientific computing.
• Computing in a cloud is probably virtualized, or at least the client interfaces to an abstract notion of a
server. These strategies often serve to ‘move the work to the data’.
• Server processes are loosely coupled, at best synchronized through working on the same dataset.
• Cloud computing can be interface through a web browser; it can involve a business model that is ‘pay as
you go’.
• The scientific kind of parallelism is not possible or not efficient using cloud computing.
Cloud computing clearly depends on the following factors:
• The ubiquity of the internet;
• Virtualization of servers;
• Commoditization of processors and hard drives.
The infrastructure for cloud computing can be interesting from a computer science point of view, involving dis-
tributed file systems, scheduling, virtualization, and mechanisms for ensuring high reliability.
Victor Eijkhout
84 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
The LU factorization operation is one that has great opportunity for cache reuse, since it is based on the matrix-
matrix multiplication kernel discussed in section 1.4.2. As a result, the LINPACK benchmark is likely to run at a
substantial fraction of the peak speed of the machine. Another way of phrasing this is to say that the LINPACK
benchmark is CPU-bound.
Typical efficiency figures are between 60 and 90 percent. However, it should be noted that many scientific codes
do not feature the dense linear solution kernel, so the performance on this benchmark is not indicative of the
performance on a typical code. Linear system solution through iterative methods (section 5.5) is much less efficient,
being dominated by the bandwidth between CPU and memory (‘bandwidth bound’).
One implementation of the LINPACK benchmark that is often used is ‘High-Performance LINPACK’ (http:
//www.netlib.org/benchmark/hpl/), which has several parameters such as blocksize that can be chosen
to tune the performance.
Victor Eijkhout
86 CHAPTER 2. PARALLEL COMPUTER ARCHITECTURE
Victor Eijkhout
Chapter 3
Computer Arithmetic
Of the various types of data that one normally encounters, the ones we are concerned with in the context of
scientific computing
√ are the numerical types: integers (or whole √ numbers)
√ . . . , −2, −1, 0, 1, 2, . . ., real numbers
0, 1, −1.5, 2/3, 2, log 10, . . ., and complex numbers 1 + 2i, 3 − 5i, . . .. Computer memory is organized to
give only a certain amount of space to represent each number, in multiples of bytes, each containing 8 bits. Typical
values are 4 bytes for an integer, 4 or 8 bytes for a real number, and 8 or 16 bytes for a complex number.
Since only a certain amount of memory is available to store a number, it is clear that not all numbers of a certain
type can be stored. For instance, for integers only a range is stored. In the case of real numbers, even storing a
range is not possible since any interval [a, b] contains infinitely many numbers. Therefore, any representation of
real numbers will cause gaps between the numbers that are stored. As a result, any computation that results in a
number that is not representable will have to be dealt with by issuing an error or by approximating the result. In this
chapter we will look at the ramifications of such approximations of the ‘true’ outcome of numerical calculations.
3.1 Integers
In scientific computing, most operations are on real numbers. Computations on integers rarely add up to any serious
computation load1 . It is mostly for completeness that we start with a short discussion of integers.
Integers are commonly stored in 16, 32, or 64 bits, with 16 becoming less common and 64 becoming more and
more so. The main reason for this increase is not the changing nature of computations, but the fact that integers
are used to index arrays. As the size of data sets grows (in particular in parallel computations), larger indices are
needed. For instance, in 32 bits one can store the numbers zero through 232 − 1 ≈ 4 · 109 . In other words, a 32 bit
index can address 4 gigabytes of memory. Until recently this was enough for most purposes; these days the need
for larger data sets has made 64 bit indexing necessary.
When we are indexing an array, only positive integers are needed. In general integer computations, of course,
we need to accomodate the negative integers too. There are several ways of implementing negative integers. The
1. Some computations are done on bit strings. We will not mention them at all.
88
3.1. INTEGERS 89
simplest solution is to reserve one bit as a sign bit, and use the remaining 31 (or 15 or 63; from now on we will
consider 32 bits the standard) bits to store the absolute magnitude.
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1
interpretation as bitstring 0 . . . 231 − 1 231 . . . 232 − 1
interpretation as integer 0 · · · 231 − 1 −0 · · · −231
This scheme has some disadvantages, one being that there is both a positive and negative number zero. This means
that a test for equality becomes more complicated than simply testing for equality as a bitstring.
The scheme that is used most commonly is called 2’s complement, where integers are represented as follows.
• If 0 ≤ m ≤ 231 − 1, the normal bit pattern for m is used.
• If 1 ≤ n ≤ 231 , then −n is represented by the bit pattern for 232 − n.
bitstring 00 · · · 0 . . . 01 · · · 1 10 · · · 0 . . . 11 · · · 1
interpretation as bitstring 31
0 ... 2 − 1 231 . . . 232 − 1
interpretation as integer 0 · · · 2 − 1 2 − 231 · · ·
31 32 −1
Some observations:
• There is no overlap between the bit patterns for positive and negative integers, in particular, there is only
one pattern for zero.
• The positive numbers have a leading bit zero, the negative numbers have the leading bit set.
Exercise 3.1. For the ‘naive’ scheme and the 2’s complement scheme for negative numbers, give pseu-
docode for the comparison test m < n, where m and n are integers. Be careful to distinguish
between all cases of m, n positive, zero, or negative.
Adding two numbers with the same sign, or multiplying two numbers of any sign, may lead to a result that is too
large or too small to represent. This is called overflow.
Exercise 3.2. Investigate what happens when you perform such a calculation. What does your compiler
say if you try to write down a nonrepresentible number explicitly, for instance in an assignment
statement?
In exercise 1 above you explored comparing two integers. Let us know explore how subtracting numbers in two’s
complement is implemented. Consider 0 ≤ m ≤ 231 − 1 and 1 ≤ n ≤ 231 and let us see what happens in the
computation of m − n.
Suppose we have an algorithm for adding and subtracting unsigned 32-bit numbers. Can we use that to subtract
two’s complement integers? We start by observing that the integer subtraction m−n becomes the unsigned addition
m + (232 − n).
Victor Eijkhout
90 CHAPTER 3. COMPUTER ARITHMETIC
• Case: m < n. Since m + (232 − n) = 232 − (n − m), and 1 ≤ n − m ≤ 231 , we conclude that
232 − (n − m) is a valid bit pattern. Moreover, it is the bit pattern representation of the negative number
m − n, so we can indeed compute m − n as an unsigned operation on the bitstring representations of m
and n.
• Case: m > n. Here we observe that m + (232 − n) = 232 + m − n. Since m − n > 0, this is a
number > 232 and therefore not a legitimate representation of a negative number. However, if we store
this number in 33 bits, we see that it is the correct result m − n, plus a single bit in the 33-rd position.
Thus, by performing the unsigned addition, and ignoring the overflow bit, we again get the correct result.
In both cases we conclude that we can perform subtraction by adding the bitstrings that represent the positive and
negative number as unsigned integers, and ignoring overflow if it occurs.
We introduce a base, a small integer number, 10 in the preceding example, and 2 in computer numbers, and write
numbers in terms of it as a sum of t terms:
3.2.2 Limitations
Since we use only a finite number of bits to store floating point numbers, not all numbers can be represented. The
ones that can not be represented fall into two categories: those that are too large or too small (in some sense), and
those that fall in the gaps. Numbers can be too large or too small in the following ways.
Overflow The largest number we can store is (1 − β −t−1 )β U , and the smallest number (in an absolute sense) is
−(1 − β −t−1 )β U ; anything larger than the former or smaller than the latter causes a condition called
overflow.
Victor Eijkhout
92 CHAPTER 3. COMPUTER ARITHMETIC
Underflow The number closest to zero is β −t−1 ·β L . A computation that has a result less than that (in absolute value)
causes a condition called underflow. In fact, most computers use normalized floating point numbers: the
first digit d1 is taken to be nonzero; see section 3.2.3 for more about this. In this case, any number less
than β −1 · β L causes underflow. Trying to compute a number less than that is sometimes handled by
using unnormalized floating point numbers (a process known as gradual underflow), but this is typically
tens or hundreds of times slower than computing with regular floating point numbers. At the time of this
writing, only the IBM Power6 has hardware support for gradual underflow.
The fact that only a small number of real numbers can be represented exactly is the basis of the field of round-off
error analysis. We will study this in some detail in the following sections.
For detailed discussions, see the book by Overton [69]; it is easy to find online copies of the essay by Gold-
berg [43]. For extensive discussions of round-off error analysis in algorithms, see the books by Higham [53] and
Wilkinson [86].
1.d1 d2 . . . dt × 2exp .
We can now be a bit more precise about the representation error. A machine number x̃ is the representation for all x
in an interval around it. With t digits in the mantissa, this is the interval of numbers that differ from x̄ in the t + 1st
digit. For the mantissa part we get:
(
x ∈ x̃, x̃ + β −t truncation
1 −t 1 −t
rounding
x ∈ x̃ − 2 β , x̃ + 2 β
Often we are only interested in the order of magnitude of the error, and we will write x̃ = x(1+), where || ≤ β −t .
This maximum relative error is called the machine precision, or sometimes machine epsilon. Typical values are:
(
≈ 10−7 32-bit single precision
≈ 10−16 64-bit double precision
Machine precision can be defined another way: is the smallest number that can be added to 1 so that 1 + has
a different representation than 1. A small example shows how aligning exponents can shift a too small operand so
1.0000 ×100
1.0000 ×100
⇒ + 0.00001 ×100
+ 1.0000 ×10−5
= 1.0000 ×100
Yet another way of looking at this is to observe that, in the addition x + y, if the ratio of x and y is too large, the
result will be identical to x.
The machine precision is the maximum attainable accuracy of computations: it does not make sense to ask for more
than 6-or-so digits accuracy in single precision, or 15 in double.
Exercise 3.3. Write a small program that computes the machine epsilon. Does it make any difference
if you set the compiler optimization levels low or high? Can you find other ways in which this
computation goes wrong?
2. IEEE 754 is a standard for binary arithmetic; there is a further standard, IEEE 854, that allows decimal arithmetic.
3. Computer systems can still differ as to how to store successive bytes. If the least significant byte is stored first, the system is called
little-endian; if the most significant byte is stored first, it is called big-endian. See http://en.wikipedia.org/wiki/Endianness
for details.
Victor Eijkhout
94 CHAPTER 3. COMPUTER ARITHMETIC
The standard also declared the rounding behaviour to be ‘exact rounding’: the result of an operation should be the
rounded version of the exact result.
Above (section 3.2.2), we have seen the phenomena of overflow and underflow, that is, operations leading to un-
representible numbers. There is a further exceptional situation
√ that needs to be dealt with: what result should be
returned if the program asks for illegal operations such as −4? The IEEE 754 standard has two special quantities
for this: Inf and NaN for ‘infinity’ and ‘not a number’. If NaN appears in an expression, the whole expression will
evaluate to that value. The rule for computing with Inf is a bit more complicated [43].
An inventory of the meaning of all bit patterns in IEEE 754 double precision is given in figure 3.1. Note that for
normalized numbers the first nonzero digit is a 1, which is not stored, so the bit pattern d1 d2 . . . dt is interpreted as
1.d1 d2 . . . dt .
These days, almost all processors adhere to the IEEE 754 standard, with only occasional exceptions. For in-
stance, Nvidia Tesla GPU s are not standard-conforming in single precision; see http://en.wikipedia.
org/wiki/Nvidia_Tesla. The justification for this is that double precision is the ‘scientific’ mode, while
sigle precision is mostly likely used for graphics, where exact compliance matters less.
Let us consider an example in decimal arithmetic, that is, β = 10, and with a 3-digit mantissa: t = 3. The number
x = .1256 has a representation that depends on whether we round or truncate: x̃round = .126, x̃truncate = .125.
The error is in the 4th digit: if = x − x̃ then || < β t .
Exercise 3.4. The number in this example had no exponent part. What are the error and relative error if
there had been one?
3.3.3 Addition
Addition of two floating point numbers is done in a couple of steps. First the exponents are aligned: the smaller of
the two numbers is written to have the same exponent as the larger number. Then the mantissas are added. Finally,
the result is adjusted so that it again is a normalized number.
As an example, consider .100 + .200 × 10−2 . Aligning the exponents, this becomes .100 + .002 = .102, and this
result requires no final adjustment. We note that this computation was exact, but the sum .100 + .255 × 10−2 has
the same result, and here the computation is clearly not exact. The error is |.10255 − .102| < 10−3 , and we note
that the mantissa has 3 digits, so there clearly is a relation with the machine precision here.
Victor Eijkhout
96 CHAPTER 3. COMPUTER ARITHMETIC
In the example .615 × 101 + .398 × 101 = 1.013 × 102 = .101 × 101 we see that after addition of the mantissas
an adjustment of the exponent is needed. The error again comes from truncating or rounding the first digit of the
result that does not fit in the mantissa: if x is the true sum and x̃ the computed sum, then x̃ = x(1 + ) where, with
a 3-digit mantissa || < 10−3 .
Formally, let us consider the computation of s = x1 + x2 , and we assume that the numbers xi are represented as
x̃i = xi (1 + i ). Then the sum s is represented as
s̃ = (x̃1 + x̃2 )(1 + 3 )
= x1 (1 + 1 )(1 + 3 ) + x2 (1 + 2 )(1 + 3 )
≈ x1 (1 + 1 + 3 ) + x2 (1 + 1 + 3 )
≈ s(1 + 2)
sign under the assumptions that all i are small and of roughly equal size, and that both xi > 0. We see that the relative
errors are added under addition.
3.3.4 Multiplication
Floating point multiplication, like addition, involves several steps. In order to multiply two numbers .m1 × β e1
and .m2 × β e2 , the following steps are needed.
• The exponents are added: e ← e1 + e2 .
• The mantissas are multiplied: m ← m1 × m2 .
• The mantissa is normalized, and the exponent adjusted accordingly.
For example: .123 · 100 × .567 · 101 = .069741 · 101 → .69741 · 100 → .697 · 100 .
What happens with relative errors?
3.3.5 Subtraction
Subtraction behaves very differently from addition. Whereas in addition errors are added, giving only a gradual
increase of overall roundoff error, subtraction has the potential for greatly increased error in a single operation.
For example, consider subtraction with 3 digits to the mantissa: .124 − .123 = .001 → .100 · 10−2 . While the result
is exact, it has only one significant digit4 . To see this, consider the case where the first operand .124 is actually the
rounded result of a computation that should have resulted in .1235. In that case, the result of the subtraction should
have been .050 · 10−2 , that is, there is a 100% error, even though the relative error of the inputs was as small as
could be expected. Clearly, subsequent operations involving the result of this subtraction will also be inaccurate.
We conclude that subtracting almost equal numbers is a likely cause of numerical roundoff.
There are some subtleties about this example. Subtraction of almost equal numbers is exact, and we have the correct
rounding behaviour of IEEE arithmetic. Still, the correctness of a single operation does not imply that a sequence
4. Normally, a number with 3 digits to the mantissa suggests an error corresponding to rounding or truncating the fourth digit. We say
that such a number has 3 significant digits. In this case, the last two digits have no meaning, resulting from the normalization process.
of operations containing it will be accurate. While the addition example showed only modest decrease of numerical
accuracy, the cancellation in this example can have disastrous effects.
3.3.6 Examples
From the above, the reader may got the impression that roundoff errors only lead to serious problems in exceptional
circumstances. In this section we will discuss some very practical examples where the inexactness of computer
arithmetic becomes visible in the result of a computation. These will be fairly simple examples; more complicated
examples exist that are outside the scope of this book, such as the instability of matrix inversion. The interested
reader is referred to [86, 53].
ers means a machine precision of 10−7 . The problem with this example is that both the ratio between terms, and the
ratio of terms to partial sums, is ever increasing. In section 3.2.3 we observed that a too large ratio can lead to one
operand of an addition in effect being ignored.
If we sum the series in the sequence it is given, we observe that the first term is 1, so all partial sums (ΣN
n where
N < 10000) are at least 1. This means that any term where 1/n < 10 gets ignored since it is less than the
2 −7
machine precision. Specifically, the last 7000 terms are ignored, and the computed sum is 1.644725. The first 4
digits are correct.
However, if we evaluate the sum in reverse order we obtain the exact result in single precision. We are still adding
small quantities to larger ones, but now the ratio will never be as bad as one-to-, so the smaller number is never
ignored. To see this, consider the ratio of two terms subsequent terms:
n2 n2 1 2
2
= 2
= 2
≈1+
(n − 1) n − 2n + 1 1 − 2/n + 1/n n
Since we only sum 105 terms and the machine precision is 10−7 , in the addition 1/n2 + 1/(n − 1)2 the second term
will not be wholly ignored as it is when we sum from large to small.
Exercise 3.6. There is still a step missing in our reasoning. We have shown that in adding two subse-
quent terms, the smaller one is not ignored. However, during the calculation we add partial
sums to the next term in the sequence. Show that this does not worsen the situation.
Victor Eijkhout
98 CHAPTER 3. COMPUTER ARITHMETIC
The lesson here is that series that are monotone (or close to monotone) should be summed from small to large, since
the error is minimized if the quantities to be added are closer in magnitude. Note that this is the opposite strategy
from the case of subtraction, where operations involving similar quantities lead to larger errors. This implies that
if an application asks for adding and subtracting series of numbers, and we know a priori which terms are positive
and negative, it may pay off to rearrange the algorithm accordingly.
ỹn − yn = n ,
then
f back- so n ≥ 5n−1 . The error made by this computation shows exponential growth.
on
3.3.6.4 Linear system solving
Sometimes we can make statements about the numerical precision of a problem even without specifying what
algorithm we use. Suppose we want to solve a linear system, that is, we have an n × n matrix A and a vector b of
size n, and we want to compute the vector x such that Ax = b. (We will actually considering algorithms for this in
chapter 5.) Since the vector b will the result of some computation or measurement, we are actually dealing with a
vector b̃, which is some perturbation of the ideal b:
b̃ = b + ∆b.
The perturbation vector ∆b can be of the order of the machine precision if it only arises from representation error,
or it can be larger, depending on the calculations that produced b̃.
We now ask what the relation is between the exact value of x, which we would have obtained from doing an exact
calculation with A and b, which is clearly impossible, and the computed value x̃, which we get from computing
with A and b̃. (In this discussion we will assume that A itself is exact, but this is a simplification.)
Writing x̃ = x + ∆x, the result of our computation is now
Ax̃ = b̃
or
Since Ax = b, we get A∆x = ∆b. From this, we get (see appendix A.1 for details)
The quantity kAkkA−1 k is called the condition number of a matrix. The bound (3.2) then says that any perturbation
in the right hand side can lead to a perturbation in the solution that is at most larger by the condition number of the
matrix A. Note that it does not say that the perturbation in x needs to be anywhere close to that size, but we can not
rule it out, and in some cases it indeed happens that this bound is attained.
Suppose that b is exact up to machine precision, and the condition number of A is 104 . The bound (3.2) is often
interpreted as saying that the last 4 digits of x are unreliable, or that the computation ‘loses 4 digits of accuracy’.
Victor Eijkhout
100 CHAPTER 3. COMPUTER ARITHMETIC
Fortran In Fortran it is possible to specify the number of bytes that a number takes up: INTEGER*2, REAL*8.
Often it is possible to write a code using only INTEGER, REAL, and use compiler flags to indicate the
size of an integer and real number.
C In C, the type identifiers have no standard length. For integers there is short int, int, long
int, and for floating point float, double. The sizeof() operator gives the number of bytes
used to store a datatype.
C99, Fortran2003 Recent standards of the C and Fortran languages incorporate the C/Fortran interoperability stan-
dard, which can be used to declare a type in one language so that it is compatible with a certain type in
the other language.
Another area where fixed point arithmetic is still used, is in signal processing. In modern CPUs, integer and floating
point operations are of essentially the same speed, but converting between them is relatively slow. Now, if the sine
function is implemented through table lookup, this means that in sin(sin x) the output of a function is used to index
the next function application. Obviously, outputting the sine function in fixed point obviates the need for conversion
between real and integer quantities, which simplifies the chip logic needed, and speeds up calculations.
3.5 Conclusions
In a way, the reason for the error is the imperfection of computer arithmetic: if we could calculate with actual real
numbers there would be no problem. However, if we accept roundoff as a fact of life, then various observations
hold:
• Operations with ‘the same’ outcomes do not behave identically from a point of stability; see the ‘abc-
formula’ example.
5. These two header files are not identical, and in fact not compatible. Beware, if you compile C code with a C++ compiler [32].
Victor Eijkhout
102 CHAPTER 3. COMPUTER ARITHMETIC
• Even rearrangements of the same computations do not behave identically; see the summing example.
Thus it becomes imperative to analyze computer algorithms with regard to their roundoff behaviour: does roundoff
increase as a slowly growing function of problem parameters, such as the number of terms evaluted, or is worse
behaviour possible? We will not address such questions in further detail in this book.
In this chapter we will look at the numerical solution of Ordinary Diffential Equations (ODEs) and Partial Diffential
Equations (PDEs). These equations are commonly used in physics to describe phenomena such as the flow of air
around an aircraft, or the bending of a bridge under various stresses. While these equations are often fairly simple,
getting specific numbers out of them (‘how much does this bridge sag if there are a hundred cars on it’) is more
complicated, often taking large computers to produce the desired results. Here we will describe the techniques that
turn ODEs and PDEs into computable problems.
Ordinary differential equations describe how a quantity (either scalar or vector) depends on a single variable. Typi-
cally, this variable denotes time, and the value of the quantity at some starting time is given. This type of equation
is called an Initial Value Problem (IVP).
Partial differential equations describe functions of several variables, usually denoting space dimensions, possibly
also including time. Similar to the starting value in ODEs, PDEs need values in space to give a uniquely determined
solution. These values are called boundary values, and the problem is called a Boundary Value Problem (BVP).
Boundary value problems typically describe static mechanical structures.
Finally, we will consider the ‘heat equation’ which has aspects of both IVPs and BVPs: it describes heat spreading
through a physical object such as a rod. The initial value describes the initial temperature, and the boundary values
give prescribed temperatures at the ends of the rod.
For ease of analysis we will assume that all functions involved have sufficiently many higher derivatives, and that
each derivative is sufficiently smooth.
F = ma
103
104 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
d2
a= x = F/m
dt2
it states that acceleration depends linearly on the force exerted on the mass. A closed form description x(t) = . . .
for the location of the mass can sometimes be derived analytically, but in many cases some form of approximation
or numerical computation is needed.
Newton’s equation is an ODE since it describes a function of one variable, time. It is an IVP since it describes the
development in time, starting with some initial conditions. As an ODE, it is ‘of second order’ since it involves a
second derivative, We can reduce this to first order, involving only first derivaties, if we allow vector quantities.
Defining u(t) = (x(t), x0 (t)), we find for u:
0 0 1 0
u = Au + f, A= , B=
0 0 F/a
For simplicity, in this course we will only consider scalar equations; our reference equation is then
and in this section we will consider numerical methods for its solution.
Typically, the initial value in some starting point (often chosen as t = 0) is given: u(0) = u0 for some value u0 ,
and we are interested in the behaviour of u as t → ∞. As an example, f (x) = x gives the equation u0 (t) = u(t).
This is a simple model for population growth: the equation states that the rate of growth is equal to the size of
the population. The equation (4.1) can be solved analytically for some choices of f , but we will not consider this.
Instead, we only consider the numerical solution and the accuracy of this process.
In a numerical method, we consider discrete size time steps to approximate the solution of the continuous time-
dependent process. Since this introduces a certain amount of error, we will analyze the error introduced in each
time step, and how this adds up to a global error. In some cases, the need to limit the global error will impose
restrictions on the numerical scheme.
in which the right hand side does not explicitly depend on t1 . A sufficient criterium for stability is:
∂ > 0 unstable
f (u) = = 0 neutrally stable
∂u
< 0 stable
We will often refer to the simple example f (u) = −λu, with solution u(t) = u0 e−λt . This problem is stable
if λ > 0.
Proof. If u∗ is a zero of f , meaning f (u∗ ) = 0, then the constant function u(t) ≡ u∗ is a solution of u0 = f (u),
a so-called ‘equilibrium’ solution. We will now consider how small perturbations from the equilibrium behave. Let
u be a solution of the PDE, and write u(t) = u∗ + η(t), then we have
which means that the perturbation will damp out if f 0 (x∗ ) < 0.
∆t2 ∆t3
u(t + ∆t) = u(t) + u0 (t)∆t + u00 (t) + u000 (t) + ···
2! 3!
This gives for u0 :
Victor Eijkhout
106 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
or
We use this equation to derive a numerical scheme: with t0 = 0, tk+1 = tk + ∆t = · · · = (k + 1)∆t, u(tk ) = uk
we get a difference equation
uk+1 = uk + ∆t f (tk , uk ).
and
So
∆t2
Lk+1 = uk+1 − u(tk+1 ) = uk − u(tk ) + f (tk , uk ) − f (tk , u(tk )) − u00 (tk ) + ···
2!
∆t2
= −u00 (tk ) + ···
2!
This shows that in each step we make an error of O(∆t2 ). If we assume that these errors can be added, we find a
global error of
∆t2
Ek ≈ Σk Lk = k∆t = O(∆t)
2!
Since the global error is of first order in ∆t, we call this a ‘first order method’. Note that this error, which measures
the distance between the true and computed solutions, is of lower order than the truncation error, which is the error
in approximating the operator.
⇔ |1 − λ∆t| < 1
⇔ −1 < 1 − λ∆t < 1
⇔ −2 < −λ∆t < 0
⇔ 0 < λ∆t < 2
⇔ ∆t < 2/λ
We see that the stability of the numerical solution scheme depends on the value of ∆t: the scheme is only stable if
∆t is small enough3 . For this reason, we call the explicit Euler method conditionally stable. Note that the stability
of the differential equation and the stability of the numerical scheme are two different questions. The continuous
problem is stable if λ > 0; the numerical problem has an additional condition that depends on the discretization
scheme used.
∆t2
u(t − ∆t) = u(t) − u0 (t)∆t + u00 (t) + ···
2!
which implies
u(t) − u(t − ∆t)
u0 (t) = + u00 (t)∆t/2 + · · ·
∆t
3. There is a second condition limiting the time step. This is known as the Courant-Friedrichs-Lewy condition http://en.
wikipedia.org/wiki/CourantFriedrichsLewy_condition. It describes the notion that in the exact problem u(x, t) depends
on a range of u(x0 , t − ∆t) values; the time step of the numerical method has to be small enough that the numerical solution takes all these
points into account.
Victor Eijkhout
108 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
As before, we take the equation u0 (t) = f (t, u(t)) and turn it into a computable form by replacing u0 (t) by a
difference formula:
u(t) − u(t − ∆t)
= f (t, u(t)) ⇒ u(t) = u(t − ∆t) + ∆tf (t, u(t))
∆t
Taking fixed points uk ≡ u(kt), this gives a difference equation
An important difference with the explicit scheme is that uk+1 now also appears on the right hand side of the
equation. That is, compution of uk+1 is now implicit. For example, let f (t, u) = −u3 , then uk+1 = uk − ∆tu3k+1 .
This needs a way to solve a nonlinear equation; typically this can be done with Newton iterations.
so
k
1 1
uk+1 = uk = u0
1 + λ∆t 1 + λ∆t
If λ > 0, which is the condition for a stable equation, we find that uk → 0 for all values of λ and ∆t. This method
is called ‘unconditionally stable’. The main advantage of an implicit method over an explicit one is clearly the
stability: it is possible to take larger time steps without worrying about unphysical behaviour. Of course, large time
steps can make convergence to the steady state slower, but at least there will be no divergence.
On the other hand, implicit methods are more complicated. As you saw above, they can involve nonlinear systems
to be solved in every time step. Later, you will see an example where the implicit method requires the solution of a
system of equations.
Exercise 4.1. Analyse the accuracy and computational aspects of the following scheme for the IVP
u0 (x) = f (x):
which corresponds to adding the Euler explicit and implicit schemes together. You do not have
to analyze the stability of this scheme.
Exercise 4.2. Consider the initial value problem y 0 (t) = y(t)(1 − y(t)). Observe that y ≡ 0 and y ≡ 1
are solutions. These are called ‘equilibrium solutions’.
1. A solution is stable, if perturbations ‘converge back to the solution’, meaning that for
small enough,
and
3. Write a small program to investigate the behaviour of the numerical solution under var-
ious choices for ∆t. Include program listing and a couple of runs in your homework
submission.
4. You see from running your program that the numerical solution can oscillate. Derive a
condition on ∆t that makes the numerical solution monotone. It is enough to show that
yk < 1 ⇒ yk+1 < 1, and yk > 1 ⇒ yk+1 > 1.
5. Now formulate an implicit method, and show that yk+1 is easily computed from yk . Write
a program, and investigate the behaviour of the numerical solution under various choices
for ∆t.
6. Show that the numerical solution of the implicit scheme is monotone for all choices
of ∆t.
4. Actually, the boundary conditions are can be more general, involving derivatives on the interval end points.
Victor Eijkhout
110 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
−uxx (x̄) − uyy (x̄) = f (x̄) for x̄ ∈ Ω = [0, 1]2 with u(x̄) = u0 on δΩ. (4.5)
in two space dimensions. Here, δΩ is the boundary of the domain Ω. Since we prescribe the value of u on the
boundary, such a problem is called a Boundary Value Problem (BVP).
There are several types of PDE, each with distinct mathematical properties. The most important property is that of
region of influence: if we tinker with the problem so that the solution changes in one point, what other points will
be affected.
• Elliptic PDEs have the form
where A, B > 0. They are characterized by the fact that all points influence each other. These equations
often describe phenomena in structural mechanics, such as a beam or a membrane. It is intuitively clear
that pressing down on any point of a membrane will change the elevation of every other point, no matter
how little.
• Hyperbolic PDEs are of the form
with A, B of opposite sign. Such equations describe wave phenomena, and. Intuitively, changing the
solution at any point will only change certain future points. Waves have a propagation speed that makes
it impossible for a point to influence points in the future that are too far away.
• Parabolic PDEs are of the form
and they describe diffusion-like phenomena. The best way to characterize them is to consider that the
solution in each point in space and time is influenced by a certain finite region at each previous point in
space.
∆u = uxx + uyy ,
a second order differential operator, and equation (4.5) a second-order PDE. Specifically, equation (4.5) is called
the Poisson equation. Second order PDEs are quite common, describing many phenomena in fluid and heat flow,
and structural mechanics.
In order to find a numerical scheme we use Taylor series as before, expressing u(x + h) or u(x − h) in terms of u
and its derivatives at x. Let h > 0, then
h2 h3 h4 h5
u(x + h) = u(x) + u0 (x)h + u00 (x) + u000 (x) + u(4) (x) + u(5) (x) + · · ·
2! 3! 4! 5!
and
h2 h3 h4 h5
u(x − h) = u(x) − u0 (x)h + u00 (x) − u000 (x) + u(4) (x) − u(5) (x) + · · ·
2! 3! 4! 5!
Our aim is now to approximate u00 (x). We see that the u0 terms in these equations would cancel out under addition,
leaving 2u(x):
h4
u(x + h) + u(x − h) = 2u(x) + u00 (x)h2 + u(4) (x) + ···
12
so
2u(x) − u(x + h) − u(x − h) h2
−u00 (x) = + u(4)
(x) + ··· (4.6)
h2 12
The basis for a numerical scheme for (4.4) is then the observation
2u(x) − u(x + h) − u(x − h)
= f (x, u(x), u0 (x)) + O(h2 ),
h2
which shows that we can approximate the differential operator by a difference operator, with an O(h2 ) truncation
error.
To derive a numerical method, we divide the interval [0, 1] into equally spaced points: xk = kh where h = 1/n and
k = 0 . . . n. With these the Finite Difference (FD) formula (4.6) becomes a system of equations:
For most values of k this equation relates uk unknown to the unknowns uk−1 and uk+1 . The exceptions are k = 1
and k = n − 1. In that case we recall that u0 and un are known boundary conditions, and we write the equations as
(
2u1 − u2 = h2 f (x1 ) + u0
2un−1 − un−2 = h2 f (xn−1 ) + un
Victor Eijkhout
112 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
This has the form Au = f with A a fully known matrix, f a fully known vector, and u a vector of unknowns. Note
that the right hand side vector has the boundary values of the problem in the first and last locations. This means that,
if you want to solve the same differential equation with different boundary conditions, only the vector f changes.
Exercise 4.3. A condition of the type u(0) = u0 is called a Dirichlet boundary condition. Physically,
this corresponds to knowing the temperature at the end point of a rod. Other boundary con-
ditions exist. Specifying a value for the derivative, u0 (0) = u00 , rather than for the function
value,would be appropriate if we are modeling fluid flow and the outflow rate at x = 0 is
known. This is known as a Neumann boundary condition.
A Neumann boundary condition u0 (0) = u00 can be modeled by stating
u0 − u1
= u00 .
h
Show that, unlike in the case of the Direchlet boundary condition, this affects the matrix of the
linear system.
Show that having a Neumann boundary condition at both ends gives a singular matrix, and
therefore no unique solution to the linear system. (Hint: guess the vector that has eigenvalue
zero.)
Physically this makes sense. For instance, in an elasticity problem, Dirichlet boundary con-
ditions state that the rod is clamped at a certain height; a Neumann boundary condition only
states its angle at the end points, which leaves its height undetermined.
Let us list some properties of A that you will see later are relevant for solving such systems of equations:
• The matrix is very sparse: the percentage of elements that is nonzero is low. The nonzero elements are
not randomly distributed but located in a band around the main diagonal. We call this a band matrix in
general, and a tridiagonal matrix in this specific case.
• The matrix is symmetric. This property does not hold for all matrices that come from discretizing BVPs,
but it is true if there are no odd order derivatives, such as ux , uxxx , uxy .
• Matrix elements are constant in each diagonal, that is, in each set of points {(i, j) : i − j = c} for some c.
This is only true for very simple problems. It is no longer true if the differential equation has location
dependent terms such as dx d d
(a(x) dx u(x)). It is also no longer true if we make h variable through the
interval, for instance because we want to model behaviour around the left end point in more detail.
• Matrix elements conform to the following sign pattern: the diagonal elements are positive, and the off-
diagonal elements are nonpositive. This property depends on the numerical scheme used, but it is often
true. Together with the following property of definiteness, this is called an M-matrix. There is a whole
mathematical theory of these matrices [15].
• The matrix is positive definite: xt Ax > 0 for all nonzero vectors x. This property is inherited from the
original continuous problem, if the numerical scheme is carefully chosen. While the use of this may not
seem clear at the moment, later you will see methods for solving the linear system that depend on it.
Strictly speaking the solution of equation (4.8) is simple: u = A−1 f . However, computing A−1 is not the best way
of computing u. As observed just now, the matrix A has only 3N nonzero elements to store. Its inverse, on the other
hand, does not have a single nonzero. Although we will not prove it, this sort of statement holds for most sparse
matrices. Therefore, we want to solve Au = f in a way that does not require O(n2 ) storage.
Let again h = 1/n and define xi = ih and yj = jh. Our discrete equation becomes
4uij − ui+1,j − ui−1,j − ui,j+1 − ui,j−1 = h−2 fij . (4.11)
We now have n × n unknowns uij which we can put in a linear ordering by defining I = Iij = i + j × n. This
gives us N = n2 equations
4uI − uI+1 − uI−1 − uI+n − uI−n = h−2 fI . (4.12)
Figure 4.1 shows how equation (4.11) connects domain points.
If we take the set of equation obtained by letting I in equation (4.12) range through the domain Ω, we find a linear
system Au = f of size N with a special structure:
4 −1 0 −1
−1 4 1 0 −1
. . . .. ..
. . . . . . . .
. . . . . −1 ...
. ..
.
(4.13)
A=
−1 4 0 −1
−1 4 −1
. .. .. ..
.. −1 . . .
..
.
Victor Eijkhout
114 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
The matrix is again banded, but unlike in the one-dimensional case, there are zeros inside the band. (This has some
important consequences when we try to solve this system; see section 5.4.3.) Because the matrix has five nonzero
diagonals, it is said to be of penta-diagonal structure.
You can also put a block structure on the matrix, by grouping the unknowns together that are in one row of the
domain. This is called a block matrix, and, on the block level, it has a tridiagonal structure, so we call this a block
tridiagonal matrix. Note that the diagonal blocks themselves are tridiagonal; the off-diagonal blocks are minus the
identity matrix.
This matrix, like the one-dimensional example above, has constant diagonals, but this is again due to the simple
nature of the problem. In practical problems it will not be true. That said, such ‘constant coefficient’ problems occur,
and when they are on rectangular domains, there are very efficient methods for solving linear system with N log N
complexity.
Exercise 4.4. The block structure of the matrix, with all diagonal blocks having the same size, is due
to the fact that we defined out BVP on a square domain. Sketch the matrix structure that arises
from discretizing equation (4.9), again with central differences, but this time defined on a
triangular domain. Show that, again, there is a block tridiagonal structure, but that the blocks
are now of varying sizes.
For domains that are even more irregular, the matrix structure will also be irregular.
The regular block structure is also caused by our decision to order the unknowns by rows and columns. This
known as the natural ordering or lexicographic ordering; various other orderings are possible. One common way
of ordering the unknowns is the red-black ordering or checkerboard ordering which has advantanges for parallel
computation. This will be discussed in section 6.7.2.
There is more to say about analytical aspects of the BVP (for instance, how smooth is the solution and how does
that depend on the boundary conditions?) but those questions are outside the scope of this course. In the chapter on
linear algebra, we will come back to the BVP, since solving the linear system is mathematically interesting.
· −1 ·
−1 4 −1
· −1 ·
to the function u. Given a physical domain, we apply the stencil to each point in that domain to derive the equation
for that point. Figure 4.1 illustrates that for a square domain of n × n points. Connecting this figure with equa-
tion (4.13), you see that the connections in the same line give rise to the main diagonal and first upper and lower
offdiagonal; the connections to the next and previous lines become the nonzeros in the off-diagonal blocks.
This particular stencil is often referred to as the ‘5-point star’. There are other difference stencils; the structure of
some of them are depicted in figure 4.2. A stencil with only connections in horizontal or vertical direction is called
a ‘star stencil’, while one that has cross connections (such as the second in figure 4.2) is called a ‘box stencil’.
Exercise 4.5. In the previous section you saw that a red-black ordering of unknowns coupled with
the regular five-point star stencil give two subsets of variables that are not connected among
themselves, that is, they form a two-colouring of the matrix graph. Can you find a colouring if
nodes are connected by the second stencil in figure 4.2?
There is a simple bound for the number of colours needed for the graph of a sparse matrix: the number of colours
is at most d + 1 where d is the degree of the graph. To see that we can colour a graph with degree d using d + 1
colours, consider a node with degree d. No matter how its neighbours are coloured, there is always an unused colour
among the d + 1 available ones.
Exercise 4.6. Consider a sparse matrix, where the graph can be coloured with d colours. Permute the
matrix by first enumerating the unknowns of the first colour, then the second colour, et cetera.
What can you say about the sparsity pattern of the resulting permuted matrix?
Victor Eijkhout
116 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
Exercise 4.7. Consider the third stencil in figure 4.2, used for a BVP on a square domain. What does
the sparsity structure of the resulting matrix look like, if we again order the variables by rows
and columns?
Other stencils than the 5-point star can be used to attain higher accuracy, for instance giving a truncation error
of O(h4 ). They can also be used for other differential equations than the one discussed above. For instance, it is
not hard to show that the 5-point stencil can not give a discretization of the equation uxxxx + uyyyy = f with less
than O(1) truncation error.
While the discussion so far has been about two-dimensional problems, it is easily generalized to higher dimensions
for such equations as −uxx − uyy − uzz = f . The straightforward generalization of the 5-point stencil, for instance,
becomes a 7-point stencil in three dimensions.
4.3.1 Discretization
We now discretize both space and time, by xj+1 = xj + ∆x and tk+1 = tk + ∆t, with boundary conditions
x0 = a, xn = b, and t0 = 0. We write Tjk for the numerical solution at x = xj , t = tk ; with a little luck, this will
approximate the exact solution T (xj , tk ).
For the space discretization we use the central difference formula (4.7):
k − 2T k + T k
∂2 Tj−1
j j+1
2
T (x, tk ) ⇒ 2
.
∂x
x=xj ∆x
For the time discretization we can use any of the schemes in section 4.1.2. For instance, with explicit time stepping
we get
Tjk+1 − Tjk
∂
(4.14)
T (xj , t)
⇒ .
∂t t=tk ∆t
Tjk+1 − Tjk k − 2T k + T k
Tj−1 j j+1
−α = qjk
∆t ∆x2
which we rewrite as
α∆t k
Tjk+1 = Tjk + (T − 2Tjk + Tj+1
k
) + ∆tqjk (4.15)
∆x2 j−1
Pictorially, we render this as a difference stencil in figure 4.3. This expresses that the function value in each point is
determined by a combination of points on the previous time level.
It is convenient to summarize the set of equations (4.15) for a given k and all values of j in vector form as
α∆t
T k+1
= I− K T k + ∆tq k (4.16)
∆x2
where
2 −1
K = −1 2 −1
.. .. ..
. . .
The important observation here is that the dominant computation for deriving the vector T k+1 from T k is a simple
matrix-vector multiplication:
T k+1 ← AT k + ∆tq k
Victor Eijkhout
118 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
Figure 4.3: The difference stencil of the Euler forward method for the heat equation
.
where A = I − ∆x α∆t
2 K. Actual computer programs using an explicit method often do not form the matrix, but
evaluate the equation (4.15). However, the linear algebra formulation (4.16) is more insightful for purposes of
analysis.
In equation (4.14) we let T k+1 be defined from T k . We can turn this around by defining T k from T k−1 . For the
time discretization this gives
The implicit time step discretization of the whole heat equation, evaluated in tk+1 , now becomes:
Figure 4.4: The difference stencil of the Euler backward method for the heat equation
.
As opposed to the explicit method, where a matrix-vector multiplication sufficed, the derivation of the vector T k+1
from T k now involves solving a linear system
where A = I + ∆x α∆t
2 K. a harder operation than the matrix-vector multiplication. In this case, it is not possible, as
above, to evaluate the equation (4.18). Codes using an implicit method actually form the coefficient matrix, and
solve the system (4.19) as such.
Exercise 4.8. Show that the flop count for a time step of the implicit method is of the same order as of
the explicit method. (This only holds for a problem with one space dimension.) Give at least
one argument why we consider the implicit method as computationally ‘harder’.
The numerical scheme that we used here is of first order in time and second order in space: the truncation error
(section 4.1.2) is O(∆t + ∆x2 ). It would be possible to use a scheme that is second order in time by using central
differences in time too. We will not consider such matters in this course.
5. Actually, β is also dependent on `, but we will save ourselves a bunch of subscripts, since different β values never appear together in
one formula.
Victor Eijkhout
120 CHAPTER 4. NUMERICAL TREATMENT OF DIFFERENTIAL EQUATIONS
and t coordinates, we surmise that the solution will be a product of the separate solutions of
(
ut = c1 u
uxx = c2 u
If the assumption holds up, we need |β| < 1 for stability.
Substituting the surmised form for Tjk into the explicit scheme gives
α∆t k
Tjk+1 = Tjk + (T − 2Tjk + Tj+1
k
)
∆x2 j−1
α∆t k i`xj−1
⇒ β k+1 ei`xj = β k ei`xj + 2
(β e − 2β k ei`xj + β k ei`xj+1 )
∆x
α∆t h −i`∆x i
= β k ei`xj 1 + e − 2 + e i`∆x
∆x2
α∆t 1 i`∆x
⇒β = 1+2 [ (e + e−`∆x ) − 1]
∆x2 2
α∆t
= 1+2 (cos(`∆x) − 1)
∆x2
For stability we need |β| < 1:
2 (cos(`∆x) − 1) < 0: this is true for any ` and any choice of ∆x, ∆t.
α∆t
• β < 1 ⇔ 2 ∆x
• β > −1 ⇔ 2 ∆x2 (cos(`∆x) − 1) > −2: this is true for all ` only if 2 ∆x
α∆t
2 < 1, that is
α∆t
∆x2
∆t <
2α
The latter condition poses a big restriction on the allowable size of the time steps: time steps have to be small
enough for the method to be stable. Also, if we decide we need more accuracy in space and we half the space
discretization ∆x, the number of time steps will be multiplied by four.
Let us now consider the stability of the implicit scheme. Substituting the form of the solution Tjk = β k ei`xj into the
numerical scheme gives
α∆t k+1
Tjk+1 − Tjk = (T − 2Tjk+1 + Tj+1
k+1
)
∆x2 j1
α∆t k+1 i`xj−1
⇒ β k+1 ei`∆x − β k ei`xj = (β e − 2β k+1 ei`xj + β k+1 ei`xj+1 )
∆x2
Dividing out ei`xj β k+1 gives
∆t
1 = β −1 + 2α (cos `∆x − 1)
∆x2
1
β= ∆t
1 + 2α ∆x2 (1 − cos `∆x)
Since 1 − cos `∆x ∈ (0, 2), the denominator is strictly > 1. Therefore the condition |β| < 1 is always satisfied,
regardless the choices of ∆x and ∆t: the method is always stable.
In chapter 4 you saw how the numerical solution of partial differential equations can lead to linear algebra problems.
Sometimes this is a simple problem – a matrix-vector multiplication in the case of the Euler forward method – but
sometimes it is more complicated: the solution of a system of linear equations. (In other applications, which we
will not discuss here, eigenvalue problems need to be solved.) You may have learned a simple algorithm for this:
elimination of unknowns, also called Gaussian elimination. This method can still be used, but we need some careful
discussion of its efficiency. There are also other algorithms, the so-called iterative solution methods, which proceed
by gradually approximating the solution of the linear system. They warrant some discussion of their own.
Because of the PDE background, we only consider linear systems that are square and nonsingular. Rectangular,
in particular overdetermined, systems have important applications too in a corner of numerical analysis known as
approximation theory. However, we will not cover that in this book.
In the example of the heat equation (section 4.3) you saw that each time step involves solving a linear system. As an
important practical consequence, any setup cost for solving the linear system will be amortized over the sequence
of systems that is to be solved. A similar argument holds in the context of nonlinear equations, a topic that we will
not discuss as such. Nonlinear equations are solved by an iterative process such as the Newton method , which in its
multidimensional form leads to a sequence of linear systems. Although these have different coefficient matrices, it
is again possible to amortize setup costs.
121
122 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
The solution of a linear system can be written with a fairly simple explicit formula, using determinants. This is
called ‘Cramer’s rule’. It is mathematically elegant, but completely inpractical for our purposes.
If a matrix A and a vector b are given, and a vector x satisfying Ax = b is wanted, then, writing |A| for determinant,
a11 a12 . . . a1i−1 b1 a1i+1 . . . a1n
a21 ... b2 . . . a2n
xi = . .. .. /|A|
.. . .
an1 ... bn . . . ann
For any matrix M the determinant is defined recursively as
X
|M | = (−1)i m1i |M [1,i] |
i
where M [1,i] denotes the matrix obtained by deleting row 1 and column i from M . This means that computing the
determinant of a matrix of dimension n means n times computing a size n − 1 determinant. Each of these requires
n − 1 determinants of size n − 2, so you see that the number of operations required to compute the determinant is
factorial in the matrix size. This quickly becomes prohibitive, even ignoring any issues of numerical stability. Later
in this chapter you will see complexity estimates for other methods of solving systems of linear equations that are
considerably more reasonable.
Let us now look at a simple of example of solving linear equations with elimination of unknowns. Consider the
system
6x1 −2x2 +2x3 = 16
12x1 −8x2 +6x3 = 26
3x1 −13x2 +3x3 = −19
We eliminate x1 from the second and third equation by
• multiplying the first equation ×2 and subtracting the result from the second equations, and
• multiplying the first equation ×1/2 and subtracting the result from the third equation.
6x1 −2x2 +2x3 = 16
0x1 −4x2 +2x3 = −6
0x1 −12x2 +2x3 = −27
Finally, we eliminate x2 from the third equation by multiplying the second equation by 3, and subtracting the result
from the third equation:
6x1 −2x2 +2x3 = 16
0x1 −4x2 +2x3 = −6
0x1 +0x2 −4x3 = −9
We can now solve x3 = 9/4 from the last equations. Substituting that in the second equation, we get −4x2 =
−6−2x2 = −21/2 so x2 = 21/8. Finally, from the first equation 6x1 = 16+2x2 −2x3 = 16+21/4−9/2 = 76/4
so x1 = 19/6.
In the above example, the matrix coefficients could have been any real (or, for that matter, complex) coefficients,
and you could follow the elimination procedure mechanically. There is the following exception. At some point in
the computation, we divided by the numbers 6, −4, −4 which are found on the diagonal of the matrix in the last
elimination step. These quantities are called the pivots, and clearly they are required to be nonzero. The first pivot
is an element of the original matrix; the other pivots can not easily be found without doing the actual elimination.
If a pivot turns out to be zero, all is not lost for the computation: we can always exchange two matrix rows. It is not
hard to show1 that with a nonsingular matrix there is always a row exchange possible that puts a nonzero element
in the pivot location.
Exercise 5.1. Suppose you want to exchange matrix rows 2 and 3 of the system of equations in equa-
tion (5.1). What other adjustments do you have to make to make sure you still compute the
correct solution? What are the implications of exchanging two columns in that equation?
In general, with floating point numbers and round-off, it is very unlikely that a matrix element will become exactly
zero during a computation. Does that mean that pivoting is in practice almost never necessary? The answer is no:
pivoting is desirable from a point of view of numerical stability. In the next section you will see an example that
illustrates this fact.
1. And you can find this in any elementary linear algebra textbook.
Victor Eijkhout
124 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
includes a full-fledged error analysis of the algorithms we discuss; however, that is beyond the scope of this course;
see the ‘further reading’ section at the end of this chapter.
Here, we will only note two paradigmatic examples of the sort of problems that can come up in computer arithmetic:
we will show why ‘pivoting’ during LU factorization is more than a theoretical device, and we will give two
examples of problems in eigenvalue calculations.
which has the solution solution x = (1, 1)t Using the (1, 1) element to clear the remainder of the first column gives:
1 1+
x=
0 1 − 1 2 − 1+
x1 + x2 = 1 ⇒ x1 = 0 ⇒ x1 = 0,
which is 100% wrong, or infinitely wrong depending on how you look at it.
What would have happened if we had pivoted as described above? We exchange the matrix rows, giving
1 1 2 1 1 2
x= ⇒ x=
1 1+ 0 1− 1−
we find a double eigenvalue 1. Note that the exact eigenvalues are expressible in working precision; it is the algo-
rithm that causes the error. Clearly, using the characteristic polynomial is not the right way to compute eigenvalues,
even in well-behaved, symmetric positive definite, matrices.
An unsymmetric example: let A be the matrix of size 20
20 20 ∅
19 20
.. ..
A= . . .
2 20
∅ 1
Since this is a triangular matrix, its eigenvalues are the diagonal elements. If we perturb this matrix by setting
A20,1 = 10−6 we find a perturbation in the eigenvalues that is much larger that in the elements:
Victor Eijkhout
126 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
5.3 LU factorization
So far, we have looked at eliminating unknowns in the context of solving a single system of linear equations.
Suppose you need to solve more than one system with the same matrix, but with different right hand sides. Can you
use any of the work you did in the first system to make solving subsequent ones easier?
The answer is yes. You can split the solution process in a part that only concerns the matrix, and part that is specific
to the right hand side. If you have a series of systems to solve, you have to do the first part only once, and, luckily,
that even turns out to be the larger part of the work.
Let us take a look at the same example again.
6 −2 2
A = 12 −8 6
3 −13 3
In the elimination process, we took the 2nd row minus 2× the first and the 3rd row minus 1/2× the first. Convince
yourself that this combining of rows can be done by multiplying A from the left by
1 0 0
L1 = −2 1 0
−1/2 0 1
which is the identity with the elimination coefficients in the first column, below the diagonal. You see that the first
step in elimination of variables is equivalent to transforming the system Ax = b to L1 Ax = L1 b.
In the next step, you subtracted 3× the second row from the third. Convince yourself that this corresponds to
left-multiplying the current matrix L1 A by
1 0 0
L2 = 0 1 0
0 −3 1
We have now transformed our system Ax = b into L2 L1 Ax = L2 L1 b, and L2 L1 A is of ‘upper triangular’ form. If
we define U = L2 L1 A, then A = L−1 −1 −1
1 L2 U . How hard is it to compute matrices such as L2 ? Remarkably easy,
it turns out to be.
We make the following observations:
1 0 0 1 0 0
L1 = −2 1 0 L−1
1 =
2 1 0
−1/2 0 1 1/2 0 1
and likewise
1 0 0 1 0 0
L2 = 0 1 0 L2 = 0 1 0
0 −3 1 0 3 1
If we define L = L−1 −1
1 L2 , we now have A = LU ; this is called an LU factorization. We see that the coefficients
of L below the diagonal are the negative of the coefficients used during elimination. Even better, the first column
of L can be written while the first column of A is being eliminated, so the computation of L and U can be done
without extra storage, at least if we can afford to lose A.
hLU factorizationi:
for k = 1, n − 1:
heliminate values in column ki
heliminate values in column ki:
for i = k + 1 to n:
hcompute multiplier for row ii
hupdate row ii
hcompute multiplier for row ii
aik ← aik /akk
hupdate row ii:
for j = k + 1 to n:
aij ← aij − aik ∗ akj
hLU factorizationi:
for k = 1, n − 1:
for i = k + 1 to n: (5.2)
aik ← aik /akk
for j = k + 1 to n:
aij ← aij − aik ∗ akj
This is the most common way of presenting the LU factorization. However, other ways of computing the same
result exist; see section 5.6.
Victor Eijkhout
128 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
5.3.2 Uniqueness
It is always a good idea, when discussing numerical algorithms, to wonder if different ways of computing lead to
the same result. This is referred to as the ‘uniqueness’ of the result, and it is of practical use: if the computed result
is unique, swapping one software library for another will not change anything in the computation.
Let us consider the uniqueness of LU factorization. The definition of an LU factorization algorithm (without pivot-
ing) is that, given a nonsingular matrix A, it will give a lower triangular matrix L and upper triangular matrix U such
that A = LU . The above algorithm for computing an LU factorization is deterministic (it does not contain instruc-
tions ‘take any row that satisfies. . . ’), so given the same input, it will always compute the same output. However,
other algorithms are possible.
So let us assume that A = L1 U1 = L2 U2 where L1 , L2 are lower triangular and U1 , U2 are upper triangular. Then,
L2−1 L1 = U2 U1−1 . In that equation, the left hand side is lower triangular, and the right hand side is upper triangular.
Exercise 5.2. Prove that the product of lower triangular matrices is lower triangular, and the product of
upper triangular matrices upper triangular. Is a similar statement true for inverses of nonsingu-
lar triangular matrices?
The product L−12 L1 is apparently both lower triangular and upper triangular, so it must be diagonal. Let us call
it D, then L1 = L2 D and U2 = DU1 . The conclusion is that LU factorization is not unique, but it is unique ‘up to
diagonal scaling’.
Exercise 5.3. The algorithm in section 5.3.1 resulted in a lower triangular factor L that had ones on the
diagonal. Show that this extra condition make the factorization unique.
Exercise 5.4. Show that an added condition of having ones on the diagonal of U is also sufficient for
the uniqueness of the factorization.
Since we can demand a unit diagonal in L or in U , you may wonder if it is possible to have both. (Give a simple
argument why this is not strictly possible.) We can do the following: suppose that A = LU where L and U are
nonsingular lower and upper triangular, but not normalized in any way. Write
L = (I + L0 )DL , U = DU (I + U 0 ), D = DL DU .
A = (I + L)D(I + U ) (5.3)
Exercise 5.5. Show that you can also normalize the factorization on the form
A = (D + L)D−1 (D + U ).
Consider the factorization of a tridiagonal matrix this way. How do L and U relate to the
triangular parts of A?
5.3.3 Pivoting
Above, you saw examples where pivoting, that is, exchanging rows, was necessary during the factorization process,
either to guarantee the existence of a nonzero pivot, or for numerical stability. We will now integrate pivoting into
the LU factorization.
Let us first observe that row exchanges can be described by a matrix multiplication. Let
01
.. ..
. .
0
i 0 1
P (i,j)
=
I
j
1 0
I
..
.
then P (i,j) A is the matrix A with rows i and j exchanged. Since we may have to pivot in every iteration of the
factorization process, we introduce a sequence pi containing the j values, and we write P (i) ≡ P (i,p(i)) for short.
Exercise 5.6. Show that P (i) is its own inverse.
The process of factorizing with partial pivoting can now be described as:
• Let A(i) be the matrix with columns 1 . . . i − 1 eliminated, and partial pivoting applied to get the right
element in the (i, i) location.
• Let `(i) be the vector of multipliers in the i-th elimination step. (That is, the elimination matrix Li in this
step is the identity plus `(i) in the i-th column.)
• Let P (i+1) (with j ≥ i + 1) be the matrix that does the partial pivoting for the next elimination step as
described above.
• Then A(i+1) = P (i+1) Li A(i) .
In this way we get a factorization of the form
A = P (1) L−1
1 ···P
(n−2) −1
Ln−1 U. (5.4)
Exercise 5.7. Recall from sections 1.5.7 and 1.5.8 that blocked algorithms are often desirable from a
performance point of view. Why is the ‘LU factorization with interleaved pivoting matrices’
in equation (5.4) bad news for performance?
Victor Eijkhout
130 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Fortunately, equation (5.4) can be simplified: the P and L matrices ‘almost commute’. We show this by looking at
an example: P (2) L1 = L̃1 P (2) where L̃1 is very close to L1 .
1 ∅
1 ∅ 1 ∅
1 .. 1
.
...
. 0
0 1 .. 1
0
1
1
˜(i) 1
`
I . = = (1) . I
. .
(1) . . ˜ .
` .. 1 `
1 0
1
0
.. 0
..
I . .. . I
1 1
. I
where `˜(1) is the same as `(1) , except that elements i and p(i) have been swapped. You can now easily convince
yourself that similarly P (2) et cetera can be ‘pulled through’ L1 .
As a result we get
This means that we can again form a matrix L just as before, except that every time we pivot, we need to update the
columns of L that have already been computed.
Exercise 5.8. If we write equation (5.5) as P A = LU , we get A = P −1 LU . Can you come up with a
simple formula for P −1 in terms of just P ? Hint: each P (i) is symmetric.
Exercise 5.9. Earlier, you saw that 2D BVP (section 4.2.3) give rise to a certain kind of matrix. We
stated, without proof, that for these matrices pivoting is not needed. We can now formally
prove this, focusing on the crucial property of diagonal dominance.
Assume that a matrix A satisfies ∀j6=i : aij ≤ 0. Show that the matrix is diagonally dominant
iff there are vectors u, v ≥ 0 (meaning that each component is nonnegative) such that Au = v.
Show that, after eliminating a variable, for the remaining matrix à there are again vectors
ũ, ṽ ≥ 0 such that Ãũ = ṽ.
Now finish the argument that (partial) pivoting is not necessary if A is symmetric and di-
agonally dominant. (One can actually prove that pivoting is not necessary for any symmetric
positive definite (SPD) matrix, and diagonal dominance is a stronger condition than SPD-ness.)
Now that we have a factorization A = LU , we can use this to solve the linear system Ax = LU x = b. If we
introduce a temporary vector y = U X, then we see this takes two steps:
Ly = b, U x = z.
The first part, Ly = b is called the ‘lower triangular solve’, since it involves the lower triangular matrix L.
1 ∅ y1 b1
`21 1 y2 b2
`31 `32 1
=
.. . .. ..
. . . . .
`n1 `n2 ··· 1 yn bn
In the first row, you see that y1 = b1 . Then, in the second row `21 y1 + y2 = b2 , so y2 = b2 − `21 y1 . You can imagine
how this continues: in every i-th row you can compute yi from the previous y-values:
X
yi = bi − `ij yj .
j<i
Since we compute yi in increasing order, this is also known as the ‘forward sweep’.
The second half of the solution process, the ‘upper triangular solve’, or ‘backward sweep’ computes x from U x = y:
u11 u12 . . . u1n x1 y1
u22 . . . u2n x2 y2
.. .. .. = ..
. . . .
∅ unn xn yn
5.3.5 Complexity
In the beginning of this chapter, we indicated that not every method for solving a linear system takes the same
number of operations. Let us therefore take a closer look at the complexity2 , that is, the number of operations as
function of the problem size, of the use of an LU factorization in solving the linear system.
The complexity of solving the linear system, given the LU factorization, is easy to compute. Looking at the lower
and upper triangular part together, you see that you perform a multiplication with all off-diagonal elements (that is,
Victor Eijkhout
132 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
elements `ij or uij with i 6= j). Furthermore, the upper triangular solve involves divisions by the uii elements. Now,
division operations are in general much more expensive than multiplications, so in this case you would compute the
values 1/uii , and store them instead.
Summing up, you see that, on a system of size n × n, you perform n2 multiplications and roughly the same number
of additions. This is the same complexity as of a simple matrix-vector multiplication, that is, of computing Ax given
A and x.
The complexity of computing the LU factorization is a bit more involved to compute. Refer to the algorithm in
section 5.3.1. You see that in the k-th step two things happen: the computation of the multipliers, and the updating
of the rows.
There are n − k multipliers to be computed, each of which involve a division. After that, the update takes (n − k)2
additions and multiplications. If we ignore the divisions for now, because there are fewer of them, we find that the
LU factorization takes n−1 − k)2 operations. If we number the terms in this sum in the reverse order, we find
P
k=1 2(n
n−1
X
#ops = 2k 2
k=1
Without further proof we state that this is 2/3n3 plus some lower order terms.
5.3.6 Accuracy
In section 5.2 you saw some simple examples of the problems that stem from the use of computer arithmetic, and
how these motivated the use of pivoting. Even with pivoting, however, we still need to worry about the accumulated
effect of roundoff errors. A productive way of looking at the question of attainable accuracy is to consider that by
solving a system Ax = b we get a numerical solution x + ∆x which is the exact solution of a slightly different
linear system:
k∆xk 2κ(A)
≤
kxk 1 − κ(A)
where is the machine precision and κ(A) = kAkkA−1 k is called the condition number of the matrix A. Without
going into this in any detail, we remark that the condition number is related to eigenvalues of the matrix.
The analysis of the accuracy of algorithms is a field of study in itself; see for instance the book by Higham [53].
where M, N are the block dimensions, that is, the dimension expressed in terms of the subblocks. Usually, we
choose the blocks such that M = N and the diagonal blocks are square.
As a simple example, consider the matrix-vector product y = Ax, expressed in block terms.
Y1 A11 ... A1M X1
.. .. .. ..
. = . . .
YM AM 1 . . . AM M XM
To see that the block algorithm computes the same result as the old scalar algorithm, we look at a component Xik ,
that is the k-th scalar component of the i-th block. First,
X
Yi = Aij Xj
j
so
X X XX
Yik = Aij Xj k
= (Aij Xj )k = Aijk` Xj`
j j j `
which is the product of the k-th row of the i-th blockrow of A with the whole of X.
A more interesting algorithm is the block version of the LU factorization. The algorithm (5.2) then becomes
hLU factorizationi:
for k = 1, n − 1:
for i = k + 1 to n: (5.6)
Aik ← Aik A−1 kk
for j = k + 1 to n:
Aij ← Aij − Aik · Akj
which mostly differs from the earlier algorithm in that the division by akk has been replaced by a multiplication
by A−1
kk . Also, the U factor will now have pivot blocks, rather than pivot elements, on the diagonal, so U is only
‘block upper triangular’, and not strictly upper triangular.
Victor Eijkhout
134 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Exercise 5.11. We would like to show that the block algorithm here again computes the same result as
the scalar one. Doing so by looking explicitly at the computed elements is cumbersome, so we
take another approach. Show first that LU factorizations are unique: if A = L1 U1 = L2 U2
and L1 , L2 have unit diagonal, then L1 = L2 , U1 = U2 .
Next, consider the computation of A−1 kk . Show that this can be done easily by first computing
an LU factorization of Akk . Now use this to show that the block LU factorization can give L
and U factors that are strictly triangular. The uniqueness of LU factorizations then proves that
the block algorithm computes the scalar result.
Block algorithms are interesting for a variety of reasons. On single processors they are the key to high cache
utilization; see section 1.5.6. On shared memory architectures, including current multicore processors, they can be
used to schedule parallel tasks on the processors/cores; see section 6.10.
In this section we will explore what form familiar linear algebra operations take when applied to sparse matrices.
First we will concern ourselves with actually storing a sparse matrix.
It is pointless to come up with an exact definition of sparse matrix, but an operational definition is that a matrix is
called ‘sparse’ if there are enough zeros to make specialized storage feasible. We will discuss here briefly the most
popular storage schemes for sparse matrices.
In section 4.2 you have seen examples of sparse matrices that were banded. In fact, their nonzero elements are
located precisely on a number of subdiagonals. For such a matrix, a specialized storage is possible.
Let us take as an example the matrix of the one-dimensional BVP (section 4.2). Its elements are located on three
subdiagonals: the main diagonal and the first super and subdiagonal. The idea of storage by diagonals or diagonal
storage is to store the diagonals consecutively in memory. The most economical storage scheme for such a matrix
would store the 2n − 2 elements consecutively. However, for various reasons it is more convenient to waste a few
storage locations, as shown in figure 5.1.
Thus, for a matrix with size n × n and a bandwidth p, we need a rectangular array of size n × p to store the matrix.
The matrix of equation (4.8) would then be stored as
? 2 −1
−1 2 −1
.. .. .. (5.7)
. . .
−1 2 ?
where the stars correspond to array elements that do not correspond to matrix elements: they are the triangles in the
top left and bottom right in figure 5.1.
If we apply this scheme to the matrix of the two-dimensional BVP (section 4.2.3), it becomes wasteful, since we
would be storing many zeros that exist inside the band. Therefore, we refine this scheme by storing only the nonzero
diagonals: if the matrix has p nonzero diagonals, we need an n × p array. For the matrix of equation (4.13) this
means:
? ? 4 −1 −1
.. ..
. . 4 −1 −1
..
. −1 4 −1 −1
−1 −1 4 −1 −1
.. .. .. .. ..
. . . . .
−1 −1 4 ? ?
Of course, we need an additional integer array telling us the locations of these nonzero diagonals.
Victor Eijkhout
136 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
yi ← yi + Aii xi ,
In other words, the whole matrix-vector product can be executed in just three vector operations of length n (or
n − 1), instead of n inner products of length 3 (or 2).
for diag = -diag_left, diag_right
for loc = max(1,1-diag), min(n,n-diag)
y(loc) = y(loc) + val(loc,diag) * x(loc+diag)
end
end
Exercise 5.12. Write a routine that computes y ← At x by diagonals. Implement it in your favourite
language and test it on a random matrix.
val 10 -2 3 9 3 7 8 7 3 ··· 9 13 4 2 -1
col ind 0 4 0 1 5 1 2 3 0 ··· 4 5 1 4 5
row ptr 0 2 5 8 12 16 19 .
A simple variant of CRS is Compressed Column Storage (CCS) where the elements in columns are stored contigu-
ously. Another storage scheme you may come across is coordinate storage, where the matrix is stored as a list of
triplets hi, j, aij i.
for i:
y←0
s←0
⇒ for j:
for j:
for i:
s ← s + aij xj
yi ← yi + aij xj
yi ← s
We see that in the second variant, columns of A are accessed, rather than rows. This means that we can use the
second algorithm for computing the At x product.
Exercise 5.13. Write out the code for the transpose product y = At x where A is stored in CRS format.
Write a simple test program and confirm that your code computes the right thing.
Exercise 5.14. What if you need access to both rows and columns at the same time? Implement an
algorithm that tests whether a matrix stored in CRS format is symmetric. Hint: keep an array
of pointers, one for each row, that keeps track of how far you have progressed in that row.
Victor Eijkhout
138 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Exercise 5.15. The operations described so far are fairly simple, in that they never make changes to the
sparsity structure of the matrix. The CRS format, as described above, does not allow you to
add new nonzeros to the matrix, but it is not hard to make an extension that does allow it.
Let numbers pi , i = 1 . . . n, describing the number of nonzeros in the i-th row, be given. Design
an extension to CRS that gives each row space for q extra elements. Implement this scheme
and test it: construct a matrix with pi nonzeros in the i-th row, and check the correctness of the
matrix-vector product before and after adding new elements, up to q elements per row.
Now assume that the matrix will never have more than a total of qn nonzeros. Alter your code
so that it can deal with starting with an empty matrix, and gradually adding nonzeros in random
places. Again, check the correctness.
(i, j) ∈ E 0 ⇔ (n + 1 − i, n + 1 − j) ∈ E, 0
wij = wn+1−i,n+1−j .
What does this renumbering imply for the matrix A0 that corresponds to G0 ? If you exchange
the labels i, j on two nodes, what is the effect on the matrix A?
Some graph properties can be hard to see from the sparsity pattern of a matrix, but are easier deduced from the
graph.
Exercise 5.17. Let A be a tridiagonal matrix (see section 4.2) of size n with n odd. What does the
graph of A look like? Now zero the offdiagonal elements closest to the ‘middle’ of the matrix:
let a(n+1)/2,(n+1)/2+1 = a(n+1)/2+1,(n+1)/2 = 0. Describe what that does to the graph of A.
Such a graph is called reducible. Consider the permutation that results from putting the nodes
In section 4.2 the one-dimensional BVP led to a linear system with a tridiagonal coefficient matrix. If we do one
step of Gaussian elimination, the only element that needs to be eliminated is in the second row:
2 −1 0 . . . 2 −1 0 ...
−1 2 −1 0 2 − 12 −1
⇒
0 −1 2 −1 0 −1 2 −1
.. .. .. .. .. .. .. ..
. . . . . . . .
There are two important observations to be made: one is that this elimination step does not change any zero elements
to nonzero. The other observation is that the part of the matrix that is left to be eliminated is again tridiagonal.
Inductively, during the elimination no zero elements change to nonzero: the sparsity pattern of L + U is the same
as of A, and so the factorization takes the same amount of space to store as the matrix.
The case of tridiagonal matrices is unfortunately not typical, as we will now see in the case of two-dimensional
problems. Here, in the first elimination step we need to zero two elements, one in the second row and one in the
first row of the next block. (Refresher question: where do these blocks come from?)
−1 −1
4 −1 0 ... −1 4 0 ...
−1 4 −1 0 ... 0 −1 4 − 14 −1 0 ... −1/4 −1
.. .. .. .. .. .. .. ..
. . . . ⇒ . . . .
−1 0 ... 4 −1 −1/4 ... 4 − 41 −1
0 −1 0 ... −1 4 −1 −1 0 ... −1 4 −1
You see that the second block causes two fill elements: fill elements are nonzero in L or U in a location that was
zero in A.
Exercise 5.18. How many fill elements are there in the next eliminating step? Can you characterize for
the whole of the factorization which locations in A get filled in L + U ?
Exercise 5.19. The LAPACK software for dense linear algebra has an LU factorization routine that
overwrites the input matrix with the factors. Above you saw that is possible since the columns
of L are generated precisely as the columns of A are eliminated. Why is such an algorithm not
possible if the matrix is stored in sparse format?
Victor Eijkhout
140 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Exercise 5.21. The assumption of a band that is initially dense is not true for the matrix of a two-
dimensional BVP. Why does the above estimate still hold, up to some lower order terms?
Victor Eijkhout
142 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
What is the number of nonzeros in the matrix, and in the factorization, assuming that no addi-
tion ever results in zero? Can you find a symmetric permutation of the variables of the problem
such that the new matrix has no fill-in?
The above estimates can sometimes be improved upon by clever permuting of the matrix, but in general the state-
ment holds that an LU factorization of a sparse matrix takes considerably more space than the matrix itself. This is
one of the motivating factors for the iterative methods in the next section.
In section 5.4.3, above, you saw that during the factorization the part of the matrix that is left to be factored becomes
more and more dense. It is possible to give a dramatic demonstration of this fact. Consider the matrix of the two-
dimensional BVP (section 4.2.3; assume the number of grid points on each line or column, n, is odd), and put a
non-standard numbering on the unknowns:
1. First number all variables for which x < 0.5,
2. then number all variables for which x > 0.5,
3. then number the variables on the line x = 0.5.
Exercise 5.24. Show that the matrix has a natural 3 × 3 block structure, in which two blocks are entirely
zero. Show that, after eliminating the first two sets of variables, the remaining graph will be a
dense matrix of size n × n.
The important feature here is that no systems are solved; instead, every iteration involves a simple matrix-vector
multiplication. Thus we have replaced a complicated operation, constructing an LU factorization and solving a
system with it, by a repeated simpler and cheaper application. This makes iterative methods easier to code, and
potentially more efficient.
Let us consider a simple example to motivate the precise definition of the iterative methods. Suppose we want to
solve the system
10 0 1 x1 21
1/2 7 1 x2 = 9
1 0 6 x3 8
which has the solution (2, 1, 1). Suppose you know (for example, from physical considerations) that solution com-
ponents are roughly the same size. Observe the dominant size of the diagonal, then, to decide that
10 x1 21
7 x2 = 9
6 x3 8
might be a good approximation: solution (2.1, 9/7, 8/6). Clearly, solving a system that only involves the diagonal
of the original system is both easy to do, and, at least in this case, fairly accurate.
Another approximation to the original system would be to use the lower triangle. The system
10 x1 21
1/2 7 x2 = 9
1 0 6 x3 8
has the solution (2.1, 7.95/7, 5.9/6). Solving triangular systems is a bit more work than diagonal systems, but still
a lot easier than computing an LU factorization. Also, we have not generated any fill-in in the process of finding
this approximate solution.
Thus we see that there are easy to compute ways of getting considerably close to the solution. Can we somehow
repeat this trick?
Formulated a bit more abstractly, what we did was instead of solving Ax = b we solved Lx̃ = b. Now define ∆x
as the distance to the true solution: x = x̃ + ∆x. This gives A∆x = b − Ax̃ ≡ r. Next we solve again L∆x
f =r
˜
and update x̃ = x̃ + ∆x.
f
iteration 1 2 3
x1 2.1000 2.0017 2.000028
x2 1.1357 1.0023 1.000038
x3 0.9833 0.9997 0.999995
In this case we get two decimals per iteration, which is not typical.
It is now clear why iterative methods can be attractive. Solving a system by Gaussian elimination takes O(n3 )
operations, as shown above. A single iteration in a scheme as the above takes O(n2 ) operations if the matrix is
dense, and possibly as low as O(n) for a sparse matrix. If the number of iterations is low, this makes iterative
methods competitive.
Exercise 5.25. When comparing iterative and direct methods, the flop count is not the only relevant
measure. Outline some issues relating to the efficiency of the code in both cases. Which scheme
is favoured?
Victor Eijkhout
144 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Instead of solving Ax = b we solve Kx = b, and define x0 as the solution: Kx0 = b. This leaves us with an error
e0 = x0 − x, for which we have the equation A(x0 − e0 ) = b or Ae0 = Ax0 − b. We call r0 ≡ Ax0 − b the residual ;
the error then satisfies Ae0 = r0 .
If we could solve the error from the equation Ae0 = r0 , we would be done: the true solution is then found as
x = x0 − e0 . However, since solving with A was too expensive the last time, we can not do so this time either, so
we determine the error correction approximately. We solve K ẽ0 = r0 and x1 = x0 − ẽ0 ; the story can now continue
with e1 = x1 − x, r1 = Ax1 − b, K ẽ1 = r1 , x2 = x1 − ẽ1 , et cetera.
The iteration scheme is then:
Let x0 be given
For i ≥ 0:
let ri = Axi − b
compute ei from Kei = ri
update xi+1 = xi − ei
The scheme we have analyzed here is called stationary iteration, where every updated is performed the same way,
without any dependence on the iteration number. It has a simple analysis, but unfortunately limited applicability.
There are several questions we need to answer about iterative schemes:
• When do we stop iterating?
• How do we choose K?
• Does this scheme alway take us to the solution?
• If the scheme converges, how quickly?
We will now devote some attention to these matters, though a full discussion is beyond the scope of this book.
Exercise 5.27. Consider the matrix A of equation (4.13) that we obtained from discretization of a two-
dimensional BVP. Let K be matrix containing the diagonal of A, that is kii = aii and kij = 0
for i 6= j. Use the Gershgoring theorem to show that |λ(I − AK −1 )| < 1.
The argument in this exercise is hard to generalize fore more complicated choices of K, such as you will see in the
next section. Here, we only remark that for certain matrices A, these choices of K will always lead to convergence,
with a speed that decreases as the matrix size increases. We will not go into the details, beyond stating that for M -
matrices (see section 4.2) these iterative methods converge. For more details on the convergence theory of stationary
iterative methods, see [83]
Above, in section 5.5.1, we derived stationary iteration as a process that involves multiplying by A and solving
with K. However, in some cases a simpler implementation is possible. Consider the case where A = K − N , and
we know both K and N . Then we write
Ax = b ⇒ Kx = N x + b ⇒ Kxi+1 = N xi + b.
Kxi+1 = N xi + b
= Kxi − Axi + b
= Kxi − ri
⇒ xi+1 = xi − K −1 ri .
The congence criterium |λ(I − AK −1 )| < 1 (see section 5.5.2) now simplifies to |λ(N K −1 )| < 1.
Let us consider two specific cases. First of all, let K = DA , that is, the matrix containing the diagonal part of A:
kii = aii and kij = 0 for all i 6= j. Then the iteration scheme becomes
for k = 1, . . . until convergence, do:
for i = 1 . . . n:
(k+1) P (k)
xi = j6=i aij xj
This requires us to have one vector for the current iterate x(k) , and one for the next vector x(k+1) . The easiest way
to write this is probably:
for k = 1, . . . until convergence, do:
for i = 1 . . . n:
P (k)
ti = j6=i aij xj
copy x(k+1) ← t
Victor Eijkhout
146 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
5.5.4 Choice of K
The convergence and error analysis above showed that the closer K is to A, the faster the convergence will be. In
the initial examples we already saw the diagonal and lower triangular choice for K. We can describe these formally
by letting A = DA + LA + UA be a splitting into diagonal, lower triangular, upper triangular part of A. Here are
some methods with their traditional names:
• Richardson iteration: K = αI.
• Jacobi method: K = DA (diagonal part),
• Gauss-Seidel method: K = DA + LA (lower triangle, including diagonal)
• The Successive Over-Relaxation (SOR) method: K = DA + ωLA
−1
• Symmetric SOR (SSOR) method: K = (DA + LA )DA (DA + UA ).
• Iterative refinement: K = LU where LU is a true factorization of A. In exact arithmetic, solving a system
LU x = y gives you the exact solution, so using K = LU in an iterative method would give convergence
after one step. In practice, roundoff error will make the solution be inexact, so people will sometimes
iterate a few steps to get higher accuracy.
Exercise 5.28. What is the extra cost of a few steps of iterative refinement over a single system solution?
Exercise 5.29. The Jacobi iteration for the linear system Ax = b is defined as
xi+1 = xi − K −1 (Axi − b)
where K is the diagonal of A. Show that you can transform the linear system (that is, find a
different coefficient matrix and right hand side vector) so that you can compute the same xi
vectors but with K = I, the identity matrix.
What are the implications of this strategy, in terms of storage and operation counts? Are there
special implications if A is a sparse matrix?
Suppose A is symmetric. Give a simple example to show that K −1 A does not have to be
symmetric. Can you come up with a different transformation of the system so that symmetry is
preserved and that has the same advantages as the transformation above? You can assume that
the matrix has positive diagonal elements.
There many different ways of choosing the preconditioner matrix K. Some of them are defined algebraically, such
as the incomplete factorization discussed below. Other choices are inspired by the differential equation. For instance,
if the operator is
δ δ δ δ
(a(x, y) u(x, y)) + (b(x, y) u(x, y)) = f (x, y)
δx δx δy δy
then the matrix K could be derived from the operator
δ δ δ δ
(ã(x) u(x, y)) + (b̃(y) u(x, y)) = f (x, y)
δx δx δy δy
Victor Eijkhout
148 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Exercise 5.32. Write a simple program to experiment with linear system solving. Take the matrix from
the 1D BVP (use an efficient storage scheme) and program an iterative method using the choice
K = DA . Experiment with stopping tests on the residual and the distance between iterates.
How does the number of iterations depend on the size of the matrix?
Change the matrix construction so that a certain quantity is added the diagonal, that is, add αI
to the original matrix. What happens when α > 0? What happens when α < 0? Can you find
the value where the behaviour changes? Does that value depend on the matrix size?
One might ask, ‘why not introduct an extra parameter and write xi+1 = αi xi + · · · ?’ Here we give a short argument
that the former scheme describes a large class of methods. Indeed, the current author is not aware of methods that
fall outside this scheme.
We defined the residual, given an approximate solution x̃, as r̃ = Ax̃−b. For this general discussion we precondition
the system as K −1 Ax = K −1 b. The corresponding residual for the initial guess x̃ is
r̃ = K −1 Ax̃ − K −1 b.
Now, the Cayley-Hamilton theorem states that for every A there exists a polynomial φ(x) such that
φ(A) = 0.
φ(x) = 1 + xπ(x)
so that x = x̃ + π(K −1 A)r̃. Now, if we let x0 = x̃, then r̃ = K −1 r0 , giving the equation
x = x0 + π(K −1 A)K −1 r0 .
This equation suggests an iterative scheme: if we can find a series of polynomials π (i) of degree i to approximate π,
it will give us a sequence of iterates
that ultimately reaches the true solution. Multiplying this equation by A and subtracting b on both sides gives
where π̂ (i) is a polynomial of degree i with π̂ (i) (0) = 1. This statement can be used as the basis of a convergence
theory of iterative methods. However, this goes beyond the scope of this book.
Let us look at a couple of instances of equation (5.14). For i = 1 we have
r1 = (α1 AK −1 + α2 I)r0 ⇒ AK −1 r0 = β1 r1 + β0 r0
for different values αi . But we had already established that AK0−1 is a combination of r1 , r0 , so now we have that
It is easy to see that the scheme (5.12) is of the form (5.16). With a little effort one can show that the reverse
implication also holds.
Summarizing, the basis of most iterative methods is a scheme where iterates get updated by all residuals computed
so far:
X
xi+1 = xi + K −1 rj αji . (5.17)
j≤i
Victor Eijkhout
150 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
Compare that to the stationary iteration (section 5.5.1) where the iterates get updated from just the last residual, and
with a coefficient that stays constant.
We can say more about the αij coefficients. If we multiply equation (5.17) by A, subtracting b from both sides, we
find
X
ri+1 = ri + AK −1 rj αji . (5.18)
j≤i
Let us consider this equation for a moment. If we have a starting residual r0 , the next residual is computed as
r1 = r0 + AK −1 r0 α00 .
−1
From this we get that AK −1 r0 = α00 (r1 − r0 ), so for the next residual,
r2 = r1 + AK −1 r1 α11 + AK −1 r0 α01
−1
= r1 + AK −1 r1 α11 + α00 α01 (r1 − r0 )
−1 −1 −1 −1
⇒ AK r1 = α11 (r2 − (1 + α00 α01 )r1 + α00 α01 r0 )
α := 1 + αii
ri+1 αi+1,i = AK −1 ri δi + j≤i rj αji substituting ii
P
αi+1,iP:= 1 − αi+1,i
note that αi+1,i = j≤i αji
ri+1 αi+1,i δi−1 = AK −1 ri + j≤i rj αji δi−1
P
In this, H is a so-called Hessenberg matrix: it is upper triangular with a single lower subdiagonal. Also we note that
the elements of H in each column sum to zero.
Because of the identity γi+1,i = we can subtract b from both sides of the equation for ri+1 and ‘divide
P
j≤i γji
out A’, giving
X
xi+1 γi+1,i = K −1 ri + xj γji .
j≤i
Victor Eijkhout
152 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
You may recognize the Gram-Schmidt orthogonalization in this (see appendix A.1.1 for an explanation); we can
use modified Gram-Schmidt by rewriting the algorithm as:
Let r0 be given
For i ≥ 0:
let s ← K −1 ri
let t ← AK −1 ri
for j ≤ i:
let γj be the coefficient so that t − γj rj ⊥ rj
form s ← s − γj xj
and t ←Pt − γj rj
let xi+1 = ( j γj )−1 s, ri+1 = ( j γj )−1 t.
P
These two version of the FOM algorithm are equivalent in exact arithmetic, but differ in practical circumstances in
two ways:
• The modified Gramm-Schmidt method is more numerically stable;
• The unmodified method allows you to compute all inner products simultaneously. We discuss this below
in section 6.6.
Even though the FOM algorithm is not used in practice, these computational considerations carry over to the
GMRES method below.
xi+1 = xi − δi pi ,
and
• A construction of the search direction from the residuals known so far:
X
pi = K −1 ri + βij K −1 rj .
j<i
where the first and third equation were introduced above, and the second can be found by multiplying the first by A
(check this!).
We simplify the recursive derivation by introducing quantities
• x1 , r1 , p1 are the current iterate, residual, and search direction. Note that the subscript 1 does not denote
the iteration number here.
• x2 , r2 , p2 are the iterate, residual, and search direction that we are about to compute. Again, the subscript
does not equal the iteration number.
• X0 , R0 , P0 are all previous iterates, residuals, and search directions bundled together in a block of vectors.
In terms of these quantities, the update equations are then
x2 = x1 − δ1 p1
r2 = r1 − δi Ap1 (5.21)
−1
p2 = K r2 + υ12 p1 + P0 u02
Victor Eijkhout
154 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
where δ1 , υ12 are scalars, and u02 is a vector with length the number of iterations before the current. We now derive
δ1 , υ12 , u02 from the orthogonality of the residuals. To be specific, the residuals have to be orthogonal under the
K −1 inner product: we want to have
r2t K −1 r1 = 0, r2t K −1 R0 = 0.
r1t K −1 r2 = 0 r1t r1
⇒ δ 1 = .
r2 = r1 − δi AK −1 p1 r1t AK −1 p1
Finding υ12 , u02 is a little harder. For this, we start by summarizing the relations for the residuals and search
directions in equation (5.20) in block form as
1
−1 1
. .
(R0 , r1 , r2 )
.. .. = A(P0 , p1 , p2 ) diag(D0 , d1 , d2 )
−1 1
−1 1
I − U00 −u01 −u02
(P0 , p1 , p2 ) 1 −υ12 = K −1 (R0 , r1 , r2 )
1
or abbreviated RJ = AP D, P (I − U ) = R where J is the matrix with identity diagonal and minus identity
subdiagonal. We then observe that
• Rt K −1 R is diagonal, expressing the orthogonality of the residuals.
• Combining that Rt K −1 R is diagonal and P (I − U ) = R gives that Rt P = Rt K −1 R(I − U )−1 . We
now reason that (I − U )−1 is upper diagonal, so Rt P is upper triangular. This tells us quantities such as
r2t p1 are zero.
• Combining the relations for R and P , we get first that
Rt K −t AP = Rt K −t RJD−1
P t AP = (I − U )−t Rt RJD−1 .
Here D and Rt K −1 R are diagonal, and (I−U )−t and J are lower triangular, so P t AP is lower triangular.
• This tells us that P0t Ap2 = 0 and pt1 Ap2 = 0.
• Taking the product of P0t A, pt1 A with the definition of p2 in equation (5.21) gives
u02 = −(P0t AP0 )−1 P0t AK −1 r2 , υ12 = −(pt1 Ap1 )−1 pt1 AK −1 r2 .
• If A is symmetric, P t AP is lower triangular (see above) and symmetric, so it is in fact diagonal. Also,
Rt K −t AP is lower bidiagonal, so, using A = At , P t AK −1 R is upper bidiagonal. Since P t AK −1 R =
P t AP (I − U ), we conclude that I − U is upper bidiagonal, so, only in the symmetric case, u02 = 0.
Some observations about this derivation.
• Strictly speaking we are only proving necessary relations here. It can be shown that these are sufficient
too.
• There are different formulas that wind up computing the same vectors, in exact arithmetic. For instance,
it is easy to derive that pt1 r1 = r1t r1 , so this can be substituted in the formulas just derived. The imple-
mentation of the CG method as it is typically implemented, is given in figure 5.2.
• In the k-th iteration, computing P0t Ar2 (which is needed for u02 ) takes k inner products. First of all, inner
products are disadvantageous in a parallel context. Secondly, this requires us to store all search directions
indefinitely. This second point implies that both work and storage go up with the number of iterations.
Contrast this with the stationary iteration scheme, where storage was limited to the matrix and a few
vectors, and work in each iteration was the same.
• The objections just raised disappear in the symmetric case. Since u02 is zero, the dependence on P0
disappears, and only the dependence on p1 remains. Thus, storage is constant, and the amount of work
per iteration is constant. The number of inner products per iteration can be shown to be just two.
Exercise 5.33. Do a flop count of the various operations in one iteration of the CG method. Assume
that A is the matrix of a five-point stencil and that the preconditioner M is an incomplete
factorization of A (section 5.5.5). Let N be the matrix size.
Victor Eijkhout
156 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
If we accept the fact that the function f has a minimum, which follows from the positive definiteness, we find the
minimum by computing the derivative
f 0 (x) = Ax − b.
and asking where f 0 (x) = 0. And, presto, there we have the original linear system.
Exercise 5.34. Derive the derivative formula above. (Hint: write out the definition of derivative as
limh↓0 . . ..) Note that this requires A to be symmetric.
For the derivation of the iterative method, the basic form
xi+1 = xi + pi δi
rit pi
δi =
pt1 Api
is then derived as the one that minimizes the function f along the line xi + δpi :
The construction of the search direction from the residuals follows by induction proof from the requirement that the
residuals be orthogonal. For a typical proof, see [13].
5.5.12 GMRES
In the discussion of the CG method above, it was pointed out that orthogonality of the residuals requires storage of
all residuals, and k inner products in the k’th iteration. Unfortunately, it can be proved that the work savings of the
CG method can, for all practical purposes, not be found outside of SPD matrices [33].
The GMRES method is a popular implementation of such full orthogonalization schemes. In order to keep the
computational costs within bounds, it is usually implemented as a restarted method. That is, only a certain number
(say k = 5 or 20) of residuals is retained, and every k iterations the method is restarted. The code can be found in
figure 5.3.
Victor Eijkhout
158 CHAPTER 5. NUMERICAL LINEAR ALGEBRA
5.5.13 Complexity
The efficiency of Gaussian elimination was fairly easy to assess: factoring and solving a system takes, determinis-
tically, 13 n3 operations. For an iterative method, the operation count is the product of the number of operations per
iteration times the number of iterations. While each individual iteration is easy to analyze, there is no good theory
to predict the number of iterations. (In fact, an iterative method may not even converge to begin with.) Added to
this is the fact that Gaussian elimination can be coded in such a way that there is considerable cache reuse, making
the algorithm run at a fair percentage of the computer’s peak speed. Iterative methods, on the other hand, are much
slower on a flops per second basis.
All these considerations make the application of iterative methods to linear system solving somewhere in between
a craft and a black art. In practice, people do considerable experimentation to decide whether an iterative method
will pay off, and if so, which method is preferable.
In this section we will discuss a number of issues pertaining to linear algebra on parallel computers. We will take
a realistic view of this topic, assuming that the number of processors is finite, and that the problem data is always
large, relative to the number of processors. We will also pay attention to the physical aspects of the communication
network between the processors.
We will start off with a short section on the asymptotic analysis of parallelism, after which we will analyze various
linear algebra operations, including iterative methods, and their behaviour in the presence of a network with finite
bandwidth and finite connectivity. This chapter will conclude with various short remarks regarding complications
in algorithms that arise due to parallel execution.
6.1 Asymptotics
If we ignore limitations such as that the number of processors has to be finite, or the physicalities of the interconnect
between them, we can derive theoretical results on the limits of parallel computing. This section will give a brief
introduction to such results, and discuss their connection to real life high performance computing.
Consider for instance the matrix-matrix multiplication C = AB, which takes 2N 3 operations where N is the matrix
size. Since there are no dependencies between the operations for the elements of C, we can perform them all in
parallel. If we had N 2 processors, we could assign each to an (i, j) coordinate in C, and have it compute cij in 2N
time. Thus, this parallel operation has efficiency 1, which is optimal.
Exercise 6.1. Adding N numbers {xi }i=1...N can be performed inP log2 N time with N/2 processors.
As a simple example, consider the sum of n numbers: s = ni=1 ai . If we have n/2 processors
we could compute:
(0)
1. Define si = ai .
2. Iterate with j = 1, . . . , log2 n:
(j) (j−1) (j−1)
3. Compute n/2j partial sums si = s2i + s2i+1
159
160 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
We see that the n/2 processors perform a total of n operations (as they should) in log2 n time.
The efficiency of this parallel scheme is O(1/ log2 n), a slowly decreasing function of n. Show
that, using this scheme, you can multiply two matrices in log2 N time with N 3 /2 processors.
What is the resulting efficiency?
It is now a legitimate theoretical question to ask
• If we had infinitely many processors, what is the lowest possible time complexity for matrix-matrix
multiplication, or
• Are there faster algorithms that still have O(1) efficiency?
Such questions have been researched (see for instance [50]), but they have little bearing on high performance
computing.
A first objection to these kinds of theoretical bounds is that they implicitly assume some form of shared memory.
In fact, the formal model for these algorithms is called a Programmable Random Access Machine (PRAM), where
the assumption is that every memory location is accessible to any processor. Often an additional assumption is
made that multiple access to the same location are in fact possible1 . These assumptions are unrealistic in practice,
especially in the context of scaling up the problem size and the number of processors.
But even if we take distributed memory into account, theoretical results can still be unrealistic. The above summa-
tion algorithm can indeed work unchanged in distributed memory, except that we have to worry about the distance
between active processors increasing as we iterate further. If the processors are connected by a linear array, the
number of ‘hops’ between active processors doubles, and with that, asymptotically, the computation time of the it-
eration. The total execution time then becomes n/2, a disappointing result given that we throw so many processors
at the problem.
What if the processors are connected with a hypercube topology? It is not hard to see that the summation algorithm
can then indeed work in log2 n time. However, as n → ∞, can we build a sequence of hypercubes of n nodes
and keep the communication time between two connected constant? Since communication time depends on latency,
which partly depends on the length of the wires, we have to worry about the physical distance between nearest
neighbours.
The crucial question here is whether the hypercube (an n-dimensional object) can be embedded in 3-dimensional
space, while keeping the distance (measured in meters) constant between connected neighbours. It is easy to see
that a 3-dimensional grid can be scaled up arbitrarily, but the question is not clear for a hypercube. There, the length
of the wires may have to increase as n grows, which runs afoul of the finite speed of electrons.
We sketch a proof (see [35] for more details) that, in our three dimensional world and with a finite speed of light,
√
speedup is limited to 4 n for a problem on n processors, no matter the interconnect. The argument goes as follows.
Consider an operation where a final result is collected on one processor. Assume that each processor takes a unit
volume of space, produces one result per unit time, and can send one data item per unit time. Then, in an amount of
time t, at most t3 processors can contribute to the finalRresult; all others are too far away. In time T , then, the number
T
of operations that can contribute to the final result is 0 t3 dt = O(T 4 ). This means that the maximum achievable
speedup is the fourth root of the sequential time.
1. This notion can be made precise; for instance, one talks of a CREW-PRAM, for Concurrent Read, Exclusive Write PRAM.
Finally, the question ‘what if we had infinitely many processors’ is not realistic as such, but we will allow it in the
sense that we will ask the weak scaling question (section 2.7.3) ‘what if we let the problem size and the number of
processors grow proportional to each other’. This question is legitimate, since it corresponds to the very practical
deliberation whether buying more processors will allow one to run larger problems, and if so, with what ‘bang for
the buck’.
We now reason:
• If processor p has all xj values, the matrix-vector product can trivially be executed, and upon completion,
the processor has the correct values yj for j ∈ Ip .
• This means that every processor needs to have a copy of x, which is wasteful. Also it raises the question
of data integrity: you need to make sure that each processor has the correct value of x.
• In certain practical applications (for instance iterative methods, as you have seen before), the output of
the matrix-vector product is, directly or indirectly, the input for a next matrix-vector operation. This is
certainly the case for the power method which computes x, Ax, A2 x, . . .. Since our operation started
with each processor having the whole of x, but ended with it owning only the local part of Ax, we have
a mismatch.
• Maybe it is better to assume that each processor, at the start of the operation, has only the local part of x,
that is, those xi where i ∈ Ip , so that the start state and end state of the algorithm are the same. This
means we have to change the algorithm to include some communication that allows each processor to
obtain those values xi where i 6∈ Ip .
Victor Eijkhout
162 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Input: Processor number p; the elements xi with i ∈ Ip ; matrix elements Aij with i ∈ Ip .
Output: The elements yi with i ∈ Ip
for i ∈ Ip do
s←0
for j ∈ Ip do
s ← s + aaij xj
end
for j 6∈ Ip do
send xj from the processor that owns it to the current one, then
s ← s + aaij xj
end
end
Procedure Naive Parallel MVP(A, xlocal , ylocal , p)
Exercise 6.2. Go through a similar reasoning for the case where the matrix is decomposed in block
columns. Describe the parallel algorithm in detail, like above, without giving pseudo code.
Let us now look at the communication in detail: we will consider a fixed processor p and consider the operations it
performs and the communication that necessitates.
The matrix-vector product evaluates, on processor p, for each i ∈ Ip , the summation
X
yi = aij xj .
j
If j ∈ Ip , the instruction yi = yi + aaij xj is trivial; let us therefore consider only the case j 6∈ Ip . It would be nice
if we could just write the statement
y(i) = y(i) + a(i,j)*x(j)
and some lower layer would automatically transfer x(j), from whatever processor it is stored on, to a local register.
(The PGAS languages (section 2.5.4) aim to do this, but their efficiency is far from guaranteed.) An implementation,
based on this optimistic view of parallelism, is given in figure 6.1.
The immediate problem with such a ‘local’ approach is that too much communication will take place.
• If the matrix A is dense, the element xj is necessary once for each row i ∈ Ip , and it will thus be fetched
once for every row i ∈ Ip .
• For each processor q 6= p, there will be (large) number of elements xj with j ∈ Iq that needs to be
transferred from processor q to p. Doing this in separate messages, rather than one bulk transfer, is very
wastefule.
With shared memory these issues are not much of a problem, but in the context of distributed memory it is better to
take a buffering approach.
Instead of communicating individual elements of x, we use a local buffer Bpq for each processor q 6= p where we
collect the elements from q that are needed to perform the product on p. (See figure 6.2 for an illustration.) The
parallel algorithm is given in figure 6.3.
In addition to preventing an element from being fetched more than once, this also combines many small mes-
sages into one large message, which is usually more efficient; recall our discussion of bandwidth and latency in
section 2.6.6.
Exercise 6.3. Give pseudocode for the matrix-vector product using nonblocking operations (section 2.5.3.3)
Above we said that having a copy of the whole of x on each processor was wasteful in space. The implicit argument
here is that, in general, we do not want local storage to be function of the number of processors: ideally it should be
only a function of the local data. (This is related to weak scaling; section 2.7.3.)
Exercise 6.4. Make this precise. How many rows can we store locally, given a matrix size of N and
local memory of M numbers? What is the exact buffer space required?
You see that, because of communication considerations, we have actually decided that it is unavoidable, or at least
preferable, for each processor to store the whole input vector. Such trade-offs between space and time efficiency are
fairly common in parallel programming. For the dense matrix-vector product we can actually defend this overhead,
since the vector storage is of lower order than the matrix storage, so our over-allocation is percentagewise small.
Below, we will see that for the sparse matrix-vector product the overhead can be much less.
Victor Eijkhout
164 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Input: Processor number p; the elements xi with i ∈ Ip ; matrix elements Aij with i ∈ Ip .
Output: The elements yi with i ∈ Ip
for q 6= p do
Send elements of x from processor q to p, receive in buffer Bpq .
end
ylocal ← Axlocal
for q 6= p do
ylocal ← ylocal + Apq Bq
end
Procedure Parallel MVP(A, xlocal , ylocal , p)
It is easy to see that the parallel dense matrix-vector product, as described above, has perfect speedup if we are al-
lowed to ignore the time for communication. In the next section you will see that the rowwise implementation above
is not optimal if we take communication into account. For scalability (section 2.7.3) we need a two-dimensional
decomposition.
The main implication of the architectural model above is that the number of active processors can only double
in each step of an algorithm. For instance, to do a broadcast, first processor 0 sends to 1, then 0 and 1 can send
to 2 and 3, then 0–3 send to 4–7, et cetera. This cascade of messages is called a minimum spanning tree of the
processor network.
6.3.1.1 Broadcast
In a broadcast, a single processor sends its data to all others. By the above doubling argument, we conclude that a
broadcast to p processors takes time at least dlog2 pe steps with a total latency of dlog2 peα. Since n elements are
sent, this adds nβ, giving a total cost lower bound of
dlog2 peα + nβ.
6.3.1.2 Reduction
By running the broadcast backwards in time, we see that a reduction operation has the same lower bound on the
communication of dlog2 peα + nβ. A reduction operation also involves computation, with a total time of (p − 1)γn:
each of n items gets reduced over p processors. Since these operations can potentially be parallelized, the lower
bound on the computation is p−1
p γn, giving a total of
p−1
dlog2 peα + nβ + γn.
p
(j) (j:k)
We illustrate this again, using the notation xi for the data item i that was originally on processor j, and xi for
the sum of the items i of processors j . . . k.
Victor Eijkhout
166 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
6.3.1.3 Allreduce
The cost of an allreduce is, somewhat remarkably, almost the same as of a simple reduction: since in a reduction not
all processors are active at the same time, we assume that the extra work can be spread out perfectly. This means
that the lower bound on the latency and computation stays the same. For the bandwidth we reason as follows: in
order for the communication to be perfectly parallelized, p−1
p n items have to arrive at, and leave each processor.
Thus we have a total time of
p−1 p−1
dlog2 peα + 2 nβ + nγ.
p p
6.3.1.4 Allgather
Again we assume that gathers with multiple targets are active simultaneously. Since every processor originates a
minimum spanning tree, we have log2 pα latency; since each processor receives n/p elements from p−1 processors,
there is p−1
p β latency. The total cost for constructing a length n vector is then
p−1
dlog2 peα + nβ.
p
At time t = 1, there is an exchange between neighbours p0 , p1 and likewise p2 , p3; at t = 2 there is an exchange
over distance two between p0 , p2 and likewise p1 , p3 .
6.3.1.5 Reduce-scatter
(i) P (j)
In a reduce-scatter operations, processor i has an item xi , and it needs j xi . We could implement this by doing
P (i) P (i)
a size p reduction, collecting the vector ( i x0 , i x1 , . . .) on one processor, and scattering the results. However
it is possible to combine these operations:
The reduce-scatter can be considered as a allgather run in reverse, with arithmetic added, so the cost is
p−1
dlog2 peα + n(β + γ).
p
Cost analysis The total cost of the algorithm is given by, approximately,
n 2
Tp (n) = Tp1D-row (n) = 2 γ + log2 (p)α + nβ.
p | {z }
Overhead
Sp1D-row (n) 1
Ep1D-row (n) = = p log2 (p) α p β
.
p 1+ +
2n2 γ 2n γ
Victor Eijkhout
168 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Thus, if one can make problem large enough, eventually the parallel efficiency is nearly perfect. However, this
assumes unlimited memory, so this analysis is not practical.
A pessimist’s view In a strong scalability analysis, one fixes n and lets p get large, to get
1
lim Ep (n) = lim . = 0.
p→∞ p→∞ 1 + p log2 (p) α + p β
2n2 γ 2n γ
A realist’s view In a more realistic view we increase the number of processors with the amount of data. This is
called weak scalability, and it makes the amount of memory that is available to store the problem scale linearly
with p.
Let M equal the number of floating point numbers that can be stored in a single node’s memory. Then the aggregate
memory is given by M p. Let nmax (p) equal the largest problem size that can be stored in the aggregate memory of
p nodes. Then, if all memory can be used for the matrix,
√
(nmax (p))2 = M p or nmax (p) = M p.
The question now becomes what the parallel efficiency for the largest problem that can be stored on p nodes:
1
Ep1D-row (nmax (p)) = p log2 (p) α p β
1+ 2(nmax (p))2 γ
+ 2nmax (p) γ
1
= log2 (p) α
√
p
.
1+ + √ β
2M γ 2 M γ
Now, if one analyzes what happens when the number of nodes becomes large, one finds that
1
lim Ep (nmax (p)) = lim log (p)
√
p β
= 0.
p→∞ p→∞ 1 + α
2M γ + 2 M γ
2 √
Thus, this parallel algorithm for matrix-vector multiplication does not scale.
Alternatively, a realist realizes that he/she has a limited amount of time, Tmax on his/her hands. Under the best of
circumstances, that is, with zero communication overhead, the largest problem that we can solve in time Tmax is
given by
(nmax (p))2
Tp (nmax (p)) = 2 γ = Tmax .
p
Thus
√ √
Tmax p Tmax p
(nmax (p)) =2
or nmax (p) = √ .
2γ 2γ
Then the parallel efficiency that is attained by the algorithm for the largest problem that can be solved in time Tmax
is given by
1
Ep,nmax = q
log2 p p β
1+ T α + T γ
and the parallel efficiency as the number of nodes becomes large approaches
s
Tγ
lim Ep = .
p→∞ pβ
Again, efficiency cannot be maintained as the number of processors increases and the execution time is capped.
Victor Eijkhout
170 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
in a reduce-scatter operation: each processor i scatters a part (Ai xi )j of its result to processor j. The receiving
processors then perform a reduction, adding all these fragments:
X
yj = (Ai xi )j .
i
Cost analysis The total cost of the algorithm is given by, approximately,
n2
Tp1D-col (n) = 2 γ + log2 (p)α + n(β + γ).
p | {z }
Overhead
Notice that this is identical to the cost Tp1D-row (n), except with β replaced by (β + γ). It is not hard to see that the
conclusions about scalability are the same.
Next, partition
A00 A01 ... A0,p−1 x0 y0
A10 A11 ... A1,p−1 x1 y1
A→ .. .. .. .. x→ .. , and y→ .. ,
. . . . . .
Ap−1,0 Ap−1,0 . . . Ap−1,p−1 xp−1 yp−1
Pp−1
where Aij ∈ Rni ×nj and xi , yi ∈ Rni with i=0 ni = n and ni ≈ n/p.
We will view the nodes as an r × c mesh, with p = rc, and index them as pij , with i = 0, . . . , r − 1 and
j = 0, . . . , c − 1. The following illustration for a 12 × 12 matrix on a 3 × 4 processor grid illustrates the assignment
of data to nodes, where the i, j “cell” shows the matrix and vector elements owned by pij :
x0 x3 x6 x9
a00 a01 a02 y0 a03 a04 a05 a06 a07 a08 a09 a0,10 a0,11
a10 a11 a12 a13 a14 a15 y1 a16 a17 a18 a19 a1,10 a1,11
a20 a21 a22 a23 a24 a25 a26 a27 a28 y2 a29 a2,10 a2,11
a30 a31 a32 a33 a34 a35 a37 a37 a38 a39 a3,10 a3,11 y3
x1 x4 x7 x10
a40 a41 a42 y4 a43 a44 a45 a46 a47 a48 a49 a4,10 a4,11
a50 a51 a52 a53 a54 a55 y5 a56 a57 a58 a59 a5,10 a5,11
a60 a61 a62 a63 a64 a65 a66 a67 a68 y6 a69 a6,10 a6,11
a70 a71 a72 a73 a74 a75 a77 a77 a78 a79 a7,10 a7,11 y7
x2 x5 x8 x11
a80 a81 a82 y8 a83 a84 a85 a86 a87 a88 a89 a8,10 a8,11
a90 a91 a92 a93 a94 a95 y9 a96 a97 a98 a99 a9,10 a9,11
a10,0 a10,1 a10,2 a10,3 a10,4 a10,5 a10,6 a10,7 a10,8 y10 a10,9 a10,10 a10,11
a11,0 a11,1 a11,2 a11,3 a11,4 a11,5 a11,7 a11,7 a11,8 a11,9 a11,10 a11,11 y11
In other words, pij owns the matrix block Aij and parts of x and y. This makes possible the following algorithm:
• Since xj is distributed over the jth column, the algorithm starts by collecting xj on each processor pij by
an allgather inside the processor columns.
• Each processor pij then computes yij = Aij xj . This involves no further communication.
• The result yi is then collected by gathering together the pieces yij in each processor row to form yi ,
and this is then distributed over the processor row. These two operations are in fact combined to form a
reduce-scatter.
• If r = c, we can transpose the y data over the processors, so that it can function as the input for a
subsequent matrix-vector product. If, on the other hand, we are computing At Ax, then y is now correctly
distributed for the At product.
Cost analysis The total cost of the algorithm is given by, approximately,
n2 n n n
Tpr×c (n) = Tpr×c (n) = 2 γ + log2 (p)α + + β + γ.
p | c{z r r }
Overhead
Victor Eijkhout
172 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
√
We will now make the simplification that r = c = p so that
√ √ √ √
p× p p× p n2 n
Tp (n) = Tp (n) = 2 γ + log2 (p)α + √ (2β + γ)
p p
| {z }
Overhead
We again ask the question what the parallel efficiency for the largest problem that can be stored on p nodes is.
√ √
p× p 1
Ep (nmax (p)) = p log2 (p) α
√
p (2β+γ)
1+ 2n2 γ + 2n γ
1
= log2 (p) α (2β+γ)
1+ + √1
2M γ 2 M γ
so that still
√ √
p× p 1
lim Ep (nmax (p)) = lim log2 (p) α (2β+γ)
= 0.
p→∞ p→∞ 1+ + √1
2M γ 2 M γ
However,
√ √
log2 p grows very slowly with p and is therefore considered to act much like a constant. In this case
p× p
Ep (nmax (p)) decreases very slowly and the algorithm is considered to be scalable for practical purposes.
Note that when r = p the 2D algorithm becomes the ”partitioned by rows” algorithm and when c = p it becomes
the ”partitioned by columns” algorithm. It is not hard to show that the 2D algorithm is scalable in the same sense as
it is when r = c as long as r/c is kept constant.
2. Gaussian elimination can be performed in right-looking, left-looking and something variants; see [82].
Sp;i = {j : j 6∈ Ip , aij 6= 0}
yi + = aij xj if j ∈ Sp;i .
Victor Eijkhout
174 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Figure 6.4: A difference stencil applied to a two-dimensional square domain, distributed over processors. A cross-
processor connection is indicated.
If we want to avoid, as above, a flood of small messages, we combine all communication into a single message per
processor. Defining
Sp = ∪i∈Ip Sp;i ,
G = {j 6∈ Ip : ∃i∈Ip : aij 6= 0}
processors in a P × P grid. Let the amount of work per processor be w and the communication time with each
neighbour c. Then the time to perform the total work on a single processor is T1 = P w, and the parallel time is
TP = w + 4c, giving a speed up of
SP = P w/(w + 4c) = P/(1 + 4c/w) ≈ P (1 − 4c/w).
Exercise 6.8. In this exercise you will analyze the parallel sparse matrix-vector product for a hypo-
thetical, but realistic, parallel machine. Let the machine parameters be characterized by (see
section 1.2.2):
Victor Eijkhout
176 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
You will note that, even though the communication during the matrix-vector product involves only a few neighbours
for each processor, giving a cost that is O(1) in the number of processors, the setup involves all-to-all communica-
tions, which is O(P ) in the number of processors. The setup can be reduced to O(log P ) with some trickery [34].
Exercise 6.10. The above algorithm for determining the communication part of the sparse matrix-vector
product can be made far more efficient if we assume a matrix that is structurally symmetric:
aij 6= 0 ⇔ aji 6= 0. Show that in this case no communication is needed to determine the
communication pattern.
Vector updates and inner products take only a small portion of the time of the algorithm.
Vector update are trivially parallel, so we do not worry about them. Inner products are more interesting. Since every
processor is likely to need the value of the inner product, we use the following algorithm:
Algorithm: compute a ← xt y where x, y are distributed vectors.
For each processor p do:
compute ap ← xtp yp where xp , yp are the part of x, y
stored on processor
Pp
do a global reduction to compute a = p ap ;
broadcast the result
The reduction and broadcast (which can be joined into an Allgather) combine data over all processors, so they
have a communication time that increases with the number of processors. This makes the inner product potentially
an expensive operation.
Victor Eijkhout
178 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
The crucial fact is that a matrix element aij is then the sum of computations, specifically certain integrals, over all
elements that variables i and j share:
X (e)
aij = aij .
e : i,j∈e
In the above figure, aij is the sum of computations over elements 2 and 4. Now, the computations in each element
share many common parts, so it is natural to assign each element e uniquely to a processor Pe , which then computes
(e)
all contributions aij .
In section 6.2 we described how each variable i was uniquely assigned to a processor Pi . Now we see that it is not
possibly to make assignments Pe of elements and Pi of variables such that Pe computes in full the coefficients aij for
all i ∈ e. In other words, if we compute the contributions locally, there needs to be some amount of communication
to assemble certain matrix elements.
−1
Exercise 6.11. Instead of storing DA , we could also store (DA + LA )−1 . Give a reason why this is a
bad idea, even if the extra storage is no objection.
Exercise 6.12. Analyze the cost in storage and operations of solving a system with a incomplete factor-
ization as described in section 5.5.5
Victor Eijkhout
180 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
finished, the third processor has to wait for the second, and so on. The disappointing conclusion is that in parallel
only one processor will be active at any time, and the total time is the same as for the sequential algorithm.
In the next few subsections we will see different strategies for finding preconditioners that perform efficiently in
parallel.
Various approaches have been suggested to remedy this sequentiality the triangular solve. For instance, we could
simply let the processors ignore the components of x that should come from other processors:
for i=myfirstrow..mylastrow
x[i] = (y[i] - sum over j=myfirstrow..i-1 ell[i,j]*x[j]) / a[i,i]
This is not mathematically equivalent to the sequential algorithm (technically, it is called a block Jacobi method
with ILU as the local solve), but since we’re only looking for an approximationg K ≈ A, this is simply a slightly
cruder approximation.
Exercise 6.13. Take the Gauss-Seidel code you write above, and simulate a parallel run. What is the
effect of increasing the (simulated) number of processors?
The idea behind block methods can easily be appreciated pictorially; see figure 6.6. In effect, we ignore all connec-
tions between processors. Since in a BVP all points influence each other (see section 4.2.1), using a less connected
preconditioner will increase the number of iterations if executed on a sequential computer. However, block methods
are parallel and, as we observed above, a sequential preconditioner is very inefficient in a parallel context.
We observe that xi directly depends on xi−1 and xi+1 , but not xi−2 or xi+1 . Thus, let us see what happens if we
permute the indices to group every other component together.
Pictorially, we take the points 1, . . . , n and colour them red and black (figure 6.7), then we permute them to first
take all red points, and subsequently all black ones. The correspondingly permuted matrix looks as follows:
a11 a12 x1 y1
a33 a 32 a 34 x3 y3
.. ..
. . x5 y5
a55 .. ..
.
.. .
.
=
x2 y2
a21 a23 a22
x4 y4
a43 a 45 a 44 . ..
.. .. .. ..
. . . .
What does this buy us? Well, the odd numbered components x3 , x5 , . . . can now all be solved without any pre-
requisites, so if we divide them over the processors, all processors will be active at the same time. After this, the
Victor Eijkhout
182 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
even numbered components x2 , x4 , . . . can also be solved independently of each other, each needing the values of
the odd numbered solution components next to it. This is the only place where processors need to communicate:
suppose a processor has x100 , . . . x149 , then in the second solution stage the value of x99 needs to be sent from the
previous processor.
Red-black ordering can be applied to two-dimensional problems too. Let us apply a red-black ordering to the
points (i, j) where 1 ≤ i, j ≤ n. Here we first apply a successive numbering to the odd points on the first line
(1, 1), (3, 1), (5, 1), . . ., then the even points of the second line (2, 2), (4, 2), (6, 2), . . ., the odd points on the third
line, et cetera. Having thus numbered half the points in the domain, we continue with the even points in the first
line, the odd points in the second, et cetera. As you can see in figure 6.8, now the red points are only connected to
black points, and the other way around. In graph theoretical terms, you have found a colouring (see appendix A.6
for the definition of this concept) of the matrix graph with two colours.
Exercise 6.14. Sketch the matrix structure that results from this ordering of the unknowns.
The red-black ordering of the previous section is a simple example of graph colouring (sometimes multi-colouring;
see also Appendix A.6). In simple cases, such as the unit square domain we considered in section 4.2.3 or its
extension to 3D, the colour number of the adjacency graph is easily determined.
d2 d2 (t+1)
(αI + + )u = u(t) (6.1)
dx2 dy 2
Without proof, we state that the time-dependent problem can also be solved by
d2 d2 (t+1)
(βI + )(βI + )u = u(t) (6.2)
dx2 dy 2
for suitable β. This scheme will not compute the same values on each individual time step, but it will converge to
the same steady state.
This approach has considerable advantages, mostly in terms of operation counts: the original system has to be
solved either making a factorization of the matrix, which incurs fill-in, or by solving it iteratively.
Exercise 6.15. Analyze the relative merits of these approaches, giving rough operation counts. Consider
both the case where α has dependence on t and where it does not. Also discuss the expected
speed of various operations.
A further advantage appears when we consider the parallel solution of (6.2). Note that we have a two-dimensional
set of variables uij , but the operator I +d2 u/dx2 only connects uij , uij−1 , uij+1 . That is, each line corresponding to
an i value can be processed independently. Thus, both operators can be solved fully parallel using a one-dimensional
partition on the domain. The solution of a the system in (6.1), on the other hand, has limited parallelism.
Unfortunately, there is a serious complication: the operator in x direction needs a partitioning of the domain in
on direction, and the operator in y in the other. The solution usually taken is to transpose the uij value matrix in
between the two solves, so that the same processor decomposition can handle both. This transposition can take a
substantial amount of the processing time of each time step.
Exercise 6.16. Discuss the merits of and problems with a two-dimensional decomposition of the do-
main, using a grid of P = p×p processors. Can you suggest a way to ameliorate the problems?
One way to speed up these calculations, is to replace the implicit solve, by an explicit operation; see section 6.9.3.
Show that the matrix vector product y ← Ax and the system solution x ← A−1 y 3 have the
same operation count.
Now consider parallelizing the product y ← Ax. Suppose we have n processors, and each
processor i stores xi and the i-th row of A. Show that the product Ax can be computed without
idle time on any processor but the first.
Victor Eijkhout
184 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Figure 6.9: The difference stencil of the L factor of the matrix of a two-dimensional BVP
Can the same be done for the solution of the triangular system Ax = y? Show that the straight-
forward implementation has every processor idle for an (n − 1)/n fraction of the computation.
We will now see a number of ways of dealing with this inherently sequential component.
6.9.1 Wavefronts
Above, you saw that solving a lower triangular system of size N can have sequential time complexity of N steps. In
practice, things are often not quite that bad. Implicit algorithms such as solving a triangular system are inherently
sequential, but the number of steps can be less than is apparent at first.
Exercise 6.18. Take another look at the matrix from a two-dimensional BVP on the unit square, dis-
cretized with central differences. Derive the matrix structure if we order the unknowns by
diagonals. What can you say about the sizes of the blocks and the structure of the blocks
themselves?
Let us take another look at figure 4.1 that describes the difference stencil of a two-dimensional BVP. The cor-
responding picture for the lower triangular factor is in figure 6.9. This describes the sequentiality of the lower
triangular solve process
In other words, the value at point k can be found if its neigbours to the left (k − 1) and below (k − n) are known.
Now we see that, if we know x1 , we can not only find x2 , but also x1+n . In the next step we can determine x3 , xn+2 ,
and x2n+1 . Continuing this way, we can solve x by wavefronts: the values of x on each wavefront are independent,
so they can be solved in parallel in the same sequential step.
Exercise 6.19. Finish this argument. What is the maximum number of processors we can employ, and
what is the number of sequential steps? What is the resulting efficiency?
Victor Eijkhout
186 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Victor Eijkhout
188 CHAPTER 6. HIGH PERFORMANCE LINEAR ALGEBRA
Now it is clear that the algorithm has a good deal of parallelism: the iterations in every `-loop can be processed inde-
pendently. However, these loops get shorter in every iteration of the outer k-loop, so it is not immediate how many
processors we can accomodate. Moreover, it is not necessary to preserve the order of operations of the algorithm
above. For instance, after
the factorization L2 Lt2 = A22 can start, even if the rest of the k = 1 iteration is still unfinished. Instead of looking
at the algorithm, it is a better idea to construct a Directed Acyclic Graph (DAG) (see section A.6 for a brief tutorial
on graphs) of the tasks of all inner iterations. Figure 6.10 shows the DAG of all tasks of matrix of 4 × 4 blocks. This
graph is constructed by simulating the Cholesky algorithm above, making a vertex for every task, adding an edge
(i, j) if task j uses the output of task i.
Exercise 6.21. What is the diameter of this graph? Identify the tasks that lie on the path that determines
the diameter. What is the meaning of these tasks in the context of the algorithm? This path
is called the ‘critical path’. Its length determines the execution time of the computation in
parallel, even if an infinite number of processors is available.
Exercise 6.22. If there are T tasks – all taking unit time to execute – and we have p processors, what is
the theoretical minimum time to execute the algorithm? Now amend this formula to take into
account the critical path; call its length C.
In the execution of the tasks a DAG, several observations can be made.
• If more than one update is made to a block, it is probably advantageous to have these updates be computed
by the same process. This simplifies maintaining cache coherence.
• If data is used and later modified, the use must be finished before the modification can start. This can even
be true if the two actions are on different processors, since the memory subsystem typically maintains
cache coherence, so the modifications can affect the process that is reading the data. This case can be
remedied by having a copy of the data in main memory, giving a reading process data that is reserved
(see section 1.3.1).
Molecular dynamics
Molecular dynamics is a technique for simulating the atom-by-atom behavior of molecules and deriving macro-
scopic properties from these atomistic motions. It has application to biological molecules such as proteins and
nucleic acids, as well as natural and synthetic molecules in materials science and nanotechnology. Molecular dy-
namics falls in the category of particle methods, which includes N-body problems in celestial mechanics and astro-
physics, and many of the ideas presented here will carry over to these other fields. In addition, there are special cases
of molecular dynamics including ab initio molecular dynamics where electrons are treated quantum mechanically
and thus chemical reactions can be modeled. We will not treat these special cases, but will instead concentrate on
classical molecular dynamics.
The idea behind molecular dynamics is very simple: a set of particles interact according to Newton’s law of motion,
F = ma. Given the initial particle positions and velocities, the particle masses and other parameters, as well as
a model of the forces that act between particles, Newton’s law of motion can be integrated numerically to give a
trajectory for each of the particles for all future (and past) time. Commonly, the particles reside in a computational
box with periodic boundary conditions.
A molecular dynamics time step is thus composed of two parts:
1: compute forces on all particles
2: update positions (integration).
The computation of the forces is the expensive part. State-of-the-art molecular dynamics simulations are performed
on parallel computers because the force computation is costly and a vast number of time steps are required for
reasonable simulation lengths. In many cases, molecular dynamics is applied to simulations on molecules with a
very large number of atoms as well, e.g., up to a million for biological molecules and long time scales, and up to
billions for other molecules and shorter time scales.
Numerical integration techniques are also of interest in molecular dynamics. For simulations that take a large num-
ber of time steps and for which the preservation of quantities such as energy is more important than order of
accuracy, the solvers that must be used are different than the traditional ODE solvers presented in Chapter 4.
In the following, we will introduce force fields used for biomolecular simulations and discuss fast methods for
computing these forces. Then we devote sections to the parallelization of molecular dynamics for short-range forces
190
7.1. FORCE COMPUTATION 191
and the parallelization of the 3-D FFT used in fast computations of long-range forces. We end with a section
introducing the class of integration techniques that are suitable for molecular dynamics simulations. Our treatment
of the subject of molecular dynamics in this chapter is meant to be introductory and practical; for more information,
the text [37] is recommended.
The potential is a function of the positions of all the atoms in the simulation. The force on an atom is the negative
gradient of this potential at the position of the atom.
The bonded energy is due to covalent bonds in a molecule,
X X X
Ebonded = ki (ri − ri,0 )2 + ki (θi − θi,0 )2 + Vn (1 + cos(nω − γ))
bonds angles torsions
where the three terms are, respectively, sums over all covalent bonds, sums over all angles formed by two bonds,
and sums over all dihedral angles formed by three bonds. The fixed parameters ki , ri,0 , etc. depend on the types of
atoms involved, and may differ for different force fields. Additional terms or terms with different functional forms
are also commonly used.
The remaining two terms for the potential energy E are collectively called the nonbonded terms. Computationally,
they form the bulk of the force calculation. The electrostatic energy is due to atomic charges and is modeled by the
familiar
X X qi qj
ECoul =
4π0 rij
i j>i
where the sum is over all pairs of atoms, qi and qj are the charges on atoms i and j, and rij is the distance between
atoms i and j. Finally, the van der Waals energy approximates the remaining attractive and repulsive effects, and is
commonly modeled by the Lennard-Jones function
" 6 #
XX σij 12 σij
EvdW = 4ij −
rij rij
i j>i
Victor Eijkhout
192 CHAPTER 7. MOLECULAR DYNAMICS
where ij and σij are force field parameters depending on atom types. At short distances, the repulsive (r12 ) term is
in effect, while at long distances, the dispersive (attractive, −r6 ) term is in effect.
Parallelization of the molecular dynamics force calculation depends on parallelization each of these individual types
of force calculations. The bonded forces are local computations in the sense that for a given atom, only nearby atom
positions and data are needed. The van der Waals forces are also local and are termed short-range because they
are negligible for large atom separations. The electrostatic forces are long-range, and various techniques have been
developed to speed up these calculations. In the next two subsections, we separately discuss the computation of
short-range and long-range nonbonded forces.
rc rc rv
Figure 7.1: Computing nonbonded forces within a cutoff, rc . To compute forces involving the highlighted particle,
only particles in the shaded regions are considered.
Cell Lists
The idea of cell lists appears often in problems where a set of points that are nearby a given point is sought. Referring
to Fig. 7.1(a), where we illustrate the idea with a 2-D example, a grid is laid over the set of particles. If the grid
spacing is no less than rc , then to compute the forces on particle i, only the particles in the cell containing i and the
8 adjacent cells need to be considered. One sweep through all the particles is used to construct a list of particles for
each cell. These cell lists are used to compute the forces for all particles. At the next time step, since the particles
have moved, the cell lists must be regenerated or updated. The complexity of this approach is O(n) for computing
the data structure and O(n × nc ) for the force computation, where nc is the average number of particles in 9 cells
(27 cells in 3-D). The storage required for the cell list data structure is O(n).
Victor Eijkhout
194 CHAPTER 7. MOLECULAR DYNAMICS
In the Barnes-Hut method, space is recursively divided into 8 equal cells (in 3-D) until each cell contains zero or one
particles. Forces between nearby particles are computed individually, as normal, but for distant particles, forces are
computed between one particle and a set of distant particles within a cell. An accuracy measure is used to determine
if the force can be computed using a distant cell or must be computed by individually considering its children cells.
The Barnes-Hut method has complexity O(n log n). The fast multipole method has complexity O(n); this method
calculates the potential and does not calculate forces directly.
Particle-Mesh Methods
In particle-mesh methods, we exploit the Poisson equation
1
∇2 φ = − ρ
which relates the potential φ to the charge density ρ, where 1/ is a constant of proportionality. To utilize this
equation, we discretize space using a mesh, assign charges to the mesh points, solve Poisson’s equation on the mesh
to arrive at the potential on the mesh. The force is the negative gradient of the potential (for conservative forces
such as electrostatic forces). A number of techniques have been developed for distributing point charges in space to
a set of mesh points and also for numerically interpolating the force on the point charges due to the potentials at the
mesh points. Many fast methods are available for solving the Poisson equation, including multigrid methods and fast
Fourier transforms. With respect to terminology, particle-mesh methods are in contrast to the naive particle-particle
method where forces are computed between all pairs of particles.
It turns out that particle-mesh methods are not very accurate, and a more accurate alternative is to split each force
into a short-range, rapidly-varying part and a long-range, slowly-varying part:
One way to accomplish this easily is to weigh f by a function h(r), which emphasizes the short-range part (small r)
and by 1 − h(r) which emphasizes the long-range part (large r). The short-range part is computed by computing the
interaction of all pairs of particles within a cutoff (a particle-particle method) and the long-range part is computed
using the particle-mesh method. The resulting method, called particle-particle-particle-mesh (PPPM, or P3 M) is
due to Hockney and Eastwood, in a series of papers beginning in 1973.
Ewald Method
The Ewald method is the most popular of the methods described so far for electrostatic forces in biomolecular
simulations and was developed for the case of periodic boundary conditions. The structure of the method is similar
to PPPM in that the force is split between short-range and long-range parts. Again, the short-range part is computed
using particle-particle methods, and the long-range part is computed using Fourier transforms. Variants of the Ewald
method are very similar to PPPM in that the long-range part uses a mesh, and fast Fourier transforms are used to
solve the Poisson equation on the mesh. For additional details, see, for example [37]. In Section 7.3, we describe
the parallelization of the 3-D FFT to solve the 3-D Poisson equation.
Figure 7.2: Atom decomposition, showing a force matrix of 16 particles distributed among 8 processors. A dot
represents a nonzero entry in the force matrix. On the left, the matrix is symmetric; on the right, only one element
of a pair of skew-symmetric elements is computed, to take advantage of Newton’s third law.
An atom decomposition is illustrated by the force matrix in Fig. 7.2(a). For n particles, the force matrix is an n-by-n
matrix; the rows and columns are numbered by particle indices. A nonzero entry fij in the matrix denotes a nonzero
force on particle i due to particle j which must be computed. This force may be a nonbonded and/or a bonded
force. When cutoffs are used, the matrix is sparse, as in this example. The matrix is dense if forces are computed
Victor Eijkhout
196 CHAPTER 7. MOLECULAR DYNAMICS
between all pairs of particles. The matrix is skew-symmetric because of Newton’s third law, fij = −fji . The lines
in Fig. 7.2(a) show how the particles are partitioned. In the figure, 16 particles are partitioned among 8 processors.
Algorithm 1 shows one time step from the point of view of one processor. At the beginning of the time step, each
processor holds the positions of particles assigned to it.
An optimization is to halve the amount of computation, which is possible because the force matrix is skew-
symmetric. To do this, we choose exactly one of fij or fji for all skew-symmetric pairs such that each processor is
responsible for computing approximately the same number of forces. Choosing the upper or lower triangular part
of the force matrix is a bad choice because the computational load is unbalanced. A better choice is to compute fij
if i + j is even in the upper triangle, or if i + j is odd in the lower triangle, as shown in Fig. 7.2(b). There are many
other options.
When taking advantage of skew-symmetry in the force matrix, all the forces on a particle owned by a processor are
no longer computed by that processor. For example, in Fig. 7.2(b), the forces on particle 1 are no longer computed
only by the first processor. To complete the force calculation, processors must communicate to send forces that are
needed by other processors and receive forces that are computed by other processors. The above algorithm must
now be modified by adding a communication step (step 4) as shown in Algorithm 2.
This algorithm is advantageous if the extra communication is outweighed by the savings in computation. Note that
the amount of communication doubles in general.
Figure, processor i is assigned to update the positions of particle i; in practical problems, a processor would be
assigned to update the positions of many particles. Note that, again, we first consider the case of a skew-symmetric
force matrix.
Figure 7.3: Force decomposition, showing a force matrix of 16 particles and forces partitioned among 16 processors.
We now examine the communication required in a time step for a force decomposition. Consider processor 3, which
computes partial forces for particles 0, 1, 2, 3, and needs positions from particles 0, 1, 2, 3, and also 12, 13, 14, 15.
Thus processor 3 needs to perform communication with processors 0, 1, 2, 3, and processors 12, 13, 14, 15. After
forces have been computed by all processors, processor 3 needs to collect forces on particle 3 computed by other
processors. Thus processor 2 needs to perform communication again with processors 0, 1, 2, 3.
Algorithm 3 shows what is performed in one time step, from the point-of-view of one processor. At the beginning
of the time step, each processor holds the positions of all the particles assigned to it.
In general, if there are p processors (and p is square, for simplicity), then the the force matrix is partitioned into
√ √
p by p blocks. The force decomposition just described requires a processor to communicate in three steps, with
√
p processors in each step. This is much more efficient than atom decompositions which require communications
among all p processors.
We can also exploit Newton’s third law in force decompositions. Like for atom decompositions, we first choose
a modified force matrix where only one of fij and fji is computed. The forces on particle i are computed by
a row of processors and now also by a column of processors. Thus an extra step of communication is needed
by each processor to collect forces from a column of processors for particles assigned to it. Whereas there were
three communication steps, there are now four communication steps when Newton’s third law is exploited (the
communication is not doubled in this case as in atom decompositions).
Victor Eijkhout
198 CHAPTER 7. MOLECULAR DYNAMICS
A modification to the force decomposition saves some communication. In Fig. 7.4, the columns are reordered
using a block-cyclic ordering. Consider again processor 3, which computes partial forces for particles 0, 1, 2, 3.
It needs positions from particles 0, 1, 2, 3, as before, but now also with processors 3, 7, 11, 15. The latter are
processors in the same processor column as processor 3. Thus all communications are within the same processor
row or processor column, which may be advantageous on mesh-based network architectures. The modified method
is shown as Algorithm 4.
Figure 7.4: Force decomposition, with permuted columns in the force matrix. Note that columns 3, 7, 11, 15 are
now in the block column corresponding to processors 3, 7, 11, 15 (the same indices), etc.
Algorithm 4 Force decomposition time step, with permuted columns of force matrix
1: send positions of my assigned particles which are needed by other processors; receive row particle positions
needed by my processor (this communication is between processors in the same processor row, e.g., processor
3 communicates with processors 0, 1, 2, 3)
2: receive column particle positions needed by my processor (this communication is generally with processors the
same processor column, e.g., processor 3 communicates with processors 3, 7, 11, 15)
3: (if nonbonded cutoffs are used) determine which nonbonded forces need to be computed
4: compute forces for my assigned particles
5: send forces needed by other processors; receive forces needed for my assigned particles (this communication
is between processors in the same processor row, e.g., processor 3 communicates with processors 0, 1, 2, 3)
6: update positions (integration) for my assigned particles
To exploit Newton’s third law, the shape of the import region can be halved. Now each processor only computes a
partial force on particles in its cell, and needs to receive forces from other processors to compute the total force on
these particles. Thus an extra step of communication is involved. We leave it as an exercise to the reader to work
out the details of the modified import region and the pseudocode for this case.
In the implementation of a spatial decomposition method, each cell is associated with a list of particles in its import
Victor Eijkhout
200 CHAPTER 7. MOLECULAR DYNAMICS
rc
(a) Decomposition into 64 cells. (b) Import region for one cell.
Figure 7.5: Spatial decomposition, showing particles in a 2-D computational box, (a) partitioned into 64 cells, (b)
import region for one cell.
region, similar to a Verlet neighbor list. Like a Verlet neighbor list, it is not necessary to update this list at every
time step, if the import region is expanded slightly. This allows the import region list to be reused for several time
steps, corresponding to the amount of time it takes a particle to traverse the width of the expanded region. This is
exactly analogous to Verlet neighbor lists.
In summary, the main advantage of spatial decomposition methods is that they only require communication between
processors corresponding to nearby particles. A disadvantage of spatial decomposition methods is that, for very
large numbers of processors, the import region is large compared to the number of particles contained inside each
cell.
spatial decomposition method. The advantage is greater when the size of the cells corresponding to each processor
is small compared to the cutoff radius.
rc
Figure 7.6: Neutral territory method, showing particles in a 2-D computational box and the import region (shaded)
for one cell (center square). This Figure can be compared directly to the spatial decomposition case of Fig. 7.5(b).
See Shaw [76] for additional details.
After the forces are computed, the given processor sends the forces it has computed to the processors that need
these forces for integration. We thus have Algorithm 6.
Like other methods, the import region of the neutral territory method can be modified to take advantage of Newton’s
third law. We refer to Shaw [76] for additional details and for illustrations of neutral territory methods in 3-D
simulations.
Victor Eijkhout
202 CHAPTER 7. MOLECULAR DYNAMICS
L (with periodic boundary conditions) and the solution of φ in −Lφ = ρ. Let F denote the Fourier transform. The
original problem is equivalent to
−(F LF −1 )F φ = F ρ
φ = −F −1 (F LF −1 )−1 F ρ.
The matrix F LF −1 is diagonal. The forward Fourier transform F is applied to ρ, then the Fourier-space components
are scaled by the inverse of the diagonal matrix, and finally, the inverse Fourier transform F −1 is applied to obtain
the solution φ.
For realistic protein sizes, a mesh spacing of approximately 1 Ångstrom is typically used, leading to a 3-D mesh
that is quite small by many standards: 64 × 64 × 64, or 128 × 128 × 128. Parallel computation would often not be
applied to a problem of this size, but parallel computation must be used because the data ρ is already distributed
among the parallel processors (assuming a spatial decomposition is used).
A 3-D FFT is computed by computing 1-D FFTs in sequence along each of the three dimensions. For the 64 ×
64 × 64 mesh size, this is 4096 1-D FFTs of dimension 64. The parallel FFT calculation is typically bound by
communication. The best parallelization of the FFT depends on the size of the transforms and the architecture
of the computer network. Below, we first describe some concepts for parallel 1-D FFTs and then describe some
concepts for parallel 3-D FFTs. For current software and research dedicated to the parallelization and efficient
computation (using SIMD operations) of large 1-D transforms, we refer to the SPIRAL and FFTW packages. These
packages use autotuning to generate FFT codes that are efficient for the user’s computer architecture.
two Figures, notice that the first two stages have data dependencies that only involve indices in the same partition.
The same is true for the second two stages for the partitioning after the transpose. Observe also that the structure of
the computations before and after the transpose are identical.
10
11
12
13
14
15
Figure 7.7: Data flow diagram for 1-D FFT for 16 points. The shaded regions show a decomposition for 4 processors
(one processor per region). In this parallelization, the first two FFT stages have no communication; the last two FFT
stages do have communication.
Victor Eijkhout
204 CHAPTER 7. MOLECULAR DYNAMICS
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
11 11 11 11 11
12 12 12 12 12
0 1 2 3 0 4 8 12
13 13 13 13 13
4 5 6 7 1 5 9 13
14 14 14 14 14 8 9 10 11 2 6 10 14
15 15 15 15 15 12 13 14 15 3 7 11 15
(a) Data flow diagram (shown without horizontal lines for clarity) for (b) Partitioning of the in-
1-D FFT for 16 points. dices before (left) and after
(right) the transpose.
Figure 7.8: 1-D FFT with transpose. The first two stages do not involve communication. The data is then transposed
among the processors. As a result, the second two stages also do not involve communication.
easily redistributed this way. The two 1-D FFTs in the plane of the slabs require no communication. The remaining
1-D FFTs require communication and could use one of the two approaches for parallel 1-D FFTs described above.
A disadvantage of the slab decomposition is that for large numbers of processors, the number of processors may
exceed the number of points in the 3-D FFT along any one dimension. An alternative is the pencil decomposition
below.
The pencil decomposition is shown in Fig. 7.9(c) for the case of 16 processors. Each processor holds one or more
pencils of the input data. If the original input data is distributed in blocks as in Fig. 7.9(a), then communication
among a row of processors (in a 3-D processor mesh) can distribute the data into the pencil decomposition. The 1-D
FFTs can then be performed with no communication. To perform the 1-D FFT in another dimension, the data needs
to be redistributed into pencils in another dimension. In total, four communication stages are needed for the entire
3-D FFT computation.
u00 = −u
where u is the displacement of a single particle from an equilibrium point. This equation could model a particle
with unit mass on a spring with unit spring constant. The force on a particle at position u is −u. This system does
not look like a molecular dyanamics system but is useful for illustrating several ideas.
The above second order equation can be written as a system of first order equations
q0 = p
p0 = −q
where q = u and p = u0 which is common notation used in classical mechanics. The general solution is
q cos t sin t q
= .
p − sin t cos t p
The kinetic energy of the simple harmonic oscillator is p2 /2 and the potential energy is q 2 /2 (the negative gradient
of potential energy is the force, −q). Thus the total energy is proportional to q 2 + p2 .
Victor Eijkhout
206 CHAPTER 7. MOLECULAR DYNAMICS
Now consider the solution of the system of first order equations by three methods, explicit Euler, implicit Euler, and
a method called the Störmer-Verlet method. The initial condition is (q, p) = (1, 0). We use a time step of h = 0.05
and take 500 steps. We plot q and p on the horizontal and vertical axes, respectively (called a phase plot). The exact
solution, as given above, is a unit circle centered at the origin.
Figure 7.10 shows the solutions. For explicit Euler, the solution spirals outward, meaning the displacement and
momentum of the solution increases over time. The opposite is true for the implicit Euler method. A plot of the
total energy would show the energy increasing and decreasing for the two cases, respectively. The solutions are
better when smaller time steps are taken or when higher order methods are used, but these methods are not at all
appropriate for integration of symplectic systems over long periods of time. Figure 7.10(c) shows the solution using
a symplectic method called the Störmer-Verlet method. The solution shows that q 2 + p2 is preserved much better
than in the other two methods.
Figure 7.10: Phase plot of the solution of the simple harmonic oscillator for three methods with initial value (1,0),
time step 0.05, and 500 steps. For explicit Euler, the solution spirals outward; for implicit Euler, the solution spirals
inward; the total energy is best preserved with the Störmer-Verlet method.
The Störmer-Verlet method is derived very easily. We derive it for the second order equation
u00 = f (t, u)
The formula can equivalently be derived from Taylor series. The method is similar to linear multistep methods in
that some other technique is needed to supply the initial step of the method. The method is also time-reversible,
because the formula is the same if k + 1 and k − 1 are swapped. To explain why this method is symplectic,
unfortunately, is beyond the scope of this introduction.
The method as written above has a number of disadvantages, the most severe being that the addition of the small
h2 term is subject to catastrophic cancellation. Thus this formula should not be used in this form, and a number of
mathematically equivalent formulas (which can be derived from the formula above) have been developed.
One alternative formula is the leap-frog method:
uk+1 = uk + hvk+1/2
vk+1/2 = vk−1/2 + hf (tk , uk )
where v is the first derivative (velocity) which is offset from the displacement u by a half step. This formula does
not suffer from the same roundoff problems and also makes available the velocities, although they need to be re-
centered with displacements in order to calculate total energy at a given step. The second of this pair of equations
is basically a finite difference formula.
A third form of the Störmer-Verlet method is the velocity Verlet variant:
h2
uk+1 = uk + hvk + f (tk , uk )
2
h2
vk+1 = vk + (f (tk , uk ) + f (tk+1 , uk+1 ))
2
where now the velocities are computed at the same points as the displacements. Each of these algorithms can be
implemented such that only two sets of quantities need to be stored (two previous positions, or a position and a
velocity). These variants of the Störmer-Verlet method are popular because of their simplicity, requiring only one
costly force evaluation per step. Higher-order methods have generally not been practical.
The velocity Verlet scheme is also the basis of multiple time step algorithms for molecular dynamics. In these
algorithms, the slowly-varying (typically long-range) forces are evaluated less frequently and update the positions
less frequently than the quickly-varying (typically short-range) forces. Finally, many state-of-the-art molecular
dynamics integrate a Hamiltonian system that has been modified in order to control the simulation temperature and
pressure. Much more complicated symplectic methods have been developed for these systems.
Victor Eijkhout
Chapter 8
Monte Carlo simulation is a broad term for methods that use random numbers and statistical sampling to solve
problems, rather than exact modeling. From the nature of this sampling, the result will have some uncertainty, but
the statistical ‘law of large numbers’ will ensure that the uncertainty goes down as the number of samples grows.
Monte Carlo is of course a natural candidatate for simulating phenomena that are statistical in nature, such as
radioactive decay, or Brownian motion. Other problems where Monte Carlo simulation is attractive are outside
the realm of scientific computing. For instance, the Black-Scholes model for stock option pricing [16] uses Monte
Carlo.
Some problems that you have seen before, such as solving a linear system of equations, can be formulated as a Monte
Carlo problem. However, this is not a typical application. Below we will discuss multi-dimensional integration,
where exact methods would take far too much time to compute.
An important tool for statistical sampling is random number generator. We will briefly discuss the problems in
generating random numbers in parallel.
Now take random points x̄0 , x̄1 , x̄2 ∈ [0, 1]2 , then we can estimate the area of Ω by counting how often f (x̄i ) is
positive or negative.
208
8.2. PARALLEL RANDOM NUMBER GENERATION 209
We can extend this idea to integration. The average value of a function on an interval (a, b) is defined as
b
1
Z
hf i = f (x)dx
b−a a
if the points xi are reasonably distributed and the function f is not too wild. This leads us to
b n
1 X
Z
f (x)dx ≈ (b − a) f (xi )
a N
i=1
Statistical theory, that we will not go into, tells us that the uncertainty σI in the integral is related to the standard
deviation σf by
1
σI ∼ √ σf
N
for normal distributions.
So far, Monte Carlo integration does not look much different from classical integration. The difference appears
when we go to higher dimensions. In that case, for classical integration we would need N points in each dimension,
leading to N d points in d dimensions. In the Monte Carlo method, on the other hand, the points are taken at random
from the d-dimensional space, and a much lower number of points suffices.
Computationally, Monte Carlo methods are attractive since all function evaluations can be performed in parallel.
Victor Eijkhout
Appendix A
Theoretical background
This course requires no great mathematical sophistication. In this appendix we give a quick introduction to the few
concepts that may be unfamiliar to certain readers.
210
A.1. LINEAR ALGEBRA 211
ut v
v0 ← v − u.
ut u
It is easy to see that this satisfies the requirements.
In the general case, suppose we have an orthogonal set u1 , . . . , un and a vector v that is not necessarily orthogonal.
Then we compute
For i = 1, . . . , n:
let ci = uti v/uti ui
For i = 1, . . . , n:
update v ← v − ci ui
Often the vector v in the algorithm above is normalized. GS orthogonalization with this normalization, applied to a
matrix, is also known as the QR factorization.
Exercise 1.1. Suppose that we apply the GS algorithm to the columns of a rectangular matrix A, giving
a matrix Q. Prove that there is an upper triangular matrix R such that A = QR. If we normalize
the vector v in the algorithm above, Q has the additional property that Qt Q = I. Prove this
too.
The GS algorithm as given above computes the desired result, but only in exact arithmetic. A computer imple-
mentation can be quite inaccurate if the angle between v and one of the ui is small. In that case, the Modiefed
Gram-Schmidt (MGS) algorithm will perform better:
For i = 1, . . . , n:
let ci = uti v/uti ui
update v ← v − ci ui
To contrast it with MGS, the original GS algorithm is also known as Classical Gram-Schmidt (CGS).
As an illustration of the difference between the two methods, consider the matrix
1 1 1
0 0
A= 0 0
0 0
Victor Eijkhout
212 APPENDIX A. THEORETICAL BACKGROUND
where is of the order of the machine precision, so that 1 + 2 = 1 in machine arithmetic. The CGS method
proceeds as follows:
• The first column is of length 1 in machine arithmetic, so
1
q1 = a1 =
0 .
It is easy to see that q2 and q3 are not orthogonal at all. By contrast, the MGS method differs in the last step:
• As before, q1t a3 = 1, so
0
−
v ← a3 − q1 =
.
0
√
Then, q2t v = 2
2 (note that q2t a3 = 0 before), so the second update gives
0
√
0
√ 6
2
normalized:
2 6√
v←v− q2 =
− ,
2 −√66
2
2 66
xi = Axi−1 ,
where x0 is some starting vector, is called the power method since it computes the product of subsequent matrix
powers times a vector:
x i = Ai x 0 .
There are cases where the relation between the xi vectors is simple. For instance, if x0 is an eigenvector of A, we
have for some scalar λ
However, for an arbitrary vector x0 , the sequence {xi }i is likely to consist of independent vectors. Up to a point.
Exercise 1.2. Let A and x be the n × n matrix and dimension n vector
1 1
1 1
. .
A=
.. .. , x = (0, . . . , 0, 1)t .
1 1
1
Show that the sequence [x, Ax, . . . , Ai x] is an independent set for i < n. Why is this no longer
true for i ≥ n?
Now consider the matrix B:
1 1
.. ..
. .
1 1
1
y = (0, . . . , 0, 1)t
B= ,
1 1
.. ..
. .
1 1
1
Show that the set [y, By, . . . , B i y] is an independent set for i < n/2, but not for any larger
values of i.
While in general the vectors x, Ax, A2 x, . . . can be expected to be independent, in computer arithmetic this story is
no longer so clear.
Victor Eijkhout
214 APPENDIX A. THEORETICAL BACKGROUND
Suppose the matrix has eigenvalues λ0 > λ1 ≥ · · · λn−1 and corresponding eigenvectors ui so that
Aui = λi ui .
Let the vector x be written as
x = c0 u0 + · · · + cn−1 un−1 ,
then
Ai x = c0 λi0 ui + · · · + cn−1 λin−1 un−1 .
If we write this as
" i i #
λ1 λn−1
Ai x = λi0 c0 ui + c1 + · · · + cn−1 ,
λ0 λ0
we see that, numerically, Ai x will get progressively closer to a multiple of u0 , the dominant eigenvector. Hence,
any calculation that uses independence of the Ai x vectors is likely to be inaccurate.
Taking norms:
X xj
(aii − λ) ≤ |aij |
xi
j6=i
Since we do not know this value of i, we can only say that there is some value of i such that λ lies in such a circle.
This is the Gershgorin theorem.
Theorem 1 Let A be a square matrix, and let Di be the circle with center aii and radius then the
P
j6=i |aij |,
eigenvalues are contained in the union of circles ∪i Di .
We can conclude that the eigenvalues are in the interior of these discs, if the constant vector is not an eigenvector.
A.2 Complexity
At various places in this book we are interested in how many operations an algorithm takes. It depends on the
context what these operations are, but often we count additions (or subtractions) and multiplications. This is called
the arithmetic or computational complexity of an algorithm. For instance, summing n numbers takes n−1 additions.
Another quantity that we may want to describe is the amount of space (computer memory) that is needed. Sometimes
the space to fit the input and output of an algorithm is all that is needed, but some algorithms need temporary space.
The total required space is called the space complexity of an algorithm.
Both arithmetic and space complexity depend on some description of the input, for instance, for summing an array
of numbers, the length n of the array is all that is needed. We express this dependency by saying ‘summing an array
of numbers has time complexity n − 1, where n is the length of the array’.
The time (or space) the summing algorithm takes is not dependent on other factors such as the values of the num-
bers. By contrast, some algorithms such as computing the greatest common divisor of an array of integers can be
dependent on the actual values.
Exercise 1.3. What is the time and space complexity of multiplying two square matrices of size n × n?
Assume that an addition and a multiplication take the same amount of time.
Often we aim to simplify the formulas that describe time or space complexity. For instance, if the complexity of an
algorithm is n2 + 2n, we see that for n > 2 the complexity is less than 2n2 , and for n > 4 it is less than (3/2)n2 .
On the other hand, for all values of n the complexity is at least n2 . Clearly, the quadratic term n2 is the most
important, and the linear term n becomes less and less important by ratio. We express this informally by saying that
the complexity is quadratic: there are constants c, C so that for n large enough the complexity is at least cn2 and at
most Cn2 . This is expressed for short by saying that the complexity is of order n2 , written as O(n2 ).
Victor Eijkhout
216 APPENDIX A. THEORETICAL BACKGROUND
Victor Eijkhout
218 APPENDIX A. THEORETICAL BACKGROUND
f (x) ≈ c0 + c1 x + c2 x2 + cn xn .
This question obviously needs to be refined. What do we mean by ‘approximately equal’? And clearly this approx-
imation formula can not hold for all x; the function sin x is bounded, whereas any polynomial is unbounded.
We will show that a function f with sufficiently many derivatives can be approximated as follows: if the n-th
derivative f (n) is continuous on an interval I, then there are coefficients c0 , . . . , cn−1 such that
X
∀x∈I : |f (x) − ci xi | ≤ cMn where Mn = maxx∈I |f (n) (x)|
i<n
It is easy to get inspiration for what these coefficients should be. Suppose
f (x) = c0 + c1 x + c2 x2 + · · ·
(where we will not worry about matters of convergence and how long the dots go on) then filling in
x = 0 gives c0 = f (0).
and filling in
x = 0 gives c1 = f 0 (0).
so filling in
x = 0 gives c2 = f 00 (0)/2.
x = 0 gives c3 = 1 (3)
3! f (0).
Now we need to be a bit more precise. Cauchy’s form of Taylor’s theorem says that
1 0 1
f (x) = f (a) + f (a)(x − a) + · · · + f (n) (a)(x − a)n + Rn (x)
1! n!
where the ‘rest term’ Rn is
1
Rn (x) = f (n+1) (ξ)(x − a)n+1 where ξ ∈ (a, x) or ξ ∈ (x, a) depending.
(n + 1)!
If f (n+1) is bounded, and x = a + h, then the form in which we often use Taylor’s theorem is
n
X 1 (k)
f (x) = f (a)hk + O(hn+1 ).
k!
k=0
We have now approximated the function f on a certain interval by a polynomial, with an error that decreases
geometrically with the inverse of the degree of the polynomial.
For a proof of Taylor’s theorem we use integration by parts. First we write
Z x
f 0 (t)dt = f (x) − f (a)
a
as
Z x
f (x) = f (a) + f 0 (t)dt
a
Victor Eijkhout
220 APPENDIX A. THEORETICAL BACKGROUND
(
V = {1, 2, 3, 4, 5, 6}
E = {(1, 2), (2, 6), (4, 3), (4, 5)}
A graph is called an undirected graph if (i, j) ∈ E ⇔ (j, i) ∈ E. The alternative is a directed graph , where we
indicate an edge (i, j) with an arrow from i to j.
Two concepts that often appear in graph theory are the degree and the diameter of a graph.
Definition 1 The degree denotes the maximum number of nodes that are connected to any node:
Definition 2 The diameter of a graph is the length of the longest shortest path in the graph, where a path is defined
as a set of vertices v1 , . . . , vk+1 such that vi 6= vj for all i 6= j and
A graph
As an example of graph concepts that can easily be read from the adjacency matrix, consider reducibility.
Victor Eijkhout
222 APPENDIX A. THEORETICAL BACKGROUND
Definition 3 A graph is called irreducible if for every pair i, j of nodes there is a path from i to j and from j to i.
where B and D are square matrices. Prove the reducibility of the graph of which this is the
adjacency matrix.
For graphs with edge weights, we set the elements of the adjacency matrix to the weights:
(
wij (i, j) ∈ E
n(M ) = |V |, Mij =
0 otherwise
You see that there is a simple correspondence between weighted graphs and matrices; given a matrix, we call the
corresponding graph its adjacency graph .
The adjacency matrix of the graph in figure A.1 is
0 1 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 1 0 1 0
0 0 0 0 0 0
0 0 0 0 0 0
If a matrix has no zero elements, its adjacency graph has an edge between each pair of vertices. Such a graph is
called a clique.
If the graph is undirected, the adjacency matrix is symmetric.
Here is another example of how adjacency matrices can simplify reasoning about graphs.
Exercise 1.6. Let G = hV, Ei be an undirected graph, and let G0 = hV, E 0 i be the graph with the same
vertices, but with vertices defined by
If M is the adjacency matrix of G, show that M 2 is the adjacency matrix of G0 , where we use
boolean multiplication on the elements: 1 · 1 = 1, 1 + 1 + 1.
Practical tutorials
This part of the book teaches you some general tools that are useful to a computational scientist. They all combine
theory with a great deal of exploration. Do them while sitting at a computer!
223
224 APPENDIX B. PRACTICAL TUTORIALS
B.1.1.1.1 An assert macro for Fortran (Thanks Robert Mclay for this code.)
#if (defined( GFORTRAN ) || defined( G95 ) || defined ( PGI) )
# define MKSTR(x) "x"
#else
# define MKSTR(x) #x
#endif
#ifndef NDEBUG
# define ASSERT(x, msg) if (.not. (x) ) \
call assert( FILE , LINE ,MKSTR(x),msg)
#else
# define ASSERT(x, msg)
#endif
subroutine assert(file, ln, testStr, msgIn)
implicit none
character(*) :: file, testStr, msgIn
integer :: ln
print *, "Assert: ",trim(testStr)," Failed at ",trim(file),":",ln
print *, "Msg:", trim(msgIn)
stop
end subroutine assert
float value,result;
result = compute(value);
Looks good? What if the computation can fail, for instance:
result = ... sqrt(val) ... /* some computation */
How do we handle the case where the user passes a negative number?
float compute(float val)
{
float result;
if (val<0) { /* then what? */
Victor Eijkhout
226 APPENDIX B. PRACTICAL TUTORIALS
} else
result = ... sqrt(val) ... /* some computation */
return result;
}
We could print an error message and deliver some result, but the message may go unnoticed, and the calling envi-
ronment does not really receive any notification that something has gone wrong.
int *ip;
MYMALLOC(ip,500,int);
Runtime checks on memory usage (either by compiler-generated bounds checking, or through tools like valgrind
or Rational Purify) are expensive, but you can catch many problems by adding some functionality to your malloc.
What we will do here is to detect memory corruption after the fact.
Victor Eijkhout
228 APPENDIX B. PRACTICAL TUTORIALS
We allocate a few integers to the left and right of the allocated object (line 1 in the code below), and put a recogniz-
able value in them (line 2 and 3), as well as the size of the object (line 2). We then return the pointer to the actually
requested memory area (line 4).
#define MEMCOOKIE 137
#define MYMALLOC(a,b,c) { \
char *aa; int *ii; \
aa = malloc(b*sizeof(c)+3*sizeof(int)); /* 1 */ \
ii = (int*)aa; ii[0] = b*sizeof(c); \
ii[1] = MEMCOOKIE; /* 2 */ \
aa = (char*)(ii+2); a = (c*)aa ; /* 4 */ \
aa = aa+b*sizesof(c); ii = (int*)aa; \
ii[0] = MEMCOOKIE; /* 3 */ \
}
Now you can write your own free, which tests whether the bounds of the object have not been written over.
#define MYFREE(a) { \
char *aa; int *ii,; ii = (int*)a; \
if (*(--ii)!=MEMCOOKIE) printf("object corrupted\n"); \
n = *(--ii); aa = a+n; ii = (int*)aa; \
if (*ii!=MEMCOOKIE) printf("object corrupted\n"); \
}
You can extend this idea: in every allocated object, also store two pointers, so that the allocated memory areas
become a doubly linked list. You can then write a macro CHECKMEMORY which tests all your allocated objects for
corruption.
Such solutions to the memory corruption problem are fairly easy to write, and they carry little overhead. There is a
memory overhead of at most 5 integers per object, and there is absolutely no performance penalty.
(Instead of writing a wrapper for malloc, on some systems you can influence the behaviour of the system routine.
On linux, malloc calls hooks that can be replaced with your own routines; see http://www.gnu.org/s/
libc/manual/html_node/Hooks-for-Malloc.html.)
Put all subprograms in modules so that the compiler can check for missing arguments and type mismatches. It also
allows for automatic dependency building with fdepend.
B.1.2 Debugging
When a program misbehaves, debugging is the process of finding out why. There are various strategies of finding
errors in a program. The crudest one is debugging by print statements. If you have a notion of where in your code
the error arises, you can edit your code to insert print statements, recompile, rerun, and see if the output gives you
any suggestions. There are several problems with this:
• The edit/compile/run cycle is time consuming, especially since
• often the error will be caused by an earlier section of code, requiring you to edit and rerun. Furthermore,
• the amount of data produced by your program can be too large to display and inspect effectively.
• If your program is parallel, you probably need to print out data from all proccessors, making the inspec-
tion process very tedious.
For these reasons, the best way to debug is by the use of an interactive debugger, a program that allows you to
monitor and control the behaviour of a running program. In this section you will familiarize yourself with gdb ,
which is the open source debugger of the GNU project. Other debuggers are proprietary, and typically come with a
compiler suite. Another distinction is that gdb is a commandline debugger; there are graphical debuggers such as
ddd (a frontend to gdb) or DDT and TotalView (debuggers for parallel codes). We limit ourselves to gdb, since it
incorporates the basic concepts common to all debuggers.
tutorials/gdb/hello.c
#include <stdlib.h>
#include <stdio.h>
int main() {
printf("hello world\n");
return 0;
}
%% cc -g -o hello hello.c
# regular invocation:
%% ./hello
hello world
# invocation from gdg:
%% gdb hello
GNU gdb 6.3.50-20050815 # ..... version info
Copyright 2004 Free Software Foundation, Inc. .... copyright info ....
(gdb) run
Starting program: /home/eijkhout/tutorials/gdb/hello
Victor Eijkhout
230 APPENDIX B. PRACTICAL TUTORIALS
tutorials/gdb/say.c
#include <stdlib.h>
#include <stdio.h>
int main(int argc,char **argv) {
int i;
for (i=0; i<atoi(argv[1]); i++)
printf("hello world\n");
return 0;
}
%% ./say 2
hello world
hello world
%% gdb say
.... the usual messages ...
(gdb) run 2
Starting program: /home/eijkhout/tutorials/gdb/say 2
Reading symbols for shared libraries +. done
hello world
hello world
tutorials/gdb/square.c
#include <stdlib.h>
#include <stdio.h>
int main(int argc,char **argv) {
int nmax,i;
1. Compiler optimizations are not supposed to change the semantics of a program, but sometimes do. This can lead to the nightmare
scenario where a program crashes or gives incorrect results, but magically works correctly with compiled with debug and run in a debugger.
float *squares,sum;
fscanf(stdin,"%d",nmax);
for (i=1; i<=nmax; i++) {
squares[i] = 1./(i*i); sum += squares[i];
}
printf("Sum: %e\n",sum);
return 0;
}
%% cc -g -o square square.c
%% ./square
50
Segmentation fault
The segmentation fault indicates that we are accessing memory that we are not allowed to, making the program
abort. A debugger will quickly tell us where this happens:
[albook:˜/Current/istc/scicompbook/tutorials/gdb] %% gdb square
(gdb) run
50000
Victor Eijkhout
232 APPENDIX B. PRACTICAL TUTORIALS
(gdb) print i
$1 = 11237
(gdb) print squares[i]
Cannot access memory at address 0x10000f000
and we quickly see that we forgot to allocate squares!
By the way, we were lucky here: this sort of memory errors is not always detected. Starting our programm with a
smaller input does not lead to an error:
(gdb) run
50
Sum: 1.625133e+00
B.1.3 Testing
There are various philosophies for testing the correctness of a code.
• Correctness proving: the programmer draws up predicates that describe the intended behaviour of code
fragments and proves by mathematical techniques that these predicates hold [54, 27].
• Unit testing: each routine is tested separately for correctness. This approach is often hard to do for nu-
merical codes, since with floating point numbers there is essentially an infinity of possible inputs, and it
is not easy to decide what would constitute a sufficient set of inputs.
• Integration testing: test subsystems
• System testing: test the whole code. This is often appropriate for numerical codes, since we often have
model problems with known solutions, or there are properties such as bounds that need to hold on the
global solution.
• Test-driven design
With parallel codes we run into a new category of difficulties with testing. Many algorithms, when executed in
parallel, will execute operations in a slightly different order, leading to different roundoff behaviour. For instance,
the parallel computation of a vector sum will use partial sums. Some algorithms have an inherent damping of
numerical errors, for instance stationary iterative methods (section 5.5.1), but others have no such built-in error
correction (nonstationary methods; section 5.5.7). As a result, the same iterative process can take different numbers
of iterations depending on how many processors are used.
Victor Eijkhout
234 APPENDIX B. PRACTICAL TUTORIALS
Originally, the latex compiler would output a device independent file format, named dvi, which could then be
translated to PostScript or PDF, or directly printed. These days, many people use the pdflatex program which
directly translates .tex files to .pdf files. This has the big advantage that the generated PDF files have automatic
cross linking and a side panel with table of contents. An illustration is found below.
Let us do a simple example.
\documentclass{article}
\begin{document}
Hello world!
\end{document}
Figure B.1: A minimal LATEX document
Exercise. Create a text file minimal.tex with the content as in figure B.1. Try the command pdflatex
minimal or latex minimal. Did you get a file minimal.pdf in the first case or minimal.dvi in the
second case? Use a pdf viewer, such as Adobe Reader, or dvips respectively to view the output.
Caveats. If you make a typo, TEX can be somewhat unfriendly. If you get an error message and TEX is asking for
input, typing x usually gets you out, or Ctrl-C. Some systems allow you to type e to go directly into the editor to
correct the typo.
\begin{document}
\end{document}
The ‘documentclass’ line needs a class name in between the braces; typical values are ‘article’ or ‘book’. Some
organizations have their own styles, for instance ‘ieeeproc’ is for proceedings of the IEEE.
All document text goes between the \begin{document} and \end{document} lines. (Matched ‘begin’ and
‘end’ lines are said to denote an ‘environment’, in this case the document environment.)
Victor Eijkhout
236 APPENDIX B. PRACTICAL TUTORIALS
The part before \begin{document} is called the ‘preamble’. It contains customizations for this particular doc-
ument. For instance, a command to make the whole document double spaced would go in the preamble. If you are
using pdflatex to format your document, you want a line
\usepackage{hyperref}
here.
Have you noticed the following?
• The backslash character is special: it starts a LATEX command.
• The braces are also special: they have various functions, such as indicating the argument of a command.
• The percent character indicates that everything to the end of the line is a comment.
Exercise. Create a file first.tex with the content of figure B.1 in it. Type some text before the \begin{document}
line and run pdflatex on your file.
Expected outcome. You should get an error message: you are not allowed to have text before the \begin{document}
line. Only commands are allowed in the preamble part.
Exercise. Edit your document: put some text in between the \begin{document} and \end{document} lines.
Let your text have both some long lines that go on for a while, and some short ones. Put superfluous spaces between
words, and at the beginning or end of lines. Run pdflatex on your document and view the output.
Expected outcome. You notice that the white space in your input has been collapsed in the output. TEX has its own
notions about what space should look like, and you do not have to concern yourself with this matter.
Exercise. Edit your document again, cutting and pasting the paragraph, but leaving a blank line between the two
copies. Paste it a third time, leaving several blank lines. Format, and view the output.
Expected outcome. TEX interprets one or more blank lines as the separation between paragraphs.
Exercise. Add \usepackage{pslatex} to the preamble and rerun pdflatex on your document. What
changed in the output?
Expected outcome. This should have the effect of changing the typeface from the default to Times Roman.
Caveats. Typefaces are notoriously unstandardized. Attempts to use different typefaces may or may not work. Little
can be said about this in general.
\section{This is a section}
and a similar line before the second. Format. You see that LATEX automatically numbers the sections, and that it
handles indentation different for the first paragraph after a heading.
Exercise. Replace article by artikel3 in the documentclass declaration line and reformat your document.
What changed?
Expected outcome. There are many documentclasses that implement the same commands as article (or another
standard style), but that have their own layout. Your document should format without any problem, but get a better
looking layout.
Caveats. The artikel3 class is part of most distributions these days, but you can get an error message about an
unknown documentclass if it is missing or if your environment is not set up correctly. This depends on your in-
stallation. If the file seems missing, download the files from http://tug.org/texmf-dist/tex/latex/
ntgclass/ and put them in your current directory; see also section B.2.2.8.
B.2.2.3 Math
Purpose. In this section you will learn the basics of math typesetting
One of the goals of the original TEX system was to facilitate the setting of mathematics. There are two ways to have
math in your document:
• Inline math is part of a paragraph, and is delimited by dollar signs.
• Display math is, as the name implies, displayed by itself.
Exercise. Put $x+y$ somewhere in a paragraph and format your document. Put \[x+y\] somewhere in a para-
graph and format.
Expected outcome. Formulas between single dollars are included in the paragraph where you declare them. Formu-
las between \[...\] are typeset in a display.
For display equations with a number, use an equation environment. Try this.
Here are some common things to do in math. Make sure to try them out.
• Subscripts and superscripts: $x_iˆ2$. If the sub or superscript is more than a single symbol, it needs to
be grouped: $x_{i+1}ˆ{2n}$. (If you need a brace in a formula, use $\{ \}$.
• Greek letters and other symbols: $\alpha\otimes\beta_i$.
• Combinations of all these $\int_{t=0}ˆ\infty tdt$.
Exercise. Take the last example and typeset it as display math. Do you see a difference with inline math?
Expected outcome. TEX tries not to include the distance between text lines, even if there is math in a paragraph. For
this reason it typesets the bounds on an integral sign differently from display math.
Victor Eijkhout
238 APPENDIX B. PRACTICAL TUTORIALS
B.2.2.4 Referencing
Purpose. In this section you will see TEX’s cross referencing mechanism in action.
So far you have not seen LATEX do much that would save you any work. The cross referencing mechanism of LATEX
will definitely save you work: any counter that LATEX inserts (such as section numbers) can be referenced by a label.
As a result, the reference will always be correct.
Start with an example document that has at least two section headings. After your first section heading, put the
command \label{sec:first}, and put \label{sec:other} after the second section heading. These label
commands can go on the same line as the section command, or on the next. Now put
As we will see in section˜\ref{sec:other}.
in the paragraph before the second section. (The tilde character denotes a non-breaking space.)
Exercise. Make these edits and format the document. Do you see the warning about an undefined reference? Take
a look at the output file. Format the document again, and check the output again. Do you have any new files in your
directory?
Expected outcome. On a first pass through a document, the TEX compiler will gather all labels with their values in
a .aux file. The document will display a double questioni mark for any references that are unknown. In the second
pass the correct values will be filled in.
Caveats. If after the second pass there are still undefined references, you probably made a typo. If you use the
bibtex utility for literature references, you will regularly need three passes to get all references resolved correctly.
Above you saw that the equation environment gives displayed math with an equation number. You can add a
label to this environment to refer to the equation number.
Exercise. Write a formula in an equation environment, and add a label. Refer to this label anywhere in the text.
Format (twice) and check the output.
Expected outcome. The \label and \ref command are used in the same way for formulas as for section numbers.
Note that you must use \begin/end{equation} rather than \[...\] for the formula.
B.2.2.5 Lists
Purpose. In this section you will see the basics of lists.
\end{itemize}
\begin{enumerate}
\item This item is numbered;
\item this one is two.
\end{enumerate}
Exercise. Add some lists to your document, including nested lists. Inspect the output.
Expected outcome. Nested lists will be indented further and the labeling and numbering style changes with the list
depth.
Expected outcome. Again, the \label and \ref commands work as before.
B.2.2.6 Graphics
Since you can not immediately see the output of what you are typing, sometimes the output may come as a surprise.
That is especially so with graphics. LATEX has no standard way of dealing with graphics, but the following is a
common set of commands:
\usepackage{graphicx} % this line in the preamble
Victor Eijkhout
240 APPENDIX B. PRACTICAL TUTORIALS
\begin{figure}[hbp]
declares that the figure has to be placed here if possible, at the bottom of the next page if that’s not
possible, and on a page of its own if it is too big to fit on a page with text.
• A caption to be put under the figure, including a figure number;
• A label so that you can refer to the figure by as can be seen in figure˜ \ref{fig:first}.
• And of course the figure material. There are various ways to fine-tune the figure placement. For instance
\begin{center}
\includegraphics{myfigure}
\end{center}
gives a centered figure.
The following example demo.tex contains many of the elements discussed above.
\documentclass{artikel3}
\usepackage{pslatex}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\newtheorem{theorem}{Theorem}
\newcounter{excounter}
\newenvironment{exercise}
{\refstepcounter{excounter}
\begin{quotation}\textbf{Exercise \arabic{excounter}.} }
{\end{quotation}}
\begin{document}
\title{SSC 335: demo}
\author{Victor Eijkhout}
\date{today}
\maketitle
\section{This is a section}
\label{sec:intro}
\begin{exercise}\label{easy-ex}
Left to the reader.
\end{exercise}
\begin{exercise}
Also left to the reader, just like in exercise˜\ref{easy-ex}
\end{exercise}
\begin{theorem}
This is cool.
\end{theorem}
This is a formula: $a\Leftarrow b$.
\begin{equation}
\label{eq:one}
x_i\leftarrow y_{ij}\cdot xˆ{(k)}_j
\end{equation}
Text: $\int_0ˆ1 \sqrt x\,dx$
\[
Victor Eijkhout
242 APPENDIX B. PRACTICAL TUTORIALS
\begin{table}[ht]
\centering
\begin{tabular}{|rl|}
\hline one&value \\ \hline another&values \\ \hline
\end{tabular}
\caption{This is the only table in my demo}
\label{tab:thetable}
\end{table}
\begin{figure}[ht]
\centering
\includegraphics{graphics/caches}
\caption{this is the only figure}
\label{fig:thefigure}
\end{figure}
As I showed in the introductory section˜\ref{sec:intro}, in the
paper˜\cite{AdJo:colorblind}, it was shown that
equation˜\eqref{eq:one}
\begin{itemize}
\item There is an item.
\item There is another item
\begin{itemize}
\item sub one
\item sub two
\end{itemize}
\end{itemize}
\begin{enumerate}
\item item one
\item item two
\begin{enumerate}
\item sub one
\item sub two
\end{enumerate}
\end{enumerate}
\tableofcontents
\listoffigures
\bibliography{math}
\bibliographystyle{plain}
\end{document}
@article{AdJo:colorblind,
author = {Loyce M. Adams and Harry F. Jordan},
title = {Is {SOR} color-blind?},
journal = {SIAM J. Sci. Stat. Comput.},
year = {1986},
volume = {7},
pages = {490--506},
abstract = {For what stencils do ordinary and multi-colour SOR have
the same eigenvalues.},
keywords = {SOR, colouring}
}
@misc{latexdemo,
author = {Victor Eijkhout},
title = {Short {\LaTeX}\ demo},
note = {SSC 335, oct 1, 2008}
}
pdflatex demo
bibtex demo
pdflatex demo
pdflatex demo
gives
Victor Eijkhout
244 APPENDIX B. PRACTICAL TUTORIALS
Victor Eijkhout
today
1 This is a section
This is a test document, used in [2]. It contains a discussion in section 2.
Exercise 1. Left to the reader.
Exercise 2. Also left to the reader, just like in exercise 1
This is a formula: a ⇐ b.
(k)
xi ← yi j · x j (1)
R1√
Text: 0 x dx
Z 1
√
x dx
0
As I showed in the introductory section 1, in the paper [1], it was shown that equation (1)
• There is an item.
Contents
1 This is a section 1
2 This is another section 1
List of Figures
1 this is the only figure 1
References
[1] Loyce M. Adams and Harry F. Jordan. Is SOR color-blind? SIAM J. Sci. Stat. Comput.,
7:490–506, 1986.
[2] Victor Eijkhout. Short LATEX demo. SSC 335, oct 1, 2008.
Victor Eijkhout
246 APPENDIX B. PRACTICAL TUTORIALS
The ls command gives you a listing of files that are in your present location.
Expected outcome. If there are files in your directory, they will be listed; if there are none, no output will be given.
This is standard Unix behaviour: no output does not mean that something went wrong, it only means that there is
nothing to report.
The cat command is often used to display files, but it can also be used to create some simple content.
Exercise. Type cat > newfilename (where you can pick any filename) and type some text. Conclude with
Control-d on a line by itself. Now use cat to view the contents of that file: cat newfilename.
Expected outcome. In the first use of cat, text was concatenated from the terminal to a file; in the second the file
was cat’ed to the terminal output. You should see on your screen precisely what you typed into the file.
Caveats. Be sure to type Control-d as the first thing on the last line of input. If you really get stuck, Control-c
will usually get you out. Try this: start creating a file with cat > filename and hit Control-c in the middle
of a line. What are the contents of your file?
Above you used ls to get a directory listing. You can also use the ls command on specific files:
Victor Eijkhout
248 APPENDIX B. PRACTICAL TUTORIALS
Exercise. Do ls newfilename with the file that you created above; also do ls nosuchfile with a file name
that does not exist.
Expected outcome. For an existing file you get the file name on your screen; for a non-existing file you get an error
message.
Exercise. Read the man page of the ls command: man ls. Find out the size and creation date of some files, for
instance the file you just created.
Expected outcome. Did you find the ls -s and ls -l options? The first one lists the size of each file, usually in
kilobytes, the other gives all sorts of information about a file, including things you will learn about later.
Caveats. The man command puts you in a mode where you can view long text documents. This viewer is common
on Unix systems (it is available as the more or less system command), so memorize the following ways of
navigating: Use the space bar to go forward and the u key to go back up. Use g to go to the beginning fo the text,
and G for the end. Use q to exit the viewer. If you really get stuck, Control-c will get you out.
(If you already know what command you’re looking for, you can use man to get online information about it. If you
forget the name of a command, man -k keyword can help you find it.)
Three more useful commands for files are: cp for copying, mv (short for ‘move’) for renaming, and rm (‘remove’)
for deleting. Experiment with them.
There are more commands for displaying a file, parts of a file, or information about a file.
Expected outcome. head displays the first couple of lines of a file, tail the last, and more uses the same viewer
that is used for man pages. The wc (‘word count’) command reports the number of words, characters, and lines in
a file.
B.3.1.2 Directories
Purpose. Here you will learn about the Unix directory tree, how to manipulate it and how to
move around in it.
A unix file system is a tree of directories, where a directory is a container for files or more directories. We will
display directories as follows:
/ .................................................... The root of the directory tree
bin.........................................................Binary programs
home.............................................Location of users directories
The root of the Unix directory tree is indicated with a slash. Do ls / to see what the files and directories there are
in the root. Note that the root is not the location where you start when you reboot your personal machine, or when
you log in to a server.
Exercise. The command to find out your current working directory is pwd. Your home directory is your working
directory immediately when you log in. Find out your home directory.
Expected outcome. You will typically see something like /home/yourname or /Users/yourname. This is
system dependent.
Do ls to see the contents of the working directory. In the displays in this section, directory names will be followed
by a slash: dir/2 , but this character is not part of their name. Example:
/home/you/
adirectory/
afile
Exercise. Make a new directory with mkdir somedir and view the current directory with ls
The command for going into another directory, that is, making it your working directory, is cd (‘change directory’).
It can be used in the following ways:
• cd Without any arguments, cd takes you to your home directory.
• cd <absolute path> An absolute path starts at the root of the directory tree, that is, starts with /.
The cd command takes you to that location.
• cd <relative path> A relative path is one that does not start at the root. This form of the cd
command takes you to <yourcurrentdir>/<relative path>.
Exercise. Do cd newdir and find out where you are in the directory tree with pwd. Confirm with ls that the
directory is empty. How would you get to this location using an absolute path?
Expected outcome. pwd should tell you /home/you/newdir, and ls then has no output, meaning there is
nothing to list. The absolute path is /home/you/newdir.
Exercise. Let’s quickly create a file in this directory: touch onefile, and another directory: mkdir otherdir.
Do ls and confirm that there are a new file and directory.
2. You can tell your shell to operate like this by stating alias ls=ls -F at the start of your session.
Victor Eijkhout
250 APPENDIX B. PRACTICAL TUTORIALS
The ls command has a very useful option: with ls -a you see your regular files and hidden files, which have a
name that starts with a dot. Doing ls -a in your new directory should tell you that there are the following files:
/home/you/
newdir/........................................................you are here
.
..
onefile
otherdir/
The single dot is the current directory, and the double dot is the directory one level back.
Exercise. Predict where you will be after cd ./otherdir/.. and check to see if you were right.
Expected outcome. The single dot sends you to the current directory, so that does not change anything. The
otherdir part makes that subdirectory your current working directory. Finally, .. goes one level back. In other
words, this command puts your right back where you started.
Since your home directory is a special place, there are shortcuts for cd’ing to it: cd without arguments, cd ˜, and
cd $HOME all get you back to your home.
Go to your home directory, and from there do ls newdir to check the contents of the first directory you created,
without having to go there.
Expected outcome. Recall that .. denotes the directory one level up in the tree: you should see your own home
directory, plus the directories of any other users.
Exercise. Can you use ls to see the contents of someone else’s home directory? In the previous exercise you saw
whether other users exist on your system. If so, do ls ../thatotheruser.
Expected outcome. If this is your private computer, you can probably view the contents of the other user’s directory.
If this is a university computer or so, the other directory may very well be protected – permissions are discussed in
the next section – and you get ls: ../otheruser: Permission denied.
Make an attempt to move into someone else’s home directory with cd. Does it work?
You can make copies of a directory with cp, but you need to add a flag to indicate that you recursively copy the
contents: cp -r. Make another directory somedir in your home so that you have
/home/you/
newdir/....................................you have been working in this one
somedir/............................................you just created this one
What is the difference between cp -r newdir somedir and cp -r newdir thirddir where thirddir
is not an existing directory name?
B.3.1.3 Permissions
Purpose. In this section you will learn about how to give various users on your system permis-
sion to do (or not to do) various things with your files.
Unix files, including directories, have permissions, indicating ‘who can do what with this file’. Actions that can be
performed on a file fall into three categories:
• reading r: any access to a file (displaying, getting information on it) that does not change the file;
• writing w: access to a file that changes its content, or even its metadata such as ‘date modified’;
• executing x: if the file is executable, to run it; if it is a directory, to enter it.
The people who can potentially access a file are divided into three classes too:
• the user u: the person owning the file;
• the group g: a group of users to which the owner belongs;
• other o: everyone else.
These nine permissions are rendered in sequence
For instance rw-r--r-- means that the owner can read and write a file, the owner’s group and everyone else can
only read. The third letter, w, has two meanings: for files it indicates that they can be executed, for directories it
indicates that they can be entered.
Permissions are also rendered numerically in groups of three bits, by letting r = 4, w = 2, x = 1:
rwx
421
Common codes are 7 = rwx and 6 = rw. You will find many files that have permissions 755 which stands for an
executable that everyone can run, but only the owner can change, or 644 which stands for a data file that everyone
can see but again only the owner can alter. You can set permissions by
chmod <permissions> file # just one file
chmod -R <permissions> directory # directory, recursively
Victor Eijkhout
252 APPENDIX B. PRACTICAL TUTORIALS
Examples:
chmod 755 file # set to rwxrw-rw-
chmod g+w file # give group write permission
chmod g=rx file # set group permissions
chod o-w file # take away write permission from others
chmod o= file # take away all permissions from others.
The man page gives all options.
Exercise. Make a file foo and do chmod u-r foo. Can you now inspect its contents? Make the file readable
again, this time using a numeric code. Now make the file readable to your classmates. Check by having one of them
read the contents.
Expected outcome. When you’ve made the file ‘unreadable’ by yourself, you can still ls it, but not cat it: that will
give a ‘permission denied’ message.
Adding or taking away permissions can be done with the following syntax:
chmod g+r,o-x file # give group read permission
# remove other execute permission
B.3.1.4 Wildcards
You already saw that ls filename gives you information about that one file, and ls gives you all files in the
current directory. To see files with certain conditions on their names, the wildcard mechanism exists. The following
wildcards exist:
For this section you need at least one file that contains some amount of text. You can for instance get random text
from http://www.lipsum.com/feed/html.
The grep command can be used to search for a text expression in a file.
Exercise. Search for the letter q in your text file with grep q yourfile and search for it in all files in your
directory with grep q *. Try some other searches.
Expected outcome. In the first case, you get a listing of all lines that contain a q; in the second case, grep also
reports what file name the match was found in: qfile:this line has q in it.
Caveats. If the string you are looking for does not occur, grep will simply not output anything. Remember that this
is standard behaviour for Unix commands if there is nothing to report.
In addition to searching for literal strings, you can look for more general expressions.
ˆ the beginning of the line
$ the end of the line
. any character
* any number of repetitions
[xyz] any of the characters xyz
This looks like the wildcard mechanism you just saw (section B.3.1.4) but it’s subtly different. Compare the example
above with:
%% cat s
sk
ski
skill
skiing
%% grep "ski*" s
sk
ski
skill
skiing
Some more examples: you can find
• All lines that contain the letter ‘q’ with grep q yourfile;
• All lines that start with an ‘a’ with grep "ˆa" yourfile (if your search string contains special
characters, it is a good idea to use quote marks to enclose it);
• All lines that end with a digit with grep "[0-9]$" yourfile.
Victor Eijkhout
254 APPENDIX B. PRACTICAL TUTORIALS
The star character stands for zero or more repetitions of the character it follows, so that a* stands for any, possibly
empty, string of a characters.
Exercise. Add a few lines x = 1, x = 2, x = 3 (that is, have different numbers of spaces between x and
the equals sign) to your test file, and make grep commands to search for all assignments to x.
The characters in the table above have special meanings. If you want to search that actual character, you have to
escape it.
Exercise. Make a test file that has both abc and a.c in it, on separate lines. Try the commands grep "a.c"
file, grep a\.c file, grep "a\.c" file.
Expected outcome. You will see that the period needs to be escaped, and the search string needs to be quoted. In
the absence of either, you will see that grep also finds the abc string.
If you type a command such as ls, the shell does not just rely on a list of commands: it will actually go searching
for a program by the name ls. This means that you can have multiple different commands with the same name,
and which one gets executed depends on which one is found first.
Exercise. What you may think of as ‘Unix commands’ are often just executable files in a system directory. Do
which cd, and do an ls -l on the result
Expected outcome. The location of cd is something like /usr/bin/cd. If you ls that, you will see that it is
probably owned by root. Its executable bits are probably set for all users.
The locations where unix searches for commands is the ‘search path’
Exercise. Do echo $PATH. Can you find the location of cd? Are there other commands in the same location? Is
the current directory ‘.’ in the path? If not, do export PATH=".:$PATH". Now create an executable file cd
in the current director (see above for the basics), and do cd. Some people consider having the working directory in
the path a security risk.
Exercise. Do alias chdir=cd and convince yourself that now chdir works just like cd. Do alias rm=’rm
-i’; look up the meaning of this in the man pages. Some people find this alias a good idea; can you see why?
Expected outcome. The -i ‘interactive’ option for rm makes the command ask for confirmation before each delete.
Since unix does not have a trashcan that needs to be emptied explicitly, this can be a good idea.
So far, the unix commands you have used have taken their input from your keyboard, or from a file named on the
command line; their output went to your screen. There are other possibilities for providing input from a file, or for
storing the output in a file.
B.3.3.2.1 Input redirection The grep command had two arguments, the second being a file name. You can also
write grep string < yourfile, where the less-than sign means that the input will come from the named
file, yourfile.
Victor Eijkhout
256 APPENDIX B. PRACTICAL TUTORIALS
B.3.3.2.2 Output redirection More usefully, grep string yourfile > outfile will take what nor-
mally goes to the terminal, and send it to outfile. The output file is created if it didn’t already exist, otherwise it
is overwritten. (To append, use grep text yourfile >> outfile.)
Exercise. Take one of the grep commands from the previous section, and send its output to a file. Check that the
contents of the file are identical to what appeared on your screen before. Search for a string that does not appear in
the file and send the output to a file. What does this mean for the output file?
Expected outcome. Searching for a string that does not occur in a file gives no terminal output. If you redirect the
output of this grep to a file, it gives a zero size file. Check this with ls and wc.
B.3.3.2.3 Standard files Unix has three standard files that handle input and output:
stdin is the file that provides input for processes.
stdout is the file where the output of a process is written.
stderr is the file where error output is written.
In an interactive session, all three files are connected to the user terminal. Using inpur or output redirection then
means that the input is taken or the output sent to a different file than the terminal.
B.3.3.2.4 Command redirection Instead of taking input from a file, or sending output to a file, it is possible
to connect two commands together, so that the second takes the output of the first as input. The syntax for this is
cmdone | cmdtwo; this is called a pipeline. For instance, grep a yourfile | grep b finds all lines that
contains both an a and a b.
Exercise. Construct a pipeline that counts how many lines there are in your file that contain the string th. Use the
wc command (see above) to do the counting.
There are a few more ways to combine commands. Suppose you want to present the result of wc a bit nicely. Type
the following command
echo The line count is wc -l foo
where foo is the name of an existing file. The way to get the actual line count echoed is by the backquote:
echo The line count is ‘wc -l foo‘
Anything in between backquotes is executed before the rest of the command line is evaluated. The way wc is used
here, it prints the file name. Can you find a way to prevent that from happening?
B.3.3.3 Processes
The Unix operating system can run many programs at the same time, by rotating through the list and giving each
only a fraction of a second to run each time. The command ps can tell you everything that is currently running.
Exercise. Type ps. How many programs are currently running? By default ps gives you only programs that you
explicitly started. Do ps guwax for a detailed list of everything that is running. How many programs are running?
How many belong to the root user, how many to you?
Expected outcome. To count the programs belonging to a user, pipe the ps command through an appropriate grep,
which can then be piped to wc.
In this long listing of ps, the second column contains the process numbers. Sometimes it is useful to have those.
The cut command can cut certain position from a line: type ps guwax | cut -c 10-14.
To get dynamic information about all running processes, use the top command. Read the man page to find out how
to sort the output by CPU usage.
When you type a command and hit return, that command becomes, for the duration of its run, the foreground
process. Everything else that is running at the same time is a background process.
Make an executable file hello with the following contents:
#!/bin/sh
while [ 1 ] ; do
sleep 2
date
done
and type ./hello.
Exercise. Type Control-z. This suspends the foreground process. It will give you a number like [1] or [2]
indicating that it is the first or second program that has been suspended or put in the background. Now type bg to
put this process in the background. Confirm that there is no foreground process by hitting return, and doing an ls.
Expected outcome. After you put a process in the background, the terminal is available again to accept foreground
commands. If you hit return, you should see the command prompt.
Exercise. Type jobs to see the processes in the current session. If the process you just put in the background was
number 1, type fg %1. Confirm that it is a foreground process again.
Expected outcome. If a shell is executing a program in the foreground, it will not accept command input, so hitting
return should only produce blank lines.
Exercise. When you have made the hello script a foreground process again, you can kill it with Control-c. Try
this. Start the script up again, this time as ./hello & which immediately puts it in the background. You should
also get output along the lines of [1] 12345 which tells you that it is the first job you put in the background,
and that 12345 is its process ID. Kill the script with kill %1. Start it up again, and kill it by using the process
number.
Victor Eijkhout
258 APPENDIX B. PRACTICAL TUTORIALS
Expected outcome. The command kill 12345 using the process number is usually enough to kill a running
program. Sometimes it is necessary to use kill -9 12345.
B.3.4 Scripting
The unix shells are also programming environments. You will learn more about this aspect of unix in this section.
Exercise. Check on the value of the HOME variable by typing echo $HOME. Also find the value of HOME by piping
env through grep.
There are a number of tests defined, for instance -f somefile tests for the existence of a file. Change your script
so that it will report -1 if the file does not exist.
There are also loops. A for loop looks like
for var in listofitems ; do
something with $var
done
This does the following:
• for each item in listofitems, the variable var is set to the item, and
• the loop body is executed.
As a simple example:
%% for x in a b c ; do echo $x ; done
a
b
c
In a more meaningful example, here is how you would make backups of all your .c files:
for cfile in *.c ; do
cp $cfile $cfile.bak
done
Shell variables can be manipulated in a number of ways. Execute the following commands to see that you can
remove trailing characters from a variable:
%% a=b.c
%% echo ${a%.c}
b
With this as a hint, write a loop that renames all your .c files to .x files.
B.3.4.3 Scripting
It is possible to write programs of unix shell commands. First you need to know how to put a program in a file and
have it be executed. Make a file script1 containing the following two lines:
#!/bin/bash
echo "hello world"
and type ./script1 on the command line. Result? Make the file executable and try again.
You can give your script command line arguments ./script1 foo bar; these are available as $1 et cetera in
the script.
Write a script that takes as input a file name argument, and reports how many lines are in that file.
Edit your script to test whether the file has less than 10 lines (use the foo -lt bar test), and if it does, cat the
file. Hint: you need to use backquotes inside the test.
Victor Eijkhout
260 APPENDIX B. PRACTICAL TUTORIALS
The number of command line arguments is available as $#. Add a test to your script so that it will give a helpful
message if you call it without any arguments.
B.3.5 Expansion
The shell performs various kinds of expansion on a command line, that is, replacing part of the commandline with
different text.
Brace expansion:
%% echo a{b,cc,ddd}e
abe acce addde
This can for instance be used to delete all extension of some base file name:
%% rm tmp.{c,s,o} # delete tmp.c tmp.s tmp.o
Tilde expansion gives your own, or someone else’s home directory:
%% echo ˜
/share/home/00434/eijkhout
%% echo ˜eijkhout
/share/home/00434/eijkhout
Parameter expansion gives the value of shell variables:
%% x=5
%% echo $x
5
Undefined variables do not give an error message:
%% echo $y
There are many variations on parameter expansion. Above you already saw that you can strip trailing characters:
%% a=b.c
%% echo ${a%.c}
b
Here is how you can deal with undefined variables:
%% echo ${y:-0}
0
The backquote mechanism (section B.3.3.2 above) is known as command substitution:
%% echo 123 > w
%% cat w
123
%% wc -c w
4 w
Victor Eijkhout
262 APPENDIX B. PRACTICAL TUTORIALS
One of the first things you become aware of when you start programming is the distinction between the readable
source code, and the unreadable, but executable, program code. In this tutorial you will learn about a couple more
file types:
• A source file can be compiled to an object file, which is a bit like a piece of an executable: by itself it
does nothing, but it can be combined with other object files to form an executable.
• A library is a bundle of object files that can be used to form an executable. Often, libraries are written
by an expert and contain code for specialized purposes such as linear algebra manipulations. Libraries
are important enough that they can be commercial, to be bought if you need expert code for a certain
purpose.
You will now learn how these types of files are created and used.
Let’s start with a simple program that has the whole source in one file.
int main() {
printf("hello world\n");
return 0;
}
Compile this program with your favourite compiler; we will use gcc in this tutorial, but substitute your own as
desired. As a result of the compilation, a file a.out is created, which is the executable.
%% gcc hello.c
%% ./a.out
hello world
You can get a more sensible program name with the -o option:
%% gcc -o helloprog hello.c
%% ./helloprog
hello world
Victor Eijkhout
264 APPENDIX B. PRACTICAL TUTORIALS
extern bar(char*);
int main() {
bar("hello world\n");
return 0;
}
Subprogram: fooprog.c
#include <stdlib.h>
#include <stdio.h>
B.4.3 Libraries
Purpose. In this section you will learn about libraries.
If you have written some subprograms, and you want to share them with other people (perhaps by selling them),
then handing over individual object files is inconvenient. Instead, the solution is to combine them into a library. First
we look at static libraries, for which the archive utility ar is used. A static library is linked into your executable,
becoming part of it. This may lead to large executables; you will learn about shared libraries next, which do not
suffer from this problem.
Create a directory to contain your library (depending on what your library is for this can be a system directory such
as /usr/bin), and create the library file there.
%% mkdir ../lib
%% ar cr ../lib/libfoo.a foosub.o
The nm command tells you what’s in the library:
%% nm ../lib/libfoo.a
../lib/libfoo.a(foosub.o):
00000000 T _bar
U _printf
Line with T indicate functions defined in the library file; a U indicates a function that is used.
The library can be linked into your executable by explicitly giving its name, or by specifying a library path:
%% gcc -o foo fooprog.o ../lib/libfoo.a
# or
%% gcc -o foo fooprog.o -L../lib -lfoo
%% ./foo
hello world
A third possibility is to use the LD_LIBRARY_PATH shell variable. Read the man page of your compiler for its
use, and give the commandlines that create the foo executable, linking the library through this path.
Although they are somewhat more complicated to use, shared libraries have several advantages. For instance, since
they are not linked into the executable but only loaded at runtime, they lead to (much) smaller executables. They
are not created with ar, but through the compiler. For instance:
%% gcc -dynamiclib -o ../lib/libfoo.so foosub.o
%% nm ../lib/libfoo.so
../lib/libfoo.so(single module):
00000fc4 t __dyld_func_lookup
00000000 t __mh_dylib_header
00000fd2 T _bar
U _printf
00001000 d dyld__mach_header
00000fb0 t dyld_stub_binding_helper
Shared libraries are not actually linked into the executable; instead, the executable will contain the information
where the library is to be found at execution time:
Victor Eijkhout
266 APPENDIX B. PRACTICAL TUTORIALS
B.5.1.1 C
Make the following files:
foo.c
#include "bar.h"
int c=3;
int d=4;
int main()
{
int a=2;
return(bar(a*c*d));
}
bar.c
#include "bar.h"
int bar(int a)
{
int b=10;
return(b*a);
}
bar.h
Victor Eijkhout
268 APPENDIX B. PRACTICAL TUTORIALS
Makefile
fooprog : foo.o bar.o
cc -o fooprog foo.o bar.o
foo.o : foo.c
cc -c foo.c
bar.o : bar.c
cc -c bar.c
clean :
rm -f *.o fooprog
The makefile has a number of rules like
foo.o : foo.c
<TAB>cc -c foo.c
which have the general form
target : prerequisite(s)
<TAB>rule(s)
where the rule lines are indented by a TAB character.
A rule, such as above, states that a ‘target’ file foo.o is made from a ‘prerequisite’ foo.c, namely by executing
the command cc -c foo.c. The precise definition of the rule is:
• if the target foo.o does not exist or is older than the prerequisite foo.c,
• then the command part of the rule is executed: cc -c foo.c
• If the prerequisite is itself the target of another rule, than that rule is executed first.
Probably the best way to interpret a rule is:
• if any prerequisite has changed,
• then the target needs to be remade,
• and that is done by executing the commands of the rule.
If you call make without any arguments, the first rule in the makefile is evaluated. You can execute other rules by
explicitly invoking them, for instance make foo.o to compile a single file.
Expected outcome. The above rules are applied: make without arguments tries to build the first target, fooprog.
In order to build this, it needs the prerequisites foo.o and bar.o, which do not exist. However, there are rules
for making them, which make recursively invokes. Hence you see two compilations, for foo.o and bar.o, and
a link command for fooprog.
Caveats. Typos in the makefile or in file names can cause various errors. In particular, make sure you use tabs and
not spaces for the rule lines. Unfortunately, debugging a makefile is not simple. Make’s error message will usually
give you the line number in the make file where the error was detected.
Exercise. Do make clean, followed by mv foo.c boo.c and make again. Explain the error message. Re-
store the original file name.
Expected outcome. Make will complain that there is no rule to make foo.c. This error was caused when foo.c
was a prerequisite, and was found not to exist. Make then went looking for a rule to make it.
Now add a second argument to the function bar. This requires you to edit bar.c and bar.h: go ahead and make
these edits. However, it also requires you to edit foo.c, but let us for now ‘forget’ to do that. We will see how
Make can help you find the resulting error.
Expected outcome. Even through conceptually foo.c would need to be recompiled since it uses the bar function,
Make did not do so because the makefile had no rule that forced it.
Exercise. Confirm that the new makefile indeed causes foo.o to be recompiled if bar.h is changed. This com-
pilation will now give an error, since you ‘forgot’ to edit the use of the bar function.
B.5.1.2 Fortran
Make the following files:
foomain.F
program test
use testmod
call func(1,2)
end program
foomod.F
Victor Eijkhout
270 APPENDIX B. PRACTICAL TUTORIALS
module testmod
contains
subroutine func(a,b)
integer a,b
print *,a,b,c
end subroutine func
end module
and a makefile:
Makefile
fooprog : foomain.o foomod.o
gfortran -o fooprog foo.o foomod.o
foomain.o : foomain.F
gfortran -c foomain.F
foomod.o : foomod.F
gfortran -c foomod.F
clean :
rm -f *.o fooprog
If you call make, the first rule in the makefile is executed. Do this, and explain what happens.
Expected outcome. The above rules are applied: make without arguments tries to build the first target, foomain.
In order to build this, it needs the prerequisites foomain.o and foomod.o, which do not exist. However, there
are rules for making them, which make recursively invokes. Hence you see two compilations, for foomain.o and
foomod.o, and a link command for fooprog.
Caveats. Typos in the makefile or in file names can cause various errors. Unfortunately, debugging a makefile is not
simple. You will just have to understand the errors, and make the corrections.
Exercise. Do make clean, followed by mv foo.c boo.c and make again. Explain the error message. Re-
store the original file name.
Expected outcome. Make will complain that there is no rule to make foo.c. This error was caused when foo.c
was a prerequisite, and was found not to exist. Make then went looking for a rule to make it.
Expected outcome. Even through conceptually foomain.F would need to be recompiled, Make did not do so
because the makefile had no rule that forced it.
Exercise. Confirm that the corrected makefile indeed causes foomain.F to be recompiled.
Exercise. Edit your makefile as indicated. First do make clean, then make foo (C) or make fooprog (For-
tran).
Expected outcome. You should see the exact same compile and link lines as before.
Caveats. Unlike in the shell, where braces are optional, variable names in a makefile have to be in braces or paren-
theses. Experiment with what hapens if you forget the braces around a variable name.
One advantage of using variables is that you can now change the compiler from the commandline:
make CC="icc -O2"
make FC="gfortran -g"
Victor Eijkhout
272 APPENDIX B. PRACTICAL TUTORIALS
Exercise. Invoke Make as suggested (after make clean). Do you see the difference in your screen output?
Expected outcome. The compile lines now show the added compiler option -O2 or -g.
Exercise. Construct a commandline so that your makefile will build the executable fooprog v2.
3. This mechanism is the first instance you’ll see that only exists in GNU make, though in this particular case there is a similar mechanism
in standard make. That will not be the case for the wildcard mechanism in the next section.
%.o : %.c
${CC} -c $<
%.o : %.F
${FC} -c $<
This states that any object file depends on the C or Fortran file with the same base name. To regenerate the object
file, invoke the C or Fortran compiler with the -c flag. These template rules can function as a replacement for the
multiple specific targets in the makefiles above, except for the rule for foo.o.
The dependence of foo.o on bar.h can be handled by adding a rule
foo.o : bar.h
with no further instructions. This rule states, ‘if file bar.h changed, file foo.o needs updating’. Make will then
search the makefile for a different rule that states how this updating is done.
B.5.3 Wildcards
Your makefile now uses one general rule for compiling all your source files. Often, these source files will be all the
.c or .F files in your directory, so is there a way to state ‘compile everything in this directory’? Indeed there is.
Add the following lines to your makefile, and use the variable COBJECTS or FOBJECTS wherever appropriate.
# wildcard: find all files that match a pattern
CSOURCES := ${wildcard *.c}
# pattern substitution: replace one pattern string by another
COBJECTS := ${patsubst %.c,%.o,${SRC}}
B.5.4 Miscellania
B.5.4.1 What does this makefile do?
Above you learned that issuing the make command will automatically execute the first rule in the makefile. This
is convenient in one sense4 , and inconvenient in another: the only way to find out what possible actions a makefile
allows is to read the makefile itself, or the – usually insufficient – documentation.
A better idea is to start the makefile with a target
4. There is a convention among software developers that a package can be installed by the sequence ./configure ; make ;
make install, meaning: Configure the build process for this computer, Do the actual build, Copy files to some system directory such as
/usr/bin.
Victor Eijkhout
274 APPENDIX B. PRACTICAL TUTORIALS
info :
@echo "The following are possible:"
@echo " make"
@echo " make clean"
Now make without explicit targets informs you of the capabilities of the makefile.
and likewise for make other. What goes wrong here is the use of $@.o as prerequisite. In Gnu Make, you can
repair this as follows:
.SECONDEXPANSION:
${PROGS} : $$@.o
In the makefiles you have seen so far, the command part was a single line. You can actually have as many lines there
as you want. For example, let us make a rule for making backups of the program you are building.
Add a backup rule to your makefile. The first thing it needs to do is make a backup directory:
.PHONY : backup
backup :
if [ ! -d backup ] ; then
mkdir backup
fi
Did you type this? Unfortunately it does not work: every line in the command part of a makefile rule gets executed
as a single program. Therefore, you need to write the whole command on one line:
backup :
if [ ! -d backup ] ; then mkdir backup ; fi
or if the line gets too long:
backup :
if [ ! -d backup ] ; then \
mkdir backup ; \
fi
Next we do the actual copy:
backup :
if [ ! -d backup ] ; then mkdir backup ; fi
cp myprog backup/myprog
But this backup scheme only saves one version. Let us make a version that has the date in the name of the saved
program.
The Unix date command can customize its output by accepting a format string. Type the following: date This
can be used in the makefile.
Exercise. Edit the cp command line so that the name of the backup file includes the current date.
Victor Eijkhout
276 APPENDIX B. PRACTICAL TUTORIALS
Expected outcome. Hint: you need the backquote. Consult the Unix tutorial if you do not remember what backquotes
do.
If you are defining shell variables in the command section of a makefile rule, you need to be aware of the following.
Extend your backup rule with a loop to copy the object files:
backup :
if [ ! -d backup ] ; then mkdir backup ; fi
cp myprog backup/myprog
for f in ${OBJS} ; do \
cp $f backup ; \
done
(This is not the best way to copy, but we use it for the purpose of demonstration.) This leads to an error message,
caused by the fact that Make interprets $f as an environment variable of the outer process. What works is:
backup :
if [ ! -d backup ] ; then mkdir backup ; fi
cp myprog backup/myprog
for f in ${OBJS} ; do \
cp $$f backup ; \
done
(In this case Make replaces the double dollar by a single one when it scans the commandline. During the execution
of the commandline, $f then expands to the proper filename.)
%.pdf : %.tex
pdflatex $<
The command make myfile.pdf will invoke pdflatex myfile.tex, if needed, once. Next we repeat
invoking pdflatex until the log file no longer reports that further runs are needed:
%.pdf : %.tex
pdflatex $<
while [ ‘cat ${basename $@}.log | grep "Rerun to get cross-references right." | wc -l‘ -
pdflatex $< ; \
done
We use the ${basename fn} macro to extract the base name without extension from the target name.
In case the document has a bibliography or index, we run bibtex and makeindex.
%.pdf : %.tex
pdflatex ${basename $@}
-bibtex ${basename $@}
-makeindex ${basename $@}
while [ ‘cat ${basename $@}.log | grep "Rerun to get cross-references right." | wc
pdflatex ${basename $@} ; \
done
The minus sign at the start of the line means that Make should not abort if these commands fail.
Finally, we would like to use Make’s facility for taking dependencies into account. We could write a makefile that
has the usual rules
mainfile.pdf : mainfile.tex includefile.tex
but we can also discover the include files explicitly. The following makefile is invoked with
make pdf TEXTFILE=mainfile
The pdf rule then uses some shell scripting to discover the include files (but not recursively), and it calls emphMake
again, invoking another rule, and passing the dependencies explicitly.
pdf :
export includes=‘grep "ˆ.input " ${TEXFILE}.tex | awk ’{v=v FS $$2".tex"} END {prin
${MAKE} ${TEXFILE}.pdf INCLUDES="$$includes"
Victor Eijkhout
278 APPENDIX B. PRACTICAL TUTORIALS
First we need to have a repository. In practice, you will often use one that has been set up by a sysadmin, but here
one student will set it up in his home directory. Another option would be to use a hosting service, commercial or
free such as http://code.google.com/projecthosting.
%% svnadmin create ./repository --fs-type=fsfs
For the purposes of this exercise we need to make this directory, and the home directory of the first student, visible
to the other student. In a practical situation this is not necessary: the sysadmin would set up the repository on the
server.
%% chmod -R g+rwX ./repository
%% chmod g+rX ˜
Both students now make some dummy data to start the repository. Please choose unique directory names.
%% mkdir sourcedir
%% cat > sourcedir/firstfile
a
b
c
d
e
f
ˆD
This project data can now be added to the repository:
%% svn import -m "initial upload" sourcedir file://‘pwd‘/repository
Adding sourcedir/firstfile
Committed revision 1.
The second student needs to spell out the path
%% svn import -m "initial upload" otherdir \
file:///share/home/12345/firststudent/repository
Instead of specifying the creation message on the commandline, you can also set the EDITOR or SVN_EDITOR
environment variable and do
%% svn import sourcedir file://‘pwd‘/repository
An editor will then open so that you can input descriptive notes.
Now we pretend that this repository is in some far location, and we want to make a local copy. Both students will
now work on the same files in the repository, to simulate a programming team.
%% svn co file://‘pwd‘/repository myproject
A myproject/secondfile
A myproject/firstfile
Checked out revision 2.
(In practice, you would use http: or svn+ssh: as the protocol, rather than file:.) This is the last time you
need to specify the repository path: the local copy you have made remembers the location.
Let’s check that everything is in place:
%% ls
myproject/ repository/ sourcedir/
%% rm -rf sourcedir/
%% cd myproject/
%% ls -a
./ ../ firstfile myfile .svn/
Victor Eijkhout
280 APPENDIX B. PRACTICAL TUTORIALS
%% svn info
Path: .
URL: file:///share/home/12345/yournamehere/repository
Repository UUID: 3317a5b0-6969-0410-9426-dbff766b663f
Revision: 2
Node Kind: directory
Schedule: normal
Last Changed Author: build
Last Changed Rev: 2
Last Changed Date: 2009-05-08 12:26:22 -0500 (Fri, 08 May 2009)
From now on all work will happen in the checked out copy of the repository.
Both students should use an editor to create a new file. Make sure to use two distinct names.
%% cat > otherfile # both students create a new file
1
2
3
Also, let both students edit an existing file, but not the same.
%% emacs firstfile # make some changes in an old file, not the same
Now use the svn status command. Svn will report that it does not know the new file, and that it detects changes
in the other.
%% svn status
? otherfile
M firstfile
With svn add you tell svn that the new file should become part of the repository:
%% svn add otherfile
A otherfile
%% svn status
A otherfile
M firstfile
You can now add your changes to the repository:
%% svn commit -m "my first batch of changes"
Sending firstfile
Adding otherfile
Transmitting file data ..
Committed revision 3.
Since both students have made changes, they need to get those that the other has made:
%% svn update
A mysecondfile
U myfile
Updated to revision 4.
This states that one new file was added, and an existing file updated.
In order for svn to keep track of your files, you should never do cp or mv on files that are in the repository. Instead,
do svn cp or svn mv. Likewise, there are commands svn rm and svn mkdir.
B.6.4 Conflicts
Purpose. In this section you will learn about how do deal with conflicting edits by two users
of the same repository.
Now let’s see what happens if two people edit the same file. Let both students make an edit to firstfile, but
one to the top, the other to the bottom. After one student commits the edit, the other will see
%% emacs firstfile # make some change
%% svn commit -m "another edit to the first file"
Sending firstfile
svn: Commit failed (details follow):
svn: Out of date: ’firstfile’ in transaction ’5-1’
The solution is to get the other edit, and commit again. After the update, svn reports that it has resolved a conflict
successfully.
%% svn update
G firstfile
Updated to revision 5.
%% svn commit -m "another edit to the first file"
Sending firstfile
Transmitting file data .
Committed revision 6.
If both students make edits on the same part of the file, svn can no longer resolve the conflicts. For instance, let one
student insert a line between the first and the second, and let the second student edit the second line. Whoever tries
to commit second, will get messages like this:
%% svn commit -m "another edit to the first file"
svn: Commit failed (details follow):
Victor Eijkhout
282 APPENDIX B. PRACTICAL TUTORIALS
You’ve already seen svn info as a way of getting information about the repository. To get the history, do svn
log to get all log messages, or svn log 2:5 to get a range.
To see differences in various revisions of individual files, use svn diff. First do svn commit and svn update
to make sure you are up to date. Now do svn diff firstfile. No output, right? Now make an edit in
firstfile and do svn diff firstfile again. This gives you the difference between the last commited
version and the working copy.
You can also ask for differences between committed versions with svn diff -r 4:6 firstfile.
The output of this diff command is a bit cryptic, but you can understand it without too much trouble. There are
fancy GUI implementations of svn for every platform that show you differences in a much nicer way.
If you simply want to see what a file used to look like, do svn cat -r 2 firstfile. To get a copy of a
certain revision of the repository, do svn export -r 3 . ../rev3.
If you save the output of svn diff, it is possible to apply it with the Unix patch command. This is a quick way
to send patches to someone without them needing to check out the repository.
Victor Eijkhout
284 APPENDIX B. PRACTICAL TUTORIALS
• svn revert restores the file to the state in which it was last restored. For a deleted file, this means that
it is brought back into existence from the repository. This command is also useful to undo any local edits,
if you change your mind about something.
• svn rm firstfile is the official way to delete a file. You can do this even if you have already
deleted the file outside svn.
• Sometimes svn will get confused about your attempts to delete a file. You can then do svn --force
rm yourfile.
Victor Eijkhout
286 APPENDIX B. PRACTICAL TUTORIALS
versions as needed. Additionally, when the patch is ready to merge back, it is not a single massive change, but a
sequence of small changes, which the source control system should be better able to handle.
A good introduction to Mercurial can be found at http://www.hginit.com/.
5. In order to do the examples, the h5dump utility needs to be in your path, and you need to know the location of the hdf5.h and
libhdf5.a and related library files.
Victor Eijkhout
288 APPENDIX B. PRACTICAL TUTORIALS
Exercise. Create an HDF5 file by compiling and running the create.c example below.
Expected outcome. A file file.h5 should be created.
Caveats. Be sure to add HDF5 include and library directories:
cc -c create.c -I. -I/opt/local/include
and
cc -o create create.o -L/opt/local/lib -lhdf5. The include and lib directories will be system
dependent.
On the TACC clusters, do module load hdf5, which will give you environment variables TACC_HDF5_INC
and TACC_HDF5_LIB for the include and library directories, respectively.
#include "myh5defs.h"
#define FILE "file.h5"
main() {
B.7.3 Datasets
Next we create a dataset, in this example a 2D grid. To describe this, we first need to construct a dataspace:
dims[0] = 4; dims[1] = 6;
dataspace_id = H5Screate_simple(2, dims, NULL);
dataset_id = H5Dcreate(file_id, "/dset", dataspace_id, .... );
....
status = H5Dclose(dataset_id);
status = H5Sclose(dataspace_id);
Note that datasets and dataspaces need to be closed, just like files.
Exercise. Create a dataset by compiling and running the dataset.c code below
Expected outcome. This creates a file dset.h that can be displayed with h5dump.
#include "myh5defs.h"
#define FILE "dset.h5"
main() {
Victor Eijkhout
290 APPENDIX B. PRACTICAL TUTORIALS
Exercise. Add a scalar dataspace to the HDF5 file, by compiling and running the parmwrite.c code below.
#include "myh5defs.h"
#define FILE "pdset.h5"
main() {
%% h5dump wdset.h5
HDF5 "wdset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
(0,0): 0.5, 1.5, 2.5, 3.5, 4.5, 5.5,
(1,0): 6.5, 7.5, 8.5, 9.5, 10.5, 11.5,
(2,0): 12.5, 13.5, 14.5, 15.5, 16.5, 17.5,
(3,0): 18.5, 19.5, 20.5, 21.5, 22.5, 23.5
Victor Eijkhout
292 APPENDIX B. PRACTICAL TUTORIALS
}
}
DATASET "parm" {
DATATYPE H5T_STD_I32LE
DATASPACE SCALAR
DATA {
(0): 37
}
}
}
}
main() {
%% h5dump wdset.h5
HDF5 "wdset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_IEEE_F64LE
Victor Eijkhout
294 APPENDIX B. PRACTICAL TUTORIALS
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
(0,0): 0.5, 1.5, 2.5, 3.5, 4.5, 5.5,
(1,0): 6.5, 7.5, 8.5, 9.5, 10.5, 11.5,
(2,0): 12.5, 13.5, 14.5, 15.5, 16.5, 17.5,
(3,0): 18.5, 19.5, 20.5, 21.5, 22.5, 23.5
}
}
DATASET "parm" {
DATATYPE H5T_STD_I32LE
DATASPACE SCALAR
DATA {
(0): 37
}
}
}
}
If you look closely at the source and the dump, you see that the data types are declared as ‘native’, but rendered as
LE. The ‘native’ declaration makes the datatypes behave like the built-in C or Fortran data types. Alternatively, you
can explicitly indicate whether data is little endian or big endian. These terms describe how the bytes of a data item
are ordered in memory. Most architectures use little endian, as you can see in the dump output, but, notably, IBM
uses big endian.
B.7.5 Reading
Now that we have a file with some data, we can do the mirror part of the story: reading from that file. The essential
commands are
h5file = H5Fopen( .... )
....
H5Dread( dataset, .... data .... )
where the H5Dread command has the same arguments as the corresponding H5Dwrite.
Exercise. Read data from the wdset.h5 file that you create in the previous exercise, by compiling and running
the allread.c example below.
Expected outcome. Running the allread executable will print the value 37 of the parameter, and the value 8.5
of the (1,2) data point of the array.
Caveats. Make sure that you run parmwrite to create the input file.
%% ./allread
parameter value: 37
arbitrary data point [1,2]: 8.500000e+00
Victor Eijkhout
296 APPENDIX B. PRACTICAL TUTORIALS
B.8.1.3.1 Construction of the coefficient matrix We will construct the matrix of 5-point stencil for the Poisson
operator (section 4.2.3) as a sparse matrix.
We start by determining the size of the matrix. PETSc has an elegant mechanism for getting parameters from the
program commandline. This prevents the need for recompilation of the code.
C:
int domain_size;
PetscOptionsGetInt
(PETSC_NULL,"-n",&domain_size,&flag);
if (!flag) domain_size = 10;
matrix_size = domain_size*domain_size;
Fortran:
call PetscOptionsGetInt(PETSC_NULL_CHARACTER,
> "-n",domain_size,flag)
if (.not.flag) domain_size = 10
matrix_size = domain_size*domain_size;
Creating a matrix involves specifying its communicator, its type (here: distributed sparse), and its local and global
sizes. In this case, we specify the global size and leave the local sizes and data distribution up to PETSc.
C:
Victor Eijkhout
298 APPENDIX B. PRACTICAL TUTORIALS
B.8.1.3.2 Filling in matrix elements We will now set matrix elements (refer to figure 4.1) by having each pro-
cessor iterate over the full domain, but only inserting those elements that are in its matrix block row.
C:
MatGetOwnershipRange(A,&low,&high);
for ( i=0; i<m; i++ ) {
for ( j=0; j<n; j++ ) {
I = j + n*i;
if (I>=low && I<high) {
J = I-1; v = -1.0;
if (i>0) MatSetValues
(A,1,&I,1,&J,&v,INSERT_VALUES);
J = I+1 // et cetera
}
}
}
MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
Fortran:
call MatGetOwnershipRange(A,low,high)
do i=0,m-1
do j=0,n-1
ii = j + n*i
if (ii>=low .and. ii<high) then
jj = ii - n
if (j>0) call MatSetValues
> (A,1,ii,1,jj,v,INSERT_VALUES)
...
call MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY)
call MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY)
B.8.1.3.3 Finite Element Matrix assembly PETSc’s versatility in dealing with Finite Element matrices (see sec-
tions 4.2.5 and 6.6.2), where elements are constructed by adding together contributions, sometimes from different
processors. This is no problem in PETSc: any processor can set (or add to) any matrix element. The assembly calls
will move data to their eventual location on the correct processors.
for (e=myfirstelement; e<mylastelement; e++) {
for (i=0; i<nlocalnodes; i++) {
I = localtoglobal(e,i);
for (j=0; j<nlocalnodes; j++) {
J = localtoglobal(e,j);
v = integration(e,i,j);
MatSetValues
(mat,1,&I,1,&J,&v,ADD_VALUES);
....
}
}
}
MatAssemblyBegin(mat,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(mat,MAT_FINAL_ASSEMBLY);
B.8.1.3.4 Linear system solver Next we declare an iterative method and preconditioner to solve the linear system
with. PETSc has a large repertoire of Krylov methods (section 5.5.7) and preconditioners.
C:
KSPCreate(comm,&Solver);
KSPSetOperators(Solver,A,A,0);
KSPSetType(Solver,KSPCGS);
{
PC Prec;
KSPGetPC(Solver,&Prec);
PCSetType(Prec,PCJACOBI);
}
Victor Eijkhout
300 APPENDIX B. PRACTICAL TUTORIALS
Fortran:
call KSPCreate(comm,Solve)
call KSPSetOperators(Solve,A,A,0)
call KSPSetType(Solve,KSPCGS)
call KSPGetPC(Solve,Prec)
call PCSetType(Prec,PCJACOBI)
The actual preconditioner is not yet created: that will happen in the solve call.
B.8.1.3.5 Input and output vectors Finally, we need our right hand side and solution vectors. To make them
compatible with the matrix, they need to be created with the same local and global sizes. This can be done by
literally specifying the same parameters, but a more elegant way is to query the matrix size and use the result.
C:
MatGetLocalSize(A,&isize,PETSC_NULL); // get local size
VecCreateMPI(comm,isize,PETSC_DECIDE,&Rhs);
// explicit alternative:
// VecCreateMPI(comm,PETSC_DECIDE,matrix_size,&Rhs);
VecDuplicate(Rhs,&Sol);
VecSet(Rhs,one);
Fortran:
call MatGetLocalSize(A,isize,PETSC_NULL_INTEGER)
call VecCreateMPI(comm,isize,PETSC_DECIDE,Rhs)
call VecDuplicate(Rhs,Sol)
call VecSet(Rhs,one)
B.8.1.3.6 Solving the system Solving the linear system is a one line call to KSPSolve. The story would end
there if it weren’t for some complications:
• Iterative methods can fail, and the solve call does not tell us whether that happened.
• If the system was solved successfully, we would like to know in how many iterations.
• There can be other reason for the iterative method to halt, such as reaching its maximum number of
iterations without converging.
The following code snipped illustrates how this knowledge can be extracted from the solver object.
C:
KSPSolve(Solver,Rhs,Sol)
{
PetscInt its; KSPConvergedReason reason;
Vec Res; PetscReal norm;
KSPGetConvergedReason(Solver,&reason);
if (reason<0) {
PetscPrintf(comm,"Failure to converge\n");
} else {
KSPGetIterationNumber(Solver,&its);
PetscPrintf(comm,"Number of iterations: %d\n",its);
}
}
Fortran:
call KSPSolve(Solve,Rhs,Sol)
call KSPGetConvergedReason(Solve,reason)
if (reason<0) then
call PetscPrintf(comm,"Failure to converge\n")
else
call KSPGetIterationNumber(Solve,its)
write(msg,10) its
10 format(’Number of iterations: i4")
call PetscPrintf(comm,msg)
end if
B.8.1.3.7 Further operations As an illustration of the toolbox nature of PETSc, here is a code snippet that
computes the residual vector after a linear system solution. This operation could have been added to the library, but
composing it yourself offers more flexibility.
C:
VecDuplicate(Rhs,&Res);
MatMult(A,Sol,Res);
VecAXPY(Res,-1,Rhs);
VecNorm(Res,NORM_2,&norm);
PetscPrintf(MPI_COMM_WORLD,"residual norm: %e\n",norm);
Fortran:
call VecDuplicate(Rhs,Res)
call MatMult(A,Sol,Res)
call VecAXPY(Res,mone,Rhs)
call VecNorm(Res,NORM_2,norm)
if (mytid==0) print *,"residual norm:",norm
Finally, we need to free all objects, both the obvious ones, such as the matrix and the vectors, and the non-obvious
ones, such as the solver.
VecDestroy(Res);
Victor Eijkhout
302 APPENDIX B. PRACTICAL TUTORIALS
KSPDestroy(Solver);
VecDestroy(Rhs);
VecDestroy(Sol);
call MatDestroy(A)
call KSPDestroy(Solve)
call VecDestroy(Rhs)
call VecDestroy(Sol)
call VecDestroy(Res)
This is the power method (section A.1.2), which is expected to converge to the dominent
eigenvector.
• In each iteration of this process, print out the norm of yi and for i > 0 the norm of the
difference xi − xi−1 . Do this for some different problem sizes. What do you observe?
• The number of iterations and the size of the problem should be specified through com-
mandline options. Use the routine PetscOptionsGetInt.
For a small problem (say, n = 10) print out the first couple xi vectors. What do you observe?
Explanation?
Exercise 2.5. Extend the previous exercise: if a commandline option -inverse is present, the se-
quence should be generated as yi+1 = A−1 xi . Use the routine PetscOptionsHasName.
What do you observe now about the norms of the yi vectors?
6. The linear system solver from this package later became the Linpack benchmark .
7. PLapack is probably the easier to use of the two.
8. We are not going into band storage here.
Victor Eijkhout
304 APPENDIX B. PRACTICAL TUTORIALS
B.8.2.1.1 Fortran column-major ordering Since computer memory is one-dimensional, some conversion is
needed from two-dimensional matrix coordinates to memory locations. The Fortran language uses column-major
storage, that is, elements in a column are stored consecutively; see figure B.2. This is also described informally as
‘the leftmost index varies quickest’.
B.8.2.1.2 Submatrices and the LDA parameter Using the storage scheme described above, it is clear how to
store an m × n matrix in mn memory locations. However, there are many cases where software needs access
to a matrix that is a subblock of another, larger, matrix. As you see in figure B.3 such a subblock is no longer
contiguous in memory. The way to describe this is by introducing a third parameter in addition to M,N: we let LDA
be the ‘leading dimension of A’, that is, the allocated first dimension of the surrounding array. This is illustrated in
figure B.4.
Victor Eijkhout
306 APPENDIX B. PRACTICAL TUTORIALS
B.9.2 Plotting
The basic plot commands are plot for 2D, and splot (‘surface plot’) for 3D plotting.
Victor Eijkhout
308 APPENDIX B. PRACTICAL TUTORIALS
you get two graphs in one plot, with the x range limited to [0, 1], and the appropriate legends for the graphs. The
variable x is the default for plotting functions.
Plotting one function against another – or equivalently, plotting a parametric curve – goes like this:
set parametric
plot [t=0:1.57] cos(t),sin(t)
which gives a quarter circle.
To get more than one graph in a plot, use the command set multiplot.
B.9.2.3 Customization
Plots can be customized in many ways. Some of these customizations use the set command. For instance,
set xlabel "time"
set ylabel "output"
set title "Power curve"
You can also change the default drawing style with
set style function dots
(dots, lines, dots, points, et cetera), or change on a single plot with
plot f(x) with points
B.9.3 Workflow
Imagine that your code produces a dataset that you want to plot, and you run your code for a number of inputs. It
would be nice if the plotting can be automated. Gnuplot itself does not have the facilities for this, but with a little
help from shell programming this is not hard to do.
Suppose you have data files
Victor Eijkhout
310 APPENDIX B. PRACTICAL TUTORIALS
B.10.1.1 Arrays
C and Fortran have different conventions for storing multi-dimensional arrays. You need to be aware of this when
you pass an array between routines written in different languages.
Fortran stores multi-dimensional arrays in column-major order. For two dimensional arrays (A(i,j)) this means
that the elements in each column are stored contiguously: a 2×2 array is stored as A(1,1), A(2,1), A(1,2),
A(2,2). Three and higher dimensional arrays are an obvious extension: it is sometimes said that ‘the left index
varies quickest’.
C arrays are stored in row-major order: elements in each row are stored contiguous, and columns are then placed se-
quentially in memory. A 2×2 array A[2][2] is then stored as A[1][1], A[1][2], A[2][1], A[2][2].
A number of remarks about arrays in C.
• C (before the C99 standard) has multi-dimensional arrays only in a limited sense. You can declare them,
but if you pass them to another C function, they no longer look multi-dimensional: they have become
plain float* (or whatever type) arrays. That brings me to the next point.
• Multi-dimensional arrays in C look as if they have type float**, that is, an array of pointers that point
to (separately allocated) arrays for the rows. While you could certainly implement this:
float **A;
A = (float**)malloc(m*sizeof(float*));
for (i=0; i<n; i++)
A[i] = (float*)malloc(n*sizeof(float));
careful reading of the standard reveals that a multi-dimensional array is in fact a single block of memory,
no further pointers involved.
Given the above limitation on passing multi-dimensional arrays, and the fact that a C routine can not tell whether
it’s called from Fortran or C, it is best not to bother with multi-dimensional arrays in C, and to emulate them:
float *A;
A = (float*)malloc(m*n*sizeof(float));
#define sub(i,j,m,n) i+j*m
for (i=0; i<m; i++)
for (j=0; j<n; j++)
.... A[sub(i,j,m,n)] ....
where for interoperability we store the elements in column-major fashion.
B.10.1.2 Strings
Programming languages differ widely in how they handle strings.
• In C, a string is an array of characters; the end of the string is indicated by a null character, that is the
ascii character zero, which has an all zero bit pattern. This is called null termination.
• In Fortran, a string is an array of characters. The length is maintained in a internal variable, which is
passed as a hidden parameter to subroutines.
• In Pascal, a string is an array with an integer denoting the length in the first position. Since only one byte
is used for this, strings can not be longer than 255 characters in Pascal.
As you can see, passing strings between different languages is fraught with peril. The safest solution is to use
null-terminated strings throughout; some compilers support extensions to facilitate this, for instance writing
DATA forstring /’This is a null-terminated string.’C/
Recently, the ‘C/Fortran interoperability standard’ has provided a systematic solution to this.
10. With a bit of cleverness and the right compiler, you can have a program that says print *,7 and prints 8 because of this.
Victor Eijkhout
312 APPENDIX B. PRACTICAL TUTORIALS
The most common case of language interoperability is between C and Fortran. The problems are platform depen-
dent, but commonly
• The Fortran compiler attaches a trailing underscore to function names in the object file.
• The C compiler takes the function name as it is in the source.
Since C is a popular language to write libraries in, this means we can solve the problem by either
• Appending an underscore to all C function names; or
• Include a simple wrapper call:
int SomeCFunction(int i,float f)
{
.....
}
int SomeCFunction_(int i,float f)
{
return SomeCFunction(i,f);
}
With the latest Fortran standard it is possible to declare the external name of variables and routines:
%% cat t.f
module operator
real, bind(C) :: x
contains
subroutine s(), bind(C,name=’_s’)
return
end subroutine
...
end module
%% ifort -c t.f
%% nm t.o
.... T _s
.... C _x
It is also possible to declare data types to be C-compatible:
type, bind(C) :: c_comp
real (c_float) :: data
integer (c_int) :: i
type (c_ptr) :: ptr
end type
B.10.1.5 Input/output
Both languages have their own system for handling input/output, and it is not really possible to meet in the middle.
Basically, if Fortran routines do I/O, the main program has to be in Fortran. Consequently, it is best to isolate I/O as
Victor Eijkhout
Appendix C
Class project
314
C.1. HEAT EQUATION 315
C.1.1 Software
Write your software using the PETSc library (see tutorial B.8. In particular, use the MatMult routine for matrix-
vector multiplication and KSPSolve for linear system solution. Exception: code the Euler methods yourself.
Be sure to use a Makefile for building your project (tutorial B.5).
Add your source files, Makefile, and job scripts to an svn repository (tutorial B.6); do not add binaries or output
files. Make sure the repository is accessible by the istc00 account (which is in the same Unix group as your
account) and that there is a README file with instructions on how to build and run your code.
Implement a checkpoint/restart facility by writing vector data, size of the time step, and other necessary items, to
an hdf5 file (tutorial B.7). Your program should be able to read this file and resume execution.
C.1.2 Tests
Do the following tests on a single core.
Method stability
Run your program with
q = sin `πx
T (x) = ex
0
T a (t) = Tb (t) = 0
alpha = 1
Victor Eijkhout
316 APPENDIX C. CLASS PROJECT
Take a space discretization h = 10−2 ; try various time steps and show that the explicit method can diverge. What
is the maximum time step for which it is stable?
For the implicit method, at first use a direct method to solve the nnsystem. This corresponds to PETSc options
KSPPREONLY and PCLU (see section 5.5.10 and the PETSc tutorial, B.8).
Now use an iterative method (for instance KSPCG and PCJACOBI); is the method still stable? Explore using a low
convergence tolerance and large time steps.
Since the forcing function q and the boundary conditions have no time dependence, the solution u(·, t) will converge
to a steady state solution u∞ (x) as t → ∞. What is the influence of the time step on the speed with which implicit
method converges to this steady state?
Run these tests with various values for `.
Timing
If you run your code with the commandline option -log_summary, you will get a table of timings of the various
PETSc routines. Use that to do the following timing experiments.
• Construct your coefficient matrix as a dense matrix, rather than sparse. Report on the difference in total
memory, and the runtime and flop rate of doing one time step. Do this for both the explicit and implicit
method and explain the results.
• With a sparse coefficient matrix, report on the timing of a single time step. Discuss the respective flop
counts and the resulting performance.
Restart
Implement a restart facility: every 10 iterations write out the values of the current iterate, together with values of
∆x, ∆t, and `. Add a flag -restart to your program that causes it to read the restart file and resume execution,
reading all parameters from the restart file.
Run your program for 25 iterations, and restart, causing it to run again from iteration 20. Check that the values in
iterations 20 . . . 25 match.
C.1.3 Parallelism
Do the following tests, running your code with up to 16 cores.
• At first test the explicit method, which should be perfectly parallel. Report on actual speedup attained.
Try larger and smaller problem sizes and report on the influence of the problem size.
• The above settings for the implicit method (KSPPREONLY and PCLU) lead to a runtime error. Instead,
let the system be solved by an iterative method. Read the PETSc manual and web pages to find out some
choices of iterative method and preconditioner and try them. Report on their efficacy.
C.1.4 Reporting
Write your report using LATEX (tutorial B.2). Use both tables and graphs to report numerical results. Use gnuplot
(tutorial B.9) or a related utility for graphs.
Victor Eijkhout
Appendix D
Codes
This section contains several simple codes that illustrate various issues relating to the performance of a single CPU.
The explanations can be found in section 1.5.
#include "papi_test.h"
318
D.2. CACHE SIZE 319
#define PCHECK(e) \
if (e!=PAPI_OK) \
{printf("Problem in papi call, line %d\n",__LINE__); return 1;}
#define NEVENTS 3
#define NRUNS 200
#define L1WORDS 8096
#define L2WORDS 100000
Victor Eijkhout
320 APPENDIX D. CODES
return 0;
}
D.3 Cachelines
This code illustrates the need for small strides in vector code. The main loop operates on a vector, progressing by a
constant stride. As the stride increases, runtime will increase, since the number of cachelines transferred increases,
and the bandwidth is the dominant cost of the computation.
There are some subtleties to this code: in order to prevent accidental reuse of data in cache, the computation is
preceded by a loop that accesses at least twice as much data as will fit in cache. As a result, the array is guaranteed
not to be in cache.
/*
* File: line.c
* Author: Victor Eijkhout <eijkhout@tacc.utexas.edu>
*
* Usage: line
*/
#include "papi_test.h"
extern int TESTS_QUIET; /* Declared in test_utils.c */
#define PCHECK(e) \
if (e!=PAPI_OK) \
{printf("Problem in papi call, line %d\n",__LINE__); return 1;}
#define NEVENTS 4
Victor Eijkhout
322 APPENDIX D. CODES
values[3],(1.*L1WORDS)/values[3]);
printf("L1 accesses:\t%d\naccesses per operation:\t%9.5f\n",
values[2],(1.*L1WORDS)/values[2]);
printf("\n");
}
free(array);
return 0;
}
Note that figure 1.6 in section 1.5.3 only plots up to stride 8, while the code computes to 16. In fact, at stride 12 the
prefetch behaviour of the Opteron changes, leading to peculiarities in the timing, as shown in figure D.1.
7 450
6 400
350
5
cache line utilization
300
4
total kcycles
250
3
200
2
150
1 100
00 2 4 6 8 10 12 14 1650
stride
#include "papi_test.h"
extern int TESTS_QUIET; /* Declared in test_utils.c */
Victor Eijkhout
324 APPENDIX D. CODES
#if defined(SHIFT)
array = (double*) malloc(13*(MAXN+8)*sizeof(double));
#else
array = (double*) malloc(13*MAXN*sizeof(double));
#endif
}
}
free(array);
return 0;
}
D.5 TLB
This code illustrates the behaviour of a TLB ; see sections 1.2.7 and 1.5.4 for a thorough explanation. A two-
dimensional array is declared in column-major ordering (Fortran style). This means that striding through the data
by varying the i coordinate will have a high likelihood of TLB hits, since all elements on a page are accessed
consecutively. The number of TLB entries accessed equals the number of elements divided by the page size. Striding
through the array by the j coordinate will have each next element hitting a new page, so TLB misses will ensue
when the number of columns is larger than the number of TLB entries.
/*
* File: tlb.c
#include "papi_test.h"
extern int TESTS_QUIET; /* Declared in test_utils.c */
double *array;
#define COL 1
Victor Eijkhout
326 APPENDIX D. CODES
#define ROW 2
int main(int argc, char **argv)
{
int events[NEVENTS] = {PAPI_TLB_DM,PAPI_TOT_CYC}; long_long values[NEVENTS];
int retval,order=COL;
PAPI_event_info_t info, info1;
const PAPI_hw_info_t *hwinfo = NULL;
int event_code;
const PAPI_substrate_info_t *s = NULL;
retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT)
test_fail(__FILE__, __LINE__, "PAPI_library_init", retval);
{
int i;
for (i=0; i<NEVENTS; i++) {
retval = PAPI_query_event(events[i]); PCHECK(retval);
}
}
#define M 1000
#define N 2000
{
int m,n;
m = M;
array = (double*) malloc(M*N*sizeof(double));
for (n=10; n<N; n+=10) {
if (order==COL)
clear_right(m,n);
else
clear_wrong(m,n);
retval = PAPI_start_counters(events,NEVENTS); PCHECK(retval);
if (order==COL)
do_operation_right(m,n);
else
do_operation_wrong(m,n);
retval = PAPI_stop_counters(values,NEVENTS); PCHECK(retval);
printf("m,n=%d,%d\n#elements:\t%d\nTot cycles: %d\nTLB misses:\t%d\nmisses per column:\t%
m,n,m*n,values[1],values[0],values[0]/(1.*n));
}
free(array);
}
return 0;
}
Victor Eijkhout
Bibliography
328
BIBLIOGRAPHY 329
[21] Barbara Chapman, Gabriele Jost, and Ruud van der Pas. Using OpenMP: Portable Shared Memory Parallel
Programming, volume 10 of Scientific Computation Series. MIT Press, ISBN 0262533022, 9780262533027,
2008.
[22] Yaeyoung Choi, Jack J. Dongarra, Roldan Pozo, and David W. Walker. Scalapack: a scalable linear algebra
library for distributed memory concurrent computers. In Proceedings of the fourth symposium on the frontiers
of massively parallel computation (Frontiers ’92), McLean, Virginia, Oct 19–21, 1992, pages 120–127, 1992.
[23] Charles Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32:406–242,
1953.
[24] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI’04:
Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004.
[25] Ashish Deshpande and Martin Schultz. Efficient parallel programming with linda. In In Supercomputing ’92
Proceedings, pages 238–244, 1992.
[26] E. W. Dijkstra. Cooperating sequential processes. http://www.cs.utexas.edu/users/EWD/
transcriptions/EWD01xx/EWD123.html. Technological University, Eindhoven, The Netherlands,
September 1965.
[27] Edsger W. Dijkstra. Programming as a discipline of mathematical nature. Am. Math. Monthly, 81:608–612,
1974.
[28] J. Dongarra, A. Geist, R. Manchek, and V. Sunderam. Integrated PVM Framework Supports Heterogeneous
Network Computing. Computers in Physics, 7(2):166–75, April 1993.
[29] J. J. Dongarra. The LINPACK benchmark: An explanation, volume 297, chapter Supercomputing 1987, pages
456–474. Springer-Verlag, Berlin, 1988.
[30] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. A set of level 3 basic linear algebra
subprograms. ACM Transactions on Mathematical Software, 16(1):1–17, March 1990.
[31] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN
basic linear algebra subprograms. ACM Transactions on Mathematical Software, 14(1):1–17, March 1988.
[32] Dr. Dobbs. Complex arithmetic: in the intersection of C and C++. http://www.ddj.com/cpp/
184401628.
[33] V. Faber and T. A. Manteuffel. Necessary and sufficient conditions for the existence of a conjugate gradient
method. SIAM J. Numer. Anal., 21(2):352–362, 1984.
[34] R.D. Falgout, J.E. Jones, and U.M. Yang. Pursuing scalability for hypre’s conceptual interfaces. Technical
Report UCRL-JRNL-205407, Lawrence Livermore National Lab, 2004. submitted to ACM Transactions on
Mathematical Software.
[35] D. C. Fisher. Your favorite parallel algorithms might not be as fast as you think. IEEE Trans. Computers,
37:211–213, 1988.
[36] M. Flynn. Some computer organizations and their effectiveness. IEEE Trans. Comput., C-21:948, 1972.
[37] D. Frenkel and B. Smit. Understanding molecular simulations: From algorithms to applications, 2nd edition.
2002.
[38] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: A Users’ Guide and
Tutorial for Networked Parallel Computing. MIT Press, 1994. The book is available electronically, the url is
ftp://www.netlib.org/pvm3/book/pvm-book.ps.
Victor Eijkhout
330 BIBLIOGRAPHY
[39] David Gelernter and Nicholas Carriero. Coordination languages and their significance. Commun. ACM,
35(2):97–107, 1992.
[40] GNU multiple precision library. http://gmplib.org/.
[41] Dominik Göddeke, Robert Strzodka, Jamaludin Mohd-yusof, Patrick Mccormick, Hilmar Wobker, Christian
Becker, and Stefan Turek. Using gpus to improve multigrid solver performance on a cluster, accepted for
publication. International Journal of Computational Science and Engineering, 4:36–55, 2008.
[42] Stefan Goedecker and Adolfy Hoisie. Performance Optimization of Numerically Intensive Codes. SIAM,
2001.
[43] David Goldberg. What every computer scientist should know about floating-point arithmetic. Computing
Surveys, March 1991.
[44] G. H. Golub and D. P. O’Leary. Some history of the conjugate gradient and Lanczos algorithms: 1948-1976.
31:50–102, 1989.
[45] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Balti-
more, second edition edition, 1989.
[46] F. Gray. Pulse code communication. U.S. Patent 2,632,058, March 17, 1953 (filed Nov. 1947).
[47] Ronald I. Greenberg and Charles E. Leiserson. Randomized routing on fat-trees. In Advances in Computing
Research, pages 345–374. JAI Press, 1989.
[48] William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and
Marc Snir. MPI: The Complete Reference, Volume 2 - The MPI-2 Extensions. MIT Press, 1998.
[49] William Gropp, Thomas Sterling, and Ewing Lusk. Beowulf Cluster Computing with Linux, 2nd Edition.
MIT Press, 2003.
[50] Don Heller. A survey of parallel algorithms in numerical linear algebra. SIAM Review, 20:740–777, 1978.
[51] John L. Hennessy and David A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kauf-
man Publishers, 3rd edition edition, 1990, 3rd edition 2003.
[52] M.R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Nat. Bur. Stand. J.
Res., 49:409–436, 1952.
[53] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, second edition, 2002.
[54] C. A. R. Hoare. An axiomatic basis for computer programming. Communications of the ACM, 12:576–580,
1969.
[55] Interval arithmetic. http://en.wikipedia.org/wiki/Interval_(mathematics).
[56] C.R. Jesshope and R.W. Hockney editors. The DAP approach, volume 2. pages 311–329. Infotech Intl. Ltd.,
Maidenhead, 1979.
[57] Michael Karbo. PC architecture. http://www.karbosguide.com/books/pcarchitecture/
chapter00.htm.
[58] Helmut Kopka and Patrick W. Daly. A Guide to LATEX. Addison-Wesley, first published 1992.
[59] Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav Pingali, and Calin Cascaval. How much paral-
lelism is there in irregular applications? In Principles and Practices of Parallel Programming (PPoPP), 2009.
[60] L. Lamport. LATEX, a Document Preparation System. Addison-Wesley, 1986.
[61] C. Lanczos. Solution of systems of linear equations by minimized iterations. Journal of Research, Nat. Bu.
Stand., 49:33–53, 1952.
[62] Rubin H Landau, Manual José Páez, and Cristian C. Bordeianu. A Survey of Computational Physics. Princeton
University Press, 2008.
[63] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran
usage. ACM Transactions on Mathematical Software, 5(3):308–323, September 1979.
[64] Charles E. Leiserson. Fat-Trees: Universal networks for hardware-efficient supercomputing. IEEE Trans.
Comput, C-34:892–901, 1985.
[65] Robert Mecklenburg. Managing Projects with GNU Make. O’Reilly Media, 3rd edition edition, 2004. Print
ISBN:978-0-596-00610-5 ISBN 10:0-596-00610-1 Ebook ISBN:978-0-596-10445-0 ISBN 10:0-596-10445-
6.
[66] Frank Mittelbach, Michel Goossens, Johannes Braams, David Carlisle, and Chris Rowley. The LATEX Com-
panion, 2nd edition. Addison-Wesley, 2004.
[67] Tobi Oetiker. The not so short introductino to LATEX. http://tobi.oetiker.ch/lshort/.
[68] S. Otto, J. Dongarra, S. Hess-Lederman, M. Snir, and D. Walker. Message Passing Interface: The Complete
Reference. The MIT Press, 1995.
[69] Michael L. Overton. Numerical Computing with IEEE Floating Point Arithmetic. SIAM, Philadelphia PA,
2001.
[70] V. Ya. Pan. New combination sof methods for the acceleration of matrix multiplication. Comp. & Maths. with
Appls., 7:73–125, 1981.
[71] David A. Patterson. Technical perspective: the data center is the computer. Commun. ACM, 51(1):105–105,
2008.
[72] S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys., 117:1–19, 1995.
[73] M. Püschel, B. Singer, J. Xiong, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. SPIRAL:
A generator for platform-adapted libraries of signal processing algorithms. Int’l Journal of High Performance
Computing Applications, 18(1):21–45, 2004.
[74] J.K. Reid. On the method of conjugate gradients for the solution of large sparse systems of linear equations.
In J.K. Reid, editor, Large sparse sets of linear equations, pages 231–254. Academic Press, London, 1971.
[75] Tetsuya Sato. The earth simulator: Roles and impacts. Nuclear Physics B - Proceedings Supplements, 129-
130:102 – 108, 2004. Lattice 2003.
[76] D. E. Shaw. A fast, scalable method for the parallel evaluation of distance-limited pairwise particle interac-
tions. J. Comput. Chem., 26:1318–1328, 2005.
[77] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI: The Complete Refer-
ence, Volume 1, The MPI-1 Core. MIT Press, second edition edition, 1998.
[78] V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969.
[79] TEX frequently asked questions.
[80] R. van de Geijn, Philip Alpatov, Greg Baker, Almadena Chtchelkanova, Joe Eaton, Carter Edwards, Murthy
Guddati, John Gunnels, Sam Guyer, Ken Klimkowski, Calvin Lin, Greg Morrow, Peter Nagel, James Overfelt,
and Michelle Pal. Parallel linear algebra package (PLAPACK): Release r0.1 (beta) users’ guide. 1996.
[81] Robert A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press, 1997.
[82] Robert A. van de Geijn and Enrique S. Quintana-Ortı́. The Science of Programming Matrix Computations.
www.lulu.com, 2008.
[83] Richard S. Varga. Matrix Iterative Analysis. Prentice-Hall, Englewood Cliffs, NJ, 1962.
Victor Eijkhout
332 BIBLIOGRAPHY
[84] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the
ATLAS project. Parallel Computing, 27(1–2):3–35, 2001. Also available as University of Tennessee LAPACK
Working Note #147, UT-CS-00-448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps).
[85] Barry Wilkinson and Michael Allen. Parallel Programming. Prentice Hall, New Jersey, 1999.
[86] J.H. Wilkinson. Rounding Errors in Algebraic Processes. Prentice-Hall, Englewood Cliffs, N.J., 1963.
[87] Randy Yates. Fixed point: an introduction. http://www.digitalsignallabs.com/fp.pdf, 2007.
[88] David M. Young. Iterative method for solving partial differential equations of elliptic type. PhD thesis,
Harvard Univ., Cambridge, MA, 1950.
333
Index
334
INDEX 335
Victor Eijkhout
336 INDEX
Victor Eijkhout
338 INDEX
vector processor, 43
vertices, 220
Virtual memory, 22
virtualization, 82
von Neumann architectures, 8
wavefront, 184–185
wavefronts, 184
weak scaling, 161
weighted graph, 220
wildcard, 252
Woodcrest, 21
work pool, 81
X10, 66–67
Victor Eijkhout