Applied Parallel Computing-Honest
Applied Parallel Computing-Honest
Applied Parallel Computing-Honest
PARALLEL COMPUTING
7767hc_9789814307604_tp.indd 2
26/7/12 12:08 PM
9in x 6in
b1350-fm
APPLIED
PARALLEL COMPUTING
Yuefan Deng
Stony Brook University, USA
World Scientific
NEW JERSEY
LONDON
7767hc_9789814307604_tp.indd 1
SINGAPORE
BEIJING
SHANGHAI
HONG KONG
TA I P E I
CHENNAI
26/7/12 12:08 PM
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
ISBN 978-981-4307-60-4
Printed in Singapore.
7/26/2012, 3:40 PM
9in x 6in
b1350-fm
PREFACE
9in x 6in
b1350-fm
9in x 6in
b1350-fm
CONTENTS
Preface
Chapter
1.1.
1.2.
1.3.
1.4.
1. Introduction
Denition of Parallel Computing
Evolution of Computers . . . . .
An Enabling Technology . . . .
Cost Eectiveness . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
8
9
Chapter
2.1.
2.2.
2.3.
2.4.
2.5.
2.6.
2.7.
2.8.
Models
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
14
15
15
16
17
18
18
Chapter
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
3.7.
3. Hardware Systems
Node Architectures . . . . . . .
Network Interconnections . . . .
Instruction and Data Streams .
Processor-Memory Connectivity
IO Subsystems . . . . . . . . . .
System Convergence . . . . . . .
Design Considerations . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
21
28
29
29
31
31
Chapter
4.1.
4.2.
4.3.
4.4.
4. Software Systems
Node Software . . . . .
Programming Models .
Parallel Debuggers . . .
Parallel Prolers . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
37
43
43
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
9in x 6in
b1350-fm
Chapter
5.1.
5.2.
5.3.
5. Design of Algorithms
Algorithm Models . . . . . . . . . . . . . . . . . . . . . .
Examples of Collective Operations . . . . . . . . . . . . .
Mapping Tasks to Processors . . . . . . . . . . . . . . . .
45
46
54
56
Chapter
6.1.
6.2.
6.3.
6. Linear Algebra
Problem Decomposition . . . . . . . . . . . . . . . . . . .
Matrix Operations . . . . . . . . . . . . . . . . . . . . . .
Solution of Linear Systems . . . . . . . . . . . . . . . . .
65
65
68
81
89
89
92
Chapter
8.1.
8.2.
8.3.
8.4.
8.5.
8.6.
8. Fourier Transforms
Fourier Transforms . . . . . .
Discrete Fourier Transforms .
Fast Fourier Transforms . . . .
Simple Parallelization . . . . .
The Transpose Method . . . .
Complexity Analysis for FFT .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
105
106
107
111
112
113
Chapter 9. Optimization
115
9.1. Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 116
9.2. Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 119
Chapter 10. Applications
10.1. Newtons Equation and Molecular Dynamics . . .
10.2. Schr
odingers Equations and Quantum Mechanics
10.3. Partition Function, DFT and Material Science . .
10.4. Maxwells Equations and Electrical Engineering .
10.5. Diusion Equation and Mechanical Engineering .
10.6. Navier-Stokes Equation and CFD . . . . . . . . .
10.7. Other Applications . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
124
133
134
135
135
136
136
Appendix A. MPI
A.1. An MPI Primer . . . . . . . . .
A.2. Examples of Using MPI . . . . .
A.3. MPI Tools . . . . . . . . . . . .
A.4. Complete List of MPI Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
139
139
159
161
167
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9in x 6in
b1350-fm
ix
Contents
Appendix B. OpenMP
B.1. Introduction to OpenMP . .
B.2. Memory Model of OpenMP .
B.3. OpenMP Directives . . . . .
B.4. Synchronization . . . . . . .
B.5. Runtime Library Routines .
B.6. Examples of Using OpenMP
B.7. The Future . . . . . . . . . .
Appendix C.
Project C.1.
Project C.2.
Project C.3.
Project C.4.
Project C.5.
Project C.6.
Project C.7.
Project C.8.
Project C.9.
Project C.10.
Project C.11.
Project C.12.
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
C.13.
C.14.
C.15.
C.16.
C.17.
C.18.
C.19.
C.20.
C.21.
C.22.
C.23.
C.24.
C.25.
C.26.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Projects
Watts and Flops of Supercomputers . . . . .
Review of Supercomputers . . . . . . . . . .
Top500 and BlueGene Supercomputers . . .
Say Hello in Order . . . . . . . . . . . . . . .
Broadcast on Torus . . . . . . . . . . . . . .
Competing with MPI on Broadcast,
Scatter, etc . . . . . . . . . . . . . . . . . . .
Simple Matrix Multiplication . . . . . . . . .
Matrix Multiplication on 4D Torus . . . . .
Matrix Multiplication and PAT . . . . . . .
Matrix Inversion . . . . . . . . . . . . . . . .
Simple Analysis of an iBT Network . . . . .
Compute Eigenvalues of Adjacency Matrices
of Networks . . . . . . . . . . . . . . . . . . .
Mapping Wave Equation to Torus . . . . . .
Load Balance in 3D Mesh . . . . . . . . . . .
Wave Equation and PAT . . . . . . . . . . .
Computing Coulombs Forces . . . . . . . . .
Timing Model for MD . . . . . . . . . . . . .
Minimizing Lennard-Jones Potential . . . . .
Install and Prole CP2K . . . . . . . . . . .
Install and Prole CPMD . . . . . . . . . . .
Install and Prole NAMD . . . . . . . . . .
FFT on Beowulf . . . . . . . . . . . . . . . .
FFT on BlueGene/Q . . . . . . . . . . . . .
Word Analysis . . . . . . . . . . . . . . . . .
Cost Estimate of a 0.1 Pops System . . . .
Design of a Pops System . . . . . . . . . .
.
.
.
.
.
.
.
171
171
172
172
174
175
178
180
.
.
.
.
.
.
.
.
.
.
181
181
181
181
182
183
.
.
.
.
.
.
.
.
.
.
.
.
183
183
183
184
184
185
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
186
186
187
187
188
188
189
190
190
191
191
191
191
.
.
.
.
.
.
.
9in x 6in
b1350-fm
203
Index
205
9in x 6in
b1350-ch01
CHAPTER 1
INTRODUCTION
1 http://www.nitrd.gov/pubs/bluebooks/1995/section.5.html.
9in x 6in
b1350-ch01
109
108
AMD K8
Pentium 4
curve shows transistor
count doubling every
two years
107
106
Barton
Atom
AMD K7
AMD K6-III
AMD K6
Pentium III
Pentium II
AMD K5
Pentium
80486
80386
80286
105
68000
104
80186
8088
8086
8085
6800
6809
8080
Z80
8008
4004
MOS 6502
RCA 1802
9in x 6in
b1350-ch01
Introduction
10000
100
Core 2 Core i7
Pentium II Pentium 4
Sandy Bridge
1000
Pentium
i486
i386
1
1000
100
10
Fig. 1.2.
9in x 6in
b1350-ch01
Floating-point
operations per second
100
1 Flop
1Kops
1Mops
1Gops
1Tops
1Pops
=1
103 = 1 Thousand
106 = 1 Million
109 = 1 Billion
1012 = 1 Trillion
1015 = 1 Quadrillion
1Eops
1Zops
1Yops
1018 = 1 Quintillion
1021 = 1 Sextillion
1024 = Septillion
Table 1.2.
Representative computer
A fast human
VPX 220 (Rank #250 in 1993); A laptop in 2010
ASCI Red (Rank #1 in 1997)
IBM Roadrunner (Rank #1 in 2008); Cray XT5
(Rank #2 in 2008);
1/8 Fujitsu K Computer (Rank #1 in 2011)
Expected in 2018
Time scales for solving medium sized and grand challenge problems.
Computer
Sub-Petaop
Tereop
1,000 Nodes Beowulf
Cluster
High-end Workstation
PC with 2 GHz Pentium
Moderate
problems
Grand challenge
problems
N/A
O(1) Hours
O(1) Seconds
O(1) Minutes
O(10) Hours
O(1) Weeks
O(1) Hours
O(1) Days
O(10) Years
O(100) Years
Applications
Protein folding, QCD,
and Turbulence
Weather
2D CFD, Simple
designs
9in x 6in
b1350-ch01
Introduction
Table 1.3.
Rmax
(Tops)
Vendor
Year
Fujitsu
2011 K computer
8,162
NUDT
2010 Tianhe-1A
2,566
Cray
2009 Jaguar
Cray XT5-HE
1,759
Dawning
2010 Nebulae
Dawning
Cluster
1,271
1,192
Cray
1,054
Bull SA
2010 Hopper
Cray XE6
2010 Tera-100
Bull Bullx
10
IBM
2009 Roadrunner
IBM
BladeCenter
1,042
1,110
1,088
1,050
Cores
Site
Country
548,352 RIKEN
Japan
Advanced
Institute for
Computational
Science
186,368 National
China
Supercomputing
Center in
Tianjin
224,162 DOE/SC/Oak
USA
Ridge
National
Laboratory
120,640 National
China
Supercomputing
Centre in
Shenzhen
73,278 GSIC Center,
Japan
Tokyo
Institute of
Technology
142,272 DOE/NNSA/
USA
LANL/SNL
111,104 NASA/Ames
USA
Research
Center/NAS
153,408 DOE/SC/LBNL/ USA
NERSC
138,368 Commissariat a France
lEnergie
Atomique
122,400 DOE/NNSA/
USA
LANL
9in x 6in
b1350-ch01
9in x 6in
Introduction
Fig. 1.4.
Fig. 1.5.
b1350-ch01
user space where a single user will control many processors. Figure 1.5 shows
the 2011s supercomputers LINPACK and power eciencies.
It is apparent from these gures that parallel computing is of enormous
value to scientists and engineers who need computing power.
9in x 6in
b1350-ch01
9in x 6in
b1350-ch01
Introduction
10
9in x 6in
b1350-ch01
1961
1984
1997
2000
2005
2007
2011
2011
1012
$1.1
$1.5 107
$3.0 104
$3.0 103
$1.0 103
$4.8 101
$1.8 100
$1.2 100
Representative technology
IBM 1620 (costing $64K)
Cray Y-MP
A Beowulf Cluster
Workstation
PC Laptop
Microwulf Cluster
HPU4Science Cluster
K Computer power cost
at one billion oating-point operations per second. During the era when no
single computing platform was able to achieve one Gops, this table lists the
total cost for multiple instances of a fast computing platform whose speed
sums to one Gops. Otherwise, the least expensive computing platform able
to achieve one Gops is listed.
3 http://www.green500.org
9in x 6in
b1350-ch01
11
Introduction
Table 1.5.
Computer
IBM BlueGene/Q
Bullx B500 Cluster
PowerEdge 1850
Mops/W
1
250
500
2097.19
169.15
21.36
$1M
$12M
$100M
9in x 6in
b1350-ch01
9in x 6in
b1350-ch02
CHAPTER 2
PERFORMANCE METRICS AND MODELS
14
9in x 6in
b1350-ch02
Fig. 2.1.
T (1, N )
.
T (P, N )
(2.1)
9in x 6in
b1350-ch02
15
P0 T (P0 , N )
.
T (P, N )
(2.2)
Most of the time, it is easy to speedup large problems than small ones.
2.3. Parallel Eciency
Parallel eciency is dened as:
E(P, N ) =
T (1, N )
S(P, N )
=
.
T (P, N )P
P
(2.3)
16
9in x 6in
b1350-ch02
Fig. 2.2.
Remarks
I(P, N ) is the average time wasted by each processor due to load
imbalance.
If Ti = Tmax for every i, then I(P, N ) = 0, resulting in a complete load
balancing.
The slowest processor Tmax can mess up the entire team. This observation shows that slave-master scheme is usually very inecient because
of the load imbalance issue due to slow master processor.
2.5. Granularity
The size of the sub-domains allocated to the processors is called the
granularity of the decomposition. Here is a list of remarks:
Granularity is usually determined by the problem size N and computer
size P .
Decreasing granularity usually increases communication and decreases
load imbalance.
9in x 6in
b1350-ch02
17
P
1.
S(P, N )
(2.7)
18
9in x 6in
b1350-ch02
2.7. Scalability
First, we dene two terms: Scalable algorithm and quasi-scalable algorithm.
A scalable algorithm is dened as those whose parallel eciency E(P, N )
remains bounded from below, i.e. E(P, N ) E0 > 0, when the number of
processors P at xed problem size.
More specically, those which can keep the eciency constant when the
problem size N is kept constant are called strong scalable, and those which
can keep the eciency constant only when N increases along with P are
called weak scalable.
A quasi-scalable algorithm is dened as those whose parallel eciency
E(P, N ) remains bounded from below, i.e. E(P, N ) E0 > 0, when the
number of processors Pmin < P < Pmax at xed problem size. The interval
Pmin < P < Pmax is called scaling zone.
Very often, at xed problem size N = N (P ), the parallel eciency
decreases monotonically as the number of processors increase. This means
that for suciently large number of processors the parallel eciency tends
to vanish. On the other hand, if we x the number of processors, the
parallel eciency usually decreases as the problem size decreases. Thus,
very few algorithms (aside from the embarrassingly parallel algorithm)
are scalable, while many are quasi-scalable. Two major tasks in designing
parallel algorithms are to maximize E0 and the scaling zone.
2.8. Amdahls Law
Suppose a fraction f of an algorithm for a problem of size N on P processors
is inherently serial and the remainder is perfectly parallel, then assume:
T (1, N ) = .
(2.8)
T (P, N ) = f + (1 f )/P.
(2.9)
Thus,
Therefore,
S(P, N ) =
1
.
f + (1 f )/P
(2.10)
9in x 6in
b1350-ch03
CHAPTER 3
HARDWARE SYSTEMS
A serial computer with one CPU and one chunk of memory while ignoring
the details of its possible memory hierarchy and some peripherals, needs
only two parameters/properties to describe itself: Its CPU speed and its
memory size.
On the other hand, ve or more properties are required to characterize
a parallel computer:
(1)
(2)
(3)
(4)
(5)
20
9in x 6in
b1350-ch03
Processor family
Power
NEC
Sparc
Intel IA-64
Intel EM64T
AMD x86 64
Intel Core
Totals
Count
Share %
45
1
2
5
380
66
1
500
9.00%
0.20%
0.40%
1.00%
76.00%
13.20%
0.20%
100%
6,274,131
122,400
8,272,600
269,498
31,597,252
12,351,314
42,830
58,930,025
and NEC processor families. AMD and Intel do not compete in the end-user
HPC system market. Thus, it should be expected that IBM, Cray and NEC
would continue to control the system designs based on their own processor
architectures, while the eorts of the other competitors will be based on
processors provided by Intel and AMD.
Currently both companies, Intel and AMD, are revamping their product
lines and are transitioning their server oerings to the quad-core processor
designs. AMD introduced its Barcelona core on 65 nm manufacturing
process as a competitor to the Intel Core architecture that should be
able to oer a comparable to Core 2 instruction per clock performance,
however the launch has been plagued by delays caused by the diculties
in manufacturing sucient number of higher clocked units and emerging
operational issues requiring additional bug xing that so far have resulted
in sub-par performance. At the same time, Intel enhanced their product
line with the Penryn core refresh on a 45 nm process featuring ramped up
the clock speed, optimized execution subunits, additional SSE4 instructions,
while keeping the power consumption down within previously dened TDP
limits of 50 W for the energy-ecient, 80 W for the standard and 130 W
for the high-end parts. According to the roadmaps published by both
companies, the parts available in 2008 consisting up to four cores on
the same processor, die with peak performance per core in the range of
815 Gops on a power budget of 1530 W and 1632 Gops on a power
budget of 5068 W. Due to the superior manufacturing capabilities, Intel
is expected to maintain its performance per watt advantage with top
Penryn parts clocked at 3.5 GHz or above, while the AMD Barcelona
parts in the same power envelope was not expected to exceed 2.5 GHz
clock speed until the second half of 2008 at best. The features of the
three processor-families that power the top supercomputers in 211 are
9in x 6in
b1350-ch03
21
Hardware Systems
Table 3.2.
Parameter
Core Count
Highest Clock Speed
L2 Cache
L3 Cache
Memory Bandwidth
Manufacturing Technology
Thermal Design Power Bins
Peak Floating Point Rate
Fujitsu
SPARC64 VIIIfx1
8
2 GHz
5 MB
N/A
64 GB/s
45 nm
58 W
128 Gops
205 Gops
AMD opteron
6100 series
8
2.3 Ghz (est)
4 MB
12 MB
42.7 GB/s
45 nm
115 W
150 Gops
given in Table 3.2 with the data collected from respective companies
websites.
22
9in x 6in
b1350-ch03
3.2.1. Topology
Here are some of the topologies that are currently in common use:
(1)
(2)
(3)
(4)
(5)
(6)
Fig. 3.1.
9in x 6in
b1350-ch03
23
Hardware Systems
Fig. 3.2.
Switch 7
Switch 5
Switch 6
Switch 1
Switch 2
Switch 3
4
Fig. 3.3.
Switch 4
24
9in x 6in
b1350-ch03
8
Fig. 3.4.
Table 3.3.
An 8 8 Network topology.
Advantages
Disadvantages
Fat tree
O-the-shelf
Good remote communication
Economical
Fault sensitive
Hard to scale up
High latency
9in x 6in
b1350-ch03
25
Hardware Systems
Table 3.4.
Rank
System
Speed
(Tops)
Processor
family
1
2
3
4
5
6
7
8
9
10
K Computer
Tianhe-1A
Cray XT5
Dawning
HP ProLiant
Cray XE6
SGI Altix ICE
Cray XE6
Bull Bullx
IBM
BladeCenter
8162
2566
1759
1271
1192
1110
1088
1054
1050
1042
SPARC64
Intel EM64T
AMD x86 64
Intel EM64T
Intel EM64T
AMD x86 64
Intel EM64T
AMD x86 64
AMD x86 64
AMD x86 64
Co-processor
Interconnect
family
Interconnect
topology
N/A
nVidia GPU
N/A
nVidia Tesla
nVidia GPU
N/A
N/A
N/A
N/A
IBM Power
X Cell
Tofu
Proprietary
SeaStar
Inniband
Inniband
Gemini
Inniband
Gemini
Inniband
Inniband
6D Torus
Fat Tree
3D Torus
Fat Tree
Fat Tree
3D Torus
Fat Tree
3D Torus
Fat Tree
Fat Tree
26
9in x 6in
b1350-ch03
MPU networks
The Micro Processor Unit (MPU) network is a combination of two
k-dimensional rectangular meshes of equal size, which are oset by 1/2
of a hop along each dimension to surround each vertex from one mesh
with a cube of 2k neighbors from the other mesh. Connecting vertices in
one mesh diagonally to their immediate neighbors in the other mesh and
removing original rectangle mesh connections produces the MPU network.
Figure 3.5 illustrates the generation of 2D MPU network by combining
two 2D meshes. Constructing MPU network of three or higher dimension is
similar. To complete wrap-around connections for boundary nodes, we apply
the cyclic property of a symmetric topology in such a way that a boundary
node encapsulates in its virtual multi-dimensional cube and connects to all
vertices of that cube.
In order to see the advantages of MPU topology, we compare MPU
topologies with torus in terms of key performance metrics as network
diameter, bisection width/bandwidth and average distance. Table 3.5 lists
the comparison of those under same dimension.
2D mesh
Fig. 3.5.
Table 3.5.
2D MPU interconnect
Network of characters
MPU (nk )
Torus (nk )
Dimensionality
Number of nodes
Node degree
Network diameter
Bisection width
Bisection bandwidth
Number of wires
k
2nk
2k
n
2k nk1
pnk1
2k nk
k
nk
2k
nk/2
2nk1
pnk1 k 1
knk
1
2
2k1 k 1
2k 1
2k1
k
2k k 1
9in x 6in
b1350-ch03
27
Hardware Systems
MSRT networks
A 3D Modied Shift Recursive Torus (MSRT) network is a 3D hierarchical
network consisting of massive nodes that are connected by a basis 3D torus
network and a 2D expansion network. It is constructed by adding bypass
links to a torus. Each node in 3D MSRT has eight links to other nodes,
of which six are connected to nearest neighbors in a 3D torus and two are
connected to bypass neighbors in the 2D expansion network. The MSRT
achieves shorter network diameter and higher bisection without increasing
the node degree or wiring complexity.
To understand the MSRT topology, let us rst start from 1D MSRT
bypass rings. A 1D MSRT bypass ring originates from a 1D SRT ring by
eliminating every other bypass link. In, 1D MSRT (L = 2; l1 = 2, l2 = 4) is
a truncated 1D SRT (L = 2; l1 = 2, l2 = 4). L = 2 is the maximum node
level. This means that two types of bypass links exist, i.e. l1 and l2 links.
Then, l1 = 2 and l2 = 4 indicates the short and long bypass links spanning
over 2l1 = 4 and 2l2 = 16 hops respectively. Figure 3.6 shows the similarity
and dierence of SRT and MSRT. We extend 1D MSRT bypass rings to 3D
MSRT networks. To maintain a suitable node degree, we add two types of
bypass links in x-axis and y-axis and then form a 2D expansion network in
xy-plane.
To study the advantages of MSRT networks, we compared the MSRT
with other networks performance metric in Table 3.6. For calculating the
average distances of 3D MSRT, we rst select a l1 -level bypass node as the
reference node and then a l2 -level bypass node so two results are present in
the left and right columns respectively.
62 60
58
56
54
10
52
12
50
14
48
16
46
18
44
20
22
42
40
24
Fig. 3.6.
26
28
30 32
34
36
38
Level-0 node
Level-1 node
Level-2 node
28
b1350-ch03
Topology
3D
4D
6D
3D
9in x 6in
Bypass
networks
Dimensions
Bisection
width
Node Diameter (1024)
degree
(hop)
links)
Torus
Torus
Torus
SRT
N/A
32 32 16
N/A
16 16 8 8
N/A
884444
L = 1;
32 32 16
l1 = 4
128 128
2D SRT
lmax = 6
3D MSRT L = 2;
32 32 16
l1 = 2,
l2 = 4
6
8
12
10
40
24
16
24
1
2
4
9
8
8
13
16
1.625
2.5
Average
distance
(hop)
20.001
12.001
8.000
12.938
8.722
9.239 9.413
Fig. 3.7.
9in x 6in
Hardware Systems
Fig. 3.8.
b1350-ch03
29
Distributed-memory.
Shared-memory.
Shared-distributed-memory.
Distributedshared-memory.
3.5. IO Subsystems
High-performance input-output subsystem is the second most essential
component of a supercomputer and its importance grows even more
30
9in x 6in
b1350-ch03
Bus/Switch
1
(a)
Fig. 3.9.
(b)
4
1
2
Fig. 3.10.
A shared-distributed-memory conguration.
apparent with the latest introduction of Web 2.0 services such as video
stream, e-commerce, and web serving.
Two popular les systems are Lustre and IBMs GPFS.
Lustre is a massively parallel distributed le system, generally used
for large scale cluster computing and it provides a high performance le
system for clusters of tens of thousands of nodes with petabytes of storage
Hardware Systems
9in x 6in
b1350-ch03
31
capacity. It serves computer clusters ranging from small workgroup to largescale, multi-site clusters. More than half of worlds top supercomputers use
Lustre le systems.
Lustre le systems can support tens of thousands of client systems,
tens of petabytes (PBs) of storage and hundreds of gigabytes per second
(GB/s) of I/O throughput. Due to Lustres high scalability, internet service
providers, nancial institutions, and the oil and gas industry deploy Lustre
le systems in their data centers.
IBMs GPFS is a high-performance shared-disk clustered le system,
adopted by some of the worlds top supercomputers such as ASC purple
supercomputer which is composed of more than 12,000 processors and has
2 petabytes of total disk storage spanning more than 11,000 disks.
GPFS provides tools for management and administration of the GPFS
cluster and allows for shared access to le systems from remote GPFS
clusters.
3.6. System Convergence
Generally, at least three parameters are necessary to quantify the quality
of a parallel system. They are:
(1) Single-node performance.
(2) Inter-node communication bandwidth.
(3) I/O rate.
The higher these parameters are, the better would be the parallel
system. On the other hand, it is the proper balance of these three parameters
that guarantees a cost-eective system. For example, a narrow internode width will slow down communication and make many applications
unscalable. A low I/O rate will keep nodes waiting for data and slow overall
performance. A slow node will make the overall system slow.
32
9in x 6in
b1350-ch03
Hardware Systems
9in x 6in
b1350-ch03
33
9in x 6in
b1350-ch03
9in x 6in
b1350-ch04
CHAPTER 4
SOFTWARE SYSTEMS
36
9in x 6in
b1350-ch04
Fig. 4.1.
Linux is a predominantly node OS, and one of the most successful free
and open source software packages. Linux, with its many dierent variations,
is installed on a wide variety of hardware devices including mobile phones,
tablet computers, routers and video game consoles, laptop and desktop
computers, mainframes, and most relevantly the nodes of parallel computers
and it runs on the nodes of most, if not all, Top500 supercomputers.
A distribution intended to run on supercomputer nodes usually omits
many functions such as graphical environment from the standard release
and instead include other software to enable communications among
such nodes.
Software Systems
9in x 6in
b1350-ch04
37
38
9in x 6in
b1350-ch04
9in x 6in
Software Systems
Fig. 4.2.
Fig. 4.3.
b1350-ch04
39
40
9in x 6in
b1350-ch04
Fig. 4.4.
Fig. 4.5.
9in x 6in
Software Systems
b1350-ch04
41
42
9in x 6in
b1350-ch04
Synchronous:
A request is sent to receiver and, when acknowledged, pushes the message
to the receiver.
When messages arrive, both sender and receiver move on.
Ready:
Sender knows, before sending, that the matching receiver has already been
posted to receive.
4.2.2. Shared-memory
Shared-memory computer (Fig. 4.6) is another large category of parallel
computers. As the name indicates, all or parts of the memory space
are shared among all the processors. That is, the memory can be
simultaneously accessed directly by multiple processors. Under this
situation, communications among processors are carried out, implicitly, by
accessing the same memory space from dierent processors, at dierent
times. This scheme permits the developer an easier task to write parallel
program, whereby at least the data sharing is made much more convenient.
Serious problems such as system scalability restricted by limited data
sharing are still unsolved. As we will discuss later, multiple techniques were
introduced to mitigate such challenges.
Although all the memory can be shared among processors without extra
diculty, in practical implementation, memory space on a single node has
Private
Private
T
Shared
Memory
T
Private
Fig. 4.6.
Private
Software Systems
9in x 6in
b1350-ch04
43
always been partitioned into shared and private part, at least logically, in
order to provide exibility in program development. Shared data is accessible
by all but private data can only be accessed by the one who owns it.
4.3. Parallel Debuggers
A parallel debugger is no dierent from a serial debugger. dbx is one popular
example of serial debugger. The dbx is a utility for source-level debugging
and execution of programs written in C and FORTRAN.
Most parallel debuggers are built around dbx with additional functions
for handling parallelism, for example, to report variable addresses and
contents in dierent processors. There are many variations to add
convenience by using graphic interface.
A typical debugger, interactive parallel debugger (IPD), is a complete
symbolic, source-level debugger for parallel programs. Beyond the standard
operations that facilitate the debugging of serial programs, IPD oers
custom features that facilitate debugging parallel programs. IPD lets you
debug parallel programs written in C, FORTRAN, and assembly language.
IPD consists of a set of debugging commands, for which help is available
from within IPD.
44
9in x 6in
b1350-ch04
clicks, although for some features, keyboard input is also supported. The
execution of ParaGraph is event driven, including both user-generated X
Window events and trace events in the data le produced by PICL. Thus,
ParaGraph provides a dynamic depiction of the parallel program while also
providing responsive interaction with the user. Menu selections determine
the execution behavior of ParaGraph both statically and dynamically.
As a further aid to the user, ParaGraph preprocesses the input trace
le to determine relevant parameters automatically before the graphical
simulation begins.
9in x 6in
b1350-ch05
CHAPTER 5
DESIGN OF ALGORITHMS
Quality algorithms, in general, must be (1) accurate, (2) ecient, (3) stable,
(4) portable, and (5) maintainable.
For parallel algorithms, the quality metric ecient must be expanded
to include high parallel eciency and scalability.
To gain high parallel eciency, two factors must be considered:
(1) Communication costs;
(2) Load imbalance costs.
For scalability, there are two distinct types: Strong scaling and weak
scaling. A strong scaling algorithm is such that it is scalable for solving a
xed total problem size with a varying number of processors. Conversely, a
weak scaling algorithm is such that it is scalable for solving a xed problem
size per processor with a varying number of processors.
Minimizing these two costs simultaneously is essential but dicult for
optimal algorithm design. For many applications, these two factors compete
to waste cycles and they have many dependencies. For example, one parameter that one can always adjust for performance granularity. Usually,
granularity depends on the number of processors, and the smaller the
granularity, the smaller load imbalance and the bigger the communication.
More specically, ecient parallel algorithms must satisfy many or all
of the following conditions:
(1)
(2)
(3)
(4)
46
9in x 6in
b1350-ch05
Master-slave.
Domain decomposition.
Control decomposition.
Data parallel.
Single program multiple data. (SPMD)
Virtual-shared-memory model.
Of course, other parallel programming models are also possible. For a large
application package, it is common to utilize two or more programming
models.
5.1.1. Master-slave
A master-slave model is the simplest parallel programming paradigm with
an exception of the embarrassingly parallel model. In the master-slave
model, a master node controls the operations of the slave nodes in the
system. This model can be applied to many types of applications and is
also popular among the initial practitioners of parallel computing. But
it has serious drawbacks for sophisticated applications in which a single
master or even a series of strategically distributed masters are unable
to handle the demands of the slaves smoothly to minimize their idling.
Thus, this model is usually, and should be, avoided for serious and larger
scale applications.
Figure 5.1 illustrates the Master-slave model.
Design of Algorithms
9in x 6in
b1350-ch05
47
Fig. 5.1. A sketch of the Master-slave model in which communications occur only
between master and individual slaves while communications never occur among slaves.
48
9in x 6in
b1350-ch05
Fig. 5.2. A sketch of domain decomposition model in which the computational domain
is decomposed to sub-domains (which are then assigned to computing resources). This
model is popular for solving problems dened on obvious physical spaces, e.g., many
PDEs. It is almost the standard approach for decomposing problems with locality.
Fig. 5.3. In this model, the computational domain is decomposed into multiple subdomains each of which is assigned to a process. This model is useful for implementing
SIMD and MIMD, and it is popular for solving PDEs. This domain decomposition
approach is particularly eective for solving problems with locality.
5.1.4. Virtual-shared-memory
Shared-memory model has the greatest advantage of accessing data without
explicit message passing because the entire (global) data space is shared
among participating processors. It has a serious drawback in achieving
scalability for larger systems with 16 processors or a similar magic upper
limit due to the greatest inability to share data harmoniously, among others.
Wise attempts came to the rescue: Virtual shared memory model in which
data are shared virtually by aid of software. To the applications developers
9in x 6in
Design of Algorithms
b1350-ch05
49
50
9in x 6in
b1350-ch05
Fig. 5.4. In virtual-shared-memory model, each process owns an entire memory space
virtually. In reality, each process only owns a portion of the entire memory space while
the operating system virtualizes the space and facilities data transfers of data when
necessary.
that are clean may not be ecient. Which is more important? There is no
universal rule. It is problem-dependent. The desirable programs must be
both clean and ecient.
The following are the main steps involved in designing a parallel
algorithm:
(1) Decomposing a problem in similar sub-problems.
(2) Scheduling the decomposed sub-problems on participating processors.
(3) Communicating the necessary information between the processors so
that each processor has sucient information for progressing.
9in x 6in
Design of Algorithms
b1350-ch05
51
Fig. 5.5. This gure illustrates the relationship between programmer time and computer
time, and how it is related to the dierent ways of handling communication.
(4) Collecting results from each processor to form the nal result to the
original problem.
There are key tasks for designing an ecient parallel algorithm:
(1) Construction of serial algorithm for each processing unit.
(2) Handlings of message passing among processing units.
(3) Balancing loads among processing units.
A parallel-ecient algorithm should always have minimal overheads
of load imbalance and communication. These two factors usually tend to
corrupt, making algorithms inecient; to minimize them individually and
simultaneously is the highest design objective that can be costly, for some
applications, to achieve. For some problems with intrinsic local interactions
such as the hyperbolic equations and short-ranged molecular dynamics, to
gain minimal communication, one must minimize the aspect ratio of the
sub-domains allocated to each processor (assuming more than one subdomain can be given to each processor), which can happen only when each
processor has one very round-shaped sub-domain. This choice of sub-domain
size, however, increases the variance in loads on all processors and creates
extreme imbalance of loads. On the other hand, if the sub-domains are made
small in size, load balance may be achieved easily, but communication soars.
Conventional wisdom tells us to choose a sub-domain with medium size.
Minimizing the overhead due to load imbalance and communication
by choosing a proper granularity is the main goal in designing parallel
algorithms. A good algorithm must be robust, which includes reliability,
exibility, and versatility.
52
9in x 6in
b1350-ch05
Embarrassingly parallel:
Little communication.
Little load imbalance.
Synchronous parallel:
Load nearly balanced.
Known communication
Patterns.
Asynchronous:
Load not balanced.
Unpredictable communication
Patterns.
Fig. 5.6.
Design of Algorithms
9in x 6in
b1350-ch05
53
54
9in x 6in
b1350-ch05
(5.1)
Fig. 5.7.
Broadcast model 1.
(5.2)
9in x 6in
Design of Algorithms
Fig. 5.8.
Fig. 5.9.
b1350-ch05
55
Broadcast model 2.
56
9in x 6in
b1350-ch05
5.2.3. Allgather
We explain the global summation for a 3D hypercube. Processor Pt contains
a number X1 for i = 1, 2, . . . , 7. We do the summation in three steps.
Step 1: Processors P0 and P1 swap contents and then add them up. So do
the pairs of processors P2 and P3 , P4 and P5 , as well as P6 and P7 .
All these pairs swap and add corresponding data in parallel. At
the end of this parallel process, each processor has its partners
content added to its own, e.g. P0 now has X0 + X1 .
Step 2: P0 exchanges its new content with P2 and performs addition (other
pairs follow this pattern). At the end of this stage, P0 has X0 +
X1 + X2 + X3 .
Step 3: Finally, P0 exchanges its new content with P4 and performs
addition (again, other pairs follow the pattern). At the end of
this stage, P0 has X0 + X1 + X2 + X3 + X4 + X5 + X6 + X7 .
In fact, after log2 8 3 stages, every processor has a copy of the global sum.
In general for a system containing P processors, the total time needed for
global summation is O(log2 P ).
5.3. Mapping Tasks to Processors
Figure 5.11 charts the pathway of mapping an application to the topology
of the computer. Several mapping functions as shown in Fig. 5.10 can be
implanted to realize the following mappings:
(1)
(2)
(3)
(4)
(5)
(6)
Linear Mapping.
2D Mapping.
3D Mapping.
Random Mapping: ying processors all over the communicator.
Overlap Mapping: convenient for communication.
Any combination of the above.
9in x 6in
b1350-ch05
57
Design of Algorithms
Call
MPI
MPI
MPI
MPI
Dims returned
Dims
Dims
Dims
Dims
create(6,2,dims)
create(7,2,dims)
create(6,3,dims)
create(7,3,dims)
3,2
7,1 or 1,7
3,2,1 or 2,3,1 or 1,2,3
no answer
Fig. 5.10.
Application
Topology
Fig. 5.11.
Process
Topology
Processors
Topology
58
9in x 6in
b1350-ch05
1
(0,1)
2
(0,2)
3
(1,0)
4
(1,1)
5
(1,2)
6
(2,0)
7
(2,1)
8
(2,2)
MPI functions.
The functions listed in Fig. 5.12 are useful when embedding a lower
dimensional Cartesian grid into a bigger one, with each sub-grid forming a
sub-communicator.
9in x 6in
b1350-ch05
Design of Algorithms
59
If the comm = 2 3 4
24 processes we have
if remain dimms = (true, false, true)
(8 processes)
new comm = 2 4
All processors in parallel computers are always connected directly
or indirectly. The connectivity among processors measured in latency,
bandwidth and others are non-uniform and they depend on the relative
or even absolute locations of the processor in the supercomputer. Thus,
the way to assign computation modules or subtasks on nodes may impact
the communication costs and the total running. To optimize this part of the
running time, we need to utilize the task mapping.
According to Bokhari in his On the Mapping Problem,2 the mapping
and the mapping problem can be dened as follows:
Suppose a problem made up of several modules that execute in
parallel is to be solved on an incompletely connected array. When
assigning modules to processors, pairs of modules that communicate
with each other should be placed, as far as possible, on processors
that are directly connected. We call the assignment of modules to
processors a mapping and the problem of maximizing the number
of pairs of communicating modules that fall on pairs of directly
connected processors the mapping problem.
The application and the parallel computer are represented as static
graphs GA and GP , respectively. GA = (VA ; EA ) is a graph where VA
represents tasks and EA means communication requests among them with
weights being the communication loads. GP = (VP ; EP ) is a graph where
VP represents processors and EP means the links among processors with
weights being the per unit communication cost.
A mapping problem can then be dened as nding a mapping: VA VP
to minimize the objective function value that associates with each mapping.
60
9in x 6in
b1350-ch05
1000
900
3.8
800
3.6
700
600
3.4
500
3.2
400
3
300
200
2.8
100
2.6
100
200
Fig. 5.13.
300
400
500
600
700
800
900
1000
9in x 6in
Design of Algorithms
Fig. 5.14.
b1350-ch05
61
62
9in x 6in
b1350-ch05
where a(i, j) is the amount of data needed to be exchanged and d((i), (j))
represents the length of shortest path between i and j. Heiss and Dromanns
introduced the Kohonens algorithm to realize this kind of topologyconserving mapping.
Model by Bhanot et al.
In 2005, a model was developed by Bhanot et al. to minimize only inter-task
communication. They neglected the actual computing cost when placing
tasks on processors linked by mesh or torus by the following model:
C(i, j)H(i, j),
(5.5)
min F =
i,j
9in x 6in
Design of Algorithms
b1350-ch05
63
i=1
(5.7)
(5.8)
64
Subject to,
9in x 6in
b1350-ch05
m
xtp = 1,
t = 1, . . . , n
p=1
n
xtp 1,
p = 1, . . . , m
t=1
Ap
xtp
n , p = 1, . . . , m,
At
t=1
(5.9)
Fig. 5.15.
9in x 6in
b1350-ch06
CHAPTER 6
LINEAR ALGEBRA
65
66
9in x 6in
b1350-ch06
M11
M21
.
..
M12
M22
..
.
..
.
Mm1
Mm2
Mmn
M1n
M2n
..
.
(6.1)
(6.2)
.
..
..
..
..
.
.
.
Pp1 Pp2 Ppq
where pq = P .
Generally, there are four methods of decomposition that are common
in practical use:
(1)
(2)
(3)
(4)
Row partition.
Column partition.
Block partition.
Scatter partition.
9in x 6in
Linear Algebra
b1350-ch06
67
where
c11
c12
c21
c22
(6.4)
M11 @1
M @2
21
.
.
.
Mp1 @p
M12 @1 M1q @1
M22 @2 M2q @2
..
..
..
.
.
.
Mp2 @p Mpq @p
(6.5)
22
2q
.
(6.6)
..
..
..
.
.
.
.
.
Mp1 @1 Mp2 @2 Mpq @p
Block Partition: Assigning a continuous block of sub-matrix to a process,
a typical 2D matrix partition.
M @2 M @2 M @k
32
3q
31
(6.7)
M41 @2 M42 @2 M4q @k
..
..
..
..
.
.
.
.
Mp1 @4 Mp2 @4 Mpq @n
68
9in x 6in
b1350-ch06
21
(6.8)
M @3 M @4 M @y
42
4q
41
..
..
..
..
.
.
.
.
Mp1 @1 Mp2 @2 Mpq @n
The properties of these decompositions vary widely. The table below lists
their key features:
Method
Properties
1D decomposition.
Easy to implement.
Relatively high communication costs.
2D decomposition.
More complication to implement.
Lower communication costs.
Block partition: Mismatch with underlying problem
mesh.
Scattered partition: Potential match with
underlying problem mesh.
(6.9)
9in x 6in
b1350-ch06
69
Linear Algebra
A11 @1 A12 @1
A21 @2 A22 @2
.
..
..
.
Ap1 @p Ap2 @p
A1q @1
A2q @2
..
..
.
.
Apq @p
Vector b: The elements of vector b are given to each processor i for each
integer i [1, p]:
b1 @i
b2 @i
b3 @i
.
..
(6.10)
bx @i
We can obtain the results on individual processors for the resulting vector c.
The PAT graph for matrix-vector multiplication Method I is as given in
Fig. 6.1.
Timing Analysis: Each processor now works on a n np matrix with a
vector of length n. The time on each processor is given by the following. On
P processors,
Tcomp(n, p) = cn
Fig. 6.1.
n
p
(6.11)
70
9in x 6in
b1350-ch06
(6.12)
T (n, 1)
P
=
Tcomp (n, p) + Tcomm (n, p)
1 + cp
n
(6.13)
Remarks
(1)
(2)
(3)
(4)
21
..
..
..
..
.
.
.
.
Ap1 @p Ap2 @p
Apq @p
b1 @1
b2 @2
b3 @3
.
.
.
(6.14)
bx @p
Step 1: Each processor multiplies its diagonal element of A with its element
of b.
@1:
A11 b1
@2:
..
.
A22 b2
..
.
@p: App bp
9in x 6in
Linear Algebra
b1350-ch06
71
b2 @1
b3 @2
b4 @3
.
..
(6.15)
b1 @p
Then, multiply the next o-diagonal elements with the local vector elements:
@1: A12 b2
@2: A23 b3
..
..
.
.
@p: Ap1 b1
Step 3: Repeat Step 2 until all elements of b have visited all processors.
The PAT graph for matrix-vector multiplication method II is as in
Fig. 6.2.
Timing Analysis: Each processor now works on a n np matrix with a
vector on length n. The time on each processor is given by the following.
On P processors,
n
(6.16)
Tcomp(n, p) = cn
p
The communication time to roll up the elements of vector b and form the
nal vector is,
n
(6.17)
Tcomm (n, p) = dp
p
Fig. 6.2.
72
9in x 6in
b1350-ch06
P
T (n, 1)
=
Tcomp(n, p) + Tcomm (n, p)
1 + cnp
(6.18)
Remarks
(1)
(2)
(3)
(4)
A12 @2 A1q @p
A22 @2 A2q @p
..
..
..
.
.
.
Ap1 @1 Ap2 @2 Apq @p
A11 @1
A21 @1
.
..
A11 b1
A12 b2
..
.
@p: App bp
The results are then lumped to form the element c1 of the vector c.
The PAT graph for matrix-vector multiplication method III is as in
Fig. 6.3.
Step 2: Repeat Step 1 for all rows to form the complete resulting vector c.
Timing Analysis: Each processor now works on a n n/p matrix with a
vector on length n. The time on each processor is given by the following.
9in x 6in
b1350-ch06
73
Linear Algebra
Fig. 6.3.
On P processors,
Tcomp(n, p) = cn
n
p
(6.19)
n
p
(6.20)
T (n, 1)
P
=
Tcomp(n, p) + Tcomm (n, p)
1 + cnp
(6.21)
Remarks
(1)
(2)
(3)
(4)
(6.22)
74
9in x 6in
b1350-ch06
where,
A11
A21
A= .
..
A12
A22
..
.
Am1 Am2
B11 B12
B21 B22
B= .
..
..
.
Bn1 Bn2
..
.
A1n
A2n
..
.
Amn
B1k
B2k
..
..
.
.
Bnk
(6.23)
(6.24)
Fig. 6.5.
9in x 6in
b1350-ch06
75
Linear Algebra
A11 @1
A12 @1 A1n @1
A22 @2 A2n @2
A21 @2
A32 @3 A3n @3
A = A31 @3
..
..
..
..
.
.
.
.
Amx @m
.
.
..
.
..
..
.
.
Am2 @m
Am1 @m
C =
C11
C22
..
.
Cmm
where,
@1:
C11 =
A1i Bi1
i=1
@2:
C22 =
A2i Bi2
i=1
..
.
@m: Cmm =
..
.
n
A1i Bim
i=1
Step 2: Roll up all rows in A by one processor unit to form A then multiply
As corresponding rows with Bs and columns to form the rst o diagonal
76
9in x 6in
b1350-ch06
A2n @1
A3n @2
A4n @3
..
A21 @1
A31 @2
A = A41 @3
.
..
A22 @1
A32 @2
A42 @3
..
.
..
.
A11 @m
A12 @m
A1n @m
C11
C21
C=
C22
..
.
C1m
..
.
Cm,m1
Cmm
where
@1:
@2:
C21 =
C32 =
n
i=1
n
A2i Bi1
A3i Bi2
i=1
..
.
@m: C1m =
..
.
n
A1i Bim
i=1
Step 3: Repeat Step 2 until all rows of A have passed through all processors.
Timing Analysis: On one processor,
Tcomp(n, 1) = cn3 tcomp
(6.25)
(6.26)
9in x 6in
Linear Algebra
b1350-ch06
77
(6.27)
The computation cost for multiplying a matrix of size n/p with a matrix of
size n is:
3
n
cn3 tcomp
(6.28)
tcomp =
Tcomp (n, p) = p2 c
p
p
Thus, the speedup is:
S(n, p) =
P
T (n, 1)
=
Tcomp(n, p) + Tcomm (n, p)
1 + cnp
(6.29)
Remarks
(1) Overhead is proportional to p/n and speedup increases with n/p.
(2) Overhead is proportional to Tcomm /Tcomp and speedup decreases while
Tcomm /Tcomp increases.
(3) The previous two comments are universal for parallel computing.
(4) Memory is parallelized.
Method II: Broadcast, Multiply, Roll (BMR Method)
Step 0: Block partition both A and B.
78
9in x 6in
b1350-ch06
B21 @1
B = B31 @4
B11 @7
A31 @8
A31 @9
B22 @2 B23 @3
B32 @5 B33 @6
B12 @8 B13 @9
9in x 6in
Linear Algebra
Fig. 6.6.
b1350-ch06
79
(6.30)
(6.31)
(6.32)
where m = n/p.
The cost of sub-matrix multiplication is:
TM (n, p) = 2m3 tcomp
(6.33)
(6.34)
80
9in x 6in
b1350-ch06
T (n, 1)
TB (n, p) + TM (n, p) + TR (n, p)
h(n, p) =
1
P 2q
+
np
2 2
tcomm
tcomp
(6.35)
(6.36)
Remarks
(1) Overhead is proportional to p/n and speedup increases with n/p.
(2) Overhead is proportional to Tcomm /Tcomp and speedup increases while
Tcomm /Tcomp decreases.
(3) The above two comments apply commonly to many parallel algorithms.
(4) Memory is parallelized.
Method III: Strassen Method
Matrix multiplication requires O(N 3 ) operations. For example, suppose
we perform AB = C where
a11 a12
b11 b12
A=
, B
a21 a22
b21 b22
How many multiplication operations are required? The answer is 23 = 8.
a11 b11 + a12 b21
a12 b12 + a12 b22
a21 b11 + a22 b21
a21 b12 + a22 b22
c11
c12
c21
c22
(6.37)
(6.38)
Linear Algebra
9in x 6in
b1350-ch06
81
(6.39)
This reduces the number of multiplications from eight to seven. For large
matrices, we can always reduce it to 7/8 and then get the computing
time from N log2 8 = 3N to log2 7 2.8N . Of course, we have many more
additions, namely ve times as many. For large N , however, the cost of the
operations will outweigh the cost of the + operations.
(6.40)
82
9in x 6in
b1350-ch06
u = a
for i = 1 to n - 1
for j = i+ 1 to n
l = u(j,i)/u(i,i)
for k = i to n
u(j,k) = u(j,k) l * u(i,k)
end
end
end
Fig. 6.7.
2 3
n tcomp
3
(6.41)
(6.42)
9in x 6in
b1350-ch06
83
Linear Algebra
Step 2: For every element in row i subtract the value of the corresponding
element in rst row multiplied by A(0,0)
A(i,0) . By doing this, the
elements in the rst column of the matrix become all zero.
Step 3: Repeat Steps 1 and 2 on the right-lower sub matrix until the
system becomes an upper triangle matrix.
(0,0)
(0,1)
(0,2)
(0,3)
(0,4)
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
In this method, as we can see, every step has one less node involving in
the computation. This means the parallelization is largely imbalanced. The
total computation time can be written as:
Tcomp(n, p) =
n3
tcomp
p
(6.43)
where tcomp is the time for unit operation. Compare with the serial case
(6.42), we know that roughly 1/3 of the computing power is wasted in
waiting.
During the process, the right-hand side can be treated as the last column
of the matrix. Thus, it can be distributed among the processor and thus no
special consideration is needed.
For the back substitution, we can parallelize by the following procedure:
Step 1: Solve the last unsolved variable in the last processor.
Step 2: Broadcast the value of the solved variable.
Step 3: Every processor plugs this variable and updates its corresponding
row.
Gaussian Elimination with Partial Pivoting
Like the serial cases, the algorithm might introduce big s when subtracting
two very close numbers. Partial pivoting is necessary in the practical use.
84
9in x 6in
b1350-ch06
That is, at each step, we nd the row of the sub-matrix that has the largest
rst column elements and calculate based on this row instead of the rst one.
This requires nding the largest elements among processors that adds more
communication in the algorithm. An alternative way to solve this problem
is to use the column partition.
Column Partition
Column partition gives an alternative way of doing 1D partition. In this
method, less data will be transferred in each communication circle. This
gives smaller communication overhead and easier for pivoting. However,
the problem with load imbalance still existed in this method.
Step
Step
Step
Step
Step
0:
1:
2:
3:
4:
This method has same overall computation time but much less communication when pivoting is needed.
Block Partition
We illustrate the idea by a special example 20 20 matrix solved by 16
processors. The idea can be easily generalized to more processors with a
large matrix.
We perform the elimination with the following three steps.
Step 1: Broadcast column 0 to other column processors. Note, at this
9in x 6in
Linear Algebra
b1350-ch06
85
Fig. 6.8.
86
9in x 6in
b1350-ch06
Fig. 6.9. This is an example of parallel gaussian elimination, i.e. a parallel algorithm
used to solve Ax = b. The numbers shown in the matrix are the processor IDs. They
indicate which matrix element is stored in which processor. This example shows a matrix
of size 20 20 to be solved by 16 processors. Obviously, this choice can be easily
generalized.
we may have
a11
a21
A=
a12
a22
a32
a23
a33
a43
,
a34
a44
x1
x2
x=
x3 ,
x4
b1
b2
b=
b3
(6.44)
b4
Classical Techniques
Relaxation
Conjugate-gradient
Minimal-residual
u(k) = T u(k1) + C
(6.45)
where u(0) are initial values and T is the iteration matrix, tailored by the
structure of A.
6.3.3. ADI
The alternating direction implicit (ADI) method was rst used in 1950s by
Peaceman and Rachford for solving parabolic PDEs. Since then, it has been
widely used in many other applications as well.
Linear Algebra
9in x 6in
b1350-ch06
87
Varga has rened and extended the discussions of this method. Its
implementations on dierent parallel computers are discussed in various
books and papers on parallel linear algebra.
We have the following implementations:
I. ADI on a 1D array of processors.
II. ADI on a 2D mesh of processors.
III. ADI on other architectures.
9in x 6in
b1350-ch06
9in x 6in
b1350-ch07
CHAPTER 7
DIFFERENTIAL EQUATIONS
f (x)dx
(7.1)
slice the interval [a, b] into N mesh blocks, each of equal size x = ba
N ,
shown in (7.1). Suppose xi = a + (i 1)x, the integral is approximated
as in (Fig. 7.1):
I
N
f (xi )x
(7.2)
i=1
90
9in x 6in
b1350-ch07
R
Fig. 7.1. I = ab f (x)dx. In the Riemann sum, the totall area A under the curve f (x)
is the approximation of the integral I. With 2 processors j and j + 1, we can compute
A = Aj + Aj+1 completely in parallel. Each processor only needs to know its starting
and ending points, in addition to the integrand, for partial integration. At the end, all
participating processors perform a global summation to add up partial integrals obtained
by individual processors.
Fig. 7.2.
(7.4)
Dierential Equations
9in x 6in
b1350-ch07
91
where x(i) s are random samples from D. According to the central limit
theorem, the error term,
m(Im I) N (0, 2 )
(7.5)
where 2 = var{g(x)}. This means the convergence rate of the Monte Carlo
method is O(m1/2 ), regardless of the dimension of D. This property gives
Monte Carlo method an advantage when integrating in high dimensions.
92
9in x 6in
b1350-ch07
(7.7)
k1
= u(x x, t t). Thus, we get an update scheme:
where ui1
(7.8)
Dierential Equations
9in x 6in
b1350-ch07
93
Fig. 7.3. Decomposition of computational domain into sub-domains for solving wave
equations.
94
9in x 6in
b1350-ch07
Fig. 7.4. PAT of decomposition of computational domain into sub-domains for solving
wave equations.
(7.9)
(7.10)
p
T (1, M )
=
tcomm
T (p, M )
1 + 2p
m tcomp
(7.11)
2p tcomm
M tcomp
(7.12)
T (p, M ) =
Thus, the speedup is:
S(p, M ) =
and the overhead is:
h(p, M ) =
9in x 6in
95
Dierential Equations
tcomm
tcomp .
b1350-ch07
This is a computer-inherent
2D Wave equation
Consider the following wave equation:
utt = c2 (uxx + uyy ),
t>0
(7.13)
with proper BC and IC. It is very easy to solve this equation on a sequential
computer, but on a parallel computer, it is a bit more complex. We perform
a little dierence operation (applying central dierences on both sides):
k1
+ uij
2ukij
uk+1
uki+1,j + uki1,j 2ukij
ukij+1 + uki,j1 2ukij
ij
+
t2
x2
y 2
(7.14)
Fig. 7.5.
(7.15)
96
9in x 6in
b1350-ch07
(7.16)
Proper BC and IC
After applying nite dierences (forward-dierence for the rst-order time
derivative and central-dierence for the second-order spatial derivatives),
we obtain:
t
ut+1
i,j,k + ui,j,k
(7.17)
Assuming x = y = z = t = 1, we get:
t
t
t
t
t
t
ut+1
i,j,k = ui,j,k + (ui+1,j,k + ui1,j,k ) + (ui,j+1,k + ui,j1,k ) + (ui,j,k+1
+ uti,j,k1 ) 6uti,j,k
(7.18)
(7.19)
Proper BC
Applying central dierences on both sides, we get:
ui+1,j + ui1,j 2ui,j
ui,j+1 + ui,j1 2ui,j
+
= f (xi , yj )
x2
y 2
(7.20)
(7.21)
9in x 6in
b1350-ch07
97
Dierential Equations
...
(7.22)
(7.23)
A1 I
0
I A2
0
.
..
.
..
.
..
.
A=
.
.
0
0 AY 1
0
0
I
This is a Y Y matrix with sub-matrix
4 1
1 4
.
..
..
.
.
Ai =
.
.
0
0
0
0
(7.24)
AY
0
..
.
elements
0
0
0
0
..
..
.
.
4 1
1 4
Y Y
XX
(7.25)
98
9in x 6in
b1350-ch07
(7.26)
(7.27)
3,2,2
...
(7.28)
If we dene a new index,
m = (k 1)XY + (j 1)X + i
(7.29)
I A2
A = ..
..
..
.
.
.
0
whose sub-matrix elements are:
k
a1
ak = ..
.
(7.30)
0
0
..
.
AZ
ak2
..
.
..
.
ZZ
..
.
aky
Y Y
9in x 6in
b1350-ch07
99
Dierential Equations
akj =
6
1
..
.
6
..
..
.
.
0
..
.
XX
(7.31)
Proper BC
There are two families of methods in solving Helmholtz equation:
(1) Method of moments, aka, boundary element method.
(2) Method of nite dierences.
In the following, we describe the nite-dierence method:
ui,j+1,k + ui,j1,k 2ui,j,k
ui+1,j,k + ui1,j,k 2ui,j,k
+
x2
y 2
ui,j,k+1 + ui,j,k1 2ui,j,k
+
+ 2 ui,j,k
z 2
(7.32)
(7.33)
100
9in x 6in
b1350-ch07
i i
xi (t = 0) = ai
i = 1, 2, . . . , N
(7.34)
xi (t = 0) = vi ,
The key computing issues for MD are:
(1) MD can be highly nonlinear depending on the force elds. Most of the
computing time is spent on force calculation.
(2) Force matrix denes the force from all pairs of particles or molecules,
f11 f1n
.
..
..
.
F =
.
.
.
fn1
fnn
9in x 6in
b1350-ch07
101
Dierential Equations
angles
k ( 0 ) +
impropers
nonbonded
UreyBradley
Rminij
rij
12
Rminij
rij
dihedrals
ku (u u0 )2
6
+
qi qj
rij
(7.35)
Term-1 accounts for the bond stretches where kb is the bond-force constant
and b b0 is the distance from equilibrium that the atom has moved.
Term-2 accounts for the bond angles where k is the angle-force constant
and 0 is the angle from equilibrium between three bounded atoms.
Term-3 is for the dihedrals (a.k.a. torsion angles) where k is the dihedral
force constant, n is the multiplicity of the function, is the dihedral angle
and is the phase shift.
102
9in x 6in
b1350-ch07
Term-4 accounts for the impropers, that is out of plane bending, where k
is the force constant and 0 is the out of plane angle. The Urey-Bradley
component (cross-term accounting for angle bending using 1, 3 non-bonded
interactions) comprises Term-5 where ku is the respective force constant
and u is the distance between the 1, 3 atoms in the harmonic potential.
Non-bonded interactions between pars of atoms are represented by the
last two terms. By denition, the non-bonded forces are only applied to
atom pairs separated by at least three bonds. The van Der Waals (VDW)
energy is calculated with a standard 126 Lennard-Jones potential and the
electrostatic energy with a Coulombic potential.
The Lennard-Jones potential (Fig. 7.6) is a good approximation of
many dierent types of short-ranged interaction besides modeling the VDW
energy.
12 6
(7.36)
VLJ (r) = 4
r
r
where is the depth of the depth of the potential well, is the
(nite) distance at which the inter-particle potential is zero, and r is the
distance between the particles. These parameters can be tted to reproduce
experimental data or mimic the accurate quantum chemistry calculations.
Fig. 7.6.
1 R. A. Aziz, A highly accurate interatomic potential for argon, J. Chem. Phys. Vol. 99
Issue 6 (1993) 4518.
9in x 6in
Dierential Equations
b1350-ch07
103
The r12 term describes Pauli repulsion at short ranges due to overlapping
electron orbitals and the r6 term describes attraction at long ranges such
as the van der Waals force or dispersion force.
The last term, the long-ranged force in the form of Coulomb potential
between two charged particles,
qi qj
X
i)
(X
(7.37)
F ij =
i X
j |3 i
|X
The total force on particle i by all other particles is:
F i =
n
j=i
qi qj
X
i)
(X
j |3 i
|Xi X
(7.38)
Estimating the forces cost 90% of the total MD computing time. But,
solution of the equation of motion is of particular applied Mathematics
interest as it involves solving the system of ODEs.
First, we may use the second order Verlet method:
a(t) =
(7.39)
where a(t) is the acceleration. So, the next time step can be calculated from
the present and the previous steps by:
x(t + ) = 2x(t) x(t ) + a(t) 2
F (t) 2
m
Second, we may consider the RungeKutta method:
= 2x(t) x(t ) +
(7.40)
y = f (t, y)
1
y(t + h) = y(t) + h(k1 + 2k2 + 2k3 + k4)
6
k1 = f (t, y)
1
1
k2 = f t + h, y + k1
2
2
1
1
k3 = f t + h, y + k2
2
2
(7.41)
k4 = f (t + h, y + k3 )
It turns out this portion of the MD, i.e. actual solution of the system of
ODEs, takes insignicant amount of computing time.
9in x 6in
b1350-ch07
9in x 6in
b1350-ch08
CHAPTER 8
FOURIER TRANSFORMS
105
106
9in x 6in
b1350-ch08
Fig. 8.1.
Fourier transform.
N
1
xj e2
ijk
N
k = 0, 1, 2, . . . , N 1
(8.3)
j=0
(X) =
N
1
j=0
Xj e2
ijk
N
j = 0, 1, 2, . . . , N 1
(8.4)
9in x 6in
b1350-ch08
107
Fourier Transforms
l [1, d]
N
1 1
n1 =0
k1 n1
N
1
N
2 1
n2 =0
k2 n2
N
2
N
d 1
nd =0
d
nkddn=0
(8.5)
108
9in x 6in
b1350-ch08
There are many FFT algorithms and the following is a short list:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
9in x 6in
b1350-ch08
109
Fourier Transforms
N
1
xn e2
ink
N
(8.6)
n=0
Xk =
1
m=0
x2m e
2i
N (2m)k
N
2
1
x2m+1 e
2i
N (2m+1)k
(8.7)
m=0
2i
Ek + e N Ok ,
if k < N/2
Xk =
2i
(8.8)
110
9in x 6in
b1350-ch08
First, we divide the transform into odd and even parts (assuming M is
even):
Xk = Ek + exp
2ik
Ok
M
and
Xk+M/2 = Ek exp
2ik
Ok
M
Function Y=FFT(X,n)
if (n==1)
Y=X
else
E=FFT({X[0],X[2],...,X[n-2]},n/2)
O=FFT({X[1],X[3],...,X[n-1]},n/2)
for j=0 to n-1
Y[j]=E[j mod (n/2)]
+exp(-2*PI*i*j/(n/2))*O[j mod (n/2)]
end for
end if
Fig. 8.2.
Fourier Transforms
9in x 6in
b1350-ch08
111
General case:
112
9in x 6in
b1350-ch08
Fig. 8.3.
10
11
12
13
14
15
ik
Ok , k = 0, 1, . . . , M
Xk = Ek + exp 2
M
ik
Xk+ M = Ek exp 2
Ok , k = 0, 1, . . . , M
2
M
(8.9)
(8.10)
9in x 6in
b1350-ch08
113
Fourier Transforms
(8.11)
(8.12)
(8.13)
(8.14)
M
2
+M
Tmult
Tpm +
2
(8.15)
Iteratively, we have:
T (M ) =
Tpm +
Tmult
2
M log M
(8.16)
Let us dene:
T+ = 2Tpm + Tmult
(8.17)
Thus obtaining the formula for the single processor time as:
T (1, M ) = T+ M log M
(8.18)
114
9in x 6in
b1350-ch08
Case I M > P:
Case I M P:
M
T+ M
T (P, N ) = 2T P,
+
2
2 P
M
M
T+
=
log
T (P, M ) +
P
2
P
M
T (P, M ) = T P,
+ T = T log M
2
M
[T+ log M + (2T T+ ) log P ]
2P
(8.19)
(8.20)
(8.21)
(8.22)
(8.23)
P
1+
2T T+ log P
T+
log M
(8.24)
2T T+ log P
T+
log M
(8.25)
(8.26)
(8.27)
(8.28)
Remarks
1. Communication to computation ratio aects overhead h(P, M ).
P
aects overhood, too, but in a much smaller way than most other
2. M
cases. For parallel FFT, overhead is typically very small.
9in x 6in
b1350-ch09
CHAPTER 9
OPTIMIZATION
116
9in x 6in
b1350-ch09
9in x 6in
b1350-ch09
Optimization
(3)
(4)
(5)
(6)
117
The most important function of Monte Carlo methods is the ability to locate
the global optimum while performing optimization. The objective functions
of most scientic or engineering applications usually have multiple local
minima (as in Fig. 9.1) and the ability to escape from such local minima is
the greatest advantage of the Monte Carlo methods.
Fig. 9.1.
118
9in x 6in
b1350-ch09
0:
1:
2:
3:
Optimization
9in x 6in
b1350-ch09
119
9.2. Parallelization
9.2.1. Basics
There are many approaches for parallelizing Monte Carlo methods and none
appears ideal. The correct measure of success is still that of the shortest
time for achieving the results. Many other claims of success include keeping
all processors busy all the time or achieving a zillion Flops are common
deceptions for parallel computing and such deceptions are particularly
dicult to detect for Monte Carlo simulations.
The latest approaches of parallelizing Monte Carlo methods always
involve decomposition of new state generation, or the search space, or the
120
9in x 6in
b1350-ch09
Markov chains. The latter two are related and none appears to be able to
scale for parallel computers with many processors.
9.2.2. Chain mixing method
With very limited numerical experiments and fantastic performances, I
would introduce a new method: the chain mixing method.
The following are the basic steps:
Step 0: Start P independent Markov chains, one per processors, by
Metropolis-Hastings algorithm or by the simulated annealing;
Step 1: All P processors elongate their individual Markov chains with
individual random number sequences by Metropolis-Hastings
algorithm or by the simulated annealing;
Step 2: At a given mixing time , suspend chain elongation and evaluate
the current states of all P chains and identify the best state;
Step 3: If the best state fails to satisfy stop conditions, one or more best
state(s) are selected, according to the acceptance probability, and
planted to P processors and go to Step 1;
Step 4: End.
As shown in Fig. 9.2, N processors start N Markov chains with initial states
X1 , X2 , . . . , Xn and iterate until they reach states Y1 , Y2 , . . . , Yn when a
mixing evaluation occurs. At this time, all states except one best state Y2
are terminated. All processors, including the one that generated state Y2 ,
are given the state of Y2 and use their own random number sequences to
elongate the Markov chains until next mixing, etc. This approach is ecient
only for parallel machines with few processors.
An improvement of the above method is presented in Fig. 9.3. The
dierence is that two or more states are selected to re-produce. We pool
the ending states Y1 , Y2 , . . . , Yn and compute the average O.F. of these
states by
E=
9in x 6in
b1350-ch09
Optimization
Fig. 9.2.
Fig. 9.3.
121
9in x 6in
b1350-ch09
9in x 6in
b1350-ch10
CHAPTER 10
APPLICATIONS
124
9in x 6in
b1350-ch10
and engineers in these areas must seize the opportunity to make ecient
use of this cost-eective and enabling technology.
Traditional methods such as the nite dierence, nite element, Monte
Carlo method, and particle methods, are still the essential ingredients for
solving these problems except that these algorithms must be parallelized.
Fluid dynamics, turbulence, dierential equations, numerical analysis,
global optimization, and numerical linear algebra are a few popular areas
of research.
Turbulence: Turbulence in uid ows aects the stability and control,
thermal characteristics, and fuel performance of virtually all aerospace
vehicles. Understanding the fundamental physics of turbulence is requisite
to reliably modeling ow turbulence for the analysis of realistic vehicle
conguration.
Topics include simulation of devices and circuits, VLSI, and articial
intelligence (speech, vision, etc.).
Speech: Speech research is aimed at providing a communications interface
with computers based on spoken language. Automatic speech understanding
by computer is a large modeling and search problem in which billions of
computations are required to evaluate the many possibilities of what a
person might have said within a particular context.
Vision: The challenge is to develop human-level visual capabilities for
computers and robots. Machine vision requires image signal processing,
texture and color modeling, geometric processing and reasoning, and object
modeling. A component vision system will likely involve the integration of
all of these processes with close coupling.
10.1. Newtons Equation and Molecular Dynamics
This is the fundamental equation by Sir Isaac Newton (16421727) in
classical mechanics, relating the force acting on a body to the change in
its momentum over time. It is used in all calculations involving moving
bodies in astrophysics, engineering, and molecular systems.
F = ma +
dm
v
dt
(10.1)
9in x 6in
b1350-ch10
125
Applications
f1N
f2N
..
.
f11
f21
F = .
..
f12
f22
..
.
..
.
fN 1
fN 2
fN N
(10.3)
Remarks
Depending on the level of resolution required, one could apply 2nd, 4th, or
6th order ODE solvers.
126
9in x 6in
b1350-ch10
Applications
9in x 6in
b1350-ch10
127
(2) Apply the resulting package to DNA-protein interaction analysis, thinlm depositions, and semi-conductor chip design as well as material
design and analysis.
(3) Study microscopic interaction form (physics) and explore the novel
macro properties in cases where interacting forms are known
(engineering).
While molecular dynamics is a powerful method for many systems
involving dynamics, continuum modeling without treating the system as an
assemble of particles has been playing a major role for simulating uid, gas,
and solids. The major advantage of the continuum modeling is the dramatic
reduction in computing time enabled by treat the system as a nite number
of interacting blocks whose motion is governed by, usually, basic principles
of energy and momentum conversations. Each of these blocks, in eect,
contains thousands or tens of thousands or more particles, the basic entity
in the MD simulations. Thus, continuum modeling can be, depending on size
and nature of the system being simulated, thousands to millions of times
faster than MD. However, the spatial and temporal resolutions oered by
MD can be a feature that continuum modeling may never provide. It is
important to select the right method for the right problem and, rather
commonly, both methods may have to be used for one system for optimal
accuracy and eciency.
Molecular dynamics (MD) is in essence an N -body problem: Classical
mechanical or quantum mechanical, depending on the scale of the problem.
MD is widely used for simulating molecular-scale models of matter such
as solid state physics, semiconductor, astrophysics, and macro-biology. In
CMD, we consider solving N -body problems governed by Newtons second
law, while for quantum MD, we solve systems governed by Schrodingers
equation.
Mathematically, MD involves solving a system of non-linear ordinary
dierential equation (ODE). An analytical solution of an N -body problem
is hard to obtain and, most often, impossible to get. It is also very timeconsuming to solve non-linear ODE systems numerically. There are two
major steps in solving MD problems: Computing the forces exerted on
each particle and solving the ODE system. The complexity lies mostly
in the computation of force terms, while the solution of ODEs takes a
negligible (5% of total) amount of time. There exist many ecient and
accurate numerical methods for solving the ODEs: For example, Runge
Kutta, Leapfrong, and Predictor-corrector.
128
9in x 6in
b1350-ch10
d2 xi
=
f
(x
,
x
)
+
gijk (xi , xj , xk ) + ,
ij
i
j
dt2
j
I = 1, 2, . . . , N
j,k
(10.4)
where mi is the mass of particle i, xi is its position, fij (, ) is a two body
force, and gijk (, , ) is a three-body force. The boundary conditions and
initial conditions are properly given. To make our study simple, we only
consider two-force interactions, so gijk = 0.
The solution vector X, of the system can be written in the following
iterative form:
dXold
,...
(10.5)
Xnew = Xold + Xold ,
dt
The entire problem is reduced to computing
dXold
,...
Xold ,
dt
(10.6)
In fact the core of the calculation is that of the force. Typically, solving
the above equation costs less than 5% of total time if the force terms
Applications
9in x 6in
b1350-ch10
129
Method
Method
Method
Method
I: Particle-mesh method.
II: Multi-pole method.
III: The fast multi-pole method.
IV: Rotation scheme.
Fig. 10.1. The force matrix and force vector. The force matrix element fj is the force
on particle i exerted by particle j. Adding up the elements in row i, we get to total force
acting on particle i by all other particles.
130
9in x 6in
b1350-ch10
(10.7)
j=i
Fig. 10.2.
(10.8)
9in x 6in
b1350-ch10
131
Applications
and
T (p, N ) = p(N 2 tpair + N txchg )
(10.9)
1
1+
1 txchg
n tpair
(10.10)
1 txchg
n tpair
(10.11)
Remarks
1
n is
txchg
tpair
132
9in x 6in
b1350-ch10
These two methods are very useful; we call them screening method.
I. Particle decomposition: Particle decomposition method distributes
particles, as in long-ranged interaction (with no consideration for the
particles positions), uniformly to all processors. This method corresponds
to force matrix row partition.
Step
Step
Step
Step
I : Screening.
II : Communicate boarder mesh particles to relevant processors.
III: Update particle positions in all processors.
IV: Scatter the new particles (same particles but new positions).
Remarks
9in x 6in
Applications
b1350-ch10
133
Remarks
10.2. Schr
odingers Equations and Quantum Mechanics
The basic equation by Ervin Schr
odinger (18871961) in quantum
mechanics which describes the evolution of atomic-scale motions is given by:
h (r, t)
h2
2 (r, t) + V (r, t) =
8 2 m
2i t
(10.12)
134
9in x 6in
b1350-ch10
9in x 6in
Applications
b1350-ch10
135
B
t
D =
D
+J
H =
t
B = 0
(10.14)
(10.15)
u
t
(10.16)
136
9in x 6in
b1350-ch10
(10.17)
Applications
9in x 6in
b1350-ch10
137
138
9in x 6in
b1350-ch10
10.7.6. Oceanography
Ocean sciences: The objective is to develop a global ocean predictive
model incorporating temperature, chemical composition, circulation, and
coupling to the atmosphere and other oceanographic features. This ocean
model will couple to models of the atmosphere in the eort on global weather
and have specic implications for physical oceanography as well.
9in x 6in
b1350-appa
APPENDIX A
MPI
139
140
9in x 6in
b1350-appa
Fig. A.1.
9in x 6in
b1350-appa
Appendix A: MPI
Fig. A.2.
141
Message Passing: This is the method by which data from one processors
memory is copied to the memory of another processor. In distributedmemory systems, data is generally sent as packets of information over
a network from one processor to another. A message may consist of
one or more packets, and usually includes routing and/or other control
information.
Process: A process is a set of executable instructions (a program) which
runs on a processor. One or more processes may execute on a processor.
In a message passing system, all processes communicate with each other
by sending messages, even if they are running on the same processor. For
reasons of eciency, however, message passing systems generally associate
only one process per processor.
Message passing library: This usually refers to a collection of routines
which are imbedded in application code to accomplish send, receive and
other message passing operations.
Send/Receive: Message passing involves the transfer of data from one
process (send) to another process (receive), requires the cooperation of
both the sending and receiving process. Send operations usually require the
sending process to specify the datas location, size, type and the destination.
Receive operations should match a corresponding send operation.
Synchronous/Asynchronous: Synchronous send operations will complete only after the acknowledgement that the message was safely received
by the receiving process. Asynchronous send operations may complete
even though the receiving process has not actually received the message.
142
9in x 6in
b1350-appa
Application buer: It is the address space that holds the data which is
to be sent or received. For example, suppose your program uses a variable
called inmsg. The application buer for inmsg is the program memory
location where the value of inmsg resides.
System buer: System space for storing messages. Depending upon
the type of send/receive operation, data in the application buer may
be required to be copied to/from system buer space. This allows
communication to be asynchronous.
Blocking communication: A communication routine is blocking if the
completion of the call is dependent on certain events. For sends, the data
must be successfully sent or safely copied to system buer space so that
the application buer that contained the data is available for reuse. For
receives, the data must be safely stored in the receive buer so that it is
ready for use.
Non-blocking communication: A communication routine is nonblocking if the call returns without waiting for any communications events
to complete (such as copying of message from user memory to system
memory or arrival of message). It is not safe to modify or use the application
buer after completion of a non-blocking send. It is the programmers
responsibility to insure that the application buer is free for reuse. Nonblocking communications are primarily used to overlap computation with
communication to provide gains in performance.
Communicators and Groups: MPI uses objects called communicators
and groups to dene which collection of processes may communicate with
each other. Most MPI routines require us to specify a communicator as an
argument. Communicators and groups will be covered in more detail later.
For now, simply use MPI COMM WORLD whenever a communicator is
required it is the predened communicator which includes all of our MPI
processes.
Rank: Within a communicator, every process has its own unique integer
identier assigned by the system when the process initializes. A rank is
sometimes also called a process ID. Ranks are contiguous and begin at
zero. In addition, it is used by the programmer to specify the source and
destination of messages, and is often used conditionally by the application
to control program execution. For example,
rank = 0, do this
rank = 1, do that
9in x 6in
Appendix A: MPI
b1350-appa
143
Message attributes
(1) The envelope.
(2) Rank of destination.
(3) Message tag.
(a) ID for a particular message to be matched by both sender and
receiver.
(b) It is like sending multiple gifts to your friend; you need to identify
them.
(c) MPI TAG UB 32767.
(d) Similar in functionality to comm to group messages.
(e) comm is safer than tag, but tag is more convenient.
(4)
(5)
(6)
(7)
(8)
Communicator.
The Data.
Initial address of send buer.
Number of entries to send.
Datatype of each entry.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
INTEGER
REAL
DOUBLE PRECISION
COMPLEX
LOGICAL
BYTE
INT
CHAR
FLOAT
DOUBLE
FORTRAN
Include mpif.h
144
9in x 6in
b1350-appa
Example
Error
Code
C
rc = MPI Xxxxx
(parameter,...)
rc = MPI Bsend
(&buf, count, type,
dest, tag, comm)
Returned as rc
MPI SUCCESS if successful
FORTRAN
CALL MPI XXXXX
(parameter,...,ierr)
CALL MPI BSEND
(buf, count, type, dest,
tag, comm, ierr)
Returned as ierr parameter
MPI SUCCESS if successful
Program structure: The general structures are the same in both language
families. These four components must be included in the program:
(1)
(2)
(3)
(4)
9in x 6in
Appendix A: MPI
b1350-appa
145
146
9in x 6in
b1350-appa
MPI Wtime()
MPI WTIME()
MPI Wtick
This returns the resolution in seconds (double precision) of MPI Wtime.
MPI Wtick()
MPI WTICK()
MPI Finalize
Terminates the MPI execution environment. This function should be the
last MPI routine called in every MPI program, no other MPI routines may
be called after it.
MPI Finalize()
MPI FINALIZE(ierr)
Examples: Figures A.3 and A.4 provide some simple examples of environment management routine calls.
#include"mpi.h"
#include<stdio.h>
int main(int argc, char *argv[]){
int numtasks, rank, rc;
rc = MPI_Init(&argc, &argv);
if (rc != 0){
printf("Error starting MPI program. Terminating. \n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Number of tasks= %d My rank= %d\n", numtasks, rank);
/*******
MPI_Finalize();
return 0;
}
Fig. A.3.
9in x 6in
b1350-appa
Appendix A: MPI
Fig. A.4.
147
148
b1350-appa
9in x 6in
CHAR
SHORT
INT
LONG
UNSIGNED CHAR
UNSIGNED SHORT
UNSIGNED
UNSIGNED LONG
FLOAT
DOUBLE
LONG DOUBLE
MPI BYTE
MPI PACKED
signed char
singed chort int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned int
float
double
long double
8 binary digits
data packed or
unpacked with
MPI Pack()/
MPI Unpack
character(1)
MPI INTEGER
integer
MPI REAL
MPI DOUBLE PRECISION
real
double precision
MPI
MPI
MPI
MPI
COMPLEX
LOGICAL
BYTE
PACKED
complex
logical
8 binary digits
data packed or
unpacked with
MPI Pack()/
MPI Unpack
that the MPI types MPI BYTE and MPI PACKED do not correspond to
standard C or FORTRAN types. Table A.1 lists the MPI data types for
both C and FORTRAN.
Destination: This is an argument to send routines which indicates the
process where a message should be delivered. It is specied as the rank of
the receiving process.
Source: For MPI Recv, there is an argument source corresponding to
destination in MPI Send. This is an argument to receive routines which
indicates the originating process of the message. Specied as the rank of
the sending process, it may be set to the wild card MPI ANY SOURCE to
receive a message from any task.
Tag: An arbitrary, non-negative integer assigned by the programmer to
uniquely identify a message. Send and receive operations should match
message tags. For a receive operation, the wild card ANY TAG can be used
to receive any message regardless of its tag. The MPI standard guarantees
that integers from the range [0, 32767] can be used as tags, but most
implementations allow a much larger range.
Communicator: Indicates the communication context, or set of processes
for which the source or destination elds are valid. Unless the programmer
9in x 6in
Appendix A: MPI
b1350-appa
149
150
9in x 6in
b1350-appa
Buffer
Buffer
BUFFER
BUFFER
attach(buffer,size)
detach(buffer,size)
ATTACH(buffer,size,ierr)
DETACH(buffer,size,ierr)
MPI Rsend
A blocking ready send should only be used if the programmer is certain
that the matching receive has already been posted.
MPI Rsend(buf,count,datatype,dest,tag,comm)
MPI RSEND(buf,count,datatype,dest,tag,comm,ierr)
MPI Sendrecv
Send a message and post a receive before blocking. This will block until the
sending application buer is free for reuse and until the receiving application
buer contains the received message.
MPI Sendrecv(sendbuf,sendcount,sendtype,dest,sendtag,
recv buf,recvcount,recvtype,source,recvtag,comm,status)
MPI SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag,
recvbuf,recvcount,recvtype,source,recvtag,comm,status,ierr)
MPI Probe
Performs a blocking test for a message. The wild cards MPI ANY SOURCE
and MPI ANY TAG may be used to test for a message from any source
or with any tag. For the C routine, the actual source and tag will be
returned in the status structure as status.MPI SOURCE and status.MPI TAG.
For the FORTRAN routine, they will be returned in the integer array
status(MPI SOURCE) and status(MPI TAG).
MPI Probe(source,tag,comm,status)
MPI PROBE(source,tag,comm,status,ierr)
Examples: Figures. A.5 and A.6 provide some simple examples of blocking
message passing routine calls.
9in x 6in
Appendix A: MPI
b1350-appa
151
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[]){
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COOM_World, &rank);
if (rank == 0){
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
}
else if (rank == 1){
dest = 0;
source = 1;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
Fig. A.5. A simple example of blocking message passing in C. Task 0 pings task 1 and
awaits a return ping.
152
9in x 6in
b1350-appa
Fig. A.6.
application buer until subsequent calls to MPI Wait or MPI Test indicates
that the non-blocking send has completed.
MPI Isend(buf,count,datatype,dest,tag,comm,request)
MPI ISEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Irecv
This identies an area in memory to serve as a receive buer. Processing
continues immediately without actually waiting for the message to be
received and copied into the application buer. A communication request
handle is returned for handling the pending message status. The program
must use calls to MPI Wait or MPI Test to determine when the nonblocking receive operation completes and the requested message is vailable
in the application buer.
MPI Irecv(buf,count,datatype,source,tag,comm,request)
MPI IRECV(buf,count,datatype,source,tag,comm,request,ierr)
9in x 6in
b1350-appa
Appendix A: MPI
153
MPI Issend
Non-blocking synchronous send. Similar to MPI Isend(), except MPI
Wait() or MPI Test() indicates when the destination process has received
the message.
MPI Issend(buf,count,datatype,dest,tag,comm,request)
MPI ISSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Ibsend
A non-blocking buered send that is similar to MPI Bsend() except
MPI Wait() or MPI Test() indicates when the destination process has
received the message. Must be used with the MPI Buffer attach routine.
MPI Ibsend(buf,count,datatype,dest,tag,comm,request)
MPI IBSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Irsend
Non-blocking ready send is similar to MPI Rsend() except MPI Wait()
or MPI Test() indicates when the destination process has received the
message. This function should only be used if the programmer is certain
that the matching receive has already been posted.
MPI Irsend(buf,count,datatype,dest,tag,comm,request)
MPI IRSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Test, MPI Testany, MPI Testall, MPI Testsome
MPI Test checks the status of a specied non-blocking send or receive
operation. The ag parameter is returned logical true (1) if the operation
has completed, and logical false (0) if not. For multiple non-blocking
operations, the programmer can specify any, all or some completions.
MPI Test(request,flag,status)
MPI Testany(count,array of requests,index,flag,status)
MPI Testall(count,array of requests,flag,
array of statuses)
MPI Testsome(incount,array of requests,outcount,
154
9in x 6in
b1350-appa
MPI Iprobe
Performs a non-blocking test for a message. The wildcards MPI ANY SOURCE
and MPI ANY TAG may be used to test for a message from any source or with
any tag. The integer ag parameter is returned logical true (1) if a message
has arrived, and logical false (0) if not. For the C routine, the actual source
and tag will be returned in the status structure as status.MPI SOURCE and
status.MPI TAG. For the FORTRAN routine, they will be returned in the
integer array status(MPI SOURCE) and status(MPI TAG).
MPI Iprobe(source,tag,comm,flag,status)
MPI IPROBE(source,tag,comm,flag,status,ierr)
Examples: Figures A.7 and A.8 provide some simple examples of blocking
message passing routine calls.
A.1.8. Collective communication routines
Collective communication involves all processes in the scope of the
communicator. All processes are by default, members in the communicator
MPI COMM WORLD.
There are three types of collective operations:
(1) Synchronization: Processes wait until all members of the group have
reached the synchronization point.
9in x 6in
Appendix A: MPI
b1350-appa
155
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[]){
int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2;
MPI_Request reqs[4];
MPI_Status stats[4];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
prev = rank-1;
next = rank+1;
if (rank == 0) prev = numtasks 1;
if (rank == (numtasks 1)) next = 0;
MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD,
&reqs[0]);
MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD,
&reqs[1]);
MPI_Irecv(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD,
&reqs[2]);
MPI_Irecv(&rank, 1, MPI_INT, prev, tag1, MPI_COMM_WORLD,
&reqs[3]);
MPI_Waitall(4, reqs, stats);
MPI_Finalize();
return 0;
}
Fig. A.7. A simple example of non-blocking message passing in C, this code represents
a nearest neighbor exchange in a ring topology.
156
9in x 6in
b1350-appa
Fig. A.8.
9in x 6in
b1350-appa
Appendix A: MPI
157
MPI Scatter
Distributes distinct messages from a single source task to each task in the
group.
MPI Scatter(sendbuf,sendcnt,sendtype,recvbuf,ecvcnt,
recvtype,root,comm)
MPI SCATTER(sendbuf,sendcnt,sendtype,recvbuf,recvcnt,
recvtype,root,comm,ierr)
MPI Gather
Gathers distinct messages from each task in the group to a single destination
task. This routine is the reverse operation of MPI Scatter.
MPI Gather(sendbuf,sendcnt,sendtype,recvbuf,recvcount,
recvtype,root,comm)
MPI GATHER(sendbuf,sendcnt,sendtype,recvbuf,recvcount,
recvtype,root,comm,ierr)
MPI Allgather
Concatenation of data to all tasks in a group. Each task in the group, in
eect, performs a one-to-all broadcasting operation within the group.
MPI Allgather(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm)
MPI ALLGATHER(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm,info)
MPI Reduce
Applies a reduction operation on all tasks in the group and places the result
in one task Table A.2.
MPI Reduce(sendbuf,recvbuf,count,datatype,op,root,comm)
MPI REDUCE(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
MPI reduction operations:
MPI Allreduce
Applies a reduction operation and places the result in all tasks in the group.
This is equivalent to an MPI Reduce followed by an MPI Bcast.
MPI Allreduce(sendbuf,recvbuf,count,datatype,op,comm)
MPI ALLREDUCE(sendbuf,recvbuf,count,datatype,op,comm,ierr)
158
9in x 6in
b1350-appa
Table A.2. The predefined MPI reduction operations. Users can also define their
own reduction functions by using the MPI Op create route.
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
C Data types
MAX
MIN
SUM
PROD
LAND
BAND
LOR
BOR
LXOR
BXOR
MAXLOC
integer, float
integer, float
integer, float
integer, float
integer
integer, MPI BYTE
integer
integer, MPI BYTE
integer
integer, MPI BYTE
float, double,
long double
float, double,
long double
MPI MINLOC
maximum
minimum
sum
product
logical AND
bit-wise AND
logical OR
bit-wise OR
logical XOR
bit-wise XOR
max value and
location
min value and
location
9in x 6in
Appendix A: MPI
b1350-appa
159
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
int main(int argc, char *argv[]){
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{ 1.0, 2.0, 3.0, 4.0},
{ 5.0, 6.0, 7.0, 8.0},
{ 9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf, sendcount, MPI_FLOAT, recvbuf,
recvcount, MPI_FLOAT, source, MPI_COMM_WORLD);
printf("rank = %d Results: %f %f %f %f\n", rank,
recvbuf[0], recvbuf[1], recvbuf[2], recvbuf[3]);
}
else
printf("Must specify %d processors, Terminating.\n",
SIZE);
MPI_Finalize();
}
Fig. A.9. A simple example of collective communications in C, this code represents a
scatter operation on the rows of an array.
MPI Scan(sendbuf,recvbuf,count,datatype,op,comm)
MPI SCAN(sendbuf,recvbuf,count,datatype,op,comm,ierr)
Examples: Figures A.10 and A.11 provide some simple examples of collective
communications.
160
9in x 6in
b1350-appa
program scatter
include 'mpif.h'
integer SIZE
parameter (SIZE=4)
integer numtasks, rank, sendcount
integer recvcount, source, ierr
real*4 sendbuf(SIZE, SIZE), recvbuf(SIZE)
c Fortran stores this array in column major order, so the
c scatter will actually scatter columns, not rows
&
&
&
&
&
&
rank
rank
rank
rank
=
=
=
=
0
1
3
4
Results:
Results:
Results:
Results:
Fig. A.11.
9in x 6in
Appendix A: MPI
Fig. A.12.
b1350-appa
161
from
from
from
from
process
process
process
process
0.
3.
2.
1.
A.2.2. Integration
Compiling integral.c Fig. A.13
>mpicc -o integral integral.c
Running integral.c Fig. A.13
>mpirun -np 4 integral
A.3. MPI Tools
Pallas
Pallas is a leading independent software company specializing in high
performance computing. Pallas assisted many organizations in migrating
from sequential to parallel computing. Customers of Pallas come from all
elds: Hardware manufacturers, software vendors, as well as end users. Each
of these has beneted from Pallas unique experience in the development
and tuning of parallel applications.
162
9in x 6in
b1350-appa
Fig. A.13.
9in x 6in
Appendix A: MPI
b1350-appa
163
VAMPIR
VAMPIR is currently the most successful MPI tool product (see also
Supercomputer European Watch, July 1997), or check references at
(1) http://www.cs.utk.edu/browne/perftools-review
(2) http://www.tc.cornell.edu/UserDoc/Software/PTools/vampir/
ScaLAPACK
The ScaLAPACK library includes a subset of LAPACK (Linear Algebra
PACKage) routines redesigned for distributed-memory MIMD parallel
computers. It is currently written in SPMD-type using explicit message
passing for interprocessor communication. The goal is to have ScaLAPACK
routines resemble their LAPACK equivalents as much as possible.
PGAPack
PGAPack is a general-purpose, data-structure-neutral, parallel genetic
algorithm library. It is intended to provide most capabilities desired in a
genetic algorithm library, in an integrated, seamless, and portable manner.
ARCH
ARCH is a C++-based object-oriented library of tools for parallel
programming on computers using the MPI (message passing interface)
communication library. Detailed technical information about ARCH is
available as a Cornell Theory Center Technical Report (CTC95TR288).
http://www.tc.cornell.edu/Research/tech.rep.html
OOMPI
OOMPI is an object-oriented interface to the MPI-1 standard. While
OOMPI remains faithful to all the MPI-1 functionality, it oers new object
oriented abstractions that promise to expedite the MPI programming
process by allowing programmers to take full advantage of C++ features.
http://www.cse.nd.edu/lsc/research/oompi
XMPI: A Run/Debug GUI for MPI
XMPI is an X/Motif based graphical user interface for running and
debugging MPI programs. It is implemented on top of LAM, an MPI cluster
computing environment, but the interface is generally independent of LAM
operating concepts. You write an MPI application in one or more MPI
programs, tell XMPI about these programs and where they are to be run,
164
9in x 6in
b1350-appa
9in x 6in
Appendix A: MPI
b1350-appa
165
166
9in x 6in
b1350-appa
Para++
The Para++ project provides a C++ interface to the MPI and PVM message
passing libraries. Their approach is to overload input and output operators
to do communication. Communication looks like standard C++ I/O.
http://www.loria.fr/para++/parapp.html.
Amelia Vector Template Library
The Amelia Vector Template Library (AVTL) is a polymorphic collection
library for distributed-memory parallel computers. It is based on ideas from
the Standard Template Library (STL) and uses MPI for communication.
ftp://riacs.edu/pub/Excalibur/avtl.html.
Parallel FFTW
FFTW, a high-performance, portable C library for performing FFTs
in one or more dimensions, includes parallel, multi-dimensional FFT
routines for MPI. The transforms operate in-place for arbitrary-size arrays
(of any dimensionality greater than one) distributed across any number of
processors. It is free for non-commercial use.
http://www.tw.org/.
Cononical Classes for Concurrency Control
The Cononical Classes for Concurrency Control library contains a set
of C++ classes that implement a variety of synchronization and data
transmission paradigms. It currently supports both Intels NX and MPI.
http://dino.ph.utexas.edu/furnish/c4.
MPI Cubix
MPI Cubix is an I/O library for MPI applications. The semantics and language
binding reect POSIX in its sequential aspects and MPI in its parallel aspects.
The library is built on a few POSIX I/O functions and each of the POSIXlike Cubix functions translate directly to a POSIX operation on a le system
somewhere in the parallel computer. The library is also built on MPI and is
therefore portable to any computer that supports both MPI and POSIX.
http://www.osc.edu/Lam/mpi/mpicubix.html.
MPIX Intercommunicator Extensions
The MPIX Intercommunicator Extension library contains a set of extensions
to MPI that allow many functions that previously only worked with
9in x 6in
b1350-appa
Appendix A: MPI
167
MPI Initialized
MPIO Wait
MPI File read all
MPI Int2handle
MPI Abort
MPI File read all begin
MPI Intercomm create
MPI Address
MPI File read all end
MPI Intercomm merge
MPI Allgather
168
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
9in x 6in
b1350-appa
File read at
Iprobe
Allgatherv
File read at all
Irecv
Allreduce
File read at all begin
Irsend
Alltoall
File read at all end
Isend
Alltoallv
File read ordered
Issend
Attr delete
File read ordered begin
Keyval create
Attr get
File read ordered end
Keyval free
Attr put
File read shared
NULL COPY FN
Barrier
File seek
NULL DELETE FN
Bcast
File seek shared
Op create
Bsend
File set atomicity
Op free
Bsend init
File set errhandler
Pack
Buffer attach
File set info
Pack size
Buffer detach
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
9in x 6in
Appendix A: MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
Graph neighbors
Testall
Errhandler free
Graph neighbors count
Testany
Errhandler get
Graphdims get
Testsome
Errhandler set
Group compare
Topo test
Error class
Group difference
Type commit
Error string
Group excl
Type contiguous
File c2f
Group free
Type create darray
File close
Group incl
Type create subarray
File delete
Group intersection
Type extent
File f2c
Group range excl
Type free
File get amode
Group range incl
Type get contents
File get atomicity
Group rank
Type get envelope
File get byte offset
Group size
Type hvector
File get errhandler
b1350-appa
169
170
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
9in x 6in
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
b1350-appa
9in x 6in
b1350-appb
APPENDIX B
OPENMP
171
172
9in x 6in
b1350-appb
Fig. B.1.
Appendix B: OpenMP
9in x 6in
b1350-appb
173
if(scalar-expression )
num threads(integer-expression )
default(shared | none)
private(list )
firstprivate(list )
shared(list )
copyin(list )
reduction(operator :list ).
B.3.2. Work-sharing constructs
A work-sharing construct distributes the execution of the associated region
among the threads encountering it. However, it does not launch new threads
by itself. It should be used within the parallel region or combined with the
parallel region constructs.
The directive for the loop construct is as follows:
#pragma omp for [clause[[,] clause] ...]
where the clause is one of the following:
private(list )
firstprivate(list )
lasstprivate(list )
reduction(operator :list )
schedule(kind [, chunck size ])
collapse(n)
ordered
nowait
B.3.3. Directive clauses
The private clause declares variables in its list to be private to each thread.
For a private variable, a new object of the same type is declared once for
each thread in the team and all reference to the original object are replaced
with references to the new object. Hence, variables declared private should
be assumed to be uninitialized for each thread.
The firstprivate clause has the same behavior of private but it
automatically initializes the variables in its list according to their original
values.
174
9in x 6in
b1350-appb
The lastprivate clause does what private does and copies the
variable from the last loop iteration to the original variable object.
The shared clause declares variables in its list to be shared among all
threads.
The default clause allows user to specify a default scope for all
variables within a parallel region to be either shared or not.
The copyin clause provides a means for assigning the same value to
threadprivate variables for all threads in the team.
The reduction clause performs a reduction on the variables that
appear in its list. A private copy for each list variable is created for each
thread. At the end of the reduction, the reduction variable is applied to all
private copies of the shared variable, and the nal result is written to the
global shared variable.
B.4. Synchronization
Before we discuss the synchronization, let us consider a simple example
where two processors try to do a read/update on same variable.
x = 0;
#pragma omp parallel shared(x)
{
x = x + 1;
}
Thread
Thread
Thread
Thread
Thread
Thread
1
2
1
2
1
2
Appendix B: OpenMP
9in x 6in
b1350-appb
175
have the same unspecied name. So if there exist two or more independent
blocks of critical procedure in the same parallel region, it is important
to specify them with dierent name so that these blocks can be executed in
parallel.
#pragma omp critical [name ]
structured block
atomic
The atomic directive ensures that a specic storage location is updated
atomically, rather than exposing it to the possibility of multiple,
simultaneous writings (race condition). The atomic directive applies only to
the statement immediately following it and only the variable being updated
is protected by atomic. The dierence between atomic and critical is
that atomic operations can be executed in parallel when updating dierent
element while critical blocks are guaranteed to be serial.
#pragma omp atomic
statement expression
barrier
The barrier directive synchronizes all threads in the team. When a
barrier directive is reached, a thread will wait at that point until all other
threads have reached that barrier. All threads then resume executing in
parallel the code that follows the barrier.
#pragma omp barrier
flush
The flush directive identies a synchronization point at which the
implementation must provide a consistent view of memory. Thread-visible
variables are written back to memory at this point.
#pragma omp flush (list )
176
9in x 6in
b1350-appb
9in x 6in
b1350-appb
Appendix B: OpenMP
177
The OpenMP lock routines access a lock variable in such a way that
they always read and update the most current value of the lock variable.
The lock routines include a ush with no list; the read and update to the
lock variable must be implemented as if they are atomic with the ush.
Therefore, it is not necessary for an OpenMP program to include explicit
ush directives to ensure that the lock variables value is consistent among
dierent tasks.
omp init lock
This subroutine initializes a lock associated with the lock variable.
void omp init lock(omp lock t *lock).
omp destroy lock
This subroutine disassociates the given lock variable from any locks.
void omp destroy lock(omp lock t *lock).
omp set lock
This subroutine forces the executing thread to wait until the specied lock is
available. A thread is granted ownership of a lock when it becomes available.
void omp set lock(omp lock t *lock).
omp unset lock
This subroutine releases the lock from the executing subroutine.
void omp unset lock(omp lock t *lock).
omp test lock
This subroutine attempts to set a lock, but does not block if the lock is
unavailable.
int omp test lock(omp lock t *lock)
B.5.3. Timing routines
omp get wtime
This routine returns elapsed wall clock time in seconds.
double omp get wtime(void).
omp get wtick
178
9in x 6in
b1350-appb
This routine returns the precision of the timer used by omp get wtime.
double omp get wtick(void).
B.6. Examples of Using OpenMP
Similar to the MPI chapter, here also we provide two simple examples
(Figs. B.2 and B.3) of using OpenMP.
= omp_get_num_threads();
9in x 6in
Appendix B: OpenMP
b1350-appb
179
180
9in x 6in
b1350-appb
9in x 6in
b1350-appc
APPENDIX C
PROJECTS
182
9in x 6in
b1350-appc
Fujitsu K computer.
NDUT Tianhe-1A.
Cray XT5-HE.
Dawning TC3600 Cluster.
HP Cluster Platform 3000SL.
Cray XE6.
SGI Altix ICE.
Cray XE6.
Bull Bullx supernode S6010/6030.
Roadrunner IBM BladeCenter.
In each of the categories, create a rank order for the computers you select.
Project C.4 Say Hello in Order
When running a Hello from Processor X program on P processors, you
usually see output on your screen ordered randomly. Its not in the order you
desired, e.g., Hello from Processor 1, then Hello from Processor 2, etc.
9in x 6in
b1350-appc
Appendix C: Projects
183
Write a parallel program to enforce the order, i.e., whenever you run it, you
always get the processors to say hello to you in order. You may test it for
a system with at least 13 processors.
Project C.5 Broadcast on Torus
Broadcast 10 oating-point numbers from any point to all 6000 nodes of a
3D torus 10 20 30. Please design an algorithm to quickly complete the
broadcast. Please report the total number of steps N needed to complete
the broadcast. Additionally, make a table to show the number of links that
are involved in each of the N steps for the broadcast.
Project C.6 Competing with MPI on Broadcast, Scatter, etc
MPI implementation has a dozen or so collective operation functions such
as broadcast, gather, scatter, all-gather, etc. Describe the algorithms of
implementation three functions: Broadcast, scatter, and all-gather.
Implement such functions on your chosen computer using only
MPI Isend() and MPI Irecv() and a few other supporting (non-datamotion) functions and other light-weight basic single-sided communication
functions. Then, compare the performance of your implementation with
that of those provided by MPI. You need to vary the size of the messages
(number of oating-point numbers to move) and the size of the computer
involved to study the dependence of the dierences on the message and
machine sizes. For example, if you want to perform collective operations on
N = 102 , 103 , 104 oating numbers from P = 22 , 23 , 24 , 25 processing cores.
Project C.7 Simple Matrix Multiplication
Create two N N square matrices with random elements whose values are
in [1, 1] and compute their product by using P processors. The program
should be general for reasonable values of N and P . Plot the speedup curves
at N = 1000, 2000, 4000 and P = 1, 2, 4, 8 processors.
Project C.8 Matrix Multiplication on 4D Torus
Design an algorithm to multiply two large square matrices A and B on a
4D torus networked supercomputer with 15 15 16 9 = 32, 400 nodes.
You may consider A and B having 8 8 (180 180) elements, i.e., your
matrices are of the size 14401440. Further, we assume the time to perform
one oating-point operation (adding or multiplying a pair of oating-point
184
9in x 6in
b1350-appc
9in x 6in
Appendix C: Projects
b1350-appc
185
186
9in x 6in
b1350-appc
=
c
+
t2
x2
y 2
u(t = 0) = u0
u
(t = 0) = v0
u(x
= 0) = u(x = L)
u(y = 0) = u(y = L)
where c, u0 , v0 , L are constants.
Now, the computer consists of 64 processors with the same speed
(1 Gops each) and these processors are placed on a 3D mesh 4 4 4.
Each link connecting any two nearest neighboring processors is capable of
1 Gbps communication bandwidth. Assume we use a nite dierence method
to solve the problem. We have to do the following:
(1) Design an algorithm to place the 256 256 mesh points to the
64 processors with optimal utilization and load balance of the CPUs
and communication links.
(2) Compute the expected load imbalance ratio for CPUs resulting from
the method.
(3) Compute the expected load imbalance ratio among all links resulting
from the method.
(4) Repeat the above three steps if each one of the top layer of 16 processors
is four times faster than each one of the processors on the second and
third layers.
Project C.15 Wave Equation and PAT
Please write a parallel program on a 3D or higher dimensional mesh
networked parallel computer to solve 2D wave equation by nite dierence
method. The wave equation is dened on a two-dimensional square domain
with 256 mesh-points in each dimension. The equation is
u(t = 0) = u0 , u (t = 0) = v0
9in x 6in
Appendix C: Projects
b1350-appc
187
188
9in x 6in
b1350-appc
1
2
12 r6
rij
ij
Appendix C: Projects
9in x 6in
b1350-appc
189
framework for dierent methods such as, e.g., density functional theory
(DFT) using a mixed Gaussian and plane waves approach (GPW) and
classical pair and many-body potentials. CP2K is freely available under
the GPL license. It is written in FORTRAN-95 and can be run eciently
in parallel.
Download the latest source code and compile the program for a parallel
computer you can access. Run the program on that computer with 8, 16, and
32 processing cores for 100 time steps. You may use the default parameters
or change them to a reasonable set that you can explain. Please build a
table to show the wall clock times on all processing cores for their local
calculations (accumulated over the 100 time steps). More desirably, prole
the program.
Project C.20 Install and Profile CPMD
CPMD is a well parallelized, plane wave and pseudo-potential
implementation of Density Functional Theory, developed especially for
ab-inito molecular dynamics. It was developed at IBM Zurich Research
Laboratory. The key features of CPMD include
Download the latest source code and compile the program for a parallel
computer you can access. Run the program on that computer with 8, 16, and
190
9in x 6in
b1350-appc
32 processing cores for 100 time steps. You may use the default parameters
or change them to a reasonable set that you can explain. Please build a
table to show the wall clock times on all processing cores for their local
calculations (accumulated over the 100 time steps). More desirably, prole
the program.
Project C.21 Install and Profile NAMD
NAMD is a parallel molecular dynamics code designed for high-performance
simulation of large bio-molecular systems. Based on Charm++ parallel
objects (in 2010), NAMD scales to hundreds of processors on high-end
parallel platforms and tens of processors on commodity clusters using
gigabit Ethernet. One can build NAMD from the source code or download
the binary executable for a wide variety of platforms.
Download the latest source code and compile the program for a parallel
computer you can access. Run the program on that computer with 8, 16, and
32 processing cores for 100 time steps. You may use the default parameters
or change them to a reasonable set that you can explain. Please build a
table to show the wall clock times on all processing cores for their local
calculations (accumulated over the 100 time steps). More desirably, prole
the program.
Project C.22 FFT on Beowulf
Fast Fourier Transform (FFT) is important for many problems in
computational science and engineering and its parallelization is very dicult
to scale for large distributed-memory systems.
Suppose you are required to perform FFT eciently on a Beowulf
system with two layers of switches. Hypothetically, the Beowulf system
has 480 processors that are divided into 32 groups with 15 each. The
15-processor group is connected to a Gigabit switch with 16 ports, one
of which is connected to the master 10-Gigabit switch that has 32 ports.
We also assume the latencies for the Gigabit and 10-Gigabit switches 50
and 20 microseconds respectively.
(1) Draw a diagram to illustrate the Beowulf system.
(2) Design the algorithm for FFT on such Beowulf system.
(3) Estimate the speedup for the algorithm.
9in x 6in
Appendix C: Projects
b1350-appc
191
9in x 6in
b1350-appc
9in x 6in
b1350-appd
APPENDIX D
PROGRAM EXAMPLES
tag = 0
root = 0
here(not always) for MPI_Gather to work root should be 0
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_comm_world,my_rank,ierr)
call MPI_Comm_size(MPI_comm_world,size,ierr)
if(mod(N2,size).ne.0)then
print*, rows are not divisible by processes
stop
end if
if(my_rank.eq.root)then
call initialize_M(M,N1,N2)
call initialize_v(v,N1)
end if
193
194
send_count = N_rows
recv_count = N_rows
if(my_rank.eq.root)recv_count = N2
call MPI_Gather(prod,
@
send_count,
@
MPI_REAL,
@
prod,
@
recv_count,
@
MPI_REAL,
@
root,
@
MPI_COMM_WORLD,
@
ierr)
if(my_rank.eq.root)call write_prod(prod,N2)
call MPI_Finalize(ierr)
end
subroutine multiply(prod,v,M,N2,N1)
real M(N2,N1),prod(N2),v(N1)
do i=1,N2
prod(i)=0
do j=1,N1
9in x 6in
b1350-appd
9in x 6in
b1350-appd
195
prod(i)=prod(i) + M(j,i)*v(j)
end do
end do
return
end
subroutine initialize_M(M,N2,N1)
real M(N2,N1)
do i=1,N2
do j=1,N1
M(j,i) = 1.*i/j
end do
end do
return
end
subroutine initialize_v(v,N1)
real v(N1)
do j=1,N1
v(j) = 1.*j
end do
return
end
subroutine write_prod(prod,N2)
real prod(N2)
C . directory for all process except the one the program was started on
C is your home directory
open(unit=1,file=~/LAM/F/prod,status=new)
do j=1,N2
write(1,*)j,prod(j)
end do
return
end
196
c
c
c
c
c
c
c
c
c
9in x 6in
b1350-appd
The number of processes must be even, and the total number of points
must be exactly divisible by the number of processes.
This is the MPI version.
Author: David W. Walker
Date:
March 10, 1995
program nbody
implicit none
include mpif.h
integer myrank, ierr, nprocs, npts, nlocal
integer pseudohost, NN, MM, PX, PY, PZ, FX, FY, FZ
real G
parameter (pseudohost = 0)
parameter (NN=10000, G = 1.0)
parameter (MM=0, PX=1, PY=2, PZ=3, FX=4, FY=5, FZ=6)
real dx(0:NN-1), dy(0:NN-1), dz(0:NN-1)
real dist(0:NN-1), sq(0:NN-1)
real fac(0:NN-1), tx(0:NN-1), ty(0:NN-1), tz(0:NN-1)
real p(0:6,0:NN-1), q(0:6,0:NN-1)
integer i, j, k, dest, src
double precision timebegin, timeend
integer status(MPI_STATUS_SIZE)
integer newtype
double precision ran
integer iran
c
c Initialize MPI, find rank of each process, and the number of
processes
c
call mpi_init (ierr)
call mpi_comm_rank (MPI_COMM_WORLD, myrank, ierr)
call mpi_comm_size (MPI_COMM_WORLD, nprocs, ierr)
c
c One process acts as the host and reads in the number of particles
c
if (myrank .eq. pseudohost) then
open (4,file=nbody.input)
if (mod(nprocs,2) .eq. 0) then
read (4,*) npts
if (npts .gt. nprocs*NN) then
print *,Warning!! Size out of bounds!!
npts = -1
else if (mod(npts,nprocs) .ne. 0) then
print *,Number of processes must divide npts
npts = -1
end if
else
9in x 6in
c
c
c
c
c
c
c
c
c
c
= npts/nprocs
The pseudocode hosts initializes the particle data and sends each
process its particles.
if (myrank .eq. pseudohost) then
iran = myrank + 111
do i=0,nlocal-1
p(MM,i) = sngl(ran(iran))
p(PX,i) = sngl(ran(iran))
p(PY,i) = sngl(ran(iran))
p(PZ,i) = sngl(ran(iran))
p(FX,i) = 0.0
p(FY,i) = 0.0
p(FZ,i) = 0.0
end do
do k=0,nprocs-1
if (k .ne. pseudohost) then
do i=0,nlocal-1
q(MM,i) = sngl(ran(iran))
q(PX,i) = sngl(ran(iran))
q(PY,i) = sngl(ran(iran))
q(PZ,i) = sngl(ran(iran))
q(FX,i) = 0.0
q(FY,i) = 0.0
q(FZ,i) = 0.0
end do
call mpi_send (q, 7*nlocal, MPI_REAL,
#
k, 100, MPI_COMM_WORLD, ierr)
end if
end do
else
call mpi_recv (p, 7*nlocal, MPI_REAL,
#
pseudohost, 100, MPI_COMM_WORLD, status, ierr)
b1350-appd
197
198
9in x 6in
c
c
c
c
c
c
c
c
c
c
C
c
c
c Each process interacts with the particles from its nprocs/2-1
c anti-clockwise neighbors. At the end of this loop p(i) in each
c process has accumulated the force from interactions with particles
c i+1, ...,nlocal-1 in its own process, plus all the particles from
its
b1350-appd
9in x 6in
b1350-appd
c
c
c
c
c
c
c
do k=0,nprocs/2-2
call mpi_sendrecv_replace (q, 7*nlocal, MPI_REAL, dest, 200,
src, 200, MPI_COMM_WORLD, status, ierr)
do i=0,nlocal-1
do j=0,nlocal-1
dx(i) = p(PX,i) - q(PX,j)
dy(i) = p(PY,i) - q(PY,j)
dz(i) = p(PZ,i) - q(PZ,j)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,j) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,j) = q(FX,j)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,j) = q(FY,j)+ty(i)
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,j) = q(FZ,j)+tz(i)
end do
end do
end do
199
200
9in x 6in
b1350-appd
=
=
=
=
=
=
p(FX,i)-tx(i)
q(FX,j)+tx(i)
p(FY,i)-ty(i)
q(FY,j)+ty(i)
p(FZ,i)-tz(i)
q(FZ,j)+tz(i)
c
c In half the processes we include the interaction of each particle
with
c the corresponding particle in the opposing process.
c
if (myrank .lt. nprocs/2) then
do i=0,nlocal-1
dx(i) = p(PX,i) - q(PX,i)
dy(i) = p(PY,i) - q(PY,i)
dz(i) = p(PZ,i) - q(PZ,i)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,i) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,i) = q(FX,i)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,i) = q(FY,i)+ty(i)
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,i) = q(FZ,i)+tz(i)
end do
endif
c
c Now the q array is returned to its home process.
c
dest = mod (nprocs+myrank-nprocs/2, nprocs)
src = mod (myrank+nprocs/2, nprocs)
call mpi_sendrecv_replace (q, 7*nlocal, MPI_REAL, dest, 400,
#
src, 400, MPI_COMM_WORLD, status, ierr)
end if
c
c The p and q arrays are summed to give the total force on each particle.
c
do i=0,nlocal-1
p(FX,i) = p(FX,i) + q(FX,i)
p(FY,i) = p(FY,i) + q(FY,i)
p(FZ,i) = p(FZ,i) + q(FZ,i)
end do
c
c Stop clock and write out timings
9in x 6in
b1350-appd
#
c
c
c
timeend = mpi_wtime ()
print *,Node, myrank, Elapsed time: ,
timeend-timebegin, seconds
100
stop
format(3e15.6)
end
D.3. Integration
Program D.3.
program integrate
include "mpif.h"
parameter (pi=3.141592654)
integer rank
201
202
9in x 6in
b1350-appd
9in x 6in
b1350-ref
REFERENCES
203
9in x 6in
b1350-ref
9in x 6in
b1350-index
INDEX
1.5.2 Schr
odingers equations, 133
1D wave equation, 92
2D mapping, 56
3D mapping, 56
patterns, 41
reduction, 155
synchronous, 38, 141
communicator, 142, 148
connectivity, 21
Cononical Classes for Concurrency
Control, 166
control decomposition, 46
cost eectiveness, 9
bisection bandwidth, 21
BLACS, 165
blocking communication, 142
blocking message passing, 147
Broadcast, 156
buer, 147
classical molecular dynamics, 126
Collective communication, 154
communication, 38
asynchronous, 38, 141
blocking, 142
collective, 154
non-blocking, 142
point to point, 147
cooperative, 139
interrupt, 40
one-sided, 139
eciency, 15
embarrassingly parallel, 52, 89
envelope, 143
environment management routines,
144
explicit nite dierence, 92
extended collective operation, 140
external interface, 140
205
206
9in x 6in
b1350-index
9in x 6in
207
Index
b1350-index
random mapping, 56
rank, 142
receive, 141
request, 149
Riemann summation approximation,
89
ring, 22
scalability, 18
scalable, 15
ScaLAPACK, 163
scaling zone, 18
scatter, 155
Scheduling, 50
search space decomposition, 116, 119
send, 141
serial computer, 1
shared-distributed-memory, 29
shared-memory, 29, 42
short-ranged interactions, 126
Simpsons Rule, 89
source, 148
sparse linear systems, 85
speedup, 14
SPMD, 46
STAR/MPI, 164
status, 149
super speedup, 15
synchronization, 154
synchronous, 141
system buer, 142
tag, 148
threadprivate, 172
torus, 22
trapezoid rule, 89
VAMPIR, 163
vector processing, 1
virtual-shared-memory, 46
XMPI, 163