Applied Parallel Computing-Honest

APPLIED
PARALLEL COMPUTING
7767hc_9789814307604_tp.indd 2
26/7/12 12:08 PM
July 27, 2012 6:18
Applied Parallel Computing
9in x 6in
This page intentionally left blank
b1350-fm
APPLIED
PARALLEL COMPUTING
Yuefan Deng
Stony Brook University, USA
World Scientific
NEW JERSEY
LONDON
7767hc_9789814307604_tp.indd 1
SINGAPORE
BEIJING
SHANGHAI
HONG KONG
TA I P E I
CHENNAI
26/7/12 12:08 PM
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.
APPLIED PARALLEL COMPUTING

Copyright 2013 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
ISBN 978-981-4307-60-4
Typeset by Stallion Press

Email: enquiries@stallionpress.com
Printed in Singapore.
Steven - Applied Parallel Computing.pmd
7/26/2012, 3:40 PM
July 27, 2012 6:18
9in x 6in
b1350-fm
PREFACE
This manuscript, Applied Parallel Computing, gathers the core materials

from a graduate course of similar title I have been teaching at Stony
Brook University for 23 years, and from a summer course I taught at
the Hong Kong University of Science and Technology in 1995, as well as
from multiple month-long and week-long training sessions at the following
institutions: HKU, CUHK, HK Polytechnic, HKBC, the Institute of Applied
Physics and Computational Mathematics in Beijing, Columbia University,
Brookhaven National Laboratory, Northrop-Grumman Corporation, and
METU in Turkey, KISTI in Korea.
The majority of the attendees are advanced undergraduate and
graduate science and engineering students requiring skills in large-scale
computing. Students in computer science, economics, and applied mathematics are common to see in classes, too.
Many participants of the above events contributed to the improvement
and completion of the manuscript. My former and current graduate
students, J. Braunstein, Y. Chen, B. Fang, Y. Gao, T. Polishchuk, R. Powell,
and P. Zhang have contributed new materials from their theses. Z. Lou,
now a graduate student at the University of Chicago, edited most of the
manuscript and I wish to include him as a co-author.
Supercomputing experiences super development and is still evolving.
This manuscript will evolve as well.
July 27, 2012 6:18
9in x 6in
b1350-fm
July 27, 2012 6:18
9in x 6in
b1350-fm
CONTENTS
Preface
Chapter
1.1.
1.2.
1.3.
1.4.
1. Introduction
Denition of Parallel Computing
Evolution of Computers . . . . .
An Enabling Technology . . . .
Cost Eectiveness . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
8
9
Chapter
2.1.
2.2.
2.3.
2.4.
2.5.
2.6.
2.7.
2.8.
2. Performance Metrics and

Parallel Activity Trace . . . . .
Speedup . . . . . . . . . . . . .
Parallel Eciency . . . . . . . .
Load Imbalance . . . . . . . . .
Granularity . . . . . . . . . . . .
Overhead . . . . . . . . . . . . .
Scalability . . . . . . . . . . . .
Amdahls Law . . . . . . . . . .
Models
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
14
15
15
16
17
18
18
Chapter
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
3.7.
3. Hardware Systems
Node Architectures . . . . . . .
Network Interconnections . . . .
Instruction and Data Streams .
Processor-Memory Connectivity
IO Subsystems . . . . . . . . . .
System Convergence . . . . . . .
Design Considerations . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
21
28
29
29
31
31
Chapter
4.1.
4.2.
4.3.
4.4.
4. Software Systems
Node Software . . . . .
Programming Models .
Parallel Debuggers . . .
Parallel Prolers . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
37
43
43
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
July 27, 2012 6:18
viii
9in x 6in
b1350-fm
Chapter
5.1.
5.2.
5.3.
5. Design of Algorithms
Algorithm Models . . . . . . . . . . . . . . . . . . . . . .
Examples of Collective Operations . . . . . . . . . . . . .
Mapping Tasks to Processors . . . . . . . . . . . . . . . .
45
46
54
56
Chapter
6.1.
6.2.
6.3.
6. Linear Algebra
Problem Decomposition . . . . . . . . . . . . . . . . . . .
Matrix Operations . . . . . . . . . . . . . . . . . . . . . .
Solution of Linear Systems . . . . . . . . . . . . . . . . .
65
65
68
81
Chapter 7. Dierential Equations

7.1. Integration and Dierentiation . . . . . . . . . . . . . . .
7.2. Partial Dierential Equations . . . . . . . . . . . . . . . .
89
89
92
Chapter
8.1.
8.2.
8.3.
8.4.
8.5.
8.6.
8. Fourier Transforms
Fourier Transforms . . . . . .
Discrete Fourier Transforms .
Fast Fourier Transforms . . . .
Simple Parallelization . . . . .
The Transpose Method . . . .
Complexity Analysis for FFT .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
105
106
107
111
112
113
Chapter 9. Optimization
115
9.1. Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 116
9.2. Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 119
Chapter 10. Applications
10.1. Newtons Equation and Molecular Dynamics . . .
10.2. Schr
odingers Equations and Quantum Mechanics
10.3. Partition Function, DFT and Material Science . .
10.4. Maxwells Equations and Electrical Engineering .
10.5. Diusion Equation and Mechanical Engineering .
10.6. Navier-Stokes Equation and CFD . . . . . . . . .
10.7. Other Applications . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
124
133
134
135
135
136
136
Appendix A. MPI
A.1. An MPI Primer . . . . . . . . .
A.2. Examples of Using MPI . . . . .
A.3. MPI Tools . . . . . . . . . . . .
A.4. Complete List of MPI Functions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
139
139
159
161
167
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
July 27, 2012 6:18
9in x 6in
b1350-fm
ix
Contents
Appendix B. OpenMP
B.1. Introduction to OpenMP . .
B.2. Memory Model of OpenMP .
B.3. OpenMP Directives . . . . .
B.4. Synchronization . . . . . . .
B.5. Runtime Library Routines .
B.6. Examples of Using OpenMP
B.7. The Future . . . . . . . . . .
Appendix C.
Project C.1.
Project C.2.
Project C.3.
Project C.4.
Project C.5.
Project C.6.
Project C.7.
Project C.8.
Project C.9.
Project C.10.
Project C.11.
Project C.12.
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
Project
C.13.
C.14.
C.15.
C.16.
C.17.
C.18.
C.19.
C.20.
C.21.
C.22.
C.23.
C.24.
C.25.
C.26.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Projects
Watts and Flops of Supercomputers . . . . .
Review of Supercomputers . . . . . . . . . .
Top500 and BlueGene Supercomputers . . .
Say Hello in Order . . . . . . . . . . . . . . .
Broadcast on Torus . . . . . . . . . . . . . .
Competing with MPI on Broadcast,
Scatter, etc . . . . . . . . . . . . . . . . . . .
Simple Matrix Multiplication . . . . . . . . .
Matrix Multiplication on 4D Torus . . . . .
Matrix Multiplication and PAT . . . . . . .
Matrix Inversion . . . . . . . . . . . . . . . .
Simple Analysis of an iBT Network . . . . .
Compute Eigenvalues of Adjacency Matrices
of Networks . . . . . . . . . . . . . . . . . . .
Mapping Wave Equation to Torus . . . . . .
Load Balance in 3D Mesh . . . . . . . . . . .
Wave Equation and PAT . . . . . . . . . . .
Computing Coulombs Forces . . . . . . . . .
Timing Model for MD . . . . . . . . . . . . .
Minimizing Lennard-Jones Potential . . . . .
Install and Prole CP2K . . . . . . . . . . .
Install and Prole CPMD . . . . . . . . . . .
Install and Prole NAMD . . . . . . . . . .
FFT on Beowulf . . . . . . . . . . . . . . . .
FFT on BlueGene/Q . . . . . . . . . . . . .
Word Analysis . . . . . . . . . . . . . . . . .
Cost Estimate of a 0.1 Pops System . . . .
Design of a Pops System . . . . . . . . . .
.
.
.
.
.
.
.
171
171
172
172
174
175
178
180
.
.
.
.
.
.
.
.
.
.
181
181
181
181
182
183
.
.
.
.
.
.
.
.
.
.
.
.
183
183
183
184
184
185
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
186
186
187
187
188
188
189
190
190
191
191
191
191
.
.
.
.
.
.
.
July 27, 2012 6:18
9in x 6in
b1350-fm
Appendix D. Program Examples

193
D.1. Matrix-Vector Multiplication . . . . . . . . . . . . . . . . 193
D.2. Long Range N-body Force . . . . . . . . . . . . . . . . . 195
D.3. Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 201
References
203
Index
205
July 27, 2012 6:17
9in x 6in
b1350-ch01
CHAPTER 1
INTRODUCTION
1.1. Denition of Parallel Computing

The US Department of Energy denes parallel computing as:
simultaneous processing by more than one processing unit on a
single application.1
It is the ultimate approach for a large number of large-scale scientic,
engineering, and commercial computations.
Serial computing systems have been with us for more than ve decades
since John von Neumann introduced digital computing in the 1950s. A serial
computer refers to a system with one central processing unit (CPU) and one
memory unit, which may be so arranged as to achieve ecient referencing
of data in the memory. The overall speed of a serial computer is determined
by the execution clock rate of instructions and the bandwidth between the
memory and the instruction unit.
To speed up the execution, one would need to either increase the clock
rate or reduce the computer size to lessen the signal travel time.
Both are approaching the fundamental limit of physics at an alarming
pace. Figure 1.1 also shows that as we pack more and more transistors on
a chip, more creative and prohibitively expensive cooling techniques are
required as shown in Fig. 1.2. At this time, two options survive to allow sizable
expansion of the bandwidth. First, memory interleaving divides memory into
banks that are accessed independently by the multiple channels. Second,
memory caching divides memory into banks of hierarchical accessibility by the
instruction unit, e.g. a small amount of fast memory and large amount of slow
memory. Both of these eorts increase CPU speed, but only marginally, with
frequency walls and memory bandwidth walls.
Vector processing is another intermediate step for increasing speed. One
central processing unit controls several (typically, around a dozen) vector
1 http://www.nitrd.gov/pubs/bluebooks/1995/section.5.html.
July 27, 2012 6:17
9in x 6in
b1350-ch01

16-Core SPARC T3
6-Core Core i7
10-Core Xeon Westmere-EX
6-Core Xeon 7400
8-Core POWER7
Dual-Core Itanium 2
Quad-Core z196
Quad-Core Itanium Tukwila
AMD K10
8-Core Xeon Nehalem-EX
POWER6
Itanium 2 with 9MB cache
Six-Core Opteron 2400
AMD K10
Core i7 (Quad)
Core 2 Duo
Itanium 2
Cell
109
108
AMD K8
Pentium 4
curve shows transistor
count doubling every
two years
107
106
Barton
Atom
AMD K7
AMD K6-III
AMD K6
Pentium III
Pentium II
AMD K5
Pentium
80486
80386
80286
105
68000
104
80186
8088
8086
8085
6800
6809
8080
Z80
8008
4004
MOS 6502
RCA 1802
Fig. 1.1. Microprocessor transistor counts 19712011 and Moores Law.

Source: Wgsimon on Wikipedia.
units that can work simultaneously. The advantage is the simultaneous

execution of instructions in these vector units for several factors of speed
up over serial processing. However, there exist some diculties. Structuring
the application codes to fully utilize vector units and increasing the scale of
the system limit improvement.
Obviously, the diculty in utilizing a computer serial or vector or
parallel increases with the complexity of the systems. Thus, we can safely
make a list of the computer systems according to the level of diculty of
programming them:
(1) Serial computers.
(2) Vector computers with well-developed software systems, especially
compilers.
(3) Shared-memory computers whose communications are handled by
referencing to the global variables.
(4) Distributed-memory single-instruction multiple-data computers with
the data parallel programming models; and
July 27, 2012 6:17
9in x 6in
b1350-ch01
Introduction
10000
100
Core 2 Core i7
Pentium II Pentium 4
Sandy Bridge
Heat Density W/cm2
1000
Pentium Pro Pentium III

10
Pentium
i486
i386
1
1000
100
10
Minimum Feature Size (nm)
Fig. 1.2.
Microprocessors dissipated heat density vs. feature size.
(5) Distributed-memory multiple-instruction multiple-data systems whose

communications are carried out explicitly by message passing.
On the other hand, if we consider their raw performance and exibility,
then the above computer systems will be rearranged thus: distributedmemory multiple-instruction multiple-data system, distributed-memory
single-instruction multiple-data system, shared-memory, vector, and nally
the serial computers. Indeed, No free lunch theorem2 applies rigorously
here.
Programming system of a distributed-memory multiple-instruction
multiple-data system will certainly burn more human neurons but one can
get many bigger problems solved more rapidly.
Parallel computing has been enabling many scientic and engineering
projects as well as commercial enterprises such as internet services and
the recently developed cloud computing technology. It is easy to predict
that parallel computing will continue to play essential roles in aspects of
human activities including entertainment and education, but it is dicult
to imagine where parallel computing will lead us to.
2 D.H. Wolpert and W.G. Macready, No free lunch theorems for optimization, IEEE
Transactions on Evolutionary Computation 1 (1997) 67.
July 27, 2012 6:17
9in x 6in
b1350-ch01
Thomas J. Watson, Chairman of the Board of International Business

Machines, was quoted, or misquoted, for a 1943 statement: I think there
is a world market for maybe five computers. Watson may be truly wrong
as the world market is not of ve computers and, rather, the whole world
needs only one big parallel computer.
Worlds Top500 supercomputers are ranked bi-annually in terms of their
LINPACK performance. Table 1.3 shows the Top10 such computers in June
2011.
1.2. Evolution of Computers
It is the increasing speed of parallel computers that denes their tremendous
value to the scientic community. Table 1.1 denes computer speeds while
Table 1.2 illustrates the times for solving representative medium sized and
grand challenge problems.
Table 1.1.
Speeds
Denitions of computing speeds.
Floating-point
operations per second
100
1 Flop
1Kops
1Mops
1Gops
1Tops
1Pops
=1
103 = 1 Thousand
106 = 1 Million
109 = 1 Billion
1012 = 1 Trillion
1015 = 1 Quadrillion
1Eops
1Zops
1Yops
1018 = 1 Quintillion
1021 = 1 Sextillion
1024 = Septillion
Table 1.2.
Representative computer
A fast human
VPX 220 (Rank #250 in 1993); A laptop in 2010
ASCI Red (Rank #1 in 1997)
IBM Roadrunner (Rank #1 in 2008); Cray XT5
(Rank #2 in 2008);
1/8 Fujitsu K Computer (Rank #1 in 2011)
Expected in 2018
Time scales for solving medium sized and grand challenge problems.
Computer
Sub-Petaop
Tereop
1,000 Nodes Beowulf
Cluster
High-end Workstation
PC with 2 GHz Pentium
Moderate
problems
Grand challenge
problems
N/A
O(1) Hours
O(1) Seconds
O(1) Minutes
O(10) Hours
O(1) Weeks
O(1) Hours
O(1) Days
O(10) Years
O(100) Years
Applications
Protein folding, QCD,
and Turbulence
Weather
2D CFD, Simple
designs
July 27, 2012 6:17
9in x 6in
b1350-ch01
Introduction
Table 1.3.
Worlds Top10 supercomputers in June 2011.

Computer
Rmax
(Tops)
Vendor
Year
Fujitsu
2011 K computer
8,162
NUDT
2010 Tianhe-1A
2,566
Cray
2009 Jaguar
Cray XT5-HE
1,759
Dawning
2010 Nebulae
Dawning
Cluster
1,271
5 NEC/HP 2010 TSUBAME 2.0

HP Cluster
Platform
3,000SL
6
Cray
2011 Cielo
Cray XE6
7
SGI
2011 Pleiades
SGI Altix
1,192
Cray
1,054
Bull SA
2010 Hopper
Cray XE6
2010 Tera-100
Bull Bullx
10
IBM
2009 Roadrunner
IBM
BladeCenter
1,042
1,110
1,088
1,050
Cores
Site
Country
548,352 RIKEN
Japan
Advanced
Institute for
Computational
Science
186,368 National
China
Supercomputing
Center in
Tianjin
224,162 DOE/SC/Oak
USA
Ridge
National
Laboratory
120,640 National
China
Supercomputing
Centre in
Shenzhen
73,278 GSIC Center,
Japan
Tokyo
Institute of
Technology
142,272 DOE/NNSA/
USA
LANL/SNL
111,104 NASA/Ames
USA
Research
Center/NAS
153,408 DOE/SC/LBNL/ USA
NERSC
138,368 Commissariat a France
lEnergie
Atomique
122,400 DOE/NNSA/
USA
LANL
A grand challenge is a fundamental problem in science or engineering,

with a broad application, whose solution would be enabled by the
application of the high performance computing resources that could become
available in the near future. For example, a grand challenge problem in 2011,
which would require 1,500 years to solve on a high-end workstation, could
be solved on the latest faster K Computer in a few hours.
Measuring computer speed is itself an evolutionary process. Instructions
per-second (IPS) is a measure of processor speed. Many reported speeds
July 27, 2012 6:17
9in x 6in
b1350-ch01
have represented peak rates on articial instructions with few branches or

memory referencing varieties, whereas realistic workloads typically lead to
signicantly lower speeds. Because of these problems of inating speeds,
researchers created standardized tests such as SPECint as an attempt to
measure the real eective performance in commonly used applications.
In scientic computations, the LINPACK Benchmarks are used to
measure a systems oating point computing speed. It measures how fast a
computer solves a dense N-dimensional system of linear equations commonly
appearing in engineering. The solution is obtained by Gaussian elimination
with partial pivoting. The result is reported in millions of oating point
operations per second.
Figure 1.3 demonstrates Moores law in action for computer speeds
over the past 50 years while Fig. 1.4 shows the evolution of computer
architectures. During the 20 years, since 1950s, mainframes were the main
computers and several users shared one processor. During the next 20 years,
since 1980s, workstations and personal computers formed the majority of
the computers where each user had a personal processor. During the next
unknown number of years (certainly more than 20), since 1990s, parallel
computers have been, and will most likely continue to be, dominating the
Fig. 1.3. The peak performance of supercomputers.

Source: Top500.org and various other sources, prior to 1993.
July 27, 2012 6:17
9in x 6in
Introduction
Fig. 1.4.
Fig. 1.5.
b1350-ch01
Evolution of computer architectures.
The scatter plot of supercomputers LINPACK and power eciencies in 2011.
user space where a single user will control many processors. Figure 1.5 shows
the 2011s supercomputers LINPACK and power eciencies.
It is apparent from these gures that parallel computing is of enormous
value to scientists and engineers who need computing power.
July 27, 2012 6:17
9in x 6in
b1350-ch01
1.3. An Enabling Technology

Parallel computing is a fundamental and irreplaceable technique used
in todays science and technology, as well as manufacturing and service
industries. Its applications cover a wide range of disciplines:
Basic science research, including biochemistry for decoding human

genetic information as well as theoretical physics for understanding the
interactions of quarks and possible unication of all four forces.
Mechanical, electrical, and materials engineering for producing better
materials such as solar cells, LCD displays, LED lighting, etc.
Service industry, including telecommunications and the nancial
industry.
Manufacturing, such as design and operation of aircrafts and bullet
trains.
Its broad applications in oil exploration, weather forecasting,

communication, transportation, and aerospace make it a unique technique
for national economical defense. It is precisely this uniqueness and its lasting
impact that denes its role in todays rapidly growing technological society.
Parallel computing researchs main concerns are:
(1) Analysis and design of
a. Parallel hardware and software systems.
b. Parallel algorithms.
c. Parallel programs.
(2) Development of applications in science, engineering, and commerce.
The processing power oered by parallel computing, growing at a
rate of orders of magnitude higher than the impressive 60% annual rate
for microprocessors, has enabled many projects and oered tremendous
potential for basic research and technological advancement. Every
computational science problem can benet from parallel computing.
Supercomputing power has been energizing the science and technology
community and facilitating our daily life. As it has been, supercomputing
development will stimulate the birth of new technologies in hardware,
software, algorithms while enabling many other areas of scientic discoveries
including maturing the 3rd and likely the 4th paradigms of scientic
research. This new round of Exascale computing will bring unique
excitement in lifting energy and human eciencies in adopting and adapting
electronic technologies.
July 27, 2012 6:17
9in x 6in
b1350-ch01
Introduction
1.4. Cost Eectiveness

The total cost of ownership of parallel computing technologies includes:
(1) Purchasing cost.
(2) Operating cost including maintenance and utilities.
(3) Programming cost in terms of added training for users.
In fact, there is an additional cost of time, i.e., cost of delay in the
technology deployment due to lack of adequate knowledge in realizing its
potential. How to make a sound investment in time and money on adopting
parallel computing is the complex issue.
Apart from business considerations that parallel computing is a costeective technology, it can be the only option for the following reasons:
(1) To improve the absolute response time.
(2) To study problems of absolutely largest spatial size at the highest spatial
resolution.
(3) To study problems of absolutely largest time scale at the highest
temporal resolutions.
The analytical methods used to solve scientic and engineering
problems were driven out of fashion several decades ago due to the
growing complexity of these problems. Solving problems numerically
on serial computers available at the time had been quite attractive for
20 years or so, starting in the 1960s. This alternative of solving problems
with serial computers was quite successful, but became obsolete with the
gradual advancement of a new parallel computing technique available only
since the early 1990s. Indeed, parallel computing is the wave of the future.
1.4.1. Purchasing costs

Hardware costs are the major expenses for running any supercomputer
center. Interestingly, the hardware costs per unit performance have been
decreasing steadily while those of operating a supercomputer center
including utilities to power them up and cool them o, and administrators
salaries have been increasing steadily. 2010 marked the turning point when
the hardware costs were less than the operating costs.
Table 1.4 shows a list of examples of computers that demonstrates how
drastically performance has increased and price has decreased. The cost
per Gops is the cost for a set of hardware that would theoretically operate
July 27, 2012 6:17
10
9in x 6in
b1350-ch01

Table 1.4.
Date
Hardware cost per Gops at dierent times.
Cost per Gops
1961
1984
1997
2000
2005
2007
2011
2011
1012
$1.1
$1.5 107
$3.0 104
$3.0 103
$1.0 103
$4.8 101
$1.8 100
$1.2 100
Representative technology
IBM 1620 (costing $64K)
Cray Y-MP
A Beowulf Cluster
Workstation
PC Laptop
Microwulf Cluster
HPU4Science Cluster
K Computer power cost
at one billion oating-point operations per second. During the era when no
single computing platform was able to achieve one Gops, this table lists the
total cost for multiple instances of a fast computing platform whose speed
sums to one Gops. Otherwise, the least expensive computing platform able
to achieve one Gops is listed.
1.4.2. Operating costs

Most of the operating costs involve powering up the hardware and cooling
it o. The latest (June 2011) Green5003 list shows that the most ecient
Top500 supercomputer runs at 2097.19 Mops per watt, i.e. an energy
requirement of 0.5 W per Gops. Operating such a system for one year will
cost 4 KWh per Gops. Therefore, the lowest annual power consumption
of operating the most power-ecient system of 1 Pops is 4,000,000 KWh.
The energy cost on Long Island, New York, in 2011 is $0.25 per KWh, so
the annual energy monetary cost of operating a 1 Pops system is $1M.
Quoting the same Green500 list, the least ecient Top500 supercomputer
runs at 21.36 Mops per watt, or nearly 100 times less power ecient than
the most power ecient system mentioned above. Thus, if we were to run
such a system of 1 Pops, the power cost is $100M per year. In summary,
the annual costs of operating the most and the least power-ecient systems
among the Top500 supercomputer of 1 Pops in 2011 are $1M and $100M
respectively, and the median cost of $12M as shown by Table 1.5.
3 http://www.green500.org
July 27, 2012 6:17
9in x 6in
b1350-ch01
11
Introduction
Table 1.5.
Computer
IBM BlueGene/Q
Bullx B500 Cluster
PowerEdge 1850
Supercomputer operating cost estimates.

Green500 rank
Mops/W
1 Pops operating cost
1
250
500
2097.19
169.15
21.36
$1M
$12M
$100M
1.4.3. Programming costs

The level of programmers investment in utilizing a computer, serial or vector
or parallel, increases with the complexity of the system, naturally.
Given the progress in parallel computing research, including software
and algorithm development, we expect the diculty in employing most
supercomputer systems to reduce gradually. This is largely due to the
popularity and tremendous values of this type of complex systems.
This manuscript also attempts to make programming parallel computers
enjoyable and productive.
July 27, 2012 6:17
9in x 6in
b1350-ch01
July 27, 2012 6:17
9in x 6in
b1350-ch02
CHAPTER 2
PERFORMANCE METRICS AND MODELS
Measuring the performance of a parallel algorithm is somewhat tricky due

to the inherent complication of the relatively immature hardware system
or software tools or both, and the complexity of the algorithms, plus
the lack of proper denition of timing for dierent stages of a certain
algorithm. However, we will examine the denitions for speedup, parallel
eciency, overhead, and scalability, as well as Amdahls law to conclude
this chapter.
2.1. Parallel Activity Trace
It is cumbersome to describe, let alone analyze, parallel algorithms. We
have introduced a new graphic system, which we call parallel activity trace
(PAT) graph, to illustrate parallel algorithms. The following is an example
created with a list of conventions we establish:
(1) A 2D Cartesian coordinate system is adopted to depict the graph with
the horizontal axis for wall clock time and the vertical axis for the
ranks or IDs of processors or nodes or cores depending on the level
of details that we wish to examine the activities.
(2) A horizontal green bar is used to indicate a local serial computation on
a specic processor or computing unit. Naturally, the two ends of the
bar indicate the starting and ending times of the computation and thus
the length of the bar shows the amount of time the underlying serial
computation takes. Above the bar, one may write down the function
being executed.
(3) A red wavy line indicates the processor is sending a message. The two
ends of the line indicate the starting and ending times of the message
sending process and thus the length of the line shows the amount of
time for sending a message to one or more processors. Above the line,
one may write down the ranks or IDs of the receiver of the message.
13
July 27, 2012 6:17
14
9in x 6in
b1350-ch02
Fig. 2.1.
A PAT graph for illustrating the symbol conventions.
(4) A yellow wavy line indicates the processor is receiving a message.

The two ends of the line indicate the starting and ending times
of the message receiving process and thus the length of the line shows
the amount of time for receiving a message. Above the line, one may
write down the ranks or IDs of the sender of the message.
(5) An empty interval signies the fact that the processing unit is idle.
We expect the PAT graph (Fig. 2.1) to record vividly, a priori
or posterior, the time series of multi-processor activities in the parallel
computer. The fact that many modern parallel computers can perform
message passing and local computation simultaneously may lead to
overlapping of the above lines. With such a PAT graph, one can easily
examine the amount of local computation, amount of communication, and
load distribution, etc. As a result, one can visually consider, or obtain guide
for, minimization of communication and load imbalance. We will follow this
simple PAT graphic system to describe our algorithms and hope this little
invention of ours would aid in parallel algorithm designs.
2.2. Speedup
Let T (1, N ) be the time required for the best serial algorithm to solve
problem of size N on one processor and T (P, N ) be the time for a given
parallel algorithm to solve the same problem of the same size N on P
processors. Thus, speedup is dened as:
S(P, N ) =
T (1, N )
.
T (P, N )
(2.1)
July 27, 2012 6:17
9in x 6in
b1350-ch02
15
Performance Metrics and Models
Normally, S(P, N ) P . Ideally, S(P, N ) = P . Rarely, S(P, N ) > P ; this is

known as super speedup.
For some memory intensive applications, super speedup may occur for
some small N because of memory utilization. Increase in N also increases
the amount of memory, which will reduce the frequency of the swapping,
hence largely increasing the speedup. The eect of memory increase will fade
away when N becomes large. For these kinds of applications, it is better to
measure the speedup based on some P0 processors rather than one. Thus,
the speedup can be dened as:
S(P, N ) =
P0 T (P0 , N )
.
T (P, N )
(2.2)
Most of the time, it is easy to speedup large problems than small ones.
2.3. Parallel Eciency
Parallel eciency is dened as:
E(P, N ) =
T (1, N )
S(P, N )
=
.
T (P, N )P
P
(2.3)
Normally, E(P, N ) 1. Ideally, E(P, N ) = 1. Rarely is E(P, N ) > 1. It

is generally acceptable to have E(P, N ) 0.6. Of course, it is problem
dependent.
A linear speedup occurs when E(P, N ) = c, where c is independent of
N and P .
Algorithms with E(P, N ) = c are called scalable.
2.4. Load Imbalance
If processor i spends Ti time doing useful work (Fig. 2.2), the total time
P 1
spent working by all processors is i=1 Ti and the average time a processor
spends working is:
P 1
Ti
.
(2.4)
Tavg = i=0
P
The term Tmax = max{Ti } is the maximum time spent by any processor,
so the total processor time is P Tmax . Thus, the parameter called the load
imbalance ratio is given by:
P 1
P Tmax i=0 Ti
Tmax
I(P, N ) =
=
1.
(2.5)
P 1
Tavg
i=0 Ti
July 27, 2012 6:17
16
9in x 6in
b1350-ch02
Fig. 2.2.
Computing time distribution: Time ti on processor i.
Remarks
I(P, N ) is the average time wasted by each processor due to load
imbalance.
If Ti = Tmax for every i, then I(P, N ) = 0, resulting in a complete load
balancing.
The slowest processor Tmax can mess up the entire team. This observation shows that slave-master scheme is usually very inecient because
of the load imbalance issue due to slow master processor.
2.5. Granularity
The size of the sub-domains allocated to the processors is called the
granularity of the decomposition. Here is a list of remarks:
Granularity is usually determined by the problem size N and computer
size P .
Decreasing granularity usually increases communication and decreases
load imbalance.
July 27, 2012 6:17
Performance Metrics and Models
9in x 6in
b1350-ch02
17
Increasing granularity usually decreases communication and increases

load imbalance.
2.6. Overhead
In all parallel computing, it is the communication and load imbalance
overhead that aects the parallel eciency. Communication costs are
usually buried in the processor active time. When a co-processor is added
for communication, the situation becomes trickier.
We introduce a quantity called load balance ratio. For an algorithm
using P processors, at the end of one logical point (e.g. a synchronization
point), processor i is busy, either computing or communicating, for ti
amount of time.
Let tmax = max{ti } and the time that the entire system of P processors
p
spent for computation or communication is i=1 ti . Finally, let the total
time that all processors are occupied (by computation, communication, or
being idle) be P tmax . The ratio of these two is dened as the load balance
ratio:
P
Ti
Tavg
=
.
(2.6)
L(P, N ) = i=1
P Tmax
Tmax
Obviously, if Tmax is close to Tavg , the load must be well balanced, so
L(P, N ) approaches one. On the other hand, if Tmax is much larger than
Tavg , the load must not be balanced, so L(P, N ) tends to be small.
Two extreme cases are
L(P, N ) = 0 for total imbalance.
L(P, N ) = 1 for perfect balance.
L(P, N ) only measures the percentage of the utilization during the system
up time, which does not care what the system is doing. For example,
if we only keep one of the P = 2 processors in a system busy, we get
L(2N ) = 50%, meaning we achieved 50% utilization. If P = 100 and one is
used, then L(100, N ) = 1%, which is badly imbalanced.
Also, we dene the load imbalance ratio as 1 L(P, N ). The overhead
is dened as
H(P, N ) =
P
1.
S(P, N )
(2.7)
Normally, S(P, N ) P . Ideally, S(P, N ) = P . A linear speedup means

that S(P, N ) = cP where c is independent of N and P .
July 27, 2012 6:17
18
9in x 6in
b1350-ch02
2.7. Scalability
First, we dene two terms: Scalable algorithm and quasi-scalable algorithm.
A scalable algorithm is dened as those whose parallel eciency E(P, N )
remains bounded from below, i.e. E(P, N ) E0 > 0, when the number of
processors P at xed problem size.
More specically, those which can keep the eciency constant when the
problem size N is kept constant are called strong scalable, and those which
can keep the eciency constant only when N increases along with P are
called weak scalable.
A quasi-scalable algorithm is dened as those whose parallel eciency
E(P, N ) remains bounded from below, i.e. E(P, N ) E0 > 0, when the
number of processors Pmin < P < Pmax at xed problem size. The interval
Pmin < P < Pmax is called scaling zone.
Very often, at xed problem size N = N (P ), the parallel eciency
decreases monotonically as the number of processors increase. This means
that for suciently large number of processors the parallel eciency tends
to vanish. On the other hand, if we x the number of processors, the
parallel eciency usually decreases as the problem size decreases. Thus,
very few algorithms (aside from the embarrassingly parallel algorithm)
are scalable, while many are quasi-scalable. Two major tasks in designing
parallel algorithms are to maximize E0 and the scaling zone.
2.8. Amdahls Law
Suppose a fraction f of an algorithm for a problem of size N on P processors
is inherently serial and the remainder is perfectly parallel, then assume:
T (1, N ) = .
(2.8)
T (P, N ) = f + (1 f )/P.
(2.9)
Thus,
Therefore,
S(P, N ) =
1
.
f + (1 f )/P
(2.10)
This indicates that when P , the speedup S(P, N ) is bounded by 1/f .

It means that the maximum possible speedup is nite even if P .
July 27, 2012 6:17
9in x 6in
b1350-ch03
CHAPTER 3
HARDWARE SYSTEMS
A serial computer with one CPU and one chunk of memory while ignoring
the details of its possible memory hierarchy and some peripherals, needs
only two parameters/properties to describe itself: Its CPU speed and its
memory size.
On the other hand, ve or more properties are required to characterize
a parallel computer:
(1)
(2)
(3)
(4)
(5)
Number and properties of computer processors;

Network topologies and communication bandwidth;
Instruction and data streams;
Processor-memory connectivity;
Memory size and I/O.
3.1. Node Architectures

One of the greatest advantages of a parallel computer is the simplicity in
building the processing units. Conventional, o-the-shelf, and mass-product
processors are normally used in contrast to developing special-purpose
processors such as those for Cray processors and for IBM mainframe CPUs.
In recent years, the vast majority of the designs are centered on four of
the processor families: Power, AMD x86-64, Intel EM64T, and Intel Itanium
IA-64. These four together with Cray and NEC families of vector processors
are the only architectures that are still being actively utilized in the highend supercomputer systems. As shown in the Table 3.1 (constructed with
data from top500.org for June 2011 release of Top500 supercomputers) 90%
of the supercomputers use x86 processors.
Ability of the third-party organizations to use available processor
technologies in original HPC designs is inuenced by the fact that the
companies that produce end-user systems themselves own PowerPC, Cray
19
July 27, 2012 6:17
20
9in x 6in
b1350-ch03

Table 3.1.
Top500 supercomputers processor shares in June 2011.
Processor family
Power
NEC
Sparc
Intel IA-64
Intel EM64T
AMD x86 64
Intel Core
Totals
Count
Share %
Rmax sum (GF)
45
1
2
5
380
66
1
500
9.00%
0.20%
0.40%
1.00%
76.00%
13.20%
0.20%
100%
6,274,131
122,400
8,272,600
269,498
31,597,252
12,351,314
42,830
58,930,025
and NEC processor families. AMD and Intel do not compete in the end-user
HPC system market. Thus, it should be expected that IBM, Cray and NEC
would continue to control the system designs based on their own processor
architectures, while the eorts of the other competitors will be based on
processors provided by Intel and AMD.
Currently both companies, Intel and AMD, are revamping their product
lines and are transitioning their server oerings to the quad-core processor
designs. AMD introduced its Barcelona core on 65 nm manufacturing
process as a competitor to the Intel Core architecture that should be
able to oer a comparable to Core 2 instruction per clock performance,
however the launch has been plagued by delays caused by the diculties
in manufacturing sucient number of higher clocked units and emerging
operational issues requiring additional bug xing that so far have resulted
in sub-par performance. At the same time, Intel enhanced their product
line with the Penryn core refresh on a 45 nm process featuring ramped up
the clock speed, optimized execution subunits, additional SSE4 instructions,
while keeping the power consumption down within previously dened TDP
limits of 50 W for the energy-ecient, 80 W for the standard and 130 W
for the high-end parts. According to the roadmaps published by both
companies, the parts available in 2008 consisting up to four cores on
the same processor, die with peak performance per core in the range of
815 Gops on a power budget of 1530 W and 1632 Gops on a power
budget of 5068 W. Due to the superior manufacturing capabilities, Intel
is expected to maintain its performance per watt advantage with top
Penryn parts clocked at 3.5 GHz or above, while the AMD Barcelona
parts in the same power envelope was not expected to exceed 2.5 GHz
clock speed until the second half of 2008 at best. The features of the
three processor-families that power the top supercomputers in 211 are
July 27, 2012 6:17
9in x 6in
b1350-ch03
21
Hardware Systems
Table 3.2.
Power consumption and performance of processors in 2011.
Parameter
Core Count
Highest Clock Speed
L2 Cache
L3 Cache
Memory Bandwidth
Manufacturing Technology
Thermal Design Power Bins
Peak Floating Point Rate
Fujitsu
SPARC64 VIIIfx1
8
2 GHz
5 MB
N/A
64 GB/s
45 nm
58 W
128 Gops
IBM power BQC

16
1.6 GHz
205 Gops
AMD opteron
6100 series
8
2.3 Ghz (est)
4 MB
12 MB
42.7 GB/s
45 nm
115 W
150 Gops
given in Table 3.2 with the data collected from respective companies
websites.
3.2. Network Interconnections

As the processors clock speed hits the limit of physics laws, the ambition
of building a record-breaking computer relies more and more on embedding
ever-growing numbers of processors into a single system. Thus, the network
performance has to speedup along with the number of processors so as not
to be the bottleneck of the system.
The number of ways of connecting a group of processors is very large.
Experience indicates that only a few are optimal. Of course, each network
exists for a special purpose. With the diversity of the applications, it does
not make sense to say which one is the best in general. The structure of a
network is usually measured by the following parameters:
Connectivity: Multiplicity between any two processors.
Average distance: The average of distances from a reference node to
all other nodes in the topology. Such average should be independent of
the choice of reference nodes in a well-designed network topology.
Diameter: Maximum distance between two processors (in other words,
the number of hops between two most distant processors).
Bisection bandwidth: Number of bits that can be transmitted in
parallel multiplied by the bisection width. The bisection bandwidth
is dened as the minimum number of communication links that have to
1 http://www.fujitsu.com/downloads/TC/090825HotChips21.pdf.
July 27, 2012 6:17
22
9in x 6in
b1350-ch03
be removed to divide the network into two partitions with an equal

number of processors.
3.2.1. Topology
Here are some of the topologies that are currently in common use:
(1)
(2)
(3)
(4)
(5)
(6)
Multidimensional mesh or torus (Fig. 3.13.2).

Multidimensional hypercube.
Fat tree (Fig. 3.3).
Bus and switches.
Crossbar.
Network.
Practically, we call a 1D mesh an array and call a 1D torus a ring.

Figures 3.13.4 illustrate some of the topologies. Note the subtle similarities
and dierences of the mesh and torus structure. Basically, a torus network
can be constructed from a similar mesh network by wrapping the end points
of every dimension of the mesh network. In practice, as in the IBM BlueGene
systems, the physically constructed torus networks can be manipulated to
Fig. 3.1.
Mesh topologies: (a) 1D mesh (array); (b) 2D mesh; (c) 3D mesh.
July 27, 2012 6:17
9in x 6in
b1350-ch03
23
Hardware Systems
Fig. 3.2.
Torus topologies: (a) 1D torus (ring); (b) 2D torus; (c) 3D torus.
Switch 7
Switch 5
Switch 6
Switch 1
Switch 2
Switch 3
4
Fig. 3.3.
Switch 4
Fat tree topology.
retain the torus structure or to become a mesh network, at will, by software

means.
Among those topologies, mesh, torus and fat tree are more frequently
adopted in latest systems. Their properties are summarized in Table 3.3.
Mesh interconnects permit a higher degree of hardware scalability than
one can aord by the fat tree network topology due to the absence of
July 27, 2012 6:17
24
9in x 6in
b1350-ch03
8
Fig. 3.4.
Table 3.3.
An 8 8 Network topology.
Properties of mesh, torus, and fat tree networks.

Mesh or torus
Advantages
Disadvantages
Fast local communication

Fault tolerant
Easy to scale
Low latency
Poor remote communication
Fat tree
O-the-shelf
Good remote communication
Economical
Fault sensitive
Hard to scale up
High latency
the external federated switches formed by a large number of individual

switches. Thus, a large-scale cluster design with several thousand nodes has
to necessarily contain twice that number of the external cables connecting
the nodes to tens of switches located tens of meters away from nodes. At
these port counts and multi-gigabit link speeds, it becomes cumbersome to
maintain the quality of individual connections, which leads to maintenance
problems that tend to accumulate over time.
As observed on systems with Myrinet and Inniband interconnects,
intermittent errors on a single cable may have a serious impact on the
performance of the massive parallel applications by slowing down the
communication due to the time required for connection recovery and data
retransmission. Mesh network designs overcome this issue by implementing
a large portion of data links as traces on the system backplanes and by
aggregating several cable connections into a single bundle attached to a
July 27, 2012 6:17
9in x 6in
b1350-ch03
25
Hardware Systems
single socket, thus reducing the number of possible mechanical faults by an

order of magnitude.
These considerations are conrmed by a review of the changes for the
Top10 supercomputers. In June 2011, there were four systems based on mesh
networks and six based on fat tree networks, a much better ratio than the
one for the all Top500 supercomputers, which leads to the conclusion that
advanced interconnection networks such as torus are of a greater importance
in scalable high end systems (Table 3.4).
Interestingly, SGI Altix ICE interconnect architecture is based on
Inniband hardware, typically used as a fat tree network in cluster
applications, which is organized as a fat tree on a single chassis level and
as a mesh for chassis to chassis connections. This conrms the observations
about problems associated with using a fat tree network for large-scale
system designs.
Advanced cellular network topology
Since the communication rates of a single physical channel are also limited
by the clock speed of network chips, the practical way to boost bandwidth
is to add the number of network channels in a computing node, that is,
when applied to the mesh or torus network we discussed, this increases
the number of dimensions in the network topology. However, the dimension
cannot grow without restriction. That makes a cleverer design of network
more desirable in top-ranked computers. Here, we present two advanced
cellular network topology that might be the trend for the newer computers.
Table 3.4.
Overview of the Top10 supercomputers in June 2011.
Rank
System
Speed
(Tops)
Processor
family
1
2
3
4
5
6
7
8
9
10
K Computer
Tianhe-1A
Cray XT5
Dawning
HP ProLiant
Cray XE6
SGI Altix ICE
Cray XE6
Bull Bullx
IBM
BladeCenter
8162
2566
1759
1271
1192
1110
1088
1054
1050
1042
SPARC64
Intel EM64T
AMD x86 64
Intel EM64T
Intel EM64T
AMD x86 64
Intel EM64T
AMD x86 64
AMD x86 64
AMD x86 64
Co-processor
Interconnect
family
Interconnect
topology
N/A
nVidia GPU
N/A
nVidia Tesla
nVidia GPU
N/A
N/A
N/A
N/A
IBM Power
X Cell
Tofu
Proprietary
SeaStar
Inniband
Inniband
Gemini
Inniband
Gemini
Inniband
Inniband
6D Torus
Fat Tree
3D Torus
Fat Tree
Fat Tree
3D Torus
Fat Tree
3D Torus
Fat Tree
Fat Tree
July 27, 2012 6:17
26
9in x 6in
b1350-ch03
MPU networks
The Micro Processor Unit (MPU) network is a combination of two
k-dimensional rectangular meshes of equal size, which are oset by 1/2
of a hop along each dimension to surround each vertex from one mesh
with a cube of 2k neighbors from the other mesh. Connecting vertices in
one mesh diagonally to their immediate neighbors in the other mesh and
removing original rectangle mesh connections produces the MPU network.
Figure 3.5 illustrates the generation of 2D MPU network by combining
two 2D meshes. Constructing MPU network of three or higher dimension is
similar. To complete wrap-around connections for boundary nodes, we apply
the cyclic property of a symmetric topology in such a way that a boundary
node encapsulates in its virtual multi-dimensional cube and connects to all
vertices of that cube.
In order to see the advantages of MPU topology, we compare MPU
topologies with torus in terms of key performance metrics as network
diameter, bisection width/bandwidth and average distance. Table 3.5 lists
the comparison of those under same dimension.
2D mesh
Fig. 3.5.
Table 3.5.
shifted two meshes
2D MPU interconnect
2D MPU generated by combining two shifted 2D meshes.
Analytical comparisons between MPU and torus networks.
Network of characters
MPU (nk )
Torus (nk )
Ratio of MPU to torus
Dimensionality
Number of nodes
Node degree
Network diameter
Bisection width
Bisection bandwidth
Number of wires
k
2nk
2k
n
2k nk1
pnk1
2k nk
k
nk
2k
nk/2
2nk1
pnk1 k 1
knk
1
2
2k1 k 1
2k 1
2k1
k
2k k 1
July 27, 2012 6:17
9in x 6in
b1350-ch03
27
Hardware Systems
MSRT networks
A 3D Modied Shift Recursive Torus (MSRT) network is a 3D hierarchical
network consisting of massive nodes that are connected by a basis 3D torus
network and a 2D expansion network. It is constructed by adding bypass
links to a torus. Each node in 3D MSRT has eight links to other nodes,
of which six are connected to nearest neighbors in a 3D torus and two are
connected to bypass neighbors in the 2D expansion network. The MSRT
achieves shorter network diameter and higher bisection without increasing
the node degree or wiring complexity.
To understand the MSRT topology, let us rst start from 1D MSRT
bypass rings. A 1D MSRT bypass ring originates from a 1D SRT ring by
eliminating every other bypass link. In, 1D MSRT (L = 2; l1 = 2, l2 = 4) is
a truncated 1D SRT (L = 2; l1 = 2, l2 = 4). L = 2 is the maximum node
level. This means that two types of bypass links exist, i.e. l1 and l2 links.
Then, l1 = 2 and l2 = 4 indicates the short and long bypass links spanning
over 2l1 = 4 and 2l2 = 16 hops respectively. Figure 3.6 shows the similarity
and dierence of SRT and MSRT. We extend 1D MSRT bypass rings to 3D
MSRT networks. To maintain a suitable node degree, we add two types of
bypass links in x-axis and y-axis and then form a 2D expansion network in
xy-plane.
To study the advantages of MSRT networks, we compared the MSRT
with other networks performance metric in Table 3.6. For calculating the
average distances of 3D MSRT, we rst select a l1 -level bypass node as the
reference node and then a l2 -level bypass node so two results are present in
the left and right columns respectively.
62 60
58
56
54
10
52
12
50
14
48
16
46
18
44
20
22
42
40
24
Fig. 3.6.
26
28
30 32
34
36
38
1D SRT ring and corresponding MSRT ring.
Level-0 node
Level-1 node
Level-2 node
July 27, 2012 6:17
28
b1350-ch03

Table 3.6.
Topology
3D
4D
6D
3D
9in x 6in
Analytical comparison between MSRT and other topologies.
Bypass
networks
Dimensions
Bisection
width
Node Diameter (1024)
degree
(hop)
links)
Torus
Torus
Torus
SRT
N/A
32 32 16
N/A
16 16 8 8
N/A
884444
L = 1;
32 32 16
l1 = 4
128 128
2D SRT
lmax = 6
3D MSRT L = 2;
32 32 16
l1 = 2,
l2 = 4
6
8
12
10
40
24
16
24
1
2
4
9
8
8
13
16
1.625
2.5
Average
distance
(hop)
20.001
12.001
8.000
12.938
8.722
9.239 9.413
3.2.2. Interconnect Technology

Along with the network topology goes the interconnect technology that
enables the fancy designed network to achieve its maximum performance.
Unlike the network topology that settles down to several typical schemes,
the interconnect technology shows a great diversity ranging from the
commodity Gigabit Ethernet all the way to the specially designed
proprietary one (Fig. 3.7).
3.3. Instruction and Data Streams

Based on the nature of the instruction and data streams (Fig. 3.8), parallel
computers can be made as SISD, SIMD, MISD, MIMD where I stands for
Instruction, D for Data, S for Single, and M for Multiple.
Fig. 3.7.
Networks for Top500 supercomputers by system count in June 2011.
July 27, 2012 6:17
9in x 6in
Hardware Systems
Fig. 3.8.
b1350-ch03
29
Illustration of four instruction and data streams.
An example of SISD is the conventional single processor workstation.

SIMD is the single instruction multiple data model. It is not very convenient
to be used for wide range of problems, but is reasonably easy to build. MISD
is quite rare. MIMD is very popular and appears to have become the model
of choice, due to its wideness of functionality.
3.4. Processor-Memory Connectivity

In a workstation, one has no choice but to connect the single memory unit
to the single CPU, but for a parallel computer, given several processors
and several memory units, how to connect them to deliver eciency is a big
problem. Typical ones are, as shown schematically by Figs. 3.9 and 3.10, are:
Distributed-memory.
Shared-memory.
Shared-distributed-memory.
Distributedshared-memory.
3.5. IO Subsystems
High-performance input-output subsystem is the second most essential
component of a supercomputer and its importance grows even more
July 27, 2012 6:17
30
9in x 6in
b1350-ch03
Message Passing Network
Bus/Switch
1
(a)
Fig. 3.9.
(b)
(a) Distributed-memory model; (b): Shared-memory model.
4
1
2
Fig. 3.10.
A shared-distributed-memory conguration.
apparent with the latest introduction of Web 2.0 services such as video
stream, e-commerce, and web serving.
Two popular les systems are Lustre and IBMs GPFS.
Lustre is a massively parallel distributed le system, generally used
for large scale cluster computing and it provides a high performance le
system for clusters of tens of thousands of nodes with petabytes of storage
July 27, 2012 6:17
Hardware Systems
9in x 6in
b1350-ch03
31
capacity. It serves computer clusters ranging from small workgroup to largescale, multi-site clusters. More than half of worlds top supercomputers use
Lustre le systems.
Lustre le systems can support tens of thousands of client systems,
tens of petabytes (PBs) of storage and hundreds of gigabytes per second
(GB/s) of I/O throughput. Due to Lustres high scalability, internet service
providers, nancial institutions, and the oil and gas industry deploy Lustre
le systems in their data centers.
IBMs GPFS is a high-performance shared-disk clustered le system,
adopted by some of the worlds top supercomputers such as ASC purple
supercomputer which is composed of more than 12,000 processors and has
2 petabytes of total disk storage spanning more than 11,000 disks.
GPFS provides tools for management and administration of the GPFS
cluster and allows for shared access to le systems from remote GPFS
clusters.
3.6. System Convergence
Generally, at least three parameters are necessary to quantify the quality
of a parallel system. They are:
(1) Single-node performance.
(2) Inter-node communication bandwidth.
(3) I/O rate.
The higher these parameters are, the better would be the parallel
system. On the other hand, it is the proper balance of these three parameters
that guarantees a cost-eective system. For example, a narrow internode width will slow down communication and make many applications
unscalable. A low I/O rate will keep nodes waiting for data and slow overall
performance. A slow node will make the overall system slow.
3.7. Design Considerations

Processors: Advanced pipelining, instruction-level parallelism, reduction
of branch penalties with dynamic hardware prediction and scheduling.
July 27, 2012 6:17
32
9in x 6in
b1350-ch03
Networks: Inter-node networking topologies and networking controllers.

Processor-memory connectivity: (1) Centralized shared-memory,
(2) distributed-memory, (3) distributedshared-memory, and (4) virtual
shared-memory. Naturally, manipulating data spreading on such memory
system is a daunting undertaking and is arguably the most signicant
challenge of all aspects of parallel computing. Data caching, reduction of
caching misses and the penalty, design of memory hierarchies, and virtual
memory are all subtle issues for large systems.
Storage Systems: Types, reliability, availability, and performance of
storage devices including buses-connected storage, storage area network,
raid.
In this 4D hardware parameter space, one can spot many points,
each representing a particular architecture of the parallel computer systems.
Parallel computing is an application-driven technique, where unreasonable
combination of computer architecture parameters are eliminated through
selection quickly, leaving us, fortunately, a small number of useful cases. To
name a few: distributed-memory MIMD (IBM BlueGene systems on 3D or
4D torus network, Cray XT and XE systems on 3D torus network, Chinas
National Defense Universitys Tianhe-1A on fat tree network, and Japans
RIKEN K computers 6D torus network), shared-memory MIMD (Cray and
IBM mainframes), and many distributed shared-memory systems.
It appears that distributed-memory MIMD architectures represent
optimal congurations.
Of course, we have not mentioned an important class of parallel
computers, i.e. the Beowulf clusters by hooking up commercially available
nodes on commercially available network. The nodes are made of the othe-shelf Intel or AMD X86 processors widely marketed to the consumer
space in the number of billions. Similarly, such nodes are connected
on Ethernet networks, either with Gigabit bandwidth in the 2000s or
10-Gigabit bandwidth in the 2010s, or Myrinet or Inniband networks with
much lower latencies. Such Beowulf clusters provide high peak performance
and are inexpensive to purchase or easy to make in house. As a result, they
spread like re for more than a decade since the l990s. Such clusters are
not without problems. First, the power eciency (dened as the number
of sustained FLOPS per KW consumed) is low. Second, scaling such
system to contain hundreds of thousands of cores is a mission impossible.
Third, hardware reliability suers badly as system size increases. Fourth,
July 27, 2012 6:17
Hardware Systems
9in x 6in
b1350-ch03
33
monetary cost can grow nonlinearly because of the needs of additional

networking. In summary, Beowulf clusters have served the computational
science community protably for a generation of researchers and some new
innovation in supercomputer architectures must be introduced to win the
next battle.
July 27, 2012 6:17
9in x 6in
b1350-ch03
July 27, 2012 6:17
9in x 6in
b1350-ch04
CHAPTER 4
SOFTWARE SYSTEMS
Software system for a parallel computer is a monster as it contains the node

software that is typically installed on serial computers, system communication protocols and their implementation, libraries for basic parallel
functions such as collective operations, parallel debuggers, and performance
analyzers.
4.1. Node Software
The basic entity of operation is a processing node and this processing node is
similar to a self-contained serial computer. Thus, we must install software
on each node. Node software includes node operating system, compilers,
system and applications libraries, debuggers, and prolers.
4.1.1. Operating systems
Essentially, augmenting any operating system for serial computers to
provide data transfers among individual processing units can constitute
a parallel computing environment. Thus, regular serial operating systems
(OS) such as UNIX, Linux, Microsoft Windows, and Mac OS are candidates
to snowball for a parallel OS. Designing parallel algorithms for large-scale
applications requires full understanding of the basics of serial computing.
UNIX is well developed and widely adopted operating system for scientic
computing communities. As evident in the picture (Fig. 4.1) copied from
Wikipedia, UNIX has gone through more than four decades of development
and reproduction.
Operating systems such as Linux or UNIX are widely used as the node
OS for most parallel computers with their vendor avors. It is notimpossible
to see parallel computers with Windows installed as their node OS, however,
It is rare to see Mac OS as the node OS in spite of their huge commercial
success.
35
July 27, 2012 6:17
36
9in x 6in
b1350-ch04
Fig. 4.1.
The evolutionary relationship of several UNIX systems (Wikipedia).
Linux is a predominantly node OS, and one of the most successful free
and open source software packages. Linux, with its many dierent variations,
is installed on a wide variety of hardware devices including mobile phones,
tablet computers, routers and video game consoles, laptop and desktop
computers, mainframes, and most relevantly the nodes of parallel computers
and it runs on the nodes of most, if not all, Top500 supercomputers.
A distribution intended to run on supercomputer nodes usually omits
many functions such as graphical environment from the standard release
and instead include other software to enable communications among
such nodes.
4.1.2. Compilers and libraries

We limit our discussion on compilers that transform source code written
in high-level programming languages into low-level executable code for
parallel computers nodes, i.e. serial elements. All compilers always perform
July 27, 2012 6:17
Software Systems
9in x 6in
b1350-ch04
37
the following operations when necessary: Lexical analysis, preprocessing,

parsing, semantic analysis, code optimization, and code generation. Most
commonly, a node-level executable program capable of communicating with
other executable programs must be generated by a serial language compiler.
The popular high-level languages are FORTRAN, C, and C++. Many other
serial computer programming languages can also be conveniently introduced
to enable parallel computing through language binding.
Parallel compilers, still highly experimental, are much more complex
and will be discussed later.
A computer library is a collection of resources in form of subroutines,
classes, and values, etc., just like the conventional brick-and-mortar libraries
storing books, newspapers, and magazines. Computer libraries contain
common code and data for independent programs to share code and data
to enhance modularity. Some executable programs are both standalone
programs and libraries, but most libraries are not executable. Library
objects allow programs to link to each other through the process called
linking done by a linker.
4.1.3. Profilers
A computer proler is a computer program that performs dynamic program
analysis of the usages of memory, of particular instructions, or frequency
and duration of function calls. It became a part of either the program source
code or its binary executable form to collect performance data at run-time.
The most common use of proling information is to aid program
optimization. It is a powerful tool for serial programs. For parallel programs,
a proler is much more dicult to dene let alone to construct. Like
parallel compilers and parallel debuggers, parallel prolers are still work in
progress.
4.2. Programming Models
Most early parallel computing circumstances were computer specic, with
poor portability and scalability. Parallel programming was highly dependent
on system architecture. Great eorts were made to bridge the gap between
parallel programming and computing circumstances. Some standards were
promoted and some software was developed.
Observation: A regular sequential programming language (like Fortran, C,
or C++, etc.) and four communication statements (send, recv, myid,
July 27, 2012 6:17
38
9in x 6in
b1350-ch04
and numnodes) are necessary and sucient to form a parallel computing

language.
send: One processor sends a message to the network. The sender does not
need to specify the receiver but it does need to designate a name to the
message.
recv: One processor receives a message from the network. The receiver
does not have the attributes of the sender but it does know the name of
the message for retrieval.
myid: An integer between 0 and P1 identifying a processor. myid is always
unique within a partition.
numnodes: An integer which shows the total number of nodes in the system.
This is the so-called single-sided message passing, which is popular in most
distributed-memory supercomputers.
It is important to note that the Network Buer, as labeled, in fact, does
not exist as an independent entity and is only a temporary storage. It is
created either in the senders RAM or in the receivers RAM and is dependent
on the readiness of the message routing information. For example, if a
messages destination is known but the exact location is not known at the
destination, the message will be copied to the receivers RAM for easier
transmission.
4.2.1. Message passing
There are three types of communication in parallel computers. They are
synchronous, asynchronous, and interrupt communication.
Synchronous Communication: In synchronous communication, the
sender will not proceed to its next task until the receiver retrieves the
message from the network. This is analogous to hand delivering a message
to someone, which can be quite time consuming (Fig. 4.2).
Asynchronous Communication: During asynchronous communication,
the sender will proceed to the next task whether the receiver retrieves
the message from the network or not. This is similar to mailing a letter.
Once the letter is placed in the post oce, the sender is free to resume
his or her activities. There is no protection for the message in the buer
(Fig. 4.3).
July 27, 2012 6:17
9in x 6in
Software Systems
Fig. 4.2.
An illustration of synchronous communication.
Fig. 4.3.
An illustration of asynchronous communication.
b1350-ch04
39
Asynchronous message passing example: The sender issues a message,

then continues with its tasks regardless of the receivers response in treating
the message (Fig. 4.4). While the receiver has four options with regard to
the message issued already by the sender, this message now stays somewhere
called the buer:
(1) Receiver waits for the message.
(2) Receiver takes the message only after it has nished its current task,
either computing or communicating with another processor.
(3) Receiver ignores the message.
(4) Receiver merges this message to another message still residing in the
buer.
July 27, 2012 6:17
40
9in x 6in
b1350-ch04
Fig. 4.4.
Sending and receiving with the network buer.
Interrupt: The receiver interrupts the senders current activity in order

to pull messages from the sender. This is analogous to the infamous 1990s
telemarketers calls at dinnertime in US. The sender issues a short message
to interrupt the current execution stream of the receiver. The receiver
becomes ready to receive a longer message from the sender. After an
appropriate delay (for the interrupt to return the operation pointer to the
messaging process), the sender pushes through the message to the right
location of the receivers memory, without any delay (Fig. 4.5).
Fig. 4.5.
An illustration of interrupt communication.
July 27, 2012 6:17
9in x 6in
Software Systems
b1350-ch04
41
Communication Patterns: There are nine dierent communication

patterns:
Four Modes of MPI Message Passing:

Standard:
Assumes nothing on the matching send.
Assumes nothing on the matching receives.
Messages are usually not buered for performance gain.
Buered:
When buering, messages stay in senders or receivers memory or both.
Moving messages to the destination from buer as soon as destination is
known and is ready.
July 27, 2012 6:17
42
9in x 6in
b1350-ch04
Synchronous:
A request is sent to receiver and, when acknowledged, pushes the message
to the receiver.
When messages arrive, both sender and receiver move on.
Ready:
Sender knows, before sending, that the matching receiver has already been
posted to receive.
4.2.2. Shared-memory
Shared-memory computer (Fig. 4.6) is another large category of parallel
computers. As the name indicates, all or parts of the memory space
are shared among all the processors. That is, the memory can be
simultaneously accessed directly by multiple processors. Under this
situation, communications among processors are carried out, implicitly, by
accessing the same memory space from dierent processors, at dierent
times. This scheme permits the developer an easier task to write parallel
program, whereby at least the data sharing is made much more convenient.
Serious problems such as system scalability restricted by limited data
sharing are still unsolved. As we will discuss later, multiple techniques were
introduced to mitigate such challenges.
Although all the memory can be shared among processors without extra
diculty, in practical implementation, memory space on a single node has
Private
Private
T
Shared
Memory
T
Private
Fig. 4.6.
Private
Schematic diagram for a typical shared-memory computer.
July 27, 2012 6:17
Software Systems
9in x 6in
b1350-ch04
43
always been partitioned into shared and private part, at least logically, in
order to provide exibility in program development. Shared data is accessible
by all but private data can only be accessed by the one who owns it.
4.3. Parallel Debuggers
A parallel debugger is no dierent from a serial debugger. dbx is one popular
example of serial debugger. The dbx is a utility for source-level debugging
and execution of programs written in C and FORTRAN.
Most parallel debuggers are built around dbx with additional functions
for handling parallelism, for example, to report variable addresses and
contents in dierent processors. There are many variations to add
convenience by using graphic interface.
A typical debugger, interactive parallel debugger (IPD), is a complete
symbolic, source-level debugger for parallel programs. Beyond the standard
operations that facilitate the debugging of serial programs, IPD oers
custom features that facilitate debugging parallel programs. IPD lets you
debug parallel programs written in C, FORTRAN, and assembly language.
IPD consists of a set of debugging commands, for which help is available
from within IPD.
4.4. Parallel Prolers

Similar to prolers for serial computing, parallel prolers provide tools
for analyzing parallel program execution including serial performances,
communication performance, and event traced performance of application
programs.
Many parallel prolers are in circulation and none appear dominating.
A proler, ParaGraph, introduced in the early 2000s, is likely to have been
forgotten but its design goals and expected functionalities are still not
obsolete.
ParaGraph is a graphical display system for visualizing the behavior
and performance of parallel programs on message-passing multiprocessor
architectures. This is an ancient utility but its philosophy is still updateto-date. It takes as input execution prole data provided by the Portable
Instrumented Communication Library (PICL) developed at Oak Ridge
National Laboratory. ParaGraph is based on the X Window System, and
thus runs on a wide variety of graphical workstations. The user interface
for ParaGraph is menu-oriented, with most user input provided by mouse
July 27, 2012 6:17
44
9in x 6in
b1350-ch04
clicks, although for some features, keyboard input is also supported. The
execution of ParaGraph is event driven, including both user-generated X
Window events and trace events in the data le produced by PICL. Thus,
ParaGraph provides a dynamic depiction of the parallel program while also
providing responsive interaction with the user. Menu selections determine
the execution behavior of ParaGraph both statically and dynamically.
As a further aid to the user, ParaGraph preprocesses the input trace
le to determine relevant parameters automatically before the graphical
simulation begins.
July 27, 2012 6:17
9in x 6in
b1350-ch05
CHAPTER 5
DESIGN OF ALGORITHMS
Quality algorithms, in general, must be (1) accurate, (2) ecient, (3) stable,
(4) portable, and (5) maintainable.
For parallel algorithms, the quality metric ecient must be expanded
to include high parallel eciency and scalability.
To gain high parallel eciency, two factors must be considered:
(1) Communication costs;
(2) Load imbalance costs.
For scalability, there are two distinct types: Strong scaling and weak
scaling. A strong scaling algorithm is such that it is scalable for solving a
xed total problem size with a varying number of processors. Conversely, a
weak scaling algorithm is such that it is scalable for solving a xed problem
size per processor with a varying number of processors.
Minimizing these two costs simultaneously is essential but dicult for
optimal algorithm design. For many applications, these two factors compete
to waste cycles and they have many dependencies. For example, one parameter that one can always adjust for performance granularity. Usually,
granularity depends on the number of processors, and the smaller the
granularity, the smaller load imbalance and the bigger the communication.
More specically, ecient parallel algorithms must satisfy many or all
of the following conditions:
(1)
(2)
(3)
(4)
Communication to computation ratio is minimal.

Load imbalance is minimal.
Sequential bottleneck is minimal.
Non-parallel access to storage is minimal.
Achieving all is likely to ensure the high parallel eciency of a parallel

algorithm. How do we achieve it is described in the following sections.
45
July 27, 2012 6:17
46
9in x 6in
b1350-ch05
5.1. Algorithm Models

The paradigm of divide and conquer, tried and true in sequential algorithm
design is still the best foundation for parallel algorithms. The only step that
needs consideration is combine. Thus the key steps are:
(1) Problem decompositions (data, control, data + control).
(2) Process scheduling.
(3) Communication handling (interconnect topology, size and number of
messages).
(4) Load Balance.
(5) Synchronization.
(6) Performance analysis and algorithm improvement.
The following basic models are widely utilized for design of ecient
parallel algorithms:
(1)
(2)
(3)
(4)
(5)
(6)
Master-slave.
Domain decomposition.
Control decomposition.
Data parallel.
Single program multiple data. (SPMD)
Virtual-shared-memory model.
Of course, other parallel programming models are also possible. For a large
application package, it is common to utilize two or more programming
models.
5.1.1. Master-slave
A master-slave model is the simplest parallel programming paradigm with
an exception of the embarrassingly parallel model. In the master-slave
model, a master node controls the operations of the slave nodes in the
system. This model can be applied to many types of applications and is
also popular among the initial practitioners of parallel computing. But
it has serious drawbacks for sophisticated applications in which a single
master or even a series of strategically distributed masters are unable
to handle the demands of the slaves smoothly to minimize their idling.
Thus, this model is usually, and should be, avoided for serious and larger
scale applications.
Figure 5.1 illustrates the Master-slave model.
July 27, 2012 6:17
Design of Algorithms
9in x 6in
b1350-ch05
47
Fig. 5.1. A sketch of the Master-slave model in which communications occur only
between master and individual slaves while communications never occur among slaves.
5.1.2. Domain decomposition

Domain decomposition, invented much earlier in numerical analysis for
particularly numerical solutions of PDEs, is dened as a method of splitting
a large problem domain into smaller ones. It is essentially a divide and
conquer technique for arriving at the solution of an operator equation
posed on a domain from the solution of related operator equations posed
on sub-domains. This borrowed term for parallel computing has several
denitions:
Definition I: The domain associated with a problem is divided into parts,
or grains, one grain per processor of the parallel computer. In the case
of domains with a metric (e.g., typical spatial domains) we use domain
decomposition in a more specic sense to denote the division of space into
locally connected grains.1
Definition II: In domain decomposition, the input data (the domain) is
partitioned and assigned to dierent processors. Figure 5.2 illustrates the
data decomposition model.
5.1.3. Control decomposition

In control decomposition, the tasks or instruction streams rather than data
are partitioned and assigned to dierent processors. Control decomposition
approach is not as uncommon as it appears for many non-mathematical
applications.
Figure 5.3 illustrates control decomposition.
1 Fox
Johnson Lyzenga Otto Salmon Walker.
July 27, 2012 6:17
48
9in x 6in
b1350-ch05
Fig. 5.2. A sketch of domain decomposition model in which the computational domain
is decomposed to sub-domains (which are then assigned to computing resources). This
model is popular for solving problems dened on obvious physical spaces, e.g., many
PDEs. It is almost the standard approach for decomposing problems with locality.
Fig. 5.3. In this model, the computational domain is decomposed into multiple subdomains each of which is assigned to a process. This model is useful for implementing
SIMD and MIMD, and it is popular for solving PDEs. This domain decomposition
approach is particularly eective for solving problems with locality.
5.1.4. Virtual-shared-memory
Shared-memory model has the greatest advantage of accessing data without
explicit message passing because the entire (global) data space is shared
among participating processors. It has a serious drawback in achieving
scalability for larger systems with 16 processors or a similar magic upper
limit due to the greatest inability to share data harmoniously, among others.
Wise attempts came to the rescue: Virtual shared memory model in which
data are shared virtually by aid of software. To the applications developers
July 27, 2012 6:17
9in x 6in
b1350-ch05
49
or the struggling programmers, there is an ocean of memory to which

all participating processors, regardless of their physical connections to the
memory banks, can read and write data conveniently without any explicit
message-passing statements. The housekeeping tasks such as strategically
addressing the read and write is the business of the demonic agents.
For better or for worse, the programmer no longer needs to explicitly
manage message passing and, of course, no longer controls and exploits
message passing for better eciency. The gain of removing explicit message
passing hassle is matched by loss of program eciency. As in most
programming trade-os, one would gain some and lose some.
The fact that this model causes major overheads of resources makes it
a wishful academic exercise although it is likely a future model of choice for
data sharing in parallel computing.
Figure 5.4 illustrates the virtual-shared-memory model.
5.1.5. Comparison of programming models
(1)
(2)
(3)
(4)
Explicit vs. implicit.

Virtual-shared-memory vs. message passing.
Data parallel vs. control parallel.
Master-slave vs. the rest.
5.1.6. Parallel algorithmic issues

An important issue in creating ecient algorithms for parallel computers
(or for any computer in general) is the balancing of:
(1) The programmer and computer eorts (see Fig. 5.5).
(2) Code clarity and code eciency.
The amount of eort devoted by a programmer is always correlated by
the eciency and program clarity. The more you work on the algorithms,
the more the computer will deliver. The broad spectrum of approaches for
creating a parallel algorithm, ranging from brute force to state-of-the-art
software engineering, allows exibility in the amount of eort devoted by a
programmer.
For short-term jobs, ignoring eciency for quick completion of the
programs is the usual practice. For grand challenges that may run for years
for a set of parameters, careful designs of algorithms and implementation
are required. Algorithms that are ecient may not be clean and algorithms
July 27, 2012 6:17
50
9in x 6in
b1350-ch05
Fig. 5.4. In virtual-shared-memory model, each process owns an entire memory space
virtually. In reality, each process only owns a portion of the entire memory space while
the operating system virtualizes the space and facilities data transfers of data when
necessary.
that are clean may not be ecient. Which is more important? There is no
universal rule. It is problem-dependent. The desirable programs must be
both clean and ecient.
The following are the main steps involved in designing a parallel
algorithm:
(1) Decomposing a problem in similar sub-problems.
(2) Scheduling the decomposed sub-problems on participating processors.
(3) Communicating the necessary information between the processors so
that each processor has sucient information for progressing.
July 27, 2012 6:17
9in x 6in
b1350-ch05
51
Fig. 5.5. This gure illustrates the relationship between programmer time and computer
time, and how it is related to the dierent ways of handling communication.
(4) Collecting results from each processor to form the nal result to the
original problem.
There are key tasks for designing an ecient parallel algorithm:
(1) Construction of serial algorithm for each processing unit.
(2) Handlings of message passing among processing units.
(3) Balancing loads among processing units.
A parallel-ecient algorithm should always have minimal overheads
of load imbalance and communication. These two factors usually tend to
corrupt, making algorithms inecient; to minimize them individually and
simultaneously is the highest design objective that can be costly, for some
applications, to achieve. For some problems with intrinsic local interactions
such as the hyperbolic equations and short-ranged molecular dynamics, to
gain minimal communication, one must minimize the aspect ratio of the
sub-domains allocated to each processor (assuming more than one subdomain can be given to each processor), which can happen only when each
processor has one very round-shaped sub-domain. This choice of sub-domain
size, however, increases the variance in loads on all processors and creates
extreme imbalance of loads. On the other hand, if the sub-domains are made
small in size, load balance may be achieved easily, but communication soars.
Conventional wisdom tells us to choose a sub-domain with medium size.
Minimizing the overhead due to load imbalance and communication
by choosing a proper granularity is the main goal in designing parallel
algorithms. A good algorithm must be robust, which includes reliability,
exibility, and versatility.
July 27, 2012 6:17
52
9in x 6in
b1350-ch05
5.1.7. Levels of algorithmic complication

Being a new form of art, parallel computing could be extremely simple or
extremely dicult depending on applications. So, application problems are
classied into three classes as shown in Fig. 5.6.
Embarrassingly parallel
The lowest class is the embarrassingly parallel problem in which very
little communication is needed, if ever; it might only occur once at the
Embarrassingly parallel:
Little communication.
Little load imbalance.
Synchronous parallel:
Load nearly balanced.
Known communication
Patterns.
Asynchronous:
Load not balanced.
Unpredictable communication
Patterns.
Fig. 5.6.
An illustration of parallel complexity.
July 27, 2012 6:17
9in x 6in
b1350-ch05
53
beginning to kick o the parallel executions on all processors or at the

end to collect results from all the processors. Furthermore, load balance is
automatically preserved. Algorithms for this type of problems are always
close to 100% parallel ecient and scalable. Examples of embarrassingly
parallel algorithms include all numerical integration and most of the masterslave type problems.
Synchronous parallel
The next class is the synchronized parallel problem in which communications are needed to synchronize the executions of a xed cluster of
processors in the middle of a run. Load is also almost balanced if the
synchronization is done properly. Algorithms for this type of problems
may achieve very high parallel eciencies and they are quasi-scalable with
reasonably large scaling zones.
Most problems involving local interactions such as the hyperbolic PDEs
or classical short-ranged molecular dynamics are of this type.
Asynchronous parallel
The most dicult class is the asynchronous parallel problem in
which one can hardly formulate any communication patterns and most
communications involving the entire system of processors. The load is also
non-deterministic and thus rarely possible to balance statically. Algorithms
for this type of problem always have small parallel eciency and are rarely
scalable.
It is easy to understand why they are not scalable. For a xed-size
problem, the computational load is nearly xed. When the number of
processors increase, the total communication increases, leading to decrease
in parallel eciency. This type of problem is poisonous to parallel
computing. The unfortunate truth is that most physically-interesting
problems including linear algebra problems such as LU decomposition and
matrix multiplications, elliptical PDEs, CMD with long-ranged interactions,
quantum molecular dynamics, and plasma simulations belong to this class.
Does this mean parallel computing is useless for scientic problems? The
answer is No. The reason is that, in reality, when increasing the number
of processors, one can always study relatively larger problems. A parallel
eciency of 50% does not pose a big challenge, but it can always make a
run faster (sometimes by orders of magnitude) than a serial run. In fact,
serial computing is indeed a dead end for scientic computing.
July 27, 2012 6:17
54
9in x 6in
b1350-ch05
5.2. Examples of Collective Operations

To broadcast a string of numbers (or other data) to a 1D array of processors
and to compute a global sum over numbers scattered on all processors are
both very simple but typical examples of parallel computing.
5.2.1. Broadcast
Suppose processor 0 possesses N oating-point numbers {XN 1 , . . . ,
X1 , X0 } that need to be broadcast to the other P 1 processors in the
system. The best way to do this (Figs. 5.7 and 5.8) is to let P0 send out
X0 to F1 , which then sends X0 to P2 (keeping a copy for itself) while P2
sends the next number, X1 , to P1 . At the next step, while P0 sends out X2
to P1 , P1 sends out X1 to P2 , etc., in a pipeline fashion. This is called the
wormhole communication.
Suppose the time needed to send one number from a processor to its
neighbor is Tcomm . Also, if a processor starts to send a number to its
neighbor at time T1 and the neighbor starts to ship out this number at
time T2 , we dene a parameter called:
Tstartup = T2 T1 .
(5.1)
Typically, Tstartup Tcomm . Therefore, the total time needed to broadcast

all N numbers to all P processors is given by:
Tcomm = NT comm + (P 2)Tstartup .
Fig. 5.7.
Broadcast model 1.
(5.2)
July 27, 2012 6:17
9in x 6in
Fig. 5.8.
Fig. 5.9.
b1350-ch05
55
Broadcast model 2.
Illustration of gather and scatter.
5.2.2. Gather and scatter

In this scheme, as shown in Fig. 5.9,
(1) The processors on the two ends of the arrows labeled 1 perform a
message exchange and summation.
At the end of the third step, all processors have the global sum.
July 27, 2012 6:17
56
9in x 6in
b1350-ch05
5.2.3. Allgather
We explain the global summation for a 3D hypercube. Processor Pt contains
a number X1 for i = 1, 2, . . . , 7. We do the summation in three steps.
Step 1: Processors P0 and P1 swap contents and then add them up. So do
the pairs of processors P2 and P3 , P4 and P5 , as well as P6 and P7 .
All these pairs swap and add corresponding data in parallel. At
the end of this parallel process, each processor has its partners
content added to its own, e.g. P0 now has X0 + X1 .
Step 2: P0 exchanges its new content with P2 and performs addition (other
pairs follow this pattern). At the end of this stage, P0 has X0 +
X1 + X2 + X3 .
Step 3: Finally, P0 exchanges its new content with P4 and performs
addition (again, other pairs follow the pattern). At the end of
this stage, P0 has X0 + X1 + X2 + X3 + X4 + X5 + X6 + X7 .
In fact, after log2 8 3 stages, every processor has a copy of the global sum.
In general for a system containing P processors, the total time needed for
global summation is O(log2 P ).
5.3. Mapping Tasks to Processors
Figure 5.11 charts the pathway of mapping an application to the topology
of the computer. Several mapping functions as shown in Fig. 5.10 can be
implanted to realize the following mappings:
(1)
(2)
(3)
(4)
(5)
(6)
Linear Mapping.
2D Mapping.
3D Mapping.
Random Mapping: ying processors all over the communicator.
Overlap Mapping: convenient for communication.
Any combination of the above.
Two benets are performance gains and code readability.

In Table 5.1, we illustrate the mapping of a 1D application to 2D
computer architecture.
This helps create a balanced distribution of processes per coordinate
dimension, depending on the processes in the group to be balanced and
optional constraints specied by the user. One possible use is to partition
all processes into an n-dimensional topology. For example,
July 27, 2012 6:17
9in x 6in
b1350-ch05
57
Call
MPI
MPI
MPI
MPI
Dims returned
Dims
Dims
Dims
Dims
create(6,2,dims)
create(7,2,dims)
create(6,3,dims)
create(7,3,dims)
3,2
7,1 or 1,7
3,2,1 or 2,3,1 or 1,2,3
no answer
MPI_Cart_get(/* Obtaining more details on a Cartesian communicator

*/
IN comm,
IN maxdims, /* length of vector dims, periods, coords */
number_of_processes_in_each_dim,
perodis_in_each_dim,
cords_of_calling_process_in_cart_structure
)
/* Cartesian translator function: rank <===> Cartesian */
MPI_Cart_rank(/* rank => Cartesian */
IN comm,
IN coords,
rank
)
MPI_Cart_coords(/* rank => Cartesian */
IN comm,
IN rank,
IN maxdims,
coords /* coords for specific processes */
)
/* Cartesian partition (collective call) */
MPI_cart_sub(
IN comm,
IN remain_dims, /* logical variable TRUE/FALSE */
new_comm
)
Fig. 5.10.
Application
Topology
Fig. 5.11.
Some Cartesian inquire functions (local calls).
Process
Topology
Processors
Topology
This is a Process-to-Processor mapping.
July 27, 2012 6:17
58
9in x 6in
b1350-ch05

Table 5.1. A 1D application
is mapped to a 2D computer
architecture.
0
(0,0)
1
(0,1)
2
(0,2)
3
(1,0)
4
(1,1)
5
(1,2)
6
(2,0)
7
(2,1)
8
(2,2)
MPI_Cart_create(/* Creating a new communicator */

/* (row-majored ranking)
*/
old_comm,
number_of_dimensions_of_cart,
array_specifying_meshs_in_each_dim,
logical_array_specifying_periods, /* TRUE/FALSE */
reoreder_ranks?,
comm_cart /* new communicator */
)
MPI_Dims_create(
number_nodes_in_grid,
number_cartesian_dims,
array_specifying_meshs_in_each_dim
)
Fig. 5.12.
MPI functions.
The functions listed in Fig. 5.12 are useful when embedding a lower
dimensional Cartesian grid into a bigger one, with each sub-grid forming a
sub-communicator.
July 27, 2012 6:17
9in x 6in
b1350-ch05
59
If the comm = 2 3 4
24 processes we have
if remain dimms = (true, false, true)
(8 processes)
new comm = 2 4
All processors in parallel computers are always connected directly
or indirectly. The connectivity among processors measured in latency,
bandwidth and others are non-uniform and they depend on the relative
or even absolute locations of the processor in the supercomputer. Thus,
the way to assign computation modules or subtasks on nodes may impact
the communication costs and the total running. To optimize this part of the
running time, we need to utilize the task mapping.
According to Bokhari in his On the Mapping Problem,2 the mapping
and the mapping problem can be dened as follows:
Suppose a problem made up of several modules that execute in
parallel is to be solved on an incompletely connected array. When
assigning modules to processors, pairs of modules that communicate
with each other should be placed, as far as possible, on processors
that are directly connected. We call the assignment of modules to
processors a mapping and the problem of maximizing the number
of pairs of communicating modules that fall on pairs of directly
connected processors the mapping problem.
The application and the parallel computer are represented as static
graphs GA and GP , respectively. GA = (VA ; EA ) is a graph where VA
represents tasks and EA means communication requests among them with
weights being the communication loads. GP = (VP ; EP ) is a graph where
VP represents processors and EP means the links among processors with
weights being the per unit communication cost.
A mapping problem can then be dened as nding a mapping: VA VP
to minimize the objective function value that associates with each mapping.
5.3.1. Supply matrix

Imitating business terminology, we introduce the supply-and-demand
concept for task mapping. A supply matrix of a supercomputer network
2 S. H. Bokhari, On the mapping problem, IEEE Transactions on Computers. 30(3)
(1981) 207214.
July 27, 2012 6:17
60
9in x 6in
b1350-ch05
is a measure of the connectivity of the supercomputers interconnection

networks. For example, the matrix element Sij measures the network latency
for initiating data transmission between processors i and j.
Hop matrix
Hop matrix measures the distance among all nodes in terms of minimal
number of hops between all pairs of nodes. Such hop matrix is a decent
approximation of the latency matrix measuring the delays or distances
messages traversing pairs of nodes. We must be aware that this approximation can be rough as collisions, among others, can easily occur and
messages may not always travel on shortest-distance paths.
Latency matrix
The latency matrix (Fig. 5.13) is measured from the communication time
for sending a 0 byte message from one node to all other nodes that forms a
matrix.
4
1000
900
3.8
800
3.6
700
600
3.4
500
3.2
400
3
300
200
2.8
100
2.6
100
200
Fig. 5.13.
300
400
500
600
700
800
900
1000
MPI latency of a 0 byte packet on a 8 8 16 BG/L system.
July 27, 2012 6:17
9in x 6in
Fig. 5.14.
b1350-ch05
61
Linear regression of the latency with respect to the hop.
As shown in Fig. 5.14, linear regression analysis of the latency with

respect to the hop shows that the hops match the latencies remarkably
nicely albeit many outliers that may mislead the optimization, i.e. the task
mapping. Most of the dierences result from the torus in Z-dimension of
the BG/L system.
5.3.2. Demand matrix
A demand matrix is a metric measuring the amount of data that need
to be transmitted between all pairs of subtasks. Each matrix element Dij
measures the total data during a period of time, in bytes, for example, from
subtask i to subtask j. Usually, the period of time is chosen as the lifetime
of an application. This is for static mapping and it is quite restrictive and
limiting. We may discretize the period of time to smaller intervals measuring
the characteristic time scales of message passing. This can be the starting
point of dynamic task mapping whose research, let alone practical software,
is still at its early stages.
July 27, 2012 6:17
62
9in x 6in
b1350-ch05
5.3.3. Review of mapping models

The general idea of the static mapping is to build a model where the total
communication time can be calculated based on the mapping. After the
model is built, the problem turns into an optimization problem and thus
can be solved by existing optimization techniques.
By considering dierent aspects and complexity, here we introduce some
of the important mapping models.
Model by Bokhari
In 1981, Bokhari proposed a model that maps n tasks to n processors to
nd the minimum communication cost independent of computation costs
which can be represented as:

Gp (x, y) Ga (fm (x), fm (y)).
(5.3)
min
xVp ,yVp
This model is used in the analysis of algorithms for SIMD architectures.

Bokhari points out that this model can be transformed to a graph
isomorphism problem. However, it cannot reect the amount dierences
among inter-module communications and inter processor likes.
Model by Heiss and Dormanns
In 1996, Heiss and Dormanns formulated the mapping problem as to nd a
mapping T P

a(i, j) d((i), (j)),
(5.4)
min CC =
(i,j)E T
where a(i, j) is the amount of data needed to be exchanged and d((i), (j))
represents the length of shortest path between i and j. Heiss and Dromanns
introduced the Kohonens algorithm to realize this kind of topologyconserving mapping.
Model by Bhanot et al.
In 2005, a model was developed by Bhanot et al. to minimize only inter-task
communication. They neglected the actual computing cost when placing
tasks on processors linked by mesh or torus by the following model:

C(i, j)H(i, j),
(5.5)
min F =
i,j
July 27, 2012 6:17
9in x 6in
b1350-ch05
63
where C(i, j) is communication data from domain i to j and H(i, j)

represents the smallest number of hops on BlueGene/L torus between
processors allocated domains i and j. A simulated annealing technique is
used to map n tasks to n processors on the supercomputer for minimizing
the communication time.
5.3.4. Mapping models and algorithms

In 2008, Chen and Deng considered a more realistic and general case to
model the mapping problem. Assume the application has already been
appropriately decomposed as n subtasks and the computing load of each
subtask is equal. The inter-subtask communication requirements thus can be
described by the demand matrix Dnn whose entry D(t, t ) is the required
data from subtask t to subtask t .
Under the condition that the communication of the parallel computer is
heterogeneous, we can further assume that the heterogeneous properties can
be described by a load matrix Lnm whose entry L(t, p) is the computation
cost of the subtask t when t is executed on the processor p, and a supply
matrix Smm whose entry S(p, p ) is the cost for communication between
processors p and p , where n is the number of subtasks and m is the number
of processors. With the assumptions above, we can formulate the mapping
problem as a quadratic assignment problem in the following way.
Let {t1 , . . . , tn } be the set of subtasks of the problem and {p1 , . . . , pm }
be the set of heterogeneous processors of the parallel computers to which the
subtasks are assigned. In general, n > m. Let Xtp be the decision Boolean
variable that is dened as,

1, if subtask t is assigned to processor p
(5.6)
xtp =
0, otherwise.
Let Ytt pp be the decision Boolean variable that is dened as,

1, if subtasks t, t assigned to p, p respectively
ytt pp =
0, otherwise.
The total execution time can be expressed as,

n m
k

L(t, p) xtp +
(Di S)max .
min
t=1 p=1
i=1
(5.7)
(5.8)
July 27, 2012 6:17
64
Subject to,
9in x 6in
b1350-ch05
m
xtp = 1,
t = 1, . . . , n
p=1
n

xtp 1,
p = 1, . . . , m
t=1
Ap
xtp
n , p = 1, . . . , m,
At
t=1
(5.9)
where k n2 is the total number of the communication batch. The term

(Di S)max represents the maximum value of the ith batch communication.
It can be seen that the mapping problem can be concluded as a
minimization problem. When considering the scale of modern parallel
computers, optimization techniques often fail for problems of this size.
Several heuristic techniques have been developed for search in large solution
spaces, such as simulated annealing (SA), genetic algorithm (GA), evolution
strategies (ES), genetic simulated annealing (GSA) and Tabu search (TS).
Figure 5.15 illustrated some of the most important techniques for the
mapping problem. Details of the each algorithm can be easily found in
literatures.
Fig. 5.15.
Solution techniques for the mapping problem.
July 27, 2012 6:17
9in x 6in
b1350-ch06
CHAPTER 6
LINEAR ALGEBRA
Linear algebra is one of the fundamental subjects of mathematics and the

building block of various scientic and engineering applications such as
uid dynamics, image and signal processing, structural biology and many
others. Because of its importance, numerical linear algebra has been paid
massive amount of attention, for both serial and parallel computing systems,
before and after the 1990s, respectively. After years of development, parallel
linear algebra is among the most mature elds of the parallel computing
applications. LINPACK and ScaLAPACK are their representative products
and research publications by Dongarra, Golub, Greenbaum, Ortega and
many others are a small subset of the rich publications in numerical linear
algebra.
6.1. Problem Decomposition
A nave way of working on parallel linear algebra is to copy all the data to
be computed across the processors and perform the parallelization only in
computing. In this situation, most of the parallelization is straightforward,
since all information is simultaneously available to all processors.
However, the size of matrices in practical applications is so large that it
not only requires intensive computing power, but also exceeds the memory
capacity that one usually can get for a single node. For example, a moderate
mesh of 10001000 = 106 points can induce a matrix of size 106 106 which
means 1012 double precision data that roughly requires eight terabytes
of memory. Hence, in this situation, decomposition across data is also
required.
65
July 27, 2012 6:17
66
9in x 6in
b1350-ch06
Consider a data matrix:
M11
M21
.
..
M12
M22
..
.
..
.
Mm1
Mm2
Mmn
M1n
M2n
..
.
(6.1)
where mn = N and P is the number of processors and usually, N > P .

Otherwise, if N < P , parallel computing would make little sense. In most
analysis, we assume N/P to be an integer; however, it is not a big problem
if it is not. In case where N/P is not an integer, we can patch 0 elements
to make it an integer. For the array of processors, we can also arrange them
in a processor matrix format:
P11 P12 P1q

P21 P22 P2q
(6.2)
.
..
..
..
..
.
.
.
Pp1 Pp2 Ppq
where pq = P .
Generally, there are four methods of decomposition that are common
in practical use:
(1)
(2)
(3)
(4)
Row partition.
Column partition.
Block partition.
Scatter partition.
A fth method of random decomposition, albeit rarely, is another possibility.

The following task assignment notation (TAN) of matrix decomposition
is introduced by one of the authors in an earlier publication. The potential of
such notation has not been fully exploited. In this notation, m @P means
that sub-matrix m (which could be a single scalar number) is assigned
to processor P . When carrying computations, communication logistics is
visually evident. For example, when multiplying two square matrices in
four processors,

a11 @1 a12 @2
b11 @1 b12 @2
C=
a21 @3 a22 @4
b21 @3 b22 @4
(6.3)

c11 c12
=
c21 c22
July 27, 2012 6:17
9in x 6in
Linear Algebra
b1350-ch06
67
where
c11
c12
c21
c22
= (a11 @1) (b11 @1) + (a12 @2) (b21 @3)

= (a11 @1) (b12 @2) + (a12 @2) (b22 @4)
= (a21 @3) (b11 @1) + (a22 @4) (b21 @3)
= (a21 @3) (b12 @2) + (a22 @4) (b22 @4)
(6.4)
Obviously, calculation of (a11 b11 )@1 can be carried out on P = 1 without

any communication. Calculation of (a12 @2) (b21 @3) requires moving b21
from P = 3 to P = 2 or a12 from P = 2 to P = 3 before computing.
The advantage is that all such communication patterns that could be
excruciatingly complicated can now be mathematically manipulated. It
is not dicult to quickly expect to exploit the newly introduced task
assignment notation (TAN) for helping manipulate many other algorithms
than the matrix multiplications. In fact, this TAN for matrix multiplication
can guide us to express the algorithm by a PAT graph conveniently.
Row Partition: Assigning a
M11 @1
M @2
21
.
.
.
Mp1 @p
subset of entire row to a processor.
M12 @1 M1q @1
M22 @2 M2q @2
..
..
..
.
.
.
Mp2 @p Mpq @p
(6.5)
Column Partition: Assigning a subset of entire column to a processor.
M11 @1 M12 @2 M1q @p

M @1 M @2 M @p
21
22
2q
.
(6.6)
..
..
..
.
.
.
.
.
Mp1 @1 Mp2 @2 Mpq @p
Block Partition: Assigning a continuous block of sub-matrix to a process,
a typical 2D matrix partition.
M11 @1 M12 @1 M1q @3

M21 @1 M22 @1 M2q @3
M @2 M @2 M @k
32
3q
31
(6.7)
M41 @2 M42 @2 M4q @k
..
..
..
..
.
.
.
.
Mp1 @4 Mp2 @4 Mpq @n
July 27, 2012 6:17
68
9in x 6in
b1350-ch06
Scattered Partition: Assigning a subset of scattered sub-matrices forming

a particular pattern to a processor, a type of 2D partition.
M11 @1 M12 @2 M1q @x

M @3 M @4 M @y
22
2q
21
M31 @1 M32 @2 M3q @x
(6.8)
M @3 M @4 M @y
42
4q
41
..
..
..
..
.
.
.
.
Mp1 @1 Mp2 @2 Mpq @n
The properties of these decompositions vary widely. The table below lists
their key features:
Method
Properties
Row partition and column

partition
1D decomposition.
Easy to implement.
Relatively high communication costs.
Block partition and

scattered partition
2D decomposition.
More complication to implement.
Lower communication costs.
Block partition: Mismatch with underlying problem
mesh.
Scattered partition: Potential match with
underlying problem mesh.
6.2. Matrix Operations

Matrix-matrix multiplication is the fundamental operation in linear algebra
and matrix-vector multiplication is the basic operation in matrix-matrix
multiplication. To learn matrix-matrix multiplication, we start from matrixvector multiplication.
6.2.1. Matrix-vector multiplications

When multiplying a matrix and a vector, we compute:
Ab = c
(6.9)
July 27, 2012 6:17
9in x 6in
b1350-ch06
69
Linear Algebra
Method I: Replicating the vector on all processors

Matrix A: Row partition matrix A according to Eq. (6.5):
A11 @1 A12 @1
A21 @2 A22 @2
.
..
..
.
Ap1 @p Ap2 @p
A1q @1
A2q @2
..
..
.
.
Apq @p
Vector b: The elements of vector b are given to each processor i for each
integer i [1, p]:
b1 @i
b2 @i
b3 @i
.
..
(6.10)
bx @i
We can obtain the results on individual processors for the resulting vector c.
The PAT graph for matrix-vector multiplication Method I is as given in
Fig. 6.1.
Timing Analysis: Each processor now works on a n np matrix with a
vector of length n. The time on each processor is given by the following. On
P processors,
Tcomp(n, p) = cn
Fig. 6.1.
n
p
PAT graph for matrix-vector multiplication Method I.
(6.11)
July 27, 2012 6:17
70
9in x 6in
b1350-ch06
The communication time to form the nal vector is:

n
Tcomm (n, p) = dp
p
(6.12)
Thus, the speed up is:

S(n, p) =
T (n, 1)
P
=
Tcomp (n, p) + Tcomm (n, p)
1 + cp
n
(6.13)
Remarks
(1)
(2)
(3)
(4)
Overhead is proportional to p/n and speedup increases with n/p.

This method is quite easy to implement.
Memory is not parallelized (redundant use of memory).
Applications where such decompositions are useful are rare.
Method II: Row partition both A and B

Matrix A: Row partition matrix A according to Eq. (6.5):
A11 @1 A12 @1 A1q @1

A @2 A @2 A @2
22
2q
21
..
..
..
..
.
.
.
.
Ap1 @p Ap2 @p
Apq @p
Vector b: In this case, the element bi is given to the ith processor:
b1 @1
b2 @2
b3 @3
.
.
.
(6.14)
bx @p
Step 1: Each processor multiplies its diagonal element of A with its element
of b.
@1:
A11 b1
@2:
..
.
A22 b2
..
.
@p: App bp
July 27, 2012 6:17
9in x 6in
Linear Algebra
b1350-ch06
71
Step 2: Roll up in the processor vector.

Vector b:
b2 @1
b3 @2
b4 @3
.
..
(6.15)
b1 @p
Then, multiply the next o-diagonal elements with the local vector elements:
@1: A12 b2
@2: A23 b3
..
..
.
.
@p: Ap1 b1
Step 3: Repeat Step 2 until all elements of b have visited all processors.
The PAT graph for matrix-vector multiplication method II is as in
Fig. 6.2.
Timing Analysis: Each processor now works on a n np matrix with a
vector on length n. The time on each processor is given by the following.
On P processors,
n
(6.16)
Tcomp(n, p) = cn
p
The communication time to roll up the elements of vector b and form the
nal vector is,
n
(6.17)
Tcomm (n, p) = dp
p
Fig. 6.2.
PAT graph for matrix-vector multiplication Method II.
July 27, 2012 6:17
72
9in x 6in
b1350-ch06
Thus, the speedup is,

S(n, p) =
P
T (n, 1)
=

Tcomp(n, p) + Tcomm (n, p)
1 + cnp
(6.18)
Remarks
(1)
(2)
(3)
(4)

This method is not as easy to implement.
Memory use, as well as communication, is parallelized.
Communication cost is higher than method I.
Method III: Column partition A and row partition b

Matrix A: Column partition matrix A according to Eq. (6.6):
A12 @2 A1q @p
A22 @2 A2q @p
..
..
..
.
.
.
Ap1 @1 Ap2 @2 Apq @p
A11 @1
A21 @1
.
..
Vector b: The element bi is given to the ith processor as in Eq. (6.14).

Step 1: Each processor i, multiplies Ai1 with bi .
@1:
@2:
..
.
A11 b1
A12 b2
..
.
@p: App bp
The results are then lumped to form the element c1 of the vector c.
The PAT graph for matrix-vector multiplication method III is as in
Fig. 6.3.
Step 2: Repeat Step 1 for all rows to form the complete resulting vector c.
Timing Analysis: Each processor now works on a n n/p matrix with a
vector on length n. The time on each processor is given by the following.
July 27, 2012 6:17
9in x 6in
b1350-ch06
73
Linear Algebra
Fig. 6.3.
PAT graph for matrix-vector multiplication Method III.
On P processors,
Tcomp(n, p) = cn
n
p
(6.19)
The communication required to lump all elements to form one element of c

is given by:
Tcomm (n, p) = d p
n
p
(6.20)
Thus, the speedup is,

S(n, p) =
T (n, 1)
P
=

1 + cnp
(6.21)
Remarks
(1)
(2)
(3)
(4)

This method is not as easy to implement.
Memory use, as well as communication, is parallelized.
Communication cost is higher than method I.
6.2.2. Matrix-matrix multiplications

When multiplying two matrices, we compute
AB = C
(6.22)
July 27, 2012 6:17
74
9in x 6in
b1350-ch06
where,
A11
A21
A= .
..
A12
A22
..
.
Am1 Am2
B11 B12
B21 B22
B= .
..
..
.
Bn1 Bn2
..
.
A1n
A2n
..
.
Amn
B1k
B2k
..
..
.
.
Bnk
(6.23)
(6.24)
The complexity in multiplying A and B is O(n3 ). Implementation of matrix

multiplication algorithms on a serial computer requires only four meaningful
lines of program as shown in Fig. 6.4:
Implementation of matrix multiplication algorithms on a parallel
computer, however, is an entirely dierent matter.
Method I: Row-Column partition (Ring Method) (Fig. 6.5)
for ( I = 1; I <= n; I++) {
for ( J = 1; J <= n; J++) {
for ( K = 1; K <= n; K++) {
C[I][J] += A[I][K] * B[K][j];
}
}
}
Fig. 6.4.
Fig. 6.5.
This is the sequential code for matrix multiplication.
PAT graph for matrix-matrix multiplication Method I.
July 27, 2012 6:17
9in x 6in
b1350-ch06
75
Linear Algebra
Step 0: Row partition matrix A and column partition matrix B.
A11 @1
A12 @1 A1n @1
A22 @2 A2n @2
A21 @2
A32 @3 A3n @3
A = A31 @3
..
..
..
..
.
.
.
.
Amx @m
B11 @1 B12 @2 B1k @k
B21 @1 B22 @2 B2k @k
B = B31 @1 B32 @2 B3k @k

.
.
.
..
.
..
..
.
.
Bn1 @1 Bn2 @2 Bnk @k
Am2 @m
Am1 @m
Step 1: Multiply row 1 of A with column 1 of B to create c11 by processor 1,

multiply row 2 of A with column 2 of B to create c22 by processor 2, and so
on. Therefore, the diagonal elements of the matrix c are created in parallel.
C =
C11
C22
..
.
Cmm
where,
@1:
C11 =
A1i Bi1
i=1
@2:
C22 =
A2i Bi2
i=1
..
.
@m: Cmm =
..
.
n

A1i Bim
i=1
Step 2: Roll up all rows in A by one processor unit to form A then multiply
As corresponding rows with Bs and columns to form the rst o diagonal
July 27, 2012 6:17
76
9in x 6in
b1350-ch06
C elements. Now, A looks like this:
A2n @1
A3n @2
A4n @3
..
A21 @1
A31 @2
A = A41 @3
.
..
A22 @1
A32 @2
A42 @3
..
.
..
.
A11 @m
A12 @m
A1n @m
and C looks like,
C11
C21
C=
C22
..
.
C1m
..
.
Cm,m1
Cmm
where
@1:
@2:
C21 =
C32 =
n

i=1
n

A2i Bi1
A3i Bi2
i=1
..
.
@m: C1m =
..
.
n

A1i Bim
i=1
Step 3: Repeat Step 2 until all rows of A have passed through all processors.
Timing Analysis: On one processor,
Tcomp(n, 1) = cn3 tcomp
(6.25)
where tcomp is the time needed to perform unit computations, such as

multiplying a pair of numbers.
On P processors, the total cost is:
T (n, p) Tcomp (n, p) + Tcomm (n, p)
(6.26)
July 27, 2012 6:17
9in x 6in
Linear Algebra
The communication cost to roll the rows of A is,

2
n
Tcomm (n, p) = (p 1)
tcomm n2 tcomm
p
b1350-ch06
77
(6.27)
The computation cost for multiplying a matrix of size n/p with a matrix of
size n is:
3
n
cn3 tcomp
(6.28)
tcomp =
Tcomp (n, p) = p2 c
p
p
Thus, the speedup is:
S(n, p) =
P
T (n, 1)
=

1 + cnp
(6.29)
Remarks
(1) Overhead is proportional to p/n and speedup increases with n/p.
(2) Overhead is proportional to Tcomm /Tcomp and speedup decreases while
Tcomm /Tcomp increases.
(3) The previous two comments are universal for parallel computing.
(4) Memory is parallelized.
Method II: Broadcast, Multiply, Roll (BMR Method)
Step 0: Block partition both A and B.
A11 @1 A12 @2 A13 @3
A = A21 @4 A22 @5 A23 @6

A31 @7 A32 @8 A33 @9
B11 @1 B12 @2 B13 @3
B = B21 @4 B22 @5 B23 @6

B31 @7 B32 @8 B33 @9
Step 1: Broadcast all diagonal elements of A to the individual processors
on each row. In other words, the elements of A will be distributed to the
processors in the following way:
A11 @1 A11 @2 A11 @3
A22 @4 A22 @5 A22 @6

A33 @7 A32 @8 A33 @9
July 27, 2012 6:17
78
9in x 6in
b1350-ch06
Step 2: Multiply row 1 of A with row 1 of B, multiply row 2 of A with row

2 of B, etc., to produce part of matrix C.
@1: A11 B11 C11 (partial)
Step 3: Broadcast the next diagonal elements of A, roll up the rows of B
and then multiply as in Step 2. Matrix A and B become
A12 @1 A12 @2 A12 @3
A = A23 @4 A23 @5 A23 @6

A31 @7
B21 @1
B = B31 @4
B11 @7
A31 @8
A31 @9
B22 @2 B23 @3
B32 @5 B33 @6
B12 @8 B13 @9

Step 4: Repeat Step 3 until all B elements have traveled through all
processors, and then gather all the elements of C (Fig. 6.6).
July 27, 2012 6:17
9in x 6in
Linear Algebra
Fig. 6.6.
b1350-ch06
79
PAT graph for matrix-matrix multiplication Method II.
Timing Analysis: On one processor,

Tcomp(n, 1) = 2n3 tcomp
(6.30)
where tcomp is the time needed to perform unit computations, such as

multiplying a pair of numbers.
On P processors, the total cost, including the broadcast (communication), sub-matrix multiply, and rolling up of the matrix elements, is:
T (n, p) TB (n, p) + TM (n, p) + TR (n, p)
(6.31)
The broadcast cost is:

TB (n, p) = m2 tcomm + (q 2)tstartup
(6.32)
where m = n/p.
The cost of sub-matrix multiplication is:
TM (n, p) = 2m3 tcomp
(6.33)
The cost of rolling up the rows of the matrix is:

TR (n, p) = 2m2 tcomm
(6.34)
July 27, 2012 6:17
80
9in x 6in
b1350-ch06

S(n, p) =
T (n, 1)
TB (n, p) + TM (n, p) + TR (n, p)
and the overhead is:
h(n, p) =
1
P 2q
+
np
2 2

tcomm
tcomp
(6.35)

(6.36)
Remarks
(1) Overhead is proportional to p/n and speedup increases with n/p.
(2) Overhead is proportional to Tcomm /Tcomp and speedup increases while
Tcomm /Tcomp decreases.
(3) The above two comments apply commonly to many parallel algorithms.
(4) Memory is parallelized.
Method III: Strassen Method
Matrix multiplication requires O(N 3 ) operations. For example, suppose
we perform AB = C where

a11 a12
b11 b12
A=
, B
a21 a22
b21 b22
How many multiplication operations are required? The answer is 23 = 8.
a11 b11 + a12 b21
a12 b12 + a12 b22
a21 b11 + a22 b21
a21 b12 + a22 b22
c11
c12
c21
c22
(6.37)
The same applies to a 3 3 matrix. There are 33 = 27 operations.

Now, if we were to rearrange the operations in the following way,
S1
S2
S3
S4
S5
S6
S7
= (a11 + a22 ) (b11 + b22 )

= (a21 + a22 ) b11
= a11 (b12 b22 )
= a22 (b11 + b21 )
= (a11 + a12 ) b22
= (a11 + a21 ) (b11 + b12 )
= (a12 a22 ) (b21 + b22 )
(6.38)
July 27, 2012 6:17
Linear Algebra
9in x 6in
b1350-ch06
81
we can express cij in terms of Sk ,

C11 = S1 + S4 S5 + S7
C12 = S3 + S5
C21 = S2 + S4
C22 = S1 + S3 S2 + S5
(6.39)
This reduces the number of multiplications from eight to seven. For large
matrices, we can always reduce it to 7/8 and then get the computing
time from N log2 8 = 3N to log2 7 2.8N . Of course, we have many more
additions, namely ve times as many. For large N , however, the cost of the
operations will outweigh the cost of the + operations.
6.3. Solution of Linear Systems

A system of linear equations, or a linear system, can be written in the matrix
form
Ax = b
(6.40)
where x is the vector of unknown variables and A is n n coecient matrix

where n2 = N . Generally, there are two ways of solving the system: direct
methods and iterative methods. Direct methods oer exact solution for the
system while iterative methods give fast approximation especially for sparse
matrices.
6.3.1. Direct methods

For dense linear systems, it is rare to nd better methods than the so-called
direct methods, although it will cost as much as O(N 3 ) of computation. Here
we discuss the classical method of Gaussian elimination. In this method, we
transform the coecient matrix A in (6.40) to an upper triangular matrix U
and then solve the system by back-substitution.
Simple Gaussian Elimination
The sequential algorithm for Gaussian elimination is actually straightforward. Figure 6.7 gives the pseudo-code that constitutes three nested loops
for Gaussian elimination.
July 27, 2012 6:17
82
9in x 6in
b1350-ch06
u = a
for i = 1 to n - 1
for j = i+ 1 to n
l = u(j,i)/u(i,i)
for k = i to n
u(j,k) = u(j,k) l * u(i,k)
end
end
end
Fig. 6.7.
Pseudo-code for sequential Gaussian elimination.
From the pseudo-code, it is easy to conclude that the operation count

for the Gaussian elimination is:
Tcomp (n, 1) =
2 3
n tcomp
3
(6.41)
where tcomp is the time for unit computation.

Back Substitution
After the elimination, the system became of the form
Ux = b
(6.42)
where U is an upper-triangular matrix. To nally solve the system, we have

to perform the back substitution. That is, the last element of the vector of
the unknown variables can be directly solved from the last equation, and
by substituting it into the rest of the equations, it can generate a similar
system of one lower degree. By repeating the process, x can thus be solved.
The method of parallelization for solving the linear system depends on
the way the matrix is decomposed. In practice, it is unnatural to consider
scattered partition for solving linear systems, so we only discuss the three
ways of matrix decompositions.
Row Partition
Row partition is among the most straightforward methods that parallelizes
the inner most loop. For simplicity, when demonstrating this method, we
assume the number of processors to be equal to the number of dimensions
i.e. p = n, but it is quite easy to port it into situations that p < n.
Step 0: Partition the matrix according to its rows.
Step 1: Broadcast the rst row to other rows.
July 27, 2012 6:17
9in x 6in
b1350-ch06
83
Linear Algebra
Step 2: For every element in row i subtract the value of the corresponding
element in rst row multiplied by A(0,0)
A(i,0) . By doing this, the
elements in the rst column of the matrix become all zero.
Step 3: Repeat Steps 1 and 2 on the right-lower sub matrix until the
system becomes an upper triangle matrix.
(0,0)
(0,1)
(0,2)
(0,3)
(0,4)
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
In this method, as we can see, every step has one less node involving in
the computation. This means the parallelization is largely imbalanced. The
total computation time can be written as:
Tcomp(n, p) =
n3
tcomp
p
(6.43)
where tcomp is the time for unit operation. Compare with the serial case
(6.42), we know that roughly 1/3 of the computing power is wasted in
waiting.
During the process, the right-hand side can be treated as the last column
of the matrix. Thus, it can be distributed among the processor and thus no
special consideration is needed.
For the back substitution, we can parallelize by the following procedure:
Step 1: Solve the last unsolved variable in the last processor.
Step 2: Broadcast the value of the solved variable.
Step 3: Every processor plugs this variable and updates its corresponding
row.
Gaussian Elimination with Partial Pivoting
Like the serial cases, the algorithm might introduce big s when subtracting
two very close numbers. Partial pivoting is necessary in the practical use.
July 27, 2012 6:17
84
9in x 6in
b1350-ch06
That is, at each step, we nd the row of the sub-matrix that has the largest
rst column elements and calculate based on this row instead of the rst one.
This requires nding the largest elements among processors that adds more
communication in the algorithm. An alternative way to solve this problem
is to use the column partition.
Column Partition
Column partition gives an alternative way of doing 1D partition. In this
method, less data will be transferred in each communication circle. This
gives smaller communication overhead and easier for pivoting. However,
the problem with load imbalance still existed in this method.
Step
Step
Step
Step
Step
0:
1:
2:
3:
4:
Partition the matrix according to its columns.

Determine the largest element in the rst column (pivot).
Broadcast the pivot and its row number to other processors.
Every processor updates its other elements.
Repeat Steps 1 to 3 until nish.
This method has same overall computation time but much less communication when pivoting is needed.
Block Partition
We illustrate the idea by a special example 20 20 matrix solved by 16
processors. The idea can be easily generalized to more processors with a
large matrix.
We perform the elimination with the following three steps.
Step 1: Broadcast column 0 to other column processors. Note, at this
step, only p 1 processors need to be sent the column 0 elements.

Step 2: Normalize column 0, then broadcast row 0 to the other row
processors. Note, at this step, only p 1 processors need to be

sent to the row 0 elements. Eliminate column 0.
Step 3: Move to the next row and column and repeat Steps 1 and 2 until
the last matrix element is reached.
Remarks
(1) There are many other schemes for solving Ax = b; for example, row
partition or column partition.
(2) This method is, in fact, quite ecient but the load imbalance problem
persists.
July 27, 2012 6:17
9in x 6in
Linear Algebra
b1350-ch06
85
(3) It is quite unnatural to scatter matrix elements to processors in such a

strange fashion as demonstrated.
(4) This method is also dicult to implement.
LU Factorization
In the practical applications, people usually need to solve a linear system
Ax = b
for multiple times with dierent b while A is constant. In this case, a predecomposed matrix that oers fast calculation of solution with dierent bs
is desirable. To achieve this, LU decomposition is a good method.
Since LU factorization is actually closely related with Gaussian
elimination, the algorithm is almost identical. The L matrix can be obtained
by saving the l value in Fig. 6.7 and the resulting matrix in Gaussian
elimination is just the U matrix. Thus, the pseudo code (Fig. 6.8) for LU
factorization looks like the following.
Thus, it is easy to see that the parallelization of the LU factorization
is very similar to that of the Gaussian elimination as shown in Fig. 6.9.
6.3.2. Iterative methods
There are rich families of serial algorithms for sparse linear systems, but
designing ecient parallel algorithms for such systems has been a challenge,
due to the lack of parallelism in these systems.
General Description
The P processors are used to solve sparse tri- (3-), 5-, and 7-diagonal
systems (also known as banded systems) of the form Ax = b. For example,
u = a
for i = 1 to n - 1
for j = i+ 1 to n
l(j,i) = u(j,i)/u(i,i)
for k = i to n
u(j,k) = u(j,k) l(j,i) * u(i,k)
end
end
end
Fig. 6.8.
Pseudo-code for sequential Gaussian elimination.
July 27, 2012 6:17
86
9in x 6in
b1350-ch06
Fig. 6.9. This is an example of parallel gaussian elimination, i.e. a parallel algorithm
used to solve Ax = b. The numbers shown in the matrix are the processor IDs. They
indicate which matrix element is stored in which processor. This example shows a matrix
of size 20 20 to be solved by 16 processors. Obviously, this choice can be easily
generalized.
we may have
a11
a21
A=
a12
a22
a32
a23
a33
a43
,
a34
a44
x1
x2
x=
x3 ,
x4

b1
b2
b=
b3
(6.44)
b4
Classical Techniques
Relaxation
Conjugate-gradient
Minimal-residual
u(k) = T u(k1) + C
(6.45)
where u(0) are initial values and T is the iteration matrix, tailored by the
structure of A.
6.3.3. ADI
The alternating direction implicit (ADI) method was rst used in 1950s by
Peaceman and Rachford for solving parabolic PDEs. Since then, it has been
widely used in many other applications as well.
July 27, 2012 6:17
Linear Algebra
9in x 6in
b1350-ch06
87
Varga has rened and extended the discussions of this method. Its
implementations on dierent parallel computers are discussed in various
books and papers on parallel linear algebra.
We have the following implementations:
I. ADI on a 1D array of processors.
II. ADI on a 2D mesh of processors.
III. ADI on other architectures.
July 27, 2012 6:17
9in x 6in
b1350-ch06
July 27, 2012 6:17
9in x 6in
b1350-ch07
CHAPTER 7
DIFFERENTIAL EQUATIONS
Another family of numerical algorithms involves solution of dierential

equations. Everything starts from the basics of numerical integration and
dierentiation. This chapter will discuss solutions of several classic linear
partial dierential equations after introducing the integration methods.
7.1. Integration and Dierentiation

As one of the standard embarrassingly parallel problems, numerical
integration is CPU time-consuming, particularly in high dimensions. Three
integration methods for one-dimensional integrands are shown here and they
can be easily generalized to higher dimensions.
7.1.1. Riemann summation for integration

Given the integral

I=
f (x)dx
(7.1)
slice the interval [a, b] into N mesh blocks, each of equal size x = ba
N ,
shown in (7.1). Suppose xi = a + (i 1)x, the integral is approximated
as in (Fig. 7.1):
I
N
f (xi )x
(7.2)
i=1
This is the Riemann summation approximation of the integral. More

sophisticated approximation techniques, e.g. the Trapezoid Rule, Simpsons
Rule, etc. can also be easily implemented. The PAT graph can be drawn as
shown in Fig. 7.2.
89
July 27, 2012 6:17
90
9in x 6in
b1350-ch07
R
Fig. 7.1. I = ab f (x)dx. In the Riemann sum, the totall area A under the curve f (x)
is the approximation of the integral I. With 2 processors j and j + 1, we can compute
A = Aj + Aj+1 completely in parallel. Each processor only needs to know its starting
and ending points, in addition to the integrand, for partial integration. At the end, all
participating processors perform a global summation to add up partial integrals obtained
by individual processors.
Fig. 7.2.
PAT graph for integration.
7.1.2. Monte Carlo method for integration

Unlike the deterministic Riemann summation, the computational load of
the Monte Carlo integration does not grow when the number of dimensions
grows. In fact, the convergence rate of Monte Carlo integration only depends
on the number of trail evaluated. That is, for integration

g(x)dx
(7.3)
I=
D
we can obtain an approximation as:

1
Im = [g(x(1) ) + + g(x(m) )]
m
(7.4)
July 27, 2012 6:17
Dierential Equations
9in x 6in
b1350-ch07
91
where x(i) s are random samples from D. According to the central limit
theorem, the error term,
m(Im I) N (0, 2 )
(7.5)
where 2 = var{g(x)}. This means the convergence rate of the Monte Carlo
method is O(m1/2 ), regardless of the dimension of D. This property gives
Monte Carlo method an advantage when integrating in high dimensions.
7.1.3. Simple parallelization

The most straightforward parallelization of Monte Carlo method is to let
(i)
every processor generate its own random values g(xp ). Communication
occurs once at the end of calculation for aggregating all the local approximations and produces the nal result. Thus, the parallel eciency will tend
to close to 100%.
However, the true eciency in terms of overall convergence is not
achievable in practice. The more processors used in the calculation, the
more points of evaluation is needed for the same level of accuracy. This is
mainly caused by the collision of the random sampling (i.e. repetition of
same number in dierent processors). due to the pseudo-random number
generator commonly used in computing.
With P = 1 processor making m samples, the error converges as:

1
.
1 O
m
With P processors making m samples each, the error convergence is
expected to be,

1
P O
mP
In fact, we are not precisely sure how the overall convergence varies
with increasing number of processors. With a variation of the sampling,
we may resolve the random number collision problem. For example, we
slice the integration domain D into P sub-domains and assign each subdomain to a processor. Each processor is restricted to sample in its
designated sub-domain. The rest is similar to method I: Ordinary dierential
equations.
July 27, 2012 6:17
92
9in x 6in
b1350-ch07
7.2. Partial Dierential Equations

We discuss solutions of several hyperbolic, parabolic and elliptical dierential equations, describe the algorithms, and analyze their performances.
7.2.1. Hyperbolic equations

Wave equations are representative of hyperbolic equations, which are the
easiest dierential equations to parallelize due to their inherent nature of
locality. Thus, there exist many methods to solve these equations; nitedierence or nite-element methods are the major ones. Each is either an
explicit scheme or implicit depending on the method chosen. The choice
may depend on the requirements for the solution, or it may be dictated by
the property of the partial dierential equation (PDE).
When it is implicit, we will have an equation AX = b to solve. Very
often A is a spare matrix, thus an indirect method or iterative method
is used. When A is dense, a direct method such as LU decomposition or
Gaussian elimination is used. The simplest method is an explicit scheme,
as those methods can be parallelized more easily.
1D Wave equation:
We rst demonstrate parallel solution of 1D wave equations with the
simplest method, i.e. an explicit nite dierence scheme. Consider the
following 1D wave equation:

utt = c2 uxx , t > 0
(7.6)
Proper BC and IC
with conveniently given initial and boundary conditions.
It is very easy to solve this equation on sequential computers, but on
a parallel computer, it is a bit more complex. We perform nite dierence
operation (applying central dierences on both sides):
uki+1 + uki1 2uki
+ uik1 2uki
uk+1
i
=
t2
x2
(7.7)
k1
= u(x x, t t). Thus, we get an update scheme:
where ui1
= f (uik1 uki , . . . , uki+1 uki1 )

uk+1
i
(7.8)
July 27, 2012 6:17
9in x 6in
b1350-ch07
93
After decomposing the computational domain into sub-domains, each

processor is given a sub-domain to solve the equation. When performing
updates for the interior points, each processor can behave like a serial
processor. However, when a point on the processor boundary needs
updating, a point from a neighboring processor is needed. This requires
communication. Most often, this is done by building buer zones for the
physical sub-domains. After updating its physical sub-domain, a processor
will request that its neighbors (2 in 1D, 8 in 2D, and 26 in 3D) send their
boarder mesh point solution values (the number of mesh point solution
values sent depends on the numerical scheme used) and in the case of
irregular and adaptive grid, the point coordinates. With this information,
this processor then builds a so-called virtual sub-domain that contains the
original physical sub-domain surrounded by the buer zones communicated
from other processors (Figs. 7.3 and 7.4).
Performance analysis: Suppose M is the total number of mesh points,
uniformly distributed over p processors. Also, let tcomp be the time to
update a mesh point without communication and let tcomm be the time
Fig. 7.3. Decomposition of computational domain into sub-domains for solving wave
equations.
July 27, 2012 6:17
94
9in x 6in
b1350-ch07
Fig. 7.4. PAT of decomposition of computational domain into sub-domains for solving
wave equations.
to communicate one mesh point to a neighboring processor. The total time

for one processor to solve such a problem is:
T (1, M ) = M tcomp
(7.9)
With p processors, the time is:

M
tcomp + 2tcomm
p
(7.10)
p
T (1, M )
=
tcomm
T (p, M )
1 + 2p
m tcomp
(7.11)
2p tcomm
M tcomp
(7.12)
T (p, M ) =
S(p, M ) =
h(p, M ) =
It is obvious that this algorithm is likely the

(a) The simplest parallel PDE algorithm.
(b) h(P, N ) P/N .
(c) h(P, N ) tc /t0 .
It is quite common for algorithms for solving PDEs to have such overhead
dependencies on granularity and communication-to-computation ratio.
We can easily observe the following:
P
1
= m
where m is the number of mesh
(a) Overhead is proportional to M
points on each processor. This means that the more mesh points each
processor holds, the smaller the overhead.
July 27, 2012 6:17
9in x 6in
95
(b) Overhead is proportional to

parameter.
tcomm
tcomp .
b1350-ch07
This is a computer-inherent
2D Wave equation
Consider the following wave equation:
utt = c2 (uxx + uyy ),
t>0
(7.13)
with proper BC and IC. It is very easy to solve this equation on a sequential
computer, but on a parallel computer, it is a bit more complex. We perform
a little dierence operation (applying central dierences on both sides):
k1
+ uij
2ukij
uk+1
uki+1,j + uki1,j 2ukij
ukij+1 + uki,j1 2ukij
ij
+
t2
x2
y 2
(7.14)
After further simplication,

k1
= f (uij
, ukij , uki+1,j , uki1,j , uki,j+1 , uki,j1 )
uk+1
ij
Fig. 7.5.
Decomposing the computational domain into sub-domains.
(7.15)
July 27, 2012 6:17
96
9in x 6in
b1350-ch07
7.2.2. 3D Heat equation

Consider the following 3D heat equation:

ut = uxx + uyy + uzz
(7.16)
Proper BC and IC
After applying nite dierences (forward-dierence for the rst-order time
derivative and central-dierence for the second-order spatial derivatives),
we obtain:
t
ut+1
i,j,k + ui,j,k
uti+1,j,k + uti1,j,k 2uti,j,k

uti,j+1,k + uti,j1,k 2uti,j,k
+
x2
y 2
+
uti,j,k+1 + uti,j,k1 2uti,j,k

z 2
(7.17)
Assuming x = y = z = t = 1, we get:
t
t
t
t
t
t
ut+1
i,j,k = ui,j,k + (ui+1,j,k + ui1,j,k ) + (ui,j+1,k + ui,j1,k ) + (ui,j,k+1
+ uti,j,k1 ) 6uti,j,k
(7.18)
Therefore, solution of heat equation is equivalent to solving Poisson equation

at every time step. In other words, we need to solve the Poisson equation
(by iteration) at every time step.
7.2.3. 2D Poisson equation

Consider the following 2D Poisson equation:

uxx + uyy = f (x, y)
(7.19)
Proper BC
Applying central dierences on both sides, we get:
ui+1,j + ui1,j 2ui,j
ui,j+1 + ui,j1 2ui,j
+
= f (xi , yj )
x2
y 2
(7.20)
Assuming x = y = 1 (without losing generality), we can get the

following simplied equation:
ui+1,j + ui1,j + ui,j+1 + ui,j1 4ui,j = f (xi , yj )
(7.21)
July 27, 2012 6:17
9in x 6in
b1350-ch07
97
We can have the following linear system of algebraic equations:
u2,1 + u0,1 + u1,2 + u1,0 4u1,1 = f (x1 , y1 )

u3,2 + u1,2 + u2,3 + u2,1 4u2,2 = f (x2 , y2 )
...
(7.22)
If we dene a new index,

m = (j 1)X + i
(7.23)
to arrange a 2D index (i, j) to a 1D in m = 1, 2, . . . , X, X +1, X +2, . . . , XY .

We can arrange the following 2D discrete Poisson equation:
Au = B
where A is a sparse matrix dened as:
A1 I
0
I A2
0
.
..
.
..
.
..
.
A=
.
.
0
0 AY 1
0
0
I
This is a Y Y matrix with sub-matrix
4 1
1 4
.
..
..
.
.
Ai =
.
.
0
0
0
0
(7.24)
AY
0
..
.
elements
0
0
0
0
..
..
.
.
4 1
1 4
Y Y
XX
which is a X X matrix and there are Y of them, i = 1, 2, . . . , Y . The nal

matrix A is a (XY ) (XY ) matrix.
Therefore, solution of 2D Poisson equation is essentially a solution of a
ve-diagonal system.
7.2.4. 3D Poisson equation
Consider the following 2D Poisson equation:

uxx + uyy + uzz = f (x, y, z)
Proper BC
(7.25)
July 27, 2012 6:17
98
9in x 6in
b1350-ch07
Applying central dierences on both sides, we get:

ui+1,j,k + ui1,j,k 2ui,j,k
ui,j+1,k + ui,j1,k 2ui,j,k
+
2
x
y 2
ui,j,k+1 + ui,j,k1 2ui,j,k
+
= f (xi yj zk )
z 2
(7.26)
Assuming x = y = z = 1 (without losing generality), we can get the

following simplied equation:
ui+1,j,k + ui1,j,k + ui,j+1,k + ui,j1,k + ui,j,k+1 + ui,j,k1 6ui,j,k
= f (xi , yj , zk )
(7.27)
We can have the following linear system of algebraic equations:
u2,1,1 + u0,1,1 + u1,2,1 + u1,0,1 + u1,1,2 + u1,1,0 6u1,1,1 = f (x1 , y1 , z1 )

u
+ u1,2,2 + u2,3,2 + u2,1,2 + u2,2,3 + u2,2,1 6u2,2,2 = f (x2 , y2 , z2 )
3,2,2
...
(7.28)
If we dene a new index,
m = (k 1)XY + (j 1)X + i
(7.29)
We can linearize a 3D index (i, j, k) by a 1D index (m). Dening um = uijk ,

we obtain the following 3D discrete Poisson equation,
Au = b
where A is a sparse matrix dened as:
1
I
A
I A2
A = ..
..
..
.
.
.
0
whose sub-matrix elements are:
k
a1
ak = ..
.
(7.30)
0
0
..
.
AZ
ak2
..
.
..
.
ZZ
..
.
aky
Y Y
July 27, 2012 6:17
9in x 6in
b1350-ch07
99
And this is a Y Y matrix with sub-matrix elements,
akj =
6
1
..
.
6
..
..
.
.
0
..
.
XX
Thus, A is a seven-diagonal (XY Z) (XY Z) spare matrix and solution of

the above equation requires an iterative method.
7.2.5. 3D Helmholtz equation

Consider the following 3D Helmholtz equation:

uxx + uyy + uzz + 2 u = 0
(7.31)
Proper BC
There are two families of methods in solving Helmholtz equation:
(1) Method of moments, aka, boundary element method.
(2) Method of nite dierences.
In the following, we describe the nite-dierence method:
ui,j+1,k + ui,j1,k 2ui,j,k
ui+1,j,k + ui1,j,k 2ui,j,k
+
x2
y 2
ui,j,k+1 + ui,j,k1 2ui,j,k
+
+ 2 ui,j,k
z 2
(7.32)
With the assumption that x = y = z = 1, we can simplify the above

equation as:
ui+1,j,k + ui1,j,k + ui,j+1,k + ui,j1,k + ui,j,k+1 + ui,j,k1
+ (2 6)ui,j,k = 0
(7.33)
Solution of this discrete equation is very similar to that of the discrete

Poisson equation. Both methods require solution of sparse system of
equations.
July 27, 2012 6:17
100
9in x 6in
b1350-ch07
7.2.6. Molecular dynamics

Molecular dynamics is a popular approach for modeling many systems in
biomolecule, chemistry and physics. Classical MD involves solutions of the
following system of ODEs:
m x = fi ({xi }, {vi })
i i
xi (t = 0) = ai
i = 1, 2, . . . , N
(7.34)
xi (t = 0) = vi ,
The key computing issues for MD are:
(1) MD can be highly nonlinear depending on the force elds. Most of the
computing time is spent on force calculation.
(2) Force matrix denes the force from all pairs of particles or molecules,
f11 f1n
.
..
..
.
F =
.
.
.
fn1
fnn
and it has the following important properties:

(a) Anti-symmetrical resulting from Newtons third law.
(b) Diagonal elements are all zero (no self-interaction).
(c) Sum of one complete row is the force acting on one particle by all
others.
(d) Sum of one complete column is the force exerted by one particle on
all others.
(e) The matrix is dense and complexity of obtaining the force is
O(N 2 ) if particles are involved in long-ranged interactions such as
Coulombs force.
(f) The matrix is sparse and complexity of obtaining the force is O(N )
if particles are involved in short-ranged interactions such as bound
force only.
There are many ways to decompose the system. The two most popular
ways are: Particle decomposition and spatial decomposition. A third way,
not as common, considers force decomposition. Interaction ranges (short-,
medium-, long-) determine the decomposition methods:
(1) Particle decomposition: Decomposing the system based on particle
IDs regardless of particle positions. For example, for 100 particles
July 27, 2012 6:17
9in x 6in
b1350-ch07
101
labeled as (1, 2, 3, . . . , 100) and 10 processors labeled as P1 , . . . , P10 ,

we allocate 10 particles on each processor. Particles (1, 2, . . . , 10) are
partitioned on P1 ; (11, 12, . . . , 20) on P2 , . . . , and (91, 92, . . . , 100) on
P10 .
(2) Spatial decomposition: Decomposing the system based on particle
positions regardless of particle IDs
(i) Cartesian coordinates
(ii) Cylindrical coordinates
(iii) Spherical coordinates
For example, for 100 particles labeled as (1, 2, . . . , 100) and nine processors, we allocated approximately 11 particles on each processor. We
divide the computational domain into 3 3 = 9 sub-domains. Particles
(1, 5, . . . , 19) on P1 ; (12, 14, . . . , 99) on P2 and so on.
(3) Force decomposition: Decomposing the system based on force terms
regardless of particle IDs and positions.
Of course, it is not uncommon; a mix of the above decomposition is used
for one MD simulation.
Calculation of forces Typical force elds for protein simulation, for
example, can be looked up from CHARM or AMBER library. Armies
of professionals on physical chemistry and experimental structure biology
provide the ever rened force formulation and the coecients associated:

kb (b b0 )2 +
k ( 0 )2 +
k [1 + cos(n )]
V =
bonds
angles
k ( 0 ) +
impropers

nonbonded

UreyBradley
Rminij
rij
12
Rminij
rij
dihedrals
ku (u u0 )2
6
+
qi qj
rij
(7.35)
Term-1 accounts for the bond stretches where kb is the bond-force constant
and b b0 is the distance from equilibrium that the atom has moved.
Term-2 accounts for the bond angles where k is the angle-force constant
and 0 is the angle from equilibrium between three bounded atoms.
Term-3 is for the dihedrals (a.k.a. torsion angles) where k is the dihedral
force constant, n is the multiplicity of the function, is the dihedral angle
and is the phase shift.
July 27, 2012 6:17
102
9in x 6in
b1350-ch07
Term-4 accounts for the impropers, that is out of plane bending, where k
is the force constant and 0 is the out of plane angle. The Urey-Bradley
component (cross-term accounting for angle bending using 1, 3 non-bonded
interactions) comprises Term-5 where ku is the respective force constant
and u is the distance between the 1, 3 atoms in the harmonic potential.
Non-bonded interactions between pars of atoms are represented by the
last two terms. By denition, the non-bonded forces are only applied to
atom pairs separated by at least three bonds. The van Der Waals (VDW)
energy is calculated with a standard 126 Lennard-Jones potential and the
electrostatic energy with a Coulombic potential.
The Lennard-Jones potential (Fig. 7.6) is a good approximation of
many dierent types of short-ranged interaction besides modeling the VDW
energy.

12 6
(7.36)
VLJ (r) = 4
r
r
where is the depth of the depth of the potential well, is the
(nite) distance at which the inter-particle potential is zero, and r is the
distance between the particles. These parameters can be tted to reproduce
experimental data or mimic the accurate quantum chemistry calculations.
Fig. 7.6.
Lennard-Jones potential, compared with empirical data.1
1 R. A. Aziz, A highly accurate interatomic potential for argon, J. Chem. Phys. Vol. 99
Issue 6 (1993) 4518.
July 27, 2012 6:17
9in x 6in
b1350-ch07
103
The r12 term describes Pauli repulsion at short ranges due to overlapping
electron orbitals and the r6 term describes attraction at long ranges such
as the van der Waals force or dispersion force.
The last term, the long-ranged force in the form of Coulomb potential
between two charged particles,
qi qj
X
i)
(X
(7.37)
F ij =
i X
j |3 i
|X
The total force on particle i by all other particles is:
F i =
n

j=i
qi qj
X
i)
(X

j |3 i
|Xi X
(7.38)
Estimating the forces cost 90% of the total MD computing time. But,
solution of the equation of motion is of particular applied Mathematics
interest as it involves solving the system of ODEs.
First, we may use the second order Verlet method:
a(t) =
x(t + ) + x(t ) 2x(t)

d2 x
=
2
dt
2
(7.39)
where a(t) is the acceleration. So, the next time step can be calculated from
the present and the previous steps by:
x(t + ) = 2x(t) x(t ) + a(t) 2
F (t) 2
m
Second, we may consider the RungeKutta method:
= 2x(t) x(t ) +
(7.40)
y = f (t, y)
1
y(t + h) = y(t) + h(k1 + 2k2 + 2k3 + k4)
6
k1 = f (t, y)

1
1
k2 = f t + h, y + k1
2
2

1
1
k3 = f t + h, y + k2
2
2
(7.41)
k4 = f (t + h, y + k3 )
It turns out this portion of the MD, i.e. actual solution of the system of
ODEs, takes insignicant amount of computing time.
July 27, 2012 6:17
9in x 6in
b1350-ch07
July 27, 2012 6:17
9in x 6in
b1350-ch08
CHAPTER 8
FOURIER TRANSFORMS
The discrete Fourier transform (DFT) has an important role in many

scientic and engineering applications. In the computer simulation of
biological, physical and chemical phenomena, the calculation of long-range
interaction depends heavily on the implementation of fast Fourier transform
(FFT), which is a revolutionary FT algorithm that can compute the DFT of
an n-point series in (n log n) time. Other areas including uid dynamics,
signal processing and time series are also utilizing FFT.
In this chapter, we discuss the simple form of serial FFT algorithm.
Based on this, we then discuss the general parallel scheme for such
algorithm. Finally, we introduce a high performance 3D FFT specially suited
for scalable supercomputers.
8.1. Fourier Transforms

Mathematically, the following integrations dene the Fourier transforms and
its inverse:
+
1
f (t)eit dt
(8.1)
g() =
2
+
1
f (t) =
g()eit d
(8.2)
2
Naturally, function f (t) is dened in t-space (or physical space) and
function g() is dened in the k-space (or frequency space, or Fourier space).
The key purpose of the transform is to study the way of representing general
functions by sums of simpler trigonometric functions, as evident by Fig. 8.1.
Fourier transforms have broad applications including analysis of
dierential equations that can transform dierential equations into algebraic
equations, and audio, video and image signal processing.
105
July 27, 2012 6:17
106
9in x 6in
b1350-ch08
Fig. 8.1.
Fourier transform.
8.2. Discrete Fourier Transforms

The sequence of N complex numbers x0 , x1 , . . . , xN 1 is transformed
into the sequence of N complex numbers X0 , X1 , . . . , XN 1 by the DFT
according to the formula:
Xk = F (x) =
N
1
xj e2
ijk
N
k = 0, 1, 2, . . . , N 1
(8.3)
j=0
The inverse discrete Fourier transform (IDFT) is given by:

xk = F
(X) =
N
1

j=0
Xj e2
ijk
N
j = 0, 1, 2, . . . , N 1
(8.4)
July 27, 2012 6:17
9in x 6in
b1350-ch08
107
Fourier Transforms
It is obvious that the complex numbers Xk represent the amplitude and

phase of the dierent sinusoidal components of the input signal xn . The
DFT computes the Xk from the xn , while the IDFT shows how to compute
ijk
the xn as a sum of sinusoidal components N1 Xk e2 N with frequency k/N
cycles per sample.
1D sequence can be generalized to a multidimensional array
x(n1 , n2 , . . . , nd ) a function of d discrete variables,
nl = 0, 1, . . . , Nl 1
l [1, d]
The DFT is dened by:

Xk1 ,k2 ,...,kd
=
N
1 1
n1 =0
k1 n1
N
1
N
2 1
n2 =0
k2 n2
N
2
N
d 1
nd =0

d
nkddn=0
xn1 ,n2 ,...,nd
(8.5)
In real applications, 3D transforms are the most common.
8.3. Fast Fourier Transforms

FFT is an ecient algorithm to compute DFT and its inverse. There are
many distinct FFT algorithms involving a wide range of mathematics, from
simple complex-number arithmetic to group theory and number theory.
Computing DFT directly from the denition is often too slow to be
practical. Computing a DFT of N points in the nave way, using the
denition, takes O(N 2 ) arithmetical operations. An FFT can get the
same result in only O(N log N ) operations. The dierence in speed can
be substantial, especially for large data sets where N = 103 106 , the
computation time can be reduced by several orders of magnitude. Since
the inverse DFT is the same as the DFT, but with the opposite sign in
the exponent and a 1/N factor, any FFT algorithm can easily be adapted
for it.
Simply put, the complexity of FFT can be estimated this way. One
F T (M M ) can be converted to two shortened FTs as F T (M/2 M/2)
through change of running indices. If this conversion is done recursively, we
can reduce the complexity from O(M 2 ) to O(M log M ).
Colloquially, FFT algorithms are so commonly employed to compute
DFTs that the term FFT is often used to mean DFT.
July 27, 2012 6:17
108
9in x 6in
b1350-ch08
There are many FFT algorithms and the following is a short list:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
Split-radix FFT algorithm.

Prime-factor FFT algorithm.
Bruuns FFT algorithm.
Raders FFT algorithm.
Bluesteins FFT algorithm.
Buttery diagram a diagram used to describe FFTs.
OdlyzkoSch
onhage algorithm applies the FFT to nite Dirichlet
series.
Overlap add/Overlap save ecient convolution methods using FFT
for long signals.
Spectral music involves application of FFT analysis to musical
composition.
Spectrum analyzers Devices that perform an FFT.
FFTW Fastest Fourier Transform in the West C library for the
discrete Fourier transform in one or more dimensions.
FFTPACK another C and Java FFT library.
Buttery algorithm the best-known FFT algorithm, is also known as the

CooleyTukey algorithm.1
A radix-2 decimation-in-time FFT is the simplest and most common

form of the CooleyTukey algorithm, although highly optimized Cooley
Tukey algorithm implementations typically use other form of the algorithm
1 James
W. Cooley was a programmer on John von Neumanns computer at the Institute

for Advanced Study at Princeton (19531956) and retired from IBM in 1991.
July 27, 2012 6:17
9in x 6in
b1350-ch08
109
Fourier Transforms
as described below. Radix-2 DIT divides a DFT of size N into two

interleaved DFTs of size N/2 with each recursive stage.
The discrete Fourier transform (DFT) is dened by the formula:
Xk =
N
1
xn e2
ink
N
(8.6)
n=0
where k is an integer ranging from to N 1.

Radix-2 DIT rst computes the DFTs of the even-indexed
x2m (x0 , x2 , . . . , xN 2 ) inputs and of the odd-indexed inputs x2m+1
(x1 , x3 , . . . , xN 1 ) and then combines those two results to produce the DFT
of the whole sequence.
More specically, the Radix-2 DIT algorithm rearranges the DFT of the
function xn into two parts: a sum over the even-numbered indices n = 2m
and a sum over the odd-numbered indices n = 2m + 1:
N
2
Xk =
1
m=0
x2m e
2i
N (2m)k
N
2
1
x2m+1 e
2i
N (2m+1)k
(8.7)
m=0
Thus, the whole DFT can be calculated as follows:
2i
Ek + e N Ok ,
if k < N/2
Xk =
2i
EkN/2 e N (kN/2) OkN/2 , if k N/2
(8.8)
July 27, 2012 6:17
110
9in x 6in
b1350-ch08
Another graph depicting the FFT algorithm:
First, we divide the transform into odd and even parts (assuming M is
even):
Xk = Ek + exp
2ik
Ok
M
and
Xk+M/2 = Ek exp
2ik
Ok
M
Next, we recursively transform Ek and Ok to the next two 2 2 terms.

We iterative continuously as shown in Fig. 8.2 until we only have one point
left for Fourier transform. In fact, one point does not need any transform.
Function Y=FFT(X,n)
if (n==1)
Y=X
else
E=FFT({X[0],X[2],...,X[n-2]},n/2)
O=FFT({X[1],X[3],...,X[n-1]},n/2)
for j=0 to n-1
Y[j]=E[j mod (n/2)]
+exp(-2*PI*i*j/(n/2))*O[j mod (n/2)]
end for
end if
Fig. 8.2.
Pseudo code for serial FFT.
July 27, 2012 6:17
Fourier Transforms
9in x 6in
b1350-ch08
111
General case:
CooleyTukey algorithms recursively re-express a DFT of a composite

size N = N1N2 as the following:
Perform N1 DFTs of size N2.

Multiply by complex roots of unity.
Perform N2 DFTs of size N1.
8.4. Simple Parallelization

Based on the serial FFT algorithm, a nature implementation of parallel
FFT is to divide the task according to the recursive algorithm until all the
processors has its own stake of subtasks. By doing this, every processor
runs independently on its own subtasks and communication occurs only
when merging back the results.
Consider a simple example of computing 16-point FFT on four
processors. The data can be partitioned as in Fig. 8.3 so the calculations of
4-point FFT are entirely local. In the second stage, the calculation of even
and odd subsequence requires P0 to communicate with P2 and P1 with P3 .
In the nal stage, P0 and P1 communicate and so do P2 and P3 .
This algorithm, although simple and closely aligned with the original
serial algorithm, is in fact quite ecient especially in the cases where the
data points are far more than the number of processors. The communication
happens log P times in the whole calculation and in each communication,
every processor sends and receives N/P amount of data.
July 27, 2012 6:17
112
9in x 6in
b1350-ch08
Fig. 8.3.
10
11
12
13
14
15
The scatter partition for performing 16-point FFT on four processors.
8.5. The Transpose Method

The nature improvement from the above simple parallel algorithm is the
so-called transpose method that eliminates the log P communications to
one all-to-all transpose communication. To understand how this works, it is
necessary to have a deeper insight into the FFT algorithm.
As a recursive algorithm, the original FFT can always be written in the
form of a forward loop that involves exactly log2 N steps. Moreover, the
total number of data is unchanged after every merge. Thus, the data points
Xi at step.
Take the example of the 16-point FFT on four processors in the previous
section.
One F T (M M ) can be converted into two shortened FTs as F T
M
(2 M
2 ) through a change of running variables. If this conversion is done
recursively, the complexity can be reduced from O(M 2 ) to O(M log M ).
This new transform is the FFT. Here are the details.
First, divide the transform into odd and even parts (assuming M is
even):
We then have:

ik
Ok , k = 0, 1, . . . , M
Xk = Ek + exp 2
M

ik
Xk+ M = Ek exp 2
Ok , k = 0, 1, . . . , M
2
M
(8.9)
(8.10)
July 27, 2012 6:17
9in x 6in
b1350-ch08
113
Fourier Transforms
Next, transform Ek and Ok recursively to the next 2 2 terms. We iterate

continuously until we only have one point left for Fourier Transform. In
fact, one point does not need any transform.
8.6. Complexity Analysis for FFT

Dene the following terms:
T (M ) = Time for M point transform
(8.11)
Tcomp = Time for real number, X + or
(8.12)
Tmult = Time for multiplying 2 complex numbers = 6Tcomp
(8.13)
Tpm = Time for adding two complex
(8.14)
Therefore, using the formulae above, we get the following relationship:

T (M ) = 2T
M
2

+M
Tmult
Tpm +
2
(8.15)
Iteratively, we have:

T (M ) =
Tpm +
Tmult
2
M log M
(8.16)
Let us dene:
T+ = 2Tpm + Tmult
(8.17)
Thus obtaining the formula for the single processor time as:
T (1, M ) = T+ M log M
(8.18)
Now, Let us discuss the details of FT parallelization.

For simple FFT: We can decompose the task by physical space, fourier
space, or both. In any of these cases, we can always obtain perfect speed
up, as this problem is the EP problem.
For Complete FFT (M M on P processors): We assume M and P
are factors of 2N and m = M
P is an integer.
July 27, 2012 6:17
114
9in x 6in
b1350-ch08
Case I M > P:
Case I M P:
M
T+ M
T (P, N ) = 2T P,
+
2
2 P

M
M
T+
=
log
T (P, M ) +
P
2
P

M
T (P, M ) = T P,
+ T = T log M
2
Substituting this equation into the above, we get:

M
M
T (P, M ) =
2T log P + T+ log
2p
p
=
M
[T+ log M + (2T T+ ) log P ]
2P
(8.19)
(8.20)
(8.21)
(8.22)
(8.23)
Therefore, the speedup is:

S(P, M ) =
P
1+
2T T+ log P
T+
log M
(8.24)
And the overhead is:

h(P, M ) =
2T T+ log P
T+
log M
(8.25)
We can estimate T and T+ in terms of the traditional Tcomp and Tcomm :

T+ = 2Tpm + Tmult 10Tcomp
(8.26)
T = Tpm + Tmult + Tshift = 8Tcomp + 2Tcomm
(8.27)
Therefore, the overhead is:

h(P, M ) =
2Tcomm + Tcomp log P

5Tcomp
log M
(8.28)
Remarks
1. Communication to computation ratio aects overhead h(P, M ).
P
aects overhood, too, but in a much smaller way than most other
2. M
cases. For parallel FFT, overhead is typically very small.
July 27, 2012 6:17
9in x 6in
b1350-ch09
CHAPTER 9
OPTIMIZATION
Mathematical optimization, aka, mathematical programming, is the

selection of a best element from some sets of available alternatives with
regard to some constraints. More generally, it involves nding best available
values of some objective function (O.F.) given a dened domain. There
are a variety of dierent types of O.F.s and dierent types of domains, a
fact that enriches optimization with many subelds: convex programming,
integer programming, quadratic programming, nonlinear programming, and
dynamic programming, as well as stochastic programming in which the
parameters and constraints involve random variables. It is a rich eld with
theories originating from, and applications to, many disciplines including
mathematics, science, nance, operations research, and engineering. It has
been adopted by, arguably, the widest communities among these areas.
Finding the best available values is a techy venture, most commonly, of
computational techniques. Such techniques include optimization algorithms
(simplex algorithms by Dantzig, combinatorial algorithms, etc), heuristics,
and iterative methods introduced by giants like Newton and Gauss. Monte
Carlo methods are examples of such iterative methods relying on repeated
random sampling to reach the best available values. The MC methods
are useful for nding the approximate solutions of problems, usually, with
many coupled degrees of freedom, when their exact solutions are infeasible
to nd.
The commonality of these optimization problems is that vast amount
of computing resources is required, which results in another commonality:
parallel computing is the only way to obtain solutions, most commonly
approximate solutions, to optimization problems. Unfortunately, there are
still no widely accepted scalable parallel algorithms for optimizations.
Indeed, this is a fertile eld where challenges and opportunities co-exist.
Achieving scalable performance for MC methods on large parallel
computers with thousands of processing cores is a serious challenge.
There are still no agreeable standard parallel algorithms for MC
115
July 27, 2012 6:17
116
9in x 6in
b1350-ch09
methods. We will introduce in this manuscript the early stages of a few

promising methods.
9.1. Monte Carlo Methods
Monte Carlo methods, as of today, still do not have a widely accepted
denition. Regardless, Monte Carlo methods are a class of numerical
algorithms that rely on repeated random sampling to compute their results.
This much is clear: (1) it is a computational algorithm; (2) it requires
massive repetitions; and (3) it is random sampling.
As mentioned earlier, the Monte Carlo methods are most suited
for problems whose solutions are infeasible to achieve by deterministic
approaches. They were rst practiced in 1940s by von Neumann and his
fellow Los Alamos scientists for the Manhattan Project. They are the only
viable approaches for numerical integrations in high dimensions and with
complicated domains. For example, for calculations in lattice gauge theory,
the integration dimension can be as high as a million or higher. As one of
the random walk Markov chain Monte Carlo methods, it can drive the chain
of states, iteratively, to the region of optimality in search space.
Monte Carlo methods are, arguably, the most powerful numerical
methods because it can solve problems that many others fail and it
can solve many problems in broad diverse areas. However, Monte Carlo
methods have serious deciencies. First, the sampling process is long and
the stop condition when the iteration is converged is usually not known a
prior and, thus, most MC simulations are extremely computational time
consuming although many attempts, such as the importance sampling for
generating more important samples to reduce the variance from the
desired distributions, were introduced to speed things up. Second, the
iterative nature of the sampling process makes it dicult for parallelization,
a promising remedy of the rst. Yet, parallelization of the MC methods may
enable solutions of many problems in engineering, science, and nance, as
well as social sciences, leading breakthroughs in these and other elds.
9.1.1. Basics
The key components of typical Monte Carlo methods include:
(1) Probability distribution functions: This PDF describes the desired
properties of the physical or mathematical systems being simulated.
(2) Random number generator. A source of random numbers uniformly
distributed on the unit interval must be available and these at
July 27, 2012 6:17
9in x 6in
b1350-ch09
Optimization
(3)
(4)
(5)
(6)
117
random numbers can be converted conveniently to other distributions as

needed.
Sampling rule: a prescription for sampling from the specied probability
distribution function, assuming the availability of random numbers on
the unit interval.
Scoring: the outcomes must be accumulated into overall tallies or scores
for the quantities of interest.
Error estimation: an estimate of the statistical error, i.e., variance,
as a function of the number of trials and other quantities must be
determined.
Variance reduction techniques: methods for reducing the variance in the
estimated solution to reduce the computational time for Monte Carlo
simulations.
The most important function of Monte Carlo methods is the ability to locate
the global optimum while performing optimization. The objective functions
of most scientic or engineering applications usually have multiple local
minima (as in Fig. 9.1) and the ability to escape from such local minima is
the greatest advantage of the Monte Carlo methods.
9.1.2. Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm, introduced by Metropolis and four
colleagues in 1953 and extended by Hastings in 1970, is one of the Markov
Fig. 9.1.
A typical higher-dimensional Objectie Function (OF) folder to 1D.
July 27, 2012 6:17
118
9in x 6in
b1350-ch09
chain Monte Carlo (MCMC) methods for obtaining a sequence of random

samples from a probability distribution for which direct sampling is dicult.
This sequence can be used to approximate the distribution or to perform
other measurements in the generated distribution. This chain can also lead
the search to a region of optimality.
Major steps of numerical Metropolis-Hastingsalgorithm include:
Step
Step
Step
Step
0:
1:
2:
3:
Initiate a state X and compute its scalar objective function: E(X);

Make a small move from X to a new state Y = X + X;
Compute the new objective function E(Y ) for the new state Y ;
Accept the new state Y according to the following probability

E(Y ) E(X)
(9.1)
P (Y ) = min 1, exp
T
Step 4: If stop conditions fail, then go to Step 1;

Step 5: End.
In the Step 3 above, the acceptance probability depends on the dierence
of the O.F.s at two nearby states X and Y and a parameter T yet to
be introduced. This new parameter T , a positive real number, is widely
regarded as temperature. It is, in fact, a parameter for regulating the level
of forgiveness when accepting a state with higher O.F.
Examining the formula in Step 3, we notice
If E(Y ) < E(X), then P (Y ) = 1 and the new state Y is always accepted.
If E(Y ) > E(X)
is small and exp( E(Y )E(x)
) 1 and
If T is big, then E(Y )E(x)
T
T
thus a new bad state Y is more likely accepted.
(A special case: T P (Y ) 1. Accepting any bad state).
is big and exp( E(Y )E(x)
) 0+ and
If T is small, then E(Y )E(x)
T
T
thus a new bad state Y is more likely rejected.
(A special case: T 0 P (Y ) 0. Rejecting any bad state).
Therefore, it is not dicult to conclude that the parameter T is a tolerance
for bad states. This tolerance naturally dictates the convergence rate of the
simulation.
In the Metropolis-Hastings algorithm, its a xed number while
dynamically adjusting it, an act called temperature scheduling, to achieve
optimal convergence is the true state-of-the-art for this algorithm. As of
today, there is no universal temperature schedule and it may never be
July 27, 2012 6:17
Optimization
9in x 6in
b1350-ch09
119
possible to nd one because an optimal temperature schedule depends on

applications.
9.1.3. Simulated annealing
As an adaptation to the Metropolis-Hastings algorithm, simulate annealing,
introduced by Kirkpatrick and two colleagues in 1983, also performs the
following new state-generating step, i.e., the probability of accepting a new
state Y from the old state X:

E(Y ) E(x)
P (Y ) = min 1, exp
(9.2)
T
The only dierence is that the parameter T varies during the process or
sampling. The precise form of change of the parameter T is still elusive but
some basic principles should be followed or a poor choice of the parameter T
may slow down the convergence or lead to diverge. Another rule of thumb
is to allow more bad states to be accepted at the start of iterations than
the latter stages, i. e., scheduling the parameter T to decrease as iteration
progresses.
9.1.4. Genetic algorithm
A genetic algorithmis a search heuristic that mimics the process of
natural evolution. This heuristic is used to generate useful solutions to
optimizations and search problems. Genetic algorithms belong to the larger
class of evolutionary algorithms inspired also by natural evolution, such as
inheritance, mutation, selection, crossover, etc.
9.2. Parallelization
9.2.1. Basics
There are many approaches for parallelizing Monte Carlo methods and none
appears ideal. The correct measure of success is still that of the shortest
time for achieving the results. Many other claims of success include keeping
all processors busy all the time or achieving a zillion Flops are common
deceptions for parallel computing and such deceptions are particularly
dicult to detect for Monte Carlo simulations.
The latest approaches of parallelizing Monte Carlo methods always
involve decomposition of new state generation, or the search space, or the
July 27, 2012 6:17
120
9in x 6in
b1350-ch09
Markov chains. The latter two are related and none appears to be able to
scale for parallel computers with many processors.
9.2.2. Chain mixing method
With very limited numerical experiments and fantastic performances, I
would introduce a new method: the chain mixing method.
The following are the basic steps:
Step 0: Start P independent Markov chains, one per processors, by
Metropolis-Hastings algorithm or by the simulated annealing;
Step 1: All P processors elongate their individual Markov chains with
individual random number sequences by Metropolis-Hastings
algorithm or by the simulated annealing;
Step 2: At a given mixing time , suspend chain elongation and evaluate
the current states of all P chains and identify the best state;
Step 3: If the best state fails to satisfy stop conditions, one or more best
state(s) are selected, according to the acceptance probability, and
planted to P processors and go to Step 1;
Step 4: End.
As shown in Fig. 9.2, N processors start N Markov chains with initial states
X1 , X2 , . . . , Xn and iterate until they reach states Y1 , Y2 , . . . , Yn when a
mixing evaluation occurs. At this time, all states except one best state Y2
are terminated. All processors, including the one that generated state Y2 ,
are given the state of Y2 and use their own random number sequences to
elongate the Markov chains until next mixing, etc. This approach is ecient
only for parallel machines with few processors.
An improvement of the above method is presented in Fig. 9.3. The
dierence is that two or more states are selected to re-produce. We pool
the ending states Y1 , Y2 , . . . , Yn and compute the average O.F. of these
states by
E=
E(Y1 ) + E(Y2 ) + + E(Yn )

n
and then compare the acceptance factors of the ending states

E(Yi ) E
P (Yi ) = exp
T
July 27, 2012 6:17
9in x 6in
b1350-ch09
Optimization
Fig. 9.2.
Mixing method 1 in which only one chain is selected to propagate.
Fig. 9.3.
Mixing method 2 in which some chains are selected to propagate.
121
to assess re-production allowance. The processors with bad states will

suspend own chain production and adopt others processors better states to
produce new Markov chains.
This schedule is a faithful parallelization of the Metropolis-Hastings
algorithm or simulated annealing.
Parallelization of Monte Carlo methods is a serious challenge with great
benets. For the schemes we discussed here, we still have many parameters
to select, e.g., the mixing periods and reproduction probability.
July 27, 2012 6:17
9in x 6in
b1350-ch09
July 27, 2012 6:17
9in x 6in
b1350-ch10
CHAPTER 10
APPLICATIONS
This chapter focuses on applications of the algorithms to physics and

engineering. They include classical and quantum molecular dynamics, and
Monte Carlo methods.
First-principle, parameter-free simulations are always desirable, but are
rarely feasible for a large class of problems in physics and engineering.
Unfortunately, computers available today and even decades to come will
still be insucient for most realistic problems.
Thus, second-principle modeling with some free parameters and simplications of the original equations remains as the dominant approach
in scientic computing. The essence of this approach rests on the
representation of the physical systems by mathematical formulations.
Scientists in each of their own areas are responsible for the conversion of
physics to mathematics.
After the mathematical formulations are obtained, the immediate
concern falls to the discretization of the equations for computer
representation, which is largely, applied mathematicians responsibilities.
The mathematical formulations are commonly divided into two major
categories: Dierential equations and global optimizations. For the former,
we can get partial dierential equations (PDEs) which are usually nonlinear
and ordinary dierential equations (ODEs) (which are, in many cases,
sti). These equations are largely solved by nite-dierent, nite-element,
or particle method or their hybridization. While for the latter, we always
get problems with multiple local minima, for which Monte Carlo methods
usually prove to be advantageous.
The following is a short list of grand challenges covering problems of
intense complexity in science, engineering, and technology. Only the highest
performance computing platforms, i.e. parallel computing, may oer limited
insights to these problems.
Parallel computing may bring revolutionary advances to the study of the
following areas, which usually require large-scale computations. Scientists
123
July 27, 2012 6:17
124
9in x 6in
b1350-ch10
and engineers in these areas must seize the opportunity to make ecient
use of this cost-eective and enabling technology.
Traditional methods such as the nite dierence, nite element, Monte
Carlo method, and particle methods, are still the essential ingredients for
solving these problems except that these algorithms must be parallelized.
Fluid dynamics, turbulence, dierential equations, numerical analysis,
global optimization, and numerical linear algebra are a few popular areas
of research.
Turbulence: Turbulence in uid ows aects the stability and control,
thermal characteristics, and fuel performance of virtually all aerospace
vehicles. Understanding the fundamental physics of turbulence is requisite
to reliably modeling ow turbulence for the analysis of realistic vehicle
conguration.
Topics include simulation of devices and circuits, VLSI, and articial
intelligence (speech, vision, etc.).
Speech: Speech research is aimed at providing a communications interface
with computers based on spoken language. Automatic speech understanding
by computer is a large modeling and search problem in which billions of
computations are required to evaluate the many possibilities of what a
person might have said within a particular context.
Vision: The challenge is to develop human-level visual capabilities for
computers and robots. Machine vision requires image signal processing,
texture and color modeling, geometric processing and reasoning, and object
modeling. A component vision system will likely involve the integration of
all of these processes with close coupling.
10.1. Newtons Equation and Molecular Dynamics
This is the fundamental equation by Sir Isaac Newton (16421727) in
classical mechanics, relating the force acting on a body to the change in
its momentum over time. It is used in all calculations involving moving
bodies in astrophysics, engineering, and molecular systems.
F = ma +
dm
v
dt
(10.1)
In molecular dynamics and N-body problems, or any models that involve

particles,

d2 x
mi 2 = fi (x1 , x2 , . . . , xN )
i N
(10.2)
dt
July 27, 2012 6:17
9in x 6in
b1350-ch10
125
Applications
Main issues are:
Interaction range: short-, medium-, and long-.

Force estimation (accounts for 90% of CPU usage).
Solution of coupled Newton Equations (ODE system).
Science (of course!).
The force matrix is given by,
f1N
f2N
..
.
f11
f21
F = .
..
f12
f22
..
.
..
.
fN 1
fN 2
fN N
(10.3)
Remarks
N N particle interactions form a force matrix.

Diagonal elements are self-interaction and are always zero.
If the particles are named properly, F decreases as |i j| increases.
The sum of the elements of row i is equal to the total force on particle i.
Column j represents the forces on all particles by particle j; this is
rarely useful.
In a long-ranged context with N bodies, F is dense and has complexity
O(N 2 ).
In a short-ranged context (MD), F is dense and has complexity O(N ).
Solution of the motion equation:
2nd order Valet method.

2nd order RungeKutta method.
4th order RungeKutta method.
Depending on the level of resolution required, one could apply 2nd, 4th, or
6th order ODE solvers.
10.1.1. Molecular dynamics

There are two types of molecular dynamics (MD) approaches currently
practised by researchers in physics, chemistry, engineering, particularly in
materials science. One is the rst-principle approach by directly solving
a N body system governed by the Schr
odinger equation. This approach
is called the quantum molecular dynamics (QMD) which considers all
quantum mechanical eects. The other involves solving systems of Newtons
July 27, 2012 6:17
126
9in x 6in
b1350-ch10
equations governing particles moving under quasi-potentials. This approach

is called classical molecular dynamics (CMD).
Both methods have their advantages and shortcomings. QMD considers
full quantum mechanics and is guaranteed to be correct if the system
contains a sucient number of particles, which is infeasible with todays
computing power. The largest system considered so far is less than
500 particles. CMD can include many more particles, but the potential used
here is basically an averaging quantity and may be incorrectly proposed. If
that is wrong, the whole results are wrong. So a study with multiple free
parameters can make it risky.
CMD problems are usually solved by the following methods:
(1) Particle-particle (PP).
(2) Particle-mesh (PM).
(3) Multipole method.
In the PP method, the forces are computed exactly by the inter-particle
distances, while in the PM method, the forces, being treated as a eld
quantity, are approximated on meshes. In multi-pole method, the eld is
expanded into multi-poles; the higher order poles, contributing negligibly
to the eld, are truncated.
There are two types of interactions measured by distance: Short-ranged
and long-ranged. For short-ranged interactions, PP is a method of choice as
the complexity is O(N ) where N is the number of particles. For long-ranged
interactions, PP may be used, but is surely a bad choice as the complexity
becomes O(N 2 ).
For short-ranged interactions, the recorded value for N = 108 in 1994.
If the same method is used for long-ranged interactions, N = 104, which is
basically useless for realistic problems. Thus, a good choice of method for
long-ranged interactions is the PM. The multi-pole method is unique for
long-ranged interactions problems.
Molecular dynamics may be unique in oering a general computational
tool for the understanding of many problems in physics and engineering.
Although it is not a new idea, its potential has never been realized due to
the lack of adequate computing source.
10.1.2. Basics of classical MD
In this section, we plan to,
(1) Develop general-purpose parallel molecular dynamics (MD) methodologies and packages on distributed-memory MIMD computers.
July 27, 2012 6:17
Applications
9in x 6in
b1350-ch10
127
(2) Apply the resulting package to DNA-protein interaction analysis, thinlm depositions, and semi-conductor chip design as well as material
design and analysis.
(3) Study microscopic interaction form (physics) and explore the novel
macro properties in cases where interacting forms are known
(engineering).
While molecular dynamics is a powerful method for many systems
involving dynamics, continuum modeling without treating the system as an
assemble of particles has been playing a major role for simulating uid, gas,
and solids. The major advantage of the continuum modeling is the dramatic
reduction in computing time enabled by treat the system as a nite number
of interacting blocks whose motion is governed by, usually, basic principles
of energy and momentum conversations. Each of these blocks, in eect,
contains thousands or tens of thousands or more particles, the basic entity
in the MD simulations. Thus, continuum modeling can be, depending on size
and nature of the system being simulated, thousands to millions of times
faster than MD. However, the spatial and temporal resolutions oered by
MD can be a feature that continuum modeling may never provide. It is
important to select the right method for the right problem and, rather
commonly, both methods may have to be used for one system for optimal
accuracy and eciency.
Molecular dynamics (MD) is in essence an N -body problem: Classical
mechanical or quantum mechanical, depending on the scale of the problem.
MD is widely used for simulating molecular-scale models of matter such
as solid state physics, semiconductor, astrophysics, and macro-biology. In
CMD, we consider solving N -body problems governed by Newtons second
law, while for quantum MD, we solve systems governed by Schrodingers
equation.
Mathematically, MD involves solving a system of non-linear ordinary
dierential equation (ODE). An analytical solution of an N -body problem
is hard to obtain and, most often, impossible to get. It is also very timeconsuming to solve non-linear ODE systems numerically. There are two
major steps in solving MD problems: Computing the forces exerted on
each particle and solving the ODE system. The complexity lies mostly
in the computation of force terms, while the solution of ODEs takes a
negligible (5% of total) amount of time. There exist many ecient and
accurate numerical methods for solving the ODEs: For example, Runge
Kutta, Leapfrong, and Predictor-corrector.
July 27, 2012 6:17
128
9in x 6in
b1350-ch10
The computation of the force terms attracts much of the attention

in solving MD problems. There are two types of problems in CMD:
Long-ranged and short-ranged interactions. In long-range interactions, the
complexity for a system with N particles is O(N 2 ) and it seems to be hard
to nd a good method to improve this bound except to let every particle
interact with every other particle. While for short range interaction, the
complexity is C N where C is a constant dependant on the range of
interaction and is the method used to compute the forces.
The interests lie in measuring the macro physical quantities after
obtaining the micro physical variables such as positions, velocities, and forces.
Normally, a realistic MD system may contain N > 105 particles, which
makes it impossible for most computers to solve. As of summer 1993, the
largest number of particles that have been attempted was = 107 . This was
done on the 512-processor Delta computer by Sandia National Laboratories
researchers. Of course, the study has not been applied to real physics problems.
The cost is so high that the only hope to solve these problems must lie
in parallel computing. So, designing an ecient and scalable parallel MD
algorithm is of great concern. The MD problems that we are dealing with are
classical N -body problems, i.e. to solve the following generally non-linear
ordinary dierential equations (ODE),
mi

d2 xi
=
f
(x
,
x
)
+
gijk (xi , xj , xk ) + ,
ij
i
j
dt2
j
I = 1, 2, . . . , N
j,k
(10.4)
where mi is the mass of particle i, xi is its position, fij (, ) is a two body
force, and gijk (, , ) is a three-body force. The boundary conditions and
initial conditions are properly given. To make our study simple, we only
consider two-force interactions, so gijk = 0.
The solution vector X, of the system can be written in the following
iterative form:

dXold
,...
(10.5)
Xnew = Xold + Xold ,
dt
The entire problem is reduced to computing

dXold
,...
Xold ,
dt
(10.6)
In fact the core of the calculation is that of the force. Typically, solving
the above equation costs less than 5% of total time if the force terms
July 27, 2012 6:17
Applications
9in x 6in
b1350-ch10
129
are known. Normally, RungeKutta, or Predictor-Corrector, or Leapfrog

methods are used for this purpose. More than 95% of the time is spent on
computing the force terms. So we concentrate our parallel algorithm design
on the force calculation.
There are two types of interactions: Long range and short-ranged
forces. The long-ranged interactions occur often in gases, while short-ranged
interactions are common in solids and liquids. For long-ranged forces, the
cost is typically O(N 2 ) and there are not many choices of schemes. However,
for short range forces, the cost is generally O(N ) and there exists a variety
of methods.
A force matrix (Fig. 10.1) is helpful in understanding the force calculation. The matrix element fij is the force acting on particle i by particle j.
Thus, if we add together all elements in a row (say, row i), we get the total
force on particle i. The matrix has several simple properties:
(1) The diagonal elements are zero.
(2) It is symmetric.
(3) The matrix can be sparse or dense depending on the system interaction
range. For short-ranged interactions, it is sparse. For long-ranged
interactions, it is dense.
Long-ranged interactions: Most N-body problems and study of plasma
under Coulombs interactions must consider long-ranged interactions.
Method
Method
Method
Method
I: Particle-mesh method.
II: Multi-pole method.
III: The fast multi-pole method.
IV: Rotation scheme.
Fig. 10.1. The force matrix and force vector. The force matrix element fj is the force
on particle i exerted by particle j. Adding up the elements in row i, we get to total force
acting on particle i by all other particles.
July 27, 2012 6:17
130
9in x 6in
b1350-ch10
The total two-body force on particle i is given by,

fij
Fi =
(10.7)
j=i
So, if N particles are distributed uniformly to p processors (assuming

n = N/p is an integer), every particle must be considered by every processor
for us to compute the total force.
As shown in Fig. 10.2, all processors simultaneously pass their own
particles to their right-hand neighbors (RHN). Each processor will keep a
copy of its own particles. As soon as RHN get their left-hand neighbors
particles, compute the partial forces exerted on the local particles by the
incoming ones. Next time, all processors will throw out the incoming
particles to their RHN. These RHN will again use the newly arrived particles
to compute the remaining force terms for local particles. This process is
repeated until all particles visit all processor.
Performance Analysis: First, we dene some parameters. Let txchg be
the time to move on particles from one processor to its neighbor and let
tpair be the time to compute the force for a pair of particles. We have that,
T (1, N ) = N 2 tpair
Fig. 10.2.
(10.8)
The communication structure for long-range interaction in MD.
July 27, 2012 6:17
9in x 6in
b1350-ch10
131
Applications
and
T (p, N ) = p(N 2 tpair + N txchg )
(10.9)
Therefore, the speedup is:

S(p, N ) = p
1
1+
1 txchg
n tpair
(10.10)

h(p, N ) =
1 txchg
n tpair
(10.11)
Remarks
1
n is
txchg
tpair
important for reducing overhead.

again controls the overhead.
For long range interaction, h(p, N ) is typically very small.
Short-ranged interactions: For short-ranged interaction, the algorithm is

very dierent. Basically, two major techniques (applicable to both serial or
parallel) are widely used to avoid computing force terms that have negligible
contribution to the total force: (1) particle-in-cell and (2) neighbor lists.
Assume that particles with a distance less than rc have a nite, non-zero
interaction, while particles with distances larger than rc do not interact.
Particle-in-cell: Create a 3D cubic mesh in the computational domain
with mesh size rc . Therefore, particles in a certain cell of volume rc only
interact with particles in the 26 other adjacent cells. Thus, a volume of 27rc
of particles must be considered. It is not the best idea, but much better than
considering the entire domain (as in long interaction case) which typically
contains O(103 ) cells.
Neighbor lists: The idea here is similar to the above, but better rened.
Instead of creating a list for a cluster of particles, we create one for each
particle. Thus, the idea goes like: For particle i, make a list of all particles
in the system that can interact with i. Normally, particles do not move far
within a time step, in order to avoid creating the lists every time step, we
record particles in the sphere of radius rc + instead of just c. This is an
adjustable parameter (depending on physics and computing resources).
Of course, to search particles in the entire domain to create the lists
is just too costly. The best method is to use particle-in-cell to create
approximate lists and then to create the neighbors lists.
July 27, 2012 6:17
132
9in x 6in
b1350-ch10
These two methods are very useful; we call them screening method.
I. Particle decomposition: Particle decomposition method distributes
particles, as in long-ranged interaction (with no consideration for the
particles positions), uniformly to all processors. This method corresponds
to force matrix row partition.
Step
Step
Step
Step
I : Screening.
II : Communicate boarder mesh particles to relevant processors.
III: Update particle positions in all processors.
IV: Scatter the new particles (same particles but new positions).
Remarks
Load balancing requires the system being of uniform density (although

each processor has the same amount of particles, but each particle may
not have the same number of neighbors). A simple way to relax this
problem is to re-distribute the particles to processors randomly.
Global communication is needed for screening and scattering particles
after updates.
Simple to implement.
II. Force decomposition: Force decomposition corresponds to matrix

sub-blocking. Each processor is assigned a sub-block F of the force matrix
to evaluate. For a system of N particles by p processors, each the force submatrix is of size Np Np .
Algorithm
(1) Get ready to receive the message data from the source processor by
calling irecv.
(2) Send the message data to the destination processor by calling csend.
(3) Wait for message.
Force computing
(1) Screening.
(2) Compute F and sum over all to nd the partial force on particles
by particles .
(3) Processor P collects the partial forces from row processors to
compute the total force on particles .
(4) Update the N/p particles within a group .
(5) Broadcast new positions to all other processors.
(6) Repeat Steps 14 to the next time step.
July 27, 2012 6:17
9in x 6in
Applications
b1350-ch10
133
Remarks
Load imbalance is again a serious problem because the matrix density

is obviously non-uniform.
No geometric information is needed for decomposition.
Space decomposition: Space decomposition is a tricker method. Basically,

we slice the computational domain into p sub-domains. These sub-domain
boarder lines are made to be co-linear with the cell lines. Typically each
cell contains about 10 particles (of course, it is physics-dependent.) and the
number of cells each sub-domain contains depends on N and p.
This method is similar to solving hyperbolic PDE on parallel processors,
i.e. building buer zones.
(1) Partition particles into participating processors according to particles
positions.
(2) Construct buer zone for communication. Then communicate the buer
zones to relevant processors to build the extended sub-domains on each
processor.
(3) Compute the forces for particles in each physical box (in parallel) and
update their positions. There is no need to update the positions of the
particles in the buer zones.
(4) Check the new positions of the particles which lie in the buer zone at told .
(5) Repeat Steps 24.
Remarks
Load imbalance is also a serious problem. To solve this problem, we need

to decompose the computational domain into non-uniform sub-domains
(the decomposition is also dynamic). This way, one can minimize the
load imbalance eect.
If a reasonable load balance is achieved, this is a very fast method.
10.2. Schr
odingers Equations and Quantum Mechanics
The basic equation by Ervin Schr
odinger (18871961) in quantum
mechanics which describes the evolution of atomic-scale motions is given by:
h (r, t)
h2
2 (r, t) + V (r, t) =
8 2 m
2i t
(10.12)
July 27, 2012 6:17
134
9in x 6in
b1350-ch10
10.3. Partition Function, DFT and Material Science

The partition function is a central construct in statistics and statistical
mechanics. It also serves as a bridge between thermodynamics and quantum
mechanics because it is formulated as a sum over the states of a macroscopic
system at a given temperature. It is commonly used in condensed-matter
physics and in theoretical studies of high Tc superconductivity.

Ej
gj e KT
(10.13)
Z=
j
10.3.1. Materials research

Topics include material property prediction, modeling of new materials, and
superconductivity.
Material property: High-performance computing has provided invaluable
assistance in improving our understanding of the atomic nature of materials.
A selected list of such materials includes semiconductors, such as silicon
and gallium arsenide, and superconductors such as the high-copper oxide
ceramics that have been shown recently to conduct electricity at extremely
high temperatures.
Superconductivity: The discovery of high-temperature superconductivity
in 1986 has provided the potential of spectacular energy-ecient power
transmission technologies, ultra-sensitive instrumentation, and devices
using phenomena unique to superconductivity. The materials supporting
high temperature-superconductivity are dicult to form, stabilize, and use,
and the basic properties of the superconductor must be elucidated through
a vigorous fundamental research program.
Semiconductor devices design: As intrinsically faster materials such
as gallium arsenide are used, a fundamental understanding is required
of how they operate and how to change their characteristics. Essential
understanding of overlay formation, trapped structural defects, and the
eect of lattice mismatch on properties are needed. Currently, it is possible
to simulate electronic properties for simple regular systems; however,
materials with defects and mixed atomic constituents are beyond present
capabilities. Simulating systems with 100 million particles is possible on the
largest parallel computer 2000-node Intel Paragon.
Nuclear fusion: Development of controlled nuclear fusion requires
understanding the behavior of fully ionized gasses (plasma) at very high
July 27, 2012 6:17
9in x 6in
Applications
b1350-ch10
135
temperatures under the inuence of strong magnetic elds in complex

3D geometries.
10.4. Maxwells Equations and Electrical Engineering
The equations by James Clerk Maxwell (18311879) describe the relationship between electric and magnetic elds at any point in space as a function
of charge and electric current densities at such a point. The wave equations for
the propagation of light can be derived from these equations, and they are the
basic equations for the classical electrodynamics. They are used in the studies
of most electromagnetic phenomena including plasma physics as well as the
earths magnetic elds and its interactions with other elds in the cosmos.
E =
B
t
D =
D
+J
H =
t
B = 0
(10.14)
10.4.1. Helmholtz equation

This equation, discovered by Herman von Helmholtz (18211894), is used
in acoustics and electromagnetism. It is also used in the study of vibrating
membranes.
u + = f
(10.15)
10.4.2. Electrical engineering

Topics include electromagnetic scattering, wireless communication, and
antenna design.
10.5. Diusion Equation and Mechanical Engineering
This is the equation that describes the distribution of a certain eld variable
such as temperature as a function of space and time. It is used to explain
physical, chemical, reaction-diusion systems, as well as some biological
phenomena.
u =
u
t
(10.16)
July 27, 2012 6:17
136
9in x 6in
b1350-ch10
Topics include structural analysis, combustion simulation, and vehicle

simulation.
Vehicle dynamics: Analysis of the aero-elastic behavior of vehicles, and
the stability and ride analysis of vehicles are critical assessments of land
and air vehicle performance and life cycle.
Combustion systems: Attaining signicant improvements in combustion
eciencies require understanding the interplay between the ows of the
various substances involved and the quantum chemistry that causes those
substances to react. In some complicated cases, the quantum chemistry is
beyond the reach of current supercomputers.
10.6. Navier-Stokes Equation and CFD
The Navier-Stokes equation, developed by Claude Louis Marie Navier
(17851836) and Sir George Gabriel Stokes (18191903), is the primary
equation of computational uid dynamics. It relates the pressure and
external forces acting on a uid to the response of the uid ow. Forms of
this equation are used in computations for aircraft and ship design, weather
prediction, and climate modeling.
1
1
u
+ (u )u = p + 2 u + F
t
(10.17)
Aircraft design, air-breathing propulsion, advanced sensors are some

examples.
10.7. Other Applications
10.7.1. Astronomy
Data volumes generated by very large array or very long baseline array
radio telescopes currently overwhelm the available computational resources.
Greater computational power will signicantly enhance their usefulness in
exploring important problems in radio astronomy.
10.7.2. Biological engineering

Current areas of research include simulation of genetic compounds, neural
networks, structural biology, conformation, drug design, protein folding, and
the human genome.
July 27, 2012 6:17
Applications
9in x 6in
b1350-ch10
137
Structural biology: The function of biologically important molecules

can be simulated by computationally intensive Monte Carlo methods in
combination with crystallographic data derived from nuclear magnetic
resonance measurements. Molecular dynamics methods are required for
the time dependent behavior of such macromolecules. The determination,
visualization, and analysis of these 3D structures are essential to the
understanding of the mechanisms of enzymic catalysis, recognition of nucleic
acids by proteins, antibody and antigen binding, and many other dynamic
events central to cell biology.
Design of drugs: Predictions of the folded conformation of proteins and
of RNA molecules by computer simulation are rapidly becoming accepted
as useful, and sometimes as a primary tool in understanding the properties
required in drug design.
Human genome: Comparison of normal and pathological molecular
sequences is currently our most revealing computational method for understanding genomes and the molecular basis for disease. To benet from
the entire sequence of a single human will require capabilities for more
than three billion sub-genomic units, as contrasted with the 10 to 200,000
units of typical viruses.
10.7.3. Chemical engineering
Topics include polymer simulation, reaction rate prediction, and chemical
vapor deposition for thin-lms.
10.7.4. Geosciences
Topics include oil and seismic exploration, enhanced oil and gas recovery.
Enhanced oil and gas recovery: This challenge has two parts. First,
one needs to locate the estimated billions of barrels of oil reserves on the
earth and then to devise economic ways of extracting as much of it as
possible. Thus, improved seismic analysis techniques in addition to improved
understanding of uid ow through geological structures are required.
10.7.5. Meteorology
Topics include prediction of Weather, Climate, Typhoon, and Global
change. The aim here is to understand the coupled atmosphere-ocean,
July 27, 2012 6:17
138
9in x 6in
b1350-ch10
biosphere system in enough detail to be able to make long-range predictions

about its behavior. Applications include understanding dynamics in the
atmosphere, ozone depletion, climatological perturbations owing to manmade releases of chemicals or energy into one of the component systems,
and detailed predictions of conditions in support of military missions.
10.7.6. Oceanography
Ocean sciences: The objective is to develop a global ocean predictive
model incorporating temperature, chemical composition, circulation, and
coupling to the atmosphere and other oceanographic features. This ocean
model will couple to models of the atmosphere in the eort on global weather
and have specic implications for physical oceanography as well.
July 27, 2012 6:17
9in x 6in
b1350-appa
APPENDIX A
MPI
A.1. An MPI Primer

A.1.1. What is MPI?
MPI is the standard for multi-computer and cluster message passing
introduced by the Message-Passing Interface Forum in April 1994. The goal
of MPI is to develop a widely used standard for writing message-passing
programs.
MPI plays an intermediary between parallel programming language and
specic running environment like other portability helpers. MPI is more
widely accepted as a portability standard.
MPI was implemented by some vendors on dierent platforms. Suitable
implementations of MPI can be found for most distributed-memory systems.
To the programmer, MPI appears in the form of libraries for FORTRAN
or C family languages. Message passing is realized by an MPI routine
(or function) call.
A.1.2. Historical perspective

Many developers contributed to the MPI library, as shown in Fig. A.1.
A.1.3. Major MPI issues

Process creation and management: Discusses the extension of MPI to
remove the static process model in MPI. It denes routines that allow for
creation of processes.
Cooperative and One-Sided Communications: One worker performs
transfer of data (the opposite of cooperative). It denes communication
routines that can be completed by a single process. These include sharedmemory operations (put/get) and remote accumulate operations.
139
July 27, 2012 6:17
140
9in x 6in
b1350-appa
Fig. A.1.
Evolution and key components of MPI.
Extended collective operations: This extends the semantics of MPI-1

collective operations to include intercommunicators. It also adds more
convenient methods of constructing intercommunicators and two new
collective operations.
External interfaces: This denes routines designed to allow developers
to layer on top of MPI. This includes generalized requests, routines that
decode MPI opaque objects, and threads.
I/O: MPI-2 support for parallel I/O.
Language Bindings: C++ binding and FORTRAN-90 issues.
A.1.4. Concepts and terminology for message passing
Distributed-Memory: Every processor has its own local memory which
can be accessed directly only by its own CPU. Transfer of data from one
processor to another is performed over a network. This diers from sharedmemory systems which permit multiple processors to directly access the
same memory resource via a memory bus Fig. A.2.
July 27, 2012 6:17
9in x 6in
b1350-appa
Appendix A: MPI
Fig. A.2.
141
This figure illustrates the structure of a distributed-memory system.
Message Passing: This is the method by which data from one processors
memory is copied to the memory of another processor. In distributedmemory systems, data is generally sent as packets of information over
a network from one processor to another. A message may consist of
one or more packets, and usually includes routing and/or other control
information.
Process: A process is a set of executable instructions (a program) which
runs on a processor. One or more processes may execute on a processor.
In a message passing system, all processes communicate with each other
by sending messages, even if they are running on the same processor. For
reasons of eciency, however, message passing systems generally associate
only one process per processor.
Message passing library: This usually refers to a collection of routines
which are imbedded in application code to accomplish send, receive and
other message passing operations.
Send/Receive: Message passing involves the transfer of data from one
process (send) to another process (receive), requires the cooperation of
both the sending and receiving process. Send operations usually require the
sending process to specify the datas location, size, type and the destination.
Receive operations should match a corresponding send operation.
Synchronous/Asynchronous: Synchronous send operations will complete only after the acknowledgement that the message was safely received
by the receiving process. Asynchronous send operations may complete
even though the receiving process has not actually received the message.
July 27, 2012 6:17
142
9in x 6in
b1350-appa
Application buer: It is the address space that holds the data which is
to be sent or received. For example, suppose your program uses a variable
called inmsg. The application buer for inmsg is the program memory
location where the value of inmsg resides.
System buer: System space for storing messages. Depending upon
the type of send/receive operation, data in the application buer may
be required to be copied to/from system buer space. This allows
communication to be asynchronous.
Blocking communication: A communication routine is blocking if the
completion of the call is dependent on certain events. For sends, the data
must be successfully sent or safely copied to system buer space so that
the application buer that contained the data is available for reuse. For
receives, the data must be safely stored in the receive buer so that it is
ready for use.
Non-blocking communication: A communication routine is nonblocking if the call returns without waiting for any communications events
to complete (such as copying of message from user memory to system
memory or arrival of message). It is not safe to modify or use the application
buer after completion of a non-blocking send. It is the programmers
responsibility to insure that the application buer is free for reuse. Nonblocking communications are primarily used to overlap computation with
communication to provide gains in performance.
Communicators and Groups: MPI uses objects called communicators
and groups to dene which collection of processes may communicate with
each other. Most MPI routines require us to specify a communicator as an
argument. Communicators and groups will be covered in more detail later.
For now, simply use MPI COMM WORLD whenever a communicator is
required it is the predened communicator which includes all of our MPI
processes.
Rank: Within a communicator, every process has its own unique integer
identier assigned by the system when the process initializes. A rank is
sometimes also called a process ID. Ranks are contiguous and begin at
zero. In addition, it is used by the programmer to specify the source and
destination of messages, and is often used conditionally by the application
to control program execution. For example,

rank = 0, do this
rank = 1, do that
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
143
Message attributes
(1) The envelope.
(2) Rank of destination.
(3) Message tag.
(a) ID for a particular message to be matched by both sender and
receiver.
(b) It is like sending multiple gifts to your friend; you need to identify
them.
(c) MPI TAG UB 32767.
(d) Similar in functionality to comm to group messages.
(e) comm is safer than tag, but tag is more convenient.
(4)
(5)
(6)
(7)
(8)
Communicator.
The Data.
Initial address of send buer.
Number of entries to send.
Datatype of each entry.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
INTEGER
REAL
DOUBLE PRECISION
COMPLEX
LOGICAL
BYTE
INT
CHAR
FLOAT
DOUBLE
A.1.5. Using the MPI programming language

All the MPI implementations nowadays support FORTRAN and C. Implementations of MPI-2 also support C++ and FORTRAN 90. Any one of these
languages can be used with the MPI library to produce parallel programs.
However, there are a few dierences in the MPI call between the FORTRAN
and C families.
Header File
C
#include mpi.h
FORTRAN
Include mpif.h
July 27, 2012 6:17
144
9in x 6in
b1350-appa
MPI call format

Language
Format
Example
Error
Code
C
rc = MPI Xxxxx
(parameter,...)
rc = MPI Bsend
(&buf, count, type,
dest, tag, comm)
Returned as rc
MPI SUCCESS if successful
FORTRAN
CALL MPI XXXXX
(parameter,...,ierr)
CALL MPI BSEND
(buf, count, type, dest,
tag, comm, ierr)
Returned as ierr parameter
MPI SUCCESS if successful
Program structure: The general structures are the same in both language
families. These four components must be included in the program:
(1)
(2)
(3)
(4)
Include the MPI header.

Initialize the MPI environment.
Normal language sentences and MPI calls.
Terminate the MPI environment.
A.1.6. Environment management routines

Several of the more commonly used MPI environment management routines
are described below:
MPI Init
MPI Init initializes the MPI execution environment. This function must be
called in every MPI program, before any other MPI functions, and must
be called only once in an MPI program. For C programs, MPI Init may be
used to pass the command line arguments to all processes, although this is
not required by the standard and is implementation dependent.
MPI Init(argc, argv)
MPI INIT(ierr)
MPI Comm size
This determines the number of processes in the group associated with
a communicator. It is generally used within the communicator MPI COMM
WORLD to determine the number of processes being used by our application.
MPI Comm size(comm,size)
MPI COMM SIZE(comm,size,ierr)
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
145
MPI Comm rank

MPI Comm rank determines the rank of the calling process within the
communicator. Initially, each process will be assigned a unique integer
rank between 0 and P1, within the communicator MPI COMM WORLD. This
rank is often referred to as a task ID. If a process becomes associated
with other communicators, it will have a unique rank within each of these
as well.
MPI Comm rank(comm,rank)
MPI COMM RANK(comm,rank,ierr)
MPI Abort
This function terminates all MPI processes associated with a communicator.
In most MPI implementations, it terminates all processes regardless of the
communicator specied.
MPI Abort(comm,errorcode)
MPI ABORT(comm,errorcode,ierr)
MPI Get processor name
This gets the name of the processor on which the command is executed. It
also returns the length of the name. The buer for name must be at least
MPI MAX PROCESSOR NAME characters in size. What is returned into name is
implementation dependent it may not be the same as the output of the
hostname or host shell commands.
MPI Get processor name(name,resultlength)
MPI GET PROCESSOR NAME(name, resultlength, ierr)
MPI Initialized
MPI Initialized indicates whether MPI Init has been called and returns a
ag as either true (1) or false (0). MPI requires that MPI Init be called
once and only once by each process. This may pose a problem for modules
that want to use MPI and are prepared to call MPI Init if necessary. MPI
Initialized solves this problem.
MPI Initialized(flag)
MPI INITIALIZED(flag,ierr)
MPI Wtime
This returns an elapsed wall clock time in seconds (double precision) on the
calling processor.
July 27, 2012 6:17
146
9in x 6in
b1350-appa
MPI Wtime()
MPI WTIME()
MPI Wtick
This returns the resolution in seconds (double precision) of MPI Wtime.
MPI Wtick()
MPI WTICK()
MPI Finalize
Terminates the MPI execution environment. This function should be the
last MPI routine called in every MPI program, no other MPI routines may
be called after it.
MPI Finalize()
MPI FINALIZE(ierr)
Examples: Figures A.3 and A.4 provide some simple examples of environment management routine calls.
#include"mpi.h"
#include<stdio.h>
int main(int argc, char *argv[]){
int numtasks, rank, rc;
rc = MPI_Init(&argc, &argv);
if (rc != 0){
printf("Error starting MPI program. Terminating. \n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Number of tasks= %d My rank= %d\n", numtasks, rank);
/*******
do some work *******/
MPI_Finalize();
return 0;
}
Fig. A.3.
A simple example of environment management in C.
July 27, 2012 6:17
9in x 6in
b1350-appa
Appendix A: MPI
Fig. A.4.
147
A simple example of environment management in FORTRAN.
A.1.7. Point to point communication routines

Blocking message passing routines
The more commonly used MPI blocking message passing routines are
described on p. 152.
MPI Send
As a basic blocking send operation, this routine returns only after the
application buer in the sending task is free for reuse. Note that this
routine may be implemented dierently on dierent systems. The MPI
standard permits the use of a system buer but does not require it. Some
implementations may actually use a synchronous send (discussed below) to
implement the basic blocking send.
MPI Send(buf,count,datatype,dest,tag,comm)
MPI SEND(buf, count, data type, dest, tag, comm, ierr)
Buer: This is the program (application) address space which references
the data that is to be sent or received. In most cases, this is simply the
variable name that is being sent/received. For C programs, this argument
is passed by reference and usually must be prepended with an ampersand
(&var1).
Data Count: Indicates the number of data elements of a particular type
to be sent.
Data Type: For reasons of portability, MPI predenes its data types.
Programmers may also create their own data types (derived types). Note
July 27, 2012 6:17
148
b1350-appa

Table A.1.
MPI data types.
MPI C data types

MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
9in x 6in
CHAR
SHORT
INT
LONG
UNSIGNED CHAR
UNSIGNED SHORT
UNSIGNED
UNSIGNED LONG
FLOAT
DOUBLE
LONG DOUBLE
MPI BYTE
MPI PACKED
signed char
singed chort int
signed int
signed long int
unsigned char
unsigned short int
unsigned int
unsigned int
float
double
long double
8 binary digits
data packed or
unpacked with
MPI Pack()/
MPI Unpack
MPI FORTRAN data types

MPI CHARACTER
character(1)
MPI INTEGER
integer
MPI REAL
MPI DOUBLE PRECISION
real
double precision
MPI
MPI
MPI
MPI
COMPLEX
LOGICAL
BYTE
PACKED
complex
logical
8 binary digits
data packed or
unpacked with
MPI Pack()/
MPI Unpack
that the MPI types MPI BYTE and MPI PACKED do not correspond to
standard C or FORTRAN types. Table A.1 lists the MPI data types for
both C and FORTRAN.
Destination: This is an argument to send routines which indicates the
process where a message should be delivered. It is specied as the rank of
the receiving process.
Source: For MPI Recv, there is an argument source corresponding to
destination in MPI Send. This is an argument to receive routines which
indicates the originating process of the message. Specied as the rank of
the sending process, it may be set to the wild card MPI ANY SOURCE to
receive a message from any task.
Tag: An arbitrary, non-negative integer assigned by the programmer to
uniquely identify a message. Send and receive operations should match
message tags. For a receive operation, the wild card ANY TAG can be used
to receive any message regardless of its tag. The MPI standard guarantees
that integers from the range [0, 32767] can be used as tags, but most
implementations allow a much larger range.
Communicator: Indicates the communication context, or set of processes
for which the source or destination elds are valid. Unless the programmer
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
149
is explicitly creating new communicators, the predened communicator

MPI COMM WORLD is usually used.
Status: For a receive operation, this indicates the source of the message
and the tag of the message. In C, this argument is a pointer to a
predened structure MPI Status (ex. stat.MPI SOURCE stat.MPI TAG).
In FORTRAN, it is an integer array of size MPI STATUS SIZE (ex.
stat(MPI SOURCE) stat(MPI TAG)). Additionally, the actual number of
bytes received are obtainable from status via the MPI Get count routine.
Request: This is used by non-blocking send and receive operations. Since
non-blocking operations may return before the requested system buer
space is obtained, the system issues a unique request number. The
programmer uses this system assigned handle later (in a WAIT type
routine) to determine completion of the non-blocking operation. In C, this
argument is a pointer to a predened structure MPI Request. In FORTRAN,
it is an integer.
MPI Recv
Receive a message and block until the requested data is available in the
application buer in the receiving task.
MPI Recv( buf,count,datatype,source,tag,comm,status)
MPI RECV(buf,count,datatype,source,tag,comm,status,ierr)
MPI Ssend
Synchronous blocking send: Send a message and block until the
application buer in the sending task is free for reuse and the destination
process has started to receive the message.
MPI Ssend(buf,count,datatype,dest,tag,comm,ierr)
MPI SSEND(buf,count,datatype,dest,tag,comm,ierr)
MPI Bsend
Buered blocking send: Permits the programmer to allocate the required
amount of buer space into which data can be copied until it is delivered.
It insulates against the problems associated with insucient system buer
space. Routine returns after the data has been copied from application
buer space to the allocated send buer. This must be used with the
MPI Buffer attach routine.
MPI Bsend(buf,count,datatype,dest,tag,comm)
MPI BSEND(buf,count,datatype,dest,tag,comm,ierr)
July 27, 2012 6:17
150
9in x 6in
b1350-appa
MPI Buffer attach, MPI Buffer detach

Used by the programmer to allocate/deallocate message buer space to
be used by the MPI Bsend routine. The size argument is specied in
actual data bytes not a count of data elements. Only one buer can
be attached to a process at a time. Note that the IBM implementation uses
MPI BSEND OVERHEAD bytes of the allocated buer for overhead.
MPI
MPI
MPI
MPI
Buffer
Buffer
BUFFER
BUFFER
attach(buffer,size)
detach(buffer,size)
ATTACH(buffer,size,ierr)
DETACH(buffer,size,ierr)
MPI Rsend
A blocking ready send should only be used if the programmer is certain
that the matching receive has already been posted.
MPI Rsend(buf,count,datatype,dest,tag,comm)
MPI RSEND(buf,count,datatype,dest,tag,comm,ierr)
MPI Sendrecv
Send a message and post a receive before blocking. This will block until the
sending application buer is free for reuse and until the receiving application
buer contains the received message.
MPI Sendrecv(sendbuf,sendcount,sendtype,dest,sendtag,
recv buf,recvcount,recvtype,source,recvtag,comm,status)
MPI SENDRECV(sendbuf,sendcount,sendtype,dest,sendtag,
recvbuf,recvcount,recvtype,source,recvtag,comm,status,ierr)
MPI Probe
Performs a blocking test for a message. The wild cards MPI ANY SOURCE
and MPI ANY TAG may be used to test for a message from any source
or with any tag. For the C routine, the actual source and tag will be
returned in the status structure as status.MPI SOURCE and status.MPI TAG.
For the FORTRAN routine, they will be returned in the integer array
status(MPI SOURCE) and status(MPI TAG).
MPI Probe(source,tag,comm,status)
MPI PROBE(source,tag,comm,status,ierr)
Examples: Figures. A.5 and A.6 provide some simple examples of blocking
message passing routine calls.
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
151
#include "mpi.h"
#include <stdio.h>
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COOM_World, &rank);
if (rank == 0){
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
}
else if (rank == 1){
dest = 0;
source = 1;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
Fig. A.5. A simple example of blocking message passing in C. Task 0 pings task 1 and
awaits a return ping.
Non-blocking message passing routines

The more commonly used MPI non-blocking message passing routines are
described below.
MPI Isend
This identies an area in memory to serve as a send buer. Processing
continues immediately without waiting for the message to be copied out
from the application buer. A communication request handle is returned for
handling the pending message status. The program should not modify the
July 27, 2012 6:17
152
9in x 6in
b1350-appa
Fig. A.6.
A simple example of blocking message passing in FORTRAN.
application buer until subsequent calls to MPI Wait or MPI Test indicates
that the non-blocking send has completed.
MPI Isend(buf,count,datatype,dest,tag,comm,request)
MPI ISEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Irecv
This identies an area in memory to serve as a receive buer. Processing
continues immediately without actually waiting for the message to be
received and copied into the application buer. A communication request
handle is returned for handling the pending message status. The program
must use calls to MPI Wait or MPI Test to determine when the nonblocking receive operation completes and the requested message is vailable
in the application buer.
MPI Irecv(buf,count,datatype,source,tag,comm,request)
MPI IRECV(buf,count,datatype,source,tag,comm,request,ierr)
July 27, 2012 6:17
9in x 6in
b1350-appa
Appendix A: MPI
153
MPI Issend
Non-blocking synchronous send. Similar to MPI Isend(), except MPI
Wait() or MPI Test() indicates when the destination process has received
the message.
MPI Issend(buf,count,datatype,dest,tag,comm,request)
MPI ISSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Ibsend
A non-blocking buered send that is similar to MPI Bsend() except
MPI Wait() or MPI Test() indicates when the destination process has
received the message. Must be used with the MPI Buffer attach routine.
MPI Ibsend(buf,count,datatype,dest,tag,comm,request)
MPI IBSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Irsend
Non-blocking ready send is similar to MPI Rsend() except MPI Wait()
or MPI Test() indicates when the destination process has received the
message. This function should only be used if the programmer is certain
that the matching receive has already been posted.
MPI Irsend(buf,count,datatype,dest,tag,comm,request)
MPI IRSEND(buf,count,datatype,dest,tag,comm,request,ierr)
MPI Test, MPI Testany, MPI Testall, MPI Testsome
MPI Test checks the status of a specied non-blocking send or receive
operation. The ag parameter is returned logical true (1) if the operation
has completed, and logical false (0) if not. For multiple non-blocking
operations, the programmer can specify any, all or some completions.
MPI Test(request,flag,status)
MPI Testany(count,array of requests,index,flag,status)
MPI Testall(count,array of requests,flag,
array of statuses)
MPI Testsome(incount,array of requests,outcount,
array of offsets, array of statuses)

MPI TEST(request,flag,status,ierr)
MPI TESTANY(count,array of requests,index,flag,status,ierr)
MPI TESTALL(count,array of requests,flag,array of statuses,
ierr)
July 27, 2012 6:17
154
9in x 6in
b1350-appa
MPI TESTSOME(incount,array of requests,outcount,

array of offsets, array of statuses,ierr)
MPI Wait, MPI Waitany, MPI Waitall, MPI Waitsome
MPI Wait blocks until a specied non-blocking send or receive operation
has completed. For multiple non-blocking operations, the programmer can
specify any, all or some completions.
Wait(request,status)
Waitany(count,array of requests,index,status)
Waitall(count,array of requests,array of statuses)
Waitsome(incount,array of requests,outcount,
array of offsets,array of statuses)

MPI WAIT(request,status,ierr)
MPI WAITANY(count,array of requests,index,status,ierr)
tt MPI WAITALL(count,array of requests,array of statuses,ierr)
MPI WAITSOME(incount,array of requests,outcount,
array of offsets, array of statuses,ierr)
MPI
MPI
MPI
MPI
MPI Iprobe
Performs a non-blocking test for a message. The wildcards MPI ANY SOURCE
and MPI ANY TAG may be used to test for a message from any source or with
any tag. The integer ag parameter is returned logical true (1) if a message
has arrived, and logical false (0) if not. For the C routine, the actual source
and tag will be returned in the status structure as status.MPI SOURCE and
status.MPI TAG. For the FORTRAN routine, they will be returned in the
integer array status(MPI SOURCE) and status(MPI TAG).
MPI Iprobe(source,tag,comm,flag,status)
MPI IPROBE(source,tag,comm,flag,status,ierr)
Examples: Figures A.7 and A.8 provide some simple examples of blocking
message passing routine calls.
A.1.8. Collective communication routines
Collective communication involves all processes in the scope of the
communicator. All processes are by default, members in the communicator
MPI COMM WORLD.
There are three types of collective operations:
(1) Synchronization: Processes wait until all members of the group have
reached the synchronization point.
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
155
#include "mpi.h"
#include <stdio.h>
int numtasks, rank, next, prev, buf[2], tag1=1, tag2=2;
MPI_Request reqs[4];
MPI_Status stats[4];
prev = rank-1;
next = rank+1;
if (rank == 0) prev = numtasks 1;
if (rank == (numtasks 1)) next = 0;
MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag1, MPI_COMM_WORLD,
&reqs[0]);
MPI_Irecv(&buf[1], 1, MPI_INT, next, tag2, MPI_COMM_WORLD,
&reqs[1]);
MPI_Irecv(&rank, 1, MPI_INT, prev, tag2, MPI_COMM_WORLD,
&reqs[2]);
MPI_Irecv(&rank, 1, MPI_INT, prev, tag1, MPI_COMM_WORLD,
&reqs[3]);
MPI_Waitall(4, reqs, stats);
MPI_Finalize();
return 0;
}
Fig. A.7. A simple example of non-blocking message passing in C, this code represents
a nearest neighbor exchange in a ring topology.
(2) Data movement: Broadcast, scatter/gather, all to all.

(3) Collective computation (reductions): One member of the group
collects data from the other members and performs an operation
(min, max, add, multiply, etc.) on that data.
Collective operations are blocking. Collective communication routines do
not take message tag arguments. Collective operations within subsets of
processes are accomplished by rst partitioning the subsets into a new
groups and then attaching the new groups to new communicators (discussed
later). Finally, work with MPI dened datatypes not with derived types.
MPI Barrier
July 27, 2012 6:17
156
9in x 6in
b1350-appa
Fig. A.8.
A simple example of non-blocking message passing in FORTRAN.
Creates a barrier synchronization in a group. Each task, when reaching

the MPI Barrier call, blocks until all tasks in the group reach the same
MPI Barrier call.
MPI Barrier(comm)
MPI BARRIER(comm,ierr)
MPI Bcast
Broadcasts (sends) a message from the process with rank root to all other
processes in the group.
MPI Bcast(buffer,count,datatype,root,comm)
MPI BCAST(buffer,count,datatype,root,comm,ierr)
July 27, 2012 6:17
9in x 6in
b1350-appa
Appendix A: MPI
157
MPI Scatter
Distributes distinct messages from a single source task to each task in the
group.
MPI Scatter(sendbuf,sendcnt,sendtype,recvbuf,ecvcnt,
recvtype,root,comm)
MPI SCATTER(sendbuf,sendcnt,sendtype,recvbuf,recvcnt,
recvtype,root,comm,ierr)
MPI Gather
Gathers distinct messages from each task in the group to a single destination
task. This routine is the reverse operation of MPI Scatter.
MPI Gather(sendbuf,sendcnt,sendtype,recvbuf,recvcount,
recvtype,root,comm)
MPI GATHER(sendbuf,sendcnt,sendtype,recvbuf,recvcount,
recvtype,root,comm,ierr)
MPI Allgather
Concatenation of data to all tasks in a group. Each task in the group, in
eect, performs a one-to-all broadcasting operation within the group.
MPI Allgather(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm)
MPI ALLGATHER(sendbuf,sendcount,sendtype,recvbuf,
recvcount,recvtype,comm,info)
MPI Reduce
Applies a reduction operation on all tasks in the group and places the result
in one task Table A.2.
MPI Reduce(sendbuf,recvbuf,count,datatype,op,root,comm)
MPI REDUCE(sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
MPI reduction operations:
MPI Allreduce
Applies a reduction operation and places the result in all tasks in the group.
This is equivalent to an MPI Reduce followed by an MPI Bcast.
MPI Allreduce(sendbuf,recvbuf,count,datatype,op,comm)
MPI ALLREDUCE(sendbuf,recvbuf,count,datatype,op,comm,ierr)
July 27, 2012 6:17
158
9in x 6in
b1350-appa
Table A.2. The predefined MPI reduction operations. Users can also define their
own reduction functions by using the MPI Op create route.
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI reduction Operation
C Data types
FORTRAN data types
MAX
MIN
SUM
PROD
LAND
BAND
LOR
BOR
LXOR
BXOR
MAXLOC
integer, float
integer, float
integer, float
integer, float
integer
integer, MPI BYTE
integer
integer, MPI BYTE
integer
integer, MPI BYTE
float, double,
long double
float, double,
long double
integer, real, complex

logical
integer, MPI BYTE
logical
integer, MPI BYTE
logical
integer, MPI BYTE
real, complex,
double precision
real, complex,
double precision
MPI MINLOC
maximum
minimum
sum
product
logical AND
bit-wise AND
logical OR
bit-wise OR
logical XOR
bit-wise XOR
max value and
location
min value and
location
MPI Reduce scatter

First does an element-wise reduction on a vector across all tasks in the
group. Next, the result vector is split into disjoint segments and distributed
across the tasks. This is equivalent to an MPI Reduce followed by an
MPI Scatter operation.
MPI Reduce scatter(sendbuf,recvbuf,recvcount,
datatype,op,comm)
MPI REDUCE SCATTER(sendbuf,recvbuf,recvcount,
datatype,op,comm,ierr)
MPI Alltoall
Each task in a group performs a scatter operation, sending a distinct
message to all the tasks in the group in order by index.
MPI Alltoall(sendbuf,sendcount,sendtype,recvbuf,
recvcnt,recvtype,comm)
MPI ALLTOALL(sendbuf,sendcount,sendtype,recvbuf,
recvcnt,recvtype,comm,ierr)
MPI Scan
Performs a scan operation with respect to a reduction operation across a
task group.
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
159
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{ 1.0, 2.0, 3.0, 4.0},
{ 5.0, 6.0, 7.0, 8.0},
{ 9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf, sendcount, MPI_FLOAT, recvbuf,
recvcount, MPI_FLOAT, source, MPI_COMM_WORLD);
printf("rank = %d Results: %f %f %f %f\n", rank,
recvbuf[0], recvbuf[1], recvbuf[2], recvbuf[3]);
}
else
printf("Must specify %d processors, Terminating.\n",
SIZE);
MPI_Finalize();
}
Fig. A.9. A simple example of collective communications in C, this code represents a
scatter operation on the rows of an array.
MPI Scan(sendbuf,recvbuf,count,datatype,op,comm)
MPI SCAN(sendbuf,recvbuf,count,datatype,op,comm,ierr)
Examples: Figures A.10 and A.11 provide some simple examples of collective
communications.
A.2. Examples of Using MPI

A.2.1. Hello from all processes
Compiling hello.c Fig. A.12
>mpicc -o hello hello.c
July 27, 2012 6:17
160
9in x 6in
b1350-appa
program scatter
include 'mpif.h'
integer SIZE
parameter (SIZE=4)
integer numtasks, rank, sendcount
integer recvcount, source, ierr
real*4 sendbuf(SIZE, SIZE), recvbuf(SIZE)
c Fortran stores this array in column major order, so the
c scatter will actually scatter columns, not rows
&
&
&
data sendbuf / 1.0, 2.0, 3.0, 4.0,

5.0, 6.0, 7.0, 8.0,
9.0, 10.0, 11.0, 12.0,
13.0, 14.0, 15.0, 16.0 /
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
&
&
&
if (numtasks .eq. SIZE) then

source = 1
sendcount = SIZE
recvcount = SIZE
call MPI_SCATTER(sendbuf, sendcount, MPI_REAL,
recvbuf, recvcount, MPI_REAL, source,
MPI_COMM_WORLD, ierr)
print *, 'rank = ', rank, ' Results:', recvbuf
else
print *, 'Must specify', SIZE,
' processors. Terminating.'
endif
call MPI_FINALIZE(ierr)
end
Fig. A.10. A simple example of collective communications in FORTRAN, this code

represents a scatter operation on the rows of an array.
rank
rank
rank
rank
=
=
=
=
0
1
3
4
Results:
Results:
Results:
Results:
Fig. A.11.
1.000000 2.000000 3.000000 4.000000

5.000000 6.000000 7.000000 8.000000
9.000000 10.000000 11.000000 12.000000
13.000000 14.000000 15.000000 16.000000
This is the output of the example given in.
Running hello.c Fig. A.12

>mpirun -np 4 hello
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
Fig. A.12.
b1350-appa
161
This example, hello.c, gives a Hello from all processes.
Output of hello.c Fig. A.12

Hello
Hello
Hello
Hello
from
from
from
from
process
process
process
process
0.
3.
2.
1.
A.2.2. Integration
Compiling integral.c Fig. A.13
>mpicc -o integral integral.c
Running integral.c Fig. A.13
>mpirun -np 4 integral
A.3. MPI Tools
Pallas
Pallas is a leading independent software company specializing in high
performance computing. Pallas assisted many organizations in migrating
from sequential to parallel computing. Customers of Pallas come from all
elds: Hardware manufacturers, software vendors, as well as end users. Each
of these has beneted from Pallas unique experience in the development
and tuning of parallel applications.
July 27, 2012 6:17
162
9in x 6in
b1350-appa
Fig. A.13.
The first part of integral.c, which performs a simple integral.
In the eld of MPI development tools and implementations, Pallas

contributions include:
VAMPIR - MPI performance visualization and analysis

VAMPIRtrace - MPI proling instrumentation
DIMEMAS - MPI application performance prediction
MPI-2 - rst industrial MPI-2 implementation in Nov. 1997
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
163
VAMPIR
VAMPIR is currently the most successful MPI tool product (see also
Supercomputer European Watch, July 1997), or check references at
(1) http://www.cs.utk.edu/browne/perftools-review
(2) http://www.tc.cornell.edu/UserDoc/Software/PTools/vampir/
ScaLAPACK
The ScaLAPACK library includes a subset of LAPACK (Linear Algebra
PACKage) routines redesigned for distributed-memory MIMD parallel
computers. It is currently written in SPMD-type using explicit message
passing for interprocessor communication. The goal is to have ScaLAPACK
routines resemble their LAPACK equivalents as much as possible.
PGAPack
PGAPack is a general-purpose, data-structure-neutral, parallel genetic
algorithm library. It is intended to provide most capabilities desired in a
genetic algorithm library, in an integrated, seamless, and portable manner.
ARCH
ARCH is a C++-based object-oriented library of tools for parallel
programming on computers using the MPI (message passing interface)
communication library. Detailed technical information about ARCH is
available as a Cornell Theory Center Technical Report (CTC95TR288).
http://www.tc.cornell.edu/Research/tech.rep.html
OOMPI
OOMPI is an object-oriented interface to the MPI-1 standard. While
OOMPI remains faithful to all the MPI-1 functionality, it oers new object
oriented abstractions that promise to expedite the MPI programming
process by allowing programmers to take full advantage of C++ features.
http://www.cse.nd.edu/lsc/research/oompi
XMPI: A Run/Debug GUI for MPI
XMPI is an X/Motif based graphical user interface for running and
debugging MPI programs. It is implemented on top of LAM, an MPI cluster
computing environment, but the interface is generally independent of LAM
operating concepts. You write an MPI application in one or more MPI
programs, tell XMPI about these programs and where they are to be run,
July 27, 2012 6:17
164
9in x 6in
b1350-appa
and then snapshot the synchronization status of MPI processes throughout

the application execution.
http://www.osc.edu/Lam/lam/xmpi.html
Aztec: An Iterative Sparse Linear Solver Package
Aztec is an iterative library that greatly simplies the parallelization process
when solving a sparse linear system of equations Ax = b where A is a user
supplied n n sparse matrix, b is a user supplied vector of length n and x
is a vector of length n to be computed. Aztec is intended as a software tool
for users who want to avoid cumbersome parallel programming details but
who have large sparse linear systems which require an eciently utilized
Parallel computing system.
http://www.cs.sandia.gov/HPCCIT/aztec.html
MPI Map
MPI Map, from Lawrence Livermore National Laboratory, lets programmers
visualize MPI datatypes. It uses Tcl/Tk, and it runs on parallel computers
that use the MPICH implementation of MPI. The tool lets you select one
of MPIs type constructors (such as MPI Type vector or MPI Type struct)
and enter the parameters to the constructor call. It then calls MPI to
generate the new type, extracts the type map from the resulting structure,
and presents a graphical display of the type map, showing the oset and
basic type of each element.
http://www.llnl.gov/livcomp/mpimap/
STAR/MPI
STAR/MPI is a system to allow binding of MPI to a generic (STAR)
interactive language. GCL/MPI is intended for easy-to-use master-slave
distributed-memory architecture. It combines the feedback of an interactive
language (the GCL or AKCL dialect of LISP) with the use of MPI to
take advantage of networks of workstations.
ftp://ftp.ccs.neu.edu/pub/people/gene/starmpi/
Parallel Implementation of BLAS
The sB BLAS package is a collection of parallel implementations of the
level 3 Basic Linear Algebra Subprograms. All codes were written using
MPI. A paper describing this work is also available.
http://www.cs.utexas.edu/users/rvdg/abstracts/sB BLAS.html
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
b1350-appa
165
BLACS (Basic Linear Algebra Communication Subprograms) for

MPI
An alpha test release of the BLACS for MPI is available from the
University of Tennessee, Knoxville. For more information contact R. Clint
Whaley (rwhaley@cs.utk.edu).
NAG Parallel Library
The NAG Parallel Library is a library of numerical routines
specically produced for distributed-memory parallel computers. This
library is available under MPI [or PVM] message-passing mechanisms.
It also performs well on shared-memory computers whenever ecient
implementations of MPI [or PVM] are available. It includes the following
areas: Optimization, Dense linear algebra [including ScaLAPACK], Sparse
linear algebra, Random number generators, Quadrature, Input/Output,
data distribution, support/utility routines.
http://www.nag.co.uk/numeric/FM.html
DQS (Distributed Queuing System)
DQS now supports the launch of MPICH (Argonne/Miss State Version of
MPI) jobs and is available by anonymous ftp.
ftp://ftp.scri.fsu.edu/pub/DQS
Interprocessor Collective Communication (iCC)
The interprocessor collective communication (iCC) research project started
as a technique required to develop high performance implementations of the
MPI collective communication calls.
http://www.cs.utexas.edu/users/rvdg/intercom/.
PETSc Scientic Computing Libraries
PETSc stands for Portable Extensible Tools for scientic computing. It
is a library of routines for both uni- and parallel-processor computing.
http://www.mcs.anl.gov/home/gropp/petsc.html.
MSG Toolkit
Message-passing tools for Structured Grid communications (MSG) is a
MPI-based library intended to simplify coding of data exchange within
FORTRAN 77 codes performing data transfers on distributed Cartesian
grids. More information, the source code, and the users guide are available
at the following site:
http://www.cerca.umontreal.ca/malevsky/MSG/MSG.html.
July 27, 2012 6:17
166
9in x 6in
b1350-appa
Para++
The Para++ project provides a C++ interface to the MPI and PVM message
passing libraries. Their approach is to overload input and output operators
to do communication. Communication looks like standard C++ I/O.
http://www.loria.fr/para++/parapp.html.
Amelia Vector Template Library
The Amelia Vector Template Library (AVTL) is a polymorphic collection
library for distributed-memory parallel computers. It is based on ideas from
the Standard Template Library (STL) and uses MPI for communication.
ftp://riacs.edu/pub/Excalibur/avtl.html.
Parallel FFTW
FFTW, a high-performance, portable C library for performing FFTs
in one or more dimensions, includes parallel, multi-dimensional FFT
routines for MPI. The transforms operate in-place for arbitrary-size arrays
(of any dimensionality greater than one) distributed across any number of
processors. It is free for non-commercial use.
http://www.tw.org/.
Cononical Classes for Concurrency Control
The Cononical Classes for Concurrency Control library contains a set
of C++ classes that implement a variety of synchronization and data
transmission paradigms. It currently supports both Intels NX and MPI.
http://dino.ph.utexas.edu/furnish/c4.
MPI Cubix
MPI Cubix is an I/O library for MPI applications. The semantics and language
binding reect POSIX in its sequential aspects and MPI in its parallel aspects.
The library is built on a few POSIX I/O functions and each of the POSIXlike Cubix functions translate directly to a POSIX operation on a le system
somewhere in the parallel computer. The library is also built on MPI and is
therefore portable to any computer that supports both MPI and POSIX.
http://www.osc.edu/Lam/mpi/mpicubix.html.
MPIX Intercommunicator Extensions
The MPIX Intercommunicator Extension library contains a set of extensions
to MPI that allow many functions that previously only worked with
July 27, 2012 6:17
9in x 6in
b1350-appa
Appendix A: MPI
167
intracommunicators to work with intercommunicators. Extensions include

support for additional intercommunciator construction operations and
intercommunicator collective operations.
http://www.erc.msstate.edu/mpi/mpix.html.
mpC
mpC, developed and implemented on the top of MPI, is a programming
environment facilitating and supporting eciently portable modular parallel
programming. mpC does not compete with MPI, but tries to strengthen
its advantages (portable modular programming) and to weaken its
disadvantages (a low level of parallel primitives and diculties with
ecient portability; ecient portability means that an application running
eciently on a particular multi-processor will run eciently after porting to
other multi-processors). In fact, users can consider mpC as a tool facilitating
the development of complex and/or eciently portable MPI applications.
http://www.ispras.ru/mpc/.
MPIRUN
Sam Fineberg is working on support for running multidisciplinary codes
using MPI which he calls MPIRUN. You can retrieve the MPIRUN software
from,
http://lovelace.nas.nasa.gov/Parallel/People/neberg homepage.html.
A.4. Complete List of MPI Functions
Constants
MPI File iwrite shared
MPI Info set
MPIO Request c2f
MPI File open
MPI Init
MPIO Request f2c
MPI File preallocate
MPI Init thread
MPIO Test
MPI File read
MPI Initialized
MPIO Wait
MPI File read all
MPI Int2handle
MPI Abort
MPI File read all begin
MPI Intercomm create
MPI Address
MPI File read all end
MPI Intercomm merge
MPI Allgather
July 27, 2012 6:17
168
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
9in x 6in
b1350-appa
File read at
Iprobe
Allgatherv
File read at all
Irecv
Allreduce
File read at all begin
Irsend
Alltoall
File read at all end
Isend
Alltoallv
File read ordered
Issend
Attr delete
File read ordered begin
Keyval create
Attr get
File read ordered end
Keyval free
Attr put
File read shared
NULL COPY FN
Barrier
File seek
NULL DELETE FN
Bcast
File seek shared
Op create
Bsend
File set atomicity
Op free
Bsend init
File set errhandler
Pack
Buffer attach
File set info
Pack size
Buffer detach
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
File set size

Pcontrol
CHAR
File set view
Probe
Cancel
File sync
Recv
Cart coords
File write
Recv init
Cart create
File write all
Reduce
Cart get
File write all begin
Reduce scatter
Cart map
File write all end
Request c2f
Cart rank
File write at
Request free
Cart shift
File write at all
Rsend
Cart sub
File write at all begin
Rsend init
Cartdim get
File write at all end
Scan
Comm compare
File write ordered
Scatter
Comm create
File write ordered begin
Scatterv
Comm dup
July 27, 2012 6:17
9in x 6in
Appendix A: MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
File write ordered end

Send
Comm free
File write shared
Send init
Comm get name
Finalize
Sendrecv
Comm group
Finalized
Sendrecv replace
Comm rank
Gather
Ssend
Comm remote group
Gatherv
Ssend init
Comm remote size
Get count
Start
Comm set name
Get elements
Startall
Comm size
Get processor name
Status c2f
Comm split
Get version
Status set cancelled
Comm test inter
Graph create
Status set elements
DUP FN
Graph get
Test
Dims create
Graph map
Test cancelled
Errhandler create
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
Graph neighbors
Testall
Errhandler free
Graph neighbors count
Testany
Errhandler get
Graphdims get
Testsome
Errhandler set
Group compare
Topo test
Error class
Group difference
Type commit
Error string
Group excl
Type contiguous
File c2f
Group free
Type create darray
File close
Group incl
Type create subarray
File delete
Group intersection
Type extent
File f2c
Group range excl
Type free
File get amode
Group range incl
Type get contents
File get atomicity
Group rank
Type get envelope
File get byte offset
Group size
Type hvector
File get errhandler
b1350-appa
169
July 27, 2012 6:17
170
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
9in x 6in
Group translate ranks

Type lb
File get group
Group union
Type size
File get info
Ibsend
Type struct
File get position
Info c2f
Type ub
File get position shared
Info create
Type vector
File get size
Info delete
Unpack
File get type extent
Info dup
Wait
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
File get view

Info f2c
Waitall
File iread
Info free
Waitany
File iread at
Info get
Waitsome
File iread shared
Info get nkeys
Wtick
File iwrite
Info get nthkey
Wtime
File iwrite at
Info get valuelen
File iwrite shared
Info set
b1350-appa
July 27, 2012 6:18
9in x 6in
b1350-appb
APPENDIX B
OPENMP
B.1. Introduction to OpenMP

OpenMP stands for Open Multi-Processing, which is the defacto standard
API for writing shared-memory parallel application. The general idea of
OpenMP is multi-threading by fork-join model (Fig. B.1): All OpenMP
programs begin as a single process called the master thread. When the
master thread reaches the parallel region, it creates multiple threads to
execute the parallel codes enclosed in the parallel region. When the threads
complete the parallel region, they synchronize and terminate, leaving only
the master thread.
B.1.1. The brief history of OpenMP

In the early 1990s, vendors supplied similar, directive-based programming
extensions for their own shared-memory computers. These extensions
oered preprocess directives where the user specied which loops were to
be parallelized within a serial program and then the compiler would be
responsible for automatically paralleling such loops across the processors.
All these implementations were all functionally similar, but were diverging.
The rst attempt to standardize the shared-memory API was the
draft for ANSI X3H5 in 1994. Unfortunately, it was never adopted. In
October 1997, the OpenMP Architecture Review Board (ARB) published
its rst API specications, OpenMP for Fortran 1.0. In 1998 they released
the C/C++ standard. The version 2.0 of the Fortran specications was
published in 2000 and version 2.0 of the C/C++ specications was released
in 2002. Version 2.5 is a combined C/C++/Fortran specication that was
released in 2005. Version 3.0, released in May 2008, is the current version
of the API specications.
171
July 27, 2012 6:18
172
9in x 6in
b1350-appb
Fig. B.1.
Fork-join model used in OpenMP.
B.2. Memory Model of OpenMP

OpenMP provides a relaxed-consistency, shared-memory model. All
OpenMP threads have access to a place to store and to retrieve variables,
called the memory. In addition, each thread is allowed to have its own
temporary view of the memory. The temporary view of memory for each
thread is not a required part of the OpenMP memory model, but can
represent any kind of intervening structure, such as machine registers, cache,
or other local storage, between the thread and the memory. The temporary
view of memory allows the thread to cache variables and thereby to avoid
going to memory for every reference to a variable. Each thread also has
access to another type of memory that must not be accessed by other
threads, called threadprivate memory.
B.3. OpenMP Directives

OpenMP is directive based. Most parallelism is specied through the use
of compiler directives which are embedded in C/C++ or FORTRAN. A
directive in C/C++ has the following format:
#pragma omp <directive-name> [clause,...]
Example
#pragma omp parallel num threads(4)
B.3.1. Parallel region construct
The directive to create threads is,
#pragma omp parallel [clause [, ]clause] ...]
where clause is one of the following:
July 27, 2012 6:18
Appendix B: OpenMP
9in x 6in
b1350-appb
173
if(scalar-expression )
num threads(integer-expression )
default(shared | none)
private(list )
firstprivate(list )
shared(list )
copyin(list )
reduction(operator :list ).
B.3.2. Work-sharing constructs
A work-sharing construct distributes the execution of the associated region
among the threads encountering it. However, it does not launch new threads
by itself. It should be used within the parallel region or combined with the
parallel region constructs.
The directive for the loop construct is as follows:
#pragma omp for [clause[[,] clause] ...]
where the clause is one of the following:
private(list )
firstprivate(list )
lasstprivate(list )
reduction(operator :list )
schedule(kind [, chunck size ])
collapse(n)
ordered
nowait
B.3.3. Directive clauses
The private clause declares variables in its list to be private to each thread.
For a private variable, a new object of the same type is declared once for
each thread in the team and all reference to the original object are replaced
with references to the new object. Hence, variables declared private should
be assumed to be uninitialized for each thread.
The firstprivate clause has the same behavior of private but it
automatically initializes the variables in its list according to their original
values.
July 27, 2012 6:18
174
9in x 6in
b1350-appb
The lastprivate clause does what private does and copies the
variable from the last loop iteration to the original variable object.
The shared clause declares variables in its list to be shared among all
threads.
The default clause allows user to specify a default scope for all
variables within a parallel region to be either shared or not.
The copyin clause provides a means for assigning the same value to
threadprivate variables for all threads in the team.
The reduction clause performs a reduction on the variables that
appear in its list. A private copy for each list variable is created for each
thread. At the end of the reduction, the reduction variable is applied to all
private copies of the shared variable, and the nal result is written to the
global shared variable.
B.4. Synchronization
Before we discuss the synchronization, let us consider a simple example
where two processors try to do a read/update on same variable.
x = 0;
#pragma omp parallel shared(x)
{
x = x + 1;
}
One possible execution sequence is as follows:

1.
2.
3.
4.
5.
6.
Thread
Thread
Thread
Thread
Thread
Thread
1
2
1
2
1
2
loads the value of x into register A.

loads the value of x into register A.
adds 1 to register A.
adds 1 to register A.
stores register A at location x.
stores register A at location x.
The result of x will be 1, not 2 as it should be. To avoid situation like

this, x should be synchronized between two processors.
Critical
The critical directive species a region of code that must be executed
by only one thread at a time. An optional name may be used to identify
the critical construct. All critical without a name are considered to
July 27, 2012 6:18
Appendix B: OpenMP
9in x 6in
b1350-appb
175
have the same unspecied name. So if there exist two or more independent
blocks of critical procedure in the same parallel region, it is important
to specify them with dierent name so that these blocks can be executed in
parallel.
#pragma omp critical [name ]
structured block
atomic
The atomic directive ensures that a specic storage location is updated
atomically, rather than exposing it to the possibility of multiple,
simultaneous writings (race condition). The atomic directive applies only to
the statement immediately following it and only the variable being updated
is protected by atomic. The dierence between atomic and critical is
that atomic operations can be executed in parallel when updating dierent
element while critical blocks are guaranteed to be serial.
#pragma omp atomic
statement expression
barrier
The barrier directive synchronizes all threads in the team. When a
barrier directive is reached, a thread will wait at that point until all other
threads have reached that barrier. All threads then resume executing in
parallel the code that follows the barrier.
#pragma omp barrier
flush
The flush directive identies a synchronization point at which the
implementation must provide a consistent view of memory. Thread-visible
variables are written back to memory at this point.
#pragma omp flush (list )
B.5. Runtime Library Routines

The OpenMP standard denes an API for library calls that perform a
variety of functions: Query the number of threads/processors, set number
of threads to use; general purpose locking routines (semaphores); portable
wall clock timing routines; set execution environment functions: Nested
July 27, 2012 6:18
176
9in x 6in
b1350-appb
parallelism, dynamic adjustment of threads. For C/C++, it may be

necessary to specify the include le omp.h.
B.5.1. Execution environment routines
omp set num threads
This sets the number of threads that will be used in the next parallel region.
It must be a positive integer.
void omp set num threads(int num threads).
omp get num threads
Returns the number of threads that are currently in the team executing the
parallel region from which it is called.
int omp get num threads(void).
omp get max threads
This returns the maximum value that can be returned by a call to the
OMP GET NUM THREADS function.
int omp get max threads(void).
omp get thread num
Returns the thread number of the thread, within the team, making this call.
This number will be between 0 and OMP GET NUM THREADS - 1. The master
thread of the team is thread 0.
int omp get thread num(void).
omp in parallel
To determine if the section code which is executing is parallel or not.
int omp in parallel(void).
B.5.2. Lock routines
The OpenMP runtime library includes a set of general-purpose lock routines
that can be used for synchronization. These general-purpose lock routines
operate on OpenMP locks that are represented by OpenMP lock variables.
An OpenMP lock variable must be accessed only through the routines
described in this section; programs that otherwise access OpenMP lock
variables are non-conforming.
July 27, 2012 6:18
9in x 6in
b1350-appb
Appendix B: OpenMP
177
The OpenMP lock routines access a lock variable in such a way that
they always read and update the most current value of the lock variable.
The lock routines include a ush with no list; the read and update to the
lock variable must be implemented as if they are atomic with the ush.
Therefore, it is not necessary for an OpenMP program to include explicit
ush directives to ensure that the lock variables value is consistent among
dierent tasks.
omp init lock
This subroutine initializes a lock associated with the lock variable.
void omp init lock(omp lock t *lock).
omp destroy lock
This subroutine disassociates the given lock variable from any locks.
void omp destroy lock(omp lock t *lock).
omp set lock
This subroutine forces the executing thread to wait until the specied lock is
available. A thread is granted ownership of a lock when it becomes available.
void omp set lock(omp lock t *lock).
omp unset lock
This subroutine releases the lock from the executing subroutine.
void omp unset lock(omp lock t *lock).
omp test lock
This subroutine attempts to set a lock, but does not block if the lock is
unavailable.
int omp test lock(omp lock t *lock)
B.5.3. Timing routines
omp get wtime
This routine returns elapsed wall clock time in seconds.
double omp get wtime(void).
omp get wtick
July 27, 2012 6:18
178
9in x 6in
b1350-appb
This routine returns the precision of the timer used by omp get wtime.
double omp get wtick(void).
B.6. Examples of Using OpenMP
Similar to the MPI chapter, here also we provide two simple examples
(Figs. B.2 and B.3) of using OpenMP.
B.6.1. Hello from all threads

#include"omp.h"
#include<stdio.h>
int main(int argc, char* argv[])
{
#pragma omp parallel num_threads(4)
{
int ID = omp_get_thread_num();
int N
= omp_get_num_threads();
printf("Hello from %d of %d.\n", ID, N);

}
return 0;
}
Fig. B.2.
A simple Hello program from all threads.
July 27, 2012 6:18
9in x 6in
Appendix B: OpenMP
B.6.2. Calculating by integration

#include"omp.h"
#include<stdio.h>
static long num_step = 100000;
double step;
#define NUM_THREADS 2
int main(int argc, char* argv[])
{
int i;
double x, pi, sum = 0.0;
step = 1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(x) reduction(+:sum)
for (i = 0;i < num_steps; i++) /* i private by
default */
{
x = (i + 0.5) * step;
sum = sum + 4.0/(1.0 + x * x);
}
pi = step * sum;
printf("Computed pi = %f\n", pi);
return 0;
}
Fig. B.3.
OpenMP example for computing pi by integration.
b1350-appb
179
July 27, 2012 6:18
180
9in x 6in
b1350-appb
B.7. The Future

MPI and OpenMP have served the parallel computing communities needs
of portability and convenience of message passing over the past nearly
20 years. Like everything else, they are approaching the second half of their
useful lifetime after passing the peaks. Parallel computers are monsters of
millions of connected processing cores with fancy network topologies and
protocols. A new programming model and, certainly, a new tool to enable
their coordination or communication, are highly desired. Many attempts,
just like those of pre-MPI era, are starting the bear fruits and, expectedly,
a new technique similar to MPI and OpenMP will emerge.
July 27, 2012 6:18
9in x 6in
b1350-appc
APPENDIX C
PROJECTS
Project C.1 Watts and Flops of Supercomputers

From the latest (Nov. 2011) Top500 and Green500 lists, one can nd
the ordered lists of Top500 and Green500 supercomputers. The Top500
supercomputers are ranked in terms of the sustained oating-point
operations Rmax in unit of Gops while the Green500 supercomputers
are ranked in terms of their power eciency measured in terms of Mops
per Watt of electricity. Assume the Top500 computers are marked with
L = 1, 2, . . . , 500 where L = 1 is assigned to the computer with the highest
Rmax while the Green500 computers are marked with P = 1, 2, . . . , 500
where P = 1 is assigned to the computer with the highest performance-topower ratio.
Please name all supercomputers whose values satisfy P + L 50 and
construct a table to list the features of their processors (CPUs and GPUs if
there are any), their number of processing cores, and their interconnection
networks.
Project C.2 Review of Supercomputers
Supercomputers have been going through active development for the
last 50 years. Write an essay not less than 10 pages to document such
development. In the essay, describe the evolution of the following aspects:
(1)
(2)
(3)
(4)
(5)
key architecture including the processing and network units.

system performance.
programming models.
representative applications.
major breakthroughs and bottleneck.
Project C.3 Top500 and BlueGene Supercomputers

According to www.top500.org in 2010, many of the worlds fastest
supercomputers belong to the BlueGene family designed and constructed
181
July 27, 2012 6:18
182
9in x 6in
b1350-appc
by IBM. We have to do the following:

(1) Dene the main architectural components of the IBM BlueGene/P or
BlueGene/Q supercomputers.
(2) Explain how worlds top500 supercomputers are selected
(i)
(ii)
(iii)
(iv)
Describe benchmark used?

Describe the pros and cons of such benchmarking system.
How would you improve the benchmarking system?
Are there any other benchmarks?
(3) Design an algorithm to solve dense linear algebraic equation on

BlueGene/P or BlueGene/Q.
Write an essay to compare ve of the top ten supercomputers from
the latest top500.org list. For example, the top ten supercomputers of the
June 2011 list include:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
Fujitsu K computer.
NDUT Tianhe-1A.
Cray XT5-HE.
Dawning TC3600 Cluster.
HP Cluster Platform 3000SL.
Cray XE6.
SGI Altix ICE.
Cray XE6.
Bull Bullx supernode S6010/6030.
Roadrunner IBM BladeCenter.
The ve computers that are selected should be as diverse as possible. The

essay, at least 10 pages in length, must contain the following aspects:
(1)
(2)
(3)
(4)
key architecture including the processing and network units.

system performance.
programming models.
major advantageous features for each computer.
In each of the categories, create a rank order for the computers you select.
Project C.4 Say Hello in Order
When running a Hello from Processor X program on P processors, you
usually see output on your screen ordered randomly. Its not in the order you
desired, e.g., Hello from Processor 1, then Hello from Processor 2, etc.
July 27, 2012 6:18
9in x 6in
b1350-appc
Appendix C: Projects
183
Write a parallel program to enforce the order, i.e., whenever you run it, you
always get the processors to say hello to you in order. You may test it for
a system with at least 13 processors.
Project C.5 Broadcast on Torus
Broadcast 10 oating-point numbers from any point to all 6000 nodes of a
3D torus 10 20 30. Please design an algorithm to quickly complete the
broadcast. Please report the total number of steps N needed to complete
the broadcast. Additionally, make a table to show the number of links that
are involved in each of the N steps for the broadcast.
Project C.6 Competing with MPI on Broadcast, Scatter, etc
MPI implementation has a dozen or so collective operation functions such
as broadcast, gather, scatter, all-gather, etc. Describe the algorithms of
implementation three functions: Broadcast, scatter, and all-gather.
Implement such functions on your chosen computer using only
MPI Isend() and MPI Irecv() and a few other supporting (non-datamotion) functions and other light-weight basic single-sided communication
functions. Then, compare the performance of your implementation with
that of those provided by MPI. You need to vary the size of the messages
(number of oating-point numbers to move) and the size of the computer
involved to study the dependence of the dierences on the message and
machine sizes. For example, if you want to perform collective operations on
N = 102 , 103 , 104 oating numbers from P = 22 , 23 , 24 , 25 processing cores.
Project C.7 Simple Matrix Multiplication
Create two N N square matrices with random elements whose values are
in [1, 1] and compute their product by using P processors. The program
should be general for reasonable values of N and P . Plot the speedup curves
at N = 1000, 2000, 4000 and P = 1, 2, 4, 8 processors.
Project C.8 Matrix Multiplication on 4D Torus
Design an algorithm to multiply two large square matrices A and B on a
4D torus networked supercomputer with 15 15 16 9 = 32, 400 nodes.
You may consider A and B having 8 8 (180 180) elements, i.e., your
matrices are of the size 14401440. Further, we assume the time to perform
one oating-point operation (adding or multiplying a pair of oating-point
July 27, 2012 6:18
184
9in x 6in
b1350-appc
numbers) is 1 unit (of time). We also assume transmitting one oating-point

number between any adjacent nodes takes 1 unit (of time). Estimate the
shortest time to complete the matrix multiplication on the computer.
Project C.9 Matrix Multiplication and PAT
There are several parallel matrix multiplication algorithms for distributedmemory parallel computers with various communication networks. In this
project, we will use one hypothetical parallel computer with P = 8 8 8 =
512 processing cores on a 3D mesh network. Now, we hope to design two
algorithms, Ring method (aka Canon method) and the BMR method, to
multiply two matrices of the size 1024 1024. We also assume the time
to take each core to perform one oating-point operation is t0 and the
time to communicate one oating-point number from one core to any one
of its nearest neighbors is tc . To simplify the problem, the time for
communicating to the next to the nearest neighbors is 2tc , to the next
nearest neighbors is 3tc , etc.
The project requires you to draw two Parallel Activity Trace (PAT)
graphs, one for each algorithm, to illustrate the activities in the cores during
the entire process of completing the matrix multiplication. A PAT is aligned
graphs of time series from all the processors. Each time series records the
activities of each processor during the matrix multiplication: sequential
operations, message send, message receive, or idle. Hopefully the PAT is
reasonably in scale.
Project C.10 Matrix Inversion
For a typical distributed-memory parallel computer, memory contents
among all processors are shared through message passing. One standard
for such sharing is called MPI, i.e. message passing interface. We have to
do the following:
(1) Dene the main function groups in MPI 2.0.
(2) If a distributed-memory parallel computer is used to invert a dense
square matrix:
(i) Design an algorithm for the matrix inversion.
(ii) Analyze the performance of the algorithm.
(iii) Write a parallel program for implementing the algorithm for
inverting a 10241024 matrix on a parallel computer with 20 , 21 , 23
and 24 processors.
July 27, 2012 6:18
9in x 6in
b1350-appc
185
Project C.11 Simple Analysis of an iBT Network

Starting from a 3D torus 30 30 36, you add links that bypass 6 or 12
hops systematically in all three dimensions and for all nodes. This will get
you a new network that we call Interlacing Bypass Torus (IBT) with node
degree 8, i.e., each node has, and only has, 8 links. You may also construct
a 4D torus network with the same number of nodes. You will have many
options to assemble the 4D torus. We request you make it as 1515169.
Please compute, for both networks, (1) the average node-to-node
distances and (2) the diameters, measured in hops.
Project C.12 Compute Eigenvalues of Adjacency Matrices
of Networks
One can form many networks topologies to connect 26 26 = 212 nodes.
For our project, we form the following three topologies:
(1) A 2D torus of 26 26
(2) A 3D mesh of 23 23 24
(3) A hypercube of 12 dimension, 212
Now, construct the adjacency matrices for each of the three topologies. The
adjacency matrix SN N for an N-node system is dened asan N N matrix
whose elements Sij is the distance between nodes i and j measured in hops.
Next, compute the eigenvalues for each of the three adjacency matrices.
Project C.13 Mapping Wave Equation to Torus
Solve a wave equation dened on a 2D grid of M N mesh points on a
parallel computer with P Q R processors. These processors are arranged
on 3D grid with the nearest neighbor communication links at latency a and
bandwidth b per link. We assume that the processors at the ends of x-, y-,
and z-directions are connected, i.e. the computer network is a 3D torus. We
further assume that the number of mesh points is much higher than that of
processors in the system. We have to do the following:
(1)
(2)
(3)
(4)
Map the 2D computational domain to the 3D computer network.

Argue that the mapping is near optimal for message passing.
Estimate the speedup that you may get for the mapping.
Write a program to solve the wave equation on an emulated distributedmemory MIMD parallel computer of 2 2 2 for the equation problem
on 128 128 mesh.
July 27, 2012 6:18
186
9in x 6in
b1350-appc
Project C.14 Load Balance in 3D Mesh

Solve 2D wave equation on a 3D hypothetical parallel computer. The wave
equation is dened on a 2D square domain with 256 mesh-points in each
dimension. The equation is,
2

2
u
u 2u
2
=
c
+
t2
x2
y 2
u(t = 0) = u0
u
(t = 0) = v0
u(x
= 0) = u(x = L)
u(y = 0) = u(y = L)
where c, u0 , v0 , L are constants.
Now, the computer consists of 64 processors with the same speed
(1 Gops each) and these processors are placed on a 3D mesh 4 4 4.
Each link connecting any two nearest neighboring processors is capable of
1 Gbps communication bandwidth. Assume we use a nite dierence method
to solve the problem. We have to do the following:
(1) Design an algorithm to place the 256 256 mesh points to the
64 processors with optimal utilization and load balance of the CPUs
and communication links.
(2) Compute the expected load imbalance ratio for CPUs resulting from
the method.
(3) Compute the expected load imbalance ratio among all links resulting
from the method.
(4) Repeat the above three steps if each one of the top layer of 16 processors
is four times faster than each one of the processors on the second and
third layers.
Project C.15 Wave Equation and PAT
Please write a parallel program on a 3D or higher dimensional mesh
networked parallel computer to solve 2D wave equation by nite dierence
method. The wave equation is dened on a two-dimensional square domain
with 256 mesh-points in each dimension. The equation is
utt = c2 (uxx + uyy )
u(t = 0) = u0 , u (t = 0) = v0
u(x = 0) = u(x = L), u(y = 0) = u(y = L)
where c, uo , v0 , L are constants
July 27, 2012 6:18
9in x 6in
b1350-appc
187
Assume you solve the above on a computer with 4 4 4 = 64 processing

cores. Obviously, the internal network for the cores in a processor is unknown
and we do not care. If you only have access to a Beowulf cluster, you may
use it with any network available.
Please draw the PAT (Parallel Activity Traces) for running your
program for 10 time steps with the actual computing and communication
data from your calculations. You should include the serial (local) computing,
message passing, and idle of each of the participating processing cores.
Project C.16 Computing Coulombs Forces
In 3D, N particles are assigned electric charges with values taking random
numbers in {1, 2, 3}. In other words, the charge of the each particle is
one of the six random numbers with equal probability. These particles are
xed (for convenience) at sites whose coordinates are random numbers in
a 10 10 10 cube, i.e. the Cartesian coordinates of these N particles are
any triplet of random numbers in [0, 10]. These particles interact under the
Coulombs law:
qi qj
Fij = 2 ,
rij
where Fij is the force on particle i by j (with distance rij ) with charges qi
and qj .
Write a program to compute the forces on each of the N particles by
P processors. Make the following plot: A set of speedup curves at dierent
values of N.
(1) For N = 1000, collect timing results for P = 20 , 21 , 22 , 23 , 24 , 25 .
Note: When collecting timing results, minimize I/O eect.
Project C.17 Timing Model for MD
In molecular dynamics (MD) simulation of proteins or other bio-molecules,
one must compute the bonded and non-bonded forces of each atom before
solving the Newtons equation of motion, for each time step. Design a simple
timing model for MD calculation. For example, rst, construct a table to
estimate the number of operations needed to compute each of the force terms
and solution of the ODEs for P = 1 processor. Next, you divide the task
July 27, 2012 6:18
188
9in x 6in
b1350-appc
into P sub-tasks and assign to P individual processors or nodes. Assume the

processor speed f , inter-node communication latency t0 and communication
bandwidth w and compute the time to nish the MD simulation on such a
system with P processors.
If you solve this MD problem on a specic computer (In 2011, a popular
computer is BlueGene/P), please look up the relevant computer parameters
(including processor speed and network latency and bandwidth) to estimate
the time needed, for each time step, for the major calculation components.
With such computer specications and our timing model, we estimate the
parallel eciency for a molecule with N = 100, 000 atoms to be modeled
by the given supercomputer with P = 210 , 212 , 214 , 216 processors.
Project C.18 Minimizing Lennard-Jones Potential
In three dimensions, N particles are xed at sites that have random
coordinates in a 10 10 10 cube, i.e. any triplet of random numbers in
[0, 10] can be the coordinates of a particle, initially. These particles interact
under the so-called Lennard-Jones potential,
Vij =
1
2
12 r6
rij
ij
where Vij is the pair-wise potential between particles i and j with

distance rij .
(1) Write a serial program to minimize the total energy of the particle
system for N = 100 by using simulated annealing method (or any
optimization method that you prefer).
(2) For P = 2, 4, and 8, write a parallel program to minimize the energy.
The minimizations should terminate at a similar nal energy that can
be obtained by P = 1 as in (1) above (Results within ve percent of
relative errors are considered similar.) We need to use the following two
methods to decompose the problem:
(a) Particle decomposition.
(b) Spatial decomposition.
(3) Report the timing results and speedup curve for both decompositions
and comment on their relative eciency.
Project C.19 Install and Profile CP2K
CP2K is a program to perform atomistic and molecular simulations of
solid state, liquid, molecular, and biological systems. It provides a general
July 27, 2012 6:18
9in x 6in
b1350-appc
189
framework for dierent methods such as, e.g., density functional theory
(DFT) using a mixed Gaussian and plane waves approach (GPW) and
classical pair and many-body potentials. CP2K is freely available under
the GPL license. It is written in FORTRAN-95 and can be run eciently
in parallel.
Download the latest source code and compile the program for a parallel
computer you can access. Run the program on that computer with 8, 16, and
32 processing cores for 100 time steps. You may use the default parameters
or change them to a reasonable set that you can explain. Please build a
table to show the wall clock times on all processing cores for their local
calculations (accumulated over the 100 time steps). More desirably, prole
the program.
Project C.20 Install and Profile CPMD
CPMD is a well parallelized, plane wave and pseudo-potential
implementation of Density Functional Theory, developed especially for
ab-inito molecular dynamics. It was developed at IBM Zurich Research
Laboratory. The key features of CPMD include
works with norm conserving or ultra-soft pseudo-potentials

LDA, LSD and the most popular gradient correction schemes; free
energy density functional implementation
isolated systems and system with periodic boundary conditions; kpoints
molecular and crystal symmetry
wavefunction optimization: direct minimization and diagonalization
geometry optimization: local optimization and simulated annealing
molecular dynamics: constant energy, constant temperature and
constant pressure
path integral MD
response functions
excited states
many electronic properties
time-dependent DFT (excitations, molecular dynamics in excited
states)
coarse-grained non-Markovian meta dynamics
July 27, 2012 6:18
190
9in x 6in
b1350-appc
the program.
Project C.21 Install and Profile NAMD
NAMD is a parallel molecular dynamics code designed for high-performance
simulation of large bio-molecular systems. Based on Charm++ parallel
objects (in 2010), NAMD scales to hundreds of processors on high-end
parallel platforms and tens of processors on commodity clusters using
gigabit Ethernet. One can build NAMD from the source code or download
the binary executable for a wide variety of platforms.
the program.
Project C.22 FFT on Beowulf
Fast Fourier Transform (FFT) is important for many problems in
computational science and engineering and its parallelization is very dicult
to scale for large distributed-memory systems.
Suppose you are required to perform FFT eciently on a Beowulf
system with two layers of switches. Hypothetically, the Beowulf system
has 480 processors that are divided into 32 groups with 15 each. The
15-processor group is connected to a Gigabit switch with 16 ports, one
of which is connected to the master 10-Gigabit switch that has 32 ports.
We also assume the latencies for the Gigabit and 10-Gigabit switches 50
and 20 microseconds respectively.
(1) Draw a diagram to illustrate the Beowulf system.
(2) Design the algorithm for FFT on such Beowulf system.
(3) Estimate the speedup for the algorithm.
July 27, 2012 6:18
9in x 6in
b1350-appc
191
Project C.23 FFT on BlueGene/Q

If you perform a Fourier transform for a 3D function f (x, y, z) on
BlueGene/Q, please look up the relevant machine parameters (including
processor speed and network latency and bandwidth) to build a timing
model for the eciency. Assume you divide the function in a 1024
1024 1024 mesh, please estimate the parallel eciency by a BG/Q with
P = 210 , 212 , 214 , 216 processors.
Project C.24 Word Analysis
Please write a parallel program to search for the 10 most frequently used
words in the rst four chapters of Charles Dickens A Tale of Two Cities.
Please sort your list.
Project C.25 Cost Estimate of a 0.1 Pflops System
Supercomputers are designed to solve problems in science, engineering,
and nance. Make a list of at least 10 major applications in those three
areas and state the key requirements of the supercomputer resources in
terms of computing power, network, memory, storage, and etc. If you are
commissioned to design a supercomputer for the applications in one
particular area, specify the key features of the system in the design. Try to
get the prices of the components on the web and estimate the cost of the
computer that can deliver 0.1 Pops.
Project C.26 Design of a Pflops System
You are commissioned to design a 1-Petaop IntelGene supercomputer
with Intels newest multi-core processors and the latest (2011) BlueGene
network processors. In the design, you will need to specify the node
architecture, the towers, and the nal system. The terminology and
packaging pathway of the published BlueGene design reports can
be mimicked. The datasheet for the supercomputers specications in
mechanical, thermo, electrical, and performance are to be provided.
July 27, 2012 6:18
9in x 6in
b1350-appc
July 27, 2012 6:18
9in x 6in
b1350-appd
APPENDIX D
PROGRAM EXAMPLES
D.1. Matrix-Vector Multiplication

Program D.1.
Matrix-vector multiplication in FORTRAN.
C multiplication of matrix M(N1,N2) and vector v(N1)

C product vector prod(N2)
C root process initializes M and v
C Broadcasts v to all processes
C splits M into sub_M by rows by MPI_Scatter
C (N2 is divided by the number of processes)
C each process multiplies its part of M x whole v and gets its part of
prod
C MPI_Gather collects all pieces of prod into one vector on root process
C simple case of N2 divisible by the number of process is handled here
parameter(N1=8)
! columns
parameter(N2=8)
! rows
include /net/campbell/04/theory/lam_axp/h/mpif.h
real M(N1,N2),v(N1),prod(N2)
integer size,my_rank,tag,root
integer send_count, recv_count
integer N_rows
tag = 0
root = 0
here(not always) for MPI_Gather to work root should be 0
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_comm_world,my_rank,ierr)
call MPI_Comm_size(MPI_comm_world,size,ierr)
if(mod(N2,size).ne.0)then
print*, rows are not divisible by processes
stop
end if
if(my_rank.eq.root)then
call initialize_M(M,N1,N2)
call initialize_v(v,N1)
end if
193
July 27, 2012 6:18
194
C Broadcasts v to all processes

call MPI_Bcast(v,
@
N1,
@
MPI_REAL,
@
root,
@
MPI_Comm_world,
@
ierr)
C splits M into M by rows by MPI_Scatter
C (N2 is divided by the number of processes)
N_rows = N2 / size
send_count = N_rows*N1
recv_count = N_rows*N1
C
if(my_rank.eq.root)send_count = N1*N2
call MPI_Scatter(M,
@
send_count,
@
MPI_REAL,
@
M,
@
recv_count,
@
MPI_REAL,
@
root,
@
MPI_COMM_WORLD,
@
ierr)
call multiply(prod,v,M,N_rows,N1)
send_count = N_rows
recv_count = N_rows
if(my_rank.eq.root)recv_count = N2
call MPI_Gather(prod,
@
send_count,
@
MPI_REAL,
@
prod,
@
recv_count,
@
MPI_REAL,
@
root,
@
MPI_COMM_WORLD,
@
ierr)
if(my_rank.eq.root)call write_prod(prod,N2)
call MPI_Finalize(ierr)
end
subroutine multiply(prod,v,M,N2,N1)
real M(N2,N1),prod(N2),v(N1)
do i=1,N2
prod(i)=0
do j=1,N1
9in x 6in
b1350-appd
July 27, 2012 6:18
9in x 6in
b1350-appd
Appendix D: Program Examples
195
prod(i)=prod(i) + M(j,i)*v(j)
end do
end do
return
end
subroutine initialize_M(M,N2,N1)
real M(N2,N1)
do i=1,N2
do j=1,N1
M(j,i) = 1.*i/j
end do
end do
return
end
subroutine initialize_v(v,N1)
real v(N1)
do j=1,N1
v(j) = 1.*j
end do
return
end
subroutine write_prod(prod,N2)
real prod(N2)
C . directory for all process except the one the program was started on
C is your home directory
open(unit=1,file=~/LAM/F/prod,status=new)
do j=1,N2
write(1,*)j,prod(j)
end do
return
end
D.2. Long Range N-body Force

Program D.2. Long-ranged N-body force calculation in FORTRAN (Source: Oak Ridge
National Laboratory).
c This program finds the force on each of a set of particles
interacting
c via a long-range 1/r**2 force law.
July 27, 2012 6:18
196
c
c
c
c
c
c
c
c
c
9in x 6in
b1350-appd
The number of processes must be even, and the total number of points
must be exactly divisible by the number of processes.
This is the MPI version.
Author: David W. Walker
Date:
March 10, 1995
program nbody
implicit none
include mpif.h
integer myrank, ierr, nprocs, npts, nlocal
integer pseudohost, NN, MM, PX, PY, PZ, FX, FY, FZ
real G
parameter (pseudohost = 0)
parameter (NN=10000, G = 1.0)
parameter (MM=0, PX=1, PY=2, PZ=3, FX=4, FY=5, FZ=6)
real dx(0:NN-1), dy(0:NN-1), dz(0:NN-1)
real dist(0:NN-1), sq(0:NN-1)
real fac(0:NN-1), tx(0:NN-1), ty(0:NN-1), tz(0:NN-1)
real p(0:6,0:NN-1), q(0:6,0:NN-1)
integer i, j, k, dest, src
double precision timebegin, timeend
integer status(MPI_STATUS_SIZE)
integer newtype
double precision ran
integer iran
c
c Initialize MPI, find rank of each process, and the number of
processes
c
call mpi_init (ierr)
call mpi_comm_rank (MPI_COMM_WORLD, myrank, ierr)
call mpi_comm_size (MPI_COMM_WORLD, nprocs, ierr)
c
c One process acts as the host and reads in the number of particles
c
if (myrank .eq. pseudohost) then
open (4,file=nbody.input)
if (mod(nprocs,2) .eq. 0) then
read (4,*) npts
if (npts .gt. nprocs*NN) then
print *,Warning!! Size out of bounds!!
npts = -1
else if (mod(npts,nprocs) .ne. 0) then
print *,Number of processes must divide npts
npts = -1
end if
else
July 27, 2012 6:18
9in x 6in

print *, Number of processes must be even
npts = -1
end if
end if
c
c
c
The number of particles is broadcast to all processes

call mpi_bcast (npts, 1, MPI_INTEGER, pseudohost,
#
MPI_COMM_WORLD, ierr)
c
c
c
Abort if number of processes and/or particles is incorrect

if (npts .eq. -1) go to 999
c
c
c
Work out number of particles in each process

nlocal
c
c
c
c
= npts/nprocs
The pseudocode hosts initializes the particle data and sends each
process its particles.
iran = myrank + 111
do i=0,nlocal-1
p(MM,i) = sngl(ran(iran))
p(PX,i) = sngl(ran(iran))
p(PY,i) = sngl(ran(iran))
p(PZ,i) = sngl(ran(iran))
p(FX,i) = 0.0
p(FY,i) = 0.0
p(FZ,i) = 0.0
end do
do k=0,nprocs-1
if (k .ne. pseudohost) then
do i=0,nlocal-1
q(MM,i) = sngl(ran(iran))
q(PX,i) = sngl(ran(iran))
q(PY,i) = sngl(ran(iran))
q(PZ,i) = sngl(ran(iran))
q(FX,i) = 0.0
q(FY,i) = 0.0
q(FZ,i) = 0.0
end do
call mpi_send (q, 7*nlocal, MPI_REAL,
#
k, 100, MPI_COMM_WORLD, ierr)
end if
end do
else
call mpi_recv (p, 7*nlocal, MPI_REAL,
#
pseudohost, 100, MPI_COMM_WORLD, status, ierr)
b1350-appd
197
July 27, 2012 6:18
198
9in x 6in

end if
c
c
c
c
Initialization is now complete. Start the clock and begin work.

First each process makes a copy of its particles.
timebegin = mpi_wtime ()
do i= 0,nlocal-1
q(MM,i) = p(MM,i)
q(PX,i) = p(PX,i)
q(PY,i) = p(PY,i)
q(PZ,i) = p(PZ,i)
q(FX,i) = 0.0
q(FY,i) = 0.0
q(FZ,i) = 0.0
end do
c
c
c
c
Now the interactions between the particles in a single process are

computed.
do i=0,nlocal-1
do j=i+1,nlocal-1
dx(i) = p(PX,i) - q(PX,j)
dy(i) = p(PY,i) - q(PY,j)
dz(i) = p(PZ,i) - q(PZ,j)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
dist(i) = sqrt(sq(i))
fac(i) = p(MM,i) * q(MM,j) / (dist(i) * sq(i))
tx(i) = fac(i) * dx(i)
ty(i) = fac(i) * dy(i)
tz(i) = fac(i) * dz(i)
p(FX,i) = p(FX,i)-tx(i)
q(FX,j) = q(FX,j)+tx(i)
p(FY,i) = p(FY,i)-ty(i)
q(FY,j) = q(FY,j)+ty(i)
p(FZ,i) = p(FZ,i)-tz(i)
q(FZ,j) = q(FZ,j)+tz(i)
end do
end do
c
c
C
c
The processes are arranged in a ring. Data will be passed in an

anti-clockwise direction around the ring.
dest = mod (nprocs+myrank-1, nprocs)
src = mod (myrank+1, nprocs)
c
c Each process interacts with the particles from its nprocs/2-1
c anti-clockwise neighbors. At the end of this loop p(i) in each
c process has accumulated the force from interactions with particles
c i+1, ...,nlocal-1 in its own process, plus all the particles from
its
b1350-appd
July 27, 2012 6:18
9in x 6in
b1350-appd

c
C
c
C
C
c
nprocs/2-1 anti-clockwise neighbors. The "home" of the q array is

regarded as the process from which it originated. At the end of
this loop q(i) has accumulated the force from interactions with
particles 0,...,i-1 in its home process, plus all the particles from
the nprocs/2-1 processes it has rotated to.
c
c
c
c
c
c
c
do k=0,nprocs/2-2
call mpi_sendrecv_replace (q, 7*nlocal, MPI_REAL, dest, 200,
src, 200, MPI_COMM_WORLD, status, ierr)
do i=0,nlocal-1
do j=0,nlocal-1
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
q(FX,j) = q(FX,j)+tx(i)
q(FY,j) = q(FY,j)+ty(i)
q(FZ,j) = q(FZ,j)+tz(i)
end do
end do
end do
Now q is rotated once more so it is diametrically opposite its home

process. p(i) accumulates forces from the interaction with particles
0,..,i-1 from its opposing process. q(i) accumulates force from the
interaction of its home particles with particles i+1,...,nlocal-1 in
its current location.
if (nprocs .gt. 1) then

do i=nlocal-1,0,-1
do j=i-1,0,-1
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
199
July 27, 2012 6:18
200
9in x 6in
b1350-appd

p(FX,i)
q(FX,j)
p(FY,i)
q(FY,j)
p(FZ,i)
q(FZ,j)
end do
end do
=
=
=
=
=
=
p(FX,i)-tx(i)
q(FX,j)+tx(i)
p(FY,i)-ty(i)
q(FY,j)+ty(i)
p(FZ,i)-tz(i)
q(FZ,j)+tz(i)
c
c In half the processes we include the interaction of each particle
with
c the corresponding particle in the opposing process.
c
if (myrank .lt. nprocs/2) then
do i=0,nlocal-1
dx(i) = p(PX,i) - q(PX,i)
dy(i) = p(PY,i) - q(PY,i)
dz(i) = p(PZ,i) - q(PZ,i)
sq(i) = dx(i)**2+dy(i)**2+dz(i)**2
fac(i) = p(MM,i) * q(MM,i) / (dist(i) * sq(i))
q(FX,i) = q(FX,i)+tx(i)
q(FY,i) = q(FY,i)+ty(i)
q(FZ,i) = q(FZ,i)+tz(i)
end do
endif
c
c Now the q array is returned to its home process.
c
dest = mod (nprocs+myrank-nprocs/2, nprocs)
src = mod (myrank+nprocs/2, nprocs)
#
end if
c
c The p and q arrays are summed to give the total force on each particle.
c
do i=0,nlocal-1
p(FX,i) = p(FX,i) + q(FX,i)
p(FY,i) = p(FY,i) + q(FY,i)
p(FZ,i) = p(FZ,i) + q(FZ,i)
end do
c
c Stop clock and write out timings
July 27, 2012 6:18
9in x 6in
b1350-appd

c
#
c
c
c
timeend = mpi_wtime ()
print *,Node, myrank, Elapsed time: ,
timeend-timebegin, seconds
Do a barrier to make sure the timings are written out first
call mpi_barrier (MPI_COMM_WORLD, ierr)

c
c Each process returns its forces to the pseudohost which prints them
out.
c
open (7,file=nbody.output)
write (7,100) (p(FX,i),p(FY,i),p(FZ,i),i=0,nlocal-1)
call mpi_type_vector (nlocal, 3, 7, MPI_REAL, newtype, ierr)
call mpi_type_commit (newtype, ierr)
do k=0,nprocs-1
if (k .ne. pseudohost) then
call mpi_recv (q(FX,0), 1, newtype,
#
k, 100, MPI_COMM_WORLD, status, ierr)
write (7,100) (q(FX,i),q(FY,i),q(FZ,i),i=0,nlocal-1)
end if
end do
else
call mpi_type_vector (nlocal, 3, 7, MPI_REAL, newtype, ierr)
call mpi_type_commit (newtype, ierr)
call mpi_send (p(FX,0), 1, newtype,
#
pseudohost, 100, MPI_COMM_WORLD, ierr)
end if
c
c Close MPI
c
999
call mpi_finalize (ierr)
100
stop
format(3e15.6)
end
D.3. Integration
Program D.3.
Simple integration in Fortan (Souce: Oak Ridge National Lab).
program integrate
include "mpif.h"
parameter (pi=3.141592654)
integer rank
201
July 27, 2012 6:18
202
9in x 6in

call mpi_init (ierr)
call mpi_comm_rank (mpi_comm_world, rank, ierr)
call mpi_comm_size (mpi_comm_world, nprocs, ierr)
if (rank .eq. 0) then
open (7, file=input.dat)
read (7,*) npts
end if
call mpi_bcast (npts, 1, mpi_integer, 0, mpi_comm_world, ierr)
nlocal = (npts-1)/nprocs + 1
nbeg
= rank*nlocal + 1
nend
= min (nbeg+nlocal-1,npts)
deltax = pi/npts
psum
= 0.0
do i=nbeg,nend
x = (i-0.5)*deltax
psum = psum + sin(x)
end do
call mpi_reduce (psum, sum, 1, mpi_real, mpi_sum, 0,
#
mpi_comm_world, ierr)
if (rank.eq. 0) then
print *,The integral is, sum*deltax
end if
call mpi_finalize (ierr)
stop
end
b1350-appd
July 27, 2012 6:18
9in x 6in
b1350-ref
REFERENCES
[1] S. H. Bokhari, On the mapping problem, IEEE Transactions on Computers

30(3) (1981) 207214.
[2] Top500 Supercomputing Sites. [Online]. HYPERLINK file:///C:\\Users\\
zhlou\\Documents\\www.top500.org www.top500.org.
[3] Y. Deng, A. Korobka, Z. Lou and P. Zhang, Perspectives on petascale
processing, KISTI Supercomputer 31 (2008) 36.
[4] A. Grama, G. Karypis, V. Kumar and A. Gupta, Introduction to Parallel
Computing, 2nd edn. (AddisonWesley, 2003).
[5] W. Gropp, E. Lusk and A. Skjellum, Using MPI : Portable Parallel
Programming with the Message Passing Interface, 2nd edn.(MIT Press, 1999).
[6] Y. Censor and S. A. Zenios, Parallel Optimization: Theory, Algorithms, and
Applications (Oxford University Press, 1998).
[7] A. Gara et al., Overview of the Blue Gene/L System Architecture, IBM
J. Res. Dev. 49(23) (2005) 195.
[8] A. Pacheco, Parallel Programming with MPI, 1st edn. (Morgan Kaufmann,
1996).
[9] G. Karniadakis and R. M. Kirby, Parallel Scientific Computing in C ++ and
MPI (Cambridge University Press, 2003).
[10] G. S. Almasi and A. Gottlieb, Highly Parallel Computing (BenjaminCummings Publishers, 1989).
[11] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative
Approach, 3rd edn. (Morgan Kaufmann, 2002).
[12] D. E. Culler, J. P. Singh and A. Gupta, Parallel Computer Architecture
A Hardware/Software Approach (Morgan Kaufmann, 1999).
[13] G. Amdahl, The validity of the single processor approach to achieving largescale computing capabilities, Proceedings of AFIPS Spring Joint Computer
Conference, Atlantic City, N.J., 1967, pp. 483485.
203
July 27, 2012 6:18
9in x 6in
b1350-ref
July 27, 2012 6:18
9in x 6in
b1350-index
INDEX
1.5.2 Schr
odingers equations, 133
1D wave equation, 92
2D mapping, 56
3D mapping, 56
patterns, 41
reduction, 155
synchronous, 38, 141
communicator, 142, 148
connectivity, 21
Cononical Classes for Concurrency
Control, 166
control decomposition, 46
cost eectiveness, 9
Alternating Direction Implicit, 86

Amdahls law, 18
Amelia Vector Template Library, 166
application buer, 142
ARCH, 163
architecture
node, 19
array, 22
asynchronous, 141
asynchronous parallel, 53
average distance, 21
Aztec, 164
data count, 147

data parallel, 46
data type, 147
dbx, 43
decomposition, 16
dense linear systems, 81
destination, 148
DFT, 105
diameter, 21
diusion equation, 135
DIMEMAS, 162
discrete Fourier transform, 105
distributed memory, 29, 140
distributed-shared-memory, 29
divide and conquer, 46
domain decomposition, 46
DQS, 165
bisection bandwidth, 21
BLACS, 165
blocking communication, 142
blocking message passing, 147
Broadcast, 156
buer, 147
classical molecular dynamics, 126
Collective communication, 154
communication, 38
asynchronous, 38, 141
blocking, 142
collective, 154
non-blocking, 142
point to point, 147
cooperative, 139
interrupt, 40
one-sided, 139
eciency, 15
embarrassingly parallel, 52, 89
envelope, 143
environment management routines,
144
explicit nite dierence, 92
extended collective operation, 140
external interface, 140
205
July 27, 2012 6:18
206
9in x 6in
b1350-index
fast Fourier transform, 105, 107

fat tree, 22
FFTW, 166
force decomposition, 132
Fourier transform, 105, 113
gather, 155
Gaussian elimination, 81
global gather, 56
grand challenge, 5, 123
granularity, 16
groups, 142
hardware, 19
header le, 143
heat equation, 135
Helmholtz equation, 135
hyperbolic equations, 92
hypercube, 22
iCC, 165
integration, 89, 161
Monte Carlo method, 90
IPD, 43
linear algebra
block partition, 66
column partition, 66
row partition, 66
scatter partition, 66
linear mapping, 56
linear speedup, 15, 17
linear system, 81
linpack, 6
load imbalance, 15
load imbalance ratio, 15
long-ranged interactions, 126
Markov chain decomposition, 120, 121
master-slave, 46
Maxwells equations, 135
memory
connectivity to processor, 29
memory interleaving, 1
mesh, 22
message attributes, 143

message passing, 140, 141
message passing library, 141
message tag, 143
Metropolis method, 119
molecular dynamics, 51, 125, 137
multipole method, 126
particle-mesh, 126
particle-particle, 126
Monte Carlo method, 90, 115119,
121
motion equation, 125
mpC, 167
MPI
data type, 147
header, 143
history, 139
MPI Cubix, 166
MPI Map, 164
MPI Abort, 145
MPI Allgather, 157
MPI Allreduce, 157
MPI Alltoall, 158
MPI Barrier, 156
MPI Bcast, 156
MPI Bsend, 149
MPI Buer attach, 150
MPI Buer detach, 150
MPI Comm rank, 145
MPI Comm size, 144
MPI Finalize, 146
MPI Gather, 157
MPI Get processor name , 145
MPI Ibsend, 153
MPI Init, 144
MPI Initialized, 145
MPI Iprobe, 154
MPI Irecv, 152
MPI Irsend, 153
MPI Isend, 152
MPI Issend, 153
MPI Probe, 150
MPI Recv, 149
MPI Reduce, 157
MPI Reduce scatter, 158
July 27, 2012 6:18
9in x 6in
207
Index
MPI Rsend, 150

MPI Scan, 159
MPI Scatter, 157
MPI Send, 147
MPI Sendrecv, 150
MPI Ssend, 149
MPI Test, 153
MPI Wait, 154
MPI Wtick, 146
MPI Wtime, 146
MPIRUN, 167
MPIX, 166
MSG, 165
NAG Parallel Library, 165
Navier-Stokes equation, 136
neighbor lists, 131
network buer, 38
Newtons equation, 124
node compilers, 36
node operating systems, 35
node program debuggers, 43
non-blocking communication, 142
numerical integration, 89
OOMPI, 163
OpenMP, 171
overhead, 17
overlap mapping, 56
Pallas, 161
Para++, 166
parallel eciency, 15
Parallel processing, 1
particle decomposition, 132
particle-in-cell, 131
partition function, 134
PETSc, 165
PGAPack, 163
point to point communication, 147
Poisson equation, 96, 97
process, 141
process creation, 139
quantum molecular dynamics, 53
quasi-scalable algorithm, 18
b1350-index
random mapping, 56
rank, 142
receive, 141
request, 149
Riemann summation approximation,
89
ring, 22
scalability, 18
scalable, 15
ScaLAPACK, 163
scaling zone, 18
scatter, 155
Scheduling, 50
search space decomposition, 116, 119
send, 141
serial computer, 1
shared-distributed-memory, 29
shared-memory, 29, 42
short-ranged interactions, 126
Simpsons Rule, 89
source, 148
sparse linear systems, 85
speedup, 14
SPMD, 46
STAR/MPI, 164
status, 149
super speedup, 15
synchronization, 154
synchronous, 141
system buer, 142
tag, 148
threadprivate, 172
torus, 22
trapezoid rule, 89
VAMPIR, 163
vector processing, 1
virtual-shared-memory, 46
XMPI, 163

Applied Parallel Computing-Honest

Uploaded by

Copyright:

Available Formats

Applied Parallel Computing-Honest

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Parallel Computing-Honest

Uploaded by

Copyright:

Available Formats

What are some of the parallel computing techniques discussed in the document?

What are some of the parallel computing techniques discussed in the document?

What are some examples of applications discussed that are suitable for parallel computing?

What are some examples of applications discussed that are suitable for parallel computing?

APPLIED

July 27, 2012 6:18