Background
A Little Motivation
Brief Overview of Parallel Programming
Why do we parallel process in the first place?
Performance - either solve a bigger problem, or to reach solution
faster (or both)
M. D. Jones, Ph.D.
Hardware is becoming (actually it has been for a while now)
intrinsically parallel
Center for Computational Research
University at Buffalo
State University of New York
Parallel programming is not easy. How easy is sequential
programming? Now add another layer (a sizable one, at that) for
parallelism ...
Spring 2014
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
1 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Background
Spring 2014
3 / 61
Background
Big Picture
Decomposition
Basic idea of parallel programming is a simple one:
problem
decomposition
1D
CPU
CPU
CPU
CPU
CPU
CPU
2D
CPU
CPU
3D
CPU
CPU
CPU
CPU
in which we decompose a large computational problem into available
processing elements (CPUs). How you achieve that decomposition is
where all the fun lies ...
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
4 / 61
Uniform volume example - decomposition in 1D, 2D, and 3D. Note that
load balancing in this case is simpler if the work load per cell is the
same ...
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
5 / 61
Background
Demand for HPC
Decomposition (continued)
More Speed!
Why Use Parallel Computation?
Note that, in the modern era, inevitably HPC = Parallel (or concurrent)
Computation. The driving forces behind this are pretty simple - the
desire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for that
matter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
Nonuniform example - decomposition in 3D for a 16-way parallel
adaptive finite element calculation (in this case using a terrain
elevation). Load balancing in this case is much more complex.
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Demand for HPC
Spring 2014
6 / 61
M. D. Jones, Ph.D. (CCR/UB)
More Speed!
Brief Overview of Parallel Programming
Demand for HPC
Spring 2014
8 / 61
More Speed!
A Concrete Example
A Size Example
Well, more of a discrete example, actually. Lets consider the
gravitational N-body problem.
On the other side, suppose that we need to diagonalize (or invert) a
dense matrix being used in, say, an eigenproblem derived from some
engineering or multiphysics problem.
Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computed
from the other N 1 bodies, thereby requiring N 2 force calculations
per step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires N log2 N calculations,
that means ' 1012 ln(1012 )/ ln(2) calculations. If each calculation
takes ' 1 sec, that is 40 106 seconds per step. That is about 1.3
CPU-years per step. Ouch!
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
9 / 61
Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.
So we determine that we need a matrix on the order of, say, 400000
elements. Simply to store this matrix, in 64-bit representation, requires
' 1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103 nodes, each having 4 GBytes of memory,
by distributing the matrix across the individual memories of each
cluster node.
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
10 / 61
Demand for HPC
More Speed!
Demand for HPC
So ...
Scaling Concepts
So the lessons to take from the preceding examples should be clear:
It is the nature of the research (and commercial) enterprise to
always be trying to accomplish larger tasks with ever increasing
speed, or efficiency. For that matter it is human nature.
These examples, while specific, apply quite generally - do you see
the connection between the first example and general molecular
modeling? The second example and finite element (or finite
difference) approaches to differential equations? How about web
searching?
M. D. Jones, Ph.D. (CCR/UB)
Inherent Limitations
Brief Overview of Parallel Programming
Demand for HPC
Spring 2014
11 / 61
By scaling we typically mean the relative performance of a parallel
vs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =
Sequential execution time (using optimal implementation)
Parallel execution time using p processors
so, in the ideal case, S(p) = p.
M. D. Jones, Ph.D. (CCR/UB)
Inherent Limitations
Brief Overview of Parallel Programming
Demand for HPC
Parallel Efficiency
Spring 2014
12 / 61
Inherent Limitations
Inherent Limitations in Parallel Speedup
Using S as the (best) sequential execution time, we note that
S(p) =
Limitations on the maximum speedup:
S
,
p (p)
Fraction of the code, f , can not be made to execute in parallel
p,
Parallel overhead (communication, duplication costs)
Using this serial fraction, f , we can note that
for a lower bound, and the parallel efficiency is given by
p f S + (1 f ) S /p,
S(p)
,
p
S
=
.
p p
E(p) =
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
13 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
14 / 61
Demand for HPC
Inherent Limitations
Demand for HPC
Inherent Limitations
Amdahls Law
Implications of Amdahls Law
This simplification for p leads directly to Amdahls Law,
Definition (Amdahls Law):
The implications of Amdahls law are pretty straightforward:
The limit as p ,
S
,
f S + (1 f ) S /p
p
.
1 + f (p 1)
S(p)
lim S(p)
1
.
f
If the serial fraction is 5%, the maximum parallel speedup is only
20.
G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Demand for HPC
Spring 2014
15 / 61
Amdahl's Law
16 / 61
Let p = comm + comp , where comm is the time spent in communication
between parallel processes (unavoidable overhead) and comp is the
time spent in (parallelizable) computation.
256
f=0.001
f=0.01
f=0.02
f=0.05
f=0.1
64
Spring 2014
Inherent Limitations
A Practical Example
Maximum Parallel Speedup
1 pcomp ,
32
MAX(S(p))
Brief Overview of Parallel Programming
Demand for HPC
Implications of Amdahls Law (contd)
128
M. D. Jones, Ph.D. (CCR/UB)
Inherent Limitations
and
16
S(p) =
8
4
1
,
p
pcomp
,
comm + comp
p 1 + comm /comp
1
1
M. D. Jones, Ph.D. (CCR/UB)
8
32
16
Number of Processors, p
64
Brief Overview of Parallel Programming
128
256
Spring 2014
1
The point here is that it is critical to minimize communication time to
time spent doing computation (recurring theme).
17 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
18 / 61
Demand for HPC
Inherent Limitations
Demand for HPC
Defeating Amdahls Law
Inherent Limitations
Gustafsons Law
Now let p be held constant as p is increased,
Ss (p) = S /p ,
There are ways to work around the serious implications of Amdahls
law:
= (f S + (1 f )S )/p ,
We assumed that the problem size was fixed, which is (very) often
not the case.
= (f S + (1 f )S )/(f S + (1 f )S /p),
Now consider a case where the problem size is allowed to vary
' p + p(1 p)f . . . .
Assume that now the problem size is fixed such that p is held
constant
= p/(1 (1 p)f ),
Another way of looking at this is that the serial fraction becomes
negligible as the problem size is scaled. Actually that is a pretty good
definition of a scalable code ...
J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Demand for HPC
Spring 2014
19 / 61
M. D. Jones, Ph.D. (CCR/UB)
Inherent Limitations
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
20 / 61
Basic Terminology
Scalability
The Parallel Zoo
Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on any
value
There are many parallel programming models, but roughly they break
down into the following categories:
Threaded models: e.g., OpenMP or Posix threads as the
application programming interface (API)
I like this (equivalent) one better:
Definition
(Scalable): For a scalable code the sequential fraction becomes
negligible as the problem size (and the number of processors) grows
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
21 / 61
Message passing: e.g., MPI = Message Passing Interface
Hybrid methods: combine elements of other models
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
23 / 61
Overview of Parallel APIs
Basic Terminology
Overview of Parallel APIs
Thread Models
Basic Terminology
Thread Models
thread 1
thread 2
a.out
thread 1
thread 2
a.out
data
data
time
time
thread N
Shared address space (all threads can access data of program
a.out).
Generally limited to single machine except in very specialized
cases.
GPGPU (general purpose GPU computing) uses thread model
(with a large number of threads on the host machine.
thread N
Typical thread model (e.g., OpenMP)
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
24 / 61
M. D. Jones, Ph.D. (CCR/UB)
Basic Terminology
Brief Overview of Parallel Programming
Overview of Parallel APIs
Message Passing Models
Spring 2014
25 / 61
Basic Terminology
Message Passing Models
a.out
a.out
PID
task 1
msg send/recv
data
a.out
a.out
data
task 1
PID
a.out
PID
task N1
time
Separate address space (tasks can not access data of other
tasks without sending messages or participating in collective
communications).
General purpose model - messages can be sent over network,
shared memory, etc.
Managing communication is generally up to the programmer.
PID
task N1
time
Typical message passing model (e.g., MPI).
Brief Overview of Parallel Programming
PID
data
data
M. D. Jones, Ph.D. (CCR/UB)
msg send/recv
collective communication (broadcast/reduce/...)
a.out
msg send/recv
task 0
task 0
msg send/recv
PID
data
collective communication (broadcast/reduce/...)
data
Spring 2014
26 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
27 / 61
Overview of Parallel APIs
Basic Terminology
Overview of Parallel APIs
Parallelism at Three Levels
Approaching the Problem
The first step in deciding where to add parallelism to an algorithm or
application is usually to analyze from the point of tasks and data:
Task - macroscopic view in which an algorithm is organized
around separate concurrent tasks
Tasks How do I reduce this problem into a set of tasks that can
be executed concurrently?
Data - structure data in (separately) update-able chunks
Data How do I take the key data and represent it in such a way
that large chunks can be operated on independently (and
thus concurrently)?
Instruction - ILP through the flow of data in a predictable fashion
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
28 / 61
M. D. Jones, Ph.D. (CCR/UB)
Basic Terminology
Brief Overview of Parallel Programming
Overview of Parallel APIs
Dependencies
Spring 2014
29 / 61
Basic Terminology
An Hour of Planning ...
Taking the time to carefully analyze an existing application, or plan a
new one, can really pay off later:
Bear in mind that you always want to minimize the dependencies
between your concurrent tasks and data:
Plan for parallel programming by keeping data and task parallel
constructs/opportunities in mind
On the task level, design your tasks to be as independent as
possible - this can also be temporal, namely take into account
any necessity for ordering the tasks and their execution
Analyze an existing or new program to discover/confirm the
locations of the time-consuming portions
Sharing of data will require synchronization and thus introduce
data dependencies, if not race conditions or contention
M. D. Jones, Ph.D. (CCR/UB)
Basic Terminology
Brief Overview of Parallel Programming
Spring 2014
Optimize by implementing a strategy using data or task parallel
constructs
30 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
31 / 61
Overview of Parallel APIs
Auto-Parallelization
Overview of Parallel APIs
Auto-Parallelization
Auto-Vectorization
Often called implicit parallelism, some compilers (usally commercial
ones) have the ability to:
automatically parallelize simple (outer) loops (thread-level
parallelism, or TLP)
Note that thread level parallelism is only usable on an SMP
architecture
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
Like auto-parallelization, a feature of high-performance (often
commercial) compilers on hardware that supports at least some level
of vectorization:
compilers finds low-level operations that can operate
simultaneously on multiple data elements using a single
instruction
Intel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 23 vector length (hardware limitation that true vector processors
do not share)
automatically vectorize (usually innermost loops) using ILP
M. D. Jones, Ph.D. (CCR/UB)
Auto-Vectorization
User can exert some control through compiler directives (usually
vendor specific) and data organization focused on vector
operations
32 / 61
M. D. Jones, Ph.D. (CCR/UB)
Auto-Vectorization
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
33 / 61
Shared Memory APIs
Downside of Automatic Methods
Shared Memory Parallelism
The automatic parallelization schemes have a number of shortfalls:
Here we consider explicit modes of parallel programming on shared
memory architectures:
Very limited in scope - parallelizes only a very small subset of
code
Without directives from the programmer, compiler is very limited in
identifying parallelizable regions of code
Performance is rather easily degraded rather than sped up
Brief Overview of Parallel Programming
Spring 2014
SHMEM, the old Cray shared memory library
Pthreads, UNIX ANSI standard (available on Windows as well)
OpenMP, common specification for compiler directives and API
Wrong results are easy to (automatically) produce
M. D. Jones, Ph.D. (CCR/UB)
HPF, High Performance Fortran
34 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
35 / 61
Overview of Parallel APIs
Distributed Memory APIs
Overview of Parallel APIs
Distributed Memory Parallel Programming
OpenMP
In this case, mainly message passing, with a few other (compiler
directives again) possibilities:
HPF, High Performance Fortran
Cluster OpenMP, a new product from Intel to push OpenMP into
a distributed memory regime
PVM, Parallel Virtual Machine
MPI, Message Passing Interface, has become the dominant API
for general purpose parallel programming
UPC, Unified Parallel C, extensions to ISO 99 C for parallel
programming (new)
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
Synopsis: designed for multi-platform shared memory parallel
programming in C/C++/FORTRAN primarily through the
use of compiler directives and environmental variables
(library routines also available).
Target: Shared memory systems.
Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in
2005, 3.0 in 2008, 3.1 in 2011. Widely implemented by
commercial compiler vendors, also now in most
open-source compilers (4.0 specification released in
2013-07, not yet available in production compilers).
More Info: The OpenMP home page:
http://www.openmp.org
CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003
(new, officially part of Fortran 2008)
M. D. Jones, Ph.D. (CCR/UB)
APIs/Language Extensions & Availability
36 / 61
M. D. Jones, Ph.D. (CCR/UB)
APIs/Language Extensions & Availability
Brief Overview of Parallel Programming
Overview of Parallel APIs
OpenMP Availability
Spring 2014
37 / 61
APIs/Language Extensions & Availability
MPI
Message Passing Interface
Platform
Linux x86_64
GNU (>=4.2)
GNU (>=4.4)
GNU (>=4.7)
Version
3.1
3.0
2.5
3.0
3.1
Synopsis: Message passing library specification proposed as a
standard by committee of vendors, implementors, and
users.
Invocation (example)
ifort -openmp -openmp_report2 ...
pgf90 -mp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
Target: Distributed and shared memory systems.
Current Status: Most popular (and most portable) of the message
passing APIs.
More Info: ANLs main MPI site:
www-unix.mcs.anl.gov/mpi
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
38 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
39 / 61
Overview of Parallel APIs
APIs/Language Extensions & Availability
Overview of Parallel APIs
MPI Availability at CCR
Platform
Linux IA64
Linux x86_64
APIs/Language Extensions & Availability
Unified Parallel C (UPC)
Unified Parallel C (UPC) is a project based at Lawrence Berkeley
Laboratory to extend C (ISO 99) and provide large-scale parallel
computing on both shared and distributed memory hardware:
Version(+MPI-2)
1.2+(C++,MPI-I/O)
1.2+(C++,MPI-I/O),2.x(various)
Explicit parallel execution (SPMD)
Shared address space (shared/private data like OpenMP), but
exploits data locality
I am starting to favor the commercial Intel MPI for its ease of use,
expecially in terms of supporting multiple networks/protocols.
Primitives for memory management
BSD license (free download)
http://upc.lbl.gov/, community website at
http://upc.gwu.edu/
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
40 / 61
M. D. Jones, Ph.D. (CCR/UB)
APIs/Language Extensions & Availability
Brief Overview of Parallel Programming
Overview of Parallel APIs
Spring 2014
41 / 61
APIs/Language Extensions & Availability
Co-Array Fortran
Simple example of some UPC syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) to
provide data decomposition for parallel programs (somewhat akin to
UPC and HPF):
shared i n t a l l _ h i t s [THREADS ] ;
...
...
f o r ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;
a l l _ h i t s [MYTHREAD] = my_hits ;
upc_barrier ;
i f (MYTHREAD == 0 ) {
t o t a l _ h i t s = 0;
f o r ( i =0; i < THREADS; i ++) {
t o t a l _ h i t s += a l l _ h i t s [ i ] ;
}
p i = 4 . 0 ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;
p r i n t f ( " PI e s t i m a t e d t o %10.7 f from %d t r i a l s on %d t h r e a d s . \ n " ,
p i , t r i a l s , THREADS ) ;
}
Original specification by Numrich and Reid, ACM Fortran Forum,
17, no. 2, pp 1-31 (1998).
ISO Fortran committee included co-arrays in next revision to
Fortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,
no 2, pp 4-17 (2005), Fortran 2008.
Does not require shared memory (more applicable than OpenMP)
Early compiler work at Rice:
http://www.hipersoft.rice.edu/caf/index.html
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
42 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
43 / 61
Overview of Parallel APIs
APIs/Language Extensions & Availability
Libraries
Leveraging Existing Parallel Libraries
Simple examples of CAF usage:
1
2
3
4
5
6
7
8
One easy way to utilize parallel programming resources is through
the use of existing parallel libraries. Most common examples (note
orientation around standard mathematical routines):
REAL, DIMENSION (N ) [ ] : : X , Y
X
= Y [ PE ]
! g e t from Y [ PE ]
Y [ PE ]
= X
! p u t i n t o Y [ PE ]
Y[:]
= X
! b r oa d c a s t X
Y [ LIST ] = X
! b r oa d c a s t X over subset o f PE s i n a r r a y LIST
Z(:)
= Y[:]
! collect all Y
S = MINVAL (Y [ : ] ) ! min ( reduce ) a l l Y
B ( 1 :M) [ 1 : N] = S ! S s c a l a r , promoted t o a r r a y o f shape ( 1 :M, 1 : N)
BLAS - Basic Linear Algebra Subroutines
LAPACK - Linear Algebra PACKage (uses BLAS)
UPC and CAF are examples of partitioned global address space
(PGAS) language, for which a large push is being made by
AHPCRC/DARPA as part of the Petascale computing initiative
(also one based on java called Titanium)
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
44 / 61
ScaLAPACK - distributed memory (MPI) solvers for common LAPACK
functions
PetSC - Portable, Extensible Toolkit for Scientific Computation
FFTW - Fastest Fourier Transforms in the West
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Libraries
BLAS
Vendor BLAS
The Basic Linear Algebra Subroutines (BLAS) form a standard set of
library functions for vector and matrix operations:
Vendor implementations of the BLAS:
Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)
Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)
Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)
These routines are generally provided by vendors hand-tuned at the
level of assembly code for optimum performance on a particular
processor. Shared memory versions (multithreaded) and distributed
memory versions available.
www.netlib.org/blas
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
46 / 61
Spring 2014
48 / 61
Libraries
Spring 2014
47 / 61
AMD
Apple
Compaq
Cray
HP
IBM
Intel
NEC
SGI
SUN
M. D. Jones, Ph.D. (CCR/UB)
ACML
Velocity Engine
CXML
libsci
MLIB
ESSL
MKL
PDLIB/SX
SCSL
Sun Performance Library
Brief Overview of Parallel Programming
Libraries
Libraries
Performance Example
LAPACK
DDOT Performance
The Linear Algebra PACKage (LAPACK) is a library that lies on top of
the BLAS (for optimium performance and parallelism) and provides:
U2 Compute Node
5000
cMKL 8.1.1
Ref (-lblas)
4500
4000
solvers for systems of simultaneous linear equations
L1 cache
L2 cache
3500
least-squares solutions
MFlop/s
3000
eigenvalue problems
2500
2000
singular value problems
Main memory
1500
On CCR systems, the Intel MKL is generally preferred (includes
optimized BLAS and LAPACK)
1000
500
0
1
10
10
10
10
Vector Length [8B]
10
10
www.netlib.org/lapack
Performance advantage of Intel MKL ddot versus reference (system)
version. Level 3 BLAS routines have even more significant gains.
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
49 / 61
Libraries
Brief Overview of Parallel Programming
Parallel Program Design Considerations
ScaLAPACK
Spring 2014
50 / 61
Programming Costs
Programming Costs
The Scalable LAPACK library, ScaLAPACK, is designed for use on the
largest of problems (typically implemented on distributed memory
systems):
subset of LAPACK routines redesigned for distributed memory
MIMD parallel computers
explicit message passing for interprocessor communication
assumes matrices are laid out in a two-dimensional block cyclic
decomposition
On CCR systems, the Intel Cluster MKL includes
ScaLAPACK libraries (includes optimized BLAS, LAPACK, and
ScaLAPACK)
www.netlib.org/scalapack
M. D. Jones, Ph.D. (CCR/UB)
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
51 / 61
Consider:
Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have to
plan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportional
to the amount of effort invested in them) - all of the serial
application porting issues apply, plus the choice of
parallel API (MPI, OpenMP, and POSIX threads are
currently good portable choices, but implementations of
them can differ from platform to platform)
Resources: overhead for parallel computation can be significant for
smaller calculations
Scalability: limitations in hardware (CPU-memory speed and
contention, for one example) and the parallel algorithms
will limit speedups. All codes will eventually reach a state
of decreasing returns at some point
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
53 / 61
Parallel Program Design Considerations
Examples
Parallel Program Design Considerations
Examples
Speedup Limitation Example (MD/NAMD)
Joint Amber-CHARMM Benchmark
NAMD, v2.6b1, u2
1
256
ch_p4
ch_gm
128
79.9
88
Benchmark MD s/step
0.1
32
27.2
16
14.2
7.3
6.46
8
7.01
3.67
0.01
Parallel Speedup
64
49.6
3.54
2.8
1.9
1.89
1
JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023
TIP3 waters, PME (9cutoff)
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Parallel Program Design Considerations
Spring 2014
54 / 61
Communication Costs
M. D. Jones, Ph.D. (CCR/UB)
8
32
16
Number of Processors (ppn=2)
128
Brief Overview of Parallel Programming
Parallel Program Design Considerations
Communications
64
256
Spring 2014
55 / 61
Communication Costs
Communication Considerations
Communication needs between parallel processes affect parallel
programs in several important ways:
Cost: there is always overhead when communicating:
All parallel programs will need some communication between tasks,
but the amount varies considerably:
minimal communication, also known as embarrassingly parallel
- think Monte Carlo calculations
robust communication, in which processes must exchange or
update information frequently - time evolution codes, domain
decomposed finite difference/element applications
latency - the cost in overhead for a zero length message
(microseconds)
bandwidth - the amount of data that can be sent per unit time
(MBytes/second); many applications utilize many small messages
and are latency-bound
Scope: point-to-point communication is always faster than
collective communication (that involves a large subset or all
processes)
Efficiency: are you using the best available resource (network) for
communicating?
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
56 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
57 / 61
Parallel Program Design Considerations
Communication Costs
Parallel Program Design Considerations
Latency Example - MPI/CCR
Communication Costs
Bandwidth Example - MPI/CCR
MPI PingPong Benchmark Performance
MPI PingPong Benchmark Performance
U2 Cluster Interconnects
U2 Cluster Interconnects
10000
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
1000
1000
Bandwidth [MByte/s]
Message Time [s]
100
10
100
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
10
1 0
10
10
10
10
10
Message Length [Bytes]
10
0.1
Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDR
Infiniband.
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Parallel Program Design Considerations
10
10
Spring 2014
58 / 61
10
10
10
10
Message Length [Bytes]
10
0.01
10
Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and
QDR Infiniband.
M. D. Jones, Ph.D. (CCR/UB)
Communication Costs
Brief Overview of Parallel Programming
Parallel Program Design Considerations
Spring 2014
59 / 61
Communication Costs
Collective Example - MPI/CCR
MPI_Alltoall Benchmark
Barriers - force collective synchronization; often implied, and always
expensive
Locks/Semaphores - typically protect memory locations from
simultaneous/conflicting updates to prevent race conditions
Intel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
1e+05
10000
talltoall(4KB) [sec]
Visibility: message-passing codes require the programmer to
explicitly manage all communications, but many data-parallel
codes do not, masking the underlying data transfers
Synchronicity: Synchronous communications are called blocking
as they require an explicit handshake between parallel
processes - asynchronous communications (non-blocking offer
the ability to overlap computation with communication, but place
an extra burden on the programmer. Examples include:
ppn=12
ppn=6
ppn=4
ppn=2
ppn=1
1000
100
10
10
100
Number of MPI Processes
1000
Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
60 / 61
M. D. Jones, Ph.D. (CCR/UB)
Brief Overview of Parallel Programming
Spring 2014
61 / 61