0% found this document useful (0 votes)

83 views14 pages

Brief Overview of Parallel Computing

A relatively short presentation concerning parallel computing for the purpose of scientific computing on medium to large clusters. Given at SUNY University at Buffalo's CCR.

Uploaded by

DouglasAndrewBrummellIII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views14 pages

Brief Overview of Parallel Computing

A relatively short presentation concerning parallel computing for the purpose of scientific computing on medium to large clusters. Given at SUNY University at Buffalo's CCR.

Uploaded by

DouglasAndrewBrummellIII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Background

A Little Motivation

Brief Overview of Parallel Programming

Why do we parallel process in the first place?

Performance - either solve a bigger problem, or to reach solution
faster (or both)

M. D. Jones, Ph.D.

Hardware is becoming (actually it has been for a while now)

intrinsically parallel

Center for Computational Research

University at Buffalo
State University of New York

Parallel programming is not easy. How easy is sequential

programming? Now add another layer (a sizable one, at that) for
parallelism ...

Spring 2014

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

1 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Background

Spring 2014

3 / 61

Background

Big Picture

Decomposition

Basic idea of parallel programming is a simple one:

problem

decomposition

CPU
CPU
CPU
CPU
CPU
CPU

CPU
CPU

CPU
CPU
CPU
CPU

in which we decompose a large computational problem into available

processing elements (CPUs). How you achieve that decomposition is
where all the fun lies ...
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

4 / 61

Uniform volume example - decomposition in 1D, 2D, and 3D. Note that
load balancing in this case is simpler if the work load per cell is the
same ...
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

5 / 61

Background

Demand for HPC

Decomposition (continued)

More Speed!

Why Use Parallel Computation?

Note that, in the modern era, inevitably HPC = Parallel (or concurrent)
Computation. The driving forces behind this are pretty simple - the
desire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for that
matter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
Nonuniform example - decomposition in 3D for a 16-way parallel
adaptive finite element calculation (in this case using a terrain
elevation). Load balancing in this case is much more complex.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

6 / 61

M. D. Jones, Ph.D. (CCR/UB)

More Speed!

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

8 / 61

More Speed!

A Concrete Example

A Size Example

Well, more of a discrete example, actually. Lets consider the

gravitational N-body problem.

On the other side, suppose that we need to diagonalize (or invert) a

dense matrix being used in, say, an eigenproblem derived from some
engineering or multiphysics problem.

Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computed
from the other N 1 bodies, thereby requiring N 2 force calculations
per step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires N log2 N calculations,
that means ' 1012 ln(1012 )/ ln(2) calculations. If each calculation
takes ' 1 sec, that is 40 106 seconds per step. That is about 1.3
CPU-years per step. Ouch!

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

9 / 61

Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.
So we determine that we need a matrix on the order of, say, 400000
elements. Simply to store this matrix, in 64-bit representation, requires
' 1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103 nodes, each having 4 GBytes of memory,
by distributing the matrix across the individual memories of each
cluster node.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

10 / 61

Demand for HPC

More Speed!

Demand for HPC

So ...

Scaling Concepts

So the lessons to take from the preceding examples should be clear:

It is the nature of the research (and commercial) enterprise to
always be trying to accomplish larger tasks with ever increasing
speed, or efficiency. For that matter it is human nature.
These examples, while specific, apply quite generally - do you see
the connection between the first example and general molecular
modeling? The second example and finite element (or finite
difference) approaches to differential equations? How about web
searching?

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

11 / 61

By scaling we typically mean the relative performance of a parallel

vs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =

Sequential execution time (using optimal implementation)

Parallel execution time using p processors

so, in the ideal case, S(p) = p.

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming

Demand for HPC

Parallel Efficiency

Spring 2014

12 / 61

Inherent Limitations

Inherent Limitations in Parallel Speedup

Using S as the (best) sequential execution time, we note that

S(p) =

Limitations on the maximum speedup:

S
,
p (p)

Fraction of the code, f , can not be made to execute in parallel

Parallel overhead (communication, duplication costs)

Using this serial fraction, f , we can note that

for a lower bound, and the parallel efficiency is given by

p f S + (1 f ) S /p,

S(p)
,
p
S
=
.
p p

E(p) =

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

13 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

14 / 61

Demand for HPC

Inherent Limitations

Demand for HPC

Inherent Limitations

Amdahls Law

Implications of Amdahls Law

This simplification for p leads directly to Amdahls Law,

Definition (Amdahls Law):

The implications of Amdahls law are pretty straightforward:

The limit as p ,

S
,
f S + (1 f ) S /p
p

.
1 + f (p 1)

S(p)

lim S(p)

1
.
f

If the serial fraction is 5%, the maximum parallel speedup is only

20.

G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

15 / 61

Amdahl's Law

16 / 61

Let p = comm + comp , where comm is the time spent in communication

between parallel processes (unavoidable overhead) and comp is the
time spent in (parallelizable) computation.

256

f=0.001
f=0.01
f=0.02
f=0.05
f=0.1

Spring 2014

Inherent Limitations

A Practical Example

Maximum Parallel Speedup

1 pcomp ,

32
MAX(S(p))

Brief Overview of Parallel Programming

Demand for HPC

Implications of Amdahls Law (contd)

128

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

and

S(p) =

8
4

1
,
p

pcomp
,
comm + comp

p 1 + comm /comp

1
1

M. D. Jones, Ph.D. (CCR/UB)

8
32
16
Number of Processors, p

Brief Overview of Parallel Programming

128

256

Spring 2014

The point here is that it is critical to minimize communication time to

time spent doing computation (recurring theme).
17 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

18 / 61

Demand for HPC

Inherent Limitations

Demand for HPC

Defeating Amdahls Law

Inherent Limitations

Gustafsons Law
Now let p be held constant as p is increased,
Ss (p) = S /p ,

There are ways to work around the serious implications of Amdahls

law:

= (f S + (1 f )S )/p ,

We assumed that the problem size was fixed, which is (very) often
not the case.

= (f S + (1 f )S )/(f S + (1 f )S /p),

Now consider a case where the problem size is allowed to vary

' p + p(1 p)f . . . .

Assume that now the problem size is fixed such that p is held
constant

= p/(1 (1 p)f ),

Another way of looking at this is that the serial fraction becomes

negligible as the problem size is scaled. Actually that is a pretty good
definition of a scalable code ...
J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Spring 2014

19 / 61

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

20 / 61

Basic Terminology

Scalability

The Parallel Zoo

Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on any
value

There are many parallel programming models, but roughly they break
down into the following categories:
Threaded models: e.g., OpenMP or Posix threads as the
application programming interface (API)

I like this (equivalent) one better:

Definition
(Scalable): For a scalable code the sequential fraction becomes
negligible as the problem size (and the number of processors) grows

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

21 / 61

Message passing: e.g., MPI = Message Passing Interface

Hybrid methods: combine elements of other models

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

23 / 61

Overview of Parallel APIs

Basic Terminology

Overview of Parallel APIs

Thread Models

Basic Terminology

Thread Models
thread 1
thread 2

a.out
thread 1
thread 2

a.out

data

time

time
thread N

Shared address space (all threads can access data of program

a.out).
Generally limited to single machine except in very specialized
cases.
GPGPU (general purpose GPU computing) uses thread model
(with a large number of threads on the host machine.

thread N

Typical thread model (e.g., OpenMP)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

24 / 61

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Overview of Parallel APIs

Message Passing Models

Spring 2014

25 / 61

Basic Terminology

Message Passing Models

a.out

PID

task 1

msg send/recv

data

a.out

data

task 1

PID
a.out

PID

task N1
time

Separate address space (tasks can not access data of other

tasks without sending messages or participating in collective
communications).
General purpose model - messages can be sent over network,
shared memory, etc.
Managing communication is generally up to the programmer.

PID

task N1
time

Typical message passing model (e.g., MPI).

Brief Overview of Parallel Programming

PID

data

M. D. Jones, Ph.D. (CCR/UB)

msg send/recv

collective communication (broadcast/reduce/...)

a.out

msg send/recv

task 0

msg send/recv

PID
data

collective communication (broadcast/reduce/...)

data

Spring 2014

26 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

27 / 61

Overview of Parallel APIs

Basic Terminology

Overview of Parallel APIs

Parallelism at Three Levels

Approaching the Problem

The first step in deciding where to add parallelism to an algorithm or

application is usually to analyze from the point of tasks and data:

Task - macroscopic view in which an algorithm is organized

around separate concurrent tasks

Tasks How do I reduce this problem into a set of tasks that can
be executed concurrently?

Data - structure data in (separately) update-able chunks

Data How do I take the key data and represent it in such a way
that large chunks can be operated on independently (and
thus concurrently)?

Instruction - ILP through the flow of data in a predictable fashion

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

28 / 61

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Overview of Parallel APIs

Dependencies

Spring 2014

29 / 61

Basic Terminology

An Hour of Planning ...

Taking the time to carefully analyze an existing application, or plan a

new one, can really pay off later:

Bear in mind that you always want to minimize the dependencies

between your concurrent tasks and data:

Plan for parallel programming by keeping data and task parallel

constructs/opportunities in mind

On the task level, design your tasks to be as independent as

possible - this can also be temporal, namely take into account
any necessity for ordering the tasks and their execution

Analyze an existing or new program to discover/confirm the

locations of the time-consuming portions

Sharing of data will require synchronization and thus introduce

data dependencies, if not race conditions or contention

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Spring 2014

Optimize by implementing a strategy using data or task parallel

constructs

30 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

31 / 61

Overview of Parallel APIs

Auto-Parallelization

Overview of Parallel APIs

Auto-Parallelization

Auto-Vectorization

Often called implicit parallelism, some compilers (usally commercial

ones) have the ability to:
automatically parallelize simple (outer) loops (thread-level
parallelism, or TLP)
Note that thread level parallelism is only usable on an SMP
architecture

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

Like auto-parallelization, a feature of high-performance (often

commercial) compilers on hardware that supports at least some level
of vectorization:
compilers finds low-level operations that can operate
simultaneously on multiple data elements using a single
instruction
Intel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 23 vector length (hardware limitation that true vector processors
do not share)

automatically vectorize (usually innermost loops) using ILP

M. D. Jones, Ph.D. (CCR/UB)

Auto-Vectorization

User can exert some control through compiler directives (usually

vendor specific) and data organization focused on vector
operations

32 / 61

M. D. Jones, Ph.D. (CCR/UB)

Auto-Vectorization

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

33 / 61

Shared Memory APIs

Downside of Automatic Methods

Shared Memory Parallelism

The automatic parallelization schemes have a number of shortfalls:

Here we consider explicit modes of parallel programming on shared

memory architectures:

Very limited in scope - parallelizes only a very small subset of

code
Without directives from the programmer, compiler is very limited in
identifying parallelizable regions of code
Performance is rather easily degraded rather than sped up

Brief Overview of Parallel Programming

Spring 2014

SHMEM, the old Cray shared memory library

Pthreads, UNIX ANSI standard (available on Windows as well)
OpenMP, common specification for compiler directives and API

Wrong results are easy to (automatically) produce

M. D. Jones, Ph.D. (CCR/UB)

HPF, High Performance Fortran

34 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

35 / 61

Overview of Parallel APIs

Distributed Memory APIs

Overview of Parallel APIs

Distributed Memory Parallel Programming

OpenMP

In this case, mainly message passing, with a few other (compiler

directives again) possibilities:
HPF, High Performance Fortran
Cluster OpenMP, a new product from Intel to push OpenMP into
a distributed memory regime
PVM, Parallel Virtual Machine
MPI, Message Passing Interface, has become the dominant API
for general purpose parallel programming
UPC, Unified Parallel C, extensions to ISO 99 C for parallel
programming (new)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

Synopsis: designed for multi-platform shared memory parallel

programming in C/C++/FORTRAN primarily through the
use of compiler directives and environmental variables
(library routines also available).
Target: Shared memory systems.
Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in
2005, 3.0 in 2008, 3.1 in 2011. Widely implemented by
commercial compiler vendors, also now in most
open-source compilers (4.0 specification released in
2013-07, not yet available in production compilers).
More Info: The OpenMP home page:
http://www.openmp.org

CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003

(new, officially part of Fortran 2008)

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

36 / 61

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

Brief Overview of Parallel Programming

Overview of Parallel APIs

OpenMP Availability

Spring 2014

37 / 61

APIs/Language Extensions & Availability

MPI

Message Passing Interface

Platform
Linux x86_64
GNU (>=4.2)
GNU (>=4.4)
GNU (>=4.7)

Version
3.1
3.0
2.5
3.0
3.1

Synopsis: Message passing library specification proposed as a

standard by committee of vendors, implementors, and
users.

Invocation (example)
ifort -openmp -openmp_report2 ...
pgf90 -mp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
gfortran -fopenmp ...

Target: Distributed and shared memory systems.

Current Status: Most popular (and most portable) of the message
passing APIs.
More Info: ANLs main MPI site:
www-unix.mcs.anl.gov/mpi

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

38 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

39 / 61

Overview of Parallel APIs

APIs/Language Extensions & Availability

Overview of Parallel APIs

MPI Availability at CCR

Platform
Linux IA64
Linux x86_64

APIs/Language Extensions & Availability

Unified Parallel C (UPC)

Unified Parallel C (UPC) is a project based at Lawrence Berkeley

Laboratory to extend C (ISO 99) and provide large-scale parallel
computing on both shared and distributed memory hardware:

Version(+MPI-2)
1.2+(C++,MPI-I/O)
1.2+(C++,MPI-I/O),2.x(various)

Explicit parallel execution (SPMD)

Shared address space (shared/private data like OpenMP), but
exploits data locality

I am starting to favor the commercial Intel MPI for its ease of use,
expecially in terms of supporting multiple networks/protocols.

Primitives for memory management

BSD license (free download)
http://upc.lbl.gov/, community website at
http://upc.gwu.edu/

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

40 / 61

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

Brief Overview of Parallel Programming

Overview of Parallel APIs

Spring 2014

41 / 61

APIs/Language Extensions & Availability

Co-Array Fortran
Simple example of some UPC syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) to

provide data decomposition for parallel programs (somewhat akin to
UPC and HPF):

shared i n t a l l _ h i t s [THREADS ] ;
...
...
f o r ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;
a l l _ h i t s [MYTHREAD] = my_hits ;
upc_barrier ;
i f (MYTHREAD == 0 ) {
t o t a l _ h i t s = 0;
f o r ( i =0; i < THREADS; i ++) {
t o t a l _ h i t s += a l l _ h i t s [ i ] ;
}
p i = 4 . 0 ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;
p r i n t f ( " PI e s t i m a t e d t o %10.7 f from %d t r i a l s on %d t h r e a d s . \ n " ,
p i , t r i a l s , THREADS ) ;
}

Original specification by Numrich and Reid, ACM Fortran Forum,

17, no. 2, pp 1-31 (1998).
ISO Fortran committee included co-arrays in next revision to
Fortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,
no 2, pp 4-17 (2005), Fortran 2008.
Does not require shared memory (more applicable than OpenMP)
Early compiler work at Rice:
http://www.hipersoft.rice.edu/caf/index.html

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

42 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

43 / 61

Overview of Parallel APIs

APIs/Language Extensions & Availability

Libraries

Leveraging Existing Parallel Libraries

Simple examples of CAF usage:
1
2
3
4
5
6
7
8

One easy way to utilize parallel programming resources is through

the use of existing parallel libraries. Most common examples (note
orientation around standard mathematical routines):

REAL, DIMENSION (N ) [ ] : : X , Y
X
= Y [ PE ]
! g e t from Y [ PE ]
Y [ PE ]
= X
! p u t i n t o Y [ PE ]
Y[:]
= X
! b r oa d c a s t X
Y [ LIST ] = X
! b r oa d c a s t X over subset o f PE s i n a r r a y LIST
Z(:)
= Y[:]
! collect all Y
S = MINVAL (Y [ : ] ) ! min ( reduce ) a l l Y
B ( 1 :M) [ 1 : N] = S ! S s c a l a r , promoted t o a r r a y o f shape ( 1 :M, 1 : N)

BLAS - Basic Linear Algebra Subroutines

LAPACK - Linear Algebra PACKage (uses BLAS)

UPC and CAF are examples of partitioned global address space

(PGAS) language, for which a large push is being made by
AHPCRC/DARPA as part of the Petascale computing initiative
(also one based on java called Titanium)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

44 / 61

ScaLAPACK - distributed memory (MPI) solvers for common LAPACK

functions
PetSC - Portable, Extensible Toolkit for Scientific Computation
FFTW - Fastest Fourier Transforms in the West

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Libraries

BLAS

Vendor BLAS

The Basic Linear Algebra Subroutines (BLAS) form a standard set of

library functions for vector and matrix operations:

Vendor implementations of the BLAS:

Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)

Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)
Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)
These routines are generally provided by vendors hand-tuned at the
level of assembly code for optimum performance on a particular
processor. Shared memory versions (multithreaded) and distributed
memory versions available.
www.netlib.org/blas

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

46 / 61

Spring 2014

48 / 61

Libraries

Spring 2014

47 / 61

AMD
Apple
Compaq
Cray
HP
IBM
Intel
NEC
SGI
SUN

M. D. Jones, Ph.D. (CCR/UB)

ACML
Velocity Engine
CXML
libsci
MLIB
ESSL
MKL
PDLIB/SX
SCSL
Sun Performance Library

Brief Overview of Parallel Programming

Libraries

Performance Example

LAPACK

DDOT Performance

The Linear Algebra PACKage (LAPACK) is a library that lies on top of

the BLAS (for optimium performance and parallelism) and provides:

U2 Compute Node
5000
cMKL 8.1.1
Ref (-lblas)

4500
4000

solvers for systems of simultaneous linear equations

L1 cache
L2 cache

3500

least-squares solutions

MFlop/s

3000

eigenvalue problems

2500
2000

singular value problems

Main memory

1500

On CCR systems, the Intel MKL is generally preferred (includes

optimized BLAS and LAPACK)

1000
500
0
1
10

10
Vector Length [8B]

www.netlib.org/lapack

Performance advantage of Intel MKL ddot versus reference (system)

version. Level 3 BLAS routines have even more significant gains.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

49 / 61

Libraries

Brief Overview of Parallel Programming

Parallel Program Design Considerations

ScaLAPACK

Spring 2014

50 / 61

Programming Costs

The Scalable LAPACK library, ScaLAPACK, is designed for use on the

largest of problems (typically implemented on distributed memory
systems):
subset of LAPACK routines redesigned for distributed memory
MIMD parallel computers
explicit message passing for interprocessor communication
assumes matrices are laid out in a two-dimensional block cyclic
decomposition
On CCR systems, the Intel Cluster MKL includes
ScaLAPACK libraries (includes optimized BLAS, LAPACK, and
ScaLAPACK)
www.netlib.org/scalapack

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

51 / 61

Consider:
Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have to
plan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportional
to the amount of effort invested in them) - all of the serial
application porting issues apply, plus the choice of
parallel API (MPI, OpenMP, and POSIX threads are
currently good portable choices, but implementations of
them can differ from platform to platform)
Resources: overhead for parallel computation can be significant for
smaller calculations
Scalability: limitations in hardware (CPU-memory speed and
contention, for one example) and the parallel algorithms
will limit speedups. All codes will eventually reach a state
of decreasing returns at some point
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

53 / 61

Parallel Program Design Considerations

Examples

Parallel Program Design Considerations

Examples

Speedup Limitation Example (MD/NAMD)

Joint Amber-CHARMM Benchmark
NAMD, v2.6b1, u2
1

256
ch_p4
ch_gm

128
79.9

Benchmark MD s/step

0.1

27.2

14.2

7.3
6.46

8
7.01

3.67

0.01

Parallel Speedup

49.6

3.54
2.8

1.9

1.89
1

JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023

TIP3 waters, PME (9cutoff)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

54 / 61

Communication Costs

M. D. Jones, Ph.D. (CCR/UB)

8
32
16
Number of Processors (ppn=2)

128

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Communications

256

Spring 2014

55 / 61

Communication Costs

Communication Considerations
Communication needs between parallel processes affect parallel
programs in several important ways:
Cost: there is always overhead when communicating:

All parallel programs will need some communication between tasks,

but the amount varies considerably:
minimal communication, also known as embarrassingly parallel
- think Monte Carlo calculations
robust communication, in which processes must exchange or
update information frequently - time evolution codes, domain
decomposed finite difference/element applications

latency - the cost in overhead for a zero length message

(microseconds)
bandwidth - the amount of data that can be sent per unit time
(MBytes/second); many applications utilize many small messages
and are latency-bound

Scope: point-to-point communication is always faster than

collective communication (that involves a large subset or all
processes)
Efficiency: are you using the best available resource (network) for
communicating?

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

56 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

57 / 61

Parallel Program Design Considerations

Communication Costs

Parallel Program Design Considerations

Latency Example - MPI/CCR

Communication Costs

Bandwidth Example - MPI/CCR

MPI PingPong Benchmark Performance

U2 Cluster Interconnects

10000
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB

1000

Bandwidth [MByte/s]

Message Time [s]

100

ch_mx
ch_gm
ch_p4
DAPL-QDR-IB

1 0
10

10
10
Message Length [Bytes]

0.1

Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDR

Infiniband.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

58 / 61

10
10
Message Length [Bytes]

0.01

Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and

QDR Infiniband.
M. D. Jones, Ph.D. (CCR/UB)

Communication Costs

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

59 / 61

Communication Costs

Collective Example - MPI/CCR

MPI_Alltoall Benchmark

Barriers - force collective synchronization; often implied, and always

expensive
Locks/Semaphores - typically protect memory locations from
simultaneous/conflicting updates to prevent race conditions

Intel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
1e+05

10000
talltoall(4KB) [sec]

Visibility: message-passing codes require the programmer to

explicitly manage all communications, but many data-parallel
codes do not, masking the underlying data transfers
Synchronicity: Synchronous communications are called blocking
as they require an explicit handshake between parallel
processes - asynchronous communications (non-blocking offer
the ability to overlap computation with communication, but place
an extra burden on the programmer. Examples include:

ppn=12
ppn=6
ppn=4
ppn=2
ppn=1

1000

100

100
Number of MPI Processes

1000

Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

60 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

61 / 61

3-D Simulations of Ignition Trasients in The RSRM
No ratings yet
3-D Simulations of Ignition Trasients in The RSRM
8 pages
Autodyn Parallel Processing Guide
No ratings yet
Autodyn Parallel Processing Guide
52 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Lecture 4 Amdahl Law 1
No ratings yet
Lecture 4 Amdahl Law 1
22 pages
01 - Lecture Intro To HPC
No ratings yet
01 - Lecture Intro To HPC
62 pages
HPC Fall 2010: Prof. Robert Van Engelen
No ratings yet
HPC Fall 2010: Prof. Robert Van Engelen
35 pages
Unit 1 - Part 3
No ratings yet
Unit 1 - Part 3
17 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
Parallel Computing Seminar Report
100% (3)
Parallel Computing Seminar Report
35 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
Computing Lecture 9
No ratings yet
Computing Lecture 9
32 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Lecture01 Intro ToHPC
No ratings yet
Lecture01 Intro ToHPC
48 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Lecture # 21
No ratings yet
Lecture # 21
16 pages
Pc7 Performance
No ratings yet
Pc7 Performance
50 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
ch4 PC
No ratings yet
ch4 PC
76 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
02 Gustafsons Law
No ratings yet
02 Gustafsons Law
2 pages
NT533
No ratings yet
NT533
16 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
In3200 Chap05
No ratings yet
In3200 Chap05
34 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Introduction To Parallel Co...
No ratings yet
Introduction To Parallel Co...
44 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
lecturelast (1)
No ratings yet
lecturelast (1)
44 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
27 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
Week 7
No ratings yet
Week 7
27 pages
2 ND
No ratings yet
2 ND
19 pages
Lecture # 03 - Parallel Computing
No ratings yet
Lecture # 03 - Parallel Computing
15 pages
High Performance Computing: What Is It Used For and Why?
No ratings yet
High Performance Computing: What Is It Used For and Why?
19 pages
Module3
No ratings yet
Module3
104 pages
UNIT-2 Parallel Programming Challenges
No ratings yet
UNIT-2 Parallel Programming Challenges
32 pages
SPPU High Performance Computing
No ratings yet
SPPU High Performance Computing
12 pages
Cours 2
No ratings yet
Cours 2
25 pages
02 Introduction
No ratings yet
02 Introduction
46 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Week 7
No ratings yet
Week 7
27 pages
Patterson6e MIPS Ch06 PPT
No ratings yet
Patterson6e MIPS Ch06 PPT
74 pages
Bert 1 Parallel Algorithmic Concepts
No ratings yet
Bert 1 Parallel Algorithmic Concepts
95 pages
Parallel Programming
No ratings yet
Parallel Programming
10 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Introduction To Parallel Computing Tutorial
No ratings yet
Introduction To Parallel Computing Tutorial
35 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
CC Notes
No ratings yet
CC Notes
78 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
High Performance Computing: 772 10 91 Thomas@chalmers - Se
No ratings yet
High Performance Computing: 772 10 91 Thomas@chalmers - Se
75 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Elementary Theory and Application of Numerical Analysis: Revised Edition
From Everand
Elementary Theory and Application of Numerical Analysis: Revised Edition
David G. Moursund
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
PetroMod 2023 1 InstallationGuide
No ratings yet
PetroMod 2023 1 InstallationGuide
30 pages
User's Guide For Quantum ESPRESSO (v.6.5)
No ratings yet
User's Guide For Quantum ESPRESSO (v.6.5)
26 pages
Maintenance Performance Measurement (MPM) : Issues and Challenges
No ratings yet
Maintenance Performance Measurement (MPM) : Issues and Challenges
14 pages
Parallel and Distributed Computing - Zhang Zhiguo.2009w-1
No ratings yet
Parallel and Distributed Computing - Zhang Zhiguo.2009w-1
9 pages
Ryan Thesis
No ratings yet
Ryan Thesis
78 pages
The Distributed Computing Model Based On The Capabilities of The Internet
No ratings yet
The Distributed Computing Model Based On The Capabilities of The Internet
6 pages
Recent Trends in Distributed Systems
No ratings yet
Recent Trends in Distributed Systems
14 pages
SIEMENS C7 621 C7 621 AS I
No ratings yet
SIEMENS C7 621 C7 621 AS I
282 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
Tutorial 26. Parallel Processing
No ratings yet
Tutorial 26. Parallel Processing
18 pages
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
41 pages
Compare Between Grid and Cluster
100% (1)
Compare Between Grid and Cluster
12 pages
Net Communication From Ie To PB Via Routing en
No ratings yet
Net Communication From Ie To PB Via Routing en
22 pages
HSPM
No ratings yet
HSPM
4 pages
Ugc Net Cs Syllabus
No ratings yet
Ugc Net Cs Syllabus
8 pages
2018-19 Syllabus PES College of Engineering
No ratings yet
2018-19 Syllabus PES College of Engineering
48 pages
Practice Questions For PP Final Orals-2022
No ratings yet
Practice Questions For PP Final Orals-2022
8 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
02 Mpi 0
No ratings yet
02 Mpi 0
19 pages
Trans Dimensional Monte Carlo Inversion of Short Period Magnetotelluric Data For Cover Thickness Estimation
No ratings yet
Trans Dimensional Monte Carlo Inversion of Short Period Magnetotelluric Data For Cover Thickness Estimation
8 pages
AUTODYN Parallel Processing v15
No ratings yet
AUTODYN Parallel Processing v15
46 pages
Siwarex FTC Getting Started Info: Stand 09.2006
No ratings yet
Siwarex FTC Getting Started Info: Stand 09.2006
31 pages
Technical Guide: Reference System Setup - v1.0
No ratings yet
Technical Guide: Reference System Setup - v1.0
9 pages
MCNP
No ratings yet
MCNP
42 pages
SIMATIC S7-300: Ethernet. Micro Memory Card - Maintenance-Free Operation, Project Archiving Incl. Symbols
No ratings yet
SIMATIC S7-300: Ethernet. Micro Memory Card - Maintenance-Free Operation, Project Archiving Incl. Symbols
13 pages
SIMATIC HMI Basic Panels 1st Generation - System Overview: Benefits
No ratings yet
SIMATIC HMI Basic Panels 1st Generation - System Overview: Benefits
15 pages
REGCM User Guide
No ratings yet
REGCM User Guide
57 pages
Scalability Performance Analysis of Openfoam On Modern HPC Clustering Technologies
No ratings yet
Scalability Performance Analysis of Openfoam On Modern HPC Clustering Technologies
35 pages