Introduction to Parallel and
High-Performance Computing
(with Machine-Learning applications)
Anshul Gupta, IBM Research
Prabhanjan (Anju) Kambadur, Bloomberg L.P.
Introduction to Parallel and High-Performance
Computing (with Machine-Learning applications)
Part 1 Part 2
Parallel computing Parallel algorithms for
basics and parallel Building a Classifier
algorithm analysis
Part 1: Parallel and High-Performance Computing
• Why parallel computing?
Part 1: Parallel and High-Performance Computing
• Why parallel computing?
• Parallel computing platforms
Part 1: Parallel and High-Performance Computing
• Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
Part 1: Parallel and High-Performance Computing
• Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
Part 1: Parallel and High-Performance Computing
• Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
• Parallel programming models
Part 1: Parallel and High-Performance Computing
• Why parallel computing?
• Parallel computing platforms
• Parallel algorithm basics
• Decomposition for parallelism
• Parallel programming models
• Parallel algorithm analysis
Why parallel computing?
Microprocessor
trends: 1972--2015
Kirk M. Bresniker, Sharad
Singhal, R. Stanley Williams,
"Adapting to Thrive in a New
Economy of Memory
Abundance",Computer, vol.48,
no. 12, pp. 44-53, Dec. 2015,
doi:10.1109/MC.2015.368
Parallel computing platforms
Highest level of
parallelism:
• Compute nodes on an
interconnection network
• Possibly, thousands of nodes
• Distributed memory
• Distributed or shared address
space
• Scalability analysis crucial
Parallel computing platforms (cont.)
A generalized compute node
• (Possibly) multiple CPUs
• (Possibly) multiple GPUs
Parallel computing platforms (cont.)
Parallel computing platforms (cont.)
Shared memory on a node, but
Non-Uniform Memory Access
(NUMA).
Parallel computing platforms (cont.)
Nvidia GeForce GTX 680 (Kepler)
• 4 GPCs (graphics processing
clusters)
• 8 SMXs (streaming
multiprocessors)
• 192 X 8 = 1536 CUDA cores
Parallel hardware hierarchy
Node
ensemble
Nodes
GPUs
CPUs GPCs
SMXs
Cores
CUDA
cores
Parallel program hierarchy
Process
groups
Processes
Thread
pools
Threads
GPU thread
grids
GPU thread
blocks
GPU
threads
Parallel programming paradigms
Distributed Memory Shared Memory Accelerator
Parallel programming paradigms
Distributed Memory
• Multiple processes
• Distributed address
space
• Explicit data
movement
• Locality!
Parallel programming paradigms
Distributed Memory Shared Memory
• Multiple processes • Multiple threads
• Distributed address • Shared address space
space
• Implicit data
• Explicit data movement
movement • Locality!
• Locality!
Parallel programming paradigms
Distributed Memory Shared Memory Accelerator
• Multiple processes • Multiple threads • Host memory PCIe
• Distributed address • Shared address space Device memory
space • Explicit data
• Implicit data
movement
• Explicit data movement
movement • Locality! • Locality!
• Locality!
Algorithm design
• Algorithm design is critical to devising a computing
solution
Algorithm design
• Algorithm design is critical to devising a computation
solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
Algorithm design
• Algorithm design is critical to devising a computation
solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
• Parallel algorithm is a recipe for solving the given
problem using an ensemble of hardware resources
Algorithm design
• Algorithm design is critical to devising a computation
solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
• Parallel algorithm is a recipe for solving the given
problem using an ensemble of hardware resources
• Specifying parallel algorithm involves a lot more than
specifying a sequence of basic steps
Algorithm design
• Algorithm design is critical to devising a computation
solution
• Serial algorithm is a recipe or sequence of basic steps or
operations
• Parallel algorithm is a recipe for solving the given
problem using an ensemble of hardware resources
• Specifying parallel algorithm involves a lot more than
specifying a sequence of basic steps
Parallel algorithm design steps
• Identifying portions of work that can be performed
concurrently
Parallel algorithm design steps
• Identifying portions of work that can be performed
concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
Parallel algorithm design steps
• Identifying portions of work that can be performed
concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to
the right computing agent at the right time
Parallel algorithm design steps
• Identifying portions of work that can be performed
concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to
the right computing agent at the right time
• Managing simultaneous requests for shared data
Parallel algorithm design steps
• Identifying portions of work that can be performed
concurrently
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to
the right computing agent at the right time
• Managing simultaneous requests for shared data
• Synchronizing computing agents for correct program execution
Parallel algorithm design steps
• Identifying portions of work that can be performed concurrently
- Decomposition
• Mapping concurrent pieces of work onto computing agents
running in parallel
• Making the input, output, and intermediate data available to the
right computing agent at the right time – Data Dependencies
• Managing simultaneous requests for shared data
• Synchronizing computing agents for correct program execution –
Task Dependencies
Decomposition for concurrency
Task Decomposition Data Decomposition
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified
and mapped onto to threads or
processes
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified
and mapped onto to threads or
processes
• Tasks share or exchange data
as needed
Decomposition for concurrency
Task Decomposition
• Concurrent tasks are identified
and mapped onto to threads or
processes
• Tasks share or exchange data
as needed
• May be static or dynamic
Decomposition for concurrency
Task Decomposition Data Decomposition
• Concurrent tasks are identified • Data is partitioned (input,
and mapped onto to threads or output, or intermediate)
processes
• Tasks share or exchange data
as needed
• May be static or dynamic
Decomposition for concurrency
Task Decomposition Data Decomposition
• Concurrent tasks are identified • Data is partitioned
and mapped onto to threads or • Partitions are assigned to
processes computing agents
• Tasks share or exchange data • “Owner computes” rule
as needed
• May be static or dynamic
Decomposition for concurrency
Task Decomposition Data Decomposition
• Concurrent tasks are identified • Data is partitioned
and mapped onto to threads or • Partitions are assigned to
processes computing agents
• Tasks share or exchange data • “Owner computes” rule
as needed
• Usually static
• May be static or dynamic
Decomposition for concurrency
Data + Task = Hybrid
Decomposition Decomposition Decompositions
(Example: sparse
matrix factorization)
Task decomposition example
Chess Program
R1 K1 B1 Q P1 P2
...
• Each task evaluates all moves of a single piece (branch-and-bound)
• Small data (board position) can be replicated
• Dynamic load balancing required
Data decomposition example
Dense Matrix-Vector Multiplication
P1
P2
P3
P4
P5
P6 . =
P7
P8
P9
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to
bottom.
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to
bottom.
• Devise the best decomposition strategy at the given level
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to
bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to
bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data
movement among computing agents
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to
bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data
movement among computing agents
• Minimize load imbalance and idling among computing
agents
Parallel application design guidelines
• Focus on one level of hierarchy at a time – from top to
bottom.
• Devise the best decomposition strategy at the given level
• Computing agents are likely to be parallel themselves
• Minimize interactions, synchronization, and data
movement among computing agents
• Minimize load imbalance and idling among computing
agents
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
• Problem size, W = total amount of work: TS = kW
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
• Efficiency, fraction of overall time spent doing useful work:
E = S/p = TS/pTP = TS/(TS+TO)
Parallel algorithm analysis
• Serial run time, TS: time required by best known method on a
single computing agent
• Problem size, W = total amount of work: TS = kW
• Parallel run time, TP: time elapsed between start of computation
until the last of the p computing agents finishes
• Overhead, sum of all wasted compute resources: TO = pTP – TS
• Speedup, ratio of serial to parallel time: S = TS/TP = pTS/(TS+TO)
• Efficiency, fraction of overall time spent doing useful work:
E = S/p = TS/pTP = TS/(TS+TO)
Parallel algorithm analysis
Parallel algorithm analysis
Isoefficiency function
• Function fE(p) of the number
of computing agents p by
which the problem size W
must grow in order to maintain
a given efficiency E.
Isoefficiency function
• Function fE(p) of the number
of computing agents p by
which the problem size W
must grow with p in order to
maintain a given efficiency E.
• Captures the effect of
communication, load-
imbalance, contention, serial-
bottlenecks, etc.
Isoefficiency function
Scalability analysis
E = S/p = TS/pTP
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW
W = TO . E/k(1-E)
Scalability analysis
E = S/p = TS/pTP
Since TO = pTP – TS or pTP = TS +TO
Therefore, E = TS/(TS +TO), or E = kW/(kW+TO), because TS = kW
W = TO . E/k(1-E)
W ~ TO
Scalability analysis: W = O(n3)
Algorithm A Algorithm B
TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)
Scalability analysis: W = O(n3)
Algorithm A Algorithm B
TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)
W = O(n3) => n3 ~ n2√p W = O(n3) => n3 ~ n1.5p
Scalability analysis: W = O(n3)
Algorithm A Algorithm B
TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)
W = O(n3) => n3 ~ n2√p W = O(n3) => n3 ~ n1.5p
n ~ √p n1.5 ~ p
Scalability analysis: W = O(n3)
Algorithm A Algorithm B
TP = O(n3/p) + O(n2/√p) TP = O(n3/p) + O(n√n)
W = O(n3) => n3 ~ n2√p W = O(n3) => n3 ~ n1.5p
n ~ √p n1.5 ~ p
W = O(n3) = O(p1.5) W = O(n3) = O(p2)
Parallel algorithm design and analysis
Dense Matrix-Vector Multiplication (1-D decomposition)
P1
P2
P3
P4
P5 A x y
P6 . =
P7
P8
P9
Parallel algorithm design and analysis
Dense Matrix-Vector Multiplication (2-D decomposition)
P1 P2 P3 P1 P1
P5 P6
A . x = y
P7 P8 P9
Parallel matrix-vector multiplication:
1-D decomposition
Parallel matrix-vector multiplication:
2-D decomposition
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)
Scalability analysis of matrix-vector
multiplication (1-D decomposition)
TP = n2/p + tslog(p) + twn
TO = tsplog(p) + twpn
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ pn, or n ~ p, or W = O(n2) = O(p2)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ n√plog(p), or n ~ √plog(p),
or W = O(n2) = O(plog2p)
Scalability analysis of matrix-vector
multiplication (2-D decomposition)
TP = n2/p + tslog(p) + tw(n/√p)log(p)
TO = tsplog(p) + twn√plog(p)
W = O(n2)
1: n2 ~ plog(p)
2: n2 ~ n√plog(p), or n ~ √plog(p),
or W = O(n2) = O(plog2p)
Isoefficiency function of dense sparse matrix-
vector multiplication
1-D decomposition 2-D decomposition
W ~ p2 W ~ plog2p
2-D decomposition is likely to yield higher speedups, require smaller
problems to deliver the speedups, and scale more readily to larger number
of computing agents.
Concluding remarks
• Parallelism necessary for continued performance
improvement.
Concluding remarks
• Parallelism necessary for continued performance
improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
Concluding remarks
• Parallelism necessary for continued performance
improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
• Systematic top down parallel application design
Concluding remarks
• Parallelism necessary for continued performance
improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
• Systematic top down parallel application design
• Decomposition strategy is critical
Concluding remarks
• Parallelism necessary for continued performance
improvement.
• Complex hierarchy of parallel computing
hardware and programming paradigms
• Systematic top down parallel application design
• Decomposition strategy is critical
• Analysis important to understand scalability
Thank you!