Parallel Programming
Parallel Computers (2)
Parallel Programming - Lecture 2
Potential for Increased Computational Speed
Speedup Factor
What is the Maximum Speedup?
Parallel Programming - Lecture 2 2
Speedup Factor
Execution time using one processor (best sequential algorithm) ts
S(p) = =
Execution time using a multiprocessor with p processors tp
Where:
ts is execution time of the best sequential algorithm running on a
single processor
tp is execution time for solving the same problem on a multiprocessor.
S(p) gives the increase in speed in using the
multiprocessor.
Underlying algorithm for parallel implementation
might be (and is usually) different.
Parallel Programming - Lecture 2 3
Speedup Factor
In theoretical analysis, Speedup factor can also be
cast in terms of computational steps:
Number of computational steps using one processor
S(p) =
Number of parallel computational steps with p processors
Can also extend time complexity to parallel
computations.
Parallel Programming - Lecture 2 4
Speedup Factor
The maximum speedup is usually p with p
processors (linear speedup).
𝑡𝑠
𝑆 𝑝 ≤ =𝑝
𝑡𝑠 𝑝
Possible to get superlinear speedup ( 𝑆(𝑝) > 𝑝 )
Because the original sequential algorithm was not
optimal.
One common reason for superlinear speedup is
extra memory in multiprocessor system.
Parallel Programming - Lecture 2 5
What is the Maximum Speedup?
Several factors will appear as overhead in the
parallel version and limit the speedup:
1. Periods when not all the processors can be
performing useful work and are simply idle.
2. Extra computations in the parallel version not
appearing in the sequential version, for
example, to recompute the constants locally.
3. Communication time between processes.
Parallel Programming - Lecture 2 6
Maximum Speedup
Parallel Programming - Lecture 2 7
Maximum Speedup: Amdahl’s law
Speedup factor is given by:
ts p
S(p)
fts (1 f )ts /p 1 (p 1) f
This equation is known as Amdahl’s law
Parallel Programming - Lecture 2 8
Speedup against number of processors
Even with infinite number of processors, maximum
speedup limited to 1/f .
𝑆(𝑝) 1
=
𝑝→∞ 𝑓
Example: With only 5% of computation being serial,
the maximum speedup is 20, irrespective of number
of processors.
Parallel Programming - Lecture 2 9
Speedup against number of processors
Parallel Programming - Lecture 2 10
Types of Parallel Computers
There are two basic types of parallel computers:
1. Shared memory multiprocessor
2. Distributed memory multicomputer
Parallel Programming - Lecture 2 11
Conventional Computer
A conventional computer consists of a processor
executing a program stored in a (main) memory:
Each main memory location located by its address.
Addresses start at 0 and extend to 2b - 1 when there
are b bits (binary digits) in address.
Parallel Programming - Lecture 2 12
Shared Memory Multiprocessor System
A natural way to extend single processor model is to
have multiple processors connected to multiple
memory modules, such that each processor can
access any memory module in a so-called shared
memory configuration.
Parallel Programming - Lecture 2 13
Shared Memory Multiprocessor System
The connection between the processors and
memory is through some form of
interconnection network.
A shared memory
multiprocessor system
employs a single address
space.
which means that each location
in the whole main memory
system has a unique address
that is used by each processor
to access the location.
Parallel Programming - Lecture 2 14
Programming Shared Memory Multiprocessors
Programming a shared memory multiprocessor
involves having executable code stored in the
shared memory for each processor to execute.
The data for each program will also be stored in
the shared memory, and each program could
access all the data if needed.
Parallel Programming - Lecture 2 15
Programming Shared Memory Multiprocessors
One way for the parallel programming to produce
the executable code for each processor is to use a
high level parallel programming language that has
special parallel programming constructs and
statements for declaring shared variables and
parallel code sections.
Parallel Programming - Lecture 2 16
Programming Shared Memory Multiprocessors
The programming languages is divided into:
1. A regular sequential programming language with
preprocessor directives to specify the parallelism.
Example OpenMP - An industry standard set of
compiler directives and constructs added to C/C++
and Fortran
2. Threads can be used that contains regular high
level language code sequences for individual
processors.
3. A regular sequential programming language and
modify the syntax to specify the parallelism.
Example UPC (Unified Parallel C).
Parallel Programming - Lecture 2 17
Shared Memory Multiprocessor System
Two-processor shared memory system are particularly
cost-effective.
However, it is very difficult to implement the hardware
to achieve fast access to all the shared memory by all
the processors with a large number of processors.
Most large shared memory systems have some form of
hierarchical or distributed memory structure. Then,
processors can physically access nearby memory
locations much faster than more distant memory
locations.
The term nonuniform memory access (NUMA) is used in
these cases, as opposed to uniform memory access
(UMA).
Parallel Programming - Lecture 2 18
Message-Passing Multicomputer
Complete computers connected through an
interconnection network:
Parallel Programming - Lecture 2 19
Message-Passing Multicomputer
Each computer consists of a processor and local
memory but this memory is not accessible by
other processors.
The interconnection network provides for
processors to send message to other processors.
The message carry data from one processor to
another as dictated by the program.
Parallel Programming - Lecture 2 20
Networks for Multicomputers
The purpose of the interconnection network is to
provide a physical path for messages sent from
one computer to another computer.
Key issues in network design are:
1. Bandwidth
2. Latency
3. Cost
Parallel Programming - Lecture 2 21
Key issues in network design
The bandwidth is the number of bits that can be
transmitted in unit time, given as bits/sec.
The network latency is the time to make a
message transfer through the network.
The communication latency is the total time to
send the message including the software
overhead and interface delays.
Message latency or startup time is the time to
send zero-length message (finding the route,
packing and unpacking)
Parallel Programming - Lecture 2 22
Key issues in network design
The diameter is the minimum number of links
between the two farthest nodes (computers) in
the network.
The bisection width of a network is the
minimum number of links that must be cut to
divide the network into two equal parts.
Parallel Programming - Lecture 2 23
Multicomputer System
There are several ways one could interconnect
computers to form a multicomputer system:
1. Connecting every computer to every other
computer with links.
1. With c computers: there are 𝑐(𝑐 − 1)/2 links in all.
2. Only for a very small systems.
3. As the size increases, the number of interconnections
becomes impractical for economic and engineering
reasons.
2. There are two networks with restricted direct
interconnections:
1. The mesh network
2. The hypercube network.
Parallel Programming - Lecture 2 24
Mesh Network
A two dimensional mesh can be created by having
each node in a two dimensional array connect to
its four nearest neighbors.
Computer/
Links processor
Parallel Programming - Lecture 2 25
Mesh Network
The diameter of a 𝑝 × 𝑝 mesh is 2 𝑝 − 1 ,
since to reach one corner from the opposite
corner requires a path to made access 𝑝−1
nodes and down 𝑝 − 1 nodes.
Torus: The free ends of the mesh might circulate
back to the opposite sides. This network is called
tours.
Parallel Programming - Lecture 2 26
Mesh Network
Meshes are particularly convenient for many
scientific and engineering problems in which
solution points are arranged in two-dimensional or
three-dimensional arrays.
The Intel Touchstone Delta computer designed
with a two dimensional mesh.
J-machine, a research prototype with a three-
dimensional mesh
Parallel Programming - Lecture 2 27