Lecture 4 Analytical
Modeling of Parallel
Programs
Parallel Computing
Fall 2008
Performance Metrics for
Parallel Systems
Number of processing elements p
Execution Time
Parallel runtime: the time that elapses from the
moment a parallel computation starts to the moment
the last processing element finishes execution.
Ts: serial runtime
Tp: parallel runtime
Total Parallel Overhead T0
Total time collectively spent by all the processing
elements running time required by the fastest known
sequential algorithm for solving the same problem on
a single processing element.
T =pTp-Ts
0
Performance Metrics for
Parallel Systems
Speedup S:
The ratio of the serial runtime of the best sequential algorithm
for solving a problem to the time taken by the parallel
algorithm to solve the same problem on p processing elements.
S=Ts(best)/Tp
Example: adding n numbers: Tp=(logn), Ts= (n), S=
(n/logn)
Theoretically, speedup can never exceed the number of
processing elements p(S<=p).
Proof: Assume a speedup is greater than p, then each processing
element can spend less than time Ts/p solving the problem. In this
case, a single processing element could emulate the p processing
elements and solve the problem in fewer than Ts units of time. This
is a contradiction because speedup, by definition, is computed with
respect to the best sequential algorithm.
Superlinear speedup: In practice, a speedup greater than p is
sometimes observed, this usually happens when the work
performed by a serial algorithm is greater than its parallel
formulation or due to hardware features that put the serial
implementation at a disadvantage.
Example for Superlinear
speedup
Superlinear speedup:
Example1: Superlinear effects from caches: With the
problem instance size of A and 64KB cache, the cache hit
rate is 80%. Assume latency to cache of 2ns and latency of
DRAM of 100ns, then memory access time is
2*0.8+100*0.2=21.6ns. If the computation is memory
bound and performs one FLOP/memory access, this
corresponds to a processing rate of 46.3 MFLOPS. With the
problem instance size of A/2 and 64KB cache, the cache hit
rate is higher, i.e., 90%, 8% the remaining data comes
from local DRAM and the other 2% comes from the remote
DRAM with latency of 400ns, then memory access time is
2*0.9+100*0.08+400*0.02=17.8. The corresponding
execution rate at each processor is 56.18MFLOPS, and for
two processors the total processing rate is 112.36MFLOPS.
Then the speedup will be 112.36/46.3=2.43!
Example for Superlinear
speedup
Superlinear speedup:
Example2: Superlinear effects due to exploratory
decomposition: explore leaf nodes of an unstructured tree.
Each leaf has a label associated with it and the objective is
to find a node with a specified label, say S. The solution
node is the rightmost leaf in the tree. A serial formulation
of this problem based on depth-first tree traversal explores
the entire tree, i.e. all 14 nodes, time is 14 units time. Now
a parallel formulation in which the left subtree is explored
by processing element 0 and the right subtree is explored
by processing element 1. The total work done by the
parallel algorithm is only 9 nodes and corresponding
parallel time is 5 units time. Then the speedup is 14/5=2.8.
Performance Metrics for
Parallel Systems(cont.)
Efficiency E
Cost(also called Work or processor-time product) W
Ratio of speedup to the number of processing element.
E=S/p
A measure of the fraction of time for which a processing element is
usefully employed.
Examples: adding n numbers on n processing elements: Tp=(logn),
Ts= (n), S= (n/logn), E= (1/logn)
Product of parallel runtime and the number of processing elements used.
W=Tp*p
Examples: adding n numbers on n processing elements: W= (nlogn).
Cost-optimal: if the cost of solving a problem on a parallel computer has
the same asymptotic growth(in terms) as a function of the input size
as the fastest-known sequential algorithm on a single processing
element.
Problem Size W2
The number of basic computation steps in the best sequential algorithm
to solve the problem on a single processing element.
W2=Ts of the fastest known algorithm to solve the problem on a
sequential computer.
Parallel vs Sequential
Computing: Amdahls
Theorem 0.1 (Amdahls Law) Let f, 0 f 1,
be the fraction of a computation that is
inherently sequential. Then the maximum
obtainable speedup S on p processors is S 1/(f
+ (1 f)/p)
Proof. Let T be the sequential running time for the
named computation. fT is the time spent on the
inherently sequential part of the program. On p
processors the remaining computation, if fully
parallelizable, would achieve a running time of at most
(1f)T/p. This way the running time of the parallel
program on p processors is the sum of the execution
time of the sequential and parallel components that is,
fT + (1 f)T/p. The maximum allowable speedup is
therefore S T/(fT + (1 f)T/p) and the result is
proven.
7
Amdahls Law
Amdahl used this observation to advocate the building of
even more powerful sequential machines as one cannot gain
much by using parallel machines. For example if f = 10%,
then S 10 as p . The underlying assumption in
Amdahls Law is that the sequential component of a
program is a constant fraction of the whole program. In
many instances as problem size increases the fraction of
computation that is inherently sequential decreases with
time. In many cases even a speedup of 10 is quite
significant by itself.
In addition Amdahls law is based on the concept that
parallel computing always tries to minimize parallel time. In
some cases a parallel computer is used to increase the
problem size that can be solved in a fixed amount of time.
For example in weather prediction this would increase the
accuracy of say a three-day forecast or would allow a more
accurate five-day forecast.
8
Parallel vs Sequential
Computing: Gustaffsons Law
Theorem 0.2 (Gustafsons Law) Let the execution
time of a parallel algorithm consist of a sequential
segment fT and a parallel segment (1 f)T and the
sequential segment is constant. The scaled speedup of
the algorithm is then. S =(fT + (1 f)Tp)/(fT + (1 f)T)
= f + (1 f)p
For f = 0.05, we get S = 19.05, whereas Amdahls law
gives an S 10.26.
1 proc
fT
(1-f)Tp
T(f+(1-f)p)
p proc
fT
(1-f)T
T
Amdahls Law assumes that problem size is fixed when it
deals with scalability. Gustafsons Law assumes that
running time is fixed.
Brents Scheduling Principle
(Emulations)
Suppose we have an unlimited parallelism efficient parallel
algorithm, i.e. an algorithm that runs on zillions of processors.
In practice zillions of processors may not available. Suppose
we have only p processors. A question that arises is what can
we do to run the efficient zillion processor algorithm on our
limited machine.
One answer is emulation: simulate the zillion processor
algorithm on the p processor machine.
Theorem 0.3 (Brents Principle) Let the execution time of
a parallel algorithm requires m operations and runs in parallel
time t. Then running this algorithm on a limited processor
machine with only p processors would require time m/p + t.
m
m Let
Proof:
mi be the number of computational operations at the ii
m / p m / p 1
processors
i
i
th step, i.e.
.If we assign the p
on the i-th step to
t
work on these mi operations
they can conclude in time
mi / p time
1 t
t m/ p
mi / p running
. Thus the
total
onmip/ pprocessors
would be
i
i 1
10
End
Thank you!
11