Isoefficiency Function A Scalability Metric For Pa
Isoefficiency Function A Scalability Metric For Pa
Isoefficiency Function A Scalability Metric For Pa
net/publication/2770177
CITATIONS READS
44 1,214
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Development of Pre-SWOT ESDRs for Global Surface Water Storage Dynamics View project
All content following this page was uploaded by Anshul Gupta on 14 January 2013.
This paper provides a tutorial introduction to a performance evaluation metric called the iso-
eciency function. Traditional methods for evaluating serial algorithms are inadequate for
analyzing the performance of parallel algorithm-architecture combinations. Isoeciency function
has proven useful for evaluating the performance of a wide variety of such combinations.
On a sequential computer, the fastest algorithm for solving a given problem is the best algorithm.
However, the performance of a parallel algorithm for a specic problem instance on a given number
of processors provides only limited information. The time taken by a parallel algorithm to solve a
problem instance depends on the problem size, the number of processors used to solve the problem,
and machine characteristics such as: processor speed, speed of communication channels, type of
interconnection network, and routing techniques. An algorithm that yields good performance for a
selected problem on a xed number of processors on a given machine may perform poorly if any of
these parameters are changed. Hence, the evaluation of a parallel algorithm on a parallel computer
requires a more comprehensive analysis, and the study of scalability aids us in this analysis. The
scalability of a parallel system is a measure of its capacity to deliver linearly increasing speedup
with respect to the number of processors used. It re
ects the capacity of the parallel system to
eectively utilize an increasing amount of processing resources.
Scalability analysis of an architecture-algorithm combination helps us in answering the following
questions: How does increasing the number of processors aect the performance of an algorithm?
How does changing the problem size aect performance? How does changing the computation speed
of the processors, or the communication speed of the interconnection network aect the performance
of a parallel computer for a given algorithm? What is the best algorithm and architecture for solving
a problem as the problem size and number of processors change?
A number of performance evaluation metrics have been developed to study the scalability of
parallel algorithms and architectures [3, 4, 10, 11, 12]. Kumar and Gupta [7] provide a compre-
hensive survey of dierent methods of scalability analysis. The isoeciency function is one such
metric. It relates the size of the problem being solved to the number of processors required to
maintain the eciency at a xed value. An important feature of the isoeciency function is that
it succinctly captures the eects of characteristics of the parallel algorithm as well as the parallel
architecture on which it is implemented, in a single expression. The isoeciency function enables us
to determine the degree of scalability of a parallel system with respect to the number of processors,
the speed of processors, and the communication bandwidth of the interconnection network. Thus
1
we can compare various combinations of parallel algorithms and architectures for a range of prob-
lem sizes and number of processors. Hence, using isoeciency, we can determine the best parallel
algorithm-architecture combination for a problem under a variety of situations without having to
explicitly analyze all possible combinations under all possible conditions. The following sections
present a discussion of the terminology necessary to use this metric followed by a number of useful
examples of isoeciency analysis.
The eciency (E ) of a parallel system is dened as the ratio of the speedup obtained to the
number of processors used. Therefore,
E = Sp
= T T+1 T
1 o
= 1 (2)
1 + TTo 1
For certain parallel architecture-algorithm combinations, To can be negative. This implies that
the speedup obtained from p processors may exceed p. This phenomenon is called superlinear
speedup. An example of a parallel system capable of exhibiting such behavior is one in which
2
memory is hierarchical and the access time increases (in discrete steps) with the memory used
by the program. In this case, the eective computation speed of a large program may be slower
on a serial processor than on a parallel computer employing similar processors. This is because
a sequential algorithm using M bytes of memory will use only M=p bytes on each processor of
a p-processor parallel computer. In this case, cache and virtual memory eects may reduce the
eective computation rate of a serial processor. In order to simplify presentation, we assume that
To is non-negative in the rest of the paper.
During the course of analyzing parallel systems, we frequently encounter the notion of the size
of the problem being solved. So far we have used this term informally without providing a rigorous
denition. One way of expressing the problem size is to represent it by a parameter of the input
size. For example, for any matrix problem involving n n matrices the problem size could be
denoted by n. A drawback of this denition is that the interpretation of problem size changes from
one problem to another. For example, doubling the input size results in an 8-fold increase in the
serial execution time for matrix multiplication and a 4-fold increase for matrix addition. A better
denition of the size or the magnitude of the problem would be such that irrespective of the problem,
doubling the problem size always means performing twice the amount of computation. Therefore,
we choose to express problem size in terms of the total number of basic operations inherent in the
problem. According to this notion, the problem size for n n matrix multiplication is (n3 ), while
that for the addition of two n n matrices is (n2 ). In order to keep the problem size unique for
a given problem, we dene it as the number of operations the best sequential algorithm executes
in order to solve the problem on a single processor. For some problems, the best algorithm is
not known; for others, the asymptotically best algorithm may have worse performance than other
algorithms for problem instances of interest. In these cases, we can use the number of operations
in the serial algorithm that is considered best for such problem instances. For example, for matrix
multiplication, the simple (n3 ) algorithm is often the algorithm of choice, even though Strassen's
algorithm has a better asymptotic complexity. The problem size is a function of the size of the
input. We will use the symbol W to denote problem size. If the cost of executing each operation
is tc , then T1 = Wtc .
We illustrate these terms using a simple example.
Example 1: Adding n numbers on a p-processor hypercube
Consider the problem of adding n numbers. For this problem the number of operations, and hence
the problem size W is equal to n. If we assume that each addition takes tc time, then T1 is equal to
ntc. 1 Now consider a parallel algorithm for adding n numbers using a p-processor hypercube. This
algorithm is shown in Figure 1 for n = 16 and p = 4. Each processors is allocated n=p numbers. In the
rst step of this algorithm, each processor locally adds its n=p numbers in (n=p) time. The problem
is now reduced to adding the p partial sums on p processors. These can be done by propagating and
adding the partial sums as shown in Figure 1. A single step consists of one addition and one nearest
neighbor communication of a single word, each of which is a constant time operation. For the sake of
simplicity, let us assume that it takes one unit of time to add two numbers and also to communicate a
number between two processors. Therefore, n=p time is spent in adding the n=p local numbers at each
processor. After the local addition, the p partial sums are added in log p steps, each step consisting
1
In reality the number of basic operations, W , is n ? 1 and the sequential execution time T is given by (n ? 1)tc .
1
3
3 7 11 15
2 6 10 14
1 5 9 13
3 7 11 15
0 4 8 12 Σ0 Σ4 Σ8 Σ 12
0 1 2 3 0 1 2 3
(a) (b)
7 15 15
Σ0 Σ8 Σ0
0 1 2 3 0 1 2 3
(c) (d)
of one addition and one communication. Thus, the total parallel execution time Tp is n=p + 2 log p.
The same task can be accomplished sequentially in n time units. Thus, out of the n=p + 2 log p time
units that each processor spends in parallel execution, n=p time is spent in performing useful work.
The remaining 2 log p units of time per processor contribute to a total overhead of
To = 2p log p (3)
The speedup S and eciency E for this algorithm are given by
T1 n
S= = (4)
TP n
p + 2 log p
S n
E= = (5)
p n + 2p log p
In the next section we formalize the notion of scalability of a parallel system.
4
35
Linear
30
25
20 n = 512
15 n = 320
n = 192
S
10
5 n = 64
0
0 5 10 15 20 25 30 35 40
Figure 2: Speedup vs. number of processors for adding a list of numbers on a hypercube.
p! 1 4 8 16 32
n#
64 1.0 .80 .57 .33 .17
192 1.0 .92 .80 .60 .38
320 1.0 .95 .87 .71 .50
512 1.0 .97 .91 .80 .62
Table 1: Eciency as a function of n and p for adding n numbers on p-p rocessor hypercubes.
Given that increasing the number of processors reduces eciency and increasing the size of the
computation increases eciency, it should be possible to keep the eciency constant by increasing
both the size of the problem and the number of processors simultaneously. For instance, in Table
1, the eciency of adding 64 numbers on a hypercube with four processors is 0.80. If the number
of processors is increased to eight and the size of the problem is scaled up to add 192 numbers, the
eciency remains 0.80. Further increasing p to 16 and n to 512 also results in the same eciency.
This behavior is exhibited by a large class of parallel systems. For these systems, the eciency of
parallel execution can be maintained at a constant value by simultaneously increasing the number
of processors and the size of the problem being solved. We call such systems scalable parallel
systems.
5
processors to keep the eciency xed?". For dierent parallel systems, the size of the problem to
be solved has to increase at dierent rates in order to maintain a xed eciency as the number
of processors is increased. This rate determines the degree of scalability of the parallel system. In
this subsection we introduce a metric for quantitatively determining the degree of scalability of a
parallel system.
Consider the eciency of a parallel algorithm dened by Equation 2. Substituting T1 = tc W
into this equation, we get:
E = 1 To (6)
1 + tc W
In Equation 6, if the problem size W is kept constant and p is increased, then the eciency
decreases because the total overhead To increases with p. If W is increased while keeping the
number of processors constant, then, for scalable parallel systems, the eciency increases. This is
because for a given p, To grows slower than (W ). For these parallel systems, the eciency can
be maintained at a desired value (between 0 and 1) by increasing p, provided W is also increased.
For dierent parallel systems, W has to be increased at dierent rates with respect to p in order to
maintain a xed eciency. For example, in some cases, W might need to grow as an exponential
function of p to keep the eciency from dropping as p is increased. Such parallel systems are
considered poorly scalable. On parallel systems with these characteristics, it is dicult to obtain
good speedups for a large number of processors, unless the size of the problem being solved is
enormously large. Conversely, if W needs to grow only linearly with respect to p, then the parallel
system is considered highly scalable. This is because it delivers speedups increasing linearly with
respect to the number of processors for problem sizes increasing at reasonable rates.
For scalable parallel systems, the eciency can be maintained at a desired value (between 0 and
1) if the ratio To=W in the expression for eciency is maintained at a constant value. To maintain
a certain eciency E (0 < E < 1), we manipulate Equation 6 to get:
To = t ( 1 ? E )
W c E
W = t1 ( 1 ?E E )To
c
Let K = E=(tc (1 ? E )) be a constant depending on the eciency. The equation above can be
rewritten as follows:
W = KTo (7)
From Equation 7 the problem size W can usually be obtained as a function of p by algebraic
manipulations. This function dictates the rate of growth of W required to keep the eciency xed
as p is increased. We call this function the isoeciency function of the parallel system. The
isoeciency function of a parallel system determines the ease with which it can achieve speedups
increasing in proportion to the number of processors. A small isoeciency function implies that
small increments in the problem size are sucient for ecient utilization of an increasing number
of processing elements, and hence the parallel system is highly scalable. Conversely, a large isoe-
ciency function indicates a poorly scalable parallel system. The isoeciency function does not exist
for unscalable parallel systems because in such systems, the eciency cannot be kept at constant
value as p increases, no matter how fast the problem size is increased.
6
Example 2:
The expression for the overhead function of the problem of adding n numbers on a p-processor hyper-
cube is given in Equation 3. Substituting the value of To in Equation 7, we get
W = 2Kp log p (8)
Thus the asymptotic isoeciency function for this parallel system is (p log p). This means that if
the number of processors is increased from p to p0 , the problem size (in this case, n) will have to
be increased by a factor of p0 log p0 =(p log p) to get the same eciency as on p processors. In other
words, increasing the number of processors by a factor of p0 =p requires n to be increased by a factor
of p0 log p0 =(p log p), in order to increase the speedup by a factor of p0 =p.
In the simple example of adding n numbers, the communication overhead is a function of only
p. In general, it can depend on both the problem size and the number of processors. A typical
overhead function may have several dierent terms of dierent orders of magnitude with respect to
p and W . When there are multiple terms of dierent orders of magnitude in the overhead function,
it may be impossible or cumbersome to obtain the isoeciency function as a closed form function
of p. For instance, consider a hypothetical parallel system, for which To = p3=2 + p3=4W 3=4. In
this case Equation 7 will be W = Kp3=2 + Kp3=4W 3=4. It is dicult to solve for W in terms of
p. Recall that the condition for constant eciency is that the ratio of To and W should remain
xed. As p and W increase in a parallel system, the eciency is guaranteed not to drop if none
of the terms of To grow faster than W . Therefore, if To has multiple terms, we balance W against
each individual term of To to compute the respective isoeciency function. The component of To
that causes the problem size to grow at the fastest rate with respect to p determines the overall
asymptotic isoeciency function of the computation.
Example: 3
Consider a hypothetical parallel algorithm-architecture combination for which To = p3=2 + p3=4W 3=4.
If we ignore the second term of To and use only the rst term in Equation 7, we get
W = Kp3=2 (9)
Now consider only the second term of the overhead function and repeat the above analysis. Equation
7 now takes the form
W = Kp3=4W 3=4
W 1=4 = Kp3=4
W = K 4 p3 (10)
In order to ensure that the eciency does not decrease as the number of processors increase, the rst
and the second term of the overhead function require the problem size to grow as (p3=2) and (p3 ),
respectively. The asymptotically higher of the two rates should be regarded as the overall asymptotic
isoeciency function. Thus, the isoeciency function is (p3 ) for this parallel system. This is because
if the problem size W grows as (p3 ), then To would remain of the same order as W .
7
By performing isoeciency analysis, we can test the performance of a parallel program on a few
processors, and then predict its performance on a larger number of processors. However, the utility
of the isoeciency analysis is not limited to predicting the impact on performance of an increasing
number of processors. As we shall see in later sections, it can also be used to study the behavior of
a parallel system with respect to changes in other hardware related parameters, such as the speed
of the processors and the data communication channels.
8
be executed simultaneously at any given time in a parallel algorithm as its degree of concurrency.
The degree of concurrency is a measure of the number of operations that an algorithm can perform
in parallel in a problem of size W and is independent of the parallel architecture. If C (W ) is
the degree of concurrency of a parallel algorithm, then given a problem of size W , at most C (W )
processors can be employed eectively. For example, using Gaussian elimination to solve a system of
n equations with n variables, the total amount of computation is (n3 ). However, the n variables
have to be eliminated one after the other, and eliminating a particular variable involves (n2 )
computations. Thus, at most O(n2) processors can be kept busy at any given time. Now if
W = (n3) for this problem, then the degree of concurrency C (W ) is (W 2=3). Since, if given a
problem of size W , at most (W 2=3 ) processors can be used, then given p processors, the size of the
problem should be at least
(p3=2) in order to use all the processors. Thus, the isoeciency function
of this computation due to concurrency is (p3=2). The isoeciency function due to concurrency
is optimal (i:e:, (p)) only if the degree of concurrency of the parallel algorithm is (W ). If the
degree of concurrency of an algorithm is less than (W ), then the isoeciency function due to
concurrency can be worse (greater) than (p). In such cases, the overall isoeciency function
of a parallel system is given by the maximum of the isoeciency functions due to concurrency,
communication, and other overheads.
9
computer.
Example 4: Stripe Based Matrix-Vector Product on a Hypercube
Consider the problem of multiplying an n n matrix with an n 1 vector. The number of basic
operations needed for this matrix-vector product, and the problem size W is equal to n2. If the time
taken by a single addition and multiplication operation is tc , then the sequential execution time of
this algorithm is n2tc ; i.e., T1 = n2 tc [5].
Matrix A Vector x Processors
P0 0 P0 0
P1 1 P1 1
. n/p . n
. .
Pp-1 p-1 Pp-1 p-1
(a) Initial partitioning of the matrix (b) Distribution of the full vector among all
and the starting vector x the processors by all-to-all broadcast
Matrix A Vector y
P0 0 1 p-1 P0 0
P1 0 1 p-1 P1 1
. 0 1 p-1
.
. 0 1 p-1
.
Pp-1 0 1 p-1 Pp-1 p-1
(c) Entire vector distributed to each (d) Final distribution of the matrix
processor after the broadcast and the result vector y
10
processors can be performed in ts log p + tw n(p ? 1)=p time [5]. Here, ts is the startup time of the
communication network and tw is the per-word transfer time. For large values of p this can be
approximated to ts log p + tw n. Assuming that an addition and a multiplication takes tc units of time,
each processor spends tc n2=p units of time in multiplying its n=p rows with the vector. Thus, the
parallel execution time of this procedure is given by:
n2
TP = tc + ts log p + tw n (12)
p
The corresponding speedup S and eciency E are given by:
p
S= (13)
1+ p(ts log p+tw n)
tc n2
E=
1 (14)
1+ tsp log p+tw np
tc n2
Now we derive an expression for the isoeciency function for this parallel formulation. Using the
relation To = pTP ? T1 , we get the following expression for the overhead function of the striped
matrix-vector multiplication on the hypercube:
n2 = Ktw np
n = Ktw p
W = n2 = K 2 t2w p2 (17)
From Equations 16 and 17 it can be inferred that the overall asymptotic rate at which the problem
size needs to increase with the number of processors in order to maintain a xed eciency is (p2 ).
Example 5: Checkerboard Based Matrix-Vector Product on a Hypercube
Consider an alternate partitioning of the matrix and the vector for computing the matrix-vector
product. Instead of partitioning the matrix into stripes, we now divide it into p squares, each of
dimensions (n=pp) (n=pp). This partitioning is referred to as checkerboard partitioning [5]. The
parallel matrix-vector multiplication algorithm based on this partitioning is illustrated in Figure 4.
The vector itself is distributed along the last column of the mesh.
In the rst step of the algorithm, the vectorpis aligned along the diagonal processors. For this, all
processors of the last column send their n= p elements of the vector to pthe diagonal processor of
their respective rows. Then a column-wise one-to-all broadcast of these n= p elements is performed.
11
n=pp
Matrix A Vector x
P0 P1 . . Ppp -1
Ppp pnp
P pp
2 . n
.
Pp -1
(a) Initial data distribution and communication (b) One-to-all broadcast of portions of
steps to align the vector along the diagonal the vector along processor columns
Matrix A Vector y
P0 P1 . . Ppp -1
Ppp
P pp
2 .
.
Pp -1
(c) Single-node accumulation of partial results (d) Final distribution of the result vector
After this communication step, the vector is aligned alongpthe rows of the matrix. Each processor
performs n2 =p multiplications and locally adds up the n= p sets of products. As shown in Figure
4(c), each processor now has n=pp partial sums which need to be accumulated along each row to
obtain the resultant vector. Hence, the last step of the algorithm is a single node accumulation of
these n=pp values in each row, with the last processor of the row as the destination. The operation
of this algorithm is illustrated in Figure 4.
We now analyze the performance of this algorithm on a hypercube connected computer with store
and forward routing [5]. The rst step of sending a messagepfrom the last processor of a row to the
diagonal processor can be performed in at most ts + tw (n= p) log pp. The second communication
step distributes copies of the vector among the rows of processors and involves one-to-all broadcasts
in all the pcolumnspof processors as shown in Figures 4(b) and 4(c). This step can be performed in
(ts + tw n= p) log p. If a multiplication and an addition is assumed to take tc units of time, then each
processors spends approximately tcn2 =p time performing computation. Now if the resultant vector has
to be placed in the last column
p of processors like the starting vector, then a single node accumulation
of components of size n= p of the vector has to be performed in each row of processors. Ignoring the
time required for performing additions during this step, the accumulation can be performed with a
communication time of (ts + tw n=pp) log pp. Summing up the time taken by all the steps, the total
parallel execution time for this procedure is given by the following equation :
12
TP = t c
n2
+ p n p
ts + 2ts log p + 3tw p log p
p p
This can be approximated by
TP = tc
n2
+ 3 n
ts log p + tw p log p (18)
p 2 p
From this equation, the total overhead To is given by:
3 p
To = ts p log p + tw n p log p
2
As before, we can equate each term in this overhead function with the problem size W . For the
isoeciency due to ts , we have W / Kts p log p, where K = 1=(tc(1 ? E )). For isoeciency due to tw ,
equating W with KTo , we have
3 p
n2tc = K tw n p log p
2
3 t p
n = K w p log p
2 tc
9 t2
n2 = K 2 w2 p log2 p
4 tc
Therefore, the isoeciency due to tw is given by (p log2 p). Since this term asymptotically dominates
the (p log p) term due to ts , the overall isoeciency of this formulation is given by (p log2 p).
From these two examples, we can see that the isoeciency function of the stripe based matrix-
vector product algorithm is (p2 ), because it is asymptotically higher than the (p log2 p) isoe-
ciency function of the checkerboard based algorithm. This implies that as the number of processors
is increased, the stripe based formulation will require much larger problem sizes to yield the same
eciencies as the checkerboard based formulation.
13
m=0 m=1 m=2 m=3
X[0] Y[0]
X[1] Y[1]
P0
X[2] Y[2]
X[3] Y[3]
X[4] Y[4]
X[5] Y[5]
P1
X[6] Y[6]
X[7] Y[7]
X[8] Y[8]
X[9] Y[9]
P2
X[10] Y[10]
X[11] Y[11]
X[12] Y[12]
X[13] Y[13]
P3
X[14] Y[14]
X[15] Y[15]
Figure 5: A 16-point FFT on four processors. Pi denotes processor number i and m refers to the
iteration number.
last (r ? d) iterations reside on the same processors. Hence, this parallel FFT algorithm involves inter-
processor communication only during d = log p of the log n iterations. Each communication operation
involves exchanging n=p words of data. Thus, the time spent in communication during the execution
of the entire algorithm is (ts + tw n=p) log p. In each iteration, a processor updates n=p elements of
vector R. If a complex multiplication and addition takes time tc , then the parallel execution time of
the n-point parallel FFT on a p-processor hypercube is given by:
n n
TP = tc log n + ts log p + tw log p (19)
p p
From this equation, we can also see that the total overhead To is given by:
To = ts p log p + tw n log p
We know that the problem size for an n-point FFT is given by:
W = n log n (20)
Following the methodology of Examples 4 and 5, we can determine the isoeciency by equating the
problem size with the total overhead. For isoeciency due to ts, we have W / ts p log p, which
14
corresponds to an isoeciency function of (p log p). The isoeciency due to the tw term can also be
determined similarly:
n log n = Ktw n log p
log n = Ktw log p
tw
n = p K tc
n log n = Ktw pKtw log p
E tw 1?EE twc
t
W= log p (21)
1 ? E tc p
If the term tw E=(tc (1 ? E )) is less that 1, then the rate of growth of problem size dictated by
Equation 21 is less than (p log p), and hence the overall isoeciency function is (p log p) for this
parallel system. On the other hand, if tw E=(tc (1 ? E )) exceeds 1, then Equation 21 determines the
overall isoeciency function which is now greater than (p log p). Also, in this case, the asymptotic
isoeciency function depends on the relative values of E=(1 ? E ), tw and tc . Thus, the FFT algorithm
is somewhat unique in that the asymptotic isoeciency function is a function of the desired eciency
and hardware dependent parameters. In fact the eciency corresponding to tw E=(tc (1 ? E )) = 1, (i:e:,
E=(1 ? E ) = tc =tw , or E = tc =(tc + tw )) acts as a threshold value. For a given hypercube connected
computer with xed tc and tw , eciencies up to this values can be obtained easily. But eciencies
much higher than this threshold can be obtained only if the problem size is extremely large.
Let us examine the eect of the value of tw E=(tc (1 ? E )) on the isoeciency function. Consider
the computation of an n-point FFT on a p-processor hypercube on which tw = tc . The isoeciency
function of the parallel FFT on this machine is E=(1 ? E )pE=(1?E ) log p. Now for E=(1 ? E ) 1
(i.e. E 0:5) the overall isoeciency is (p log p), but for E > 0:5, the isoeciency function is much
worse. For instance, if E = 0:9, then E=(1 ? E ) = 9 and hence the isoeciency function becomes
(p9 log p). Now consider the computation of an n-point FFT on a p-processor hypercube on which
tw = 2tc . The threshold eciency is 0.33. The isoeciency function for E = 0:33 is (p log p), for
E = 0:5 it is (p2 log p), and for E = 0:9 it becomes (p18 log p).
The above examples show that there is a limit on the eciency that can be obtained for
reasonable problem sizes, and this limit is determined by the ratio of the CPU speed and the
bandwidth of the communication channels of the hypercube. This limit can be raised by increasing
the bandwidth of the communication channels. On the other hand, making the CPU's faster without
improving the communication bandwidth lowers this threshold. Hence, this parallel formulation
of FFT performs poorly on a hypercube whose communication and computation speeds are not
balanced. If the hardware is balanced with respect to communication and computation speeds,
then the FFT algorithm is fairly scalable on a hypercube with an overall isoeciency function of
(p log p), and good eciencies can be expected for reasonably large number of processors.
This example illustrates how the isoeciency function can be used to derive the eect of various
machine specic factors. For a more detailed scalability analysis of this and other parallel FFT
algorithms, see [2, 5].
16
Instances of applications which conform to these characteristics are found in depth rst search of
large unstructured trees used for solving discrete optimization problems [6]. Some parallel algorithms
for solving this problem employ the following dynamic load balancing strategy. All work is initially
assigned to one processor. An idle processor Pi selects a processor Pa using some selection criterion
and sends it a work request. If processor Pa has no work, then it responds with a reject message; else,
it partitions its work into two parts and sends one of the pieces to Pi. This process continues until
all processors exhaust the available work. Various selection criteria have been proposed in literature
[6, 8]. One technique referred to as Global Round Robin (GRR) maintains a global pointer G located
at one of the processors. This pointer initially points to the rst processor in the ensemble. Each
time an idle processor needs to select Pa , it reads the current value of G, and requests work from PG.
Before the next request from an idle processor is processed, the global pointer is incremented by one
(modulo p). The global pointer distributes the work requests evenly over the various processors.
We can see that the non-deterministic nature of the algorithm makes it impossible to evaluate the exact
parallel execution time of such an algorithm. However, it is possible to bound the communication cost
of such an algorithm. As shown in [6, 5], an upper bound on the number of communications required
for this algorithm is O(p log W ). Each communication takes O(log p) time, and the total overhead
resulting from the communication of work is bounded by O(p log p log W ). As before, this term can be
equated with the problem size W to yield the isoeciency resulting from communication overheads.
Therefore,
W / O(p log p log W )
Substituting the value of W from the right hand side of the expression back into the right hand side and
ignoring the double log terms, we can see that the isoeciency of this algorithm due to communication
overheads is O(p log2 p).
This term however does not specify the overall isoeciency of the system because the algorithm has
overheads in addition to that for communicating work. In this algorithm a global variable is accessed
repeatedly. If many processors try to access it at the same time, then only one will succeed, and others
will have to wait due to contention. We analyze the isoeciency due to contention as follows:
The global variable is accessed O(p log W ) number of times (for read and increment operations) over
the entire execution. If processors are eciently utilized, then the total time of execution is (W=p).
Assume that while solving some specic problem instance of size W on p processors, there is no
contention. In this case, W=p is much more than the total time over which the shared variable is
accessed. Now, as we increase the number of processors, the total time of execution ( i:e:, W=p )
decreases but the number of times the shared variable is accessed increases. There is thus a crossover
point beyond which the shared variable access will become a bottleneck and the overall execution time
cannot be reduced further. This bottleneck can be eliminated by increasing W at a rate such that the
ratio between W=p and O(p log W ) remains the same. Equating W=p and O(p log W ) and simplifying
yields an isoeciency term of O(p2 log p).
Thus, since the isoeciency due to contention asymptotically dominates the isoeciency due to com-
munication, the overall isoeciency is given by O(p2 log p).
In this example, we have seen how it is possible to model overheads due to contention using
the isoeciency metric. A number of other dynamic load balancing algorithms have been analyzed
in [6, 8]. In [6], it is experimentally shown that dynamic load balancing schemes with better
isoeciency functions outperform those with poorer isoeciency functions.
17
4 Summary and Concluding Remarks
The isoeciency metric is useful in situations in which we are interested in obtaining linearly
increasing performance with the number of processors. If the rate of growth of problem size is the
same as that specied by the isoeciency function, then the speedup obtained from the parallel
system is linear. In certain cases, it may not be possible or desirable to increase the problem size
at the rate specied by the isoeciency function. If this rate is smaller than the isoeciency then
the speedup is sublinear. For a given rate of growth of problem size, the speedup curve can be
used as a scalability metric. If the problem size is increased linearly with the number of processors,
the speedup curve is referred to as scaled speedup [3]. The rate of growth of problem size may
also be constrained by the amount of memory in the parallel computer. In this case, the problem
size is increased at the fastest rate allowed by the available memory. These metrics have been
investigated by Worley [12], Gustafson [3] and Sun and Ni [10]. In many situations, the rate of
growth of problem size is dictated by the time available to solve the problem. In these cases, the
problem size is increased with number of processors in such a way that the parallel runtime remains
constant. The scalability issues for such problems have been explored by Worley [12], Gustafson
[3], and Sun and Ni [10]. It is also possible to keep the problem size xed and use the speedup curve
as a scalability metric. Performance issues related to xed problem size case have been addressed
in [1].
There are certain interesting relationships between isoeciency and some of these metrics. It
can be shown that if the isoeciency function is greater than (p), then for a scalable parallel
system, the problem size cannot be increased indenitely while maintaining a xed execution time,
no matter how many processors are used [1, 7]. In [1], we have shown that for a class of parallel
systems, the relationship between the problem size and the number of processors on which the
problem executes in minimum time is specied by the isoeciency function.
Acknowledgements
This work was supported by Army Research Oce grant #28408-MA-SDI to the University of
Minnesota and by the Army High Performance Computing Research Center at the University of
Minnesota. The authors would also like to thank Daniel Challou for his help in the preparation of
this document.
References
[1] Anshul Gupta and Vipin Kumar. Performance properties of large scale parallel systems.
Journal of Parallel and Distributed Computing (special issue on supercomputer performance),
November 1993. Also available as Technical Report 92-32, Department of Computer Science,
University of Minnesota, Minneapolis, MN.
[2] Anshul Gupta and Vipin Kumar. The scalability of FFT on parallel computers. IEEE Trans-
actions on Parallel and Distributed Systems, 4(8):922{932, August 1993. A detailed version
available as Technical Report TR 90-53, Department of Computer Science, University of Min-
nesota, Minneapolis, MN.
18
[3] John L. Gustafson. Reevaluating Amdahl's law. Communications of the ACM, 31(5):532{533,
1988.
[4] John L. Gustafson. The consequences of xed time performance measurement. In Proceedings
of the 25th Hawaii International Conference on System Sciences: Volume III, pages 113{124,
1992.
[5] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel
Computing: Design and Analysis of Algorithms. Benjamin/Cummings, Redwood City, CA,
1994.
[6] Vipin Kumar, Ananth Grama, and V. Nageshwara Rao. Scalable load balancing techniques
for parallel computers. Technical Report 91-55, Computer Science Department, University of
Minnesota, 1991. To appear in Journal of Distributed and Parallel Computing, 1994.
[7] Vipin Kumar and Anshul Gupta. Analyzing scalability of parallel algorithms and architec-
tures. Technical Report TR 91-18, Department of Computer Science Department, University
of Minnesota, Minneapolis, MN, 1991. To appear in Journal of Parallel and Distributed Com-
puting, 1994. A shorter version appears in Proceedings of the 1991 International Conference
on Supercomputing, pages 396-405, 1991.
[8] Vipin Kumar and V. N. Rao. Parallel depth-rst search, part II: Analysis. International
Journal of Parallel Programming, 16(6):501{519, 1987.
[9] Vipin Kumar and Vineet Singh. Scalability of Parallel Algorithms for the All-Pairs Shortest
Path Problem. Journal of Parallel and Distributed Computing, 13(2):124{138, October 1991. A
short version appears in the Proceedings of the International Conference on Parallel Processing,
1990.
[10] Xian-He Sun and L. M. Ni. Another view of parallel speedup. In Supercomputing '90 Proceed-
ings, pages 324{333, 1990.
[11] Xian-He Sun and Diane Thiede Rover. Scalability of parallel algorithm-machine combinations.
Technical Report IS-5057, Ames Laboratory, Iowa State University, Ames, IA, 1991. To appear
in IEEE Transactions on Parallel and Distributed Systems.
[12] Patrick H. Worley. The eect of time constraints on scaled speedup. SIAM Journal on Scientic
and Statistical Computing, 11(5):838{858, 1990.
19