Plan
Multithreaded Parallelism and Performance Measures
1
Parallelism Complexity Measures
2
cilk for Loops
3
Scheduling Theory and Implementation
4
Measuring Parallelism in Practice
5
Announcements
Marc Moreno Maza
University of Western Ontario, London, Ontario (Canada)
CS 4435 - CS 9624
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
1 / 62
Plan
1
2
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
2 / 62
The fork-join parallelism model
int fib (int n) {
if (n<2)
(
) return (n);
( );
else {
int x,y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
fib(n 2);
cilk_sync;
return (x+y);
3
}
}
Parallelism Complexity Measures
cilk for Loops
3
Scheduling Theory and Implementation
4
Measuring Parallelism in Practice
5
Announcements
2
1
Example:
fib(4)
4
2
1
0
“Processor
oblivious”
1
0
The computation dag
unfolds dynamically.
We shall also call this model multithreaded parallelism.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
3 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
4 / 62
Parallelism Complexity Measures
Parallelism Complexity Measures
Terminology
Work and span
initial strand
final strand
strand
continue edge
return edge
spawn edge
call edge
a strand is is a maximal sequence of instructions that ends with a
spawn, sync, or return (either explicit or implicit) statement.
At runtime, the spawn relation causes procedure instances to be
structured as a rooted tree, called spawn tree or parallel instruction
stream, where dependencies among strands form a dag.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
5 / 62
The critical path length
Multithreaded Parallelism and Performance Measures
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
6 / 62
Work law
Assuming all strands run in unit time, the longest path in the DAG is equal
to T∞ . For this reason, T∞ is also referred to as the critical path length.
(Moreno Maza)
We define several performance measures. We assume an ideal situation:
no cache issues, no interprocessor costs:
Tp is the minimum running time on p processors
T1 is called the work, that is, the sum of the number of instructions at
each node.
T∞ is the minimum running time with infinitely many processors, called
the span
CS 4435 - CS 9624
7 / 62
We have: Tp ≥ T1 /p.
Indeed, in the best case, p processors can do p works per unit of time.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
8 / 62
Parallelism Complexity Measures
Parallelism Complexity Measures
Span law
Speedup on p processors
T1 /Tp is called the speedup on p processors
A parallel program execution can have:
linear speedup: T1 /TP = Θ(p)
superlinear speedup: T1 /TP = ω(p) (not possible in this model,
though it is possible in others)
sublinear speedup: T1 /TP = o(p)
We have: Tp ≥ T∞ .
Indeed, Tp < T∞ contradicts the definitions of Tp and T∞ .
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
9 / 62
Parallelism
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
10 / 62
The Fibonacci example (1/2)
Because the Span Law dictates
that TP ≥ T∞, the maximum
possible speedup given T1
and T∞ is
T1/T∞ = parallelism
ll li
= the average
amount of work
per step along
the span.
1
2
3
4
8
7
6
5
For Fib(4), we have T1 = 17 and T∞ = 8 and thus T1 /T∞ = 2.125.
What about T1 (Fib(n)) and T∞ (Fib(n))?
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
11 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
12 / 62
Parallelism Complexity Measures
Parallelism Complexity Measures
The Fibonacci example (2/2)
Series composition
We have T1 (n) = T1 (n − 1) + T1 (n − 2) + Θ(1). Let’s solve it.
One verify by induction that T (n) ≤ aFn − b for b > 0 large enough to
dominate Θ(1) and a > 1.
We can then choose a large enough to satisfy the initial condition,
whatever that is.
On the other hand we also have Fn ≤ T (n).
√
Therefore T1 (n) = Θ(Fn ) = Θ(ψ n ) with ψ = (1 + 5)/2.
We have T∞ (n) = max(T∞ (n − 1), T∞ (n − 2)) + Θ(1).
A
B
Work?
We easily check T∞ (n − 1) ≥ T∞ (n − 2).
This implies T∞ (n) = T∞ (n − 1) + Θ(1).
Therefore T∞ (n) = Θ(n).
Span?
Consequently the parallelism is Θ(ψ n /n).
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
13 / 62
Series composition
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
CS 4435 - CS 9624
14 / 62
CS 4435 - CS 9624
16 / 62
Parallel composition
A
A
B
B
Work: T1 (A ∪ B) = T1 (A) + T1 (B)
Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B)
Work?
Span?
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
15 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Parallelism Complexity Measures
Parallelism Complexity Measures
Parallel composition
Some results in the fork-join parallelism model
A
Span
p
Merge sort
Matrix multiplication
Θ(n lg n)
Θ(n3)
Θ(lg3n)
Θ(lg n)
Θ(n lg n)
Θ(E)
Θ(lg2n)
Θ(d lg
l V)
FFT
B
Breadth-first
d h fi
search
h
Work: T1 (A ∪ B) = T1 (A) + T1 (B)
Span: T∞ (A ∪ B) = max(T∞ (A), T∞ (B))
Multithreaded Parallelism and Performance Measures
cilk for Loops
Work
Strassen
LU-decomposition
Tableau construction
B
(Moreno Maza)
Algorithm
g
Θ(nlg7)
Θ(n3)
Θ(n2)
Θ(lg2n)
Θ(n lg n)
Ω(nlg3)
We shall prove those results in the next lectures.
CS 4435 - CS 9624
17 / 62
Plan
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
cilk for Loops
CS 4435 - CS 9624
18 / 62
CS 4435 - CS 9624
20 / 62
For loop parallelism in Cilk++
1
Parallelism Complexity Measures
2
cilk for Loops
3
Scheduling Theory and Implementation
4
Measuring Parallelism in Practice
5
Announcements
a11 a12 ⋯ a1n
a21 a22 ⋯ a2n
a11 a21 ⋯ an1
a12 a22 ⋯ an2
⋮ ⋮ ⋱ ⋮
an1 an2 ⋯ ann
⋮ ⋮ ⋱ ⋮
a1n a2n ⋯ ann
A
AT
cilk_for (int i=1; i<n; ++i) {
for (int j=0; j<i; ++j) {
double temp = A[i][j];
A[i][j] = A[j][i];
A[j][i] = temp;
}
}
The iterations of a cilk for loop execute in parallel.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
19 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
cilk for Loops
cilk for Loops
Implementation of for loops in Cilk++
Analysis of parallel for loops
Up to details (next week!) the previous loop is compiled as follows, using a
divide-and-conquer implementation:
void recur(int lo, int hi) {
if (hi > lo) { // coarsen
int mid = lo + (hi - lo)/2;
cilk_spawn recur(lo, mid);
recur(mid, hi);
cilk_sync;
} else
for (int j=0; j<i; ++j) {
double temp = A[i][j];
A[i][j] = A[j][i];
A[j][i] = temp;
}
}
}
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
cilk for Loops
1
4
5
6
7
8
Span of loop control: Θ(log(n))
Max span of an iteration: Θ(n)
Span: Θ(n)
Work: Θ(n2 )
Parallelism: Θ(n)
CS 4435 - CS 9624
21 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
22 / 62
CS 4435 - CS 9624
24 / 62
Plan
cilk_for (int i=1; i<n; ++i) {
cilk_for (int j=0; j<i; ++j) {
double temp = A[i][j];
A[i][j] = A[j][i];
A[j][i] = temp;
}
}
Span of outer loop control: Θ(log(n))
Max span of an inner loop control: Θ(log(n))
Span of an iteration: Θ(1)
Span: Θ(log(n))
Work: Θ(n2 )
Parallelism: Θ(n2 /log(n)) But! More on this next week . . .
Multithreaded Parallelism and Performance Measures
3
Here we do not assume that each strand runs in unit time.
Parallelizing the inner loop
(Moreno Maza)
2
CS 4435 - CS 9624
23 / 62
1
Parallelism Complexity Measures
2
cilk for Loops
3
Scheduling Theory and Implementation
4
Measuring Parallelism in Practice
5
Announcements
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
Scheduling Theory and Implementation
Scheduling
Greedy scheduling (1/2)
Memory
I/O
Network
$
P
P
$
P
…
$
P
A scheduler’s job is to map a computation to particular processors. Such
a mapping is called a schedule.
If decisions are made at runtime, the scheduler is online, otherwise, it
is offline
Cilk++’s scheduler maps strands onto processors dynamically at
runtime.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
25 / 62
Greedy scheduling (2/2)
A strand is ready if all its predecessors have executed
A scheduler is greedy if it attempts to do as much work as possible
at every step.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
26 / 62
Theorem of Graham and Brent
P=3
P=3
In any greedy schedule, there are two types of steps:
complete step: There are at least p strands that are ready to run.
The greedy scheduler selects any p of them and runs them.
incomplete step: There are strictly less than p threads that are ready
to run. The greedy scheduler runs them all.
(Moreno Maza)
CS 4435 - CS 9624
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
27 / 62
For any greedy schedule, we have Tp ≤ T1 /p + T∞
#complete steps ≤ T1 /p, by definition of T1 .
#incomplete steps ≤ T∞ . Indeed, let G ′ be the subgraph of G that
remains to be executed immediately prior to a incomplete step.
(i) During this incomplete step, all strands that can be run are actually run
(ii) Hence removing this incomplete step from G ′ reduces T∞ by one.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
28 / 62
Scheduling Theory and Implementation
Scheduling Theory and Implementation
Corollary 1
Corollary 2
A greedy scheduler is always within a factor of 2 of optimal.
From the work and span laws, we have:
The greedy scheduler achieves linear speedup whenever T∞ = O(T1 /p).
TP ≥ max(T1 /p, T∞ )
(1)
In addition, we can trivially express:
T1 /p ≤ max(T1 /p, T∞ )
(2)
T∞ ≤ max(T1 /p, T∞ )
(3)
From Graham - Brent Theorem, we deduce:
Tp ≤ T1 /p + T∞
(7)
= T1 /p + O(T1 /p)
(8)
= Θ(T1 /p)
(9)
From Graham - Brent Theorem, we deduce:
TP
≤ T1 /p + T∞
(4)
≤ max(T1 /p, T∞ ) + max(T1 /p, T∞ )
≤ 2 max(T1 /p, T∞ )
(5)
The idea is to operate in the range where T1 /p dominates T∞ . As long as
T1 /p dominates T∞ , all processors can be used efficiently.
The quantity T1 /pT∞ is called the parallel slackness.
(6)
which concludes the proof.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
29 / 62
The work-stealing scheduler (1/11)
spawn
call
call
call
spawn
call
spawn
call
spawn
call
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
(Moreno Maza)
30 / 62
The work-stealing scheduler (2/11)
spawn
call
call
call
spawn
spawn
call
spawn
call
spawn
call
spawn
call
spawn
call
P
P
Spawn!
Call!
P
CS 4435 - CS 9624
P
P
Multithreaded Parallelism and Performance Measures
P
P
CS 4435 - CS 9624
31 / 62
(Moreno Maza)
P
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
32 / 62
Scheduling Theory and Implementation
Scheduling Theory and Implementation
The work-stealing scheduler (3/11)
spawn
call
call
call
spawn
spawn
spawn
call
spawn
call
call
spawn
call
Spawn!
P
P
The work-stealing scheduler (4/11)
spawn
call
spawn
Call!
P
Spawn!
CS 4435 - CS 9624
33 / 62
The work-stealing scheduler (5/11)
P
(Moreno Maza)
spawn
spawn
call
spawn
call
Return! call
P
spawn
call
P
P
P
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
spawn
call
call
call
spawn
spawn
spawn
call
call
call
spawn
spawn
spawn
call
spawn
call
Return! call
spawn
call
spawn
P
P
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
34 / 62
The work-stealing scheduler (6/11)
spawn
call
call
call
spawn
spawn
spawn
call
spawn
P
Multithreaded Parallelism and Performance Measures
P
P
CS 4435 - CS 9624
35 / 62
(Moreno Maza)
spawn
call
spawn
call
Steal! call
P
spawn
call
spawn
P
Multithreaded Parallelism and Performance Measures
P
CS 4435 - CS 9624
36 / 62
Scheduling Theory and Implementation
Scheduling Theory and Implementation
The work-stealing scheduler (7/11)
spawn
call
call
call
spawn
spawn
P
spawn
call
spawn
call
Steal! call
P
The work-stealing scheduler (8/11)
spawn
call
call
call
spawn
spawn
spawn
call
spawn
P
P
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
P
37 / 62
The work-stealing scheduler (9/11)
spawn
call
call
call
spawn
spawn
P
(Moreno Maza)
spawn
call
spawn
P
spawn
call
Spawn! call
P
spawn
call
spawn
call
call
spawn
call
spawn
P
P
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
38 / 62
The work-stealing scheduler (10/11)
spawn
call
call
call
spawn
spawn
spawn
call
spawn
P
Multithreaded Parallelism and Performance Measures
P
CS 4435 - CS 9624
P
39 / 62
(Moreno Maza)
spawn
call
spawn
P
spawn
call
call
spawn
call
spawn
P
Multithreaded Parallelism and Performance Measures
P
CS 4435 - CS 9624
40 / 62
Scheduling Theory and Implementation
Scheduling Theory and Implementation
The work-stealing scheduler (11/11)
spawn
call
call
call
spawn
spawn
spawn
call
spawn
P
P
Performances of the work-stealing scheduler
Assume that
each strand executes in unit time,
for almost all “parallel steps” there are at least p strands to run,
each processor is either working or stealing.
Then, the randomized work-stealing scheduler is expected to run in
spawn
call
call
spawn
call
spawn
P
TP = T1 /p + O(T∞ )
During a steal-free parallel steps (steps at which all processors have
work on their deque) each of the p processors consumes 1 work unit.
Thus, there is at most T1 /p steal-free parallel steps.
During a parallel step with steals each thief may reduce by 1 the
running time with a probability of 1/p
Thus, the expected number of steals is O(p T∞ ).
Therefore, the expected running time
P
TP = (T1 + O(p T∞ ))/p = T1 /p + O(T∞ ).
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
41 / 62
Overheads and burden
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
CS 4435 - CS 9624
(10)
42 / 62
Span overhead
Let T1 , T∞ , Tp be given. We want to refine the randomized
work-stealing complexity result.
Obviously T1 /p + T∞ will over-estimate Tp in practice.
The span overhead is the smallest constant c∞ such that
Many factors (simplification assumptions of the fork-join parallelism
model, architecture limitation, costs of executing the parallel
constructs, overheads of scheduling) will make Tp smaller in practice.
One may want to estimate the impact of those factors:
1
2
3
by improving the estimate of the randomized work-stealing complexity
result
by comparing a Cilk++ program with its C++ elision
by estimating the costs of spawning and synchronizing
Tp ≤ T1 /p + c∞ T∞ .
Recall that T1 /T∞ is the maximum possible speed-up that the
application can obtain.
We call parallel slackness assumption the following property
T1 /T∞ >> c∞ p
(11)
that is, c∞ p is much smaller than the average parallelism .
Under this assumption it follows that T1 /p >> c∞ T∞ holds, thus c∞
has little effect on performance when sufficiently slackness exists.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
43 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
44 / 62
Scheduling Theory and Implementation
Scheduling Theory and Implementation
Work overhead
The cactus stack
Let Ts be the running time of the C++ elision of a Cilk++ program.
A
A
We denote by c1 the work overhead
A
c1 = T1 /Ts
Recall the expected running time: TP ≤ T1 /P + c∞ T∞ . Thus with
the parallel slackness assumption we get
TP ≤ c1 Ts /p + c∞ T∞ ≃ c1 Ts /p.
D
(12)
D
A
E
A
C
C
C
B
D
E
E
A cactus stack is used to implement C’s rule for sharing of
function-local variables.
A stack frame can only see data stored in the current and in the
previous stack frames.
Cilk++ estimates Tp as Tp = T1 /p + 1.7 burden span, where
burden span is 15000 instructions times the number of continuation
edges along the critical path.
CS 4435 - CS 9624
C
A
Views of stack
We can now state the work first principle precisely
Minimize c1 , even at the expense of a larger c∞ .
This is a key feature since it is conceptually easier to minimize c1
rather than minimizing c∞ .
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Scheduling Theory and Implementation
C
B
B
A
45 / 62
Space bounds
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
CS 4435 - CS 9624
46 / 62
CS 4435 - CS 9624
48 / 62
Plan
P=3
P
S1
P
1
Parallelism Complexity Measures
2
cilk for Loops
3
Scheduling Theory and Implementation
4
Measuring Parallelism in Practice
5
Announcements
P
The space Sp of a parallel execution on p processors required by Cilk++’s
work-stealing satisfies:
Sp ≤ p · S1
(13)
where S1 is the minimal serial space requirement.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
47 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
Measuring Parallelism in Practice
Cilkview
The Fibonacci Cilk++ example
Span
Law
Work Law
(linear
speedup)
Measured
speedup
Burdened
parallelism
— estimates
scheduling
overheads
Parallelism
Cilkview computes work and span to derive upper bounds on
parallel performance
Cilkview also estimates scheduling overhead to compute a burdened
span for lower bounds.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
CS 4435 - CS 9624
49 / 62
Code fragment
long fib(int n)
{
if (n < 2) return n;
long x, y;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x + y;
}
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
Fibonacci program timing
Quicksort
The environment for benchmarking:
code in cilk/examples/qsort
– L2 cache size : 4096 KB
– memory size : 3 GB
n
30
35
40
45
50
(Moreno Maza)
#cores = 2
timing(s) speedup
0.046
1.870
0.436
1.780
4.842
1.844
54.017
1.949
665.115
1.752
#cores = 4
timing(s) speedup
0.025
3.440
0.206
3.767
2.399
3.723
27.200
3.870
340.638
3.420
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
50 / 62
void sample_qsort(int * begin, int * end)
{
if (begin != end) {
--end;
int * middle = std::partition(begin, end,
std::bind2nd(std::less<int>(), *end));
using std::swap;
swap(*end, *middle);
cilk_spawn sample_qsort(begin, middle);
sample_qsort(++middle, ++end);
cilk_sync;
}
}
– model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
#cores = 1
timing(s)
0.086
0.776
8.931
105.263
1165.000
CS 4435 - CS 9624
51 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
52 / 62
Measuring Parallelism in Practice
Measuring Parallelism in Practice
Quicksort timing
Matrix multiplication
Code in cilk/examples/matrix
Timing of multiplying a 687 × 837 matrix by a 837 × 1107 matrix
Timing for sorting an array of integers:
# of int
10 × 106
50 × 106
100 × 106
500 × 106
#cores = 1
timing(s)
1.958
10.518
21.481
114.300
#cores = 2
timing(s) speedup
1.016
1.927
5.469
1.923
11.096
1.936
57.996
1.971
#cores = 4
timing(s) speedup
0.541
3.619
2.847
3.694
5.954
3.608
31.086
3.677
threshold
10
16
32
48
64
80
iterative
st(s) pt(s)
su
1.273 1.165 0.721
1.270 1.787 0.711
1.280 1.757 0.729
1.258 1.760 0.715
1.258 1.798 0.700
1.252 1.773 0.706
recursive
st(s) pt (s)
su
1.674 0.399 4.195
1.408 0.349 4.034
1.223 0.308 3.971
1.164 0.293 3.973
1.159 0.291 3.983
1.267 0.320 3.959
st = sequential time; pt = parallel time with 4 cores; su = speedup
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
CS 4435 - CS 9624
53 / 62
The cilkview example from the documentation
static const int COUNT = 4;
static const int ITERATION = 1000000;
long arr[COUNT];
long do_work(long k){
long x = 15;
static const int nn = 87;
for (long i = 1; i < nn; ++i)
x = x / i + k % i;
return x;
}
int cilk_main(){
for (int j = 0; j < ITERATION; j++)
cilk_for (int i = 0; i < COUNT; i++)
arr[i] += do_work( j * i + i + j);
}
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
CS 4435 - CS 9624
54 / 62
1) Parallelism Profile
Work :
6,480,801,250 ins
Span :
2,116,801,250 ins
Burdened span :
31,920,801,250 ins
Parallelism :
3.06
Burdened parallelism :
0.20
Number of spawns/syncs:
3,000,000
Average instructions / strand :
720
Strands along span :
4,000,001
Average instructions / strand on span : 529
2) Speedup Estimate
2 processors:
0.21 - 2.00
4 processors:
0.15 - 3.06
8 processors:
0.13 - 3.06
16 processors:
0.13 - 3.06
32 processors:
0.12 - 3.06
Using cilk for to perform operations over an array in parallel:
(Moreno Maza)
(Moreno Maza)
55 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
56 / 62
Measuring Parallelism in Practice
Measuring Parallelism in Practice
A simple fix
1) Parallelism Profile
Work :
5,295,801,529 ins
Span :
1,326,801,107 ins
Burdened span :
1,326,830,911 ins
Parallelism :
3.99
Burdened parallelism :
3.99
Number of spawns/syncs:
3
Average instructions / strand :
529,580,152
Strands along span :
5
Average instructions / strand on span: 265,360,221
2) Speedup Estimate
2 processors:
1.40 - 2.00
4 processors:
1.76 - 3.99
8 processors:
2.01 - 3.99
16 processors:
2.17 - 3.99
32 processors:
2.25 - 3.99
Inverting the two for loops
int cilk_main()
{
cilk_for (int i = 0; i < COUNT; i++)
for (int j = 0; j < ITERATION; j++)
arr[i] += do_work( j * i + i + j);
}
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Measuring Parallelism in Practice
CS 4435 - CS 9624
57 / 62
Timing
version
original
improved
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Announcements
CS 4435 - CS 9624
58 / 62
CS 4435 - CS 9624
60 / 62
Plan
#cores = 1
timing(s)
7.719
7.471
(Moreno Maza)
#cores = 2
timing(s) speedup
9.611
0.803
3.724
2.006
#cores = 4
timing(s) speedup
10.758
0.718
1.888
3.957
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
59 / 62
1
Parallelism Complexity Measures
2
cilk for Loops
3
Scheduling Theory and Implementation
4
Measuring Parallelism in Practice
5
Announcements
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
Announcements
Announcements
Acknowledgements
References
Charles E. Leiserson (MIT) for providing me with the sources of its
lecture notes.
Matteo Frigo (Intel) for supporting the work of my team with
Cilk++.
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul,
Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An
Efficient Multithreaded Runtime System. Journal of Parallel and
Distributed Computing, 55-69, August 25, 1996.
Yuzhen Xie (UWO) for helping me with the images used in these
slides.
Robert D. Blumofe and Charles E. Leiserson. Scheduling
Multithreaded Computations by Work Stealing. Journal of the ACM,
Vol. 46, No. 5, pp. 720-748. September 1999.
Liyun Li (UWO) for generating the experimental data.
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The
Implementation of the Cilk-5 Multithreaded Language. Proceedings
of the ACM SIGPLAN ’98 Conference on Programming Language
Design and Implementation, Pages: 212-223. June, 1998.
61 / 62
(Moreno Maza)
Multithreaded Parallelism and Performance Measures
CS 4435 - CS 9624
62 / 62