Academia.eduAcademia.edu

Terminology

2015

The fork-join parallelism model int fib (int n) { if (n<2) return (n); Example: fib(4)else { int x,y; x = cilk_spawn fib(n-1); y fib(n 2); 4 cilk_sync; return (x+y);

Plan Multithreaded Parallelism and Performance Measures 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 1 / 62 Plan 1 2 (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 2 / 62 The fork-join parallelism model int fib (int n) { if (n<2) ( ) return (n); ( ); else { int x,y; x = cilk_spawn fib(n-1); y = fib(n-2); fib(n 2); cilk_sync; return (x+y); 3 } } Parallelism Complexity Measures cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements 2 1 Example: fib(4) 4 2 1 0 “Processor oblivious” 1 0 The computation dag unfolds dynamically. We shall also call this model multithreaded parallelism. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 3 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 4 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Terminology Work and span initial strand final strand strand continue edge return edge spawn edge call edge a strand is is a maximal sequence of instructions that ends with a spawn, sync, or return (either explicit or implicit) statement. At runtime, the spawn relation causes procedure instances to be structured as a rooted tree, called spawn tree or parallel instruction stream, where dependencies among strands form a dag. (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 5 / 62 The critical path length Multithreaded Parallelism and Performance Measures (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 6 / 62 Work law Assuming all strands run in unit time, the longest path in the DAG is equal to T∞ . For this reason, T∞ is also referred to as the critical path length. (Moreno Maza) We define several performance measures. We assume an ideal situation: no cache issues, no interprocessor costs: Tp is the minimum running time on p processors T1 is called the work, that is, the sum of the number of instructions at each node. T∞ is the minimum running time with infinitely many processors, called the span CS 4435 - CS 9624 7 / 62 We have: Tp ≥ T1 /p. Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 8 / 62 Parallelism Complexity Measures Parallelism Complexity Measures Span law Speedup on p processors T1 /Tp is called the speedup on p processors A parallel program execution can have: linear speedup: T1 /TP = Θ(p) superlinear speedup: T1 /TP = ω(p) (not possible in this model, though it is possible in others) sublinear speedup: T1 /TP = o(p) We have: Tp ≥ T∞ . Indeed, Tp < T∞ contradicts the definitions of Tp and T∞ . (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 9 / 62 Parallelism (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 10 / 62 The Fibonacci example (1/2) Because the Span Law dictates that TP ≥ T∞, the maximum possible speedup given T1 and T∞ is T1/T∞ = parallelism ll li = the average amount of work per step along the span. 1 2 3 4 8 7 6 5 For Fib(4), we have T1 = 17 and T∞ = 8 and thus T1 /T∞ = 2.125. What about T1 (Fib(n)) and T∞ (Fib(n))? (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 11 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 12 / 62 Parallelism Complexity Measures Parallelism Complexity Measures The Fibonacci example (2/2) Series composition We have T1 (n) = T1 (n − 1) + T1 (n − 2) + Θ(1). Let’s solve it. One verify by induction that T (n) ≤ aFn − b for b > 0 large enough to dominate Θ(1) and a > 1. We can then choose a large enough to satisfy the initial condition, whatever that is. On the other hand we also have Fn ≤ T (n). √ Therefore T1 (n) = Θ(Fn ) = Θ(ψ n ) with ψ = (1 + 5)/2. We have T∞ (n) = max(T∞ (n − 1), T∞ (n − 2)) + Θ(1). A B Work? We easily check T∞ (n − 1) ≥ T∞ (n − 2). This implies T∞ (n) = T∞ (n − 1) + Θ(1). Therefore T∞ (n) = Θ(n). Span? Consequently the parallelism is Θ(ψ n /n). (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 13 / 62 Series composition (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures CS 4435 - CS 9624 14 / 62 CS 4435 - CS 9624 16 / 62 Parallel composition A A B B Work: T1 (A ∪ B) = T1 (A) + T1 (B) Span: T∞ (A ∪ B) = T∞ (A) + T∞ (B) Work? Span? (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 15 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures Parallelism Complexity Measures Parallelism Complexity Measures Parallel composition Some results in the fork-join parallelism model A Span p Merge sort Matrix multiplication Θ(n lg n) Θ(n3) Θ(lg3n) Θ(lg n) Θ(n lg n) Θ(E) Θ(lg2n) Θ(d lg l V) FFT B Breadth-first d h fi search h Work: T1 (A ∪ B) = T1 (A) + T1 (B) Span: T∞ (A ∪ B) = max(T∞ (A), T∞ (B)) Multithreaded Parallelism and Performance Measures cilk for Loops Work Strassen LU-decomposition Tableau construction B (Moreno Maza) Algorithm g Θ(nlg7) Θ(n3) Θ(n2) Θ(lg2n) Θ(n lg n) Ω(nlg3) We shall prove those results in the next lectures. CS 4435 - CS 9624 17 / 62 Plan (Moreno Maza) Multithreaded Parallelism and Performance Measures cilk for Loops CS 4435 - CS 9624 18 / 62 CS 4435 - CS 9624 20 / 62 For loop parallelism in Cilk++ 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements a11 a12 ⋯ a1n a21 a22 ⋯ a2n a11 a21 ⋯ an1 a12 a22 ⋯ an2 ⋮ ⋮ ⋱ ⋮ an1 an2 ⋯ ann ⋮ ⋮ ⋱ ⋮ a1n a2n ⋯ ann A AT cilk_for (int i=1; i<n; ++i) { for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; } } The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 19 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of parallel for loops Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation: void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; cilk_spawn recur(lo, mid); recur(mid, hi); cilk_sync; } else for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; } } } (Moreno Maza) Multithreaded Parallelism and Performance Measures cilk for Loops 1 4 5 6 7 8 Span of loop control: Θ(log(n)) Max span of an iteration: Θ(n) Span: Θ(n) Work: Θ(n2 ) Parallelism: Θ(n) CS 4435 - CS 9624 21 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 22 / 62 CS 4435 - CS 9624 24 / 62 Plan cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; } } Span of outer loop control: Θ(log(n)) Max span of an inner loop control: Θ(log(n)) Span of an iteration: Θ(1) Span: Θ(log(n)) Work: Θ(n2 ) Parallelism: Θ(n2 /log(n)) But! More on this next week . . . Multithreaded Parallelism and Performance Measures 3 Here we do not assume that each strand runs in unit time. Parallelizing the inner loop (Moreno Maza) 2 CS 4435 - CS 9624 23 / 62 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation Scheduling Theory and Implementation Scheduling Greedy scheduling (1/2) Memory I/O Network $ P P $ P … $ P A scheduler’s job is to map a computation to particular processors. Such a mapping is called a schedule. If decisions are made at runtime, the scheduler is online, otherwise, it is offline Cilk++’s scheduler maps strands onto processors dynamically at runtime. (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 25 / 62 Greedy scheduling (2/2) A strand is ready if all its predecessors have executed A scheduler is greedy if it attempts to do as much work as possible at every step. (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation 26 / 62 Theorem of Graham and Brent P=3 P=3 In any greedy schedule, there are two types of steps: complete step: There are at least p strands that are ready to run. The greedy scheduler selects any p of them and runs them. incomplete step: There are strictly less than p threads that are ready to run. The greedy scheduler runs them all. (Moreno Maza) CS 4435 - CS 9624 Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 27 / 62 For any greedy schedule, we have Tp ≤ T1 /p + T∞ #complete steps ≤ T1 /p, by definition of T1 . #incomplete steps ≤ T∞ . Indeed, let G ′ be the subgraph of G that remains to be executed immediately prior to a incomplete step. (i) During this incomplete step, all strands that can be run are actually run (ii) Hence removing this incomplete step from G ′ reduces T∞ by one. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 28 / 62 Scheduling Theory and Implementation Scheduling Theory and Implementation Corollary 1 Corollary 2 A greedy scheduler is always within a factor of 2 of optimal. From the work and span laws, we have: The greedy scheduler achieves linear speedup whenever T∞ = O(T1 /p). TP ≥ max(T1 /p, T∞ ) (1) In addition, we can trivially express: T1 /p ≤ max(T1 /p, T∞ ) (2) T∞ ≤ max(T1 /p, T∞ ) (3) From Graham - Brent Theorem, we deduce: Tp ≤ T1 /p + T∞ (7) = T1 /p + O(T1 /p) (8) = Θ(T1 /p) (9) From Graham - Brent Theorem, we deduce: TP ≤ T1 /p + T∞ (4) ≤ max(T1 /p, T∞ ) + max(T1 /p, T∞ ) ≤ 2 max(T1 /p, T∞ ) (5) The idea is to operate in the range where T1 /p dominates T∞ . As long as T1 /p dominates T∞ , all processors can be used efficiently. The quantity T1 /pT∞ is called the parallel slackness. (6) which concludes the proof. (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 29 / 62 The work-stealing scheduler (1/11) spawn call call call spawn call spawn call spawn call (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation (Moreno Maza) 30 / 62 The work-stealing scheduler (2/11) spawn call call call spawn spawn call spawn call spawn call spawn call spawn call P P Spawn! Call! P CS 4435 - CS 9624 P P Multithreaded Parallelism and Performance Measures P P CS 4435 - CS 9624 31 / 62 (Moreno Maza) P Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 32 / 62 Scheduling Theory and Implementation Scheduling Theory and Implementation The work-stealing scheduler (3/11) spawn call call call spawn spawn spawn call spawn call call spawn call Spawn! P P The work-stealing scheduler (4/11) spawn call spawn Call! P Spawn! CS 4435 - CS 9624 33 / 62 The work-stealing scheduler (5/11) P (Moreno Maza) spawn spawn call spawn call Return! call P spawn call P P P (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation spawn call call call spawn spawn spawn call call call spawn spawn spawn call spawn call Return! call spawn call spawn P P (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 34 / 62 The work-stealing scheduler (6/11) spawn call call call spawn spawn spawn call spawn P Multithreaded Parallelism and Performance Measures P P CS 4435 - CS 9624 35 / 62 (Moreno Maza) spawn call spawn call Steal! call P spawn call spawn P Multithreaded Parallelism and Performance Measures P CS 4435 - CS 9624 36 / 62 Scheduling Theory and Implementation Scheduling Theory and Implementation The work-stealing scheduler (7/11) spawn call call call spawn spawn P spawn call spawn call Steal! call P The work-stealing scheduler (8/11) spawn call call call spawn spawn spawn call spawn P P (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 P 37 / 62 The work-stealing scheduler (9/11) spawn call call call spawn spawn P (Moreno Maza) spawn call spawn P spawn call Spawn! call P spawn call spawn call call spawn call spawn P P (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 38 / 62 The work-stealing scheduler (10/11) spawn call call call spawn spawn spawn call spawn P Multithreaded Parallelism and Performance Measures P CS 4435 - CS 9624 P 39 / 62 (Moreno Maza) spawn call spawn P spawn call call spawn call spawn P Multithreaded Parallelism and Performance Measures P CS 4435 - CS 9624 40 / 62 Scheduling Theory and Implementation Scheduling Theory and Implementation The work-stealing scheduler (11/11) spawn call call call spawn spawn spawn call spawn P P Performances of the work-stealing scheduler Assume that each strand executes in unit time, for almost all “parallel steps” there are at least p strands to run, each processor is either working or stealing. Then, the randomized work-stealing scheduler is expected to run in spawn call call spawn call spawn P TP = T1 /p + O(T∞ ) During a steal-free parallel steps (steps at which all processors have work on their deque) each of the p processors consumes 1 work unit. Thus, there is at most T1 /p steal-free parallel steps. During a parallel step with steals each thief may reduce by 1 the running time with a probability of 1/p Thus, the expected number of steals is O(p T∞ ). Therefore, the expected running time P TP = (T1 + O(p T∞ ))/p = T1 /p + O(T∞ ). (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 41 / 62 Overheads and burden (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation CS 4435 - CS 9624 (10) 42 / 62 Span overhead Let T1 , T∞ , Tp be given. We want to refine the randomized work-stealing complexity result. Obviously T1 /p + T∞ will over-estimate Tp in practice. The span overhead is the smallest constant c∞ such that Many factors (simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling) will make Tp smaller in practice. One may want to estimate the impact of those factors: 1 2 3 by improving the estimate of the randomized work-stealing complexity result by comparing a Cilk++ program with its C++ elision by estimating the costs of spawning and synchronizing Tp ≤ T1 /p + c∞ T∞ . Recall that T1 /T∞ is the maximum possible speed-up that the application can obtain. We call parallel slackness assumption the following property T1 /T∞ >> c∞ p (11) that is, c∞ p is much smaller than the average parallelism . Under this assumption it follows that T1 /p >> c∞ T∞ holds, thus c∞ has little effect on performance when sufficiently slackness exists. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 43 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 44 / 62 Scheduling Theory and Implementation Scheduling Theory and Implementation Work overhead The cactus stack Let Ts be the running time of the C++ elision of a Cilk++ program. A A We denote by c1 the work overhead A c1 = T1 /Ts Recall the expected running time: TP ≤ T1 /P + c∞ T∞ . Thus with the parallel slackness assumption we get TP ≤ c1 Ts /p + c∞ T∞ ≃ c1 Ts /p. D (12) D A E A C C C B D E E A cactus stack is used to implement C’s rule for sharing of function-local variables. A stack frame can only see data stored in the current and in the previous stack frames. Cilk++ estimates Tp as Tp = T1 /p + 1.7 burden span, where burden span is 15000 instructions times the number of continuation edges along the critical path. CS 4435 - CS 9624 C A Views of stack We can now state the work first principle precisely Minimize c1 , even at the expense of a larger c∞ . This is a key feature since it is conceptually easier to minimize c1 rather than minimizing c∞ . (Moreno Maza) Multithreaded Parallelism and Performance Measures Scheduling Theory and Implementation C B B A 45 / 62 Space bounds (Moreno Maza) Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice CS 4435 - CS 9624 46 / 62 CS 4435 - CS 9624 48 / 62 Plan P=3 P S1 P 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements P The space Sp of a parallel execution on p processors required by Cilk++’s work-stealing satisfies: Sp ≤ p · S1 (13) where S1 is the minimal serial space requirement. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 47 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice Measuring Parallelism in Practice Cilkview The Fibonacci Cilk++ example Span Law Work Law (linear speedup) Measured speedup Burdened parallelism — estimates scheduling overheads Parallelism Cilkview computes work and span to derive upper bounds on parallel performance Cilkview also estimates scheduling overhead to compute a burdened span for lower bounds. (Moreno Maza) Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice CS 4435 - CS 9624 49 / 62 Code fragment long fib(int n) { if (n < 2) return n; long x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } (Moreno Maza) Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice Fibonacci program timing Quicksort The environment for benchmarking: code in cilk/examples/qsort – L2 cache size : 4096 KB – memory size : 3 GB n 30 35 40 45 50 (Moreno Maza) #cores = 2 timing(s) speedup 0.046 1.870 0.436 1.780 4.842 1.844 54.017 1.949 665.115 1.752 #cores = 4 timing(s) speedup 0.025 3.440 0.206 3.767 2.399 3.723 27.200 3.870 340.638 3.420 Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 50 / 62 void sample_qsort(int * begin, int * end) { if (begin != end) { --end; int * middle = std::partition(begin, end, std::bind2nd(std::less<int>(), *end)); using std::swap; swap(*end, *middle); cilk_spawn sample_qsort(begin, middle); sample_qsort(++middle, ++end); cilk_sync; } } – model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz #cores = 1 timing(s) 0.086 0.776 8.931 105.263 1165.000 CS 4435 - CS 9624 51 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 52 / 62 Measuring Parallelism in Practice Measuring Parallelism in Practice Quicksort timing Matrix multiplication Code in cilk/examples/matrix Timing of multiplying a 687 × 837 matrix by a 837 × 1107 matrix Timing for sorting an array of integers: # of int 10 × 106 50 × 106 100 × 106 500 × 106 #cores = 1 timing(s) 1.958 10.518 21.481 114.300 #cores = 2 timing(s) speedup 1.016 1.927 5.469 1.923 11.096 1.936 57.996 1.971 #cores = 4 timing(s) speedup 0.541 3.619 2.847 3.694 5.954 3.608 31.086 3.677 threshold 10 16 32 48 64 80 iterative st(s) pt(s) su 1.273 1.165 0.721 1.270 1.787 0.711 1.280 1.757 0.729 1.258 1.760 0.715 1.258 1.798 0.700 1.252 1.773 0.706 recursive st(s) pt (s) su 1.674 0.399 4.195 1.408 0.349 4.034 1.223 0.308 3.971 1.164 0.293 3.973 1.159 0.291 3.983 1.267 0.320 3.959 st = sequential time; pt = parallel time with 4 cores; su = speedup (Moreno Maza) Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice CS 4435 - CS 9624 53 / 62 The cilkview example from the documentation static const int COUNT = 4; static const int ITERATION = 1000000; long arr[COUNT]; long do_work(long k){ long x = 15; static const int nn = 87; for (long i = 1; i < nn; ++i) x = x / i + k % i; return x; } int cilk_main(){ for (int j = 0; j < ITERATION; j++) cilk_for (int i = 0; i < COUNT; i++) arr[i] += do_work( j * i + i + j); } Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice CS 4435 - CS 9624 54 / 62 1) Parallelism Profile Work : 6,480,801,250 ins Span : 2,116,801,250 ins Burdened span : 31,920,801,250 ins Parallelism : 3.06 Burdened parallelism : 0.20 Number of spawns/syncs: 3,000,000 Average instructions / strand : 720 Strands along span : 4,000,001 Average instructions / strand on span : 529 2) Speedup Estimate 2 processors: 0.21 - 2.00 4 processors: 0.15 - 3.06 8 processors: 0.13 - 3.06 16 processors: 0.13 - 3.06 32 processors: 0.12 - 3.06 Using cilk for to perform operations over an array in parallel: (Moreno Maza) (Moreno Maza) 55 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 56 / 62 Measuring Parallelism in Practice Measuring Parallelism in Practice A simple fix 1) Parallelism Profile Work : 5,295,801,529 ins Span : 1,326,801,107 ins Burdened span : 1,326,830,911 ins Parallelism : 3.99 Burdened parallelism : 3.99 Number of spawns/syncs: 3 Average instructions / strand : 529,580,152 Strands along span : 5 Average instructions / strand on span: 265,360,221 2) Speedup Estimate 2 processors: 1.40 - 2.00 4 processors: 1.76 - 3.99 8 processors: 2.01 - 3.99 16 processors: 2.17 - 3.99 32 processors: 2.25 - 3.99 Inverting the two for loops int cilk_main() { cilk_for (int i = 0; i < COUNT; i++) for (int j = 0; j < ITERATION; j++) arr[i] += do_work( j * i + i + j); } (Moreno Maza) Multithreaded Parallelism and Performance Measures Measuring Parallelism in Practice CS 4435 - CS 9624 57 / 62 Timing version original improved (Moreno Maza) Multithreaded Parallelism and Performance Measures Announcements CS 4435 - CS 9624 58 / 62 CS 4435 - CS 9624 60 / 62 Plan #cores = 1 timing(s) 7.719 7.471 (Moreno Maza) #cores = 2 timing(s) speedup 9.611 0.803 3.724 2.006 #cores = 4 timing(s) speedup 10.758 0.718 1.888 3.957 Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 59 / 62 1 Parallelism Complexity Measures 2 cilk for Loops 3 Scheduling Theory and Implementation 4 Measuring Parallelism in Practice 5 Announcements (Moreno Maza) Multithreaded Parallelism and Performance Measures Announcements Announcements Acknowledgements References Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 55-69, August 25, 1996. Yuzhen Xie (UWO) for helping me with the images used in these slides. Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM, Vol. 46, No. 5, pp. 720-748. September 1999. Liyun Li (UWO) for generating the experimental data. (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation, Pages: 212-223. June, 1998. 61 / 62 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 4435 - CS 9624 62 / 62