0% found this document useful (0 votes)
37 views

Parallel Algorithm Design

This document discusses a method for parallelizing all-pairs computation across distributed-memory machines. It assumes an odd number of processes initially, but notes that an additional step is needed if the number is even. The method divides the pairs evenly among processes, then has processes exchange border pairs in successive steps to compute all results. Remaining issues include handling an even number of processes, mapping results to the correct matrix positions, and partitioning the input if the problem size is not evenly divisible by the number of processes.

Uploaded by

Alkesh Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Parallel Algorithm Design

This document discusses a method for parallelizing all-pairs computation across distributed-memory machines. It assumes an odd number of processes initially, but notes that an additional step is needed if the number is even. The method divides the pairs evenly among processes, then has processes exchange border pairs in successive steps to compute all results. Remaining issues include handling an even number of processes, mapping results to the correct matrix positions, and partitioning the input if the problem size is not evenly divisible by the number of processes.

Uploaded by

Alkesh Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

There are different ways for all pairwise computation in parallel on distributed-

memory machines. Here we discuss one method.

In the following discussion we assume that there are odd number of processes.
When the number of processes is even, one additional step is needed, which
will be a homework for you to figure it out
Assume the problem size n is odd
In the example on the right, n=9 (0 0) (0 1) (0 2) (0 3) (0 4) (0 5) (0 6) (0 7) (0 8)
and there are 9(9+1)/2=45 pairs.
(1 1) (1 2) (1 3) (1 4) (1 5) (1 6) (1 7) (1 8)

(2 2) (2 3) (2 4) (2 5) (2 6) (2 7) (2 8)
First consider using n processes
For load balancing, divide the (3 3) (3 4) (3 5) (3 6) (3 7) (3 8)
pairs into n groups, each holding
(n+1)/2 pairs (4 4) (4 5) (4 6) (4 7) (4 8)

How? (5 5) (5 6) (5 7) (5 8) (0 5)

Note: need also to minimize (6 6) (6 7) (6 8) (0 6) (1 6)


communication cost
Communication pattern need to (7 7) (7 8) (0 7) (1 7) (2 7)
be regular and local
(8 8) (0 8) (1 8) (2 8) (3 8)
n n+1
Totoal pairs 𝑃𝑃0 (0 0) (0 1) (0 2) (0 3) (0 4)
2
Assume n is odd
(n=9 in the example) 𝑃𝑃1 (1 1) (1 2) (1 3) (1 4) (1 5)

𝑃𝑃2 (2 2) (2 3) (2 4) (2 5) (2 6)
How is the communication
between processors? 𝑃𝑃3 (3 3) (3 4) (3 5) (3 6) (3 7)

First exchange the number 𝑃𝑃4 (4 4) (4 5) (4 6) (4 7) (4 8)


order in red pairs
𝑃𝑃5 (5 5) (5 6) (5 7) (5 8) (0 5)

𝑃𝑃6 (6 6) (6 7) (6 8) (0 6) (1 6)

𝑃𝑃7 (7 7) (7 8) (0 7) (1 7) (2 7)

𝑃𝑃8 (8 8) (0 8) (1 8) (2 8) (3 8)
step0 step1 step2 step3 step 4
n n+1
Totoal pairs 𝑃𝑃0 (0 0) (0 1) (0 2) (0 3) (0 4)
2
Assume n is odd
(n=9 in the example) 𝑃𝑃1 (1 1) (1 2) (1 3) (1 4) (1 5) Remaining problems:
1. How about using the
𝑃𝑃2 (2 2) (2 3) (2 4) (2 5) (2 6) number of processes which
The algorithm using n processes: is less than n? – using block
Initially, each process gets one 𝑃𝑃3 (3 3) (3 4) (3 5) (3 6) (3 7) partitioning technique
vector and make a copy of the 2. How to make the final
stored vector 𝑃𝑃4 (4 4) (4 5) (4 6) (4 7) (4 8) results in the correct order,
Step 0<=k<(n+1)/2: for each i.e., the results must be
process i, 𝑃𝑃5 (5 5) (5 6) (5 7) (5 8) (5 0) placed in the correct order
do an inner product in a 2D output matrix – need
computation, 𝑃𝑃6 (6 6) (6 7) (6 8) (6 0) (6 1) index mapping
then send the copy to process i-1
(wrap around) and receive a 𝑃𝑃7 (7 7) (7 8) (7 0) (7 1) (7 2)
copy from process i+1 (wrap
around) 𝑃𝑃8 (8 8) (8 0) (8 1) (8 2) (8 3)
Blocking:
𝑨𝑨𝑻𝑻 : (𝟒𝟒 × 𝟗𝟗)
𝑪𝑪 = 𝑨𝑨𝑨𝑨𝑻𝑻 𝑎𝑎00 𝑎𝑎10 𝑎𝑎20 𝑎𝑎30 𝑎𝑎40 𝑎𝑎50 𝑎𝑎60 𝑎𝑎70 𝑎𝑎80
𝑎𝑎01 𝑎𝑎11 𝑎𝑎21 𝑎𝑎31 𝑎𝑎41 𝑎𝑎51 𝑎𝑎61 𝑎𝑎71 𝑎𝑎81
𝑻𝑻 𝑻𝑻 𝑻𝑻
𝑐𝑐𝑖𝑖𝑖𝑖 = � 𝑎𝑎𝑖𝑖𝑖𝑖 𝑎𝑎𝑗𝑗𝑗𝑗 𝑎𝑎02 𝑨𝑨𝟎𝟎 𝑎𝑎22 𝑎𝑎32 𝑨𝑨
𝑎𝑎12 𝟏𝟏 𝑎𝑎52 𝑎𝑎62 𝑨𝑨
𝑎𝑎42 𝑎𝑎72
𝟐𝟐 𝑎𝑎82
𝑎𝑎03 𝑎𝑎13 𝑎𝑎23 𝑎𝑎33 𝑎𝑎43 𝑎𝑎53 𝑎𝑎63 𝑎𝑎73 𝑎𝑎83
𝑪𝑪 is symmetric, i.e., 𝑐𝑐𝑖𝑖𝑖𝑖 = 𝑐𝑐𝑗𝑗𝑖𝑖 .
Thus consider only to calculate 𝑎𝑎00 𝑎𝑎01 𝑎𝑎02 𝑎𝑎03 𝑐𝑐00 𝑐𝑐01 𝑐𝑐02 𝑐𝑐03 𝑐𝑐04 𝑐𝑐05 𝑐𝑐06 𝑐𝑐07 𝑐𝑐08
The upper triangular elements 𝑪𝑪𝑐𝑐𝟎𝟎𝟎𝟎
𝑎𝑎10 𝑎𝑎11𝑨𝑨𝑎𝑎𝟎𝟎12 𝑎𝑎13 11 𝑐𝑐12 𝑐𝑐13𝑪𝑪𝑐𝑐𝟎𝟎𝟎𝟎
14 𝑐𝑐15 𝑐𝑐16𝑪𝑪𝑐𝑐𝟎𝟎𝟎𝟎
17 𝑐𝑐18
𝑎𝑎20 𝑎𝑎21 𝑎𝑎22 𝑎𝑎23 𝑐𝑐22 𝑐𝑐23 𝑐𝑐24 𝑐𝑐25 𝑐𝑐26 𝑐𝑐27 𝑐𝑐28
𝑎𝑎30 𝑎𝑎31 𝑎𝑎32 𝑎𝑎33 𝑐𝑐33 𝑐𝑐34 𝑐𝑐35 𝑐𝑐36 𝑐𝑐37 𝑐𝑐38
𝑨𝑨: (𝟗𝟗 × 𝟒𝟒) 𝑎𝑎 𝑎𝑎 𝑎𝑎 𝑎𝑎 𝑪𝑪𝑐𝑐𝟏𝟏𝟏𝟏
40 41 𝑨𝑨𝟏𝟏42 43 44 𝑐𝑐45 𝑐𝑐46 𝑪𝑪𝑐𝑐𝟏𝟏𝟏𝟏
47 𝑐𝑐48
𝑎𝑎50 𝑎𝑎51 𝑎𝑎52 𝑎𝑎53 𝑐𝑐55 𝑐𝑐56 𝑐𝑐57 𝑐𝑐58
𝑎𝑎60 𝑎𝑎61 𝑎𝑎62 𝑎𝑎63 𝑐𝑐66 𝑐𝑐67 𝑐𝑐68
𝑎𝑎70 𝑎𝑎71𝑨𝑨𝑎𝑎𝟐𝟐72 𝑎𝑎73 𝑪𝑪𝑐𝑐𝟐𝟐𝟐𝟐
77 𝑐𝑐78
𝑎𝑎80 𝑎𝑎81 𝑎𝑎82 𝑎𝑎83 𝑐𝑐88
𝑪𝑪: (𝟗𝟗 × 𝟗𝟗)
Blocking:

𝑪𝑪 = 𝑨𝑨𝑨𝑨𝑻𝑻

𝑨𝑨𝑻𝑻𝟎𝟎 𝑨𝑨𝑻𝑻𝟏𝟏 𝑨𝑨𝑻𝑻𝟐𝟐


𝑪𝑪𝒑𝒑𝒑𝒑 = 𝑨𝑨𝒑𝒑 𝑨𝑨𝑻𝑻𝒒𝒒

Note: the block matrix has 𝑨𝑨𝟎𝟎 𝑪𝑪𝟎𝟎𝟎𝟎 𝑪𝑪𝟎𝟎𝟎𝟎 𝑪𝑪𝟎𝟎𝟎𝟎


exactly the same form as the
non-block one
𝑨𝑨𝟏𝟏 𝑪𝑪𝟏𝟏𝟏𝟏 𝑪𝑪𝟏𝟏𝟏𝟏

𝑨𝑨𝟐𝟐 𝑪𝑪𝟐𝟐𝟐𝟐
step0 step1 step2 step3 step 4
Divide n rows of the input
matrix into 𝑛𝑛𝑏𝑏 block rows 𝑃𝑃0 (0 0) (0 1) (0 2) (0 3) (0 4)
𝑛𝑛𝑏𝑏 𝑛𝑛𝑏𝑏 + 1
Totoal pairs 𝑃𝑃1 (1 1) (1 2) (1 3) (1 4) (1 5) Remaining problems:
2
Assume 𝑛𝑛𝑏𝑏 is odd 1. How to make the final
(𝑛𝑛𝑏𝑏 = 9 in the example) 𝑃𝑃2 (2 2) (2 3) (2 4) (2 5) (2 6) results in the correct order
in a 2D matrix – need index
The algorithm using 𝑛𝑛𝑏𝑏 𝑃𝑃3 (3 3) (3 4) (3 5) (3 6) (3 7) mapping
processes:
Initially, each process gets one 𝑃𝑃4 (4 4) (4 5) (4 6) (4 7) (4 8)
block row and make a copy of the
stored block 𝑃𝑃5 (5 5) (5 6) (5 7) (5 8) (5 0)
Step 0<=k<(𝑛𝑛𝑏𝑏 +1)/2: for each
process i, 𝑃𝑃6 (6 6) (6 7) (6 8) (6 0) (6 1)
do a matrix multiplication, (Note
the computation for (i,i) is 𝑃𝑃7 (7 7) (7 8) (7 0) (7 1) (7 2)
different!)
then send the copy to process i-1 𝑃𝑃8 (8 8) (8 0) (8 1) (8 2) (8 3)
(wrap around) and receive a copy
from process i+1 (wrap around)
How about the number of processes p is even?

Your program must be able to handle both cases for p is either even or odd

You must NOT assume that the problem size n is divisible by the number of
processes p

The final results must be placed in a correct order in a 2D matrix

You might also like