0% found this document useful (0 votes)
12 views

12.revision Parallelization

Uploaded by

spareyash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

12.revision Parallelization

Uploaded by

spareyash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Revision

Lecture 12
February 14, 2024
Broadcast Algorithms in MPICH
• Short messages
• < MPIR_CVAR_BCAST_SHORT_MSG_SIZE
• Binomial
• Medium messages
• Scatter + Allgather (Recursive doubling)
• Large messages
• > MPIR_CVAR_BCAST_LONG_MSG_SIZE
• Scatter + Allgather (Ring)

2
Old vs. New MPI_Bcast

Van de Geijn

3
Reduce on 64 nodes

4
Allgather – Ring Algorithm
• Every process sends to and 0 1 2 3
1 4 10 1 7 3 9 15
receives from everyone else
n/p n/p n/p n/p
• Assume p processes and total
n bytes 1 4 9 15
1 4 10 1
• Every process sends and
receives n/p bytes 10 1 7 3
7 3 9 15
• Time
• (p – 1) * (L + n/p*(1/B))
• How can we improve?

5
Non-blocking Point-to-Point

• MPI_Isend (buf, count, datatype, dest, tag, comm, request)


• MPI_Irecv (buf, count, datatype, source, tag, comm, request)

• MPI_Wait (request, status)


• MPI_Waitall (count, request, status)

6
Many-to-one Non-blocking P2P

7
Non-blocking Performance
• Standard does not require overlapping communication and
computation
• Implementation may use a thread to move data in parallel
• Implementation can delay the initiation of data transfer until “Wait”
• MPI_Test – non-blocking, tests completion, starts progress
• MPIR_CVAR_ASYNC_PROGRESS (MPICH)

8
Non-blocking Point-to-Point Safety
• MPI_Isend (buf, count, datatype, dest, tag, comm, request)
• MPI_Irecv (buf, count, datatype, source, tag, comm, request)
• MPI_Wait (request, status)

0 1
MPI_Isend MPI_Isend Safe
MPI_Recv MPI_Recv

9
Mesh Interconnect

• Diameter 2(√p – 1)
• Bisection width √p
• Cost 2(p – √p)
10
Torus Interconnect

• Diameter 2(√p/2)
• Bisection width 2√p
• Cost 2p
11
Parallelization
Parallelization Steps
1. Decomposition of computation into tasks
• Identifying portions of the work that can be performed concurrently
2. Assignment of tasks to processes
• Assigning concurrent pieces of work onto multiple processes running in parallel
3. Orchestration of data access, communication and synchronization among processes
• Distributing the data associated with the program
• Managing access to data shared by multiple processes
• Synchronizing at various stages of the parallel program execution
4. Mapping of processes to processors
• Placement of processes in the physical processor topology

13
Illustration of Parallelization Steps
Expose enough
concurrency

Source: Culler et al. book


14
Performance Goals
• Expose concurrency
• Reduce inter-process communications
• Load-balance
• Reduce synchronization
• Reduce idling
• Reduce management overhead
• Preserve data locality
• Exploit network topology

15
Matrix Vector Multiplication – Decomposition
P=3?
P1
Decomposition
Identifying portions of the
P2 work that can be performed
concurrently
P3 Assignment

16
Matrix Vector Multiplication – Orchestration
P=3

P1 Decomposition
Assignment
P2 Orchestration
• Allgather/Bcast
P3 • Scatter
• Gather
• Initial communication
• Distribute (read by process 0) or parallel reads
• Final communication 17
Distribute using Bcast vs. Allgather

18
Bcast vs. Allgather

19
Bcast vs. Allgather

20
Matrix Vector Multiplication – Column-wise
Decomposition

Decomposition
Assignment
Orchestration
P1 P2 P3
• Reduce

Row-wise vs. column-wise partitioning?


21
1D Domain Decomposition

N grid points
P processes
N/P points per process
Grid
point
P1 P2 P3 P4 #Communications?
2
1D domain
#Computations?
Nearest neighbor communications N/P

2 sends()
Communication to computation ratio=2P/N
2 recvs()
22
1D Domain Decomposition
N grid points
P processes
Grid N/P points per process
point
#Communications?
2√N (assuming square grid)
#Computations?
N/P (assuming square grid)

2D domain
Communication to computation ratio=?
23
2D Domain decomposition

Grid
point N grid points (√N x √N grid)
P processes (√P x √P grid)
N/P points per process

+ Several parallel communications


+ Lower communication volume/process
24
2D Domain decomposition

Grid 2 Sends()
point 2 Recvs()

#Communications?
2√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)

Communication to computation ratio=?


25
Stencils

Five-point stencil

26
2D Domain decomposition

Grid 4 Sends() N grid points (√N x √N grid)


P processes (√P x √P grid)
point 4 Recvs() N/P points per process

#Communications?
4√N/√P (assuming square grid)
#Computations?
N/P (assuming square grid)

Communication to computation ratio=?

27
Send / Recv

0 1 2 3

4 5 6 7 MPI_Send MPI_Recv

0 1 2 3 4 5 6 7

28
Send / Recv

0 1 2 3

4 5 6 7 MPI_Pack (buf) MPI_Recv (buf)


MPI_Send (buf) MPI_Unpack (buf)

0 1 2 3 4 5 6 7

29
MPI_Pack

int MPI_Pack (const void


*inbuf, int incount,
MPI_Datatype datatype,
void *outbuf, int outsize,
int *position, MPI_Comm
comm)

You might also like