0% found this document useful (0 votes)
39 views53 pages

ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2

This document discusses MPI (Message Passing Interface) point-to-point communications and global communication patterns. It begins with an overview of MPI principles and development steps. It then describes a distributed algorithm for dense matrix multiplication on a ring of processes, including partitioning the matrices across processes and using a displacement to exchange data. Finally, it outlines examples of point-to-point communication functions and a code skeleton for an MPI program.

Uploaded by

josh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views53 pages

ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2

This document discusses MPI (Message Passing Interface) point-to-point communications and global communication patterns. It begins with an overview of MPI principles and development steps. It then describes a distributed algorithm for dense matrix multiplication on a ring of processes, including partitioning the matrices across processes and using a displacement to exchange data. Finally, it outlines examples of point-to-point communication functions and a code skeleton for an MPI program.

Uploaded by

josh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

ST7: High Performance Simulation

MPI point-to-point communications


and global communication patterns

Stéphane Vialle

Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm

Optimized code on one core:


- Optimized compilation
- Serial optimizations
- Vectorization

Parallelized code on one node


- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance

Distributed code on a cluster:


- Message passing accros processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
MPI point-to-point communications and
global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm (native Python data)
5. Dense matrix product on a ring: core code (Numpy arrays)
Principles of developments with MPI
A set of communicating processes
The developper
MPI 1 executable designs a set of
pgm file
cooperative processes

Deployment and Execution

Process 0 Process 1
Generic Process 2 Process 3
Process 0 Process 2 network

Process 1 Process 3

Cluster of servers/nodes Multi-core server


Distributed memory architecture Shared memory architecture
Principles of developments with MPI
Message passing paradigm
Message passing for communication between processes

Private memory msg Private memory


space P1 P2 space

Each process has a private memory space…


…but it can send/recv msg to/from another one

msg msg
P1 P2 P1 P2
node A node B One node
Message passing is mandatory to Message passing can also
access data in another node be used inside one node

Message passing is generic


Principles of developments with MPI
Communication pattern
Identify the needed global communication pattern
In general, the communication needs of a distributed algorithm follow a
global and regular pattern
 Identify this global pattern
 Implement it using the most efficient MPI communication routines

Ex of patterns on top of point-to-point communications


• 1D unidirectional process ring P1 P2 Pn

or bidirectional process bar P1 P2 Pn


send/recv
• 2D torus of processes • nD hypercube of processes bsend/recv
000 001 ssend/recv
0;0 0;1 0;2
100 101 sendrecv
1;0 1;1 1;2 …
010 011
2;0 2;1 2;2
110 111
Principles of developments with MPI
Communication pattern
Ex of patterns on top of collective communications
• Master-Slaves pattern
Master Master

w res
scatter gather

• File reading and broadcasting

data
bcast
Principles of developments with MPI
Development steps

Numerical algorithm

Distributed algorithm design Some nice algorithms


(with a global comm. pattern) can not run efficiently
on distributed
optimisation

computer !
MPI code development

Deployment of the process set


(on the distributed architecture) Adopt and
intricated
Experiment approach
Principles of developments with MPI
Classic mpi4py code skeleton
Classic mpi4py code:
from mpi4py import MPI Group of all MPI
comm = MPI.COMM_WORLD processes of the program
NbP = comm.Get_size()
Me = comm.Get_rank()
print("Hello World from process ",Me,"/",NbP)
// Local computation (inside each process)
// Message passing (data exchange) Wait until all
// Result saving/printing calculations are
comm.Barrier() complete before
Print("Goodbye from process ", Me) saying goodbye
(not mandatory)
Example of execution with 3 processes:
Hello World from process 0/3 No assumption about
Hello World from process 2/3
message printing order!
Hello World from process 1/3
Principles of developments with MPI
Deploying and running a MPI code
How to launch a MPI application on a distributed architecture ?
• Where to launch the processes ? (machines ? CPU ? cores ?)
• How to number the processes ? (key issue for comm. pattern)
• What resources can a process access? (how many cores for the process?)
mpirun Deployer of « OpenMPI »
–np <#P> Total nb of processes to create
–machinefile <FileName> List of available machines
-map-by … -rank-by … -bind-to … Deployment control (function
of the comm. pattern…)
<Path/ExecName> [args] Executable code and arguments
Examples:
mpirun –np 3 ./HelloWorld  run 3 processes on the current PC
mpirun –np 6 –machinefile mach.txt  run 6 processes
–map-by ppr:2:node -rank-by core on 6 “sockets”
./HelloWorld in a list of dual-
processor PCs…
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Dense matrix product on a ring: algorithm

Distributed algorithm
Probleme:

A, B, C : n × n = N elements
n
C=A.B c   (a .b ) O(Nb of floating ops) = O(N3/2)
ij k 1 ik kj

How to divide data ?


B
• Data duplication
 no possible size up !
• Data partitionning A
 possible size up
 a data displacement
will be necessary
C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

0 1 P-1 Topology of the


processes

Partitioning and
displacement of A
Static partitionning
Step 0 of B
(initial state) Static partitionning
of C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

0 1 P-1 Topology of the


processes

Partitioning and
displacement of A
Static partitioning
Step 1 of B
Static partitioning
of C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

0 1 P-1 Topology of the


processes

Partitioning and
displacement of A
Static partitioning
Step 2 of B
Static partitioning
of C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

Results available at the end of the P steps

0 1 P-1 Communication
pattern
Static partitioning
of C

Conclusion
• Each PC has computed a block of C colums
• The P PC have worked in parallel
 Computation of all blocks of C columns in parallel, in P steps
Dense matrix product on a ring: algorithm

Distributed algorithm
Implantation strategies on a ring of P processes :
// Without overlapping // With overlapping
for (step=0; step<P; step++) for (step=0; step<P; step++)
calcul(); thread{ calcul(); }
barrier(); // if needed thread{ circulation(); }
circulation(); joinThreads();
barrier(); // if needed permutBuff();
} }

• Design the algorithm with potential synchronization barriers


• Function of the communications used :
• Implement some explicit barriers to get a strong synchronization
or
• Rely only on relaxed synchronization of the communications
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
• Send/Recv
• Ssend/Recv
• Send/Recv
• Sendrecv
• Sendrecv_replace
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Point-to-Point communications
Available pt-to-pt communications
MPI Python code for native Python object (and lists):
Mode\Type Not specified Buffered Synchronous Ready
Blocking comm.send comm.bsend comm.ssend comm.rsend
comm.recv comm.recv comm.recv comm.recv

Non- comm.isend comm.ibsend comm.issend comm.irsend


blocking comm.irecv comm.irecv comm.irecv comm.irecv
comm.sendrecv
MPI Python code for NumPy arrays:
Mode\Type Not specified Buffered Synchronous Ready
Blocking comm.Send comm.Bsend comm.Ssend comm.Rsend
comm.Recv comm.Recv comm.Recv comm.Recv

Non- comm.Isend comm.Ibsend comm.Issend comm.Irsend


blocking comm.Irecv comm.Irecv comm.Irecv comm.Irecv
comm.Sendrecv
comm.Sendrecv_replace
Dense matrix product on a ring of processes

mpi4py with Numpy arrays


Pt-to-pt communications:
Send, Bsend, Ssend, Recv
Sendrecv, Sendrecv_replace
A Numpy array is an
Basic syntax: object including its
Tab = np.zeros(Size, dtype=np.float64) data, size and datatye :
…… the name (reference) is
Bsend( Tab , dest=…) usually sufficient.

A Python list giving


the Numpy array name,
Detailled syntax:
its size and its datatype
Tab = np.zeros(Size, dtype=np.float64)
in MPI standard
……
Bsend( [Tab, Size, MPI.DOUBLE] , dest=…) Ex.: when using
and reshaping a
bigger array…
Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) :
• Make a local copy of data to send (while buffer is not full)
• Return as soon as the local copy is achieved
 original data storage can be overwritten

Ex. on a RING of processes:


k-1 k k+1

Data copy Data copy Data copy


Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) :
• Make a local copy of data to send (while buffer is not full)
• Return as soon as the local copy is achieved
 original data storage can be overwritten
Recv(…) :
• Requires data exchange & wait for (entire) data reception
• Return when all data received
Non-blocking & buffered send, + blocking recv

Ex. on a RING of processes:


k-1 k k+1

Data copy Data copy Data copy

We get a rexaled synchronization on the Recv operations


 Sufficient and fast synchronization
Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) / Recv(…):
• A unique code for each process, with a unique comm. schedule:
import objsize
Data = 10
……
bsize = objsize.get_deep_size(Data) bsend buffer
+ MPI.BSEND_OVERHEAD allocation
buff = MPI.Alloc_mem(bsize) bsend can use the buffer
MPI.Attach_buffer(buff) Data is copied into the buffer
comm.bsend(Data,dest=Me+1) process waits until it receives
Data = comm.recv(source=Me-1)
a new value to store in Data
MPI.Detach_buffer()
process waits until all data in
MPI.Free_mem(buff)
the buffer have been sent
k-1 k k+1

Data copy Data copy Data copy


Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) / Recv(…):
• A unique code for each process, with a unique comm. schedule:
import numpy as np
Tab = np.zeros(size,np.float64)
……
bsize = A_s.nbytes +
MPI.BSEND_OVERHEAD
buff = MPI.Alloc_mem(bsize)
The array Tab is copied/saved
MPI.Attach_buffer(buff)
into a hidden buffer
comm.Bsend(Tab,dest=Me+1)
comm.Recv(Tab,source=Me-1) The process waits until
MPI.Detach_buffer() it receives new values
MPI.Free_mem(buff) to store in Tab
k-1 k k+1

Tab copy Tab copy Tab copy


Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) / Recv(…):
bsend(obj, int dest=yyy, int tag=0) Must exist
Bsend(NumpyArray, int dest=yyy, int tag=0) with the right
obj = recv(int source=xxx, int tag=ANY_TAG) size!
Recv(NumpyArray, int source=xxx, int tag=ANY_TAG)
dest/source: the rank of tag: Only Send and Recv with identical
the receiving / sending tags can match.
process. Default: recv tags set to ANY_TAG (ok!)
One possible scenario:
Start of End of Recv
Bsend Bsend signaled

send
Local copy
recv
Start of Transfer End of
Recv Recv
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…):
• Requires a data exchange & wait for (entire) data emission
• Return when all data sent
 Original data storage must not be overwritten
Recv(…):
• Requires a data exchange & wait for (entire) data reception
• Return when all data received
2.k-1 2.k 2.k+1
Communications are Tab Tab Tab
appointments (Rendez-Vous) 1 buffer buffer buffer

between processes: Tab Tab Tab


2 buffer buffer buffer

Tab Tab Tab


3 buffer
buffer buffer

Some (explicit) buffers are still required to avoid data losses


Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…) / Recv(…):
• At each step a Ssend has to match a Recv operation
 Higher risk to dead-lock than with Bsend/Recv
 A communication schedule has to be entirely and finely designed:
to plan each Ssend/Recv appointment avoiding dead-locks

Ex. on a ring of processes: 2.k-1 2.k 2.k+1


buffer = np.empty(size,…) Tab Tab Tab
…… 1 buffer buffer buffer
if (Me %2 == 0)
Tab Tab Tab
1 : Ssend(Tab,dest=Me+1) 2 buffer buffer buffer
2 : Recv(Tab,source=Me-1)
else Tab Tab Tab
1 : Recv(buffer,source=Me-1) 3 buffer
buffer buffer
2 : Ssend(Tab,dest=Me+1)
3 : permut(buffer,Tab)
……
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…) / Recv(…):
• Execution of the communication schedule will be longer than with
Bsend/Recv.
The communications take place in 2+1 phases :
• (1): 1st half of the comms,
• (2): then 2nd half of the comms
• (3): buffer permutation
Ex. on a ring of processes: 2.k-1 2.k 2.k+1
buffer = np.empty(size,…) Tab Tab Tab
…… 1 buffer buffer buffer
if (Me %2 == 0)
Tab Tab Tab
1 : Ssend(Tab,dest=Me+1) 2 buffer buffer buffer
2 : Recv(Tab,source=Me-1)
else Tab Tab Tab
1 : Recv(buffer,source=Me-1) 3 buffer
buffer buffer
2 : Ssend(Tab,dest=Me+1)
3 : permut(buffer,Tab)
……
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…) / Recv(…):
Identical syntax to
ssend(obj, int dest=yyy, int tag=0, …) Bsend/Recv, but a
Ssend(NumpyArray, int dest=yyy, int tag=0, different
…) behavior!
obj = recv(int source=xxx, int tag=ANY_TAG, …)
Recv(NumpyArray, int source=xxx, int tag=ANY_TAG, …)
Must exist with the right size!
One possible scenario:
Start of Recv End of
Ssend ack. Ssend
waiting time
send
recv
transfert
Ssend Start of End of
signaled Recv Recv
Point-to-Point communications
« Standard & Blocking » comm.: Send/Recv
Send(…): not entirely specified !
• Allows manufacturers to implement communication routines
optimized to their architecture
• Not a portable communication mechanism !

Past example: behavior depending on the message size


• Under some threshold:
 runs like a Bsend with automatic buffer management
• Above some threshold:
 Runs like a Ssend with rendez-vous protocol
!
Recv(…): unchanged
• Requires data exchange & wait for (entire) data reception
• Return when all data received
Point-to-Point communications
« Standard & Blocking » comm.: Send/Recv
Send(…) / Recv(…) : not entirely specified !
• Allows manufacturers to implement communication routines
optimized to their architecture
• Not a portable communication mechanism

2 opposed approaches:
• A MPI pgm should never used standard-blocking comms.
 Portability is the main objective

• A MPI pgm should use standard-blocking comms.


 Efficiency of the communication is the main objective
And a clear documentation on the standard protocol is
available
Point-to-Point communications
Combined & Blocking comm.: Sendrecv
Sendrecv(…) : 1 send & 1 recv, 1 operation:
recvobj = sendrecv(sendobj, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)
Sendrecv(sendNpArray, int dest=yyy, int sendtag=0,
recvNpArray, int source=xxx, int recvtag=ANY_TAG,……)
TabSL TabSR Must exist with the right size!

P1 P2 P3
Frontier exchange with Sendrecv, comm. pattern 1
Step 1
Sendrecv(TabSR, Sendrecv(TabSL, Sendrecv(TabSR,
dest=me+1, dest=me-1, dest=me+1,
TabRR, TabRL TabRR,
source=me+1) source=me-1) source=me+1)
Step 2
Sendrecv(TabSL, Sendrecv(TabSR, Sendrecv(TabSL,
dest=me-1, dest=me+1, dest=me-1,
TabRL, TabRR, TabRL,
source=me-1) source=me+1) source=me-1)
Point-to-Point communications
Combined & Blocking comm.: Sendrecv
Sendrecv(…) : 1 send & 1 recv, 1 operation:
recvobj = sendrecv(sendobj, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)
Sendrecv(sendNpArray, int dest=yyy, int sendtag=0,
recvNpArray, int source=xxx, int recvtag=ANY_TAG,……)
TabSL TabSR Must exist with the right size!

P1 P2 P3
Frontier exchange with Sendrecv, comm. pattern 2
Step 1
Sendrecv(TabSL, Sendrecv(TabSL, Sendrecv(TabSL,
dest=me-1, dest=me-1, dest=me-1,
TabRR, TabRR TabRR,
source=me+1) source=me+1) source=me+1)
Step 2
Sendrecv(TabSR, Sendrecv(TabSR, Sendrecv(TabSR,
dest=me+1, dest=me+1, dest=me+1,
TabRL, TabRL, TabRL,
source=me-1) source=me-1) source=me-1)
Point-to-Point communications
Combined & Blocking: Sendrecv_replace
Sendrecv_replace(…) : 1 send & 1 recv & only 1 buffer
Sendrecv_replace(NumpyArray, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)

P0 P1 P2 P3

Tab = np.empty(size,np.float64)
……
Sendrecv_replace(Tab,dest=(me-1)%NbP,source=(me+1)%NbP)
……

• Blocking comms.: returns when Send part and Recv part have completed
• For native objects : data = sendrecv(data,…)

• On this example: No need for a fine schedule, just follow the comm. pattern
• Easy to use & very efficient communications!
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Elementary examples
Example introduction

Communication pattern: ring of processes

P0 P1 P2 P3

Msg: native Python arrays


[100, 1] : list of integers

Experimented pt-to-pt communications


• send/recv
• ssend/recv
• bsend/recv
• sendrecv
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import mpi4py Right ! PE 0 : received [200, 12]
from mpi4py import MPI • Non blocking send PE 1 : received [200, 12]
comm = MPI.COMM_WORLD + Blocking recv PE 2 : received [300, 12]
NbP = comm.Get_size() • Automatic usage of PE 2 : received [0, 12]
Me = comm.Get_rank() a hidden buffer PE 2 : received [100, 12]
PE 3 : received [0, 12]
prev = (Me-1)%NbP • Data to send not
PE 3 : received [100, 12]
suiv = (Me+1)%NbP crushed by received PE 3 : received [200, 12]
data = [Me*100,12] data PE 1 : received [300, 12]
for i in range(NbP): BUT: This PE 1 : received [0, 12]
comm.send(data,dest=prev) PE 0 : received [300, 12]
behavior is
data = comm.recv(source=suiv) PE 2 : received [200, 12]
limited to small PE 2 : end
print("PE",Me,": received ",data) msgs! (2014 int
PE 1 : received [100, 12]
print("PE",Me,": end") or 448 float on PE 1 : end
our cluster) PE 3 : received [300, 12]
Beyond this limit: PE 3 : end
pgm deadlock! PE 0 : received [0, 12]
PE 0 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import mpi4py PE 0 : received [200, 12]
from mpi4py import MPI PE 0 : received [300, 12]
comm = MPI.COMM_WORLD PE 1 : received [200, 12]
NbP = comm.Get_size() PE 1 : received [300, 12]
Me = comm.Get_rank() PE 1 : received [0, 12]
PE 1 : received [100, 12]
prev = (Me-1)%NbP PE 2 : received [300, 12]
suiv = (Me+1)%NbP PE 2 : received [0, 12]
data = [Me*100,12] PE 2 : received [100, 12]
for i in range(NbP): PE 2 : received [200, 12]
comm.send(data,dest=prev) PE 2 : end
data = comm.recv(source=suiv) PE 3 : received [0, 12]
print("PE",Me,": received ",data) PE 3 : received [100, 12]
PE 3 : received [200, 12]
comm.barrier() PE 3 : received [300, 12]
The MPI barrier is not
print("PE",Me,": end") PE 0 : received [0, 12]
efficient to synchronize
PE 0 : end
the print operations PE 1 : end
(classical) ! PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import objsize Right ! PE 1 : received [200, 12]
import mpi4py • Non blocking bsendPE+0 : received [200, 12]
from mpi4py import MPI blocking recv PE 2 : received [300, 12]
comm = MPI.COMM_WORLD • Msg to send is copiedPE 3 : received [0, 12]
NbP = comm.Get_size()
in an explicit bufferPE 1 : received [300, 12]
Me = comm.Get_rank()
• Buffer must be PE 1 : received [0, 12]
prev = (Me-1)%NbP
suiv = (Me+1)%NbP allocated, attached,PE 1 : received [100, 12]
detached and PE 1 : end
released
data = [Me*100,12] PE 0 : received [300, 12]
buff = MPI.Alloc_mem(objsize.get_deep_size(data) +
PE 0 : received [0, 12]
MPI.BSEND_OVERHEAD)
for i in range(NbP): PE 0 : end
MPI.Attach_buffer(buff) PE 2 : received [0, 12]
comm.bsend(data,dest=prev) PE 2 : received [100, 12]
data = comm.recv(source=suiv) PE 2 : received [200, 12]
MPI.Detach_buffer() PE 2 : end
print("PE",Me,": received ",data) PE 3 : received [100, 12]
MPI.Free_mem(buff) PE 3 : received [200, 12]
print("PE",Me,": end") PE 3 : received [300, 12]
PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import objsize Right ! PE 1 : received [200, 12]
import mpi4py • Non blocking bsendPE+0 : received [200, 12]
from mpi4py import MPI blocking recv PE 2 : received [300, 12]
comm = MPI.COMM_WORLD • Msg to send is copiedPE 3 : received [0, 12]
NbP = comm.Get_size()
in an explicit bufferPE 1 : received [300, 12]
Me = comm.Get_rank()
• Buffer must be PE 1 : received [0, 12]
prev = (Me-1)%NbP
suiv = (Me+1)%NbP allocated, attached,PE 1 : received [100, 12]
detached and PE 1 : end
released
data = [Me*100,12] PE 0 : received [300, 12]
buff = MPI.Alloc_mem(objsize.get_deep_size(data) +
PE 0 : received [0, 12]
MPI.BSEND_OVERHEAD)
for i in range(NbP): PE 0 : end
MPI.Attach_buffer(buff) PE 2 : received [0, 12]
comm.bsend(data,dest=prev) PE 2 : received [100, 12]
data = comm.recv(source=suiv) PE 2 : received [200, 12]
MPI.Detach_buffer() PE 2 : end
print("PE",Me,": received ",data) PE 3 : received [100, 12]
MPI.Free_mem(buff) PE 3 : received [200, 12]
print("PE",Me,": end") PE 3 : received [300, 12]
PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import objsize Right! PE 1 : received [200, 12]
import mpi4py • Non blocking bsendPE+0 : received [200, 12]
from mpi4py import MPI blocking recv PE 2 : received [300, 12]
comm = MPI.COMM_WORLD • Msg to send is copied PE 3 : received [0, 12]
NbP = comm.Get_size()
in an explicit bufferPE 1 : received [300, 12]
Me = comm.Get_rank()
• Buffer must be PE 1 : received [0, 12]
prev = (Me-1)%NbP
suiv = (Me+1)%NbP allocated, attached,PE 1 : received [100, 12]
detached and PE 1 : end
released
data = [Me*100,12] PE 0 : received [300, 12]
buff = MPI.Alloc_mem((objsize.get_deep_size(data) +
PE 0 : received [0, 12]
MPI.BSEND_OVERHEAD)*NbP)
MPI.Attach_buffer(buff) PE 0 : end
for i in range(NbP): PE 2 : received [0, 12]
comm.bsend(data,dest=prev) Faster but risky! PE 2 : received [100, 12]
data = comm.recv(source=suiv) • Buffer attached andPE 2 : received [200, 12]
print("PE",Me,": received ",data) detached only oncePE 2 : end
MPI.Detach_buffer() • Risk of memory PE 3 : received [100, 12]
MPI.Free_mem(buff) overflow with too bigPE 3 : received [200, 12]
print("PE",Me,": end") PE 3 : received [300, 12]
buffer!
PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
Nothing!
import mpi4py
from mpi4py import MPI
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank()
prev = (Me-1)%NbP
suiv = (Me+1)%NbP
data = [Me*100,12]
for i in range(NbP):
comm.ssend(data,dest=prev) Deadlock !
data = comm.recv(source=suiv) Every process is waiting for the
print("PE",Me,": received ",data) end of its synchronous blocking
print("PE",Me,": end") send op!
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 2 : received [300, 12]
import mpi4py Right ! PE 3 : received [0, 12]
from mpi4py import MPI PE 0 : received [100, 12]
• Alternating blocking
comm = MPI.COMM_WORLDbsend & blocking recv PE 0 : received [200, 12]
NbP = comm.Get_size() PE 1 : received [200, 12]
• No deadlock PE 1 : received [300, 12]
Me = comm.Get_rank()
Note: PE 2 : received [0, 12]
prev = (Me-1)%NbP PE 3 : received [100, 12]
suiv = (Me+1)%NbP No msg size
PE 2 : received [100, 12]
data = [Me*100,12]
limit to this PE 0 : received [300, 12]
solution PE 1 : received [0, 12]
for i in range(NbP):
(perfect!) PE 1 : received [100, 12]
if Me %2 == 0:
comm.ssend(data,dest=prev) PE 3 : received [200, 12]
data = comm.recv(source=suiv) PE 3 : received [300, 12]
else: PE 1 : end
datarecv = comm.recv(source=suiv) PE 2 : received [200, 12]
comm.ssend(data,dest=prev) PE 2 : end
data = datarecv PE 0 : received [0, 12]
print("PE",Me,": received ",data) PE 0 : end
…… PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import mpi4py Right ! PE 0 : received [200, 12]
from mpi4py import MPI
With std Python objects PE 0 : received [300, 12]
comm = MPI.COMM_WORLDthe sendrecv routine PE 0 : received [0, 12]
NbP = comm.Get_size() PE 1 : received [200, 12]
can be used to PE 1 : received [300, 12]
Me = comm.Get_rank()
implement a PE 1 : received [0, 12]
prev = (Me-1)%NbP sendrecv_replace (that PE 1 : received [100, 12]
suiv = (Me+1)%NbP does not exist) PE 1 : end
data = [Me*100,12] PE 2 : received [300, 12]
Note: PE 2 : received [0, 12]
for i in range(NbP):
No msg size PE 2 : received [100, 12]
data = comm.sendrecv(data,
dest=prev, limit to this PE 2 : received [200, 12]
source=suiv) solution PE 2 : end
print("PE",Me,": received ",data) (perfect!) PE 3 : received [0, 12]
PE 3 : received [100, 12]
print("PE",Me,": end") PE 3 : received [200, 12]
PE 3 : received [300, 12]
PE 3 : end
PE 0 : end
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Dense matrix product on a ring: core code

mpi4py implementation v1
mpi4py: Blocking Sendrecv_replace version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
#Loop computation-communication
for step in range(NbP):
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
comm.Sendrecv_replace(A_s,dest=(Me-1)%NbP,
source=(Me+1)%NbP)
#Save matrix C
……
End
Dense matrix product on a ring: core code

C/C++ MPI implementation v1


C/C++ MPI: Blocking MPI_Sendrecv_replace version
#include <mpi.h>
#define SIZE 4096 A
int LOCAL_SIZE = SIZE/NbP;
double *A_s, *B_s, *C_s; LOCAL_SIZE
int Me, NbP;
MPI_Status status; SIZE C
main(int argc, char** argv){
MPI_Init(&argc,&argv) // Start using MPI
MPI_Get_rank(&Me,MPI_COMM_WORLD);
MPI_Get_size(&NbP,MPI_COMM_WORLD);
LOCAL_SIZE = SIZE/NbP;
// Allocate A, B and C
A_s = (double *) malloc(LOCAL_SIZE*SIZE*sizeof(double));
B_s = (double *) malloc(SIZE*LOCAL_SIZE*sizeof(double));
C_s = (double *) malloc(SIZE*LOCAL_SIZE*sizeof(double));
// Initialize A and B
……
Part 1
Dense matrix product on a ring: core code

C/C++ MPI implementation v1


C/C++ MPI: Blocking MPI_Sendrecv_replace version
// Loop computation-communication
for (int step = 0; step < NbP; step++) {
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step);
MPI_Sendrecv_replace(
A_s, LOCAL_SIZE*SIZE, MPI_DOUBLE,
(Me-1+NbP)%NbP, 0, (Me+1)%NbP, 0,
MPI_COMM_WORLD, &status);
}
//Save matrix C
……
// Free the allocated arrays A
free(A_s);
free(B_s);
free(C_s); LOCAL_SIZE
// End of use of MPI SIZE C
MPI_Finalize();
}
End
Dense matrix product on a ring: core code

mpi4py implementation v2
mpi4py: Blocking Bsend/Recv version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
#Allocate Bsend buffer
buff = MPI.Alloc_mem(A_s.nbytes + MPI.BSEND_OVERHEAD)

Part 1
Dense matrix product on a ring: core code

mpi4py implementation v2
mpi4py: Blocking Bsend/Recv version
#Loop computation-communication
for step in range(NbP):
MPI.Attach_buffer(buff)
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
comm.Bsend(A_s,dest=(Me-1)%NbP)
comm.Recv(A_s,source=(Me+1)%NbP)
MPI.Detach_buffer()
#Release Bsend buffer A
MPI.Free_mem(buffer)

#Save matrix C LOCAL_SIZE


……
End SIZE C
Dense matrix product on a ring: core code

mpi4py implementation v3
mpi4py: Blocking Ssend/Recv version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
Buff = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
Part 1
Dense matrix product on a ring: core code

mpi4py implementation v3
mpi4py: Blocking Ssend/Recv version
#Loop computation-communication
for step in range(NbP):
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
if (Me%2==0):
comm.Ssend(A_s,dest=(Me-1)%NbP)
comm.Recv(A_s,source=(Me+1)%NbP)
else:
comm.Recv(Buff,source=(Me+1)%NbP)
comm.Ssend(A_s,dest=(Me-1)%NbP)
tmp = A_s # A_s  Buff permutation A
A_s = Buff
Buff = tmp
#Save matrix C LOCAL_SIZE
……
End SIZE C
MPI point-to-point communications and
global communication patterns

Questions ?

You might also like