0% found this document useful (0 votes)

39 views53 pages

ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2

This document discusses MPI (Message Passing Interface) point-to-point communications and global communication patterns. It begins with an overview of MPI principles and development steps. It then describes a distributed algorithm for dense matrix multiplication on a ring of processes, including partitioning the matrices across processes and using a displacement to exchange data. Finally, it outlines examples of point-to-point communication functions and a code skeleton for an MPI program.

Uploaded by

josh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views53 pages

ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2

Uploaded by

josh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

ST7: High Performance Simulation

MPI point-to-point communications

and global communication patterns

Stéphane Vialle

Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm

Optimized code on one core:

- Optimized compilation
- Serial optimizations
- Vectorization

Parallelized code on one node

- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance

Distributed code on a cluster:

- Message passing accros processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
MPI point-to-point communications and
global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm (native Python data)
5. Dense matrix product on a ring: core code (Numpy arrays)
Principles of developments with MPI
A set of communicating processes
The developper
MPI 1 executable designs a set of
pgm file
cooperative processes

Deployment and Execution

Process 0 Process 1
Generic Process 2 Process 3
Process 0 Process 2 network

Process 1 Process 3

Cluster of servers/nodes Multi-core server

Distributed memory architecture Shared memory architecture
Principles of developments with MPI
Message passing paradigm
Message passing for communication between processes

Private memory msg Private memory

space P1 P2 space

Each process has a private memory space…

…but it can send/recv msg to/from another one

msg msg
P1 P2 P1 P2
node A node B One node
Message passing is mandatory to Message passing can also
access data in another node be used inside one node

Message passing is generic

Principles of developments with MPI
Communication pattern
Identify the needed global communication pattern
In general, the communication needs of a distributed algorithm follow a
global and regular pattern
 Identify this global pattern
 Implement it using the most efficient MPI communication routines

Ex of patterns on top of point-to-point communications

• 1D unidirectional process ring P1 P2 Pn

or bidirectional process bar P1 P2 Pn

send/recv
• 2D torus of processes • nD hypercube of processes bsend/recv
000 001 ssend/recv
0;0 0;1 0;2
100 101 sendrecv
1;0 1;1 1;2 …
010 011
2;0 2;1 2;2
110 111
Principles of developments with MPI
Communication pattern
Ex of patterns on top of collective communications
• Master-Slaves pattern
Master Master

w res
scatter gather

• File reading and broadcasting

data
bcast
Principles of developments with MPI
Development steps

Numerical algorithm

Distributed algorithm design Some nice algorithms

(with a global comm. pattern) can not run efficiently
on distributed
optimisation

computer !
MPI code development

Deployment of the process set

(on the distributed architecture) Adopt and
intricated
Experiment approach
Principles of developments with MPI
Classic mpi4py code skeleton
Classic mpi4py code:
from mpi4py import MPI Group of all MPI
comm = MPI.COMM_WORLD processes of the program
NbP = comm.Get_size()
Me = comm.Get_rank()
print("Hello World from process ",Me,"/",NbP)
// Local computation (inside each process)
// Message passing (data exchange) Wait until all
// Result saving/printing calculations are
comm.Barrier() complete before
Print("Goodbye from process ", Me) saying goodbye
(not mandatory)
Example of execution with 3 processes:
Hello World from process 0/3 No assumption about
Hello World from process 2/3
message printing order!
Hello World from process 1/3
Principles of developments with MPI
Deploying and running a MPI code
How to launch a MPI application on a distributed architecture ?
• Where to launch the processes ? (machines ? CPU ? cores ?)
• How to number the processes ? (key issue for comm. pattern)
• What resources can a process access? (how many cores for the process?)
mpirun Deployer of « OpenMPI »
–np <#P> Total nb of processes to create
–machinefile <FileName> List of available machines
-map-by … -rank-by … -bind-to … Deployment control (function
of the comm. pattern…)
<Path/ExecName> [args] Executable code and arguments
Examples:
mpirun –np 3 ./HelloWorld  run 3 processes on the current PC
mpirun –np 6 –machinefile mach.txt  run 6 processes
–map-by ppr:2:node -rank-by core on 6 “sockets”
./HelloWorld in a list of dual-
processor PCs…
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Dense matrix product on a ring: algorithm

Distributed algorithm
Probleme:

A, B, C : n × n = N elements
n
C=A.B c   (a .b ) O(Nb of floating ops) = O(N3/2)
ij k 1 ik kj

How to divide data ?

B
• Data duplication
 no possible size up !
• Data partitionning A
 possible size up
 a data displacement
will be necessary
C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

0 1 P-1 Topology of the

processes

Partitioning and
displacement of A
Static partitionning
Step 0 of B
(initial state) Static partitionning
of C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

0 1 P-1 Topology of the

processes

Partitioning and
displacement of A
Static partitioning
Step 1 of B
Static partitioning
of C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

0 1 P-1 Topology of the

processes

Partitioning and
displacement of A
Static partitioning
Step 2 of B
Static partitioning
of C
Dense matrix product on a ring: algorithm

Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static

Results available at the end of the P steps

0 1 P-1 Communication
pattern
Static partitioning
of C

Conclusion
• Each PC has computed a block of C colums
• The P PC have worked in parallel
 Computation of all blocks of C columns in parallel, in P steps
Dense matrix product on a ring: algorithm

Distributed algorithm
Implantation strategies on a ring of P processes :
// Without overlapping // With overlapping
for (step=0; step<P; step++) for (step=0; step<P; step++)
calcul(); thread{ calcul(); }
barrier(); // if needed thread{ circulation(); }
circulation(); joinThreads();
barrier(); // if needed permutBuff();
} }

• Design the algorithm with potential synchronization barriers

• Function of the communications used :
• Implement some explicit barriers to get a strong synchronization
or
• Rely only on relaxed synchronization of the communications
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
• Send/Recv
• Ssend/Recv
• Send/Recv
• Sendrecv
• Sendrecv_replace
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Point-to-Point communications
Available pt-to-pt communications
MPI Python code for native Python object (and lists):
Mode\Type Not specified Buffered Synchronous Ready
Blocking comm.send comm.bsend comm.ssend comm.rsend
comm.recv comm.recv comm.recv comm.recv

Non- comm.isend comm.ibsend comm.issend comm.irsend

blocking comm.irecv comm.irecv comm.irecv comm.irecv
comm.sendrecv
MPI Python code for NumPy arrays:
Mode\Type Not specified Buffered Synchronous Ready
Blocking comm.Send comm.Bsend comm.Ssend comm.Rsend
comm.Recv comm.Recv comm.Recv comm.Recv

Non- comm.Isend comm.Ibsend comm.Issend comm.Irsend

blocking comm.Irecv comm.Irecv comm.Irecv comm.Irecv
comm.Sendrecv
comm.Sendrecv_replace
Dense matrix product on a ring of processes

mpi4py with Numpy arrays

Pt-to-pt communications:
Send, Bsend, Ssend, Recv
Sendrecv, Sendrecv_replace
A Numpy array is an
Basic syntax: object including its
Tab = np.zeros(Size, dtype=np.float64) data, size and datatye :
…… the name (reference) is
Bsend( Tab , dest=…) usually sufficient.

A Python list giving

the Numpy array name,
Detailled syntax:
its size and its datatype
Tab = np.zeros(Size, dtype=np.float64)
in MPI standard
……
Bsend( [Tab, Size, MPI.DOUBLE] , dest=…) Ex.: when using
and reshaping a
bigger array…
Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) :
• Make a local copy of data to send (while buffer is not full)
• Return as soon as the local copy is achieved
 original data storage can be overwritten

Ex. on a RING of processes:

k-1 k k+1

Data copy Data copy Data copy

Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) :
• Make a local copy of data to send (while buffer is not full)
• Return as soon as the local copy is achieved
 original data storage can be overwritten
Recv(…) :
• Requires data exchange & wait for (entire) data reception
• Return when all data received
Non-blocking & buffered send, + blocking recv

Ex. on a RING of processes:

k-1 k k+1

Data copy Data copy Data copy

We get a rexaled synchronization on the Recv operations

 Sufficient and fast synchronization
Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) / Recv(…):
• A unique code for each process, with a unique comm. schedule:
import objsize
Data = 10
……
bsize = objsize.get_deep_size(Data) bsend buffer
+ MPI.BSEND_OVERHEAD allocation
buff = MPI.Alloc_mem(bsize) bsend can use the buffer
MPI.Attach_buffer(buff) Data is copied into the buffer
comm.bsend(Data,dest=Me+1) process waits until it receives
Data = comm.recv(source=Me-1)
a new value to store in Data
MPI.Detach_buffer()
process waits until all data in
MPI.Free_mem(buff)
the buffer have been sent
k-1 k k+1

Data copy Data copy Data copy

Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) / Recv(…):
• A unique code for each process, with a unique comm. schedule:
import numpy as np
Tab = np.zeros(size,np.float64)
……
bsize = A_s.nbytes +
MPI.BSEND_OVERHEAD
buff = MPI.Alloc_mem(bsize)
The array Tab is copied/saved
MPI.Attach_buffer(buff)
into a hidden buffer
comm.Bsend(Tab,dest=Me+1)
comm.Recv(Tab,source=Me-1) The process waits until
MPI.Detach_buffer() it receives new values
MPI.Free_mem(buff) to store in Tab
k-1 k k+1

Tab copy Tab copy Tab copy

Point-to-Point communications
Buffered & blocking comm.: Bsend/Recv
Bsend(…) / Recv(…):
bsend(obj, int dest=yyy, int tag=0) Must exist
Bsend(NumpyArray, int dest=yyy, int tag=0) with the right
obj = recv(int source=xxx, int tag=ANY_TAG) size!
Recv(NumpyArray, int source=xxx, int tag=ANY_TAG)
dest/source: the rank of tag: Only Send and Recv with identical
the receiving / sending tags can match.
process. Default: recv tags set to ANY_TAG (ok!)
One possible scenario:
Start of End of Recv
Bsend Bsend signaled

send
Local copy
recv
Start of Transfer End of
Recv Recv
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…):
• Requires a data exchange & wait for (entire) data emission
• Return when all data sent
 Original data storage must not be overwritten
Recv(…):
• Requires a data exchange & wait for (entire) data reception
• Return when all data received
2.k-1 2.k 2.k+1
Communications are Tab Tab Tab
appointments (Rendez-Vous) 1 buffer buffer buffer

between processes: Tab Tab Tab

2 buffer buffer buffer

Tab Tab Tab

3 buffer
buffer buffer

Some (explicit) buffers are still required to avoid data losses

Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…) / Recv(…):
• At each step a Ssend has to match a Recv operation
 Higher risk to dead-lock than with Bsend/Recv
 A communication schedule has to be entirely and finely designed:
to plan each Ssend/Recv appointment avoiding dead-locks

Ex. on a ring of processes: 2.k-1 2.k 2.k+1

buffer = np.empty(size,…) Tab Tab Tab
…… 1 buffer buffer buffer
if (Me %2 == 0)
Tab Tab Tab
1 : Ssend(Tab,dest=Me+1) 2 buffer buffer buffer
2 : Recv(Tab,source=Me-1)
else Tab Tab Tab
1 : Recv(buffer,source=Me-1) 3 buffer
buffer buffer
2 : Ssend(Tab,dest=Me+1)
3 : permut(buffer,Tab)
……
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…) / Recv(…):
• Execution of the communication schedule will be longer than with
Bsend/Recv.
The communications take place in 2+1 phases :
• (1): 1st half of the comms,
• (2): then 2nd half of the comms
• (3): buffer permutation
Ex. on a ring of processes: 2.k-1 2.k 2.k+1
buffer = np.empty(size,…) Tab Tab Tab
…… 1 buffer buffer buffer
if (Me %2 == 0)
Tab Tab Tab
1 : Ssend(Tab,dest=Me+1) 2 buffer buffer buffer
2 : Recv(Tab,source=Me-1)
else Tab Tab Tab
1 : Recv(buffer,source=Me-1) 3 buffer
buffer buffer
2 : Ssend(Tab,dest=Me+1)
3 : permut(buffer,Tab)
……
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…) / Recv(…):
Identical syntax to
ssend(obj, int dest=yyy, int tag=0, …) Bsend/Recv, but a
Ssend(NumpyArray, int dest=yyy, int tag=0, different
…) behavior!
obj = recv(int source=xxx, int tag=ANY_TAG, …)
Recv(NumpyArray, int source=xxx, int tag=ANY_TAG, …)
Must exist with the right size!
One possible scenario:
Start of Recv End of
Ssend ack. Ssend
waiting time
send
recv
transfert
Ssend Start of End of
signaled Recv Recv
Point-to-Point communications
« Standard & Blocking » comm.: Send/Recv
Send(…): not entirely specified !
• Allows manufacturers to implement communication routines
optimized to their architecture
• Not a portable communication mechanism !

Past example: behavior depending on the message size

• Under some threshold:
 runs like a Bsend with automatic buffer management
• Above some threshold:
 Runs like a Ssend with rendez-vous protocol
!
Recv(…): unchanged
• Requires data exchange & wait for (entire) data reception
• Return when all data received
Point-to-Point communications
« Standard & Blocking » comm.: Send/Recv
Send(…) / Recv(…) : not entirely specified !
• Allows manufacturers to implement communication routines
optimized to their architecture
• Not a portable communication mechanism

2 opposed approaches:
• A MPI pgm should never used standard-blocking comms.
 Portability is the main objective

• A MPI pgm should use standard-blocking comms.

 Efficiency of the communication is the main objective
And a clear documentation on the standard protocol is
available
Point-to-Point communications
Combined & Blocking comm.: Sendrecv
Sendrecv(…) : 1 send & 1 recv, 1 operation:
recvobj = sendrecv(sendobj, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)
Sendrecv(sendNpArray, int dest=yyy, int sendtag=0,
recvNpArray, int source=xxx, int recvtag=ANY_TAG,……)
TabSL TabSR Must exist with the right size!

P1 P2 P3
Frontier exchange with Sendrecv, comm. pattern 1
Step 1
Sendrecv(TabSR, Sendrecv(TabSL, Sendrecv(TabSR,
dest=me+1, dest=me-1, dest=me+1,
TabRR, TabRL TabRR,
source=me+1) source=me-1) source=me+1)
Step 2
Sendrecv(TabSL, Sendrecv(TabSR, Sendrecv(TabSL,
dest=me-1, dest=me+1, dest=me-1,
TabRL, TabRR, TabRL,
source=me-1) source=me+1) source=me-1)
Point-to-Point communications
Combined & Blocking comm.: Sendrecv
Sendrecv(…) : 1 send & 1 recv, 1 operation:
recvobj = sendrecv(sendobj, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)
Sendrecv(sendNpArray, int dest=yyy, int sendtag=0,
recvNpArray, int source=xxx, int recvtag=ANY_TAG,……)
TabSL TabSR Must exist with the right size!

P1 P2 P3
Frontier exchange with Sendrecv, comm. pattern 2
Step 1
Sendrecv(TabSL, Sendrecv(TabSL, Sendrecv(TabSL,
dest=me-1, dest=me-1, dest=me-1,
TabRR, TabRR TabRR,
source=me+1) source=me+1) source=me+1)
Step 2
Sendrecv(TabSR, Sendrecv(TabSR, Sendrecv(TabSR,
dest=me+1, dest=me+1, dest=me+1,
TabRL, TabRL, TabRL,
source=me-1) source=me-1) source=me-1)
Point-to-Point communications
Combined & Blocking: Sendrecv_replace
Sendrecv_replace(…) : 1 send & 1 recv & only 1 buffer
Sendrecv_replace(NumpyArray, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)

P0 P1 P2 P3

Tab = np.empty(size,np.float64)
……
Sendrecv_replace(Tab,dest=(me-1)%NbP,source=(me+1)%NbP)
……

• Blocking comms.: returns when Send part and Recv part have completed
• For native objects : data = sendrecv(data,…)

• On this example: No need for a fine schedule, just follow the comm. pattern
• Easy to use & very efficient communications!
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Elementary examples
Example introduction

Communication pattern: ring of processes

P0 P1 P2 P3

Msg: native Python arrays

[100, 1] : list of integers

Experimented pt-to-pt communications

• send/recv
• ssend/recv
• bsend/recv
• sendrecv
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import mpi4py Right ! PE 0 : received [200, 12]
from mpi4py import MPI • Non blocking send PE 1 : received [200, 12]
comm = MPI.COMM_WORLD + Blocking recv PE 2 : received [300, 12]
NbP = comm.Get_size() • Automatic usage of PE 2 : received [0, 12]
Me = comm.Get_rank() a hidden buffer PE 2 : received [100, 12]
PE 3 : received [0, 12]
prev = (Me-1)%NbP • Data to send not
PE 3 : received [100, 12]
suiv = (Me+1)%NbP crushed by received PE 3 : received [200, 12]
data = [Me*100,12] data PE 1 : received [300, 12]
for i in range(NbP): BUT: This PE 1 : received [0, 12]
comm.send(data,dest=prev) PE 0 : received [300, 12]
behavior is
data = comm.recv(source=suiv) PE 2 : received [200, 12]
limited to small PE 2 : end
print("PE",Me,": received ",data) msgs! (2014 int
PE 1 : received [100, 12]
print("PE",Me,": end") or 448 float on PE 1 : end
our cluster) PE 3 : received [300, 12]
Beyond this limit: PE 3 : end
pgm deadlock! PE 0 : received [0, 12]
PE 0 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import mpi4py PE 0 : received [200, 12]
from mpi4py import MPI PE 0 : received [300, 12]
comm = MPI.COMM_WORLD PE 1 : received [200, 12]
NbP = comm.Get_size() PE 1 : received [300, 12]
Me = comm.Get_rank() PE 1 : received [0, 12]
PE 1 : received [100, 12]
prev = (Me-1)%NbP PE 2 : received [300, 12]
suiv = (Me+1)%NbP PE 2 : received [0, 12]
data = [Me*100,12] PE 2 : received [100, 12]
for i in range(NbP): PE 2 : received [200, 12]
comm.send(data,dest=prev) PE 2 : end
data = comm.recv(source=suiv) PE 3 : received [0, 12]
print("PE",Me,": received ",data) PE 3 : received [100, 12]
PE 3 : received [200, 12]
comm.barrier() PE 3 : received [300, 12]
The MPI barrier is not
print("PE",Me,": end") PE 0 : received [0, 12]
efficient to synchronize
PE 0 : end
the print operations PE 1 : end
(classical) ! PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import objsize Right ! PE 1 : received [200, 12]
import mpi4py • Non blocking bsendPE+0 : received [200, 12]
from mpi4py import MPI blocking recv PE 2 : received [300, 12]
comm = MPI.COMM_WORLD • Msg to send is copiedPE 3 : received [0, 12]
NbP = comm.Get_size()
in an explicit bufferPE 1 : received [300, 12]
Me = comm.Get_rank()
• Buffer must be PE 1 : received [0, 12]
prev = (Me-1)%NbP
suiv = (Me+1)%NbP allocated, attached,PE 1 : received [100, 12]
detached and PE 1 : end
released
data = [Me*100,12] PE 0 : received [300, 12]
buff = MPI.Alloc_mem(objsize.get_deep_size(data) +
PE 0 : received [0, 12]
MPI.BSEND_OVERHEAD)
for i in range(NbP): PE 0 : end
MPI.Attach_buffer(buff) PE 2 : received [0, 12]
comm.bsend(data,dest=prev) PE 2 : received [100, 12]
data = comm.recv(source=suiv) PE 2 : received [200, 12]
MPI.Detach_buffer() PE 2 : end
print("PE",Me,": received ",data) PE 3 : received [100, 12]
MPI.Free_mem(buff) PE 3 : received [200, 12]
print("PE",Me,": end") PE 3 : received [300, 12]
PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import objsize Right ! PE 1 : received [200, 12]
import mpi4py • Non blocking bsendPE+0 : received [200, 12]
from mpi4py import MPI blocking recv PE 2 : received [300, 12]
comm = MPI.COMM_WORLD • Msg to send is copiedPE 3 : received [0, 12]
NbP = comm.Get_size()
in an explicit bufferPE 1 : received [300, 12]
Me = comm.Get_rank()
• Buffer must be PE 1 : received [0, 12]
prev = (Me-1)%NbP
suiv = (Me+1)%NbP allocated, attached,PE 1 : received [100, 12]
detached and PE 1 : end
released
data = [Me*100,12] PE 0 : received [300, 12]
buff = MPI.Alloc_mem(objsize.get_deep_size(data) +
PE 0 : received [0, 12]
MPI.BSEND_OVERHEAD)
for i in range(NbP): PE 0 : end
MPI.Attach_buffer(buff) PE 2 : received [0, 12]
comm.bsend(data,dest=prev) PE 2 : received [100, 12]
data = comm.recv(source=suiv) PE 2 : received [200, 12]
MPI.Detach_buffer() PE 2 : end
print("PE",Me,": received ",data) PE 3 : received [100, 12]
MPI.Free_mem(buff) PE 3 : received [200, 12]
print("PE",Me,": end") PE 3 : received [300, 12]
PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import objsize Right! PE 1 : received [200, 12]
import mpi4py • Non blocking bsendPE+0 : received [200, 12]
from mpi4py import MPI blocking recv PE 2 : received [300, 12]
comm = MPI.COMM_WORLD • Msg to send is copied PE 3 : received [0, 12]
NbP = comm.Get_size()
in an explicit bufferPE 1 : received [300, 12]
Me = comm.Get_rank()
• Buffer must be PE 1 : received [0, 12]
prev = (Me-1)%NbP
suiv = (Me+1)%NbP allocated, attached,PE 1 : received [100, 12]
detached and PE 1 : end
released
data = [Me*100,12] PE 0 : received [300, 12]
buff = MPI.Alloc_mem((objsize.get_deep_size(data) +
PE 0 : received [0, 12]
MPI.BSEND_OVERHEAD)*NbP)
MPI.Attach_buffer(buff) PE 0 : end
for i in range(NbP): PE 2 : received [0, 12]
comm.bsend(data,dest=prev) Faster but risky! PE 2 : received [100, 12]
data = comm.recv(source=suiv) • Buffer attached andPE 2 : received [200, 12]
print("PE",Me,": received ",data) detached only oncePE 2 : end
MPI.Detach_buffer() • Risk of memory PE 3 : received [100, 12]
MPI.Free_mem(buff) overflow with too bigPE 3 : received [200, 12]
print("PE",Me,": end") PE 3 : received [300, 12]
buffer!
PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
Nothing!
import mpi4py
from mpi4py import MPI
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank()
prev = (Me-1)%NbP
suiv = (Me+1)%NbP
data = [Me*100,12]
for i in range(NbP):
comm.ssend(data,dest=prev) Deadlock !
data = comm.recv(source=suiv) Every process is waiting for the
print("PE",Me,": received ",data) end of its synchronous blocking
print("PE",Me,": end") send op!
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 2 : received [300, 12]
import mpi4py Right ! PE 3 : received [0, 12]
from mpi4py import MPI PE 0 : received [100, 12]
• Alternating blocking
comm = MPI.COMM_WORLDbsend & blocking recv PE 0 : received [200, 12]
NbP = comm.Get_size() PE 1 : received [200, 12]
• No deadlock PE 1 : received [300, 12]
Me = comm.Get_rank()
Note: PE 2 : received [0, 12]
prev = (Me-1)%NbP PE 3 : received [100, 12]
suiv = (Me+1)%NbP No msg size
PE 2 : received [100, 12]
data = [Me*100,12]
limit to this PE 0 : received [300, 12]
solution PE 1 : received [0, 12]
for i in range(NbP):
(perfect!) PE 1 : received [100, 12]
if Me %2 == 0:
comm.ssend(data,dest=prev) PE 3 : received [200, 12]
data = comm.recv(source=suiv) PE 3 : received [300, 12]
else: PE 1 : end
datarecv = comm.recv(source=suiv) PE 2 : received [200, 12]
comm.ssend(data,dest=prev) PE 2 : end
data = datarecv PE 0 : received [0, 12]
print("PE",Me,": received ",data) PE 0 : end
…… PE 3 : end
Elementary examples
Communication on a ring of processes
mpirun -np 4 python3 ring.py
P0 P1 P2 P3
PE 0 : received [100, 12]
import mpi4py Right ! PE 0 : received [200, 12]
from mpi4py import MPI
With std Python objects PE 0 : received [300, 12]
comm = MPI.COMM_WORLDthe sendrecv routine PE 0 : received [0, 12]
NbP = comm.Get_size() PE 1 : received [200, 12]
can be used to PE 1 : received [300, 12]
Me = comm.Get_rank()
implement a PE 1 : received [0, 12]
prev = (Me-1)%NbP sendrecv_replace (that PE 1 : received [100, 12]
suiv = (Me+1)%NbP does not exist) PE 1 : end
data = [Me*100,12] PE 2 : received [300, 12]
Note: PE 2 : received [0, 12]
for i in range(NbP):
No msg size PE 2 : received [100, 12]
data = comm.sendrecv(data,
dest=prev, limit to this PE 2 : received [200, 12]
source=suiv) solution PE 2 : end
print("PE",Me,": received ",data) (perfect!) PE 3 : received [0, 12]
PE 3 : received [100, 12]
print("PE",Me,": end") PE 3 : received [200, 12]
PE 3 : received [300, 12]
PE 3 : end
PE 0 : end
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Dense matrix product on a ring: core code

mpi4py implementation v1
mpi4py: Blocking Sendrecv_replace version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
#Loop computation-communication
for step in range(NbP):
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
comm.Sendrecv_replace(A_s,dest=(Me-1)%NbP,
source=(Me+1)%NbP)
#Save matrix C
……
End
Dense matrix product on a ring: core code

C/C++ MPI implementation v1

C/C++ MPI: Blocking MPI_Sendrecv_replace version
#include <mpi.h>
#define SIZE 4096 A
int LOCAL_SIZE = SIZE/NbP;
double *A_s, *B_s, *C_s; LOCAL_SIZE
int Me, NbP;
MPI_Status status; SIZE C
main(int argc, char** argv){
MPI_Init(&argc,&argv) // Start using MPI
MPI_Get_rank(&Me,MPI_COMM_WORLD);
MPI_Get_size(&NbP,MPI_COMM_WORLD);
LOCAL_SIZE = SIZE/NbP;
// Allocate A, B and C
A_s = (double *) malloc(LOCAL_SIZE*SIZE*sizeof(double));
B_s = (double *) malloc(SIZE*LOCAL_SIZE*sizeof(double));
C_s = (double *) malloc(SIZE*LOCAL_SIZE*sizeof(double));
// Initialize A and B
……
Part 1
Dense matrix product on a ring: core code

C/C++ MPI implementation v1

C/C++ MPI: Blocking MPI_Sendrecv_replace version
// Loop computation-communication
for (int step = 0; step < NbP; step++) {
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step);
MPI_Sendrecv_replace(
A_s, LOCAL_SIZE*SIZE, MPI_DOUBLE,
(Me-1+NbP)%NbP, 0, (Me+1)%NbP, 0,
MPI_COMM_WORLD, &status);
}
//Save matrix C
……
// Free the allocated arrays A
free(A_s);
free(B_s);
free(C_s); LOCAL_SIZE
// End of use of MPI SIZE C
MPI_Finalize();
}
End
Dense matrix product on a ring: core code

mpi4py implementation v2
mpi4py: Blocking Bsend/Recv version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
#Allocate Bsend buffer
buff = MPI.Alloc_mem(A_s.nbytes + MPI.BSEND_OVERHEAD)

Part 1
Dense matrix product on a ring: core code

mpi4py implementation v2
mpi4py: Blocking Bsend/Recv version
#Loop computation-communication
for step in range(NbP):
MPI.Attach_buffer(buff)
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
comm.Bsend(A_s,dest=(Me-1)%NbP)
comm.Recv(A_s,source=(Me+1)%NbP)
MPI.Detach_buffer()
#Release Bsend buffer A
MPI.Free_mem(buffer)

#Save matrix C LOCAL_SIZE

……
End SIZE C
Dense matrix product on a ring: core code

mpi4py implementation v3
mpi4py: Blocking Ssend/Recv version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
Buff = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
Part 1
Dense matrix product on a ring: core code

mpi4py implementation v3
mpi4py: Blocking Ssend/Recv version
#Loop computation-communication
for step in range(NbP):
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
if (Me%2==0):
comm.Ssend(A_s,dest=(Me-1)%NbP)
comm.Recv(A_s,source=(Me+1)%NbP)
else:
comm.Recv(Buff,source=(Me+1)%NbP)
comm.Ssend(A_s,dest=(Me-1)%NbP)
tmp = A_s # A_s  Buff permutation A
A_s = Buff
Buff = tmp
#Save matrix C LOCAL_SIZE
……
End SIZE C
MPI point-to-point communications and
global communication patterns

Questions ?

MiniTool Partition Wizard Crack 12 Key Download Free 2025
No ratings yet
MiniTool Partition Wizard Crack 12 Key Download Free 2025
29 pages
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Message Passing Interface (MPI) : Author: Blaise Barney, Lawrence Livermore National Laboratory
41 pages
CS621 Final Term Current Papers
No ratings yet
CS621 Final Term Current Papers
9 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
Unit3 All
No ratings yet
Unit3 All
115 pages
Parallel Programming 3
No ratings yet
Parallel Programming 3
22 pages
BIg Data Anslysi
No ratings yet
BIg Data Anslysi
57 pages
Distributed Memory Programming With: Peter Pacheco
No ratings yet
Distributed Memory Programming With: Peter Pacheco
125 pages
MPI Tutorial Fall Break 2022
No ratings yet
MPI Tutorial Fall Break 2022
60 pages
Mpi Book
No ratings yet
Mpi Book
673 pages
Computing LLNL Gov
No ratings yet
Computing LLNL Gov
42 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Unit IV
No ratings yet
Unit IV
12 pages
Mpi
No ratings yet
Mpi
46 pages
Message Passing and MPI: John Mellor-Crummey
No ratings yet
Message Passing and MPI: John Mellor-Crummey
78 pages
Week 10
No ratings yet
Week 10
52 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
CS-3006 - 5 - MPI Basics
No ratings yet
CS-3006 - 5 - MPI Basics
53 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Writing Message Passing Parallel Programs With MPI: Course Notes
No ratings yet
Writing Message Passing Parallel Programs With MPI: Course Notes
80 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Using MPI
No ratings yet
Using MPI
385 pages
Lecture 11 Distributed Memory Programming
No ratings yet
Lecture 11 Distributed Memory Programming
28 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
MPI4 Py
No ratings yet
MPI4 Py
28 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
CH 4
No ratings yet
CH 4
16 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
HPC Lecture40
No ratings yet
HPC Lecture40
25 pages
2013 02 24 Ppopp Mpi Basic
No ratings yet
2013 02 24 Ppopp Mpi Basic
102 pages
MPI Part2 Updated
No ratings yet
MPI Part2 Updated
20 pages
College Information System
68% (28)
College Information System
97 pages
12.revision Parallelization
No ratings yet
12.revision Parallelization
30 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
Slides 41
No ratings yet
Slides 41
32 pages
Parallel Programming and MPI
No ratings yet
Parallel Programming and MPI
54 pages
BCV654B Geographic Information Systems
No ratings yet
BCV654B Geographic Information Systems
2 pages
Mpi Lecture
No ratings yet
Mpi Lecture
129 pages
Introduction To MPI Basics
No ratings yet
Introduction To MPI Basics
8 pages
5CS022 Lecture 1
No ratings yet
5CS022 Lecture 1
36 pages
Unit-II Part I
No ratings yet
Unit-II Part I
15 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Message Passing-1
No ratings yet
Message Passing-1
76 pages
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
No ratings yet
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
91 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
Slides 07-1
No ratings yet
Slides 07-1
57 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
1 MPI Communications: CS424. Parallel Computing Lab#4
No ratings yet
1 MPI Communications: CS424. Parallel Computing Lab#4
30 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
ANN Final Exam
100% (1)
ANN Final Exam
13 pages
High Performance Computing: Matthew Jacob Indian Institute of Science
No ratings yet
High Performance Computing: Matthew Jacob Indian Institute of Science
25 pages
Full Download Architecting A Modern Data Warehouse For Large Enterprises: Build Multi-Cloud Modern Distributed Data Warehouses With Azure and AWS 1st Edition Anjani Kumar PDF
100% (1)
Full Download Architecting A Modern Data Warehouse For Large Enterprises: Build Multi-Cloud Modern Distributed Data Warehouses With Azure and AWS 1st Edition Anjani Kumar PDF
47 pages
Word Processing Assignment
100% (1)
Word Processing Assignment
8 pages
AI in Education
No ratings yet
AI in Education
36 pages
PSHS Ict Roadmap 2026
No ratings yet
PSHS Ict Roadmap 2026
32 pages
CISSP Exam Outline April 2024 English
No ratings yet
CISSP Exam Outline April 2024 English
15 pages
AAVSBamboo Course-Guide
No ratings yet
AAVSBamboo Course-Guide
9 pages
EEE 8086 Assembly Language Programming
No ratings yet
EEE 8086 Assembly Language Programming
42 pages
Sensi Touch Wi Fi Smart Thermostat Manual Operation Guide en Us 5242446
No ratings yet
Sensi Touch Wi Fi Smart Thermostat Manual Operation Guide en Us 5242446
12 pages
Price Quotation For Drafting Services
No ratings yet
Price Quotation For Drafting Services
2 pages
Chapter 4 Software Architecture
No ratings yet
Chapter 4 Software Architecture
33 pages
Space Engineering: Engineering Design Model Data Exchange (CDF)
No ratings yet
Space Engineering: Engineering Design Model Data Exchange (CDF)
31 pages
Microsoft Word Activity (1) 253841265
No ratings yet
Microsoft Word Activity (1) 253841265
2 pages
BT MeetMe Services With Cisco WebEx Install Guide
No ratings yet
BT MeetMe Services With Cisco WebEx Install Guide
26 pages
FTI Delta Tech Trends 2024
No ratings yet
FTI Delta Tech Trends 2024
17 pages
CS-100 Series Vru, Imu and Ahrs: Cti Sensor, Inc
No ratings yet
CS-100 Series Vru, Imu and Ahrs: Cti Sensor, Inc
19 pages
Supported device-DH-IPC-HFW2541T-AS
No ratings yet
Supported device-DH-IPC-HFW2541T-AS
3 pages
Strategy and Work Plan For Scanning and Digitization of Records For District Magistrate
No ratings yet
Strategy and Work Plan For Scanning and Digitization of Records For District Magistrate
4 pages
Aliff Suffian & Solah Report Mini Project Autocad
No ratings yet
Aliff Suffian & Solah Report Mini Project Autocad
7 pages
WWW - Universityquestions.in: Department of Electronics and Communication Engineering
No ratings yet
WWW - Universityquestions.in: Department of Electronics and Communication Engineering
10 pages
Holiday Homework
No ratings yet
Holiday Homework
4 pages
Com - Tapatap.loader Logcat
No ratings yet
Com - Tapatap.loader Logcat
2 pages
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
100% (1)
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
1 page
CSP 20-21 U4 Practice PT Planning Guide
No ratings yet
CSP 20-21 U4 Practice PT Planning Guide
6 pages
Brian M. O'Malley
No ratings yet
Brian M. O'Malley
2 pages
Vignette
No ratings yet
Vignette
3 pages
You Are What Apps You Use: Demographic Prediction Based On User's Apps
No ratings yet
You Are What Apps You Use: Demographic Prediction Based On User's Apps
4 pages
Bangkit Logbook Week 13
No ratings yet
Bangkit Logbook Week 13
3 pages
MPICH Essentials: Definitive Reference for Developers and Engineers
From Everand
MPICH Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
From Everand
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering System Programming with C: Files, Processes, and IPC
From Everand
Mastering System Programming with C: Files, Processes, and IPC
Larry Jones
No ratings yet
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2

Uploaded by

ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2

Uploaded by

ST7: High Performance Simulation

MPI point-to-point communications

Optimized code on one core:

Parallelized code on one node

Distributed code on a cluster:

Deployment and Execution

Cluster of servers/nodes Multi-core server

Private memory msg Private memory

Each process has a private memory space…

Message passing is generic

Ex of patterns on top of point-to-point communications

or bidirectional process bar P1 P2 Pn

• File reading and broadcasting

Distributed algorithm design Some nice algorithms

Deployment of the process set

How to divide data ?

0 1 P-1 Topology of the

0 1 P-1 Topology of the

0 1 P-1 Topology of the

Results available at the end of the P steps

• Design the algorithm with potential synchronization barriers

Non- comm.isend comm.ibsend comm.issend comm.irsend

Non- comm.Isend comm.Ibsend comm.Issend comm.Irsend

mpi4py with Numpy arrays

A Python list giving

Ex. on a RING of processes:

Data copy Data copy Data copy

Ex. on a RING of processes:

Data copy Data copy Data copy

We get a rexaled synchronization on the Recv operations

Data copy Data copy Data copy

Tab copy Tab copy Tab copy

between processes: Tab Tab Tab

Tab Tab Tab

Some (explicit) buffers are still required to avoid data losses

Ex. on a ring of processes: 2.k-1 2.k 2.k+1

Past example: behavior depending on the message size

• A MPI pgm should use standard-blocking comms.

Communication pattern: ring of processes

Msg: native Python arrays

Experimented pt-to-pt communications

C/C++ MPI implementation v1

C/C++ MPI implementation v1

#Save matrix C LOCAL_SIZE

You might also like