ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
ST7 SHP 2.2 MessagePassing MPI p2p Communications 1spp 2
Stéphane Vialle
Stephane.Vialle@centralesupelec.fr
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm
Process 0 Process 1
Generic Process 2 Process 3
Process 0 Process 2 network
Process 1 Process 3
msg msg
P1 P2 P1 P2
node A node B One node
Message passing is mandatory to Message passing can also
access data in another node be used inside one node
w res
scatter gather
data
bcast
Principles of developments with MPI
Development steps
Numerical algorithm
computer !
MPI code development
Distributed algorithm
Probleme:
A, B, C : n × n = N elements
n
C=A.B c (a .b ) O(Nb of floating ops) = O(N3/2)
ij k 1 ik kj
Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static
Partitioning and
displacement of A
Static partitionning
Step 0 of B
(initial state) Static partitionning
of C
Dense matrix product on a ring: algorithm
Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static
Partitioning and
displacement of A
Static partitioning
Step 1 of B
Static partitioning
of C
Dense matrix product on a ring: algorithm
Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static
Partitioning and
displacement of A
Static partitioning
Step 2 of B
Static partitioning
of C
Dense matrix product on a ring: algorithm
Distributed algorithm
Partitioning on a ring of P processes
• A split in blocks of lines • Displacement of A
• B et C split in blocks of columns • B and C static
0 1 P-1 Communication
pattern
Static partitioning
of C
Conclusion
• Each PC has computed a block of C colums
• The P PC have worked in parallel
Computation of all blocks of C columns in parallel, in P steps
Dense matrix product on a ring: algorithm
Distributed algorithm
Implantation strategies on a ring of P processes :
// Without overlapping // With overlapping
for (step=0; step<P; step++) for (step=0; step<P; step++)
calcul(); thread{ calcul(); }
barrier(); // if needed thread{ circulation(); }
circulation(); joinThreads();
barrier(); // if needed permutBuff();
} }
send
Local copy
recv
Start of Transfer End of
Recv Recv
Point-to-Point communications
Synchronous & blocking comm.: Ssend/Recv
Ssend(…):
• Requires a data exchange & wait for (entire) data emission
• Return when all data sent
Original data storage must not be overwritten
Recv(…):
• Requires a data exchange & wait for (entire) data reception
• Return when all data received
2.k-1 2.k 2.k+1
Communications are Tab Tab Tab
appointments (Rendez-Vous) 1 buffer buffer buffer
2 opposed approaches:
• A MPI pgm should never used standard-blocking comms.
Portability is the main objective
P1 P2 P3
Frontier exchange with Sendrecv, comm. pattern 1
Step 1
Sendrecv(TabSR, Sendrecv(TabSL, Sendrecv(TabSR,
dest=me+1, dest=me-1, dest=me+1,
TabRR, TabRL TabRR,
source=me+1) source=me-1) source=me+1)
Step 2
Sendrecv(TabSL, Sendrecv(TabSR, Sendrecv(TabSL,
dest=me-1, dest=me+1, dest=me-1,
TabRL, TabRR, TabRL,
source=me-1) source=me+1) source=me-1)
Point-to-Point communications
Combined & Blocking comm.: Sendrecv
Sendrecv(…) : 1 send & 1 recv, 1 operation:
recvobj = sendrecv(sendobj, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)
Sendrecv(sendNpArray, int dest=yyy, int sendtag=0,
recvNpArray, int source=xxx, int recvtag=ANY_TAG,……)
TabSL TabSR Must exist with the right size!
P1 P2 P3
Frontier exchange with Sendrecv, comm. pattern 2
Step 1
Sendrecv(TabSL, Sendrecv(TabSL, Sendrecv(TabSL,
dest=me-1, dest=me-1, dest=me-1,
TabRR, TabRR TabRR,
source=me+1) source=me+1) source=me+1)
Step 2
Sendrecv(TabSR, Sendrecv(TabSR, Sendrecv(TabSR,
dest=me+1, dest=me+1, dest=me+1,
TabRL, TabRL, TabRL,
source=me-1) source=me-1) source=me-1)
Point-to-Point communications
Combined & Blocking: Sendrecv_replace
Sendrecv_replace(…) : 1 send & 1 recv & only 1 buffer
Sendrecv_replace(NumpyArray, int dest=yyy, int sendtag=0,
int source=xxx, int recvtag=ANY_TAG, ……)
P0 P1 P2 P3
Tab = np.empty(size,np.float64)
……
Sendrecv_replace(Tab,dest=(me-1)%NbP,source=(me+1)%NbP)
……
• Blocking comms.: returns when Send part and Recv part have completed
• For native objects : data = sendrecv(data,…)
• On this example: No need for a fine schedule, just follow the comm. pattern
• Easy to use & very efficient communications!
MPI point-to-point communications
and global communication patterns
1. Principles of developments with MPI
2. Dense matrix product on a ring: algorithm
3. Point-to-Point communications
4. Elementary examples of Pt-to-Pt comm.
5. Dense matrix product on a ring: core code
Elementary examples
Example introduction
P0 P1 P2 P3
mpi4py implementation v1
mpi4py: Blocking Sendrecv_replace version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
#Loop computation-communication
for step in range(NbP):
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
comm.Sendrecv_replace(A_s,dest=(Me-1)%NbP,
source=(Me+1)%NbP)
#Save matrix C
……
End
Dense matrix product on a ring: core code
mpi4py implementation v2
mpi4py: Blocking Bsend/Recv version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
#Allocate Bsend buffer
buff = MPI.Alloc_mem(A_s.nbytes + MPI.BSEND_OVERHEAD)
Part 1
Dense matrix product on a ring: core code
mpi4py implementation v2
mpi4py: Blocking Bsend/Recv version
#Loop computation-communication
for step in range(NbP):
MPI.Attach_buffer(buff)
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
comm.Bsend(A_s,dest=(Me-1)%NbP)
comm.Recv(A_s,source=(Me+1)%NbP)
MPI.Detach_buffer()
#Release Bsend buffer A
MPI.Free_mem(buffer)
mpi4py implementation v3
mpi4py: Blocking Ssend/Recv version
import numpy as np
from mpi4py import MPI A
comm = MPI.COMM_WORLD
NbP = comm.Get_size()
Me = comm.Get_rank() LOCAL_SIZE
SIZE = 4096
LOCAL_SIZE = SIZE//NbP SIZE C
#Allocate slices of A, B and C
A_s = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
Buff = np.empty(LOCAL_SIZE*SIZE, dtype=np.float64)
B_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
C_s = np.empty(SIZE*LOCAL_SIZE, dtype=np.float64)
#Init A and B slices
……
Part 1
Dense matrix product on a ring: core code
mpi4py implementation v3
mpi4py: Blocking Ssend/Recv version
#Loop computation-communication
for step in range(NbP):
LocalSliceProduct(A_s, B_s, C_s, LOCAL_SIZE, SIZE, step)
if (Me%2==0):
comm.Ssend(A_s,dest=(Me-1)%NbP)
comm.Recv(A_s,source=(Me+1)%NbP)
else:
comm.Recv(Buff,source=(Me+1)%NbP)
comm.Ssend(A_s,dest=(Me-1)%NbP)
tmp = A_s # A_s Buff permutation A
A_s = Buff
Buff = tmp
#Save matrix C LOCAL_SIZE
……
End SIZE C
MPI point-to-point communications and
global communication patterns
Questions ?