Distributed Memory Programming Using
Distributed Memory Programming Using
Distributed Memory Programming Using
01 03
Introduction Deadlocks
02 04
Blocking and Non- Splitting & Collective
Blocking communication
Programming (Computing)
Sequential Parallel
programming programming
Serial program
1. A problem is broken into a discrete series of instruction
2. Instructions are executed sequentially one after another at any moment on a
single processor
Problem
? ?
Sub problem 1
Sub problem 3
Sub problem 4
Sub probem 2
Processor
Parallel program
1. The use of multiple computers, processors or cores working together on a
common task.
2. Each processor works on a section of the problem
3. Processors are allowed to exchange information (data in local memory) with
other processors
Problem
MPI_FLOAT Float
MPI_DOUBLE Double
MPI_INT Int
Process Group
1. Group of processes
the instance of a
2. Each group assosiated
computer program
with a communicator
that is being executed
3.Base group contain all
It contains the
process and associated
program code and its
with
state
MPI_COM_WORLD
Communicator rank
Object that make A unique ID for each
processes of the same process in a certain
group to communicate group
with each other
Minimal set of routines
//initializes MPI environment
1.int MPI_Init(int *argc, char ***argv)
• On different
machine
02
Point to Point Communication
Point to Point Communication
there are two types of communications, point-to-point (that we are going to call P2P from now on), and
collective. P2P communications are divided in two operations : Send and Receive.
The most basic forms of P2P communication are called blocking communications. The process sending a
message will be waiting until the process receiving has finished receiving all the information. This is the
easiest form of communications but not necessarily the fastest .
Send/receive
First, process A decides a message needs to be sent to process B. Process A then packs up all of its necessary data
into a buffer for process B. These buffers are often referred to as envelopes since the data is being packed into a
single message before transmission (similar to how letters are packed into envelopes before transmission to the post
office). After the data is packed into a buffer, the communication device (which is often a network) is responsible
for routing the message to the proper location. The location of the message is defined by the process’s rank.
Even though the message is routed to B, process B still has to acknowledge that it wants to receive A’s data. Once it
does this, the data has been transmitted. Process A is acknowledged that the data has been transmitted and may go
back to work.
Example int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Send( void* data, int count, MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
MPI_Datatype datatype, int
if (world_rank == 0) {
destination, int tag, MPI_Comm number = -1;
communicator) MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_Recv( void* data, int count, MPI_STATUS_IGNORE);
MPI_Datatype datatype, int source, printf("Process 1 received number %d from process 0\
int tag, MPI_Comm communicator, n",number);
}
MPI_Status* status)
Output: Process 1 received number -1 from process 0
MPI_Status
Buffered
Blocking
Ready
Standard
Synchronous Blocking (MPI_Ssend)
Message transfer must be preceded by a sender-
receiver “handshake”. When the blocking
synchronous send MPI_Ssend is executed, the
sending task sends the receiving task a “ready to
send” message. When the receiver executes the
receive call, it sends a “ready to receive” message.
Once both processes have successfully received the
other’s “ready” message, the actual data transfer can
begin.
Synchronization overhead
▪ Synchronous send has more waiting, because a handshake must arrive before the send can
occur.
▪ If the receiver posts a blocking receive first, then the synchronization delay will occur on
the receiving side. Given large numbers of independent processes on disparate systems,
keeping them in sync is a challenging task and separate processes can rapidly get out of
sync causing fairly major synchronization overhead. MPI_Barrier() can be used to try to
keep nodes in sync, but probably doesn’t reduce actual overhead. MPI_Barrier() blocks
processing until all tasks have checked in (i.e., synced up), so repeatedly calling Barrier
tends to just shift synchronization overhead out of MPI_Send/Recv and into the Barrier
calls.
Buffered Blocking (MPI_Bsend)
▪ The sending task can then proceed with calculations that modify the original message buffer,
knowing that these modifications will not be reflected in the data actually sent.
▪ Buffered mode incurs extra system overhead, because of the additional copy from the message
buffer to the user-supplied buffer. On the other hand, synchronization overhead is eliminated on the
sending task—the timing of the receive is now irrelevant to the sender. Synchronization overhead
can still be incurred by the receiving task, though, because it must block until the send has been
completed.
▪ In buffered mode, the programmer is responsible for allocating and managing the data buffer (one
per process), by using calls to MPI_Buffer_attach and MPI_Buffer_detach. This has the advantage
of providing increased control over the system, but also requires the programmer to safely manage
this space. If a buffered-mode send requires more buffer space than is available, an error will be
generated, and (by default) the program will exit.
Advantages DisAdvantages
▪ Safest, therefore
most portable
▪ ▪ Can incur
Synchronous No need for substantial
Blocking extra buffer space Synchronization
▪ overhead
SEND/RECV order not
critical
▪ timing of the
corresponding receive is
irrelevant ▪ Copying to buffer
Buffered incurs additional
Blocking system overhead
▪ synchronization overhead
on the sender is
eliminated
Ready Blocking (MPI_Rsend())
• As it expects a ready destination, If the corresponding Recv has not yet been called, the
message that will be received might be ill-defined so you have to make sure that the
receiving process has asked for a Recv before calling the ready mode.
• It’s the programmer’s responsibility to make sure that data sent has a waiting receive.
Standard Blocking (MPI_Send)
● This is the standard send-rev mode <the “non” mode>, it is allowed to buffer, either on the sender
or receiver side, or to wait for the matching receive.
● MPI decides which scenario is the best in terms of performance, memory, and so on. This might be
heavily dependent on the implementation.
● In any case, the data can be safely modified after the function returns. You can also reuse the buffer.
Standard blocking (openMPI case)
▪ In OpenMPI, the observed behavior is that for short messages, the send is automatically buffered at
the receiver side (using eager protocol), while for long messages, the message will be sent using a
mode somewhat close to the synchronous mode (using rendezvous protocol).
Non-blocking Communication
Buffered
MPI_Ibsend/Ibrecv
Non-Blocking Ready
MPI_Irsend/Irrecv
Standard
MPI_Isend/Irecv
Non-blocking communication
- Syntax: - Example:
• Once this request has been prepared, it is necessary to complete it. There are two ways of
completing a request : wait and test.
• Non-blocking does not mean asynchronize. That's the function returns while the data
exchange process is still running in the background. Means you cannot reuse the buffer
until MPI_Test, MPI_Wait are done.
Non-blocking communication
• MPI_Wait:
int MPI_Wait(MPI_Request *request, MPI_Status *status);
int MPI_Waitany(int count, MPI_Request array_of_requests[], int *index, MPI_Status *status);
▪ Waiting forces the process to go in "blocking mode". The sending process will simply wait for the
request to finish. If your process waits right after MPI_Isend, the send is the same as calling MPI_Send.
There are two ways to wait MPI_Wait and MPI_Waitany.
▪ The former, MPI_Wait just waits for the completion of the given request. As soon as the request is
complete an instance of MPI_Status is returned in status.
▪ The latter, MPI_Waitany waits for the first completed request in an array of requests to continue. As
soon as a request completes, the value of index is set to store the index of the completed request of
array_of_requests. The call also stores the status of the completed request.
Non-blocking communication
• MPI_Test:
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
int MPI_Testany(int count, MPI_Request array_of_requests[], int *index, int *flag, MPI_Status *status);
▪ Testing is a little bit different. As we've seen right before, waiting blocks the process until the request
(or a request) is fulfilled. Testing checks if the request can be completed. If it can, the request is
automatically completed and the data transferred. As for wait, there are two waits of testing : MPI_Test
and MPI_Testany.
▪ As for MPI_Wait, the parameters request and status hold no mystery. Now remember that testing is non-
blocking, so in any case the process continues execution after the call. The variable flag is there to tell
you if the request was completed during the test or not. If flag != 0 that means the request has been
completed.
Race condition
Message Exchange in one call
1- use 2 buffers 🡪 one for each process
2- use only one buffer
Benefits: 🡪 we use them in shift operations
MPI_Sendrecv
buffer
Process 0 Process 1
buffer
MPI_Sendrecv_replace
buffer
When to use which !
Still working
Deadlock with MPI_Send/Recv
(Blocking-Comm.)
1. int a[10], b[10], myrank;
2. MPI_Status s1, s2;
3. ...
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
4. if (myrank == 0){
5. MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
6. MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
7. }
8. else if (myrank == 1){
9. MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &s1);
10. MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &s2);
11.}
12....
Solution for the previous case
MPI_Comm_split, the splitting function is a blocking call. That means that all the
processes on the original communicator will have to call the function for it to return. The
splitting function in MPI is prototyped as follows :
- Movement of Data
Data Transfer: - Reduction of Data
Why do we Use Them?
-Programs become easier to understand
- Every program does roughly the same thing
- No "Strange" Communication patterns
▪ Synchronization
▪ Data movement
▪ Collective computation
Synchronization
● MPI_Barrier( comm )
● Make Sure that all the Processes will call your function or your
program will idle.
One to all
Broadcast
Scatter (Personalized)
All to all
All gather
All to all (Personalized)
O(logN)
Tree-Based Communication
STAGE 1
STAGE
2
.
.
.
Bcast SYNTAX
MPI_Bcast(
void* data, //the buffer with data
int count,
MPI_Datatype datatype,
int root, // the source
MPI_Comm communicator) //the receiving comm
MPI_Gather
COMMON
COLLECTIVES MPI_Reduce
MPI_Alltoall
MPI_Scatter
MPI_Scatter
MPI_Scatter(
void* send_data, //array of data that resides on the root process
int send_count, //number of elements to be sent
MPI_Datatype send_datatype, //datatype (MPI_INT for example)
void* recv_data, //buffer that will hold received data
int recv_count, //number of elements to be received
MPI_Datatype recv_datatype, //datatype of elements to be received
int root, //the root process that’s scattering the array
MPI_Comm communicator the communicator in which the root process resides
)
Scatter VS Broadcats
Broadcast Scatter
CODE
MPI_Gather
MPI_Gather
MPI_Gather
(
void* send_data, int send_count, MPI_Datatype
send_datatype, void* recv_data, int recv_count,
MPI_Datatype recv_datatype, int root, MPI_Comm
communicator
)
MPI_AllGather
(void* send_data, int send_count, MPI_Datatype
send_datatype, void* recv_data, int recv_count,
MPI_Datatype recv_datatype, MPI_Comm
communicator)
data
A0
A1
processes
MPI_Reduce
(
void* send_data, //array of elements of type datatype
void* recv_data, //only relevant on the process of rank root, it
contains the reduced result
int count, //number of items to be received
MPI_Datatype datatype,
MPI_Op op, //the operation you wish to apply to your data
int root, //the root process that will hold the result
MPI_Comm communicator
)
MPI_Op Examples
A0 A1 A2 A3 A4 A5
processes
B0 B1 B2 B3 B4 B5
C0 C1 C2 C3 C4 C5
D0 D1 D2 D3 D4 D5
Each process starts with its own set
E0 E1 E2 E3 E4 E5 of blocks, one destined for each
F0 F1 F2 F3 F4 F5 process.
A0 B0 C0 D0 E0 F0
Each process finishes with all
A1 B1 C1 D1 E1 F1
A2 B2 C2 D2 E2 F2 blocks destined for itself.
A3 B3 C3 D3 E3 F3
A4 B4 C4 D4 E4 F4
A5 B5 C5 D5 E5 F5
05
Command line
To install MPI on visual studio
Refer to this link
Run Hello World Program
From Command Prompt
mpiexec –np <No of Processes to run> <.exe file>
Common Errors
Doing Things before MPI_Init()
✔ MPI_Init() isn’t like fork(), it doesn’t create processes.
✔ The processes are already created when you run the command (refer to
slide (108) )
✔ The processes already exist but they just don’t know about each other
so they can’t communicate.
✔ MPI_Init() just initialize the MPI library (i.e. Makes the processes in
the Communicator Knows each other).
✔ The MPI Standard says little about the situation before MPI_Init() and
after MPI_Finalize().
✔ So you have to write what ever you want enclosed between them.
✔ Writing outside them lead to unexpected & implementation dependent
results
Common Errors
Expecting argc & argv to be passed to all
processes
The First Implantation doesn’t generate errors But some compilers don’t pass
these arguments to all processes so you have to pass them explicitly using the
second implementation
Common Errors
Matching MPI_Bcast() with MPI_Recv
STUCK!!
WORKED
☺
REMEMBER: MPI_Bcast() in
all processes
with same arguments
Common Errors
Assuming Your MPI Imp. is Thread Safe
Common Errors
Question
Solution