0% found this document useful (0 votes)
74 views

IEEE-Implementation of MPI in Linux

The document describes an implementation of MPI under the APLinux operating system on the Fujitsu AP1000+ multicomputer. Key points: 1) APLinux allows multiple parallel programs to share processors, requiring a more sophisticated message passing library than previous single-program environments. 2) The MPI implementation uses either polling or interrupt-driven communication depending on whether a single program or multiple programs are running. 3) Polling works well for a single program but is ineffective for gang scheduling multiple programs. Interrupt-driven communication better utilizes resources in the multiple program case. 4) The library provides non-blocking send and receive primitives that transfer messages without copying for efficiency. Processes

Uploaded by

api-27351105
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

IEEE-Implementation of MPI in Linux

The document describes an implementation of MPI under the APLinux operating system on the Fujitsu AP1000+ multicomputer. Key points: 1) APLinux allows multiple parallel programs to share processors, requiring a more sophisticated message passing library than previous single-program environments. 2) The MPI implementation uses either polling or interrupt-driven communication depending on whether a single program or multiple programs are running. 3) Polling works well for a single program but is ineffective for gang scheduling multiple programs. Interrupt-driven communication better utilizes resources in the multiple program case. 4) The library provides non-blocking send and receive primitives that transfer messages without copying for efficiency. Processes

Uploaded by

api-27351105
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Implementing MPI under APLinux

David Sitsky Paul Mackerras Andrew Tridgell David Walsh


CAP Research Program
Australia National University
Canberra, ACT 0200 Australia
sits @cafe.mu .edu.au

Abstract performance evaluation of the library in Section 5.

A preliminary MPI library has been implemented for


2 Overview of AP1000+
the Fujitsu API OOO+ multicomputer running the APLinux
operating system. Under this environment, parallel pro-
grams may be dedicated to a $xed partition, or a number The A P l O O O + is a distributed memory multicomputer
of parallel programs may share a partition. Therefore, the which can include up to 1024 nodes. As shown in Fig-
MPI library has been constructed so that messaging op- ure 1, the nodes are connected by three independent net-
erations can be driven by polling and/or interrupt tech- works. Point-to-point communication is realized through
niques. It has been found that polling works well when a the T-net, which is a wormhole-routed 2-D torus network
single parallel program is running on a given partition, and with a bandwidth of 25 MByteskec per link. The B-net
that interrupt-driven communication makes far better use (broadcast network) is used for communication with the
ofthe machine when multiple parallel programs are execut- host machine and for broadcasting data to all cells from ei-
ing. Gang scheduling of multiple parallel programs which ther the host or a cell. The S-net (synchronization network)
use polling was found to be relatively ineffective. provides fast barrier synchronization across the entire ma-
chine.

7AP1 OOO+ I
1 Introduction
Sun-4
MPI has previously been implemented on both the Fu-
jitsu APIOOO [3] and AP1000+ [2, 41 multicomputers run- Host
ning the CellOS operating system. Under CellOS, when
a parallel program is initiated, the machine is reset and the
kernel is loaded before launching the parallel program on all
nodes. This single-user single-program environment means
that techniques such as polling are acceptable [6, 51.
However, when running under a multicomputer operat-
ing system which supports multiple parallel programs shar-
ing processors, a more sophisticated strategy needs to be
employed in the message passing library in order to achieve
the best utilization of the machine’s resources.
The Linux operating system has been ported to the Figure 1. AP1000+ System Configuration
AP1000+ with some extensions to support parallel pro-
grams. This paper will present a preliminary implementa- Each node consists of a SOMHz SuperSPARC processor
tion of an MPI library which has been built to work within and a message controller (MSC+) which controls the flow
this environment. Section 2 provides an overview of the of data between the T-net and the memory of the CPU. Mes-
AP1000+ multicomputer while Section 3 introduces the sages can be sent to another cell without kernel intervention
APiLinux operating system. The implementation of point- by writing directly to the MSC+ registers. Memory pro-
to-point communication is addressed in Section 4 with a tection is based on the notion of “contexts” as used in the

32
0-8186-7533-0/96 $05.00 0 1996 IEEE
definition of the SPARC Reference MMlJ. A context repre- Each process in a parallel program has a special ring
sents a mapping of virtual addresses to physical addresses. buffer which is used to receive messages from other nodes.
Under a Unix-like operating sys’tem,each process will have A nonblocking send operation which uses the p u t com-
its own context. Contexts are referenced by a context-ID (a mand is provide:d, where the destination address is the ring
small integer), and are represented by page tables stored in buffer. Tlhe MSC+ hardware maintains read and write point-
system memory. ers to determine where incloming messages are placed.
Requests sent to the MSC+ specify all addresses as vir- AP/Linux provides support for interrupt driven commu-
tual addresses in the context of the current process. The nication. The following functions are available:
context-ID of the current process is sent in the message
pu t-s i gna1 ( no de-i d , si gna1-number )
header and used by the receiving MSC+ to map virtual ad-
get-s i gna1 ( node-i d , s ignalpumber)
dresses on the remote node to physical addresses. Invalid or
unmapped addresses cause the MSC+ to interrupt the CPU A call to put-signal will instruct the MSC+ to send
in order for it to either map in a page (as for a normal page a message to a special memory location on node node-id
fault), or for it to signal an error to the MSC+. In the first which will cause an interrupt. The interrupt handler gener-
case, the transfer is continued; in the second, it is aborted. ates the specified signal for the process with the same con-
text ID as the sender. The g e t - s i g n a l function works in
3 Overview of APLinux a similar fashion to put-signal, except that the specified
signal is generated on the local node after it is generated
Linux is a modern operating system developed by a on the remlote node. The use of these two functions is very
loosely-knit group of people on the internet, led by Linux important for implementing an interrupt-driven communi-
Torvalds of the University of Helsinki, Finland. It provides cation library as the following section illustrates.
a POSIX compliant kernel and user environment, along
with all the features that are expected in a normal worksta- 4 Poiint-to-pointcoinmunication
tion environment. Linux is most widely used on Intel x86
processors, although it is also available for a range of other This section describes the implementation of the li-
processor architectures including Alpha, M68000, MIPS, brary’s basic non-blocking send and receive primitives. The
PowerPC and SPARC. goals of ithi s implementation were:
The APLinux operating system is based on Linux with
some appropriate extensions to support a multicomputer en- 0 For large messages, no intermediate copying should
vironment. The most significant extension is the notion of be performed when {ransferring a message between a
a “parallel process” or parallel program, which consists of send and a receive buffer. This is particularly impor-
a process on a collection of nodes, each with the same pro- tant on a imachine like the AP1000+, where the speed
cess and context ID. A very simple gang-scheduler has been of memory copying is less than the bandwidth of the
written to support the execution of parallel processes. T-net.
Communication using the M[SC+ is achieved using {be
procedures described below: e The transfer of message data should occur while pro-
cesses are executing to facilitate the overlap of com-
put(dest-node, local-adclr, size, munication and computation.
remote-addr, remote-.flag, local-flag)
get(src-node, local-addr, size, 0 The user should be: able to select whether polling
remote-addr, remote-.fl a g , local-flag) anldhx interrupt techniques are used to detect the ar-
The put function will transfer the data segment from rival of new messages and the completion of data
local-addr of s i z e bytes to node dest-node at ad- transfers.
dress remote-addr. The local-fl a g variable on
the sender is incremented after the data is sent. The 4.1 Message transfer methods
remote-f l a g variable on the receiver is once the m’es-
sage has been transferred into the receiver’s memory. The The following sections describe the two different send
g e t operation is similar to p u t , except it is initiated by methods -- in-place and protocol - that the library em-
the receiver instead of the sender. Importantly, these op- ploys. When an MPI program is executed, a threshold value
erations don’t require a k r n e l trap, the data is transferred can be specified which indicates that messages of less than
without any intermediate copying and both functions return the threshold siize will be sent using the in-place method,
once the appropriate command words have been written to while those greater than rhe threshold size are sent using
the MSC+ registers. the protcicol method.

33
n- Message data
Receiver

“1
4I
Sender
Piotocol mejsage

Detectlon of message
Receiver

i

Detection of message
-1 I Get request
l I
I Increment send flag
Increment receive flag

Figure 2. In-place method Figure 3. Protocol method

4.1.1 In-place method copy is required to copy the data to the receiver’s buffer. For
this reason, this method is intended to be used mainly for
This method transfers the message data into the receiver’s sending small messages, where the overheads of the mem-
ring buffer. Upon detection of this message and a match- ory copy will be relatively small.
ing receive request, the receiver will copy the message data
from the ring buffer into the specified receive buffer. This
4.1.2 Protocol method
is illustrated in Figure 2 .
During the execution of an MPI program, two queues This method uses the get operation to transfer the message
are maintained by the library. The first is known as the “un- directly from the sender’s buffer to the receiver’s buffer,
matched received messages”queue, which contains all mes- avoiding any intermediate message copying. Instead of
sages which have been received from other nodes but un- sending the message directly to the receiver, a small fixed
matched in time-order. The second queue is known as the size message known as the “protocol message” is sent to
“pending nonblocking receives” queue which contains all the receiver. This message contains details such as the ad-
nonblocking receive requests for which no matching mes- dress of the message, the rank of the sender, the tag and
sage has been received in time-order. context of the message and the address of an integer vari-
When a message sent using this method is detected able which will be used as the remote-flag parameter in
by the receiver, the nonblocking receive request queue is a g e t operation. This operation can be seen in Figure 3.
scanned for any matching requests. If a matching request The same algorithm is used as for the in-place method
is located, the message is copied from the ring buffer into when processing new incoming messages. If a matching
the receiver’s buffer specified in the request object, which nonblocking receive request is found, there is enough in-
is subsequently dequeued and marked as completed and the formation to initiate a get operation. From the proto-
message in the ring buffer is removed. If a matching receive col message, the src-node,remote-addr,size and
request was not found, a new entry is appended to the un- remote-flag parameters are known, while the match-
matched received message queue, containing the necessary ing nonblocking request object contains the local-addr
details of the message in the ring buffer. and local-flagparameters. After the get call is is-
A nonblocking receive operation is implemented by first sued, the MPI library returns so that the process can con-
scanning the unmatched received message queue for a tinue executing while the message is being transferred. The
match. If a match is detected, the message is copied into remote-flag and local-flag variables are stored in
the receiver’s buffer and the message in the ring buffer is the associated nonblocking receive and send request ob-
removed, as is its associated element in the unmatched re- jects, and when incremented indicate that the operation has
ceived message queue. If no match is detected, the non- completed.
blocking receive request object is appended to the pending This method has the benefit of avoiding intermediate
nonblocking receives queue. message copying. However, it requires a rendezvous be-
Since the ring buffer is used to store the message, the tween the sender and receiver before the message can be
send request does not have to wait for the receiver to post a transferred. The overhead of the get call makes the in-
matching nonblocking receive. However, an extra memory place method more attractive for small messages.

34
4.1.3 Noncontiguoussends method, 1he receiver know:; when a nonblocking receive op-
eration has completed since the CPU explicitly copies the
The descriptions of the in-place and protocol method thus data from Ihe ring buffer 10 the receiver’s buffer. To gen-
far have assumed that the send and receive buffer are con- erate a signal fior the sender to indicate that the message
tiguous. Noncontiguous send buffers are packed into a con- has been tiransferred from the sender’s buffer, the sender
tiguous area of memory and are then sent using the iin- invokes put-s ignal immediately after the send, where
place method. The MSC+ supports the transfer of one- the node-id argument is itself. Again due to the ordering
dimensional strided buffers with certain restrictions. Where properties of the MSC+, 1.he signal will not be generated
possible, the g e t - s t r i d e operation which works in a sim-
until after the data has been transferred from the sender’s
ilar fashion to g e t but works with strided buffers is used in
buffer.
the protocol method.
Similarly for the protocol method, immediately after the
receiver issues ithe g e t operation to retrieve the message,
4.2 Detection methods it issues a g e t _ s i g n a l operation. Again, the MSC+ or-
dering properties insure that the sender will receive a signal
A fundamental operation in the MPI library is to detect once the data has been transferred from the sender’s buffer,
the arrival of new messages and to del.ect when a put or while the receiver will receive a signal once the data has
get transfer has completed. The library uses three differ- been tran sferrecl into the receiver’s buffer.
ent methods, which are described in the following sections. When the MPI process is in a blocking state, the
Only one particular method may be used during the lifetime sigsuspend system call is used so that this process will
of an MPI execution. not run until a signal is delivered to it. This allows other
processes to ruin in the system. With this method, proto-
4.2.1 Polling detection col messages will be detected as soon as possible, allowing
early initiation of g e t transfers.
In this mode of operation, MPI wait operations are imple- Critic(i1sections within the MPI library are marked using
mented by sitting in a busy-wait loop until some set of flags a flag. If a critical section is interrupted by a signal, the han-
in the list of request objects have been incremented. The dler return!; immediately. In this case, the library code calls
ring buffer is examined every iteration to check for the ar- the handler explicity after it exits the critical section. This
rival of new messages to ensure progress;. The ring buffer is is required when manipulating some of the system queues,
also checked in the test and probe functions. but it is also important because the operation of writing a
While this method avoids using signal handlers and command lo the: MSC+ registers is not atomic.
the associated overheads, it cain potentially waste a large
amount of CPU time. Under the AP/I,inux system, it is
envisaged that this method would only be used if a single 4.2.3 Pollsig detection
parallel program is runnicg over a given set of nodes. This mode of operation is a compromise between the
The other problem with this rnethod i:j that protocol mes- polling and signals modes. During the test and probe func-
sages will not be detected until the next time that the user tions, the ring buffer is checked for the arrival of new mes-
program calls one of the MPI functionis which check the sages. When the MPI process enters a blocked state, the
ring buffer. Depending on the application, this can be a sig- process polls fix a certain interval specified by the user
nificant amount of time. This delays the completion of ithe at startup time. If the condition on which the process is
associated nonblocking send request. blocking does not become true within the specified interval,
the process modifies its signal mask appropriately and calls
4.2.2 Signal sigsuspend. All p u t - s i g n a l and g e t s i g n a l op-
erations which are invoked in signals mode are also issued
This method uses signals to indicate when messages have in this mode in case the receiving process is waiting for a
arrived or when a put or get transfer has completed. While signal.
this incurs some overhead, it potentially makes better use of With ,an appropriate value of the interval, this method can
the CPU. avoid the owerbeads of signals in many cases, while limiting
In this mode of operation, p u t s i g n a l is called after the amount of CPU time wasted.
every send to the same destination cell. Due to the message
ordering of the MSC+, the destination cell will receive the
signal after the message has arrived. The signal handler can 5 Performance
then process the new message as described above.
A signal also needs to also be generated to indicate when This section provides an analysis of the performance of
a put or get request has completed. When using the in-place the MPI lilbrary. A ping-pong benchmark was used to mea-

35
sure the latency and to measure the overheads of the differ- large overhead exists and why the “ProtocoWsignals” curve
ent transfer and detection methods described in the previous is irregular.
section, for a variety of message sizes. Some of the NAS Figure 5 illustrates the performance of the library for
2.1 parallel benchmarks [ 13 were run to examine the effect message sizes up to 200,000 bytes when using the in-place
of the different transfer and detection modes on execution and protocol methods and polling. The other detection
times. Finally, some wall-clock times are reported when methods produce the same times. The effect of perform-
running multiple phrallel programs simultaneously when ing the extra memory copy in the in-place method is quite
using the different detection modes. evident. At 200,000 bytes, the bandwidth measured was
10.3 MBytes/s for the in-place method, compared to 24.8
5.1 Ping-pong benchmarks MBytes/s for the protocol method, which is very close to
the peak 25 MBytes/s speed of the T-net.
The ping-pong benchmark consists of a pair of processes
which echo a single message of fixed size back and forth. 5.2 NAS parallel benchmarks
The first process sends a message and then receives the
echoed message, while the second process first receives the The BT, FT, and MG programs from the NAS 2.1 par-
message and echos it. This is performed for a large number allel benchmarks were executed for both class S (small) on
of iterations for a given message size, for which the total a 16 processor AP1000+ and for class A on a 64 proces-
time is recorded by the first process. Dividing this time by sor AP1000+. The results can be seen in Table 1, where all
the number of iterations determines the round-trip time for times are normalized to those presented in the first column.
a given message size. Dividing this time by two will give The “Inplace/poll” column refers to runs where all send-
the message latency; dividing the latency by the message ing is done with the in-place method and polling is used,
size gives the bandwidth. “Protocol/poll” indicates all messages are sent with the pro-
Figure 4 illustrates the performance of the library for tocol method and polling is used, “1024/poll” indicates that
message sizes up to 1000 bytes. the message threshold for using the in-place or protocol
The “Inplace/polling” key indicates that all messages are method is 1024 bytes and polling is used, “1024/pollsig”
sent in-place, while “Protocol/polling” indicates that mes- is the same as the previous column except pollsig with a 1
sages are transferred with the protocol method. Both keys millisecond interval is used, and “1024/sig” is the same as
indicate the use of polling. As expected, small messages the previous column except signals mode is used.
sent in-place are faster than those sent with the protocol The most interesting result is that the best configuration
method. The protocol method becomes faster for message for all benchmarks was to send all messages in-place and
sizes greater than 300 bytes, indicating that the overhead of to use polling. Under CellOS, some of the benchmarks ran
using get is less than the extra memory copy from the ring faster when using signals. The primary reason that this does
buffer into the receiver’s buffer. The software latency mea- not occur under APLinux, is that signals had far less over-
sured for a zero sized message was 21.7ps, which compares head under CellOS. As expected, the “Protocol/poll” col-
well to the value of 20.211,s for the MPI library running un- umn showed slow times since all messages are sent using
der CellOS. the protocol method, where matching sending and receiving
A similar figure can be seen for the “Inplace/pollsig” requests must rendezvous for the transfer to be initiated. For
and “Protocol/pollsig” times, where pollsig detection is some applications, this can have a drastic affect on perfor-
used with an infinite interval. The effect of using this mance, particularly for small messages. The high overheads
configuration means all sends will be accompanied with a of signals is also reflected in the “1024/sig” column.
put-signal operation to the receiver. Since an infinite
interval has been specified, the receiver will always have 5.3 Multiple parallel programs
the detection signal blocked, but the kernel interrupt han-
dler will still have to process the message and generate the The aim of this benchmark was to measure the effect
signal. The constant time difference between the pollsig of running multiple instances of a class S SP (NAS par-
and polling modes measures this overhead, which comes to allel benchmark) program on all the nodes of a 16 node
17.0ps, a significant overhead compared to the 2 1 . 7 ~ sla- AP1 OOO+. The multiple instances were started simultane-
tency for polling. ously and the wall clock time was recorded once all in-
The “Inplace/signals” and “Protocol/signals” indicate stances had completed. The normalized times are sum-
the times when running under signals mode. Clearly, sig- marized in Table 2. All programs ran using the in-place
nals mode carries a significant overhead. For a zero-sized method.
message, the latency is 117ps more than the polling mode. The “Polling/nogang” column indicates the total wall
It has not been determined at the time of writing why this clock time when there is no gang scheduling and polling

36
1--r--I

2.50 --Le-

200
ProtocoUpollsig - - - - -.
Inpllace/pollsig --E-.---.
150 ProtocoUpolling - A - -:
Inplace/polling - ?,.::-::

100

50

----I
0 I

0 200 400 600 800 1000


Message Size (bytes)

Fiigure 4. Ping-pong benchmark for small messages

I
___- 1-
7-
-

Jnplace/pollirig
+-

Protocol/pollirig -*----
_/__I__

0 50000 100000 150000 200000


Message Size (bytes)

Figure 5. Ping-pong benchmark for large messages

37
BenchmarMClass Inplacelpoll Protocolipoll 1024/poll 1024/pollsig 1024/sig
BTIS 1.oo 1.06 1.06 1.11 1.11
FT/S 1.oo 1.37 1.05 1.06 1.os
MGIS 1.oo 1.55 1.oo 1.oo 1.44
RTIA
- 1.oo I .02 1
1.02 1.02 1.oo
FTIA 1.oo 1.29 1 1.27 1.29 1.26
MG/A 1.oo 1.01 1 1.01 1.01 1.02

Table 1. NAS Benchmark results

2 36.1 3.2 2.5 2.4


3 - 10.0 3.9 3.I
4 - - 5.4 5.1

Table 2. Multiple parallel programs benchmark

is used. When only 2 instances of the SP benchmark were 7 Conclusion


running, the total time to complete was more than 36 times
that for one instance of the program. Attempting to run A preliminary MPI library has been built for the
more than three programs under this configuration would AP 1000+ which runs a modified version of the Linux kernel
lead to enormous completion times. Running with the gang on each node. The library can employ a number of methods
scheduler made a big difference, although when three in- affecting the behaviour of how messages are transferred and
stances are run, the total time becomes quite large. Running how messages and data transfer events are detected. Mea-
in pollsig mode with a 1 millisecond interval made a huge surements have shown that polling is effective when a single
difference. Running in signals mode gave the best result, parallel program is executing on the machine, while signals
presumably because the CPU was not polling at all. Four mode is effective when many parallel programs are execut-
programs run under signals mode took 4.25 times longer to ing at the same time. The user determines which mode to
execute than a single program running under signals mode. use at runtime, the choice being governed by the current
which indicates how important an interrupt driven message operating environment (interactive or batch) and the appli-
passing library is when multiple parallel programs are run- cation’s messaging characteristics.
ning simultaneously.
8 Acknowledgements

We would like to thank the many contributors to Linux


6 Future directions for providing an excellent operating system on which to
base our research. Special thanks go to David Miller for
his extensive help with details of the SparcLinux kernel.
We would also like to thank Fujitsu Laboratories for their
The APlLinux system and its MPI library are still be- continuing support of the CAP Program, and the CRC for
ing developed at the time of writing. Clearly, the gang Advanced Computational Systems for their support of the
scheduler needs to be improved to be more effective when Pious project.
running multiple parallel programs in polling mode. The
signal overheads that have been measured in the ping-pong References
benchmark have yet to be accounted for. This needs to be
examined in more detail and the overheads need to be re- [l] D.Bailey, T. Harris, W. Saphir, R. van der Wijingaart,
duced substantially for signals mode to be more effective. A. Woo, and M. Yarrow. The NAS parallel benchmarks 2.0.
Similarly, the overheads shown for the pollsig mode in the Technical Report NAS-95-020, NASA Ames Research Cen-
ping-pong benchmark nerd to be examined. ter, December 1995.

38
[2j K. Hayashi, T. Doi, T. Horie, Y. Koyanagi, 0. Shiraki, N. Ima-
mura, T. Shimizu, H. Ishihata, and T. Shindo. AP1000+:
Architectural support for parallelizing compilers and paral-
lel programs. In International Symposium on Parallel Archi-
tectures, Algorithms, and Networks (ISPAN '94). IEEE, Dec.
1994.
[3] H. Ishihata, T. Horie, S. Inano, T. Shimizu, S . Kato, and
M. Ikesaka. Third generation message passing computer
AP1000. In International Symposium on Supercomputing,
pages 45 - 55, November 199 I.
[4] 0. Shiraki, Y. Koyanagi, N. Imamura, K. Hayashi,
T. Shimizu, T. Horie, and H. Ishihata. AP1000+: Message
handling mechanism (11) - system level interface -. In Sum-
mer Workshop on Parallel Processing '94, volume 94-ARC-
107-22,July 1994.
[5j D. Sitsky and K. Hayashi. Implementing MPI for the Fujitsu
AP 1000/AP1000+ using polling, interrupts and remote copy-
ing. In Joint Symposium on Parallel P,rocessing (JSPP '96),
Jun. 1996.
[6j D. Sitsky and K. Hayashi. An I\IIPI library which uses polling,
interrupts and remote copying for the Fujitsu AP1000+. In In-
ternational Symposium on Panzllel Architectures, Algorithms,
ancl Networks (ISPAN '96).IEEE, Jun. 1996.

39

You might also like