See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/4245345
Infiniband scalability in Open MPI
CONFERENCE PAPER · MAY 2006
DOI: 10.1109/IPDPS.2006.1639335 · Source: IEEE Xplore
CITATIONS
READS
38
65
5 AUTHORS, INCLUDING:
Arthur B. Maccabe
Patrick G. Bridges
103 PUBLICATIONS 1,510 CITATIONS
84 PUBLICATIONS 932 CITATIONS
Oak Ridge National Laboratory
SEE PROFILE
University of New Mexico
SEE PROFILE
Available from: Patrick G. Bridges
Retrieved on: 04 February 2016
Infiniband Scalability in Open MPI
G. M. Shipman1,2, T. S. Woodall1 , R. L. Graham1 , A. B. Maccabe2
1
Advanced Computing Laboratory
Los Alamos National Laboratory
2
Scalable Systems Laboratory
Computer Science Department
University of New Mexico
Abstract
including Myrinet [17], Quadrics [3], Gigabit
Ethernet and, recently, Infiniband [1]. Infiniband (IB) is increasingly deployed in small to
medium sized commodity clusters. It is IB’s low
price/performance qualities that has made it attractive to the HPC market.
Of the available distributed memory programming models, the Message Passing Interface
(MPI) standard [16] is currently the most widely
used. Several MPI implementations support Infiniband including Open MPI [10], MVAPICH
[15], LA-MPI [11] and NCSA MPI [18]. However, there are concerns about the scalability of
Infiniband for MPI applications, partially arising
from the fact that Infiniband was initially developed as a general I/O fabric technology and not
specifically targeted to HPC [4].
In this paper, we describe Open MPI’s scalable support for Infiniband. In particular, Open
MPI makes use of Inifiniband feature not currently used by other MPI/IB implementations,
allowing Open MPI to scale more effectively
than current implementations. We illustrate
the scalability of Open MPI’s Infiniband support through comparisions with the widelyused MVAPICH implementation, and show that
Open MPI uses less memory and provides better
latency than MVAPICH on medium/large-scale
Inifiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the
HPC community. Open MPI, a new production
grade implementation of the MPI standard, provides several mechanisms to enhance Infiniband
scalability. Initial comparisons with MVAPICH,
the most widely used Infiniband MPI implementation, show similar performance but with
much better scalability characterics. Specifically, small message latency is improved by up
to 10% in medium/large jobs and memory usage
per host is reduced by as much as 300%. In addition, Open MPI provides predicatable latency
that is close to optimal without sacrificing bandwidth performance.
1 Introduction
High performance computing (HPC) systems
are continuing a trend toward distributed memory clusters consisting of commodity components. Many of these systems make use of
commodity or ‘near’ commodity interconnects
1
and scalability parameters which allow for easy
tuning.
The Open MPI point-to-point (p2p) design
and implementation is based on multiple MCA
frameworks. These frameworks provide functional isolation with clearly defined interfaces.
Figure 1 illustrates the p2p framework architecture.
clusters.
The remainder of this paper is organized as
follows. Section 2 presents a brief overview of
the Open MPI general point-to-point message
design. Next, section 3 discusses the Infiniband architecture including current limitations
of the architecture. MVAPICH is discussed in
section 4 including potential scalability issues
relating to this implementation. Section 5 provides a detailed description of Infiniband support in Open MPI. Scalability and performance
results are discussed in section 6, followed by
conclusions and future work in section 7.
MPI
PML
BML
OpenIB
BTL
OpenIB
BTL
Open IB
MPool
Rcache
2 Open MPI
The Open MPI Project is a collaborative effort
by Los Alamos National Lab, the Open Systems
Laboratory at Indiana University, the Innovative Computing Laboratory at the University of
Tennessee and the High Performance Computing Center at the University of Stuttgart (HLRS).
The goal of this project is to develop a next generation implementation of the Message Passing
Interface. Open MPI draws upon the unique expertise of each of these groups which includes
prior work on LA-MPI, LAM/MPI [20], FTMPI [9] and PAX-MPI [13]. Open MPI is however, a completely new MPI, designed from the
ground up to address the demands of current and
next generation architectures and interconnects.
Open MPI is based on a Modular Component Architecture [19]. This architecture supports the runtime selection of components that
are optimized for a specific operating environment. Multiple network interconnects are supported through this MCA. Currently there are
two Infiniband components in Open MPI. One
supporting the OpenIB Verbs-API and another
supporting the Mellanox Verbs-API. In addition
to being highly optimized for scalability these
components provide a number of performance
SM
BTL
Open IB
MPool
Rcache
SM
MPool
Figure 1: Open MPI p2p framework
As shown in Figure 1 the architecture consists of four layers. Working from the bottom up
these layers are the Byte Transfer Layer (BTL),
BTL Management Layer (BML), Point-to-Point
Messaging Layer (PML) and the MPI layer.
Each of these layers is implemented as an MCA
framework. Other MCA frameworks shown are
the Memory Pool (MPool) and the Registration
Cache (Rcache). While these are illustrated
and defined as layers, critical send/receive paths
bypass the BML, as it is used primarily during
initialization/BTL selection.
MPool The memory pool provides memory allocation/deallocation and registration/deregistration services. Infiniband requires memory to be registered (physical pages present and pinned) before
send/receive or RDMA operations can use
the memory as a source or target. Separating this functionality from other components allows the MPool to be shared
2
layer may be safely bypassed by upper layers for performance. The current BML
component is named R2.
among various layers.
For example,
MPI ALLOC MEM uses these MPools to
register memory with available interconnects.
PML The PML implements all logic for
p2p MPI semantics including standard,
buffered, ready, and synchronous communication modes. MPI message transfers are
scheduled by the PML based on a specific
policy. This policy incorporates BTL specific attributes to schedule MPI messages.
Short and long message protocols are implemented within the PML. All control
messages (ACK/NACK/MATCH) are also
managed at the PML. The benefit of this
structure is a seperation of transport protocol from the underlying interconnects. This
significantly reduces both code complexity and code redundancy enhancing maintainability. There are currently three PMLs
available in the Open MPI code base. This
paper discusses OB1 the latest generation
PML in the later section.
Rcache The registration cache allows memory
pools to cache registered memory for later
operations. When initialized, MPI message
buffers are registered with the Mpool and
cached via the Rcache. For example, during an MPI SEND the source buffer is registered with the memory pool and this registration may be then be cached, depending
on the protocol in use. During subsequent
MPI SEND operations the source buffer is
checked against the Rcache, and if the registration exists the PML may RDMA the
entire buffer in a single operation without
incurring the high cost of registration.
BTL The BTL modules expose the underlying
semantics of the network interconnect in a
consistent form. BTLs expose a set of communication primitives appropriate for both
send/receive and RDMA interfaces. The
BTL is not aware of any MPI semantics; it
simply moves a sequence of bytes (potentially non-contiguous) across the underlying transport. This simplicity will enable
early adoption of novel network devices
and encourages vendor support. There are
several BTL modules currently available;
including TCP, GM, Portals, Shared Memory (SM), Mellanox VAPI and OpenIB
VAPI. In the later section we discusses the
Mellanox VAPI and OpenIB VAPI BTLs.
During startup, a PML component is selected
and initialized. The PML component selected
defaults to OB1 but may be overridden by a runtime parameter/environment setting. Next the
BML component R2 is selected. R2 then opens
and initializes all available BTL modules. During BTL module initialization, R2 directs peer
resource discovery on a per-BTL basis. This allows the peers to negotiate which set of interfaces they will use to communicate with each
other. This infrastructure allows for heterogenous networking interconnects within a cluster.
3 Infiniband
BML The BML acts as a thin multi-plexing
layer, allowing the BTLs to be shared
among multiple upper layers. Discovery of
peer resources is coordinated by the BML
and cached for multiple consumers of the
BTLs. After resource discovery, the BML
The Infiniband specification is published by the
Infiniband Trade Association (ITA) originally
created by Compaq, Dell, Hewlett-Packard,
IBM, Intel, Microsoft, and Sun Microsystems.
3
Two-sided send/receive operations are initiated by enqueueing a send WQE on a QP’s send
queue. The WQE specifies only the senders local buffer. The remote process must pre-post
a receive WQE on the corresponding receive
queue which specifies a local buffer address to
be used as the destination of the receive. Send
completion indicates the send WQE is completed locally and results in a sender side CQ
entry. When the transfer actually completes a
CQ entry will be posted to the receivers CQ.
One-sided RDMA operations are likewise initiated by enqueueing a RDMA WQE on the
Send Queue. However, this WQE specifies both
the source and target virtual addresses along
with a protection key for the remote buffer. Both
the protection key and remote buffer address
must be obtained by the initiator of the RDMA
read/write prior to submitting the WQE. Completion of the RDMA operation is local and results in a CQ entry at the initiator. The operation
is one sided in the sense that the remote application is not involved in the request and does not
receive notification of its completion.
IB was originally proposed as a general I/O
technology, allowing for a single I/O fabric to
replace mutliple existing fabrics. The goal of a
single I/O fabric has faded and currently Infiniband is targeted as an Inter Process Communication (IPC) and Storage Area Network (SAN)
interconnect technology.
Infiniband, similar to Myrinet and Quadrics,
provides both Remote Direct Memory Access
(RDMA) and Operating System (OS) bypass facilities. RDMA enables data transfer from the
address space of an application processs to a
peer process across the network fabric without
requiring involvement of the host CPU. Infiniband RDMA operations support both two-sided
send/receive and one-sided put/get semantics.
Each of these operations may be queued from
the user level directly to the host channel adapter
(HCA) for execution, bypassing the OS to minimize latency and processing requirements on the
host CPU.
3.1 Infiniband OS Bypass
To enable OS bypass, Infiniband defines the
concept of a Queue Pair (QP). The Queue Pair
mechanism provides user level processes direct
access to the IB HCA. Unlike traditional stack
based protocols, there is no need to packetize
the source buffer or process other protocol specific messages in the OS or at user level. Packetization and transport logic is located almost entirely in the HCA.
Each queue pair consists of both a send and
receive work queue, and is additionally associated with a Completion Queue (CQ). Work
Queue Entries (WQEs) are posted from the user
level for processing by the HCA. Upon completion of a WQE, the HCA posts an entry to the
completion queue, allowing the user level process to poll and/or wait on the completion queue
for events related to the queue pair.
3.2 Infiniband Resource Allocation
Infiniband does place some additional constraints on these operations. As data is moved
directly between the host channel adapter
(HCA) and user level source/destination buffers,
these buffers must be registered with the HCA in
advance of their use. Registration is a relatively
expensive operation which locks the memory
pages associated with the request, thereby preserving the virtual to physical mappings. Additionally, when supporting send/receive semantics, pre-posted receive buffers are consumed in
order as data arrives on the host channel adapter
(HCA). Since no attempt is made to match available buffers to the incoming message size, the
maximum size of a message is constrained to the
minimum size of the posted receive buffers.
4
Datagram provide acknowledgment and retransmission. In practice, this portion of the specification has yet to be implemented.
Unreliable Connection and Unreliable Datagram are similar to their reliable counterparts in
terms of QP resources. These transports differ
in that they are unacknowledged services and do
not provide for retransmission of dropped packets. The high cost of user reliability relative
to the hardware reliability of RC and RD make
these modes of transport inefficient for MPI.
Infiniband additionally defines the concept of
a Shared Receive Queue (SRQ). A single SRQ
may be associated with multiple QPs during
their creation. Receive WQEs that are posted to
the SRQ are then shared resources to all associated QPs. This capability plays a significant role
in improving the scalability of the connectionoriented transport protocols described below.
3.3 Infiniband Transport Modes
The Infiniband specification details five modes
of transport
3.4
1. Reliable Connection (RC)
Infiniband Summary
Infiniband shares many of the architectural features of VIA. Scalability limitations of VIA are
well known [5] to the HPC community. These
limitations arise from a connection oriented protocol, RDMA semantics and the lack of direct
support for asynchronous progress. While the
Infiniband specification does address scalability
of connection oriented protocols through the RD
transport mode, the industry leader Mellanox
has yet to implement this portion of the specification. Additionally, while the SRQ mechanism
addresses scalability issues associated with the
reliable connection oriented transport, issues related to flow control and resource management
must be considered. MPI implementations must
therefore compensate for these limitations in order to effectively scale to large clusters.
2. Reliable Datagram (RD)
3. Unreliable Connection (UC)
4. Unreliable Datagram (UD)
5. Raw Datagram
Reliable Connection provides a connection
oriented transport between two queue pairs.
During initialization of each QP, peers exchange
addressing information used to bind the QP’s
and bring them to a connected state. Work requests posted on each QP’s Send Queue are
implicitly addressed to the remote peer. As
with any connection oriented protocol, scalability may be a concern as the number of connected
peers grows large, and resources are allocated to
each QP. Both Open MPI and MVAPICH currently use RC transport modes.
Reliable Datagram allows a single QP to be
used to send and receive messages to/from other
RD QPs. Whereas in RC reliability state is associated with the QP, RD associates this state with
an end-to-end (EE) context. The intent of the Infiniband specification is that the EE’s will scale
much more effectively with the number of active
peers. Both Reliable Connection and Reliable
4 MVAPICH
MVAPICH is currently the most widely used
MPI implementation on Infiniband platforms. A
descendent of MPICH [12], one of the earliest
MPI implementation, as well as MVICH [14],
MVAPICH provides several novel features for
Infiniband support. These features include small
message RDMA, caching of registered memory
regions and multi-rail IB support.
5
4.1 Small Message Transfer
message latencies but still results in sub-optimal
performance as the number of peers increases.
The MVAPICH design incorporates a novel approach to small message transfer. Each peer
is pre-allocated and registered a separate memory region for small message RDMA operations
called a persistent buffer association. Each of
these memory regions is structured as a circular
buffer allowing the remote peer to RDMA directly into the currently available descriptor. Remote completion is detected by the peer polling
the current descriptor in the persistent buffer association. A single bit can indicate completion of the RDMA as current Mellanox hardware
guarantees the last byte of an RDMA operation
will be the last byte delivered to the application.
This design takes advantage of the extremely
low latencies of Infiniband RDMA operations.
Unfortunately, this is not a scalable solution
for small message transfer. As each peer requires a separate persistent buffer, memory usage grows linearly with the number of peers.
Polling each persistent buffer for completion
also presents scalability problems. As the number of peers increases the additional overhead
required to poll these buffers quickly erodes the
benefits of small message RDMA.
A similar design was attempted earlier on
ASCI Blue Mountain with what later evolved
into LA-MPI to support HIPPI-800. The approach was later abandoned due to poor scalability and a hybrid approach evolved, taking advantage of HIPPI-800 firmware for multiplexing.
Other alternative approaches to polling persistent buffers for completion have also been discussed and may prove to be more scalable [6].
To address the issues of small message
RDMA, MVAPICH provides a medium and
large configuration option. These options limit
the resources used for small message RDMA
and revert instead to standard send/receive. As
demonstrated in our results section this configuration option improves the scalability of small
4.2 Connection Management
MVAPICH uses static connection management,
establishing a fully connected job at startup. In
addition to eagerly establishing QP connections,
MVAPICH also allocates a persistent buffer association for each peer. If send/receive is used
instead of small message RDMA, MVAPICH allocates receive descriptors on a per QP basis instead of using the shared receive queue across
QP’s. This further increases resource allocation
per peer.
4.3 Caching Registered Buffers
As discussed earlier Infiniband requires all
memory to be registered (pinned) with the HCA.
Memory registration is an expensive operation
so MVAPICH caches memory registrations for
later use. This allows subsequent message transfers to queue a single RDMA operation without paying any registration costs. This approach
to registration assumes that the application will
reuse buffers often in order to amortize the high
cost of a single up front memory registration.
For some applications this is a reasonable assumption.
A potential issue when caching memory registrations is that the application may free a
cached memory region and then return the associated pages to the OS 1 . The application could
later allocate another memory region and obtain
the same virtual address as the previously freed
buffer. Subsequent RDMA operations may use
the cached registration but this registration may
now contain incorrect virtual to physical mappings. RDMA operation may therefore use an
1
Memory is returned via the sbrk function in UNIX
and Linux.
6
unintentional memory region. In order to avoid
this scenario MVAPICH forces the application
to never release pages to the OS 2 and thereby
preserving virtual to physical mappings. This
approach may cause resource exhaustion as the
OS can never reclaim physical pages.
allocated/registered buffers to copy in for send
and copy out for receive. This protocol provides
good performance for small messages transfer
and is used both for the eager protocol as well
as control messages.
To support RDMA operations, OB1 makes
uses of the Mpool and Rcache components in
order to cache memory regions for later RDMA
operations. Both source and target buffers must
be registered prior to an RDMA read or write
of the buffer. Subsequent RDMA operation
can make use of pre-registered memory in the
Mpool/Rcache. While MVAPICH prevents
physical pages from being released to the OS,
Open MPI instead uses memory hooks to intercept deallocation of memory. When memory
is deallocated it is checked against the Rcache
and all matching registrations are de-registered.
This prevents future use of an invalid memory
registration while allowing memory to be returned to the host operating system.
5 Design of Open MPI
In this section we discuss Open MPI’s support
for Infiniband, including techniques to enhance
scalability.
5.1 The OB1 PML Component
OB1 is the latest point-to-point management
layer for Open MPI. OB1 replaces the previous
generation PML - TEG [21]. The motivation
for a new PML was driven by code complexity at the lower layers. Previously much of the
MPI p2p semantics such as the short and long
protocols were duplicated for each interconnect.
This logic as well as RDMA specific protocol
logic was moved up to the PML layer. Initially
there was concern that moving this functionality into an upper layer would cause performance
degradation. Preliminary performance benchmarks have shown this not to be the case. This
restructuring has substantially decreased code
complexity while maintaining performance on
par with both previous Open MPI architectures
as well as other MPI implementations. Through
the use of device appropriate abstractions we
have exposed the underlying architecture to the
PML level. As such the overhead of the p2p
architecture in Open MPI is lower than that of
other MPI implementations.
OB1 provides numerous features to support
both send/receive and RDMA read/write operations. The send/receive protocol uses pre-
In addition to supporting both send/receive
and RDMA read/write, Open MPI provides a
hybrid RDMA pipeline protocol. This protocol avoids caching of memory registrations
and virtually eliminates memory copies. The
protocol begins by eagerly sending data using
send/receive up to a configurable eager limit.
Upon receipt and match the receiver responds
with an ack to the source and begins registering
blocks of the target buffer across the available
HCAs. The number of blocks registered at any
given time is bound by a configurable pipeline
depth. As each registration in the pipeline completes an RDMA control message is sent to the
source to initiate an RDMA write on the block.
To cover the cost of initializing the pipeline,
on receipt of the initial ack at the source,
send/receive semantics are used to deliver data
from the eager limit up to the initial RDMA
2
The mallopt function in UNIX and Linux prevents write offset. As RDMA control messages are
received at the source, the corresponding block
pages from being given back the OS.
7
and a longer first message latency time for Infiniband communication. Resource usage reflects the actual communication patterns of the
application and not the number of peers in the
MPI job. As such, MPI codes with scalable
communication patterns will require fewer resources.
of the source buffer is registered and an RDMA
write operation initiated on the current block.
On local completion at the source, an RDMA
FIN message is sent to the peer. Registered
blocks are de-registered upon local completion
or receipt of the RDMA FIN message. If required, the receipt of an RDMA FIN messages
may also further advance the RDMA pipeline.
This protocol effectively overlaps the cost of
registration/deregistration with RDMA writes.
Resources are released immediately and the
high overhead of a single large memory registration is avoided. Additionally, this protocol results in improved performance for applications
that seldom reuse buffers for MPI operations.
5.2.2 Small Message Transfer
MVAPICH uses a pure RDMA protocol for
small message transfer requiring a separate
buffer per peer. Open MPI currently avoids
this scalability problem by using Infiniband’s
send/receive interface for small messages. In
an MPI job with 64 nodes, instead of polling 64
preallocated memory regions for remote RDMA
completion, Open MPI polls a single completion queue. Instead of preallocating 64 separate memory regions for RDMA operations,
Open MPI will optionally post receive descriptors to the SRQ. Unfortunately, Infiniband does
not support flow control when the SRQ is used.
As such Open MPI provides a simple user level
flow control mechanism. As demonstrated in
our results, this mechanism is probabilistic and
may result in retransmission under certain communication patterns and may require further
analysis.
5.2 The OpenIB and Mvapi BTLs
This section focuses on two BTL components,
both of which support the Infiniband interconnect. These two components are called Mvapi,
based on the Mellanox verbs API, and OpenIB,
based on the OpenIB verbs API. Other than this
difference the Mvapi and OpenIB BTL components are nearly identical. Two major goals
drove the design and implementation of these
BTL components, performance and scalability.
The following details the scalability issues addressed in these components.
5.2.1 Connection Managment
Open MPI’s resource allocation scheme is detailed in the Figure 2. Per peer resources include
2 Reliable Connection QP’s, one for High Priority transfers and one for Low Priority transfers.
High priority QP’s share a single Shared Receive Queue and Completion Queue as do low
priority QP’s. Receive descriptors are posted to
the SRQ on demand. The number of receive descriptors posted to the SRQ is calculated using
the following method:
As detailed earlier, connection oriented protocols pose scaling challenges for larger clusters.
In contrast to the static connection management
strategy adopted by MVAPICH, Open MPI uses
dynamic connection management. When one
peer first initiates communication with another
peer, the request is queued at the BTL layer. The
BTL then establishes the connection through an
out of band (OOB) channel. After connection
establishment, queued sends are progressed to
the peer. This results in a shorter startup time
x = log2 (n) ∗ k + b
8
x
n
k
Number of Receive Descriptors to post
Number of peers in cluster
per peer scaling factor for number
of Receive Descriptors to post
b base number of
Receive Descriptors to post
The high priority QP is for small control messages and any data sent eagerly to the peer. The
low priority QP is for larger MPI level messages
as well as all RDMA operations. Using two
QP’s allows Open MPI to maintain two sizes of
receive descriptors, an eager size for the high
priority QP and a maximum send size for the
low priority QP. While requiring an additional
QP per peer, we gain a finer grained control over
receive descriptor memory usage. In addition,
using two QPs allows us to exploit parallelism
available in the HCA hardware [8].
RDMA based protocols require that the initiator of the RDMA operation be aware of both
the source and destination buffers. To avoid a
memory copy and to allow the user to send and
receive from arbitrary buffers of arbitrary length
the peer’s memory region must be obtained by
the intiator prior to each request.
Figure 3 illustrates a timing of a typical
RDMA transfer in MPI using an RDMA Write.
The RTS, CTS and FIN can either be sent using send/receive or small message RDMA. Either method requires the receiver to be in the
MPI library to progress the RTS, send the CTS
and then to handle the completion of the RDMA
operation by receiving the FIN message.
In contrast to traditional RDMA interfaces,
one method of providing asynchronous progress
is by moving the matching of the receive buffer
of the MPI message to the network interface.
Portals [7] style interfaces allow this by associating target memory locations with the tuple
of MPI communicator, tag, and sender address
thereby eliminating the need for the sender to
obtain the receivers target memory address.
Established Dynamically (as needed)
Peer
CQ
Peer
RC
QP
RC
QP
RC
QP
RC
QP
High
Priority
Low
Priority
High
Priority
Low
Priority
SRQ
RD
RD
RD
RD
CQ
SRQ
(eager limit)
RD
RD
RD
RD
Peer 1
Peer 2
Match - RTS
(max-send size)
CTS
Shared Resources
RDMA Write
Figure 2: Open MPI Resource Allocation
Fin
Figure 3: RDMA Write
5.3 Asynchronous progress
From Figure 3 we can see that if the receiver
is not currently in the MPI library on the initial
RDMA of the RTS, no progress is made on the
RDMA write until after the receiver enters the
MPI library.
Open MPI addresses asynchronous progress
for Infiniband by introducing a progress thread.
A further problem in RDMA devices is lack of
direct support for asynchronous progress. Asynchronous progress in MPI is the ability for the
MPI library to make progress on both sending
and receiving of messages when the application
has left the MPI library. This allows for effective
overlap of communication and computation.
9
The progress thread allows the Open MPI library to continue to progress messages by processing RTS/CTS and FIN messages. While this
is a solution to asynchronous progress, the cost
in terms of message latency is quite high. In
spite of this, some applications may benefit from
asynchronous progress even in the presence of
higher message latency. This is especially true
if the application is written in a manner to take
advantage of communication/computation overlap.
6 Results
To examine memory usage of the MPI library we have used three different benchmarks.
The first is a simple “hello world” application that does not communicate with any of its
peers. This benchmark establishes a baseline
of memory usage for an application. Figure 4
demonstrates that Open MPI’s memory usage is
constant as no connections are established and
therefore no resources are allocated for other
peers. MVAPICH on the other hand preallocates resources for each peer at startup so memory resources increase as the number of peers
increases. Both MVAPICH small and medium
configurations consume more resources.
This section presents a comparison of our work.
First we present scalability results in terms of
per node resource allocation. Next we examine
performance results, showing that while Open
MPI is highly scalable it also provides excellent performance in the NAS Parallel benchmark
(NPB) [2].
Open MPI, MVAPICH Memory Utilization - Hello World
550000
Memory Usage (KBytes)
500000
MVAPICH - Small
MVAPICH - Medium
Open MPI - SRQ
450000
400000
350000
300000
250000
200000
150000
100000
6.1 Scalability
50000
0
As demonstrated earlier, the memory footprint
of a pure RDMA protocol as used in MVAPICH
increases linearly with the number of peers.
This is partially due to lack of dynamic connection management as well as resource allocation. Resource allocation for the small RDMA
protocol is per peer. Specifically, each peer is
allocated a memory region in every other peer.
As the number of peers increases this memory
allocation scheme becomes intractable. Open
MPI avoids these costs in two ways. First, Open
MPI establishes connections dynamically on the
first send to a peer. This allows resource allocation to reflect the communication pattern of
the MPI application. Second, Open MPI optionally makes use of the Infiniband SRQ so that receive resources (pre-registered memory) can be
shared among multiple endpoints.
50
100
150
200
250
300
Number of peers
Figure 4: Hello World Memory Usage
Our next benchmark is a pairwise ping-pong,
where peer’s of neighbor rank ping each other,
that is rank 0 pings rank 1 and rank 2 pings rank
3 and so on. As Figure 5 demonstrates, Open
MPI memory usage is constant. This is due
to dynamic connection management, only peers
participating in communication are allocated resources. Again we see that MVAPICH memory
usage ramps up with the number of peers.
Our final memory usage benchmark is a worst
case for Open MPI, each peer communicates
with every other peer. As can be seen in Figure 6 Open MPI SRQ memory usage does increase as the number of peers increases, but at a
10
6.2.1 Latency
Open MPI, MVAPICH Memory Utilization - Ping-Pong 0 bytes
550000
500000
MVAPICH - Small
MVAPICH - Medium
Open MPI - SRQ
Open MPI - No SRQ
Memory Usage (KBytes)
450000
400000
350000
300000
250000
200000
150000
100000
50000
0
0
50
100
150
200
250
300
Number of peers
Figure 5: Pairwise Ping-Pong Memory Usage
much smaller rate than that of MVAPICH. This
is due to the use of the SRQ for resource allocation. Open MPI without SRQ scales slightly
worse than the MVAPICH medium configuration, this is due to Open MPI’s use of two QPs
per peer.
Open MPI, MVAPICH Memory Utilization - Alltoall communication
550000
Memory Usage (KBytes)
500000
MVAPICH - Small
450000 Open MPI - NO SRQ
MVAPICH - Medium
400000
Open MPI - SRQ
350000
300000
250000
200000
150000
100000
50000
0
0
50
100
150
200
Number of peers
250
300
Figure 6: All-to-all Memory Usage
6.2 Performance
To verify the performance of our MPI implementation we present both micro benchmarks as
well as the NAS Parallel Benchmarks
Ping-pong latency is a standard benchmark of
MPI libraries. As with any micro-benchmark,
ping-pong provides only part of the true representation of performance. Most ping-pong results are presented using two nodes involved in
communication. While this number provides a
lower bound on communication latency, multinode ping-pong is more representative of communication patterns in anything but trivial applications. As such, we present ping-pong latencies for a varying number of nodes in which
N nodes perform the previously discussed pairwise ping-pong. This enhancement to the pingpong benchmark helps to demonstrate scalability of small message transfers because in larger
MPI jobs the number of peers communicating at
the same time often increases.
In this test, the latency of a zero byte message is measured for each pair of peers. We have
then ploted the average with errorbars for each
of these runs. As can be seen in Figure 7, the
small message RDMA mechanisms provided in
MVAPICH provides a benefit with a small number of peers. Unfortunately, the polling of memory regions is not a scalable architecture as can
be seen when the number of peers participating
in the latency benchmark increases. For each
additional peer involved in the benchmark, every other peer must allocate and poll an additional memory region. Costs of polling quickly
erode any improvements in latency. Memory
usage is also higher on a per peer and aggregate basis. This trend occurs in both small and
medium MVAPICH configurations. Open MPI
provides much more predictable latencies and
outperforms MVAPICH latencies as the number
of peers increases. Open MPI - SRQ latencies
are a bit higher than Open MPI - No SRQ latencies as the SRQ path under Mellanox HCA’s is
more costly.
The following Table 1 shows the Open MPI
11
results of these benchmarks are summarized in
Table 2. Open MPI was run using 3 configurations, with SRQ, SRQ with simple flow control and without SRQ. MVAPICH was run in
both small and medium cluster configurations.
Open MPI without SRQ and MVAPICH performance is similar. With SRQ, Open MPI perforAverage Latency
mance is similar for the BT, CG, and EP benchOpen MPI - Optimized
5.64
marks. BT, FT and IS performance is lower with
Open MPI - Default
5.94
SRQ as receive resources are quickly consumed
MVAPICH - RDMA
4.19
in collective operations. Our current flow conMVAPICH - Send/Receive
6.51
trol mechanism addresses this issue for the BT
Table 1: Two node Ping-Pong latency in µ-sec. benchmark but both the FT and IS benchmarks
Optimized - Limits the number of WQE on the are still effected due to global broadcast and allRQ Defaults - Default number of WQE on the to-all communication patterns respectively. Further research into SRQ flow control techniques
RQ
are ongoing.
send/receive latencies trail MVAPICH small
message RDMA latencies but are better than
MVAPICH send/receive latencies. This is an
important result as larger MVAPICH clusters
will make more use of send/receive and not
small message RDMA.
6.3 Experimental Setup
Open MPI, MVAPICH - Latency - Multiple peers
10
Latency (uSec)
8
6
4
Open MPI - SRQ
Open MPI - No SRQ
MVAPICH - Small
MVAPICH - Medium
2
0
0
50
100
150
200
Number of peers
250
Figure 7: Multi-Node Zero Byte Latency
6.2.2 NPB
300
Our experiments were performed on two different machine configurations. Two node pingpong benchmarks were performed on dual Intel Xeon X86-64 3.2 Ghz processors with 2GB
of RAM, and Mellanox PCI-Express Lion-Cub
adapters connected via a Voltair 9288 switch.
The Operating System is Linux 2.6.13.2 with
Open MPI pre-release 1.0 and MVAPICH 0.9.5118. All other benchmarks were performed on a
256 node cluster consisting of dual Intel Xeon
X86-64 3.4 Ghz processors with a minimum
6GB of RAM, Mellanox PCI-Express Lion Cub
adapters also connected via a Voltair switch.
The Operating System is Linux 2.6.9-11 with
Open MPI pre-release 1.0 and MVAPICH 0.9.5118.
To demonstrate the performance of our implementation outside of micro benchmarks we used
the NAS Parallel Benchmarks [2]. NPB is a set 7 Future Work - Conclusions
of benchmarks derived from computational fluid
dynamics applications. All NPB benchmarks Open MPI addresses many of the concerns rewere run using the class C size of problem and garding the scalability and use of Infiniband in
all results are given in run-time (seconds). The HPC. In this section we summarize the results
12
Nodes
Open MPI - No SRQ
Open MPI - SRQ
Open MPI - SRQ FC
MVAPICH - Small
MVAPICH - Large
Nodes
Open MPI - No SRQ
Open MPI - SRQ
Open MPI - SRQ FC
MVAPICH - Small
MVAPICH - Large
BT
64
256
100.03 25.03
114.92 26.92
100.13 25.33
98.78 27.40
99.22 27.58
SP
64
256
54.39 16.08
140.81 22.53
54.90 14.61
53.66 15.16
53.87 15.84
CG
64
12.74
12.86
12.83
12.96
13.15
32
20.17
20.45
21.13
20.33
20.24
32
36.64
75.48
54.81
37.59
37.91
128
7.39
7.49
7.38
7.84
7.83
FT
64
128
18.28
9.39
68.36 56.92
35.87 19.39
19.42 10.17
19.51
9.85
256
5.56
5.61
5.63
6.11
6.09
256
4.81
26.96
24.54
4.84
4.88
32
38.89
38.85
39.10
39.15
39.10
32
2.23
32.21
5.32
2.19
2.20
EP
64
19.84
19.72
19.76
19.65
19.59
128
9.95
10.04
12.88
10.02
9.89
IS
64
128
1.62
0.97
33.29 25.06
4.38 12.35
1.55
0.87
1.56
0.87
256
5.11
5.26
5.12
5.32
5.31
256
0.52
21.97
11.12
0.42
0.50
Table 2: NPB Results - Each benchmark uses the class C option with a varying number of nodes,
1 process per node. Results are given in seconds.
of this paper and provide directions for future to a much higher level.
work.
7.2 Future work
7.1 Conclusions
Open MPI’s Infiniband support provides several
techniques to improve scalability. Dynamic connection management allows per peer resource
usage to reflect the applications chosen communication pattern, thereby allowing scalable MPI
codes to preserve resources. Per peer memory usage in these types of applications will be
significantly less in Open MPI when compared
to other MPI implementations which lack this
feature. Optional support for an asynchronous
progress thread addresses the lack of direct support for asynchronous progress within Infiniband, potentially further reducing buffering requirements at the HCA. Shared resource allocation scales much more effectively than per peer
resource allocation through the use of the Infiniband Shared Receive Queue (SRQ). This should
allow even fully connected applications to scale
This work has identified additional areas for improvement. As the NAS parallel benchmarks illustrated, there are concerns regarding the SRQ
case that require further consideration. Preliminary results indicate that an effective flow control and/or resource replacement policy must be
implemented, as resource exhaustion results in
significant performance degredation.
Additionally, Open MPI currently utilizes an
OOB communication channel for connection establishment, which is based on TCP/IP. Using
an OOB channel based on the unreliable datagram protocol will decrease first message latency and potentially improve the performance
of the Open MPI run-time environment.
While connections are established dynamically, once opened, all connections are persistent. Some MPI codes which randomly communicate with peers may experience high resource
13
and quadrics elan-4 technologies. In Proceedings of 2004 IEEE International Conference on Cluster Computing, pages 193–
204, September 2004.
usage even if communication with the peer is
infrequent. For these types of applications, dynamic connection teardown may be beneficial.
Open MPI currently supports both caching
of RDMA registrations as well as a hybrid
RDMA pipeline protocol. The RDMA pipeline
provides good results even in applications that
rarely reuse application buffers. Currently
Open MPI does not cache RDMA registrations used in the RDMA pipeline protocol.
Caching these registration would allow subsequent RDMA operations to avoid the cost of registration/deregistration if the send/recv buffer is
used more than once, while providing good performance even when the buffer is not used again.
[5] R. Brightwell and A. Maccabe. Scalability limitations of VIA-based technologies
in supporting MPI. In Proceedings of the
Fourth MPI Developer’s and User’s Conference, March 2000.
[6] Ron Brightwell. A new MPI implementation for cray SHMEM. In PVM/MPI,
pages 122–130, 2004.
[7] Ron Brightwell,
Tramm Hudson,
Arthur B. Maccabe, and Rolf Riesen.
The portals 3.0 message passing interface,
November 1999.
Acknowledgments
[8] V Velusamy et al. Programming the infiniband network architecture for high performance message passing systems. In
Proceedings of The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004.
The authors would like too thank Kurt Ferreira
and Patrick Bridges of UNM and Jeff Squyres
and Brian Barrett of IU for comments and feedback on early versions of this paper.
References
[1] Infiniband Trade Association. Infiniband
architecture specification vol 1. release
1.2, 2004.
[2] Bailey, Barszcz, Barton, Browning, Carter,
Dagum, Fatoohi, Fineberg, Frederickson,
Lasinski, Schreiber, Simon, Venkatakrishnan, and Weeratunga. NAS parallel benchmarks, 1994.
[9] G. E. Fagg, A. Bukovsky, and J. J. Dongarra. HARNESS and fault tolerant MPI.
Parallel Computing, 27:1479–1496, 2001.
[10] E. Garbriel, G.E. Fagg, G. Bosilica,
T. Angskun, J. J. Dongarra J.M. Squyres,
V. Sahay, P. Kambadur, B. Barrett,
A. Lumsdaine, R.H. Castain, D.J. Daniel,
R.L. Graham, and T.S. Woodall. Open
MPI: goals, concept, and design of a next
generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’
Group Meeting, 2004.
[3] Jon Beecroft, David Addison, Fabrizio
Petrini, and Moray McLaren. QsNetII: An
interconnect for supercomputing applica[11] R. L. Graham, S.-E. Choi, D. J. Daniel,
tions, 2003.
N. N. Desai, R. G. Minnich, C. E.
Rasmussen, L. D. Risinger, and M. W.
[4] R. Brightwell, D. Doerfler, and K.D. UnSukalksi.
A network-failure-tolerant
derwood. A comparison of 4x infiniband
14
message-passing system for terascale clusters. International Journal of Parallel Programming, 31(4), August 2003.
MPI: Enabling third-party collective algorithms. In Vladimir Getov and Thilo Kielmann, editors, Proceedings, 18th ACM International Conference on Supercomputing, Workshop on Component Models and
Systems for Grid Applications, pages 167–
185, St. Malo, France, July 2004. Springer.
[12] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing
interface standard. Parallel Computing,
22(6):789–828, September 1996.
[20] J.M. Squyres and A. Lumsdaine. A Component Architecture for LAM/MPI. In
[13] Rainer Keller, Edgar Gabriel, Bettina
Proceedings, 10th European PVM/MPI
Krammer, Matthias S. Mueller, and
Users’ Group Meeting, number 2840
Michael M. Resch. Towards efficient exin Lecture Notes in Computer Science,
ecution of MPI applications on the grid:
Venice, Italy, September / October 2003.
porting and optimization issues. Journal
Springer-Verlag.
of Grid Computing, 1:133–149, 2003.
[21] T.S. Woodall, R.L. Graham, R.H. Cas[14] Lawrence Berkeley National Laboratory.
tain, D.J. Daniel, M.W. Sukalsi, G.E. Fagg,
Mvich: Mpi for virtual interface architecE. Garbriel, G. Bosilica, T. Angskun,
ture, August 2001.
J. J. Dongarra, J.M. Squyres, V. Sahay,
P. Kambadur, B. Barrett, and A. Lums[15] Jiuxing Liu, Jiesheng Wu, Sushmitha P.
daine. Open MPI’s TEG point-to-point
Kini, Pete Wyckoff, and Dhabaleswar K.
communications methodology : ComparPanda. High performance RDMA-based
ison to existing implementations. In ProMPI implementation over infiniband. In
ceedings, 11th European PVM/MPI Users’
ICS ’03: Proceedings of the 17th annual
Group Meeting, 2004.
international conference on Supercomputing, pages 295–304, New York, NY, USA,
2003. ACM Press.
[16] Message Passing Interface Forum. MPI: A
Message Passing Interface. In Proc. of Supercomputing ’93, pages 878–883. IEEE
Computer Society Press, November 1993.
[17] Myricom.
Myrinet-on-VME protocol
specification.
[18] S. Pakin and A. Pant. . In Proceedings of The 8th International Symposium
on High Performance Computer Architecture (HPCA-8), Cambridge, MA, February
2002.
[19] Jeffrey M. Squyres and Andrew Lumsdaine. The component architecture of open
15