Topology Mapping for Blue Gene/L Supercomputer
Hao Yu
I-Hsin Chung
Jose Moreira
IBM Thomas J. Watson Research Center
Yorktown Heights, NY 10598-0218
{yuh,ihchung,jmoreira}@us.ibm.com
Abstract
Mapping virtual processes onto physical processos is one of
the most important issues in parallel computing. The problem of mapping of processes/tasks onto processors is equivalent to the graph embedding problem which has been studied
extensively. Although many techniques have been proposed
for embeddings of two-dimensional grids, hypercubes, etc.,
there are few efforts on embeddings of three-dimensional
grids and tori. Motivated for better support of task mapping
for Blue Gene/L supercomputer, in this paper, we present
embedding and integration techniques for the embeddings
of three-dimensional grids and tori. The topology mapping
library that based on such techniques generates high-quality
embeddings of two/three-dimensional grids/tori. In addition,
the library is used in BG/L MPI library for scalable support
of MPI topology functions. With extensive empirical studies on large scale systems against popular benchmarks and
real applications, we demonstrate that the library can significantly improve the communication performance and the
scalability of applications.
1
Introduction
Mapping tasks of a parallel application onto physical processors of a parallel system is one of the very essential issues in
parallel computing. It is critical for todays supercomputing
system to deliver sustainable and scalable performance.
The problem of mapping application’s task topology onto
underneath hardware’s physical topology can be formalized
as a graph embedding problem [3], which has been studied
extensively in the past. For the sake of intuition, in this
writing, we used terms, embedding and mapping, with the
same meaning. In general, an embedding of a guest graph
G = (VG , EG ) into a host graph H = (VH , EH ) is a one-to-one
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SC2006 November 2006, Tampa, Florida, USA
0-7695-2700-0/06 $20.00 c 2006 IEEE
mapping φ from VG to VH . The quality of the embedding are
usually measured by two cost functions (parameters): dilation and expansion. The dilation of an edge (u, v) ∈ EG is the
length of a shortest path in H that connects φ (u) and φ (v).
The dilation of an embedding is the maximum dilation over
all edges in EG . The expansion of an embedding is simply
|VH |/|VG |. Intuitively, the dilation of an embedding measures the worst case stretching of edges and the expansion
measures the relative size of the guest graph.
Because the graph embedding problems are NP-hard problems, researchers have focused on developing heuristics. In
the past two decades, a large number of graph embedding
techniques has been developed to solve problems raised from
VLSI circuit layout and process mapping in parallel computing. Specifically, for optimizing the VLSI design of highly
eccentric circuits, techniques that embed rectangular twodimensional grids into square two-dimensional grids [3; 17;
10] were proposed and significant results were obtained.
In parallel computing domain, techniques were developed
to map processes onto various high-performance interconnects such as hierarchical network, crossbar-based network,
hypercube network, switch-based interconnects, and multidimensional grids [24; 14; 8; 18; 16; 15; 12; 20]. In this paper, we concentrate on the problem of embedding for threedimensional grids and tori. Our research is motivated for
providing better support for task mapping on Blue Gene/L
(BG/L) supercomputer.
The BG/L supercomputer is a new massively parallel system developed by IBM in partnership with Lawrence Livermore National Laboratory (LLNL). BG/L uses system-on-achip integration [4] and a highly scalable architecture [2] to
assemble machines with up-to 65,536 dual-processor compute nodes. The primary communication network of BG/L
is a three-dimensional torus. To provide system level topology mapping support for BG/L, we need effective and scalable embedding techniques that map application’s virtual
topologies onto three-dimensional tori or two-dimensional
sub-topologies.
Although there are large amount of work on embeddings
for various topologies, we found relatively few techniques
for efficient/scalable embeddings among three-dimensional
grids/tori.
In particular, techniques based on graphpartitioning and/or searching would find fairly good embeddings for different topologies, they are hard to be parallelized
and scalable [12; 20; 6]. Specifically, except the works
on hypercube embeddings [8], most techniques proposed for
task mapping for parallel systems have neglected the progression in the area of embeddings of two-dimensional grids.
Overall, existing techniques are either not suitable for BG/L
or only covering very limited cases.
In this paper, we describe existing and newly developed grid
embedding techniques and integration techniques to generate efficient mapping of parallel processes/tasks onto up-to
three-dimensional physical topologies, which indirectly optimizes the nearest-neighbor communications of parallel programs. The embedding techniques we used and explored
take constant time in parallel and therefore scalble. Our integration techniques cover rather general cases (e.g. not limited to grids/tori with dimension sizes as powers of 2) with
small dilation costs. The topology mapping library based on
these techniques has been integrated into BG/L MPI library
to support MPI Cartesian Topology [13]. Finally, we present
comprehensive experiments against popular parallel benchmarks and real applications. With helps from MPI tracing
tools, our empirical results demonstrate significant performance improvements for point-to-point communication after using the process-to-processor mapping generated by our
library.
This paper makes the following contributions:
- We present the design and integration of a comprehensive
topology mapping library for scalable three-dimensional
grids/tori embeddings. Besides intensive exploration of
latest developments in the area of grid-embedding, we
describe extensions for embeddings of three-dimensional
grids/tori.
- Our topology mapping library provides efficient support
for the MPI virtual topology functions. The computation
of MPI virtual topology on each processor takes constant
time and thereafter the mapping process is scalable.
- Quantitatively, we demonstrate that our topology mapping library is effective on improving the communication
performance of parallel applications running on a large
number of processors (i.e. experiments ran on up-to 4096
processors).
The rest of the paper is organized as followings: Sec. 2
gives brief explanation of embedding techniques for up-to
two-dimensional grids Based on the existing techniques, we
present a set of embedding operations and corresponding
predicates for their selection. Sec. 3 presents embedding
operations for 3D grids/tori. We further present the procedure for integrating various embedding techniques into a
general and powerful library. Sec. 4 describes some issues on
the support of topology mapping for BG/L systems. Sec. 5
presents an extensive empirical study to demonstrate the effectiveness of our topology mapping library. Finally, Sec. 6
discusses related work and Sec. 7 concludes the paper.
2
Basic Grid Embeddings
In this section, we describe techniques of embedding 1D or
2D guest Cartesian topologies into 2D host Cartesian topologies (grids or tori). We describe the operations and predicates
we defined for their integration. In the next section of this
paper, we will describe how we utilize these operations for
embeddings of 3D Cartesian topologies.
2.1
1D Embeddings of Rings
For the case of embedding a ring into an 1D mesh (aka line),
[22] described a method that embed the first half of the nodes
in the guest ring into the host line in the same numbering direction, with each edge dilated by 2. Then it maps the 2nd
half of the nodes in the guest ring to the host line in the reverse order of the numbering direction of the line, still with
edges dilated to 2. In this paper, we refer to this method
as ring-wrapping. Figure 1(a) shows an example for embedding a ring with size 7 to a line with size 8 using ringwrapping.
When the given host graph is a ring with slightly greater size
than the guest ring, a method referred to as ring-scattering
in our writing can be used ([15]). It simply stretches some
edges of the guest ring and map the nodes onto the host ring
so the ring-links of the host graph can be utilized. Figure 1(b)
shows an example for embedding a ring with size 5 to a ring
with size 8 using ring-scattering. However, when the size of
the host ring is more than twice as great as the guest ring, the
dilation of ring-scattering will exceed 2. In this case, ringwrapping would give better embedding.
In our later development for 3D embeddings, we will often
switching between these two basic ring embedding methods. In addition, because ring-scattering introduces expansion that is greater than 1 and the expansion of ring-wrapping
is always 1, we found it is convenient to use ring-wrapping
for the most times.
2.2
Embeddings of Lines and Rings into
Higher-Dimensional Topologies
The methods to embed lines (aka pipes in [22]) and rings into
2D or 3D topologies are fairly intuitive. They are similar
to performing a naive space filling, i.e. to fold (aka wrap)
the line in a 2D or 3D space without stretching any of its
edges. Figure 1 (c) and (d) show 2 examples for folding a
line onto 2D and 3D grids. Because the dilation factors of
the embeddings are 1, methods of embedding lines into 2D
grids and tori are essentially the same.
While the embedding of a line to 2D or 3D grids can always
achieve dilation factor 1, embedding of a ring to 2D or 3D
grids may not. Specifically, when the size of the guest ring
14:(1,3)
0
6
1
5
2
4
3
(a) Ring-wrapping
0
1
2
3
14:(2,1,1)
4
0:(0,0)
(b) Ring-scattering
(c) Embed a line to a 2D grid
0:(0,0,0)
(d) Embed a line to a 3D grid
3,7
2,8
3,6
order: (X,Y,Z)
1,9
3,5
2,7
Y
3,4
14:(1,3)
Z
X
1,8
2,6
2,5
1,7
0:(0,0)
1,6
(e) Embed a ring to a 2D grid
(f) Fold a 12x3 grid to a 6x6 2D
(g) 90 degree turn
Figure 1: Existing Methods for Grid Embedding
is an odd number, there will be one and only one guest edge
being dilated to 2 hops [22]. Figure 1 (e) shows an example
for the case.
We find that the basic technique of folding a ring is useful because for the cases of embedding 2D rectangular grids with
f theshorterside
very small aspect ratios ( lengtho
lengtho f thelongerside ) and 3D grids
with one dimension much greater than the others as an 1D
line or ring We will discuss more on the matter in the next
section. In addition, when embedding a ring into 2D/3D tori,
the torus edges of the host grid are not needed.
2.3
Fold Embeddings among 2D Topologies
2) goes along row 2 until it reaches (2,5). Then row 2 follows a diagonal–vertical–diagonal folding maneuver. Note
the diagonal edges in the figure do not exist in the host grid
and need to be replaced with a pair of edges in X and Y direction. As a result, the maneuver keeps the dilation cost
under 2.
The key of the method is the embedding scheme for performing the 180 degree turn (dark nodes in the picture) with
dilation cost under 2. Recently, a method for performing a 90
degree turn with dilation cost under 2 is introduced by [11]
(depicted in Figure 1(g)). Integrating the 180 degree turn and
the 90 degree turn, a 2D torus with small aspect ratio can be
folded onto a 2D grid with large aspect ratio in the similar
fashion as the ring-folding introduced above Figure 2 gives
an intuitive example of such integration.
Need to note here that, to facilitate a massively parallel implementation of the embeddings, we have inverted the projection functions of the 90 degree turn and the 180 degree
turn. Due to space limitation, we are not including the
project functions in this writing.
2.4
Figure 2: Embed 3 × 36t grid to 12 × 12 grid
A well known method of embedding 2D grids with small aspect ratio to grids with large aspect ratio is folding, which
is first introduced in [3]. Figure 1(f) shows an example of
embedding a 12 × 3 rectangular grid into a 6 × 6 square grid.
Under the folding process, a row of the guest grid (e.g. row
Matrix-Based Embedding
Matrix-based embeddings [21] are widely used to embed
among 2D grids with relative close aspect rations. The example given in Figure 3 shows an embedding and the corresponding embedding matrix of mapping a 2x32 grid into
a 5x13 grid. In the example, the dilation of the embedding
(3) is very close to the average dilation across all the edges
of the guest grid ( 2.5), which implies that most of the guest
edges are stretched. On the other hand, the average dilation
costs of folding-based embeddings reach 2 only in the areas
where the turn maneuvers are applied. The rest areas of the
embeddings have average dilation cost as 1. For this rea-
son, our embedding selection process presented in the later
sections sets higher priority to folding-based methods than
matrix-based methods. That is, we would try folding-based
methods first when certain criteria is satisfied. Then when
the folding-based method is failed to obtain an embedding,
we will fall back to matrix-based method.
3
2
2
3
3
2
2
3
2
2
3
2
2
3
3
2
2
3
2
2
3
2
2
3
2
2
Figure 3: Embed 2 × 31t grid to 5 × 13 grid
Nevertheless, matrix-based method is very general and can
efficiently embed a 2D guest grid (with dimension as A × B)
to its ideal sub-grid of the host grid. Here the ideal grid is
defined as X ×Y ′ , for a given host grid with dimension X ×Y ,
where Y ′ = ⌈A× B/X⌉. Therefore, for the embedding among
2D grids, we use it as a default method. In addition, when the
guest and host grids are both tori, with the relative orientation
of the guest graph is maintained, matrix-based embedding
can utilize the torus links of the host grid.
The core of the matrix-based embedding is the generation
of the matrix, many of which are in sequential because of
the dependences of contiguous cells in the matrix. Although
the methods described in [21] enable parallel generation of
the matrix, the generated matrix is for computing the coordinates of host grid from the coordinates of a guest grid. What
we need is the inverse function, i.e. each physical processor
computes its logical coordinates. Following a similar procedure described in [21], we have derived the inverse functions
to enable scalable computation of the embedding in the MPI
virtual topology library.
2.5
Summary on Embeddings into 2D
Topologies
We have derived a number of embedding operators to embed
1D or 2D Cartesian topologies into 2D grids or tori. The operations we defined and later used in 3D embeddings, along
brief descriptions are summarized in Table 1. Among the operations, while many of them are directly adopted from existing techniques, others are extended from the existing ones or
developed from the existing ones to support scalable parallel
embeddings.
First, in the context of this paper, we use following terminologies:
- G represents the guest grid/torus.
- H represents the host grid/torus.
- |G| and |H| represent the sizes of G and H.
- A, B, and C are the sorted dimensions of the guest
grid/torus, which satisfy A ≤ B ≤ C.
- A′ , B′ , or C′ represent one dimension of the guest
grid/torus. When used together, they represent distinct
dimensions of the guest grid/torus.
- X, Y , and Z are the sorted dimensions of the host
grid/torus, which satisfy X ≤ Y ≤ Z.
- X ′ , Y ′ , or Z ′ represent one dimension of the host
grid/torus. When used together, they represent distinct
dimensions of the host grid/torus.
- AB, A′ B′ , XY , or X ′Y ′ are occasionally used to represent a 2D grid/torus or a 2D sub-topology of a higherdimensional grid/torus, where the meanings associated to
′ in A′ , B′ , X ′ , and Y ′ are consistent with above definitions.
- A × B, A′ × B′ , X × Y , or X ′ × Y ′ represent the sizes of
corresponding grids/tori or sub-topologies.
Given that each of the above embedding methods for 2D
grids works well for certain cases, an important step to integrate them into an effective library is to define the conditions or procedure for their application. Figure 4 specifies
the procedure we used. We specify 6 steps, executing in sequential order, for embedding to 2D grids/tori. Due to space
restriction, sub-steps such as matrix-based embeddings into
tori are omitted. Note that because the procedure is executed
in sequential order, the conditions of later steps imply that
the previous steps are not performed or do not return with
a valid embedding. While most conditions or predicates are
straight-forward, some deserves explanation.
In step 3, because RFold2D needs at least 4 turns to have the
start and end points of the B dimension, the larger one of the
guest grid, to meet, condition A < B/4 is necessary to embed at least the four 90-degree-turns. On the other hand, for
the case of MFold2D, since at least one 180-degree-turn is
needed, condition A < B/2 is specified. Step 4 and 5 essentially applies Compress() operation to cover general cases of
2D grid embedding.
3
Embeddings
Topologies
of
3D
Cartesian
In this section, we concentrate on embedding guest grids/tori
with 1, 2, or 3 dimensions into 3D grids or tori. In addition to
the basic strategy, we will also present how we guard the use
of the basic strategy to increase the chance of a successful
mapping.
The most simple embedding is called, in this paper,
as Sit3D(A,B,X,Y), which simply applies Sit1D(A,X),
Sit1D(B,Y), and Sit1D(C,Z) to embed the 3 dimensions
of the guest grid/torus into the 3 dimensions of the host
grid/torus independently.
Table 1: Methods/Operators for Embedding into 2D Grid/Tori
Operator
Description
Sit1D(A,X)
Wrap(A)
Scat(A,X)
RFold1D(A,X)
LFold1D(B,X)
Sit2D(A,X)
Compress(B,Y)
Stretch(B,Y)
MFold2D(B,Y)
RFold2D(B,Y)
Step
1
2
3
4
5
Reference
Guest topology is 1D
map nodes of a line with size A to the first A nodes of X
embed a ring onto a line with A nodes
scatter/stretch a ring onto a ring with X nodes
fold a ring onto a 2D grid/tori, starting in parallel to Y
fold a line onto a 2D grid/tori, starting in parallel to Y
Guest topology is 2D
Sit1D(A,X) and Sit1D(B,Y)
matrix embedding A × B onto an ideal grid X ′ ×Y with B along Y (when B < Y)
matrix embedding A × B onto an ideal grid X ′ ×Y with B along Y (when B > Y)
fold rectangular mesh B × A onto mesh X ×Y with B along Y
fold rectangular tori B × A onto mesh X ×Y with B along Y
Conditions
IF( A ≤ B ≤ X ≤ Y )
IF( A ≤ X < B ≤ Y )
IF( (A ≤ X ≤ Y ≤ B) ∧ (A ≤ X/2 ∨ A ≤ Y /2) ∧Y ≤ B/2 )
IF( B has tori-link ∧A < B/4 )
IF( B has NO tori-link ∧A < B/2 )
IF( A ≤ X ≤ Y ≤ B )
IF( B/Y ≤ X/A )
IF( B/Y > X/A )
IF( X < A ≤ B < Y )
naive
[22]
[15]
[22]
[22]
naive
[21; 22]
extended from [21; 22]
[3]
extended from [22; 3; 11]
Embedding Method
Sit2D (B,X)
Sit2D (A,X)
try RFold2D (B,*)
try MFold2D (B,*)
Compress (B,Y)
Stretch (A,X)
Compress (A,X) or Stretch (B,Y)
Figure 4: The procedure for selecting 2D embedding methods
In the following sub-sections, we describe more complicated
embedding methods we developed. Later in this section,
we summarize the described techniques and specify conditions/predicates for their applications.
3.1
3D Embeddings for Pipe and Ring
The embeddings described in this sub-section is extended
from embeddings of 1D rings into 3D topologies. Specifically, for a given 2D or 3D guest grid/torus in the shape
as a rectangle with low aspect ratio or a thin pipe, we treat
them as 1D lines/rings and fold/wrap them in the 3D host
grid/torus. The projection function is extended from that for
embeddings of 1D lines/rings into 3D grids. Figure 5 (ac) shows some representative candidate guest grids that are
considered for such embeddings.
Because the method embeds guest grids/tori whose shapes
are close to lines or rings, the technique is very effective.
Figure 5(d) shows an example of embedding a 24x3x2 grid
into a 6x6x4 host grid. In the example, three 2D-fold operations are performed. The process is similar to fold or wrap a
ring with size 8 into a 2x2x2 grid. The worst stretched edges
are from the maneuvers as 90-degree-turn and 180-degreeturn. For each such maneuvers, it introduces dilation cost
in a 2D sub-grid, which has dilation factor as 2. The folding
maneuvers applied in different 2D sub-grids do not re-embed
each other’s dilated edges. Therefore, the dilation cost of the
whole embedding is 2.
In term of detailed embedding process, similar to folding an
1D line/ring into a 3D graph, various turning points are determined first. Then turning methods are assigned to each
turning points. For instance, for the case of embedding a
3D torus (in the shape of a thin 3D circle) into a 3D cube,
four 90-degree-turn maneuvers maybe needed to turn the end
phase of the 3D ring to be close to the starting phase of the
ring. In addition, we found that the dilation costs and the average dilation costs are the same for embeddings of 3D rings
into 3D grids/tori. However, utilizing the torus links during
embedding will help the parallel tasks to use the torus links
in a better organized manner and thereafter may introduce
slightly better performance.
3.2
Paper Folding
An intuitive way to map a two-dimensional grid into a threedimensional grid is to follow the process as folding a paper
into multiple layers. The idea can been used for embeddings
of 2D grids and 3D grids with a shape as a thin panel into
near-cube 3D host grids/tori. Figure 6 (a) and (b) give an
example for embedding a 10x9 2D grid into a 5x3x6 3D grid
with the dilation factor as 5.
This naive 3D paper folding method can be improved by utilizingt the 2D grid folding techniques described in previous
section. We call this method as PaperFold Figure 6 (c) illustrate the folding scheme and shows that the dilation factor
is dropped to 2. To explore the folding idea further, when
the guest grid is a 2D/3D torus, the torus edges can be main-
(a) 2D candidate for 3D ring folding
(b) 3D pipe
(c) 3D ring
(d) 3D ring folding
Figure 5: 3D Ring Folding
01 01 10 01 10 01 10 01 10 10
10 10 01 10 01 10 01 10 01 10
01
Y
Z
X
(a) Fold the Y dimension of a 2D grid onto Z
dimension
11111111
00000000
11111111
00000000
1
00
11
0
00
11
0
1
00
11
00
11
00
11
00
11
00
0
1
00
11
00
11
00
11
00 11
11
00
11
011
1
00
11
00
11
00
00
11
00
11
0
1
0
1
0
00
11
00
11
001
11
0 1
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
Y
Z
X
(b) Naive 3D to 3D folding
6,9
01 01 01 01 01
5,4
01 01 01 01 5,3016,3
1010
Y
5,9
01
Z
X
the 2 dimensions are not divisible by any of the sizes of the
3 dimensions of the host grid. For these cases, if we simply
perform PaperFold, the expansion factor of the embedding is
fairly large. For instance, in our study of running NAS BT
on a partition with 512 nodes organized into 8x8x8 torus,
the guest topology is a 2D square grid in 22x22, because
484 is the maximal square number that is smaller than 512.
To embed such 2D grids into a 3D grid, we first find a intermediate 2D grid I. The requirements of the I are: (a) the 2D
guest grid can be embedded into I with minimal expansion,
and (b) the size of one of the dimensions of I is a multiply
of any of the dimensions of the 3D host grid. Because the
dilation cost of embedding a 2D grid into a 3D grid using
PaperFold is 2, the final dilation of above 2-step procedure is
two times the dilation of the embedding from the guest grid
to the intermediate grid, which is usually guarded.
3.3
General 3D to 3D Embedding
(c) Dilation 2 3D to 3D folding
Figure 6: PaperFold
tained by applying ring folding in a 2D plane defined by the
dimensions excluding the dimension whose shape and size
are not changed (X dimension in Figure 6 (a) ).
The 3D fold embedding of a 2D guest grid or a 3D guest grid
with a panel shape can always be decomposed to two steps.
To embed a 3D guest grid/tori A × B × C with A < B < C,
into a 3D guest grid X × Y × Z with X < Y < Z, the general
condition for paper folding is A′ >= X ′ ∧B′ > Y ′ ∧C′ < Z ′ /2.
The procedure is to first fold dimension B′ onto C′ and then
fold A′ onto C′ as showing in Figure 6 (c).
One observation of the problems of applying folding to 3D
embeddings is that, a folding introduces dilation cost 2 and
the dilation cost is accumulative. For most 3D grid embedding problems (embedding a thin 3D grid into a near-square
3D grid), we can simply apply folding twice and the mapping
dilation cost is guarded by 4.
Often, applications require 2D square grids, whose sizes of
Given the relative non-trivial search space for a reasonable
application of many basic embedding techniques for 3D embedding, we developed an approach to deal with the general
cases. Our approach is composed of 2 contiguous 2D to 2D
embeddings:
1. Map a 2D sub-topology of the guest grid to its ideal 2D
grid (a 2D sub-grid of a 2D sub-topology of the host
grid). Assume embedding A × B to X × Y ′ , where Y’
<= Y.
2. Map Y ′ × C onto Y × Z
In our embedding process for 3D grids, we always try to
find embeddings with 3D folding and PaperFold. first, when
an embedding can not be found applying these two folding,
we apply the general approach. Therefore, the grids ending up using the general technique are likely to have similar
shapes and aspect ratios as the host grids. In addition, both
steps likely end up using matrix-based embedding (the Compress() operation described in the previous section) and we
expect the compression ratio for both Compress() steps being
about or smaller than three. Note the two steps may dilate the
same guest edges and the upper-bound of the final dilation of
Table 2: Methods/Operators for Embedding into 3D Grid/Tori
Operator
Description
Sit3D(A,B,X,Y)
MFold3D(C,Z)
RFold3D(C,Z)
PaperFold(A,X)
General3D
Step
1
2
3
4
5
6
Reference
Guest topology is 3D
Sit1D(A,X), Sit1D(B,Y), and Sit1D(C,Z)
treat a thin 3D structure ship as a 1D line and fold in 3D
also used for embedding a line or a grid with small aspect ration
treat a thin circle 3D structure as a 1D ring and fold in 3D
also used for embedding a ring or a tori with small aspect ration
treat a thin 3D structure as a 2D paper and fold it into 3D
The process is the same when the guest grid is 2D
embedding common cases that can not apply above methods
Conditions
IF( A′ ≤ X ′ ∧ B′ ≤ Y ′ ∧C′ ≤ Z ′ )
IF( A′ ≡ X ′ ∧ B′ ≡ Y ′ )
IF( A′ ≡ X ′ )
IF( A′ ≥ X ′ ∧ B′ ≥ Y ′ ∧C′ ≤ Z ′ /2 ∧ (A′ /2 > C′ ∨ B′ /2 > C′ ) )
IF( A′ × B′ ≤ X ′ ×Y ′ /4 ∨ A′ ≤ X ′ /2 ∧ B′ ≤ Y ′ ∧C ≥ Z )
IF( A′ × B′ ≤ X ′ ×Y ′ )
naive
3D extension of [22]
3D extension of [22; 11]
combination of [5; 3; 21]
developed
Embedding Method
Sit3D
reduced to embedding to 1D grid
reduced to embedding to 2D grid
try PaperFold (C’,Z’)
try Fold3D (C’,Z’)
General3D (A’,B’, X’,Y’)
Figure 7: The procedure for selecting 3D embedding methods
the two steps is the product of the dilation factors of the two
steps.
3.4
Summary
Grid/Tori
of
Embeddings
to
3D
As a summary of our effort on embeddings into 3D grids/tori,
we list the operations and the corresponding techniques in
Table 2.
Similar to Sec. 2.5, Figure 7 specifies our procedure for
selecting and applying different embedding operations for
embedding 3D or lower-dimensional grids/tori into 3D host
grids/tori. The steps are executed in sequential order, and
therefore the conditions of later steps imply that the previous
steps are not performed.
In Figure 7, step 2 and 3 simply deal with special cases
that can be simplified to the embedding among 1D and 2D
grids/torus. The condition for PaperFold (step 4) is rather
lose, which makes sure either A′ or B′ can be folded into C′ .
It requires permutation of both A, B, C and X, Y , Z to find
ouf if the condition is ever met. In step 5, RFold3D will
be applied when the longer dimension of the guest grid has
torus-link.
4
Implementation Issues
Currently, besides supporting the linearization-based process
numbering/mapping, BG/L MPI allows a user to provide a
mapping file to explicitly specify a list of torus coordinates
for all MPI tasks [5]. This simple approach allow users to
control the task placement of an application during its launch
time. While having users to specify dictate mappings simplifies the BG/L control system and the MPI implementation,
it is not portable and addes an additional task for user, i.e.
the generation of different mapping files for different BG/L
partitions. In the end, an application-specific program for
mapping generation is needed to run on different BG/L partitions.
We have implemented two interfaces for our topology mapping library for BG/L. First, we support standalone interface
to allow user to generate BG/L MPI mapping file for a pair
of given guest and host grids/tori. Secondly, we have integrated most of the functionality into BG/L MPI library to
support MPI Cartesian Topology functions. In the rest of this
section, we briefly discuss implementation issues on the support of MPI Cartesian topology functions and the support of
BG/L virtual node operation mode.
MPI standard has defined topology functions which provides
a way to specify task layout at run-time. It essentially provides a portable way for adapting MPI applications to the
communication architecture of the target hardware. MPI
specifies two types of virtual topologies: graph topology
describing an graph with irregular connectivity, Cartesian
topology describing a multi-dimensional grid. Because the
computation network of BG/L is 3D torus, we have concentrated on the support for Cartesian topology.
MPI virtual Cartesian topology is created by
MPI Cart create(), whose inputs describes a preferred
Cartesian topology. Additional functions are defined for a
process to query a communicator for information related to
the communicator’s Cartesian topology, e.g., ranks of any
neighbor in the virtual grid, dimensionality of the virtual
grid, etc. With this set of functions, an MPI application
can map its tasks dynamically and transparently. And it is
the MPI system’s responsibility to realize efficient mapping
from the application’s requested grid onto underneath
physical interconnect.
We have plugged our topology mapping library into the
system-dependent layer of the BG/L MPI implementation (essentially an optimized port of the MPICH2 [19]).
Specifically, when the application calls MPI Cart create() or
MPI Cart map(), mapping functions in our library are invoked first. If the library can not compute a valid mapping/embedding, the default implementation (linearization
based) is called.
BG/L supports two operation modes: co-processor mode and
virtual node mode that uses both processors on a compute
node [5]. In coprocessor mode, a single process uses both
processors. In virtual node mode, two processes, each using
half of the memory of the node, run on one compute node,
with each process bound to one processor. Virtual node doubles the number of tasks in BG/L message layer; it also introduces a refinement in the addressing of tasks. Instead of
being addressed with a triplet (x,y,z) denoting the 3D physical torus coordinates, tasks are addressed with quadruplets
(x,y,z,t) where t is the processor ID (0 or 1) in a compute
node. The additional dimension consisting of the two CPUs
of one compute node the T dimension. In our implementation, to avoid to compilicate the problem, we did not treat
the issue as a problem of embedding into 4D tori because the
T dimension, comparing to other 3 regular torus dimension,
is too small. Instead, we always disregard the T dimension
and solve the embeddings of lower-dimensional topologies.
Then the T dimension is put into the inner-most (fastestchanging) dimension of the embedded virtual topologies.
5
Empirical Evaluation
In this section, we present the evaluation of our topology
mapping library against a collection of widely used benchmark programs and real applications on an large scale Blue
Gene/L system. We show that, for a large number of realistic cases, the mappings generated by our library not only
have much lower dilation factors, but also achieves clostto-constant hop counts of messages exchanged among processes. In the end, the communication costs are significantly
reduced and and the scalabilities are largely improved.
5.1
Experiment Setup
We use the MPI tracing/profiling component of IBM High
Performance Computing Toolkit [1] in our study. We use
the following two performance metrics that are dynamically
measured by the tracing library:
- Communication Time is the total time a processor spends
in MPI communication routines.
- Average Hops is based on manhattan distance of two processors. The manhattan distance of a pair of processors
p, q, with their physical coordinates as (x p , y p , z p ) and
(xq , yq , zq ), is defined as Hops(p, q) = |x p − xq | + |y p −
yq | + |z p − zq |. We define the average-hops for all messages sent from a given processor as:
∑Hopsi × Bytesi
averagehops =
i
∑Bytesi
i
where Hopsi is the manhattan distance of the sending and
receiving processors of the ith messages, and Bytesi is the
message size.
The logical concept behind the performance metric averagehops is to measure, for any given MPI message, the number
of hops each byte has to travel. While the metric reflects the
hops of actually exchanged messages of pairs of MPI communication partners, it is different from the dilation factors
of an embedding for it reflects the real communication scenario.
For both performance metrics, we record the average values and the maximum values. The maximum values represent the “worst case”: the maximal communication time is
the communicaiton time of the processor with the longest
communication time; the maximal average-hops is that of
the MPI message with the largest average-hops.
We used benchmark programs from NAS Parallel Benchmark Suite (NPB2.4) and two scientific applications. NAS
NPB benchmark suite has been widely used for studies of
parallel performance. Their descriptions can be found in [9].
In this paper, we include detailed results for 4 NAS NPB
programs (i.e. BT, SP, LU, and CG). Among the rest of the
NAS NPB programs, FT and EP do not use point-to-point
communication primitives, IS does one time point-to-point
communication, where each node sends one integer value to
its right-hand-side neighbor. The virtual topology of MG is
a 3D near-cube, which can exactly map to the topology of
BG/L partitions with a permutation of the three dimensions
([5] describes how user can specify such mapping on BG/L)
. For all the results obtained on NAS NPB benchmarks, we
used the class D problem sizes, the largest of NPB 2.4.
The applications are SOR and SWEEP3D [23] from ASCI
benchmark programs. SOR is a program for solving the
Poisson equation using an iterative red-black SOR method.
This code uses a two dimensional process mesh, where communication is mainly boundary exchange on a static grid
with east-west and north-south neighbors. This results in
a simple repetitive communication pattern, typical of gridpoint codes from a number of fields. SWEEP3D [23] is a
simplified benchmark program that solves a neutron transport problem using a pipelined wave-front method and a
two-dimensional process mesh. Input parameters determine
problem sizes and blocking factors, allowing for a wide
range of message sizes and parallel efficiencies.
Table 3: Topology Scenario of Test Programs
Embedding Topologies
Guest Topo.
Host Topo.
Dilation / Avg. Dilation
Default
Optimized
Mapping
Mapping
NAS BT, SP : 2D square mesh
256: 16x16
256: 8x8x4
3 / 1.633
2 / 1.038
484: 22x22
512: 8x8x8
6 / 3.108
4 / 1.620
1024: 32x32
1024: 8x16x8
5 / 2.661
2 / 1.051
2025: 45x45
2048: 8x16x16
10 / 5.049
4 / 1.587
4096: 64x64
4096: 8x16x32
9 / 4.802
2 / 1.025
NAS LU, CG; SOR : 2D near-square mesh
256: 16x16
256: 8x8x4
3 / 1.633
2 / 1.038
512: 32x16
512: 8x8x8
3 / 1.656
2 / 1.055
1024: 32x32
1024: 8x16x8
5 / 2.661
2 / 1.051
2048: 64x32
2048: 8x16x16
5 / 2.680
2 / 1.026
4096: 64x64
4096: 8x16x32
9 / 4.802
2 / 1.025
SWEEP3D : 2D near-square mesh
256: 16x16
256: 8x8x4
3 / 1.633
2 / 1.038
512: 16x32
512: 8x8x8
5 / 2.754
2 / 1.055
1024: 32x32
1024: 8x16x8
5 / 2.661
2 / 1.051
2048: 32x64
2048: 8x16x16
9 / 4.768
2 / 1.026
4096: 64x64
4096: 8x16x32
9 / 4.802
2 / 1.025
Table 3 lists the programs’ brief information related to topology mapping, the specific mapping scenario we used in our
study, and the computed dilation factors using BG/L default
mapping and optimized mapping generated by our library.
In the columns of guest/host topologies, we listed the number of total processes and corresponding Cartesian topology.
The computed dilation factors show that the edge dilation
associated to our optimized mapping are very small (with
many equal to 2), and much smaller than the dilation factors
associated to using BG/L’s default mapping.
In the following sub-section, we investigate the performance
impacts of optimized mappings with low dilation factors on
realistic programs. For the experiments presented in this section, we compiled the programs using IBM XL compiler on
BG/L. For each of the scenario we have studied, we generate
a BG/L mapping file using our mapping library. Then we
ran the specific program and the input case on a BG/L partition matches the host topology described in the scenario and
collect the performance metrics.
Note that the evaluation of the support for MPI topology
functions is not presented in this writing. This is primarily
because we did not find application or benchmark program
that is written with MPI topology routines.
5.2
Results of NAS benchmarks
Results of the 4 NAS programs are given in Figure 8. The
left graph for each program show the measurements of
AverageHops and the right graph show the measurements
of communication time. The bars are the averages across
all messages and lines are the “worst cases”. The horizontal
axis represents the number of processors which corresponds
to the tested mapping scenario given in Table 3.
The results of BT, SP, and LU are all consistent in the sense
that the dilation is significantly reduced (many cases guarded
by 2 hops) by using the mapping produced by our mapping
library. As a result, BT and SP benefited from the highquality mapping and shows significant reduction of their
communication costs. This is primarily because the two programs have nearest-neighbor communication pattern. Particularly, for the cases of using BG/L partitions with 512
and 2048 compute nodes, because the programs requires the
number of processes as square numbers, the mapping problems are to map 22x22 and 45x45 onto the corresponding
partitions. The default, linearization-based mapping has no
special handling for such cases. One of the worst dilated
edges of mapping 22x22 onto 512 nodes using default mapping is the edge between guest nodes (1,20) and (2,20),
where the nodes are mapped to physical nodes (2,5,0) and
(0,0,1) and the edge is dilated by 6. In our optimized mapping, the perfect square meshes, 22x22 and 45x45, are compressed onto 32x16 and 64x32, then they are folded into 3D
grids (Section 3.2) and dilation 4 is realized.
On the other hand, although the average hops measurements
of LU are very good when using mapping files we generated,
there is no or little improvements on the communication cost.
This is because LU involves communications between processes with variant distances.
The results of CG is rather controversal. Specifically, although the estimated dilation of optimized mapping is only
2, the average communication hops is much higher. This is
because for CG, each process P exchanges message with all
processes of the same row that are exactly 2k − 1 hops away
from P. Here the hops are in the term of the guest 2D grid.
For example, assuming each row has 32 nodes, the node 2
of the row would communicate with nodes 3, 4, 6, 10, 18 of
the same row. When mapping this dimension to the host grid
using the default, linearization-based mapping with radix as
a power-of-two number, the physical distances of far-apart
communication partners may be small. On the other hand,
the optimized mapping, trying to optimize the connectivity
of the nearest neighbors, the physical distances of the longdistance communication partners may not be optimized. For
instance, Figure 9 shows how a row is mapped onto a 2D
space with the size of the inner dimension as 8, using the
default mapping and the optimized mapping. As highlighted
in Figure 9(a), with the default mapping, node 2 node 2 ends
up to be 1 hop away from node 10. In Figure 9(b), with the
optimized mapping, the physical distance of node 2 and node
10 is 4.
(a) BT
(b) SP
(c) LU
(d) CG
Figure 8: NAS NPB 2.4 results (with class D inputs, co-processor mode)
18
18
10
10
2
3
4
6
(a) Default mapping
2
3
4
6
(b) Optimized mapping
Figure 9: CG communication partners
5.3
is not affected. This is because for the SWEEP3D application, the communication pattern is composed of pipelined
wavefronts. Although its communications are all among
nearest neighbors, the pipeline hides the relative high latencies between certain neighboring wavefronts containing
extended mesh edges. Nevertheless, because the communications of SWEEP3D are among neighbors, the optimized
mapping does not introduce performance degradation as the
NAS-CG case.
Application Results
Figure 10 gives the detailed performance results for SOR
and SWEEP3D. The results of SOR confirms that for applications with nearest-neighbor communication pattern, optimized process mapping improve the communication performance significantly. Specifically, SOR results show that not
only the AverageHops but also the communication time improve significantly with our mapping library. This is because
the communication pattern of SOR is “ping-pong” messages
among nearest neighbors, which is the representative communication pattern that can benifit from high-quality topology mapping. Note the measured average hops for the cases
using our mapping library stays flat and the benefit of using
it becomes more significant when the number of processors
increases. For 4096 processors, the average communication
time are improved by 42%. For the worst cases (i.e., the
processor spends most time in communication), the communication costs are improved by 25%.
Similar to the results of NAS-LU, results of SWEEP3D in
Figure 10(b) show that although the AverageHops are improved significantly for all the cases, the communication cost
6
Related Work
The problem of mapping parallel programs onto parallel systems has been studied extensively since the beginning of parallel computing. The problem is essentially equivalent to the
graph embedding problem. Nevertheless, for different applications, the problem would have different constraints. For
instance, there are many-to-one and one-to-one mappings.
Similarly, some methods concentrates develop techniques
for mapping data structures onto processors, and others maps
parallel processes onto processors. In this paper, we concentrated on exploring one-to-one mappings from parallel processes processors.
In term of solution approaches, a large number of methods
exploring graph-partitioning and searching-based optimization have been developed (with few examples as [12; 20;
24]). In this approach [6] described a simulated annealing based method to explore applications’ communication
pattern and in turn discover the most beneficial mapping of
application’s tasks onto BG/L. The off-line approach intro-
(a) SOR
(a) SWEEP3D
Figure 10: Application results (co-processor mode)
duced in the paper is effective for performance tuning and
knowledge discovery of complicated applications. Our work
is orthogonal and complementary to their approach in the
sense that when the logical topology of the communications
of an application is well defined, our topology mapping library should be used to port the application to various BG/L
partitions easily. When the communication pattern of an application is irregular or dynamic (like that of NAS-CG), their
tool should be used to uncover a proper mapping to BG/L
topologies.
the guest grid and host grid, they introduced a simple reduction method and a general reduction method to embed among
multi-dimensional grids. Their factorization based techniques post fairly strong constraints on the relative shapes
of the guest grids and host grids and additional techniques
are needed to complement the solution scope. In addition,
their most general technique (general reduction) can introduce large dilation. For the case of mapping to 3D grid/tori,
in the worst case, its dilation factor is the same as the size of
the smallest dimension.
Another approach is to embed guest graphs into host graphs
via projection functions, These methods usually develop solutions for special cases with low complexity. As discussed
in Section 2, most of the effective results on grid embeddings
are obtained for embedding into two-dimensional grids. In
this section we discuss related efforts on embeddings for
three-dimensional grids/tori, embeddings for tori, support
for MPI topology functions, and topology mapping related
work for BG/L.
PARIX mapping [22] is a comprehensive topology mapping
library developed in the similar approach as ours, i.e. exploring effective techniques for 2D grid embedding. In this
sense, our work is similar to theirs. Nevertheless, their integration for embedding of 3D grid/tori is not as sufficient. If
directly applying their steps of embedding 3D grids into 2D
grids, the embedding of 3D grid into 3D grid would involve
3 steps. It first unfolds the 3D guest grid to 2D grid along a
single dimension, Then, it does a 2D embedding of the 2D
intermediate grid to another 2D intermediate grid that can be
fold into the 3D host grid by pipe/ring foldings. Finally, it
folds the 2nd intermediate grid to 3D. The drawback of the
procedure is that the first step (unfolding from 3D grid to 2D
grid) introduces large dilation which equals to the size of the
smallest dimension.
Gray Codes is an ordering of 2n binary numbers such that
only bit changes from one entry to the next. Applying gray
code numbering for grid embedding, for the case of 8 nodes,
the nodes can be numbered as 000, 001, 011, 010, 110, 111,
101, 100. This sequence is a gray code. With its simplicity,
gray code has been successfully applied for embedding into
hypercube topologies and researcher has explored its application for embedding among k-array n-cube topologies [7;
16]. When embedding a one-dimensional ring onto multiple dimensional grids, gray-code-based embedding is exactly the same as the ring folding method mentioned in Section 2. Nevertheless, it is difficult to extend the approach
effectively for embedding guest grids with two or higher dimensions. When embedding among 2D and/or 3D grids, the
associated dilation factors are usually as the sizes of the dimensions.
Tao’s mapping [16] is composed of ring-wrapping, simple
reduction, and general reduction techniques for embedding
among high-dimensional meshes and tori. It is based on
a ring-wrapping method that similar to gray-code methods
to embed a one-dimensional pipe/ring onto two-dimensional
grids. Based on re-factoring the sizes of the dimensions of
Kim and Hur [15] proposed an approach for many-to-one
embedding of multiple dimension torus onto a host torus
with the same number of dimensions. Their approach is
based on ring stretching/scattering, which we also use for
our torus Sit operations in this paper. Because they concentrated on many-to-one embedding, their approach does not
work well for general one-to-one grid/torus embeddings.
In term of supporting MPI topology functions, [24] proposed a graph-partitioning based technique for embedding
into a hierarchical communication architecture (i.e. NEC
SX-series). [18] described techniques to for embedding into
switch-based networks. These techniques are designed for
specific systems, which have different networks from BG/L.
Topology mapping on BG/L has been studied in [7; 6]. [7]
studied performance impacts of process mapping on BG/L
has been studied on rather small scales (using 128 BG/L
compute nodes). The study shows that gray code and a
one-step paper folding are fairly effective for mapping twodimensional guest topologies onto BG/L partitions with sizes
upto 128 compute nodes (i.e. 8x8x2 torus). Based on more
sophisticated grid embedding techniques, our approach covers many hard cases. In addition, our integration of existing
and novel techniques provides a rather complete solution to
systematically map applications onto BG/L.
References
[1] IBM Advanced Computing Technology Center MPI
tracer/profiler. URL: http:// www.research.ibm.com/ actc/
projects/ mpitracer.shtml.
[2] A DIGA , N. R., ET AL . 2002. An overview of the BlueGene/L supercomputer. In SC2002 – High Performance
Networking and Computing.
[3] A LELIUNAS , R., AND ROSENBERG , A. L. 1982. On
embedding rectangular grids in square grids. IEEE Transactions on Computers 31, 9 (September), 907–913.
7
Conclusions
To run scalable applications on today’s scalable parallel systems with minimal communication overhead, effective and
efficient support on mapping parallel tasks onto physical processors is required. This paper describes the design and integration of a comprehensive topology mapping library for
mapping MPI processes (parallel tasks) onto physical processors with three-dimensional grid/torus topology. On developing our topology mapping library that presented in this
paper, we have not only engaged an extensive study of the
existing practical techniques on grid/graph embeddings, but
also explored design space and integration techniques on
embeddings of three-dimensional grids/tori. By providing
scalable support of MPI virtual topology interface, portable
MPI applications can benefit from our comprehensive library. The results of our empirical study that using popular benchmarks and real applications, with topologies scale
up to 4096 nodes, further shows the impact of our topology
mapping techniques and library on improving communication performance of parallel applications.
For future work, we would like to find or co-develop applications that use MPI virtual topology functions for a realistic
evaluation of our support for MPI topology interface. In addition, we would like to look into embeddings of topologies
other than grids/tori into BG/L topologies.
[4] A LMASI , G., ET AL . 2001. Cellular supercomputing
with system-on-a-chip. In IEEE International Solid-state
Circuits Conference ISSCC.
[5] A LMASI , G., A RCHER , C., C ASTANOS , J. G., E RWAY,
C. C., H EIDELBERGER , P., M ARTORELL , X., M OR EIRA , J. E., P INNOW, K., R ATTERMAN , J., S MEDS , N.,
S TEINMACHER - BUROW, B., G ROPP, W., AND T OONEN ,
B. 2004. Implementing MPI on the BlueGene/L supercomputer. In Proc. of Euro-Par Conference.
[6] B HANOT, G., G ARA , A., H EIDELBERGER , P., L AWLESS , E., S EXTON , J. C., AND WALKUP, R. 2005. Optimizing task layout on the Blue Gene/L supercomputer.
IBM Journal of Research and Development 49, 2 (March),
489–500.
[7] B RIAN E. S MITH , B. B. 2005. Performance effects
of node mappings on the IBM Blue Gene/L machine. In
Euro-Par.
[8] C HAN , M. J. 1996. Dilation-5 embedding of 3dimensional grids into hypercubes. Journal of Parallel
and Distributed Computing 33, 1 (February).
[9] DER W IJINGAART, R. F. V. 2002. NAS Parallel benchmarks version 2.4. Tech. Rep. NAS-02-007, NASA Ames
Research Center, Oct.
[10] E LLIS , J. A. 1991. Embedding rectangular grids into
square grids. IEEE Transactions on Computers 40, 1
(Jan.), 46–52.
[11] E LLIS , J. A. 1996. Embedding grids into grids: Techniques for large compression ratios. Networks 27, 1–17.
Acknowledgement
We would like to acknowledge and thank George Almási,
José G. Castaños, and Manish Gupta from IBM T. J. Watson Research Center for their supports, Brian Smith, Charles
Archer, and Joseph Ratterman from IBM System Group for
discussion and their effort on integrating our library into
BG/L MPI library, and William Gropp for valuable discussion on the support of MPI topology functions.
[12] E RÇAL , F., R AMANUJAM , J., AND S ADAYAPPAN , P.
1990. Task allocation onto a hypercube by recursive mincut bipartitioning. J. Parallel Distrib. Comput. 10, 1, 35–
44.
[13] F ORUM , M. P. I., 1997. MPI: A message-passing interface standard. URL: http:// www.mpi-forum.org/ docs/
mpi-11-html/ mpi-report.html, August.
[14] H ATAZAKI , T. 1998. Rank reordering strategy for MPI
topology creation functions. In Proceedings of the 5th Eu-
roPVM/MPI conference, Springer-Verlag, Lecture Notes
in Computer Science.
[15] K IM , S.-Y., AND H UR , J. 1999. An approach for
torus embedding. In Proceedings of the 1999 International Workshop on Parallel Processing, 301–306.
[16] M A , E., AND TAO , L. 1993. Embeddings among
meshes and tori. Journal of Parallel and Distributed Computing 18, 44–55.
[17] M ELHEM , R. G., AND H WANG , G.-Y. 1990. Embedding rectangular grids into square grids with dilation
two. IEEE Transactions on Computers 39, 12 (December), 1446–1455.
[18] M OH , S., Y U , C., H AN , D., YOUN , H. Y., AND L EE ,
B. 2001. Mapping strategies for switch-based cluster systems of irregular topology. In 8th IEEE International Conference on Parallel and Distributed Systems.
[19]
The
MPICH
and
MPICH2
homepage.
http://www-unix.mcs.anl.gov/mpi/mpich.
[20] O U , C.-W., R ANKA , S., AND F OX , G. 1996. Fast
and parallel mapping algorithms for irregular problems. J.
Supercomput. 10, 2, 119–140.
[21] ROTTGER , M., AND S CHROEDER , U. 1998. Efficient
embeddings of grids into grids. In The 24th International
workshop on Graph-Theoretic Concepts in Computer Science, 257–271.
[22] ROTTGER , M., S CHROEDER , U., AND S IMON , J.
1993. Virtual topology library for parix. Tech. Rep. TR005-93, Paderborn Center for Parallel Computing, University of Paderborn, Germany, November.
[23] The asci sweep3d benchmark code. URL: http://
www.llnl.gov/ asci benchmarks/ scsi/ limited/ sweep3d/
asci sweep3d.html.
[24] T R ÄFF , J. L. 2002. Implementing the MPI process
topology mechanism. In Supercomputing, 1–14.