Academia.eduAcademia.edu

Topology Mapping for Blue Gene/L Supercomputer

2006, ACM/IEEE SC 2006 Conference (SC'06)

Mapping virtual processes onto physical processos is one of the most important issues in parallel computing. The problem of mapping of processes/tasks onto processors is equivalent to the graph embedding problem which has been studied extensively. Although many techniques have been proposed for embeddings of two-dimensional grids, hypercubes, etc., there are few efforts on embeddings of three-dimensional grids and tori. Motivated for better support of task mapping for Blue Gene/L supercomputer, in this paper, we present embedding and integration techniques for the embeddings of three-dimensional grids and tori. The topology mapping library that based on such techniques generates high-quality embeddings of two/three-dimensional grids/tori. In addition, the library is used in BG/L MPI library for scalable support of MPI topology functions. With extensive empirical studies on large scale systems against popular benchmarks and real applications, we demonstrate that the library can significantly improve the communication performance and the scalability of applications.

Topology Mapping for Blue Gene/L Supercomputer Hao Yu I-Hsin Chung Jose Moreira IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598-0218 {yuh,ihchung,jmoreira}@us.ibm.com Abstract Mapping virtual processes onto physical processos is one of the most important issues in parallel computing. The problem of mapping of processes/tasks onto processors is equivalent to the graph embedding problem which has been studied extensively. Although many techniques have been proposed for embeddings of two-dimensional grids, hypercubes, etc., there are few efforts on embeddings of three-dimensional grids and tori. Motivated for better support of task mapping for Blue Gene/L supercomputer, in this paper, we present embedding and integration techniques for the embeddings of three-dimensional grids and tori. The topology mapping library that based on such techniques generates high-quality embeddings of two/three-dimensional grids/tori. In addition, the library is used in BG/L MPI library for scalable support of MPI topology functions. With extensive empirical studies on large scale systems against popular benchmarks and real applications, we demonstrate that the library can significantly improve the communication performance and the scalability of applications. 1 Introduction Mapping tasks of a parallel application onto physical processors of a parallel system is one of the very essential issues in parallel computing. It is critical for todays supercomputing system to deliver sustainable and scalable performance. The problem of mapping application’s task topology onto underneath hardware’s physical topology can be formalized as a graph embedding problem [3], which has been studied extensively in the past. For the sake of intuition, in this writing, we used terms, embedding and mapping, with the same meaning. In general, an embedding of a guest graph G = (VG , EG ) into a host graph H = (VH , EH ) is a one-to-one Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2006 November 2006, Tampa, Florida, USA 0-7695-2700-0/06 $20.00 c 2006 IEEE mapping φ from VG to VH . The quality of the embedding are usually measured by two cost functions (parameters): dilation and expansion. The dilation of an edge (u, v) ∈ EG is the length of a shortest path in H that connects φ (u) and φ (v). The dilation of an embedding is the maximum dilation over all edges in EG . The expansion of an embedding is simply |VH |/|VG |. Intuitively, the dilation of an embedding measures the worst case stretching of edges and the expansion measures the relative size of the guest graph. Because the graph embedding problems are NP-hard problems, researchers have focused on developing heuristics. In the past two decades, a large number of graph embedding techniques has been developed to solve problems raised from VLSI circuit layout and process mapping in parallel computing. Specifically, for optimizing the VLSI design of highly eccentric circuits, techniques that embed rectangular twodimensional grids into square two-dimensional grids [3; 17; 10] were proposed and significant results were obtained. In parallel computing domain, techniques were developed to map processes onto various high-performance interconnects such as hierarchical network, crossbar-based network, hypercube network, switch-based interconnects, and multidimensional grids [24; 14; 8; 18; 16; 15; 12; 20]. In this paper, we concentrate on the problem of embedding for threedimensional grids and tori. Our research is motivated for providing better support for task mapping on Blue Gene/L (BG/L) supercomputer. The BG/L supercomputer is a new massively parallel system developed by IBM in partnership with Lawrence Livermore National Laboratory (LLNL). BG/L uses system-on-achip integration [4] and a highly scalable architecture [2] to assemble machines with up-to 65,536 dual-processor compute nodes. The primary communication network of BG/L is a three-dimensional torus. To provide system level topology mapping support for BG/L, we need effective and scalable embedding techniques that map application’s virtual topologies onto three-dimensional tori or two-dimensional sub-topologies. Although there are large amount of work on embeddings for various topologies, we found relatively few techniques for efficient/scalable embeddings among three-dimensional grids/tori. In particular, techniques based on graphpartitioning and/or searching would find fairly good embeddings for different topologies, they are hard to be parallelized and scalable [12; 20; 6]. Specifically, except the works on hypercube embeddings [8], most techniques proposed for task mapping for parallel systems have neglected the progression in the area of embeddings of two-dimensional grids. Overall, existing techniques are either not suitable for BG/L or only covering very limited cases. In this paper, we describe existing and newly developed grid embedding techniques and integration techniques to generate efficient mapping of parallel processes/tasks onto up-to three-dimensional physical topologies, which indirectly optimizes the nearest-neighbor communications of parallel programs. The embedding techniques we used and explored take constant time in parallel and therefore scalble. Our integration techniques cover rather general cases (e.g. not limited to grids/tori with dimension sizes as powers of 2) with small dilation costs. The topology mapping library based on these techniques has been integrated into BG/L MPI library to support MPI Cartesian Topology [13]. Finally, we present comprehensive experiments against popular parallel benchmarks and real applications. With helps from MPI tracing tools, our empirical results demonstrate significant performance improvements for point-to-point communication after using the process-to-processor mapping generated by our library. This paper makes the following contributions: - We present the design and integration of a comprehensive topology mapping library for scalable three-dimensional grids/tori embeddings. Besides intensive exploration of latest developments in the area of grid-embedding, we describe extensions for embeddings of three-dimensional grids/tori. - Our topology mapping library provides efficient support for the MPI virtual topology functions. The computation of MPI virtual topology on each processor takes constant time and thereafter the mapping process is scalable. - Quantitatively, we demonstrate that our topology mapping library is effective on improving the communication performance of parallel applications running on a large number of processors (i.e. experiments ran on up-to 4096 processors). The rest of the paper is organized as followings: Sec. 2 gives brief explanation of embedding techniques for up-to two-dimensional grids Based on the existing techniques, we present a set of embedding operations and corresponding predicates for their selection. Sec. 3 presents embedding operations for 3D grids/tori. We further present the procedure for integrating various embedding techniques into a general and powerful library. Sec. 4 describes some issues on the support of topology mapping for BG/L systems. Sec. 5 presents an extensive empirical study to demonstrate the effectiveness of our topology mapping library. Finally, Sec. 6 discusses related work and Sec. 7 concludes the paper. 2 Basic Grid Embeddings In this section, we describe techniques of embedding 1D or 2D guest Cartesian topologies into 2D host Cartesian topologies (grids or tori). We describe the operations and predicates we defined for their integration. In the next section of this paper, we will describe how we utilize these operations for embeddings of 3D Cartesian topologies. 2.1 1D Embeddings of Rings For the case of embedding a ring into an 1D mesh (aka line), [22] described a method that embed the first half of the nodes in the guest ring into the host line in the same numbering direction, with each edge dilated by 2. Then it maps the 2nd half of the nodes in the guest ring to the host line in the reverse order of the numbering direction of the line, still with edges dilated to 2. In this paper, we refer to this method as ring-wrapping. Figure 1(a) shows an example for embedding a ring with size 7 to a line with size 8 using ringwrapping. When the given host graph is a ring with slightly greater size than the guest ring, a method referred to as ring-scattering in our writing can be used ([15]). It simply stretches some edges of the guest ring and map the nodes onto the host ring so the ring-links of the host graph can be utilized. Figure 1(b) shows an example for embedding a ring with size 5 to a ring with size 8 using ring-scattering. However, when the size of the host ring is more than twice as great as the guest ring, the dilation of ring-scattering will exceed 2. In this case, ringwrapping would give better embedding. In our later development for 3D embeddings, we will often switching between these two basic ring embedding methods. In addition, because ring-scattering introduces expansion that is greater than 1 and the expansion of ring-wrapping is always 1, we found it is convenient to use ring-wrapping for the most times. 2.2 Embeddings of Lines and Rings into Higher-Dimensional Topologies The methods to embed lines (aka pipes in [22]) and rings into 2D or 3D topologies are fairly intuitive. They are similar to performing a naive space filling, i.e. to fold (aka wrap) the line in a 2D or 3D space without stretching any of its edges. Figure 1 (c) and (d) show 2 examples for folding a line onto 2D and 3D grids. Because the dilation factors of the embeddings are 1, methods of embedding lines into 2D grids and tori are essentially the same. While the embedding of a line to 2D or 3D grids can always achieve dilation factor 1, embedding of a ring to 2D or 3D grids may not. Specifically, when the size of the guest ring 14:(1,3) 0 6 1 5 2 4 3 (a) Ring-wrapping 0 1 2 3 14:(2,1,1) 4 0:(0,0) (b) Ring-scattering (c) Embed a line to a 2D grid 0:(0,0,0) (d) Embed a line to a 3D grid 3,7 2,8 3,6 order: (X,Y,Z) 1,9 3,5 2,7 Y 3,4 14:(1,3) Z X 1,8 2,6 2,5 1,7 0:(0,0) 1,6 (e) Embed a ring to a 2D grid (f) Fold a 12x3 grid to a 6x6 2D (g) 90 degree turn Figure 1: Existing Methods for Grid Embedding is an odd number, there will be one and only one guest edge being dilated to 2 hops [22]. Figure 1 (e) shows an example for the case. We find that the basic technique of folding a ring is useful because for the cases of embedding 2D rectangular grids with f theshorterside very small aspect ratios ( lengtho lengtho f thelongerside ) and 3D grids with one dimension much greater than the others as an 1D line or ring We will discuss more on the matter in the next section. In addition, when embedding a ring into 2D/3D tori, the torus edges of the host grid are not needed. 2.3 Fold Embeddings among 2D Topologies 2) goes along row 2 until it reaches (2,5). Then row 2 follows a diagonal–vertical–diagonal folding maneuver. Note the diagonal edges in the figure do not exist in the host grid and need to be replaced with a pair of edges in X and Y direction. As a result, the maneuver keeps the dilation cost under 2. The key of the method is the embedding scheme for performing the 180 degree turn (dark nodes in the picture) with dilation cost under 2. Recently, a method for performing a 90 degree turn with dilation cost under 2 is introduced by [11] (depicted in Figure 1(g)). Integrating the 180 degree turn and the 90 degree turn, a 2D torus with small aspect ratio can be folded onto a 2D grid with large aspect ratio in the similar fashion as the ring-folding introduced above Figure 2 gives an intuitive example of such integration. Need to note here that, to facilitate a massively parallel implementation of the embeddings, we have inverted the projection functions of the 90 degree turn and the 180 degree turn. Due to space limitation, we are not including the project functions in this writing. 2.4 Figure 2: Embed 3 × 36t grid to 12 × 12 grid A well known method of embedding 2D grids with small aspect ratio to grids with large aspect ratio is folding, which is first introduced in [3]. Figure 1(f) shows an example of embedding a 12 × 3 rectangular grid into a 6 × 6 square grid. Under the folding process, a row of the guest grid (e.g. row Matrix-Based Embedding Matrix-based embeddings [21] are widely used to embed among 2D grids with relative close aspect rations. The example given in Figure 3 shows an embedding and the corresponding embedding matrix of mapping a 2x32 grid into a 5x13 grid. In the example, the dilation of the embedding (3) is very close to the average dilation across all the edges of the guest grid ( 2.5), which implies that most of the guest edges are stretched. On the other hand, the average dilation costs of folding-based embeddings reach 2 only in the areas where the turn maneuvers are applied. The rest areas of the embeddings have average dilation cost as 1. For this rea- son, our embedding selection process presented in the later sections sets higher priority to folding-based methods than matrix-based methods. That is, we would try folding-based methods first when certain criteria is satisfied. Then when the folding-based method is failed to obtain an embedding, we will fall back to matrix-based method. 3 2 2 3 3 2 2 3 2 2 3 2 2 3 3 2 2 3 2 2 3 2 2 3 2 2 Figure 3: Embed 2 × 31t grid to 5 × 13 grid Nevertheless, matrix-based method is very general and can efficiently embed a 2D guest grid (with dimension as A × B) to its ideal sub-grid of the host grid. Here the ideal grid is defined as X ×Y ′ , for a given host grid with dimension X ×Y , where Y ′ = ⌈A× B/X⌉. Therefore, for the embedding among 2D grids, we use it as a default method. In addition, when the guest and host grids are both tori, with the relative orientation of the guest graph is maintained, matrix-based embedding can utilize the torus links of the host grid. The core of the matrix-based embedding is the generation of the matrix, many of which are in sequential because of the dependences of contiguous cells in the matrix. Although the methods described in [21] enable parallel generation of the matrix, the generated matrix is for computing the coordinates of host grid from the coordinates of a guest grid. What we need is the inverse function, i.e. each physical processor computes its logical coordinates. Following a similar procedure described in [21], we have derived the inverse functions to enable scalable computation of the embedding in the MPI virtual topology library. 2.5 Summary on Embeddings into 2D Topologies We have derived a number of embedding operators to embed 1D or 2D Cartesian topologies into 2D grids or tori. The operations we defined and later used in 3D embeddings, along brief descriptions are summarized in Table 1. Among the operations, while many of them are directly adopted from existing techniques, others are extended from the existing ones or developed from the existing ones to support scalable parallel embeddings. First, in the context of this paper, we use following terminologies: - G represents the guest grid/torus. - H represents the host grid/torus. - |G| and |H| represent the sizes of G and H. - A, B, and C are the sorted dimensions of the guest grid/torus, which satisfy A ≤ B ≤ C. - A′ , B′ , or C′ represent one dimension of the guest grid/torus. When used together, they represent distinct dimensions of the guest grid/torus. - X, Y , and Z are the sorted dimensions of the host grid/torus, which satisfy X ≤ Y ≤ Z. - X ′ , Y ′ , or Z ′ represent one dimension of the host grid/torus. When used together, they represent distinct dimensions of the host grid/torus. - AB, A′ B′ , XY , or X ′Y ′ are occasionally used to represent a 2D grid/torus or a 2D sub-topology of a higherdimensional grid/torus, where the meanings associated to ′ in A′ , B′ , X ′ , and Y ′ are consistent with above definitions. - A × B, A′ × B′ , X × Y , or X ′ × Y ′ represent the sizes of corresponding grids/tori or sub-topologies. Given that each of the above embedding methods for 2D grids works well for certain cases, an important step to integrate them into an effective library is to define the conditions or procedure for their application. Figure 4 specifies the procedure we used. We specify 6 steps, executing in sequential order, for embedding to 2D grids/tori. Due to space restriction, sub-steps such as matrix-based embeddings into tori are omitted. Note that because the procedure is executed in sequential order, the conditions of later steps imply that the previous steps are not performed or do not return with a valid embedding. While most conditions or predicates are straight-forward, some deserves explanation. In step 3, because RFold2D needs at least 4 turns to have the start and end points of the B dimension, the larger one of the guest grid, to meet, condition A < B/4 is necessary to embed at least the four 90-degree-turns. On the other hand, for the case of MFold2D, since at least one 180-degree-turn is needed, condition A < B/2 is specified. Step 4 and 5 essentially applies Compress() operation to cover general cases of 2D grid embedding. 3 Embeddings Topologies of 3D Cartesian In this section, we concentrate on embedding guest grids/tori with 1, 2, or 3 dimensions into 3D grids or tori. In addition to the basic strategy, we will also present how we guard the use of the basic strategy to increase the chance of a successful mapping. The most simple embedding is called, in this paper, as Sit3D(A,B,X,Y), which simply applies Sit1D(A,X), Sit1D(B,Y), and Sit1D(C,Z) to embed the 3 dimensions of the guest grid/torus into the 3 dimensions of the host grid/torus independently. Table 1: Methods/Operators for Embedding into 2D Grid/Tori Operator Description Sit1D(A,X) Wrap(A) Scat(A,X) RFold1D(A,X) LFold1D(B,X) Sit2D(A,X) Compress(B,Y) Stretch(B,Y) MFold2D(B,Y) RFold2D(B,Y) Step 1 2 3 4 5 Reference Guest topology is 1D map nodes of a line with size A to the first A nodes of X embed a ring onto a line with A nodes scatter/stretch a ring onto a ring with X nodes fold a ring onto a 2D grid/tori, starting in parallel to Y fold a line onto a 2D grid/tori, starting in parallel to Y Guest topology is 2D Sit1D(A,X) and Sit1D(B,Y) matrix embedding A × B onto an ideal grid X ′ ×Y with B along Y (when B < Y) matrix embedding A × B onto an ideal grid X ′ ×Y with B along Y (when B > Y) fold rectangular mesh B × A onto mesh X ×Y with B along Y fold rectangular tori B × A onto mesh X ×Y with B along Y Conditions IF( A ≤ B ≤ X ≤ Y ) IF( A ≤ X < B ≤ Y ) IF( (A ≤ X ≤ Y ≤ B) ∧ (A ≤ X/2 ∨ A ≤ Y /2) ∧Y ≤ B/2 ) IF( B has tori-link ∧A < B/4 ) IF( B has NO tori-link ∧A < B/2 ) IF( A ≤ X ≤ Y ≤ B ) IF( B/Y ≤ X/A ) IF( B/Y > X/A ) IF( X < A ≤ B < Y ) naive [22] [15] [22] [22] naive [21; 22] extended from [21; 22] [3] extended from [22; 3; 11] Embedding Method Sit2D (B,X) Sit2D (A,X) try RFold2D (B,*) try MFold2D (B,*) Compress (B,Y) Stretch (A,X) Compress (A,X) or Stretch (B,Y) Figure 4: The procedure for selecting 2D embedding methods In the following sub-sections, we describe more complicated embedding methods we developed. Later in this section, we summarize the described techniques and specify conditions/predicates for their applications. 3.1 3D Embeddings for Pipe and Ring The embeddings described in this sub-section is extended from embeddings of 1D rings into 3D topologies. Specifically, for a given 2D or 3D guest grid/torus in the shape as a rectangle with low aspect ratio or a thin pipe, we treat them as 1D lines/rings and fold/wrap them in the 3D host grid/torus. The projection function is extended from that for embeddings of 1D lines/rings into 3D grids. Figure 5 (ac) shows some representative candidate guest grids that are considered for such embeddings. Because the method embeds guest grids/tori whose shapes are close to lines or rings, the technique is very effective. Figure 5(d) shows an example of embedding a 24x3x2 grid into a 6x6x4 host grid. In the example, three 2D-fold operations are performed. The process is similar to fold or wrap a ring with size 8 into a 2x2x2 grid. The worst stretched edges are from the maneuvers as 90-degree-turn and 180-degreeturn. For each such maneuvers, it introduces dilation cost in a 2D sub-grid, which has dilation factor as 2. The folding maneuvers applied in different 2D sub-grids do not re-embed each other’s dilated edges. Therefore, the dilation cost of the whole embedding is 2. In term of detailed embedding process, similar to folding an 1D line/ring into a 3D graph, various turning points are determined first. Then turning methods are assigned to each turning points. For instance, for the case of embedding a 3D torus (in the shape of a thin 3D circle) into a 3D cube, four 90-degree-turn maneuvers maybe needed to turn the end phase of the 3D ring to be close to the starting phase of the ring. In addition, we found that the dilation costs and the average dilation costs are the same for embeddings of 3D rings into 3D grids/tori. However, utilizing the torus links during embedding will help the parallel tasks to use the torus links in a better organized manner and thereafter may introduce slightly better performance. 3.2 Paper Folding An intuitive way to map a two-dimensional grid into a threedimensional grid is to follow the process as folding a paper into multiple layers. The idea can been used for embeddings of 2D grids and 3D grids with a shape as a thin panel into near-cube 3D host grids/tori. Figure 6 (a) and (b) give an example for embedding a 10x9 2D grid into a 5x3x6 3D grid with the dilation factor as 5. This naive 3D paper folding method can be improved by utilizingt the 2D grid folding techniques described in previous section. We call this method as PaperFold Figure 6 (c) illustrate the folding scheme and shows that the dilation factor is dropped to 2. To explore the folding idea further, when the guest grid is a 2D/3D torus, the torus edges can be main- (a) 2D candidate for 3D ring folding (b) 3D pipe (c) 3D ring (d) 3D ring folding Figure 5: 3D Ring Folding 01 01 10 01 10 01 10 01 10 10 10 10 01 10 01 10 01 10 01 10 01 Y Z X (a) Fold the Y dimension of a 2D grid onto Z dimension 11111111 00000000 11111111 00000000 1 00 11 0 00 11 0 1 00 11 00 11 00 11 00 11 00 0 1 00 11 00 11 00 11 00 11 11 00 11 011 1 00 11 00 11 00 00 11 00 11 0 1 0 1 0 00 11 00 11 001 11 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 Y Z X (b) Naive 3D to 3D folding 6,9 01 01 01 01 01 5,4 01 01 01 01 5,3016,3 1010 Y 5,9 01 Z X the 2 dimensions are not divisible by any of the sizes of the 3 dimensions of the host grid. For these cases, if we simply perform PaperFold, the expansion factor of the embedding is fairly large. For instance, in our study of running NAS BT on a partition with 512 nodes organized into 8x8x8 torus, the guest topology is a 2D square grid in 22x22, because 484 is the maximal square number that is smaller than 512. To embed such 2D grids into a 3D grid, we first find a intermediate 2D grid I. The requirements of the I are: (a) the 2D guest grid can be embedded into I with minimal expansion, and (b) the size of one of the dimensions of I is a multiply of any of the dimensions of the 3D host grid. Because the dilation cost of embedding a 2D grid into a 3D grid using PaperFold is 2, the final dilation of above 2-step procedure is two times the dilation of the embedding from the guest grid to the intermediate grid, which is usually guarded. 3.3 General 3D to 3D Embedding (c) Dilation 2 3D to 3D folding Figure 6: PaperFold tained by applying ring folding in a 2D plane defined by the dimensions excluding the dimension whose shape and size are not changed (X dimension in Figure 6 (a) ). The 3D fold embedding of a 2D guest grid or a 3D guest grid with a panel shape can always be decomposed to two steps. To embed a 3D guest grid/tori A × B × C with A < B < C, into a 3D guest grid X × Y × Z with X < Y < Z, the general condition for paper folding is A′ >= X ′ ∧B′ > Y ′ ∧C′ < Z ′ /2. The procedure is to first fold dimension B′ onto C′ and then fold A′ onto C′ as showing in Figure 6 (c). One observation of the problems of applying folding to 3D embeddings is that, a folding introduces dilation cost 2 and the dilation cost is accumulative. For most 3D grid embedding problems (embedding a thin 3D grid into a near-square 3D grid), we can simply apply folding twice and the mapping dilation cost is guarded by 4. Often, applications require 2D square grids, whose sizes of Given the relative non-trivial search space for a reasonable application of many basic embedding techniques for 3D embedding, we developed an approach to deal with the general cases. Our approach is composed of 2 contiguous 2D to 2D embeddings: 1. Map a 2D sub-topology of the guest grid to its ideal 2D grid (a 2D sub-grid of a 2D sub-topology of the host grid). Assume embedding A × B to X × Y ′ , where Y’ <= Y. 2. Map Y ′ × C onto Y × Z In our embedding process for 3D grids, we always try to find embeddings with 3D folding and PaperFold. first, when an embedding can not be found applying these two folding, we apply the general approach. Therefore, the grids ending up using the general technique are likely to have similar shapes and aspect ratios as the host grids. In addition, both steps likely end up using matrix-based embedding (the Compress() operation described in the previous section) and we expect the compression ratio for both Compress() steps being about or smaller than three. Note the two steps may dilate the same guest edges and the upper-bound of the final dilation of Table 2: Methods/Operators for Embedding into 3D Grid/Tori Operator Description Sit3D(A,B,X,Y) MFold3D(C,Z) RFold3D(C,Z) PaperFold(A,X) General3D Step 1 2 3 4 5 6 Reference Guest topology is 3D Sit1D(A,X), Sit1D(B,Y), and Sit1D(C,Z) treat a thin 3D structure ship as a 1D line and fold in 3D also used for embedding a line or a grid with small aspect ration treat a thin circle 3D structure as a 1D ring and fold in 3D also used for embedding a ring or a tori with small aspect ration treat a thin 3D structure as a 2D paper and fold it into 3D The process is the same when the guest grid is 2D embedding common cases that can not apply above methods Conditions IF( A′ ≤ X ′ ∧ B′ ≤ Y ′ ∧C′ ≤ Z ′ ) IF( A′ ≡ X ′ ∧ B′ ≡ Y ′ ) IF( A′ ≡ X ′ ) IF( A′ ≥ X ′ ∧ B′ ≥ Y ′ ∧C′ ≤ Z ′ /2 ∧ (A′ /2 > C′ ∨ B′ /2 > C′ ) ) IF( A′ × B′ ≤ X ′ ×Y ′ /4 ∨ A′ ≤ X ′ /2 ∧ B′ ≤ Y ′ ∧C ≥ Z ) IF( A′ × B′ ≤ X ′ ×Y ′ ) naive 3D extension of [22] 3D extension of [22; 11] combination of [5; 3; 21] developed Embedding Method Sit3D reduced to embedding to 1D grid reduced to embedding to 2D grid try PaperFold (C’,Z’) try Fold3D (C’,Z’) General3D (A’,B’, X’,Y’) Figure 7: The procedure for selecting 3D embedding methods the two steps is the product of the dilation factors of the two steps. 3.4 Summary Grid/Tori of Embeddings to 3D As a summary of our effort on embeddings into 3D grids/tori, we list the operations and the corresponding techniques in Table 2. Similar to Sec. 2.5, Figure 7 specifies our procedure for selecting and applying different embedding operations for embedding 3D or lower-dimensional grids/tori into 3D host grids/tori. The steps are executed in sequential order, and therefore the conditions of later steps imply that the previous steps are not performed. In Figure 7, step 2 and 3 simply deal with special cases that can be simplified to the embedding among 1D and 2D grids/torus. The condition for PaperFold (step 4) is rather lose, which makes sure either A′ or B′ can be folded into C′ . It requires permutation of both A, B, C and X, Y , Z to find ouf if the condition is ever met. In step 5, RFold3D will be applied when the longer dimension of the guest grid has torus-link. 4 Implementation Issues Currently, besides supporting the linearization-based process numbering/mapping, BG/L MPI allows a user to provide a mapping file to explicitly specify a list of torus coordinates for all MPI tasks [5]. This simple approach allow users to control the task placement of an application during its launch time. While having users to specify dictate mappings simplifies the BG/L control system and the MPI implementation, it is not portable and addes an additional task for user, i.e. the generation of different mapping files for different BG/L partitions. In the end, an application-specific program for mapping generation is needed to run on different BG/L partitions. We have implemented two interfaces for our topology mapping library for BG/L. First, we support standalone interface to allow user to generate BG/L MPI mapping file for a pair of given guest and host grids/tori. Secondly, we have integrated most of the functionality into BG/L MPI library to support MPI Cartesian Topology functions. In the rest of this section, we briefly discuss implementation issues on the support of MPI Cartesian topology functions and the support of BG/L virtual node operation mode. MPI standard has defined topology functions which provides a way to specify task layout at run-time. It essentially provides a portable way for adapting MPI applications to the communication architecture of the target hardware. MPI specifies two types of virtual topologies: graph topology describing an graph with irregular connectivity, Cartesian topology describing a multi-dimensional grid. Because the computation network of BG/L is 3D torus, we have concentrated on the support for Cartesian topology. MPI virtual Cartesian topology is created by MPI Cart create(), whose inputs describes a preferred Cartesian topology. Additional functions are defined for a process to query a communicator for information related to the communicator’s Cartesian topology, e.g., ranks of any neighbor in the virtual grid, dimensionality of the virtual grid, etc. With this set of functions, an MPI application can map its tasks dynamically and transparently. And it is the MPI system’s responsibility to realize efficient mapping from the application’s requested grid onto underneath physical interconnect. We have plugged our topology mapping library into the system-dependent layer of the BG/L MPI implementation (essentially an optimized port of the MPICH2 [19]). Specifically, when the application calls MPI Cart create() or MPI Cart map(), mapping functions in our library are invoked first. If the library can not compute a valid mapping/embedding, the default implementation (linearization based) is called. BG/L supports two operation modes: co-processor mode and virtual node mode that uses both processors on a compute node [5]. In coprocessor mode, a single process uses both processors. In virtual node mode, two processes, each using half of the memory of the node, run on one compute node, with each process bound to one processor. Virtual node doubles the number of tasks in BG/L message layer; it also introduces a refinement in the addressing of tasks. Instead of being addressed with a triplet (x,y,z) denoting the 3D physical torus coordinates, tasks are addressed with quadruplets (x,y,z,t) where t is the processor ID (0 or 1) in a compute node. The additional dimension consisting of the two CPUs of one compute node the T dimension. In our implementation, to avoid to compilicate the problem, we did not treat the issue as a problem of embedding into 4D tori because the T dimension, comparing to other 3 regular torus dimension, is too small. Instead, we always disregard the T dimension and solve the embeddings of lower-dimensional topologies. Then the T dimension is put into the inner-most (fastestchanging) dimension of the embedded virtual topologies. 5 Empirical Evaluation In this section, we present the evaluation of our topology mapping library against a collection of widely used benchmark programs and real applications on an large scale Blue Gene/L system. We show that, for a large number of realistic cases, the mappings generated by our library not only have much lower dilation factors, but also achieves clostto-constant hop counts of messages exchanged among processes. In the end, the communication costs are significantly reduced and and the scalabilities are largely improved. 5.1 Experiment Setup We use the MPI tracing/profiling component of IBM High Performance Computing Toolkit [1] in our study. We use the following two performance metrics that are dynamically measured by the tracing library: - Communication Time is the total time a processor spends in MPI communication routines. - Average Hops is based on manhattan distance of two processors. The manhattan distance of a pair of processors p, q, with their physical coordinates as (x p , y p , z p ) and (xq , yq , zq ), is defined as Hops(p, q) = |x p − xq | + |y p − yq | + |z p − zq |. We define the average-hops for all messages sent from a given processor as: ∑Hopsi × Bytesi averagehops = i ∑Bytesi i where Hopsi is the manhattan distance of the sending and receiving processors of the ith messages, and Bytesi is the message size. The logical concept behind the performance metric averagehops is to measure, for any given MPI message, the number of hops each byte has to travel. While the metric reflects the hops of actually exchanged messages of pairs of MPI communication partners, it is different from the dilation factors of an embedding for it reflects the real communication scenario. For both performance metrics, we record the average values and the maximum values. The maximum values represent the “worst case”: the maximal communication time is the communicaiton time of the processor with the longest communication time; the maximal average-hops is that of the MPI message with the largest average-hops. We used benchmark programs from NAS Parallel Benchmark Suite (NPB2.4) and two scientific applications. NAS NPB benchmark suite has been widely used for studies of parallel performance. Their descriptions can be found in [9]. In this paper, we include detailed results for 4 NAS NPB programs (i.e. BT, SP, LU, and CG). Among the rest of the NAS NPB programs, FT and EP do not use point-to-point communication primitives, IS does one time point-to-point communication, where each node sends one integer value to its right-hand-side neighbor. The virtual topology of MG is a 3D near-cube, which can exactly map to the topology of BG/L partitions with a permutation of the three dimensions ([5] describes how user can specify such mapping on BG/L) . For all the results obtained on NAS NPB benchmarks, we used the class D problem sizes, the largest of NPB 2.4. The applications are SOR and SWEEP3D [23] from ASCI benchmark programs. SOR is a program for solving the Poisson equation using an iterative red-black SOR method. This code uses a two dimensional process mesh, where communication is mainly boundary exchange on a static grid with east-west and north-south neighbors. This results in a simple repetitive communication pattern, typical of gridpoint codes from a number of fields. SWEEP3D [23] is a simplified benchmark program that solves a neutron transport problem using a pipelined wave-front method and a two-dimensional process mesh. Input parameters determine problem sizes and blocking factors, allowing for a wide range of message sizes and parallel efficiencies. Table 3: Topology Scenario of Test Programs Embedding Topologies Guest Topo. Host Topo. Dilation / Avg. Dilation Default Optimized Mapping Mapping NAS BT, SP : 2D square mesh 256: 16x16 256: 8x8x4 3 / 1.633 2 / 1.038 484: 22x22 512: 8x8x8 6 / 3.108 4 / 1.620 1024: 32x32 1024: 8x16x8 5 / 2.661 2 / 1.051 2025: 45x45 2048: 8x16x16 10 / 5.049 4 / 1.587 4096: 64x64 4096: 8x16x32 9 / 4.802 2 / 1.025 NAS LU, CG; SOR : 2D near-square mesh 256: 16x16 256: 8x8x4 3 / 1.633 2 / 1.038 512: 32x16 512: 8x8x8 3 / 1.656 2 / 1.055 1024: 32x32 1024: 8x16x8 5 / 2.661 2 / 1.051 2048: 64x32 2048: 8x16x16 5 / 2.680 2 / 1.026 4096: 64x64 4096: 8x16x32 9 / 4.802 2 / 1.025 SWEEP3D : 2D near-square mesh 256: 16x16 256: 8x8x4 3 / 1.633 2 / 1.038 512: 16x32 512: 8x8x8 5 / 2.754 2 / 1.055 1024: 32x32 1024: 8x16x8 5 / 2.661 2 / 1.051 2048: 32x64 2048: 8x16x16 9 / 4.768 2 / 1.026 4096: 64x64 4096: 8x16x32 9 / 4.802 2 / 1.025 Table 3 lists the programs’ brief information related to topology mapping, the specific mapping scenario we used in our study, and the computed dilation factors using BG/L default mapping and optimized mapping generated by our library. In the columns of guest/host topologies, we listed the number of total processes and corresponding Cartesian topology. The computed dilation factors show that the edge dilation associated to our optimized mapping are very small (with many equal to 2), and much smaller than the dilation factors associated to using BG/L’s default mapping. In the following sub-section, we investigate the performance impacts of optimized mappings with low dilation factors on realistic programs. For the experiments presented in this section, we compiled the programs using IBM XL compiler on BG/L. For each of the scenario we have studied, we generate a BG/L mapping file using our mapping library. Then we ran the specific program and the input case on a BG/L partition matches the host topology described in the scenario and collect the performance metrics. Note that the evaluation of the support for MPI topology functions is not presented in this writing. This is primarily because we did not find application or benchmark program that is written with MPI topology routines. 5.2 Results of NAS benchmarks Results of the 4 NAS programs are given in Figure 8. The left graph for each program show the measurements of AverageHops and the right graph show the measurements of communication time. The bars are the averages across all messages and lines are the “worst cases”. The horizontal axis represents the number of processors which corresponds to the tested mapping scenario given in Table 3. The results of BT, SP, and LU are all consistent in the sense that the dilation is significantly reduced (many cases guarded by 2 hops) by using the mapping produced by our mapping library. As a result, BT and SP benefited from the highquality mapping and shows significant reduction of their communication costs. This is primarily because the two programs have nearest-neighbor communication pattern. Particularly, for the cases of using BG/L partitions with 512 and 2048 compute nodes, because the programs requires the number of processes as square numbers, the mapping problems are to map 22x22 and 45x45 onto the corresponding partitions. The default, linearization-based mapping has no special handling for such cases. One of the worst dilated edges of mapping 22x22 onto 512 nodes using default mapping is the edge between guest nodes (1,20) and (2,20), where the nodes are mapped to physical nodes (2,5,0) and (0,0,1) and the edge is dilated by 6. In our optimized mapping, the perfect square meshes, 22x22 and 45x45, are compressed onto 32x16 and 64x32, then they are folded into 3D grids (Section 3.2) and dilation 4 is realized. On the other hand, although the average hops measurements of LU are very good when using mapping files we generated, there is no or little improvements on the communication cost. This is because LU involves communications between processes with variant distances. The results of CG is rather controversal. Specifically, although the estimated dilation of optimized mapping is only 2, the average communication hops is much higher. This is because for CG, each process P exchanges message with all processes of the same row that are exactly 2k − 1 hops away from P. Here the hops are in the term of the guest 2D grid. For example, assuming each row has 32 nodes, the node 2 of the row would communicate with nodes 3, 4, 6, 10, 18 of the same row. When mapping this dimension to the host grid using the default, linearization-based mapping with radix as a power-of-two number, the physical distances of far-apart communication partners may be small. On the other hand, the optimized mapping, trying to optimize the connectivity of the nearest neighbors, the physical distances of the longdistance communication partners may not be optimized. For instance, Figure 9 shows how a row is mapped onto a 2D space with the size of the inner dimension as 8, using the default mapping and the optimized mapping. As highlighted in Figure 9(a), with the default mapping, node 2 node 2 ends up to be 1 hop away from node 10. In Figure 9(b), with the optimized mapping, the physical distance of node 2 and node 10 is 4. (a) BT (b) SP (c) LU (d) CG Figure 8: NAS NPB 2.4 results (with class D inputs, co-processor mode) 18 18 10 10 2 3 4 6 (a) Default mapping 2 3 4 6 (b) Optimized mapping Figure 9: CG communication partners 5.3 is not affected. This is because for the SWEEP3D application, the communication pattern is composed of pipelined wavefronts. Although its communications are all among nearest neighbors, the pipeline hides the relative high latencies between certain neighboring wavefronts containing extended mesh edges. Nevertheless, because the communications of SWEEP3D are among neighbors, the optimized mapping does not introduce performance degradation as the NAS-CG case. Application Results Figure 10 gives the detailed performance results for SOR and SWEEP3D. The results of SOR confirms that for applications with nearest-neighbor communication pattern, optimized process mapping improve the communication performance significantly. Specifically, SOR results show that not only the AverageHops but also the communication time improve significantly with our mapping library. This is because the communication pattern of SOR is “ping-pong” messages among nearest neighbors, which is the representative communication pattern that can benifit from high-quality topology mapping. Note the measured average hops for the cases using our mapping library stays flat and the benefit of using it becomes more significant when the number of processors increases. For 4096 processors, the average communication time are improved by 42%. For the worst cases (i.e., the processor spends most time in communication), the communication costs are improved by 25%. Similar to the results of NAS-LU, results of SWEEP3D in Figure 10(b) show that although the AverageHops are improved significantly for all the cases, the communication cost 6 Related Work The problem of mapping parallel programs onto parallel systems has been studied extensively since the beginning of parallel computing. The problem is essentially equivalent to the graph embedding problem. Nevertheless, for different applications, the problem would have different constraints. For instance, there are many-to-one and one-to-one mappings. Similarly, some methods concentrates develop techniques for mapping data structures onto processors, and others maps parallel processes onto processors. In this paper, we concentrated on exploring one-to-one mappings from parallel processes processors. In term of solution approaches, a large number of methods exploring graph-partitioning and searching-based optimization have been developed (with few examples as [12; 20; 24]). In this approach [6] described a simulated annealing based method to explore applications’ communication pattern and in turn discover the most beneficial mapping of application’s tasks onto BG/L. The off-line approach intro- (a) SOR (a) SWEEP3D Figure 10: Application results (co-processor mode) duced in the paper is effective for performance tuning and knowledge discovery of complicated applications. Our work is orthogonal and complementary to their approach in the sense that when the logical topology of the communications of an application is well defined, our topology mapping library should be used to port the application to various BG/L partitions easily. When the communication pattern of an application is irregular or dynamic (like that of NAS-CG), their tool should be used to uncover a proper mapping to BG/L topologies. the guest grid and host grid, they introduced a simple reduction method and a general reduction method to embed among multi-dimensional grids. Their factorization based techniques post fairly strong constraints on the relative shapes of the guest grids and host grids and additional techniques are needed to complement the solution scope. In addition, their most general technique (general reduction) can introduce large dilation. For the case of mapping to 3D grid/tori, in the worst case, its dilation factor is the same as the size of the smallest dimension. Another approach is to embed guest graphs into host graphs via projection functions, These methods usually develop solutions for special cases with low complexity. As discussed in Section 2, most of the effective results on grid embeddings are obtained for embedding into two-dimensional grids. In this section we discuss related efforts on embeddings for three-dimensional grids/tori, embeddings for tori, support for MPI topology functions, and topology mapping related work for BG/L. PARIX mapping [22] is a comprehensive topology mapping library developed in the similar approach as ours, i.e. exploring effective techniques for 2D grid embedding. In this sense, our work is similar to theirs. Nevertheless, their integration for embedding of 3D grid/tori is not as sufficient. If directly applying their steps of embedding 3D grids into 2D grids, the embedding of 3D grid into 3D grid would involve 3 steps. It first unfolds the 3D guest grid to 2D grid along a single dimension, Then, it does a 2D embedding of the 2D intermediate grid to another 2D intermediate grid that can be fold into the 3D host grid by pipe/ring foldings. Finally, it folds the 2nd intermediate grid to 3D. The drawback of the procedure is that the first step (unfolding from 3D grid to 2D grid) introduces large dilation which equals to the size of the smallest dimension. Gray Codes is an ordering of 2n binary numbers such that only bit changes from one entry to the next. Applying gray code numbering for grid embedding, for the case of 8 nodes, the nodes can be numbered as 000, 001, 011, 010, 110, 111, 101, 100. This sequence is a gray code. With its simplicity, gray code has been successfully applied for embedding into hypercube topologies and researcher has explored its application for embedding among k-array n-cube topologies [7; 16]. When embedding a one-dimensional ring onto multiple dimensional grids, gray-code-based embedding is exactly the same as the ring folding method mentioned in Section 2. Nevertheless, it is difficult to extend the approach effectively for embedding guest grids with two or higher dimensions. When embedding among 2D and/or 3D grids, the associated dilation factors are usually as the sizes of the dimensions. Tao’s mapping [16] is composed of ring-wrapping, simple reduction, and general reduction techniques for embedding among high-dimensional meshes and tori. It is based on a ring-wrapping method that similar to gray-code methods to embed a one-dimensional pipe/ring onto two-dimensional grids. Based on re-factoring the sizes of the dimensions of Kim and Hur [15] proposed an approach for many-to-one embedding of multiple dimension torus onto a host torus with the same number of dimensions. Their approach is based on ring stretching/scattering, which we also use for our torus Sit operations in this paper. Because they concentrated on many-to-one embedding, their approach does not work well for general one-to-one grid/torus embeddings. In term of supporting MPI topology functions, [24] proposed a graph-partitioning based technique for embedding into a hierarchical communication architecture (i.e. NEC SX-series). [18] described techniques to for embedding into switch-based networks. These techniques are designed for specific systems, which have different networks from BG/L. Topology mapping on BG/L has been studied in [7; 6]. [7] studied performance impacts of process mapping on BG/L has been studied on rather small scales (using 128 BG/L compute nodes). The study shows that gray code and a one-step paper folding are fairly effective for mapping twodimensional guest topologies onto BG/L partitions with sizes upto 128 compute nodes (i.e. 8x8x2 torus). Based on more sophisticated grid embedding techniques, our approach covers many hard cases. In addition, our integration of existing and novel techniques provides a rather complete solution to systematically map applications onto BG/L. References [1] IBM Advanced Computing Technology Center MPI tracer/profiler. URL: http:// www.research.ibm.com/ actc/ projects/ mpitracer.shtml. [2] A DIGA , N. R., ET AL . 2002. An overview of the BlueGene/L supercomputer. In SC2002 – High Performance Networking and Computing. [3] A LELIUNAS , R., AND ROSENBERG , A. L. 1982. On embedding rectangular grids in square grids. IEEE Transactions on Computers 31, 9 (September), 907–913. 7 Conclusions To run scalable applications on today’s scalable parallel systems with minimal communication overhead, effective and efficient support on mapping parallel tasks onto physical processors is required. This paper describes the design and integration of a comprehensive topology mapping library for mapping MPI processes (parallel tasks) onto physical processors with three-dimensional grid/torus topology. On developing our topology mapping library that presented in this paper, we have not only engaged an extensive study of the existing practical techniques on grid/graph embeddings, but also explored design space and integration techniques on embeddings of three-dimensional grids/tori. By providing scalable support of MPI virtual topology interface, portable MPI applications can benefit from our comprehensive library. The results of our empirical study that using popular benchmarks and real applications, with topologies scale up to 4096 nodes, further shows the impact of our topology mapping techniques and library on improving communication performance of parallel applications. For future work, we would like to find or co-develop applications that use MPI virtual topology functions for a realistic evaluation of our support for MPI topology interface. In addition, we would like to look into embeddings of topologies other than grids/tori into BG/L topologies. [4] A LMASI , G., ET AL . 2001. Cellular supercomputing with system-on-a-chip. In IEEE International Solid-state Circuits Conference ISSCC. [5] A LMASI , G., A RCHER , C., C ASTANOS , J. G., E RWAY, C. C., H EIDELBERGER , P., M ARTORELL , X., M OR EIRA , J. E., P INNOW, K., R ATTERMAN , J., S MEDS , N., S TEINMACHER - BUROW, B., G ROPP, W., AND T OONEN , B. 2004. Implementing MPI on the BlueGene/L supercomputer. In Proc. of Euro-Par Conference. [6] B HANOT, G., G ARA , A., H EIDELBERGER , P., L AWLESS , E., S EXTON , J. C., AND WALKUP, R. 2005. Optimizing task layout on the Blue Gene/L supercomputer. IBM Journal of Research and Development 49, 2 (March), 489–500. [7] B RIAN E. S MITH , B. B. 2005. Performance effects of node mappings on the IBM Blue Gene/L machine. In Euro-Par. [8] C HAN , M. J. 1996. Dilation-5 embedding of 3dimensional grids into hypercubes. Journal of Parallel and Distributed Computing 33, 1 (February). [9] DER W IJINGAART, R. F. V. 2002. NAS Parallel benchmarks version 2.4. Tech. Rep. NAS-02-007, NASA Ames Research Center, Oct. [10] E LLIS , J. A. 1991. Embedding rectangular grids into square grids. IEEE Transactions on Computers 40, 1 (Jan.), 46–52. [11] E LLIS , J. A. 1996. Embedding grids into grids: Techniques for large compression ratios. Networks 27, 1–17. Acknowledgement We would like to acknowledge and thank George Almási, José G. Castaños, and Manish Gupta from IBM T. J. Watson Research Center for their supports, Brian Smith, Charles Archer, and Joseph Ratterman from IBM System Group for discussion and their effort on integrating our library into BG/L MPI library, and William Gropp for valuable discussion on the support of MPI topology functions. [12] E RÇAL , F., R AMANUJAM , J., AND S ADAYAPPAN , P. 1990. Task allocation onto a hypercube by recursive mincut bipartitioning. J. Parallel Distrib. Comput. 10, 1, 35– 44. [13] F ORUM , M. P. I., 1997. MPI: A message-passing interface standard. URL: http:// www.mpi-forum.org/ docs/ mpi-11-html/ mpi-report.html, August. [14] H ATAZAKI , T. 1998. Rank reordering strategy for MPI topology creation functions. In Proceedings of the 5th Eu- roPVM/MPI conference, Springer-Verlag, Lecture Notes in Computer Science. [15] K IM , S.-Y., AND H UR , J. 1999. An approach for torus embedding. In Proceedings of the 1999 International Workshop on Parallel Processing, 301–306. [16] M A , E., AND TAO , L. 1993. Embeddings among meshes and tori. Journal of Parallel and Distributed Computing 18, 44–55. [17] M ELHEM , R. G., AND H WANG , G.-Y. 1990. Embedding rectangular grids into square grids with dilation two. IEEE Transactions on Computers 39, 12 (December), 1446–1455. [18] M OH , S., Y U , C., H AN , D., YOUN , H. Y., AND L EE , B. 2001. Mapping strategies for switch-based cluster systems of irregular topology. In 8th IEEE International Conference on Parallel and Distributed Systems. [19] The MPICH and MPICH2 homepage. http://www-unix.mcs.anl.gov/mpi/mpich. [20] O U , C.-W., R ANKA , S., AND F OX , G. 1996. Fast and parallel mapping algorithms for irregular problems. J. Supercomput. 10, 2, 119–140. [21] ROTTGER , M., AND S CHROEDER , U. 1998. Efficient embeddings of grids into grids. In The 24th International workshop on Graph-Theoretic Concepts in Computer Science, 257–271. [22] ROTTGER , M., S CHROEDER , U., AND S IMON , J. 1993. Virtual topology library for parix. Tech. Rep. TR005-93, Paderborn Center for Parallel Computing, University of Paderborn, Germany, November. [23] The asci sweep3d benchmark code. URL: http:// www.llnl.gov/ asci benchmarks/ scsi/ limited/ sweep3d/ asci sweep3d.html. [24] T R ÄFF , J. L. 2002. Implementing the MPI process topology mechanism. In Supercomputing, 1–14.