0% found this document useful (0 votes)
26 views11 pages

LectureNoteInCS1573 (VECPAR'98)

Parallel Computing, Systolic Algorithm, LU Decomposition

Uploaded by

drsgseo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

LectureNoteInCS1573 (VECPAR'98)

Parallel Computing, Systolic Algorithm, LU Decomposition

Uploaded by

drsgseo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
Iaitiemy Copii Computer Science 1573 José M.L.M. Palma Jack Dongarra Vicente Hernandez (Eds.) Vector and Parallel Processing—VECPAR’98 Third International Conference Porto, Portugal, June 1998 Selected Papers and Invited Talks A Systolic Algorithm for the Factorisation of Matrices Arising in the Field of Hydrodynamics S.-G. Seo!, M. J. Downie', G. B. Hearn! and C. Phillip: "Department of Mariné Technology, University of Newcastle upon Tyne, Newcastle upon Tyne, NEI 7RU, UK *Department of Computing Science, University of Newcastle upon Tyne, Newcastle upon Tyne, NEI 7RU, UK Abstract. Systolic algorithms often present an attractive parallel programming paradigm. However, the unavailability of specialised hardware for efficient implementation means that such algorithms are often dismissed as being of theoretical interest only. In this paper we report on experience with implementing a systolic algorithm for matrix factorisation and present ‘ modified version that is expected to lead to acceptable performance on a distributed memory multicomputer. The origin of the problem that generates the full complex matrix in the first place is in the field of hydrodynamics. 1. Forming the Linear System of Equations ‘The efficient, safe and economic design of large floating offshore structures and vessels requires a knowledge of how they respond in an often hostile wave environment (1]. Prediction of the hydrodynamic forces experienced by them and their resulting responses, which occur with six rigid degrees of freedom, involves the use of complex ‘mathematical models leading to the implementation of computationally demanding software. The solution to such problems can be formulated in terms of a velocity potential involving an integral expression that can be thought of as representing a distribution of sources over the wetted surface of the body. In most applications there is no closed solution to the problem and it has to be solved numerically using a discretisation procedure in which the surface of the body is represented by a number of, panels, or facets. The accuracy of the solution depends on a number of factors. one of which is the resolution of the discretisation. The solution converges as resolution becomes finer and complicated geometries can require very large numbers of facets to attain an acceptable solution. In the simplest approach a source is associated with each panel and the interaction of the sources is modelled by a Green function which automatically satisfies relevant “wave boundary conditions’. The strength of the sources is determined by satisfying a velocity continuity condition over the mean wetted surface of the body. This is 3. Palma, J. Dongarra and V. Herinde (Eds); VECPAR'98, LNCS 1573, p. 355-364, 1999, © Springer-Verlag Heidelberg Berin 1999 356 1. Seo etal achieved by setting up an influence matrix, A, for the, sources based on the Green functions and solving a linear set of algebraic equations in which the unknowns, x, are either the source strengths or the velocity potential values and the right-hand side, b, are related to the appropriate normal velocities at a representative point on each facet. By construction, the diagonal elements of A represent the influence of the sources on themselves. Since the off-diagonal elements are inversely proportional to the distance between any two sources, they tend to be of increasingly smaller magnitude as we move away from the diagonal. Hence if geometric symmetry is not used and the full matrix A is considered, it is guaranteed to be diagonally dominant. For a given wave frequency, each facet has a separate source contributing to the wave potential representing the incoming and diffracted waves, gy and y, and one radiation « velocity potential for each degree of freedom of the motion, 9, : i=1,2,...,6. When the velocity potentials have been: determined, once the source strengths are known, the pressure can be computed at every facet and the resultant forces and moments on the body computed by integrating them over the wetted surface. The forces can then be introduced into the equations of motion and the response of the vessel at the given wave frequency calculated, The complexity of the mathematical form of the Green functions and the requirement to refine the discretisation of the wetted surfaces within practical offshore analyses, significantly increases the memory and computational load associated with the formulation of the required fluid-structure interactions. Similarly the solution of the very large dense square matrix equations formulated in terms of complex variables requires considerable effort to provide the complex variable solution, In some problems 5,000 panels might be required using constant plane elements or 5,000 nodes using higher order boundary elements, leading to a matrix with 25,000,000 locations. Since double precision arithmetic is required, and the numbers are complex, this will require memory of the order of 4 gigabytes. The number of operations for direct inversion or solution by iteration is large and of the order of 1’, e.g. 5,000 elements requires 125,000 x 10° operations. Furthermore, the sea-state for a particular ‘wave environment has a continuous frequency spectrum which can be modelled as the sum of a number of regular waves of different frequencies with random phase angles and amplitudes determined by the nature of the spectrum. Determination of vessel responses in a realistic sea-state requires the solution of the boundary integral problem described above over a range of discrete frequencies sufficiently large to encompass the region in which the wave energy of the irregular seas is concentrated. In view of the size and complexity of such problems, and the importance of being able to treat them, it is essential to develop methods to speed up their formulation and solution times One possible means of achieving this is through the use of parallel computers (2) (3) Numerical techniques for solving linear systems of equations neatly divide into direct methods that involve some transformation of A, and iterative methods that do not change A but typically involve a matrix-vector multiplication involving A at each iteration. (Iterative techniques, such as those based on the conjugate gradient ‘method, often employ preconditioning, Which requires the formulation ot a pseudo transformation of A in order to improve the convergence characteristics.) Iterative techniques are normally restricted to the case that A is sparse, when the reduction in ‘A Systolic Algorithm for the Factorization of Matrices 357 ‘operations count and space requirements caised by fill-in merit their use. For our application A is a full, square matrix, and the use of a direct method of solution based con elimination techniques is therefore the most attractive proposition. ‘The method of LU-decomposition has been chosen because in this scheme only one factorisation is required for multiple unknown right hand “side vectors b. We recall that A is diagonally dominant and hence pivoting, which often limits the performance of parallel algorithms, is not an issue here. Tt is well known that this factorisation is computationally intensive, involving order n’ arithmetic operations (multiplications and additions). In contrast the forward- and backward-substitution required to solve the original system once the factorisation has been performed involves an order of magnitude less computation, namely order n?, which becomes insignificant as the matrix size increases. Consequently, we limit consideration to the factorisation process only. Formally, we have that the elements uy of U are given by a 4, =4,-Y uty jen w ft and I, of L are given by (2) t= (oe- Lita) (Doolittle factorisation) leading to an upper-trianguilar U7 and a unit lower-triangular L. 2. A Naive Systolic Algorithm Solution A systolic system can be envisaged as an array of synchronised processing elements (PEs), or cells, which process data in parallel by passing them fom cell to cell in a regular rhythmic pattem. Systolic arrays have gained popularity because of their ability to exploit massive parallelism and pipelining to produce high performance computing [4] [5]. Although systolic algorithms support a high degree of concurrency, they are often regarded as being appropriate only for those machines specially built for the particular algorithm in mind. This is because of the inherent high communication/computation ratio. In a soft-systolic algorithm, the emphasis is on retaining systolic computation as a design principle and mapping the algorithm onto an available (ron-systolic) parallel architecture, with inevitable trade-offs in speed and efficiency due to communication and processor overheads incurred in simulating the systolic array structure. Initially @ systolic matrix factorisation routine was written in Encore Parallel Fortran (epf) and tested on an Encore Multimax 520, ‘This machine has seven dual processor cards (a maximum of ten can be accommodated), each of which contains two independent 10 MHz National Semiconductor 32532 32-bit PEs with LSI 358 SG, Seoet al Fig. 1, Data flow for LU factorisation A Systolic Algorithm forthe Factorisaion of Matrices 359 memory management units and floating-point co-processors. This work was undertaken on a well-stablished machine platform that was able to provide a mechanism for validating the model of computation, and the software developed from that model, with a view to refining the model for subsequent development on a state- ofthe-art distributed memory parallel machine, epf’s parallel program paradigm is mich simpler to work with than that for message passing of distributed memory achitectures and produces a convenient means of providing validation data for subsequent developments. As long as the underlying algorithm is retained this implementation can be used to check the correctness of the equations developed and the validity of the algorithm with a view to an eventual port ‘The task of systolic aray design may be defined more precisely by investigating the coll tyres required, the way they are to be comnested, and how data moves thivugh the atray in order to achieve the desired computational effect. The hexagonal shaped cells employed here are due to Kung and Leiserson [6] ‘The manner in which the elements of L and U are computed in the aray ad passed on to the cells that require them is demonstrated in Fig. 1. An example of a 4x4 cell array is shown for the purposes of illustrating the paradigm employed, Of the PEs on the upper-right boundary, the top-most has three roles tw perform: 1. Produce the diagonal components of L (which, for a Doolitle-based factocisation, are all uit). 2. Produce the diagonal components of U using clements of A which have filtered through the cells along the diagonal (and been modified, as appropriate). 3. Pass the reciprocal of the diagonal components of U down and left The cells on the upper lefi boundary are responsible for computing the multipliers (the elements of L), having received the appropriate reciprocals of the diagonal elements of v. The flow of data through the systolic array is shown in Fig. 1. Elements of A flow jin an upward direction; elements of J, (computed from (1)) flow in a right-and-down direction; and elements of U (computed from (2)) flow in a lefe-and-down direction, Esch cell computes a value every 3 clock ticks, although they start at different times. Note that at each time step each cell responsible for forming an element of L. or U calculates one multiplication only in forming a partial summation. Date flowing out correspond to the required elements of the L and U factors of A. Formally, the elements of A are fel into the cells a6 indicated, although for efficiency the cells are directly assigned the apprepriate elements of A Table 1 shows the execution times (7,) of the parallel code with a fixed-size (100100 double complex) matrix and various numbers of PEs (p). The gradual algorithmic speed-up (S,), defined as the ratio of the time to execute the program on Processors to-the time to execute the same parallel program ona single processor, is clearly seen all the way up to twelve PEs. The (generally) decreasing efficiency (E;), defined as the ratio of speed up to the number of PES times 100, is a consequence ot the von Neumann bottleneck. The results show some minor anomalies, but this is not ‘atypical when attempting to obtein accurate timings on a shared resource, with other AA Systolic Algorithm for the Factorisation of Matrices 361 comparison to the DMT routine in terms of elapsed time. Nevertheless, the systolic algorithm shows better speedup characteristics than the DMT algerithm, as illustrated by Fig. 2. If 4 is an m by 7 dense matrix then the systolic algorithm implemented on n? PEs ‘can compute L and U in 4n clock ticks, giving a cell efficiency of 33%. Assuming a hardware system in which the number of PEs is much less than the number of cells, and using an appropriate mapping of cells to PEs, we can improve this position considerably, and we now address this issue. Fig. 3. Allocation of pseudo cells to PEs for @ 6x6 matrix. 3. An Improved Systolic Algorithm As already indicated, the ultimate goal is to produce an efficient systolic: matrix factorisation routine for general-purpose distributed parallel systems including clusters of workstations. This is to be achieved by increasing the: granularity of the computation within the algorithm and thus reducing the communication/ computation ratio, while balancing the load on each PE to minimise the adverse effect due to enforced synchronisation. Nevertheless, the characteristics peculiar to systolic algorithms should be retained. Thus we aim at + massive, distributed parallelism © local communication only * asynchronous mode of operation Each PE will need to perform far more complex operations than in the. original systolic systems used as the original basis for implementation. There is an inevitable increase in complexity from the organisation of the data handling required at each synchronisation (message-passing) point. ‘The systolic algorithm can be regarded as a wave front passing through the cells in an upwards direction in Fig. 1. This means that all pseudo cells in a horizontal line, corresponding to a reverse diagonal of A, become active at once. It follows that we allocate the whole of a reverse diagonal to a PE, and distribute the reverse diagonals from the top left to the bottom right in a cyclic manner so as to maintain an even load balance (for a suitably large matrix size). Fig. 2 shows such a distribution for a 6x6 matrix distributed on 4 PEs ‘The computation starts at PEO with pseudocell I. As time increases, so. the computation domain over the matrix domain increases, and later shrinks. The shape of the computation domain is initially triangular, to include the first few reverse diagonals. On passing the main reverse diagonal, the computation domain becomes pentagonal, and remains so until the bottom right-hand comer is reached, when It becomes a quadrilateral. Once the computation domain has covered the whole of the domain of pseudo cells it shrinks back to the top left, whilst retaining its quadrilateral shape. The whole process is completed in 3n-2 timesteps. A Fortran implementation of the revised systolic algorithm is curently under development using the distributed memory Cray T3D at the Edinburgh Parallel Computer Centre (EPC), and the Fujitsu AP1000 at Imperial College, London, ad MPI {8} [9] for message passing. In this implementation the elements in each reverse (top-right to bottom-left) diagonal of the matrix are bundled together so that they are dealt with by a single PE and all of the reverse diagonals which are active at a given time step are again grouped together to be passed around as a single message. Preliminary experience with the revised algorithm indicates that it scales well, as shown by the speed up figures for a 400 by 400 array computed on the Fujitsu machine and illustrated in Fig. 4. ‘A Systolic Algorithm for the Factorisation of Matrices 363 100 Se 80 40 20 0 20 40 60 80 100 Fig. 4. Speedup on distributed memory machine for 400 by 400 array 4. Conclusions ‘The algorithm described initially was a precursor to a more generalized one designed With the intention of increasing the granularity of computation and with the solution of large systems of equations in mind. As expected, it performed poorly in terms of elapsed time due to the penalties imposed by an unfavourable balance between processor communication and computation. However, the speedup characteristics compared favourably with those of a more conventional approach and pointed to the potential for the generalised algorithm. Accordingly, a generalised version of the algorithm has been developed and is currently being tested. The speedup obtained on the distributed memory machine suggest that the approach taken is valid. It remains to benchmark the algorithm against a conventional solver, such as the one available within the public domain software package, Sa APACK. 364 SG, Seo etal. References 1. Hearn, G.E, Yaakob, ©., Hills, W.: Seakeeping for design; identification of hydrodynamically optimal hull forms for high speed catamarans. Proc. RINA Int. Symp, High Speed Vessels for Transport and Defence, Paper 4, London (1995), ppl. 2. Hardy, N., Downie, M.J., Bettess, P., Graham, JM.: The calculation of wave forees on offshore structures using parallel computers., Int. J. Num. Meth. Engng. 36 (1993) 1765-1783 3. Hardy, N., Downie, MJ., Bettess, P.:_ Caleulation of fluid loading on offshore structures using parallel distributed memory MIMD computers. Proceedings, Parallel CFD, Paris (1993). 4. Quinton, P., Craig, I. Systolic Algorithms & Architectures. Prentice Hall International (1991) 5. Megson, G. M. (ed.): Transformational Approaches to Systolic Design. Chapman and Hall (1994y 6. Kung, H. T. and Leiserson, C. E. Systolic Arrays for VLSI, Sparse Matrix Proceedings, Duff, I.S. and Stewart, G-W. {ed,). Society for Industrial and Applied Mathematics (1979) PP.256-282, 7. Applegarth, 1, Barbier, C., Bettess, P A parallel equation solver for unsymmetric systems of linear equations. Comput. Sys. Engng. 4 (1993) 99-115 8. MPI: A Message-Passing Interface Standard, Message Passing Interface Forum (May 1994) 9. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI: The Complete Reference. MIT Press (1996)

You might also like