Content PDF
Content PDF
Content PDF
DOI
Document Version
UNSPECIFIED
Versions of research
The version in the Kent Academic Repository may differ from the final published version.
Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the
published version of record.
Enquiries
For any further enquiries regarding the licence status of this document, please contact:
researchsupport@kent.ac.uk
If you believe this document infringes copyright then please contact the KAR admin team with the take-down
information provided at http://kar.kent.ac.uk/contact.html
The Parallel Solution of Systems of Linear Equations
using Iterative Methods on Transputer Networks
Rudnei Dias da Cunha
Computing Laboratory, University of Kent at Canterbury, U.K.
Centro de Processamento de Dados, Universidade Federal do Rio Grande do Sul, Brasil
Tim Hopkins
Computing Laboratory, University of Kent at Canterbury, U.K.
1 Introduction
Ax = b: (1)
Iterative methods for solving system (1) are attractive because their complexity is O(N 2 ),
provided the number of iterations required for convergence is small compared to N , whilst
direct methods like the LU and Cholesky decompositions are O(N 3 ), thus precluding their
use for very large N . Moreover, when considering parallel implementations of such methods,
iterative methods are more obviously suited to parallelization through the vectorization of
arithmetic operations among replicated functional units (SIMD) or processors (MIMD).
Iterative methods can be built upon a set of basic linear algebra subroutines (BLAS), and
the successful parallelization of a given method depends strongly on how efficient these
subroutines are. We consider some of the BLAS which are most useful in the implementation
of iterative methods, these include inner-products, computation of norms and matrix-vector
products.
A brief outline of the paper follows. First, we will present the parallel environment
(hardware and software components) used. Then the basic linear algebra subroutines are
presented along with the results of some tests designed to ascertain how the efficiency of the
To appear in “Transputing for Numerical and Neural Network Applications”, IOS Press, Amsterdam.
Grid of Transputers
Routing
Buffers
Compute
implementation of these methods is affected by problem size, sparsity and the number of
processors used. Finally, a case study, the implementation of a polynomial preconditioned
Conjugate Gradient method, is discussed.
2 Parallel environment
2.1 Hardware
The machine used for the implementations was a MEiKO SPARC-based Computing Surface.
The processors used were T800 transputers rated at 20 MHz (T800-20). Each transputer has
4MB of DRAM memory.
2.2 Software
The algorithms were implemented in occam 2 using 64-bit floating-point arithmetic. The
topology used to interconnect the transputers is either a rectangular or square grid.
We used the Single Program, Multiple Data paradigm in the implementation of the
algorithms. The code that ran in each transputer was divided in ten processes. Two processes
handled the incoming and outgoing messages in each link and a buffer process was used to
decouple communication and computation, the latter being carried out by a separate process.
Figure 1 shows the grid of processors and the structure of the processes in each transputer.
Some of the connections are not shown for clarity.
We used this structure of processes to overlap communication with computation. We
exploited the fact that each link is capable of exchanging data between two processors by
assigning two sets of incoming/outgoing- messages handlers to the four links needed for the
grid topology. Since these eight processes run at a higher priority and concurrently with the
computation process, we can overlap the communication with the computation if the amount
of computation is sufficiently large. The data received by the handlers are only passed to the
computation process when it is needed.
The communication and routing aspects were encapsulated in the communications-handling
processes, hiding these issues from the computation process. This structure allowed the
coding of a very clean computation process and reduced the amount of work involved while
implementing the algorithms, because the same code (both communications and computation
processes) specific to a given method ran in all processors. Also, the reusability of the
communications-handling code was enhanced, requiring at most a few minor modifications
when implementing other iterative methods.
The linear algebra operations that will be discussed are the saxpy, the inner-product and the
matrix-vector product.
Consider a real (scalar) value , three real vectors u, v and w of length N and real matrix A
of size N N . The grid of processors has Pr rows and Pc columns, the number of processors
being P = Pr Pc . Assume, without loss of generality, that N is an integer multiple of P .
Each processor stores s = N=P columns of A and segments of s elements of each vector
used. The operations are described below.
3.1 Saxpy
The “scalar-alpha-x-plus-y” or saxpy operation is a vector accumulation of the form
w = u + v. This operation has the characteristic that its computation is disjoint element-
wise with respect to the vectors u v and w. This means that we can compute a saxpy without
any communication between processors. Parallelism is exploited in the saxpy by the fact that
P processors will compute the same operation with a substantial smaller amount of data. The
saxpy is computed independently in each processor as
wi = ui + vi i = 1 : : : s: (2)
Note that the indices used are relative to the s variables stored in each processor and do not
refer to the actual variables.
3.2 Inner-product
The inner-product, defined as = N
P
i=1 ui vi is an operation that involves accumulation of
data, implying a high level of communication between all processors. The grid topology and
the process architecture used reduces the time that processors will be idle waiting for the
computed inner-product value arrive, but the problem still remains. The use of the SPMD
paradigm also implies the global broadcast of that value to all processors.
The inner-product is computed in three distinct phases. Phase 1 is the computation of
Xs u v p = 1 : : : P
partial sums of the form
p = i i (3)
i =1
Phase 2 is the accumulation of the p values computed in all processors. This accumulation
is performed following the pattern shown in Figures 2-a and 2-b. We define a processor
2 2
p ,q p ,q
c c c c
1 1
0 0
0 1 2 0 1 2
(a) (b)
(p q ) as the processor which will accumulate the partial values . This processor sits in the
c c p
middle of the grid. The accumulation starts with all processors sending their partial values in
a north or south direction to the processors that lie in the middle row (Figure 2-a). The partial
values are accumulated while passing through each processor. The processors in the middle
row then send its column-accumulated partial values to processor (p c qc ) as shown in Figure
2-b. At the end of this phase, the inner-product value is present in that processor.
Phase 3 is the broadcast phase where processor (pc qc) sends the inner-product value to
its neighbours, which then relay this value to its neighbours, until all processors have the
inner-product value stored.
i
X
v= A
N
ij u i = 1 : : : N
j
j =1
The computation starts as follows. Each processor computes a vector w of s elements
which holds the partial sums of each column with the corresponding element of u, for s rows.
After this computation is complete, there are P vectors w stored in each processor and P ; 1
of them are to be sent to a “target” processor. The target processor is the one that holds the
current s rows used to compute the w’s.
Each processor p sends its wp vector to one of its neighbouring processors. A processor
decides whether to send the vector in either the row or column directions to reach the target
processor based on the First-Row-First-Column routing algorithm (see [2]). As it passes
through a processor q it is summed to w q , wq = wq + wp , and this accumulated vector is again
routed to the target processor, possibly being accumulated with other w vectors. The target
processor will receive at most four w vectors which, when summed to its own w vector, yield
the desired set of s elements of v . Figure 3 depicts a possible situation for this algorithm.
ab c d e f gh s sa+tb+uc+vd+we+xf+yg+zh
i j k l mn op t si+tj+uk+vl+wm+xn+yo+zp
u
v
w =
x
y
z
sa+tb uc+vc+yg+zh
si+tj uk+vl+yo+zp
we+xf yg+zh
wm+xn yo+zp
The column partitioning allows a better distribution of work since the accumulations of
data are computed as the vectors w travel through the network. It also allows us to make
use of the buffered communications shown in x1, since as soon as a w vector is sent, the
processor may start the computations for a different target processor.
Table 1 shows the effect of computing the matrix-vector product for standard and unrolled
loops for N = 250 and 500, using a serial implementation on a single transputer. The gain
is defined as (1 ; Tunrolled =Tstandard ) 100. The MFLOPS rate is also shown here where
MFLOPS is given by (2N 2 ; N ) 10;6 =T .
The use of loop-unrolling reduces the execution time by approximately 45% whilst
achieving a MFLOPS rate of 70% of the theoretical maximum attainable value (approximately
Standard Unrolled
N Time(s) MFLOPS Time(s) MFLOPS Gain(%)
250 0.3657 0.3412 0.2529 0.4932 44.5344
500 1.4582 0.3426 0.9864 0.5075 47.8233
0.7 MFLOPS) by the T800-20 using 64-bit floating-point arithmetic. Similar results were
obtained for the other operations presented here.
In this section, we present the results obtained for the BLAS described in x3.
We tested the routines for different problem sizes on different numbers of processors.
Results are given for a serial implementation of the routines, running on a single transputer,
for comparison with the parallel implementation.
The vectors are filled with random values, using a normal distribution with mean 10 and
standard deviation 5, so as to not bias the measurements towards the FPU of the transputer
(which does not execute some arithmetic operations if one of the operands is 0 or 1). The
same is true for the scalar value .
The code for the operations was generated using the “REDUCED” mode of the compiler,
which uses the minimum amount of exception testing (only floating-point checks are made).
Table 2 show the timings obtained with N = 1000 and 10000 for P = 1 (serial), 4 9 16 25
and 36 processors. The matrix-vector product is not computable on any of the processor
grids for N = 10000 due to memory restrictions.
4.1 Efficiencies
Figure 4-a shows the effects of loop-unrolling for relatively small values of N . The
efficiencies increase or decrease, for some combinations of N and P , due to the fact
N=1000 Serial P =4 P = 9 P = 16 P = 25 P = 36
Operation Time(s) Times(s) Time(s) Time(s) Time(s) Time(s)
u + v 0:00390 0:00103 0:00046 0:00026 0:00017 0:00012
uT v 0:00416 0:00154 0:00109 0:00141 0:001216 0:001664
Au ;; 0:99834 0:47750 0:29280 0:19002 0:13317
N=10000 Serial P =4 P =9 P = 16 P = 25 P = 36
Operation Time(s) Times(s) Time(s) Time(s) Time(s) Time(s)
u + v 0:03763 0:00954 0:00424 0:00241 0:00157 0:00110
uT v 0:04038 0:00979 0:00474 0:00339 0:00250 0:00256
1 1
0.8 0.8
Efficiency
Efficiency
0.6 N=250 0.6 N=5000
N=500 N=10000
N=1000 N=20000
0.4 0.4 N=40000
0.2 0.2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Processors Processors
that relatively few unrolled loops are executed allowing the standard loop to dominate the
computation. Figure 4-b shows that for large N the efficiencies are close to 100%.
Figure 5-a shows that the inner-product implementation provides acceptable efficiencies
as N increases for P = 36. The matrix-vector product presents higher efficiencies than those
of the inner-product for given N and P values. As will be seen in x6 the combined use of the
operations provide a good efficiency, since the more efficient operations (the saxpy and the
matrix-vector product) contribute to minimize the effects of the inner-product.
Efficiency
N=20000
N=40000
0.6 0.6
N=250
N=500
0.4 0.4 N=1000
N=2000
N=3000
0.2 0.2
0 0
0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40
Processors Processors
Efficiency
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Processors Processors
The Conjugate Gradient (CG) method is one of the most widely used techniques for the
solution of systems of linear equations. Since its revival by Reid [9], who first considered it
as an iterative method, several variations have been proposed (see [1]).
CG is usually used with a preconditioner which both improves the rate of convergence
and prevents divergence of the basic iterative method. It is well known that the convergence
of CG is dependent on the distribution of the eigenvalues of the coefficient matrix. If the
eigenvalues are in the interval a 1], convergence will be improved if there is a clustering
of the eigenvalues around 1. Preconditioners are thus used to ensure that the preconditioned
coefficient matrix has its eigenvalues with the above characteristics.
Preconditioners based on incomplete Cholesky and LU factorizations are among the most
widely used (see [6]). However, these factorizations are sometimes numerically unstable
and, due to the inherently sequential nature of the forward-backward substitutions needed,
are not well suited to parallel implementations.
Another kind of preconditioner that has received attention in the past years is the polynomial
preconditioner (see [6]). This preconditioner is attractive in a parallel environment as it may
be implemented using the BLAS operations described in x3.
PP(m): v = P ;1 r
p = mm v
for i=1 to m
p = mm;i v + p ; P ;1 Ap
Equation (6) may be computed in a similar way and is more economic in terms of the number
of arithmetic operations and vectors involved.
;1 = 1
( )
p ;1 = 0
( )
Mm z k = r k ( ) ( )
k = r k Tz k
( ) ( ) ( )
k = k = k;1
( ) ( ) ( )
p k = z k +
k p k;1
( ) ( ) ( ) ( )
w k = Ap k
( ) ( )
k = k =p k T w k
( ) ( ) ( ) ( )
xk 1 =xk +k pk
( + ) ( ) ( ) ( )
r k 1 =r k ;k w k
( + ) ( ) ( ) ( )
6 Experimental results
As test problems, we used two generic systems. The first, a nonsymmetric, dense system, is
defined by
= 1 i = 1 2 : : : N
aii
= 2 j > i j = 1 2 : : : N
aij
= ;2 j < i j = 1 2 : : : N
aij
= 10 i = 1 2 : : : N
bi (7)
Test cases I and II solve system (7) for N = 500 and 1000, respectively.
The second system, a sparse, symmetric, positive-definite, band matrix is defined by
aii = 10 i = 1 2 : : : N
aij = 1 i = 1 2 : : : N j = i + 1 i + 2 : : : i + W
aji = 1 i = 1 2 : : : N j = i ; 1 i ; 2 : : : i ; W
bi = 10 i = 1 2 : : : N (8)
where W is the half-bandwith (number of diagonals above and below the main diagonal).
Test case III is system (8) defined by N = 2000 and W = 50 being 95% sparse. Test
case IV has N = 2000 and W = 100 with a sparsity of 90%. These systems were solved
using the symmetric scaling, D;1=2 AD;1=2 .
We present the timings obtained by running PPCG on the four test problems. The parallel
implementation was run on square grids using 4, 9, 16, 25 and 36 transputers. The results of
running the serial code are also given for comparison except for tests II and IV which could
not be solved on a single transputer due to memory size constraints. We required a tolerance
of 10;10 for convergence, and the polynomial preconditioner was used with m = 1. Table 3
presents the timings for the test cases.
Figure 7 shows the efficiencies of PPCG. Test case I shows an acceptable performance for
up to 16 processors. As shown in Table 3, for P = 36 the execution time exceeds that for
P = 25. Test case II provides approximately 70% of efficiency for P = 25. Comparing
both problems, we expect that increasing N will provide higher efficiencies on a large number
of processors. Note in Table 3 that for test case II the execution time for P = 36 is less than
that of P = 25.
Processors
No. of 1 2 2 3 3 4 4 5 5 6 6
Test Storage iterations Time(s) Time(s) Time(s) Time(s) Time(s) Time(s)
I Dense 446 792.4831 201.8778 102.2939 77.4727 63.7057 69.3276
II Dense 821 – 1503.6106 726.1686 463.9160 337.3492 297.7701
III Sparse 532 868.9976 255.5875 134.1046 100.5784 108.6247 114.9468
IV Sparse 1037 – 493.7486 254.0929 178.0733 141.8772 146.2959
Test cases III and IV show that using the sparse storage we obtain less efficiency with
PPCG, mostly due to the efficiencies in the sparse matrix-vector product for these problem
sizes. However, increasing the sparsity contributes to an increase in the overall efficiency of
the implementation.
7 Conclusion
We have presented some results concerning the implementation of distributed linear algebra
operations on a network of transputers. The efficiency of the implementations is considerably
improved by the use of loop-unrolling.
The sparsity of the coefficient matrix A affects substantially the performance of the sparse
matrix-vector product; high sparsity reduces the number of transputers that can be effectively
used.
The BLAS discussed were used to implement serial and parallel versions of a Polynomial
Preconditioned Conjugate Gradient method. The combined use of the BLAS operations
means that the most efficient operations contribute to minimize the effects of the less effective
ones.
0.8 0.8
Efficiency
Efficiency
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Processors Processors
The first author acknowledges the financial support given by the Brazilian National Council
for the Scientific and Technological Development (CNPq) under grant 204062=89:6 .
References
[1] S.F. Ashby, T.A. Manteuffel, and P.E. Saylor. A taxonomy for Conjugate Gradient
methods. SIAM Journal of Numerical Analysis, 27:1542–1568, 1990.
[3] P.F. Dubois, A. Greenbaum, and G.H Rodrigue. Approximating the inverse of a matrix
for use in iterative algorithms on vector processors. Computing, 22:257–268, 1979.
[4] I.S. Duff. Sparse Matrices and Their Uses. Academic Press, London, 1981.
[5] A. George and J.W. Liu. Computer Solution of Large Sparse Positive Definite Systems.
Prentice-Hall, Englewood Cliffs, 1981.
[6] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press,
Baltimore, 2nd edition, 1989.
[7] O.G. Johnson, C.A. Micchelli, and G. Paul. Polynomial preconditioners for Conjugate
Gradient calculations. SIAM Journal of Numerical Analysis, 20:362–376, 1983.
[8] INMOS Limited. Occam 2 Reference Manual. Prentice-Hall, New York, 1988.
[9] J.K. Reid. The use of Conjugate Gradients for systems of linear equations possessing
“Property A”. SIAM Journal of Numerical Analysis, 9:325–332, 1972.
[10] H.W. Roebbers and P.H. Welch. Advanced occam 2 and transputer engineering.
Course notes (parts 1 and 2), Control Laboratory, University of Twente and Computing
Laboratory, University of Kent at Canterbury, March 1991.
[11] Y. Saad. Practical use of polynomial preconditionings for the Conjugate Gradient
method. SIAM Journal of Scientific and Statistical Computing, 6:865–881, 1985.
[12] C.F. Schofield. Optimising FORTRAN programs. Ellis Horwood, Chichester, 1989.