The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorith... more The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorithm is considered. Under conditions of diagonal dominance, norms describing the off-diagonal blocks relative to the diagonal blocks decrease quadratically with each reduction. This allows early termination of the reduction when an approximate solution is desired. The algorithm is well-suited for parallel computation.
Proceedings of the 7th annual symposium on Computer Architecture - ISCA '80, 1980
This paper discusses the development of a high speed pipelined arithmetic system suitable for rec... more This paper discusses the development of a high speed pipelined arithmetic system suitable for recursive numeric computations. The core of the arithmetic system is an online pipeline network. The details of the architectural design of this arithmetic system are first presented. Then the organization of such a system to support a broad range of recursive computations, which have not been amenable to pipelining by other techniques, will be described. The LU factorization of a tridiagonal matrix is used as an example to provide timing comparisons between the online pipeline network, the CRAY-1, and the systolic array as presented by Kung and Leiserson, 1978.
Special-purpose hardware has previously been proposed for recursive filter computations. In some ... more Special-purpose hardware has previously been proposed for recursive filter computations. In some cases the hardware capacity may not match the filter order, requiring the problem to be partitioned. Tradeoffs between hardware design, hardware dimensions and partitioning methods will be discussed.
Proceedings of the 1993 ACM/IEEE conference on Supercomputing - Supercomputing '93, 1993
The Gordon Betl Prize recognizes significant achievements in the application of supercomputers to... more The Gordon Betl Prize recognizes significant achievements in the application of supercomputers to scientific and engineering problems. In a special session at Supercomputing '93 the finalists of the 1993 prize competition will give presentations about their winning entries. In this note we summarize the rules for the Gordon Bell Prize, and give a brief revzew of the history of this Prize, which reflects some of the developments in high performance computing in the last five years.
The existence of parallel and pipeline computers has inspired a approach to algorithmic analysis.... more The existence of parallel and pipeline computers has inspired a approach to algorithmic analysis. Classical numerical methods are generally unable to exploit multiple processors and powerful vector-oriented hardware. Efficient parallel algorithms can be created by reformulating familiar algorithms or by discovering new ones, and the results are often surprising. A comprehensive survey of parallel techniques for problems in linear algebra is given. Specific topics include: relevant computer models and their consequences, evaluation of ubiquitous arithmetic expressions, solution of linear systems of equations, and computation of eigenvalues. Watson: !f You have formed a theory, then? 1 ' Holmes: fl At least I have got a grip of the essential facts of the case. I shall enumerate them to you, for nothing clears up a case so much as stating it to another person, and I can hardly expect your cooperation if I do not show you the position from which we start. 11
SIAM Journal on Scientific and Statistical Computing, 1983
An orthogonally connected systolic array, consisting of a few types of simple processors, is cons... more An orthogonally connected systolic array, consisting of a few types of simple processors, is constructed to perform the $QR$ decomposition of a matrix. Application is made to solution of linear systems and linear least squares problems as well as $QL$ and $LQ$ factorizations. For matrices A of bandwidth w the decomposition network requires less than $w^2 $ processors, independent of the order n of A. In terms of the operation time of the slowest processor, computation time varies between $2n$ and $4n$ subject to the number of codiagonals.
We state and prove an expansion theorem for the determinant of any Hessenberg matrix. The expansi... more We state and prove an expansion theorem for the determinant of any Hessenberg matrix. The expansion is expressed as a vector-matrix-vector product which can be efficiently evaluated on a parallel machine. We consider the computation of the first N terms of a sequence defined by a general linear recurrence. On a sequential machine this problem is 0(N 2), with N 4 2 processors it is 0(N), and with 0(N) processors it is O(log N) using our expansion. Other applications include locating roots of analytic functions and proving doubling formulas for linear recurrences with constant coefficients.
Iterative methods for the solution of tridiagonal systems are considered, and a new iteration is ... more Iterative methods for the solution of tridiagonal systems are considered, and a new iteration is presented, whose rate of convergence is comparable to that of the optimal two-cyclic Chebyshev iteration but which does not require the calculation of optimal parameters. The convergence rate depends only on the magnitude of the elements of the tridiagonal matrix and not on its dimension or spectrum. The theory also has a natural extension to block tridiagonal systems. Numerical experiments suggest that on a parallel computer this new algorithm is the best of the iterative algorithms considered.
Journal of Parallel and Distributed Computing, 1996
We present a communication library to improve performance of PVM. The new library introduces comm... more We present a communication library to improve performance of PVM. The new library introduces communication primitives based on Active Messages. We propose a hybrid scheme that includes a signal driven message noti cation scheme plus controlled polling. The new communication library is tested along with the normal PVM library to assess the improvement in performance.
Minimal Parallelism for Assoeiative Computations Under Time Constraints. The parallel evaluation ... more Minimal Parallelism for Assoeiative Computations Under Time Constraints. The parallel evaluation of A N = a 1 o a z ..... aN, where o is binary associative, is studied. Under an idealized model of parallel computation, the minimal number of parallel processors required to compute AN in at most t steps is determined for [log 2 N]<t<N-1. This indicates that it is not always desirable to reduce the running time to an absolute minimum, and provides a lower bound on the processing power required for time-constrained evaluation of general arithmetic expressions. Results for two-input processors are generalized to b-input processors, and then to non-homogeneous collections of processors. The latter does not have a closed-form solution, so approximations are analyzed. Minimaler Parallelismus far zeitbesehrgnkte assoziative Bereehnnngen. Die parallele Auswertung von A N = a 1 o a z ..... aN, bei der o bingr-assoziativ ist, wird untersucht. In einem idealisierten Modell einer parallelen Berechnung wird die Mindestanzahl paralleler Prozessoren festgestellt, welche fiir die Berechnung von A N in hSchstens t Schritten ftir [log 2 N] _< t < N-1 erforderlich ist. Es zeigt sich, dab es nicht immer wiinschenswert ist, die Laufzeit auf ein absolutes Minimum herabzusetzen; daraus ergibt sich eine untere Schranke fiir die bei einer zeitbeschr~inkten Errechnung von generellen arithmetischen Funktionen erforderlichen Verarbeitungsleistung. Ergebnisse ftir Zwei-Eingabe-Prozessoren werden zun~ichst auf b-Eingabe-Prozessoren und dann auf nichthomogene Aggregate von Prozessoren verallgemeinert. Letztere Verallgemeinerung hat keine Liisung in geschlossener Form; es werden deshalb Ngherungen analysiert.
For the first time in the Gordon Bell prize competition, performance numbers in fractions of a te... more For the first time in the Gordon Bell prize competition, performance numbers in fractions of a teraflop were reported. Bell was so pleased that he doubled the performance awards. he Gordon Bell Prize recognizes significant achievements in the application of parallel processing to scientific and engineering T problems. In 1994, finalists were named for work in two categories: Performance, which recognizes those who solved a real problem in less Price/performance, which encourages the development of cost-effec-elapsed time than anyone else, and tive supercomputing. No entry was received in the third category of compiler-generated speedup, which measures how well compiler writers are doing at easing the programming of parallel processors. Gordon Bell, an independent consultant in Los Altos, California, is sponsoring $2,000 in prizes each year for 10 years to promote practical parallel-processing research. He was so pleased to see the performance measured in tenths of a teraflop (trillions of floating-point operations per second) that he doubled the size of the two performance awards. This is the seventh year of the prize, which Computer administers. The winners were announced November 17 at the Supercomputing 94 meeting in Washington, D.C. RESULTS The performance winner modeled problems in structural mechanics; the price/performance winner simulated quantum mechanical interactions. Other interesting-and impressive-entries simulated isotropic turbulence, modeledvapor deposition, computed the internal mobility of an organic molecule, and ran nonfloating-point applications, such as gene sequencing and image processing.
The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorith... more The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorithm is considered. Under conditions of diagonal dominance, norms describing the off-diagonal blocks relative to the diagonal blocks decrease quadratically with each reduction. This allows early termination of the reduction when an approximate solution is desired. The algorithm is well-suited for parallel computation.
The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorith... more The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorithm is considered. Under conditions of diagonal dominance, norms describing the off-diagonal blocks relative to the diagonal blocks decrease quadratically with each reduction. This allows early termination of the reduction when an approximate solution is desired. The algorithm is well-suited for parallel computation.
Proceedings of the 7th annual symposium on Computer Architecture - ISCA '80, 1980
This paper discusses the development of a high speed pipelined arithmetic system suitable for rec... more This paper discusses the development of a high speed pipelined arithmetic system suitable for recursive numeric computations. The core of the arithmetic system is an online pipeline network. The details of the architectural design of this arithmetic system are first presented. Then the organization of such a system to support a broad range of recursive computations, which have not been amenable to pipelining by other techniques, will be described. The LU factorization of a tridiagonal matrix is used as an example to provide timing comparisons between the online pipeline network, the CRAY-1, and the systolic array as presented by Kung and Leiserson, 1978.
Special-purpose hardware has previously been proposed for recursive filter computations. In some ... more Special-purpose hardware has previously been proposed for recursive filter computations. In some cases the hardware capacity may not match the filter order, requiring the problem to be partitioned. Tradeoffs between hardware design, hardware dimensions and partitioning methods will be discussed.
Proceedings of the 1993 ACM/IEEE conference on Supercomputing - Supercomputing '93, 1993
The Gordon Betl Prize recognizes significant achievements in the application of supercomputers to... more The Gordon Betl Prize recognizes significant achievements in the application of supercomputers to scientific and engineering problems. In a special session at Supercomputing '93 the finalists of the 1993 prize competition will give presentations about their winning entries. In this note we summarize the rules for the Gordon Bell Prize, and give a brief revzew of the history of this Prize, which reflects some of the developments in high performance computing in the last five years.
The existence of parallel and pipeline computers has inspired a approach to algorithmic analysis.... more The existence of parallel and pipeline computers has inspired a approach to algorithmic analysis. Classical numerical methods are generally unable to exploit multiple processors and powerful vector-oriented hardware. Efficient parallel algorithms can be created by reformulating familiar algorithms or by discovering new ones, and the results are often surprising. A comprehensive survey of parallel techniques for problems in linear algebra is given. Specific topics include: relevant computer models and their consequences, evaluation of ubiquitous arithmetic expressions, solution of linear systems of equations, and computation of eigenvalues. Watson: !f You have formed a theory, then? 1 ' Holmes: fl At least I have got a grip of the essential facts of the case. I shall enumerate them to you, for nothing clears up a case so much as stating it to another person, and I can hardly expect your cooperation if I do not show you the position from which we start. 11
SIAM Journal on Scientific and Statistical Computing, 1983
An orthogonally connected systolic array, consisting of a few types of simple processors, is cons... more An orthogonally connected systolic array, consisting of a few types of simple processors, is constructed to perform the $QR$ decomposition of a matrix. Application is made to solution of linear systems and linear least squares problems as well as $QL$ and $LQ$ factorizations. For matrices A of bandwidth w the decomposition network requires less than $w^2 $ processors, independent of the order n of A. In terms of the operation time of the slowest processor, computation time varies between $2n$ and $4n$ subject to the number of codiagonals.
We state and prove an expansion theorem for the determinant of any Hessenberg matrix. The expansi... more We state and prove an expansion theorem for the determinant of any Hessenberg matrix. The expansion is expressed as a vector-matrix-vector product which can be efficiently evaluated on a parallel machine. We consider the computation of the first N terms of a sequence defined by a general linear recurrence. On a sequential machine this problem is 0(N 2), with N 4 2 processors it is 0(N), and with 0(N) processors it is O(log N) using our expansion. Other applications include locating roots of analytic functions and proving doubling formulas for linear recurrences with constant coefficients.
Iterative methods for the solution of tridiagonal systems are considered, and a new iteration is ... more Iterative methods for the solution of tridiagonal systems are considered, and a new iteration is presented, whose rate of convergence is comparable to that of the optimal two-cyclic Chebyshev iteration but which does not require the calculation of optimal parameters. The convergence rate depends only on the magnitude of the elements of the tridiagonal matrix and not on its dimension or spectrum. The theory also has a natural extension to block tridiagonal systems. Numerical experiments suggest that on a parallel computer this new algorithm is the best of the iterative algorithms considered.
Journal of Parallel and Distributed Computing, 1996
We present a communication library to improve performance of PVM. The new library introduces comm... more We present a communication library to improve performance of PVM. The new library introduces communication primitives based on Active Messages. We propose a hybrid scheme that includes a signal driven message noti cation scheme plus controlled polling. The new communication library is tested along with the normal PVM library to assess the improvement in performance.
Minimal Parallelism for Assoeiative Computations Under Time Constraints. The parallel evaluation ... more Minimal Parallelism for Assoeiative Computations Under Time Constraints. The parallel evaluation of A N = a 1 o a z ..... aN, where o is binary associative, is studied. Under an idealized model of parallel computation, the minimal number of parallel processors required to compute AN in at most t steps is determined for [log 2 N]<t<N-1. This indicates that it is not always desirable to reduce the running time to an absolute minimum, and provides a lower bound on the processing power required for time-constrained evaluation of general arithmetic expressions. Results for two-input processors are generalized to b-input processors, and then to non-homogeneous collections of processors. The latter does not have a closed-form solution, so approximations are analyzed. Minimaler Parallelismus far zeitbesehrgnkte assoziative Bereehnnngen. Die parallele Auswertung von A N = a 1 o a z ..... aN, bei der o bingr-assoziativ ist, wird untersucht. In einem idealisierten Modell einer parallelen Berechnung wird die Mindestanzahl paralleler Prozessoren festgestellt, welche fiir die Berechnung von A N in hSchstens t Schritten ftir [log 2 N] _< t < N-1 erforderlich ist. Es zeigt sich, dab es nicht immer wiinschenswert ist, die Laufzeit auf ein absolutes Minimum herabzusetzen; daraus ergibt sich eine untere Schranke fiir die bei einer zeitbeschr~inkten Errechnung von generellen arithmetischen Funktionen erforderlichen Verarbeitungsleistung. Ergebnisse ftir Zwei-Eingabe-Prozessoren werden zun~ichst auf b-Eingabe-Prozessoren und dann auf nichthomogene Aggregate von Prozessoren verallgemeinert. Letztere Verallgemeinerung hat keine Liisung in geschlossener Form; es werden deshalb Ngherungen analysiert.
For the first time in the Gordon Bell prize competition, performance numbers in fractions of a te... more For the first time in the Gordon Bell prize competition, performance numbers in fractions of a teraflop were reported. Bell was so pleased that he doubled the performance awards. he Gordon Bell Prize recognizes significant achievements in the application of parallel processing to scientific and engineering T problems. In 1994, finalists were named for work in two categories: Performance, which recognizes those who solved a real problem in less Price/performance, which encourages the development of cost-effec-elapsed time than anyone else, and tive supercomputing. No entry was received in the third category of compiler-generated speedup, which measures how well compiler writers are doing at easing the programming of parallel processors. Gordon Bell, an independent consultant in Los Altos, California, is sponsoring $2,000 in prizes each year for 10 years to promote practical parallel-processing research. He was so pleased to see the performance measured in tenths of a teraflop (trillions of floating-point operations per second) that he doubled the size of the two performance awards. This is the seventh year of the prize, which Computer administers. The winners were announced November 17 at the Supercomputing 94 meeting in Washington, D.C. RESULTS The performance winner modeled problems in structural mechanics; the price/performance winner simulated quantum mechanical interactions. Other interesting-and impressive-entries simulated isotropic turbulence, modeledvapor deposition, computed the internal mobility of an organic molecule, and ran nonfloating-point applications, such as gene sequencing and image processing.
The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorith... more The solution of a general block tridiagonal linear system by a cyclic odd-even reduction algorithm is considered. Under conditions of diagonal dominance, norms describing the off-diagonal blocks relative to the diagonal blocks decrease quadratically with each reduction. This allows early termination of the reduction when an approximate solution is desired. The algorithm is well-suited for parallel computation.
Uploads
Papers by Don Heller