Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic struc... more Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N 3 ) with the size of the investigated problem, N , and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the ELPA library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the ScaLAPACK library but replacing all actual parallel solution steps with subroutines of its own. The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem sizes arising in the electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260,000, scalability up to 295,000 CPU cores has been shown on BlueGene/P.
this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm.We develop a ba... more this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm.We develop a band reduction algorithm that eliminates d subdiagonalsof a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin tothe MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, theband reduction algorithm is repeatedly used until the reduced matrix is tridiagonal.If d = b 1,
Most methods for calculating the SVD (singular value decomposition) require to ®rst bidiagonalize... more Most methods for calculating the SVD (singular value decomposition) require to ®rst bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense 3 banded and banded 3 bidiagonal with cubic and quadratic arithmetic costs, respectively, we are able to carry out a much higher portion of the calculations in matrix±matrix multiplications. Thus, higher performance can be expected. This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang, Parallel Comput. 22 (1996) 1±18]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be signi®cantly superior if only the singular values are required. Ó . This work was partially funded by Deutsche Forschungsgemeinschaft, Gesch aftszeichen Fr 755/6-1 and Fr 755/6-2. 0167-8191/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 -8 1 9 1 ( 9 9 ) 0 0 0 4 1 -1 parallel computers [1,2,4] and to novel accuracy issues, do most of the work on a full or triangular matrix.
... Institute for Scientific Computing Aachen University of Technology, D–52056 Aachen, Germany {... more ... Institute for Scientific Computing Aachen University of Technology, D–52056 Aachen, Germany {buecker, lang, rasch, bischof}@sc.rwth-aachen.de http://www.sc ... In particular, Matthias Meinke and Ehab Fares deserve special recognition for their help with the RAE 2822 airfoil. ...
Numerical simulation is a powerful tool in science and engineering, and it is also used for optim... more Numerical simulation is a powerful tool in science and engineering, and it is also used for optimizing the design of products and experiments rather than only for reproducing the behavior of scientific and engineering systems. In order to reduce the number of simulation runs, the traditional "trial and error" approach for finding near-to-optimum design parameters is more and more replaced with efficient numerical optimization algorithms. Done by hand, the coupling of simulation and optimization software is tedious and error-prone. In this note we introduce a software environment called EFCOSS (Environment For Combining Optimization and Simulation Software) that facilitates and speeds up this task by doing much of the required work automatically. Our framework includes support for automatic differentiation providing the derivatives required by many optimization algorithms. We describe the process of integrating the widely used computational fluid dynamics package FLUENT and a MINPACK-1 least squares optimizer into EFCOSS and follow a sample session solving a data assimilation problem.
Proceedings of the 2003 ACM symposium on Applied computing - SAC '03, 2003
Abstract For functions given in the form of a computer program, automatic differentiation is an e... more Abstract For functions given in the form of a computer program, automatic differentiation is an efficient technique to accurately evaluate the derivatives of that function. Starting from a given computer program, automatic differentiation generates another program for the ...
This work deals with aspects of support vector learning for large-scale data mining tasks. Based ... more This work deals with aspects of support vector learning for large-scale data mining tasks. Based on a decomposition algorithm that can be run in serial and parallel mode we introduce a data transformation that allows for the usage of an expensive generalized kernel without additional costs. In order to speed up the decompo-sition algorithm we analyze the problem of working set selection for large data sets and analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our modifications and settings lead to improvement of support vector learning performance and thus allow using extensive parameter search methods to optimize classification accuracy.
Derivatives are ubiquitous in various areas of computational science including sensitivity analys... more Derivatives are ubiquitous in various areas of computational science including sensitivity analysis and parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from large-scale scientific investigations. In this note, we show how AD technology can aid in the
Australasian Conference on Knowledge Discovery and Data Mining, 2006
The support vector machine (SVM) is a well- established and accurate supervised learning method f... more The support vector machine (SVM) is a well- established and accurate supervised learning method for the classification of data in various application fields. The statistical learning task - the so-called training - can be formulated as a quadratic optimiza- tion problem. During the last years the decompo- sition algorithm for solving this optimization prob- lem became the most frequently used
Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation, 2002
... by x = A\b. Fur-thermore, all operators exist in a “pointwise” version; for instance the poin... more ... by x = A\b. Fur-thermore, all operators exist in a “pointwise” version; for instance the pointwise vector multiplication u = v.*w pro-duces a vector u whose components are equal to the product of the corresponding components in v and w. Note that the MATLAB grammar is context ...
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiati... more Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. In contrast to numerical differentiation, where derivatives are only approximated, automatic differentiation generates derivatives that are accurate up to machine precision. Sophisticated software tools implementing the technology of automatic differentiation are capable of automatically generating code for the product of the Jacobian matrix and a so-called seed matrix. It is shown how these tools can benefit from concepts of shared memory programming to parallelize, in a completely mechanical fashion, the gradient operations associated with each statement of the given code. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenM...
Existence or fixed point theorems, combined with interval analytic methods, provide a means to co... more Existence or fixed point theorems, combined with interval analytic methods, provide a means to computationally prove the existence of a zero of a nonlinear system in a given interval vector. One such test is based on Borsuk's existence theorem. We discuss preconditioning techniques that are aimed at improving the effectiveness of this test. *
ABSTRACT The relatively robust representations (RRR) algorithm computes the eigendecomposition of... more ABSTRACT The relatively robust representations (RRR) algorithm computes the eigendecomposition of a symmetric tridiagonal matrix T with an complexity. This article discusses how this method can be extended to the bidiagonal SVD B=UΣVT. It turns out that using the RRR algorithm as a black box to compute BTB=VΣ2VT and BBT=UΣ2UT separately may give poor results for ∥BV−UΣ∥. The use of the standard Jordan–Wielandt representation can fail as well if clusters of tiny singular values are present. A solution is to work on BTB and to keep factorizations of BBT implicitly. We introduce a set of coupling transformations which allow us to replace the representation by a more stable representation , where is a diagonal matrix. Numerical results of our implementation are compared with the LAPACK routines DSTEGR, DBDSQR and DBDSDC.
Many different heuristics have been proposed for selecting the subdivision direction in branch-an... more Many different heuristics have been proposed for selecting the subdivision direction in branch-and-bound result-verifying nonlinear solvers. We investigate the impact of the box-splitting techniques on the overall performance of the solver and propose a new approach combining some of the simple heuristics in a hybrid way. Numerical experiments with mediumsized example problems indicate that our approach is successful.
Derivative information is required in numerous applications, including sensitivity analysis and n... more Derivative information is required in numerous applications, including sensitivity analysis and numerical optimization. For simple functions, symbolic differentiation-done either manually or with a computer algebra system-can provide the derivatives, whereas divided differences (DD) have been used traditionally for functions defined by (potentially very complex) computer programs, even if only approximate values can be obtained this way. An alternative approach for such functions is automatic differentiation (AD), yielding exact derivatives at often lower cost than DD, and without restrictions on the program complexity. In this paper we compare the functionality and describe the use of ADMIT/ADMAT and ADiMat. These two AD tools provide derivatives for programs written in the MAT-LAB language, which is widely used for prototype and production software in scientific and engineering applications. While ADMIT/ADMAT implements a pure operator overloading approach of AD, ADiMat also employes source transformation techniques.
Proceedings of IEEE Scalable High Performance Computing Conference, 1994
We present a two-step variant of the \successive band reduction" paradigm for the tridiagonalizat... more We present a two-step variant of the \successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix rst to narrow-banded form and then to tridiagonal form. The rst step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations. This expression leads to an algorithm that is almost entirely based on BLAS-3 kernels and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone DELTA and the IBM SP1.
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic struc... more Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N 3 ) with the size of the investigated problem, N , and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the ELPA library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the ScaLAPACK library but replacing all actual parallel solution steps with subroutines of its own. The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem sizes arising in the electronic structure theory is demonstrated for current high-performance computer architectures such as Cray or Intel/Infiniband. For a matrix of dimension 260,000, scalability up to 295,000 CPU cores has been shown on BlueGene/P.
this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm.We develop a ba... more this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm.We develop a band reduction algorithm that eliminates d subdiagonalsof a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin tothe MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, theband reduction algorithm is repeatedly used until the reduced matrix is tridiagonal.If d = b 1,
Most methods for calculating the SVD (singular value decomposition) require to ®rst bidiagonalize... more Most methods for calculating the SVD (singular value decomposition) require to ®rst bidiagonalize the matrix. The blocked reduction of a general, dense matrix to bidiagonal form, as implemented in ScaLAPACK, does about one half of the operations with BLAS3. By subdividing the reduction into two stages dense 3 banded and banded 3 bidiagonal with cubic and quadratic arithmetic costs, respectively, we are able to carry out a much higher portion of the calculations in matrix±matrix multiplications. Thus, higher performance can be expected. This paper presents and compares three parallel techniques for reducing a full matrix to banded form. (The second reduction stage is described in another paper [B. Lang, Parallel Comput. 22 (1996) 1±18]). Numerical experiments on the Intel Paragon and IBM SP/1 distributed memory parallel computers demonstrate that the two-stage reduction approach can be signi®cantly superior if only the singular values are required. Ó . This work was partially funded by Deutsche Forschungsgemeinschaft, Gesch aftszeichen Fr 755/6-1 and Fr 755/6-2. 0167-8191/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 -8 1 9 1 ( 9 9 ) 0 0 0 4 1 -1 parallel computers [1,2,4] and to novel accuracy issues, do most of the work on a full or triangular matrix.
... Institute for Scientific Computing Aachen University of Technology, D–52056 Aachen, Germany {... more ... Institute for Scientific Computing Aachen University of Technology, D–52056 Aachen, Germany {buecker, lang, rasch, bischof}@sc.rwth-aachen.de http://www.sc ... In particular, Matthias Meinke and Ehab Fares deserve special recognition for their help with the RAE 2822 airfoil. ...
Numerical simulation is a powerful tool in science and engineering, and it is also used for optim... more Numerical simulation is a powerful tool in science and engineering, and it is also used for optimizing the design of products and experiments rather than only for reproducing the behavior of scientific and engineering systems. In order to reduce the number of simulation runs, the traditional "trial and error" approach for finding near-to-optimum design parameters is more and more replaced with efficient numerical optimization algorithms. Done by hand, the coupling of simulation and optimization software is tedious and error-prone. In this note we introduce a software environment called EFCOSS (Environment For Combining Optimization and Simulation Software) that facilitates and speeds up this task by doing much of the required work automatically. Our framework includes support for automatic differentiation providing the derivatives required by many optimization algorithms. We describe the process of integrating the widely used computational fluid dynamics package FLUENT and a MINPACK-1 least squares optimizer into EFCOSS and follow a sample session solving a data assimilation problem.
Proceedings of the 2003 ACM symposium on Applied computing - SAC '03, 2003
Abstract For functions given in the form of a computer program, automatic differentiation is an e... more Abstract For functions given in the form of a computer program, automatic differentiation is an efficient technique to accurately evaluate the derivatives of that function. Starting from a given computer program, automatic differentiation generates another program for the ...
This work deals with aspects of support vector learning for large-scale data mining tasks. Based ... more This work deals with aspects of support vector learning for large-scale data mining tasks. Based on a decomposition algorithm that can be run in serial and parallel mode we introduce a data transformation that allows for the usage of an expensive generalized kernel without additional costs. In order to speed up the decompo-sition algorithm we analyze the problem of working set selection for large data sets and analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our modifications and settings lead to improvement of support vector learning performance and thus allow using extensive parameter search methods to optimize classification accuracy.
Derivatives are ubiquitous in various areas of computational science including sensitivity analys... more Derivatives are ubiquitous in various areas of computational science including sensitivity analysis and parameter optimization of computer models. Among the various methods for obtaining derivatives, automatic differentiation (AD) combines freedom from approximation errors, high performance, and the ability to handle arbitrarily complex codes arising from large-scale scientific investigations. In this note, we show how AD technology can aid in the
Australasian Conference on Knowledge Discovery and Data Mining, 2006
The support vector machine (SVM) is a well- established and accurate supervised learning method f... more The support vector machine (SVM) is a well- established and accurate supervised learning method for the classification of data in various application fields. The statistical learning task - the so-called training - can be formulated as a quadratic optimiza- tion problem. During the last years the decompo- sition algorithm for solving this optimization prob- lem became the most frequently used
Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation, 2002
... by x = A\b. Fur-thermore, all operators exist in a “pointwise” version; for instance the poin... more ... by x = A\b. Fur-thermore, all operators exist in a “pointwise” version; for instance the pointwise vector multiplication u = v.*w pro-duces a vector u whose components are equal to the product of the corresponding components in v and w. Note that the MATLAB grammar is context ...
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiati... more Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. In contrast to numerical differentiation, where derivatives are only approximated, automatic differentiation generates derivatives that are accurate up to machine precision. Sophisticated software tools implementing the technology of automatic differentiation are capable of automatically generating code for the product of the Jacobian matrix and a so-called seed matrix. It is shown how these tools can benefit from concepts of shared memory programming to parallelize, in a completely mechanical fashion, the gradient operations associated with each statement of the given code. The feasibility of our approach is demonstrated by numerical experiments. They were performed with a code that was generated automatically by the Adifor system and augmented with OpenM...
Existence or fixed point theorems, combined with interval analytic methods, provide a means to co... more Existence or fixed point theorems, combined with interval analytic methods, provide a means to computationally prove the existence of a zero of a nonlinear system in a given interval vector. One such test is based on Borsuk's existence theorem. We discuss preconditioning techniques that are aimed at improving the effectiveness of this test. *
ABSTRACT The relatively robust representations (RRR) algorithm computes the eigendecomposition of... more ABSTRACT The relatively robust representations (RRR) algorithm computes the eigendecomposition of a symmetric tridiagonal matrix T with an complexity. This article discusses how this method can be extended to the bidiagonal SVD B=UΣVT. It turns out that using the RRR algorithm as a black box to compute BTB=VΣ2VT and BBT=UΣ2UT separately may give poor results for ∥BV−UΣ∥. The use of the standard Jordan–Wielandt representation can fail as well if clusters of tiny singular values are present. A solution is to work on BTB and to keep factorizations of BBT implicitly. We introduce a set of coupling transformations which allow us to replace the representation by a more stable representation , where is a diagonal matrix. Numerical results of our implementation are compared with the LAPACK routines DSTEGR, DBDSQR and DBDSDC.
Many different heuristics have been proposed for selecting the subdivision direction in branch-an... more Many different heuristics have been proposed for selecting the subdivision direction in branch-and-bound result-verifying nonlinear solvers. We investigate the impact of the box-splitting techniques on the overall performance of the solver and propose a new approach combining some of the simple heuristics in a hybrid way. Numerical experiments with mediumsized example problems indicate that our approach is successful.
Derivative information is required in numerous applications, including sensitivity analysis and n... more Derivative information is required in numerous applications, including sensitivity analysis and numerical optimization. For simple functions, symbolic differentiation-done either manually or with a computer algebra system-can provide the derivatives, whereas divided differences (DD) have been used traditionally for functions defined by (potentially very complex) computer programs, even if only approximate values can be obtained this way. An alternative approach for such functions is automatic differentiation (AD), yielding exact derivatives at often lower cost than DD, and without restrictions on the program complexity. In this paper we compare the functionality and describe the use of ADMIT/ADMAT and ADiMat. These two AD tools provide derivatives for programs written in the MAT-LAB language, which is widely used for prototype and production software in scientific and engineering applications. While ADMIT/ADMAT implements a pure operator overloading approach of AD, ADiMat also employes source transformation techniques.
Proceedings of IEEE Scalable High Performance Computing Conference, 1994
We present a two-step variant of the \successive band reduction" paradigm for the tridiagonalizat... more We present a two-step variant of the \successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix rst to narrow-banded form and then to tridiagonal form. The rst step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations. This expression leads to an algorithm that is almost entirely based on BLAS-3 kernels and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone DELTA and the IBM SP1.
Uploads
Papers by Bruno Lang