Academia.eduAcademia.edu

GPU-based parallel algorithms for sparse nonlinear systems

Journal of Parallel and Distributed Computing

In this work we describe some parallel algorithms for solving nonlinear systems using CUDA (Compute Unified Device Architecture) over a GPU (Graphics Processing Unit). The proposed algorithms are based on both the Fletcher-Reeves version of the nonlinear conjugate gradient method and a polynomial preconditioner type based on block two-stage methods. Several strategies of parallelization and different storage formats for sparse matrices are discussed. The reported numerical experiments analyze the behavior of these algorithms working in a fine grain parallel environment compared with a thread-based environment.

J. Parallel Distrib. Comput. ( ) – Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc GPU-based parallel algorithms for sparse nonlinear systems✩ V. Galiano a , H. Migallón a , V. Migallón b,∗ , J. Penadés b a Department of Physics and Computer Architectures, University Miguel Hernández, E-03202, Elche, Alicante, Spain b Department of Computer Science and Artificial Intelligence, University of Alicante, E-03071, Alicante, Spain article info Article history: Received 1 March 2011 Received in revised form 5 September 2011 Accepted 28 October 2011 Available online xxxx Keywords: GPGPU GPU libraries Multicore architectures Nonlinear conjugate gradient algorithms Parallel preconditioners Bratu problem abstract In this work we describe some parallel algorithms for solving nonlinear systems using CUDA (Compute Unified Device Architecture) over a GPU (Graphics Processing Unit). The proposed algorithms are based on both the Fletcher–Reeves version of the nonlinear conjugate gradient method and a polynomial preconditioner type based on block two-stage methods. Several strategies of parallelization and different storage formats for sparse matrices are discussed. The reported numerical experiments analyze the behavior of these algorithms working in a fine grain parallel environment compared with a thread-based environment. © 2011 Elsevier Inc. All rights reserved. 1. Introduction Graphics Processing Units (GPUs) have become a powerful many-core processor and useful tool within the computational science community. The massively parallel architecture offers high performance in many computing applications when the algorithms map well to the characteristics of the GPU. The conjugate gradient (CG) algorithm [10] is an efficient method for solving linear systems with symmetric positive definite matrices. The performance of this algorithm can be improved by considering the preconditioned conjugate gradient (PCG) method. Its efficiency and robustness have been proved for a wide range of applications. There are several works about this kind of linear algorithms for GPUs. In [16] a sparse matrix solver for the conjugate gradient method on GPUs is described. On the other hand, in [5] an efficient implementation of a Jacobi preconditioned conjugate gradient algorithm on GPUs is developed. Recently, an SSOR preconditioned conjugate gradient algorithm on GPUs is analyzed in [9]. The authors focus on the numerical solution of the generalized Poisson equation and they notice that the proposed parallel implementation is significantly better than the ✩ This research has been supported by the Spanish Ministry of Science and Innovation under grant number TIN2011-26254. ∗ Corresponding author. E-mail addresses: vgaliano@umh.es (V. Galiano), hmigallon@umh.es (H. Migallón), violeta@dccia.ua.es (V. Migallón), jpenades@dccia.ua.es (J. Penadés). 0743-7315/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2011.10.016 Jacobi preconditioner implemented in [5]. Moreover, as compared to the CPU implementation of the CG algorithm, their GPU preconditioned conjugate gradient implementation is up to 10 times faster; other related works can be found e.g., in [3,8,18]. Focused on the problem of solving nonlinear systems, there are some works that deal with parallel solvers on GPU. At present, techniques for solving nonlinear systems on GPUs using wavelet analysis are developed. Moreover, parallel solvers for nonlinear systems in Bernstein form based on subdivision and the Newton–Raphson methods are proposed; see e.g., [6,27] and the references cited therein. In this work we develop parallel algorithms for solving nonlinear systems based on the CG method. The CG method for linear systems can be extended for nonlinear systems following the Fletcher–Reeves version [7]. For this purpose, consider the problem of solving the mildly nonlinear system Ax = Φ (x), (1) where A ∈ ℜn×n is a symmetric positive definite matrix and Φ : ℜn → ℜn is a nonlinear diagonal mapping, i.e., the ith component φi of φ is a function only of the ith component xi of x. Let Ψ : ℜn → ℜ be a nonlinear mapping and consider ⟨x, y⟩ = xT y the inner product in ℜn . The minimization problem of finding x ∈ ℜn such that J (x) = miny∈ℜn J (y), where J (x) = 21 ⟨Ax, x⟩ − Ψ (x), is equivalent to find x ∈ ℜn such that F (x) = Ax − Φ (x) = 0, where Φ (x) = Ψ ′ (x). The Fletcher–Reeves version [7] of the nonlinear conjugate gradient method (NLCG) is an effective approach for 2 V. Galiano et al. / J. Parallel Distrib. Comput. ( solving the nonlinear system (1), by considering the connection with this minimization problem: Algorithm 1 (Fletcher–Reeves Nonlinear Conjugate Gradient). Given an initial vector x(0) r (0) = Φ (x(0) ) − Ax(0) p(0) = r (0) For i = 0, 1, . . . , until convergence αi =→ see below x(i+1) = x(i) + αi p(i) r (i+1) = r (i) − Φ (x(i) ) + Φ (x(i+1) ) − αi Ap(i) Convergence test   r (i+1) ,r (i+1) βi+1 = − ⟨r (i) ,r (i) ⟩ p(i+1) = r (i+1) − βi+1 p(i) Note that, in Algorithm 1, αi is chosen to minimize the associated functional J in the direction p(i) . This is equivalent to solve the dJ (x(i) +α p(i) ) i = 0. Using one dimensional zero-point problem dα i the Newton method for solving the zero-point problem for αi , it (k+1) obtains αi = αi(k) − δ (k) , where δ (k)       αi(k) Ap(i) , p(i) − r (i) , p(i) + Φ (x(i) ) − Φ (x(i) + αi(k) p(i) ), p(i)  = .    (k) Ap(i) , p(i) − Φ ′ (x(i) + αi p(i) )p(i) , p(i) Note that in order to obtain δ (k) , the inner products ⟨Ap(i) , p(i) ⟩ and ⟨r (i) , p(i) ⟩ can be computed once at the first Newton iteration. Moreover Ap(i) is available from the conjugate gradient iteration. In order to generate efficient algorithms to solve the nonlinear system (1), in [17] we designed a parallel version of Algorithm 1 and a parallel nonlinear preconditioned conjugate gradient algorithm, based on both Algorithm 1 and a polynomial preconditioner type based on the block two-stage methods [4]. The proposed parallel algorithms were analyzed on multicore architectures using an OpenMP model [23]. In this paper we have developed the implementation of these nonlinear algorithms on GPUs. The Fletcher–Reeves version of the nonlinear conjugate gradient algorithm and its preconditioned method are based on a set of operations; the most important of which are the sparse matrix vector product (SpMV) and the inner (or dot) product, in addition to other vector computations. We discuss several strategies to compute these operations and we consider different storage formats for sparse matrices. Furthermore, in the preconditioned method, incomplete LU factorizations are used in order to obtain the inner splittings of the block two-stage method, therefore we analyze several strategies taking into account that there is no fine grain inherent parallelism in the LU solver. The algorithms described here have been implemented on an Intel Core 2 Quad Q6600 and an NVIDIA GTX 280 GPU. In order to analyze the behavior of these algorithms we have considered a nonlinear elliptic partial differential equation known as the Bratu problem [2]. In Section 2 we describe the preconditioned conjugate gradient algorithm and the constructed preconditioners. Some concepts about GPU architecture and its parallel programming are given in Section 3. In Section 4 we review the sparse matrix storage formats used in this work and in Section 5 we explain how the basic operations have been implemented in order to perform the algorithms described in Sections 1 and 2. In Section 6, we display the numerical results obtained using CUDA on a GPU and we compare these results with those obtained on the shared memory platform using OpenMP. Furthermore, a mixed model (using OpenMP and CUDA) is considered in order to exploit the characteristics of both parallel systems. Finally, concluding remarks are presented in Section 7. ) – 2. Nonlinear preconditioned conjugate gradient method Preconditioning is a technique for improving the condition number (cond) of a matrix. Suppose that M is a symmetric positive definite matrix that approximates A, but is easier to invert. We can solve Ax = Φ (x) indirectly by solving M −1 Ax = M −1 Φ (x). If cond(M −1 A) ≪ cond(A) we can iteratively solve M −1 Ax = M −1 Φ (x) more quickly than the original problem. In this case we obtain the following nonlinear preconditioned conjugate gradient algorithm (NLPCG). Algorithm 2 (Nonlinear Preconditioned Conjugate Gradient). Given an initial vector x(0) r (0) = Φ (x(0) ) − Ax(0) Solve Ms(0) = r (0) p(0) = s(0) For i = 0, 1, . . . , until convergence αi =→ see Algorithm 1 x(i+1) = x(i) + αi p(i) r (i+1) = r (i) − Φ (x(i) ) + Φ (x(i+1) ) − αi Ap(i) Solve Ms(i+1) = r (i+1) Convergence test   s(i+1) ,r (i+1) βi+1 = − ⟨s(i) ,r (i) ⟩ p(i+1) = r (i+1) − βi+1 p(i) Since the auxiliary system Ms = r must be solved at each conjugate gradient iteration, this system needs to be easily solved. Moreover, in order to obtain an effective preconditioner, it wants M to be a good approximation to A. One of the general preconditioning techniques for solving linear systems is the use of the truncated series preconditioning [1]. These preconditioners consist of considering a splitting of the matrix A as A=P −Q (2) and performing m steps of the iterative procedure defined by this splitting toward the solution of As = r, choosing s(0) = 0. It is well known that the solution of the auxiliary system Ms = r is effected by s = (I + R + R2 + · · · + Rm−1 )P −1 r, where R = P −1 Q and the preconditioning matrix is Mm = P (I + R + R2 + · · · + Rm−1 )−1 , cf. [1]. In order to obtain the preconditioners we use m steps of block two-stage methods toward the solution of As = r, choosing s(0) = 0. More specifically, suppose that A is partitioned into p × p p blocks, with square diagonal blocks of order nj , j=1 nj = n, such that system (1) can be written as A11 A21 A12 A22 Ap1 Ap2   .  . . ··· ··· .. . ···   Φ 1 ( x) Φ2 (x)     ..    ..  =  ..  , . . . Φp (x) xp App x1 A1p A2p  x2    (3) where x and Φ (x) are partitioned according to the size of the blocks of A. Let us consider the splitting (2) such that P consists of the diagonal blocks of A in (3), that is P = diag(A11 , . . . , App ). Let us consider the splittings Ajj = Bj − Cj , 1 ≤ j ≤ p, then, to solve the auxiliary system Ms = r of Algorithm 2, the following algorithm is established. Algorithm 3 (Block Two-Stage). Given an initial vector s(0) = ((s(10) )T , (s(20) )T , . . . , (s(p0) )T )T and a sequence of numbers of inner iterations q( j), 1 ≤ j ≤ p V. Galiano et al. / J. Parallel Distrib. Comput. ( For l = 1, 2, . . . , until convergence For j = 1, 2, . . . , p (l) (0) yj = sj For k = 1 to q( j) (k−1) (k) + (Qs(l−1) + r )j Bj yj = Cj yj  (q(1)) T s(l) = (y1 ) , (y2(q(2)) )T , . . . , (y(pq(p)) )T T In order to analyze the NLPCG method, we have considered in Algorithm 3 an incomplete LU factorization of each matrix Ajj , j = 1, 2, . . . , p, that is, the splittings Ajj = Bj −Cj are constructed by setting Bj = Lj Uj and Cj = Rj , and at each lth step perform, for each j, q( j) inner iterations of the iterative procedure defined by this splitting. Let us denote by ILU(S ) the incomplete LU factorization associated with the zero pattern subset S of Sn = {(i, j) : i ̸= j, 1 ≤ i, j ≤ n}. In particular, when S = {(i, j) : aij = 0}, the incomplete factorization with zero fill-in, known as ILU(0), is obtained. To improve the quality of the factorization, many strategies for altering the pattern have been proposed. In the ‘‘level of fill-in’’ factorizations [12], ILU(κ), κ ≥ 0, a level of fill-in is recursively attributed to each fill-in position from the levels of its parents. Then, the positions of level lower than κ are removed from S. In the experiments reported here we have used these ILU(κ) factorizations for the matrices Ajj , j = 1, 2, . . . , p, previously defined; see [17] for a more detailed explanation of the algorithms treated here and their OpenMP implementations. 3. GPU architecture — CUDA parallel programming The GPU architecture is based on a set of multiprocessor units called streaming multiprocessors (SM), each one containing a set of processor cores called streaming processors (SP). CUDA is a heterogeneous computing model that involves both the CPU and the GPU. In the CUDA parallel programming model [19,22], an application consists of a sequential host program, that may execute parallel programs known as kernels on a parallel device, i.e., on a GPU. Note that the CPU could be a multicore processor running an OpenMP model program; in this case only one core can call a kernel, i.e., kernel calls must be serialized. A kernel is an SPMD (Single Program Multiple Data) computation that is executed using a potentially large number of parallel threads. Each thread runs the same scalar sequential program. The programmer organizes the threads of a kernel into a grid of thread blocks. The threads of a given block can cooperate among themselves using barrier synchronization. The main memories available in GPUs are: the global memory, which has the highest latency; the constant and the texture memories, which are read only memories; shared memory and registers, both are on-chip memories, shared memory is owned by a block and registers are owned by a thread. The constant and the texture memories have on-chip cache. In this work the global and shared memories have been used. Thread creation, scheduling, and management are performed entirely in hardware. For example, the GTX 280 GPU contains 30 multiprocessors, and it can work with a maximum of 30 K concurrent threads. In order to manage efficiently this large number of threads, the GPU employs an SIMT (Single Instruction Multiple Thread) architecture [14,19] in which the threads of a block are executed in groups of 32 called warps. A warp executes a single instruction at a time across all its threads. The threads of a warp are free to follow their own execution. The thread execution may diverge, however, it is substantially more efficient for threads to follow the same execution path for the bulk of the computation. 4. Sparse matrix storage formats There are a multitude of sparse matrix representations, each one with different storage requirements, computational ) – 3 characteristics, and methods of accessing and manipulating matrix entries. In the context of sparse matrix vector product on GPU, we have only considered common sparse matrix storage formats suitable to obtain a good behavior for computing on GPU. Concretely, we have used the Compressed Row Storage (CRS) format, the ELLPACK (or ITPACK) format [11], and the ELLPACK-R format proposed in [25]. A discussion of these storage formats for the sparse matrix vector product (SpMV) on GPUs can be found in [26,25]. The Compressed Row Storage (CRS) format is a popular, general-purpose sparse matrix representation. Assuming we have a nonsymmetric sparse matrix A, the CRS format stores the matrix on three vectors: one for floating point numbers and other two for integers. The floating point vector stores the values of the nonzero elements of the matrix A, following a row-wise method. One of the integer vectors stores the column indexes of the elements in the values vector. The other integer vector stores the locations in the values vector that start a row. ELLPACK [11] was introduced as a format to compress a sparse matrix with the purpose of solving large sparse linear systems on vector computers. Note that there are some similarities between a vector architecture and the GPU architecture. This format stores the sparse matrix on two arrays, one for floating point numbers, to store the nonzero elements, and one for integers, to store the columns of every nonzero element. This format, as we have previously mentioned, is appropriate to compute operations with sparse matrices on vector architectures. However, as ELLPACK uses the same number of elements per row padding with zero elements, if the percentage of zeros is high and there is a very irregular location of entries in different rows, then the performance of the ELLPACK data structure decreases and storage requirements increase, see [26,25]. ELLPACK-R is a variant of the ELLPACK format introduced in [25] with the purpose of optimizing the sparse matrix vector product on GPUs. ELLPACK-R consists of two arrays following original ELLPACK and moreover, an additional integer array is included with the purpose of storing the actual length of every row, regardless of the number of the zero elements padded. 5. GPU kernels In this section, the primary kernels used in the implementation of the NLCG and NLPCG algorithms are explained. The kernels represented here perform mathematical operations used in the algorithms. Following Algorithms 1 and 2, the basic operations to implement the proposed methods are sparse matrix vector product, vector operations (included in level 1 BLAS [13]), inner product and LU solver (included in SPARSKIT [24] only for NLPCG method; see Section 6). In the rest of this section, we describe different ways of computing these basic operations and we analyze them in the context of our work, that is, in the context of solving nonlinear systems using the NLCG and NLPCG methods. Some optimizations will be considered in Section 6 from a global perspective with respect to the algorithms. Some of these operations are also available as kernels in several libraries as CUBLAS [20] or CUSPARSE [21]. In Section 6 our kernels are compared to those of these libraries. 5.1. Sparse matrix vector product In order to compute the sparse matrix vector product (SpMV), we consider the different sparse matrix storage formats described in Section 4. Particularly, the kernel code to compute the SpMV, using CRS format, is not optimized. In order to optimize the code there are two ways, the first one is to use a storage format to optimize subsequent computation and the second one is to 4 V. Galiano et al. / J. Parallel Distrib. Comput. ( ) – Fig. 1. Inner product according to the NVIDIA CUDA C SDK. optimize the access to the GPU global memory modifying the thread mapping. However, in the second case the performed optimizations based on the CRS format are focused on matrices with higher nonzero elements per row than the matrices used in our work. Note that, in our test example the typical number of nonzero elements per row is 7 (see Section 6), and, for example, the optimizations presented in [15] need more than 32 nonzero elements per row. Consequently, in order to optimize the SpMV operation, we consider the use of ELLPACK and ELLPACK-R storage formats following [25]. Note that in both formats the memory access pattern is improved. Moreover, according to the mapping of threads in the computation of every row, we also consider the SpMV implementations proposed in [26] and based on ELLPACK-R format. In this proposal, T threads compute the ith element of the matrix-vector product accessing to the ith row of the matrix. The implementation of the SpMV following [25] is referred in the numerical experiments as ELLR, while the implementation of [26] is referred to as ELLR-T. On the other hand, in order to compute the SpMV we also test the use of CUSPARSE [21]. CUSPARSE is a new library of GPU-accelerated sparse matrix routines for sparse/sparse and dense/sparse operations. Currently, the sparse matrix vector product is only supported in CRS format. 5.2. Vector operations As we can see in Algorithms 1 and 2, the common vector operations, regardless of the reduction process, are the vector copy, the scalar vector product, the axpy operation and the nonlinear function computation of the nonlinear system. In order to optimize these operations, we try to group several operations into a single kernel. For example, the inner products needed for the computation of α as well as the inner products involved in the β computation are grouped in single kernels. On the other hand, CUBLAS [20] has been used to compute the basic operations with vectors. CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) using CUDA. Note that the use of CUBLAS does not allow to group several operations into a single kernel, however in our code we perform the axpy function and the vector copy function in the same kernel. 5.3. Inner product The inner (or dot) product of a vector is a special operation in CUDA because it implies a reduction process. It is necessary to be able to use multiple thread blocks in order to process large vectors and keep busy all streaming multiprocessors (SM) of the GPU. For this purpose each thread block makes a reduction of a portion of the vector, but CUDA has no global synchronization to communicate partial results between thread blocks. To perform inner products we follow the proposal of NVIDIA CUDA C SDK, i.e., we compute VECTOR_N vectors of ELEMENT_N elements. ELEMENT_N must be a multiple of the warp size to meet alignment constraints of memory coalescing. One block computes the reduction of one or more portions of vector. In order to avoid synchronization processes, we work with a shared memory array of size ACCUM_N which acts as an accumulator. ACCUM_N must be a power of two and preferably a multiple of the warp size. Each thread computes an accumulator element through vectors with stride equal to ACCUM_N. To finish the kernel we perform a tree-like reduction of the results stored in the accumulator array, where synchronization processes between threads are necessary (see Fig. 1). Note that, in our implementation, CPU must complete the operation by computing the VECTOR_N partial results. The reduction of the VECTOR_N partial results could be performed in the CPU or in the global memory of the GPU. However, we have tested both options and we have observed that the use of the global memory of the GPU for this reduction does not increase the performance. This is due to the fact that this reduction does not need a lot of communications between CPU and GPU, and using the GPU we need to execute a kernel performing a reduction process of log2 (VECTOR_N) steps. Note that the optimal value of VECTOR_N parameter is 128, therefore we must use a kernel with a single block with 64 threads. We have tested this process working on both global memory and shared memory, in both cases we obtain a slight decrease of the performance than when using the CPU. On the other hand, CUBLAS also provides inner product, and taking into account that we work with the optimized version included in CUDA Toolkit 3.2 RC, we will not consider other optimizations. 5.4. LU solver In the LU solver each computed element of the solution vector is needed to compute the next element. There is no inherent fine grain parallelism. We have tried various strategies to compute the LU solver on the GPU but efficiency has not been achieved. Remark that this procedure performs two loops without dependencies to compute each element of the solution vector. In the first strategy, we launch N blocks in one or more kernels, where N is the size V. Galiano et al. / J. Parallel Distrib. Comput. ( ) – 5 Fig. 2. Kernels sequence for the NLCG and NLPCG algorithms. of the system, each thread computes its partial computation of both loops, performing a reduction process to complete the loops. In the second one, following the partitioning showed in (3), the computation of each block solution is assigned to each SM of the GPU. Therefore we perform a partitioning based on the number of SMs of the GPU, 30 on the GTX 280. On the other hand, strategies that group independent rows into levels in order to sequentially iterate across the constructed levels, as the one presented in [18], entail a sequential computation due to the sparsity pattern of the LU in our matrix problem. Hence, the parallelism involved by the preconditioner algorithm (Algorithm 3) has been exploited with the multicore CPU architecture. However, this parallelism becomes somewhat ineffective due to CPU–GPU communications. Due to the structure of (3) and assuming that p is the number of cores, the number of iterations of the NLPCG method, in a multicore, increases with the number of cores; therefore the amount of communications between CPU and GPU increases. In Section 6 we will show results using CUDA and OpenMP in multicore CPU. Note that in the NLCG method the communications between CPU and GPU are reduced to some scalars and the arrays to complete reduction operations. Fig. 2 summarizes the kernels used in each algorithm showing the involved grouped operations according to the structure of Algorithms 1 and 2. We note that the same kernel is used in both algorithms for the same operation, except for the computation of β and p(i+1) . 6. Numerical experiments In order to illustrate the behavior of the NLCG and NLPCG methods, we have run both algorithms on a multicore computer Intel Core 2 Quad Q6600, 2.4 GHz, with 4 GB of RAM and 8 MB of L2 Cache Memory, called SULLI. The operating system in SULLI is Ubuntu 9.04 (Jaunty Jackalope) for 64 bit systems. The GPU is an NVIDIA GeForce GTX 280 connected to SULLI. The CUDA kernels were compiled using the NVIDIA CUDA compiler (nvcc) from CUDA Toolkit 3.2 RC. As our illustrative example we have considered a nonlinear elliptic partial differential equation, known as the Bratu problem. In this problem, heat generation from a combustion process is balanced by heat transfer due to conduction. The three-dimensional model problem is given as ∇ 2 u − λeu = 0, where u is the temperature and λ is a constant known as the FrankKamenetskii parameter; see e.g., [2]. There are two possible steadystate solutions to this problem for a given value of λ. One solution is close to u = 0 and it is easy to obtain. A starting point near to the other solution is needed to converge to it. For our model case, we consider a 3D cube domain Ω of unit length and λ = 6. To solve this problem using the finite difference method, we consider a grid in Ω of d3 nodes. This discretization yields a nonlinear system of the form Ax = Φ (x), where Φ : ℜn → ℜn is a nonlinear diagonal mapping, i.e., the ith component Φi of Φ is a function only of the ith component of x. The matrix A is a sparse matrix of order n = d3 and the typical number of nonzero elements per row of this matrix is seven, with fewer in rows corresponding to boundary points of the physical domain. The performed analyses are based on the run-times and the number of floating point operations per second (Flops) measured on a GeForce GTX 280 compared with the parallel run-times and Flops measured on SULLI using OpenMP. First we present results for the NLCG method with problems of various sizes. In Fig. 3(a) we show the speed-up using OpenMP and different number of cores in SULLI, and the speed-up when the NLCG method is computed on the NVIDIA GeForce GTX 280 GPU, managed from one core of SULLI. In addition, Fig. 3(b) compares this method in terms of GFLOPS on both architectures. We obtain a good speed-up with OpenMP using the available cores, but not comparable with the speed-up obtained with GPU, greater than 36. These results confirm a good interaction between the NLCG algorithm and a GPU computing platform. Note that, on the GPU, the NLCG method is about 12–13 times faster than using four cores of the CPU, achieving about 7 GFlops. In order to call a CUDA kernel, the number of threads of each block must be established taking into account that the maximum number of threads is 512. Therefore, the number of blocks in each grid depends on the number of required threads. For example, if each block has 512 threads, for calling the axpy kernel working with a system of size n = 373,248, a grid with 729 blocks is required. On the other hand, for calling the inner product kernel, the number of blocks and the number of threads in each block should be selected in order to obtain the best performance. We have analyzed the behavior of the NLCG algorithm with respect to the number of threads in each block and the influence of ACCUM_N, that is, the size of the shared memory arrays that 6 V. Galiano et al. / J. Parallel Distrib. Comput. ( (a) Speed-up. ) – (b) GFlops. Fig. 3. NLCG method. (a) Comparison sparse matrix storage formats. (b) Comparison CUBLAS and/or CUSPARSE. Fig. 4. NLCG method. act as accumulators. Although the influence of the number of threads in each block and ACCUM_N is not critical, the best performance is obtained using 128 or 256 threads in each block and ACCUM_N = 128. In Fig. 4(a) we analyze the performance of the different storage formats explained in Section 4. The sparse matrix storage format used changes the code to compute the SpMV operation (see Section 5.1). The best results are obtained using ELLPACK-R format (ELLR and ELLR-T). The algorithms for computing SpMV using ELLPACK-R do not include flow control instructions that serialize the execution of a warp of 32 threads and they allow coalesced matrix data access. Note that ELLPACK-R format is the sparse matrix storage format with highest memory requirement when compared with the other discussed formats. This is due to the use of a vector of integers in order to store the number of nonzero elements of each row and to avoid the flow control inside the iterative loop for computing each element. The results shown in Fig. 4(a) have been obtained using optimal values for all the parameters. In the case of the ELLR-T implementation, for our problem, the best performance has been achieved with T = 8 and using a thread block size of 256 (see [26] for a detailed description of these parameters). In [25] it is confirmed that, for matrices with high sparsity pattern having almost the same number of nonzero elements per row, the performance improvement obtained with ELLR for the SpMV, is not significant compared with the use of CRS and ELLPACK formats. In fact, for these three cases, the SpMV represents, in our test, problems of about the 50% of the total solving time, and the analysis of the performance (GFlops) of the SpMV kernels of the algorithm shows similar results for these storage formats (between 2.3 and 2.6 GFlops). However, the ELLRT implementation outperforms the other approaches. In this case the SpMV represents about the 30% of the total time, achieving between 5.6 and 5.8 GFlops. To conclude the analysis of the NLCG method, we present in Fig. 4(b) results using the CUBLAS and CUSPARSE libraries, included in the CUDA Toolkit 3.2 RC. We analyze the use of CUBLAS and the use of CUSPARSE separately, and the use of both together. First, when we only use CUBLAS library the results are worse than the results using only the CUDA API (with CRS format). This is due to the optimizations performed by grouping operations, which can not be performed using CUBLAS. In addition, reduction operations may be affected by the parameters. In contrast, the use of CUSPARSE leads to a slight improvement. In this case the SpMV kernels of the algorithm achieve between 2.5 and 3.1 GFlops. On the other hand the axpy operations achieve more than 4 GFlops and the inner product operations between 8.5 and 11.5 GFlops. However, the best performance is obtained in the iterative procedure to compute α achieving between 15.2 and 19.9 GFlops. Remark that CUBLAS and CUSPARSE libraries hide to the user the setting of parameters, both for calling kernels and in the reduction operations. In order to analyze the NLPCG method, first we present results showing the behavior of this method respect to its characteristic parameters, which are the number of outer iterations, the number of inner iterations of the block two-stage method and the level of fill-in of the incomplete LU factorizations. Note that, as we have mentioned in Section 5.4, the LU solver is computed in the multicore, where p is the number of cores. Initially, we consider only one block, thus the method uses one core and the GPU. In Fig. 5 we present results varying the number of inner iterations q and the number of outer iterations m, for a system of size equal to 373,248 and level of fill-in equal to 0 and equal to 1. Fig. 5 shows that, for the matrix of size 373,248, the best results are obtained using q = 1 and m = 1. More specifically, at each nonlinear iteration of the NLPCG algorithm, only one step (m = 1) of the truncated series preconditioner defined by Algorithm 3 is used, performing q = 1 inner iterations of the iterative procedure defined by the incomplete LU factorization (see Section 2). All performed experiments have shown analogous results to those obtained in a multicore architecture, where the best results were obtained with small values of q and m (see [17]). Fig. 6 presents results using 2 cores and the GPU to solve a system of size 884,736. The results correspond to a level of fillin equal to 0. Since the LU solver is performed at the CPU, using more than one core can benefit the algorithm as it is shown in Fig. 7. However, it is important to remark that when optimal values are used, it should not be used more than one core, because the increase in the number of global iterations required does not allow V. Galiano et al. / J. Parallel Distrib. Comput. ( (a) ILU(0). ) – 7 (b) ILU(1). Fig. 5. NLPCG, 1 core + GPU, n = 373,248. Fig. 6. NLPCG, 2 cores + GPU, n = 884,736, ILU(0). Fig. 7. NLPCG, n = 884,736, m = 2, ILU(1). of the NLPCG algorithm is not affected by the value of ACUMM_N. Fig. 8 shows that the use of CUBLAS in the NLPCG method slightly increases the computational times, as we have also seen in the NLCG method. Moreover, the NLPCG method is not affected by using CUSPARSE. It is due to the small number of iterations of the NLPCG method, which is nearly 10 times lower than the iterations needed by the NLCG method. Finally, from the analysis of Fig. 9 it is deduced that the speed-up using GPU is greater than the speedup using four cores of the CPU. Moreover, as compared to the CPU implementation of the NLCG algorithm, our GPU implementation of the NLPCG is up to 18–20 times faster. Nevertheless the speedup of the NLPCG method is not comparable with the one obtained by the NLCG method, which is above 36. This is due to the need to run the LU solver on the CPU, and the decrease of the work performed on the GPU. In addition, the communications between CPU and GPU are increased. Note that the percentage of the execution time required by the LU solver in the NLPCG algorithm (using one core of the CPU) is about 40%–50%. This percentage increases up to 60%–70% when CPU–GPU communications are also considered. Clearly, both methods present a very different inherent parallelism, therefore their performance on two different parallel architectures is also different. However, as can be expected, the best execution times of both methods are always obtained when the algorithms make use of the GPU. 7. Conclusions Fig. 8. NLPCG, 1 core + GPU, m = 1, q = 1, ILU(0). an improvement. The conclusions about the optimal value of the number of threads in each block have been analogous to those obtained for the NLCG method. On the other hand, the behavior (a) Total execution time. We have developed the Fletcher–Reeves version of the nonlinear conjugate gradient method and we have applied a polynomial preconditioner based on two-stage methods. We have analyzed both methods using a multicore OpenMP model, a GPGPU model and a mixed model in order to exploit both parallel systems. We have analyzed the proposed algorithms in order to identify the main operations, and we have implemented some optimizations and tested some libraries in order to perform these operations optimally. CUBLAS and CUSPARSE libraries offer a good performance, and the sparse matrix format should be chosen according to the parallel architecture, being ELLPACK-R and specifically its ELLR-T implementation, the most efficient format. (b) Speed-up per iteration. Fig. 9. Comparison NLCG and NLPCG on CPU and GPU, n = 884,736. 8 V. Galiano et al. / J. Parallel Distrib. Comput. ( We would like to point out that the use of the GPU improves the results obtained using any of the proposed methods. On the other hand, the NLCG method exploits better the parallelism offered by the GPU than the NLPCG method. References [1] L. Adams, M-step preconditioned conjugate gradient methods, SIAM J. Sci. Stat. Comput. 6 (1985) 452–462. [2] B.M. Averick, R.G. Carter, J.J. More, G. Xue, The MINPACK-2 test problem collection, Technical Report MCS-P153-0692, Mathematics and Computer Science Division, Argonne, 1992. [3] J. Bolz, I. Farmer, E. Grinspun, P. Schröder, Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graph. 22 (3) (2003) 917–924. [4] R. Bru, V. Migallón, J. Penadés, D.B. Szyld, Parallel, synchronous and asynchronous two-stage multisplitting methods, Electron. Trans. Numer. Anal. 3 (1995) 24–38. [5] L. Buatois, G. Caumon, B. Lévy, Concurrent number cruncher: a GPU implementation of a general sparse linear solver, Int. J. Parallel Emergent Distrib. Syst. 24 (3) (2009) 205–223. [6] L. Dechevsky, B. Bang, J. Gundersen, A. Lakså, A.R. Kristoffersen, Solving nonlinear systems of equations on graphic processing units, Lect. Notes Comput. Sci. 5910 (2010) 719–729. [7] R. Fletcher, C. Reeves, Function minimization by conjugate gradients, The Comput. J. 7 (1964) 149–154. [8] G.A. Gravvanis, C.K. Filelis-Papadopoulos, K.M. Giannoutakis, Solving finite difference linear systems on GPUs: CUDA parallel explicit preconditioned biconjugate conjugate gradient type methods, J. Supercomput. (2011) doi:10.1007/s11227-011-0619-z. [9] R. Helfenstein, J. Koko, Parallel preconditioned conjugate gradient algorithm on GPU, J. Comput. Appl. Math. (2011) doi:10.1016/j.cam.2011.04.025. [10] M.R. Hestenes, E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. Nat. Bur. Stand. 49 (1952) 409–436. [11] D.R. Kincaid, D.M. Young, A brief review of the ITPACK project, J. Comput. Appl. Math. 24 (1–2) (1998) 121–127. [12] H.P. Langtangen, Conjugate gradient methods and ILU preconditioning of nonsymmetric matrix systems with arbitrary sparsity patterns, Internat. J. Numer. Methods Fluids 9 (1989) 213–233. [13] C.L. Lawson, R.J. Hanson, D. Kincaid, F.T. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Trans. Math. Software 5 (1979) 308–323. [14] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, NVIDIA Tesla: a unified graphics and computing architecture, IEEE Micro 28 (2) (2008) 39–55. [15] M. Manikandan, R. Bordawekar, Optimizing sparse matrix-vector multiplication on GPUs, IBM Research Report RC24704, 2008. [16] D. Michels, Sparse-matrix-CG-solver in CUDA, in: Proceedings of the 15th Central European Seminar on Computer Graphics, 2011. [17] H. Migallón, V. Migallón, J. Penadés, Parallel nonlinear conjugate gradient algorithms on multicore architectures, in: J. Vigo (Ed.), Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, 2009, pp. 689–700. [18] M. Naumov, Parallel solution of sparse triangular linear in the preconditioned iterative methods on the GPU, NVIDIA Technical Report NVR-2011-001, 2011, http://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf. [19] J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable parallel programming with CUDA, ACM Queue 6 (2) (2008) 40–53. [20] NVIDIA Corporation, CUDA CUBLAS Library, Document PG-05326-032_V01, http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/ 2010, docs/CUBLAS_Library.pdf. [21] NVIDIA Corporation, CUDA CUSPARSE Library, Document PG-05329-032_V01, http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/ 2010, docs/CUSPARSE_Library.pdf. [22] NVIDIA Corporation, NVIDIA CUDA C Programming Guide, Version 3.2, 2010, http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming_Guide.pdf. [23] OpenMP official site, 2008, http://openmp.org. ) – [24] Y. Saad, SPARSKIT: a basic tool kit for sparse matrix computation, http://wwwusers.cs.umn.edu/~saad/software/SPARSKIT/sparskit.html. [25] F. Vázquez, J.J. Fernández, E.M. Garzón, A new approach for sparse matrix vector product on NVIDIA GPUs, Concurrency Computat.: Pract. Exp. 23 (2011) 815–826. [26] F. Vázquez, G. Ortega, J.J. Fernández, E.M. Garzón, Improving the performance of the sparse matrix vector product with GPUs, in: 10th IEEE International Conference on Computer and Information Technology, CIT2010, 2010, pp. 1146–1151. [27] F. Wei, J. Feng, H. Lin, GPU-based parallel solver via Kantorovich theorem for the nonlinear Bernstein polynomial system, Comput. Math. Appl. 62 (6) (2011) 2506–2517. V. Galiano is a Professor of the Physics and Computer Architecture Department at the Miguel Hernandez University in Elche (Spain). He received the Ing. grad. from the Polytechnic University of Valencia and the Ph.D. degree in Computer Science in 2007. His main research interests include parallel algorithms for heterogeneous platforms for solving linear and nonlinear systems, high level interface design in parallel libraries for using them in meteorological and biological applications and parallel simulations for modeling electronic systems. H. Migallón is a Professor of the Physics and Computer Architecture Department at the Miguel Hernandez University in Elche (Spain). He received the Degree in Physics and Electronic Eng. from the University of Valencia. He is member of the ‘‘Architecture and Computer Technology’’ at Miguel Hernández University and the ‘‘High Performance Computing and Parallelism’’ at the University of Alicante research groups. His main research interests include parallel high level interfaces for heterogeneous platforms, parallel algorithms for solving linear and nonlinear systems, parallel algorithms for image processing and parallel simulations for modeling electronic systems. V. Migallón is Full Professor of the Computing Science and Artificial Intelligence Department and a research member of the University Institute for Computing Research, both at the University of Alicante (Spain). She received the Ph.D. degree in Computer Science in 1993. She is a member of the ‘‘High Performance Computing and Parallelism’’ research group at the University of Alicante which led from its foundation in 1991 until 2002. Her main research interests are scheduling techniques and parallel algorithms for heterogeneous platforms for solving linear and nonlinear systems and high level interface design, which simplifies programming in parallel environments. J. Penadés is Full Professor of the Computing Science and Artificial Intelligence Department and a research member of the University Institute for Computing Research, both at the University of Alicante (Spain). He received the Ph.D. degree in Computer Science in 1993. His main research interests are scheduling techniques and parallel algorithms for heterogeneous platforms for solving linear and nonlinear systems and high level interface design, which simplifies programming in parallel environments. He currently leads the ‘‘High Performance Computing and Parallelism’’ research group in Alicante.