Academia.eduAcademia.edu

OpenMP parallelization of multiple precision Taylor series method

2019, ArXiv

OpenMP parallelization of multiple precision Taylor series method is proposed. A very good parallel performance scalability and parallel efficiency inside one computation node of a CPU-cluster is observed. We explain the details of the parallelization on the classical example of the Lorentz equations. The same approach can be applied straightforwardly to a large class of chaotic dynamical systems.

OpenMP parallelization of multiple precision Taylor series method arXiv:1908.09301v1 [cs.MS] 25 Aug 2019 S. Dimova1 , I. Hristov1,a , R. Hristova1 , I. Puzynin2 , T. Puzynina2 , Z. Sharipov2,b , N. Shegunov1 , Z. Tukhliev2 1 2 Sofia University, Faculty of Mathematics and Informatics, Bulgaria JINR, Laboratory of Information Technologies, Dubna, Russia Corresponding e-mails: a ivanh@fmi.uni-sofia.bg, b zarif@jinr.ru Abstract OpenMP parallelization of multiple precision Taylor series method is proposed. A very good parallel performance scalability and parallel efficiency inside one computation node of a CPU-cluster is observed. We explain the details of the parallelization on the classical example of the Lorentz equations. The same approach can be applied straightforwardly to a large class of chaotic dynamical systems. 1 Introduction Achieving a mathematically reliable long-term solution of a chaotic dynamical system is a difficult task due to the sensitive dependence on the initial conditions. No doubt, having a numerical procedure for achieving such solutions is of a big importance, because it gives us a powerful tool for theoretical investigations. A breakthrough in this direction can be found in the paper [1] of Shijun Liao. The author of the paper considers a new numerical procedure called ”clean numerical simulation”, based on the multiple precision Taylor series method [2], [3]. The numerical procedure in [1] works as follows. A new concept, namely the critical predictable time Tc is introduced. Practical estimations of the needed accurate decimal places K and the needed order N of the Taylor method are obtained by constructing Tc − K and Tc − N diagrams. Using the estimated K and N for a given time interval, the solution is calculated. The solution obtained is additionally verified by a new calculation with larger K and N over the same interval. If the two solutions coincide over the whole interval, the solution is considered to be a mathematically reliable one. To apply the ”clean numerical simulation”, we have to be precise in many directions. First, we have to use a multiple precision library. In order to be effective, we need a method of highest order of accuracy, such as Taylor series method. Low 1 2 S. Dimova, I. Hristov, et al. order accuracy methods require very small steps to be used and as a result we can not compute the solution in a foreseeable time. In addition, if we want a solution in the case of extremely large intervals, we need a serious computational resource and a parallelization of the algorithm. First parallelization of Liao’s method is reported in [4] and later improved in [5]. A mathematically reliable solution of the Lorenz system using 1200 CPU cores, obtained in about 9 days and 5 hours, on a time interval with a record length, namely [0,10000], is given in [6]. It is explained in [4],[5] that a parallel reduction of the sums, which appear when we calculate the Taylor coefficients, have to be done. This is of course the crucial observation. However, no details of the parallel version of the algorithm is given. Most likely the authors used pure MPI (Message Passing Interface) which is a distributed memory programming model. Our goal is not to compare to the impressive simulation in [6], which uses pretty large computational resource. Our goal is to present in more details a simple and effective OpenMP parallelization of the multiple precision Taylor series method, which uses a moderate computational resource, namely one CPU-node. For benchmarks, we use the results in [5]. Our model problem is the classical Lorenz system. However, the proposed approach is rather general, and it could be applied to a large class of chaotic dynamical systems. 2 Taylor series method for the Lorenz system We consider as a model problem the classical Lorenz system [7]: dx = −σ x + σ y dt dy = Rx − y − xz dt dz = xy − bz, dt (1) where R = 28, σ = 10, b = 8/3 are the standard Salztman’s parameter values. For these parameters the system is chaotic. The N-th order Taylor series method [2], [3] for (1) with step size τ is: N xn+1 = xn + ∑ αi τ i , i=1 N yn+1 = yn + ∑ βi τ i , i=1 N zn+1 = zn + ∑ γi τ i , i=1 where (2) OpenMP parallelization of multiple precision Taylor series method 3 1 d i x(tn ) , i! dt i 1 d i y(tn ) βi = , i! dt i 1 d i z(tn ) γi = i! dt i are the i-th Taylor coefficients (the so called normalized derivatives). They are calculated as follows. From equation (1) we have αi = α1 = −σ α0 + σ β0 , β1 = Rα0 − β0 − α0 γ0 , γ1 = α0 β0 − bγ0 , where α0 = xn , β0 = yn , γ0 = zn . By applying Leibniz rule for the derivatives of the product of two functions, we obtain the following recursive procedure for calculating αi , βi , γi for i = 0, ..., N − 1. 1 (−σ αi + σ βi ), i+1 i 1 βi+1 = (Rαi − βi − ∑ αi−k γk ), i+1 k=0 αi+1 = γi+1 = (3) i 1 ( ∑ αi−k βk − bγi ). i + 1 k=0 Note that we don’t need any analytical expressions for the derivatives of x(t), y(t), z(t) and we don’t have any. We only need the values of the derivatives in point tn . Let’s store the Taylor coefficients in the arrays x,y,z with lengths N+1. The values of αi are stored in x[i], those of βi in y[i] and those of γi in z[i]. Then the pseudocode for (3) is given in Fig.1. It is obvious that we need O(N 2 ) floating point operations for calculating the coefficients. The subsequent evaluation of Taylor series with Horner’s rule needs only O(N) operations and hence is negligible. To calculate the i+1-th coefficient in Taylor series we need all previous coefficients from 0 to i. What we can do in parallel is the calculation of sums s1 and s2, i.e. we can parallelize the loops for index k (the code in frame in Fig. 1). This is a classical parallel reduction. Actually, the explained above algorithm for calculating the coefficient of Taylor series, is called automatic differentiation, or sometimes called algorithmic differentiation (see [8]). Generally speaking, automatic differentiation is a recursive procedure for computing the derivatives of certain functions at a given point (not evaluating analytical formulas for the derivatives). The certain functions are those that can be obtained by sum, product, quotient, and composition of some elementary functions. It is important that in all cases of dynamical systems whose right-hand 4 S. Dimova, I. Hristov, et al. side is automatically differentiable, sums like those in (3) are obtained. Thus, the approach for parallelization of our model problem can be applied straightforwardly to a large class of dynamical systems. for (i = 0; i<N; i++) { s1=0.0; s2=0.0; for (k=0; k<=i; k++) { s1=s1+x[i-k]*z[k]; s2=s2+x[i-k]*y[k]; } div=1.0/(i+1); x[i+1] = Sigma*(y[i]-x[i])*div; y[i+1] = (R*x[i]-y[i]-s1)*div; z[i+1] = (s2-b*z[i])*div; } Fig. 1 Pseudocode of the recursive procedure (3). 3 OpenMP parallel technology Modern supercomputer clusters have nodes consisting of multicore processors. Our goal is an effective and moderate (using only one node) parallelism. Parallel computing on one node performs efficiently when a shared memory, multithreaded environment, such as OpenMP, is used [9]. OpenMP is an application programming interface (API) consisting of a set of compiler directives (pragmas in C/C++) and a library of support functions. The standard view of parallelism in OpenMP is the fork-join model. When the program starts, only a single thread, called the master thread, is active. The master thread performs the sequential parts of the algorithm. At those points where parallel operations are required, the master thread forks (creates additional threads). The master and the additional threads work concurrently trough the parallel region (some block of code). At the end of the parallel region, created threads are suspended, and the flow of control returns to the single master thread, this is the join point. A nice feature of OpenMP is that it supports incremental parallelism, i.e. we transform the sequential program into a parallel program step by step, one block of code at a time. If the main part of computational work is done only in one block, as in our case, adding one or two lines (pragmas) in the code and eventual minor transformations, might be all we need for excellent speedup results. OpenMP parallelization of multiple precision Taylor series method 5 for (k=0; k<=i; k++) { mpf_mul(temp,x[i-k],z[k]); mpf_add(s1,s1,temp); mpf_mul(temp,x[i-k],y[k]); mpf_add(s2,s2,temp); } Fig. 2 The serial block of code that we have to parallelize, written in terms of GMP. 4 OpenMP parallel reduction We write our parallel program in C programming language and use the GMP library (The GNU Multiple Precision Arithmetic Library) [10]. An important feature of the GMP library is its thread-safety [9], otherwise we can’t use it with OpenMP. The serial block of code that we have to parallelize, written in terms of GMP, is given in Fig.2. We use a temporary variable temp to store the intermediate results of the multiplications before additions to the sums s1 and s2. To parallelize this code, we have to do a parallel reduction. OpenMP has a buildin reduction clause, however we can’t use it, because we use user-defined types for multiple precisions number and user-defined operations. Thus, we have to do the reduction manually. Because multiple precision numbers are allocated in the heap, they can not be privatized [9], thus we have to make containers for the partial sum for every thread and these containers will be shared. We store the containers in an array of multiple precision numbers sum with length 2Numt, where Numt is the number of threads. Thread with number tid stores its portion of s1 in sum[2*tid] and its portion of s2 in sum[2*tid+1]. Now we have an array of temporary variables tempv with length Numt. The parallelized version of the code from Fig.2 is given in Fig.3. The directive #pragma omp parallel private(k,tid) creates a parallel region (additional threads are awaken) and the integer variables k and tid are defined as private in the parallel region (each tread has its own copy in its own stack). Then every thread gets its ID and stores it in tid by using the library function omp get thread num(). The second directive #pragma omp for shares the work for the loop between threads, i.e. every thread works on its own portion of the range of indexes [0,...,i]. This range is divided in Numt chunks, equal in size as much as possible. The first right bracket closes the for loop section and acts as an implicit barrier for synchronization of the threads. The second right bracket closes the entire parallel region (the join point). Out of the parallel region the master thread unloads all containers, belonging to each thread, over s1, s2. 6 S. Dimova, I. Hristov, et al. #pragma omp parallel private(k,tid) { tid = omp_get_thread_num(); # pragma omp for for (k=0; k<=i; k++) { mpf_mul(tempv[tid],x[i-k],z[k]); mpf_add(sum[2*tid],sum[2*tid],tempv[tid]); mpf_mul(tempv[tid],x[i-k],y[k]); mpf_add(sum[2*tid+1],sum[2*tid+1],tempv[tid]); } } // unloading the containers belonging to each thread for (j=0; j<Numt; j++) { mpf_add(s1,s1,sum[2*j]); mpf_add(s2,s2,sum[2*j+1]); } Fig. 3 The parallelized block of the code. 5 Performance results All computations are performed on the HybriLIT Heterogeneous Platform at the Laboratory of IT of JINR, Dubna, Russia [11], where the GMP library (version 6.1.2) is installed. We use one computational CPU-node consisting of 2 x Intel(R) Xeon(R) Processor E5-2695v3 (28 cores, 2.3 GHz). As a benchmark we use the results for the Lorentz equations with initial conditions and step size taken from [5], namely x(0) = −15.8, y(0) = −17.48, z(0) = 35.64, τ = 0.01. First, we compare our performance with those in Table 1 in [5]. The results in this table are for a problem with t ∈ [0, 1200] that used order of Taylor series N = 400 and precision with K = 800 decimal digits. The number of CPU-cores used in [5] is between 1 and 50. We did not expect too much parallel efficiency on all 28 cores for this relatively small problem. Our serial time is ∼ 9.0 hours and it is similar to that in [5] ∼ 9.8 hours. However, our program shows better parallel efficiency on this small problem. We obtain about 44 % parallel efficiency on 28 cores comparing with the same efficiency on 20 cores in [5]. Our time on 28 cores is ∼ 0.73 hours which is even better than the time ∼ 0.83 hours on 50 cores in [5]. To make a simulation in the rather large interval [0,5000], we use order of method N = 2000 and precision of 8000 bits (K ∼ 2412 decimal digits), according to Tc − K OpenMP parallelization of multiple precision Taylor series method 7 and Tc − N diagrams in [1]. We test the performance scalability and efficiency inside the CPU node with a short length simulation test (5 integration steps). The results in Table 1 show a very good performance scalability and efficiency about 75% on all 28 cores. In a foreseeable time (∼ 9 days and 14 hours) on 28 cores we integrated the equations in the time interval [0,5000] and repeated the benchmark table S1 from paper [5]. Table 1 OpenMP parallelization of 2000-th order Taylor method with precision ∼ 2412 decimal digits on the example of the Lorenz equations. The computations are made on one CPU-node consisting of 2 x Intel(R) Xeon(R) Processor E5-2695 v3 (28 cores, 2.3 GHz). The measured time is for a short length simulation test (5 integration steps). Number of cores Time (seconds) Speedup Parallel efficiency (%) 1 (serial) 1 (parallel) 4 8 14 28 174.05 174.57 47.89 25.16 15.32 8.28 1.00 0.997 3.63 6.92 11.4 21.0 100.0 99.7 90.9 86.5 81.1 75.1 6 Conclusions OpenMP parallelization of multiple precision Taylor series method for the Lorenz system is realized. A very good parallel performance scalability and parallel efficiency inside one computation node of a CPU-cluster is observed. Our approach for parallelization can be simply applied to a large class of chaotic dynamical systems in order to obtain long-term mathematically reliable solutions. Acknowledgements We thank the Laboratory of Information Technologies of JINR, Dubna, Russia for the opportunity to use the computational resources of the HybriLIT Heterogeneous Platform. The work is financially supported by a grant of the Plenipotentiary Representative of the Republic of Bulgaria at the JINR. References 1. Liao, Shijun. ”On the reliability of computed chaotic solutions of non-linear differential equations.” Tellus A: Dynamic Meteorology and Oceanography 61.4 (2008): 550-564 2. Jorba, Angel, and Maorong Zou. ”A software package for the numerical integration of ODEs by means of high-order Taylor methods.” Experimental Mathematics 14.1 (2005): 99-117. 3. Barrio, Roberto, et al. ”Breaking the limits: the Taylor series method.” Applied mathematics and computation 217.20 (2011): 7940-7954. 8 S. Dimova, I. Hristov, et al. 4. Wang, Pengfei, Jianping Li, and Qian Li. ”Computational uncertainty and the application of a high-performance multiple precision scheme to obtaining the correct reference solution of Lorenz equations.” Numerical Algorithms 59.1 (2012): 147-159. 5. Wang, Pengfei, Yong Liu, and Jianping Li. ”Clean numerical simulation for some chaotic systems using the parallel multiple-precision Taylor scheme.” Chinese science bulletin 59.33 (2014): 4465-4472. 6. Liao, ShiJun, and PengFei Wang. ”On the mathematically reliable long-term simulation of chaotic solutions of Lorenz equation in the interval [0, 10000].” Science China Physics, Mechanics and Astronomy 57.2 (2014): 330-335. 7. Lorenz, Edward N. ”Deterministic nonperiodic flow.” Journal of the atmospheric sciences 20.2 (1963): 130-141. 8. Moore, Ramon E. Methods and applications of interval analysis. Society for Industrial and Applied Mathematics, 1979. 9. Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared memory parallel programming. Vol. 10. MIT press, 2008. 10. https://gmplib.org/ 11. Multipurpose Information and Computing Complex of the Laboratory of IT of JINR. Available at: http://hlit.jinr.ru (accessed 24.08.2019)