OpenMP parallelization of multiple precision
Taylor series method
arXiv:1908.09301v1 [cs.MS] 25 Aug 2019
S. Dimova1 , I. Hristov1,a , R. Hristova1 , I. Puzynin2 , T. Puzynina2 , Z. Sharipov2,b ,
N. Shegunov1 , Z. Tukhliev2
1
2
Sofia University, Faculty of Mathematics and Informatics, Bulgaria
JINR, Laboratory of Information Technologies, Dubna, Russia
Corresponding e-mails:
a
ivanh@fmi.uni-sofia.bg,
b
zarif@jinr.ru
Abstract OpenMP parallelization of multiple precision Taylor series method is proposed. A very good parallel performance scalability and parallel efficiency inside
one computation node of a CPU-cluster is observed. We explain the details of the
parallelization on the classical example of the Lorentz equations. The same approach can be applied straightforwardly to a large class of chaotic dynamical systems.
1 Introduction
Achieving a mathematically reliable long-term solution of a chaotic dynamical system is a difficult task due to the sensitive dependence on the initial conditions. No
doubt, having a numerical procedure for achieving such solutions is of a big importance, because it gives us a powerful tool for theoretical investigations. A breakthrough in this direction can be found in the paper [1] of Shijun Liao. The author of
the paper considers a new numerical procedure called ”clean numerical simulation”,
based on the multiple precision Taylor series method [2], [3].
The numerical procedure in [1] works as follows. A new concept, namely the critical predictable time Tc is introduced. Practical estimations of the needed accurate
decimal places K and the needed order N of the Taylor method are obtained by constructing Tc − K and Tc − N diagrams. Using the estimated K and N for a given time
interval, the solution is calculated. The solution obtained is additionally verified by
a new calculation with larger K and N over the same interval. If the two solutions
coincide over the whole interval, the solution is considered to be a mathematically
reliable one.
To apply the ”clean numerical simulation”, we have to be precise in many directions. First, we have to use a multiple precision library. In order to be effective,
we need a method of highest order of accuracy, such as Taylor series method. Low
1
2
S. Dimova, I. Hristov, et al.
order accuracy methods require very small steps to be used and as a result we can
not compute the solution in a foreseeable time. In addition, if we want a solution in
the case of extremely large intervals, we need a serious computational resource and
a parallelization of the algorithm.
First parallelization of Liao’s method is reported in [4] and later improved in [5].
A mathematically reliable solution of the Lorenz system using 1200 CPU cores,
obtained in about 9 days and 5 hours, on a time interval with a record length, namely
[0,10000], is given in [6]. It is explained in [4],[5] that a parallel reduction of the
sums, which appear when we calculate the Taylor coefficients, have to be done. This
is of course the crucial observation. However, no details of the parallel version of
the algorithm is given. Most likely the authors used pure MPI (Message Passing
Interface) which is a distributed memory programming model.
Our goal is not to compare to the impressive simulation in [6], which uses pretty
large computational resource. Our goal is to present in more details a simple and
effective OpenMP parallelization of the multiple precision Taylor series method,
which uses a moderate computational resource, namely one CPU-node. For benchmarks, we use the results in [5]. Our model problem is the classical Lorenz system.
However, the proposed approach is rather general, and it could be applied to a large
class of chaotic dynamical systems.
2 Taylor series method for the Lorenz system
We consider as a model problem the classical Lorenz system [7]:
dx
= −σ x + σ y
dt
dy
= Rx − y − xz
dt
dz
= xy − bz,
dt
(1)
where R = 28, σ = 10, b = 8/3 are the standard Salztman’s parameter values. For
these parameters the system is chaotic. The N-th order Taylor series method [2], [3]
for (1) with step size τ is:
N
xn+1 = xn + ∑ αi τ i ,
i=1
N
yn+1 = yn + ∑ βi τ i ,
i=1
N
zn+1 = zn + ∑ γi τ i ,
i=1
where
(2)
OpenMP parallelization of multiple precision Taylor series method
3
1 d i x(tn )
,
i! dt i
1 d i y(tn )
βi =
,
i! dt i
1 d i z(tn )
γi =
i! dt i
are the i-th Taylor coefficients (the so called normalized derivatives). They are calculated as follows. From equation (1) we have
αi =
α1 = −σ α0 + σ β0 ,
β1 = Rα0 − β0 − α0 γ0 ,
γ1 = α0 β0 − bγ0 ,
where
α0 = xn ,
β0 = yn ,
γ0 = zn .
By applying Leibniz rule for the derivatives of the product of two functions, we
obtain the following recursive procedure for calculating αi , βi , γi for i = 0, ..., N − 1.
1
(−σ αi + σ βi ),
i+1
i
1
βi+1 =
(Rαi − βi − ∑ αi−k γk ),
i+1
k=0
αi+1 =
γi+1 =
(3)
i
1
( ∑ αi−k βk − bγi ).
i + 1 k=0
Note that we don’t need any analytical expressions for the derivatives of x(t),
y(t), z(t) and we don’t have any. We only need the values of the derivatives in
point tn . Let’s store the Taylor coefficients in the arrays x,y,z with lengths N+1.
The values of αi are stored in x[i], those of βi in y[i] and those of γi in z[i]. Then
the pseudocode for (3) is given in Fig.1. It is obvious that we need O(N 2 ) floating
point operations for calculating the coefficients. The subsequent evaluation of Taylor
series with Horner’s rule needs only O(N) operations and hence is negligible. To
calculate the i+1-th coefficient in Taylor series we need all previous coefficients
from 0 to i. What we can do in parallel is the calculation of sums s1 and s2, i.e. we
can parallelize the loops for index k (the code in frame in Fig. 1). This is a classical
parallel reduction.
Actually, the explained above algorithm for calculating the coefficient of Taylor series, is called automatic differentiation, or sometimes called algorithmic differentiation (see [8]). Generally speaking, automatic differentiation is a recursive
procedure for computing the derivatives of certain functions at a given point (not
evaluating analytical formulas for the derivatives). The certain functions are those
that can be obtained by sum, product, quotient, and composition of some elementary
functions. It is important that in all cases of dynamical systems whose right-hand
4
S. Dimova, I. Hristov, et al.
side is automatically differentiable, sums like those in (3) are obtained. Thus, the
approach for parallelization of our model problem can be applied straightforwardly
to a large class of dynamical systems.
for (i = 0; i<N; i++)
{
s1=0.0;
s2=0.0;
for (k=0; k<=i; k++)
{
s1=s1+x[i-k]*z[k];
s2=s2+x[i-k]*y[k];
}
div=1.0/(i+1);
x[i+1] = Sigma*(y[i]-x[i])*div;
y[i+1] = (R*x[i]-y[i]-s1)*div;
z[i+1] = (s2-b*z[i])*div;
}
Fig. 1 Pseudocode of the recursive procedure (3).
3 OpenMP parallel technology
Modern supercomputer clusters have nodes consisting of multicore processors. Our
goal is an effective and moderate (using only one node) parallelism. Parallel computing on one node performs efficiently when a shared memory, multithreaded environment, such as OpenMP, is used [9]. OpenMP is an application programming
interface (API) consisting of a set of compiler directives (pragmas in C/C++) and a
library of support functions.
The standard view of parallelism in OpenMP is the fork-join model. When the
program starts, only a single thread, called the master thread, is active. The master
thread performs the sequential parts of the algorithm. At those points where parallel
operations are required, the master thread forks (creates additional threads). The
master and the additional threads work concurrently trough the parallel region (some
block of code). At the end of the parallel region, created threads are suspended, and
the flow of control returns to the single master thread, this is the join point.
A nice feature of OpenMP is that it supports incremental parallelism, i.e. we
transform the sequential program into a parallel program step by step, one block of
code at a time. If the main part of computational work is done only in one block,
as in our case, adding one or two lines (pragmas) in the code and eventual minor
transformations, might be all we need for excellent speedup results.
OpenMP parallelization of multiple precision Taylor series method
5
for (k=0; k<=i; k++)
{
mpf_mul(temp,x[i-k],z[k]);
mpf_add(s1,s1,temp);
mpf_mul(temp,x[i-k],y[k]);
mpf_add(s2,s2,temp);
}
Fig. 2 The serial block of code that we have to parallelize, written in terms of GMP.
4 OpenMP parallel reduction
We write our parallel program in C programming language and use the GMP library
(The GNU Multiple Precision Arithmetic Library) [10]. An important feature of the
GMP library is its thread-safety [9], otherwise we can’t use it with OpenMP. The
serial block of code that we have to parallelize, written in terms of GMP, is given
in Fig.2. We use a temporary variable temp to store the intermediate results of the
multiplications before additions to the sums s1 and s2.
To parallelize this code, we have to do a parallel reduction. OpenMP has a buildin reduction clause, however we can’t use it, because we use user-defined types for
multiple precisions number and user-defined operations. Thus, we have to do the
reduction manually. Because multiple precision numbers are allocated in the heap,
they can not be privatized [9], thus we have to make containers for the partial sum
for every thread and these containers will be shared. We store the containers in an
array of multiple precision numbers sum with length 2Numt, where Numt is the
number of threads. Thread with number tid stores its portion of s1 in sum[2*tid]
and its portion of s2 in sum[2*tid+1]. Now we have an array of temporary variables
tempv with length Numt.
The parallelized version of the code from Fig.2 is given in Fig.3. The directive
#pragma omp parallel private(k,tid) creates a parallel region (additional threads
are awaken) and the integer variables k and tid are defined as private in the parallel
region (each tread has its own copy in its own stack). Then every thread gets its
ID and stores it in tid by using the library function omp get thread num(). The
second directive #pragma omp for shares the work for the loop between threads,
i.e. every thread works on its own portion of the range of indexes [0,...,i]. This range
is divided in Numt chunks, equal in size as much as possible. The first right bracket
closes the for loop section and acts as an implicit barrier for synchronization of the
threads. The second right bracket closes the entire parallel region (the join point).
Out of the parallel region the master thread unloads all containers, belonging to each
thread, over s1, s2.
6
S. Dimova, I. Hristov, et al.
#pragma omp parallel private(k,tid)
{
tid = omp_get_thread_num();
# pragma omp for
for (k=0; k<=i; k++)
{
mpf_mul(tempv[tid],x[i-k],z[k]);
mpf_add(sum[2*tid],sum[2*tid],tempv[tid]);
mpf_mul(tempv[tid],x[i-k],y[k]);
mpf_add(sum[2*tid+1],sum[2*tid+1],tempv[tid]);
}
}
// unloading the containers belonging to each thread
for (j=0; j<Numt; j++)
{
mpf_add(s1,s1,sum[2*j]);
mpf_add(s2,s2,sum[2*j+1]);
}
Fig. 3 The parallelized block of the code.
5 Performance results
All computations are performed on the HybriLIT Heterogeneous Platform at the
Laboratory of IT of JINR, Dubna, Russia [11], where the GMP library (version
6.1.2) is installed. We use one computational CPU-node consisting of 2 x Intel(R)
Xeon(R) Processor E5-2695v3 (28 cores, 2.3 GHz).
As a benchmark we use the results for the Lorentz equations with initial conditions and step size taken from [5], namely x(0) = −15.8, y(0) = −17.48, z(0) =
35.64, τ = 0.01. First, we compare our performance with those in Table 1 in [5]. The
results in this table are for a problem with t ∈ [0, 1200] that used order of Taylor series N = 400 and precision with K = 800 decimal digits. The number of CPU-cores
used in [5] is between 1 and 50. We did not expect too much parallel efficiency on
all 28 cores for this relatively small problem. Our serial time is ∼ 9.0 hours and it
is similar to that in [5] ∼ 9.8 hours. However, our program shows better parallel
efficiency on this small problem. We obtain about 44 % parallel efficiency on 28
cores comparing with the same efficiency on 20 cores in [5]. Our time on 28 cores
is ∼ 0.73 hours which is even better than the time ∼ 0.83 hours on 50 cores in [5].
To make a simulation in the rather large interval [0,5000], we use order of method
N = 2000 and precision of 8000 bits (K ∼ 2412 decimal digits), according to Tc − K
OpenMP parallelization of multiple precision Taylor series method
7
and Tc − N diagrams in [1]. We test the performance scalability and efficiency inside
the CPU node with a short length simulation test (5 integration steps). The results in
Table 1 show a very good performance scalability and efficiency about 75% on all
28 cores. In a foreseeable time (∼ 9 days and 14 hours) on 28 cores we integrated
the equations in the time interval [0,5000] and repeated the benchmark table S1
from paper [5].
Table 1 OpenMP parallelization of 2000-th order Taylor method with precision ∼ 2412 decimal
digits on the example of the Lorenz equations. The computations are made on one CPU-node
consisting of 2 x Intel(R) Xeon(R) Processor E5-2695 v3 (28 cores, 2.3 GHz). The measured time
is for a short length simulation test (5 integration steps).
Number of cores
Time (seconds)
Speedup
Parallel efficiency (%)
1 (serial)
1 (parallel)
4
8
14
28
174.05
174.57
47.89
25.16
15.32
8.28
1.00
0.997
3.63
6.92
11.4
21.0
100.0
99.7
90.9
86.5
81.1
75.1
6 Conclusions
OpenMP parallelization of multiple precision Taylor series method for the Lorenz
system is realized. A very good parallel performance scalability and parallel efficiency inside one computation node of a CPU-cluster is observed. Our approach for
parallelization can be simply applied to a large class of chaotic dynamical systems
in order to obtain long-term mathematically reliable solutions.
Acknowledgements We thank the Laboratory of Information Technologies of JINR, Dubna, Russia for the opportunity to use the computational resources of the HybriLIT Heterogeneous Platform.
The work is financially supported by a grant of the Plenipotentiary Representative of the Republic
of Bulgaria at the JINR.
References
1. Liao, Shijun. ”On the reliability of computed chaotic solutions of non-linear differential equations.” Tellus A: Dynamic Meteorology and Oceanography 61.4 (2008): 550-564
2. Jorba, Angel, and Maorong Zou. ”A software package for the numerical integration of ODEs
by means of high-order Taylor methods.” Experimental Mathematics 14.1 (2005): 99-117.
3. Barrio, Roberto, et al. ”Breaking the limits: the Taylor series method.” Applied mathematics
and computation 217.20 (2011): 7940-7954.
8
S. Dimova, I. Hristov, et al.
4. Wang, Pengfei, Jianping Li, and Qian Li. ”Computational uncertainty and the application of
a high-performance multiple precision scheme to obtaining the correct reference solution of
Lorenz equations.” Numerical Algorithms 59.1 (2012): 147-159.
5. Wang, Pengfei, Yong Liu, and Jianping Li. ”Clean numerical simulation for some chaotic
systems using the parallel multiple-precision Taylor scheme.” Chinese science bulletin 59.33
(2014): 4465-4472.
6. Liao, ShiJun, and PengFei Wang. ”On the mathematically reliable long-term simulation of
chaotic solutions of Lorenz equation in the interval [0, 10000].” Science China Physics, Mechanics and Astronomy 57.2 (2014): 330-335.
7. Lorenz, Edward N. ”Deterministic nonperiodic flow.” Journal of the atmospheric sciences
20.2 (1963): 130-141.
8. Moore, Ramon E. Methods and applications of interval analysis. Society for Industrial and
Applied Mathematics, 1979.
9. Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. Using OpenMP: portable shared
memory parallel programming. Vol. 10. MIT press, 2008.
10. https://gmplib.org/
11. Multipurpose Information and Computing Complex of the Laboratory of IT of JINR. Available at: http://hlit.jinr.ru (accessed 24.08.2019)