Optimizing Large-Scale Ode Simulations
Optimizing Large-Scale Ode Simulations
Optimizing Large-Scale Ode Simulations
MARIO MULANSKY∗
where ṙ denotes the derivative with respect to t and we have omitted the explicit
time-dependence r(t). The function fi represents the local term and gi , the nearest-
neighbor coupling. Note that in this setup both the local and the coupling term can
be different for each site. In many cases, however, one faces homogeneous situations
where f and g are the same for all elements:
Here, we will also assume such a homogeneous situation. Examples for such systems
include nonlinear oscillator chains [18, 4], nonlinear Klein-Gordon models [10], the
discrete nonlinear Schrödinger model [16] and classical spin chains [20]. Furthermore,
sometimes also discretized partial differential equations are solved by explicit schemes,
e.g. [27]. Nevertheless, the cache optimization technique presented below can be
generalized to heterogenic cases, but will not work with implicit routines.
2
r1 r2 r3 rN
x1 y1 z1 x2 y2 z2 x3 y3 z3 b b b xN yN zN
k2 = f (r2 , t) + g(r2 , r1 , r3 , t)
k1 k2 kN
Fig. 1: Memory layout and data dependency of the ODE function evaluation
Then, a new intermediate approximation is computed from the current, and possibly
previous, rhs evaluations:
∑
j
(2.4) r′i = ri + aj,n ∆t kni .
n=1
This intermediate approximation is then used in the next stage to compute another
rhs evalution {kj+1
i } and so on. Fig. 2 schematically shows the flow of the algorithm.
After s stages the final result is obtained using the previously computed evaluations
of the rhs:
∑
s
(2.5) r̄i = ri + bn ∆t kni .
n=1
The number of stages s as well as the parameters aj,n and cj are properties of the
Runge-Kutta scheme. The prominent Runge-Kutta-4 scheme for example has s = 4
stages, i.e. four evaluations of the rhs function in Eq. (2.3) per time step.
3
r1 r2 r3 b b b rN
r1 r2 r3 b b b rN stage s
k1g k1ḡ−1
r′g b b b
r′ḡ−1
k2g k2ḡ−1
b
b
b
b b b
ksg b b b
ksḡ−1 b b
neglect neglect
registers capable of holding four values (double precision). Older CPUs might only support SSE with
128 Bit registers, where a pack size of two should be used.
6
SIMD pack SIMD pack SIMD pack
xp xp′ xp′′ xp′′′ yp yp′ yp′′ yp′′′ zp zp′ zp′′ zp′′′ xp+1 xp′ +1 b b b
b
b
b
b
coupling b
coupling
dxp dxp′ dxp′′ dxp′′′ dyp dyp′ dyp′′ dyp′′′ dzp dzp′ dxz′′ dzp′′′ b b b
Fig. 4: Memory layout of a chain of 3D systems organized in SIMD packs. Each pack
contains P = 4 values of the same type (x, y or z coordinate of a single system)
taken from distant points in the chain (see text). Note how the evaluation of the rhs
function for one value in a SIMD pack does not depend on other values of that SIMD
pack.
organization, the computation of the rhs of the ODE in Eq. (2.3) would introduce
coupling between the entries of the pack. To avoid that, we divide the chain in P = 4
pieces of length N ′ = N/P and represent each piece by one entry of the SIMD packs.
So let p = 1, p′ = N ′ + 1, p′′ = 2N ′ + 1 and p′′′ = 3N ′ + 1 be the starting indices of the
four parts of the chain. Then the first SIMD pack will contain {xp , xp′ , xp′′ , xp′′′ }, the
second one {yp , yp′ , yp′′ , yp′′′ } and so on. This memory layout is sketched in Fig. 4.
Compared to the original memory layout in Fig. 1, we replace the single values, e.g.
x1 , by four values from distant points of the chain (xp,p′ ,p′′ ,p′′′ ).
Using such a memory layout, it is now possible to compute four values at once
by virtue of SIMD. In principle this could give a performance gain of a factor four in
the best case. However, note that this relies on the fact that the dynamical equations
f + g are indeed independent of the chain index, as the above usage of SIMD registers
requires that the same operations are performed for each entry of the SIMD pack and
hence in the above description also for each element in the chain. Note also that at
the edges of the four parts of the chain there is coupling between different entries of
the SIMD pack. For example the second chain is stored in the second SIMD entries,
but the left neighbor of its first element is the last element of the first chain and hence
stored in the first entries of the SIMD packs. So the edges need special treatment and
their calculation can not be carried out using the SIMD vectorization. But again for
long chains this has negligible performance impact. Finally, the cache optimization
described above can be similarly applied to this memory layout as well. It works
exactly in the same way, only that one now deals with SIMD packs instead of scalar
values, but in return one has a reduced chain length of N ′ = N/P .
Fig. 5a shows the performance comparison of the standard implementation, the
optimized cache version and SIMD version. As seen there, using SIMD instructions
gives another performance boost of about 50% compared to the cache optimized
version, which means a total speed up of a factor of three compared to the standard
implementation in this simulation. Details of these performance tests are described
below.
7
3. Performance Study. In the following, we will compare the performance of
the different approaches presented above. The examplary system is a chain of Roessler
oscillators [24] with nearest-neighbor coupling.
3.1. Coupled Roessler Chain. A single Roessler oscillator has a three-dimensional
(D = 3) state ri = (xi , yi , zi ) following the dynamics:
−yi − zi
(3.1) ṙi = f (ri , t) = xi + ayi
b + zi (xi − c)
The parameters a, b, c are constant and fixed to some typical values: a = 0.2, b = 1,
c = 9. For those parameter values, the dynamics of the Roessler system is generally
chaotic due to the existence of a strange attractor [21]. We add dispersive nearest-
neighbor coupling in the first coordinate by defining:
xi−1 − 2xi + xi+1
(3.2) g(ri , ri−1 , ri+1 ) = 0 .
0
This is repeated for M = 10 times and we use the minimal runtime (maximal perfor-
mance) of those 10 trials. Additionally, we measure the CPU throughput in terms of
Flops/s (FLoating point OPerations per second) as well as the cache transfer band-
width using the likwid framwork [26], which is based on the processor’s performance
counters.
3.2.1. Results on Intel Xeon. In a first run, the simulations are performed on
an Intel Xeon E5 processor with 3.8 GHz (single core, turbo mode). The simulations
are implemented in C++ and compiled with the Intel Compiler version 15.0. The
Intel Xeon processor has 32KB/256KB/20MB of L1/L2/L3 cache respectively. The
representation of one state {ri } requires N · D · 8 Byte = 24 Megabytes, clearly above
the L1/L2 cache capacity, but also bigger than the available L3 cache. Therefore,
we expect the standard implementation, where 2s = 8 iteration over the state are
performed in each Runge-Kutta step, to suffer severely from bandwidth limitations.
This is seen in Fig. 5a, where the top bar in all three panels represents this stan-
dard implementation. We find a CPU throughput of roughly 2000 MFlops/s, which
is significantly below the 3800 MFlops/s the FPU unit of this CPU is capable of.
Accordingly, the L3 cache bandwidth of about 17 GByte/s clearly indicates signifi-
cant L3 usage. Hence, the performance remains at below 30 Runge-Kutta steps per
second.
8
Standard 80
Cache optimized
steps/s
60
0 10 20 30 40 50 60 70 80 90 SIMD
Performance (steps/s) 20
10
Standard
GFlops/s
8
Cache optimized 6
2
0 2000 4000 6000 8000 10000
MFlops/s
0
50
L2 GByte/s
L2 40
L3 30
20
10
2 Remember that each element of the chain in the SIMD case consists of a pack of four values.
3 Test with the g++-4.9 compiler showed similar, but slightly lower (∼ 10%) performance results.
10
Standard 30
Cache optimized
steps/s
20
Cache optimized + SIMD FPU
0 5 10 15 20 25 30 35 10 SIMD
Performance (steps/s)
Standard
3
GFlops/s
Cache optimized 2
L2 GByte/s
L2 15
L3 10
Fig. 6: Performance results for the same simulation as in Fig. 5 on an AMD Opteron
2.1 GHz.
Speedup
2.5
2.0
1.5 Cache Optimization
1.0
3.5 AMD Opteron
3.0
Speedup
2.5 SIMD
2.0
1.5 Cache Optimization
1.0 32K 128K 512K 2M 8M 32M 128M 512M 2G
System Size (Bytes)
Fig. 7: Speedup gained from cache optimization and SIMD usage compared to stan-
dard implementations in dependence of the problem size. The graph shows the per-
formance gain in terms of total run-time of simulating a chain with nearest-neighbor
couplings of different size on a Intel Xeon (3.8 GHz, top panel) and AMD Opteron (2.1
GHz, bottom panel). The graphs show the speedup over a standard implementation
gained from granularity optimization (dark gray) and additional SIMD instructions
(light gray).
store the state {ri }, an intermediate state {r′i } and s rhs evaluations {kji }. This gives
a total memory requirement of s + 2 = 6 states. That means for the Intel Xeon with
20 MBytes L3 cache, for system sizes larger than about 3 MBytes the L3 chache is
not sufficient to contain all required data and the even slower main memory has to be
used. This leads to a further performance decrease of the standard implementation
and hence bigger speed up from the cache optimizations. For the AMD Opteron with
an L3 cache size of 8 MBytes, the transition to main memory should happen at system
sizes of about 1 MByte, which is also consistent with Fig. 7. Furthermore, there we
observe another increase of the speed up at system sizes reaching about 2 GBytes.
This is probably due to the fact that the main memory of this machine is divided into
NuMA domains (Nonuniform Memory Access). So for very large systems, the single-
core simulation requires memory outside of its NuMA domain, which induces further
bandwidth limitations and hence even bigger speed up due to cache optimization.
4. Implementation. All simulations were implemented in C++ and the sources
are made publicly available [17]. For the numerical Runge-Kutta-4 algorithm the
Boost.odeint library was used [3, 2]. It provides modern, generic implementations
of ODE solvers and greatly simplified the implementation of the cache optimized al-
gorithm. For example, Boost.odeint performs the required memory allocations for
the temporary results of the RK4 steps and also supports an easy way to introduce
granularity with the help of a range interface. To efficiently employ the SIMD reg-
isters, NT2 ’s SIMD library [9] (proposed Boost.SIMD) was used. It provides a C++
abstraction to the low-level SIMD registers and allows for a very fast and easy porting
of the simulation to SIMD. By using this library, the source code is generic in the
sense that it will be decided at compile-time, and not in the source code, which SIMD
extensions are going to be employed (e.g. SSE3, SSE4 or AVX). Furthermore, the
12
generic implementation of Boost.odeint allowed us to readily use SIMD data types,
and therefore SIMD registers, without the need to re-implement the Runge-Kutta-4
algorithm with SIMD instructions.
For more details of the implementations we refer to the source codes [17]. We only
emphasize here that the introduction of granularity as well as the SIMD usage can be
done with reasonable programming effort when using Boost.odeint and Boost.SIMD.
For example, the simulation of the coupled Roessler system with granularity and
SIMD instructions requires merely about 220 lines of C++ code, only about 100 lines
more than a basic standard implementation with Boost.odeint. Hence, applying the
techniques presented here in ODE simulations with C++ is certainly achievable in
most situations.
5. Conclusions and Outlook. In this article, we have presented a technique
to improve the performance of large-scale ODE simulations with nearest-neighbor
coupling based on explicit Runge-Kutta schemes. The main performance bottleneck
for such simulations is the cache and memory bandwidth which prevents the full
employment of CPU throughput due to the lack of data available to the CPU. We
are able to overcome this problem by introducing a granularity to the algorithm
which leads to a more cache efficient implementation. To ensure the correctness of
the algorithm, we introduced overlap computations, which effectively increased the
required computational effort. However, performance tests on two processors, an Intel
Xeon and an AMD Opteron, showed that for optimal granularity, this is clearly out-
weighted by the better cache utilization. With this improvement, we again arrived at
an implementation that is bound by the CPU throughput instead of cache/memory
bandwidth. We were then able to further increase the performance by employing
SIMD instructions. For both processors we found a total performance increase of up
to a factor of three, depending on the system size (cf. Fig. 7). We conclude that
for large-scale simulations, the techniques presented here should be considered as a
valuable option to increase simulation performance, typically even before exploring
multi-core parallelization.
Note that from Figures 5b and 6b one finds that the optimal granularity roughly
corresponds to the L1 cache sizes of the used processor. Hence, it is possible to
automatically choose the optimal granularity without employing a performance study
by adjusting the granularity to the L1 cache size of the target machine.
However, the cache optimizations are only promising if the program indeed suffers
from memory/cache bandwidth limitations. In cases, for example, where the rhs of
the ODE involves more complicated expressions that require several CPU cycles to be
computed, the cache/memory bandwidths might not impose a bottleneck and hence
introducing granularity would not lead to any performance gain. In such situations,
one should directly try to utilize SIMD instructions to increase the CPU throughput
and obtain better performance this way.
To ensure the correct computation in cases with granularity we introduced overlap
computations. The presented strategy only works for nearest-neighbor coupling. In
general, however, the same idea could also be applied in more complicated situations.
In case of next-nearest-neighbor couplings, for example, simply twice as long overlaps
have to be introduced, and similarly with longer coupling ranges. In principle, also
for situations with distant, but sparse couplings granularity can be implemented. But
this is more complicated as one has to keep exact track of the error propagation in
the Runge-Kutta scheme. Furthermore, if the coupling is not sparse enough, the gain
from cache optimization might not outweight the additional overlap computations
13
anymore. Nevertheless, with a factor three speed up, the cache optimization tech-
niques presented here proved to be highly valuable for nearest-neighbor coupling and
can potentially be generalized to more complex situations as well.
REFERENCES
[1] K. Ahnert, D. Demidov, and M. Mulansky, Solving ordinary differential equations on GPUs,
in Numerical Computations with GPUs, Springer, 2014, pp. 125–157.
[2] K. Ahnert and M. Mulansky, odeint. http://www.odeint.com, 2009–2014.
[3] , odeint - Solving ordinary differential equations in C++, in Symposium on the Numerical
Solution of Differential Eq. and their Applications, AIP Conference Proceedings, 2011.
[4] G. Bordyugov, A. Pikovsky, and M. Rosenblum, Self-emerging and turbulent chimeras in
oscillator chains, Physical Review E, 82 (2010), p. 035205.
[5] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes, Cache hierarchy
and memory subsystem of the amd opteron processor, IEEE micro, 30 (2010), pp. 16–29.
[6] D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling, Programming CUDA and OpenCL:
A case study using modern C++ libraries, SIAM Journal on Scientific Computing, 35(5)
(2013), pp. C453–C472.
[7] S. Dindar, E.B. Ford, M. Juric, Y.I. Yeo, J. Gao, A.C. Boley, B. Nelson, and J. Peters,
Swarm-NG: A CUDA library for parallel n-body integrations with focus on simulations of
planetary systems, New Astronomy, 23 (2013), pp. 6–18.
[8] C. Ding and K. Kennedy, The memory of bandwidth bottleneck and its amelioration by a
compiler, in Proceedings of the 14th International Symposium on Parallel and Distributed
Processing IPDPS., IEEE, 2000, pp. 181–189.
[9] P. Estérie, J. Falcou, M. Gaunard, and J.-T. Lapresté, Boost.SIMD: generic programming
for portable simdization, in Proceedings of the 2014 Workshop on programming models for
SIMD/Vector processing, ACM, 2014, pp. 1–8.
[10] S. Flach, D.O. Krimer, and C. Skokos, Universal spreading of wave packets in disordered
nonlinear systems, Physical Review Letters, 102 (2009), p. 024101.
[11] F. Günther, M. Mehl, M. Pögl, and C. Zenger, A cache-aware algorithm for PDEs on
hierarchical data structures based on space-filling curves, SIAM Journal on Scientific Com-
puting, 28 (2006), pp. 1634–1650.
[12] A.C. Hindmarsh, P.N. Brown, K.E. Grant, S.L. Lee, R. Serban, D.E. Shumaker, and
C.S. Woodward, Sundials: Suite of nonlinear and differential/algebraic equation solvers,
ACM Transactions on Mathematical Software (TOMS), 31 (2005), pp. 363–396.
[13] A. Kagi, J.R. Goodman, and D. Burger, Memory bandwidth limitations of future micropro-
cessors, in Computer Architecture, 1996 23rd Annual International Symposium on, IEEE,
1996, pp. 78–78.
[14] M. Kowarschik and C. Weiß, An overview of cache optimization techniques and cache-aware
numerical algorithms, in Algorithms for Memory Hierarchies, Springer, 2003, pp. 213–232.
[15] A.R. Lebeck and D.A. Wood, Cache profiling and the SPEC benchmarks: A case study,
Computer, 27 (1994), pp. 15–26.
[16] M. Mulansky, Simulating DNLS models, arXiv:1304.1608, (2013).
[17] , Roessler performance tests source codes. Available on Github:
https://github.com/mariomulansky/olsos, 2014.
[18] M. Mulansky and A. Pikovsky, Energy spreading in strongly nonlinear disordered lattices,
New Journal of Physics, 15 (2013), p. 053015.
[19] L. Murray, GPU acceleration of Runge-Kutta integrators, Parallel and Distributed Systems,
IEEE Transactions on, 23 (2012), pp. 94–101.
[20] V. Oganesyan, A. Pal, and D.A. Huse, Energy transport in disordered classical spin chains,
Physical Review B, 80 (2009), p. 115104.
14
[21] E. Ott, Chaos in Dynamical Systems, Cambridge University Press, 2002.
[22] D.A. Patterson and J.L. Hennessy, Computer organization and design: the hard-
ware/software interface, Newnes, 2013.
[23] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes
3rd Edition: The Art of Scientific Computing, Cambridge University Press, 2007.
[24] O.E. Rössler, An equation for continuous chaos, Physics Letters A, 57 (1976), pp. 397 – 398.
[25] L.F. Shampine and M.W. Reichelt, The Matlab ODE suite, SIAM Journal on Scientific
Computing, 18 (1997), pp. 1–22.
[26] J. Treibig, G. Hager, and G. Wellein, Likwid: A lightweight performance-oriented tool suite
for x86 multicore environments, in 39th International Conference on Parallel Processing
Workshops (ICPPW), IEEE, 2010, pp. 207–216.
[27] W. Yu-Shun, L. Qing-Hong, and S. Yong-Zhong, Two new simple multi-symplectic schemes
for the nonlinear Schrödinger equation, Chinese Physics Letters, 25 (2008), pp. 1538–1540.
15