High Performance Computing
High Performance Computing
· Silvio Rizzi ·
Esteban Meneses · Esteban Mocskos ·
Jose M. Monsalve Diaz ·
Javier Montoya (Eds.)
High Performance
Computing
10th Latin American Conference, CARLA 2023
Cartagena, Colombia, September 18–22, 2023
Revised Selected Papers
Communications
in Computer and Information Science 1887
High Performance
Computing
10th Latin American Conference, CARLA 2023
Cartagena, Colombia, September 18–22, 2023
Revised Selected Papers
Editors
Carlos J. Barrios H. Silvio Rizzi
Industrial University of Santander Argonne National Laboratory
Bucaramanga, Colombia Lemont, IL, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2024
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
and UNESCO via ICTP-Trieste. In addition, CARLA 2023 received sponsorship from
eight companies and two professional societies. Compute resources were provided by
Chameleon Cloud for some of the Tutorials. The committee also recognizes the role
and support of the Science and Technology Ministry in Colombia (MinCiencias) and
RENATA.
CARLA 2023 featured a total of three keynote speakers. In an effort to improve
women’s visibility in the field of high-performance computing (HPC), the committee
made a conscious decision to select recognized female speakers for these keynotes: Ewa
Deelman, Liliana Barbosa Santillan, and Trilce Estrada. The conference included three
invited three invited talks by Fabrizio Gagliardi, Balaji Baktha, and Guido Araujo in
addition to five industry talks given by different sponsors. The conference also encour-
aged attendees’ participation and discussion through three panels featuring 12 interna-
tional panelists. The panels were “EuroHPCLatam Panel: Policy and Global Actions”,
“Education HPC”, and an Industry Panel.
This book contains 14 papers selected from 25 submissions and one invited paper
from one of the keynote speakers. All manuscripts were peer-reviewed by at least three
members of the Program Committee. The work by Johansell Villalobos and Esteban
Meneses titled “Evaluation of Alternatives to Accelerate Scientific Numerical Calcula-
tions on Graphics Processing Units using Python” was selected for the Best Paper Award.
Additionally, the poster by Kevin A. Brown and Robert Ross titled “Understanding HPC
Network Behavior Using Low-level Metrics” was selected for the Best Poster award.
Reviewers
Abstract. In this paper, the Numba, JAX, CuPy, PyTorch, and Ten-
sorFlow Python GPU accelerated libraries were benchmarked using sci-
entific numerical kernels on a NVIDIA V100 GPU. The benchmarks con-
sisted of a simple Monte Carlo estimation, a particle interaction kernel,
a stencil evolution of an array, and tensor operations. The benchmarking
procedure included general memory consumption measurements, a sta-
tistical analysis of scalability with problem size to determine the best
libraries for the benchmarks, and a productivity measurement using
source lines of code (SLOC) as a metric. It was statistically determined
that the Numba library outperforms the rest on the Monte Carlo, parti-
cle interaction, and stencil benchmarks. The deep learning libraries show
better performance on tensor operations. The SLOC count was similar
for all the libraries except Numba which presented a higher SLOC count
which implies more time is needed for code development.
1 Introduction
Scientific computing (SC), alongside high performance computing (HPC), have
revolutionized the way complex problems are approached across disciplines. With
the exponential growth in computational power, problems that were once only
theorized can now be solved by supercomputers. This development allows the
design of simulations that are closer to reality, which sheds light on problems
like weather forecasting, plasma dynamics, and oceanic current modeling.
Also, the supercomputing simulation workflows used are becoming more com-
plex not only in their simulation process but in their pre-processing and data
analysis components. Accommodating these pipelines on HPC systems currently
requires a heterogeneous approach using accelerators which mainly include but
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 3–20, 2024.
https://doi.org/10.1007/978-3-031-52186-7_1
4 J. Villalobos and E. Meneses
are not limited to GPUs [18]. Currently, the GPU is the most common accelera-
tor in HPC clusters and supercomputers, due to their high energy efficiency [9]
and throughput. Referring to the Top500 supercomputer list, the heterogeneous
systems are comprised mostly of NVIDIA and AMD manufactured GPUs. Intel
has recently incorporated their GPU chips into the market by implementing the
Aurora HPC system at Argonne National Laboratory [6].
Introducing these high performance CPU-GPU systems to the HPC world pre-
sented a faster, more energy efficient method of executing scientific software. How-
ever, it did so by introducing a different perspective on software parallelism and
a change in the usual programming paradigm. This new environment presented
challenges for the scientific community. CUDA, HIP, and OpenCL, although
specifically designed for GPU computing, are low level programming languages
that slow down productivity while prototyping and generating SC code. It is
important to note that “an increasing number of scientists are using high pro-
ductivity programming languages” [18], in which Python is a clear protagonist.
Computational scientists would prefer to have fast and portable code while still
being able to design the software for the solution of the numerical methods [18].
Python is popular for it portrays a numerical modeling and SC high level abstrac-
tion with appropriate APIs. According to GitHub Octoverse statistics, Python
is the second most used programming language as of 2022 [1]. Generally, Python
code is thought of as slow due to its interpreted nature, but with the introduction
of new libraries that implement just-in-time (JIT) compilation and GPU deploy-
ment with Python bindings for C/C++ code, considerable acceleration of code
execution can be seen [10]. This is the main reason Python has diffused over many
disciplines that are not limited to SC. For instance, the machine learning (ML) and
artificial intelligence fields have been constructed on the basis of clean high-level
Python APIs, such as PyTorch and TensorFlow, that implement bindings to fast
compiled CUDA/C/C++ routines. These routines perform the heavy duty linear
algebra and numerical computations for deep learning model distributed training,
which leaves condensed, readable Python code at the high level.
In that direction, this paper aims to evaluate different Python APIs that
make use of GPU architectures via JIT compilation and CUDA/C/C++ sub-
routines to accelerate scientific numerical calculations. This evaluation considers
the use of fundamental scientific numerical methods as benchmarks, specifically
Monte Carlo methods, particle interaction routines, stencil operations and linear
algebra operations. Furthermore, programmability, performance and scalability
of the libraries is measured. The contributions of this work include: i) a com-
parative evaluation of available parallel programming Python tools for a set of
SC benchmarks; ii) a public GitLab repository with the implemented SC bench-
marks alongside the time and memory measurement modules and results; and
iii) an analysis of tradeoffs between the programming tools.
Alternatives for Python Scientific Computing on GPU 5
Python GPU execution has been benchmarked for Python bindings to CUDA
and OpenCL [8], focusing on an evaluation of performance portability and energy
efficiency between GPU models. The evaluation was done using a finite volume
shallow water code implemented with both of these libraries. Evidence suggests
that using the Python bindings does not have an important impact on the overall
performance of the code and that using Python is as computationally efficient as
using C++. Using the Python APIs alongside the Jupyter Notebook environment
increased development productivity.
NPBench is a set of NumPy computer programs that represent various
HPC applications in different domains [20]. Python machine code compilers
and NumPy-like accelerated libraries were tested and benchmarked, specifically
CuPy, DaCe, DaCe GPU, Numba, and Pythran. It was determined that alternate
implementations of numerical methods using other Python libraries drastically
accelerate execution of code but lower productivity in most cases.
OMB-Py is a module consisting of micro benchmarks that evaluate perfor-
mance of MPI libraries on HPC systems. It implements the first set of com-
munication benchmarks suite for MPI-based parallel Python applications. This
package also performs benchmarks on GPU message passing latency and library
overheads for multi GPU applications [3].
A comparative evaluation of the performance of a cellular nonlinear network
simulator programmed in the CuPy, Numba, PyCUDA, and NumPy Python
libraries was performed in [5]. This simulator served as a benchmark for these
libraries. The data collected shows that the PyCUDA implementation was the
fastest out of the four libraries.
2 Background
GPUs are a core component in today’s HPC systems. In the last 15 years the
realm of application of GPU has expanded from gaming to the artificial intelli-
gence and scientific computing field. The main reason being the high computing
throughput that GPUs can handle due to their parallel computing architecture.
The modern GPU has as its core computational unit the streaming multipro-
cessor (SM). SMs are comprised of thousands of registers that can be partitioned
among numerous threads of execution. This presents a single instruction multi-
ple thread (SIMT) programming paradigm that can be taken advantage of. Even
though GPUs have a slower clock speed than CPUs, the ability to handle data
in parallel with the SMs gives GPUs an edge for numerical applications that
involve multidimensional array operations.
6 J. Villalobos and E. Meneses
The high information throughput of GPUs, which is in the order of TB/s for
high performance GPU architectures, due to the SIMT programming paradigm
proves beneficial for scientific computing. Figure 1 shows the general architecture
of modern GPUs. GPUs like NVIDIA’s Tesla V100 and Ampere A100 have
been used in the acceleration of sections of computational modeling codes in
fields such as lattice quantum chromodynamics, computational fluid dynamics,
molecular dynamics, and climate modeling. This acceleration takes place, for
instance, in the linear algebra solvers used, and domain decomposition of the
system for parallel computation, or even the whole formulation of the algorithm
of a simulation [13].
There are a many Python tools that allow the implementation of numerical
methods to solve scientific computing problems on GPU. In this paper, the
CuPy, PyTorch, TensorFlow, Numba and JAX libraries were chosen for analysis.
This selection was a result of a qualitative evaluation of library relevance in the
scientific computing community, library development, and maintenance in the
Python community.
in which mi is its mass, ri is its position, and Fi is the total force acting on
the particle. For the interatomic potential V , r is the distance between the two
particles, σ is the distance at which the potential function becomes zero, and ε
is the depth of the potential well.
The total force F is dependent on specific functions which describe the attrac-
tion and repulsion of two particles. These functions or interparticular potentials
can be theoretically or experimentally constructed. The most used potential
function for the study of gases is the Lennard-Jones (LJ) interatomic poten-
tial. This function describes the potential energy between two particles of the
same kind taking into account the Van der Waals attraction and repulsion due
to their overlapping electron clouds. Theoretically, for each individual particle
the total force Fi is the sum of the forces caused by the interaction of the N
Alternatives for Python Scientific Computing on GPU 9
particles in the system, which can scale up computation time quite rapidly for
large problems. Considering this issue, an approximation can be made based
on the distance σ, which is to define a cutoff radius rcutoff = σ to only consider
the particles that influence significantly the total force. With this approximation
only Nr < N calculations have to be made per particle.
Stencil: Partial differential equations (PDEs) are often used in scientific com-
puting. From computational fluid dynamics to electromagnetism, the models
used are always based on spatial and temporal changes of a set of variables.
Given that analytical solutions are not always available, numerical methods
need to be implemented to solve these problems. The finite difference method
(FDM) is a simple method to implement spatial model discretization for numer-
ical PDE solutions. The discretization process of PDEs results in an expression
that describes the value of a node in a structured array-like mesh as a function of
itself and neighboring nodes. This expression constitutes a computational sten-
cil. In this work, a simple averaging stencil will be applied to a three dimensional
cubic array. This operation is given by,
aijk = (a(i+1)jk + a(i−1)jk + ai(j+1)k + ai(j−1)k + aij(k+1) + aij(k−1) + aijk )/7.
Tensor Operations: Tensor operations generalize concepts which are present
in physics and mathematics to a high degree. In continuum mechanics, PDEs are
combined with tensors to represent the change in the stress, strain, and velocity
of a system. In relativity, tensors are used to compute the trajectories of particles
in curved space-times using the geodesic equations. These mathematical objects
are not limited to any particular field of science and are commonly used in
multidimensional array handling. Simple tensor operations like outer and inner
products of matrices and vectors will be used as benchmarks for the chosen
Python tools. These operations are defined mathematically as,
Particles: The particle codes all followed the same structure, a main loop which
used a Verlet integration algorithm, a vectorized approach to obtain the parti-
cle neighbor list using distance matrices, and a vectorized evaluation of the
Lennard-Jones particle force for each neighbor in the list. All this was necessary
to take advantage of the accelerated linear algebra (XLA) and JIT compilation
implementation characteristics of the libraries used. The exception to this is the
Numba particle code implementation which differs considerably with respect to
the others in its approach to calculate neighbor lists. This benchmark looped over
all particles to define their neighbors instead of calculating a distance matrix.
Stencil: The stencil algorithm was implemented with the exact same structure
throughout all codes. Array slicing was used for vectorized updates of the whole
array. In this benchmark, JIT compilation was used to optimize the TensorFlow
and JAX code. The Numba variant needed specification of thread ids for array
indexing.
Tensor Operations: All tensor operations are already implemented in every
library except Numba. The specific algorithms for each tensor operation had
to be programmed. The matrix contraction and outer product algorithms were
implemented using global thread IDs since both operations are direct, the matrix
contraction algorithm was programmed to calculate the resulting matrix and
then perform a sum reduction of this matrix using the @cuda.reduce decorator.
The matrix multiplication algorithm used shared memory programming since
looping directly with all threads is computationally expensive.
4 Methodology
4.1 Computational Infrastructure
The resources used for the benchmarking process are found in the Kabré Super-
computing Cluster at the Costa Rica National Center of High Technology
(CeNAT). Kabré is a collection of 52 computing nodes of four different architec-
tures that are used in four main areas: machine learning, computational science,
big data, and bioinformatics. The machine learning section of the Kabré super-
computing cluster was used in this study. These nodes are composed of four
NVIDIA TESLA V100, with 32 GB of GPU memory, and 24 2.40 GHz CPU
cores with 2 physical threads each. The software tested in this benchmark is
summarized in Table 1.
Alternatives for Python Scientific Computing on GPU 11
Table 1. Computer program versions Table 2. Benchmark problem sizes for exper-
imental design
Computer tool Version
Linux distribution CentOS Linux 7 (Core) Benchmark Problem sizes N
Linux kernel 3.10.0-1160.81.1.el7 .x86 64 Monte Carlo [1.0e1, 2.5e7, 5.0e7, 7.5e7, 1.0e8]
Python 3.9.0 Particle Interaction [10, 3757, 7505, 11252, 15000]
PyTorch 1.13.0 Stencil [10, 232, 455, 677, 900]
TensorFlow 2.2.0 Matrix contraction [10, 6257, 12505, 18752, 25000]
Numba 0.56.4 Outer product [10, 6257, 12505, 18752, 25000]
CuPy 8.3.0 Matrix mult. [10, 6257, 12505, 18752, 25000]
JAX 0.4.10
A two stage nested design was used for the data acquisition process using 15
observations per entry. The factors in this experiment were the problem size
N , and programming library, Table 2 shows the problem sizes for each bench-
mark. This experimental design was chosen since it allowed for a more robust
data analysis [12]. Statistical difference between means can be evaluated using a
parametric analysis of variance (ANOVA). If ANOVA assumptions are not sat-
isfied, a non parametric approach using the Kruskal-Wallis median comparison
can be implemented.
all functions. Each script implemented the set of all benchmarks for a specific
library. This script then looped over the problem sizes specified in Table 2.
With respect to the problem size N , it varied according to the kernel being
tested since each piece of code has a different memory footprint. This way, the
GPU resources can be fully used, and the benchmark times can be long enough
to reduce any source of randomness in the execution of the kernels.
A set of Nwarmup = 10 warm-up runs and Niters = 10 timed iterations were
used to obtain consistent data. These scripts were then executed on a NVIDIA
V100 GPU using isolated Conda environments for reproducibility. The results
were then saved to disk using json files.
Data Analysis: The data analysis of these results was done using the NumPy,
Pandas, Statsmodels, and Pingouin Python libraries. First and foremost, an
ANOVA test was performed for the nested design, this way the ANOVA sta-
tistical assumptions of residuals could be tested. The linear model yijk =
μ + τi + βj(i) + (ij)k was fit using ordinary least squares with the Statsmodels
function ols to carry this out. Then, a normality test was conducted quanti-
tively using a Shapiro-Wilk test. For data that followed the ANOVA assumptions
the Tukey pairwise mean comparison test was implemented. The Kruskal-Wallis
non parametric ANOVA was used for data which did not satisfy the parametric
ANOVA assumptions. It was executed using the Pingouin library. After this,
Dunn’s pairwise comparison was used as a post hoc test to evaluate the differ-
ences between medians and quantify statistical difference. After the post hoc
pairwise tests, the best performing library was chosen according to statistical
difference and minimum median/mean execution time.
Productivity Measurement: Lines of code were used as a measure of
library productivity. To standardize this metric, the Black Python package was
employed, which applies the same format to all Python code it analyzes. After
this, the Pygount tool was used to statically analyze Python scripts and get the
source lines of code (SLOC) count.
5 Results
5.1 Normality Test
ANOVAs were implemented to test the statistical assumptions of the residuals of
the obtained time data. The residuals showed no normality at a 95% confidence
level. Data normalization was considered by implementing a Box-Cox transfor-
mation, however that did not yield positive results. Due to this, non parametric
ANOVA studies were carried out for all levels. Table 3 presents the P -values for
the residuals of each benchmark.
Fig. 2. Memory usage results from the execution of the GPU proposed benchmarks.
14 J. Villalobos and E. Meneses
5.3 Scalability
Referring to Fig. 3, the Monte Carlo algorithm displays linear scalability with
respect to problem size. The CuPy library presents the highest execution times
for all sizes except N = 10. Profiling of the Monte Carlo code sheds light on the
inefficiency of the cupy.sum function. The reduction algorithm in this module
may be the cause of slow execution. Numba presents the fastest execution of all
the libraries given its different memory implementation. Random numbers are
generated on the device per thread and not stored in main memory which speeds
up the Monte Carlo process.
Considering the particle interaction benchmark, it scales as N 2 for all
libraries. JAX presents poor performance in this benchmark as its execution
time is 20 times the other libraries. Profiling the code with nvprof highlighted
a fusion kernel in which almost 70% of the time was spent. The XLA com-
piler fuses computations to optimize the computational graph. It is possible the
fusion of some computations in this code may be reducing performance thus
raising execution time.
Figure 3 clearly shows that Numba outperforms every other library in the
stencil benchmark. This due to the fast thread access mentioned before. JAX
starts off with a high execution time attributed to the JIT and XLA compilation
of code, yet it scales like the other libraries.
Regarding tensor operations, the deep learning libraries including JAX out-
performed Numba and CuPy. These libraries are optimized for N-dimensional
array handling and multiplication due to the field they are applied in. These
libraries present static, compiled GPU routines which surpass JIT compilcated
kernels.
Alternatives for Python Scientific Computing on GPU 15
After the Kruskal-Wallis non parametric ANOVAs, data was analyzed using
the Dunn post-hoc test. Table 4 presents the best statistically similar results
after performing post-hoc tests.
16 J. Villalobos and E. Meneses
Table 4. Execution time medians [s] of the best statistically similar libraries for the
benchmarks evaluation.
Benchmark Problem Size (N )
N1 N2 N3 N4 N5
Numba : 0.0002 Numba : 0.1246 Numba : 0.2489 Numba : 0.3736 Numba : 0.4986
Montecarlo
CuPy : 0.0780 PyTorch : 2.1880 PyTorch : 4.3235 PyTorch : 6.4481 PyTorch : 8.5914
Numba : 0.0574 Numba : 0.0571 Numba : 0.0572 Numba : 0.0815 Numba : 0.0562
Particles
JAX : 0.0576 CuPy : 0.2186 CuPy : 0.3501 CuPy : 0.5765 CuPy : 0.8917
Numba : 0.0001 Numba : 0.0238 Numba : 0.2410 Numba : 0.8121 Numba : 1.7729
Stencil
CuPy : 0.0280 PyTorch : 0.3274 PyTorch : 1.0429 PyTorch : 3.4906 PyTorch : 11.0197
CuPy : 0.0000 PyTorch : 0.0356 PyTorch : 0.2811 TensorFlow : 0.9431 JAX : 2.2371
Matrix Mult.
PyTorch : 0.0000 TensorFlow : 0.0356 TensorFlow : 0.2819 PyTorch : 0.9438 PyTorch : 2.2418
CuPy : 0.0000 PyTorch : 0.0005 PyTorch : 0.0022 JAX : 0.0044 JAX : 0.0072
Matrix Contr.
PyTorch : 0.0000 JAX : 0.0009 JAX : 0.0024 PyTorch : 0.0049 TensorFlow : 0.0086
CuPy : 0.0000 PyTorch : 0.0003 PyTorch : 0.0013 PyTorch : 0.0016 PyTorch : 0.0028
Outer Prod.
PyTorch : 0.0000 CuPy : 0.0005 JAX : 0.0014 JAX : 0.0021 JAX : 0.0033
5.4 Productivity
Python is heavily used for prototyping in the scientific community and it is a
valuable aspect. As such, there is a need to maintain this characteristic while
evolving the language towards HPC. Measuring SLOC is the main way to exam-
ine if productivity indeed is maintained. The SLOC data for each benchmark is
shown in Table 5. Qualitatively, all libraries seem to have similar productivity
relying only on SLOC, with the exception of Numba. Numba presents notorious
differences in SLOC count. These are visible in the particle interaction, and the
tensor operations benchmarks. These counts are to be expected since the CUDA
functionality in Numba is of a lower level with respect to the other libraries.
Benchmark Library
CuPy JAX Numba NumPy PyTorch TensorFlow
Monte Carlo 11 14 20 11 14 16
Particles 87 70 97 88 77 76
Stencil 15 19 21 17 18 20
Tensor Operations 9 9 49 11 10 9
Alternatives for Python Scientific Computing on GPU 17
6 Final Remarks
An evaluation of five GPU computing Python libraries was performed, specifi-
cally Numba, CuPy, PyTorch, TensorFlow, and JAX. This evaluation included
memory consumption, scalability, and productivity analyses on a single NVIDIA
V100 GPU. With respect to memory consumption, all libraries seem to handle
GPU memory differently and this behavior is expected. Numba presented the
least memory usage in most benchmarks.
Considering scalability results, our analysis concluded that the data needed a
non parametric statistical evaluation due to non normality of residuals. Kruskal-
Wallis tests with Dunn’s post-hoc test were implemented to quantify the differ-
ences between the median execution times of the libraries. It was found that the
Numba library performs better than the rest for the Monte Carlo, particle inter-
action, and stencil benchmarks. The deep learning libraries, specifically PyTorch,
outperform Numba in tensor operations. This behavior is expected since these
libraries implement highly optimized GPU routines. However, between these
libraries the times are close so no distinction was made with the exception of
CuPy which presents slightly higher computation times for linear algebra kernels.
Regarding the productivity of code, SLOC were counted and from all bench-
marks the library that stands out is Numba. This library presents higher SLOC
count for the particle interaction algorithm and tensor operations. From this it
can be concluded that while using Numba some productivity may be sacrificed
for speed depending on the nature of the problem trying to be solved. The rest
of the libraries present similar SLOC count on all benchmarks.
Fig. 4. Monte Carlo (a), particles (b), and stencil (c) algorithms roofline graphs.
Alternatives for Python Scientific Computing on GPU 19
Fig. 5. Matrix multiplication (a), outer product (b), and matrix contraction (c) algo-
rithms roofline graphs.
20 J. Villalobos and E. Meneses
References
1. The top programming languages. https://octoverse.github.com/2022/top-
programming-languages. Accessed 23 June 2022
2. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al.: Tensorflow: large-scale
machine learning on heterogeneous distributed systems (2016). https://arxiv.org/
abs/1603.04467
3. Alnaasan, N., Jain, A., Shafi, A., Subramoni, H., Panda, D.K.: OMB-PY: Python
micro-benchmarks for evaluating performance of MPI libraries on HPC systems
(2021). https://arxiv.org/abs/2110.10659
4. Asanović, K., Bodik, R., Catanzaro, et al.: The landscape of parallel comput-
ing research: a view from berkeley. Technical Reports, UCB/EECS-2006-183,
EECS Department, University of California, Berkeley, Dec 2006. http://www2.
eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
5. Dogaru, R., Dogaru, I.: A Python framework for fast modelling and simulation of
cellular nonlinear networks and other finite-difference time-domain systems (2021)
6. Facility, A.L.C.: Aurora. https://www.alcf.anl.gov/aurora
7. Frostig, R., Johnson, M.J., Leary, C.: Compiling machine learning programs via
high-level tracing (2016). https://arxiv.org/abs/1603.04467
8. Holm, H.H., Brodtkorb, A.R., Sætra, M.L.: GPU computing with Python: Perfor-
mance, energy efficiency and usability. Computation 8 (2020). https://doi.org/10.
3390/computation8010004
9. Huang, S., Xiao, S., Feng, W.: On the energy efficiency of graphics processing
units for scientific computing. In: 2009 IEEE International Symposium on Parallel
and Distributed Processing, pp. 1–8 (2009). https://doi.org/10.1109/IPDPS.2009.
5160980
10. Lam, S.K., Pitrou, A., Seibert, S.: Numba: a LLVM-based python JIT compiler,
vol. 2015-January. Association for Computing Machinery (2015). https://doi.org/
10.1145/2833157.2833162
11. Marowka, A.: Python accelerators for high-performance computing. J. Supercom-
put. 74, 1449–1460 (2018). https://doi.org/10.1007/s11227-017-2213-5
12. Montgomery, D.C.: Design and analysis of experiments (2017)
13. NVIDIA: Nvidia HPC application performance. https://developer.nvidia.com/
hpc-application-performance
14. Okuta, R., Unno, Y., Nishino, D., Hido, S., Loomis, C.: Cupy: a numpy-compatible
library for NVIDIA GPU calculations. https://github.com/cupy/cupy
15. Paszke, A., Gross, S., Massa, F., et al.: Pytorch: an imperative style, high-
performance deep learning library (2019). https://arxiv.org/abs/1912.01703
16. Pata, J., Dutta, I., Lu, N., Vlimant, J.R., et al.: ETH library data analysis with
GPU-accelerated kernels (2020). https://doi.org/10.3929/ethz-b-000484721
17. Springer, P.L.: Berkeley’s dwarfs on CUDA (2012)
18. Vetter, J.S., Brightwell, R., Gokhale, M., McCormick, P., et al.: Extreme hetero-
geneity 2018 - productive computational science in the era of extreme heterogene-
ity: report for doe ASCR workshop on extreme heterogeneity (2018). https://doi.
org/10.2172/1473756, https://www.osti.gov/servlets/purl/1473756/
19. Yang, C.: Hierarchical roofline analysis: how to collect data using performance
tools on Intel CPUs and NVIDIA GPUs (2020)
20. Ziogas, A.N., Ben-Nun, T., Schneider, T., Hoefler, T.: NPBench: a benchmarking
suite for high-performance numpy, pp. 63–74. Association for Computing Machin-
ery (2021). https://doi.org/10.1145/3447818.3460360
Enhancing a GPU-Based Wave
Propagation Application Through Loop
Tiling and Loop Fission Optimizations
1 Introduction
DEVITO is a Domain-Specific Language (DSL) written in Python for stencil
computation, focusing on seismic inversion problems [1,2]. DEVITO enables the
creation of geophysical models in Python with a high degree of abstraction,
thanks to the functions and classes provided by the tool. Using an internal
compiler, DEVITO can translate the model written from symbolic equations in
Python into a finite difference code in C/C++, which is partially optimized for
execution on the specified architecture.
In addition to running on multi-core machines, DEVITO allows code gener-
ation targeting accelerators and GPUs. Offloading the workload to the graphics
unit is done through OpenACC, a programming standard implemented through
a library for C/C++ and Fortran, enabling code optimization in an automated
way via directives [3]. The loop tiling technique stands out among the opti-
mizations implemented in OpenACC and is available for DEVITO. This loop
transformation alters the data access pattern, so the data is no longer accessed
continuously until the end of a dimension but instead accessed in blocks with
dimensions specified by the programmer [4]. The goal of this technique is to take
advantage of the principle of spatial locality of data, favoring the reuse of nearby
stencil points [5,6].
This work focuses on the computational optimization of a DEVITO-based
implementation of elastodynamic equations as described by Virieux [7]. There-
fore, the objective is to develop modifications in the C++ algorithm to improve
performance and mitigate bottlenecks without changing the nature of the equa-
tions and their solutions.
This work continues an ongoing effort to optimize the DEVITO tool, aiming
to contribute to existing efforts with a similar goal. N. Kukreja et al. [1] provide
an initial presentation of DEVITO, introducing some of the optimizations in the
DSL to ensure code generation with optimization levels comparable to manually
optimized code. These automated optimizations include vectorization, aligned
memory allocation, parallelization through pragmas, loop blocking on CPU, and
common sub-expression elimination. The work also measures performance in
GFLOPS/s and arithmetic intensity in FLOPS/byte for the simulation of a
three-dimensional acoustic model.
M. Louboutin et al. [8] conducted industrial-scale tests for seismic imaging
of an anisotropic model using DEVITO. The authors compared the number of
floating-point operations required for each grid point in two cases: without any
optimization and with the optimizations implemented by DEVITO (common
sub-expression elimination, factorization, and cross-iteration redundancy elim-
ination). The difference reached values approximately 81% lower in the lowest
discretization order used and around 95% lower in the highest discretization
order used, both in favor of the optimized version of the code.
L. Jesus et al. [9] performed GPU tests for six visco-acoustic equations, com-
paring the performance of two codes for each equation: one with the advanced
optimizations offered by DEVITO and another with these same optimizations
plus the use of OpenACC loop tiling. The authors used the default tile shape
Enhancing a GPU-Based Wave Propagation Application 23
⎧ ∂v ∂σ ∂σxy ∂σxz
x xx
⎪
⎪ ρ − + + = 0,
⎪
⎪ ∂t ∂x ∂y ∂z
⎪ ∂σ
⎪
⎪
⎪ ∂vy yx ∂σyy ∂σyz
⎪
⎪ ρ − + + = 0,
⎪
⎪ ∂t ∂x
∂σ ∂y ∂z
⎪
⎪
⎪ ∂vz zx ∂σzy ∂σzz
⎪
⎪ ρ − + + = 0,
⎪
⎪ ∂t ∂x ∂y ∂z
⎪
⎪ ∂σxx ∂vx ∂v ∂vz
⎪
⎪ y
⎪
⎪ − (λ + 2μ) −λ + = fxx ,
⎨ ∂t
⎪ ∂x ∂y
∂v ∂z
∂σyy ∂vy x ∂vz
− (λ + 2μ) −λ + = fyy , (1)
⎪ ∂t
⎪ ∂y ∂x ∂z
⎪
⎪ ∂σzz ∂vz ∂v ∂vy
⎪
⎪ − (λ + 2μ) −λ
x
+ = fzz ,
⎪
⎪
⎪
⎪ ∂t ∂v ∂z ∂x ∂y
⎪
⎪ ∂σyz ∂vy
⎪
⎪ −μ
z
+ = 0,
⎪
⎪ ∂t ∂y ∂z
⎪
⎪ ∂v
⎪
⎪ ∂σzx x ∂vz
⎪
⎪ −μ + = 0,
⎪
⎪ ∂t ∂z
∂v ∂x
⎪
⎪ ∂σxy ∂v
⎩ −μ
x
+
y
= 0.
∂t ∂y ∂x
vx , vy , vz are the horizontal and vertical particle velocity fields, σxx , σyy , σzz ,
σyz , σzx , σxy are the stress fields, fxx , fyy , fzz are the source terms, ρ is density, λ
and μ are the Lamé parameters. These coefficients describe the spatially variable
property of the earth, and are related to the seismic P-wave and S-wave velocities
via λ + 2μ = ρVp2 and μ = ρVs2 .
2.3 Approaches
Analyses performed with the NSYS profiler on the generated C++ code showed
that 98.3% of the computational effort employed in the total execution of the
application is concentrated in just two kernels: V (53.2%) and Sig (45.1%). These
kernels are responsible for calculating the particle velocity fields (V) and the
stress fields (Sig), as shown in Eq. 1. Therefore, optimization efforts focused
on extracting performance from these two kernels. The present work compares
the results obtained from the standard version of the operator generated by
DEVITO with two new approaches developed for performance extraction, as
described below:
– Sig Fission: an approach that combines the techniques of loop tiling and
loop fission. In this approach, the V kernel maintains the tile dimensions of
(32, 4, 4). The loop fission technique divides the workload of a nesting into
two or more nests of smaller workload but with the same iteration space [12].
It was applied to the Sig kernel by separating the cross-derivatives into one
nest, and the derivatives taken in the same dimension into another nest, with
these two new kernels being computed sequentially. The tile clause is applied
again to these two new nests, maintaining the (32, 4, 4) dimension applied in
the V nest.
The tile dimensions of each of the kernels were determined empirically, always
choosing the ones that achieved the best performance. A limitation to be taken
into account in this process is that the product of the tile dimensions should
not exceed the maximum limit of threads per block on the GPU. Since the limit
of the utilized device is 1024 threads per block, the product of the dimensions
must be less than or equal to this value, as shown in Eq. 2, where the Dsize(i)
represents the size of dimension i, and n is the number of dimensions in the tile.
n
Dsize(i) ≤ 1024 (2)
i=1
3 Results
3.1 Single GPU
Table 2 presents the average execution times in seconds for the three approaches
and the respective saved time percentage of the optimized approaches compared
to the time obtained by the Naive version. The number of cycles required to
execute the two main kernels in each approach is shown in Table 3. For better
visualization and ease of comparison, all data collected from the two additional
kernels generated in Sig Fission will be presented in a combined manner through
a weighted average (taken according to the workload of the kernels), as if they
continued to be a single kernel. Considering that, according to NCU reports,
computing same direction derivatives represents 42% of the original Sig kernel
workload, and cross-derivatives make up the remaining 58%, the results of the
two new kernels are weighted, summed, and then divided by 100 to yield the
final combined outcome.
The results shown in both tables reveal that the optimized approaches were
able to reduce the number of cycles required for processing the two kernels and,
consequently, significantly reduce the execution time of the application.
For the V kernel, the L1 cache hit rate has increased with the application
of loop tiling in the Tiling and Sig Fission approaches, and the L2 hit rate in
both optimized approaches remained very close to that obtained by the Naive
version, as shown in the graph of Fig. 1(a). The graph in Fig. 1(b) shows that the
Sig kernel presented improvements in L1 cache hit rates, both with the isolated
application of loop tiling in the Tiling approach and with loop fission combined
with tiling in the Sig Fission approach. In L2, the Naive version obtained the
highest hit rate in this kernel, but with very little difference compared to the
two optimized versions.
The cache hit values obtained by the two optimized approaches on both
kernels converge with the previous results. The increase in cache hit rates allows
for more efficient use of data, reducing processing bottlenecks and allowing the
application to increase the use of the available processing power (increasing
FLOP/s performance and arithmetic intensity).
Enhancing a GPU-Based Wave Propagation Application 29
Fig. 1. L1 and L2 cache hit rates on kernels V (a) and Sig (b).
Fig. 2. Rate of issued warps every 100 cycles and amount of registers needed on kernels
V (a) and Sig (b).
The main advantage of the two optimized approaches in the V kernel was
reducing the cache miss rate, allowing more efficient use of data, and decreasing
the need for expensive memory accesses since the necessary data for processing
are now more frequently found in the cache memory. The reduction of waiting
bottlenecks caused by accesses to main memory allowed for better use of available
computational resources, increasing performance and arithmetic intensity. These
gains explain the superior performance of the Tiling and Sig Fission approaches
in this kernel if compared to the Naive approach, which obtained significantly
lower cache hit rates.
In the Sig kernel, the reduction in cache miss rate is also the main factor
responsible for the positive results of the Tiling and Sig Fission approaches.
Once again, the more efficient use of data and the reduced need for costly mem-
ory accesses allowed for better utilization of the hardware computational power.
However, the Sig Fission approach proved to be more efficient in reducing the
number of registers required per thread, which led to a higher rate of issued
warps. This higher emission of warps also contributes to even better utiliza-
tion of the processing power offered by the hardware, making the application of
loop fission deliver superior results, even though the two optimized approaches
achieved very similar cache hit rates in both levels of cache.
disabled. The runtime results presented in this section are averaged over five runs
conducted under identical conditions and parameters. The tests were conducted
in the same three-dimensional model with 820 elements in each dimension. Each
execution applied a simulation time of 6000 ms and a space order of 16 elements.
The values are summarized in Table 8.
The execution times of each approach and their respective percentage gains
compared to the Naive version can be found in Table 9. Once again, similar to
the tests on a single GPU, the optimized approaches achieved gains compared
to the Naive version, with a stronger emphasis on the Sig Fission approach,
which achieves the most significant performance increase. This maintenance of
the performance gain can be interpreted positively, as it indicates that the com-
munication overhead and other computational costs of domain decomposition are
small compared to the effort employed to compute the workload. Thus, there is
an indication that the developed optimizations remain effective in more robust
models, with larger stencil size, longer simulation time, and domain shared across
multiple devices.
Table 9. Elapsed times for each version running on 4 GPUs. nx 820, ny 820, nz 820,
tn 6000, so 16.
Table 10 presents the most relevant parameters used in modeling the tests
conducted in this work. Synthetic seismograms were generated with a Ricker
source placed at the center of the model. The receivers were positioned on the sur-
face for the pressure field, while for particle velocity seismograms, the receivers
were placed on the seafloor (OBN geometry). The tests were conducted using
a distributed computing approach facilitated by MPI, enabling domain decom-
position across four Nvidia c
Tesla
c
V100 SXM2 32 GB devices, with NVLink
disabled. The runtime results presented in this section are again averaged over
five runs conducted under identical conditions and parameters.
The approaches underwent a new tuning process to ensure that optimal tile
sizes were used in each of the kernels for this model. The dimension (16,4,4)
was used for the V kernel in both approaches, and the dimension (32,4,4) was
used for the Sig kernel in the Tiling approach as well as for the two new kernels
generated by the fission process in the Sig Fission approach.
Table 11 compares the execution times of the three tested approaches. The
results indicate a reduction in the total execution time of the operator in the
two optimized approaches, which were once again able to achieve performance
gains when compared to the Naive version.
Enhancing a GPU-Based Wave Propagation Application 33
Table 10. SEG/EAGE 3D Salt model parameters for multi GPU executions.
Table 11. Elapsed times for each approach running on 4 GPUs. nx 400, ny 400, nz
505, tn 4000, so 16.
4 Conclusions
Based on the results collected by the NCU profiler, as well as the average exe-
cution times obtained, it is possible to assert that the two developed approaches
were able to overcome some of the limitations of the Naive approach and deliver
positive results. Both loop transformation techniques used positively impacted
the application, with loop tiling taking advantage of the data locality principle
and making better use of cache memory, and loop fission increasing computa-
tional performance in the Sig kernel.
The Sig Fission approach is the one that achieves the best results by com-
bining the strategies that obtained the most expressive results in the two main
kernels of the operator: using loop tiling of dimensions (32, 4, 4) in the V kernel
and using loop fission separating the cross derivatives followed by the use of loop
tiling of dimensions (32, 4, 4) in the Sig kernels.
The replicated tests on multiple GPUs maintained a similar gain pattern
as those conducted on a single GPU, demonstrating the scalability of the pro-
posed solutions. The sustained performance gains in the experiments with the
SEG/EAGE 3D Salt model, albeit to a lesser extent, also emphasize the effective-
ness of the developed work in achieving improved performance, despite variations
in the computational mesh format and the utilized model.
References
1. Kukreja, N., Louboutin, M., Vieira, F., Luporini, F., Lange, M., Gorman, G.:
Devito: automated fast finite difference computation. In: 2016 Sixth International
Workshop on Domain-Specific Languages and High-Level Frameworks for High
Performance Computing (WOLFHPC), pp. 11–19. IEEE (2016)
2. Lange, M., et al.: Devito: towards a generic finite difference DSL using symbolic
python. In: 2016 6th Workshop on Python for High-Performance and Scientific
Computing (PyHPC), pp. 67–75. IEEE (2016)
Enhancing a GPU-Based Wave Propagation Application 35
1 Introduction
– Outer Product: This operation measures the interaction between two quan-
tum states. The computational complexity associated with this function
within a high-dimensional quantum computing simulator is related to the
number of nested iteration cycles, in this case it has two cycles, so the total
number of operations to be performed is 2n where n is each element of the
matrix (1).
∗ ∗
a ∗ ∗ ac ad
|Ψ Φ| = c d = . (1)
b bc∗ bd∗
– Tensor Product: This operation is used to measure the interaction between
two or more high-dimensional quantum states (2).
⎡ ⎤
ad
a d ⎢ae⎥
⊗ =⎢ ⎥
⎣ bd ⎦ , (2)
b e
be
– Pauli Matrices and Their Generalizations: The Pauli matrices can be
generalized to high dimensions using the Weyl adjoint operator (3). This
allows for the generalization of the Pauli X, Y, and Z gates to high dimensions
(4).
d−1
†
Wq,p = ω kq |k ⊕ p k| , (3)
k=0
†
d−1
Z m = Wm,0 = ω km Pk (6)
k=0
The code associated with this gate has a complexity O(n3 + 2n2 ).
• Generalized Pauli matrix Y To find the Generalized Pauli Y matrix,
the variables q, p must be replaced by m, m in the Weyl adjoint operator
equation, and it must also be multiplied by a global cyclic phase of im%b
(7).
†
Y m = i(m%d) Wm,m (7)
In summary, the 4 generalized gates within the high-dimensional comput-
ing simulator are the most important feature, since they clearly show the
relationship between Weyl matrices and Pauli gates. Generating a solid
base to make quantum circuits where gates are applied in high dimen-
sions.
• Generalized Quantum Fourier Transform It is part of the set of
gates that are applied to a qubit and is implemented in the simulator as
the generalized quantum Fourier transform (GQFT).
The code associated with this function considers the order, whether direct
or inverse, which generates a quadratic complexity given by: O(n2 ) where
n is the number of dimensions.
• CNOT the CNOT gate creates a qubit that controls another qubit, this
means that if the control qubit is equal to one, the controlled qubit will
be negated. The matrix associated with the CNOT gate is defined by a
4 × 4 matrix, like this (8):
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1000 0 0
⎢0 1 0 0⎥ ⎢0⎥ ⎢0⎥
⎢
CN OT (|1 ⊗ |0) = ⎣ ⎥ × ⎢ ⎥ ⎢
= ⎥ (8)
0 0 0 1⎦ ⎣1⎦ ⎣0⎦
0010 0 1
• Generalized Controlled Gate The quantum control gate for two
dimensions (CNOT) can be applied to more than one qubit through mod-
ular addition. For example, if you have two qubits |x and |y, where the
40 H. M. Zuluaga-Bucheli et al.
control qubit is |x, the result would be equal to the following multipli-
cation (9):
3 Acceleration Methodology
The acceleration methodology for the high-dimensional QuantumSkynet quan-
tum computing simulator is meticulously designed to make effective use of the
Eigen library, multithreading, and CUDA [3,11]. The Eigen library, renowned
for its efficient matrix and vector operations, introduces concepts such as Lazy
Evaluation, Vectorization, and Broadcasting. Lazy Evaluation minimizes unnec-
essary computations by intelligently determining the need for temporary vari-
ables at compile time. Vectorization optimizes the simulator by transforming it
from a Single Instruction, Single Data (SISD) to a Single Instruction, Multiple
Data (SIMD) paradigm, effectively utilizing the SIMD capabilities of modern
CPUs for enhanced performance. Broadcasting, on the other hand, allows for
operations between matrices of different dimensions by conceptually expanding
the smaller matrix, thereby ensuring smooth and efficient computations.
In addition to the Eigen library, the methodology also incorporates multi-
threading to exploit the multicore architectures of modern CPUs. This enables
the simultaneous computation of multiple quantum states, thereby significantly
improving the scalability and performance of QuantumSkynet. Furthermore, the
integration of CUDA allows the simulator to leverage the computational power
of NVIDIA GPUs for highly parallel tasks such as large matrix and vector oper-
ations [9,10]. Although the CUDA memory management model presents unique
challenges, careful design and optimization of the QuantumSkynet simulator can
result in substantial performance gains.
Overall, this combination of techniques constitutes a comprehensive accel-
eration methodology that significantly enhances the computational efficiency of
high-dimensional quantum computing simulations in QuantumSkynet, making
it a powerful tool for exploring the complex world of quantum computing.
Acceleration of High-Dimensional Quantum Computing Simulator 41
3.2 Vectorization
This concept comes from converting a program that by default is SISD salary
implemented to a SIMD vector implementation. In general, most computers
today have vector implementations that allow them to optimize the resources
available to them.
To execute these SIMD-type functions, computational systems have sets of
subroutines that describe operations external to the processor with the aim of
sending the same operation to the different processing units within the CPUs,
but each unit operates with different data so that tries to use as little single-
threaded processing as possible.
3.3 Broadcasting
When operating with matrices, their dimensions might be different, so operat-
ing them conceptually is impossible. To solve this issue, the concept of Array
Broadcast arises, which is based on increasing the dimension of the lower matrix
to have the possibility of operating them, Eigen has implemented this concept
within its library to give developers fluidity.
4 Results
This section presents a comparative analysis of the performance of the acceler-
ated version of the QuantumSkynet simulator on two different hardware archi-
tectures: Intel and AMD. The analysis focuses on the speedup rates achieved
in various operations that are critical to the simulator due to their substantial
impact on its overall computational complexity. Speedup rates provide a measure
of performance improvement obtained with the accelerated version compared to
a non-accelerated one. A detailed examination of these rates can reveal how
effectively the accelerated simulator performs computations faster and, in cer-
tain instances, extends its storage capacity on different hardware architectures.
The specific results and discussions of speedup rates for each operation tested
on Intel and AMD architectures will be detailed in the following sections.
Acceleration of High-Dimensional Quantum Computing Simulator 43
Sum of Probabilities. The test set that was generated for both versions is
the same, although the accelerated version was able to trade up to 90% more
data. In the case of the sum of probabilities without accelerating, the dimension
interval ranges from a vector of size 1 to 12000. In the case of the accelerated
version, the dimension range grows exponentially from 1 dimension to 12000000,
which represents an increase of 1 to 10. The maximum number of dimensions it
can support the normal version is 12000, with a time of 41 seconds. The total
throughput considering the maximum amount of each test and time is 99% more
for the accelerated version on Intel and AMD architecture with GPU.
Sum of Probabilities: The speedup rates for both Intel and AMD are significantly
higher compared to other operations, especially for Intel. However, the variability
in speedup rates is also much higher, as indicated by the very long box in the
Intel (see Fig. 1).
Fig. 1. Comparison of speedup rates for the Sum of Probabilities using different vector
dimensions on Intel (blue) and AMD (orange) architectures. The x-axis represents the
logarithmic scale of dimensions, ranging from 50 to 12000, while the y-axis shows the
corresponding speedup rates. The graph offers insights into the relative performance
between the two architectures for this specific function. (Color figure online)
Vector Base. This function in its normal form has a linear complexity O(n),
where n is the number of dimensions. For the tests, this function achieved a
mark of 8 seconds to generate vectors of 26000000 dimensions. In contrast, the
accelerated version reached a maximum amount of 57 million in dimensions
44 H. M. Zuluaga-Bucheli et al.
in a time of 8 s. This means that the accelerated version doubles the number
of dimensions with which it can operate. This represents an acceleration and
performance improvement of 46%.
Speedup Rate Base Vector: The speedup rates for Intel are generally lower than
for AMD, and there’s a large variability in the Intel speedup rates, as indicated
by the long box in the figure (see Fig. 2).
Fig. 2. Comparison of speedup rates for vector-based calculations across various dimen-
sions (logarithmic scale) using Intel and AMD architectures. The x-axis represents the
vector dimensions in a logarithmic scale, ranging from 50 to 100000, allowing for direct
comparison across a wide range of sizes. The y-axis indicates the speedup rate, with the
blue line depicting the performance for Intel and the orange line for AMD. The graph
illustrates the relative efficiencies of these architectures in handling vector calculations.
(Color figure online)
Product V ector × M atrix. This test was developed to measure the process-
ing capacity for the product function between a vector and a matrix. The test
was performed in principle for the version without accelerating, with a set of
matrices where the maximum processing capacity was from the operation of
vectors and matrices of dimension 1 to 120. Each dimension is represented in
vector form of size n and matrix with size n × n. The same dataset was used
for accelerated version testing. In this version, it could be evidenced that you
can trade with a higher number of dimensions, compared to the non-accelerated
version, which had a maximum dimension of 120, which had a runtime of 0.05 s.
As for the accelerated version, it allowed trading a maximum number of 15000
dimensions, with a time of 4.84 s. This represents a yield improvement of 117%.
Acceleration of High-Dimensional Quantum Computing Simulator 45
Fig. 3. Comparison of speedup rates for the Matrix-Matrix Product operation using
different matrix dimensions on Intel (blue) and AMD (orange) architectures. The x-axis
represents the logarithmic scale of matrix dimensions, ranging from 2 to 1592, while
the y-axis depicts the corresponding speedup rates. The graph illustrates the behavior
of the function, shedding light on the efficiency and performance characteristics of both
architectures for this particular computation. (Color figure online)
implying a storage improvement of 24%. As for how long it took you to solve, this
is 15 s. In general, the acceleration of this matrix allows achieving two specific
goals which are performance improvement regarding time and space.
External Product: The speedup rates for Intel are generally higher than for
AMD, but there’s a large variability in the Intel speedup rates, as indicated by
the height of the box in the figure (see Fig. 4).
Fig. 4. Comparison of speedup rates for the External Product operation using dif-
ferent vector dimensions on Intel (blue) and AMD (orange) architectures. The x-axis
represents the logarithmic scale of vector dimensions, ranging from 2 to 15092, while
the y-axis highlights the corresponding speedup rates. The graph demonstrates the
comparative performance of both architectures, illuminating the complex relationships
between dimensionality and computation efficiency in the context of External Product
calculations. (Color figure online)
Kronecker Product. Testing for the Kronecker product [12], was performed
for the non-accelerated version with dimension arrays from 5×5 to 125×125, with
a maximum time of 22 s. In the accelerated version, it was possible to expand
the maximum size of the dimensions, from 5 × 5 to 150 × 150, with a time for
the last one of 8 s. The performance improvement that was obtained regarding
space is 17% and the time improvement is 36%. This represents 218% of the
total acceleration obtained by the accelerated version in an Intel architecture
with GPU.
Acceleration of High-Dimensional Quantum Computing Simulator 47
Kronecker Product: The speedup rates for Intel are generally higher than for
AMD, with a larger variability in the Intel speedup rates, as indicated by the
long box in the figure (see Fig. 5).
Fig. 5. Comparison of speedup rates for the Kronecker Product operation using Intel
(blue) and AMD (orange) architectures across different vector dimensions. The x-axis
illustrates the logarithmic scale of vector dimensions, ranging from 5 to 125, while the
y-axis delineates the associated speedup rates. This representation offers an insightful
analysis of the behavior and efficiency of both architectures in Kronecker Product
computations, accentuating the interplay between dimensionality and performance.
(Color figure online)
Several lines of research remain open that can extend the scope of this project,
for example in areas such as high-performance computing, quantum comput-
ing, high-dimensional quantum machine learning, drug discovery using quantum
algorithms, among others. Some work based on QuantumSkynet that could be
carried out in the future are:
References
1. André, T., Sjöqvist, E.: Dark path holonomic qudit computation. Phys. Rev. A
106(6), 062402 (2022). https://journals.aps.org/pra/abstract/10.1103/PhysRevA.
106.062402
2. Avramouli, M., Savvas, I.K., Vasilaki, A., Garani, G.: Unlocking the potential of
quantum machine learning to advance drug discovery. Electronics (Switzerland)
12(11), 2402 (2023). Multidisciplinary Digital Publishing Institute. https://www.
mdpi.com/2079-9292/12/11/2402
3. Criado-Ramón, D., Ruiz, L.B.G., Pegalajar, M.C.: CUDA-bigPSF: an optimized
version of bigPSF accelerated with graphics processing Unit. Expert Syst. Appl.
230, 120661 (2023). https://doi.org/10.1016/j.eswa.2023.120661
4. Daley, A.J., et al.: Practical quantum advantage in quantum simulation. Nature
607(7920), 667–676 (2022). https://www.nature.com/articles/s41586-022-04940-6
Acceleration of High-Dimensional Quantum Computing Simulator 49
1 Introduction
With the increasing popularity of demand response programs, large consumers
face the challenge of adopting intelligent planning strategies to effectively lever-
age the new business opportunities presented in the electricity market [14,24].
However, the integration into the electricity market can introduce complexities if
the potential impacts on normal operations are not carefully considered. In this
context, conducting a comprehensive multi-objective analysis becomes crucial
for both automated decision-making and post-evaluation processes.
Multi-objective scheduling plays a crucial role in maximizing the computing
capabilities of computing infrastructures while simultaneously meeting business
needs. Multi-objective scheduling has been extensively studied in the literature
[6,11,13,20]. By leveraging the concept of the Pareto front, it becomes possible to
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 50–65, 2024.
https://doi.org/10.1007/978-3-031-52186-7_4
Multi-objective Power and QoS Demand Response in Datacenters 51
narrow down the solution space to high-quality solutions. This enables effective
comparison of various techniques and facilitates accurate trade-off decisions.
This article focuses on the challenge of minimizing two objectives in data-
center operation: the power consumption and the quality of service (QoS) degra-
dation. Considering the inherent conflict between these objectives, the solution
to this problem is a Pareto front that encompasses a range of trade-off values.
To solve the considered problem, the proposed approach combines an on-off
energy-aware scheduling strategy with efficient greedy list-scheduling heuristics
[9,12,21].
The manuscript is organized as follows: Sect. 2 provides a detailed descrip-
tion of the multi-objective problem addressed in this study. Subsequently, Sect. 3
presents a review of related work in the field. Section 4 outlines the methodology
employed to resolve the problem. The experimental evaluation of the proposed
methodology is discussed in Sect. 5. Finally, Sect. 6 concludes the article by sum-
marizing the key findings and outlining potential directions for future research.
of the datacenter participation in the response event. A better option for char-
acterizing the datacenter participation in the response event is via multi-criteria
analysis. Applying multi-objective (Pareto) comparison provides a more com-
prehensive perspective of the interplay and trade-offs between monetary gains
and provided QoS, which can severely affect the reputation and have a severe
negative implications for the business.
The multi-objective analysis presented in this article provides decision-
makers with a broad range of trade-off levels between power reduction and
quality of service, in order to meet their specific business needs in different
situations. Furthermore, the analysis allows for the consideration of several rele-
vant factors such as: i) datacenter reputation, since having knowledge about the
impact on QoS helps to prevent the image of the datacenter from being affected,
ii) sustainability, allowing ecological goal to be weighted and even prioritized
over monetary incomes (profit), and iii) emergency situations, where reducing
power consumption is mandatory to guarantee a proper electricity supply for
other critical facilities.
Our previous work addressed the problem of demand response in multi-tenant
datacenters that operate in colocation mode [16,19]. The analysis were conducted
at two levels. On the higher level, the datacenter negotiates with its tenants to
achieve the best price for power consumption reduction. On the lower level, ten-
ants offered a price based on their convenience in terms of their profits, applying
single-objective optimization evaluated through simulation.
This article focuses on the optimization performed by a tenant that owns
its own computational infrastructure and serves multiple clients, following the
enterprise datacenter model [25]. Instead of applying a single-objective approach
(only considering profit), a multi-objective analysis is developed. The multi-
objective analysis takes into account the power consumption reduction and the
degradation of service quality as objective function. The next subsection defines
the specific problem addressed in this article.
2.3 Formulation
N
[penalty cost] min 1fj >dj cj (1)
j=1
3 Related Work
effective in improving profit and reducing the risk of power system overflow. In
the model by Chen et al. [4], the operator negotiates with tenants to reduce their
power consumption, while tenants aim to maximize their profits. An iterative
evaluation of a supply function mechanism is used, offering monetary incen-
tives per power unit reduced to all tenants equally until the target reduction
is achieved. The monetary penalty function on tenants is defining according
to queuing theory. The model was proven to achieve optimal solutions under
certain assumptions. A small case study with three tenants showed that the pro-
posed mechanism achieved solutions close to optimal. In the approach by Tran
et al. [28] tenants are encouraged to reduce their electricity consumption. The
model combines game theory, exact and heuristic optimization methods, and a
theoretical supply function. The model considered a theoretical quadratic cost
function to model the use of other energy generation sources.
Regarding multi-objective analysis, Tchernykh et al. [27] studied several
online scheduling heuristics in cloud computing to optimize the provider income,
related to the server level agreement (and therefore to QoS degradation), and
power consumption. The best results, according to the Pareto dominance analy-
sis, were achieved by the strategy of allocating task to servers with the minimum
power consumption among the available options. Stavrinides et al. [26] proposed
an energy-efficient and QoS-aware scheduling approach for cloud computing. The
strategy is based on Dynamic Voltage and Frequency Scaling (DVFS) to min-
imize power consumption. The approach solely considered the number of jobs
that fail to meet their deadlines. The proposed strategy outperformed an early
deadline first heuristic across all conducted experiments.
Our previous articles [17,19] studied the participation of datacenters in
demand response programs, applying a two-level planning approach that esti-
mated the power consumption of tasks execution and the cooling system. The
model was extended by considering multiple power reduction targets during a
demand response event and bio-inspired metaheuristic algorithms for optimiza-
tion [10].
4 Resolution Method
– First Come First Served (FCFS): tasks are ordered by arrival time, in ascend-
ing order.
– Last In First Out (LIFO): tasks are ordered by arrival time, in descending
order.
– Early deadline first (EDF): tasks are ordered by deadline, in ascending order.
– High penalty first (HDF): tasks are ordered by penalty cost, in descending
order.
– Late deadline first (LDF): tasks are ordered by deadline, in ascending order.
– the best trade-off solution, defined as the closest point to the ideal vector,
which represents a decision based on equally weighting both objectives. The
ideal vector is a non-realistic solution with the best values for power con-
sumption and penalty cost, computed from the extreme values of the global
where
Pareto front. The trade-off solution is defined by arg minpi ∈P ||pi − id||,
is the ideal vector and P is the set of non-dominated solutions computed
id
for all heuristics, both normalized to equally weigh both objectives. The ideal
vector is different for each problem scenario solved.
– the best power consumption solution, i.e., the computed solution with the
minimum power consumption.
– the best penalty cost solution, i.e., the computed solution with the minimum
penalty cost.
The experimental analysis was performed on the high performance comput-
ing platform of National Supercomputing Center (Cluster-UY), Uruguay [22].
The reported results correspond to representative independent executions of the
proposed scheduling strategies for the considered problem scenarios. For the com-
parative analysis, the results computed using the FCFS heuristic are used as a
reference baseline, since it represents the most simple and traditional method
for scheduling in datacenters under a business-as-usual scenario.
Since the true Pareto front of the problem is unknown, all metrics were
calculated considering the global Pareto front, built by gathering all the non-
dominated solutions found in all independent executions performed for each
scheduling strategy. This is a common approach for evaluating multi-objective
optimization metrics for real world problems [5].
Metrics were calculate using the pfevaluator library (https://pypi.org/project/
pfevaluator/). This library contains a useful set of tools for evaluating multi-
objective optimization algorithms.
This subsection reports and analyzes the results of the proposed heuristics. The
analysis focuses on the comparison of the studied multi-objective optimization
metrics to identify the usefulness of each heuristic on different situations.
Multi-objective Power and QoS Demand Response in Datacenters 59
According to Table 1, LIFO was the best heuristic regarding ER and GD,
except for ER in the medium scenario, where EDF was the best. However, the
reason LIFO outperforms EDF in GD despite having less coverage of the global
Pareto front is that LIFO solutions are closely aligned with the global Pareto
front, whereas EDF solutions tend to deviate in other areas. HPF was the second-
best heuristic regarding GD, indicating that close solutions to the Pareto front
were computed. Regarding ONVG metric, all strategies obtained similar values.
Fig. 2. Computed Pareto fronts for the small scenario (Color figure online)
Figure 2 reveals that the solutions obtained by FCFS and EDF are farther
away from the ideal vector compared to other heuristics, especially at medium
and low power consumption. However, as the power consumption values increase
and the penalty costs decrease accordingly, FCFS and EDF solutions gradu-
ally approach to the global Pareto front. EDF solutions outperformed the other
heuristics in solutions close to 3 kW of consumption. This behavior is due to
FCFS and EDF prioritize tasks with long wait times in the pending queue.
Thus, only a few tasks are completed within the specified deadlines when com-
puting resources are scarce. FCFS and EDF are optimistic in the sense that
they assume that tasks that have been waiting longer can still be executed on
time, even if they are already delayed. Pessimistic heuristics (LIFO and LDF)
prioritize recently arrived tasks with a high probability of being completed on
time, meeting their deadlines. The pessimistic approach is more favorable for
meeting deadlines of certain tasks when there computing resources are scarce
due to power consumption restrictions. However, it must be considered that
some clients may be adversely affected by long waiting times. The HPF strat-
egy, which prioritize tasks with high penalty cost, properly balanced solutions
that outperformed pessimistic strategies at low power consumption and outper-
formed optimistic strategies at high power consumption.
Figure 3 presents the Pareto fronts computed by each proposed scheduling
strategy for the medium scenario.
Multi-objective Power and QoS Demand Response in Datacenters 61
Similar to the small scenario, Fig. 3 shows that optimistic strategies computed
better results at low power consumption levels. EDF outperformed the other
strategies at medium power consumption levels, in contrast to the results in
the small scenarios. FCFS also had better results than in the small scenario at
medium power consumption levels. The differences with results computed in the
small scenario are because the load on the computing infrastructure is lower. In
the medium scenario the task-to-server ratio is 153, whereas in the small scenario
is 227. Considering that the tasks and servers have similar characteristics in
both scenarios and the time horizon is the same, the observed improvement for
optimistic strategies is explained by the difference in the task-to-server ratio.
Figure 4 shows the Pareto fronts computed by each proposed heuristic for the
large scenario. The behavior was similar to the small scenario, where optimistic
heuristic solutions were better than other heuristics at low and medium power
consumption levels. EDF solutions are better at high power consumption levels.
Table 2. Trade-off, best power consumption and best penalty cost solutions
Results in Table 2 indicate that LIFO obtained the best value of power con-
sumption for trade-off solutions in medium and large scenarios, and was the
second-best heuristic in the small scenario. Additionally, LIFO achieved accept-
able values of penalty cost for trade-off solutions in all scenarios. FCFS obtained
the best values of penalty cost for trade-off solutions in small and large scenarios,
but with high power consumption in all scenarios. Regarding the best power con-
sumption solutions, all strategies obtained the same value on power consumption
but LIFO and LDF obtained the best penalty cost in all scenarios, outperform-
ing FCFS in 12% in the small scenario, 21% in the medium scenario, and 17%
in the large scenario. Relative improvements show that pessimistic heuristics
obtained better results than the other heuristics at low power consumption lev-
els. Regarding the best penalty cost solutions, no significant differences were
observed among the computed solutions, except for the EDF heuristic in the
medium scenario (EDF improved 6% over FCFS). This improvement that EDF
presents indicates that the EDF heuristic, in situations of low load of the com-
puting infrastructure, manages to obtain solutions with the same penalty cost as
the other heuristics while using 6% less energy than FCFS, which is the strategy
that achieves the second-best penalty cost value.
Multi-objective Power and QoS Demand Response in Datacenters 63
The notebook with the complete Python code used for computing the studied
metrics and elaborating the figures is publicly available at https://colab.research.
google.com/drive/1lrkHP8u9aWaPdgmRIIf0LJ2aPbM99Rpc?usp=sharing.
References
1. Assad, U., et al.: Smart grid, demand response and optimization: a critical review
of computational methods. Energies 15(6) (2022)
2. Bahrami, S., Wong, V., Huang, J.: Data center demand response in deregulated
electricity markets. IEEE Trans. Smart Grid 10(3), 2820–2832 (2019)
3. Cao, X., Zhang, J., Poor, V.: Data center demand response with on-site renewable
generation: a bargaining approach. IEEE/ACM Trans. Network. 26(6), 2707–2720
(2018)
4. Chen, N., Ren, X., Ren, S., Wierman, A.: Greening multi-tenant data center
demand response. Perform. Eval. 91, 229–254 (2015)
5. Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary Algorithms for Solving
Multi-objective Problems. Kluwer Academic, New York (2002)
64 J. Muraña and S. Nesmachnow
6. Cui, Y., Geng, Z., Zhu, Q., Han, Y.: Review: multi-objective optimization methods
and application in energy saving. Energy 125, 681–704 (2017)
7. Dai, Y., Gao, Y., Gao, H., Zhu, H.: Real-time pricing scheme based on Stackelberg
game in smart grid with multiple power retailers. Neurocomputing 260, 149–156
(2017)
8. Feitelson, D., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads
archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
9. Guo, C., Luo, F., Cai, Z., Dong, Z.: Integrated energy systems of data centers and
smart grids: state-of-the-art and future opportunities. Appl. Energy 301 (2021)
10. Iturriaga, S., Muraña, J., Nesmachnow, S.: Bio-inspired negotiation approach for
smart-grid colocation datacenter operation. Math. Biosci. Eng. 19(3), 2403–2423
(2022)
11. Iturriaga, S., Dorronsoro, B., Nesmachnow, S.: Multiobjective evolutionary algo-
rithms for energy and service level scheduling in a federation of distributed data-
centers. Int. Trans. Oper. Res. 24(1–2), 199–228 (2016)
12. Iturriaga, S., Garcı́a, S., Nesmachnow, S.: An empirical study of the robustness
of energy-aware schedulers for high performance computing systems under uncer-
tainty. Commun. Comput. Inf. Sci. 143–157 (2014)
13. Iturriaga, S., Nesmachnow, S., Dorronsoro, B., Bouvry, P.: Energy efficient schedul-
ing in heterogeneous systems with a parallel multiobjective local search. Comput.
Inform. 32(2), 273–294 (2013)
14. Lu, X., Li, K., Xu, H., Wang, F., Zhou, Z., Zhang, Y.: Fundamentals and business
model for resource aggregator of demand response in electricity markets. Energy
204, 117885 (2020)
15. Meng, F.L., Zeng, X.J.: A Stackelberg game-theoretic approach to optimal real-
time pricing for the smart grid. Soft. Comput. 17(12), 2365–2380 (2013)
16. Muraña, J., Nesmachnow, S., Iturriaga, S., Montes de Oca, S., Belcredi, G.,
Monzón, P., Tchernykh, A.: Two level demand response planning for retail multi-
tenant datacenters. In: 18th International Conference on High Performance Com-
puting and Simulation, pp. 1–8 (2021)
17. Muraña, J., Nesmachnow, S., Iturriaga, S., Montes de Oca, S., Belcredi, G.,
Monzón, P., Shepelev, V., Tchernykh, A.: Negotiation approach for the partici-
pation of datacenters and supercomputing facilities in smart electricity markets.
Program. Comput. Softw. 46, 636–651 (2020)
18. Muraña, J., Nesmachnow, S., Armenta, F., Tchernykh, A.: Characterization, mod-
eling and scheduling of power consumption of scientific computing applications in
multicores. Clust. Comput. 22(3), 839–859 (2019)
19. Muraña, J., Nesmachnow, S.: Simulation and evaluation of multicriteria plan-
ning heuristics for demand response in datacenters. SIMULATION 99(3), 291–310
(2021)
20. Nesmachnow, S.: Parallel multiobjective evolutionary algorithms for batch schedul-
ing in heterogeneous computing and grid systems. Comput. Optim. Appl. 55(2),
515–544 (2013)
21. Nesmachnow, S., Dorronsoro, B., Pecero, J., Bouvry, P.: Energy-aware scheduling
on multicore heterogeneous grid computing systems. J. Grid Comput. 11(4), 653–
680 (2013)
22. Nesmachnow, S., Iturriaga, S.: Cluster-UY: collaborative scientific high perfor-
mance computing in Uruguay. In: Torres, M., Klapp, J. (eds.) ISUM 2019. CCIS,
vol. 1151, pp. 188–202. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-
38043-4 16
Multi-objective Power and QoS Demand Response in Datacenters 65
23. Nesmachnow, S., Perfumo, C., Goiri, Í.: Holistic multiobjective planning of data-
centers powered by renewable energy. Clust. Comput. 18(4), 1379–1397 (2015)
24. Porteiro, R., Nesmachnow, S., Moreno-Bernal, P., Torres-Aguilar, C.: Computa-
tional intelligence for residential electricity consumption assessment: detecting air
conditioner use in households. Sustain. Energy Technol. Assess. 58, 103319 (2023)
25. Snevely, R.: Enterprise Data Center Design and Methodology. Pearson, London
(2002)
26. Stavrinides, G., Karatza, H.: An energy-efficient, QoS-aware and cost-effective
scheduling approach for real-time workflow applications in cloud computing sys-
tems utilizing DVFS and approximate computations. Futur. Gener. Comput. Syst.
96, 216–226 (2019)
27. Tchernykh, A., Lozano, L., Schwiegelshohn, U., Bouvry, P., Pecero, J., Nesmach-
now, S.: Bi-objective online scheduling with quality of service for IAAS clouds. In:
IEEE 3rd International Conference on Cloud Networking, pp. 307–312 (2014)
28. Tran, N., Pham, C., Ren, S., Han, Z., Hong, C.: Coordinated power reduction
in multi-tenant colocation datacenter: an emergency demand response study. In:
IEEE International Conference on Communications, pp. 1–6 (2016)
29. Wang, Y., Lin, X., Pedram, M.: A Stackelberg game-based optimization framework
of the smart grid with distributed PV power generations and data centers. IEEE
Trans. Energy Convers. 29(4), 978–987 (2014)
30. Zhang, Y., Paschalidis, I., Coskun, A.: Data center participation in demand
response programs with quality-of-service guarantees. In: Proceedings of the 10th
ACM International Conference on Future Energy Systems (2019)
Enhancing the Sparse Matrix Storage
Using Reordering Techniques
1 Introduction
2 Background
This section briefly introduces basic concepts such as sparse matrix storage for-
mats, the Reverse Cuthill-Mckee algorithm for reordering, and the index com-
pression technique known as delta encoding.
Sparse matrices storage formats are strategies that avoid storing void information
of the sparse matrices (i.e., null coefficients usually represented by zeros) by using
underlying structures which hold information to recover the position of the non-
null elements.
68 M. Freire et al.
The most straightforward idea (in the static context) is to store only the
non-null elements in a vector and store their row and column indices separately.
This format is known as Elementary or Coordinate (COO) [2].
One of the most popular formats to store matrices is the Compressed Sparse
Row (CSR) [21]. This format is similar to COO because it holds only the non-
null values and their indices. CSR goes further in the compression, ordering the
non-null elements row-wise in the values vector and storing the index where each
row starts. If one keeps elements column-wise, the same idea gives place to the
Compressed Sparse Column (CSC) format.
In both cases, each floating-point value needs extra information storage to
recover its position in the matrix. In COO, the memory overhead is 8B per
element and in CSR is 4B per element plus 4B per row. In general, indices
occupy a big part of the memory used to store the matrices, so many formats
focus on reducing that overhead.
Some sparse formats admit the storage of some zeros to mitigate the irregu-
larity in the memory accesses, which damages the performance of sparse routines
in massively parallel processors like GPUs. A typical example is the ELLPACK
storage format [3] which represents the sparse matrix as two dense matrices of
size n × m, where n is the number of rows in the matrix, and m is the number of
non-null of the longest row. One of these matrices stores the non-null coefficients
plus the zero padding, and the other stores the column index corresponding to
each coefficient.
If the matrix has its non-null elements grouped in clusters, a sparse format
could store just one pair of coordinates for each cluster (or block) and calculate
the rest. The set of layouts that use this strategy are known as blocked formats,
and one of the most popular is the Blocked CSR (BCSR) [15] which divides
the matrix into dense blocks of fixed size, treating it as a matrix of blocks and
applying CSR to index the blocks. The efficiency of a blocked format depends on
the sparsity pattern of the matrix it is storing. Other examples of arrangements
that use blocking techniques are bmSparse, Sliced ELL (SELL-C), or Column
Diagonal Storage [1,4,5,13,24].
In recent years much effort has been directed to reducing the weight of index
storage. In that line of work, Tang et al. proposed a family of efficient compression
schemes called bit-representation optimized (BRO) [22]. They introduced two
techniques, BRO-ELL and BRO-COO, based on ELL and COO formats. In
addition, they presented BRO-HYB, which is analogous to HYB because it stores
the regular part in BRO-ELL and the irregular on BRO-COO.
An interesting idea to reduce the memory used to store indices in sparse matrices
is to compress them by storing the distance to the previous element of the row
instead of the column index itself. This technique is known as delta encoding [23].
In the worst case, the bits needed to store a distance can be as much as those
used for the index (i.e., the index is the distance from the element (r, 0)). On
the other hand, in many cases, the coefficients in a row are considerably closer,
and we can store them in smaller data types, such as int16 or int8 (instead of
int32 ).
The main advantage is that this strategy is not related (or opposed) to any
particular storage format. For example, CSR can use delta encoding to store the
column indices, while CSC could do so with row indices. Other layouts such as
ELL or COO can also use this strategy to arrange their column indices (or, in
COO, the row index if one uses column-major ordering).
The idea of applying delta encoding and other related techniques to com-
press the indexes of sparse matrices, accelerating the performance of the SpMV
in shared memory processors and GPUs was explored in [16,17].
3 Proposal
Considering the time spent fetching sparse matrix indexing data from memory
in kernels such as the SpMV, it is reasonable to expect a performance benefit
from storing these indices with only the necessary number of bits. The standard
integer size is 4 bytes, but, for example, one can store indices lower than 216 with
a uint16 data type. Even when we can keep only a part of the matrix indices
with fewer bytes, one can store entire matrix rows with the smaller data types.
In that case, we could use a sparse format that stores rows using different integer
sizes. As the processing of rows is independent in kernels such as the SpMV,
this could improve the processing performance for the part of the matrix stored
with a smaller integer size.
70 M. Freire et al.
In [10] we analyzed the minimum integer size needed to store the matrix
indices for a large set of symmetric sparse matrices. We also evaluated the num-
ber of rows in each matrix that needs 8, 16, or 32 bits for their indices. We
used delta encoding to replace the column index in the CSR sparse format by
the delta to the next non-null in the row and studied the impact of reorder-
ing the matrix on that strategy. The results showed that using the Reverse
Cuthill-Mckee (RCM) reordering heuristic in conjunction with delta encoding, a
significant number of matrices passed from the 32 bits category (those matrices
with at least one row that needs 32 bits) to 16 bits category, and from the 16 bits
category to 8 bits category. Of all the 1407 matrices studied, only 56 matrices
ended in the 32 bits category after applying RCM and delta encoding.
RCM has proven to be a good heuristic to reduce the bandwidth of a matrix,
but, on the other hand, in doing so, sometimes it extends the distance of two
contiguous elements in the rows that end up with more bandwidth. The problem
is that RCM is not designed to reduce the distance between consecutive elements
but reduces bandwidth. Reducing the bandwidth of the matrix brings non-null
elements closer to each other. Thus, we can expect a collateral improvement in
the delta. However, there are many cases where reducing the bandwidth increases
the deltas (e.g., TSOPF/TSOPF FS b300 c3) and many more where RCM improves
the deltas, but other strategies yield better results. In particular, other authors
have replaced the RCM with other heuristics for an specific purposes, see as an
example [18] where the objective is the number of diagonals.
Considering the above, we centered the effort on designing a strategy that
enhances the reordering returned by RCM. The proposal relies on a light search
strategy focused on improving the ordering of matrices that stayed in the 32 bits
category after applying RCM. In a previous study [20], the authors employed
similar ideas for bandwidth reduction.
The algorithm uses a random search in conjunction with a local search to
improve the permutation produced by RCM. To guide the search, we define a
metric of how much away the matrix is to be able to be stored in 16 bits. The
fitness function is defined as follows:
f (p) = #away16r (1)
r∈rows
where #away16r is the number of elements of row r that are more than 216
positions away from the previous one (or from zero if it is the first element of
the row).
The proposal consists of two stages. In the first stage, the algorithm performs
several random steps, and in the second stage, it makes directed steps towards
improving the solution. The random steps consist of a permutation of n rows
chosen randomly, allowing the routine not to get stuck in a local optimum. We
fixed a boundary α for the deterioration of f to prevent the random steps from
worsening the solution too much.
Enhancing the Sparse Matrix Storage Using Reordering Techniques 71
The second stage is to improve the solution given by the first one. Ideally, the
improvement in this stage outweighs the other’s deterioration (if any). This stage
focuses on fixing the matrix’s “worst” problem by finding the element further
away from the previous coefficient. Once found, the procedure tries to make a
permutation that moves it to a distance less than 216 − 1, making it possible to
store the column index in 16 bits.
Naturally, this change can have an overall negative effect because it can
enlarge distances between other elements. The procedure evaluates if the change
worsens the solution and discards it in that case. If the algorithm does not use
a permutation, it tries with other permutations that make that delta storable in
16 bits. If none of the experimented permutations works, the procedure applies
a fixed permutation that makes the delta equal to one.
4 Experimental Evaluation
In this section, we present the evaluation of our novel heuristic. For this work, we
chose a subset of the five matrices that, after applying RCM and delta encoding,
still cannot be stored using deltas of 16 bits. We took this set of symmetric
matrices from the Suite Sparse Matrix Collection [7]. Table 1 shows the main
characteristics of the testing set.
Table 1. Matrices used for the experimental evaluation with their principal character-
istics.
As the primary goal of our heuristic is to increase the number of rows that
uses less than 32 bits for the studied matrices, Table 2 presents the number of
rows and deltas (i.e., particular elements) that require 32 bits for their storage.
These characteristics can be considered the baseline with which we compare
the performance of the new heuristic. The table presents these metrics for the
original matrices and after applying the permutation given by RCM.
Even if the most important metric is the number of rows that need int32 ,
because it directly impacts the memory requirements of the matrix, the number
of deltas gives a more precise idea of the problem (at least considering our search
strategy). The table shows that applying RCM has an overall positive impact on
72 M. Freire et al.
Table 2. Number of rows and deltas representable only in 32 bits for the original
matrices (right) and after applying RCM (left).
both metrics. The only exception was the matrix lp1. In this case, the number
of rows that require int32 grew significantly.
In the first experiment, we executed RCM and applied the proposed heuristic
to the permutations that resulted from the method. In these tests, we fixed the
number of iterations in n = 6000. Furthermore, we used the boundary α as a
parameter that grows with the iterations to avoid getting trapped in a local
optimum. Since α grows with the number of iterations, the random step can
severely deteriorate the candidate solution after many iterations. For this reason,
each iteration stores the best solution until that point.
Table 3 shows the number of rows and deltas that still require 32 bits after
applying our heuristic method. The results show that for two of the matrices
(ASIC 100k and boyd1), the proposed heuristic gets Nrows to zero, which means
that we can store the entire matrices with a smaller data type. On the other hand,
for the other three, where there are still rows that require int32 , we get a sizeable
reduction in the number of rows that require this data type.
Table 3. Number of rows and deltas representable only in 32 bits after applying the
heuristic
Id Rows 32 Deltas 32
1 0 0
2 763 763
3 0 0
4 462,690 462,690
5 252,980 254,240
Table 4. Number of rows that need 32, 16 and 8 bits to store its indices after the
RCM.
Table 5. Number of rows that need 32, 16 and 8 bits to store it indices after the
application of the heuristic.
Concerning the 8 bits categories, the results are diverse, with the number
of 8-bit rows decreasing in some matrices but growing in others. In general,
the percentage of rows in that category is small with respect to the total. The
primary contribution of our heuristic is that it replaces rows from the 32 bits
category to the 16 bits one. Figure 1 presents the same results in percentages
comparing the matrices before and after applying the heuristic proposed. In all
matrices, the first bar represents the results with only RCM and the second after
the heuristic proposed.
Fig. 1. Percentage of rows in each category for the evaluated matrices before and after
the application of the proposed heuristic.
74 M. Freire et al.
References
1. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-
vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duester-
wald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp.
111–125. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11515-
8 10
2. Barrett, R., et al.: Templates for the Solution of Linear Systems: Building
Blocks for Iterative Methods. Society for Industrial and Applied Mathemat-
ics (1994). https://doi.org/10.1137/1.9781611971538, https://epubs.siam.org/doi/
abs/10.1137/1.9781611971538
3. Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on
throughput-oriented processors. In: Proceedings of the Conference on High Per-
formance Computing Networking, Storage and Analysis, SC 2009. Association
for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/
1654059.1654078
Enhancing the Sparse Matrix Storage Using Reordering Techniques 75
4. Berger, G., Freire, M., Marini, R., Dufrechou, E., Ezzatti, P.: Unleashing the per-
formance of bmSparse for the sparse matrix multiplication in GPUs. In: Proceed-
ings of the 2021 12th Workshop on Latest Advances in Scalable Algorithms for
Large-Scale Systems (ScalA), pp. 19–26, November 2021
5. Berger, G., Freire, M., Marini, R., Dufrechou, E., Ezzatti, P.: Advancing on an
efficient sparse matrix multiplication kernel for modern GPUs. Concurr. Comput.
Pract. Experience 35, e7271 (2022). https://doi.org/10.1002/cpe.7271, https://
onlinelibrary.wiley.com/doi/abs/10.1002/cpe.7271
6. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices.
In: Proceedings of the 1969 24th National Conference, pp. 157–172. ACM Press
(1969). https://doi.org/10.1145/800195.805928
7. Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans.
Math. Softw. 38(1), 1–25 (2011). https://doi.org/10.1145/2049662.2049663
8. Dufrechou, E., Ezzatti, P., Quintana-Ortı́, E.S.: Selecting optimal SPMV realiza-
tions for GPUs via machine learning. Int. J. High Perform. Comput. Appl. 35(3),
254–267 (2021). https://doi.org/10.1177/1094342021990738
9. Favaro, F., Oliver, J.P., Ezzatti, P.: Unleashing the computational power of FPGAs
to efficiently perform SPMV operation. In: 40th International Conference of the
Chilean Computer Science Society, SCCC 2021, La Serena, Chile, 15–19 November
2021, pp. 1–8. IEEE (2021). https://doi.org/10.1109/SCCC54552.2021.9650418
10. Freire, M., Marichal, R., Dufrechou, E., Ezzatti, P.: Towards reducing communica-
tions in sparse matrix kernels. In: Naiouf, M., Rucci, E., Chichizola, F., De Giusti,
L. (eds.) Cloud Computing, Big Data & Emerging Topics, JCC-BD&ET 2023.
CCIS, vol. 1828, pp. 17–30. Springer, Cham (2023). https://doi.org/10.1007/978-
3-031-40942-4 2
11. George, A.: Computer implementation of the finite element method. Ph.D. the-
sis, Computer Science Department, School of Humanities and Sciences, Stanford
University, CA, USA (1971)
12. George, J.A., Liu, J.W.: Computer Solution of Large Sparse Positive Definite Sys-
tems. Prentice-Hall, Englewood Cliffs (1981)
13. Godwin, J., Holewinski, J., Sadayappan, P.: High-performance sparse matrix-
vector multiplication on GPUs for structured grid computations. In: The 5th
Annual Workshop on General Purpose Processing with Graphics Processing Units,
GPGPU-5, London, United Kingdom, 3 March 2012, pp. 47–56. ACM (2012)
14. Gómez, C., Mantovani, F., Focht, E., Casas, M.: Efficiently running SPMV on
long vector architectures. In: Proceedings of the 26th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, PPoPP 2021, pp. 292–303.
Association for Computing Machinery, New York, NY, USA (2021). https://doi.
org/10.1145/3437801.3441592
15. Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-
vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming (15th PPOPP 2010), pp. 115–
125. ACM SIGPLAN, Bangalore, India, January 2010
16. Karakasis, V., Gkountouvas, T., Kourtis, K., Goumas, G.I., Koziris, N.: An
extended compression format for the optimization of sparse matrix-vector multipli-
cation. IEEE Trans. Parallel Distributed Syst. 24(10), 1930–1940 (2013). https://
doi.org/10.1109/TPDS.2012.290, https://doi.org/10.1109/TPDS.2012.290
76 M. Freire et al.
17. Kourtis, K., Goumas, G.I., Koziris, N.: Optimizing sparse matrix-vector multipli-
cation using index and value compression. In: Ramı́rez, A., Bilardi, G., Gschwind,
M. (eds.) Proceedings of the 5th Conference on Computing Frontiers, 2008, Ischia,
Italy, 5–7 May 2008, pp. 87–96. ACM (2008). https://doi.org/10.1145/1366230.
1366244
18. Marichal, R., Dufrechou, E., Ezzatti, P.: Optimizing sparse matrix storage for the
big data era. In: Naiouf, M., Rucci, E., Chichizola, F., De Giusti, L. (eds.) Cloud
Computing, Big Data & Emerging Topics - 9th Conference, JCC-BD&ET, La
Plata, Argentina, 22–25 June 2021, Proceedings. Communications in Computer
and Information Science, vol. 1444, pp. 121–135. Springer, Cham (2021). https://
doi.org/10.1007/978-3-030-84825-5 9
19. de Oliveira, S.L.G., de Abreu, A.A.A.M.: An evaluation of pseudoperipheral vertex
finders for the reverse Cuthill-McKee method for bandwidth and profile reductions
of symmetric matrices. In: 37th International Conference of the Chilean Computer
Science Society, SCCC 2018, Santiago, Chile, 5–9 November 2018, pp. 1–9. IEEE
(2018). https://doi.org/10.1109/SCCC.2018.8705263
20. de Oliveira, S.L.G., Silva, L.M.: Low-cost heuristics for matrix bandwidth reduc-
tion combined with a hill-climbing strategy. RAIRO Oper. Res. 55(4), 2247–2264
(2021). https://doi.org/10.1051/ro/2021102
21. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Indus-
trial and Applied Mathematics (SIAM), Philadelphia (2003)
22. Tang, W.T., et al.: Accelerating sparse matrix-vector multiplication on GPUs using
bit-representation-optimized schemes. In: Proceedings of the International Confer-
ence on High Performance Computing, Networking, Storage and Analysis. ACM,
November 2013. https://doi.org/10.1145/2503210.2503234
23. Willcock, J., Lumsdaine, A.: Accelerating sparse matrix computations via data
compression. In: Proceedings of the 20th Annual International Conference on
Supercomputing, ICS 2006, pp. 307–316. Association for Computing Machinery,
New York, NY, USA (2006). https://doi.org/10.1145/1183401.1183444
24. Zhang, J., Gruenwald, L.: Regularizing irregularity: bitmap-based and portable
sparse matrix multiplication for graph data on GPUs. In: Proceedings of the 1st
ACM SIGMOD Joint International Workshop on Graph Data Management Experi-
ences & Systems (GRADES) and Network Data Analytics (NDA), GRADES-NDA
2018. Association for Computing Machinery, New York, NY, USA (2018). https://
doi.org/10.1145/3210259.3210263
Towards Fault Tolerance and Resilience
in the Sequential Codelet Model
1 Introduction
Computers and embedded systems play a significant role in the daily func-
tioning of critical infrastructure and systems in a variety of sectors, including
This research used resources at the Argonne Leadership Computing Facility, a DOE
Office of Science User Facility supported under Contract DE-AC02-06CH11357. This
research was also supported by the Exascale Computing Project (17-SC-20-SC), a
collaborative effort of the U.S. Department of Energy Office of Science and the
National Nuclear Security Administration. This work was partially supported by the
National Science Foundation, under award SHF-1763654, and by Petrobras, under
grant 2018/00347-4.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 77–94, 2024.
https://doi.org/10.1007/978-3-031-52186-7_6
78 D. A. R. Perdomo et al.
healthcare, transportation, finance, industry, and defense. In all these sectors, the
disruption or failure of essential services in such areas can result in far-reaching
consequences, including endangering human life, impacting the environment, or
even destabilizing the economy.
Supporting fault tolerance and resiliency in critical HPC systems has been
extensively studied [12,20,32,33]. Fault compartmentalization is one of the main
challenges that apply to both the cause and the impact of faults. When the cause
is compartmentalized, the abnormal components can be isolated and corrective
actions can be taken based on their overall error history. When the impact is
well defined and compartmentalized, the system can recover from faults with
only minimum redo. In traditional von Neumann computational models, com-
partmentalizing usually requires domain-specific analysis and user involvement
to segregate portions of computation and monitor execution. The main reason
is that large-scale HPC applications usually entangle computation, data man-
agement, and parallelism management without much distinction. Coupling that
with the non-deterministic execution of threads magnified in scale, compartmen-
talization can hardly be done in an effective, efficient, and user-transparent way.
Implementing resiliency in parallel and distributed programs often involves com-
plex multi-layered system interactions, spanning across the operating system,
runtime, application, and workload manager. In High-Performance computing,
this kind of resiliency is often done through checkpointing, which helps prevent
the loss of computation progress in case of failure [5,12,26].
A program execution model (PXM) refers to the governing set of rules or con-
ceptual framework that dictates the execution of a computer program by a com-
puting system [8,30]. It defines the behaviors and interactions between the vari-
ous system components involved in executing a program, including the processor,
memory, operating system, and other system resources. Among various PXMs, the
Codelet PXM [16,37] describes computation based on a hybrid between dataflow
and von Neumann computation models. Codelets are units of computation that
can be executed when their dependencies are satisfied. Each codelet can be seen
as a task, defined by a collection of instructions tailored to a particular architec-
ture. Unlike other more widely-used task models (e.g., OpenMP, OmpSs, Legion,
etc.), codelets are pure functions whose outputs only depend on their inputs. To
support more complex programs, codelets interconnect with each other forming
codelet dependence graphs (CDG) that represent the data and control depen-
dencies between them. The creation of this graph is inspired by the theory of
dataflow. An evolution of the Codelet model is the Sequential Codelet Model
(SCM) [9,10,25]. SCM describes the CDG through sequential program semantics,
and it uses techniques inspired by instruction-level parallelism to achieve hetero-
geneous, parallel, and distributed execution of programs. The SCM is based on a
hierarchical view of the system. Each level separates scheduling and orchestration,
execution components, and memory management interfaces.
In this paper, we argue that the Codelet Program Execution Model naturally
provides an effective foundation for compartmentalization and fault tolerance. It
derives from the explicit encapsulation of computation and data in the Codelet
Model. This paper presents mechanisms that leverage PXM behavior to provide
Towards Fault Tolerance and Resilience in the Sequential Codelet Model 79
The hierarchical organization of the SCM machine comes from two aspects.
First, a CU of a given level is a complete machine in the level below. Input and out-
put data dependencies (i.e., operands) are stored in the register file of the level that
schedules the codelet into a CU. Second, memory operations from the level below
interface with the registers of the level above. Hence the name Memory/Register
in Fig. 1. This latter property guarantees that memory operations are bounded in
latency between two levels, providing performance guarantees for code executing
at a given level. Notice that the higher the level in the hierarchy, the larger the
footprint of the system is, resulting in larger memory sizes and increased latency,
therefore translating into larger codelets. The hierarchy in the organization is bro-
ken whenever at a particular level, a Compute Unit is represented in a particular
architecture (e.g., GPGPUs, CPUs, or FGPAs) representing a leaf of the architec-
ture tree. This allows for support for heterogeneous execution of the Sequential
Codelet Model, as presented in [10] in commodity hardware.
Finally, the Write Back phase of the 5 stages pipeline manages synchroniza-
tion, notifying the SU that a given Codelet has finished execution and triggering
the activation of other Codelets.
in the context of the Sequential Codelet Model presented in Sect. 2. The afore-
mentioned table includes information about the type of implementation: in the
PXM or as part of the program, static: fixed resources; dynamic: resources are
assigned on-demand; hybrid: a combination of static-dynamic is possible. Mech-
anisms are analyzed qualitatively in terms of overhead cost: hardware, runtime,
time, or storage.
Fig. 6. SU Voting
Towards Fault Tolerance and Resilience in the Sequential Codelet Model 85
4 Evaluation
In order to demonstrate some of the approaches mentioned in Sect. 3 we have
modified SCMUlate, an emulator of the Sequential Codelet Model previously
used in [10]. SCMUlate is a functional emulator for the first level of the Sequential
Codelet Model. It leverages commodity hardware found in current systems to
be used as compute units of Level 1. Codelets are executed in real hardware
compute units while the rest of the machine is emulated in software. As a result,
the mechanism for fetch, decode, memory, writeback, and the register file for
Level 1 are fully emulated. On the application side, SCMUlate interprets and
executes the programs written in the assembly language of Level 1. This assembly
program uses the same syntax used in Listing 1.1. Parallelization is achieved
through Out-of-Order execution [9].
SCMUlate was extended to allow for SU-driven codelet duplication, per-
forming the necessary register duplication and renaming to achieve N-replicas of
the same codelet. Furthermore, a perfect voting mechanism was included that
would compare all replicas in order to perform fault detection. In the presence
of a recoverable error, the SU performs the necessary actions to recover compu-
tation. Finally, a fault injection mechanism is included that implements codelet
fault injection based on two fault models: Poisson (i.e., constant error rate),
and Weibull. The fault injection mechanism used is an implementation of the
simulation strategy presented in [22].
Our evaluation includes the implementation of three resiliency mechanisms: a
2-of-3 system, a 3-of-5 system, and a dynamic M-of-N system with a maximum
N of 10 duplications. As a baseline, we use a system that has no resiliency
mechanism but is prone to failures based on the specific statistical model of
the fault injection mechanism (i.e., Poisson vs Weibull). The experiments are
focused on fault tolerance mechanisms for the Compute Unit as explained in
Sect. 3 Thus, we assume that the only component that may fail is the Compute
Unit and that other components are fully reliable.
Our execution environment is the Polaris supercomputer hosted by Argonne
Leadership Computing Facilities (ALCF) [1]. To evaluate our approach, we use
the implementation of the matrix multiplication algorithm in SCM that was also
used in [10]. Each codelet executes a tile that is of size 128 by 128 double ele-
ments. The program executes square matrix multiplication of size M=N=K=768
doubles (i.e., 6 by 6 tiles. For the Weibull fault model, we use a beta of 0.7
previously used by [22,31]. For Lambda, we precalculate a range that could
demonstrate the benefits of the approach based on the codelet execution time.
In Polaris, the MatMul 2048L codelet for a single tile runs in approximately 2.5
ms. Based on our calculation the ranges of λ = [0.0002, 0.008] for the Poisson
model, and λ = [0.002, 0.08] for Weibull were considered appropriate values. We
perform 30 repetitions per lambda, increasing lambda by 0.0002 and 0.002 for
Poisson and Weibull respectively.
Towards Fault Tolerance and Resilience in the Sequential Codelet Model 91
125% 125%
Percentage of successful executions
100% 100%
75% 75%
50% 50%
25% 25%
0% 0%
0.02 0.04 0.06 0.08 0.002 0.004 0.006 0.008
0.8 0.8
Avg Execution time (s) of successful
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.02 0.04 0.06 0.08 0.002 0.004 0.006 0.008
Fig. 15. Percentage of successful trials (Top Row) and execution time (Bottom Row) vs
fault rate λ (horizontal axis) for two fault injection models: Weibull (left) and Poisson
(Right). Weibull uses β = 0.7 in all the experiments. (Color figure online)
Our experiments are summarized in Fig. 15. The top row shows the results for
the number of experiments that yield a successful run. The bottom row shows
the average execution time of those results that were successful. If no results
were successful, no data point for the time was considered, hence the breaking
line in No Duplication (red).
Experiments show a considerable improvement in reliability across the differ-
ent mechanisms. The average overhead of execution time for Weibull was 1.15,
1.37, and 2.07 times the baseline (no duplication). For Poisson on the other
hand overheads were 1.39, 1.80, and 2.81 times the baseline (no duplication). In
general, the adaptive solution presented a considerable advantage both in relia-
bility and performance overhead, consequently demonstrating the advantage of
scheduling-driven approaches.
5 Conclusion
This paper presents a possible implementation of resilience within the Sequen-
tial Codelet PXM. Our main contribution consists of identifying, categorizing,
and adapting state-of-the-art resilience techniques for integration within the
SCM. We demonstrated the real-world applicability of these adapted techniques
within an emulated Codelet Model. We believe our work lays the foundation for
enhancing resilience in similarly complex systems. Furthermore, our in-depth
analysis of the performance implications of these techniques offers critical, data-
driven insights. These insights will assist in understanding the trade-offs between
92 D. A. R. Perdomo et al.
References
1. Argonne leadership computing facility. https://www.alcf.anl.gov/, Accessed 22
July 2023
2. GitHub - josemonsalve2/SCM: Sequential Codelet Model of Program Execution –
github.com. https://github.com/josemonsalve2/SCM/. Accessed 22 July 2023
3. Aguilera, M., Chen, W., Toueg, S.: Heartbeat: a timeout-free failure detector for
quiescent reliable communication, vol. 1320, pp. 126–140 (1997). https://doi.org/
10.1007/BFb0030680
4. Ahmad, I., Yu-Kwong Kwok, Y.K.K.: A new approach to scheduling parallel pro-
grams using task duplication. In: 1994 International Conference on Parallel Pro-
cessing, vol. 2, pp. 47–51 (1994). https://doi.org/10.1109/ICPP.1994.37
5. Ansel, J., Arya, K., Cooperman, G.: DMTCP: transparent checkpointing for cluster
computations and the desktop. In: 2009 IEEE International Symposium on Parallel
& Distributed Processing, pp. 1–12 (2009). https://doi.org/10.1109/IPDPS.2009.
5161063
6. Bolchini, C., Miele, A., Sciuto, D.: An adaptive approach for online fault man-
agement in many-core architectures (2012). https://doi.org/10.1109/DATE.2012.
6176589
7. Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithmic based fault tolerance
applied to high performance computing (2008)
8. Dennis, J.: A parallel program execution model supporting modular software con-
struction. In: Proceedings. Third Working Conference on Massively Parallel Pro-
gramming Models (Cat. No. 97TB100228), pp. 50–60 (1997). https://doi.org/10.
1109/MPPM.1997.715961
9. Diaz, J.M.M.: Sequential Codelet Model A SuperCodelet Program Execution
Model and Architecture. Phd thesis, University of Delaware, Newark, DE (2021)
10. Diaz, J.M.M., Harms, K., Guaitero, R.A.H., Perdomo, D.A.R., Kumaran, K., Gao,
G.R.: The supercodelet architecture. In: Proceedings of the 1st International Work-
shop on Extreme Heterogeneity Solutions. ExHET 2022. Association for Comput-
ing Machinery, New York (2022). https://doi.org/10.1145/3529336.3530823
11. DiTomaso, D., Kodi, A., Louri, A.: QORE: a fault tolerant network-on-chip archi-
tecture with power-efficient quad-function channel (qfc) buffers. In: 2014 IEEE 20th
International Symposium on High Performance Computer Architecture (HPCA),
pp. 320–331 (2014). https://doi.org/10.1109/HPCA.2014.6835942
12. Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mech-
anisms and checkpoint/restart implementations for high performance comput-
ing systems. J. Supercomput. 65(3), 1302–1326 (2013). https://doi.org/10.1007/
s11227-013-0884-0
13. Fang, Y., Zou, C., Elmore, A.J., Chien, A.A.: UDP: a programmable accelerator
for extract-transform-load workloads and more. In: Proceedings of the 50th Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO-50 2017, pp.
55–68. Association for Computing Machinery, New York (2017). https://doi.org/
10.1145/3123939.3123983
14. Fox, D., Diaz, J.M.M., Li, X.: Chiplets and the codelet model (2022)
Towards Fault Tolerance and Resilience in the Sequential Codelet Model 93
15. Fox, D., Diaz, J.M., Li, X.: On memory codelets: prefetching, recoding, moving
and streaming data (2023)
16. Gao, G., Suetterlein, J., Zuckerman, S.: Toward an Execution Model for Extreme-
Scale Systems - Runnemede and Beyond (2011). technical Memo
17. Gizopoulos, D., et al.: Architectures for online error detection and recovery in mul-
ticore processors. In: 2011 Design, Automation & Test in Europe (2011). https://
doi.org/10.1109/date.2011.5763096
18. IEC: Functional safety of electrical/electronic/programmable electronic safety-
related systems. Standard IEC 61508–1:2010. International Electrotechnical Com-
mission, Geneva, CH (2010). https://webstore.iec.ch/publication/5515
19. Iyer, R., Nakka, N., Kalbarczyk, Z., Mitra, S.: Recent advances and new avenues
in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005). https://doi.
org/10.1109/MM.2005.119
20. Kadri, N., Koudil, M.: A survey on fault-tolerant application mapping tech-
niques for network-on-chip. J. Syst. Arch. 92, 39–52 (2019). https://doi.org/
10.1016/j.sysarc.2018.10.001. https://www.sciencedirect.com/science/article/pii/
S1383762118301498
21. Kasap, S., Wächter, E.W., Zhai, X., Ehsan, S., McDonald-Maier, K.D.: Novel
lockstep-based fault mitigation approach for socs with roll-back and roll-
forward recovery. Microelectron. Reliabil. 124, 114297 (2021). https://doi.org/10.
1016/j.microrel.2021.114297. https://www.sciencedirect.com/science/article/pii/
S0026271421002638
22. Koren, I., Krishna, C.M.: Fault-Tolerant Systems. Organ Kaufmann (2007)
23. Landwehr, A.: An experimental exploration of self-aware systems for exascale archi-
tectures (2016)
24. Linguaglossa, L., et al.: Survey of performance acceleration techniques for network
function virtualization. Proc. IEEE 107(4), 746–764 (2019). https://doi.org/10.
1109/JPROC.2019.2896848
25. Monsalve, J., Harms, K., Kalyan, K., Gao, G.: Sequential codelet model of program
execution - a super-codelet model based on the hierarchical turing machine. In:
2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed
Runtime Systems and Middleware (IPDRM), pp. 1–8 (2019). https://doi.org/10.
1109/IPDRM49579.2019.00005
26. Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K., Cappello, F.: Veloc: towards
high performance adaptive asynchronous checkpointing at large scale. In: 2019
IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.
911–920 (2019). https://doi.org/10.1109/IPDPS.2019.00099
27. Patterson, D.A., Hennessy, J.L.: Computer Architecture: A Quantitative App-
roach. Morgan Kaufmann Publishers Inc., San Francisco (1990)
28. Platunov, A., Sterkhov, A.: Whatchdog mechanisms in embedded systems. Sci.
Tech. J. Inf. Technol. Mech. Opt. 301–311 (2017). https://doi.org/10.17586/2226-
1494-2017-17-2-301-311
29. Poledna, S.: Fault-Tolerant Real-Time Systems: The Problem of Replica Deter-
minism. Kluwer Academic Publishers, Boston (1996)
30. Qu, P., Yan, J., Zhang, Y., Gao, G.: Parallel turing machine, a proposal. J. Comput.
Sci. Technol. 32, 269–285 (2017). https://doi.org/10.1007/s11390-017-1721-3
31. Rozo Duque, L.A., Monsalve Diaz, J.M., Yang, C.: Improving mpsoc reliability
through adapting runtime task schedule based on time-correlated fault behavior.
In: 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE),
pp. 818–823 (2015)
94 D. A. R. Perdomo et al.
32. Safari, S., et al.: A survey of fault-tolerance techniques for embedded systems from
the perspective of power, energy, and thermal issues. IEEE Access 10, 12229–12251
(2022). https://doi.org/10.1109/ACCESS.2022.3144217
33. Sahoo, S.S., Ranjbar, B., Kumar, A.: Reliability-aware resource management
in multi-/many-core systems: a perspective paper. J. Low Power Electron.
Appl. 11(1) (2021). https://doi.org/10.3390/jlpea11010007. https://www.mdpi.
com/2079-9268/11/1/7
34. Salehi, M., Khavari Tavana, M., Rehman, S., Shafique, M., Ejlali, A., Henkel,
J.: Two-state checkpointing for energy-efficient fault tolerance in hard real-time
systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(7), 2426–2437
(2016). https://doi.org/10.1109/TVLSI.2015.2512839
35. Sastry Hari, S.K., Li, M.L., Ramachandran, P., Choi, B., Adve, S.V.: Mswat: low-
cost hardware fault detection and diagnosis for multicore systems. In: Proceedings
of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 42, pp. 122–132. Association for Computing Machinery, New York (2009).
https://doi.org/10.1145/1669112.1669129
36. Subasi, O., Unsal, O., Krishnamoorthy, S.: Automatic risk-based selective redun-
dancy for fault-tolerant task-parallel hpc applications. In: Proceedings of the Third
International Workshop on Extreme Scale Programming Models and Middleware,
ESPM22017. Association for Computing Machinery, New York (2017). https://
doi.org/10.1145/3152041.3152083
37. Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet
model. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol.
8097, pp. 633–644. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-
642-40047-6 63
38. Tomasulo, R.M.: An efficient algorithm for exploiting multiple arithmetic units.
IBM J. Res. Dev. 11(1), 25–33 (1967). https://doi.org/10.1147/rd.111.0025
39. Weis, S., Garbade, A., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T.: Archi-
tectural support for fault tolerance in a teradevice dataflow system. Int. J. Parallel
Program. (2014). https://doi.org/10.1007/s10766-014-0312-y
40. Weis, S., et al.: A fault detection and recovery architecture for a teradevice dataflow
system. In: 2011 First Workshop on Data-Flow Execution Models for Extreme Scale
Computing, pp. 38–44 (2011). https://doi.org/10.1109/DFM.2011.9
Artificial Intelligence using HPC Scale
Parallel-Distributed Implementation
of the Lipizzaner Framework
for Multiobjective Coevolutionary
Training of Generative Adversarial
Networks
1 Introduction
This section describes the Lipizzaner framework and reviews related works. The
coevolutionary GAN training approach involves modeling the GAN training pro-
cess as a two-player game between the generator and discriminator. The game
is solved by implementing gradient-based optimization on the GAN minmax
objective [5]. In contrast, competitive coevolutionary GAN training involves two
Parallel Lipizzaner Framework for Coevolutionary Training of GANs 99
with two Intel Xeon Gold 6132 processors and NVIDIA Tesla M60 or P100
GPUs. Acceptable efficiency values were obtained (reduction by a factor of two)
improvinig over a variant of federated learning approach for both datasets.
Cardoso et al. [3] explored the parallelization of GANs training on cloud ser-
vices. Parallel approaches were proposed to train on multiple GPUs over Google
Tensor Processing Units (TPU). A synchronous parallel strategy was applied
over multiple GPUs on a single node. A multiworker strategy to use multiple
nodes and a TPU strategy for synchronous training on TPUs using all reduce
and collective operations. The proposal was evaluated on Google Cloud and
Microsoft Azure. Linear speedup of the training process was achieved.
The original Lipizzaner implementation applies a task-based workload distri-
bution implemented using the Flask library from Python and the HTTP proto-
col for communications. Master and slave processes are executed on predefined
nodes and ports, available for communications. The processes open an API on
the assigned port and wait for the start execution message. The master rec-
ognizes the network ports exposed to the clients. Once all clients are identified, a
thread is created for each one that is responsible for checking that they are still
running (a heartbeat mechanism). The threads send a message to each client at
a fixed interval and wait for a response; if no response is received, the client is
restarted or the entire algorithm is terminated. Afterwards, the master sends the
execution start message to each client and blocks until the clients finish execut-
ing the algorithm. Once the clients receive the message to start executing, they
create a new thread that is responsible for executing the algorithm, while the
main process handles arriving requests. At the beginning of each generation, each
client sends a request to its neighbors, to be responded with the network that
was most recently selected in their cells. This method avoids synchronization,
since the last selection is always responded regardless of whether the nodes are in
different generations. A class is used to control read/write access to the local net-
works, avoiding race conditions between the thread that executes the algorithm
and the control thread. When all clients have finished their generations, they
send a completion message to the master as a response to the heartbeat. Once
all clients have notified their completion, the master makes an HTTP request to
all clients for them to send the generators and discriminators. With the obtained
networks, the master generates an ensemble of models, whose weights are opti-
mized via the evolutionary algorithm, based on a given performance metric.
The proposed approach for distributed execution has several drawbacks:
– Requires nodes to have visibility of the IP and port of other nodes. This func-
tionality is not allowed in many clusters as it could be a security vulnerability.
This restricts the environments in which the model can be executed.
– Manual load balancing of processes is necessary. There is no system to deter-
mine an optimal distribution of processes on nodes, and the distribution must
be done manually, adding work when running the algorithm.
– Introduces a high level of complexity, since the implementation uses control
threads to handle the API while the experiment is running. Due to the use of
threads, there is a need for resource access control to avoid race conditions.
Parallel Lipizzaner Framework for Coevolutionary Training of GANs 101
also allows integrating other machine learning modules and libraries developed
in Pyton and PyTorch.
mpi4py is responsible of initializing and terminating the MPI context via
the MPI INIT and MPI FINALIZE functions. In turn, provides an object-oriented
interface for defining communicators and accessing MPI functions, and handles
serialization and deserialization of Python objects to make it possible to send
them as MPI messages using the Pickle library.
Regarding the serialization of GANs, the coding for discriminators and gen-
erators defined by the original algorithm was maintained. Both networks are
packaged in a map where the discriminator is indexed by the letter “D” and the
generator is indexed by the letter “G”, to be sent in a single message to neigh-
boring processes. Each process uses lists to represent separate populations of
discriminators and generators. Two neighborhood dictionaries (one for discrim-
inators and one for generators) are maintained. Both are indexed by the rank of
neighboring processes. The neighborhoods are used during the communication
routines to keep track of the MPI ranks (origin neighborhoods) associated with
each Lipizzaner process, to know the origin of the individuals. Populations are
used to execute the evolutionary cycle.
The Cartesian grid topology provided by MPI makes it easy to identify neigh-
boring processes. In addition, it allows taking advantage of the physical location
in the computing nodes to improve the performance of sending messages, since
communications between neighboring nodes are the most frequent in the Lip-
izzaner algorithm. In the Lipizzaner reimplementation, as many processes are
launched as many cells are required in the grid and the process with rank zero is
determined as master. The master process is responsible for the initial loading
of the data into shared memory and the assembly of the final model. However,
unlike the original Lipizzaner algorithm, the master process also participates in
the execution of the evolutionary algorithm as another grid process., following
an active master-slave pattern.
Figure 1 shows a diagram with the execution flow of the process of the
redesigned Lipizzaner algorithm. Labels are included for the steps in which an
MPI function is used. When a process starts its execution and mpi4py implicitly
initializes the MPI context, the process checks if its rank is 0 (corresponding
to the master process). If the process has a different rank, it waits at the first
synchronization barrier until the training data is loaded into memory and can be
used. If the process has rank 0, it is the master process. Thus, it is in charge of
reading the configuration of the algorithm, loading the training data into memory
and waiting at the synchronization barrier for the other processes. Once all the
processes are in the synchronization barrier, the barrier is opened to continue.
After passing the barrier that controls the availability of data, each process
initializes its population, distributes it using the MPI ALLGATHER function over a
communicator that contains only its neighbors, and begins the first generation
of the evolutionary algorithm. From the second generation, the usual migration
flow begins: the process obtains immigrant individuals from its neighbors and
updates its population, executes the evolutionary cycle to select and mutate its
best individual, and begins the emigration process sending the best individual
to its four neighbors.
Parallel Lipizzaner Framework for Coevolutionary Training of GANs 103
4 Experimental Evaluation
This section describes the experimental evalauation of the developed parallel
distributed multiobjective version of Lipizzaner.
parameter value
batch size 64
# iterations 50
learning rate 5×10−5
population size 2
replacement size 1
tournament size 2
Fig. 3. Sample images in MNIST
Efficiency Metrics. The execution time is used as the main metric for perfor-
mance evaluation. The traditional approach for algorithmic speedup evaluation
is applied to evaluate efficiency, computing the ratio of the the execution time
of the algorithm on a single compute resource (T P1 ) and the execution time of
the algorithm on N compute resources (T PN ) (Eq. 1).
An adapted formulation of the parallelizability is used to evaluate the scal-
ability of the developed implementation. The approach is based on increasing
the number of computing resources while keeping the amount of work constant.
However, in the case of Lipizzaner, there is a one-to-one relationship between grid
cells (work performed) and the processing cores (available computing resources),
Parallel Lipizzaner Framework for Coevolutionary Training of GANs 107
so the work to be done does not remain constant, but increases proportionally to
computing resources. Given this proportional increase, the adapted formulation
for evaluating scalability (SLN ) is defined by Eq. 2, where T PM is the Lipiz-
zaner execution time using M processes and M processing cores and T PN the
Lipizzaner execution time using N processes and N processing cores.
T P1 T PM − T PN
SN = (1) SLN = , for N < M (2)
T PN M −N
For a perfectly parallelizable algorithm, the execution time is constant when
increasing both variables proportionally, therefore SLL = 0. If there are steps
that cannot be parallelized, SLL > 0. The analysis is performed for 4, 9, and 16
resources, corresponding to Lipizzaner grids of dimensions 2×2, 3×3, and 4×4.
Finally, the load balance of the algorithm is also evaluated. The start and
end times of iteration are studied for each process in the grid. The evaluation
was meant to detect situations with load not evenly distributed between the
processes or when some process is benefited by the resource manager.
The experimental analysis was performed on the high performance comput-
ing platform of National Supercomputing Center (Cluster-UY), Uruguay [10].
Experiments were performed on Intel Xeon-Gold 6138 2.00GHz with GPUs
NVIDIA Tesla P100 (12GB RAM) and NVIDIA Ampere A100 (40GB RAM,
6912 CUDA cores FP32, 3456 CUDA cores FP64 and 422 núcleos Tensor cores)
and trying to run the algorithm in the same conditions. All metrics were com-
puted as averages in ten independent executions for each experiment.
Figure 4 presents a summary of the analysis of results quality. The Q-Q plots for
quality metrics (coverage, density, and Frechet Inception Distance (FID) [12])
are reported. Results obtained in the plane of the theoretical quantiles of a
reference normal distribution and the observed quantiles are shown in green.
The identity line, in orange, is a graphic reference of the approximation of the
results to the reference distribution. Results are correctly aligned, so they are
distributed similar to a normal distribution, i.e., the reference distribution has
mean and standard deviation equal to the empirical generated by the GAN.
108 S. Nesmachnow et al.
Figure 7 presents the execution time needed for each iteration of processes
belonging to a Lipizzaner grid. Figures 7(a) and 7(b) indicate that the processes
remained at the same or one iteration distance throughout training. However,
in Fig. 7(c) it is shown that process 10 ended its execution while process 9 still
had two iterations to process. The difference between the end of the iterations
increases as time progresses for the case of the 4×4 grid.
Summarizing, no pathologies were detected in the load distribution or in the
allocation of resources. An interesting line of future work is to explore how the
difference between the final iteration times behaves for larger grid sizes, since if
the difference were to increase, it could result in training problems.
Parallel Lipizzaner Framework for Coevolutionary Training of GANs 111
dard datasets to analyze the efficiency and scalability of the proposed parallel
distributed training strategy for larger and more complex problems.
References
1. Arjovsky, M., Bottou, L.: Towards principled methods for training generative
adversarial networks. arXiv preprint arXiv:1701.04862 (2017)
2. Arora, S., Risteski, A., Zhang, Y.: Do GANs learn the distribution? some theory
and empirics. In: International Conference on Learning Representations (2018)
3. Cardoso, R., Golubovic, D., Lozada, I.P., Rocha, R., Fernandes, J., Vallecorsa, S.:
Accelerating GAN training using highly parallel hardware on public cloud. EPJ
Web Conf. 251, 02073 (2021)
4. Esteban, M., Toutouh, J., Nesmachnow, S.: Parallel/distributed intelligent hyper-
parameters search for generative artificial neural networks. In: Jagode, H., Anzt, H.,
Ltaief, H., Luszczek, P. (eds.) ISC High Performance 2021. LNCS, vol. 12761, pp.
297–313. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-90539-2 20
5. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Infor-
mation Processing Systems, pp. 2672–2680 (2014)
6. Gui, J., Sun, Z., Wen, Y., Tao, D., Ye, J.: A review on generative adversarial
networks: algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng.
35(4), 3313–3332 (2021)
7. Hardy, C., Merrer, E.L., Sericola, B.: MD-GAN: multi-discriminator generative
adversarial networks for distributed datasets. In: IEEE International Parallel and
Distributed Processing Symposium (2019)
8. Liu, M., et al.: A decentralized parallel algorithm for training generative adversarial
nets (2019). https://arxiv.org/abs/1910.12999
9. Moran, N., Pollack, J.: Coevolutionary neural population models. In: Artificial Life
Conference Proceedings, pp. 39–46. MIT Press One Rogers Street, Cambridge, MA
02142–1209, USA (2018)
10. Nesmachnow, S., Iturriaga, S.: Cluster-UY: collaborative scientific high perfor-
mance computing in Uruguay. In: Torres, M., Klapp, J. (eds.) ISUM 2019. CCIS,
vol. 1151, pp. 188–202. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-
38043-4 16
11. Perez, E., Nesmachnow, S., Toutouh, J., Hemberg, E., O’Reily, U.M.: Paral-
lel/distributed implementation of cellular training for generative adversarial neural
networks. In: IEEE International Parallel and Distributed Processing Symposium
Workshops (2020)
12. Ripa, G., Mautone, A., Vidal, A., Nesmachnow, S., Toutouh, J.: Multiobjective
coevolutionary training of generative adversarial networks. In: Genetic and Evolu-
tionary Computation Conference (2023)
13. Schmiedlechner, T., Yong, N., Al-Dujaili, A., Hemberg, E., O’Reilly, U.: Lipizzaner:
a system that scales robust generative adversarial network training (2018). https://
arxiv.org/abs/1811.12843
14. Toutouh, J., Esteban, M., Nesmachnow, S.: Parallel/distributed generative adver-
sarial neural networks for data augmentation of COVID-19 training images. In:
Nesmachnow, S., Castro, H., Tchernykh, A. (eds.) CARLA 2020. CCIS, vol. 1327,
pp. 162–177. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68035-
0 12
Provenance-Based Dynamic Fine-Tuning
of Cross-Silo Federated Learning
1 Introduction
Over the last few years, Machine Learning (ML) has gained much attention
from both academia and industry [3,8,17]. Although several ML techniques were
proposed decades ago, many factors explain the cause of ML usage becoming
grandly boosted more recently. The first one is data availability. Massive datasets
are available and can be used to train a myriad of ML models. The second factor
is that many computing resource types are available for any user, especially in
public clouds. Finally, ML has been fostered by the emergence of Deep Learning
(DL) [6], an essential technique for pattern matching at scale.
Although ML and DL represent a step forward, their learning workflow
requires that all available data are fully accessible to train a model, becom-
ing a problem for two primary reasons: Privacy and Volume. Regarding privacy,
many datasets contain sensitive data of individuals that cannot be publicized,
according to regulations such as GDPR in Europe (https://gdpr-info.eu/). Even
though anonymization techniques can be applied, due to reverse engineering,
anonymized datasets usage is often ineffective. Regarding data volume, datasets
often have hundreds of gigabytes. Despite various initiatives aimed at providing
advanced network infrastructure to facilitate research collaborations among dif-
ferent organizations and, consequently, handle large volumes of data transfers
(e.g., RedCLARA within Latin America), not all organizations or institutions
have access to these resources. As a result, data transfers can still pose challenges.
To bridge this gap between ML and privacy, Federated Learning (FL) [13]
has been proposed. FL is a distributed technique that allows multiple users to
train models collaboratively while maintaining their training data private, thus
not exposing sensitive data. The final trained model is obtained iteratively, and
in each iteration, the users (i.e., workers) train their model locally and send
an updated version of the model (not sensitive data) back to the server that
aggregates the received locally trained models and sends the updated parameters
to the workers, so that they can continue their local training in the following
iteration. Besides the clear advantage of keeping data private, FL also reduces
costly data transfers since it removes the need to pool data into a single server.
Despite the evolution of FL in the last few years [11], there are some open,
yet crucial, problems to be addressed. One problem is that each iteration of
the FL workflow may last from a few minutes to several hours, depending on
the model type, the volume of input data, and the worker’s computing power,
which can limit the local training of large models. The training time and the
quality of the trained model in each iteration are directly associated with the
input dataset and the choice of a set of hyperparameters. Although the user can
explore several configurations of hyperparameter values by executing multiple
FL workflows, this can be time-consuming. One attractive option is to analyze
the evaluation metrics values (e.g., accuracy) after each iteration and then per-
form Dynamic Fine-tuning of hyperparameters, i.e., depending on the metric
values, hyperparameters are adjusted at runtime. According to da Silva et al.
[24], there is a lack of capabilities for enabling ML (and consequently FL) steer-
ing and dynamic execution. One of the challenges is how to properly register
which hyperparameter configurations have been used to train a specific model.
If this information is not recorded, the user may lose track of it, mainly due to
the exploratory nature of the FL workflow and its distributed execution.
Provenance data [4] can help the dynamic fine-tuning of hyperparameters
by providing a global data picture with exact dependencies. It supplies a series
of metadata that describes how data are produced in each iteration of the FL
workflow by registering hyperparameter and parameter values, evaluation met-
Provenance-Based Dynamic Fine-Tuning of Cross-Silo Federated Learning 115
2 Background
This section discusses two important concepts tackled by this paper: (i) Feder-
ated Learning and (ii) Provenance data.
global model, which can be achieved using different aggregation strategies. One
common strategy is Federated Averaging (FedAvg), which involves several local
stochastic gradient descent updates and one aggregation by the server in each
round. Once the aggregated global model is obtained, the server starts a new
iteration by sending it back to them.
Although the trained model has its hyperparameters (e.g., dropout and acti-
vation function for DNNs), the FL workflow itself has specific parameters that
must be set. These include the number T of rounds (one round is associated with
one iteration of the learning phase), the number of workers nodes K, the fraction
of workers C used in each round, and the batch size B (the number of exam-
ples used for training) consumed in each iteration. Such parameters commonly
have to be fine-tuned together with the model’s hyperparameters to optimize the
execution both in terms of the accuracy of the result (or any other evaluation
metric) or training time consumption. This fine-tuning is a complex yet essential
task. For example, depending on the heterogeneity of the computing power in
worker nodes, one can define a small value for parameter C, which reduces the
necessary computing power per round.
This server client architecture of FL, also called Model-Centric FL [27], is
classified into Cross-Device, where the clients are mobile devices typically and
can reach up to a scale of millions of workers, or Cross-Silo (which is the focus of
this paper), where workers are usually associated with organizations and their
number is commonly reduced. In this second type of FL, the central server can
assume that all clients are available during the whole execution, as they are
robust machines or even clusters. There are some frameworks capable of exe-
cuting FL workflows. One of the most prominent is Flower [2], open-source,
which enables train models using FL on large numbers of workers. Besides, it
has compatibility with most existing ML frameworks like Keras, TensorFlow,
and PyTorch, allowing research across different servers and devices, including
Cloud, Mobile, and Edge Computing. Flower is interoperable with many oper-
ating systems and hardware platforms, works well in heterogeneous edge device
environments, and is used as the basis for the approach proposed in this paper.
thing. In the context of this paper, an entity can be a trained model, the aggre-
gated model, a set of hyperparameter values, etc. Activities are actions executed
within the FL workflow and act upon entities, e.g., the local training in a worker
node or the model aggregation in the server. Finally, an Agent is a user respon-
sible for starting/stopping activities, providing datasets, etc. Although PROV
is an agnostic model, i.e., not tied to a specific knowledge domain, it can be
extended to distinct fields, such as FL.
3 Related Work
This section discusses existing approaches that use provenance data for analyz-
ing and fine-tuning hyperparameters in general ML and FL-specific approaches.
Many use provenance data to foster data analytics of ML workflow [5,12,20,21,
23,26]. ModelDB [26] and ModelKB [5] are model management systems that col-
lect provenance data and other metadata related to the ML workflow. ModelDB
provides specific interfaces to well-known tools, such as SparkML and scikit-
learn, and ModelKB is based on callbacks in native ML frameworks, e.g., Ten-
sorFlow), to collect metadata about the workflow. Similarly, Schelter et al. [23]
propose a system to track the metadata and provenance for SparkML pipelines
and scikit-learn. The previous three approaches provide ways to submit analyses
118 C. Lopes et al.
to the provenance database allowing users to trace back results, but they do not
supply dynamic fine-tuning of hyperparameters.
BugDoc [12] is an approach that captures provenance data from ML work-
flows to help users to understand the reason for failed executions. Once it is
identified, BugDoc suggests parameter values for future executions to improve
efficiency and reduce the number of iterations in the learning workflow. Although
BugDoc represents a step forward, it does not support dynamic fine-tuning or
FL. Similarly, Pina et al. [20,21] propose an approach to capture provenance
data from Deep Learning applications to foster analytics and interpretability of
the trained model. Although Pina et al. use a provenance database to store and
query provenance data at runtime, it does not provide dynamic fine-tuning and
is not designed for the FL workflow.
Some provenance-based approaches are specific to FL [1,18]. The framework
Bassa-ML proposed [1] is a blockchain-based FL platform that is built on top of
the TensorFlow model card toolkit (a model card is an ML document that encap-
sulates metadata and some level of provenance) and implements a blockchain-
based coordinator-less FL scheme to manage provenance data in a decentralized
way. The authors claim that using the generated ML documents permits the
addition of trust to blockchain applications while preventing attacks. Nonethe-
less, it was not reported how the shared information in the Model Card was used
for analysis, and the proposed approach does not allow for dynamic fine-tuning.
Peregrina et al. [18] propose a framework for data governance in the FL work-
flows. The data governance framework has a metadata model and management
system for tracing the participants’ operations and collecting all information
regarding the definition of the goals and configuration of the FL workflow. Their
model, however, lacks some relevant information, such as the activities executed
in the FL workflow. Besides, the proposal does not consider the dynamic tuning
of the FL hyperparameters.
where the light gray rectangles are the contributions of this paper and the white
ones are present at the Flower core framework. The architecture is composed
of six main components: (i) Strategy, (ii) Client Manager, (iii) FL Loop, (iv)
RPC Server, (v) Client, and (vi) Provenance Manager. The Client Manager is
responsible for sampling Clients that will train multiple local models. In coordi-
nation with the RPC Server, it manages numerous ClientProxy objects, which
are objects associated with a client connected to the server. A ClientProxy is
responsible for sending and receiving messages to each of the Clients, which per-
forms local model training. The entire FL workflow orchestration is performed
by FL Loop. It invokes the Strategy, an abstraction of how the FL algorithm
works in the server, to prepare the next round of FL and sends the configura-
tions to the clients involved in the training process. After each Client generates
a new local model, the FL Loop receives the updates, e.g., weights of a DNN,
and invokes the Strategy for aggregating results (Aggregate Train/Evaluation in
Fig. 2). After aggregation, the Strategy evaluates the aggregated model (Config-
ure Train/Evaluation in Fig. 2) according to the chosen evaluation metrics (e.g.,
accuracy).
In each of the forenamed steps, provenance data are captured. The Prove-
nance Manager is the component responsible for receiving provenance data from
the Server and Clients and structuring it in a queryable form. In the current
version of Flower-PROV, the Provenance Manager is built on top of the DfAn-
alyzer provenance library [25] that provides generic methods for capturing and
querying provenance. On the server side, the Provenance Capture Server com-
ponent captures metadata regarding the aggregation and evaluation steps, such
as the evaluation metrics and the model sent to the clients. Client-side activi-
ties and metadata are collected from each Client using the Provenance Capture
Client component. All data are stored in a W3C PROV-compliant provenance
120 C. Lopes et al.
database and are available at runtime, i.e., as soon as they are captured, they
can be queried, which allows for using such data for dynamic fine-tuning.
When the provenance data are available, the Fine Tuning component can be
invoked, triggered by a series of events identified in the provenance database.
Such events can be identified according to user-defined criteria that we call
Dynamic Fine-tuning Policy (dfp). In this work, the FL workflow can be viewed
as a directed graph F = (S, Dep), where S are the vertices representing the
workflow steps to be executed in Server or Client (e.g., local training or model
aggregation) and Dep is a set of arcs that represents the data dependencies
amongst steps in S (i.e., the local training in Clients can only be executed after
receiving updates from the Server ). Let us also represent the execution envi-
ronment as R = {r1 , . . . , rk }, which is the set of computing resources where the
Clients and the Server execute. Therefore, given an FL workflow F , an input
D, and a set of computing resources R, let X(F, D, R) = {hpv1 , hpv2 , ..., hpvm }
be the set of hyperparameter values defined for executing the FL workflow F .
Each hyperparameter value hpvi represents one of the configuration parameters
of the FL framework and model, e.g., number of rounds (as detailed in Sub-
sect. 2.1). Thus, the goal of the Fine Tuning component is to use provenance
data to adjust hyperparameter values at runtime seeking the set of hyperparam-
eter values X ∗ (F, D, R) that satisfy a set of user-defined criteria C.
To find X ∗ , the hyperparameter values need runtime adjustment following
a procedure. In addition, the moment it is performed has to be identified by
querying the provenance database. The dfp is the abstraction used to define
how parameters are changed and how to identify them. A dfp can be for-
malized as df p = {e, C, A}, where e is a type of the dfp (i.e., user-defined
or automatic, although in this paper only user-defined dfps are applied), C
is the set of constraints that have to be satisfied to change hyperparame-
ters, and A are fine-tuning actions that can be performed. Such constraints
are translated to a database function in the provenance database (named
update hyperparameters), which is triggered every time new metadata is col-
lected. This function returns 1 if changes are required and 0, otherwise. Figure 3
shows an example of some defined constraints using SQL.
Finally, when it is defined that a change will be performed, the actions in A
are executed. One example of action is the increment on the number of epochs
in the Clients, which will be increased by 1 when a change is required in the
hyperparameters. These changes repeat until the dfp does not trigger any hyper-
parameter changes after some θ rounds.
In the next section, we present the evaluation of the dynamic fine-tuning
mechanism in Flower-PROV. Flower-PROV is being open-sourced and will be
available at https://github.com/UFFeScience/Flower-PROV.
5 Experimental Evaluation
To evaluate dynamic fine-tuning of hyperparameters in Flower-PROV we have
chosen as a case study the MobileNetV2 architecture [22], a general-purpose
Provenance-Based Dynamic Fine-Tuning of Cross-Silo Federated Learning 121
computer vision convolution DNN. The dataset used was CIFAR-10 [9], a pop-
ular multi-class balanced dataset of colored 32 × 32 pixels images formed by
ten classes, each having 6,000 images. In total, 50,000 (≈ 83.33%) images were
used for training and 10,000 (≈ 16.66%) for testing. To emulate the data privacy
attribute in Flower-PROV, each Client received a balanced and distinct partition
of the original CIFAR-10 generated with the dataset-splitter1 tool. We fixed 10%
of the local training datasets for the validation split in all experiments presented
in this section.
A set of virtual machines in Amazon AWS were deployed to compose the
Cross-Silo FL system comprised of one server and five clients, where all clients are
available during the entire FL workflow execution, i.e., no churn is considered.
All clients in Flower-PROV execute the training and evaluation of the local ML
model in g4dn.xlarge virtual machines, which feature 4 vCPUs, 16 GiB memory,
and 1 GPU NVIDIA T4 Tensor Core with 2560 CUDA cores and 16 GB memory,
costing USD 0.5260 per hour in the On-Demand market. The server, responsible
for aggregating local models, executes in a t2.2xlarge virtual machine, which
features 8 vCPUs and 32.0 GiB memory and costs USD 0.3712 per hour in the
On-Demand market.
In addition, in each execution of the workflow, all 100 rounds were executed.
However, if the user defines a target accuracy, this value can be achieved with
less than 100 rounds, and the workflow execution could be stopped, sparing
execution time and computing resources. If the user also defines a deadline for
the execution, the FL workflow may not finish executing all rounds in time.
Figure 7 presents the accuracy evolution over the 100 rounds of each of the 25
executions of the workflow. Let the user set the target accuracy as 0.8. Analyzing
Fig. 7, one can note that after 100 rounds, only 10 out of the 25 hyperparameter
configurations achieve the target accuracy value. On the other hand, if the target
accuracy is 0.5 and a deadline of 700 s, only one hyperparameter configuration
is successful. Therefore, executing all the configurations would be a time and
computing resource waste, even considering the limited search space used.
124 C. Lopes et al.
Although the analysis provided by Fig. 7 represents a step forward, it may not
deliver the best hyperparameter configuration. Grid-search is an interesting fine-
tuning strategy if the well-performing hyperparameter combinations are known
a priori. For example, in the experiments, we consider batch size values as the
power of two. However, the best parameter combinations sometimes cannot be
guessed intuitively. Thereby, dynamic fine-tuning is a promising approach.
Fig. 7. Accuracy evolution over 100 rounds for fine-tuning using grid-search with batch
size (a) bs = 32, (b) bs = 64, (c) bs = 128, (d) bs = 256, and (e) bs = 512.
smaller than the target accuracy defined by the user, (ii) if the hyperparameters
were not changed in the last two rounds (this allows for the latest hyperparam-
eter change to take effect), (iii) if the accuracy varies less than 0.01 after two
rounds. We have also to define the set of actions to be performed: (i) batch size
is increased by 10% and (ii) the number of epochs is increased by 2. It is worth
noticing that although provenance data are captured since the first round, the
changes in hyperparameter values start on the third round since the constraints
require analyzing a 2-round window.
Four different scenarios are set by varying the target accuracy: (i) 0.5 (DS1 ),
(ii) 0.6 (DS2 ), (iii) 0.7 (DS3 ), and (iv) 0.8 (DS4 ) and the execution deadline as
10 min. Table 1 presents the number of changes in hyperparameter values, the
chosen batch size, the chosen number of epochs, the required rounds to obtain
the best configuration, the accuracy, and the execution time. It is worth noticing
that fine-tuning hyperparameters using grid search required 25 workflow execu-
tions to obtain the best hyperparameters configuration for a given accuracy goal,
demanding a total of 36,570.32 s (≈ 10 h). Moreover, using the grid-search app-
roach, the user only explores intuitive values for the hyperparameters. However,
Table 1 shows that the hyperparameter values selected cannot be guessed intu-
itively, e.g., bs = 41 and e = 7. In addition, the target accuracy was achieved for
DS1 , DS2 , DS3 and DS4 , respectively, in less than 20 rounds (as presented in
Fig. 8), taking no more than 2,104.95 s (≈ 0.5 h), depicting a 94.24% reduction
in execution time.
Table 1. The chosen values for batch size and number of epochs for each scenario.
6 Concluding Remarks
The collaborative nature of a Federated Learning system allows multiple users to
evaluate their private and sensitive data without sharing it. The user seeks the
desired quality metric of the training results, which can be time-consuming and
may require a great effort from the user’s side to set the proper configuration
values. This work proposes Flower-PROV, an FL framework for dynamically fine-
tuning the hyperparameters based on provenance data captured in a queryable
form to achieve an evaluation target. An experimental analysis demonstrates
that the Flower-PROV dynamic fine-tuning shortens the training time up to
94.24% when compared with an exploratory grid-search to reach the desired
target evaluation metrics, under an overhead of no more than 8%. Concerning
future perspectives, Flower-PROV will be expanded to Cross-Device FL, where
clients may not participate in all rounds. Also, it will regard non-IID datasets
since clients can have different data sample sizes and probability distributions.
Furthermore, we plan to apply the framework in real-data scenarios.
References
1. Bandara, E., Shetty, S., Rahman, A., Mukkamala, R., Zhao, J., Liang, X.: Bassa-
ML – a blockchain and model card integrated federated learning provenance plat-
form. In: IEEE 19th Annual Consumer Communications and Networking Confer-
ence (CCNC), pp. 753–759 (2022)
2. Beutel, D.J., et al.: Flower: a friendly federated learning research framework. arXiv
(2020)
3. Fernandes, E., Moro, S., Cortez, P.: Data science, machine learning and big data
in digital journalism: a survey of state-of-the-art, challenges and opportunities.
Expert Syst. Appl. 221, 119795 (2023)
4. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks:
a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)
5. Gharibi, G., Walunj, V., Nekadi, R., Marri, R., Lee, Y.: Automated end-to-end
management of the modeling lifecycle in deep learning. Empir. Softw. Eng. 26,
1–33 (2021)
6. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016)
7. Groth, P., Moreau, L.: W3C PROV - an overview of the prov family of documents
(2013). https://www.w3.org/TR/prov-overview/
8. Kamm, S., Veekati, S.S., Müller, T., Jazdi, N., Weyrich, M.: A survey on machine
learning based analysis of heterogeneous data in industrial automation. Comput.
Ind. 149, 103930 (2023)
9. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical
report. University of Toronto (2009)
10. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: challenges, meth-
ods, and future directions. IEEE Sig. Process. Mag. 37(3), 50–60 (2020)
11. Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated
optimization in heterogeneous networks. In: Proceedings of Machine Learning and
Systems (MLSys). mlsys.org (2020)
Provenance-Based Dynamic Fine-Tuning of Cross-Silo Federated Learning 127
12. Lourenço, R., Freire, J., Simon, E., Weber, G., Shasha, D.E.: BugDoc. VLDB J.
32(1), 75–101 (2023)
13. McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-
efficient learning of deep networks from decentralized data. In: Proceedings of the
20th (AISTATS), vol. 54, pp. 1273–1282. PMLR (2017)
14. Nair, D.G., Aswartha Narayana, C.V., Jaideep Reddy, K., Nair, J.J.: Exploring
SVM for federated machine learning applications. In: Rout, R.R., Ghosh, S.K.,
Jana, P.K., Tripathy, A.K., Sahoo, J.P., Li, K.C. (eds.) Advances in Distributed
Computing and Machine Learning. LNNS, vol. 427, pp. 295–305. Springer, Singa-
pore (2022). https://doi.org/10.1007/978-981-19-1018-0 25
15. Nogay, H.S., Adeli, H.: Diagnostic of autism spectrum disorder based on struc-
tural brain MRI images using, grid search optimization, and convolutional neural
networks. Biomed. Sig. Process. Control. 79(Part), 104234 (2023)
16. de Oliveira, D.C.M., Liu, J., Pacitti, E.: Data-Intensive Workflow Management:
For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis
Lectures on Data Management. Morgan & Claypool Publishers, San Rafael (2019)
17. Parmar, J., Chouhan, S.S., Raychoudhury, V., Rathore, S.S.: Open-world machine
learning: applications, challenges, and opportunities. ACM Comput. Surv. 55(10),
205:1–205:37 (2023)
18. Peregrina, J.A., Ortiz, G., Zirpins, C.: Towards a metadata management system
for provenance, reproducibility and accountability in federated machine learning.
In: Zirpins, C., et al. (eds.) ESOCC 2022. CCIS, vol. 1617, pp. 5–18. Springer,
Cham (2022). https://doi.org/10.1007/978-3-031-23298-5 1
19. Pina, D.B., Chapman, A., de Oliveira, D., Mattoso, M.: Deep learning provenance
data integration: a practical approach. In: Ding, Y., Tang, J., Sequeda, J.F., Aroyo,
L., Castillo, C., Houben, G. (eds.) Companion Proceedings of the ACM Web Con-
ference 2023. WWW 2023, Austin, TX, USA, 30 April 2023–4 May 2023, pp.
1542–1550. ACM (2023)
20. Pina, D., Kunstmann, L., de Oliveira, D., Valduriez, P., Mattoso, M.: Prove-
nance supporting hyperparameter analysis in deep neural networks. In: Glavic, B.,
Braganholo, V., Koop, D. (eds.) IPAW 2020-2021. LNCS, vol. 12839, pp. 20–38.
Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80960-7 2
21. Pina, D., et al.: Capturing provenance from deep learning applications using Keras-
Prov and Colab: a practical approach. J. Inf. Data Manag. 13(5) (2022)
22. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2:
inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
23. Schelter, S., Boese, J.H., Kirschnick, J., Klein, T., Seufert, S.: Automatically track-
ing metadata and provenance of machine learning experiments. In: Machine Learn-
ing Systems Workshop at NIPS (2017)
24. da Silva, F., Casanova, R., et al.: Workflows community summit: bringing the
scientific workflows research community together (2021)
25. Silva, V., et al.: Dfanalyzer: runtime dataflow analysis tool for computational sci-
ence and engineering applications. SoftwareX 12, 100592 (2020)
26. Vartak, M., Madden, S.: MODELDB: opportunities and challenges in managing
machine learning models. IEEE Data Eng. Bull. 41(4), 16–25 (2018)
27. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and
applications. ACM Trans. Intell. Syst. Technol. 10(2) (2019)
High Performance Computing
Applications
A GPU Numerical Implementation
of a 2D Simplified Wildfire Spreading
Model
1 Introduction
Wildfires are a worldwide problem that each year consumes large extensions of
area, generating environmental, and socioeconomic, among other damages. Most
fires, for example in Chile [8], are caused by human negligence, being natural
causes, such as extreme weather conditions, thunderstorms or volcanic eruptions,
less frequent compared to the former. The season of greatest occurrence of this
phenomenon is in summer and generally in areas with a Mediterranean climate,
conditions that favor the uncontrolled spread of fire.
The devastating effects and behavior of this type of disaster have led to the
importance of developing technology for the study of the phenomenon, especially,
to mitigate the damage they generate. For this purpose, several types of models
have been developed which, according to [28], can be grouped into risk, propaga-
tion, and effect assessment models, which are closely related to each other. These
models have a variety of approaches and may involve different mathematical tech-
niques, mainly based on cellular automata or partial differential equations.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 131–145, 2024.
https://doi.org/10.1007/978-3-031-52186-7_9
132 D. San Martin and C. E. Torres
For the development of this work, we focus on fire propagation models that
use a mathematical representation to describe its behavior. Specifically, this
work uses a physical model based on partial differential equations as described
by [4,12], and [30]. For a comparison between the modeling approaches and
the details of the models on which this work is based, see [30,33]. The main
objective of this work is to extend a numerical implementation of the selected
mathematical model, which is capable of processing multiple simulations effi-
ciently, allowing a contribution to the study of wildfires. This allows for a more
in-depth analysis of the phenomenon by knowing the behavior of the fire and fuel
for different scenarios, for instance: varying ignition sources, wind direction, fuel
characteristics, etc. The use of Graphics Processing Units (GPU ) is introduced
as a High-performance Computing (HPC ) tool for processing this large number
of case studies.
For an overview of the main approaches used in modeling the phenomenon,
related work is presented in Sect. 2. Subsequently, in Sect. 3, the insight behind
the model studying the dynamics of wildfires is briefly described. In Sect. 4 we
present the numerical method which approximates the solution of the model
used, in addition to the proposed GPU implementation. The numerical experi-
ments and analysis of the implementation are presented in Sect. 5. Finally, the
conclusions and future work are presented in Sects. 6 and 7 respectively.
2 Related Work
Currently, there is a wide range of tools for the analysis of forest fires, including
numerical simulation software. Most of this technology is based on mathemat-
ical models that, very generally, can be grouped into discrete and continuous
approaches. Within the discrete approach are Cellular Automata (CA), which
are widely used for their simplicity of implementation, however, it is a challenge
to relate the state update rules with the physical phenomenon to be modeled,
see [1,2,7,14,18,19,21]. In the continuous approach, there are models based on
Partial Differential Equations (PDE ), which build the model from the underly-
ing physical process; however, these usually derive in computationally intensive
implementations; see [4,13,15–17,22,24,30]. In many cases, these models are
used to complement information systems for wildfire analysis (see [3,38]).
In terms of implementation, many of the software uses HPC tools such
as OpenMP or MPI. Recently, due to the capabilities of GPU s, and through
General-Purpose Computing on Graphics Processing Units (GPGPU ) approach,
some of the new software provides implementations using frameworks such as
CUDA or OpenCL. See [5,6,9–11,25,34,35,37]. Since most of the references using
GPU are CA-based models, this work explores the development of an open-
source PDE -based model implementation, that includes an efficient numerical
method compatible with the use of the CUDA framework for GPU s.
GPU Implementation of a 2D Wildfire Model 133
3 Mathematical Model
The mathematical model used in this work is based on the model originally
proposed by Asensio & Ferragut [4], also derived by Mandel et al. [22] and
Eberle et al. [13], and recently studied by San Martin & Torres in [29,30,33].
Conceptually, the model describes the temperature behavior through a pro-
cess of diffusion, convection, and reaction, in addition to fuel consumption by
the chemical reaction of the process. Diffusion represents the propagation of
temperature from a zone of higher temperature to a zone of lower temperature.
Convection is the heat transfer induced by the effect of a fluid, such as wind
in this case. The reaction process is an exothermic chemical process between
temperature and vegetable fuel.
Let u(x, t) the temperature value and β(x, t) the fuel fraction, both
defined in the spatial coordinates x = (x, y) at time t, where x ∈ Ω =
]xmin , xmax [×]ymin , ymax [⊂ R2 y t ∈ [tmin , tmax ]. Let v(x, t) = w(t) + ∇T (x)
the vector field which models the effect of wind and topography, and f (u, β) a
non-linear heat source. Then, the mathematical model is defined as,
ut = κ ∇2 u − v · ∇u + f (u, β), in Ω×]0, tmax ],
βt = g(u, β), in Ω×]0, tmax ],
u(x, t) = h1 (x, t), on Γ ×]0, tmax ],
(1)
β(x, t) = h2 (x, t), on Γ ×]0, tmax ],
u(x, 0) = u0 (x), in Ω,
β(x, 0) = β0 (x), in Ω,
where Γ represents the boundary of Ω and,
u
f (u, β) = s(u)+ β exp − αu,
1 + εu
ε u
g(u, β) = −s(u)+ β exp ,
q 1 + εu
with
1, if u ≥ upc ,
s(u) =+
0, otherwise.
∂ ∂ ∂2 ∂2
The gradient is defined as ∇ = ∂x , ∂y and ∇2 = ∂x 2 + ∂y 2 is the Laplace
4 Algorithm
The algorithm used in this work is derived from a numerical approximation using
the Method of Lines representing the PDE s system (1) as the following Initial
Value Problem (IVP)
ẏ(t) = Φ(t, y(t)), (2)
134 D. San Martin and C. E. Torres
where,
ẏ(t) = vec U̇ (t) , vec Ḃ(t)
and
(2) (2)
vec κ (U (t) DN + DN U (t)) − V1 (t) (U (t) DNx ) + V2 (t) (DNy U (t)) + F (t)
Φ(t, y(t)) = x y
vec (G(U (t), B(t)))
with y(0) = (vec (U (0)) , vec (B(0))) the initial condition. vec (·) denotes the
vectorization operator applied in column-major order. Boundary conditions are
imposed in each time step according to the values of h1 (x, t) and h2 (x, t) from
(1).
The problem (2) is solved numerically using Second-Order Finite Difference
(FDM ) and Fourth-Order Runge-Kutta Method (RK4 ) over a discrete spatial
and time domain
xi = xmin + i Δx, i = 0, ..., Nx , Δx = (xmax − xmin )/Nx ,
yj = ymin + j Δy, j = 0, ..., Ny , Δy = (ymax − ymin )/Ny ,
tn = tmin + n Δt, n = 0, ..., Nt , Δt = (tmax − tmin )/Nt .
Nx , Ny , and Nt correspond to the number of intervals for the spatial and tempo-
ral domains. U stores the temperature discretization, B stored the discretization
of fuel, V1 , V2 are the matrices corresponding to the approximation of the com-
ponents of the vector field v, F y G are the numerical representation of the
(2)
functions f and g defined in (1). DN y DN are the differentiation matrices
that allow us to approximate the first and second spatial partial derivatives; see
[30,36]. represents the Hadamard product or element-wise multiplication.
The derivation, analysis of the convergence, and theoretical and numerically
observed computational complexity of the algorithm can be found in [30,33].
4.1 Applications
This algorithm allows us to study, for example, the vulnerability of fuel zones or
risk maps (see examples in [30]), calculating the damage that fires can cause at
different initial ignition points. This analysis requires a large number of numeri-
cal simulations since there is uncertainty in the initial conditions of the problem.
For example, the location of the initial fire sources, and the meteorological con-
ditions, among others. See Fig. 1. Therefore, it is crucial to develop an implemen-
tation that keeps computation times in the order of minutes or seconds, since
the shorter the execution time, the higher the number of fire scenarios that can
be simulated.
GPU Implementation of a 2D Wildfire Model 135
Scenario 1
Scenario 2
..
Scenario N
The basic implementation of the algorithm was carried out in Python, using the
libraries NumPy for the vector handling of the structures and SciPy for the
handling of sparse matrices. It is important to point out that these libraries
have an optimized implementation of linear algebra algorithms and are mostly
developed in C/C++ [27]. This version can be accessed in [31].
To compare the performance of our implementation, we also include a sequen-
tial version in C and another with a CPU multi-thread management using
OpenMP.
Graphics processing units (GPU s), included on graphics cards, have been widely
used over the past few years because of the amount of computation that can
be processed, including numerical simulations of PDE s. While the video cards
were developed primarily for graphics work, the concept General-Purpose GPU
(GPGPU ) has been widely used in scientific applications because of the com-
putational advantages they present over CPU s [20]. The parallel programming
model for GPU s allows an instruction to be performed on multiple threads and
is known as Single Instruction, Multiple Threads (SIMT ). CUDA is a platform
for GPU software development, providing a set of directives for working with
136 D. San Martin and C. E. Torres
the C/C++ language, but extending its use to other languages. The execution
of code on GPU is performed in functions called kernels, which are executed in
parallel on the available threads according to the characteristics of the graphics
card. More details about CUDA programming can be accessed in the official
development documentation [26].
The general parameters for the execution of a kernel are the number of blocks
per grid and the number of threads per block. The performance of the code on
GPU often depends on the correct selection of these parameters.
Due to the need to perform multiple numerical simulations, a GPU imple-
[m]
mentation is proposed where each thread processes the Yl element of the
scheme described in Fig. 2. Each scenario m is associated with an independent
numerical simulation with m ∈ [0, Ns −1], l ∈ [0, 2 (Nx +1) (Ny +1) Ns −1], and Ns
the number of scenarios/simulations. The idea is to process 2 (Nx +1) (Ny +1) Ns
elements in parallel for each timestep of the time integration method.
Y0 YNt
Y0,0 YNt ,0
Scenario 0 [0]
y0 Y0,1 YNt ,1 [0]
yNt
.. ..
. .
[1] [1]
Scenario 1 y0 yNt
[Ns −1]
.. .. [N −1]
Scenario Ns − 1 y0 . . yNts
Global Global
GPU
Memory Memory
where
Y(t) = y(t)[0] , . . . , y(t)[Ns −1] ,
represents the vector with all the numerical simulations, and
Φ(t, Y(t)) = Φ(t, y(t))[0] , . . . , Φ(t, y(t))[Ns −1]
is the evaluation of the right-hand side of (2) for each fire scenario.
GPU Implementation of a 2D Wildfire Model 137
The parameters of the functions are detailed in equation (1) and N BLOCKS,
N THREADS are the parameters of CUDA kernels. In particular, the RHSVec
kernel consumes approximately the 90% of the execution time.
5 Numerical Experiments
The second experiment, Fig. 4, has the spatial domain Ω =] − 100, 100[2 and
temporal t ∈]0, 30]. The initial condition for temperature is
6 if (x, y) ∈ Ω0 ,
u0 (x, y) =
0 otherwise,
with Ω0 = [−100, −93.7] × [−21.3, 21.3] and for β0 ∼ U(0, 1). The vector field
used is v(x, y, t) = (1, 0). The boundary conditions are h1 = h2 = 0. The model
parameters are κ = 10, ε = 0.3, upc = 3, q = 1 y α = 0.01, and the number of
intervals is Nx = Ny = 127 and Nt = 900.
Figs. 3 and 4 show how the fire front, the highest temperature zone, and fuel
consumption follow the dynamics of the wind, as expected.
140 D. San Martin and C. E. Torres
Fig. 4. Second experiment with a rectangular initial fire source. This experiment aims
to replicate the output presented by Mell et al. [23].
5.2 Comparison
The code was executed in the computer cluster of the Centro Cientı́fico Tec-
nológico de Valparaı́so from Universidad Técnica Federico Santa Marı́a (CCT-
Val ). Computer nodes have an Intel(R) Xeon(R) E5-2643 v2 CPU with 3.9
GHz frequency, 6 physical nodes and 12 logical cores. These nodes include a
Tesla K20m graphic card with 2496 cores, 706 MHz frequency, and 5 GB of
vRAM. Additionally, they have 64 GB of RAM.
For the OpenMP implementation, 12 threads were selected due to the shorter
average execution time. To select the CUDA kernel parameters, we performed
10 experiments for 8100 simulations. The results are presented in Table 1.
Table 1. Average execution times (s) of the algorithm for different numbers of threads
and blocks.
# Blocks
32 64 128 256 512
# Threads 128 116.7 74.2 60.5 62.1 61.1
256 74.0 59.1 59.7 58.8 57.7
512 73.4 65.8 60.7 60.9 59.0
Using the previous results, the configuration of 512 blocks and 256 threads
obtains the shorter average execution time, being 49.42% smaller than the worst
cases with 32 blocks and 128 threads. Figure 5a presents the average times of
experiments performed with the configuration described before. In addition,
we define the speedup as the quotient between the execution time of the CPU
implementations included in the article versus the GPU version using CUDA.
Figure 5b shows these values.
When analyzing the curves in the execution times plot, we can see how the
GPU implementation considerably reduces the computation time required for
processing the different scenarios. Notice that over 1000 execution scenarios, the
GPU implementation is approximately 88 times faster than the first version
in Python. As well, the CUDA version is 14 times faster than the CPU multi-
threaded OpenMP implementation.
Regarding degradation of speedup, it is mainly because the number of threads
available is lower than the number of tasks that need to be computed under this
approach, thus, the computation has to be performed in batches.
6 Conclusions
This work has presented an open-source GPU implementation of the wildfire
spread model proposed by [4] with the simplification introduced by [13]. A
description of the mathematical model, the algorithm, and a GPU implementa-
tion strategy is provided. To validate the implementation, an extensive compar-
ison has been made with CPU versions, including an implementation on C and
OpenMP, both in the numerical results obtained and in their execution times.
The algorithm has been successfully adapted to take advantage of the power
of graphics cards. The results show that the proposal is 88 times faster than the
version in [30], 32 times faster than a sequential version in C and approximately
14 times faster than a version in a multi-threaded CPU implementation using
OpenMP.
In addition, the described implementation is publicly available on GitHub
[32], therefore the results can be reproduced and the code can be used as an
active tool in the study of wildfires.
7 Future Work
Due to the limitations associated with the parallelization approach used in this
work, specifically, using one thread to process each element of the vector of
simulations, it is open to explore other strategies to improve the performance of
the implementation. For example, using each thread to process more than one
element of the vector considering the local relationship when approximating the
spatial partial derivatives.
Further extensions of the current work are the utilization of different types
of available memory to further reduce GPU computation times, the use of code
profiling in order to optimize and avoid possible bottlenecks, and the exten-
sion of the presented framework to the use of multi-GPU or hybrid CPU -GPU
architecture to solve even more scenarios in parallel.
References
1. Alexandridis, A., Vakalis, D., Siettos, C., Bafas, G.: A cellular automata model for
forest fire spread prediction: the case of the wildfire that swept through Spetses
Island in 1990. Appl. Math. Comput. 204(1), 191–201 (2008). https://doi.org/10.
1016/j.amc.2008.06.046
2. Almeida, R.M., Macau, E.E.N.: Stochastic cellular automata model for wildland
fire spread dynamics. J. Phys: Conf. Ser. 285(1), 12038 (2011). https://doi.org/
10.1088/1742-6596/285/1/012038
3. Arganaraz, J., Lighezzolo, A., Clemoveki, K., Bridera, D., Scavuzzo, J., Bellis, L.:
Operational meteo fire risk system based on space information for Chaco Serrano.
IEEE Lat. Am. Trans. 16(3), 975–980 (2018). https://doi.org/10.1109/TLA.2018.
8358681
4. Asensio, M.I., Ferragut, L.: On a wildland fire model with radiation. Int. J. Numer.
Meth. Eng. 54(1), 137–157 (2002). https://doi.org/10.1002/nme.420
5. Carrillo, C., Margalef, T., Espinosa, A., Cortés, A.: Accelerating wild fire simulator
using GPU. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11540, pp.
521–527. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22750-0 46
6. Carrillo, C., Cortés, A., Margalef, T., Espinosa, A., Cencerrado, A.: Applying GPU
parallel technology to accelerate FARSITE forest fire simulator. In: Advances in
Forest Fire Research, pp. 913–921 (2018). https://doi.org/10.14195/978-989-26-
16-506 100
7. Chopard, B., Droz, M.: Cellular automata model for the diffusion equation. J. Stat.
Phys. 64(3), 859–892 (1991). https://doi.org/10.1007/BF01048321
8. CONAF: Incendios Forestales en Chile (2021). http://www.conaf.cl/incendios-
forestales/incendios-forestales-en-chile/
9. Denham, M., Laneri, K.: Using efficient parallelization in graphic processing units
to parameterize stochastic fire propagation models. J. Comput. Sci. 25, 76–88
(2018). https://doi.org/10.1016/J.JOCS.2018.02.007
10. Denham, M.M., Waidelich, S., Laneri, K.: Visualization and modeling of forest fire
propagation in Patagonia. Environ. Model. Softw. 158, 105526 (2022). https://
doi.org/10.1016/J.ENVSOFT.2022.105526
11. D’Ambrosio, D., Gregorio, S.D., Filippone, G., Rongo, R., Spataro, W., Trunfio,
G.A.: A Multi-GPU approach to fast wildfire hazard mapping. Adv. Intell. Syst.
Comput. 256, 183–195 (2014). https://doi.org/10.1007/978-3-319-03581-9 13
12. Eberle, S.: Modeling and simulation of forest fire spreading. In: Eulogio, P.I.,
Guardiola-Albert, Carolina, Javier, H., Luis, M.M., José, D.J., Antonio, V.G.J.
(eds.) Mathematics of Planet Earth, pp. 811–814. Springer, Berlin Heidelberg,
Berlin, Heidelberg (2014). https://doi.org/10.1007/978-3-642-32408-6 175
13. Eberle, S., Freeden, W., Matthes, U.: Forest fire spreading. In: Freeden, W.,
Nashed, M.Z., Sonar, T. (eds.) Handbook of Geomathematics, pp. 1349–1385.
Springer, Berlin Heidelberg, Berlin, Heidelberg (2015). https://doi.org/10.1007/
978-3-642-54551-1 70
144 D. San Martin and C. E. Torres
14. Fernandez-Anez, N., Christensen, K., Rein, G.: Two-dimensional model of smoul-
dering combustion using multi-layer cellular automaton: the role of ignition loca-
tion and direction of airflow. Fire Saf. J. 91, 243–251 (2017). https://doi.org/10.
1016/J.FIRESAF.2017.03.009
15. Ferragut, L., Asensio, M.I., Cascón, J.M., Prieto, D.: A wildland fire physical model
well suited to data assimilation. Pure Appl. Geophys. 172(1), 121–139 (2015).
https://doi.org/10.1007/s00024-014-0893-9
16. Ferragut, L., Asensio, M.I., Monedero, S.: Modelling radiation and moisture con-
tent in fire spread. Commun. Numer. Meth. Eng. 23, 819–833 (2006). https://doi.
org/10.1002/cnm.927
17. Ferragut, L., Asensio, M.I., Monedero, S.: A numerical method for solving
convection-reaction-diffusion multivalued equations in fire spread modelling. Adv.
Eng. Softw. 38(6), 366–371 (2007). https://doi.org/10.1016/J.ADVENGSOFT.
2006.09.007
18. Ghisu, T., Arca, B., Pellizzaro, G., Duce, P.: An improved cellular automata for
wildfire spread. Procedia Comput. Sci. 51, 2287–2296 (2015). https://doi.org/10.
1016/J.PROCS.2015.05.388
19. Hansen, P.B.: Parallel cellular automata: a model program for computational sci-
ence. Concurrency Pract. Experience 5(5), 425–448 (1993). https://doi.org/10.
1002/cpe.4330050504
20. Harris, M.: Introducing parallel forall. https://developer.nvidia.com/blog/?p=8.
Accessed 3 Oct 2023
21. Karafyllidis, I., Thanailakis, A.: A model for predicting forest fire spreading using
cellular automata. Ecol. Model. 99(1), 87–97 (1997). https://doi.org/10.1016/
S0304-3800(96)01942-4
22. Mandel, J., et al.: A wildland fire model with data assimilation. Math. Comput.
Simul. 79(3), 584–606 (2008). https://doi.org/10.1016/j.matcom.2008.03.015
23. Mell, W., Jenkins, M.A., Gould, J., Cheney, P.: A physics-based approach to mod-
elling grassland fires. Int. J. Wildland Fire 16(1), 1–22 (2007). https://doi.org/10.
1071/WF06002
24. Montenegro, R., Plaza, A., Ferragut, L., Asensio, M.I.: Application of a nonlinear
evolution model to fire propagation. Nonlinear Anal. Theory Methods Appl. 30(5),
2873–2882 (1997). https://doi.org/10.1016/S0362-546X(97)00341-6
25. Ntinas, V.G., Moutafis, B.E., Trunfio, G.A., Sirakoulis, G.C.: GPU and FPGA
parallelization of fuzzy cellular automata for the simulation of wildfire spread-
ing. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski,
J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 560–569. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-32152-3 52
26. NVIDIA: CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-
c-programming-guide/. Accessed 3 Oct 2023
27. Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9(3), 10–20
(2007). https://doi.org/10.1109/MCSE.2007.58
28. Preisler, H.K., Ager, A.A.: Forest-Fire Models. Encycl. Environmetrics (2013).
https://doi.org/10.1002/9780470057339.vaf010.pub2
29. San Martı́n, D., Torres, C.E.: Exploring a spectral numerical algorithm for solv-
ing a wildfire mathematical model. In: 2019 38th International Conference of the
Chilean Computer Science Society (SCCC), pp. 1–7 (2019). https://doi.org/10.
1109/SCCC49216.2019.8966412
30. San Martı́n, D., Torres, C.E.: Ngen-Kütral: Toward an open source framework
for chilean wildfire spreading. In: 2018 37th International Conference of the
GPU Implementation of a 2D Wildfire Model 145
1 Introduction
Geophysical exploration methods play a vital role in our society as they enable
the discovery of fundamental resources (e.g., oil and gas) that drive the economic
development of nations. However, pursuing new oil reservoirs usually involves
destructive practices like drilling in environmentally sensitive areas and improper
waste disposal. Hence, researchers have developed applications that simulate
seismic imaging for oil detection to mitigate these adverse effects and enhance
drilling precision. On top of that, given that these applications involve a huge
amount of data and naturally lend themselves to parallel processing, graphics
processing units (GPUs) have become extensively employed to accelerate such
tasks [Lukawski et al., 2014].
GPUs are architectures designed with a single instruction, multiple data
(SIMD) approach and incorporate thousands of processing cores. This design
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 146–159, 2024.
https://doi.org/10.1007/978-3-031-52186-7_10
Towards a Multi-GPU Implementation of a Seismic Application 147
makes them well-suited as accelerator devices for executing applications that effi-
ciently handle array and matrix data structures. However, despite their impres-
sive computing capabilities, GPUs demand significant power during their opera-
tion. Consequently, optimizing the utilization of the available hardware resources
on GPUs, such as cores and memory, becomes imperative when executing paral-
lel applications [Lorenzon and Beck Filho, 2019]. By doing this, we can effectively
reduce energy consumption while mitigating the associated environmental and
economic impacts [Navaux et al., 2023].
With the increasing availability of multiple GPUs in high-performance
servers, one can further explore the processing potential through Multi-GPU
systems [Papadrakakis et al., 2011]. In this scenario, by distributing the work-
load among all the GPUs available in the system, one can take advantage of the
parallel processing power of each one, achieving significant performance improve-
ments [Liu et al., 2019]. Furthermore, the improvement in scalability provided
by multiple GPUs allows the handling of increasingly larger datasets, improving
the analysis capacity and quality of the generated seismic images through the
greater density of data incorporated in the final result.
However, effectively implementing seismic applications that can fully lever-
age the processing power of Multi-GPU systems poses a significant challenge. A
key obstacle lies in achieving efficient utilization of GPUs. Therefore, striking
a balance in workload distribution across the GPUs becomes crucial to opti-
mize the available processing power. Additionally, employing efficient strategies
for workload partitioning and thread coordination is vital to prevent resource
underutilization or overload on the GPUs. By addressing these challenges, one
can maximize the efficiency and performance of parallel applications on Multi-
GPU systems.
Considering the aforementioned scenario, we propose a Multi-GPU imple-
mentation for the Fletcher Method. Our main objective is to provide a workload
division strategy that considers fundamental aspects such as data locality and
maximizes GPU processing capacity. To validate the proposed implementation,
we performed extensive experiments using twenty-nine different grid input sets
on a system with eight GPUs. With that, we can verify the performance gains
and energy consumption reductions a Multi-GPU implementation provides as
the grid input set changes.
Through the experiments, we demonstrate that the proposed Multi-GPU
implementation can provide significant performance improvements over the
Single-GPU version (e.g., 2.77 times using 4 GPUs). Moreover, the results indi-
cate that more GPUs are associated with greater throughput, highlighting scal-
ability as a critical aspect of optimizing performance in this application. We
also show that the performance and efficiency of multi-GPU implementations
are directly proportional to the size of the input grid. However, it is essential to
highlight that for smaller input grids, the multi-GPU performance is degraded
due to the cost of inter-GPU synchronization and data communication, charac-
teristics inherent to these implementations.
148 P. H. C. Rigon et al.
1 ∂2p
= ∇2 p (1)
V 2 ∂t2
1 ∂2p ∇ρ
= ∇2 p − · ∇p (2)
V 2 ∂t2 ρ
Seismic modeling initializes by collecting data in a seismic survey, as illus-
trated in (Fig. 1). The procedure begins with equipment attached to a ship,
which at regular intervals emits seismic waves that reflect and refract in inter-
actions with different environmental undergrounds, working as a sonar to map
geological structures. When these waves return to the ocean’s surface, specific
sensors installed on cables towed by the ship capture and record seismic vari-
ations. These variations, a.k.a. seismic traces, correspond to the set of signals
obtained by each sensor during the wave emission. Therefore, with each emission
of waves, the seismic traces of all the microphones on the cable are recorded,
providing an understandable overview of the subsoil. During this operation, the
ship continues to move and emit signals periodically, thus producing a detailed
image of the seabed and underground [Chu et al., 2011].
The Fletcher method models the acoustic wave propagation in a Tilted Trans-
versely Isotropic (TTI) environment through a three-dimensional grid, in which
the size of each dimension (x,y, and z ) is defined by the variables sx, sy, and sz,
respectively. Each point on this grid represents a point in the physical environ-
ment being modeled, and this point is associated with physical characteristics
such as pressure, density, and wave velocity. In the case of TTI media, the wave
velocity varies depending on the direction. That is, each point has an associated
slope direction.
We illustrate the single-GPU implementation of the Fletcher method in the
Algorithm 1. It requires as input the following parameters: the number of itera-
tions the wave will propagate (endTime), a value that defines the period in which
the state of the wave will be stored in the disk (threshToWriteWave), and the
Towards a Multi-GPU Implementation of a Seismic Application 149
Fig. 2. The pressure point inserted (in red) at the center of the three-dimensional
pressure vector. (Color figure online)
dimensions of the grid (sx, sy, and sz ). The procedure starts by initializing the
grid with the physical characteristics of the environment through the initialize-
Grid() function. Initially, a pressure point that represents the amplitude of the
seismic wave for a given instant is inserted at the central position of the three-
dimensional pressure vector (Fig. 2). Then, before the kernel starts the execution
on the GPU, this three-dimensional array is mapped to a one-dimensional array,
following the traditional (x, y and z ) order. This means that the points x of the
same line are mapped contiguously in the one-dimensional vector resulting from
the mapping.
150 P. H. C. Rigon et al.
The loop from line 6 to 13 is responsible for iterating until the simulation
is performed. Then, for each iteration, a modulated Gaussian pulse representing
the amplitude of the seismic wave at a given time instant is inserted in the center
of the three-dimensional grid (insertSourcePointToDevice()). Then, the CUDA
Kernel is launched for execution, which will propagate this pressure point in
time. The propagation of the seismic wave is based on the computation of a
5-point stencil during the Kernel execution. Stencil computation is a technique
that involves computing a center point based on reading neighboring points that
are the results of previous kernel computation. This approach is widely used
in parallel processing algorithms and [Pearson et al., 2020] image processing.
During computation using the Stencil (Fig. 3) technique, there is no dependency
between the calculations of individual points, which means that they can be
computed independently and in parallel. This property makes processing highly
parallelizable, allowing multiple points to be calculated simultaneously, speeding
up execution [Pavan et al., 2019].
Once the point associated with the acoustic wave is computed, the wave
state is propagated to the previous state to proceed with the next iteration. In
this scenario, two buffers are used: pp (previous state) and pc (current state).
The current state of the wave is moved to the pp buffer, which now becomes
the previous state. At the same time, the newly calculated next state is stored
in the pc buffer, which now becomes the current state. Furthermore, when the
number of iterations reaches a defined threshold, the wave is written to the disk
(writeWave()).
Towards a Multi-GPU Implementation of a Seismic Application 151
In summary, for each time step, the pressure values at each grid point are
updated based on the wave equation and the previous pressure, density, and
wave velocity values of the point and its neighbors. To avoid artificial reflections
from the border of the grid, which can interfere with the wave propagation
characteristic, an absorption zone of 16 points is applied. The seismic waves are
artificially damped in this region according to the distance from the inner grid.
In this way, the closer to the border of the grid, within the absorption region, the
greater the smoothing velocity at these points, so the velocity set at the border
is zero.
In this section, we list the works that exploit the parallelism of seismic applica-
tions. They are organized in chronological order.
[Liu et al.,2019] explore using GPUs to accelerate the Reverse Time Migra-
tion (RTM) algorithm. The parallelization scheme focuses on using two GPUs
by employing a workload division strategy. The authors also consider a version
that relies on the unified memory scheme available in CUDA. The results suggest
that the workload division strategy presents better results than unified memory
in a multi-GPU environment. Furthermore, the authors argue that computa-
tional efficiency grows linearly with the increase in GPUs. In contrast, our paper
extends the scope to use Multi-GPU systems with up to eight GPUs, with a
balanced workload distribution approach across all these components.
[Serpa and Mishra, 2022] address optimizing the Fletcher method on multi-
core and single-GPU architectures focusing on portability. The paper analyzes
the performance, energy consumption, and energy efficiency of two versions of
the code, an original version and an optimized version for OpenMP, OpenACC,
and CUDA. The results indicate that the CUDA version has the best perfor-
152 P. H. C. Rigon et al.
mance and energy efficiency among all evaluated versions. While it focuses on
multicore architectures and single-GPU only, our work focuses on Multi-GPU
architectures and addresses related topics such as border exchange between algo-
rithm iterations and the performance and energy improvements as the grid size
increases.
[Liu et al., 2012] discuss the implementation of the GPU-accelerated RTM
algorithm. It also addresses specific issues such as uneven topography and
anisotropic environment, i.e., it focuses on various technical aspects of imple-
menting GPU-accelerated RTM. Different from it, our work focuses on the imple-
mentation of a Multi-GPU system and on the performance and energy efficiency
results.
[Pearson et al., 2020] explore techniques to improve 3D stencil communi-
cation on heterogeneous supercomputers using strategies such as hierarchical
partitioning and optimization of data exchange between GPUs. Through tests
on up to 256 nodes, the authors demonstrate the efficiency of these techniques
in improving communication and reducing data exchange time. On the other
hand, our work uses concepts of stencil communication to implement Fletcher’s
method of propagating seismic waves in a Multi-GPU environment in CUDA,
exploring aspects such as performance and energy efficiency.
[Okamoto et al., 2010] describe how the use of multiple GPUs can signifi-
cantly speed up simulations of seismic wave propagation. A particular challenge
faced when using multiple GPUs is non-contiguous memory alignment in the
overlapping regions between subdomains processed by different GPUs. This can
lead to delays in data transfer between the device and the host node. Differently,
in our work, we address this scenario and show that with the increase in the grid
input size, there is a proportional increase in the synchronization time between
the subdomains processed by different GPUs.
the traditional way (x, y, z), it is essential to note that the neighboring addresses
along the x and y axes will be closer in memory than the neighboring points along
the z-axis. In this scenario, assigning a continuous block in the z-axis direction
to each GPU can increase memory locality on L1 and L2 GPU caches, reducing
the need for slower global memory accesses.
Moreover, we explore memory coalescence because GPUs are designed to be
more efficient when threads of the same warp access data stored in contiguous
memory addresses. Hence, decreasing the distance between the points ensures
that memory data access will happen more cohesively. This allows GPU memory
reads and writes to be combined into fewer memory transactions, resulting in a
more efficient use of memory bandwidth. However, although there is immediate
parallelism during kernel execution, it is worth mentioning that each call to the
CUDA kernel advances the solution by a single dt time step. Hence, to propa-
gate the wave by several time steps, the kernel needs to be iterated, and between
these iterations, we must ensure the exchange of information between the dif-
ferent GPUs. This communication occurs through synchronizing and updating
variables in the border regions of the computing domain.
For the 5-point Stencil computation, the edges represent intersection zones
of the three-dimensional grid data mapped to the GPUs used for processing.
These borders are needed to synchronize data across all GPUs, ensuring the
correct reading of data throughout the execution of the CUDA kernel. Hence,
computing the Stencil requires that each subdomain, present on each GPU, have
a 5-point border in the z dimension for each intersection point with the subdo-
main of other GPUs. Therefore, after the kernel execution, it is necessary to
exchange the upper and lower borders across the neighbors’ GPUs to propa-
gate the updated data to the next iteration of wave propagation. During this
exchange operation, cudaMemCpyDeviceToDevice was used to assess the impact
of the border exchange considering an indirect communication. When using this
function, the communication always passes through the Host.
Moreover, the workload distribution along the z-axis also provides benefits in
reducing the number of operations performed only to exchange borders through
GPUs. Although it is not possible to eliminate all communications, this strategy
minimizes the need for inter-GPU communication because threads on each GPU
154 P. H. C. Rigon et al.
can process their grid points independently without synchronizing and updating
data from other GPUs. In addition, the cost associated with border synchro-
nization reduces since this overhead increases with the growth of the grid size
(Fig. 5).
4 Methodology
The experiments were performed on a p3.16xlarge AWS instance (Table 1), which
is equipped with 64 Intel Xeon E5-2686 v4 (Broadwell) VCPUs, each supporting
two threads per core, resulting in a total of 128 available execution threads. In
addition, the instance has 8 NVIDIA Tesla V100-SXM2 16Gb GPUs and offers
488 GiB of RAM. Also, the following versions were used: CUDA v.12.0, NVIDIA
driver 525.85.12, and gcc 9.4.0 with the −O3 optimization flag.
Processor Specification
Processor Intel Xeon E5-2686 v4
Architecture Broadwell
Processor/GPU Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30 GHz, 64 VCPUs
Memory 1 MiB L1d, 1 MiB L1i, 8 MiB L2, 90 MiB L3
GPU Specifications
GPU NVIDIA Tesla V100-SXM2
Architecture Volta
Processor/GPU GV100
Registers 256 KB/SM, 20480 KB/GPU
Memory 4096-bit HBM2, 16 GB, 6144 KB L2 Cache
We have considered twenty-nine different input grid sizes for the Fletcher
method: ranging from 88 to 984 (the maximum size we could allocate in the
Towards a Multi-GPU Implementation of a Seismic Application 155
5 Results
In this section, we present, analyze, and discuss the results obtained from the
experiments. First, we discuss the performance of the Multi-GPU implementa-
tion of the Fletcher method for 1, 2, 4, and 8 GPUs. The energy efficiency is
analyzed, comparing the average power demand (in Watts) of the GPU dur-
ing the iteration of the CUDA kernel with the performance (in MSamples/s) to
assess the energy efficiency of the application.
Fig. 6. Performance results for each grid size and implementation. The higher the bar,
the better the performance.
Figure 6 shows the performance results for the entire experiment set. It
is worth mentioning that the grid set computed on the Single-GPU and 2-
GPUs versions is limited by the GPU VRAM, which in this case is 16Gb per
Device (NVIDIA Tesla V100-SXM2). Therefore, the first observation is that the
Multi-GPU implementation of Fletcher allows the execution of larger grid sizes,
156 P. H. C. Rigon et al.
improving the capability and quality of seismic images generated through higher
data density incorporated into the final result.
By analyzing the behavior of Fig. 7, one can highlight that the performance
grows along with the increase in the grid size, allowing the user to increase
application throughput. As an example, for a grid size equal to 504, the maximum
speedup of 2.77 is achieved over the Single-GPU with 4 GPUs. Moreover, by
using 2 GPUs, we observed a speedup over the Single-GPU greater than 2 for
grid sizes 376 and larger, indicating efficient scalability for this configuration.
However, when increasing the number of GPUs to eight, we could not obtain
proportional gains to this increase in computational capacity because of the
cost of inter-GPU communication. That is, with a small grid size, parallelism in
Kernel computation is not exploited to the maximum, and the cost of border
synchronization between the GPU is more significant in relation to the execution
time of CUDA kernel computing. Therefore, we argue that the effectiveness of an
8-GPU strategy would require a larger problem, where the cost of communication
would have a smaller impact in relation to the throughput gain.
This statement is corroborated by the analysis of Fig. 6, which demonstrates
that the performance of multi-GPU implementations exhibits a positive lin-
ear trend with increasing input grid size. This implies that the effectiveness
of Multi-GPU computing amplifies proportionally to the scale of the problem.
Thus, its performance becomes especially notable for large-scale computational
tasks, where the ratio between the cost of inter-GPU communication and Kernel
CUDA computation is optimized. Additionally, the use of 2 GPUs reaches its
Performance peak for input grid sizes close to 408. Afterward, a slight decrease in
performance is observed as the grid size increases, until GPU memory capacity
(VRAM) limits computation, which it does for input sizes greater than 792.
Figure 8 illustrates the maximum power achieved while running the CUDA
kernel for different versions and grid input sizes. The maximum power increases
as the number of GPUs also increase. The Single-GPU approach consistently
Towards a Multi-GPU Implementation of a Seismic Application 157
Fig. 8. Maximum Power dissipation for each version and grid input set.
presents the lowest power across all problem sizes. This increase in maximum
power with the use of more GPUs is expected, as more processing units mean
more power dissipated. However, it is important to highlight that an increase
in power does not always translate into a proportional increase in performance,
highlighting the importance of considering energy efficiency when analyzing and
optimizing applications for multi-GPU systems, aiming to achieve the best bal-
ance between performance and power consumption.
Extending the analysis through the data of the average power consumed
during the execution of the CUDA kernel. Figure 9 shows that the single-GPU
implementation has high efficiency for small input grid sizes. This is due to the
fact that there is no cost of inter-GPU edge synchronization and also the fact that
158 P. H. C. Rigon et al.
computing power does not become a performance limiting factor for these input
sizes, due to the reduced scale of the grid Furthermore, we confirmed the low
power efficiency of multi-GPU implementations for lower problem sizes. This is
because the cost of inter-GPU synchronization and data communication inherent
in such implementations results in unnecessarily high power consumption for
issues that could be efficiently managed by a single GPU.
Furthermore, the energy efficiency of multi-GPU grows linearly with the size
of the input grid. This indicates that to achieve high efficiency with multi-GPU,
a large input set is needed, in order to ensure that throughput inherent to the
multi-GPU implementation significantly outweighs the cost associated with data
synchronization. This reinforces the idea that the best performance between
single-GPU and multi-GPU implementations depends on factors such as the
input grid and the complexity of modeling wave and medium characteristics.
Acknowledgment. This work has been partially supported by Petrobras under num-
ber 2020/00182-5, by the call CNPq/MCTI/FNDCT - Universal 18/2021 under grants
406182/2021-3, and by the Coordenação de Aperfeioamento de Pessoal de Nı́vel Supe-
rior - Brazil (CAPES) - Finance Code 001.
Towards a Multi-GPU Implementation of a Seismic Application 159
References
Chu, C., Macy, B.K., Anno, P.D.: Approximation of pure acoustic seismic wave prop-
agation in TTI media. Geophysics 76(5), WB97–WB107 (2011)
Fletcher, R.P., Du, X., Fowler, P.J.: Reverse time migration in tilted transversely
isotropic (TTI) media. Geophysics 74(6), WCA179–WCA187 (2009)
Liu, G.-F., Meng, X.-H., Yu, Z.-J., Liu, D.-J.: An efficient scheme for multi-GPU TTI
reverse time migration. Appl. Geophys. 16(1), 56–63 (2019)
Liu, H., Li, B., Liu, H., Tong, X., Liu, Q., Wang, X., Liu, W.: The issues of prestack
reverse time migration and solutions with graphic processing unit implementation.
Geophys. Prospect. 60(5), 906–918 (2012)
Lorenzon, A.F., Beck Filho, A.C.S.: Parallel computing hits the power wall: principles,
challenges, and a survey of solutions. Springer Nature (2019)
Lukawski, M.Z., et al.: Cost analysis of oil, gas, and geothermal well drilling. J. Petrol.
Sci. Eng. 118, 1–14 (2014)
Navaux, P.O.A., Lorenzon, A.F., da Silva Serpa, M.: Challenges in high-performance
computing. J. Braz. Comput. Soc. 29(1), 51–62 (2023)
Okamoto, T., Takenaka, H., Nakamura, T., Aoki, T.: Accelerating large-scale simu-
lation of seismic wave propagation by multi-GPUS and three-dimensional domain
decomposition. Earth Planets Space 62(12), 939–942 (2010)
Padoin, E.L., Pilla, L.L., Boito, F.Z., Kassick, R.V., Velho, P., Navaux, P.O.: Eval-
uating application performance and energy consumption on hybrid CPU+ GPU
architecture. Clust. Comput. 16, 511–525 (2013)
Papadrakakis, M., Stavroulakis, G., Karatarakis, A.: A new era in scientific computing:
Domain decomposition methods in hybrid cpu-gpu architectures. Comput. Methods
Appl. Mech. Eng. 200(13), 1490–1508 (2011)
Pavan, Pablo J.., Serpa, Matheus S.., Carreño, Emmanuell Diaz, Martı́nez, Vı́ctor.,
Padoin, Edson Luiz, Navaux, Philippe O. A.., Panetta, Jairo, Mehaut, Jean-
François.: Improving Performance and Energy Efficiency of Geophysics Applications
on GPU Architectures. In: Meneses, Esteban, Castro, Harold, Barrios Hernández,
Carlos Jaime, Ramos-Pollan, Raul (eds.) High Performance Computing: 5th Latin
American Conference, CARLA 2018, Bucaramanga, Colombia, September 26–28,
2018, Revised Selected Papers, pp. 112–122. Springer International Publishing,
Cham (2019). https://doi.org/10.1007/978-3-030-16205-4 9
Pearson, C., Hidayetoğlu, M., Almasri, M., Anjum, O., Chung, I.-H., Xiong, J., Hwu,
W.-M.W.: Node-aware stencil communication for heterogeneous supercomputers. In:
2020 IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW), pp. 796–805. IEEE (2020)
Serpa, M., Mishra, P.: Performance evaluation and enhancement of the fletcher method
on multicore architectures (2022)
What Does a Nation-Wide Digital Nervous
System Use for an Operating System?
Abstract. Concerns over climate change and sustainable agriculture have made
nation-wide high resolution environment monitoring and modelling desirable.
Recent developments in technology have made it affordable. An environment
modelling network is a supercomputer, but not of a familiar kind. Conventional
supercomputing approaches are appropriate for the modelling aspect, but not the
monitoring aspect. While sensor networks are familiar in the Internet of Things
(IoT), geographically remote sensors without access to mains power have harsher
resource constraints than, say, internet-ready light bulbs. A “two-realm” approach
to system software is needed.
1 Introduction
Like most nations worldwide, New Zealand is highly dependent on its primary sector.
Land is a diminishing resource, and we must make better use of it and take better care of
it. The dual needs of sustainability and economic exploitation are not symbiotic, at least
in the short term, so we need to develop much more informed methods of management
which will ensure that we can preserve and improve our environment while improving
our economic exploitation of the land. But biology is complicated, variable, riddled with
uncertainty, and highly integrated. It is difficult to be cognisant of all of the moving parts
and it is very difficult to predict beyond the most immediate time period.
Operational response lies mostly with individual producers, while governments have
responsibility for creating the right conditions for sustainable economic exploitation. The
complexity draws comparison to a nervous system which responds to continuous and
varied inputs to maintain health. One of the key components of a nervous system is con-
tinuous sensory input. One of the problems the primary sector has even in a nation-wide
context is that often the sensory mechanisms become compromised, communication can
become discontinuous and input to decision making numbed.
Many of the crops and production systems used around the world are similar therefore
there is further utility in linking up with international partners. We have a truly global
food supply industry. If we look at crops like apples or avocados we can see that they
are distributed around the world, but we can also see the importance of countries like
Mexico to global supply (Mexico exported 50x more avocados than New Zealand1 in
2021) [21]. One example would be the fight against pests and diseases, newer Deep
Learning and AI models are being used to detect defects and diagnose pest and disease
problems, these models need to be trained with comprehensive and reliable data sets.
These data sets could be created overseas before the problem occurs in a specific country.
This type of wider diagnosis and cooperation would only work if the nervous system
within each body (country) was functional and data interoperable. Thus, this multilevel
or layering of decision making makes the need for integration of information flow to
support decision making even more critical.
2 Background
There is an acknowledged yield gap in pretty much every growing system, that is the
difference between what is biologically possible and what is actually achieved. It is
suggested by the authors that it is because of non decisions, incomplete, or late decisions
rather than wrong decisions. Which would point to the fact that the necessary information
is not getting to the correct place to make decisions at the right time. It may be that
preventative measures to yield limiting conditions were not taken or were taken too late
and the damage was done. Having better information in the field to rapidly inform the
grower is vital, this can be achieved by having adequate and continuous sensing of the
crop and a nervous system capable of transmitting the information to the neural hub of
the system.
We will discuss in this position paper that the amount of data and information required
to make decisions in a complex environment involving e.g. weather forecast, climate vari-
ation, environment modification, and biology integrations requires a significant amount
of processing capabilities -both distributed and centralised, fitting the definition of a
supercomputer [22] -albeit not in the traditional way.
How much sensing is needed for weather/climate sensing?
Simulations and forecasts are too often based on incomplete data and assumptions
and extrapolations are common, even if satellites are covering more and more areas. A
sub-kilometre scale is presented in the Concept Paper for the Berlin Summit for EVE
(Earth Visualization Engines) held in July 2023. [23] It says: “The demand for km-scale
simulations is rooted in the global need for local information, at greater fidelity, both to
advance scientific understanding as well as to link to impacts and better integrate local
knowledge, including observations” [24].
The same paper gives some context for our approach: “The need to integrate local
knowledge, sample uncertainty through complementary efforts, maintain access to a truly
global talent pool, and establish global legitimacy, can likely only be met through regional
or super-regional facilities [25] for instance, a lack of open observations throughout the
Global South is crucial to fill” [26].
1 Perú exported 22x more than New Zealand in 2021, and Chile and Colombia ~ 4x more each.
But models tested and used in one country could be replicated in another if they are scalable
and modular.
162 N. Erdödy et al.
3 The Supercomputer
The supercomputer we are concerned with has not been built yet. It has not been funded
yet. But some day soon we shall build it, because we’ll have to.
There are about 250,000 nodes2 . Each node contains 4 cores. The cores are 32-bit
processors, running at 120 to 260 MHz depending on the model. At peak, a node draws
about 0.45 A (two CPU packages + radio or camera) at 3.7 V, so the peak power demand
is 0.42 MW. This is unambiguously a supercomputer.
Given the term “supercomputer”, you have imagined a room, possibly a very large
room, filled with racks and cables. There are no racks, no cables, and no room. What
you must now imagine is the nodes moving apart, until they average about 1 km apart.
That 0.42 MW peak power? It has to be supplied by batteries, which means that all these
nodes have to spend most of their time powered almost completely down.
1) The sensor nodes have four 32-bit microcontroller cores performing specialised tasks:
one to manage a camera or cameras (for cloud cover, global illuminance, frost and dew,
and precipitation), one to manage encryption and the radio, one to collect temperature,
pressure, humidity, and wind data and to do overall system management, and one to
do compression, modelling, and forecasting. There will be on the order of 250,000
of these. Each node has about 1MB of ROM, about 1MB of RAM, and 2–8 MB of
flash. Except for lacking MMUs and FPUs, these machines are much more powerful
than those UNIX was designed for, and could run a fair approximation of Unix V7.
There are no mass stores for virtual memory, and memory protection will be done
by static analysis. Integer arithmetic is adequate for what the sensor nodes do. FPUs
and MMUs would just increase price and current drain. “Operating system” services
will be device drivers, scheduling, and communication stack.
2) The sensor nodes communicate with aggregators through a mesh network. That net-
work is mostly the sensor nodes, filled out with communication nodes which have
two 32-bit microcontroller cores: one to manage the radio and one for system man-
agement. How many are needed depends on the range of the radios in the sensor
nodes, which will change over time. They are very similar to the sensor nodes but
trade cameras for better radios.
3) The aggregators are unattended computers which receive observations from the sen-
sor nodes, communicate with local data integration and modelling centres, which
eventually provide nation-wide real-time high-resolution weather data that can be
integrated with data from other sources and used to correct forecasting models. They
receive forecasts from those models and relay near-term expectations to the sensors.
The sensors and aggregators both know what the sensors have reported and what is
expected next, permitting high compression.
Satellite estimates of things like precipitation are qualitatively good, but their quan-
titative agreement with ground truth has room for improvement. The same is true of rain
radars, which trade wide coverage for reduced accuracy [6].
A review in 2017 found that access to weather data in New Zealand is more expensive
than in many other countries. Indeed, many New Zealand farmers get their weather data
from Norway (e.g., https://www.yr.no/nb/v) which is one of the most open countries. We
estimate that the sensor and communicator nodes can be built for about USD 65 each.
USD 18 million sounds like a lot of money for a network of tiny autonomous weather
stations, but it is 38 times cheaper than launching our own weather satellite would be4 .
Each node will be available, providing data around the clock, easy to reach and cheap
to replace.
It’s not just New Zealand that has a use for such a network. Any nation or region
that is seriously concerned about whether and how the climate is changing and what to
do about it needs a good grasp on what is actually happening, and there is no substitute
for actual measurement.
With hindsight, this nation-wide digital nervous system is an obvious answer to the
needs of climate monitoring, precision agriculture and potentially earthquake predic-
tion5 . It is only within the last few years that it has become feasible to build sensor
nodes cheap enough yet capable enough to be useful. This needed advances in micro-
controllers, low power radio communication, and sensor technology. If we don’t build
such a system, someone else will. The issues for operating systems will remain.
7 Two Realms
Many people are working on Internet of Things/Edge Computing, and one would expect
that this would be a solved problem. For example, one might turn to the Linux Foun-
dation Edge consortium [7]. There [8] we read that “eKuiper is an edge lightweight
IoT data analytics / streaming software implemented by Golang, and it can be run on
all kinds of resource-constrained edge devices.” Promising, very promising. “Features:
Lightweight, Core service package is only about 4.5 M, initial memory footprint is
10 MB”. 10 MB? That’s not lightweight, that’s huge! What happened to “can be run on
resource-constrained devices”?
The six layer architecture divides into two realms, with one of the layers on the border
between them. The “low realm” (layers 1 and 2 and in some ways 3) is very different
from the “high realm” (layers 4–6 and in some ways 3). A lot of “edge” computing is
actually pretty far from the edge. A device in a building with access to mains power, is
living in the high realm, a very different world from the low realm where a device that’s
outside, nailed to a fencepost in a remote area, desperately trying to save battery power
lives. The analogue of “edge” computing in L2L is the aggregators.
4 USD 290 million for the satellite, USD 400 million for the launch, according to https://science.
howstuffworks.com/satellite10.htm.
5 “Movement along the Alpine Fault (in New Zealand), with its powerful uplift along the Southern
Alps over millions of years, forms the geological foundations for our beautiful South Island
and the stunning landscape we call home. The more we understand our natural environment
and the forces that shape it, the better prepared we can be”.[27].
166 N. Erdödy et al.
A. Electrical power
High realm: mains power or own generator. Reducing energy use helps to keep prices
down. Low realm: batteries + small solar panels. Keep energy use way down or it doesn’t
work at all.
B. Flexibility and maintainability
High realm: the computer should be able to do many things and be able to switch
between them. Software changes fast. Patches are installed often. Low realm: the node
has a fixed set of tasks, which changes very seldom. It makes economic sense to use
formal methods such as SPARK/Ada [9] [10] and FRAMA-C [11] [12] to ensure that
arithmetic errors, pointer errors, stack overflows, and scheduling errors will not occur.
Over-the-Air updates should be rare, but bugs happen. The software that runs in the
sensor nodes must be structured so that updates are more like hot-loading a replacement
module in an Erlang [13] program (which keeps right on running) than stopping a
process and starting a new one in Unix. Updates must, in short, be small so that they can
be transmitted and acted on without interruption to normal service.
C. Heterogeneity
Within a sensor node, each core has a different task. There is no particular benefit in
all the cores having the same hardware architecture. Limited memory means that each
core should store only the drivers for the devices attached to it and only the software
(such as network stack, image processing, modelling, compression or whatever) relevant
to its task. It also means that memory should be managed no later than system build time.
This leaves mainly scheduling and core-to-core communication for a “common OS” to
do.
The aggregators will be just sufficiently powerful single board computers, and it
is here that we might expect to see synergy with Linux Foundation Edge. Fledge [14]
in particular, with its focus on integrating and processing sensor data and forwarding
the results to various destinations, looks attractive. EVE [15] on the other hand, with
its support for “Docker containers, Kubernetes clusters and virtual machines”, is over-
engineered for the needs of this layer. In this layer there will be tens of thousands of nodes,
so keeping price and power consumption down are still issues. This means just enough
computer, and just enough operating system, to do the aggregation and forwarding job.
As a nation-wide system will not be built all at once, and as it will need incremental
maintenance/replacement, we must expect a mix of different hardware architectures and
different capabilities.
The human interface layer takes us from grass [16] to GRASS [17]. For these systems
unmodified Linux, macOS, and Windows are perfectly adequate. These devices will
typically be owned by the people who use them.
Layers 5 and 6 are already familiar to the audience.
If there is no one kernel running at all levels that can be pointed to as “the OS”, what
can we mean when we talk about “an OS for a nation-wide supercomputer”?
The answer is that the systems are unified by their common task so that the collection
of modules and services that enable the nodes to collaborate in their shared task is “the
OS”. A large part of this is communication support.
What Does a Nation-Wide Digital Nervous System 167
D. Communication is costly
Operating the radio is a major power drain for a sensor node. In a mesh where some of
the nodes act as intermediaries between aggregators and other sensor nodes that are out
of range, those intermediaries cannot afford to keep their receivers on at all times in case
some neighbour has something to say. Most communication needs to be on a schedule
where the radios are on at known times just enough to communicate. This means that
keeping the sensor nodes associated with an aggregator tolerably well synchronised with
it and each other is important for power economy. It is also useful because clock drift
can be a sign of other problems.
Communication between aggregators and the geographical information layer is also
difficult. Not depending on mobile phone networks or high-speed broadband is important
-not even fully developed countries have homogeneous broadband coverage within their
boundaries, and then you have to consider all the geographic variables (mountains, lakes,
jungle, etc.). Once again, communicating on a regular schedule, using compression to
keep data volumes down, and keeping software updates small and rare are important.
As we climb to higher layers, more, and more affordable, communication bandwidth
is available, but greater volumes of data need to be transmitted. The two-way flow of
information, with expectations flowing down and observations flowing up, means that
only the surprise in the observations (the differences between what is observed and what
is expected) need to be forwarded, not the raw observations.
Even messaging is heterogeneous however. AMQP [18] is well suited for the aggrega-
tors on up, using an implementation such as RabbitMQ. Communication is so costly and
the sensor nodes so constrained that specialised, special-purpose lightweight protocols
are needed at that level.
E. Failure
Suppose the sensor nodes are so well built that the probability that a specific one fails
on any given day is 1/36536 . Then on any given day, out of 250,000 nodes, 68 ± 8 will fail.
These failures will be scattered all over the nation, and it will be impractical to fix them all
on the same day. This is going to happen to any large scale sensor network. The data that
a failed node would have reported is irretrievably lost. However, for an isolated failure,
an aggregator may be able to “fill in” estimated data using geostatistical [19] techniques
such as kriging. (You would be astonished at how much on- the-ground weather data is
estimated.) The good news is that with multiple sensors (such as temperature at worm
height, at sheep height, at cow height, and at head height) failure will often be partial,
so some data may be available from a “failed” node.
Data reported up-level from an aggregator needs to be tagged according to
its reliability: actual measurements, corrected measurements (allowing for drift), or
estimates.
Planning the process of repairing or replacing failed nodes is a well understood
operations management task.
F. Ownership and authentication
6 1 in 10 years.
168 N. Erdödy et al.
8 Summary
The operating system for a nation-wide supercomputer is a set of modules (such as con-
ventional operating systems) and services communicating through messaging protocols
that authenticate identity and location of sensors, respect rights to data, and ensure that
quality-tagged reliable data are passed around while staying within resource limits. This
requires machine learning and AI at the edge to cope with failure as well as to exploit
success. A “two-realm” approach to system software is presented to start with the sens-
ing capabilities and move up to the processing capabilities establishing the proposed
architecture.
References
1. National Institute of Water and Air Instruments. https://niwa.co.nz/our-services/instruments/
instrumentsystems/products/climate-stations, (Accessed 08 2022)
2. CIA World Factbook, 2022, entry New Zealand. https://www.cia.gov/the-world-factbook/cou
ntries/new-zealand/#geography, (Accessed 08 2022)
3. CIA World Factbook, 2022, entry United Kingdom. https://www.cia.gov/the-world-factbook/
countries/united-kingdom/#geography, (Accessed 08 2022)
4. Cancillería de Ecuador (2023). https://www.cancilleria.gob.ec/bolivia/wp-content/uploads/
sites/22/2021/07/ECUADOR.pdf, (Accessed 07 2022)
5. Slingo, J., et al.: Ambitious partnership needed for reliable climate prediction. Nat. Climate
Change 12 (2022)
6. Schleiss, M., et al.: The accuracy of weather radar in heavy rain: a comparative study for
Denmark. Hydrol. Earth Syst. Sci. 24(6) (2020). https://hess.copernicus.org/articles/24/3157/
2020/
7. Linux Foundation Edge home page. https://www.lfedge.org, (Accessed 08 2022)
8. Linux Foundation Edge “eKuiper project”. https://www.lfedge.org/projects/ekuiper,
(Accessed 08 2022)
9. Barnes, J.: SPARK: The Proven Approach to High Integrity Software. Altran Praxis (2012)
What Does a Nation-Wide Digital Nervous System 169
10. McCormick, J.W., Chapin, P.C.: Building High Integrity Applications with SPARK 2014.
Cambridge University Press (2015)
11. Baudin, P., et al.: The dogged pursuit of bug-free C programs: the Frama-C software analysis
platform. Commun. ACM 64(8) (2021)
12. Kirchner, F., Kosmatov, N., Prevosto, V., Signoles, J., Yakobowski, B.: Frama-C, A software
analysis. Perspect. Formal Aspects Comput. 27(3) (2015)
13. Armstrong, J.: Programming Erlang: Software for a Concurrent World. 2nd edn. Pragmatic
Programmers (2013)
14. Linux Foundation Edge Fledge project. https://www.lfedge.org/projects/fledge/, (Accessed
08 2022)
15. Linux Foundation Edge EVE project. https://www.lfedge.org/projects/eve, (Accessed 08
2022)
16. Champion, P.D., James, T., Popay, I., Ford, K.: An Illustrated Guide to Common Grasses.
The New Zealand Plant Protection Society, Sedges and Rushes of New Zealand (2012)
17. Neteler, M., Mitasova, H. (eds.): Open Source GIS. Springer US, Boston, MA (2008). https://
doi.org/10.1007/978-0-387-68574-8
18. OASIS, Advanced Message Queuing Protocol (AMQP) Version 1.0 (2012). http://docs.oasis-
open.org/amqp/core/v1.0/amqp-core-complete-v1.0.pdf
19. Kitanidis, P.K.: Introduction to Geostatistics: Applications in Hydro-geology. Cambridge
University Press (1997)
20. Gupta, U.C., Gupta, S.C.: Selenium in soils and crops, its deficiencies in livestock and humans:
Implications for management. Commun. Soil Sci. Plant Anal. 31(11–14), 1791–1807 (2008)
21. World Bank - World Integrated Trade Solution: Avocados, fresh or dried exports by country
in 2021. http://wits.worldbank.org/trade/comtrade/en/country/ALL/year/2021/tradeflow/Exp
orts/partner/WLD/product/080440, (Accessed 07 2023)
22. Supercomputing. https://www.ibm.com/topics/supercomputing, (Accessed 07 2023)
23. Conveners of the Berlin Summit for EVE (Earth Visualization Engines) (June 5 2023). https://
owncloud.gwdg.de/index.php/s/rNWYNJSdJ19iwbJ, (Accessed 07 2023)
24. Ibid. Pg.8
25. Ibid. Pg.9
26. Ibid. Pg.15
27. Sage. https://sagecontinuum.org/, (Accessed 07 2023)
28. Alpine Fault Magnitude 8 (AF8). https://af8.org.nz/, (Accessed 07 2023)
The Impact of CUDA Execution
Configuration Parameters
on the Performance and Energy
of a Seismic Application
Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
{bsschussler,phcrigon,aflorenzon,asc,navaux}@inf.ufrgs.br
1 Introduction
Geophysical exploration methods are fundamental for humanity as they explore
essential resources for the economic development of countries, such as oil and
gas. However, investigating new oil reservoirs often involves destructive practices,
such as drilling in sensitive areas and improper waste disposal. Applications
that perform seismic image simulation for oil detection have been developed to
reduce these environmental impacts and increase drilling accuracy. Since these
applications usually involve the computation of a huge amount of data and are
naturally parallel, GPUs (graphics processing units) are widely used to accelerate
such applications [Hanindhito et al., 2022].
GPUs are powerful SIMD (single instruction, multiple data) architectures
that usually consist of thousands of processing cores [Hennessy and Patterson,
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 170–183, 2024.
https://doi.org/10.1007/978-3-031-52186-7_12
The Impact of CUDA Execution Configuration Parameters 171
– By rightly defining the number of blocks and threads per block, performance
can be improved by up to 50% and energy consumption reduced by up to
18% when compared to the standard way the application is executed in the
GPU.
– For small grid sizes, a lower number of threads per block is capable of deliver-
ing better results because it improves resource utilization, increases SM occu-
pancy, and reduces memory contention. On the other hand, a larger number
of threads per block is better as the grid size grows due to the increase in the
available parallelism, efficient memory access, and load balancing reasons.
2 Background
2.1 Graphic Processing Units
NVIDIA GPUs have an internal structure consisting of processing units known
as streaming multiprocessors (SMs) [Hennessy and Patterson, 2011. The number
of SMs varies depending on the GPU architecture. For example, GPUs based
on the Volta architecture, such as the NVIDIA Tesla V100, have 80 SMs, while
those based on the Pascal architecture, like the NVIDIA Tesla P100, have 56
SMs. Within each SM, there are processing cores called CUDA cores. These
CUDA cores are responsible for executing instructions in parallel. They enable
the GPU to perform massively parallel processing and accelerate computations.
The fundamental units of execution on GPUs are threads. Each thread executes
a specific task or a portion of a computation that can be executed concurrently
with other threads.
Threads are logically organized into blocks to facilitate cooperation, commu-
nication, and synchronization among them using shared memory regions. This
organization into blocks helps manage the execution and coordination of threads
within an SM. In this context, each block is assigned to an SM. When the GPU
executes a kernel, it maps different blocks to available SMs, distributing the
workload across the SMs for parallel execution. The mapping of blocks to SMs
is done dynamically by the GPU hardware. Furthermore, each thread within a
block is individually assigned to a specific CUDA core within the corresponding
SM. This assignment allows the threads to be executed in parallel by utilizing
the available CUDA cores within the SM. By leveraging the parallelism offered
by the multiple SMs and CUDA cores, GPUs can efficiently execute large-scale
computations by dividing the workload into numerous threads, blocks, and SMs,
thereby achieving significant acceleration in performance compared to traditional
CPU architectures.
Each SM typically contains a group of CUDA cores, caches, warp schedulers,
and shared memory. These components work together to enable efficient parallel
The Impact of CUDA Execution Configuration Parameters 173
formations, locate potential oil reservoirs, and assess the viability of hydrocarbon
exploration or other geological investigations.
The algorithm Fletcher implements is based on the numerical solution of
the wave equation. It is a partial differential equation that considers the envi-
ronment’s elastic properties (e.g., the propagation velocity of the wave) and is
represented in a three-dimensional grid. The wave propagation process is itera-
tive, where in each iteration, the algorithm calculates the approximate solution
of the wave equation at each grid point, considering the information from pre-
vious iterations. During propagation, the wave energy spreads and changes as
it interacts with the heterogeneities of the environment, updating the values at
each grid point and allowing the algorithm to model seismic waves’ reflection,
refraction, and diffraction as they propagate underground.
The CUDA implementation of the Fletcher method employs a three-
dimensional grid to represent the wave propagation in the environment. The
dimensions of this data structure are defined by the inputs of the application
(sx, sy, and sz, as shown in Fig. 1 which refer to the size of the x, y, and z-axis
of the grid, respectively). The algorithm adds 40 positions on each dimension to
this value to help during the computation and edge exchange. In this scenario, a
defined input set of x=y=z=56 will allocate a grid of 96 elements on each dimen-
sion. For this grid to be computed by CUDA threads in parallel throughout the
execution of the application, a 2D decomposition of the domain approach is used.
It consists of obtaining a two-dimensional plane and making a cut in the volume
(for example, in the x and y dimensions of the grid). Then, the algorithm can
iterate along the third direction, in this case, represented by the z dimension.
In this scenario, this two-dimensional (x,y) plane can be divided into blocks of
CUDA threads, where the user in the algorithm defines the number of threads
per block in the x and y domains.
Therefore, the total number of threads needed to compute the grid may be
determined by dividing the total grid size by the number of blocks of CUDA
threads in each dimension. Moreover, the number of threads per block is calcu-
lated by multiplying the number of CUDA threads created on each dimension
(BsizeX, BsizeY ). For example, for a grid with dimensions 96 × 96 × 96, if the
number of CUDA threads in the x and y domains equals 32 and 16, respectively,
The Impact of CUDA Execution Configuration Parameters 175
In this section, we discuss the works that have studied the performance and
energy consumption of geophysical applications running on GPU architectures.
They are discussed in chronological order.
[Michéa and Komatitsch, 2010] discusses the influence of kernel configuration
on the performance of a three-dimensional finite-difference wave propagation
code. However, the number of threads per block was fixed while the number of
blocks was varied. Different from this work, our research simultaneously varies
the number of blocks and threads per block to analyze the impact of this variation
as the size of the propagation grid changes.
[Páez et al., 2020] evaluate the performance of two strategies (1D and 2D
decomposition) for implementing elastic modeling using different kernel con-
figurations. When using the 1D layout, only one big block containing all the
processing threads is defined. On the other hand, the 2D layout allows working
with larger blocks, while the number of threads per block is a multiple of warp
size (e.g., 32). In addition to this work, our research addresses a two-dimensional
layout, allowing the software developer to vary the number of threads per block
and the number of blocks in each dimension.
[Alkhimenkov et al., 2021] exploit GPUs in the propagation of seismic waves
in fluid-saturated porous environments. The research discusses how the number
of threads per block in the x, y, and z domains impacts the effective memory
transfer rate (MTP) of the numerical application Biot 3D. Fifteen different block
combinations were analyzed for a fixed resolution of 576 × 576 × 576. Compared
to this proposal, our paper analyzes the influence of the number of threads per
block on performance and energy consumption, besides exploring different grid
dimensions, as opposed to the fixed resolution of 576 used by [Alkhimenkov et
al., 2021].
Sanchez-Noguez et al., 2022] evaluate different block sizes in 3D and 2D
kernels to compute 3D and 2D arrays, respectively, aiming to study the impact
of shared memory usage on performance. [Serpa and Mishra, 2022] explore the
optimization of the Fletcher method with a focus on portability, analyzing the
performance and energy consumption of eight code versions. Unlike these two
works, our paper addresses the optimization of the Fletcher method by exploring
different kernel execution parameters on GPUs and providing guidelines to end-
users so they can get better GPU usage regardless of the input set.
176 B. S. Schussler et al.
Processor Specification
Processor Intel Xeon E5-2686 v4
Architecture Broadwell
Processor/GPU Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz, 64 VCPUs
Memory 1 MiB L1d, 1 MiB L1i, 8 MiB L2, 90 MiB L3
GPU Specifications
GPU NVIDIA Tesla V100-SXM2
Architecture Volta
Processor/GPU GV100
Registers 256 KB/SM, 20480 KB/GPU
Memory 4096-bit HBM2, 16 GB, 6144 KB L2 Cache
3 Methodology
The experiments were performed in a heterogeneous architecture as depicted in
Table 1. The host is an Intel Xeon CPU E5-2686 with 488 GB of main memory;
and the GPU is an NVIDIA Tesla V100 SXM2 with 16GB of main memory,
80 SMs, and 5120 CUDA cores. The Fletcher modeling application was com-
piled with CUDA version 12.0, driver v.525.85.12, and GCC v.9.4.0. We have
considered the following parameters: sixteen different grid dimensions, ranging
from 24 × 24 × 24 to 504 × 504 × 504 in intervals of 32 in each axis; fifteen
combinations of the number of threads in the x and y dimension, ranging from
4 × 4 to 32 × 32.
In the next section, we analyze the performance and energy consumption of
all executions. The performance metric considers the number of grid points cal-
culated per second during the execution, represented as MSamples/s. Therefore,
the higher this value, the better the performance. On the other hand, the energy
consumption was obtained directly from the GPU through the NVIDIA SMI
(system management interface) command line. In this scenario, the lower the
value, the better. Each combination of the input set (grid dimension) and con-
figuration of the number of threads in each dimension was executed ten times,
and the results consider the average with a standard deviation lower than 0.5%.
for software developers and end-users so they can optimize the execution of the
Fletcher Method based on our findings.
Fig. 2. Performance and Energy Consumption results for each configuration of the
number of blocks in x and y.
consumption. However, it also delivers the worst performance. The energy spent
by this configuration is about 14% lower than the baseline, but its performance
loss is more significant, around 60%. On the other hand, the 32 × 8 configuration
achieved the highest performance, being approximately 12% better than the
baseline, with a slight variation concerning energy consumption, around 1%.
Compared to the best configuration for energy (8 × 8), the 32 × 8 achieved 75%
higher performance, spending approximately only 12% more energy.
Three combinations are worth mentioning when considering the results achie-
ved with the grid size equal to 440 × 440 × 440, shown in Fig. 2(e). If the user
wants to optimize energy consumption, the 4 × 4 configuration delivers better
results, spending about 18% less energy than the baseline. On the other hand, if
the objective is to get the best possible performance, the ideal configuration is
32 × 8, providing 15% more performance than the baseline. However, if the user
The Impact of CUDA Execution Configuration Parameters 179
Table 2. Best configuration found by the DSE for each grid dimension w.r.t. the energy
and performance
Fig. 3. Performance and energy results of the best and worst configuration for each
grid size normalized to the baseline.
We have found that for small grid sizes, a small number of threads per block is
capable of delivering better performance and energy consumption in most cases.
The following reasons can be listed. (i ) improved resource utilization: with a
small input set, using a small number of threads per block allows for better use
of registers and shared memory, which reduces the resource contention among
threads and enables more threads to be scheduled on an SM simultaneously. (ii )
increased occupancy (i.e., the ratio of active warps to the maximum possible
warps that can be executed concurrently on an SM): with a small number of
threads per block, each thread block occupies fewer resources, leaving room for
more thread blocks to be launched, allowing the SM to hide memory latency by
switching between different thread blocks. (iii ) reduced memory contention: the
182 B. S. Schussler et al.
memory footprint of the computation is smaller with a small input set. Hence,
using a small number of threads per block can reduce memory contention because
the threads are accessing a smaller portion of the global memory simultaneously,
resulting in fewer memory conflicts and improved memory access efficiency.
On the other hand, a larger number of threads per block may deliver better
performance as the grid size increases due to the following reasons. (i ) increased
parallelism available: Larger grid sizes often require more parallelism to uti-
lize the computational power of the GPU fully. Hence, by using a larger num-
ber of threads per block, one can increase the total number of threads running
concurrently, enabling more parallel computations. (ii ) efficient memory access:
by using a larger number of threads per block, one can exploit better memory
access patterns. That is, multiple threads can access memory in a coalesced man-
ner, fetching contiguous memory locations efficiently, which may reduce mem-
ory latency and improve memory throughput. (iii ) load balancing: using a larger
number of threads per block allows for finer granularity in workload distribution.
In this scenario, each thread can process a smaller portion of the grid, enabling
better load balancing among threads. This can help avoid scenarios where a few
threads are overloaded while others remain idle, leading to better performance.
Defining the number of blocks and threads per block according to the CUDA
application is challenging, as it directly impacts the performance and energy con-
sumption of the GPU system. Hence, in this work, we have performed extensive
experiments over a real-world seismic application to find configurations of blocks
and threads per block that optimize its performance and energy consumption.
When evaluating different configurations over sixteen distinct grid dimensions,
we have shown that no unique configuration can deliver the best outcome for
all grid sizes. We have also shown that performance can be improved 2x com-
pared to the default way the application is implemented by carefully selecting
the ideal configuration. Furthermore, significant energy reductions were achieved
by defining the ideal number of blocks and threads per block.
In summary, it is worth mentioning that the optimal number of threads
per block can depend on various factors, such as GPU architecture, memory
requirements, and the nature of the computation. Therefore, it is recommended
to experiment and profile the application with different thread block sizes to find
the optimal configuration for your specific scenario. In this scenario, as future
work, we intend to implement a heuristic to automatically define the number of
blocks and threads per block according to the grid dimension and optimization
objectives (e.g., performance, energy, or the trade-off between them).
Acknowledgment. This work has been partially supported by Petrobras under num-
ber 2020/00182-5, by the call CNPq/MCTI/FNDCT - Universal 18/2021 under grants
406182/2021-3, and by the Coordenação de Aperfeiçoamento de Pessoal de Nı́vel Supe-
rior - Brazil (CAPES) - Finance Code 001.
The Impact of CUDA Execution Configuration Parameters 183
References
Alkhimenkov, Y., Räss, L., Khakimova, L., Quintal, B., Podladchikov, Y.Y.: Resolving
wave propagation in anisotropic Poroelastic media using graphical processing units
(GPUs). J. Geophys. Res. Solid Earth 126, e2020JB021175 (2021)
Fletcher, R.P., Du, X., Fowler, P.J.: Reverse time migration in tilted transversely
isotropic (TTI) media. Geophysics 74(6), WCA179–WCA187 (2009)
Hanindhito, B., Gourounas, D., Fathi, A., Trenev, D., Gerstlauer, A., John, L.K.:
GAPS: GPU-acceleration of PDE solvers for wave simulation. In: Proceedings of the
36th ACM International Conference on Supercomputing, ICS 2022, New York, NY,
USA. Association for Computing Machinery (2022)
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach.
Elsevier (2011)
Francisco Lorenzon, A., Beck Filho, A.C.S.: Parallel Computing Hits the Power Wall.
SCS, Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28719-1
Michéa, D., Komatitsch, D.: Accelerating a three-dimensional finite-difference wave
propagation code using GPU graphics cards. Geophys. J. Int. 182(1), 389–402 (2010)
Navaux, P.O.A., Lorenzon, A.F., da Silva Serpa, M.: Challenges in high-performance
computing. J. Braz. Comput. Soc. 29(1), 51–62 (2023)
NVIDIA: Nvidia dgx-1 with tesla v100 system architecture, Technical white paper
(2017)
Páez, A., Sánchez, I.J., RamÃrez, A.B.: Computational strategies for implementation
of 2D elastic wave modeling in GPU. Entre Ciencia e IngenierÃa 14, 52–58 (2020)
Sanchez-Noguez, J., Couder-Castañeda, C., Hernández-Gómez, J.J., Navarro-Reyes,
I.: Solving the heat transfer equation by a finite difference method using multi-
dimensional arrays in CUDA as in standard C. In: Gitler, I., Barrios Hernández,
C.J., Meneses, E. (eds.) CARLA 2021. CCIS, vol. 1540, pp. 221–235. Springer, Cham
(2022). https://doi.org/10.1007/978-3-031-04209-6 16
Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-purpose
GPU Programming. Addison-Wesley Professional (2010)
Serpa, M., Mishra, P.: Performance evaluation and enhancement of the fletcher method
on multicore architectures. Int. J. Res. Publ. Rev. 3, 2649–2655 (2022)
Yuan, Y., Shi, F., Kirby, J.T., Yu, F.: FUNWAVE-GPU: multiple-GPU acceleration of
a Boussinesq-type wave model. J. Adv. Model. Earth Syst. 12(5), e2019MS001957
(2020). https://doi.org/10.1029/2019MS001957
High-Performance Computing
for Astrophysical Simulations
and Astroparticle Observations
1 Introduction
remarkable accuracy and detail. This has led to significant advancements in our
understanding of the universe, from the smallest particle interactions to the
largest scale of cosmic structure and evolution.
Several numerical codes have been developed to model various astrophysical
scenarios, ranging from large-scale problems such as galaxy formation to small-
scale phenomena such as turbulence in the interstellar medium. These codes can
be classified according to their discretisation approach, with grid-based codes
(e.g., Pencil [11], Flash [6], Ramses [16], and Pluto [10]) and particle-based
(or N-body) codes (e.g. Gadget [15], Phantom [12]) as the two common cat-
egories. These codes are often coupled with radiative transfer, gravity, nuclear
physics, and general relativity to study a wide range of phenomena, includ-
ing stellar structure, planet formation, stellar and galactic evolution, and high-
energy events such as supernovae, active galactic nuclei (AGN), and gamma-ray
bursts (GRBs).
In a recent work [4], we have employed HPC resources to model the evolution
of magnetic field configurations in the interior of convective and stratified stars.
To do this, we adapted the Pencil Code1 [11], which is a high-level finite-
difference numerical code for modelling compressible hydrodynamic fluids with
magnetic fields and particles. The code is written primarily in Fortran and runs
efficiently in parallel on shared or distributed memory computers. Its modular
structure allows it to easily adapt to simulate different problems, supporting
Cartesian, cylindrical and spherical geometries.
On the other hand, by analysing astroparticle observations, scientists can
refine and validate the accuracy of astrophysical simulations, leading to a deeper
understanding of cosmic phenomena and enabling more accurate predictions
of astroparticle processes. For example, ground-based detection of secondary
particles allows the study of transient events such as GRBs and/or Forbush
decays [17], as well as practical applications such as muography [8]. These studies
require a detailed understanding of how secondary particles are produced in the
atmosphere and reach specific locations.
The Latin American Giant Observatory (LAGO) focuses on detecting back-
ground radiation for studying astroparticles and geophysical phenomena. LAGO
comprises an extended network of Water Cherenkov Detectors across the con-
tinent, covering a wide range of geomagnetic rigidity cutoffs and atmospheric
absorption depths. A comprehensive computational framework that considers
the influence of the geomagnetic field on the propagation of Galactic Cosmic
Rays (GCRs) has been developed to estimate the expected signals at the LAGO
detector sites. This framework consists of a collection of individual tools, collec-
tively known as the ARTI2 .
In this article, we present a comprehensive overview of the main applications
developed by our group using HPC resources. Section 2 is dedicated to stellar
simulations with the Pencil code, while Sect. 3 is dedicated to astroparticle
1
https://github.com/pencil-code/pencil-code.git.
2
https://github.com/lagoproject/arti.
186 L. M. Becerra et al.
simulations with ARTI. Section 4 describes the measures taken to improve the
reproducibility of our analyses. Finally, in Sect. 5, we give some final remarks.
Fig. 1. Left: Magnetic energy decay of a purely Ohmic mode (l = 1 and n = 1). Right:
Snapshot of the magnetic field at the end of the simulation. The field lines correspond
to the poloidal component of the magnetic field, and the surface contours to the toroidal
one. We simulate with the Pencil Code in a Cartesian box with an equally-spaced
grid.
The outer diffusivity, ηo , is 103 times greater than the inner one, ηi . The tran-
sition zone connecting the star’s interior with the atmosphere has a width of
Δr ≈ 0.3R.
To test the Pencil Code for star-in-a-box simulations and the performance
of the magnetic diffusivity given in Eq. (5), we first solve the induction equation
neglecting the advection term:
∂B
= −∇ × j = ∇2 B (6)
∂τ
with τ = ηt. This equation has an analytical solution, which is:
with
2
Bφ (r) = (Aln jl (kln r) + Bln yl (kln r)) Pl1 (cos θ)e−kln τ (8)
l,n=1
2
Aφ (r) = (Cln jl (ωln r) + Dln yl (ωln r)) Pl1 (cos θ)e−ωln τ , (9)
l,n=1
where jl and yl are the spherical Bessel functions of the first and second kind,
respectively, and Pl1 are the associated Legendre polynomials. The constants
188 L. M. Becerra et al.
Table 1. Hypatia cluster timing for the Pencil Codel evolving stellar magnetic fields
Aln , Bln , Cln , Dln , kln and ωln are determined by the boundary conditions:
regular conditions in the centre of the star make Bln = Dnl = 0. In contrast,
the continuity condition of the magnetic field across the star’s surface gives
ω11 = 3.14 and k11 = 4.49 for the n = 1, l = 1 mode.
We follow the evolution of the magnetic field for about one τ time. From
the analytical solution, the magnetic energy of the toroidal component scales
2 2
with e−2.0k11 τ and the poloidal with e−2.0ω11 τ . Figure 1 shows a good agreement
between the simulated and the analytical model over time. This confirms the
validity and effectiveness of employing the high magnetic conductivity atmo-
sphere in star-in-box simulations. Additionally, Becerra et al. [4] confirmed that
for sufficiently large box sizes (L > 3Rs ), the evolution of the magnetic field in
the stellar interior remains unaffected by the periodic boundary conditions at
the box’s sides.
N = 1283 N = 1283
N = 2563 30 N = 2563
101
time step [ µs ] / ( N × p )
25
20
Speed up
100
15
10
10−1
−2
10 0
10−1 100 0 10 20 30 40 50 60
Neff / N p
(a) (b)
Fig. 2. Left: Mean time spent in each time step per number of points and processors
as a function of the fraction of more effective points in each processor over the total
points of the simulation. Right: Speed up of parallelization of the Pencil code in
Hypatia cluster.
are utilized. This trend seems to be the same for the 2563 resolution, but when
64 processors were employed, the code’s performance did not improve.
The right panel of Fig. 2 shows the speed-up achieved through parallelization
as a function of the number of processors. This quantity is calculated as the time
the code takes to complete 100 interactions on a single processor over the time
taken when using multiple processors. For both resolutions, the code exhibits
speedup benefits up to the utilization of 32 processors.
3 Astroparticle Applications
Atmospheric particle flux simulations require a significant number of simulated
particles. This number increases as the simulation progresses. For example, simu-
lating just one minute of flux introduces about 100,000 primary protons into the
atmosphere. As these primary particles interact with the atmosphere, they cre-
ate hundreds of millions of secondary particles. To get a realistic and statistically
meaningful representation, we must simulate at least 40 million primary protons,
equivalent to five hours of flux. In this case, high-performance computing (HPC)
is the only viable option for these simulations.
We used the ARTI [14] code to estimate the secondary particle flux at each
LAGO site and to simulate the neutron flux. ARTI is a complete framework
designed to simulate the signals produced by the secondary particles emerg-
ing from the interaction of single, multiple, and even from the complete flux of
primary cosmic rays with the atmosphere. These signals are simulated for any
particle detector located anywhere (latitude, longitude and altitude), includ-
ing the real-time atmospheric, geomagnetic and detector conditions. Formulated
through a sequence of codes written in C++, Fortran, Bash and Perl, it pro-
vides an easy-to-use integration of standard astroparticle simulation environ-
ments. ARTI supports different cluster architectures and distributed computing
solutions, such as those based on grid and federated or public clouds implemen-
tations [13]. In the following section, we present the results for both applications.
Fig. 3. Energy spectrum of secondary particles at three different study sites: Buenos
Aires (19 m a.s.l), Bucaramanga (956 m a.s.l), and Berlin (3450 m a.s.l.). The total spec-
trum of particles produced as they reach the ground is shown in black. The electromag-
netic component comprises gamma photons (blue), electrons, and positrons (yellow).
The muon component is shown in green, and the protons in purple. The neutrons, which
have a cut of about 300 MeV applied by CORSIKA to optimize the computation time,
are shown in magenta.
ARTI uses CORSIKA [7] to evaluate the particles produced by the inter-
action of each GCR with the atmosphere. In these simulations, each secondary
particle is tracked up to the lowest energy threshold (Es ) allowed by CORSIKA,
which depends on the type of the secondary particle. Currently, these thresholds
are Es ≥ 5 MeV for muons and hadrons (excluding pions) and Es ≥ 5 KeV for
electrons, pions and gammas photons. As the atmospheric profile is a key factor
for the production of secondary particles and a parameter for CORSIKA, we set
atmospheric MODTRAN profiles models [9] according to the geographical posi-
tion of the LAGO sites ( see [14] and references therein for a detailed description
of this method).
Figure 3 shows examples of the results for the obtained spectra of each type
of secondary particle at Buenos Aires, Argentina (19 m a.s.l), Bucaramanga,
Colombia (956 m a.s.l) and Berlin, Colombia (3450 m a.s.l.). These sites are
located at different altitudes, with a difference of up to 2,000 m. These altitude
variations are reflected in the total flux, with the difference between Buenos
Aires and Berlin being almost one order of magnitude. The neutron component
in these spectra is shown in magenta, and as can be seen, CORSIKA imposes a
cutoff of about 300 MeV to optimize computational time. We use Geant4 [1] to
simulate the last 2 km of particle trajectories to capture the low-energy neutron
component. Geant4 is a software package that simulates the interaction between
radiation and matter. A more detailed discussion of this phase follows in the
next section.
As shown in [13], the generation of such simulations, when stored, transforms
into synthetic data with versatile applications for various research objectives. A
machine with 128 cores and 1 TB of RAM generated the spectra shown in
Fig. 3. ARTI, a software optimized for parallel processing, efficiently divides
these simulations among up to 120 cores, generating 120 binary files. Under
these conditions, simulating one hour of cosmic ray flux takes 1.5 h of computing
192 L. M. Becerra et al.
time. In contrast, using a desktop machine with eight cores and 16 GB of RAM,
the simulation time increases to approximately 24 h. We simulated 12 h of cosmic
ray flux for each site in this case.
Fig. 4. Energy spectrum of neutrons produced by the interaction of cosmic rays within
the atmosphere or three locations in South America. This spectrum was generated in
Geant4.
Table 2. Ratio of the flux per altitude according to the see level flux as reference.
4 Reproducibility Considerations
To improve the reproducibility of our analyses, we employ several strategies,
including the use of Docker4 and Singularity containers to preserve the compu-
tational environment. By encapsulating our software, dependencies and configu-
rations within these containers, we ensure that the exact same environment can
be replicated regardless of the underlying operating system or computing infras-
tructure. This eliminates compatibility issues and minimises the risk of software
version conflicts, allowing other researchers to reproduce our simulations easily.
We also emphasise using open-source software tools based on version control
systems. For example, the Pencil code we use for stellar simulations is licensed
under the GENERAL PUBLIC LICENSE version 2. In contrast, ARTI, our
astroparticle simulation framework, is licensed under the 3-Clause BSD. These
open licences promote transparency and allow researchers to access, study, mod-
ify and redistribute the software, facilitating the replication and validation of
our analyses.
In addition to containerisation and open software, our approach also aims to
facilitate the findability, accessibility, interoperability and reuse (FAIR) [18] of
4
https://hub.docker.com/u/lagocollaboration.
194 L. M. Becerra et al.
our digital assets5 . We assign persistent identifiers, such as DOIs, to our datasets
and code repositories, making them easy to find. Our open-access policy ensures
the accessibility of our research results, allowing other researchers to reproduce
and build on our work. By adhering to standard file formats, data structures
and interfaces, we promote interoperability with other tools and facilitate the
integration of our analyses into larger scientific workflows. Finally, by openly
sharing our data, code and methods, we encourage their reuse, enabling other
researchers to validate our findings and explore new research directions.
Overall, our comprehensive approach to reproducibility, which includes con-
tainerisation, open software and adherence to the FAIR principles, ensures that
our analyses can be accurately replicated, validated and extended by the scien-
tific community. By promoting transparency, accessibility and compatibility, we
contribute to the robustness and reliability of scientific research, foster collabo-
ration and advance knowledge in our field.
5 Remarks
In this paper, we have discussed two particular cases of study of HPC in astro-
physical and astroparticle simulations. HPC provides the computational power
to model complex astrophysical systems and validate theoretical models against
observational data.
In the first application, we used the Pencil Code, a high-level finite-
difference numerical code, to simulate the evolution of magnetic field config-
urations in convective and stratified stars. We test the performance of these
simulations on the Hypatia Cluster and conclude that the code demonstrates
improved speedup performance when utilizing up to 32 processors.
We also discussed the importance of astroparticle observations in refining
and validating astrophysical simulations. Ground-based detection of secondary
particles is crucial in studying transient events and applications such as muon
radiography. Collaboration between astrophysicists, computational scientists and
observational researchers is essential to advance our knowledge of the Universe.
HPC resources and sophisticated numerical codes and astroparticle observations
enable us to tackle challenging problems and significantly advance astrophysics6 .
In the future, we aim to refine our simulations by incorporating additional
physics and considering more complex astrophysical scenarios7 . We also look
forward to exploring new avenues of research that harness the power of HPC and
astroparticle observations to deepen our understanding of cosmic phenomena8 .
call 890 of 2020, managed through the ICETEX contract 2022-0718. The computations
presented in this paper were performed on the Hypatia cluster at the Universidad de los
Andes and the Guane cluster at the Universidad Industrial de Santander, both located
in Colombia. These HPC clusters were accessed through the LaRedCCA initiative of
the National Academic Network of Advanced Technology, RENATA.
References
1. Agostinelli, S., Allison, J., Amako, K., Apostolakis, J., et al.: Geant4 - a simu-
lation toolkit. Nucl. Instrum. Methods Phys. Res. Sect. A: Acceler. Spectromet.
Detect. Associat. Equip. 506(3), 250–303 (2003). https://doi.org/10.1016/S0168-
9002(03)01368-8
2. Asorey, H., Dasso, S., Núñez, L.A., Pérez, Y., Sarmiento-Cano, C., Suárez-Durán,
M., the LAGO Collaboration: The LAGO space weather program: drectional geo-
magnetic effects, background fluence calculations and multi-spectral data analysis.
In: The 34th International Cosmic Ray Conference, vol. PoS(ICRC2015), p. 142
(2015)
3. Asorey, H., Núñez, L., Suárez-Durán, M.: Preliminary results from the Latin Amer-
ican giant observatory space weather simulation chain. Space Weather 16(5), 461–
475 (2018)
4. Becerra, L., Reisenegger, A., Valdivia, J.A., Gusakov, M.E.: Evolution of random
initial magnetic fields in stably stratified and barotropic stars. Mon. Not. R. Astron.
Soc. 511(1), 732–745 (2022). https://doi.org/10.1093/mnras/stac102
5. Braithwaite, J., Nordlund, Å.: Stable magnetic fields in stellar interiors.
Astron. Astrophys. 450(3), 1077–1095 (2006). https://doi.org/10.1051/0004-6361:
20041980
6. Fryxell, B., et al.: FLASH: an adaptive mesh hydrodynamics code for modeling
astrophysical thermonuclear flashes. Astrophys. J. Suppl. 131(1), 273–334 (2000).
https://doi.org/10.1086/317361
7. Heck, D., Knapp, J., Capdevielle, J.N., Schatz, G., Thouw, T.: CORSIKA: a Monte
Carlo code to simulate extensive air showers (1998)
8. Jourde, K., et al.: Monitoring temporal opacity fluctuations of large structures
with muon radiography: a calibration experiment using a water tower. Sci. Rep.
6(23054) (2016). https://doi.org/10.1038/srep23054
9. Kneizys, F.X., Abreu, L.W., Anderson, G.P., Chetwynd, J.H., et al.: The MOD-
TRAN 2/3 report and LOWTRAN 7 model. Tech. Rep. (1996). https://web.gps.
caltech.edu/~vijay/pdf/modrept.pdf
10. Mignone, A., et al.: PLUTO: a numerical code for computational astrophysics.
Astrophys. J. Suppl. 170(1), 228–242 (2007). https://doi.org/10.1086/513316
11. Pencil Code Collaboration, Brandenburg, A., et al.: The Pencil Code, a modu-
lar MPI code for partial differential equations and particles: multipurpose and
multiuser-maintained. J. Open Source Softw. 6(58), 2807 (2021). https://doi.org/
10.21105/joss.02807
12. Price, D.J., et al.: Phantom: a smoothed particle hydrodynamics and magnetohy-
drodynamics code for astrophysics. Publ. Astron. Soc. Austral. 35, e031 (2018).
https://doi.org/10.1017/pasa.2018.25
13. Rubio-Montero, A.J., Pagán-Muñoz, R., Mayo-García, R., Pardo-Diaz, A., Sidel-
nik, I., Asorey, H.: The EOSC-synergy cloud services implementation for the Latin
American giant observatory (LAGO). arXiv preprint arXiv:2111.11190 (2021)
196 L. M. Becerra et al.
14. Sarmiento-Cano, C., et al.: The arti framework: cosmic rays atmospheric back-
ground simulations. Eur. Phys. J. C 82, 1019 (2022). https://doi.org/10.1140/
epjc/s10052-022-10883-z
15. Springel, V., Pakmor, R., Zier, O., Reinecke, M.: Simulating cosmic structure for-
mation with the GADGET-4 code. Mon. Not. R. Astron. Soc. 506(2), 2871–2949
(2021). https://doi.org/10.1093/mnras/stab1855
16. Teyssier, R.: Cosmological hydrodynamics with adaptive mesh refinement. A new
high resolution code called RAMSES. Astron. Astrophys. 385, 337–364 (2002).
https://doi.org/10.1051/0004-6361:20011817
17. Usoskin, I.G., et al.: Forbush decreases of cosmic rays: energy dependence of the
recovery phase. J. Geophys. Res.: Space Phys. 113(A7) (2008). https://doi.org/
10.1029/2007JA012955
18. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data manage-
ment and stewardship. Sci. Data 3(1), 160018 (2016). https://doi.org/10.1038/
sdata.2016.18
Improvement of the Simulation
of the Degradation of Reinforced Concrete
in Saltwater Environments Using Directives
1 Introduction
The corrosion phenomenon of the steel structures that are part of the reinforced concrete
is caused by oxygen and humidity, in addition to the existence of free chlorides in
the surrounding environment. Seawater has a high concentration of dissolved salts that
represents a severe threat to reinforced concrete because it promotes and accelerates
corrosion.
In coastal areas, the sea breeze carries with it actual moisture contents that, in one
way or another, take chlorides, so structures that are not in direct contact with the sea
begin to suffer from the action of chlorides.
Reinforced concrete is one of the most used materials in construction today due to its
structural properties, low cost, and outstanding durability. The alkaline character present
in the pores of the concrete allows the reinforcement steel to be in a passive state in terms
of corrosion, its speed being almost nil. However, this speed increases due to the entry of
aggressive agents into the environment, such as chloride ions and carbon dioxide. This
leads to the corrosion of the steel structures that reinforce the concrete.
In a marine environment, the leading cause of deterioration of reinforced concrete
structures is corrosion, which is initiated by chloride ions due to the exposure of these
structures to seawater, sea breezes, or the use of aggregates contaminated with salts.
Once the chloride ions reach a critical concentration on the surface of the steel, the
passive protection layer loses its stability and initiates corrosion. Due to the outstanding
deficiencies of concrete, such as low tensile strength and porosity, the latter is the product
of the hydration processes of the reaction between cement and water. Once hardened,
these remains of water will become pores of different sizes, which will be interconnected
with each other, facilitating the transport of fluids, gases, and other chemical substances
that are highly detrimental to reinforced concrete since it allows the beginning of the
process of deterioration of the steel structures that comprise it (Fig. 1).
Through this simulation, it is possible to analyze and experiment with what happens
inside the reinforced concrete, thus minimizing this corrosion phenomenon by changing
specific design and manufacturing parameters. Given that many variables and a large
amount of information are handled in this simulation, it is necessary to use multiple
GPU architectures to provide the researcher with the best results in the shortest possible
time.
Corrosion. It is the chemical reaction product of the union of the metal with oxygen.
it is a deterioration observed in a metallic object due to a high electrochemical impact
of the oxidative character, and the degenerative speed of said material will depend on
the ex-position to the oxidizing agent, the temperature presented if it is exposed to
saline solutions and the chemical properties of these metallic agents, this process is
spontaneous, and this can occur in materials that are not metallic [4].
Reinforced Concrete: Also called reinforced concrete, it consists of the use of rein-
forced concrete with steel bars or meshes, although it also uses plastic fibers, fiberglass,
steel fibers, or combinations of steel bars with fibers depending on the requirements
to which will be submitted. It is used in all types of buildings, such as bridges, dams,
tunnels, and industrial works [5].
Durability of Concrete: It is defined as the ability to resist weathering, chemical attack,
abrasion, or any other process or service condition of structures that produce the concrete
deterioration. Several factors affect the durability of reinforced construction, such as the
quality of its components, dosage of each part, mix for a sufficient time to obtain a
homogeneous material, correct placement of steel structures, and suitable compaction
to avoid segregation and porosity [6].
Deterioration of Concrete: Physical, chemical, or physicochemical factors lead to the
reinforcements’ corrosion. In any of these categories, the influence of the components
of the concrete and its geographical location on the structure’s useful life is recognized.
The physical mechanisms of deterioration are associated with the dissolution of pulp
compounds in the medium, with loss of mass, increased porosity, and a drop in resistance.
The chemical mechanisms correspond to the exchange of ions of the pulp with the
medium, giving rise to compounds that are soluble or not but of a non-expansive nature
that eventually cause similar effects to the physical ones. The physicochemical combines
the two concepts, giving rise to the formation of expansive-type compounds that cause
internal tensions and lead to cracking and possible component disintegration [7].
Attack by Chlorides: Chloride can penetrate the reinforced concrete from the external
environment to the interior, generating corrosion in the reinforcements by combining
various transport mechanisms such as ionic diffusion, water absorption, water flow, and
chloride ion dispersion, or by an effect of external electric potential.
Carbonation of Concrete: Concrete is a very porous material, which allows penetration
into the interior of the CO2 from the air. When this happens, the CO2 reaction occurs
with the calcium hydroxide of the concrete and the hydrated compounds of the cement,
calcium carbonate formed by which the pH decreases, reaching values lower than nine
[8].
Improvement of the Simulation of the Degradation of Reinforced Concrete 215
This paper presents a diffusion model simulated considering the different factors
that affect the corrosion initiation rate of the reinforcement structure, such as the water-
cement ratio, temperature, density, and chloride binding capacity of concrete since all
these variables are handled. A significant amount of information becomes necessary
to use computational architectures based on multiple GPUs to obtain better results in
shorter times. This section presents the physical problem from the material’s point of
view. The second section shows the mathematical model, the simulation, and the results.
Finally, a conclusion and further work are presented.
Simulation of the Prediction of the Useful life of Concrete: The corrosion of the
reinforcement due to the penetration of chloride ions and the carbonation of concrete
is a major problem that reduces the durability of reinforced concrete structures. Once
the chloride concentration around the surface of the steel reinforcement exceeds a cer-
tain limit concentration or the pH value of the concrete pore solution decreases to a
threshold value due to the carbonation reaction, the steel reinforcement will undergo the
depassivation process, and then the metallic corrosion. [10, 11, 12].
Carbonation Process: Concrete carbonation is a complex physical and chemical pro-
cess. In our model, this process is divided into four parts: (1) carbon dioxide transport,
(2) dissolved calcium hydroxide mass balance, (3) solid calcium hydroxide solution in
concrete pore solution, and (4) Chemical reaction of CSH with carbon dioxide. The
governing equations can be given by:
where Ctc is the total chloride content in a unit of concrete volume (mol/m3 of concrete)
and Cfc is the content of free chloride ions (mol / m3 of pore solution). Cbc is the content
of bound chloride (mol/m3 of concrete). A part of the bound chloride can participate in
the chemical reaction shown in Eq. (3), releasing free chloride ions.
The amount of chloride attached depends on the concentration of free chloride in the
pore solution and the degree of carbonation because we consider carbonation. Therefore,
the instantaneous variation of the total chloride can be expressed as:
where ∂C bc ∂C bc
∂C fc is an isotherm between the bound chloride and the free chloride ∂α c can
also be described with an isotherm in which carbonation should be considered.
Presented the mathematical model to describe the degradation of reinforced concrete
in saltwater environments, proposing an algorithm to be implemented is possible, as
shown in the following section.
the codes using parallelization techniques of the simulation algorithm based on the chlo-
rides diffusion model combined with the diffusion of carbonates in reinforced concrete
exposed to a marine environment for their respective execution in architectures based on
multiple GPUs. Code development is performed by scientists of the Advanced Comput-
ing and Large Scale Group, CAGE (from Spanish acronym of Computación Avanzada y
de Gran Èscala), and Corrosion Research Group, GIC (from Spanish acronym of Grupo
de Investigación en Corrosión).
The algorithm in Fig. 2 shows the parallel regions in the workflow after the data
input, calculating the current chloride and carbonate concentration and verifying the
concentrations. In the workflow, visualization is also presented, and, using some specific
functions with the directives, the simulation provides information about the process, as
seen later in Fig. 4.
The main function that calculates the diffusion of the chloride ion in the reinforced
concrete using the second law of Fick was parallelized; for this purpose, the “pragma
acc kernels” directive was used, which tells the compiler to generate parallel accelerator
cores (CUDA cores in our case) for the loop nests that follow the directive. (See Fig. 3).
Obviously, the algorithm is independent of the implementation mechanism and execution
support. Still, contemplating the need to accelerate the execution of the simulation as the
mathematical properties of the algorithm, we will concentrate on the implementation
thought towards massively paralleled machines based on Purpose General platforms
based in CPUs/GPUs.
218 F. A. Mejía et al.
An important aspect to consider is that for this project, and as the proposed algorithm
is seen, we seek to accelerate the visualization of the phenomenon; in this case, the
concentration of the chloride ion advances in reinforced concrete, then the visualization
of the concentrations is critical1 .
The implementation mechanism of the algorithm mainly used C and directives using
OpenACC in a homogeneous code. All functions and open-source code are available in
a open repository in [18].
Figure 3 shows a fragment of the parallel code of the function to highlight the use of
the pragmas and the placement in the structure of the code. It is important to note the use
of collapse on the loop level since this will allow the compiler to use multi-dimensional
blocks.
Considering that about five to seven runs were performed for the simulation for
each of the thirteen simulations presented (and obtain the respective values), a very
good standard deviation was obtained (with very little variability, between 0,3 and 0,5).
However, as explained below, we must not confuse the runs with the interactions made.
1 Another important aspect to observe, but more for computational reasons, was the performance
measures around the resources balancing, the acceleration, and time used in the execution,
visible in the proposed algorithm in Fig. 2.
Improvement of the Simulation of the Degradation of Reinforced Concrete 219
The visualization of the vtk files generated by the algorithm can be seen as a movie
using the Paraview software,2 as seen in Fig. 4. Fundamentally, it is visualized by high-
lighting the colors according to key information needed to observe the phenomenon and
understand the degradation. Then, the color difference provides information about the
concentration of the chloride ion advance in reinforced concrete.
The simulation was executed in sequential solution using only CPUs and in a parallel
solution on a CPU/GPU platform using initially one GPU and after with two GPUs,
obtaining the results shown in Table 1. This usage configuration corresponds to one of the
GUANE-1 Supercomputer node types. The presented results are in terms of processing
2 ParaView is an open-source, multi-platform data analysis and visualization tool. It is widely used
in various fields, including the materials and condensed physics community. More information
about Paraview in: https://www.paraview.org/
220 F. A. Mejía et al.
time in seconds for each of the solutions, considering the next computer architecture
elements:
Sequential Running: Intel Xeon CPU E5645 @ 2.40GHz (12 Cores), RAM 104 GB.
Linux operating system, CENTOS distribution. Compiled in GCC.
Parallel CPU/GPU: Intel Xeon CPU E5645 @ 2.40 GHz (12 Cores), RAM 104 GB,
NVIDIA GPU Tesla 12 GB GDDR5. Linux operating system, CENTOS Distribution.
Compiled in NVIDIA NVC (to exploit OpenACC directives suite).
Table 1 shows in the first column the size of the cube for the analysis is varied (256
× 256 × 400, 512 × 512 × 400, and 1024 × 1024 × 400). The second column presents
the number of iterations of the simulation (8000, 16000, 32000, and 64000) and obtains
results from an average of executions (five to seven executions). The last three columns
organize the results for the performed simulations, the third column only for results using
the CPU, and the fourth and last columns the results using one GPU and two GPUs.
GPUs, results were almost equal to twice the acceleration obtained for its version for a
single GPU analyzed previously. Hence, it determines that the algorithm is scalable and
accelerates the processing by increasing the number of GPUs used.
The acceleration obtained for each cube size analyzed for the MultiGPU algorithm
is shown in Fig. 6. Comparing the results of both Figs. 5 and 6, we observe that linear
acceleration that is visible, independent of the change in dimension, is sustained by
scaling, as stated above. In the case of using one GPU, the attended acceleration is the
7X, and in the case of the two GPUs is the 12X.
However, the presented results, it’s important to note that not all code can be easily
parallelized and accelerated on a GPU using directives. Some algorithms and compu-
tations may require more fine-grained control and optimization using low-level GPU
programming techniques. Furthermore, the performance benefits of GPU acceleration
using directives may vary depending on the specific code and hardware configuration.
It’s important to profile and optimize the code to achieve the best performance on the
target GPU architecture. For this simulation, the configuration of the support platform
and different elements organized in the compute node allow to focus on processing
acceleration. As can be seen, for example, the RAM memory used allows the dimen-
sions of the experiment to be increased easily and the limitation used was due to the
detail characteristics required by the experiment of the degradation of armored concrete
in saltwater locations, as the columns on a pier.
In Fig. 7, the execution times obtained for the sequential version as the parallel
version using OpenACC directives are shown, with the values of the parallel version
being much smaller, and its difference concerning the serial version increases as the size
of the data increases cube to simulate.
222 F. A. Mejía et al.
The OpenACC directives facilitate the programming of the GPU since employing a few
lines of code without worrying about the location of the data allows for parallelizing
in the GPU program. These directives are added to the code reasonably easily and
Improvement of the Simulation of the Degradation of Reinforced Concrete 223
quickly. Still, it does not allow the programmer to control the hardware in which it is
executed directly, so the program can behave differently than expected; besides, not
directly allowing handling the details of the communication between CPU-GPU can
cause software optimization to deteriorate performance. [9] The OpenACC directives
are based on the use of #pragma compiler instructions following the syntax:
#pragma acc directive-name [clause-list] new-line.
These directives apply to the code block immediately following the #pragma tag.
PGI now supports the following OpenACC directives:
Kernels Construct. Defines the program region that must be compiled in a sequence
of kernels for execution in the accelerator device.
Data Directive. Defines data, usually matrices that must be assigned in the device
memory for the duration of the data region, if the data must be copied from the host to
the memory of the device when entering the region and copied from the device to the
memory of the device host when leaving the region.
• Host Data Construct. It makes the device’s data address available on the host.
• Loop Directive. Describe the type of parallelism to run the cycle and declare variables,
private reduct, and arrangements within the cycle. It applies to the cycle that should
appear in the following line of the directive.
Combined Parallel and Loop Directive. Specifying a Loop Directive nested immedi-
ately within a Kernel Directive is a shortcut. The meaning is identical to the explicit
kernel specification that contains a Loop Directive.
Cache Directive. Specifies the matrix elements or sub-arrays to be searched at the
highest level of the cache for the body of a Loop. It should appear at the top inside the
Loop.
Declare Directive. Specifies that a matrix or arrays should be allocated in the device’s
memory for the duration of the implicit data region of a function, subroutine, or program.
Specifies whether the data values will be transferred from the host to the device memory
when entering the implicit data region and from the device to the host memory when
exiting the implicit data region.
Update Directive. They were used during the validity of the accelerator data to update
all or a part of the host memory array with the corresponding array values in the device’s
memory or to update all or part of the memory array of the device with the corresponding
array values in the host memory.
Routine Directive. It tells the compiler to compile a procedure for an accelerator and
the host. In a file or routine with a procedure call, the Routine Directive describes the
implementation of the attributes of the process when it is called over the accelerator.
The detailed used directives are available and documented in the provided code
via the site [18]. In these experiments to improve the simulation of the degradation of
reinforced concrete in saltwater environments, achieving the GPU acceleration using
directives simplifies the coding. The used directives provide a high-level approach to
224 F. A. Mejía et al.
References
1. Li, L., Page, C., Wang, Y.: Modelling of chloride ingress into concrete from the saline
environment, pp. 1573–1582 (2005)
2. Velázquez Gonzalez, R.: Electrochemical evaluation of the corrosion in grade in reinforced
steel in the presence of admixtures. Portugaliae Electrochimica 23, 179–194 (2005)
3. Nvidia Corporation. ¿Qué es el GPU Computing acelerado?. http://www.nvidia.es/object/gpu
computing-es.html
4. Mejía de Gutiérrez, R.: Durabilidad y Corrosión en Materiales Cementicios. Universidad del
Valle, Cali, Colombia. Cyted, pp. 85–115 (1999)
5. Del Valle Moreno, A., Pérez López, T., y Maerínez Madrid, M.: El Fenómeno de la Corrosión
en Estructuras de Concreto Reforzado. Publicación Técnica No. 182 Sanfandila, Qro, pp. 33–
50 (2001)
6. Fontana, M.G.: de Corrosion engineering, Nueva York, Mc GrawHill, 556 (1986)
7. Mindess, S.a.Y.J.: Concrete. Prentice Hall, Nueva Jersey (1981)
8. G. d. I. d. c. UIS, Desarrollo metodológico electroquímico de la corrosividad de estruc-
turas de concreto sometidas a los ambientes marinos de las costas del pacífico colombiano,
Bucaramanga, Colombia, pp. 15–18 (2009)
Improvement of the Simulation of the Degradation of Reinforced Concrete 225
1 Introduction
Engineering and science problems use matrix operators or large matrices that are
usually symmetric. The main transformations that apply to these operators are
QR factorization for the resolution of systems of equations (which are generally
LU, Givens, Householder or Cholesky factorization) (Sameh and Kuck 1978;
Cosnard et al. 1986).
Bowgen and Modi (1985) QR factorization is a basis for solving equations
and is also very useful for calculating other operations such as the determinant or
inverse of a matrix (Choi et al. 1996). Factorization can be carried out by various
methods, including LU, Cholesky, Householder or Givens rotations (Householder
1958; Sameh and Kuck 1978; Chen et al. 2008). The Householder transformation
H is a reflection of a vector v on a plane. It allows many elements of v to be zero
in a single transformation. The Householder transformation is characterized by
its low complexity and the properties that have its transformation because it
allows rebuilding in an appropriate way to the matrix Q while building R.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 197–211, 2024.
https://doi.org/10.1007/978-3-031-52186-7_15
198 J. C. Hernández-Cortés et al.
x − 2x, v = x − 2vvT x.
We can also write this transformation in matrix form, as follows:
vvT
P=I− , (1)
H
where H = 12 |v|2 and v is a real vector. The Householder transformation
matrix has the property of being symmetrical or orthogonal. The main idea of
Parallel Hybrid-Heterogeneous Single Value Decomposition Factorization 199
This process is repeated k − 1 times, to obtain the tridiagonal matrix. And for
each Pi , the dimension of the Householder matrix is decreased.
From the original paper of Householder (Householder 1958), the complexity
in QR factorization is
O(n3 ).
The main problem addressed in this work is to carry out the bifactorization
of large matrices in different GPU cards, such that the acceleration advantages
of these devices are.
200 J. C. Hernández-Cortés et al.
2 Related Work
In the last decades, parallel algorithms developed for bifactoriazacion; many of
them do not use Householder transformation. Several papers have been pub-
lished on the parallelization of the Householder transformation in CUDA for
bifactorization and other transformations.
Cosnuau in 2014, (Cosnuau 2014), implemented Householder bifactorization
in order to perform the tridiagonalization of matrices. Here, small matrices of
128×128 used to solve eigenvalues and eigenvectors problems since the author
had the idea that it was more efficient to perform for small instances of the
problem the calculations on CPU. Lahabar and Narayanan (2009) implemented
an algorithm to carry out SVD array decomposition on a GPU. In order to per-
form the decomposition, it was necessary to apply bifactoring to diagonalize the
matrix later. The next step of bifactorization is carried out with the Householder
method of reflection.
Sachdev et al. (2010) implemented SVD using bidiagonalization followed by
diagonalization. This procedure was the first to be deployed on a GPU. Bidi-
agonalization implemented using Householder transformations that were closely
related to BLAS operations. The QR algorithm is applied for diagonalization.
This work outperformed an implementation using MATLAB and Intel Math
Kernel Library (MKL) LAPAC on the CPU.
The iterative method calculates the Householder matrix based on the tridi-
agonal form of a symmetric matrix.
Symmetric matrix A multiplied by this matrix. The elements below the i-th
row and the elements to the right of the i-th column become zero.
Several different implementations of this method are described in the follow-
ing sections.
meaning that parallelization with this tool can result in excessive operations
involving the creation and destruction of threads, which can create difficulties
for massive instances of the problem. To take advantage of parallelization with
OpenMP is the parallelized tasks must be of coarse granularity.
are performed in the GPU, while operations involving sending and receiving data
in the CPU.
Figure 3 shows the bifactorization process in the hybrid-heterogeneous imple-
mentation. From Fig. 3, In the first stage, the vector u is sent to all processes
using a broadcast operation. The vector u is the i-th row of the matrix A. The
slave and master nodes calculate the part of the Householder matrix (H); the
part that is calculated by each node depends on the range of the node and the
total number of processes that are running. It is important to note that although
MPI balances the nodes’ loads, this implementation assumes that all GPUs have
the same capabilities.
As shown in Fig. 3, Stage 2 starts when each process receives the vector
u. The calculation of the 2-norm of the vector is done, and the product point
and a scalar. All these operations are performed on the CPU using OpenMP,
since reduction operations (i.e., of a vector) we obtain one scale. It is better
to perform this type of operation on the CPU, as the GPU would require too
many synchronizations between the threads. At this stage, when each process
has calculated the corresponding part of the Householder matrix, the master
process sends the columns of the original matrix to each node to perform partial
multiplication with the Householder matrix.
In Stage 3, the slave processes send the solutions to the master process, which
combines them to give the matrix resulting from the multiplication of the matrix
H and A. The combined solution is sent to all nodes as the rows of the matrix
HA.
In Stage 4, as shown in Fig. 3, each process receives the rows of the matrix
HA and performs a partial multiplication of the matrices HA and H. When
these processes have generated the partial results, they go to the master process.
In the last stage, the master process unifies the partial solutions to form the
matrix HAH. If the number of iterations is less than n − 1, iterating until the
tridiagonal matrix is obtained.
3.5 Considerations
– The matrices are symmetric; the rows sent since rows store the matrices.
When the columns are sent to the nodes to perform the HA multiplication,
– In each iteration, no ones are added to the main diagonal of the reflection
matrix as indicated by (Householder 1958), but the indices of the multiplying
matrices traversed.
– At the end of the iteration, a reordering of the matrix HAH is carried out,
because in this last multiplication columns in the different nodes divide the
matrix.
204 J. C. Hernández-Cortés et al.
– CPU
• Model: Intel Xeon X5675
• Socket number: 2
Parallel Hybrid-Heterogeneous Single Value Decomposition Factorization 205
The server also had three nVidia GPU cards, as shown in Table 1:
– CPU
• Model: Intel i7 6700HQ
• Frequency: 3.5 GHz
• Cache Memory: 6 MB
• Cores numbers: 4 Physics, 4 virtual
– RAM memory: 8 GB
n Time (ms)
One Thread Two Threads Four Threads Six Threads
128 819 413 214 150
256 19325 4 9755 4997 3514
512 330515 165750 84017 57793
1000 17046837 8482279 4137022 2761388
n Speedup (ms)
One Thread Two Threads Four Threads Six Threads
128 1 1.98 3.82 5.46
256 1 2.25 4.42 4.98
512 1 2.17 4.25 5.19
1000 1 2.009 4.12 6.17
Execution Times
·106
GTX 960M
Execution Time(ms)
6
Tesla
4
0
128 256 512 1,000
Matrix size
Fig. 4. Householder bifactorization execution time on GTX M690 and Tesla 2070.
Table 5. Execution time for HouseHolder bifactorization runs in two and three GPUs
cards.
Execution Times
·108
Two GPUs
Execution Time(ms)
Three GPUs
2
0
1,000 2,000 3,000 4,000
Matrix size
Fig. 5. Execution time for PahHousholder uses two and three GPUs.
The bifactoring process shows an increase of the GPU numbers in the same
proportion that occurs in the experimentation with OpenMP. The cards do not
have the same characteristics and affect the experiment’s performance differently,
as shown in Table 4.
Table 4 shows that the same implementation has a faster execution time on
the Tesla card than on the GTX card.
Figure 6 shows a comparison between the accelerations obtained for the dif-
ferent experiments. The runtime decreases when more GPUs are used. Although
the program executes on a single card, the experiment on the CPU is faster.
Parallel Hybrid-Heterogeneous Single Value Decomposition Factorization 209
Sequential
OpenMP 4 Threads
107 OpenMP 6 Threads
Time (ms) GPU
Two GPUs
Three GPUs
106
1000
Matrix Size
As the number of GPU cards increases, the time required to exchange data
between the MPI processes increases. However, the time for communication
between CPU and GPU decreases, as shown in Fig. 7.
·105
Two GPUs MPI
3 Three GPUs MPI
Two GPUs(CPU-GPU)
Time(ms)
Three GPUs(CPU-GPU)
2
0
1,000 2,000 3,000 4,000
Matrix size
Figure 8 shows the data exchange times between the CPU and GPU. The
number of communications between processes and the time for execution for the
matrix size is accelerated 4000 × 4000. We note that there is a decrease in the
kernel runtime when there is an increase in communication between processes
and a decrease in the exchange time between CPU and GPU.
210 J. C. Hernández-Cortés et al.
107
106
Conflict of Interest. The authors declare that they have no conflict of interest.
References
Bowgen, G., Modi, J.: Implementation of QR factorization on the dap using householder
transformations. Comput. Phys. Commun. 37(1–3), 167–170 (1985)
Chen, Y., Davis, T.A., Hager, W.W., Rajamanickam, S.: Algorithm 887: Cholmod,
supernodal sparse cholesky factorization and update/downdate. ACM Trans. Math.
Softw. (TOMS) 35(3), 22 (2008)
Choi, J., Dongarra, J.J., Ostrouchov, L.S., Petitet, A.P., Walker, D.W., Whaley, R.C.:
Design and implementation of the scaLAPACK LU, QR, and Cholesky factorization
routines. Sci. Program. 5(3), 173–184 (1996)
Cosnard, M., Muller, J.-M., Robert, Y.: Parallel QR decomposition of a rectangular
matrix. Numer. Math. 48(2), 239–249 (1986)
Cosnuau, A.: Computation on GPU of eigenvalues and eigenvectors of a large number
of small hermitian matrices. Procedia Comput. Sci. 29, 800–810 (2014)
Parallel Hybrid-Heterogeneous Single Value Decomposition Factorization 211
Golub, G.H., Loan, C.F.V.: Matrix Computations, 4th edn. Johns Hopkins University
Press, Baltimore (2012)
Householder, A.S.: Unitary triangularization of a nonsymmetric matrix. J. ACM
(JACM) 5(4), 339–342 (1958)
Lahabar, S., Narayanan, P.: Singular value decomposition on GPU using CUDA. In:
IEEE International Symposium on Parallel & Distributed Processing, 2009. IPDPS
2009, pp. 1–10. IEEE (2009)
Leininger, M.L., Sherrill, C.D., Allen, W.D., Schaefer, H.F., III.: Systematic study of
selected diagonalization methods for configuration interaction matrices. J. Comput.
Chem. 22(13), 1574–1589 (2001)
Liu, F., Yu, C., Meng, W.: Personalized web search by mapping user queries to cate-
gories. In: Proceedings of the Eleventh International Conference on Information and
Knowledge Management, pp. 558–565. ACM (2002)
Meneses-Viveros, A., Paredes-López, M., Hernández-Rubio, E., Gitler, I.: Energy con-
sumption model in multicore architectures with variable frequency. J. Supercomput.
77, 1–28 (2020)
Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE
Intell. Syst. 20(3), 48–54 (2005)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C:
The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge
(1992)
Sachdev, G.S., Vanjani, V., Hall, M.W.: Takagi factorization on GPU using CUDA. In:
Symposium on Application Accelerators in High Performance Computing, Knoxville,
Tennessee (2010)
Sameh, A.H., Kuck, D.J.: On stable parallel linear system solvers. J. ACM (JACM)
25(1), 81–91 (1978)
Sunderland, A.G.: Parallel diagonalization performance on high-performance comput-
ers. In: Parallel Scientific Computing and Optimization, pp. 57–66. Springer, Hei-
delberg (2009). https://doi.org/10.1007/978-0-387-09707-7 5
Author Index
B J
Barbosa-Santillán, Liliana Ibeth 197 Jaramillo-Villegas, Jose A. 36
Barrios H., Carlos J. 212
Becerra, L. M. 184
Boeres, Cristina 113 L
Li, Xiaoming 77
Lopes, Camila 113
C Lorenzon, Arthur F. 146, 170
Carissimi, Alexandre 146, 170
Carvajal, Andres Giraldo 36
Costa, Gabriel 21 M
Marichal, Raul 66
Martínez-Méndez, A. 184
Mautone, Agustín 97
D Mejía, Félix A. 212
de Oliveira, Daniel 113 Meneses, Esteban 3
Diaz, Jose M. Monsalve 77 Muraña, Jonathan 50
Dominguez, Y. 184
Drummond, Lúcia M. A. 113
Dufrechou, Ernesto 66 N
Navaux, Philippe O. A. 146, 170
Nesmachnow, Sergio 50, 97
E Nogueira, Peterson 21
Erdödy, Nicolás 160 Nunes, Alan L. 113
Ezzatti, Pablo 66 Núñez, L. A. 184
F O
Fox, Dawson 77 O’Keefe, Richard 160
Freire, Manuel 66
P
Padoin, Edson L. 146
G Peña B., Darío Y. 212
Gonzaga de Oliveira, Sanderson L. 66 Perdomo, Diego A. Roa 77
Guaitero, Rafael A. Herrera 77
R
H Raskar, Siddhisanket 77
Hernández-Cortés, Juan C. 197 Rigon, Pedro H. C. 146, 170
Hernández-Rubio, Erika 197 Ripa, Guillermo 97
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2024
C. J. Barrios H. et al. (Eds.): CARLA 2023, CCIS 1887, pp. 227–228, 2024.
https://doi.org/10.1007/978-3-031-52186-7
228 Author Index
S V
San Martin, Daniel 131 Vidal, Andrés 97
Sánchez-Escobar, Juan J. 197 Villalobos, Johansell 3
Sarmiento-Cano, C. 184 Viveros, Amilcar Meneses 197
Schussler, Brenda S. 146, 170
Silva, Laian 21 Y
Speglich, João 21 Yule, Ian 160
Yviquel, Hervé 77
T
Torres, Claudio E. 131 Z
Toutouh, Jamal 97 Zuluaga-Bucheli, Hernán M. 36