Papers by Marcos Amaris González
Anais do XVI Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2015)
In this paper we implement an autotuner for the compilation flags of GPU algorithms using the Open... more In this paper we implement an autotuner for the compilation flags of GPU algorithms using the OpenTuner framework. An autotuner is a program that finds a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of a given problem instance or set of instances. We analyse the performance gained after autotuning compilation flags for parallel algorithms in three GPU devices, and show that it is possible to improve upon the high-level optimizations of the CUDA compiler. One of the experimental settings achieved a 30% speedup.
The focus of this study is on the scheduling of moldable Bulk Synchronous Parallel (BSP) tasks on... more The focus of this study is on the scheduling of moldable Bulk Synchronous Parallel (BSP) tasks on cloud computing environments. \lq\lq{}Moldable\rq\rq{} in this context is related to possibly reducing the number of required processors of a BSP task, recalculating the total execution time using a particular cost model. From there, we analyze how a moldable BSP task is influenced by the unreliable behavior of clouds, simulating the execution of several BSP tasks in existing public cloud computing environment. The objective of this paper is to analyze the difference between the completion time of the last task (\emph{makespan}), in both a simulated setting and in real clouds, analyzing the results and concluding how much a theoretical model can be useful in such environments.
Concurrency and Computation: Practice and Experience, 2018
We study the problem of executing an application represented by a precedence task graph on a para... more We study the problem of executing an application represented by a precedence task graph on a parallel machine composed of standard computing cores and accelerators. Both off-line and on-line settings are addressed by proposing generic scheduling approaches. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming and replace the greedy List Scheduling policy used in this approach by a better task ordering. Although this modification leads to the same approximability guarantees, it performs much better in practice. We also extend this algorithm to more types of computing units, achieving an approximation ratio which depends on the number of different types. In the on-line case, tasks arrive in any order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. We propose the first on-line scheduling algorithm taking into account precedences, which is based on adequate rules for selecting the type of processor where to allocate the tasks. Finally, all the previous algorithms have been experimented on a large number of simulations built on actual libraries, assessing their good practical behavior with respect to the state-of-the-art solutions and baseline algorithms.
Concurrency and Computation: Practice and Experience, 2017
A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating ... more A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating vector operations. The enormous heterogeneity of parallel computing platforms justifies and motivates the development of automated optimization tools and techniques. The Algorithm Selection Problem consists in finding a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of a set of problem instances. An autotuner solves the Algorithm Selection Problem using search and optimization techniques. In this paper, we implement an autotuner for the Compute Unified Device Architecture compiler's parameters using the OpenTuner framework. The autotuner searches for a set of compilation parameters that optimizes the time to solve a problem. We analyze the performance speedups, in comparison with high-level compiler optimizations, achieved in three different GPU devices, for 17 heterogeneous GPU applications, 12 of which are from the Rodinia Benchmark Suite. The autotuner often beats the compiler's high-level optimizations, but underperformed for some problems. We achieved over 2x speedup for Gaussian Elimination and almost 2x speedup for Heart Wall, both problems from the Rodinia Benchmark, and over 4x speedup for a matrix multiplication algorithm.
Comunicaciones en Estadística, 2013
En este artÍculo se presenta una fundamentacion teórica de las principales medidas de similaridad... more En este artÍculo se presenta una fundamentacion teórica de las principales medidas de similaridad basadas en compresión de datos, estas técnicas surgieron en la última década y han presentado gran utilidad en diversos campos de las ciencias, implementación de técnicas de agrupamiento u otras máquinas de aprendizaje pueden hacer una clasificación entre objetos.
2016 49th Hawaii International Conference on System Sciences (HICSS), 2016
This paper aims to show that knowing the core concepts related to a given parallel architecture i... more This paper aims to show that knowing the core concepts related to a given parallel architecture is necessary to write correct code, regardless of the parallel programming paradigm used. Programmers unaware of architecture concepts, such as beginners and students, often write parallel code that is slower than their sequential versions. It is also easy to write code that produces incorrect answers under specific conditions, which are hard to detect and correct. The increasing popularization of multi-core architectures motivates the implementation of parallel programming frameworks and tools, such as OpenMP, that aim to lower the difficulty of parallel programming. OpenMP uses compilation directives, or pragmas, to reduce the number of lines that the programmer needs to write. However, the programmer still has to know when and how to use each of these directives. The documentation and available tutorials for OpenMP give the idea that using compilation directives for parallel programming is easy. In this paper we show that this is not always the case by analysing a set of corrections of OpenMP programs made by students of a graduate course in Parallel and Distributed Computing, at University of São Paulo. Several incorrect examples of OpenMP pragmas were found in tutorials and official documents available in the Internet. The idea that OpenMP is easy to use can lead to superficial efforts in teaching fundamental parallel programming concepts. This can in its turn lead to code that does not develop the full potential of OpenMP, and could also crash inexplicably due to very specific and hard-to-detect conditions. Our main contribution is showing how important it is to teach core architecture and parallel programming concepts properly, even when you have powerful tools such as OpenMP available.
2015 IEEE 22nd International Conference on High Performance Computing (HiPC), 2015
Models are useful to represent abstractions of software and hardware processes. The Bulk Synchron... more Models are useful to represent abstractions of software and hardware processes. The Bulk Synchronous Parallel (BSP) is a bridging model for parallel computation that allows algorithmic analysis of programs on parallel computers using performance modeling. The main idea of BSP model is the treatment of communication and computation as abstractions of a parallel system. Meanwhile, the use of GPU devices are becoming more widespread and they are currently capable of performing efficient parallel computation for applications that can be decomposed on thousands of simple threads. However, few models for predicting application execution time on GPUs have been proposed. In this work we present a simple and intuitive BSP-based model for predicting the CUDA application execution times on GPUs. The model is based on the number of computations and memory accesses of the GPU, with additional information on cache usage obtained from profiling. Scalability, divergence, effect of optimizations and differences of architectures are adjusted by a single parameter. We evaluated our model using two applications and six different boards. We showed by using profile information for a single board, that the model is general enough to predict the execution time of an application with different input sizes and on different boards with the same architecture. Our model predictions were within 0.8 to 1.2 times the measured execution times, which are reasonable for such a simple model. These results indicate that the model is good enough to generalize the predictions for different problem sizes and GPU configurations.
En este documento es presentada una clasificación de señales electrocardiográficas por medio de m... more En este documento es presentada una clasificación de señales electrocardiográficas por medio de máquinas de aprendizaje no supervisadas, específicamente técnicas de agrupamiento jerárquico aglomerativo. Estos algoritmos mezclan iterativamente aquellos grupos que más se asemejen a otros de acuerdo a ciertas medidas de distancia predefinidas. Existen múltiples medidas de distancia para ser ingresada como datos de entrada de los algoritmos de agrupamiento, nuevas medidas de distancia basadas en compresión inventadas hace pocos años han demostrado realizar un buen trabajo en tareas de clasificación de series temporales. Previo al proceso de clasificación, se realiza una etapa de preprocesamiento de cada electrocardiograma con el fin de eliminar el ruido presente en la adquisición y después extraer la variabilidad de la frecuencia cardiaca de cada señal; luego se aplican técnicas de minería de datos, que transforman las señales electrocardiográficas en índices estadísticos que extraen pa...
En este documento, se presenta una introducción y estudio al procesamiento y caracterización de s... more En este documento, se presenta una introducción y estudio al procesamiento y caracterización de señales electrocardiogŕaficas usando diferentes paquetes y funciones para el análisis de datos con la transformada wavelet sobre la plataforma de calculo numérico-estadístico R. La operación de la transformación Wavelet ha resultado de gran utilidad y ha sido implementada en muchas áreas de la ingeniería a través de muchos lenguajes de programación. El lenguaje R no ha sido ajeno a este fenomeno, y en el presente existen diversas implementaciones que permiten el análisis estadístico, filtrado, caracterización de datos. En este trabajo se usaron los paquetes wmtsa, wavethresh, wavelets, waveslim, y se testaron otras implementaciones en señales electrocardiográficas con el fin de realizar un proceso de filtrado, recuperación de la linea base, detección del complejo QRS, hallazgo de picos, entre otras.
A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating ... more A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating vector operations. The enormous heterogeneity of parallel computing platforms justifies and motivates the development of automated optimization tools and techniques. The Algorithm Selection Problem consists in finding a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of a set of problem instances. An autotuner solves the Algorithm Selection Problem using search and optimization techniques. In this paper we implement an autotuner for the CUDA compiler's parameters using the OpenTuner framework. The autotuner searches for a set of compilation parameters that optimizes the time to solve a problem. We analyse the performance speedups, in comparison with high-level compiler optimizations, achieved in three different GPU devices, for 17 heterogeneous GPU applications, 12 of which are from the Rodinia Benchmark Suite. The autotuner often beat the compiler's high-level optimizations, but underperformed for some problems. We achieved over 2x speedup for Gaussian Elimination and almost 2x speedup for Heart Wall, both problems from the Rodinia Benchmark, and over 4x speedup for a matrix multiplication algorithm.
We study the problem of executing an application represented by a precedence task graph on a para... more We study the problem of executing an application represented by a precedence task graph on a parallel machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings and design generic scheduling approaches. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy List Scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. We also extend this algorithm to more types of computing units, achieving an approximation ratio which depends on the number of different types. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first on-line scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms for hybrid architectures have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions, whenever these exist, or baseline algorithms.
Resumen En este artículo se presenta una breve reseña teórica de importantes medidas de similarid... more Resumen En este artículo se presenta una breve reseña teórica de importantes medidas de similaridad basadas en comprensión de datos. Estas medidas se utilizan en diferen-tes máquinas de aprendizaje con el objetivo de realizar una clasificación entre los objetos que constituyen el sistema bajo estudio. Muchasáreas científicas han sido favorecidas con estas medidas de similitud de información, entre ellas, el estudio de series de tiempo, imágenes, ADN, vídeo, audio y métricas de software, etc. La base teórica de estas técnicas de compresión es la complejidad de Kolmogorov. En este documento se definen conceptos importantes de dicha complejidad y analogías con la teoría de información de Shannon. Adicionalmente, se presentan ejemplos de aplicaciones de estas técnicas para la clasificación entre señales electrocardiográfi-cas utilizando algoritmos de agrupamiento. Palabras clave: medidas de similaridad, información mutua, complejidad de Kol-mogorov, compresión, algoritmos de agrupamiento. Abstract In this paper a theoretical foundation of main similarity measures based on data compression is presented. These approaches are used to implement different learning machines for classification between different objects. Many scientific fields have been positively impacted with these similarity measures of information, among
Conference Presentations by Marcos Amaris González
Today, most high-performance computing (HPC) platforms have heterogeneous hardware resources (CPU... more Today, most high-performance computing (HPC) platforms have heterogeneous hardware resources (CPUs, GPUs, storage, etc.) A Graphics Processing Unit (GPU) is a parallel computing coprocessor specialized in accelerating vector operations. The prediction of application execution times over these devices is a great challenge and is essential for efficient job scheduling. There are different approaches to do this, such as analytical modeling and machine learning techniques. Analytic predictive models are useful, but require manual inclusion of interactions between architecture and software, and may not capture the complex interactions in GPU architectures. Machine learning techniques can learn to capture these interactions without manual intervention, but may require large training sets. In this paper, we compare three different machine learning approaches: linear regression, support vector machines and random forests with a BSP-based analytical model, to predict the execution time of GPU applications. As input to the machine learning algorithms, we use profiling information from 9 applications executed over 9 different GPUs. We show that machine learning approaches provide reasonable predictions for different cases. Although the predictions were inferior to the analytical model, they required no detailed knowledge of application code, hardware characteristics or explicit modeling. Consequently, whenever a database with profile information is available or can be generated, machine learning techniques can be useful for deploying automated on-line performance prediction for scheduling applications on heterogeneous architectures containing GPUs.
This paper aims to show that knowing the core concepts related to a given parallel architecture i... more This paper aims to show that knowing the core concepts related to a given parallel architecture is necessary to write correct code, regardless of the parallel programming paradigm used. Programmers unaware of architecture concepts, such as beginners and students, often write parallel code that is slower than their sequential versions. It is also easy to write code that produces incorrect answers under specific conditions, which are hard to detect and correct. The increasing popularization of multi-core architectures motivates the implementation of parallel programming frameworks and tools, such as OpenMP, that aim to lower the difficulty of parallel programming. OpenMP uses compilation directives, or pragmas, to reduce the number of lines that the programmer needs to write. However, the programmer still has to know when and how to use each of these directives. The documentation and available tutorials for OpenMP give the idea that using compilation directives for parallel programming is easy. In this paper we show that this is not always the case by analysing a set of corrections of OpenMP programs made by students of a graduate course in Parallel and Distributed Computing, at University of São Paulo. Several incorrect examples of OpenMP pragmas were found in tutorials and official documents available in the Internet. The idea that OpenMP is easy to use can lead to superficial efforts in teaching fundamental parallel programming concepts. This can in its turn lead to code that does not develop the full potential of OpenMP, and could also crash inexplicably due to very specific and hard-to-detect conditions. Our main contribution is showing how important it is to teach core architecture and parallel programming concepts properly, even when you have powerful tools such as OpenMP available.
We study the problem of executing an application represented by a precedence task graph on a mult... more We study the problem of executing an application represented by a precedence task graph on a multi-core machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy list scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first online scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions whenever these exist or baseline algorithms.
Resumen En este documento, se presenta una introducción y estudio al procesamiento y caracterizac... more Resumen En este documento, se presenta una introducción y estudio al procesamiento y caracterización de señales electrocardiog´raficas usando diferentes paquetes y funciones para el análisis de datos con la transformada wavelet sobre la plataforma de calculo numérico-estadístico R. La operación de la transformación Wavelet ha resultado de gran utilidad y ha sido implementada en muchasáreas de la ingeniería a través de muchos lenguajes de programación. El lenguaje R no ha sido ajeno a este fenomeno, y en el presente existen diversas implementaciones que permiten el análisis estadísti-co, filtrado, caracterización de datos. En este trabajo se usaron los paquetes wmtsa, wavethresh, wavelets, waveslim, y se testaron otras implementaciones en señales electrocardiográficas con el fin de realizar un proceso de filtrado, recuperación de la linea base, detección del complejo QRS, hallazgo de picos, entre otras.
The focus of this study is on the scheduling of moldable Bulk Synchronous Parallel (BSP) tasks on... more The focus of this study is on the scheduling of moldable Bulk Synchronous Parallel (BSP) tasks on cloud computing environments. \lq\lq{}Moldable\rq\rq{} in this context is related to possibly reducing the number of required processors of a BSP task, recalculating the total execution time using a particular cost model. From there, we analyze how a moldable BSP task is influenced by the unreliable behavior of clouds, simulating the execution of several BSP tasks in existing public cloud computing environment. The objective of this paper is to analyze the difference between the completion time of the last task (\emph{makespan}), in both a simulated setting and in real clouds, analyzing the results and concluding how much a theoretical model can be useful in such environments.
En este documento es presentada una clasificaci ́on de señales electrocardiográficas por medio de... more En este documento es presentada una clasificaci ́on de señales electrocardiográficas por medio de máquinas de aprendizaje no supervisadas, específicamente técnicas de agrupamiento jerárquico aglomerativo. Estos algoritmos mezclan iterativamente aquellos grupos que más se asemejen a otros de acuerdo a ciertas medidas de distancia predefinidas.
Existen múltiples medidas de distancia para ser ingresada como datos de entrada de los algoritmos de agrupamiento, nuevas medidas de distancia basadas en compresión inventadas hace pocos años han demostrado realizar un buen trabajo en tareas de clasificación de series temporales. Previo al proceso de clasificación, se realiza una etapa de preprocesamiento de cada electrocardiograma con el fin de eliminar el ruido presente en la adquisición y después extraer la variabilidad de la frecuencia cardiaca de cada señal; luego se aplican técnicas de minería de datos, que transforman las señales electrocardiográficas en índices estadísticos que extraen patrones e información oculta inherente en las señales. Con la metodología presentada se demuestra eficiencia de un 100% en la clasificación de electrocardiogramas de pacientes con infarto agudo de miocardio.
Graphics Processing Units (GPUs) are specialized coprocessors that were initially conceived for t... more Graphics Processing Units (GPUs) are specialized coprocessors that were initially conceived for the purpose of accelerating vector operations, such as graphics rendering. Writing and configuring efficient algorithms for GPU devices is still a hard problem. The Algorithm Selection Problem consists of finding a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of a given problem instance or set of instances. An auto-tuner is a program solves the Algorithm Selection Problem automatically. In this paper we implement an autotuner for the compilation flags of GPU algorithms, using the OpenTuner framework. The autotuner produces a set of compilation flags that aims to optimize the time to solve a given problem for a specific GPU device. We analyse the performance gains of tuning the compilation flags for heterogeneous GPU algorithms across three different GPU devices. We show that it is possible to gain performance by automatically and empirically selecting a set of compilation flags for the same GPU algorithm in different devices. In one of the experimental settings we were able to achieve a 30% speedup in comparison with the compiler high-level optimization options.
—Models are useful to represent abstractions of software and hardware processes. The Bulk Synchro... more —Models are useful to represent abstractions of software and hardware processes. The Bulk Synchronous Parallel (BSP) is a bridging model for parallel computation that allows algorithmic analysis of programs on parallel computers using performance modeling. The main idea of BSP model is the treatment of communication and computation as abstractions of a parallel system. Meanwhile, the use of GPU devices are becoming more widespread and they are currently capable of performing efficient parallel computation for applications that can be decomposed on thousands of simple threads. However, few models for predicting application execution time on GPUs have been proposed. In this work we present a simple and intuitive BSP-based model for predicting the CUDA application execution times on GPUs. The model is based on the number of computations and memory accesses of the GPU, with additional information on cache usage obtained from profiling. Scalability, divergence, effect of optimizations and differences of architectures are adjusted by a single parameter. We evaluated our model using two applications and six different boards. We showed by using profile information for a single board, that the model is general enough to predict the execution time of an application with different input sizes and on different boards with the same architecture. Our model predictions were within 0.8 to 1.2 times the measured execution times, which are reasonable for such a simple model. These results indicate that the model is good enough to generalize the predictions for different problem sizes and GPU configurations.
Uploads
Papers by Marcos Amaris González
Conference Presentations by Marcos Amaris González
Existen múltiples medidas de distancia para ser ingresada como datos de entrada de los algoritmos de agrupamiento, nuevas medidas de distancia basadas en compresión inventadas hace pocos años han demostrado realizar un buen trabajo en tareas de clasificación de series temporales. Previo al proceso de clasificación, se realiza una etapa de preprocesamiento de cada electrocardiograma con el fin de eliminar el ruido presente en la adquisición y después extraer la variabilidad de la frecuencia cardiaca de cada señal; luego se aplican técnicas de minería de datos, que transforman las señales electrocardiográficas en índices estadísticos que extraen patrones e información oculta inherente en las señales. Con la metodología presentada se demuestra eficiencia de un 100% en la clasificación de electrocardiogramas de pacientes con infarto agudo de miocardio.
Existen múltiples medidas de distancia para ser ingresada como datos de entrada de los algoritmos de agrupamiento, nuevas medidas de distancia basadas en compresión inventadas hace pocos años han demostrado realizar un buen trabajo en tareas de clasificación de series temporales. Previo al proceso de clasificación, se realiza una etapa de preprocesamiento de cada electrocardiograma con el fin de eliminar el ruido presente en la adquisición y después extraer la variabilidad de la frecuencia cardiaca de cada señal; luego se aplican técnicas de minería de datos, que transforman las señales electrocardiográficas en índices estadísticos que extraen patrones e información oculta inherente en las señales. Con la metodología presentada se demuestra eficiencia de un 100% en la clasificación de electrocardiogramas de pacientes con infarto agudo de miocardio.