Carlos Sá

Square matrices multiplication performance

This report is a result of a study about square matrices multiplication performance algorithm. Fo... more This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations of this algorithm with matrices of different sizes specifically selected to evaluate the performance impact of our algorithm. The internal CPU organization and bottlenecks evaluation are the main focus throughout this work. In the algorithm, the loops indexes order was defined as k-j-i for our workgroup. In the internal CPU architecture logic, vector computing features was implemented in modern times. This potentiality consists on the capability of using "large" processor registers to process multiple data elements at once in a clock cycle. This CPU capability is commonly known as SIMD (Single Instruction Multiple Data) wich will be explored too as an optimization performance technique for our algorithm implementation. As the main tool in the experimental component of this work we'll use a C library for performance analysis called Performance Application Programming Interface (PAPI). This library will allow us to access the CPU internal counters of 431 node, analyse the different metrics and draw some conclusions for different data sets and algorithm performance.

DD3IMP - Finite Element Solver: The Challenge of Computational Improvement through HPC [EN]

by Carlos Sá, Nuno Oliveira, and Bruno da Silva Barbosa

This report is a result of a study about computational improvement in DD3IMP software package thr... more This report is a result of a study about computational improvement in DD3IMP software package through High Performance Computing. DD3IMP is a software package for Finite Element Method (FEM) based on numerical simulation. The program simulates the forming processes of sheet metals and elastoplasticity through deep drawing using FEM. The performance of this software is directly based on the performance of the linear equations system solver it uses. In the currently version of DD3IMP, the main solver is Direct Sparse Solver (DSS) from Intel R Math Kernel Library (MKL). This is an optimized solver wich has revealed the best performance for solving the linear system of equations in DD3IMP using Intel's processor's based machines. The entire program is written in Fortran programming language with about 500 routines and 60k lines of code. The entire program is already parallelized in shared memory paradigm using OpenMP directives. In this work we're going to explore the DD3IMP program and using some profilling tools to detect were the program is more computational expensive and explore the possibilities of increasing their performance. The program will be analysed using SeARCH Cluster nodes based on Intel R Xeon R Processor with Ivy-Bridge microarchitecture, and a team laptop based on Intel R Core R i7 based on Haswell microarchitecture. I. The package DD3IMP (Deep-Drawing 3D Implicit FE Solver) The program DD3IMP is a software package for conforma-tion and elastoplasticity simulation for sheet metals through deep drawing using finite element methods. The program was developed and implemented in Fortran 95 has more than 500 routines and 60k lines of code. The part of DD3IMP wich performs more work is related to solving a linear equations system multiple times. Solving a linear equation system can be a task computation-ally intensive. Since DD3IMP solves this kind of linear equation system multiple times, it's resolution can configure a bottleneck for performance scalability. As we'll see in the next sections, since the most computational heavy regions of this software corresponds to solving a linear equation system, the global performance is directly affected by the solver it uses. The equation system involved on DD3IMP is a matrix-vector multiplication: Ax = b where A is a non-symmetric sparse matrix (symmetric in structure but non-symmetric in values) in a CSR format that represents the mesh structure, x is the displacements vector and b is the vector of external forces. The actual solver implemented in DD3IMP is DSS (Direct Sparse Solver) from Intel's Math Kernel Library (MKL). The previous one was a solver based on conjugate gradient method-the conjugate gradient squared (CGS) combined with ILU pre-conditioner-wich was substituted by DSS due to performance reasons. However, these two solver are currently available on DD3IMP package and user can select wich one to use. The main differences between these two solvers is that CGS is an iterative method and DSS is a direct method. The iterative methods are commonly known for being computacionally efficient and fast convergence. However previous studies in this software reveals that DSS was the faster solver running on Intel's processors based machines in OpenMP implementation of DD3IMP. This optimized library is particularly efficient when having a large sized problems. The scalability of DSS allow the program to scale almost linearly as we'll see. II. Starting Point and Case Studies The starting point is DD3IMP sequential and a parallel version with OpenMP directives. As we'll see on the profiling section , in actual paralelized version, more than 97% of DD3IMP execution time is running in parallel. We also have three different case studies. Since DD3IMP is a finite element method package, the program uses numerical techniques for finding approximate solutions to boundary value problems using differential equations.

Engenharia de Sistemas de Computação - TP [PT]

Este documento constitui um portfólio que reúne todos os trabalhos práticos que realizei na Unida... more Este documento constitui um portfólio que reúne todos os trabalhos práticos que realizei na Unidade Curricular (UC) de Engenharia de Sistemas de Computação. Esta unidade curricular insere-se no quarto ano do Mestrado Integrado em Engenharia Informática, no perfil de especialização de Computação Paralela e Distribuída da Universidade do Minho.
Esta unidade curricular trata-se de uma unidade curricular com uma forte componente prática na análise de desempenho de aplicações e fundamentos de sistemas operativos da qual resultaram os 5 trabalhos produzidos. Estes trabalhos possuem um conjunto de temáticas diversas relacionadas com sistemas de computação tais como programação ambiente multicore, sistemas de ficheiros, monitorização de aplicações e clustering em sistemas GNU/Linux e Solaris.

- Análise Performance - NAS Parallel Benchmark (NASA Advanced Supercomputing Division - http://www.nas.nasa.gov/publications/npb.html)
- Programação em ambiente de Memória Partilhada com Pthreads
- Exploração da Ferramenta DTrace
- Benchmark Activo com dtrace e IOzone
- Profilling com PERF: Estudo de eventos hardware e software

Multi-core and Multi-Process Parallel Programming with Molecular Dynamic Algorithm [EN]

by Carlos Sá and Filipe Oliveira

This report is a result of a study about Molecular Dynamic Algorithm (MD). In this work MD simula... more This report is a result of a study about Molecular Dynamic Algorithm (MD). In this work MD simulation source code written in C is provided. The goal of this work is study a MD program simulation, his sequential algorithm complexity, and implement a multi-core parallel version in two different programming paradigms: Shared Memory and Distributed memory. The shared memory version of MD will be achieved using OpenMP-an API for multi-platform shared-memory parallel programming in C/C++ and Fortran. The distributed version of MD will be achieved using MPI (Message Passing Interface) using OpenMPI an high performance message passing library. Initially, we'll study the program organization (with gprof and callgrind) and study his complexity. For all MD implementations we're going to produce a performance analysis using compute-641 node of SeARCH Cluster and draw some conclusions.

Multi-level BLAS: LU Factorization with Partial Pivoting [EN]

This report is a result of a study about LU decomposition exploring partial pivoting with Matlab.... more This report is a result of a study about LU decomposition exploring partial pivoting with Matlab. In this work we'll gonna use two provided Matlab codes based on BlAS2 and BLAS3 and implement partial pivoting in both. The first one is called BLAS2LU.m wich applies a row permutation to matrix wich has m rows and n columns where m ≥ n. The second code provided is BLAS3LU.m wich applies a block LU factorization and calls BLAS2LU to perform multiple block factorization. Both codes initially without pivoting. The main goal of this work is modifying original codes and implement partial pivoting on BLAS2LU.m and BLAS3LU.m. Our partial pivoting implementation will call BLAS2LUPP and BLAS3LUPP respectively. On experimental component of this work we will test both codes with matrices generated randomly with different dimensions. For both solutions produced, we'll compute the numerical error using the permutation matrix and a speedup analysis to draw some conclusions.

This is an academic work developed on University of Minho

Exploring a zero cost solution for Travelling Salesman Problem (TSP) [EN]

This report is a result of a study about Monte Carlo algorithm applied to Travelling Salesman Pro... more This report is a result of a study about Monte Carlo algorithm applied to Travelling Salesman Problem (TSP) exploring the Simulated Annealing (SA) meta-heuristic. We've a discrete space of cities and the algorithm finds the shortest route that starts at one of the towns, goes once through everyone of the others and returns to the first one. The main goal is explore the possibility of having a zero cost solution with n cities and p processors running in parallel. To perform this analysis we'll gonna use a TSP algorithm with MATLAB.
This is an academic work developed on University Of Minho.

Advanced Architectures - Matrix Multiplication Performance Analysis Through HPC [EN]

This report is a result of a study about square matrices multiplication performance algorithm. Fo... more This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster (http://search.di.uminho.pt).
Throughout this work we are going to explore three different implementations of this algorithm with matrices of different sizes specifically selected to evaluate the performance impact of our algorithm. The internal CPU organization and bottlenecks evaluation are the main focus throughout this work. In the algorithm, the loops indexes order was defined as k-j-i for our workgroup. In the internal CPU architecture logic, vector computing features was implemented in modern times. This potentiality consists on the capability of using "large" processor registers to process multiple data elements at once in a clock cycle. This CPU capability is commonly known as SIMD (Single Instruction Multiple Data) wich will be explored too as an optimization performance technique for our algorithm implementation. As the main tool in the experimental component of this work we'll use a C library for performance analysis called Performance Application Programming Interface (PAPI). This library will allow us to access the CPU internal counters of 431 node, analyse the different metrics and draw some conclusions for different data sets and algorithm performance.
This is just an academic work developed on University Of Minho.

Testing Java Applications using JUnit Framework.pptx

Teaching Documents by Carlos Sá

"JUnit is a unit testing framework for the Java programming language. JUnit has been important in... more "JUnit is a unit testing framework for the Java programming language. JUnit has been important in the development of test-driven development, and is one of a family of unit testing frameworks which is collectively known as xUnit that originated with SUnit. The framework resides under package junit.framework for JUnit 3.8 and earlier, and under package org.junit for JUnit 4 and later.
A research survey performed in 2013 across 10,000 Java projects hosted on GitHub found that JUnit, was the most commonly included external library."

- JUnit Wikipedia

Análise de dados com recurso a metodologia CRISP-DM e modelos de aprendizagem automática

Square matrices multiplication performance

DD3IMP - Finite Element Solver: The Challenge of Computational Improvement through HPC [EN]

by Carlos Sá, Nuno Oliveira, and Bruno da Silva Barbosa

Engenharia de Sistemas de Computação - TP [PT]

Multi-core and Multi-Process Parallel Programming with Molecular Dynamic Algorithm [EN]

by Carlos Sá and Filipe Oliveira

Multi-level BLAS: LU Factorization with Partial Pivoting [EN]

Exploring a zero cost solution for Travelling Salesman Problem (TSP) [EN]

Advanced Architectures - Matrix Multiplication Performance Analysis Through HPC [EN]

This report is a result of a study about square matrices multiplication performance algorithm. Fo... more This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster (http://search.di.uminho.pt).
Throughout this work we are going to explore three different implementations of this algorithm with matrices of different sizes specifically selected to evaluate the performance impact of our algorithm. The internal CPU organization and bottlenecks evaluation are the main focus throughout this work. In the algorithm, the loops indexes order was defined as k-j-i for our workgroup. In the internal CPU architecture logic, vector computing features was implemented in modern times. This potentiality consists on the capability of using "large" processor registers to process multiple data elements at once in a clock cycle. This CPU capability is commonly known as SIMD (Single Instruction Multiple Data) wich will be explored too as an optimization performance technique for our algorithm implementation. As the main tool in the experimental component of this work we'll use a C library for performance analysis called Performance Application Programming Interface (PAPI). This library will allow us to access the CPU internal counters of 431 node, analyse the different metrics and draw some conclusions for different data sets and algorithm performance.
This is just an academic work developed on University Of Minho.

Testing Java Applications using JUnit Framework.pptx