0% found this document useful (0 votes)
14 views

DST4030A Lecture Notes Week 4

The document outlines the course DST4030A: Parallel Computing, focusing on various parallel programming models including Shared Memory, Threads, Distributed Memory, and Data Parallel models. It emphasizes the learning objectives for students, such as understanding parallel architectures and programming models, and introduces OpenMP as a standard for multi-threaded programming. The document provides insights into the implementation and characteristics of these models, along with the advantages and challenges associated with each.

Uploaded by

allansharad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

DST4030A Lecture Notes Week 4

The document outlines the course DST4030A: Parallel Computing, focusing on various parallel programming models including Shared Memory, Threads, Distributed Memory, and Data Parallel models. It emphasizes the learning objectives for students, such as understanding parallel architectures and programming models, and introduces OpenMP as a standard for multi-threaded programming. The document provides insights into the implementation and characteristics of these models, along with the advantages and challenges associated with each.

Uploaded by

allansharad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

DST4030A: Parallel Computing

Parallel Programming Models

Dr Mike Asiyo

School of Science & Technology

Dr Asiyo (USIU) DST4030A: Parallel Computing 1 / 42


Contents

1 Course Learning Objectives

2 Parallel Programming Models

3 OpenMP

Dr Asiyo (USIU) DST4030A: Parallel Computing 2 / 42


Course Learning Objectives

Course Learning Objectives

Dr Asiyo (USIU) DST4030A: Parallel Computing 3 / 42


Course Learning Objectives

Lecture Learning Objectives

At the end of this lecture, students should be able to:


1 Appreciate parallel machine architectures from a software perspective
2 Understand parallel programming models/architectures

Dr Asiyo (USIU) DST4030A: Parallel Computing 4 / 42


Parallel Programming Models

Parallel Programming Models

Dr Asiyo (USIU) DST4030A: Parallel Computing 5 / 42


Parallel Programming Models Introduction

Parallel Programming Models exist as an abstraction above hardware


and memory architectures:
Shared Memory (without threads)
Shared Threads Models (Pthreads, OpenMP)
Distributed Memory / Message Passing (MPI)
Data Parallel
Hybrid
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)

Dr Asiyo (USIU) DST4030A: Parallel Computing 6 / 42


Parallel Programming Models Shared Memory Model (without threads)

In this programming model, processes/tasks share a common address


space, which they read and write to asynchronously.

Various mechanisms such as locks / semaphores are used to control access


to the shared memory, resolve contentions and to prevent race conditions
and deadlocks.

This is perhaps the simplest parallel programming model.

An advantage of this model from the programmer’s point of view is that


the notion of data ”ownership” is lacking, so there is no need to specify
explicitly the communication of data between tasks.

Dr Asiyo (USIU) DST4030A: Parallel Computing 7 / 42


Parallel Programming Models Shared Memory Model (without threads)

All processes see and have equal access to shared memory. Program
development can often be simplified.

An important disadvantage in terms of performance is that it becomes


more difficult to understand and manage data locality:
Keeping data local to the process that works on it conserves memory
accesses, cache refreshes and bus traffic that occurs when multiple
processes use the same data.
Unfortunately, controlling data locality is hard to understand and
may be beyond the control of the average user.

Dr Asiyo (USIU) DST4030A: Parallel Computing 8 / 42


Parallel Programming Models Shared Memory Model (without threads)

Figure 1: Shared Memory Model

Dr Asiyo (USIU) DST4030A: Parallel Computing 9 / 42


Parallel Programming Models Shared Memory Model (without threads)

Implementations
On stand-alone shared memory machines, native operating systems,
compilers and/or hardware provide support for shared memory
programming. For example, the POSIX standard provides an API
for using shared memory, and UNIX provides shared memory
segments (shmget, shmat, shmctl, etc.).
On distributed memory machines, memory is physically distributed
across a network of machines, but made global through specialized
hardware and software.

Dr Asiyo (USIU) DST4030A: Parallel Computing 10 / 42


Parallel Programming Models Threads Model

This programming model is a type of shared memory programming.


In the threads model of parallel programming, a single ”heavy
weight” process can have multiple ”light weight”, concurrent
execution paths.

Dr Asiyo (USIU) DST4030A: Parallel Computing 11 / 42


Parallel Programming Models Threads Model

Dr Asiyo (USIU) DST4030A: Parallel Computing 12 / 42


Parallel Programming Models Threads Model

For example:
The main program a.out is scheduled to run by the native operating system.
a.out loads and acquires all of the necessary system and user resources to run.
This is the ”heavy weight” process.
a.out performs some serial work, and then creates a number of tasks (threads)
that can be scheduled and run by the operating system concurrently.
Each thread has local data, but also, shares the entire resources of a.out. This
saves the overhead associated with replicating a program’s resources for each
thread (”light weight”). Each thread also benefits from a global memory view
because it shares the memory space of a.out.

Dr Asiyo (USIU) DST4030A: Parallel Computing 13 / 42


Parallel Programming Models Threads Model

A thread’s work may best be described as a subroutine within the main program.
Any thread can execute any subroutine at the same time as other threads.
Threads communicate with each other through global memory (updating
address locations). This requires synchronization constructs to ensure that more
than one thread is not updating the same global address at any time.
Threads can come and go, but a.out remains present to provide the necessary
shared resources until the application has completed.

Dr Asiyo (USIU) DST4030A: Parallel Computing 14 / 42


Parallel Programming Models Threads Model

Implementations
From a programming perspective, threads implementations
commonly comprise:
A library of subroutines that are called from within parallel source
code
A set of compiler directives imbedded in either serial or parallel
source code

Dr Asiyo (USIU) DST4030A: Parallel Computing 15 / 42


Parallel Programming Models Threads Model

In both cases, the programmer is responsible for determining the


parallelism (although compilers can sometimes help).
These implementations differed substantially from each other
making it difficult for programmers to develop portable threaded
applications.
Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.
POSIX Threads tutorial:
hpc.llnl.gov/sites/default/files/2019.08.21.TAU_.pdf

Dr Asiyo (USIU) DST4030A: Parallel Computing 16 / 42


Parallel Programming Models Openmp

OpenMP
Industry standard, jointly defined and endorsed by a group of
major computer hardware and software vendors, organizations and
individuals.
Compiler directive based
Portable / multi-platform, including Unix and Windows platforms
Available in C/C++ and Fortran implementations
Can be very easy and simple to use - provides for ”incremental
parallelism”. Can begin with serial code.

Dr Asiyo (USIU) DST4030A: Parallel Computing 17 / 42


Parallel Programming Models Distributed Memory / Message Passing Model

This model demonstrates the following characteristics:


A set of tasks that use their own local memory during computation.
Multiple tasks can reside on the same physical machine and/or
across an arbitrary number of machines.
Tasks exchange data through communications by sending and
receiving messages.
Data transfer usually requires cooperative operations to be
performed by each process. For example, a send operation must
have a matching receive operation.

Dr Asiyo (USIU) DST4030A: Parallel Computing 18 / 42


Parallel Programming Models Distributed Memory / Message Passing Model

Figure 3: Distributed Memory Model

Dr Asiyo (USIU) DST4030A: Parallel Computing 19 / 42


Parallel Programming Models Distributed Memory / Message Passing Model

Implementations:

From a programming perspective, message passing implementations


usually comprise a library of subroutines. Calls to these subroutines are
imbedded in source code. The programmer is responsible for determining
all parallelism.

Historically, a variety of message passing libraries have been available


since the 1980s. These implementations differed substantially from each
other making it difficult for programmers to develop portable applications.

Dr Asiyo (USIU) DST4030A: Parallel Computing 20 / 42


Parallel Programming Models Distributed Memory / Message Passing Model

In 1992, the MPI Forum was formed with the primary goal of establishing a
standard interface for message passing implementations.
Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2
(MPI-2) was released in 1996 and MPI-3 in 2012. All MPI specifications are
available on the web at http://www.mpi-forum.org/docs/.
MPI is the ”de facto” industry standard for message passing, replacing virtually
all other message passing implementations used for production work. MPI
implementations exist for virtually all popular parallel computing platforms.
Not all implementations include everything in MPI-1, MPI-2 or MPI-3.

Dr Asiyo (USIU) DST4030A: Parallel Computing 21 / 42


Parallel Programming Models Data Parallel Model

May also be referred to as the Partitioned Global Address Space (PGAS) model.

The data parallel model demonstrates the following characteristics:

Address space is treated globally


Most of the parallel work focuses on performing operations on a data
set.
A set of tasks work collectively on the same data structure, however,
each task works on a different partition of the same data structure.
Tasks perform the same operation on their partition of work, for
example, ”add 4 to every array element”.

Dr Asiyo (USIU) DST4030A: Parallel Computing 22 / 42


Parallel Programming Models Data Parallel Model

Figure 4: Data Parallel Model

Dr Asiyo (USIU) DST4030A: Parallel Computing 23 / 42


Parallel Programming Models Data Parallel Model

On shared memory architectures, all tasks may have access to the data structure
through global memory.
On distributed memory architectures, the global data structure can be split up
logically and/or physically across tasks.
Implementations:
Currently, there are several parallel programming implementations in various
stages of developments, based on the Data Parallel / PGAS model.
Coarray Fortran: a small set of extensions to Fortran 95 for SPMD parallel
programming. Compiler dependent. More information:
https://en.wikipedia.org/wiki/Coarray_Fortran

Dr Asiyo (USIU) DST4030A: Parallel Computing 24 / 42


OpenMP

OpenMP

Dr Asiyo (USIU) DST4030A: Parallel Computing 25 / 42


OpenMP

OpenMP is:
An Application Program Interface (API) that may be used to explicitly direct
multi-threaded, shared memory parallelism

Comprised of three primary API components:


1 Compiler Directives
2 Runtime Library Routines
3 Environment Variables
An abbreviation for:

Short version: Open Multi-Processing


Long version: Open specifications for Multi-Processing via
collaborative work between interested parties from the hardware and
software industry, government and academia.

Dr Asiyo (USIU) DST4030A: Parallel Computing 26 / 42


OpenMP

OpenMP is not:
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
Required to check for data dependencies, data conflicts, race
conditions, or deadlocks
Required to check for code sequences that cause a program to be
classified as non-conforming
Designed to guarantee that input or output to the same file is
synchronous when executed in parallel. The programmer is
responsible for synchronizing input and output.

Dr Asiyo (USIU) DST4030A: Parallel Computing 27 / 42


OpenMP

Goals of OpenMP:
1 Standardization:
Provide a standard among a variety of shared memory
architectures/platforms
Jointly defined and endorsed by a group of major computer hardware
and software vendors
2 Lean and Mean:
Establish a simple and limited set of directives for programming
shared memory machines.
Significant parallelism can be implemented by using just 3 or 4
directives.

Dr Asiyo (USIU) DST4030A: Parallel Computing 28 / 42


OpenMP

3 Ease of Use:
Provide capability to incrementally parallelize a serial program,
unlike message-passing libraries which typically require an all or
nothing approach
Provide the capability to implement both coarse-grain and fine-grain
parallelism
4 Portability:
The API is specified for C/C++ and Fortran
Public forum for API and membership
Most major platforms have been implemented including Unix/Linux
platforms and Windows

Dr Asiyo (USIU) DST4030A: Parallel Computing 29 / 42


OpenMP OpenMP Programming Model

Openmp has - Memory Model, Execution Model:


1 Shared Memory Model:
OpenMP is designed for multi-processor/core, shared memory
machines. The underlying architecture can be shared memory UMA
or NUMA.

(a) UMA (b) NUMA

Dr Asiyo (USIU) DST4030A: Parallel Computing 30 / 42


OpenMP OpenMP Programming Model

2 Openmp Execution Model:


a. Thread Based Parallelism:
OpenMP programs accomplish parallelism exclusively through the
use of threads.
A thread of execution is the smallest unit of processing that can be
scheduled by an operating system. The idea of a subroutine that can
be scheduled to run autonomously might help explain what a thread
is.
Threads exist within the resources of a single process. Without the
process, they cease to exist.
Typically, the number of threads match the number of machine
processors/cores. However, the actual use of threads is up to the
application.

Dr Asiyo (USIU) DST4030A: Parallel Computing 31 / 42


OpenMP OpenMP Programming Model

c. Explicit Parallelism:
OpenMP is an explicit (not automatic) programming model, offering
the programmer full control over parallelization.
Parallelization can be as simple as taking a serial program and
inserting compiler directives. . . .
Or as complex as inserting subroutines to set multiple levels of
parallelism, locks and even nested locks.

Dr Asiyo (USIU) DST4030A: Parallel Computing 32 / 42


OpenMP Fork - Join Model:

OpenMP uses the fork-join model of parallel execution:

Figure 6: Fork - Join Model

All OpenMP programs begin as a single process: the master thread.


The master thread executes sequentially until the first parallel
region construct is encountered.

Dr Asiyo (USIU) DST4030A: Parallel Computing 33 / 42


OpenMP Fork - Join Model:

FORK: the master thread then creates a team of parallel threads.


The statements in the program that are enclosed by the parallel
region construct are then executed in parallel among the various
team threads.
JOIN: When the team threads complete the statements in the
parallel region construct, they synchronize and terminate, leaving
only the master thread.
The number of parallel regions and the threads that comprise them
are arbitrary.

Dr Asiyo (USIU) DST4030A: Parallel Computing 34 / 42


OpenMP Fork - Join Model:

The Components of the OpenMP API

The OpenMP API has three primary components:


Compiler directives: These are preprocessor directives that can be used by
programmers to define and control parallel regions of code.
Runtime library routines: These are functions from the OpenMP library
that can be called by the program and are then linked into the program;
Environment variables: These can be used to control the behaviour of
OpenMP programs.

It is possible to parallelize many sequential programs without using most of the API.

Dr Asiyo (USIU) DST4030A: Parallel Computing 35 / 42


OpenMP Fork - Join Model:

OpenMP - Compiler Directives

We will focus here on C/C++ syntax

#pragma omp < d i r e c t i v e name> [< c l a u s e s >]

Used for:
Defining parallel regions / spawning threads
Distributing loop iterations or sections of code between threads
Serializing sections of code (e.g. for access to I/O or shared
variables)
Synchronizing threads

Dr Asiyo (USIU) DST4030A: Parallel Computing 36 / 42


OpenMP Fork - Join Model:

OpenMP - Runtime Library Routines

These routines are provided by the openmp library are used to


configuring and monitoring the multithreading during execution: e.g.
omp get num threads returns number of threads in current team
omp in parallel check if in parallel regions
omp set schedule modify scheduler policy

Dr Asiyo (USIU) DST4030A: Parallel Computing 37 / 42


OpenMP Fork - Join Model:

OpenMP - Environment variables

Environmental variables are used to store configurations needed for running the
program. In OpenMP, they are used for setting e.g. the number of threads per
team (OMP NUM THREADS), maximum number of threads
(OMP THREAD LIMIT) or the scheduler policy (OMP SCHEDULE).
While most of these settings can also be done using clauses in the compiler
directives of runtime library routines, environmental variables provide a user an
easy way to change these crucial settings without the need of an additional
config file (parsed by your program) or even rewriting/recompiling the
openmp-enhanced program.

Dr Asiyo (USIU) DST4030A: Parallel Computing 38 / 42


OpenMP Fork - Join Model:

Listing 1: Helloworld.cpp
#i n c l u d e <i o s t r e a m >
#i n c l u d e <omp . h>
using namespace s t d ;
i n t main ( )
{
#pragma omp p a r a l l e l
{
cout<<” H e l l o World”<<e n d l ;
}
return 0 ;
}

Dr Asiyo (USIU) DST4030A: Parallel Computing 39 / 42


OpenMP Fork - Join Model:

Listing 2: arrayex.cpp
#i n c l u d e <i o s t r e a m >
#i n c l u d e <a l g o r i t h m >
#i n c l u d e <omp . h>
#d e f i n e ARRAY SIZE 100000000
#d e f i n e ARRAY VALUE 1231
i n t main ( )
{
omp set num threads ( 4 ) ;
i n t * a r r = new i n t [ ARRAY SIZE ] ;
s t d : : f i l l n ( a r r , ARRAY SIZE , ARRAY VALUE

Dr Asiyo (USIU) DST4030A: Parallel Computing 40 / 42


OpenMP Fork - Join Model:

Listing 3: arrayex.cpp...

#pragma omp p a r a l l e l f o r
f o r ( i n t i = 0 ; i < ARRAY SIZE ; i ++)
{
arr [ i ] = arr [ i ] / arr [ i ] + arr [ i
}
return 0 ;
}

Dr Asiyo (USIU) DST4030A: Parallel Computing 41 / 42


Thank You!

Dr Asiyo (USIU) DST4030A: Parallel Computing 42 / 42

You might also like