Chapter 3 - Shared-Memory Programming, OpenMP
Chapter 3 - Shared-Memory Programming, OpenMP
Chapter 3 - Shared-Memory Programming, OpenMP
Programming-OpenMP
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.
3
4.1 Shared Memory
4
Shared Memory Architecture
• All processors have access to one global memory
• All processors share the same address space
• The system runs a single copy of the OS
• Processors communicate by reading/writing to the
global memory
• Examples: multiprocessor PCs (Intel P4), Sun Fire
15K, NEC SX-7, Fujitsu PrimePower, IBM p690,
SGI Origin 3000.
5
Shared Memory Systems
OpenMP
Pthreads
6
Shared Memory Systems (2)
• All processors may access the whole main memory
• Non-UniformMemoryAccess • UniformMemoryAccess
• Memory access time is • Memory access time is
non-uniform uniform
7 / 71
Clustered of SMPs
MPI
hybrid MPI + OpenMP
8
4.2 Multithread Programming
9
Shared Memory Programming
10
Process and thread
• A process is an instance of a
computer program
• Information present in a
process include:
• Text
• Machine code
• Data
• Global variables
• Stack
• Local variables
• Program counter (PC)
• A pointer to the istruction
to be executed
11
Multi-threading
• The process contains several
concurrent execution flows
(threads)
• Each thread has its own
program counter (PC)
• Each thread has its own
private stack (variables local
to the thread)
• The instructions executed by
a thread can access:
• the process global
memory (data)
• the thread local stack
12
OpenMP vs. POSIX Threads
• POSIX threads is the other widely used shared programming API.
• Fairly widely available, usually quite simple to implement on top of
OS kernel threads.
• Lower level of abstraction than OpenMP
• library routines only, no directives
• more flexible, but harder to implement and maintain
• OpenMP can be implemented on top of POSIX threads
• Not much difference in availability
• not that many OpenMP C++ implementations
• no standard Fortran interface for POSIX threads
13
4.3 OPENMP
14
What is OpenMP?
• What does OpenMP stands for?
• Open specifications for Multi Processing via collaborative work
between interested parties from the hardware and software industry,
government and academia.
• OpenMP is an Application Program Interface (API)
that may be used to explicitly direct multi-threaded,
shared memory parallelism.
• API components: Compiler Directives, Runtime Library
Routines. Environment Variables
• OpenMP is a directive-based method to invoke parallel
computations on share-memory multiprocessors
15
What is OpenMP?
• OpenMP API is specified for C/C++ and Fortran.
• OpenMP is not intrusive to the original serial code:
instructions appear in comment statements for
fortran and pragmas for C/C++.
• OpenMP website: http://www.openmp.org
• Materials in this lecture are taken from various OpenMP
tutorials in the website and other places.
16
OpenMP Usage
• Applications
• Applications with intense computational needs
• From video games to big science & engineering
• Programmer Accessibility
• From very early programmers in school to scientists to
parallel computing experts
• Available to millions of programmers
• In every major (Fortran & C/C++) compiler
17
Programming Model
• Fork-Join Parallelism:
• Master thread spawns a team of threads as needed.
• Parallelism is added incrementally: i.e. the sequential
program evolves into a parallel program.
18
OpenMP Syntax
• Most of the constructs in OpenMP are compiler
directives or pragmas.
• For C and C++, the pragmas take the form:
• #pragma omp construct [clause [clause]…]
• For Fortran, the directives take one of the forms:
• C$OMP construct [clause [clause]…]
• !$OMP construct [clause [clause]…]
• *$OMP construct [clause [clause]…]
• Since the constructs are directives, an OpenMP
program can be compiled by compilers that don’t
support OpenMP.
19
OpenMP: How is OpenMP Typically Used?
• OpenMP is usually used to parallelize loops:
• Find your most time consuming loops.
• Split them up between threads.
Split-up this loop between multiple threads
void main() void main()
{ {
double Res[1000]; double Res[1000];
#pragma omp parallel for
for(int i=0;i<1000;i++) { for(int i=0;i<1000;i++) {
do_huge_comp(Res[i]); do_huge_comp(Res[i]);
} }
} Sequential program } Parallel program 20
How to compile and run OpenMP
programs?
• Gcc 4.2 and above supports OpenMP 3.0
• gcc –fopenmp a.c
• Try example1.c
• To run: ‘a.out’
• To change the number of threads:
• setenv OMP_NUM_THREADS 4 (tcsh) or export
OMP_NUM_THREADS=4(bash)
21
OpenMP:
How do Threads Interact?
• OpenMP is a shared memory model.
• Threads communicate by sharing variables.
• Unintended sharing of data can lead to race conditions:
• race condition: when the program’s outcome changes as the
threads are scheduled differently.
• To control race conditions:
• Use synchronization to protect data conflicts.
• Synchronization is expensive so:
• Change how data is stored to minimize the need for
synchronization.
22
OpenMP Constructs
• OpenMP’s constructs fall into 5 categories:
• Parallel Regions
• Worksharing
• Data Environment
• Synchronization
• Runtime functions/environment variables
• OpenMP is basically the same between Fortran and
C/C++
23
OpenMP: Parallel Regions
• You create threads in OpenMP with the “omp parallel” pragma.
• For example, To create a 4-thread Parallel region:
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
Each thread redundantly
executes the code within {
the structured block int ID =omp_get_thread_num();
pooh(ID,A);
}
• Each thread calls pooh(ID,A) for ID = 0 to 3
24
Parallel Region
25
OpenMP: Work-Sharing Constructs
• The “for” Work-Sharing construct splits up loop iterations
among the threads in a team
• For example:
!$omp parallel do &
!$omp schedule(static,3)
29
dynamic Scheduling
• Iterations are distributed to
threads in the team in chunks as
the threads request them. Each
thread executes a chunk of
iterations, then requests another
chunk, until no chunks remain to
be distributed.
• The default chunk is 1
• For example:
!$omp parallel do &
!$omp schedule(dynamic,1)
30
guided Scheduling
• Iterations are assigned to threads
in the team in chunks as the
executing threads request them.
Each thread executes a chunk of
iterations, then requests another
chunk, until no chunks remain to
be assigned. The chunk
decreases to chunk
• The default value of chunk is 1
• For example:
!$omp parallel do &
!$omp schedule(guided,1)
31
runtime and auto Scheduling
• runtime: iteration scheduling scheme is set at runtime
through the enviroment variable OMP_SCHEDULE
• For example:
!$omp parallel do &
!$omp schedule(runtime)
• the scheduling scheme can be modified without recompiling the
program changing the environment variable OMP_SCHEDULE,
for example: setenv OMP_SCHEDULE “dynamic,50”
32
Scheduling experiment
3
2
1
0
3
2
1
0
3
2
1
0
0 200 400 600 800 1000
Different scheduling for a 1000 iterations loop with 4 threads: guided
(top), dynamic (middle), static (bottom)
33
Sections: Work-Sharing Constructs
• The Sections work-sharing construct gives a
different structured block to each thread.
#pragma omp parallel
#pragma omp sections
{ By default, there is a barrier at the end of
X_calculation(); the “omp sections”. Use the “nowait”
clause to turn off the barrier.
#pragma omp section
y_calculation();
#pragma omp section
z_calculation();
}
34
Data-sharing attributes
• In a parallel construct the data-sharing attributes are
implicitily determined by the default clause, if present
• if no default clause is present they are shared
• Certain variables have a predetermined data-sharing attributes
• Variables with automatic storage duration that are declared in a
scope inside a construct are private
• Objects with dynamic storage duration are shared
• The loop iteration variable(s) in the associated for-loop(s) of a
for construct is (are ) private
• A loop iteration variable for a sequential loop in a parallel
construct is private in the innermost such construct that
encloses the loop (only Fortran)
• Variables with static storage duration that are declared in a
scope inside the construct are shared
• ...
35
Data-sharing attributes clauses
• Explicitly determined data-sharing attributes are those that are
referenced in a given construct and are listed in a data-sharing
attribute clause
• shared(list): there is only one istance of the objects in the
list accessible by all threads in the team
• private(list): each thread has a copy of the variables in
the list
• firstprivate(list): same as private but all variables in
the list are initialized with the value that the original object had
before entering the parallel construct
• lastprivate(list): same as private but the thread that
executes the sequentially last iteration or section updates the
value of the objects in the list
• The default clause sets the implicit default
• default(none|shared) in C/C++
36
Parallel programming using OpenMP
Data sharing
firstprivate
Particular case of private.
Each private copy is initialized with the value of the variable
of the master thread.
Example
void f() {
int x= 17;
#pragma omp parallel for firstprivate(x)
for (long i=0;i<maxval;++i) {
x+=i; //xisinitially17
}
std::cout<<x<<std::endl; //x==17
}
37
lastprivate
Pass the value of the private variable of the last sequential
iteration to the global variable.
Example
void f() {
int x= 17;
#pragma omp parallel for firstprivate(x) lastprivate(x)
for (long i=0;i<maxval;++i) {
x+=i; //xisinitially17
}
std::cout<<x<<std::endl; //xvalueiniterationi==maxval −1
}
38
The threadprivate directive
C/C++
#pragma omp threadprivate(list)
• Is a declarative directive
• Is used to create private copies of
• file-scope, namespace-scope or static variables in C/C++
• Follows the variable declaration in the same program unit
• Initial data are undefined, unless the copyin clause is used
39
Data Environment:
Default Storage Attributes
• Shared Memory programming model:
• Most variables are shared by default
• Global variables are SHARED among threads
• Fortran: COMMON blocks, SAVE variables, MODULE
variables
• C: File scope variables, static
• But not everything is shared...
• Stack variables in sub-programs called from parallel
regions are PRIVATE
• Automatic variables within a statement block are
PRIVATE.
40
Private Clause
• private(var) creates a local copy of var for each
thread.
– The value is uninitialized
– Private copy is not storage associated with the
original
void wrong(){
int IS = 0;
#pragma parallel for private(IS)
for(int J=1;J<1000;J++)
IS = IS + J;
printf(“%i”, IS);
}
41
OpenMP: Reduction
• Another clause that effects the way variables are shared:
• reduction (op : list)
• The variables in “list” must be shared in the enclosing
parallel region.
• Inside a parallel or a worksharing construct:
• A local copy of each list variable is made and initialized
depending on the “op” (e.g. 0 for “+”)
• pair wise “op” is updated on the local value
• Local copies are reduced into a single global copy at the
end of the construct.
42
OpenMP:
An Reduction Example
#include <omp.h>
#define NUM_THREADS 2
void main ()
{
int i;
double ZZ, func(), sum=0.0;
omp_set_num_threads(NUM_THREADS)
#pragma omp parallel for reduction(+:sum)
private(ZZ)
for (i=0; i< 1000; i++){
ZZ = func(i);
sum = sum + ZZ;
}
}
43
OpenMP: Synchronization
• OpenMP has the following constructs to support
synchronization:
• barrier
• critical section
• atomic
• flush
• ordered
• single
• master
44
Synchronization
45
atomic construct
• The atomic construct applies only to statements that update
the value of a variable
• Ensures that no other thread updates the variable between
reading and writing
• The allowed instructions differ between Fortran and C/C++
• Refer to the OpenMP specifications
• It is a special lightweight form of a critical
• Only read/write are serialized, and only if two or more threads
access the same memory address
C/C++
#pragma omp atomic [clause]
<statement>
46
atomic Examples
C/C++
#pragma omp atomic update
x += n*mass; // default update
47
Critical session
• Only one thread at a time can enter a critical section
For(I=0; I<N; I++) {
…… Cannot be parallelized if sum is shared.
sum += A[I];
…… Fix:
} For(I=0; I<N; I++) {
……
#pragma omp critical
{
sum += A[I];
}
……
}
48
Master directive
• The master construct denotes a structured block that is only
executed by the master thread. The other threads just skip it
(no implied barriers or flushes).
#pragma omp parallel private (tmp)
{
do_many_things();
#pragma omp master
{ exchange_boundaries(); }
#pragma barrier
do_many_other_things();
}
49
Single directive
• The single construct denotes a block of code that is executed by
only one thread.
• A barrier and a flush are implied at the end of the single block.
50
Ordering
• An ordered region is executed in sequential order.
51
Simple locks
• Locks in the OpenMP library.
• Also nested locks.
omp_lock_t l;
omp_init_lock(&l);
#pragma omp parallel
{
int id = omp_get_thread_num();
double x = f( i ) ;
omp_set_lock(&l);
cout << "ID=" << id << " tmp= " << tmp << endl;
omp_unset_lock(&l);
}
omp_destroy_lock(&l);
52
OpenMP: Library routines
• Lock routines
• omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock()
• Runtime environment routines:
• Modify/Check the number of threads
• omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()
• Turn on/off nesting and dynamic mode
• omp_set_nested(), omp_set_dynamic(), omp_get_nested(),
omp_get_dynamic()
• Are we in a parallel region?
• omp_in_parallel()
• How many processors in the system?
• omp_num_procs()
53
OpenMP: Environment Variables
• OMP_NUM_THREADS
• bsh:
• export OMP_NUM_THREADS=2
• csh:
• setenv OMP_NUM_THREADS 4
54
CODE. Matrix-vector multiply using a parallel loop and
critical directive
55
Ex.Travelling Salesman Problem
• The map is represented as a graph with nodes
representing cities and edges representing the
distances between cities.
• A special node (cities) is the starting point of the
tour.
• Travelling salesman problem is to find the circle
(starting point) that covers all nodes with the
smallest distance.
• This is a well known NP-complete problem.
56
CODE. Sequential TSP
Init_q(); init_best();
While ((p = dequeue()) != NULL) {
for each expansion by one city {
q = addcity (p);
if (complete(q)) {update_best(q);}
else enqueue(q);
}
}
57
CODE. OpenMP TSP
Do_work() {
While ((p = dequeue()) != NULL) {
for each expansion by one city {
q = addcity (p);
if (complete(q)) {update_best(q);}
else enqueue(q);
} } }
main() {
init_q(); init_best();
#pragma omp parallel for
for (i=0; I < NPROCS; i++)
do_work();
}
58
Summary
• OpenMP provides a compact, yet powerful
programming model for shared memory programming
• It is very easy to use OpenMP to create parallel programs.
• OpenMP preserves the sequential version of the
program
• Developing an OpenMP program:
• Start from a sequential program
• Identify the code segment that takes most of the time.
• Determine whether the important loops can be parallelized
• The loops may have critical sections, reduction variables, etc
• Determine the shared and private variables.
• Add directives
59
Conclusion
• OpenMP is successful in small-to-medium SMP
systems
• Multiple cores/CPUs dominate the future computer
architectures; OpenMP would be the major parallel
programming language in these architectures.
• Simple: everybody can learn it in 2 weeks
• Not so simple: Don’t stop learning! keep learning it
for better performance
60
OpenMP discussion
• Ease of use
• OpenMP takes cares of the thread maintenance.
• Big improvement over pthread.
• Synchronization
• Much higher constructs (critical section, barrier).
• Big improvement over pthread.
61
OpenMP discussion
• Expressiveness
• Data parallelism:
• MM and SOR
• Fits nicely in the paradigm
• Task parallelism:
• TSP
• Somewhat awkward. Use OpenMP constructs to create threads.
OpenMP is not much different from pthread.
62
OpenMP discussion
• Exposing architecture features (performance):
• Not much, similar to the pthread approach
• Assumption: dividing job into threads = improved performance.
• How valid is this assumption in reality?
• Overheads, contentions, synchronizations, etc
• This is one weak point for OpenMP: the performance of
an OpenMP program is somewhat hard to understand.
63
OpenMP final thoughts
• Main issues with OpenMP: performance
• Is there any obvious way to solve this?
• Exposing more architecture features?
• Is the performance issue more related to the fundamantal
way that we write parallel program?
• OpenMP programs begin with sequential programs.
• May need to find a new way to write efficient parallel programs
in order to really solve the problem.
64
Thank
you for
your
attentions
!
65