Openmp: Martin Kruliš Ji Ří Dokulil

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 38

OpenMP

Martin Kruli
Ji Dokulil
OpenMP
OpenMP Architecture Review Board
Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S.
Department of Energy,
http://www.openmp.org
specifications (freely available)
1.0 C/C++ and FORTRAN versions
2.0 C/C++ and FORTRAN versions
2.5 combined C/C++ and FORTRAN
3.0 combined C/C++ and FORTRAN
4.0 combined C/C++ and FORTRAN (July 2013)
OpenMP Threading Model
Basics
pragmas
#pragma omp
simple to use
only a few constructs
programs should run without OpenMP
possible but not enforced
compiler ignore unknown pragmas
#ifdef _OPENMP
Simple example
#define N 1024*1024
int* data=new int[N];
for(int i=0; i<N; ++i)
{
data[i]=i;
}
Simple example cont.
#define N 1024*1024
int* data=new int[N];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
data[i]=i;
}
Another example
int sum;
#pragma omp parallel for
for(int i=0; i<N; ++i) WRONG
{
sum+=data[i];
}
Variable scope
shared
one instance for all threads
private
one instance for each thread
reduction
special variant for reduction operations
valid within lexical extent
no effect in called functions
Variable scope private
default for loop control variable
only for the parallelized loop
should (probably always) be made private
all loops in Fortran
all variables declared within the parallelized
block
all non-static variables in called functions
allocated on stack private for each thread
uninitialized values
at start of the block and after the block
except for classes
default constructor (must be accessible)
may not be shared among the threads
Variable scope private
int j;
#pragma omp parallel for private(j)
for(int i=0; i<N/2; ++i)
{
j=i*2;
data[j]=i;
data[j+1]=i;
}
Variable scope reduction
performing e.g. sum of an array
cannot use only private variable
shared requires explicit synchronization
combination is possible and (relatively) efficient
but unnecessarily complex
each thread works on an private copy
initialized to a default value (0 for +, 1 for *,)
final results are joined and available to the
master thread
Variable scope reduction
long long sum=0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<N; ++i)
{
sum+=data[i];
}
Variable scope firstprivate
and lastprivate
private variables at the start of the block and
after end of the block are undefined
firstprivate
all values are initialized to the value of the master
thread
lastprivate
variable after the parallelized block is set to the
value of the last iteration (last in the serial version)

#pragma omp for firstprivate(x) lastprivate(x)


parallel
#pragma omp parallel
launches threads and executes block in
parallel (once for each thread)
modifiers
if (scalar expression)
variable scope modifiers (including reduction)
num_threads
especially useful in conjunction with
omp_get_thread_num()
Loop-level parallelism
#pragma omp parallel for
launch threads and execute loop in parallel
can be nested
#pragma omp for
parallel loop within another parallel block
divide work among existing threads
no (direct) nesting
simple for expression
implicit barrier at the end
Loop modifiers 1
variable scope modifiers
nowait removes barrier at the end
cannot be used with #pragma omp parallel for
ordered
loop (or called function) may contain block
marked #pragma omp ordered
such block is executed in the same order as in
serial execution of the loop
at most one such block may exist
Loop modifiers 2
schedule
schedule(static[, chunk_size])
round robin
no chunk size equal size to all threads
schedule(dynamic[, chunk_size])
threads request chunks
default chunk size is 1
schedule(guided[, chunk_size])
like dynamic with size of chunks proportional to the amount
of remaining work, but at least chunk_size
default chunk size is 1
auto
selected by implementation
runtime
use default value stored in variable def-sched-var
Parallel sections
#pragma omp sections
#pragma omp section
#pragma omp section

several blocks of code that should be
evaluated in parallel
modifiers
private, firstprivate, lastprivate, reduction
nowait
Single
#pragma omp single
code is executed by only one thread of the team
modifiers
private, firstprivate
nowait
when not used, there is a barrier at the end of the block
copyprivate
final value of the variable is distributed to all threads in the
team after the block is executed
incompatible with nowait
Workshare
Fortran only

SUBROUTINE A11_1(AA, BB, CC, DD, EE, FF, N)


INTEGER N
REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N), EE(N,N), FF(N,N)
!$OMP PARALLEL
!$OMP WORKSHARE
AA = BB
CC = DD
EE = FF
!$OMP END WORKSHARE
!$OMP END PARALLEL
END SUBROUTINE A11_1
Master
#pragma omp master
similar to omp single
executed only by the master thread
Critical section
#pragma omp critical [name]
a well-known (named) critical section
at most one thread can execute critical section
with certain name
multiple pragmas with same name form one
section
names have external linkage
all unnamed pragmas form one section
Barrier
#pragma omp barrier
no associated block of code
some restrictions on placement
if (a<10)
#pragma omp barrier
{ do_something() }
Atomic
#pragma omp atomic
followed by expression in the form
x op= expr
+, *, -, /, &, ^, |, <<,or >>
expr must not reference x
x++
++x
x--
--y
Flush
#pragma omp flush (variable list)
make threads view of variables consistent with the main
memory
variable list may be omitted, flushes all
similar to volatile in C/C++
influences memory operation reordering that can be
performed by the compiler
cannot move read/write of the flushed variable to the other side
of the flush operation
all values of flushed variables are saved to the memory
before flush finishes
first read of flushed variable after flush is performed from
the main memory
same placement restrictions as barrier
threadprivate
#pragma omp threadprivate(list)
makes global variable private for each thread
complex restrictions
copyin, copyprivate
copyin(list)
copy value of threadprivate variable from master
thread to other members of the team
used as modifier in #pragma omp parallel
values copied at the start of the block
copyprivate(list)
copy value from one threads threadprivate
variable to all other members of the team
used as modifier in #pragma omp single
values copied at the end of the block
Task
new in OpenMP 3.0
#pragma omp task
piece of code to be executed in parallel
immediately or later
if clause forces immediate execution when false
tied or untied (to a thread)
can be suspended, e.g. by launching nested task
modifiers
default, private, firstprivate, shared
untied
if
Task scheduling points
after explicit generation of a task
after the last instruction of a task region
taskwait region
in implicit and explicit barriers
(almost) anywhere in untied tasks
Taskwait
#pragma omp taskwait
wait for completion of all child tasks
generated since the start of the current task
Functions
omp_set_num_threads, omp_get_max_threads
number of threads used for parallel regions without
num_threads clause
omp_get_num_threads
number of threads in the team
omp_get_thread_num
number of calling thread within the team
0 = master
omp_get_num_procs
number of processors available to the program
Functions cont.
omp_in_parallel
checks if the caller is in active parallel region
active region is region without if or if the condition
was true
omp_set_dynamic, omp_get_dynamic
dynamic adjustment of thread number
on/off
omp_set_nested, omp_get_nested
nested parallelism
on/off
Locks
plain and nested
omp_lock_t, omp_nest_lock_t
omp_init_lock, omp_init_nest_lock
initializes the lock
omp_destroy_lock, omp_destroy_nest_lock
uninitializes
must be unlocked
omp_set_lock, omp_set_nest_lock
must be initialized
locks the lock
blocks until the lock is acquired
omp_unset_lock, omp_unset_nest_lock
must be locked and owned by the calling thread
unlocks
omp_test_lock, omp_test_nest_lock
like set but does not block
Timing routines
double omp_get_wtime()
wall clocl time in seconds
since time in the past
may not be consistent between threads
double omp_get_wtick()
number of seconds between successive clock
ticks of the timer used by omp_get_wtime
Environment variables
OMP_NUM_THREADS
number of threads launched in parallel regions
omp_set_num_threads, omp_get_num_threads
OMP_SCHEDULE
used in loops with schedule(runtime)
"guided,4", "dynamic
OMP_DYNAMIC
set if implementation may change number of threads
omp_set_dynamic, omp_get_dynamic
true or false
OMP_NESTED
controls nested parallelism
true or false
default is false
Nesting of regions
some limitations
close nesting
no #pragma omp parallel nested between the two regions
work-sharing region
for, sections, single, (workshare)
work-sharing region may not be closely nested inside a work-
sharing, critical, ordered, or master region
barrier region may not be closely nested inside a work-sharing,
critical, ordered, or master region
master region may not be closely nested inside a work-sharing
region
ordered region may not be closely nested inside a critical region
ordered region must be closely nested inside a loop region (or
parallel loop region) with an ordered clause
critical region may not be nested (closely or otherwise) inside a
critical region with the same name
note that this restriction is not sufficient to prevent deadlock
OpenMP 4.0
The newest version (June 2013)
No implementations yet
Thread affinity
proc_bind(master | close | spread)
SIMD support
Explicit loop vectorization (by SSE, AVX, )
User defined reduction
#pragma omp declare reduction (identifier : typelist :
combiner-expr) [initializer-clause]
Atomic operations with sequential
consistency (seq_cst)
OpenMP 4.0
Accelerator support
Xeon Phi cards, GPUs,
#pragma omp target offloads computation
device(idx)
map(variable map)
#pragma target update

You might also like