Openmp Overview
Openmp Overview
Openmp Overview
openmp.org
1
OpenMP Release History
2005
1998 2002 OpenMP
C/C++/Fortran
OpenMP OpenMP 2.5
C/C++ 1.0 C/C++ 2.0
2008/9
OpenMP
C/C++/Fortran
OpenMP OpenMP OpenMP 3.0
Fortran 1.0 Fortran 1.1 Fortran 2.0
2
OpenMP Programming Model:
Fork-Join Parallelism:
Master thread spawns a team of threads as needed.
Parallelismis added incrementally: i.e. the
sequential program evolves into a parallel program.
Master
Thread
Parallel Regions
(usually independent loops or SPMD sections) 3
How is OpenMP typically used?
5
OpenMP:
Some syntax details to get us started
Mostof the constructs in OpenMP are compiler
directives or pragmas.
For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
For Fortran, the directives take one of the forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
Include file and the OpenMP lib module
#include <omp.h>
use omp_lib
6
OpenMP:
Structured blocks (C/C++)
Most OpenMP* constructs apply to structured blocks.
– Structured block: a block with one point of entry at the top
and one point of exit at the bottom.
– The only branches allowed are STOP statements in
Fortran and exit() in C/C++.
#pragma omp parallel if(go_now()) goto more;
{ #pragma omp parallel
int id = omp_get_thread_num(); {
more: res[id] = do_big_job(id); int id = omp_get_thread_num();
if(!conv(res[id]) goto more; more: res[id] = do_big_job(id);
} if(conv(res[id]) goto done;
printf( All done \n ); go to more;
}
done: if(!really_done()) goto more;
Data Environment
Synchronization
10
The OpenMP* API
Parallel Regions
You
create threads in OpenMP* with the omp
parallel pragma.
For example, To create a 4 thread Parallel region:
double A[1000]; Runtime function to
Each thread omp_set_num_threads(4); request a certain
executes a number of threads
copy of the
#pragma omp parallel
the code {
within the int ID = omp_get_thread_num();
structured pooh(ID,A); Runtime function
block } returning a thread ID
A single
copy of A pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
is shared
between all
threads.
printf( all done\n ); Threads wait here for all threads to
finish before proceeding (I.e. a barrier)
12
* Third party trademarks and names are the property of their respective owner.
OpenMP: Contents
OpenMP s constructs fall into 5 categories:
Parallel
Regions
Work-sharing
Data Environment
Synchronization
13
OpenMP: Work-Sharing Constructs
The for Work-Sharing construct splits up
loop iterations among the threads in a team
#pragma omp parallel
#pragma omp for
for (I=0;I<N;I++){
NEAT_STUFF(I);
}
By default, there is a barrier at the end of the
omp for . Use the nowait clause to turn off
the barrier.
#pragma omp for nowait
nowait is useful between two consecutive,
independent omp for loops. 14
Work Sharing Constructs
A motivating example
17
* Third party trademarks and names are the property of their respective owner.
OpenMP: Work-Sharing Constructs
TheSections work-sharing construct gives a
different structured block to each thread.
#pragma omp parallel
#pragma omp sections
{
#pragma omp section
X_calculation();
#pragma omp section
y_calculation();
#pragma omp section
z_calculation();
}
By default, there is a barrier at the end of the omp
18
sections . Use the nowait clause to turn off the barrier.
The OpenMP* API
Combined parallel/work-share
bar.f
poo.f
subroutine whoami!
C$OMP PARALLEL!
external omp_get_thread_num!
call whoami! +
integer iam, omp_get_thread_num!
C$OMP END PARALLEL!
iam = omp_get_thread_num()!
C$OMP CRITICAL!
lexical Dynamic extent print*, Hello from , iam!
extent of of parallel C$OMP END CRITICAL!
parallel Orphan directives
region includes return!
region can appear outside a
lexical extent
end! parallel region
20
OpenMP: Contents
OpenMP s constructs fall into 5 categories:
Parallel
Regions
Worksharing
Data Environment
Synchronization
21
Data Environment:
Default storage attributes
22
Data Sharing Examples
A, index, count!
A, index and count are
shared by all threads.
temp! temp! temp!
temp is local to each
thread
A, index, count!
23
* Third party trademarks and names are the property of their respective owner.
Data Environment:
Changing storage attributes
One can selectively change storage attributes
constructs using the following clauses*
– SHARED
– PRIVATE All the clauses on this page
only apply to the lexical extent
– FIRSTPRIVATE
of the OpenMP construct.
– THREADPRIVATE
The value of a private inside a parallel loop can be
transmitted to a global value outside the loop with:
– LASTPRIVATE
The default status can be modified with:
– DEFAULT (PRIVATE | SHARED | NONE)
All data clauses apply to parallel regions and worksharing constructs except
shared which only applies to parallel regions. 24
Private Clause
private(var) creates a local copy of var for each thread.
– The value is uninitialized
– Private copy is not storage-associated with the original
– The original is undefined at the end
program wrong!
IS = 0!
C$OMP PARALLEL DO PRIVATE(IS)!
DO J=1,1000 !
! IS = IS + J! IS was not
Regardless of
initialization, IS is
END DO ! initialized
print *, IS!
undefined at this
point
25
Firstprivate Clause
Firstprivate is a special case of private.
– Initializes each private copy with the corresponding
value from the master thread.
program almost_right!
IS = 0!
C$OMP PARALLEL DO FIRSTPRIVATE(IS)!
DO J=1,1000 !
! IS = IS + J! Each thread gets its own IS
1000 CONTINUE ! with an initial value of 0
print *, IS!
Regardless of initialization, IS is
undefined at this point 26
Lastprivate Clause
Lastprivate passes the value of a private from the
last iteration to a global variable.
program closer!
IS = 0!
C$OMP PARALLEL DO FIRSTPRIVATE(IS) !
C$OMP+ LASTPRIVATE(IS)!
DO J=1,1000 !
Each thread gets its own IS
! IS = IS + J!
1000 CONTINUE !
with an initial value of 0
print *, IS!
itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads() Are these
each = itotal/np two codes
………
C$OMP END PARALLEL
equivalent?
itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)
np = omp_get_num_threads()
yes
each = itotal/np
………
C$OMP END PARALLEL!
30
Threadprivate
Makes global data private to a thread
Fortran: COMMON blocks
C: File scope and static variables
31
A threadprivate example
Consider two different routines called within a
parallel region.
subroutine poo subroutine bar
parameter (N=1000) parameter (N=1000)
common/buf/A(N),B(N) common/buf/A(N),B(N)
C$OMP THREADPRIVATE(/buf/) C$OMP THREADPRIVATE(/buf/)
do i=1, N do i=1, N
B(i)= const* A(i) A(i) = sqrt(B(i))
end do end do
return return
end! end!
end! 33
OpenMP: Reduction
Another clause that effects the way variables are
shared:
reduction (op : list)
The variables in list must be shared in the enclosing
parallel region.
Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized
depending on the op (e.g. 0 for + ).
– Compiler finds standard reduction expressions containing
op and uses them to update the local copy.
– Local copies are reduced into a single value and
combined with the original global value.
34
OpenMP:
Reduction example
#include <omp.h>
void main ()
{
int i;
double ZZ, func(), res=0.0;
35
OpenMP: Reduction example
Remember the code we used to demo private,
firstprivate and lastprivate.
program closer!
IS = 0!
DO J=1,1000 !
! IS = IS + J!
1000 CONTINUE !
print *, IS!
Data Environment
Synchronization
38
OpenMP: Synchronization
OpenMP has the following constructs to
support synchronization:
– critical section
– atomic
– barrier We will save flush for the
– flush advanced OpenMP tutorial.
40
The OpenMP* API
Synchronization – critical section (in C/C++)
44
OpenMP: Synchronization
The
master construct denotes a structured
block that is only executed by the master
thread. The other threads just skip it (no
synchronization is implied).
#pragma omp parallel private (tmp)
{
do_many_things();
#pragma omp master
{ exchange_boundaries(); }
#pragma barrier
do_many_other_things();
}
45
OpenMP: Synchronization work-share
The single construct denotes a block of code
that is executed by only one thread.
Abarrier is implied at the end of the single
block.
#pragma omp parallel private (tmp)
{
do_many_things();
#pragma omp single
{ exchange_boundaries(); }
do_many_other_things();
}
46
OpenMP:
Implicit synchronization
Barriers
are implied on the following OpenMP
constructs:
end parallel
end do (except when nowait is used)
end sections (except when nowait is used)
end single (except when nowait is used)
47
OpenMP: Contents
OpenMP s constructs fall into 5 categories:
Parallel
Regions
Worksharing
Data Environment
Synchronization
48
OpenMP: Library routines:
Runtime environment routines:
– Modify/Check the number of threads
– omp_set_num_threads(),
– omp_get_num_threads(),
– omp_get_thread_num(),
– omp_get_max_threads()
– Are we in a parallel region?
– omp_in_parallel()
– How many processors in the system?
– omp_num_procs()
49
OpenMP: Library Routines
To fix the number of threads used in a program,
(1) set the number threads, then (4) save the
number you got.
#include <omp.h> Request as many threads
void main() as you have processors.
{ int num_threads;
omp_set_num_threads( omp_num_procs() );
#pragma omp parallel
Protect this
{ int id=omp_get_thread_num();
op since
#pragma omp single
Memory
num_threads = omp_get_num_threads();
stores are
do_lots_of_stuff(id);
not atomic
}
}
50
OpenMP: Environment Variables:
Controlhow omp for schedule(RUNTIME)
loop iterations are scheduled.
– OMP_SCHEDULE schedule[, chunk_size]
Set the default number of threads to use.
– OMP_NUM_THREADS int_literal
51
Agenda
Setting the stage
– Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
52
The SPEC OMPM2001 Benchmarks
53
Basic Characteristics
Code Parallel Total
Coverage Runtime (sec) # of parallel
(%) Seq. 4-cpu sections
ammp 99.11 16841 5898 7
applu 99.99 11712 3677 22
apsi 99.84 8969 3311 24
art 99.82 28008 7698 3
equake 99.15 6953 2806 11
fma3d 99.45 14852 6050 92/30*
gafort 99.94 19651 7613 6
galgel 95.57 4720 3992 31/32*
mgrid 99.98 22725 8050 12
swim 99.44 12920 7613 8
wupwise 99.83 19250 5788 10
55
C$OMP PARALLEL
C$OMP+ PRIVATE (AUX1, AUX2, AUX3),
C$OMP+ PRIVATE (I, IM, IP, J, JM, JP, K, KM, KP, L, LM, LP),
C$OMP+ SHARED (N1, N2, N3, N4, RESULT, U, X)
C$OMP DO
DO 100 JKL = 0, N2 * N3 * N4 - 1
in Wupwise
DO 100 I=(MOD(J+K+L,2)+1),N1,2
IP=MOD(I,N1)+1
CALL GAMMUL(1,0,X(1,(IP+1)/2,J,K,L),AUX1)
CALL SU3MUL(U(1,1,1,I,J,K,L),'N',AUX1,AUX3)
CALL GAMMUL(2,0,X(1,(I+1)/2,JP,K,L),AUX1)
CALL SU3MUL(U(1,1,2,I,J,K,L),'N',AUX1,AUX2)
CALL ZAXPY(12,ONE,AUX2,1,AUX3,1)
CALL GAMMUL(3,0,X(1,(I+1)/2,J,KP,L),AUX1)
CALL SU3MUL(U(1,1,3,I,J,K,L),'N',AUX1,AUX2)
CALL ZAXPY(12,ONE,AUX2,1,AUX3,1)
CALL GAMMUL(4,0,X(1,(I+1)/2,J,K,LP),AUX1)
CALL SU3MUL(U(1,1,4,I,J,K,L),'N',AUX1,AUX2)
CALL ZAXPY(12,ONE,AUX2,1,AUX3,1)
CALL ZCOPY(12,AUX3,1,RESULT(1,(I+1)/2,J,K,L),1)
100 CONTINUE
C$OMP END DO
C$OMP END PARALLEL 56
Swim
Shallow Water model written in F77/F90
Swim is known to be highly parallel
Code contains several doubly-nested loops
The outer loops are parallelized
!$OMP PARALLEL DO
DO 100 J=1,N
DO 100 I=1,M
Example CU(I+1,J) = .5D0*(P(I+1,J)+P(I,J))*U(I+1,J)
parallel CV(I,J+1) = .5D0*(P(I,J+1)+P(I,J))*V(I,J+1)
Z(I+1,J+1) = (FSDX*(V(I+1,J+1)-V(I,J+1))-FSDY*(U(I+1,J+1)
loop
-U(I+1,J)))/(P(I,J)+P(I+1,J)+P(I+1,J+1)+P(I,J+1))
H(I,J) = P(I,J)+.25D0*(U(I+1,J)*U(I+1,J)+U(I,J)*U(I,J)
+V(I,J+1)*V(I,J+1)+V(I,J)*V(I,J))
100 CONTINUE
57
Mgrid
Multigrid electromagnetism in F77/F90
Major parallel regions inrprj3, basic multigrid
iteration
Simple loop nest patterns, similar to Swim,
several 3-nested loops
Parallelized through the Polaris automatic
parallelizing source-to-source translator
58
Applu
Non-linear PDES time stepping SSOR in F77
Major parallel regions in ssor.f, basic SSOR iteration
60
!$OMP PARALLEL
!$OMP+ DEFAULT(NONE)
!$OMP+ PRIVATE (I, IL, J, JL, L, LM, M, LPOP, LPOP1),
!$OMP+ SHARED (DX, HtTim, K, N, NKX, NKY, NX, NY, Poj3, Poj4, XP, Y), Major parallel loop in
!$OMP+ SHARED (WXXX, WXXY, WXYX, WXYY, WYXX, WYXY, WYYX, WYYY),
!$OMP+ SHARED (WXTX, WYTX, WXTY, WYTY, A, Ind0)
If (Ind0 .NE. 1) then
subroutine syshtN.f
of Galgel
! Calculate r.h.s.
63
!$OMP PARALLEL PRIVATE(rand, iother, itemp, temp, my_cpu_id)
my_cpu_id = 1
!$ my_cpu_id = omp_get_thread_num() + 1
Parallel loop !$OMP DO
DO j=1,npopsiz-1
In shuffle.f CALL ran3(1,rand,my_cpu_id,0)
iother=j+1+DINT(DBLE(npopsiz-j)*rand)
of Gafort !$ IF (j < iother) THEN
!$ CALL omp_set_lock(lck(j))
!$ CALL omp_set_lock(lck(iother))
!$ ELSE
!$ CALL omp_set_lock(lck(iother))
!$ CALL omp_set_lock(lck(j))
Exclusive access !$ END IF
to array itemp(1:nchrome)=iparent(1:nchrome,iother)
elements. iparent(1:nchrome,iother)=iparent(1:nchrome,j)
iparent(1:nchrome,j)=itemp(1:nchrome)
Ordered locking temp=fitness(iother)
prevents fitness(iother)=fitness(j)
deadlock. fitness(j)=temp
!$ IF (j < iother) THEN
!$ CALL omp_unset_lock(lck(iother))
!$ CALL omp_unset_lock(lck(j))
!$ ELSE
!$ CALL omp_unset_lock(lck(j))
!$ CALL omp_unset_lock(lck(iother))
!$ END IF
END DO
!$OMP END DO
!$OMP END PARALLEL 64
Fma3D
3D finite element mechanical simulator
Largest of the SPEC OMP codes: 60,000 lines
Uses OMP DO, REDUCTION, NOWAIT,
CRITICAL
Keyto good scaling was critical section
Most parallelism from simple DOs
Of
the 100 subroutines only four have parallel
sections; most of them in fma1.f90
Conversion to OpenMP took substantial work
65
Parallel loop in platq.f90 of Fma3D
!$OMP PARALLEL DO &
!$OMP DEFAULT(PRIVATE), SHARED(PLATQ,MOTION,MATERIAL,STATE_VARIABLES), &
!$OMP SHARED(CONTROL,TIMSIM,NODE,SECTION_2D,TABULATED_FUNCTION,STRESS),&
!$OMP SHARED(NUMP4) REDUCTION(+:ERRORCOUNT), &
!$OMP REDUCTION(MIN:TIME_STEP_MIN), &
!$OMP REDUCTION(MAX:TIME_STEP_MAX)
DO N = 1,NUMP4
66
SUBROUTINE PLATQ_MASS ( NEL,SecID,MatID )
68
#pragma omp for private (k,m,n, gPassFlag) schedule(dynamic)
for (ij = 0; ij < ijmx; ij++) {
j = ((ij/inum) * gStride) + gStartY;
i = ((ij%inum) * gStride) +gStartX;
k=0;
for (m=j;m<(gLheight+j);m++)
for (n=i;n<(gLwidth+i);n++)
Key loop f1_layer[o][k++].I[0] = cimage[m][n];
if (gPassFlag==1) {
if (set_high[o][0]==TRUE) {
highx[o][0] = i;
highy[o][0] = j;
set_high[o][0] = FALSE;
}
if (set_high[o][1]==TRUE) {
highx[o][1] = i;
highy[o][1] = j;
set_high[o][1] = FALSE;
}
}
}
69
Ammp
Molecular Dynamics
Very large loop in rectmm.c
Good parallelism required great deal of work
Uses OMP FOR, SCHEDULE(GUIDED), about
20,000 locks
Guided scheduling needed because of loop
with conditional execution.
70
#pragma omp parallel for private (n27ng0, nng0, ing0, i27ng0, natoms, ii, a1, a1q, a1serial,
inclose, ix, iy, iz, inode, nodelistt, r0, r, xt, yt, zt, xt2, yt2, zt2, xt3, yt3, zt3, xt4,
yt4, zt4, c1, c2, c3, c4, c5, k, a1VP , a1dpx , a1dpy , a1dpz , a1px, a1py, a1pz, a1qxx ,
a1qxy , a1qxz ,a1qyy , a1qyz , a1qzz, a1a, a1b, iii, i, a2, j, k1, k2 ,ka2, kb2, v0, v1, v2,
v3, kk, atomwho, ia27ng0, iang0, o ) schedule(guided)
Parallel loop in
for( ii=0; ii< jj; ii++)
...
for( inode = 0; inode < iii; inode ++)
if( (*nodelistt)[inode].innode > 0) {
for(j=0; j< 27; j++)
rectmm.c of
...
if( j == 27 ) Ammp
if( atomwho->serial > a1serial)
for( kk=0; kk< a1->dontuse; kk++)
if( atomwho == a1->excluded[kk])
...
for( j=1; j< (*nodelistt)[inode].innode -1 ; j++)
...
if( atomwho->serial > a1serial)
for( kk=0; kk< a1->dontuse; kk++)
if( atomwho == a1->excluded[kk]) goto SKIP2;
...
for (i27ng0=0 ; i27ng0<n27ng0; i27ng0++)
...
...
for( i=0; i< nng0; i++)
...
if( v3 > mxcut || inclose > NCLOSE )
...
...
72
/* malloc w1[numthreads][ARCHnodes][3] */
ammp 7 20k 2
applu 22 14
apsi 24
art 3 1
equake 11
fma3d 92/30 1 2
gafort 6 40k
galgel 31/32* 7 3
mgrid 12 11
swim 8
wupwise 10 1
Feature used to deal with NUMA machines: rely on first-touch page placement. If necessary, put
initialization into a parallel loop to avoid placing all data on the master processor.
74