Unit3 RMD PDF
Unit3 RMD PDF
Unit3 RMD PDF
OpenMP is an API for shared-memory parallel programming. The “MP” in OpenMP stands
for “multiprocessing,” a term that is synonymous with shared-memory parallel computing.
OpenMP is designed for systems in which each thread or process can potentially have access
to all available memory. When we’re programming with OpenMP, we view our system as a
collection of cores or
The OpenMP runtime library maintains a pool of helper threads that can be used to work on
parallel regions. When a thread encounters a parallel construct and requests a team of more than
one thread, the thread will check the pool and grab idle threads from the pool, making them part of
the team. The encountering thread might get fewer helper threads than it requests if the pool does
not contain a sufficient number of idle threads. When the team finishes executing the parallel region,
the helper threads are returned to the pool.
The flush operation and the temporary view allow OpenMP implementations to optimize
reads and writes of shared variables. For example, consider the program fragment in Figure . The
write to variable A may complete as soon as point 1 in the figure. However, the OpenMP
implementation is allowed to execute the computation denoted as “…” in the figure, before the write
to A completes. The write need not complete until point 2, when it must be firmly lodged in memory
and available to all other threads. If an OpenMP implementation uses a temporary view, then a read of
A during the “…” computation in Figure 1 can be satisfied from the temporary view, instead of going
all the way to memory for the value. So, flush and the temporary view together allow an
implementation to hide both write and read latency.
A flush of all visible variables is implied 1) in a barrier region, 2) at entry and exit from
parallel, critical and ordered regions, 3) at entry and exit from combined parallel work-sharing regions,
and 4) during lock API routines.
A flush with a list is implied at entry to and exit from atomic regions, where the list contains
the object being updated. The C and C++ languages include the volatile qualifier, which provides a
consistency mechanism for C and C++ that is related to the OpenMP consistency mechanism.
OpenMP pros and Cons
Pros :
Simple
Incremental Parallelism.
Decomposition is handled automatically.
Unified code for both serial and parallel applications.
Cons :
Runs only on shared-memory multiprocessor.
Scalability is limited by memory architecture.
Reliable error handling is missing.
OpenMP Directives
OpenMP provides what’s known as a “directives-based” shared-memory API
Pragmas
1. Pragmas that let you define parallel regions in which work is done by threads in parallel.
Most of the OpenMP directives either statically or dynamically bind to an enclosing parallel
region.
#pragma omp parallel - Defines a parallel region, which is code that will be executed by
multiple
threads in parallel.
2. Pragmas that let you define how work is distributed or shared across the threads in a
parallel region
#pragma omp section - Identifies code sections to be divided among all threads.
#pragma omp for - Causes the work done in a for loop inside a parallel region to be divided
among threads.
#pragma omp single - lets you specify that a section of code should be executed on a single
thread
#pragma omp task
3. Pragmas that let you control synchronization among threads
#pragma omp atomic -Specifies that a memory location that will be updated atomically.
#pragma omp master - Specifies that only the master thread should execute a section of the
program.
#pragma omp barrier - Synchronizes all threads in a team; all threads pause at the barrier,
until all threads execute the barrier.
#pragma omp critical - Specifies that code is only executed on one thread at a time.
#pragma omp flush - Specifies that all threads have the same view of memory for all shared
objects.
#pragma omp ordered - Specifies that code under a parallelized for loop should be
executed like a sequential loop.
4. Pragmas that let you define the scope of data visibility across threads
parallel
Defines a parallel region, which is code that will be executed by multiple threads in parallel.
Syntax
#pragma omp parallel [clauses]
{
code_block
}
Remarks
The parallel directive supports the following OpenMP clauses:
copyin
default
firstprivate
if
num_threads
private
reduction
shared
parallel can also be used with the sections and for directives.
.
Example
The following sample shows how to set the number of threads and define a parallel region. By
default, the number of threads is equal to the number of logical processors on the machine. For
example, if you have a machine with one physical processor that has hyperthreading enabled, it will
have two logical processors and, therefore, two threads.
//hello.c
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int j= omp_get_num_threads ();
printf("Hello from thread %d of %d\n", i,j);
}
}
Output
Hello from thread 0 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 3 of 4
To Compile:
To run the program, we specify the number of threads on the command line. For example, we might
run the program with four threads and type
$ ./omp hello 4
for
Specifies that the iterations of associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks.
#pragma omp [parallel] for [clauses]
for_loop
clause:
private(list), firstprivate(list), lastprivate(list), reduction(reduction-identifier : list),
schedule( [modifier [, modifier] : ] kind[, chunk_size]), collapse(n), ordered[ (n) ], nowait
sections
A noniterative work-sharing construct that contains a set of structured blocks that are to be
distributed among and executed by the threads in a team.
#pragma omp [parallel] sections [clauses]
{
#pragma omp section
{
code_block
}
}
clause:
private(list), firstprivate(list), lastprivate(list), reduction(reduction-identifier: list), nowait
single
Specifies that the associated structured block is executed by only one of the threads in the
team.
#pragma omp single [clauses]
structured-block
clause:
private(list), firstprivate(list), copyprivate(list), nowait
master
Specifies a structured block that is executed by the master thread of the team.
#pragma omp master
structured-block
critical
Restricts execution of the associated structured block to a single thread at a time.
#pragma omp critical [(name)]
structured-block
flush
Executes the OpenMP flush operation, which makes a thread’s temporary view of memory
consistent with memory, and enforces an order on the memory operations of the variables.
#pragma omp flush [(list)]
barrier
Specifies an explicit barrier at the point at which the construct appears.
#pragma omp barrier
atomic
Ensures that a specific storage location is accessed atomically.
#pragma omp atomic
expression
parallel for
Shortcut for specifying a parallel construct containing one or more associated loops and no
other statements.
#pragma omp parallel for [clauses]
for-loop
clause: Any accepted by the parallel or for directives, except the nowait clause, with identical
meanings and restrictions.
parallel sections
Shortcut for specifying a parallel construct containing one sections construct and no other
statements.
#pragma omp parallel sections [clauses]
{
[#pragma omp section]
structured-block
[#pragma omp section
structured-block]
...
}
clause: Any accepted by the parallel or sections directives, except the nowait clause, with
identical meanings and restrictions.
Work-sharing Constructs
A work-sharing construct distributes the execution of the associated statement among the
members of the team that encounter it. The work-sharing directives do not launch new threads, and
there is no implied barrier on entry to a work-sharing construct.
The sequence of work-sharing constructs and barrier directives encountered must be the same for
every thread in a team.
OpenMP defines the following work-sharing constructs, and these are described in the sections that
follow:
for directive
The single directive identifies a construct that specifies that the associated structured
block is executed by only one thread in the team (not necessarily the master thread).
1. Parallel for loop
Example
Note:
2. Parallel sections
Example:
3. Parallel single:
Syntax
#pragma omp single [clause,clause , ...]
structured-block
Arguments
clause Can be one or more of the following clauses:
copyprivate(list) Provides a mechanism to use a private
variable in list to broadcast a value from the
data environment of one implicit task to the
data environments of the other implicit tasks
belonging to the parallel region.
firstprivate(list) Provides a superset of the functionality
provided by the private clause. Each
private data object is initialized with the
value of the original object.
nowait(integer Indicates that an implementation may omit
expression) the barrier at the end of the worksharing
region.
private(list) Declares variables to be private to each
thread in a team.
Description
Only one thread will be allowed to execute the structured block. The rest of the threads in the team wait at
the implicit barrier at the end of the single construct, unless nowait is specified. If nowait is specified, then
the rest of the threads in the team immediately execute the code after the structured block.
The following example demonstrates how to use this pragma to make sure that the printf function is
executed only once. All the threads in the team that do not execute the function proceed immediately to the
following calculations.
Example
#include <omp.h>
#pragma omp parallel {
#pragma omp single nowait { printf(“Starting calculation\n”); }
// Do some calculation
}
OpenMP provides several run-time library routines to help you manage your program
in parallel mode. Many of these run-time library routines have corresponding environment
variables that can be set as defaults. The run-time library routines let you dynamically change
these factors to assist in controlling your program. In all cases, a call to a run-time library
routine overrides any corresponding environment variable.
Function Description
omp_set_num_threads(nthreads) Sets the number of threads to use for subsequent
parallel regions.
omp_get_num_threads() Returns the number of threads that are being used
in the current parallel region.
omp_get_max_threads() Returns the maximum number of threads that are
available for parallel execution.
omp_get_thread_num() Returns the unique thread number of the thread
currently executing this section of code.
omp_get_num_procs() Returns the number of processors available to the
program.
omp_in_parallel() Returns TRUE if called within the dynamic
extent of a parallel region executing in parallel;
otherwise returns FALSE.
omp_set_dynamic(dynamic_threads) Enables or disables dynamic adjustment of the
number of threads used to execute a parallel
region. If dynamic_threads is TRUE, dynamic
threads are enabled.
If dynamic_threads is FALSE, dynamic threads
are disabled. Dynamics threads are disabled by
default.
omp_get_dynamic() Returns TRUE if dynamic thread adjustment is
enabled, otherwise returns FALSE.
omp_set_nested(nested) Enables or disables nested parallelism.
If nested is TRUE, nested parallelism is enabled.
If nested is FALSE, nested parallelism is
disabled. Nested parallelism is disabled by
default.
omp_get_nested() Returns TRUE if nested parallelism is enabled,
otherwise returns FALSE
Lock Routines
Function Description
omp_init_lock(lock) Initializes the lock associated with lock for use in
subsequent calls.
omp_destroy_lock(lock) Causes the lock associated with lock to become
undefined.
omp_set_lock(lock) Forces the executing thread to wait until the lock
associated with lock is available. The thread is
granted ownership of the lock when it becomes
available.
omp_unset_lock(lock) Releases the executing thread from ownership of
the lock associated with lock. The behavior is
undefined if the executing thread does not own
the lock associated with lock.
omp_test_lock(lock) Attempts to set the lock associated with lock. If
successful, returns TRUE, otherwise
returns FALSE.
omp_init_nest_lock(lock) Initializes the nested lock associated with lock for
use in the subsequent calls.
omp_destroy_nest_lock(lock) Causes the nested lock associated with lock to
become undefined.
omp_set_nest_lock(lock) Forces the executing thread to wait until the
nested lock associated with lock is available. The
thread is granted ownership of the nested lock
when it becomes available.
omp_unset_nest_lock(lock) Releases the executing thread from ownership of
the nested lock associated with lock if the nesting
count is zero. Behavior is undefined if the
executing thread does not own the nested lock
associated with lock.
omp_test_nest_lock(lock) Attempts to set the nested lock associated
with lock. If successful, returns the nesting count,
otherwise returns zero.
Timing Routines
Function Description
omp_get_wtime() Returns a double-precision value equal to the
elapsed wallclock time (in seconds) relative to an
arbitrary reference time. The reference time does
not change during program execution.
omp_get_wtick() Returns a double-precision value equal to the
number of seconds between successive clock
ticks.
Any example with parallel for constructs can be explained for this topic
Function parallelism
The simplest way to create parallelism in OpenMP is to use the parallel pragma. A block
preceded by the omp parallel pragma is called a parallel region ; it is executed by a newly
created team of threads. SIMD} model: all threads execute the same segment of code.
result = f(x)+g(x)+h(x)
you could parallelize this as
double result,fresult,gresult,hresult;
#pragma omp parallel
{ int num = omp_get_thread_num();
if (num==0) fresult = f(x);
else if (num==1) gresult = g(x);
else if (num==2) hresult = h(x);
}
result = fresult + gresult + hresult;
Nested parallelism
What happens if you call a function from inside a parallel region, and that function itself
contains a parallel region?
int main() {
...
#pragma omp parallel
{
...
func(...)
...
}
} // end of main
void func(...) {
#pragma omp parallel
{
...
}
}
By default, the nested parallel region will have only one thread. To allow nested thread
creation, set
OMP_NESTED=true
or
omp_set_nested(1)
This means that the omp for or omp do directive needs to be inside a parallel region. It is also
possible to have a combined omp parallel for or omp parallel dodirective.
If your parallel region only contains a loop, you can combine the pragmas for the parallel
region and distribution of the loop iterations:
Loop schedules
Usually you will have many more iterations in a loop than there are threads. Thus, there are
several ways you can assign your loop iterations to the threads.
#pragma omp for schedule(....)
The first distinction we now have to make is between static and dynamic schedules. With
static schedules, the iterations are assigned purely based on the number of iterations and the
number of threads (and the chunk parameter; see later). In dynamic schedules, on the other
hand, iterations are assigned to threads that are unoccupied. Dynamic schedules are a good
idea if iterations take an unpredictable amount of time, so that load balancing is needed.
Figure 2 illustrates this: assume that each core gets assigned two (blocks of) iterations and
these blocks take gradually less and less time. You see from the left picture that thread 1 gets
two fairly long blocks, where as thread 4 gets two short blocks, thus finishing much earlier.
On the other hand, in the right figure thread 4 gets block 5, since it finishes the first set of
blocks early. The effect is a perfect load balancing.
The default static schedule is to assign one consecutive block of iterations to each thread. If
you want different sized blocks you can define
Example: in
ordered iterations
he omp ordered directive must be used as follows:
It must appear within the extent of a omp for or omp parallel for construct
containing an ordered clause.
It applies to the statement block immediately following it. Statements in that block are
executed in the same order in which iterations are executed in a sequential loop.
An iteration of a loop must not execute the same omp ordered directive more than
once.
An iteration of a loop must not execute more than one distinct omp ordered directive.
NoWait clause
The implicit barrier at the end of a work sharing construct can be cancelled with a nowait
clause. This has the effect that threads that are finished can continue with the next code in the
parallel region:
While loops
OpenMP can only handle `for' loops: while loops can not be parallelized. So you have to find
a way around that. While loops are for instance used to search through data:
PERFORMANCE CONSIDERATIONS