Unit3 RMD PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

CS6801 MULTI-CORE ARCHITECTURES AND PROGRAMMING

UNIT III SHARED MEMORY PROGRAMMING WITH OpenMP

1. OpenMP Execution and Memory Model


2. OpenMP Directives
3. Work-sharing Constructs
4. Library functions
5. Handling Data
6. Functional Parallelism
7. Handling Loops
8. Performance Considerations.

1.OpenMP Execution and Memory Model


What is OpenMP?

OpenMP is an API for shared-memory parallel programming. The “MP” in OpenMP stands
for “multiprocessing,” a term that is synonymous with shared-memory parallel computing.

OpenMP is designed for systems in which each thread or process can potentially have access
to all available memory. When we’re programming with OpenMP, we view our system as a
collection of cores or

CPUs, all of which have access to main memory

OpenMP Execution Model


OpenMP uses the fork-join model of parallel execution. When a thread encounters a parallel
construct, the thread creates a team composed of itself and some additional (possibly zero) helper
threads. The encountering thread becomes the master of the new team. All team members execute
the code in the parallel region. When a thread finishes its work within the parallel region, it waits at
an implicit barrier at the end of the parallel region. When all team members have arrived at the
barrier, the threads can leave the barrier. The master thread continues execution of user code in the
program beyond the end of the parallel construct, while the helper threads wait to be summoned to
join other teams.
OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled,
then the team executing a nested parallel region consists of one thread only (the thread that
encountered the nested parallel construct). If nested parallelism is enabled, then the new team may
consist of more than one thread.

The OpenMP runtime library maintains a pool of helper threads that can be used to work on
parallel regions. When a thread encounters a parallel construct and requests a team of more than
one thread, the thread will check the pool and grab idle threads from the pool, making them part of
the team. The encountering thread might get fewer helper threads than it requests if the pool does
not contain a sufficient number of idle threads. When the team finishes executing the parallel region,
the helper threads are returned to the pool.

OpenMP memory model


OpenMP supports a relaxed-consistency shared memory model.
– Threads can maintain a temporary view of shared memory which is not consistent with that of other
threads.
– These temporary views are made consistent only at certain points in the program.
– The operation which enforces consistency is called the flush operation
Explanation:
OpenMP assumes that there is a place for storing and retrieving data that is available to all
threads, called the memory. Each thread may have a temporary view of memory that it can use instead
of memory to store data temporarily when it need not be seen by other threads. Data can move
between memory and a thread's temporary view, but can never move between temporary views
directly, without going through memory.
Each variable used within a parallel region is either shared or private. The variable names
used within a parallel construct relate to the program variables visible at the point of the parallel
directive, referred to as their "original variables". Each shared variable reference inside the construct
refers to the original variable of the same name. For each private variable, a reference to the variable
name inside the construct refers to a variable of the same type and size as the original variable, but
private to the thread. That is, it is not accessible by other threads.
There are two aspects of memory system behavior relating to shared memory parallel
programs: coherence and consistency. Coherence refers to the behavior of the memory system when
a single memory location is accessed by multiple threads. Consistency refers to the ordering of
accesses to different memory locations, observable from various threads in the system.
OpenMP doesn't specify any coherence behavior of the memory system. That is left to the
underlying base language and computer system. OpenMP does not guarantee anything about the result
of memory operations that constitute data races within a program. A data race in this context is
defined to be accesses to a single variable by at least two threads, at least one of which is a write, not
separated by a synchronization operation. OpenMP does guarantee certain consistency behavior,
however. That behavior is based on the OpenMP flush operation.
The OpenMP flush operation is applied to a set of variables called the flush set.Memory
operations for variables in the flush set that precede the flush in program execution order must be
firmly lodged in memory and available to all threads before the flush completes, and memory
operations for variables in the flush set, that follow a flush in program order cannot start until the
flush completes. A flush also causes any values of the flush set variables that were captured in the
temporary view, to be discarded, so that later reads for those variables will come directly from
memory.
A flush without a list of variable names flushes all variables visible at that pointin the
program. A flush with a list flushes only the variables in the list.The OpenMP flush operation is the
only way in an OpenMP program, to guarantee that a value will move between two threads. In order
to move a value from one thread to a second thread, OpenMP requires these four actions in exactly the
followingorder:
1. the first thread writes the value to the shared variable,
2. the first thread flushes the variable.
3. the second thread flushes the variable and
4. the second thread reads the variable.

The flush operation and the temporary view allow OpenMP implementations to optimize
reads and writes of shared variables. For example, consider the program fragment in Figure . The
write to variable A may complete as soon as point 1 in the figure. However, the OpenMP
implementation is allowed to execute the computation denoted as “…” in the figure, before the write
to A completes. The write need not complete until point 2, when it must be firmly lodged in memory
and available to all other threads. If an OpenMP implementation uses a temporary view, then a read of
A during the “…” computation in Figure 1 can be satisfied from the temporary view, instead of going
all the way to memory for the value. So, flush and the temporary view together allow an
implementation to hide both write and read latency.
A flush of all visible variables is implied 1) in a barrier region, 2) at entry and exit from
parallel, critical and ordered regions, 3) at entry and exit from combined parallel work-sharing regions,
and 4) during lock API routines.
A flush with a list is implied at entry to and exit from atomic regions, where the list contains
the object being updated. The C and C++ languages include the volatile qualifier, which provides a
consistency mechanism for C and C++ that is related to the OpenMP consistency mechanism.
OpenMP pros and Cons
Pros :
Simple
Incremental Parallelism.
Decomposition is handled automatically.
Unified code for both serial and parallel applications.
Cons :
Runs only on shared-memory multiprocessor.
Scalability is limited by memory architecture.
Reliable error handling is missing.

OpenMP Directives
OpenMP provides what’s known as a “directives-based” shared-memory API
Pragmas

 Special pre-processor instructions.(#pragma)


 Typically added to a system to allow behaviors that aren’t part of the basic C
specification.
 Compilers that don’t support the pragmas ignore them
OpenMP directives exploit shared memory parallelism by defining various types of parallel regions.
Parallel regions can include both iterative and non-iterative segments of program code.

Pragmas fall into these general categories:

1. Pragmas that let you define parallel regions in which work is done by threads in parallel.
Most of the OpenMP directives either statically or dynamically bind to an enclosing parallel
region.

#pragma omp parallel - Defines a parallel region, which is code that will be executed by
multiple

threads in parallel.

2. Pragmas that let you define how work is distributed or shared across the threads in a
parallel region
#pragma omp section - Identifies code sections to be divided among all threads.
#pragma omp for - Causes the work done in a for loop inside a parallel region to be divided
among threads.
#pragma omp single - lets you specify that a section of code should be executed on a single
thread
#pragma omp task
3. Pragmas that let you control synchronization among threads
#pragma omp atomic -Specifies that a memory location that will be updated atomically.
#pragma omp master - Specifies that only the master thread should execute a section of the
program.

#pragma omp barrier - Synchronizes all threads in a team; all threads pause at the barrier,
until all threads execute the barrier.

#pragma omp critical - Specifies that code is only executed on one thread at a time.
#pragma omp flush - Specifies that all threads have the same view of memory for all shared
objects.
#pragma omp ordered - Specifies that code under a parallelized for loop should be
executed like a sequential loop.
4. Pragmas that let you define the scope of data visibility across threads

#pragma omp threadprivate - Specifies that a variable is private to a thread.

5. Pragmas for task synchronization


#pragma omp taskwait
#pragma omp barrier

OpenMP directive syntax

parallel

Defines a parallel region, which is code that will be executed by multiple threads in parallel.
Syntax
#pragma omp parallel [clauses]
{
code_block
}

Remarks
The parallel directive supports the following OpenMP clauses:
 copyin
 default
 firstprivate
 if
 num_threads
 private
 reduction
 shared
parallel can also be used with the sections and for directives.
.
Example
The following sample shows how to set the number of threads and define a parallel region. By
default, the number of threads is equal to the number of logical processors on the machine. For
example, if you have a machine with one physical processor that has hyperthreading enabled, it will
have two logical processors and, therefore, two threads.
//hello.c
#include <stdio.h>
#include <omp.h>

int main() {
#pragma omp parallel num_threads(4)
{
int i = omp_get_thread_num();
int j= omp_get_num_threads ();
printf("Hello from thread %d of %d\n", i,j);
}
}

Output
Hello from thread 0 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 3 of 4

Compiling and running the parallel program

To Compile:

$gcc −g −Wall −fopenmp −o omp_hello omp_hello . c

To run the program, we specify the number of threads on the command line. For example, we might
run the program with four threads and type

$ ./omp hello 4

Syntax for other directives are

 for
Specifies that the iterations of associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks.
#pragma omp [parallel] for [clauses]
for_loop
clause:
private(list), firstprivate(list), lastprivate(list), reduction(reduction-identifier : list),
schedule( [modifier [, modifier] : ] kind[, chunk_size]), collapse(n), ordered[ (n) ], nowait

 sections
A noniterative work-sharing construct that contains a set of structured blocks that are to be
distributed among and executed by the threads in a team.
#pragma omp [parallel] sections [clauses]
{
#pragma omp section
{
code_block
}
}
clause:
private(list), firstprivate(list), lastprivate(list), reduction(reduction-identifier: list), nowait

 single
Specifies that the associated structured block is executed by only one of the threads in the
team.
#pragma omp single [clauses]
structured-block
clause:
private(list), firstprivate(list), copyprivate(list), nowait

 master
Specifies a structured block that is executed by the master thread of the team.
#pragma omp master
structured-block

 critical
Restricts execution of the associated structured block to a single thread at a time.
#pragma omp critical [(name)]
structured-block

 flush
Executes the OpenMP flush operation, which makes a thread’s temporary view of memory
consistent with memory, and enforces an order on the memory operations of the variables.
#pragma omp flush [(list)]

 barrier
Specifies an explicit barrier at the point at which the construct appears.
#pragma omp barrier

 atomic
Ensures that a specific storage location is accessed atomically.
#pragma omp atomic
expression

 parallel for
Shortcut for specifying a parallel construct containing one or more associated loops and no
other statements.
#pragma omp parallel for [clauses]
for-loop
clause: Any accepted by the parallel or for directives, except the nowait clause, with identical
meanings and restrictions.
 parallel sections
Shortcut for specifying a parallel construct containing one sections construct and no other
statements.
#pragma omp parallel sections [clauses]
{
[#pragma omp section]
structured-block
[#pragma omp section
structured-block]
...
}
clause: Any accepted by the parallel or sections directives, except the nowait clause, with
identical meanings and restrictions.

Work-sharing Constructs
A work-sharing construct distributes the execution of the associated statement among the
members of the team that encounter it. The work-sharing directives do not launch new threads, and
there is no implied barrier on entry to a work-sharing construct.

The sequence of work-sharing constructs and barrier directives encountered must be the same for
every thread in a team.

OpenMP defines the following work-sharing constructs, and these are described in the sections that
follow:

 for directive

The for directive identifies an iterative work-sharing construct that specifies


that the iterations of the associated loop will be executed in parallel. The iterations of
the for loop are distributed across threads that already exist in the team executing the
parallel construct to which it binds.
 sections directive

The sections directive identifies a noniterative work-sharing construct that


specifies a set of constructs that are to be divided among threads in a team. Each
section is executed once by a thread in the team.
 single directive

The single directive identifies a construct that specifies that the associated structured
block is executed by only one thread in the team (not necessarily the master thread).
1. Parallel for loop

Example
Note:

2. Parallel sections
Example:

3. Parallel single:

Syntax
#pragma omp single [clause,clause , ...]
structured-block

Arguments
clause Can be one or more of the following clauses:
copyprivate(list) Provides a mechanism to use a private
variable in list to broadcast a value from the
data environment of one implicit task to the
data environments of the other implicit tasks
belonging to the parallel region.
firstprivate(list) Provides a superset of the functionality
provided by the private clause. Each
private data object is initialized with the
value of the original object.
nowait(integer Indicates that an implementation may omit
expression) the barrier at the end of the worksharing
region.
private(list) Declares variables to be private to each
thread in a team.

Description
Only one thread will be allowed to execute the structured block. The rest of the threads in the team wait at
the implicit barrier at the end of the single construct, unless nowait is specified. If nowait is specified, then
the rest of the threads in the team immediately execute the code after the structured block.

The following example demonstrates how to use this pragma to make sure that the printf function is
executed only once. All the threads in the team that do not execute the function proceed immediately to the
following calculations.

Example

#include <omp.h>
#pragma omp parallel {
#pragma omp single nowait { printf(“Starting calculation\n”); }

// Do some calculation
}

Open MP Library functions:

OpenMP provides several run-time library routines to help you manage your program
in parallel mode. Many of these run-time library routines have corresponding environment
variables that can be set as defaults. The run-time library routines let you dynamically change
these factors to assist in controlling your program. In all cases, a call to a run-time library
routine overrides any corresponding environment variable.

Execution Environment Routines

Function Description
omp_set_num_threads(nthreads) Sets the number of threads to use for subsequent
parallel regions.
omp_get_num_threads() Returns the number of threads that are being used
in the current parallel region.
omp_get_max_threads() Returns the maximum number of threads that are
available for parallel execution.
omp_get_thread_num() Returns the unique thread number of the thread
currently executing this section of code.
omp_get_num_procs() Returns the number of processors available to the
program.
omp_in_parallel() Returns TRUE if called within the dynamic
extent of a parallel region executing in parallel;
otherwise returns FALSE.
omp_set_dynamic(dynamic_threads) Enables or disables dynamic adjustment of the
number of threads used to execute a parallel
region. If dynamic_threads is TRUE, dynamic
threads are enabled.
If dynamic_threads is FALSE, dynamic threads
are disabled. Dynamics threads are disabled by
default.
omp_get_dynamic() Returns TRUE if dynamic thread adjustment is
enabled, otherwise returns FALSE.
omp_set_nested(nested) Enables or disables nested parallelism.
If nested is TRUE, nested parallelism is enabled.
If nested is FALSE, nested parallelism is
disabled. Nested parallelism is disabled by
default.
omp_get_nested() Returns TRUE if nested parallelism is enabled,
otherwise returns FALSE

Lock Routines

Function Description
omp_init_lock(lock) Initializes the lock associated with lock for use in
subsequent calls.
omp_destroy_lock(lock) Causes the lock associated with lock to become
undefined.
omp_set_lock(lock) Forces the executing thread to wait until the lock
associated with lock is available. The thread is
granted ownership of the lock when it becomes
available.
omp_unset_lock(lock) Releases the executing thread from ownership of
the lock associated with lock. The behavior is
undefined if the executing thread does not own
the lock associated with lock.
omp_test_lock(lock) Attempts to set the lock associated with lock. If
successful, returns TRUE, otherwise
returns FALSE.
omp_init_nest_lock(lock) Initializes the nested lock associated with lock for
use in the subsequent calls.
omp_destroy_nest_lock(lock) Causes the nested lock associated with lock to
become undefined.
omp_set_nest_lock(lock) Forces the executing thread to wait until the
nested lock associated with lock is available. The
thread is granted ownership of the nested lock
when it becomes available.
omp_unset_nest_lock(lock) Releases the executing thread from ownership of
the nested lock associated with lock if the nesting
count is zero. Behavior is undefined if the
executing thread does not own the nested lock
associated with lock.
omp_test_nest_lock(lock) Attempts to set the nested lock associated
with lock. If successful, returns the nesting count,
otherwise returns zero.

Timing Routines

Function Description
omp_get_wtime() Returns a double-precision value equal to the
elapsed wallclock time (in seconds) relative to an
arbitrary reference time. The reference time does
not change during program execution.
omp_get_wtick() Returns a double-precision value equal to the
number of seconds between successive clock
ticks.

Handling Data and function parallelism


Data parallel computation:
Perform the same operation on different items of data at the same time; the parallelism grows
with the size of the data.
Task parallel computation:
Perform distinct computations -- or tasks -- at the same time. If the number of tasks is fixed,
the parallelism is not scalable.
OpenMP Data Parallel Construct: Parallel Loop
 All pragmas begin: #pragma
 Compiler calculates loop bounds for each thread directly from serial source (computation
decomposition)
 Compiler also manages data partitioning of Res
 Synchronization also automatic (barrier)

Any example with parallel for constructs can be explained for this topic

Function parallelism

The simplest way to create parallelism in OpenMP is to use the parallel pragma. A block
preceded by the omp parallel pragma is called a parallel region ; it is executed by a newly
created team of threads. SIMD} model: all threads execute the same segment of code.

#pragma omp parallel


{
// this is executed by a team of threads
}
It would be pointless to have the block be executed identically by all threads.
For instance, if your program computes

result = f(x)+g(x)+h(x)
you could parallelize this as
double result,fresult,gresult,hresult;
#pragma omp parallel
{ int num = omp_get_thread_num();
if (num==0) fresult = f(x);
else if (num==1) gresult = g(x);
else if (num==2) hresult = h(x);
}
result = fresult + gresult + hresult;

Nested parallelism
What happens if you call a function from inside a parallel region, and that function itself
contains a parallel region?

int main() {
...
#pragma omp parallel
{
...
func(...)
...
}
} // end of main
void func(...) {
#pragma omp parallel
{
...
}
}
By default, the nested parallel region will have only one thread. To allow nested thread
creation, set

OMP_NESTED=true
or
omp_set_nested(1)

Handling Loops in Open MP


Loop parallelism is a very common type of parallelism in scientific codes, so
OpenMP has an easy mechanism for it. OpenMP parallel loops are a first example of
OpenMP `worksharing' constructs.constructs that take an amount of work and distribute it
over the available threads in a parallel region.
The parallel execution of a loop can be handled a number of different ways. For instance, you
can create a parallel region around the loop, and adjust the loop bounds:

#pragma omp parallel


{
int threadnum = omp_get_thread_num(),
numthreads = omp_get_num_threads();
int low = N*threadnum/numthreads,
high = N*(threadnum+1)/numthreads;
for (i=low; i<high; i++)
// do something with i
}
A more natural option is to use the parallel for pragma:

#pragma omp parallel


#pragma omp for
for (i=0; i<N; i++) {
// do something with i
}
This has several advantages. For one, you don't have to calculate the loop bounds for
the threads yourself, but you can also tell OpenMP to assign the loop iterations according to
different schedules
Figure shows the execution on four threads of

#pragma omp parallel


{
code1();
#pragma omp for
for (i=1; i<=4*N; i++) {
code2();
}
code3();
}
The code before and after the loop is executed identically in each thread; the loop iterations
are spread over the four threads.
Note that the parallel do and parallel for pragmas do not create a team of threads: they take
the team of threads that is active, and divide the loop iterations over them.

This means that the omp for or omp do directive needs to be inside a parallel region. It is also
possible to have a combined omp parallel for or omp parallel dodirective.

If your parallel region only contains a loop, you can combine the pragmas for the parallel
region and distribution of the loop iterations:

#pragma omp parallel for


for (i=0; .....

Loop schedules
Usually you will have many more iterations in a loop than there are threads. Thus, there are
several ways you can assign your loop iterations to the threads.
#pragma omp for schedule(....)
The first distinction we now have to make is between static and dynamic schedules. With
static schedules, the iterations are assigned purely based on the number of iterations and the
number of threads (and the chunk parameter; see later). In dynamic schedules, on the other
hand, iterations are assigned to threads that are unoccupied. Dynamic schedules are a good
idea if iterations take an unpredictable amount of time, so that load balancing is needed.
Figure 2 illustrates this: assume that each core gets assigned two (blocks of) iterations and
these blocks take gradually less and less time. You see from the left picture that thread 1 gets
two fairly long blocks, where as thread 4 gets two short blocks, thus finishing much earlier.
On the other hand, in the right figure thread 4 gets block 5, since it finishes the first set of
blocks early. The effect is a perfect load balancing.

The default static schedule is to assign one consecutive block of iterations to each thread. If
you want different sized blocks you can define

#pragma omp for schedule(static[,chunk])


(where the square brackets indicate an optional argument). With static scheduling, the
compiler will split up the loop iterations at compile time, so, provided the iterations take
roughly the same amount of time, this is the most efficient at runtime.
The choice of a chunk size is often a balance between the low overhead of having only a few
chunks, versus the load balancing effect of having smaller chunks.

Collapsing nested loops


In general, the more work there is to divide over a number of threads, the more efficient the
parallelization will be. In the context of parallel loops, it is possible to increase the amount of
work by parallelizing all levels of loops instead of just the outer one.

Example: in

for ( i=0; i<N; i++ )


for ( j=0; j<N; j++ )
A[i][j] = B[i][j] + C[i][j]
All N2 iterations are independent, but a regular omp for directive will only parallelize one
level. The {collapse} clause will parallelize more than one level:
#pragma omp for collapse(2)
for ( i=0; i<N; i++ )
for ( j=0; j<N; j++ )
A[i][j] = B[i][j] + C[i][j]
It is only possible to collapse perfectly nested loops, that is, the loop body of the outer loop
can consist only of the inner loop; there can be no statements before or after the inner loop in
the loop body of the outer loop. That is, the two loops in
for (i=0; i<N; i++) {
y[i] = 0.;
for (j=0; j<N; j++)
y[i] + A[i][j] * x[j]
}
can not be collapsed.

ordered iterations
he omp ordered directive must be used as follows:
 It must appear within the extent of a omp for or omp parallel for construct
containing an ordered clause.
 It applies to the statement block immediately following it. Statements in that block are
executed in the same order in which iterations are executed in a sequential loop.
 An iteration of a loop must not execute the same omp ordered directive more than
once.
 An iteration of a loop must not execute more than one distinct omp ordered directive.

NoWait clause
The implicit barrier at the end of a work sharing construct can be cancelled with a nowait
clause. This has the effect that threads that are finished can continue with the next code in the
parallel region:

#pragma omp parallel


{
#pragma omp for nowait
for (i=0; i<N; i++) { ... }
// more parallel code
}

While loops

OpenMP can only handle `for' loops: while loops can not be parallelized. So you have to find
a way around that. While loops are for instance used to search through data:

while ( a[i]!=0 && i<imax ) {


i++; }
// now i is the first index for which a[i]
is zero.
We replace the while loop by a for loop that examines all locations:
result = -1;
#pragma omp parallel for
for (i=0; i<imax; i++) {
if (a[i]!=0 && result<0) result = i;
}

PERFORMANCE CONSIDERATIONS

General Performance Recommendations


 Minimize synchronization:
 Avoid or minimize the use of synchronizations such as barrier, critical, ordered,
taskwait, and locks.
 Use the nowait clause where possible to eliminate redundant or unnecessary barriers.
For example, there is always an implied barrier at the end of a parallel region. Adding
nowait to a work-sharing loop in the region that is not followed by any code in the
region eliminates one redundant barrier.
 Use named critical sections for fine-grained locking where appropriate so that not all
critical sections in the program will use the same, default lock.
 Use the OMP_WAIT_POLICY environment variables to control the behavior of waiting
threads. By default, idle threads will be put to sleep after a certain timeout period. If a
thread does not find work by the end of the timeout period, it will go to sleep, thus
avoiding wasting processor cycles at the expense of other threads. The default timeout
period might not be appropriate for your application, causing the threads to go to sleep
too soon or too late. In general, if an application has dedicated processors to run on, then
an active wait policy that would make waiting threads spin would give better performance.
If an application runs simultaneously with other applications, then a passive wait policy
that would put waiting threads to sleep would be better for system throughput.
 Parallelize at the highest level possible, such as outermost loops. Enclose multiple loops
in one parallel region. In general, make parallel regions as large as possible to reduce
parallelization overhead.
Less Efficient Construct More Efficient Construct
 Use a parallel for/do construct, instead of a work-sharing for/do construct nested inside a parallel
construct.
 When possible, merge parallel loops to avoid parallelization overhead.
 Use master instead of single where possible.
o The master directive is implemented as an if statement with no implicit barrier:
if (omp_get_thread_num() == 0) {...}
o The single construct is implemented similarly to other work-sharing constructs. Keeping
track of which thread reaches single first adds additional runtime overhead. Moreover,
there is an implicit barrier if nowait is not specified, which is less efficient.
 Use explicit flush with care. A flush causes data to be stored to memory, and subsequent data
accesses may require reload from memory, all of which decrease efficiency.
Avoiding False sharing
 False sharing occurs when threads on different processors modify variables that reside on the
same cache line. This situation is called false sharing because the threads are not accessing the
same variable, but rather are accessing different variables that happen to reside on the same cache
line.
 If false sharing occurs frequently, interconnect traffic increases, and the performance and
scalability of an OpenMP application suffer significantly. False sharing degrades performance
when all the following conditions occur:
 Shared data is modified by multiple threads
 Multiple threads modify data within the same cache line
 Data is modified very frequently (as in a tight loop)
 False sharing can typically be detected when accesses to certain variables seem particularly
expensive. Careful analysis of parallel loops that play a major part in the execution of an
application can reveal performance scalability problems caused by false sharing.
 In general, false sharing can be reduced using the following techniques:
 Make use of private or threadprivate data as much as possible.
 Use the compiler’s optimization features to eliminate memory loads and stores.
 Pad data structures so that each thread's data resides on a different cache line. The size of
the padding is system-dependent, and is the size needed to push a thread's data onto a
separate cache line.
 Modify data structures so there is less sharing of data among the threads.
 Techniques for tackling false sharing are very much dependent on the particular application.
In some cases, a change in the way the data is allocated can reduce false sharing. In other
cases, changing the mapping of iterations to threads by giving each thread more work per
chunk (by changing the chunk_size value) can also lead to a reduction in false sharing.

You might also like