MulticoreProgrammingPractices
MulticoreProgrammingPractices
PROGRAMMING
PRACTICES
Working Group Chairs: David Stewart & Max Domeika
President: Markus Levy
Table of Contents
Foreward: Motivation for this Multicore Programming Practices Guide
Chapter 5: Debug
5.1 Introduction ........................................................................................................................................66
5.2 Parallel Processing Bugs....................................................................................................................66
5.3 Debug Tool Support...........................................................................................................................69
5.4 Static Code Analysis ..........................................................................................................................70
5.5 Dynamic Code Analysis .....................................................................................................................71
Chapter 6: Performance
6.1 Performance .......................................................................................................................................78
6.2 Amdahl’s Law, Speedup, Efficiency, Scalability.............................................................................78
6.3 Using Compiler Flags ........................................................................................................................79
6.4 Serial Optimizations ..........................................................................................................................80
6.4.1 Restrict Pointers .........................................................................................................................80
6.4.2 Loop Transformations ...............................................................................................................80
6.4.3 SIMD Instructions, Vectorization ..............................................................................................81
6.5 Adapting Parallel Computation Granularity ..................................................................................82
6.6 Improving Load Balancing ...............................................................................................................84
6.7 Removing Synchronization Barriers ................................................................................................84
6.8 Avoiding Locks/Semaphores .............................................................................................................85
6.9 Avoiding Atomic Sections..................................................................................................................86
6.10 Optimizing Reductions ....................................................................................................................86
6.11 Improving Data Locality .................................................................................................................87
6.11.1 Cache Coherent Multiprocessor Systems ................................................................................87
6.11.2 Improving Data Distribution and Alignment ...........................................................................87
6.11.3 Avoiding False Sharing ...........................................................................................................88
6.11.4 Cache Blocking (or Data Tiling) Technique ...........................................................................90
6.11.5 Software Cache Emulation on Scratch-Pad Memories (SPM) ................................................92
6.11.6 Scratch-Pad Memory (SPM) Mapping Techniques at Compile Time.....................................93
6.12 Enhancing Thread Interactions ......................................................................................................93
6.13 Reducing Communication Overhead .............................................................................................95
6.14 Overlapping Communication and Computation ..........................................................................95
6.15 Collective Operations.......................................................................................................................96
6.16 Using Parallel Libraries ..................................................................................................................97
6.17 Multicore Performance Analysis ....................................................................................................97
MultiMark .........................................................................................................................................100
ParallelMark .....................................................................................................................................100
Appendix D
The Multicore Association was founded in May 2005, back in the pioneering days of multicore
technology. To this day, the primary goal of the MCA is to develop an extensive set of application
programming interfaces (APIs) and the establishment of an industry-supported set of multicore
programming practices and services.
When the Multicore Programming Practices working group (MPP) formed 4+ years ago, it was (and
still is) a known fact that C/C++ was going to be the predominant programming language for at least
the better part of the next decade. While the industry was waiting for long-term research results to
change the programming infrastructure, the multicore programmability gap was continuously
expanding, pointing out an obvious need for a group of like-minded methodology experts to create a
standard “best practices” guide.
Through the guidance of Max Domeika (Intel) and David Stewart (CriticalBlue), the Multicore
Association created this one-of-a-kind document, providing best practices for writing multicore-ready
software using C/C++ without extensions, describing a framework of common pitfalls when
transitioning from serial to parallel code, and offering insight to minimizing debugging efforts by
reducing bugs.
Of course, the only problem with creating this kind of guide is that it’s really a work in progress. As a
matter of fact, even at the onset of this monumental project, we realized that the biggest challenge was
deciding what not to include (at least in the first version). So therefore, on one hand I personally wish
to thank the working group members, and on the other hand, I’m inviting you to step forward and join
the Multicore Association effort to help evolve this MPP Guide.
Prior to the multicore processor era, developers obtained performance increases through processor
upgrades and the inherent clock frequency boosts and microarchitecture improvements. These
upgrades required minimal software changes. Obtaining performance increases in the multicore era
requires developers to invest in significant software modifications to, in effect, transform current
sequential applications into parallel ones. This modification is nontrivial and introduces new
challenges, spanning the traditional development phases of program analysis, design, implementation,
debug, and performance tuning.
In stark contrast, the inertia of existing software is undeniable. To meet tight deadlines, development
projects rarely have the freedom to make wholesale changes to applications, and instead limit risk by
minimizing them. This means that a developer considering a move to multicore processors may not
have the luxury of taking on a new parallel programming language or even re-architecting the
application to support widespread concurrency. Instead, many development projects in the computing
industry are adopting an evolutionary approach to enabling multicore processors. This approach
employs existing programming tools and technology and a systematic method of introducing and
debugging concurrency in existing software.
The Multicore Association (MCA) Multicore Programming Practices (MPP) guide is a detailed set of
best practices for employing this evolutionary approach to multicore development. The following
sections detail the goal of the MPP guide, the target audience, how to apply the MPP guide, and areas
not covered.
The following are thoughts on some of the higher level decisions made:
API choices – Pthreads and MCAPI were selected for several reasons. These two APIs cover
the two main categories of multicore programming – shared memory parallel programming and
message passing. Second, Pthreads is a very commonly used API for shared memory parallel
programming. MCAPI is a recently introduced API, but is novel in being a message passing-
based multicore API that specifically targets embedded homogeneous and heterogenuous
applications.
Architecture categories – The architecture categories (homogeneous multicore with shared
memory, heterogeneous multicore with mix of shared and non-shared memory, and
homogeneous multicore with non-shared memory) are in priority order and reflect the group’s
assessment of use in embedded applications. Admittedly, the second category is a superset of
the first and third. The motivation for listing these three categories separately is due to the
widespread and common use of techniques associated with the first and third, namely shared
memory parallel techniques and process-level parallelism. The following paragraphs dissect
this goal and offer more detail.
The phases of software development discussed in MPP and a summary of each follows:
Program analysis and high level design is a study of an application to determine where to add
concurrency and a strategy for modifying the application to support concurrency.
Implementation and low level design is the selection of design patterns, algorithms, and data
structures and subsequent software coding of the concurrency.
Debug comprises implementation of the concurrency in a manner that minimizes latent
concurrency issues, enabling of an application to be easily scrutinized for concurrency issues,
and techniques for finding concurrency issues.
Specifically, the benefits of this guide to the specific target audience are summarized below:
Software developers who are experienced in sequential programming will benefit from reading
this guide as it will explain new concepts which are essential when developing software
targeted at multicore processors.
Engineering managers running technical teams of hardware and software engineers will benefit
from reading this guide by becoming knowledgeable about development for multicore
processors and the learning curve faced by their software developers.
Project managers scheduling and delivering complex projects will benefit from reading this
guide as it will enable them to appreciate, and appropriately schedule, projects targeting
multicore processors.
Test engineers developing tests to validate functionality and verify performance will benefit
from reading this guide as it will enable them to write more appropriate and more efficient tests
for multicore-related projects.
In particular, the areas outside the scope of the best practices portion of the MPP guide are:
Languages other than C and C++ and extensions to C and C++ that are not part of the C and
C++ ISO standard or are proprietary extensions to the standards;
Coding style guidelines; and
Architectures and programming models other than those specified in Chapter 2. This includes
C & C++ compatible parallel libraries other than those specified in Chapter 2.
The MPP guide uses two multicore programming APIs, Pthreadsi and Multicore Communications API
(MCAPIii), for its examples. This will allow us to provide coverage of two categories of multicore
software development - shared memory programming and message-passing programming. We chose
these two APIs because they are commonly used APIs within these two software development
paradigms.
A heterogeneous multicore processor with a mix of shared and non-shared memory is a processor
comprised of multiple processor cores implementing different ISAs and having access to both main
memory shared between processor cores and local to them. In this type of architecture, local memory
refers to physical storage that is only accessible from a subset of a system. An example is the memory
associated with each SPE in the Cell BE processor from Sony/Toshiba/IBM. It is generally necessary
to state what portion of the system the memory is local to, as memory may be local to a processor core,
a processor, an SoC, an accelerator, or a board.
3.2 Analysis
To effectively exploit multicore resources, you must parallelize your program. However, several steps
should first be taken before introducing parallelism. This section describes two such activities—
improving the program's serial performanceiv and understanding the program.
It is important to remember, though, that serial optimization is not the end goal. We want to optimize
carefully, applying only changes that will facilitate parallelization and the performance improvements
that parallelization can bring. In particular, serial optimizations that interfere with, or limit
parallelization should be avoided. For example, avoid introducing unnecessary data dependencies or
exploiting details of the single core hardware architecture (such as cache capacity).
Developers are familiar with many techniques that can be used to improve a program’s serial
performance; in fact, there is a wealth of public material (e.g., articles, books, tools, and accumulated
wisdom). But haphazard application of such improvements is generally a bad idea.
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of
noncritical parts of their programs, and these attempts at efficiency actually have a strong negative
impact when debugging and maintenance are considered. We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of all evil.”v
Instead, discipline and an iterative approach are the keys to effective serial performance tuning (Figure
1). Use measurements and careful analysis to guide decision making, change one thing at a time, and
meticulously re-measure to confirm that changes have been beneficial.
Measure Measure
good
yes no enough
Done acceptable? better? Done
yes
no
not good
enough yet
passes
Tune regression?
no
Undo
Change
Figure 1. Serial performance tuning approach. Four key steps: prepare, measure, tune, assess.
The following sections provide short descriptions of the steps in this approach, but each topic is richer
than can be reasonably covered in this document. There has been a great deal of useful work in
performance analysis and performance tuning and most of it is applicable when preparing for
multicore (though the effects of parallelization changes are more sensitive to the order in which the
changes are applied).
3.3.1 Prepare
Assemble a collection of regression tests and benchmarks against which to improve the program's
performance. Before making any changes, think about your objectives and assemble the resources that
will help you achieve them efficiently and reliably.
What are your goals? Improving a program’s performance can be a complicated, time
consuming, and error prone task. Stop when you've reached "good enough" results.
What operational scenarios are most important? Any improvements made will be based on data
from observation of the program's execution, but which scenarios are most important? You
could focus on steady state operations, heavy load conditions, error recovery scenarios, or
some balance of such scenarios.
What performance metric makes sense for your program? Perhaps it's the time to process a unit
of work, or maybe the number of units of work completed in a unit of time. Know what to
measure so you can quantify goals or know that you've met them.
Performance optimizations are often trade-offs with some other concern, such as memory consumption,
code maintainability, or portability. Know how much you are willing to compromise in any of these
areas.
Serial performance tuning is an iterative process that involves repeated execution of your program,
both to measure performance and ensure that correct functionality is preserved, so make sure you have
a good regression suite before you start (it should go without saying that the program passes all tests
before starting).
3.3.2 Measure
After preparations are complete, the next task is to utilize your benchmarks and understand the
baseline performance of your program. Development tools, such as profilers, are typically used to
measure performance. A profiler is a tool that observes the execution of a program and collects
measures of its performance. Several different approaches are used for profiling:
The quality of the performance data varies in several ways, two of which are completeness and
resemblance of "real performance." Sampling only measures at regular intervals, generating
incomplete insight into program behavior. Non-sampling approaches produce complete data, with
respect to the measurement granularity. Any profiling approach, however, has some effect on how
closely the measurements resemble execution in the absence of measurement. Low overhead
approaches more closely resemble typical execution than those with high overhead, such as
instrumentation or interpretation, which may alter the timing properties of the program in significant
ways.
Output differs from tool to tool in both granularity and format. Profilers typically measure data relative
to functions, with simple measures being the time at which each function call and return occurs. Some
profilers capture information at a finer granularity or allow the user add instrumentation points (e.g.,
encompassing interesting loops).
A flat profile is generally a tabular report of statistics organized by function, such as how much
execution time was spent in a function (by percent and units of time) and the number of times
the function was called.
A call graph shows execution time, again by function, relative to call chains. A call graph will
show which functions were called by a given function, how many times, and how much
execution time was used.
Typically, you want to use both forms of output. A flat profile is a good way to quickly find where
large portions of time are being spent (e.g., a single function that consumes 60% of the time). A call
graph is a good way to see where a function is being used.
Profiler(s) choice is based on a number of the usual criteria, such as availability/cost, target platform,
experience, and so on. Use at least one profiler that measures how much of your execution time is
spent outside of your program (e.g., in system calls) to assemble a more complete performance picture.
When developing an embedded system, you may need to use an emulation- or simulation-based
profiler before target hardware is available or to compensate for lack of a persistent store for
measurements in the target device.
3.3.3 Tune
An important rule of thumb in performance tuning is "only change one thing at a time." Each tuning
iteration, use the measurements you gathered through profiling to look for the biggest improvement
you can reasonably make.viii Remember that you will also be making substantial changes when
parallelizing your program, so it's usually best to stick with the "low-hanging fruit" or big wins when
improving the serial program’s performance. Changing an algorithm that consumes a large portion of
time to scale linearly rather than logarithmically, for instance, would be a good example of a change
worth making (for clarification, this sentence only refers to the scaling, which depends on the number
of cores, and not to the algorithm's complexity, which depends on the size of the input data). In
general, favor changes that reduce computation over those that minimize data footprint. At this stage,
we want to focus on reducing the program to the essential computations needed to function correctly.
Start by looking through your measurements for hotspots. Hotspots are areas in your program that use
disproportionate percentages of your execution time, and can be found using flat profiles. Start with
hotspots for a simple reason: a performance improvement in frequently executed code will yield better
overall improvement than a similar improvement in infrequently executed code. That is, a 10%
improvement in a function using 50% of total execution time is an overall improvement of 5%; a 10%
improvement in a function using 5% of total execution time is only an overall improvement of 0.5%.
Beware of diminishing returns.
Computing: executing the instructions is the first thing we tend to think about.
Is an appropriate algorithm used; review the profiling data for different workloads to see if it's
scaling as you would expect.
Are all computations needed. You may be able to hoist calculations out of a loop or use lookup
tables to reduce computation.
Are you using the right types and data structures? Moving larger chunks of memory around
typically takes longer (e.g., using floats is often faster than using doubles). Consider the impact
of different language constructs on your platform (e.g., the overhead that is necessary to
support exception handling).
Is your program decomposition good? Use inlining or refactoring if you have functions whose
execution time isn't significantly larger than call/return overhead.
Think about your persistence strategies. Re-initializing and re-using a data structure (or class)
can be more efficient than creating a new one.
Waiting: some operations just have to spend time waiting, which is particularly problematic in
serial programs. I/O operations like network or file system access take a long time, as does user
interaction. Examine your caching strategy for some improvements, but using concurrency is
the typical solution. Avoid introducing concurrency at this point, but note the dependency. This
will factor into your parallelization strategy.
Working with large data sets can cause programs to frequently move data in and out of
memory. Reducing unnecessary memory moves can be a big help. Memory profilers are good
tools for understanding and improving your memory use.
Though it could improve serial performance, avoid reorganizing your data for locality (i.e.,
keeping data used together close in memory, which could allow a single move instead of
several smaller moves). This kind of optimization is better performed as part of parallelization,
when factors like cache width and policy are considered.
As you examine each hotspot, use several sources of information. Beyond the code itself, you can also
get a lot of quick information from the call graph generated by a profiler. By looking at calls made by
or to a function, you gain more insight into how it's used and what it does. In particular, walking back
up the call graph (callers of the function) may lead you to a hotspot that is not a leaf node of the call
graph.
3.3.4 Assess
Each time you make a change to your program, stop and assess the effect. A "good change" is one that
does not introduce bugs and that either directly or indirectly enables additional changes which improve
the performance metrics you identified in the Prepare step. Assess these criteria by re-running your
regression suite and re-measuring the program using your benchmarks.
If the regression passed and your performance improved, make sure the change is also acceptable with
regards to your other concerns. For example, it might be important to preserve the portability
properties that your program started with. Such criteria are often more difficult to validate in an
automated way, but should be part of your mental checklist.
As discussed in the following sections, some aspects of an application’s behavior are particularly
relevant to parallelization, such as a thorough understanding of computational and data dependencies.
A good application analysis will help to determine which dependencies are inherent in solving the
problem and which are artificial, perhaps introduced as part of a serial optimization.
Speed-up is achieved by effectively load balancing work across the cores to minimize execution time,
but there are limits to what is reasonable. For example, some code segments cannot be effectively
parallelized. Amdahl’s Law (see Section 6.2) expresses the maximum expected benefit for improving
a portion of an application. If the serial execution time of a portion accounts for 80% of the total time,
and that portion will be spread across eight cores, the result would be at best 3.3 times faster. If that
same portion only accounted for 20% of serial execution time, the same parallelization would be at
best 1.2 times faster. These are maximum expected improvements - various overhead factors (e.g., for
coordination) will further reduce the benefit.
The effort to parallelize the application should be justified by an objective analysis of the application,
the effort required to make the change, and the realistic expectations from the effort.
Task parallel decomposition is where the solution to a problem could be envisioned as a sequence of
steps which could be accomplished concurrently, which are absolutely or relatively independent of
each other, to achieve the same outcome regardless of the order in which those tasks are carried out.
Absolutely independent of each other means that the tasks share no data and that no synchronization
between the steps must occur (very uncommon). Relatively independent of each other, the more
common situation, means that such tasks share data and at certain times the tasks must synchronize
their activities. One example of task parallel decomposition is an image processing and capture
application in which a bitmap from a camera is received and the application performs three operations:
In the serialized application, these activities are sequential (Figure 2). In the task parallelized version
of this application, two tasks can be done in parallel since there are no dependencies between them,
but the third task to save the image can only be done after the bitmap is converted to JPEG—thus
requiring synchronization between the task that performs the encoding and the task that saves the
image. If the application were only required to save the original bitmap captured from the camera and
not the converted JPEG, there would be absolutely no dependencies between the three tasks, and all
three could be performed in parallel.
Save image
Data parallel decomposition is where the solution to a problem involves segmenting and decomposing
major data structures (e.g., arrays) into smaller, independent blocks of data which can be processed
loop i = 1 to 2
loop j = 1 to 2
Ai,j = 3· Ai,j A1,1 = 3· A1,1 A1,2 = 3· A1,2 A2,1 = 3· A2,1 A2,2 = 3·A2,2
end loop
end loop
In task or data parallelization, it is often that one type will impact the other; in other words, task and
data parallelism are not mutually exclusive. Most application problems are amenable to both forms of
decomposition—knowing which form is the most appropriate can only be reasoned by thoroughly
understanding the problem. Some problems, such as the image processing and capture application
problem, are more amenable to task decomposition, whereas others (e.g. the scalar multiplication
problem) are more amenable to data decomposition. In either type of parallelism, data structures must
be designed so that the tasks from task parallel decomposition can perform their computations, or tasks
must be created to manipulate the independent blocks of data resulting from the data parallel
decomposition.
Shared data dependencies between different (perhaps large) portions of code may also be discernible
from call graphs, but it is more likely that inspection of the source code will be necessary to either
confirm or refute such data dependencies. Static analysis tools and memory profiling tools may be
available to aid in the identification of data dependencies for large regions of code (for example
looking for access to global variables, shared memory regions, etc. across many lines of code or source
code files). Such tools will likely be language dependent and possibly even platform dependent.
Hotspots that are the result of a computationally-intensive loop may create dependencies within an
operation. These hotspots are candidates for parallelization because if the loop can be designed to be
executed in parallel some speed-up can be achieved. However, in order to perform this activity, the
loop must be analyzed and redesigned so that each loop iteration can be made independent—that is,
each iteration can be safely executed in any order and any dependencies are not carried from each
iteration. On the other hand, some loops used in serialized applications must have dependencies
between iterations, such as performing accumulations or other associative operations which are more
difficult to remove.
To identify speed-up and achieve efficient use of all the cores, the decomposition, ordering, design,
and implementation of the tasks needed to perform the overall workload (task parallel or data parallel
decomposition) must consider the target platform. This is to ensure that the work created for each of
the tasks will be of sufficient size to keep the cores busy and be correctly balanced between the
multiple cores. Previously, we illustrated in the prior scalar multiplication operation that a task could
be created for each element of the matrix to perform the design operation. For this specific example,
this would likely never be done here for various reasons:
The allocation of a task for each cell of the matrix to one of the cores would simply not scale
for a matrix of any practical sizex
The overhead introduced by parallelization would likely not outperform the serialized code
This underscores the point that determining the proper sizing (or granularity) of the tasks is non-trivial.
For software engineers performing this activity for the first time, this might involve some trial and
error after the parallelization of the application is shown to be correct. There are also heuristics which
can aid in estimating granularity which may be available for the target platform, parallelization
libraries, or languages used for the parallelization effort. For example, Thread Building Blocksxi
recommends a “grainsize” (an indicator of granularity for the number of instructions to execute in the
body of a task) of 10,000 to 100,000 instructions.
After conducting a parallelization exercise on the source code, revisit the assessment loop (Figure 1) to
confirm that the expected results for speed-up occurred and that the regression test for the application
passed.
Identify People
People
Identify Words Words
Receive Classify
Image
Image Results
Identify Things Things
Place
Identify Place
For a given processing platform, the different ‘identify’ tasks may have different execution times.
Though the four tasks may all be run in parallel, if identifying words, things, and places are each three
times faster than identifying people, it may make sense to run the word, thing, and place identification
tasks one after another in parallel with the slower people identification task. This type of load
balancing is an example of how task parallel decomposition may need to be scheduled properly to
most effectively utilize the available processing resources.
Filter Image
Filter Image
Though each pixel may be computed in parallel, the granularity of decomposition must match the
available resources. For example, if there are 8 logical processor cores available, the image could be
sliced into 8 independent regions. Within a region, the pixels would be computed serially, but the
regions themselves would all be computed in parallel. Data decomposition can inherently deliver a
natural load balancing process. When the task being computed takes the same amount of time per pixel,
which represents a static workload, then the data can be divided into equal segments to balance the
load.
Smooth
Image
Sobel
Out
Ideally, pipeline decomposition requires the number of stages to be matched to the available
processing resources. The overall throughput of the pipeline is also limited by the slowest pipeline
stage. In the example above, the smoothing function requires the most work, and for each data block,
the correction and Sobel functions must idle until the smoothing function completes. The smoothing
function’s speed determines the overall performance of the parallel implementation. Recursively
parallelizing the smoothing function may shorten its execution time and better balance the pipeline.
Dependencies between data reads and writes determine the partial order of computation. There are
three types of data dependencies which limit the ordering: true data dependencies, anti-dependencies,
and output dependencies (Figure 7).
True data dependencies imply an ordering between operations in which a data value may not be read
until after its value has been written. These are fundamental dependencies in an algorithm, although it
might be possible to refactor algorithms to minimize the impact of this data dependency.
Anti-dependencies have the opposite relationship and can possibly be resolved by variable renaming.
In an anti-dependency, a data value cannot be written until the previous data value has been read. In
Figure 7b, the final assignment to A cannot occur before B is assigned, because B needs the previous
value of A. In the final assignment, variable A is renamed to D, then the B and D assignments may be
reordered.
Renaming may increase storage requirements when new variables are introduced if the lifetimes of the
variables overlap as code is parallelized. Anti-dependencies are common occurrences in sequential
code. For example, intermediate variables defined outside the loop may be used within each loop
iteration. This is fine when operations occur sequentially. The same variable storage may be
repeatedly reused. However, when using shared memory, if all iterations were run in parallel, they
would be competing for the same shared intermediate variable space. One solution would be to have
each iteration use its own local intermediate variables. Minimizing variable lifetimes through proper
scoping helps to avoid these dependency types.
The third type of dependency is an output dependency. In an output dependency, writes to a variable
may not be reordered if they change the final value of the variable that remains when the instructions
are complete. In Figure 7c, the final assignment to A may not be moved above the first assignment,
because the remaining value will not be correct.
Parallelizing an algorithm requires both honoring dependencies and appropriately matching the
parallelism to the available resources. Algorithms with a high amount of data dependencies will not
parallelize effectively. When all anti-dependencies are removed and still partitioning does not yield
acceptable performance, consider changing algorithms to find an equivalent result using an algorithm
which is more amenable to parallelism. This may not be possible when implementing a standard with
strictly prescribed algorithms. In other cases, there may be effective ways to achieve similar results.
Interconnect
Memory
Multiple threads of execution are used to run multiple tasks simultaneously. Data which is shared
between tasks must also be properly synchronized by the developer in the application (i.e. locks) to
ensure dependency relationships are properly maintained. In an example, threads T0 and T1 may both
need to read and write variable ‘A’ (Figure 9). Without synchronization, the order of reads and writes
between the two threads is unpredictable, and three different outcomes are possible.
T0 T1
A
read A
read A
A += 1
A += 2
write A
write A
A = 1, 2, or 3?
Mutual exclusion through locks and semaphores is a common technique to ensure that only one thread
at a time may execute code which encompasses certain critical dependencies. While one thread holds a
lock for a critical section, other threads are blocked from entering, which if allowed, could violate
dependencies involving data shared between the sections. In the example of Figure 9, properly locking
the critical sections ensures that ‘A’ always receives 3 as a final value.
Generally, each processor runs one thread at a time. More threads may be created across the processor
cores. When one thread blocks while waiting for the lock to clear, the operating system may wake up
another ready thread to take its place.
The cost of acquiring and releasing a lock can be significant. Locks serialize code, so locking large
critical sections will inhibit parallelism. On the other hand, using frequent, low-level locking may
impose a large penalty for synchronization. The cost of creating and destroying completed tasks is also
significant, so once again, the granularity of tasks and locking should match the available resources.
In the distributed memory model, data must be explicitly shared between tasks. Synchronization
between tasks can be achieved by synchronous send-receive semantics. A receiving task will block
until the sending data is available. Alternatively, asynchronous send-receive semantics can be used
where the receiving task can check or be notified when data is available without blocking. This
permits overlap of computation and communication leading to significant performance gains.
Interconnect
Relative to shared memory, distributed memory represents a higher overhead for communication, both
in the setup and tear down of a message and in explicit copying of data, so message passing should be
optimized for both functions.
What would happen if the run time of the smoothing function was dependent on the input data? For
example, regions with slowly varying pixel values might take fewer compute cycles to smooth, while
regions with rapidly changing pixel values might require extra processing. This type of workload
varies dynamically depending on the pixel values - equally-sized pixel regions may contain very
different workloads. Consider an image with widely varying pixels on the top half and a uniform color
in the bottom half. If the work is split between two tasks, top and bottom and each running on separate
processors, the bottom half will finish well in advance of the top half, and the bottom processor will
idle.
With unpredictable workloads, the schedule can only be statistically optimized. In this case, dividing
the work into smaller regions might improve the processing efficiency because the differences
between neighboring workloads will be less. As the work is divided into smaller tasks, the cost of
communication can become a significant portion of the task run time and must be considered in
determining the best task sizes. Run times using thread pools or work stealing provide built-in support
for dynamic load-balancing.
Task, data, and pipeline decompositions should always consider how to balance the workloads
between regions. Static workloads are easy to analyze, while dynamic workloads will require
Some regions will easily decompose into many fine-grained regions, and there will be a tradeoff of
granularity and overhead. Data decomposing an image into pixels might be an extreme example of
fine-grained implementation. The number of available cores, overhead of creating and destroying tasks,
and communication and synchronization overhead become a significant cost.
As we described previously, profiling information from high-level analysis identifies the most
important hot spot regions. To gain the most benefit, concentrate on parallelizing the hot spots that
consume relatively large amounts of execution time. A true bottom-up approach will start with the
finest grained partitioning for hot spot regions and progressively move up to encompass larger regions,
maintaining a balance between partitioning and overhead. Once a suitable granularity is established, a
larger region can be assembled in parallel from the smaller regions until achieving a satisfactory
estimated performance level.
Using data decomposition and assuming sufficient processing resources, it may be possible to move
six data blocks in parallel through the first stage, which can then run serially through the second stage
in the same amount of time, effectively balancing the pipeline and optimizing the throughput.
0 1 2
0 6 12
1 7 13
2 8 14
3 9 15
4 10 16
5 11 17
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Figure 11. Example depicts 2-stage pipeline combining data and pipeline decomposition.
Model-Based Parallelization (MBP) can be used for this purpose. In many modeling tools, data flow
block diagrams are utilized to represent models. Figure 12 shows an example of Simulink model for the
Fibonacci function (𝑓𝑛+1 = 𝑓𝑛 + 𝑓𝑛−1 ) and comparison of 𝑓𝑛 /𝑓𝑛−1 with its convergence value, with is
also called the golden ratio. In this diagram, the blocks marked by 1/𝑧 are unit delays that keep values
of 𝑓𝑛 and 𝑓𝑛−1 . At each step, the calculation initializes with the reading of the values of 𝑓𝑛 and 𝑓𝑛−1 at
the unit delays marked (S). Then, the values of 𝑓𝑛+1 and 𝑓𝑛 at the unit delays marked (T) and the
calculated values at output ports marked (T) are stored and outputted. The diagram explicitly shows data
dependency; task parallel decomposition creates a model partitioning problem, which considers load
balancing, inter-core communication, and computation orders with start and end points (Figure 13). In
addition, there is a way to draw data parallelism using special blocks such as “For Iterator” blocks in
Simulink. Therefore, MBP might be useful for the analysis and high-level design of parallelism in
modeling.
For user-level threads, the user is responsible for scheduling and managing the threads, although the
kernel manages the process but is unaware of the threads. However, if a ULT makes a call into the
kernel, the entire process will block. For KLTs, the kernel must schedule each thread and maintains
information about each thread. Process blocking is done on a per thread basis. Hybrid ULT/KLT
threading is implemented on platforms such as Linux and Solaris (Figure 14).
User Space
Thread Library
k k k k k k Kernel Space
Kernel
p0 p1 p2 p3
The kernel will switch control between threads based on the scheduling algorithm and the state
of the machine
The kernel will save and restore the context of the thread - known as a context switch. This has
a detrimental effect on the performance of the software because of the inherent overhead,
adversely affecting the behavior of the underlying microprocessor and cache state.
Kernel scheduling is non-deterministic and does not reliably schedule threads in any particular
order
Even though POSIX compliant platforms feature Pthreads, some implementations are incomplete or
don’t conform to the specification. Some API calls may not have underlying operating system kernel
Beyond this document there are several classic textsxiv xv xvi that cover Pthreads in great depth and even
more on-line tutorials.
Each Pthreads function call takes an opaque variable (object) as a function parameter. The thread
library uses this parameter to track the state of that object/resource. For instance the pthread_create
call takes a pthread_t type that contains, amongst other things, the ID of the thread. This ID can be
retrieved via the pthread_self call.
Most Pthreads functions return a value indicating the success, failure or status of the call. Check these
values and deal with them appropriately.
The Pthreads function categories are outlined in Table 1. We will introduce most of this API and its
applicability later in the chapter.
Any portion of code that must be synchronized and protected by mutual exclusion is called a critical
section. Furthermore, construct the protection mechanism (Figure 15) so that all threads execute the
entry and exit routines before entering and exiting the critical section.
For Pthreads, unprotected global or static variables are particularly troublesome and should be avoided.
Threads that allocate memory on the stack can take advantage of the thread’s stack space offering
automatic privacy. Many functions return a pointer to static data, which can be a problem, but this can
be remedied by allocating memory on the heap and returning a pointer. Using thread-safe variants of
library function calls is also recommended. Furthermore, any third-party libraries should be vetted for
thread safety and thread-safe calls should be used.
In normal operation, threads compete for resources, cooperate to share them, or cooperate amongst
themselves via some type of communication method to guarantee a certain outcome. Depending on the
interaction between threads, there are many opportunities for the programmer to fall prey to race
conditions, deadlocks, and starvation.
Deadlockxxvii is a situation that arises when two threads are both holding a resource that the other
needs to continue and they are both waiting on the other to release it. However, without external
interference there is no possibility to release the resource and it ends up in a situation termed circular
waiting. This can usually be prevented by verification of the lock ordering - the order in which locks
are acquired and released. One rule of thumb is to acquire locks in ascending order and release in
descending order. When using C++, concepts such as RAII (Resource Acquisition Is Initialization),
leveled locking, and lock guards/scope locks can be helpful to avoid missing locksxxviii. Figure 16
illustrates a common C++ scope-locking technique. When an object goes out of scope, it’s destroyed.
The ScopeLock class destructor ensures that the mutex is unlocked when the object is destroyed
because the mutex unlock is called in the destructor of the object as the stack is wound down.
Starvation can happen when attempting to enforce mutual exclusion in a heavily contended critical
section. The kernel may schedule two threads in such a way that a third thread must wait indefinitely
for a shared resource, thereby starving it and making no progress. Thus, one of the goals of certain
locks is to be fair, thereby preventing starvation.
Time - Make the critical section as short as possible – some instructions take longer to execute.
Space - Execute the minimal amount of instructions during the lock.
Do not lock any section of code that executes undeterministically.
When dealing with containers that hold data, lock data items and structures whenever possible,
as opposed to the entire container - balance this against granularity concerns.
Fine-grained parallelism is small amounts of work done or that there is a low computation to
communication ratio. This can facilitate load balancing, but relatively large amounts of time will be
spent coordinating the events and communicating their outcome to the other threads. It’s also possible
to produce a situation where there is more communication than computation, such as when the
program spends most of its time in the kernel dealing with mutexes or condition variables.
Coarse-grained parallelism is where there is a high computation to communication ratio. This might
indicate that the performance would increase, but depending on the algorithm’s implementation, it
may require more effort to achieve a reasonable load balance across a given machine.
For deciding on the appropriate granularity, a useful rule of thumb is that there must be enough work
to amortize the cost of communication over the computation. Start with a coarse-grained approach and
look for opportunities to implement fine-grained parallelism. More parallelism is usually better, but
this alone is not the determining factor for performance. This balance is usually dependent on the
scheduling strategies, algorithm implementation, the computation to communication ratioxxix, as well
as the available processing resources. Maximum performance is only realized after considerable
experimentation and applied performance tuning. Additionally, the number of concurrent work items
should be close to the number of cores in the machine to avoid over-subscription.
When using most Pthreads primitives, the library will ask the kernel to help it perform the task, which
in turn causes its own set of performance problems. Have an idea of the relative overhead for Pthreads
calls. This knowledge combined with understanding the potential concurrency in the software
(Amdahl’s/Gustavson’s Law)xxx will help with the optimization process.
Creating or identifying a group of program functions that will be used as tasks - called thread
functions
Creating the threads that will execute the thread functions and passing the necessary data to the
functions
Joining with and retrieving the results returned from the thread functions, if any
Minimally, all critical sections must be protected – this pertains to locations where shared
resources will be changed by at least one thread and accessed by others
A parallel pipeline system has temporal and data dependencies (Figure 20), therefore synchronization
at the output of each stage might need to be considered.
Figure 20. Parallel pipeline system with temporal and data dependencies.
A pipeline can be created using the Queue concept introduced earlier, where each stage in the pipeline
is separated by a queue object. Each stage has one or more threads that remove a data item from the
previous queue, perform some work and then place the data object in the queue of the next block.
Master
task task
result
Worker Worker
..........
The master/worker system relies on the relatively smooth operation of the master thread which is
responsible for collecting results and emitting tasks. When a large number of workers are used, the
master thread may become overloaded. Conversely, allowing the worker threads to perform more
1) Divide Phase - A problem is split into one or more independent sub-problems of smaller size
3) Combine Phase - The sub-solutions are combined to a solution of the original problem instance
In the divide-and-conquer model, one or more threads perform the same task in parallel (SPMD
model). There is no master thread and all run independently. Recursive calls can be made concurrently
only if the calls write to different parts of the program’s memory. Divide and conquer algorithms are
subject to load-balancing problems when using non-uniform sub-problems, but this can be resolved if
the sub-problems can be further reduced.
An efficient use of the available resources on a parallel system requires enough tasks to spread over
the processor cores allowing tasks to execute in parallel. The programmer must determine the optimal
task size that will give the shortest execution time. The larger the task size, the less parallelism that is
available on a given machine. A smaller task size will result in greater parallel overhead
(communication and synchronization)
The general solution to the granularity problem is NP-completexxxii, but it’s possible to find a near
optimal solution to a sub-problemxxxiii. Techniques such as task packingxxxiv xxxv and work stealingxxxvi
xxxvii
are also beneficial to arrive at a tuned scheduling algorithm.
On most processor architectures, migration of threads across cache, memory, or processor boundaries
is expensive (i.e. TLB flush, cache invalidation) and can reduce program performance. The
programmer is able to set affinities for certain threads to take advantage of shared caches, interrupt
handing, and to match computation with data (locality). Additionally, by setting the affinity to a single
CPU and excluding other threads from using that CPU, and that processor is dedicated to running that
thread only, and takes that processor out of the pool of resources that the OS can allocate. When the
affinity scheduling is manually controlled, the programmer should carefully design the affinity
scheduling scheme to ensure efficiency.
Each operating system will use different calls to set affinity, binding the thread to the processor - for
Linux, it’s sched_setaffinity and pthread_setaffinity_np (np refers to non-portable); for Solaris it’s
processor_bind. Both have similar semantics.
In the following example that shows the function call for cache affinity scheduling (Figure 22), the
mask is a bitmap of all the cores in the machine (Figure 23). If sched_setaffinity is called again later
with a different mask value, the OS migrates the thread to the requested processor.
Figure 22. Code snippet using bitmap mask for cache affinity scheduling.
Using a laundromat as an example, where there are a number of washers and dryers available for use,
people arrive with their laundry at random times. Each person is directed by an attendant to an
available washing machine (if one exists) or queued if all are busy. Each washing machine processes
one load of laundry at a time. The goal is to compute, for a given distribution of arrival times, the
average time a load of laundry spends in the system (wash time plus any time waiting for an available
machine) and the average length of the queue. The "events" in this system include loads of laundry
arriving at the attendant, loads of laundry being directed to the washing machines, and loads of
laundry leaving the washing machines. Think of this as "source" and "sink" objects to make it easier to
model loads of laundry arriving and leaving the facility. Also, the attendant must be notified when
loads of laundry leave the washing machines so that it knows whether the machines are busy.
There are many techniques for locality optimization of codes. For example, cache blocking is
reorganizing data objects to fit in the microprocessor cache. Loop tiling is also a technique to break a
loop’s iteration space into smaller blocks thereby insuring that the data of interest stays in the cache.
Node 0 Node 2
Memory Memory
Processor Processor
Memory Memory
Memory Memory
Processor Processor
Memory Memory
Node 1 Node 3
Figure 25. With NUMA architecture, each processor core connects to a node with dedicated memory.
Programmers must take care to initialize their data structures using the thread that will also perform
the computation. This will ensure that memory is placed closest to the node when the thread is
executing. Additionally, to guarantee the best memory performance, threads should be pinned to the
processor on the node where the memory of interest is located.
To understand how MCAPI might fit into your system, look at how to divide the system and its
resources among the available operating systems and CPUs. The system itself must be designed both
with the hardware and software in mind. What are the system requirements? What is the minimum
latency needed to process a datagram? Are there any deterministic requirements that must be adhered
to? Can the application be parallelized?
It’s important to note the differences between asymmetric multiprocessing (AMP) and symmetric
multiprocessing (SMP). In some situations, SMP hardware is best used by running a single operating
system across all cores. However, sometimes it can be advantageous to distribute several operating
system instances (similar to AMP systems) across cores in an SMP system. In addition, it might be
optimal to run an application directly on a CPU without an operating system, otherwise known as a
“bare-metal” application.
After determining that using AMP is advantageous, or deemed a requirement in your system design,
there must be an inter-processor communication (IPC) mechanism for passing messages between those
instances. MCAPI specifically deals with message passing between nodes in the system. MCAPI can
be used at a very low layer to pass messages between operating systems, making it possible to abstract
away the hardware differences and uniqueness from the application.
4.23.1 MCAPI
There are three primary concepts to understand with MCAPI: a node, an endpoint, and a channel. An
MCAPI implementation passes messages between nodes; where a node can be a CPU, an operating
There are two types of inter-node communication: connection-less and connected. A connection-less
endpoint can receive communication from many different nodes at any time. A connected channel is
when communication passes between two distinct endpoints; no other nodes can connect to that
endpoint.
A channel can be either “socket-like” in its message passing, where all messages appear in their
transmitted order. Alternatively, channels can be “scalar-based” where a simple integer message will
be passed between endpoints. For example, a video frame would be passed in its entirety as a datagram,
whereas a command can be passed as a scalar.
By using MCAPI at a very low layer to pass messages between operating systems, you can abstract
away all of the hardware differences and uniqueness from the application. Check out the MCAPI
programming source code example in Appendix. This example involves a printer function with two
operating systems (Android and Nucleus), and includes a flow chart with initialization and the
message process.
4.23.2 MRAPI
MRAPI deals with resources, such as sharing memory and synchronizing objects between processing
nodes or operating system instances. Memory sharing APIs allow a single block of memory, owned by
one operating system, to be shared among several nodes in a system. These APIs can enforce only one
writer or multiple readers, which can access the memory block at any time.
Synchronization APIs provide applications running on different nodes with the ability to synchronize
their activities. An MRAPI semaphore is a system-wide semaphore that can be accessed by any node
in the system, preventing access by multiple nodes to a shared resource.
The combination of using MCAPI and MRAPI together is a compelling approach that allows full-
featured APIs to be implemented in an AMP system across any of the system’s nodes. This even
includes nodes that are remote to all other nodes, but must be passed through intermediate nodes to get
data to its destination by setting up a route. The way your system is architected will either enable or
impair the ability to pass messages efficiently throughout the system.
During development, you can use MCAPI/MRAPI APIs on a single operating system instance, with
either a single or multiple nodes, with all the appropriate endpoints, and with all the appropriate tasks,
and the algorithm will run. This is an easy way to test the application on a single core operating system
node, prior to the multicore hardware being available. Although it does not test all the parallel
threading, it does allow the logical nature of the algorithms to be tested.
4.24.1 MTAPI
Besides MCAPI and MRAPI, the Multicore Association defined MTAPI, a set of interfaces for task
management in embedded systems. MTAPI allows to split complex computations into fine-grained tasks
that can be executed in parallel. Threads are usually too heavy-weight for that purpose, since context
switches consume a significant amount of time. Moreover, programming with threads is complex and
error-prone due to concurrency issues such as race conditions and deadlocks.
While task schedulers are nowadays widely used, especially in desktop and server applications, they are
typically limited to a single operating system running on a homogeneous multicore processor. System-
wide task management in heterogeneous systems must be realized explicitly via communication
mechanisms. MTAPI addresses these issues by providing an API which allows parallel software to be
designed in a straightforward way, covering homogeneous and heterogeneous multicore architectures. It
abstracts from the hardware details and lets software developers focus on the application.
MTAPI provides two basic programming models, tasks and queues, which can be implemented in
software or hardware (Figure 26). Tasks are fine-grained pieces of work executed locally (on a multicore
processor) or on a remote node, e.g. an accelerator. For that purpose, MTAPI provides mechanisms to
pass data to the remote node and back to the calling node. Queues can be used to control the scheduling
policy for related tasks. For example, order-preserving queues ensure that subsequent tasks are executed
sequentially.
Tasks Queues
Figure 27 depicts a sample architecture based on MTAPI utilizing the CPU, a DSP, and a GPU. Each of
them is represented by one or more MTAPI nodes (one core of the CPU is reserved for special purposes,
e.g. processing real-time tasks, and constitutes an own node). From an application perspective, executing
a task on such a system is accomplished via unified interfaces hiding the specific characteristics of the
different hardware units.
sched. / lib. OS 1 OS 2
GPU
CPU CPU CPU CPU
DSP
core core core core
Node: An MTAPI node is an independent unit of execution. A node can be a process, a thread, a thread
pool, a general purpose processor or an accelerator.
Job and Action: A job is an abstraction representing the work and is implemented by one or more actions.
The MTAPI system binds tasks to the most suitable actions during runtime.
Task: An MTAPI task is an instance of a job together with its data environment. Since tasks are fine
granular pieces of work, numerous tasks can be created, scheduled, and executed in parallel. A task can
be offloaded to a neighboring node other than its origin node depending on the action binding.
Group: MTAPI groups are defined for synchronization purposes. A group is similar to a barrier in other
task models. Tasks attached to the same group must be completed before the next step by calling an API
function (mtapi_group_wait).
An MTAPI-compliant scheduler distributes tasks among the available processing units and combines the
results after synchronization. However, during creation, the task does not know with which action it will
be associated. MTAPI provides a dynamic binding policy between tasks and actions. This allows it to
schedule jobs on more than one hardware type, where the scheduler is responsible for balancing the load.
Depending on where a task is located, it is either a local task (if it is implemented by an action residing
on the same node) or a remote task. Figure 28 shows an example for the relationship between tasks and
actions. In the example, Tasks 1 and 2 refer to Job A, which is implemented by Actions I and II on
4.24.2 EMB²
In this section, we give a brief overview of the Embedded Multicore Building Blocksxxxviii (EMB²), an
open source C/C++ library for the development of parallel applications based on MTAPI. EMB² has
been specifically designed for embedded systems and the typical requirements that accompany them,
such as predictable memory consumption and support for timing-critical applications. Besides a task
scheduler, EMB² provides basic parallel algorithms, concurrent data structures, and skeletons for
implementing dataflow-based applications (Figure 29).
Dataflow Algorithms
Containers
Task management (MTAPI)
Hardware
Figure 30 shows an example for task creation and synchronization (for convenience, we use C++
wrappers for the corresponding MTAPI functions). The main function first gets a reference to the
current node. By calling Start, it then creates a task that executes the function work in parallel to
main. Finally, it waits for the task to finish. Instead of an ordinary function, we could also pass a job
to Start in order to execute an action on an accelerator. Note that the function work takes as
parameter a reference to an object of type TaskContext, which provides information about the status
of the task and can be used to set an error code to be returned by Wait.
int main() {
Node& node = Node::GetInstance();
Task task = node.Start(&work);
// do something...
mtapi_status_t status = task.Wait();
// check status ...
}
The algorithms provided by EMB² can be used for more complex computations such as parallel loops.
They split the computations to be performed into tasks, which are executed by the MTAPI task
scheduler. Suppose, for example, we are given a range of integers and want to double each of them. In
principle, one could apply the operation to all elements in parallel as there are no data dependencies.
However, this results in unnecessary overhead if the number of elements is greater than the number of
available processor cores. A better solution is to partition the range into blocks and to process the
std::vector<int> v;
// initialize v ...
Similar to the previous example (Figure 30), we could also pass a job to the loop instead of a lambda
function. This way, it is even possible to process the elements of a range on different compute units, e.g.
the CPU and a GPU, where the task scheduler automatically distributes the load by means of dynamic
action binding.
For further information and examples, the reader is referred to the MTAPI specificationxxxix and the
EMB² websitexl.
1. Thread stalls caused by a thread locking a resource for a long period which causes other
threads requiring the locked resource to wait for a long time.
2. Thread convoying results from thread stalls, where multiple threads may wait to acquire a lock.
As threads wait for locks to be released, the system switches through all runnable, blocked
threads.
3. Data races (Figure 32): result from two concurrent threads performing conflicting accesses
with no explicit mechanism implemented to prevent simultaneous access. Synchronization and
critical sections may be used to correct data races. Unnecessary synchronization must be
avoided to prevent negative impact on application performance.
4. Deadlocks caused by locking hierarchies, resulting from all threads being blocked as each
thread waits on an action by another thread.
5. Livelocks : caused by threads detecting and recovering from deadlock, as processes/threads
constantly change with respect to one another, trigger the deadlock detection algorithm.
The second function, thread_fix, shows how to resolve the data race issue. A mutex is created and
protects access to the increment of global_fix. The mutex serializes access to the code between the
pthread_mutex_lock call where the mutex is acquired and the pthread_mutex_unlock call, where the
mutex is released.
1 int global_broken;
1 int global;
2 static pthread_mutex_t mutex;
3
4 static void* deadlock(void* p) {
5 pthread_mutex_lock(&mutex);
6 global++;
7 }
8
9 int main(int argc, char *argv[])
10 {
11 pthread_t threadID1;
12
13 global = 0;
14
15 pthread_mutex_init(&mutex, NULL);
16 pthread_create(&threadID1, NULL, deadlock, NULL);
17
18 pthread_mutex_lock(&mutex);
19 global++;
20
}
Livelocks are similar to deadlocks, and are a special case of resource starvation. In livelocks, there is a
constant change in the states of the processes with respect to one other, preventing either one of them
from progressing. Livelocks occur in algorithms which detect and recover from deadlocks, repeatedly
triggering the deadlock detection algorithm. Livelocks can be prevented by allowing only one process to
take action.
A livelock occurs when a request for an exclusive lock for a shared resource is repeatedly denied, since
a series of overlapping shared locks, continue to interfere with each other. At the end two or more
threads may continue to execute, with neither one making any progress. One example of livelock occurs
when two or more processes or threads are set to repeat a set of actions until a given condition tests true.
However, the two processes or threads may cancel each other’s actions before testing for the condition,
causing all of them to perpetually re-start. The threads (or processes) are not blocked, they simply
change their respective states, in response to changes in the other thread(s) (or process(es)), preventing
all of them from making any progress.
Thread-related bugs are difficult to find due to their non-determinism and certain bugs may only be
observed when threads execute in a specific order. The more common threading bugs include data
races, deadlocks, and thread stalls. Synchronization may solve data races, but may result in thread
stalls and deadlocks.
A number of approaches are available for debugging concurrent systems including traditional
debugging and event based debugging. Traditional debugging (breakpoint debugging) has been
applied to parallel programs, where one sequential debugger is used per parallel process. These
debuggers can only provide limited information when several processes interact. Event-based
debuggers (or monitoring debuggers) provide some replay functionality for multi-threaded
applications, but can result in high overhead. Debuggers may display control flow, using several
approaches such as textual presentation of data, time process diagrams, or animation of program
executionxliii.
The use of a threading API may impact the selection of a debugger. Using OpenMP, for example,
requires the use of an OpenMP-aware debugger, which can access information such as constructs and
types of OpenMP variables (private, shared, thread private) after threaded code generation.
In general, the use of scalar optimizations may negatively impact the quality of debug information. In
order to improve the quality of available debugging information, it may be desirable to deselect some
of the advanced optimizations. Examples of the impact of serial optimization on debugging includexliv:
Sequential stepping through inlined source code may display different source locations and
lead to confusion.
Commonly used Linux based application debuggers include dbx and gdb, which are thread aware.
Static deadlock detection tools look for violations of reasonable programming practices, and
approaches have been based on type systems, dataflow analysis and model checking. Type systems are
syntactic frameworks for classifying phrases according to types of values computed. Type checkers
evaluate well typed programs for deadlocks. These require annotations and they do not scale well to
large systems.
In dataflow analysis call graph analysis is used to check for deadlock patterns. A static lock order
graph is computed and cycles are reported as possible deadlocks. The analysis provides information
such as type information, constant values, and nullness. Dataflow algorithms generally use information
from conditional tests.
Model checking systematically checks all schedules and assumes the input program has finite and
tractable state-space. Modeling software applications use the source code to detect potential errors.
These models can be analyzed for behavioral characteristics. Due to the large number of possible
interleavings in a multithreaded application, model checking is computationally expensive and does
not scale well to large programs, thereby limiting its applicabilityxlv xlvi xlvii.
1. Use verbosity options if possible. Analyzing source code with many instances of multiple
issues can be overwhelming. Research phasing in new checks as more critical issues are
addressed.
2. Beware of false positives. Static analysis cannot guarantee 100% accuracy so confirm reported
issues before enacting source code changes.
3. Static code analysis complements the development and debugging process, but it is not
intended to be a replacement for a mature development and testing process - use it in
conjunction with other debug techniques.
In the context of multithreading, the following recommendations should help in identifying thread-
related bugs when using a dynamic thread analysis toolxlviii:
1. Benchmarks should exercise the application’s threaded sections - tools cannot identify and
analyze threading issues unless the threaded code is executed.
2. Use a code coverage tool to ensure the test suite runs through and the dynamic analysis tool
adequately exercises key code regions.
3. Use a non-optimized debug version of the application that includes symbol and line number
information.
4. Limit instrumentation added to the application. If threading is restricted to a specific area of the
application, it may not be necessary to instrument the entire application.
5. If possible, test both optimized and non-optimized versions of the application. If certain bugs
are only detected in the optimized version of the application then selected optimizations
should be re-visited to ensure that none are contributing to the spurious bugs.
6. If binary instrumentation is used the binary should be relocatable.
Following are some tips for using dynamic code analysis tools:
It may be more difficult to create a serial version for applications that will ultimately execute on
heterogeneous multicore processors. In these cases, a library that emulates the execution in software
can be used.
This step may not apply to applications which use threading for latency, such as the case when
threading a GUI for responsiveness executing on a single core processor.
For applications that execute on heterogeneous multicore processors, it may not be possible to execute
serially. Alternatively, limiting the parallel execution as much as possible can be effective. For
example, limit the parallel execution to one task per type of processor core.
Serial consistency
Logging
Synchronization Points
Thread Verification
Simulation
Stress Testing
In the parallel version (NUM_THREADS=2), the number of threads used is two, set by the command
line definition of NUM_THREADS during compilation. The function, FilterImage(), only includes the
code to divide the data space based upon the total number of threads.
char **image;
int size_x, size_y;
Figure 35 implements a log of size LOG_SIZE and lists the function, event_record(), which adds a log
entry. Notice the mutex around the increment of the array increment, id_count. Time is obtained via a
system time() function, which is assumed to be thread safe. Clock cycle information may be used if
finer granularity is desired.
Typical dynamic analysis tools use instrumentation which can be inserted directly into the binary file
(binary instrumentation) just before the analysis run or inserted at compilation time (source
instrumentation). Further information on instrumentation and verification technology can be found in
the paper by Banerjee et al.lii
Effective use of stress testing involves varying the above mentioned parameters and seeing if the
application behaves incorrectly. This can help catch anomalies that may rarely occur on the actual
hardware due to timing dependencies.
As you will see, performance improvements with the use of a multicore processor depend on the
software algorithms and their implementations. Ideally, parallel problems may realize speedup factors
near the number of cores, or possibly even more if the problem is split up to fit within each core's
cache(s), which avoids the use of the much slower main system memory. However, many applications
cannot be greatly accelerated unless the application developer spends a large amount of effort to re-
factor the entire application.
T1
SP
TP
where T1 is the execution time using one core, T P is the execution time using P cores, and the goal is
to obtain a speedup of P. To characterize a parallel program’s behavior, we usually refer to its
efficiency ( EP ):
T1
EP
P T P
A parallel program makes full use of the hardware when the efficiency is 1. It is said to be scalable
when the efficiency remains close to 1 as the number of cores increases. Scalability is rarely achieved
without increasing the amount of work to perform.
when increasing the number of cores
Amdahl’s law states that the sequential part of the execution limits the potential benefit from
parallelization. The execution time TP using P cores is given by:
Compilers implement many optimizations that must be activated using compilation flags, with the
most common and straightforward ones being –O1, -O2, -O3, going from local to more extensive
optimizations. These flags activate a set of optimizations that are also set or unset by more specific
flags. The most aggressive optimization mode is usually not set by default since it can greatly increase
compilation time, as well as debug effort.
More complex specific flags allow for inter-procedural optimizations, profile-guided optimizations,
automatic vectorization or parallelization, or to help program analysis (no alias flags, etc…). Some
flags deal with specific functional areas (floating point, function inlining, etc.). All these flags must be
tested because their impact on performance may vary significantly from one application to another.
1. When setting the debugging flag ‘-g’, some compilers limit the extent of optimization
performed to preserve some debugging capabilities. Alternatively, some compilers may reduce
debugging information when increasing the optimization level.
2. Finding the right combination of optimization flags can be burdensome because it’s a manual
process.
3. The highest level of optimizations (e.g. -O3) does not always produce the fastest code because
in this mode, the compiler tends to make tradeoffs between optimizations, sometimes making
incorrect choices (especially due to the lack of runtime data).
4. Optimizations may change the program results (e.g. different floating-point rounding).
5. Compilers must produce correct code. When an optimization is not proven safe, it is discarded.
After a slight rewriting (e.g. removing pointer aliasing), the compiler may be able to apply an
optimization.
1. C++ doesn't support restrict yet. Restriction on aliasing can also be indicated using compiler
flags (e.g. ‘fno-alias’ or ‘fargument-noalias’), but these options should be used with care since
they may break your application software.
2. If aliasing exists and restricted pointers indicate otherwise, this may introduce difficulties in
finding bugs (especially if the activation of debug reduces optimization and hides the bug).
3. Some compilers generate multiple code variants (one assuming aliasing and another without
aliasing) in the absence of aliasing knowledge. This may have an impact on code size.
/* File: unrollandjam.c */
//original code
void fori (int dest[100][100], int src[100][100], int v[]){
int i,j;
for (i=0;i < 100; i++){
for (j=0;j < 100;j++){
dest[i][j] = src[i][j] * v[j];
}
}
}
// after unroll-and-jam
void ftrans (int dest[100][100], int src[100][100], int v[]){
int i,j;
for (i=0;i < 100; i = i+2){
for (j=0;j < 100;j++){
int t = v[j];
dest[i+0][j] = src[i+0][j] * t;
dest[i+1][j] = src[i+1][j] * t;
}
}
}
Methods to exploit such instructions may include writing inline assembly code, using a C intrinsic
library, or letting the compiler automatic vectorization technology generate the SIMD instructions.
/* File: simd.c */
//original code
void daxpy( unsigned int N, double alpha, double *X, double *Y ) {
int i;
for( i = 0 ; i < N ; i++ ) {
Y[i] = alpha*Y[i] + X[i];
}
}
//SSE code
void daxpySSE( unsigned int N, double alpha, double *X, double *Y) {
// m128d is two 64 bit float.
m128d alphav = _mm_set1_pd(alpha);
int i;
for( i = 0 ; i < N ; i+=2 ) {
__m128d Yv = _mm_load_pd(&Y[i]);
__m128d Xv = _mm_load_pd(&X[i]);
__m128d mulv = _mm_mul_pd(alphav, Xv);
__m128d computv = _mm_add_pd(mulv, Yv);
_mm_store_pd(&Y[i], computv);
}
}
1. On some processors, SIMD memory access instructions have data alignment requirements that
can be specified using compiler pragmas.
2. In the simplest cases, the compiler automatically generates vector code that use SIMD
instructions.
3. It’s difficult to maintain hand-written assembly code. The use of intrinsics in high-level
language is preferable.
4. Saturated arithmetic, predicated operations, and masked operations, when available, can be
very useful to enhance performance.
/* File: granularity.c */
When dealing with sequences of loop nests, many techniques can be used to tune the granularity.
Conceptually, these techniques are easy to understand, but in practice it is complex to get the
implementation details correct.
Transformations called ‘loop fusion’ (a.k.a. loop merging) is where consecutive loops with
independent computation are gathered into a unique loop. This transformation decreases the
loop overhead and allows for better instruction overlap. Although it can potentially increase
cache misses due to conflicts and cause register spilling, fused loops using the same arrays can
improve data locality.
Loop tiling transformations (a.k.a. loop blocking) divide the iteration space of a loop nest in
blocks that usually make better use of cache memories.
Loop interchange transformations can be used to make the innermost parallel loop the
outermost one, but should be carefully used in order to adversely affect data locality (see
Section 6.11 Improving Data Locality).
1. Achieving adequate load balancing may conflict with improving data locality, and may require
a tradeoff.
2. With OpenMP, dynamic scheduling strategies are available under the “schedule” clause for
parallel loops.
3. Dynamic task scheduling strategies usually incur more overhead and are less scalable than
static scheduling strategies since they require some global synchronization.
1. Removing synchronization barriers can introduce race conditions when done improperly.
1. Fine granularity locking consists of protecting atomic data with a mutex in the low-level access
functions (e.g. each record of a database). Using this approach, high-level functions are not
concerned about locking, however, the number of mutexes can be very high thereby increasing
the risk of deadlocks.
2. High-level locking is where data is organized in large areas (e.g. a database) protected by a
single mutex. Although there is less risk of deadlock, the performance penalty can be large
since accesses to the atomic data of the entire area are serialized.
An efficient tradeoff can be achieved by organizing a global data structure into buckets and using a
separate lock for each bucket. With an efficient partitioning, the contention on locks can be
significantly reduced. Another approach is to make each thread compute on private copies of a value
and synchronize only to produce the global result.
1. Trying locks in a non-blocking way before entering a mutex code section (and switching to a
different job) can help avoid thread idle time.
2. Too many locks may be the cause of tricky deadlock situations.
3. Some architectures provide atomic memory read/write that can be used to replace locks.
/* File: reduction.c */
// ON ALL CORES/THREADS
void worker1(int *dp, int src1[], int src2[], int bg, int ed) {
int i,j;
for (i=bg;i < ed; i++){
int t;
t += src1[i] * src2[i];
// START ATOMIC SECTION
pthread_mutex_lock(dp_mutex);
*dp += t;
pthread_mutex_unlock(dp_mutex);
// END ATOMIC SECTION
}
}
//ALTERNATIVE IMPLEMENTATION
// ON ALL CORES/THREADS
void worker2 (int t[],int src1[], int src2[], int bg, int ed){
int i,j;
for (i=bg;i < ed; i++){
t[MYCOREorTHREADNUMBER] += src1[i] * src2[i];
}
}
1. Most parallel programming APIs provide efficient operations such as reduction and prefix.
2. Parallel reductions change rounding computations, which may generate non-reproducible
results depending on the number of processors cores.
In multicore processors, we can consider data distribution and alignment for both the shared and
distributed memory models. In either case, the programmer should consider the strategy of data
distribution in the main memory to reduce extra communication.
For example, assume an unaligned data structure comprised of 16 bytes. This data structure would
require two transfers with a shared-memory processor whose coherence unit (cache line) is 16 bytes
wide. This unaligned data structure would also causes extra transfers in heterogeneous multicore
systems whose minimal DMA transfer granularity is 16 bytes, since most of these systems require that
DMA transfers are aligned to a minimum size.
Programmers should be aware of prefetching effects when designing for data distribution. Since the
data is transferred to the processor in units larger than single word blocks, this could result in extra
words being transferred at the same time. If the requesting processor uses the extra words before the
words are changed by other processors or replaced by the requesting processor, additional data transfer
is not needed; this is referred to as the "prefetching effect" (this also applies to single-core systems but
the effect is more significant in multicore systems).
In cache-coherent, shared memory systems, in order to reduce the occurrence of coherence operations
and to exploit data locality (fetch once, reuse many times), cores share the memory in a coherence unit
with specific size (typically anywhere from 32 to 256 bytes, depending on the processor and system
architecture).
In a cache coherent system, which adopts one of the popular cache coherent protocols such as MSI and
MESIliii, only one core at a time can write to a specific cache line. The copies of a cache line cached
by other cores in read-only modes ("Shared" state in MESI protocol) must be invalidated before being
written, while a cache line can be shared among cores as long as all accesses to it are read.
False sharing means that coherence operations occur between two or more cores, for data not actually
shared by the cores. For example, Core0 tries to write to a shared memory location x and Core1 tries to
read from a shared memory location y, where both x and y addresses happen to be on the same cache
line (coherence unit). When Core0 performs write operations to location x and Core1 tries to read from
location y, and if the write operations and read operations are interleaved in time, the system can suffer
from false sharing.
False sharing, and its performance-degrading effect, can be illustrated in three steps (Figure 41): 1) the
cache line A, which contains both words x and y, is read by Core1 and the copy of A becomes read-
only on Core0; 2) the cache line A is invalidated on Core1 when Core0 performs a write operation on
cache line A; and 3) when the Core1 wants to read word y, a cache miss occurs in the closest level
cache to Core1, and while the latest copy of cache line A is transferred to Core1, Core0's cache line A
word x word y
Cache line A
L1 cache L1 cache
Core 0 Core 1
Since false sharing only occurs when one or more cores tries to write to the same cache line, care must
be taken when the frequently written data locations reside on the same cache line with other less
related data. Specifically, avoid collecting synchronization variables in one compact data structure.
While this may forego readability, it will result in better performance.
A simple way to resolve false sharing is to pad the cache line so that frequently written data can reside
on different cache lines. For example, suppose that in ‘struct test { int x; int y; }’ that x is frequently
written by Core0 (Figure 42). We can pad the structure to avoid false sharing at the cost of extra
memory space.
struct test
{
int x;
unsigned char
padding[CACHE_LINE_SIZE -
sizeof(int)];
int y;
};
Figure 42. Cache-line padding example.
Figure 42 demonstrates a conceptual example and the actual system-dependent directives for
alignment are not shown. Compilers may provide pragmas or library functions for memory alignment,
such as memalign(), but the expression and behavior are system-dependent. To avoid performance
degradation and waste of memory space, it is desirable to collect data accessed by a specific core
(thread) into the same or neighboring cache lines.
The programmer should consider the effect of automatic prefetching, a feature supported by modern
processor architectures, whereby neighboring cache lines are prefetched according to the program's
access pattern. However, false sharing avoidance, while meeting other criteria, is not always an easy
task.
In an example of cache blocking (Figure 43), the original loop computes the total sum of a 4-Mbyte
input_data array 100 times, and the loop with cache blocking breaks the 4-Mbyte input_data into
multiple chunks that fit within the 32KB cache block size. Due to the large size input_data, the cache
hit rate is significantly improved.
Emulating a cache using software (henceforth referred to software cacheliv) is a method to attempt to
hide these communication latencies involved in irregular access patterns. To maximize spatial locality
in irregular access patterns, the software cache maintains the already fetched copies of memory for
reuse. In other words, software can be used to emulate a direct-mapped cache or n-way set associative
cache which behaves logically similar to a hardware cache. However, unlike hardware caches which
do not incur a penalty on a cache hit, the software cache must check if the requested cache line is kept
in local storage (scratchpad memory) even with a cache hit, making it difficult to achieve performance
comparable to hardware caches.
X = SOFTWARE_CACHE_READ(SYSTEM_MEMORY_ADDRESS)
Even when there is no change in SYSTEM_MEMORY_ADDRESS, the cache hit check routine is called
whenever executing this code fragment. For best performance, when developing an efficient cache-hit
check mechanism or access-localization technique, treat any subsequent accesses after a cache hit as
local memory accesses (scratchpad memory accesses). To achieve better performance based on access
patterns, the cache algorithm should analyze the access pattern and prefetch the cache lines that will be
used with high probability in the future.
If the shared queue length is too short, the producer may lack the memory bandwidth and the
mechanism will reduce the number of cores for the consumer task. On the other hand, if the shared
queue length is too long, the mechanism increases the number of cores for the consumer task to
accelerate the consumer-side processing. Figure 44 shows the effect of applying a shared queue on a
parallel H.264 decode running on a quad-core processor (the solid line represents the enhanced
speedup delivered by this technique).
1.6
1.4
1.2
speedup 1
0.8
0.6
0.4 Original
Queue-based
0.2
0
1 2 3 4
cores
Many applications benefit from using multi-threading; but when the performance of an application is
limited by the contention for the shared data or bus bandwidth, additional threads do not improve
performance. Instead, these threads only waste on-chip power. Thus, for power-efficient and high-
performance execution, it is important to choose the right number (and distribution) of threads.
Tcom (n) n
where n is the size of the message, the startup time due to the latency, and the time for sending one
data unit limited by the available bandwidth. Because of the non-trivial, start-up time, it is usually
better to gather many small messages into larger ones when possible to increase the effective
communications bandwidth.
1. Sending non-contiguous data is usually less efficient than sending contiguous data.
2. Do not use messages that are too large. Some APIs (e.g. MPI implementation) may switch
protocols when dealing with very large messages.
3. In some cases, the layout of processes/threads on cores may affect performance due to the
communication network latency and the routing strategy.
1. Message-passing techniques on shared memory systems are not always efficient compared
with direct memory-to-memory transfers; this is due to the extra memory copies required to
implement the sending and receiving operations. State-of-the-art implementations avoid
memory copies for large messages by using zero-copy protocols.
2. When the receive operation is ready before the corresponding post, it enables the message
passing API (e.g. MCAPI) to store received data directly in the destination, thus saving the cost
of buffering the data.
/* File: collectiveOperations.c */
//NAÏVE BROADCASTING OF DATA WITH POINT TO POINT MESSAGE
…
if (me == sender){
// start sending data to every other tasks
for (i=0;i<nbtask;i++){
if (i != me){
send(data);
}
}
} else {
// receive data from sender
for (i=0;i<nbtask;i++){
if (i != sender){
receive(data);
}
}
}
#include "mpi.h"
//BROADCASTING DATA WITH MPI
…
The key element of MultiBench is the Multi-Instance Test Harness (MITH) that provides both a
framework for coordinating scalability analysis, as well as an abstraction layer to facilitate porting to
different platforms. Using the code kernels (such as angle-to-time computation, finite impulse response
filtering, IP packet checking, image rotation, and video encoding), MITH provides a method to analyze
compute scalability through a platform framework that can link together these individual kernels into
workloads with varying degrees of complexity.
MITH supports two types of parallelism: worker and context. Worker parallelism is analogous to data
decomposition. In MITH, kernels are referred to as work items. Parallel elements in a work item are
referred to as workers. For example, the image rotation kernel is parallelized by data decomposition:
splitting up the scan-lines across a specified number of workers. Context level parallelism allows the
user to set the number of workload instances to run concurrently. For example, a context of 8 would
cause MITH to launch 8 concurrent instances of the same workload, which may or may not be the
optimal set up for an 8-core processor. Contexts are synonymous with task-level parallelism. If one
context is used, the kernels inside the workload execute serially on one logical core. If the user chooses
to run more than one context, the kernels will execute in parallel across the logical cores assigned by the
O/S schedulerlvii. However, even if the contexts execute serially, the workers inside the benchmark may
execute in parallel.
The two types of parallelism may also be combined. For example, on a four-core machine, it is possible
to launch two image rotation kernel contexts, and then instruct MITH to use two workers per kernel. The
following diagrams illustrate the relationship between contexts and workers on a four-core machine.
(NOTE: The XCMD= statements above the diagrams refer to the corresponding Makefile configuration
options for the benchmark run.) In the diagram, each workload is comprised of three kernels (or work
items), K1, K2 and K3, and each Kernel may have one or more workers, [W1 … Wn].
Throughput - defined in iterations per second, each platform executes a workload a specific number of
times per second. The user adjusts the number of contexts to find the optimal throughput for each
workload for the given hardware.
Performance Scaling - this unit-less value defines how well performance scales with more computing
resources assigned to the workloads. This is done by comparing a workload running with a single
worker to that same workload with multiple workers. The user adjusts the number of workers to find the
optimal scaling for each workload.
Using MultiBench as an example, it consists of three separate marks: MultiMark, ParallelMark, and
MixMark. Each of these marks are comprised of approximately two dozen workloads.
MultiMark
MultiMark consolidates the best throughput using workloads with only one work item, each of which
uses only one worker. The calculated throughput factor is 10 times the geometric mean of the iterations
per second achieved with the best configuration for each workload (Note: 10 is a multiplication factor).
Each work item uses only one worker (-w1), and multiple copies of the task can be performed in parallel
to take advantage of concurrent hardware resources (-cN). All workloads in this mark use a 4MByte
datasetlviii.
ParallelMark
This mark consolidates the best throughput of workloads with only one work item that each use multiple
workers. The calculated throughput factor is 10 times the geometric mean of the iterations per second
achieved with the best configuration for each workload (Note: 10 is a multiplication factor). Only one
work item may be executed at a time, and multiple workers may be used to take advantage of concurrent
hardware resources (-wN). All workloads in this mark use a 4MB dataset.
MixMark
MixMark is perhaps the most telling mark, it consolidates the best throughput of workloads with
multiple different work items. These workloads are closest to workloads run on actual systems. The
calculated throughput factor is 10 times the geometric mean of the iterations per second achieved with
the best configuration for each workload (Note: 10 is a multiplication factor). All workloads in this mark
use a 4MB dataset.
Sample scores on a simulated platform with 16 cores (Figure 47) show some interesting information
even without diving into the details of specific workloads. For example, consider the fact that the
MultiMark Scaling Factor is 8.9 rather than a number closer to 16, which one might hope for with 16
Throughput Scale
Factor Factor
This limitation is likely related to memory bottlenecks, as single-worker workloads have very little
synchronization and thus are less dependent on the synchronization efficiency of the platform and
operating system. A more detailed examination of the individual results may be needed to yield more
answers and highlight the trends better than a single-number representation can.
Figure 48 compares the results from two dual-core platforms. Note the significant difference in the
Performance Factors between these two platforms. While both platforms use Linux and GCC and both
have the same core frequency, they are based on different processor architectures with different memory
hierarchies. Platform #1 has a shared L2 cache while Platform #2 has separate L2 caches.
Platform #1 Platform #2
Performance Factor Scale Factor Performance Factor Scale Factor
Basic analysis of the results show that platform #1 is 30% faster on MultiMark and 80% faster on
ParallelMark. Insights into why the performance is so different don’t necessarily require sophisticated
performance analysis tools. Often, simply looking deeper into the specific scores of different workloads
and correlating those with architectural differences can be enough to figure out where the differences
come from. For example, synchronization overhead and cache coherency traffic both result in the
significantly lower ParallelMark for Platform #2.
The first set of results (Figure 49) shows the number of iterations/second for the overall AutoBench 2.0
(taking the geometric mean of all workloads in this suite).
Where in Cx-4y, X = number of contexts launched (1, 2, 4, or 8); controls core utilization; Y = size of
dataset per context (4k or 4M). The ‘P’ suffix indicates the perfect scaling result from a single core. The
goal here is not to compare these two processors, but to demonstrate how much different architectures
can scale. In Figure 50 with the 4Mbyte data sizes, the overall performance is significantly reduced as
expected, as is the overall scaling as the number of contexts launched reaches 4 or greater.
60000
50000
Intel
40000
Intel-P
30000 NXP
20000 NXP-P
10000
0
C1-4k C2-4k C4-4k C8-4k
90
80
70
60
Intel
50
Intel-P
40
NXP
30 NXP-P
20
10
0
C1-4M C2-4M C4-4M C8-4M
Asymmetric Homogeneous Multi-*: homogeneous multi-* with more than one micro-architecture
implemented in different processor cores that use the same ISA.
Coarse-grain Multi-threading (CMT): processor core with more than one context that share
resources but only one context can execute code at a time. Execution of the multiple contexts is time
sliced by cycle, memory access, or some other mechanism. An example of a CMT processor is the Sun
UltraSPARC T2 (“Niagara 2”) processor. The C in CMT may also stand for Cooperative or Chip
without change in definition.
Context: program-counter and set of registers within a processor core capable of executing a program.
Contexts are frequently referred to as threads or strands in hardware documentation and can be thought
of as a “logical” processor. Contexts may be defined to only support some levels in the privilege
hierarchy (user-level to kernel-level to hypervisor-level).
Homogeneous Multi-*: multi-* where all processor cores implement the same ISA. This does not
imply that the software is AMP, SMP, or a combination thereof, only that that hardware is
homogeneous.
Many-Core Processor: multicore processor with a large number of cores (typically greater than 16).
Multi-*: multicore processor, multicore system, or multi-processor system. Multi-* systems are often
referred to as Parallel Systems and their use as Parallel Computing.
Multi-Processor System: computer with more than one processor core available to the end user.
N-way System or N-core System: computer with N contexts consisting of either more than one
processor or one processor with more than one context or core.
NxM-way System or NxM-core System: computer with N processors and M contexts or cores per
processor. The next logical step of an NxMxL-way system with N processors and M cores per
processor and L contexts per core is not used.
Processor (CPU): one or more processor cores implemented on one or more chips, in a single
package from the end user perspective.
Processor Core: integrated circuit block implementing an instruction set (ISA), producing a unit that
is user-programmable.
Simultaneous Multi-threading (SMT): processor core with more than one context that share
resources within the processor core where multiple contexts can execute code at the same time. A
single-core SMT processor and a multicore processor differ in that the SMT contexts share resources
within the processor core while the multicore contexts do not. Either may share resources outside of
the processor core. An example of a SMT processor is the Intel Pentium with Hyperthreading. An
example of a Multicore SMT processor, meaning a processor where there are multiple processor cores
each of which is individually SMT, are the Intel Core i7 based processors.
Symmetric Homogeneous Multi-*: homogeneous multi-* where all the processor cores have the
same micro-architecture (and the same ISA).
System on a Chip (SoC): single chip containing at least one processor and at least one non-processor
device of relevance to the system design, such as computation accelerators, memory controllers, timers,
or similar. In general, SoC tends to refer to highly integrated chips that can indeed work as a complete
or almost complete system in their own right (adding external memory should be sufficient).
Local Memory: physical storage that is only accessible from a subset of a system. An example is the
memory associated with each SPE in the Cell BE processor from Sony/Toshiba/IBM. It is generally
Shared Memory: physical storage that is accessible from multiple parts of a system. As with local
memory, it is generally necessary to state what system portions share the memory.
Concurrent Tasks: two or more tasks operating at the same time. No relation among the tasks other
than that they execute at the same time is implied.
Multi-threaded (MT) Application: application with at least one process containing more than one
thread.
Multiprocessing (MP) Application: application consisting of more than one process, each with its
own mapping from virtual to physical addresses. The processes may use an operating system to
dynamically share one or more contexts or may be statically mapped to separate contexts.
Process: program running on one or more contexts with a specific mapping between virtual and
physical addresses. Conceptually, a process contains an instruction pointer, register values, a stack,
and a region of reserved memory that stays active throughout the life of a process.
Task: unit of work with in a process. Depending on coding style, a task may be performed by a
process, a thread, or the use of an accelerator.
Assumptions:
Supports heterogeneous systems consisting of different kinds of compute units (e.g., CPU, GPU,
DSP, FPGA).
Binary
Parallel Libraries
Static Analysis
System
Dynamic Analysis Active Testing
Testing Application
System Framework
Framework Debugger
Simulator/
Instrumented binary executed Virtual Platform
System-Wide Performance Data Collection on simulated platform
C.1.1 Compilers
A number of compilers are available for compiling applications, depending upon the choice of the
threading API. As indicated in Figure 51, the compiler may modify code for generating trace
information, using either source-to-source, static binary, or dynamic instrumentation. Source-to-source
instrumentation modifies source code prior to pre-processing and compilation. Static binary
instrumentation modifies compiled binary code prior to execution.
C.1.3 Debuggers
Complexities resulting from non-deterministic thread scheduling and pre-emption, and dependencies
between control flow and data flow, can make debugging multithreaded applications more
complicated. In addition, issues caused by thread interactions, such as deadlocks and race conditions,
may also be masked by the use of debuggers.
Debuggers for concurrent systems may use a number of approaches, including traditional debugging
and event-based debugging. In traditional debugging (or break-point debugging), one sequential
debugger is used per parallel process, which limits the information that can be made available by these
debuggers. Event-based (or monitoring debuggers) provide better replay functionality for multi-
threaded applications, but may have more overhead. The use of a given threading API may impact the
choice of debuggers.
Profilers can provide behavioral data for the executed control paths. Code coverage may be improved
by using multiple application runs with carefully selected input data and artificial fault injection.
When selecting a profiler, the programmer must understand if the profiler requires specific hardware
support, whether the profiler does system-wide profiling, or if the focus is only on user applications. In
addition, the profilers may be designed to analyze utilization of select system resources, such as thread
profiling, call stack sampling, memory profiling, cache profiling, and heap profiling.
Yes
Wait for connection to be established with
mcapi_wait()
Cleanup mcapi with
mcapi_finalize()
Proceed to
Message
Process
/**************
IPC connection
***************/
/* MCAPI definitions */
#define LOCAL_NODEID 1
#define REMOTE_NODEID 0
[**UNRESOLVED****UNRESOLVED****UNRESOLVED****UNRESOLVED**]ports[2] = {
[**UNRESOLVED**],[**UNRESOLVED**]};
mcapi_pktchan_recv_hndl_t send_handle;
mcapi_pktchan_recv_hndl_t recv_handle;
mcapi_endpoint_t local_rm_endpoint;
mcapi_endpoint_t remote_rm_endpoint;
/***********************************************************************
*
* FUNCTION
*
* Initialization
*
* DESCRIPTION
*
* This task initializes the IPC.
*
***********************************************************************/
void Initialization()
{mcapi_version_tmcapi_version;mcapi_status_tmcapi_status;mcapi_endpoint_tloca
l_send_endpoint;mcapi_endpoint_tlocal_recv_endpoint;mcapi_endpoint_tremote_recv_end
point;mcapi_request_trequest;size_tsize;/* Initialize MCAPI
*/mcapi_initialize(LOCAL_NODEID, &mcapi_version, &mcapi_status);/* Create a local
send endpoint for print job processing. */if(mcapi_status ==
MCAPI_SUCCESS)[**UNRESOLVED**]
[**UNRESOLVED**]
[**UNRESOLVED**]
[**UNRESOLVED**]
[**UNRESOLVED**r]
printf("Connected \r\n");
[**UNRESOLVED**]
if(mcapi_status == MCAPI_SUCCESS)
[r**UNRESOLVED**]
}
{STATUSstatus;mcapi_status_tmcapi_status;UINT32size;UINT32bytesReceived,type,
cmd;unsigned char *in_buffer;unsigned char
out_buffer[MAX_SIZE];mcapi_uint64_ttmp;/* Receive image from foreign node
*/mcapi_pktchan_recv(recv_handle, (void **)in_buffer,(size_t*)&size,
&mcapi_status);/* Do something with MCAPI buffer */… /* Free MCAPI buffer
*/if(mcapi_status == MCAPI_SUCCESS)[**UNRESOLVED**]
…
/* Respond to message with data to send in out_buffer */
mcapi_pktchan_send(send_handle, out_buffer, MAX_SIZE, &mcapi_status);
if(mcapi_status != MCAPI_SUCCESS)
[n]
Shared memory ......... 39, 49, 116, 117, 118, 119 Loop coalescing .......................................... 93
Loop distribution......................................... 93
Source instrumentation ................................... 84 Loop fusion ................................................. 92
Loop interchange ........................................ 92
Static code analysis ......................................... 78 Loop tiling................................................... 92
Stress testing ................................................... 84 Type systems ................................................... 78
POSIX Threads as specified in ‘The Open Group Base Specifications Issue 6, IEEE Std 1003.1 is used with no nonstandard
i
Shameem Akhter, Jason Roberts, “Multi-Core Programming: Increasing Performance through Software Multithreading,”
xli
Daniel G. Waddington, Nilabya Roy, Douglas C. Schmidt, “Dynamic Analysis and Profiling of Multi-threaded Systems,”
xlii
http://www.cs.wustl.edu/~schmidt/PDF/DSIS_Chapter_Waddington.pdf, 2007.
Yusen Li, Feng Wang, Gang Wang, Xiaoguang Liu, Jing Liu, “MKtrace: An innovative debugging tool for multi-threaded
xliii
programs on multiprocessor systems,” Proceedings of the 14th Asia Pacific Software Engineering Conference, 2007
xliv
Domeika, Max, “Software Development for Embedded Multicore Systems,” Newnespress.com, 2008.
xlv
Brutch, Tasneem, “Migration to Multicore: Tools that can help,” USENIX ;login:, Oct. 2009.
xlvi
M. Naik, C. S. Park, K. Sen, and D. Gay, “Effective Static Deadlock Detection”, ICSE 2009, Vancouver, Canada.
N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using Static Analysis to Find Bugs”, IEEE
xlvii
U. Banerjee, B. Bliss, Z. Ma, P. Petersen, “A Theory of Data Race Detection,” International Symposium on Software
lii
Testing and Analysis, Proceeding of the 2006 workshop on Parallel and distributed systems: testing and debugging, Portland,
Maine, http://support.intel.com/support/performancetools/sb/CS-026934.htm
liii
MSI (Modified, Shared, Invalid) and MESI (Modified, Exclusive, Shared, Invalid) protocols are named after the states that
cache lines can have, and used commonly for many shared memory multiprocessor systems. They allow only one
"Exclusive" or "Modified" copy of a cache line in the system to serialize the write accesses to the cache line, thus keep the
cache coherent. Variants of these protocols such as MOSI and MOESI also have been researched.
liv
This should be distinguished from software-controlled hardware cache, where cache control like flushing and coherence
protocol is done by software, and the other operations are performed by hardware.
lv
We assumed that the programmer has applied all possible techniques for data locality.
lvi
Scalability of Macroblock-level parallelism for H.264 decoding. The IEEE Fifteenth International Conference on Parallel
and Distributed Systems (ICPADS), December 8-11, 2009, Shenzhen, China.
http://alvarez.site.ac.upc.edu/papers/parallel_h264_icpads_2009.pdf.
lvii
The default implementation of MITH uses the POSIX pthread library and launches all contexts and workers as threads,
thereby making it easy to port this benchmark suite when using a system that contains an operating system that supports the
POSIX library. For advanced users, the MITH abstraction layer provides a method to implement custom execution
scheduling.
lviii
All MITH-based benchmarks support options for a small (4kbytes) and a large (4Mbyte) data size.