0% found this document useful (0 votes)

179 views

Debugging, Profiling, Performance Analysis, Optimization PDF

Foster's Design Methodology From Designing and Building Parallel Programs by Ian Foster Four Steps: - Partitioning Dividing computation and data - communication Sharing data between computations - Agglomeration Grouping tasks to improve performance - Mapping Assigning tasks to processors / threads.

Uploaded by

Gabriella Ella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

179 views

Debugging, Profiling, Performance Analysis, Optimization PDF

Uploaded by

Gabriella Ella

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

13.12.

2010

Debugging, Profiling, Performance

Analysis, Optimization, Load
Balancing
APP 13.12.2010
Emil Slusanschi
emil.slusanschi@cs.pub.ro

Fosters Design Methodology

From Designing and Building Parallel

Programs by Ian Foster
Four Steps:
Partitioning

Dividing computation and data

Communication

Sharing data between computations

Agglomeration

Grouping tasks to improve performance

Mapping

Assigning tasks to processors/threads

13.12.2010

Parallel Algorithm Design: PCAM

Partition: Decompose problem into fine-grained

tasks to maximize potential parallelism
Communication: Determine communication
pattern among tasks
Agglomeration: Combine into coarser-grained
tasks, if necessary, to reduce communication
requirements or other costs
Mapping: Assign tasks to processors, subject to
tradeoff between communication cost and
concurrency

Designing Threaded Programs

Partition
Divide problem into
tasks
Communicate
Determine amount
and pattern of
communication
Agglomerate
Combine tasks
Map
Assign agglomerated
tasks to created
threads

The
Problem

Initial tasks
Communication

Combined Tasks
Final Program

13.12.2010

Parallel Programming Models

Functional Decomposition
Task parallelism
Divide the computation, then associate the
data
Independent tasks of the same problem

Data Decomposition
Same operation performed on different data
Divide data into pieces, then associate
computation

Decomposition Methods

Functional Decomposition
Focusing on computations
can reveal structure in a
problem

Atmosphere Model
Hydrology
Model

Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory,
ERDC

Ocean
Model

Land Surface
Model
Domain Decomposition

Focus on largest or most frequently

accessed data structure
Data Parallelism
Same operation applied to all data

13.12.2010

Example: Computing Pi

We want to compute
One method: method of darts*
Ratio of area of square to area of inscribed
circle proportional to

*Disclaimer: this is a TERRIBLE way to compute . Dont even

think about doing it this way in real life!!!

Method of Darts
Imagine dartboard with circle of
radius R inscribed in square
2
Area of circle R

Area of square 2R2 4R

R
Area of circle

2
4R
4
Area
of square

13.12.2010

Method of Darts

So, ratio of areas proportional to

How to find areas?
Suppose we threw darts (completely
randomly) at dartboard
Could count number of darts landing in circle
and total number of darts landing in square
Ratio of these numbers gives approximation
to ratio of areas
Quality of approximation increases with
number of darts
= 4 # darts inside circle
# darts thrown

Method of Darts

Okay, Rebecca, but how in the world do

we simulate this experiment on a
computer?
Decide on length R
Generate pairs of random numbers (x, y) so
that -R x, y R
If (x, y) within circle (i.e. if (x2+y2) R2), add
one to tally for inside circle
Lastly, find ratio

13.12.2010

Parallelization Strategies
What tasks independent of each
other?
What tasks must be performed
sequentially?
Using PCAM parallel algorithm design
strategy

Partition

Decompose problem into finegrained tasks to maximize potential

parallelism
Finest grained task: throw of one dart
Each throw independent of all others
If we had huge computer, could
assign one throw to each processor

13.12.2010

Communication
Determine communication pattern among
tasks
Each processor throws dart(s) then
sends results back to manager process

Agglomeration
Combine into coarser-grained tasks, if necessary, to
reduce communication requirements or other costs
To get good value of , must use millions of darts
We dont have millions of processors available
Furthermore, communication between manager and
millions of worker processors would be very
expensive
Solution: divide up number of dart throws evenly
between processors, so each processor does a
share of work

13.12.2010

Mapping
Assign tasks to processors, subject to
tradeoff between communication cost
and concurrency
Assign role of manager to processor 0
Processor 0 will receive tallies from all
the other processors, and will compute
final value of
Every processor, including manager,
will perform equal share of dart throws

Parallel Correctness Challenges

Parallel programming presents a number of

new challenges to writing correct software
New kinds of bugs: data races, deadlocks, etc.
More difficult to test programs and find bugs
More difficult to reproduce errors

Key Difficulty: Potential non-determinism

Order in which threads execute can change from
run to run
Some runs are correct while others hit bugs

13.12.2010

Parallel Correctness Challenges

For sequential programs, we typically expect

that same input same output

Parallel Correctness Challenges

But for parallel programs, threads can be

scheduled differently each run

13.12.2010

Parallel Correctness Challenges

But for parallel programs, threads can be

scheduled differently each run

Parallel Correctness Challenges

But for parallel programs, threads can be

scheduled differently each run
A bug may occur under only rare schedules.
In 1 run in 1000 or 10,000 or

May occur only under some configurations:

Particular OS scheduler
When machine is under heavy load.
Only when debugging/logging is turned off!

13.12.2010

Testing Parallel Programs

For sequential programs:

Create several test inputs with known answers.
Run the code on each test input
If all tests give correct input, have some
confidence in the program
Have intuition about which edge cases to test

But for parallel programs:

Each run tests only a single schedule
How can we test many different schedules?
How confident can we be when our tests pass?

Testing Parallel Programs

Possible Idea: Can we just run each test

thousands of times?
Problem: Often not much randomness in OS
scheduling
May waste much effort, but test few different
schedules
Recall: Some schedules tend to occur only under
certain configurations hardware, OS, etc
One easy parameter to change: load on machine

13.12.2010

Stress Testing

Idea: Test parallel program while

oversubscribing the machine
On a 4-core system, run with 8 or 16 threads
Run several instances of the program at a time
Increase size to overflow cache/memory
Effect: Timing of threads will change, giving
different thread schedules

Pro: Very simple idea, easy to implement

And often works!

Noise Making / Random Scheduling

Idea: Run with random thread schedules

E.g., insert code like:

if (rand() < 0.01) usleep(100)

if (rand() < 0.01) yield()

Can add to only suspicious or tricky code.

Or use tool to seize control of thread scheduling.

Pros: Still fairly simple and often effective.

Explores different schedules than stress testing.
Many tools can perform this automatically

13.12.2010

Noise Making / Random Scheduling

IBMs ConTest: Noise-making for Java

Clever heuristics about where to insert delays

Berkeleys Thrille (C + pthreads) and

CalFuzzer (Java) do simple random scheduling

Extensible: Write testing scheduler for your app

Microsoft Researchs Cuzz (for .NET)

New random scheduling algorithm with probabilistic

guarantees for finding bugs available soon

Many of these tools provide replay same

random number seed same schedule

Limitations of Random Scheduling

Parallel programs have huge number of

schedules exponential in length of a run

13.12.2010

Limitations of Random Scheduling

Parallel programs have huge number of

schedules exponential in length of a run

Limitations of Random Scheduling

Parallel programs have huge number of

schedules exponential in length of a run

13.12.2010

Detecting/Predicting Parallel Bugs

Say we observe a test run of a parallel

program that doesnt obviously fail
Key Question: Can we find possible parallel
bugs by examining the execution?

Detecting/Predicting Parallel Bugs

Techniques/tools exist for:

Data races
Atomicity violations
Deadlocks
Memory consistency errors

13.12.2010

Data Race Detection / Prediction

20+ years of research on race detection

Happens-Before Race Detection [Schonberg 89]:
Do two accesses to a variable occur, at least one a write,
with no intervening synchronization?
No false warnings

Lockset Race Prediction [Savage, et al., 97]:

Does every access to a variable hold a common lock?
Efficient, but many false warnings

Hybrid Race Prediction [OCallahan, Choi, 03]:

Combines Lockset with Happens-Before for better
performance and fewer false warnings vs. Lockset

Coverage vs. False Warnings

False Warning: Tool reports a data race, but

the race cannot happen in a real run
Coverage: How many of the real data races
does a tool report?
Hybrid race prediction:
Better coverage but more false warnings

Happens-Before race detection:

Fewer false warnings (still some, in practice) and
less coverage

13.12.2010

Dynamic Data Race Tools

Intel Thread Checker for C + pthreads

Happens-Before race detection

Valgrind-based tools for C + pthreads

Helgrind and DRD (Happens-Before)
ThreadSanitizer (Hybrid)

CHESS performs race detection for .NET

CalFuzzer and Thrille: hybrid race
detection for Java and C + pthreads

Static Analysis

Have only discussed dynamic analyses

Examine a real run/trace of a program

Static analyses predict data races,

deadlocks, etc., without running a program
Only examine the source code
Area of active research for ~20 years
Potentially much better coverage than dynamic
analysis examines all possible runs
But typically also more false warnings

CHORD: static race and deadlock prediction

13.12.2010

Active Random Testing Overview

Problem: Random testing can be very

effective for parallel programs, but can miss
many potential bugs
Problem: Predictive analyses find many
bugs, but can have false warnings
Time consuming and difficult to examine reported
bugs and determine whether or not they are real

Key Idea: Combine them use predictive

analysis to find potential bugs, then biased
random testing to actually create each bug

Active Random Testing

Key Idea: Combine them use predictive

analysis to find potential bugs, then biased
random testing to actually create each bug

13.12.2010

Active Random Testing

Key Idea: Combine them use predictive

analysis to find potential bugs, then biased
random testing to actually create each bug

Active Random Testing

CalFuzzer is an extensible, open-source tool

for active testing of Java programs
For data races, atomicity bugs, and deadlocks.
RaceFuzzer is the active testing algorithm for
data races will show by example.

Thrille for C + pthreads.

For data races.

And UPC-Thrille for Unified Parallel C.

Part of the Berkeley UPC system

13.12.2010

Debugging and Performance Evaluation

Common errors in parallel programs

Debugging tools
Overview of benchmarking and
performance measurements

Concepts and Definitions

The typical performance optimization cycle

Code Development
Functionally
complete and
correct program

Instrumentation

Measure
Analyze
Modify / Tune

Complete, correct and wellperforming

program

Usage /
Production

13.12.2010

Development Cycle
Analysis
Intel Parallel Amplifier
Design (Introduce Threads)
Intel Performance libraries: IPP and MKL
OpenMP* (Intel Parallel Composer)
Explicit threading (Win32*, Pthreads*)

Debug for correctness

Intel Parallel Inspector
Intel Debugger
Tune for performance
Intel Parallel Amplifier

Intel Parallel Studio

Decide where to add the parallelism

Analyze the serial program
Prepare it for parallelism
Test the preparations

Add the parallelism

Threads, OpenMP, Cilk, TBB, etc.

Find logic problems

Only fails sometimes
Place of failure changes

Find performance problems

13.12.2010

Workflow

Transforming many serial

algorithms into parallel form takes
five easy high-level steps
Often existing algorithms are overconstrained by serial language
semantics, and the underlying
mathematics has a natural parallel
expression if you can just find it

Advisor Overview

If you look at these

steps in more detail,
you find decisions
you will need to make
You do not have to
choose the perfect
answer the first time,
so you can go back
and modify your
choices

13.12.2010

Hotspot Analysis
bool TestForPrime(int val)
{
// lets start checking from 3
inttolimit, factor = 3;
Use Parallel Amplifier
limit = (long)(sqrtf((float)val)+0.5f);
find hotspots in application
while( (factor <= limit) && (val % factor))
factor ++;

Lets use the project PrimeSingle

for analysis
return (factor
> limit);

PrimeSingle <start>
} <end>

Usage: ./PrimeSingle
1 1000000
void FindPrimes(int
start, int end)
{
// start is always odd
int range = end - start + 1;
for( int i = start; i <= end; i+= 2 ){
if( TestForPrime(i) )
globalPrimes[gPrimesFound++] = i;
ShowProgress(i, range);
}

Identifies the
time consuming regions
}

Analysis - Call Stack

Inspect the code for

the leaf node
Look for a loop to
parallelize
If none are found, progress up the
call stack until you find a suitable
loop or function call to parallelize
(FindPrimes)
This is the level in the call
tree where we need to thread

Used to find proper level in the

call-tree to thread

13.12.2010

Debugging for Correctness

Intel Parallel Inspector pinpoints notorious
threading bugs like data races and deadlocks

Intel Parallel Inspector

Threading Error Analysis

Where are Deadlocks

or Data Races
Runtime
Data
Collector

Intel Parallel Inspector

Select info regarding

both data races &
deadlocks
View the Overview for
Threading Errors
Select a threading
error and inspect the
code

13.12.2010

Starting Parallel Inspector

The Configure Analysis window pops up

Select the level of analysis to
be carried out by Parallel
Inspector

The deeper the analysis,

the more thorough the
results and the longer the
execution time

Click Run Analysis

Motivation

Developing threaded applications can be a

complex task
New class of problems are caused by the
interaction between concurrent threads
Data races or storage conflicts

More than one thread accesses memory without

synchronization

Deadlocks

Thread waits for an event that will never happen

13.12.2010

Intel Parallel Inspector

Debugging tool for threaded software

Plug-in to Microsoft* Visual Studio*

Finds threading bugs in OpenMP*, Intel

Threading Building Blocks, and Win32*
threaded software
Locates bugs quickly that can take days to
find using traditional methods and tools
Isolates problems, not the symptoms
Bug does not have to occur to find it!

Parallel Inspector: Analysis

Dynamic as software runs

Data (workload) -driven execution

Includes monitoring of:

Thread and Sync APIs used
Thread execution order

Scheduler impacts results

Memory accesses between threads

Code path must be executed to be analyzed

13.12.2010

Parallel Inspector: Before You Start

Instrumentation: background
Adds calls to library to record information

Thread and Sync APIs

Memory accesses

Increases execution time and size

Use small data sets (workloads)

Execution time and space is expanded
Multiple runs over different paths yield best
results

Workload selection is important!

Workload Guidelines

Execute problem code once per thread to be

identified
Use smallest possible working data set
Minimize data set size

Smaller image sizes

Minimize loop iterations or time steps

Simulate minutes rather than days

Minimize update rates

Lower frames per second

Finds threading errors faster!

13.12.2010

Binary Instrumentation

Build with supported compiler

Running the application
Must be run from within Parallel
Inspector
Application is instrumented when
executed
External DLLs are instrumented as used

Tuning for Performance

Parallel Amplifier (Locks & Waits) pinpoints
performance bottlenecks in threaded
applications
Intel Parallel Amplifier

Locks & Waits

13.12.2010

Parallel Amplifier - Locks & Waits

Graph shows
significant portion of
time in idle condition
as result of critical
section
FindPrimes() &
ShowProgress() are
both excessively
impacted by the idle
time occurring in the
critical section

Parallel Amplifier - Locks & Waits

ShowProgress() consumes 558/657 (85%) of the time

idling in a critical section
Double Click ShowProgress() in largest critical section to
see the source code

13.12.2010

Parallel Amplifier Summary

Elapsed Time shows

.571 sec
Wait Time/ Core Count =
1.87/4 =.47 sec
Waiting 82% of elapsed
time in critical section
Most of the time 1 core
and occasionally 2 are
occupied

Parallel Amplifier - Concurrency

Function - Caller Function Tree

ShowProgress is called from FindPrimes and
represent the biggest reason concurrency is
poor

13.12.2010

Parallel Amplifier - Concurrency

Thread Function Caller Function Tree
This view shows how each thread contributes to the
concurrency issue
Expanding any thread will reveal the functions that
contribute most

Performance
Double Click ShowProgress in second largest critical section
This implementation has implicit synchronization calls - printf
This limits scaling performance due to the resulting context
switches

Back to the design stage

13.12.2010

Load Balance Analysis

Use Parallel Amplifier Concurrency Analysis

Select the Thread Function -Caller Function Tree
Observe that the 4 threads do unequal amounts of
work

Fixing the Load Imbalance

Distribute the work more evenly

void FindPrimes(int start, int end)
{
// start is always odd
int range = end - start + 1;
#pragma omp parallel for schedule(static,8)
for( int i = start; i <= end; i += 2 )
{
if( TestForPrime(i) )
globalPrimes[InterlockedIncrement(&gPrimesFound)] = i;
ShowProgress(i, range);
}
}

Speedup achieved is 1.68X

13.12.2010

Comparative Analysis

Threading applications require

multiple iterations of going through
the software development cycle

Common Performance Issues

Load balance
Improper distribution of parallel work

Synchronization
Excessive use of global data, contention for the
same synchronization object

Parallel Overhead
Due to thread creation, scheduling..

Granularity
Not sufficient parallel work

13.12.2010

Load Imbalance

Unequal work loads lead to idle threads

and wasted time

Thread 0

Busy

Thread 1

Idle

Thread 2

Thread 3

Time

Start
threads

Join
threads

Redistribute Work to Threads

Static assignment
Are the same number of tasks
assigned to each thread?
Do tasks take different processing
time?

Do tasks change in a predictable pattern?

Rearrange

(static) order of assignment to

threads

Use dynamic assignment of tasks

13.12.2010

Redistribute Work to Threads

Dynamic assignment
Is there one big task being assigned?

Break up large task to smaller parts

Are small computations agglomerated

into larger task?
Adjust number of computations in a task
More small computations into single task?
Fewer small computations into single task?
Bin packing heuristics

Unbalanced Workloads

13.12.2010

Unbalanced Workloads

Synchronization

By definition, synchronization serializes

execution
Lock contention means more idle time for
threads

Thread 0

Busy
Idle
In Critical

Thread 1

Thread 2

Thread 3

Time

13.12.2010

Synchronization Fixes

Eliminate synchronization
Expensive but necessary evil
Use storage local to threads

Use local variable for partial results, update global

after local computations
Allocate space on thread stack (alloca)

Use thread-local storage API (TlsAlloc)

Use atomic updates whenever possible

Some global data updates can use atomic operations

(Interlocked API family)

General Optimizations

Serial Optimizations
Serial optimizations along the critical path should
affect execution time

Parallel Optimizations
Reduce synchronization object contention
Balance workload
Functional parallelism

Analyze benefit of increasing number of

processors
Analyze the effect of increasing the number of
threads on scaling performance

13.12.2010

Testing & Debugging Conclusions

Many tools available right now to help find

bugs in parallel software
Data races, atomicity violations, deadlocks

No silver bullet solution!

Have to carefully design how an application
threads will coordinate and share/protect data
Tools will help catch mistakes when the design is
accidentally not followed
Ad hoc parallelization likely to never be correct,
even with these tools

Measurement:
Profiling vs. Tracing

Profiling
Summary statistics of performance metrics

Number of times a routine was invoked

Exclusive, inclusive time
Hardware performance counters
Number of child routines invoked, etc.
Structure of invocations (call-trees/call-graphs)
Memory, message communication sizes

Tracing
When and where events took place along a global
timeline

Time-stamped log of events

Message communication events (sends/receives) are tracked
Shows when and from/to where messages were sent
Large volume of performance data generated usually leads to
more perturbation in the program

13.12.2010

Measurement: Profiling

Profiling
Helps to expose performance bottlenecks and hotspots
80/20 rule or Pareto principle: often 80% of the execution time in
20% of your application
Optimize what matters, dont waste time optimizing things that have
negligible overall influence on performance

Implementation
Sampling: periodic OS interrupts or hardware counter traps
Build a histogram of sampled program counter (PC) values
Hotspots will show up as regions with many hits
Measurement: direct insertion of measurement code
Measure at start and end of regions of interests, compute
difference

Measurement: Tracing

Tracing
Recording of information about significant points (events)
during program execution
entering/exiting code region (function, loop, block, )
thread/process interactions (e.g., send/receive
message)
Save information in event record
timestamp
CPU identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event records
Can be used to reconstruct dynamic program behavior
Typically requires code instrumentation

13.12.2010

Performance Data Analysis

Draw conclusions from measured

performance data
Manual analysis

Visualization
Interactive exploration
Statistical analysis
Modeling

Automated analysis
Try to cope with huge amounts of performance by
automation
Examples: Paradyn, KOJAK, Scalasca, Periscope

Trace File Visualization

Vampir: timeline view

Similar other tools:
Jumpshot, Paraver

13.12.2010

Trace File Visualization

Vampir/IPM:
message
communication
statistics

3D Performance Data Exploration

Paraprof viewer
(from the TAU
toolset)

13.12.2010

Automated Performance Analysis

Reason for Automation

Size of systems: several tens of thousand of
processors
LLNL Sequoia: 1.6 million cores
Trend to multi-core

Large amounts of performance data when tracing

Several gigabytes or even terabytes

Not all programmers are performance experts

Scientists want to focus on their domain
Need to keep up with new machines

Automation can solve some of these issues

Automation Example

Late sender
pattern
This pattern can
be detected
automatically by
analyzing the
trace

13.12.2010

Hardware Performance Counters

Specialized hardware registers to measure the

performance of various aspects of a microprocessor
Originally used for hardware verification purposes
Can provide insight into:

Cache behavior
Branching behavior
Memory and resource contention and access patterns
Pipeline stalls
Floating point efficiency
Instructions per cycle

Counters vs. events

Usually a large number of countable events - hundreds
On a small number of counters (4-18)
PAPI handles multiplexing if required

What is PAPI

Middleware that provides a consistent and efficient

programming interface for the performance counter
hardware found in most major microprocessors.
Countable events are defined in two ways:
Platform-neutral Preset Events (e.g., PAPI_TOT_INS)
Platform-dependent Native Events (e.g., L3_CACHE_MISS)

Preset Events can be derived from multiple Native Events

(e.g. PAPI_L1_TCM might be the sum of L1 Data Misses
and L1 Instruction Misses on a given platform)
Preset events are defined in a best-effort way
No guarantees of semantics portably
Figuring out what a counter actually counts and if it does so correctly
can be hairy

13.12.2010

PAPI Hardware Events

Preset Events
Standard set of over 100 events for application performance tuning
No standardization of the exact definitions
Mapped to either single or linear combinations of native events on
each platform
Use papi_avail to see what preset events are available on a given
platform

Native Events
Any event countable by the CPU
Same interface as for preset events
Use papi_native_avail utility to see all available native events

Use papi_event_chooser utility to select a compatible set

of events

PAPI Counter Interfaces

PAPI provides 3 interfaces to

the underlying counter
hardware:
A low level API manages
hardware events in user defined
groups called EventSets.
Meant for experienced
application programmers wanting
fine-grained measurements.
A high level API provides the
ability to start, stop and read the
counters for a specified list of
events.
Graphical and end-user tools
provide facile data collection and
visualization.

3rd Party and GUI Tools

Low Level
User API

High Level
User API

PAPI PORTABLE LAYER

PAPI HARDWARE SPECIFIC

LAYER
Kernel Extension
Operating System
Perf Counter Hardware

13.12.2010

PAPI High Level Calls

PAPI_num_counters()

PAPI_flips (float *rtime, float *ptime, long long *flpins, float *mflips)

get the number of hardware counters available on the system

simplified call to get Mflips/s (floating point instruction rate), real and processor time

PAPI_flops (float *rtime, float *ptime, long long *flpops, float *mflops)

PAPI_ipc (float *rtime, float *ptime, long long *ins, float *ipc)

PAPI_accum_counters (long long *values, int array_len)

PAPI_read_counters (long long *values, int array_len)

simplified call to get Mflops/s (floating point operation rate), real and processor time
gets instructions per cycle, real and processor time
add current counts to array and reset counters
copy current counts to array and reset counters

PAPI_start_counters (int *events, int array_len)

PAPI_stop_counters (long long *values, int array_len)

start counting hardware events

stop counters and return current counts

PAPI Example Low Level API Usage

#include "papi.h
#define NUM_EVENTS 2
int Events[NUM_EVENTS]={PAPI_FP_OPS,PAPI_TOT_CYC},
int EventSet;
long long values[NUM_EVENTS];
/* Initialize the Library */
retval = PAPI_library_init (PAPI_VER_CURRENT);
/* Allocate space for the new eventset and do setup */
retval = PAPI_create_eventset (&EventSet);
/* Add Flops and total cycles to the eventset */
retval = PAPI_add_events (&EventSet,Events,NUM_EVENTS);
/* Start the counters */
retval = PAPI_start (EventSet);
do_work();

/* What we want to monitor*/

/Stop counters and store results in values /

retval = PAPI_stop (EventSet,values);

13.12.2010

Using PAPI through tools

You can use PAPI directly in your application, but most

people use it through tools
Tool might have a predfined set of counters or lets you
select counters through a configuration file/environment
variable, etc.
Tools using PAPI

TAU (UO)
PerfSuite (NCSA)
HPCToolkit (Rice)
KOJAK, Scalasca (FZ Juelich, UTK)
Open|Speedshop (SGI)
ompP (UCB)
IPM (LBNL)

Component PAPI Design

Re-Implementation of PAPI w/ support
for multiple monitoring domains
Low
Level
API

Hi
Level
API

PAPI Framework Layer

PAPI Component Layer

(network)Kernel Patch
Operating System
Perf Counter Hardware

Devel
API

PAPI Component Layer

(CPU)
Kernel Patch

PAPI Component Layer

(thermal)Kernel Patch
Operating System
Perf Counter Hardware

Operating System
Perf Counter Hardware

13.12.2010

Vampir overview statistics

Aggregated profiling information

Execution time
Number of calls

This profiling information is computed from

the trace
Change the selection in main timeline window

Inclusive or exclusive of called routines

Timeline display

To zoom, mark region with the mouse

13.12.2010

Timeline display message details

Message
information
Click on
message
line

Message
send op

Message
receive op

Communication statistics

Message statistics for each process/node pair:

Byte and message count
min/max/avg message length, bandwidth

13.12.2010

Message histograms

Message statistics by length, tag or

communicator
Byte and message count
Min/max/avg bandwidth

100

Collective operations

For each process: mark operation locally

Stop of op
Start of op
Data being sent

Data being received

Connection
lines

Connect start/stop points by lines

13.12.2010

101

Activity chart

Profiling information for all processes

102

Processlocal displays

Timeline (showing calling levels)

Activity chart
Calling tree (showing number of calls)

13.12.2010

103

Effects of zooming
Updated
summary

Updated
message
statistics

Select one
iteration

104

KOJAK / Scalasca Basic Idea

Traditional Tool

Automatic Tool
Simple:
1 screen +
2 commands +
3 panes

Relevant
problems
and data

Huge amount of
Measurement data

For non-standard / tricky cases (10%)

For expert users

More productivity for performance analysis process!

For standard cases (90% ?!)

For normal users
Starting point for experts

13.12.2010

105

MPI-1 Pattern: Wait at Barrier

Time spent in front of MPI synchronizing operation such as barriers

106

location

MPI-1 Pattern: Late Sender / Receiver

MPI_Send
MPI_Recv

MPI_Send
MPI_Irecv

MPI_Wait

time

location

Late Sender: Time lost waiting caused by a blocking receive operation posted
earlier than the corresponding send operation
MPI_Send
MPI_Recv

MPI_Send
MPI_Irecv

MPI_Wait

Late Receiver: Time lost waiting in a blocking send operation until the
corresponding receive operation is called

time

13.12.2010

107

Performance Property
What problem?

Region Tree
Where in source code?
In what context?

Color Coding

Location
How is the
problem distributed
across the machine?

How severe is the problem?

KOJAK: sPPM run on

(8x16x14) 1792 PEs

108

Topology
display
Shows
distribution
of pattern
over HW
topology
Easily
scales to
even
larger
systems

13.12.2010

109

TAU Parallel Performance System

http://www.cs.uoregon.edu/research/tau/
Multi-level performance instrumentation
Multi-language automatic source instrumentation

Flexible and configurable performance measurement

Widely-ported parallel performance profiling system
Computer system architectures and operating systems
Different programming languages and compilers

Support for multiple parallel programming paradigms

Multi-threading, message passing, mixed-mode, hybrid

Integration in complex software, systems,

applications

110

ParaProf 3D Scatterplot (Miranda)

Each point is a
thread of
execution
A total of four
metrics shown
in relation
ParaVis 3D
profile
visualization
library
32k processors

13.12.2010

111

PerfExplorer - Cluster Analysis

Four significant events automatically

selected (from 16K processors)
Clusters and correlations are visible

PerfExplorer Correlation
Analysis (Flash)

112

Describes strength and direction of a

linear relationship between two variables
(events) in the data

13.12.2010

113

Tools References

IBM ConTest (Noise-Making for Java):

https://www.research.ibm.com/haifa/projects/verification/contest/index.html

Cuzz (Random scheduling for C++/.NET):

http://research.microsoft.com/en-us/projects/cuzz/

Intel Thread Checker and Parallel Inspector (C/C++):

http://software.intel.com/en-us/intel-thread-checker/
http://software.intel.com/en-us/intel-parallelinspector/
http://software.intel.com/en-us/articles/intel-parallel-studio-xe/

Helgrind, DRD, ThreadSanitizer (Dynamic Data Race

Detection/Prediction for C/C++):
http://valgrind.org/docs/manual/hg-manual.html
http://code.google.com/p/data-race-test/

CHORD (Static Race/Deadlock Detection for Java):

http://code.google.com/p/jchord/

114

Tools References

CalFuzzer (Java):
http://srl.cs.berkeley.edu/~ksen/calfuzzer/

Thrille (C):
http://github.com/nicholasjalbert/Thrille

CHESS (C++/.NET Model Checking, Race Detection):

http://research.microsoft.com/en-us/projects/chess/default.aspx

Java Path Finder (Model Checking for Java):

http://babelfish.arc.nasa.gov/trac/jpf

Tau Performance System (Fortran, C, C++, Java, Python):

http://www.cs.uoregon.edu/research/tau/home.php

Vampir/GuideView (C/C++ and Fortran):

https://computing.llnl.gov/code/vgv.html

Performance Application Programming Interface (PAPI):

http://icl.cs.utk.edu/papi/

Operating System (1000 MCQS)
100% (2)
Operating System (1000 MCQS)
135 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
T24 Oracle Operations and File Maintenance
No ratings yet
T24 Oracle Operations and File Maintenance
10 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Parallel Programming
No ratings yet
Parallel Programming
18 pages
Parallel Computing: Lecture 4: Parallel Software: Basics
No ratings yet
Parallel Computing: Lecture 4: Parallel Software: Basics
31 pages
How To Parallelise An Application
No ratings yet
How To Parallelise An Application
30 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
High Performance Computing (HPC) - Lec2
No ratings yet
High Performance Computing (HPC) - Lec2
53 pages
01-Parallel Computing
No ratings yet
01-Parallel Computing
7 pages
Revision Slides
No ratings yet
Revision Slides
25 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
410A-week-5
No ratings yet
410A-week-5
23 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
ICS 311 PADC Foaster Algorithm Design (1)
No ratings yet
ICS 311 PADC Foaster Algorithm Design (1)
54 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
in3200-chap05
No ratings yet
in3200-chap05
34 pages
Parallel and Distributed lec 8
No ratings yet
Parallel and Distributed lec 8
24 pages
Partitioning
No ratings yet
Partitioning
37 pages
How To Sound Like A Parallel Programming Expert - Part 1 Introducing Concurrency and Parallelism
No ratings yet
How To Sound Like A Parallel Programming Expert - Part 1 Introducing Concurrency and Parallelism
4 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
02 - Introduction To Concurrent Systems PDF
No ratings yet
02 - Introduction To Concurrent Systems PDF
31 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
IT-318: Scalable and Cloud Computing: Programming at Scale Concurrency and Consistency
No ratings yet
IT-318: Scalable and Cloud Computing: Programming at Scale Concurrency and Consistency
37 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Chapter1 (1)
No ratings yet
Chapter1 (1)
39 pages
PROG8060 Lesson02 Debugging
No ratings yet
PROG8060 Lesson02 Debugging
42 pages
ACA 2024W 01 Introduction
No ratings yet
ACA 2024W 01 Introduction
19 pages
Parallel Algorithms
No ratings yet
Parallel Algorithms
21 pages
Part 1 - Lecture 1 - Introduction Parallel Computing
No ratings yet
Part 1 - Lecture 1 - Introduction Parallel Computing
33 pages
ACA 2024W 02 Program design
No ratings yet
ACA 2024W 02 Program design
27 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
CSCE626 Amato LN PerformanceAnalysisMethodology
No ratings yet
CSCE626 Amato LN PerformanceAnalysisMethodology
19 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
A Large-Scale Parallel Fuzzing System: Yang Li, Chao Feng, Chaojing Tang
No ratings yet
A Large-Scale Parallel Fuzzing System: Yang Li, Chao Feng, Chaojing Tang
4 pages
PC 2
No ratings yet
PC 2
44 pages
HPC Note
No ratings yet
HPC Note
39 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
Distributed Computing Seminar
No ratings yet
Distributed Computing Seminar
37 pages
Unit II
No ratings yet
Unit II
17 pages
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
CS526 3 Design of Parallel Programs
No ratings yet
CS526 3 Design of Parallel Programs
83 pages
CSE524sp10-01
No ratings yet
CSE524sp10-01
62 pages
Unit 2
No ratings yet
Unit 2
8 pages
Parallel Computing Challanges
No ratings yet
Parallel Computing Challanges
7 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Dis Top Tim Notes 1
No ratings yet
Dis Top Tim Notes 1
3 pages
Programming For Performance
No ratings yet
Programming For Performance
79 pages
HPC Unit2 Part1
No ratings yet
HPC Unit2 Part1
44 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Ticket
No ratings yet
Ticket
1 page
Application Note: Scalable Hashing: Problem
No ratings yet
Application Note: Scalable Hashing: Problem
5 pages
The Adaptive Radix Tree: Artful Indexing For Main-Memory Databases
No ratings yet
The Adaptive Radix Tree: Artful Indexing For Main-Memory Databases
12 pages
Cmos 4011 (Nand Gate) CD 4049
No ratings yet
Cmos 4011 (Nand Gate) CD 4049
1 page
SQL Server DBA Questions
No ratings yet
SQL Server DBA Questions
32 pages
Java Performance On POWER7 - Best Practice: June 28, 2011
No ratings yet
Java Performance On POWER7 - Best Practice: June 28, 2011
25 pages
Parallel Virtual File System, Version 2: PVFS2 Development Team September, 2003
No ratings yet
Parallel Virtual File System, Version 2: PVFS2 Development Team September, 2003
25 pages
Lecture 13: Locks: Mythili Vutukuru IIT Bombay
No ratings yet
Lecture 13: Locks: Mythili Vutukuru IIT Bombay
12 pages
Module 5 Part2 Concurrency-Control-Techniques
No ratings yet
Module 5 Part2 Concurrency-Control-Techniques
45 pages
Process Scheduler Trouble Shooting
No ratings yet
Process Scheduler Trouble Shooting
3 pages
Os Unit 3
No ratings yet
Os Unit 3
10 pages
Get The Art of Multiprocessor Programming 2nd Edition Maurice Herlihy - eBook PDF PDF ebook with Full Chapters Now
100% (4)
Get The Art of Multiprocessor Programming 2nd Edition Maurice Herlihy - eBook PDF PDF ebook with Full Chapters Now
56 pages
BScCSIT Transaction DBMS
No ratings yet
BScCSIT Transaction DBMS
30 pages
SQL Server Performance Monitoring & Optimization
No ratings yet
SQL Server Performance Monitoring & Optimization
44 pages
CSE Rover Technology REPORT
100% (1)
CSE Rover Technology REPORT
41 pages
Communication and Concurrency
No ratings yet
Communication and Concurrency
3 pages
Chapter 7: Process Synchronization
No ratings yet
Chapter 7: Process Synchronization
62 pages
(AJP) Chapter 3
No ratings yet
(AJP) Chapter 3
32 pages
Operating System PDF
No ratings yet
Operating System PDF
43 pages
Multi Threading
No ratings yet
Multi Threading
17 pages
Extending The Reach of Real-Time Analytics:: Hardware and Software, Engineered To Work Together
No ratings yet
Extending The Reach of Real-Time Analytics:: Hardware and Software, Engineered To Work Together
8 pages
Threads Programming: Refs: Chapter 23
No ratings yet
Threads Programming: Refs: Chapter 23
36 pages
Critical Section in OS
No ratings yet
Critical Section in OS
4 pages
RPSC Programmer MCQ 2
No ratings yet
RPSC Programmer MCQ 2
15 pages
MTOOSD103 Advanced Operating System Unit 2 Notes
No ratings yet
MTOOSD103 Advanced Operating System Unit 2 Notes
33 pages
Linux Tutorial - POSIX Threads
No ratings yet
Linux Tutorial - POSIX Threads
12 pages
OS Process Synchronization Unit 3
No ratings yet
OS Process Synchronization Unit 3
55 pages
Ass Oracle
No ratings yet
Ass Oracle
14 pages
22CSE44-OS Module - 3
No ratings yet
22CSE44-OS Module - 3
34 pages
Java Threads
No ratings yet
Java Threads
50 pages
Download ebooks file Troubleshooting Java Read debug and optimize JVM applications 1st Edition Laurentiu Spilca all chapters
100% (1)
Download ebooks file Troubleshooting Java Read debug and optimize JVM applications 1st Edition Laurentiu Spilca all chapters
50 pages
SF Record-Locking-Cheatsheet Web PDF
No ratings yet
SF Record-Locking-Cheatsheet Web PDF
4 pages