Debugging, Profiling, Performance Analysis, Optimization PDF
Debugging, Profiling, Performance Analysis, Optimization PDF
Debugging, Profiling, Performance Analysis, Optimization PDF
2010
Communication
Agglomeration
Mapping
13.12.2010
The
Problem
Initial tasks
Communication
Combined Tasks
Final Program
13.12.2010
Functional Decomposition
Task parallelism
Divide the computation, then associate the
data
Independent tasks of the same problem
Data Decomposition
Same operation performed on different data
Divide data into pieces, then associate
computation
Decomposition Methods
Functional Decomposition
Focusing on computations
can reveal structure in a
problem
Atmosphere Model
Hydrology
Model
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory,
ERDC
Ocean
Model
Land Surface
Model
Domain Decomposition
13.12.2010
Example: Computing Pi
We want to compute
One method: method of darts*
Ratio of area of square to area of inscribed
circle proportional to
Method of Darts
Imagine dartboard with circle of
radius R inscribed in square
2
Area of circle R
2
4R
4
Area
of square
13.12.2010
Method of Darts
10
Method of Darts
13.12.2010
11
Parallelization Strategies
What tasks independent of each
other?
What tasks must be performed
sequentially?
Using PCAM parallel algorithm design
strategy
12
Partition
13.12.2010
13
Communication
Determine communication pattern among
tasks
Each processor throws dart(s) then
sends results back to manager process
14
Agglomeration
Combine into coarser-grained tasks, if necessary, to
reduce communication requirements or other costs
To get good value of , must use millions of darts
We dont have millions of processors available
Furthermore, communication between manager and
millions of worker processors would be very
expensive
Solution: divide up number of dart throws evenly
between processors, so each processor does a
share of work
13.12.2010
15
Mapping
Assign tasks to processors, subject to
tradeoff between communication cost
and concurrency
Assign role of manager to processor 0
Processor 0 will receive tallies from all
the other processors, and will compute
final value of
Every processor, including manager,
will perform equal share of dart throws
16
13.12.2010
17
18
13.12.2010
19
20
10
13.12.2010
21
22
11
13.12.2010
23
Stress Testing
24
12
13.12.2010
25
26
13
13.12.2010
27
28
14
13.12.2010
29
30
Data races
Atomicity violations
Deadlocks
Memory consistency errors
15
13.12.2010
31
32
16
13.12.2010
33
34
Static Analysis
17
13.12.2010
35
36
18
13.12.2010
37
38
19
13.12.2010
39
40
Instrumentation
Measure
Analyze
Modify / Tune
Usage /
Production
20
13.12.2010
41
Development Cycle
Analysis
Intel Parallel Amplifier
Design (Introduce Threads)
Intel Performance libraries: IPP and MKL
OpenMP* (Intel Parallel Composer)
Explicit threading (Win32*, Pthreads*)
42
21
13.12.2010
43
Workflow
44
Advisor Overview
22
13.12.2010
45
Hotspot Analysis
bool TestForPrime(int val)
{
// lets start checking from 3
inttolimit, factor = 3;
Use Parallel Amplifier
limit = (long)(sqrtf((float)val)+0.5f);
find hotspots in application
while( (factor <= limit) && (val % factor))
factor ++;
PrimeSingle <start>
} <end>
Usage: ./PrimeSingle
1 1000000
void FindPrimes(int
start, int end)
{
// start is always odd
int range = end - start + 1;
for( int i = start; i <= end; i+= 2 ){
if( TestForPrime(i) )
globalPrimes[gPrimesFound++] = i;
ShowProgress(i, range);
}
Identifies the
time consuming regions
}
46
23
13.12.2010
47
48
24
13.12.2010
49
50
Motivation
Deadlocks
25
13.12.2010
51
52
26
13.12.2010
53
Instrumentation: background
Adds calls to library to record information
54
Workload Guidelines
27
13.12.2010
55
Binary Instrumentation
56
28
13.12.2010
57
Graph shows
significant portion of
time in idle condition
as result of critical
section
FindPrimes() &
ShowProgress() are
both excessively
impacted by the idle
time occurring in the
critical section
58
29
13.12.2010
59
60
30
13.12.2010
61
62
Performance
Double Click ShowProgress in second largest critical section
This implementation has implicit synchronization calls - printf
This limits scaling performance due to the resulting context
switches
31
13.12.2010
63
64
32
13.12.2010
65
Comparative Analysis
66
Load balance
Improper distribution of parallel work
Synchronization
Excessive use of global data, contention for the
same synchronization object
Parallel Overhead
Due to thread creation, scheduling..
Granularity
Not sufficient parallel work
33
13.12.2010
67
Load Imbalance
Thread 0
Busy
Thread 1
Idle
Thread 2
Thread 3
Time
Start
threads
Join
threads
68
Static assignment
Are the same number of tasks
assigned to each thread?
Do tasks take different processing
time?
threads
34
13.12.2010
69
Dynamic assignment
Is there one big task being assigned?
70
Unbalanced Workloads
35
13.12.2010
71
Unbalanced Workloads
72
Synchronization
Thread 0
Busy
Idle
In Critical
Thread 1
Thread 2
Thread 3
Time
36
13.12.2010
73
Synchronization Fixes
Eliminate synchronization
Expensive but necessary evil
Use storage local to threads
74
General Optimizations
Serial Optimizations
Serial optimizations along the critical path should
affect execution time
Parallel Optimizations
Reduce synchronization object contention
Balance workload
Functional parallelism
37
13.12.2010
75
Measurement:
Profiling vs. Tracing
Profiling
Summary statistics of performance metrics
76
Tracing
When and where events took place along a global
timeline
38
13.12.2010
77
Measurement: Profiling
Profiling
Helps to expose performance bottlenecks and hotspots
80/20 rule or Pareto principle: often 80% of the execution time in
20% of your application
Optimize what matters, dont waste time optimizing things that have
negligible overall influence on performance
Implementation
Sampling: periodic OS interrupts or hardware counter traps
Build a histogram of sampled program counter (PC) values
Hotspots will show up as regions with many hits
Measurement: direct insertion of measurement code
Measure at start and end of regions of interests, compute
difference
78
Measurement: Tracing
Tracing
Recording of information about significant points (events)
during program execution
entering/exiting code region (function, loop, block, )
thread/process interactions (e.g., send/receive
message)
Save information in event record
timestamp
CPU identifier, thread identifier
Event type and event-specific information
Event trace is a time-sequenced stream of event records
Can be used to reconstruct dynamic program behavior
Typically requires code instrumentation
39
13.12.2010
79
Visualization
Interactive exploration
Statistical analysis
Modeling
Automated analysis
Try to cope with huge amounts of performance by
automation
Examples: Paradyn, KOJAK, Scalasca, Periscope
80
40
13.12.2010
81
Vampir/IPM:
message
communication
statistics
82
Paraprof viewer
(from the TAU
toolset)
41
13.12.2010
83
84
Automation Example
Late sender
pattern
This pattern can
be detected
automatically by
analyzing the
trace
42
13.12.2010
85
Cache behavior
Branching behavior
Memory and resource contention and access patterns
Pipeline stalls
Floating point efficiency
Instructions per cycle
86
What is PAPI
43
13.12.2010
87
Preset Events
Standard set of over 100 events for application performance tuning
No standardization of the exact definitions
Mapped to either single or linear combinations of native events on
each platform
Use papi_avail to see what preset events are available on a given
platform
Native Events
Any event countable by the CPU
Same interface as for preset events
Use papi_native_avail utility to see all available native events
88
High Level
User API
44
13.12.2010
89
PAPI_num_counters()
PAPI_flips (float *rtime, float *ptime, long long *flpins, float *mflips)
PAPI_flops (float *rtime, float *ptime, long long *flpops, float *mflops)
PAPI_ipc (float *rtime, float *ptime, long long *ins, float *ipc)
simplified call to get Mflops/s (floating point operation rate), real and processor time
gets instructions per cycle, real and processor time
add current counts to array and reset counters
copy current counts to array and reset counters
90
45
13.12.2010
91
TAU (UO)
PerfSuite (NCSA)
HPCToolkit (Rice)
KOJAK, Scalasca (FZ Juelich, UTK)
Open|Speedshop (SGI)
ompP (UCB)
IPM (LBNL)
92
Hi
Level
API
Devel
API
Operating System
Perf Counter Hardware
46
13.12.2010
93
94
Timeline display
47
13.12.2010
97
Message
send op
Message
receive op
98
Communication statistics
48
13.12.2010
99
Message histograms
100
Collective operations
Connection
lines
49
13.12.2010
101
Activity chart
102
Processlocal displays
50
13.12.2010
103
Effects of zooming
Updated
summary
Updated
message
statistics
Select one
iteration
104
Traditional Tool
Automatic Tool
Simple:
1 screen +
2 commands +
3 panes
Relevant
problems
and data
Huge amount of
Measurement data
51
13.12.2010
105
106
location
MPI_Send
MPI_Irecv
MPI_Wait
time
location
Late Sender: Time lost waiting caused by a blocking receive operation posted
earlier than the corresponding send operation
MPI_Send
MPI_Recv
MPI_Send
MPI_Irecv
MPI_Wait
Late Receiver: Time lost waiting in a blocking send operation until the
corresponding receive operation is called
time
52
13.12.2010
107
Performance Property
What problem?
Region Tree
Where in source code?
In what context?
Color Coding
Location
How is the
problem distributed
across the machine?
108
Topology
display
Shows
distribution
of pattern
over HW
topology
Easily
scales to
even
larger
systems
53
13.12.2010
109
http://www.cs.uoregon.edu/research/tau/
Multi-level performance instrumentation
Multi-language automatic source instrumentation
110
Each point is a
thread of
execution
A total of four
metrics shown
in relation
ParaVis 3D
profile
visualization
library
32k processors
54
13.12.2010
111
PerfExplorer Correlation
Analysis (Flash)
112
55
13.12.2010
113
Tools References
114
Tools References
CalFuzzer (Java):
http://srl.cs.berkeley.edu/~ksen/calfuzzer/
Thrille (C):
http://github.com/nicholasjalbert/Thrille
56