Uprof User Guide v4.2
Uprof User Guide v4.2
Uprof User Guide v4.2
Trademarks
AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.
Microsoft, Windows, Windows Vista, and DirectX are registered trademarks of Microsoft Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their
respective companies.
Rovi Corporation
This device is protected by U.S. patents and other intellectual property rights. The use of Rovi Corporation's copy
protection technology in the device must be authorized by Rovi Corporation and is intended for home and other limited
pay-per-view uses only, unless otherwise authorized in writing by Rovi Corporation.
USE OF THIS PRODUCT IN ANY MANNER THAT COMPLIES WITH THE MPEG-2 STANDARD IS EXPRESSLY
PROHIBITED WITHOUT A LICENSE UNDER APPLICABLE PATENTS IN THE MPEG-2 PATENT PORTFOLIO,
WHICH LICENSE IS AVAILABLE FROM MPEG LA, L.L.C., 6312 S. FIDDLERS GREEN CIRCLE, SUITE 400E,
GREENWOOD VILLAGE, COLORADO 80111.
57368 Rev. 4.2 January 2024 AMD uProf User Guide
Contents
Contents 3
AMD uProf User Guide 57368 Rev. 4.2 January 2024
4 Contents
57368 Rev. 4.2 January 2024 AMD uProf User Guide
Contents 5
AMD uProf User Guide 57368 Rev. 4.2 January 2024
6 Contents
57368 Rev. 4.2 January 2024 AMD uProf User Guide
Contents 7
AMD uProf User Guide 57368 Rev. 4.2 January 2024
8 Contents
57368 Rev. 4.2 January 2024 AMD uProf User Guide
Contents 9
AMD uProf User Guide 57368 Rev. 4.2 January 2024
10 Contents
57368 Rev. 4.2 January 2024 AMD uProf User Guide
List of Tables
List of Tables 11
AMD uProf User Guide 57368 Rev. 4.2 January 2024
12 List of Tables
57368 Rev. 4.2 January 2024 AMD uProf User Guide
Table 57. Family 17h Model 10h – 1Fh (AMD RyzenTM and AMD RyzenTM PRO APU) . . . .196
Table 58. Family 17h Model 70h – 7Fh (3rd Gen AMD RyzenTM) . . . . . . . . . . . . . . . . . . . . . . .196
Table 59. Family 17h Model 30h – 3Fh (EPYC 7002). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196
Table 60. Family 19h Model 0h – 2Fh (EPYC 7003 and EPYC 9000) . . . . . . . . . . . . . . . . . . . .197
Table 61. AMDProfilerService Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204
Table 62. AMD uProf Virtualization Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .208
Table 63. AMD uProf CLI Collect Command Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210
Table 64. Predefined Core PMC Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .221
Table 65. Core CPU Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .233
Table 66. IBS Fetch Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .236
Table 67. IBS Fetch Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .238
Table 68. IBS Op Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239
Table 69. IBS Op Metrics for AMD “Zen4” and AMD “Zen3” Server Platforms. . . . . . . . . . . .245
List of Tables 13
AMD uProf User Guide 57368 Rev. 4.2 January 2024
List of Figures
14 List of Figures
57368 Rev. 4.2 January 2024 AMD uProf User Guide
List of Figures 15
AMD uProf User Guide 57368 Rev. 4.2 January 2024
16 List of Figures
57368 Rev. 4.2 January 2024 AMD uProf User Guide
Revision History
Revision History 17
AMD uProf User Guide 57368 Rev. 4.2 January 2024
This document describes how to use AMD uProf to perform CPU, GPU, and power analysis of
applications running on Windows®, Linux®, and FreeBSD® operating systems on AMD processors.
The latest version of this document is available in the AMD uProf web site (https://www.amd.com/en/
developer/uprof.html).
Intended Audience
This document is intended for the software developers and performance tuning experts who want to
improve the performance of their application. It assumes prior understanding of CPU architecture,
concepts of threads, processes, load modules, and familiarity with performance analysis concepts.
Conventions
The following conventions have been used in this document:
Table 1. Conventions
Convention Description
GUI element A Graphical User Interface element such as menu name or button
> Menu item within a Menu
[] Contents are optional in syntax
… Preceding element can be repeated
| Denotes “or”, like two options are not allowed together
File name Name of a file or path or source code snippet
Command Command name or command phrase
Hyperlink Links to external web sites
Abbreviations
The following abbreviations have been used in this document:
Table 2. Abbreviations
Abbreviation Description
APERF Actual Performance Frequency Clock Counter
ASLR Address Space Layout Randomization
CCD Core Complex Die that can contain one or more CCX(s) and GMI2
Fabric port(s) connecting to IOD
Table 2. Abbreviations
Abbreviation Description
CLI Command Line Interface
CPI Cycles Per Instruction
CSV Comma Separated Values format
DC Data Cache
DIMM Dual In-line Memory Module
DRAM Dynamic Random Access Memory
DTLB Data Translation Lookaside Buffer
EBP Event Based Profiling, uses Core PMC events
GUI Graphical User Interface
IBS Instruction Based Sampling
IC Instruction Cache
IOD IO Die
IPC Instructions Per Cycle
ITLB Instruction Translation Lookaside Buffer
MPERF Maximum Performance Frequency Clock Counter
MSR Model Specific Register
NB Northbridge
OS Operating System
P0Freq P0 State Frequency
PMC Performance Monitoring Counter
PTI Per Thousand Instructions
RAPL Running Average Power Limit
SMU System Management Unit
TBP TimeBased Profiling
TSC Time Stamp Counter
UMC Unified Memory Controllers
Up to 8 UMCs, each supporting one DRAM channel per socket; each
channel can have up to 2 DIMMs
Terminology
The following terms have been used in this document:
Table 3. Terminology
Term Description
AMD uProf The product name uProf.
AMDuProfGUI The name of the graphical user interface tool.
AMDuProfCLI The name of the command line interface tool.
AMDuProfPcm The name of the command line interface tool for System Analysis.
AMDuProfSys The name of the python based command line interface tool for System
Analysis.
Client Instance of AMD uProf or AMDuProfCLI running on a host system.
Core The logical core number, a core can contain one or two CPU(s) depending
on the SMT configuration.
Core Complex (CCX) Consists of one or many cores and a cache system.
CPU Logical CPU numbers as considered by the operating system.
Host system System in which the AMD uProf client process runs.
L1D, L1I Cache CPU exclusive data and instruction cache.
L2 Cache Shared by all the CPUs within the core.
L3 Cache Shared by all the CPUs within CCX.
Node Logical NUMA node.
Performance Profiling (or) CPU Identify and analyze the performance bottlenecks. Performance Profiling
Profiling and CPU Profiling denotes the same.
Socket The logical socket number, a socket can contain multiple nodes.
System Analysis Refers to AMDuProfPcm or AMDuProfSys tools.
Target system System in which the profile data is collected.
Part 1:
Introduction
1
AMD uProf User Guide 57368 Rev. 4.2 January 2024
Chapter 1 Introduction
1.1 Overview
AMD uProf is a performance analysis tool for applications running on Windows and Linux operating
systems. It allows developers to understand and improve the runtime performance of their
application.
AMD uProf offers the following functionalities:
• Performance Analysis (CPU Profile)
To identify runtime performance bottlenecks of the application.
• System Analysis
To monitor system performance metrics, such as IPC and memory bandwidth.
• Live Power Profile
To monitor thermal and power characteristics of the system.
AMD uProf has the following user interfaces:
Table 1. User Interface
Executable Description Supported OS
AMDuProf GUI to perform CPU and Power Profile Windows and Linux
AMDuProfCLI CLI to perform CPU and Power Profile Windows, Linux, and FreeBSD
AMDuProfPcm CLI to perform System Analysis Windows, Linux, and FreeBSD
AMDPerf/ Python script for System Analysis Windows and Linux
AMDuProfSys.py
2 Introduction Chapter 1
57368 Rev. 4.2 January 2024 AMD uProf User Guide
1.2 Specification
AMD uProf supports the following specifications. For a detailed list of supported processors and
operating systems, refer to the AMD uProf Release Notes available at:
https://www.amd.com/en/developer/uprof.html
1.2.1 Processors
• AMD “Zen”-based CPU and APU Processors
• AMD InstinctTM MI100 and MI200 accelerators (for GPU kernel profiling and tracing)
• Intel® Processors (Time based profiling only)
Chapter 1 Introduction 3
AMD uProf User Guide 57368 Rev. 4.2 January 2024
4 Introduction Chapter 1
57368 Rev. 4.2 January 2024 AMD uProf User Guide
• Run AMD uProf CLI outside the Docker container to profile and analyze the target application
running in the container:
– Attach uProf CLI to the containerized process using the --pid option during collection.
Alternatively, collect the system-wide data and filter by PID during report generation.
– During report generation, provide the path to the binary and source code (--bin-path and --
src-path) of the profiled application running in the container. AMD uProf GUI doesn't support
profiling and analysis in this mode.
1.3.1 Windows
Run the 64-bit Windows installer binary AMDuProf-x.y.z.exe.
After the installation is complete, the executables, libraries, and the other required files are installed
in the folder C:\Program Files\AMD\AMDuProf\.
1.3.2 Linux
1.3.2.1 Installing Using a tar File
Extract the tar.bz2 binary file and install AMD uProf using the following command:
$ tar -xf AMDuProf_Linux_x64_x.y.z.tar.bz2
Install the AMD uProf RPM package by using the rpm or yum command:
$ sudo rpm --install amduprof-x.y-z.x86_64.rpm
$ sudo yum install amduprof-x.y-z.x86_64.rpm
After the installation is complete, the executables, libraries, and the other required files will be
installed in the directory /opt/AMDuProf_X.Y-ZZZ/.
Install the AMD uProf Debian package by using the dpkg command:
$ sudo dpkg --install amduprof_x.y-z_amd64.deb
After the installation is complete, the executables, libraries, and the other required files will be
installed in the directory /opt/AMDuProf_X.Y-ZZZ/.
Chapter 1 Introduction 5
AMD uProf User Guide 57368 Rev. 4.2 January 2024
While installing AMD uProf using RPM and Debian installer packages, the Power Profiler Linux
Driver build is generated and installed automatically. However, if you downloaded the AMD uProf
tar.bz2 archive, you must install the Power Profiler Linux Driver manually.
The GCC and MAKE software packages are prerequisites for installing Power Profiler Driver. If you
do not have these packages, you can install them using the following commands:
On RHEL and CentOS distros:
$ sudo yum install gcc make
On Debian/Ubuntu distros:
$ sudo apt install build-essential
Installer will create a source tree for Power Profiler Driver in the directory /usr/src/
AMDPowerProfiler-<version>. All the source files required for module compilation are in this
directory and under MIT license.
To uninstall the driver run the following commands:
$ cd AMDuProf_Linux_x64_x.y.z/bin
$ sudo ./AMDPowerProfilerDriver.sh uninstall
On Linux machines, Power profiling driver can also be installed with Dynamic Kernel Module
Support (DKMS) framework support. DKMS framework automatically upgrades the Power Profiler
Driver module whenever there is a change in the existing kernel. This saves you from manually
upgrading the power profiling driver module. The DKMS package must be installed on target
machines before running the installation steps mentioned in the above section.
AMDPowerProfilerDriver.sh installer script will automatically handle the DKMS related
configuration if the DKMS package is installed on the target machine.
Example (for Ubuntu distros):
$ sudo apt-get install dkms
$ tar –xf AMDuProf_Linux_x64_x.y.z.tar.bz2
$ cd AMDuProf_Linux_x64_x.y.z/bin
$ sudo ./AMDPowerProfilerDriver.sh install
If you upgrade the kernel version frequently, it is recommended to use DKMS for the installation.
6 Introduction Chapter 1
57368 Rev. 4.2 January 2024 AMD uProf User Guide
After ROCm 5.5 installation, make sure symbolic link of /opt/rocm/ points to /opt/rocm-5.5.0/.
$ ln -s /opt/rocm-5.5.0/ /opt/rocm/
AMD ROCm v5.5 installation is required for GPU tracing and profiling.
If you install AMD uProf using RPM/DEB installer, the script is run by the installer and the info
about BCC installation and eBPF (Extended Berkeley Packet Filter) support on the host is provided.
1.3.3 FreeBSD
Extracting the tar.bz2 binary file and install AMD uProf:
$ tar -xf AMDuProf_FreeBSD_x64_x.y.z.tar.bz2
• Linux
– A sample matrix multiplication program with makefile
/opt/AMDuProf_X.Y-ZZZ/Examples/AMDTClassicMat/
– An OpenMP example program and its variants with makefile
/opt/AMDuProf_X.Y-ZZZ/Examples/CollatzSequence_C-OMP/
• FreeBSD
A sample matrix multiplication program with makefile
/<install dir>/AMDuProf_FreeBSD_x64_X.Y.ZZZ/Examples/AMDTClassicMat/
1.5 Support
For support options, the latest documentation, and downloads refer the AMD portal (https://
www.amd.com/en/developer/uprof.html).
Chapter 1 Introduction 7
AMD uProf User Guide 57368 Rev. 4.2 January 2024
Part 2:
System Analysis
8
57368 Rev. 4.2 January 2024 AMD uProf User Guide
2.1 Overview
The System Analysis utility AMDuProfPcm helps to monitor basic performance monitoring metrics
for AMD EPYCTM 7001, AMD EPYCTM 7002, AMD EPYCTM 7003, and AMD EPYCTM 9000 of
family 17h and 19h processors. This utility periodically collects the CPU Core, L3, and DF
performance event count values and reports various metrics. It is supported on Windows, Linux, and
FreeBSD.
2.1.1 Prerequisite(s)
2.1.1.1 Linux
• AMDuProfPcm requires the MSR driver and either root privileges or read write permissions for
dev/cpu/*/msr devices only when it is used with --msr for data collection.
• NMI watchdog must be disabled (echo 0 > /proc/sys/kernel/nmi_watchdog).
• Set /proc/sys/kernel/perf_event_paranoid to -1.
• Use the following command to load the msr driver:
$ modprobe msr
• Roofline plotting script (AMDuProfModelling.py) requires python 3.x and python module
'matplotlib'
2.1.1.2 FreeBSD
AMDuProfPcm uses cpuctl module and requires either root privileges or read write permissions for /
dev/cpuctl* devices.
Synopsis:
AMDuProfPcm [<COMMANDS>] [<OPTIONS>] -- <PROGRAM> [<ARGS>]
2.2 Options
The following table lists all the options:
Table 2. AMDuProfPcm Options
Option Description
-h Displays this help information on the console/terminal.
-m <metric,...> Metrics to report, the default metric group is 'ipc'.
The supported metric groups and the corresponding metrics are
Platform, OS, and Hypervisor specific.
Run AMDuProfpcm -h to get the list of supported metrics.
The following metric groups are supported:
• ipc – reports metrics such as CEF, Utilization, CPI, and IPC
• fp – reports GFLOPS
• l1 – L1 cache related metrics (DC access and IC Fetch miss
ratio)
• l2 – L2D and L2I cache related access/hit/miss metrics
• l3 – L3 cache metrics like L3 Access, L3 Miss, and Average
Miss latency
• dc – advanced caching metrics such as DC refills by source
(supported only on AMD “Zen3” and AMD “Zen4”
processors)
• memory – approximate memory read and write bandwidths in
GB/s for all the channels
• pcie – PCIe bandwidth in GB/s (supported only on AMD
“Zen2” and AMD “Zen4” processors)
• xgmi – approximate xGMI outbound databytes in GB/s for all
the remote links
• dma – DMA bandwidth in GB/s (supported only on AMD
“Zen4” processors)
• swpfdc – software prefetch data cache from various nodes and
CCX (supported only on AMD “Zen3” and AMD “Zen4”
processors)
• hwpfdc – hardware prefetch data cache from various nodes and
CCX (supported only on AMD “Zen3” and AMD “Zen4”
processors)
• pipeline_util – top-down metrics to visualize the bottlenecks in
the CPU pipeline (supported only on AMD “Zen4” processors)
Following are the performance metrics for AMD EPYCTM “Zen 2” core architecture processors:
Table 3. Performance Metrics for AMD EPYCTM “Zen 2”
Metric Group Metric Description
Utilization (%) Percentage of time the core was running, that is non-
idle time.
Eff Freq Core Effective Frequency (CEF) without halted cycles
over the sampling period, reported in GHz. The metric
is based on CEF = (APERF / TSC) * P0Freq. APERF is
incremented in proportion to the actual number of core
cycles while the core is in C6 state.
l2 L2 Miss from DC Miss The L2 cache misses from DC miss. This metric is in
PTI.
L2 Miss from HWPF The L2 cache misses from L2 hardware pre-fetching.
This metric is in PTI.
L2 Hit All the L2 cache hits. This metric is in PTI.
L2 Hit from IC Miss The L2 cache hits from IC miss. This metric is in PTI.
L2 Hit from DC Miss The L2 cache hits from DC miss. This metric is in PTI.
L2 Hit from HWPF The L2 cache hits from L2 hardware pre-fetching. This
metric is in PTI.
L1 ITLB Miss The instruction fetches the misses in the L1 Instruction
Translation Lookaside Buffer (ITLB), but hit in the L2-
ITLB plus the ITLB reloads originating from page table
walker. The table walk requests are made for L1-ITLB
miss and L2-ITLB misses. This metric is in PTI.
L2 ITLB Miss The number of ITLB reloads from page table walker
due to L1-ITLB and L2-ITLB misses. This metric is in
PTI.
tlb
L1 DTLB Miss The number of L1 Data Translation Lookaside Buffer
(DTLB) misses from load store micro-ops. This event
counts both L2-DTLB hit and L2-DTLB miss. This
metric is in PTI.
L2 DTLB Miss The number of L2 Data Translation Lookaside Buffer
(DTLB) missed from load store micro-ops. This metric
is in PTI.
xgmi xGMI0 BW (GB/s) Approximate xGMI outbound data bytes in GB/s for all
xGMI1 BW (GB/s) the remote links.
xGMI2 BW (GB/s)
xGMI3 BW (GB/s)
pcie PCIe0 (GB/s) Approximate PCIe bandwidth in GB/s.
PCIe1 (GB/s)
PCIe2 (GB/s)
PCIe3 (GB/s)
Following are the performance metrics for AMD EPYCTM “Zen 3” core architecture processors:
Table 4. Performance Metrics for AMD EPYCTM “Zen 3”
Metric Group Metric Description
Utilization (%) Percentage of time the core was running, that is non-
idle time.
Eff Freq Core Effective Frequency (CEF) without halted cycles
over the sampling period, reported in GHz. The metric
is based on CEF = (APERF / TSC) * P0Freq. APERF is
incremented in proportion to the actual number of core
cycles while the core is in C6 state.
IPC Instructions Per Cycle (IPC) is the average number of
instructions retired per CPU cycle. This is measured
using Core PMC events PMCx0C0 [Retired
Instructions] and PMCx076 [CPU Clocks not Halted].
These PMC events are counted in both OS and User
ipc mode.
CPI Cycles Per Instruction (CPI) is the multiplicative
inverse of IPC metric. This is one of the basic
performance metrics indicating how cache misses,
branch mis-predictions, memory latencies, and other
bottlenecks are affecting the execution of an
application. A lower CPI value is better.
Branch Mis-prediction Ratio The ratio between mis-predicted branches and retired
branch instructions.
Retired SSE/AVX Flops The number of retired SSE/AVX FLOPs.
fp (GFLOPs)
Mixed SSE/AVX Stalls Mixed SSE/AVX stalls.
This metric is in per thousand instructions (PTI).
IC (32B) Fetch Miss Ratio Instruction cache fetch miss ratio.
Op Cache (64B) Fetch Miss Operation cache fetch miss ratio.
Ratio
l1 IC Access All instruction cache accesses. This metric is in PTI.
IC Miss The instruction cache miss. This metric is in PTI.
DC Access All the DC accesses. This metric is in PTI.
l2 L2 Miss from DC Miss The L2 cache misses from DC miss. This metric is in
PTI.
L2 Miss from HWPF The L2 cache misses from L2 hardware pre-fetching.
This metric is in PTI.
L2 Hit All the L2 cache hits. This metric is in PTI.
L2 Hit from IC Miss The L2 cache hits from IC miss. This metric is in PTI.
L2 Hit from DC Miss The L2 cache hits from DC miss. This metric is in PTI.
L2 Hit from HWPF The L2 cache hits from L2 hardware pre-fetching. This
metric is in PTI.
L1 ITLB Miss The instruction fetches the misses in the L1 Instruction
Translation Lookaside Buffer (ITLB), but hit in the L2-
ITLB plus the ITLB reloads originating from page table
walker. The table walk requests are made for L1-ITLB
miss and L2-ITLB misses. This metric is in PTI.
L2 ITLB Miss The number of ITLB reloads from page table walker
due to L1-ITLB and L2-ITLB misses. This metric is in
PTI.
tlb
L1 DTLB Miss The number of L1 Data Translation Lookaside Buffer
(DTLB) misses from load store micro-ops. This event
counts both L2-DTLB hit and L2-DTLB miss. This
metric is in PTI.
L2 DTLB Miss The number of L2 Data Translation Lookaside Buffer
(DTLB) missed from load store micro-ops. This metric
is in PTI.
All TLBs Flushed All the TLBs flushed. This metric is in PTI.
Following are the performance metrics for AMD EPYCTM “Zen 4” core architecture processors:
Table 5. Performance Metrics for AMD EPYCTM “Zen 4”
Metric Group Metric Description
Utilization (%) Percentage of time the core was running, that is non-
idle time.
Eff Freq Core Effective Frequency (CEF) without halted cycles
over the sampling period, reported in GHz. The metric
is based on CEF = (APERF / TSC) * P0Freq. APERF is
incremented in proportion to the actual number of core
cycles while the core is in C6 state.
l2 L2 Miss from DC Miss The L2 cache misses from DC miss. This metric is in
PTI.
L2 Miss from HWPF The L2 cache misses from L2 hardware pre-fetching.
This metric is in PTI.
L2 Hit All the L2 cache hits. This metric is in PTI.
L2 Hit from IC Miss The L2 cache hits from IC miss. This metric is in PTI.
L2 Hit from DC Miss The L2 cache hits from DC miss. This metric is in PTI.
L2 Hit from HWPF The L2 cache hits from L2 hardware pre-fetching. This
metric is in PTI.
L1 ITLB Miss The instruction fetches the misses in the L1 Instruction
Translation Lookaside Buffer (ITLB), but hit in the L2-
ITLB plus the ITLB reloads originating from page table
walker. The table walk requests are made for L1-ITLB
miss and L2-ITLB misses. This metric is in PTI.
L2 ITLB Miss The number of ITLB reloads from page table walker
due to L1-ITLB and L2-ITLB misses. This metric is in
PTI.
L1 DTLB Miss The number of L1 Data Translation Lookaside Buffer
tlb
(DTLB) misses from load store micro-ops. This event
counts both L2-DTLB hit and L2-DTLB miss. This
metric is in PTI.
L2 DTLB Miss The number of L2 Data Translation Lookaside Buffer
(DTLB) missed from load store micro-ops. This metric
is in PTI.
All TLBs Flushed All the flushed TLBs.
Local Inbound Read Data Local inbound data bytes to the CPU, for example, read
Bytes (GB/s) data.
Local Outbound Write Data Local outbound data bytes from the CPU, for example,
Bytes (GB/s) write data.
Remote Inbound Read Data Remote socket inbound data bytes to the CPU, for
xgmi Bytes (GB/s) example, read data.
Remote Outbound Write Data Remote socket outbound data bytes from the CPU for
Bytes (GB/s) example, write data.
xGMI Outbound Data Bytes Total outbound data bytes in Gigabytes per second.
(GB/s)
Total Upstream DMA Read Total upstream DMA including read and write.
Write Data Bytes (GB/s)
Local Upstream DMA Read Local upstream DMA read data bytes.
Data Bytes (GB/s)
dma Local Upstream DMA Write Local upstream DMA write data bytes.
(not available in Data Bytes (GB/s)
AMD “Zen1”, Remote Upstream DMA Remote socket upstream DMA read data bytes
AMD “Zen2”, Read Data Bytes (GB/s)
and AMD
“Zen3” Remote Upstream DMA Remote socket upstream DMA write data bytes.
processors) Write Data Bytes (GB/s)
2.3 Commands
The following table lists all the commands:
Table 6. AMDuProfPcm Options
Command Description
roofline Collects data required for generating roofline model.
2.4 Examples
2.4.1 Linux and FreeBSD
• Collect IPC data from core 0 for the duration of 60 seconds:
# ./AMDuProfPcm -m ipc -c core=0 -d 60 -o /tmp/pcmdata.csv
• Collect only the memory bandwidth across all the UMCs for the duration of 60 seconds and save
the output in /tmp/pcmdata.csv file:
# ./AMDuProfPcm -m memory -a -d 60 -o /tmp/pcmdata.csv
• Collect IPC data from core 0 and run the program in core 0:
# ./AMDuProfPcm -m ipc -c core=0 -o /tmp/pcmdata.csv -- /usr/bin/taskset -c 0 <application>
• Collect IPC data from cores 0-7 and run the application on cores 0-3:
# ./AMDuProfPcm -m ipc -c core=0-7 -o /tmp/pcmdata.csv -- /usr/bin/taskset -c 0-3
<application>
• Collect IPC and data l2 data from core 0 and report the cumulative (not timeseries) and run the
program in core 0
# ./AMDuProfPcm -m ipc,l2 -c core=0 -o /tmp/pcmdata.csv -C -- /usr/bin/taskset -c 0
<application>
• Print the name, description, and the available unit masks for the specified event:
# ./AMDuProfPcm -z pmcx03
• Plot roofline data and generate a PDF in the output directory /tmp:
AMDuProfModelling.py -i /tmp/roofline.csv -o /tmp/
2.4.2 Windows
Core Metrics
• Collect IPC/L2 metrics for all the core in CCX=0 for the duration of 30 seconds:
C:\> AMDuProfPcm.exe -m ipc,l2 -c ccx=0 -d 30 -o c:\tmp\pcmdata.csv
• Collect IPC data for 30 seconds from all the cores in the system:
C:\> AMDuProfPcm.exe -m ipc -a -d 30 -o c:\tmp\pcmdata.csv
• Collect IPC and data l2 data from all the cores and report the aggregated data at the system and
package level:
C:\> AMDuProfPcm.exe -m ipc,l2 -a -o c:\tmp\pcmdata.csv -d 30 -A system,package
• Collect IPC and data l2 data from all the cores in CCX=0 and report the cumulative (not
timeseries):
C:\> AMDuProfPcm.exe -m ipc,l2 -c ccx=0 -o c:\tmp\pcmdata.csv -C -d 30
• Collect IPC and data l2 data from all the cores and report the cumulative (not timeseries):
C:\> AMDuProfPcm.exe -m ipc,l2 -a -o c:\tmp\pcmdata.csv -C -d 30
• Collect IPC and data l2 data from all the cores and report the cumulative (not timeseries) and
aggregate at system and package level:
C:\> AMDuProfPcm.exe -m ipc,l2 -a -o c:\tmp\pcmdata.csv -C -A system,package -d 30
L3 Metrics
• Collect L3 data from all the CCXs and report for the duration of 30 seconds:
C:\> AMDuProfPcm.exe -m l3 -a -d 30 -o c:\tmp\pcmdata.csv
• Collect L3 data from all the CCXs and aggregate at system and package level and report for the
duration of 30 seconds:
C:\> AMDuProfPcm.exe -m l3 -a -d 30 -A system,package -o c:\tmp\pcmdata.csv
• Collect L3 data from all the CCXs and aggregate at system and package level and report for the
duration of 30 seconds; also report for the individual CCXs:
C:\> AMDuProfPcm.exe -m l3 -a -d 30 -A system,package,ccx -o c:\tmp\pcmdata.csv
• Collect L3 data from all the CCXs for the duration of 30 seconds and report the cumulative data
(no timeseries data):
C:\> AMDuProfPcm.exe -m l3 -a -d 30 -C -o c:\tmp\pcmdata.csv
• Collect L3 data from all the CCXs and aggregate at system and package level and report
cumulative data (no timeseries data)
C:\> AMDuProfPcm.exe -m l3 -a -d 30 -A system,package -C -o c:\tmp\pcmdata.csv
Memory Bandwidth
• Report memory bandwidth for all the memory channels for the duration of 60 seconds and save
the output in c:\tmp\pcmdata.csv file:
C:\> AMDuProfPcm.exe -m memory -a -d 60 -o c:\tmp\pcmdata.csv
• Report total memory bandwidth aggregated at the system level for the duration of 60 seconds and
save the output in c:\tmp\pcmdata.csv file:
C:\> AMDuProfPcm.exe -m memory -a -d 60 -o c:\tmp\pcmdata.csv -A system
• Report total memory bandwidth aggregated at the system level and also report for every memory
channel:
C:\> AMDuProfPcm.exe -m memory -a -d 60 -o c:\tmp\pcmdata.csv -A system,package
• Report total memory bandwidth aggregated at the system level and also report for all the available
memory channels. To report cumulative metric value instead of the timeseries data:
C:\> AMDuProfPcm.exe -m memory -a -d 60 -o c:\tmp\pcmdata.csv -C -A system,package
• Monitor events from core 0 and dump the raw event counts for every sample in timeseries
manner, no metrics report will be generated:
C:\> AMDuProfPcm.exe -m ipc -d 60 -D c:\tmp\pcmdata_dump.csv
• Monitor events from all the cores and dump the raw event counts for every sample in timeseries
manner, no metrics report will be generated:
C:\> AMDuProfPcm.exe -m ipc -a -d 60 -D c:\tmp\pcmdata_dump.csv
Miscellaneous
• Print the name, description, and the available unit masks for the specified event:
C:\> AMDuProfPcm.exe -z pmcx03
• Cumulative reporting of IPC metrics at the end of the benchmark execution, aggregate metrics per
processor package:
$ AMDuProfPcm -X -m ipc -C -A package -o /tmp/pcm.csv -- /tmp/myapp.exe
• Cumulative reporting of IPC metrics at the end of the benchmark execution, aggregate metrics at
system level:
$ AMDuProfPcm -X -m ipc -C -A system -o /tmp/pcm.csv -- /tmp/myapp.exe
• Timeseries monitoring of memory bandwidth reporting at package and memory channels level:
$ AMDuProfPcm -X -m memory -a -A system,package -o /tmp/mem.csv
For better top-down results, disable NMI watchdog and run the following command as root:
echo 0 > /proc/sys/kernel/nmi_watchdog
On AMD “Zen4” 9xx4 Series processors, if the Linux kernel doesn't support accessing DF
counters, use the following command with root privilege:
$ AMDuProfPcm roofline -o /tmp/myapp-roofline.csv -- /tmp/myapp.exe
• Use -a <appname> option to specify the application name to print in the graph chart.
• As this tool uses the maximum theoretical peaks for memory traffic and floating-point
performance, you can use benchmarks such as STREAM to get the peak memory bandwidth and
HPL or GEMM for peak FLOPS. Those scores can be used to plot the roofline charts. Use the
following options:
– --stream <STREAM score>
– --hpl <HPL score>
– --gemm <SGEMM | DGEMM score>
Due to multiplexing, the reported metrics may be inconsistent. To minimize the impact of
multiplexing, use the option -X. For better results, use taskset to bind the monitored application to a
specific set of cores and monitor only the cores on which the monitored application is running.
Run the following command to collect the top-down metrics:
$ sudo AMDuProfPCm -m pipeline_util -c core=0 -A system -o /tmp/myapp-td.csv -- /usr/bin/
taskset -c 0 myapp.exe
(or, use the option -X that does not require root access)
Examples
• Timeseries monitoring of level-1 and level-2 top-down metrics (pipeline utilization) of a single-
threaded program:
# AMDuProfPcm -m pipeline_util -c core=1 -o /tmp/td.csv -- /usr/bin/taskset -c 1 /tmp/
myapp.exe
3.1 Overview
AMDuProfSys is a python-based system analysis tool for AMD processors. It can be used to collect
the hardware events and evaluate the simple counter values or complex recipes using collected raw
events. The performance metrics are based on the profile data collected using Core, L3, DF, and UMC
PMCs. This tool can be used to get the overall performance details of the hardware blocks used in the
system.
3.5 Set up
Follow the installation steps in the section "Installing AMD uProf" on page 5.
3.5.1 Linux
If tar ball is used, uProf driver must be used manually. If you are not using uProf driver, optionally,
you can use Linux perf. However, you must ensure that Linux user space tool is installed and Perf
tools support the required PMC event monitoring. If uProf driver is not used, command line must
include the option --use-linux-perf.
To install user space perf tool:
$ sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
Perf parameter should be set to -1 if system-wide profile data or DF and L3 metrics must be collected:
$ sudo sh -c 'echo -1 >/proc/sys/kernel/perf_event_paranoid'
3.5.2 Windows
Setup file will install all the required components to run AMDuProfSys.
After installation, AMDuProfSys is available in the following directory:
<Installed Directory>/bin/AMDPerf/AMDuProfSys.py
Python Packages
AMDuProfSys requires Python to be installed on the target platform. Supported minimum Python
version is 3.6. When the tool is executed for the first time, it will prompt to install the following
Python modules:
• tqdm — use pip3 install tqdm to install
• xlsxwriter — use pip3 install XlsxWriter to install
• yaml — use apt-get install python-yaml or pip3 install pyyaml to install
• yamlordereddictloader — use pip3 install yamlordereddictloader to install
• rich — use pip3 install rich to install
Synopsis
AMDuProfSys.py [<OPTIONS>] -- [<PROGRAM>] [<ARGS>]
Common Usages
• Display help:
AMDuProfSys.py -h
• Generate the .csv format report from the session file generated during collection:
AMDuProfSys.py report -i output_core.ses
• Time series profile data for core metrics (core 0-5) with an interval of 1000 ms and set affinity of
running application to core 0:
AMDuProfSys.py --config core -C 0-5 -I 1000 --use-linux-perf -T -o output --affinity 0
<application>
Note: Time series profile data collection is available only with the option -use-linux-perf .
• Collect metrics for CORE, L3, DF and UMC metrics together:
AMDuProfSys.py collect --config core,l3,df,umc -C 0-10 <application>
3.6 Options
3.6.1 Generic
The following table lists the generic options:
Table 9. AMDuProfSys Generic Options
Option Description
-h, --help Display the usage
-v, --version Print the version
--system-info System information
--enable-irperf Enable irperf
Note: It is available only on Linux and requires root privilege.
--mux-interval-core <ms> Set the multiplexing interval in millisecond(s)
3.7 Examples
• Monitor the entire system to collect and generate metrics defined in config file and generate the
profile report:
AMDuProfSys.py --config core -a sleep 50
• Launch the program with core affinity set to core 0 and monitor that core and generate profile
report:
AMDuProfSys.py --config core -C 0 taskset -c 0 /tmp/scimark2
3.8 Limitations
• UMC profiling is not available in Linux for the following platforms:
– Family 17, model 0x30 - 0x3F
– Family 19, model 0x0 - 0xF
• Time series profile data collection is available only in Linux using the option --use-linux-perf.
Part 3:
Application Analysis
41
AMD uProf User Guide 57368 Rev. 4.2 January 2024
4.1 Workflow
The AMD uProf workflow has the following phases:
1. Collect — Run the application program and collect the profile data.
2. Translate — Process the profile data to aggregate, correlate, and organize into database.
3. Analyze — View and analyze the performance data to identify the bottlenecks.
Profile Target
The profile target is one of the following for which profile data will be collected:
• Application — Launch application and profile that process and its children.
• System — Profile all the running processes and/or kernel.
• Process — Attach to a running application (native applications only).
Profile Type
The profile type defines the type of profile data collected and how the data should be collected. The
following profile types are supported:
• CPU Profile
• CPU Trace
• GPU Profile
• GPU Trace
• System-wide Power Profile
The data collection is defined by Sampling Configuration:
• Sampling Configuration identifies the set of Sampling Events, their Sampling Interval, and
mode.
• Sampling Event is a resource used to trigger a sampling point at which a sample (profile data)
will be collected.
• Sampling Interval defines the number of the occurrences of the sampling event after which an
interrupt will be generated to collect the sample.
• Mode defines when to count the occurrences of the sampling event – in User mode and/or OS
mode.
Type of profile data to collect – Sampled Data:
Sampled Data — the profile data that can be collected when the interrupt is generated (upon the
expiry of the sampling interval of a sampling event).
The following table shows the type of profile data collected and sampling events for a profile type:
Table 12. Sampled Data
Profile Type Type of Profile Data Collected Sampling Events
CPU Profiling • Process ID • OS Timer
• Thread ID • Core PMC events
• IP • IBS
• Callstack
• ETL tracing (Windows only)
• OpenMP Trace — OMPT (Linux)
• MPI Trace — PMPI (Linux)
• OS Trace — Linux BPF
CPU Tracing • User mode trace — Collects Not applicable
syscall and pthread data
• OS trace — Collects schedule,
diskio, syscall, pthread, and
funccount data
GPU Profiling Perfmon Metrics Not applicable
GPU Tracing Runtime Trace — HIP and HSA Not applicable
For CPU Profiling, there are numerous micro-architecture specific events available to monitor. The
tool groups the related and interesting events to monitor called Predefined Sampling Configuration.
For example, Assess Performance is one such configuration used to get the overall assessment of the
performance and to find potential issues for investigation. For more information, refer “Predefined
View Configuration” on page 46.
A Custom Sampling Configuration is the one in which you can define a sampling configuration
with events of interest.
Profile Configuration
A profile configuration identifies all the information used to collect the measurement. It contains the
information about profile target, sampling configuration, data to sample, and profile scheduling
details.
The GUI saves these profile configuration details with a default name (for example, AMDuProf-TBP-
Classic), you can define them too. As the performance analysis is iterative, this is persistent (can be
deleted) and hence, you can also reuse the same configuration for the future data collection runs.
A profile session represents a single performance experiment for a profile configuration. The tool
saves all the profile and translated data (in a database) in the folder named as <profile config name>-
<timestamp>.
Once the profile data is collected, uProf processes the data to aggregate and attribute the samples to
the respective processes, threads, load modules, functions, and instructions. This aggregated data is
then written into an SQLite database used during the Analyze phase. This process of the translating
the raw profile data happens when CLI generates the profile report or GUI generates the visualization.
A View is a set of sampled event data and computed performance metrics either displayed in the GUI
pages or in the text report generated by the CLI. Each predefined sampling configuration has a list of
associated predefined views.
The tool can be used to filter/view only specific configurations, which is called Predefined View. For
example, IPC assessment view lists metrics such as CPU Clocks, Retired Instructions, IPC, and CPI.
For more information, refer “Predefined Sampling Configuration” on page 44.
Notes:
1. The AMDuProf GUI uses the name of the predefined configuration in the above table.
2. The abbreviation (in Table 13 on page 44) is used with AMDuProfCLI collect command’s --
config option.
3. The supported predefined configurations and the sampling events used in them is based on
the processor family and model.
The following table lists the predefined view configurations for Investigate Data Access:
Table 16. Investigate Data Access Configurations
View configuration Abbreviation Description
IPC assessment ipc_assess Find hotspots with low instruction level parallelism. Provides
performance indicators – IPC and CPI.
Data access dc_assess Provides information about data cache (DC) access including
assessment DC miss rate and DC miss ratio.
Data access report dc_focus You can use this view to analyze L1 Data Cache (DC)
behavior and compare misses versus refills.
The following table lists the predefined view configurations for Investigate Branch:
Table 17. Investigate Branch Configurations
View configuration Abbreviation Description
Investigate Branching Branch You can use this view to find code with a high branch density
and poorly predicted branches.
IPC assessment ipc_assess Find hotspots with low instruction level parallelism, provides
performance indicators – IPC and CPI.
Branch assessment br_assess You can use this view to find code with a high branch density
and poorly predicted branches.
Taken branch report taken_focus You can use this view to find the code with a high number of
taken branches.
Near return report return_focus You can use this view to find code with poorly predicted near
returns.
The following table lists the predefined view configurations for Assess Performance (Extended):
Table 18. Assess Performance (Extended) Configurations
View configuration Abbreviation Description
Assess Performance triage_assess_ext This view gives an overall picture of performance. You can
(Extended) use it to find possible issues for deeper investigation.
IPC assessment ipc_assess Find hotspots with low instruction level parallelism, provides
performance indicators – IPC and CPI.
Branch assessment br_assess Use this view to find code with a high branch density and
poorly predicted branches.
Data access dc_assess Provides information about data cache (DC) access including
assessment DC miss rate and DC miss ratio.
Misaligned access misalign_assess Identify regions of code that access misaligned data.
assessment
The following table lists the predefined view configurations for Investigate Instruction Access:
Table 19. Investigate Instruction Access Configurations
View configuration Abbreviation Description
IPC assessment ipc_assess Find hotspots with low instruction level parallelism. Provides
performance indicators – IPC and CPI.
The following table lists the predefined view configurations for Investigate CPI:
Table 20. Investigate CPI Configurations
View configuration Abbreviation Description
IPC assessment ipc_assess Find hotspots with low instruction level parallelism. Provides
performance indicators – IPC and CPI.
The following table lists the predefined view configurations for Instruction Based Sampling:
Table 21. Instruction Based Sampling Configurations
View configuration Abbreviation Description
IBS fetch overall ibs_fetch_overall You can use this view to display an overall summary of the
IBS fetch sample data.
IBS fetch instruction ibs_fetch_ic You can use this view to display a summary of IBS
cache attempted fetch Instruction Cache (IC) miss data.
IBS fetch instruction ibs_fetch_itlb You can use this view to display a summary of IBS
TLB attempted fetch ITLB misses.
IBS fetch page ibs_fetch_page You can use this view to display a summary of the IBS L1
translations ITLB page translations for attempted fetches.
IBS All ops ibs_op_overall You can use this view to display a summary of all IBS Op
samples.
IBS MEM all load/ ibs_op_ls You can use this view to display a summary of IBS Op
store load/store data.
IBS MEM data cache ibs_op_ls_dc You can use this view to display a summary of DC
behavior derived from IBS Op load/store samples.
IBS MEM data TLB ibs_op_ls_dtlb You can use this view to display a summary of DTLB
behavior derived from IBS Op load/store data.
IBS MEM locked ops ibs_op_ls_memacc You can use this view to display the uncacheable (UC)
and access by type memory access, write combining (WC) memory access,
and locked load/store operations.
IBS MEM translations ibs_op_ls_page You can use this view to display a summary of DTLB
by page size address translations broken out by page size.
IBS MEM forwarding ibs_op_ls_expert You can use this view to display the memory access bank
and bank conflicts conflicts, data forwarding, and Missed Address Buffer
(MAB) hits.
Notes:
1. The AMDuProf GUI uses the ‘View configuration’ name of the predefined configuration
mentioned in the above table.
2. The abbreviation is used in the CLI generated report file.
3. The supported predefined configurations and the sampling events used in them is based on
the processor family and model.
1. The menu names in the horizontal bar such as HOME, PROFILE, SUMMARY, and
ANALYZE are called pages.
2. Each page has its sub-windows listed in the leftmost vertical pane. For example, HOME page has
various windows such as Welcome, Recent Session(s), Import Session, and so on.
3. Each window has various sections. These sections are used to specify various inputs required for
a profile run, display the profile data for analysis, buttons and links to navigate to associated
sections. In the Welcome window, Quick Links section has two links that allows you to start a
profile session with minimal configuration steps.
c. Click See what’s keeping your System busy to start a system-wide time-based profiling until
you stop it and then display the collected data.
d. Click See what’s guzzling power in your System to select various power and thermal related
counters and display a live view of the data through graphs.
5. AMD uProf Resources section provides links to the AMD uProf release page and AMD server
community forum for discussions on profiling and performance tuning.
You can select the one of the following profile targets from the Select Profile Target drop-down:
• Application: Select this target when you want to launch an application and profile it (or launch
and do a system-wide profile). The only compulsory option is a valid path to the executable. (By
default, the path to the executable becomes the working directory unless you specify a path).
• System: Select this if you do not wish to launch any application but perform either a system-wide
profile or profile specific set of cores.
• Process(es): Select this if you want to profile an application/process which is already running.
This will bring up a process table which can be refreshed. Selecting any one of the processes from
the table is mandatory to start profile.
Once profile target is selected and configured with valid data, the Next button will be enabled to go
the next screen of Start Profiling.
Note: The Next button will be enabled only if all the selected options are valid.
This screen lets you to decide the type of profile data collected and how the data should be collected.
You can select the profile type based on the performance analysis that you intend to perform. In the
above figure:
1. Select one of the following tabs:
– Predefined Configs consists of all the predefined configurations, such as Time-based
Profiling, Cache Analysis, and Assess Performance.
– Live Power Profiling consists of options to perform real-time power profiling.
– Custom Configs has options to perform Custom CPU Profile, CPU Tracing, and GPU
Tracing.
2. Once you select a profile type, the left vertical pane within this window will list the options
corresponding to the selected profile type. For CPU Profile type, all the available predefined
sampling configurations will be listed.
3. Modify event options are available only for the predefined configurations.
4. Click Advanced Options button to proceed to the Advanced Options screen and set the other
options such as the Call Stack Options, Profile Scheduling, Sources, Symbols, and so on.
5. The details in “Profile Configuration” on page 43 are persistent and saved by the tool with a name
(here, it is AMDuProf-EBP-ScimarkStable). You can define this name and navigate to PROFILE
> Saved Configurations to reuse/select the same configuration later.
6. The Next and Previous buttons are available to navigate to various screen of the Start Profiling
screen.
The CLI command is available at the bottom of this page, which displays the CLI version of the
GUI option selected on the Select Profile Configuration page.
You can set the following options on the Advanced Options screen:
1. Enable Thread Concurrency to collect the profile data and to show Thread Concurrency Chart
in Windows.
2. Call Stack Options to enable callstack sample data collection. This profile data is used to show
Top-Down Callstack, Flame Graph, and Call Graph views.
3. Profile Scheduling to schedule the profile data collection.
4. The Next and Previous buttons are available to navigate to various fragments within the Start
Profiling screen.
5. Sources line-edit to specify the path(s) to locate the source files of the profiled application.
6. Symbols to specify the Symbols servers (Windows only) and to specify the path(s) to locate the
symbol files of the profiled application.
You can also provide Download timeout for symbol file download from the server.
• TIMECHART page to visualize the MPI API trace, OS event trace, and information as a timeline
chart.
The sections available depends on the profile type. The CPU Profile will have SUMMARY,
ANALYZE, MEMORY, HPC, TIMECHART, and SOURCES pages to analyze the data.
1. The top 5 hottest functions, processes, modules and threads for the selected event are displayed.
2. The Hot Functions pie chart is interactive in nature. You can click on any section and the
corresponding function's source will open in a separate tab in the SOURCES page.
3. The hotspots are shown per event and the monitored event can be selected from drop-down in the
top-right corner. You can change it to any other event to update the corresponding hotspot data.
4. From the Select Summary View drop-down, select one of the following:
– Hot Threads
– Hot Processes
– Hot Functions
– Hot Modules
Based on the selection, one donut will be displayed at a time.
Summary Overview
Based on the selection, the Summary Overview screen will look similar to the following:
Table 22. Summary Overview
Data
Table Present Description Timing Details
Collected
OS Trace Schedule Summary Summary of per thread running/wait time • Profile Duration
(percentages). • Parallel Time
Wait Object Time spent in operations related to several • Serial Time
Summary types of synchronization objects, that is, locks, • Wait Time
mutexes, condition variables, and so on. • Sleep Time
OS Trace
The OS Trace screen will look as follows:
GPU Trace
The GPU Trace screen will look as follows:
MPI Trace
The MPI Trace screen will look as follows:
CPU Profile
The CPU Profile screen will look as Figure 11.
The thread concurrency graph displays the duration (in seconds) of the specific number of threads
that were running simultaneously.
Bucketization approach is used for this graph. Instead of showing the Elapsed Time for each core,
the weighted average based on the bucket size will be taken. The bucket size will be determined based
on the cores and number of available pixels available. This is done to avoid the horizontal scrolling.
4. Event Timeline is the line graph showing the number of aggregated sample values over the
period of time. You can use it to identify the hot functions within a profile region. From the Select
Metric drop-down you can select the event for which event timeline must be plotted.
All the entries will not be loaded for a profile. To load more than the default number of entries,
click the vertical scroll bar on the right. When the entries are expanded, process and thread-wise
breakdown of data is available.
4. Filters pane lets you filter the profile data by providing the following options.
• The Select View controls the counters that are displayed. The relevant counters and their
derived metrics are grouped in predefined views. You can select it from the Select View drop-
down.
• The Process drop-down lists all the processes on which this selected function is executed and
has samples.
• The Threads drop-down lists all the threads on which this selected function is executed and
has samples.
• You can use the ValueType drop-down to display the counter values as follows:
– Sample Count is the number of samples attributed to a function.
– Event Count is the product of sample count and sampling interval.
– Percentage is the percentage of samples collected for a function.
• The Show Assembly button shows/hides visibility of the assembly instruction table shown at
the bottom of the view.
For multi-threaded or multi-process applications, if a function has been executed from multiple
threads or processes, each of them will be listed in the Process and Threads drop-downs in the
Filters pane. Changing them will update the profile data for that selection. By default, profile data
for the selected function, aggregated across all processes and all threads will be displayed.
Note: If the source file cannot be located or opened, only disassembly will be displayed.
1. Functions are displayed based on the parent to child entires depending on the inclusive samples
values sorted.
2. Inclusive sample values for a function and its descendants.
3. Enabling Hide C++ std Library Calls option works only when C++ library calls are made. It will
exclude such calls from the list and display the other child entries.
4. Context menu of collapse entries will close all the expanded entries. Expand entries will expand
the child entries and the Open Source View option will display the corresponding source view.
1. The x-axis of the flame graph shows the call-stack profile and the y-axis shows the stack depth. It
is not plotted based on passage of time. Each cell represents a stack frame and if a frame were
present more often in the call-stack samples, the cell would be wider. This screen has the
following options:
• Module-wise coloring of the cells.
• Click on a cell to zoom only that cell and its children. Use the Reset Zoom button visualize
the entire graph.
• Right-click on a cell to view the following context options:
– Copy Function Data to copy the function names and its metrics to clipboard.
– Open Source View to navigate to the source tab of that function.
• Hover the mouse over a cell to display the tool-tip showing the inclusive and exclusive
number of samples of that function.
2. Following options are available at the top of this screen:
• Click Zoom Graph button for a better zooming experience.
• When you type a function name in the search box, a list of all the relevant matches will be
displayed. Select the required function to highlight the cells corresponding to that function in
the flame graph.
• The Process drop-down lists all the processes for which call-stack samples are collected.
Changing the process will plot the flame graph for that particular process.
• For multi-threaded applications, the flame graph will be plotted for the cumulative data of all
the threads by default.
• The Threads drop-down lists all the threads for which call-stack samples are collected.
Changing the thread will plot the flame graph for that thread.
• The Select Metric drop-down lists all the metrics for which call-stack samples are collected.
Changing the metric will plot the flame graph for that particular metric.
1. The Function table lists all the functions with inclusive and exclusive samples.
Click on function to display its Caller and Callee functions in a butterfly view.
2. Lists all the parents of the function selected in the Function table.
3. Lists all the children of the function selected in the Function table.
4. Options:
• The Process drop-down lists all the processes for which call-stack samples are collected.
Changing the process will show the call graph for that particular process.
• For multi-threaded applications, the call-graph will be plotted for the the cumulative data of
all the threads by default.
• The Threads drop-down lists all the threads for which call-stack samples are collected.
Changing the thread will plot the call graph for that thread.
• The Select Metric drop-down lists the metrics for which call-stack samples are collected.
Changing the counter will show the call graph for that particular counter.
1. The IMIX table lists all the instructions with sample count and sample percentage for the selected
options.
2. Options:
• The Select Metric drop-down lists all the metrics for which samples are collected. Changing
the metric will display the IMIX information for that metric.
• The Module drop-down lists all the binaries for which samples are collected. Changing the
module will display the IMIX information for that module.
• The Functions drop-down lists all the functions for which samples are collected. Changing
the function will display the IMIX information for that thread. By default, IMIX information
for All Functions is shown.
This can be used to import the processed profile data collected using the CLI or the processed profile
data saved in GUI’s profile session storage path. You must do the following:
• Specify the pathcontaining the session.uprof file in the Profile Data File box.
• Binary Path: If the profile run is performed in a system and the corresponding raw profile data is
imported in another system, you must specify the path(s) in which binary files can be located.
• Source Path: Specify the source path(s) from where the sources files can be located. No sub-
directories will be searched in this path to locate any source files.
• Root Path to Sources: Specify the path to the root of multiple source directories. The entire
directory and sub-directories present in that path will be searched to locate any source files.
Note: The search might take time as all the sub-directories will be searched recursively.
• Force Database Regeneration: To forcefully regenerate the database file while importing.
• Use Cached Source/Binary/Symbol Files: Enable this option to reuse cached source, binary, and
symbol files.
1. History of profile sessions opened for analysis in the GUI. The following options are available:
• Click on an entry to load the corresponding profile database for analysis.
• See Details button displays details about this profile session such as profiled application,
monitored events list, and so on.
• Click Edit Options to automatically fill the Import Profile Session for the database and
update the required line-edits before opening the session.
• Remove Entry button deletes the current profile session from the history.
2. Displays the details of the selected profile session.
5.9 Settings
There are certain application-wide settings to customize the AMD uProf experience. The SETTINGS
page is in top-right corner and is divided into the following three sections:
• Preferences: Use this section to set the global path and data reporting preferences.
– Click the Apply Changes button to apply the updated/modified settings. There are settings
which are common to profile data filters and hence, any changes to them through the Apply
Changes button will only be applied to the views that do not have local filters set.
– You can click Reset button to reset the settings or Cancel to discard the changes that you don't
want to apply.
• Symbols: Use this section to configure the Symbol Paths and Symbol Server locations. The
Symbol server is a Windows only option. The following figure represents the Symbols section:
• Source Data: Use this section to set the Source view preferences. The following figure represents
the Source Data section:
You can use Select Disassembly Syntax to select the syntax in which you wish to see the
disassembly. By default, it is set to Intel on windows and AT&T on Linux.
• Profile Data: Use this section to control the location of data generation during profiling. The
following figure represents Profile Data section:
– Keep Raw Files After Collection enables saving of the raw files after translation. It is disabled
by default.
– You can use the option Delete Record Session Files to delete the session files older than a given
time period. The time period is set to None by default.
– Reset Profile Configuration helps add preference to keep or clear the profile configuration
after each profile. It is set to True (clear after profiling) by default.
– Hotkey to stop profile (if running) helps halt the CPU and Power profiling.
– Hotkey to pause/resume profile helps pause or resume the CPU and Power profiling
Note: Hotkeys are supported only on Windows.
6.1 Overview
AMD uProf’s command line interface AMDuProfCLI provides options to collect and generate report
for analyzing the profile data.
AMDuProfCLI [--version] [--help] COMMAND [<options>] [<PROGRAM>] [<ARGS>]
For more information on the workflow, refer to the section “Workflow and Key Concepts”. To run the
command line interface AMDuProfCLI, run the following binaries as per the OS:
• Windows
C:\Program Files\AMD\AMDuProf\bin\AMDuProfCLI.exe
• Linux:
/opt/AMDuProf_X.Y-ZZZ/bin/AMDuProfCLI
• FreeBSD:
sh ./AMDuProf_FreeBSD_x64_X.Y.ZZZ/bin/AMDuProfCLI
The timechart to collect the profile samples and write into a file:
The above run collects the power and frequency counters on all the devices on which these counters
are supported and writes them in the output file specified with -o option. Before the profiling begins,
the given application is launched and the data is collected till the application terminates.
Common Usages:
$ AMDuProfCLI collect <PROGRAM> [<ARGS>]
$ AMDuProfCLI collect [--config <config> | -e <event>] [-a] [-d <duration>] [<PROGRAM>]
6.4.1 Options
The following table lists the collect command options:
Table 25. AMDuProfCLI Collect Command Options
Option Description
-h | --help Displays the help information on the console/terminal.
-o | --output-dir Base directory path in which collected data files will be saved. A new sub-
<directory-path> directory will be created in this directory.
--config <config> Predefined sampling configuration to be used to collect samples.
Use the command info --list collect-configs to get the list of supported
configs. Multiple occurrences of --config are allowed.
Notes:
1. It is not required to provide umask with predefined event.
2. Use the dedicated option --call-graph to specify the arguments related to the call
stack sample collection.
Argument details:
• user – Enable(1) or disable(0) user space samples collection
• os - Enable(1) or disable(0) kernel space samples collection
• interval – Sample collection interval. For timer, it is the time interval in
milliseconds. For PMU and predefined events, it is the count of the event
occurrences. For IBS FETCH, it is the fetch count. For IBS OP, it is the
cycle count or the dispatch count.
• op-count-control – Choose IBS OP sampling by cycle(0) count or
dispatch(1) count.
• loadstore – Enable only the IBS OP load/store samples collection, other
IBS OP samples are not collected.
• ibsop-l3miss – Enable IBS OP sample collection only when a l3 miss
occurs, for example, '-e event=ibs-op,interval=100000,ibsop-l3miss'
When F = fp, the value for N is ignored and hence, there is no need to pass it.
-g Same as passing --call-graph fp
--tid <TID,..> Profile existing threads by attaching to a running thread. The thread IDs are
separated by comma.
6.4.4 Examples
Windows
• Launch the application AMDTClassicMatMul.exe and collect the Time-Based Profile (TBP)
samples:
C:\> AMDuProfCLI.exe collect -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the IBS samples in the SWP mode:
C:\> AMDuProfCLI.exe collect --config ibs -a -o c:\Temp\cpuprof-ibs-swp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect TBP with callstack sampling (unwind FPO
optimized stack):
C:\> AMDuProfCLI.exe collect --config tbp --call-graph 1:64:user:fpo -o c:\Temp\cpuprof-tbp
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0:
C:\> AMDuProfCLI.exe collect -e event=pmcx76,interval=250000 -e
event=pmcxc0,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the samples for IBS OP with an interval of 50000:
C:\> AMDuProfCLI.exe collect -e event=ibs-op,interval=50000 -o c:\Temp\cpuprof-tbp
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and do TBP samples profile for thread concurrency, name:
C:\> AMDuProfCLI.exe collect --config tbp --thread thread=concurrency,name -o c:\Temp\cpuprof-
tbp AMDTClassicMatMul.exe
• Collect samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0:
C:\> AMDuProfCLI.exe collect -e event=pmcx76,interval=250000 -e
event=pmcxc0,interval=250000,call-graph -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul.exe and collect the samples for predefined event RETIRED_INST
and L1_DC_REFILLS.ALL events:
C:\> AMDuProfCLI.exe collect -e event=RETIRED_INST,interval=250000 -e
event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-pmc
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0 events
with count-mask enabled:
C:\> AMDuProfCLI.exe collect -e event=pmcx076,cmask=0x0, -e
event=pmcx0c0,cmask=0x7f,interval=250000 -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin
Linux
• Launch AMDTClassicMatMul-bin and collect the IBS samples in the SWP mode:
$ ./AMDuProfCLI collect --config ibs -a -o /tmp/cpuprof-ibs-swp AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect TBP with callstack sampling (unwind FPO
optimized stack):
$ ./AMDuProfCLI collect --config tbp --call-graph fpo:512 -o /tmp/uprof-tbp AMDTClassicMatMul-
bin
• Launch AMDTClassicMatMul-bin and collect the samples for PMCx076 and PMCx0C0:
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -e
event=pmcxc0,user=1,os=0,interval=250000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect the samples for IBS OP with interval 50000:
$ ./AMDuProfCLI collect -e event=ibs-op,interval=50000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-
bin
• Launch AMDTClassicMatMul-bin and collect the memory accesses for false cache sharing:
$ AMDuProfCLI collect --config memory -o /tmp/cpuprof-mem AMDTClassicMatMul-bin
• Collect the samples for PMCx076 and PMCx0C0, but collect the call graph info only for
PMCx0C0:
$ AMDuProfCLI collect -e event=pmcx76,interval=250000 -e event=pmcxc0,interval=250000,call-
graph -o /tmp/cpuprof-pmc AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect the samples for predefined event RETIRED_INST
and L1_DC_REFILLS.ALL events:
$ AMDuProfCLI collect -e event=RETIRED_INST,interval=250000 -e
event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o /tmp/cpuprof-pmc AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect all the user mode trace events:
$ AMDuProfCLI collect --trace user -o /tmp/cpuprof-umt AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect syscall taking more than or equal to 1µs:
$ AMDuProfCLI collect --trace os=syscall:1000 -o /tmp/cpuprof-os AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect the GPU Traces for hip domain:
$ AMDuProfCLI collect --trace gpu=hip -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect the GPU Traces for hip and hsa domain:
$ AMDuProfCLI collect --trace gpu -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin, collect the TBP samples and GPU Traces for hip domain:
$ AMDuProfCLI collect --config tbp --trace gpu=hip -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect the context switches, syscalls, pthread API tracing,
and function count of malloc() called:
$ AMDuProfCLI collect --trace os --func c:malloc -o /tmp/cpuprof-os AMDTClassicMatMul-bin
• Collect the system wide function count of malloc(), calloc(), and kernel functions that match the
pattern 'vfs_read*':
$ AMDuProfCLI collect --trace os --func c:malloc,calloc,kernel:vfs_read* -o /tmp/cpuprof-os -
a -d 10
• Launch AMDTClassMatMul-bin and perform branch analysis with the default filter type:
$ AMDuProfCLI collect --branch-filter -o /tmp/cpuprof-ebp-branch AMDTClassicMatMul-bin
Common Usages:
$ AMDuProfCLI report -i <session-dir path>
6.5.1 Options
Table 28. AMDuProfCLI Report Command Options
Option Description
-h | --help Displays this help information on the console/terminal.
-i | --input-dir Path to the directory containing collected data.
<directory-path>
--detail Generate detailed report.
--group-by <section> Specify the report to be generated. The supported report options are:
• process: Report process details
• module: Report module details
• thread: Report thread details
This option is applicable only with --detail option. The default is group-by
process.
-p | --pid <PID,..> Generate report for the specified PIDs. The process IDs are separated by
comma.
Note: A maximum of 512 processes can be attached at a time.
-g The print callgraph. Use with the option --detail or --pid(-p). With --pid
option, callgraph will be generated only if the callstack samples were
collected for specified PIDs.
--cutoff <n> Cutoff to limit the number of process, threads, modules, and functions to be
reported. n is the minimum number of entries to be reported in various
report sections. The default value is 10.
--view <view-config> Report only the events present in the given view file. Use the command
info --list view-configs to get the list of supported view-configs.
--inline Show inline functions for C, C++ executables.
Notes:
1. This option is not supported on Windows.
2. Using this option will increase the time taken to generate the report.
--show-sys-src Generate detailed function report of the system module functions (if debug
info is available) with the source statements.
--src-path <path1;...> Source file directories (semicolon separated paths). Multiple use of --src-
path is allowed.
--disasm Generate a detailed function report with assembly instructions.
--disasm-style <att | Choose the syntax of assembly instructions. The supported options are att
intel> and intel. If this option is not used:
• intel is used by default on Windows.
• att is used by default on Linux.
--disasm-only Generate the function report with only assembly instructions.
Example:
--category cpu,mpi,trace,gputrace,gpuprof
--category mpi --category cpu --category trace --category gputrace --
category gpuprof
--funccount-interval Specify the time interval in seconds to list the function count detail report. If this
<funccount-interval> option is not specified, the function count will be generated for the entire profile
duration.
6.5.4 Examples
Windows
• Generate report from the raw datafile on one of the predefined views:
C:\> AMDuProfCLI.exe report --view ipc_assess -i c:\Temp\pwrprof-swp\<SESSION-DIR>
• Generate report from the raw datafile providing the source and binary paths:
C:\> AMDuProfCLI.exe report --bin-path Examples\AMDTClassicMatMul\bin\ --src-path
Examples\AMDTClassicMatMul\ -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
Linux
Common Usages:
$ AMDuProfCLI translate -i <session-dir path>
6.6.1 Options
Following table lists the AMDuProfCLI translate command options:
Table 31. AMDuProfCLI Translate Command Options
Option Description
-h | --help Displays the help information.
-i | --input-dir Path to the directory containing collected data.
<directory-path>
--time-filter <T1:T2> Restricts the processing to the time interval between T1 and T2, where T1, T2
are time in seconds from profile start time.
--agg-interval <low | Use this option to configure the sample aggregation interval which is useful
medium | high | when the session is imported to GUI.
INTERVAL>
low level of aggregation interval generates better timeline view in GUI but
increases the database size.
Aggregation INTERVAL can also be specified as a numeric value in
milliseconds.
--bin-path <path> Binary file path. Multiple use of --bin-path is allowed.
--symbol-path <path> Debug symbol path. Multiple instances of --symbol-path are allowed.
--inline Inline function extraction for C and C++ executables.
Notes:
1. This option is not supported on Windows.
2. Using this option will increase the time taken to generate the report.
--retranslate Re-translate the collected data files with a different set of translation options.
--log-path <path-to- Specify the path where the log file should be created. If this option is not
log-dir> provided, the log file will be created either in the path set by
AMDUPROF_LOGDIR environment variable or %TEMP% path by default.
The log file name will be of the format $USER-AMDuProfCLI.log (on Linux,
FreeBSD) or %USERNAME%-AMDuProfCLI.log (on Windows).
--enable-log Enable additional logging with log file.
--enable-logts Capture the timestamp of the log records. This option should be used with the
--enable-log option.
--remove-raw-files Remove the raw data files to recover the disk space.
--export-session Create a compressed archive of required session files which can be used in
other system for analysis.
6.6.4 Examples
Windows
• Process all the raw data files:
> AMDuProfCLI.exe translate -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
• Process the raw data files with the source and binary path:
> AMDuProfCLI.exe translate --bin-path Examples\AMDTClassicMatMul\bin\ --src-path
Examples\AMDTClassicMatMul\ -i c:\Temp\cpuprof-tbp\<SESSION-DIR>
Linux
• Process all the raw data files:
$ AMDuProfCLI translate -i /tmp/cpuprof-tbp/<SESSION-DIR>
<PROGRAM> — Denotes the application to be launched before starting the power metrics collection.
<ARGS> — Denotes the list of arguments for the launch application.
Common Usages:
$ AMDuProfCLI timechart --list
$ AMDuProfCLI timechart -e <event> -d <duration> [<PROGRAM>] [<ARGS>]
6.7.1 Options
Table 34. AMDuProfCLI Timechart Command Options
Option Description
-h | --help Displays this help information.
--list Displays all the supported devices and categories.
-e | --event <type...> Collect counters for specified combination of device type and/or category
type.
Use command timechart --list for the list of supported devices and
categories.
Note: Multiple occurrences of -e is allowed.
-t | --interval <n> Sampling interval n in milliseconds. The minimum value is 10ms.
-d | --duration <n> Profile duration n in seconds.
--affinity <core...> The core affinity. Comma separated list of core-ids. Ranges of core-ids is
also be specified, for example, 0-3. The default affinity is all the available
cores. The affinity is set for the launched application.
-w | --working-dir <dir> Set the working directory for the launched target application.
-f | --format <fmt> Output file format. Supported formats are:
• txt: Text (.txt) format.
• csv: Comma Separated Value (.csv) format.
Default file format is CSV.
-o | --output-dir <dir> Output directory path.
6.7.2 Examples
Windows
• Collect all the power counter values for a duration of 10 seconds with sampling interval of 100
milliseconds:
C:\> AMDuProfCLI.exe timechart --event power --interval 100 --duration 10
• Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds
and dumping the results into a csv file:
C:\> AMDuProfCLI.exe timechart --event frequency -o C:\Temp\output --interval 500 --duration
10
• Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500
milliseconds and dumping the results into a text file:
C:\> AMDuProfCLI.exe timechart --event core=0-3,frequency -o C:\Temp\PowerOutput --interval
500 -duration 10 --format txt
Linux
• Collect all the power counter values for a duration of 10 seconds with sampling interval of 100
milliseconds:
$ ./AMDuProfCLI timechart --event power --interval 100 --duration 10
• Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds
and dumping the results into a csv file:
$ ./AMDuProfCLI timechart --event frequency -o /tmp/PowerOutput --interval 500 --duration 10
• Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500
milliseconds and dumping the results into a text file:
$ ./AMDuProfCLI timechart --event core=0-3,frequency -o /tmp/PowerOutput --interval 500 --
duration 10 --format txt
Common Usages:
AMDuProfCLI diff --baseline <base session-dir path> --with <non-base session-dir path> -o
<output-dir>
6.8.2 Options
Following table lists the diff commands:
Table 35. AMDuProfCLI diff Command Options
Option Description
-h | --help Displays this help information on the console/terminal.
--baseline <directory- Path to the directory containing collected data. The profile data in this directory
path> will be treated as the base profile against which all other profiles will be
compared.
--with <directory-path> Path to the directory containing collected data. Each profile specified with --
with will be considered as a non-base profile and compared against the base
profile. You can use multiple instances of --with to specify multiple non-base
profiles for comparison.
-i, --input-dir Path to the directory containing collected data. Multiple occurrences of -i is
<directory-path> allowed. First occurrence of -i is considered as the base session, while all the
subsequent occurrences of -i are treated as non-base sessions.
Note: When using -i, --input-dir, you should not use the --baseline or --with options in
conjunction. If you use --baseline and -i together, the --baseline option will take
precedence and be considered as the base session. If the --baseline option is not present,
the first occurrence of -i will automatically be considered as the base session.
--output-dir | -o Path where the markdown comparison report will be generated.
<directory-path>
6.8.3 Examples
Windows
Use the following commands to:
• Generate a comparison report of base profile data with subsequent profile data:
C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-
tbp\<NON-BASE-DIR> -o c:\Temp\cpuprof-tbp
• Generate a comparison report without ignoring the unique entries across sessions:
C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-
tbp\<NON-BASE-DIR> --type order -o c:\Temp\cpuprof-tbp
• Generate a comparison report of base profile data with subsequent profile data sorted on ibs-op
event:
C:\> AMDuProfCLI.exe diff --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with c:\Temp\cpuprof-
tbp\<NON-BASE-DIR> --type name -s ibs-op -o c:\Temp\cpuprof-tbp
• Generate a comparison report of base profile data with successor profile data with changed
function names across sessions:
C:\> AMDuProfCLI.exe compare --baseline c:\Temp\cpuprof-tbp\<BASE-DIR> --with
c:\Temp\cpuprof-tbp\<NON-BASE-DIR> --alias
CalculateSum,CalculateUpdatedSum|enhanceOutput,optimizeOutput -o c:\Temp\cpuprof-tbp
• Generate a comparison report of base profile data with multiple successor profile data:
C:\> AMDuProfCLI.exe diff -i c:\Temp\cpuprof-tbp\<BASE-DIR> -i c:\Temp\cpuprof-tbp\<NON-BASE-
DIR1> -i c:\Temp\cpuprof-tbp\<NON-BASE-DIR2> --with c:\Temp\cpuprof-tbp\<NON-BASE-DIR3> -o
c:\Temp\cpuprof-tbp
Linux
• Generate a comparison report of base profile data with subsequent profile data:
$ AMDuProfCLI diff --baseline /tmp/cpuprof-tbp/<BASE-DIR> --with /tmp/cpuprof-tbp/<NON-BASE-
DIR> -o /tmp/cpuprof-tbp
• Generate a comparison report of base profile data with subsequent profile data sorted on PMC
event:
$ AMDuProfCLI diff --baseline /tmp/cpuprof-tbp/<BASE-DIR> --with /tmp/cpuprof-tbp/<NON-BASE-
DIR> -s event=pmcxc0,user=1,os=0 -o /tmp/cpuprof-tbp
6.9.1 Options
Following table lists the profile commands:
Table 36. AMDuProfCLI profile Command Options
Option Description
-h | --help Displays the help information on the console/terminal.
-o | --output-dir Base directory path in which the collected data files will be saved. A new sub-
<directory-path> directory will be created in this directory.
--config <config> Predefined sampling configuration to be used to collect samples.
Use the command info --list collect-configs to get the list of supported
configs. Multiple occurrences of --config are allowed.
When F = fp, the value for N is ignored and hence, there is no need to pass
it.
-g Same as passing --call-graph fp
--tid <TID,..> Profile existing threads by attaching to a running thread. The thread IDs are
separated by comma.
--trace <TARGET> To trace a target domain. TARGET can be one or more of the following:
mpi[=<openmpi|mpich>,<lwt|full>]
Provide MPI implementation type:
'openmpi' for tracing OpenMPI library
'mpich' for tracing MPICH and its derivative libraries, for example, Intel
MPI
Provide tracing scope:
'lwt' for light-weight tracing
'full' for complete tracing
'--trace mpi' defaults to '--trace mpi=mpich,full'
• openmp — for tracing OpenMP application. This is same as the option --
omp.
• os[=<event1,event2,...>] — provide the event names and optional
threshold with a comma separated list. syscall and memtrace events will
take the optional threshold value as <event:threshold>. Use the command
info --list ostrace-events for a list of the OS trace events.
• user=<event1,event2,...> — provide the event name and threshold with a
comma separated list. These events will be collected in the user mode.
Use the command info --list trace-events to get a list of the trace
events supported in user mode.
• gpu[=<hip,hsa>] — provide the domain for GPU Tracing. By default, the
domain is set to 'hip,hsa'.
When the above filters are not set, the default filter type will be 'any'.
Notes:
1. When the above filters not set, the default filter type will be 'any'.
2. This option will work only with the PMC events.
3. This is applicable to per process and attach process profiling. However, it is not
applicable to Java app profiling.
6.9.4 Examples
Windows
• Launch application AMDTClassicMatMul.exe and collect the samples for
CYCLES_NOT_IN_HALT and RETIRED_INST events and generate report:
C:\> AMDuProfCLI.exe profile -e cycles-not-in-halt -e retired-inst --interval 1000000
-o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
$ ./AMDuProfCLI.exe profile -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o c:\Temp\cpuprof-custom AMDTClassicMatMul.exe
• Launch the application AMDTClassicMatMul.exe and collect the IBS Samples and generate IMIX
report:
AMDuProfCLI.exe profile --config ibs --imix -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and perform Assess Performance profile for 10 seconds and
generate report:
C:\> AMDuProfCLI.exe profile --config assess -o c:\Temp\cpuprof-assess -d 10
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the IBS samples in the SWP mode and generate
report sorted on ibs-op event:
C:\> AMDuProfCLI.exe profile --config ibs -a -s event=ibs-op -o c:\Temp\cpuprof-ibs-swp
AMDTClassicMatMul.exe
• Collect the TBP samples in SWP mode for 10 seconds and generate report:
C:\> AMDuProfCLI.exe profile -a -o c:\Temp\cpuprof-tbp-swp -d 10
• Launch AMDTClassicMatMul.exe, collect TBP with callstack sampling and generate report:
C:\> AMDuProfCLI.exe profile --config tbp -g -o c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe, collect TBP with callstack sampling (unwind FPO optimized
stack) and generate report:
C:\> AMDuProfCLI.exe profile --config tbp --call-graph 1:64:user:fpo -o c:\Temp\cpuprof-tbp
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the samples for PMCx076 and PMCx0C0 and
generate report sorted on pmcxc0 event:
C:\> AMDuProfCLI.exe profile -e event=pmcx76,interval=250000 -e
event=pmcxc0,user=1,os=0,interval=250000 -s event=pmcxc0 -o c:\Temp\cpuprof-tbp
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and collect the samples for IBS OP with an interval of 50000
and generate report sorted on ibs-op event:
C:\> AMDuProfCLI.exe profile -e event=ibs-op,interval=50000 -s event=ibs-op -o
c:\Temp\cpuprof-tbp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe and do TBP samples profile for thread concurrency, name, and
generate report:
C:\> AMDuProfCLI.exe profile --config tbp --thread thread=concurrency,name -o
c:\Temp\cpuproftbp AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe, collect the Power samples in SWP mode and generate report:
C:\> AMDuProfCLI.exe profile --config energy -a -o c:\Temp\pwrprof-swp AMDTClassicMatMul.exe
• Collect samples for PMCx076 and PMCx0C0, but collect the call graph info only for PMCx0C0
and generate report:
C:\> AMDuProfCLI.exe profile -e event=pmcx76,interval=250000 -e
event=pmcxc0,interval=250000,call-graph -o c:\Temp\cpuprof-pmc AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul.exe and collect the samples for predefined event RETIRED_INST
and L1_DC_REFILLS.ALL events and generate report:
C:\> AMDuProfCLI.exe profile -e event=RETIRED_INST,interval=250000 -e
event=L1_DC_REFILLS.ALL,user=1,os=0,interval=250000 -o c:\Temp\cpuprof-pmc
AMDTClassicMatMul.exe
• Launch AMDTClassicMatMul.exe. Collect the TBP, Assess Performance samples, and generate
report:
C:\> AMDuProfCLI.exe profile --config tbp --config assess -o c:\Temp\cpuprof-tbp-assess
AMDTClassicMatMul.exe
Linux
• Launch the application AMDTClassicMatMul.bin. Collect the samples for
CYCLES_NOT_IN_HALT and RETIRED_INST events and generate report:
$ ./AMDuProfCLI profile -e cycles-not-in-halt -e retired-inst
--interval 1000000 -o /tmp/cpuprof-custom AMDTClassicMatMul-bin
$ ./AMDuProfCLI profile -e event=cycles-not-in-halt,interval=250000
-e event=retired-inst,interval=500000 -o /tmp/cpuprof-custom
AMDTClassicMatMul-bin
• Launch the application AMDTClassicMatMul-bin. Collect the IBS samples and generate IMIX
report from the raw data file:
$ ./AMDuProfCLI profile --config ibs --IMIX -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin. Collect the IBS samples in the SWP mode and generate report
sorted based on ibs_op event:
$ ./AMDuProfCLI profile --config ibs -a -s event=ibs_op -o /tmp/cpuprof-ibs-swp
AMDTClassicMatMul-bin
• Collect the TBP samples in SWP mode for 10 seconds and generate report:
$ ./AMDuProfCLI profile -a -o /tmp/cpuprof-tbp-swp -d 10
• Launch AMDTClassicMatMul-bin. Collect TBP with callstack sampling and generate report:
$ ./AMDuProfCLI profile --config tbp -g -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin and collect TBP with callstack sampling (unwind FPO
optimized stack) and generate report:
$ ./AMDuProfCLI profile --config tbp --call-graph fpo:512 -o /tmp/uprof-tbp
AMDTClassicMatMulbin
• Launch AMDTClassicMatMul-bin. Collect the samples for PMCx076 and PMCx0C0 and
generate report:
$ ./AMDuProfCLI profile -e event=pmcx76,interval=250000 -e
event=pmcxc0,user=1,os=0,interval=250000 -o /tmp/cpuprof-tbp AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin. Collect the samples for IBS OP with interval 50000 and
generate report sorted on ibs-op event:
$ ./AMDuProfCLI profile -e event=ibs-op,interval=50000 -s event=ibs-op -o /tmp/cpuprof-tbp
AMDTClassicMatMulbin
• Attach to a thread, collect TBP samples for 10 seconds, and generate report:
$ AMDuProfCLI profile --config tbp -o /tmp/cpuprof-tbp-attach -d 10 --tid <TID>
• Collect OpenMP trace info of an OpenMP application, pass -omp, and generate report:
$ AMDuProfCLI profile --omp --config tbp -o /tmp/openmp_trace <path-to-openmp-exe>
• Collect the samples for PMCx076 and PMCx0C0, but collect the call graph info only for
PMCx0C0 and generate report:
$ AMDuProfCLI profile -e event=pmcx76,interval=250000 -e
event=pmcxc0,interval=250000,callgraph -o /tmp/cpuprof-pmc AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin. Collect all the OS trace events and generate report:
$ AMDuProfCLI profile --trace os -o /tmp/cpuprof-os AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin. Collect the GPU Traces for Host Identity Protocol (HIP)
domain and generate report:
$ AMDuProfCLI profile --trace gpu=hip -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin. Collect the TBP samples, GPU Traces for the HIP domain,
and generate report:
$ AMDuProfCLI profile --config tbp --trace gpu=hip -o /tmp/cpuprof-gpu AMDTClassicMatMul-bin
• Launch AMDTClassicMatMul-bin. Collect the GPU samples, OS Traces, and generate report:
$ AMDuProfCLI profile --config gpu --trace os -o /tmp/cpuprof-gpu-os AMDTClassicMatMul-bin
Common Usages:
$ AMDuProfCLI info --system
6.10.1 Options
Following table lists the info command:
Table 39. AMDuProfCLI Info Command Options
Option Description
-h | --help Displays the help information.
--list <type> Lists the supported items for the following types:
• collect-configs: Predefined profile configurations that can be used with
collect--config option.
• predefined-events: List of the supported predefined events that can be used
with collect --event option.
• pmu-events: Raw PMC events that can be used with collect --event option.
Alternatively, info --pmu-event all can be used to print information of all the
supported events.
• cacheline-events: List of event aliases to be used with report --sort-by
option for cache analysis. It is supported only on Windows and Linux
platforms.
• view-configs: List the supported data view configurations that can be used
with report --view option.
--collect-config <name> Displays the details of the given profile configuration used with collect --
config <name> option.
Use info --list collect-configs command for the details on the supported
profile configurations.
6.10.2 Examples
Use the following commands to:
• Print the system details:
C:\> AMDuProfCLI.exe info --system
5. Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different
locations) and other options. Refer the section “Advanced Options” section for more information
on this screen.
6. Once all the options are set, the Start Profile button at the bottom will be enabled and you can
click on it to start the profile.
After the profile initialization the profile data collection screen is displayed.
3. From the Select Profiling screen, select the Predefined Configs tab.
4. Select Assess Performance in the left vertical pane. Refer the section “Predefined Sampling
Configuration” for EBP based predefined sampling configurations.
5. Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different
locations) and other options. Refer the section “Advanced Options” for more information on this
screen.
6. Once all the options are set, the Start Profile button at the bottom will be enabled. Click it to start
the profile.
After the profile initialization the profile data collection screen is displayed.
3. Click ANALYZE > Metrics to display the profile data table at various granularities - Process,
Load Modules, Threads, and Functions. Refer to the section “Process and Functions” for more
information on this screen.
4. Double-click any entry on the Functions table in the Grouped Metrics screen to load the source
tab for that function in SOURCES page. Refer the section “Source and Assembly” for more
information on this screen.
5. Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different
locations) and other options. Refer the section “Advanced Options” for more information on this
screen.
6. Once all the options are set, the Start Profile button at the bottom will be enabled. Click it to start
the profile.
After the profile initialization the profile data collection screen is displayed.
2. Click on Advanced Options button to turn on the Enable CSS option in Call Stack Options
pane as follows:
Refer the section “Advanced Options” for more information on this screen.
Note: If the application is compiled with higher optimization levels and frame pointers are not
displayed, Enable FPO option can be turned on. On Linux, this will increase the size of the
raw profile file size.
Flame Graph provides a stack visualizer based on call stack samples. The Flame Graph is available
in the ANALYZE page to analyze the call stack samples to identify hot call-paths. To access it,
navigate to ANALYZE > Flame Graph in the left vertical pane.
Refer the section “Flame Graph” for more information on this screen.
The flame graph can be displayed based on the Process and Select Metric drop-downs. Also, it has
the function search box to search and highlight the given function name.
You can browse the data based on Process and Select Metric drop-downs. The top central table
displays call-stack samples for each function. Click on any function to update the bottom two
Caller(s) and Callee(s) tables. These tables display the callers and callees respectively of the selected
function.
Keep a note of the process id (PID) of the above JVM instance. Then, launch AMD uProf GUI or
AMD uProf CLI to attach to this process and profile.
The following figure shows the source view of the Java method:
After the profile completion, navigate to Cache Analysis page in MEMORY tab to analyze the
profile data. This page shows the cache-lines and it offsets with the associated metric values:
This command will launch the program and collect the profile data required to generate the cache
analysis report. The raw profile data file is created in /tmp/cache_analysis/AMDuProf-
IBS_<timestamp>/ directory.
Report Generation and Analysis
Use the following CLI command to generate the cache analysis report:
$ AMDuProfCLI report -i /tmp/cache_analysis/AMDuProf-IBS_<timestamp>/
Use any of the following metric with the --sort-by event=<METRIC> (for example, --sort-by
event=ldst-count) option to change the sorting by order during the report generation:
Table 42. Sort-by Metric
Sort-by Metric Description
ldst-count Total Loads and stores sampled
ld-count Total Loads
st-count Total Stores
cache-hitm Loads that were serviced either from the local or remote cache (L3) and
the cache hit state was Modified.
Note: You can also use the command info --list cacheline-events for a list of supported
metrics for sort-by option.
Windows, it’s enabled with the supported event Schedule. User Mode Trace is enabled only
for Application Analysis on Linux.
CPU Trace looks as follows:
Multiple categories from the custom configs can be added together, for example, CPU Profile +
CPU Trace.
When multiple categories are selected, it will be mentioned below as breadcrumbs under Added
Categories and you can deselect the unwanted categories. The corresponding CLI command will
be generated below.
1. Select the Custom Configs tab and select CPU Profile from the left vertical pane.
2. Click Advanced Options to enable call-stack, set symbol paths (if the debug files are in different
locations) and other options. Refer the section “Advanced Options” for more information on this
screen.
3. Once all the options are set, the Start Profile button at the bottom will be enabled. Click it to start
the profile.
After the profile initialization the profile data collection screen is displayed.
4. Double-click any entry on the Functions table in Metrics screen to load the source tab for that
function in SOURCES page. Refer the section “Source and Assembly” for more information on
this screen.
7.9 Advisory
7.9.1 Confidence Threshold
The metric with low number of samples collected for a program unit either due to multiplexing or
statical sampling will be grayed out. A few points to remember are:
• This is applicable to SW Timer and Core PMC based metrics.
• This confidence threshold value can be set through Preferences section in SETTINGS page.
2. Once the raw file is generated, run the following command to translate and get the ASCII dump of
IBS OP samples:
C:\> AMDuProfCLI.exe translate --ascii event-dump -i C:\temp\AMDuProf-IBS_<timestamp>\
The CSV file that containing ASCII dump of the IBS OP samples is generated:
C:\temp\AMDuProf-IBS_<timestamp>\IbsOpDump.csv
Example
Collect the LBR info:
$ AMDuProfCLI collect --branch-filter -o /tmp/ ./ScimarkStable/scimark2_64static
Sample Report
The report generated contains a section for branch analysis. A sample screenshot for branch analysis
summary is as follows:
Example
Launch the application AMDTClassicMatMul.exe and collect the Time-Based Profile (TBP) samples
and generate a report with the export session option enabled:
AMDuProfCLI.exe profile --config tbp --export-session -o c:\Temp\cpuprof-tbp
AMDTClassicMatMul.exe
7.13 Limitations
CPU profiling in AMD uProf has the following limitations:
• CPU profiling expects the profiled application executable binaries must not be compressed or
obfuscated by any software protector tools, for example, VMProtect.
• In case of AMD EPYCTM 1st generation B1 parts, only one PMC register is used at a time for
Core PMC event-based profiling (EBP).
This command will launch the program to collect the profile and trace data. When the launched
application is executed, AMDuProfCLI will display the session directory path in which the raw
profile and trace data are saved.
In the above example, the session directory path is:
/tmp/threading-analysis/AMDuProf-classic_lock-Threading_Jun-13-2023_06-00-23
After processing the data and generating the report, the report file path is displayed on the terminal.
An example of the trace report sections in the .csv report file is as follows:
From the report, the application performance snapshot provides the following details:
• Number of threads/Thread count: Total number of threads created by the application.
• Elapsed time: Total elapsed time of the application.
• Serial time: Total time of the application when only one thread is running.
• Parallel time: Total time of the application when two or more threads are running.
• Run time: Total run time of all threads. If context switch records are collected, the total run time
will be total time of all the threads executing in CPU. Otherwise, total run time = total time - (total
wait time + total sleep time)
• Wait time: Total wait time of all the threads. Wait time is calculated as follows:
– Threading config (--config threading): Total time spent by a thread in pthread
synchronization APIs and wait system calls. Refer section 8.1.2 and 8.1.3 for traced
synchronization APIs and wait system calls.
– Custom config (--trace user=syscall): Total time spent by a thread in wait system calls.
– Custom config (--trace user=pthread): Total time spent by a thread in pthread
synchronization APIs.
– Custom config (--trace os/--trace os=schedule): Total time of all the threads when a
thread is not in CPU. It uses the context switch records to identify whether thread is in CPU or not.
• Sleep time: Total time spent by all the threads in sleep system calls. Refer to section 8.1.3 for
sleep system calls that are traced.
• IO time: Total time spent by all the threads in IO system calls. Refer to section 8.1.3 for IO system
calls that are traced.
• Block time: Total time spent by all the threads in blocking the system calls. When application
makes this type of system call, there is no guarantee that the application will be blocked. So, this
block time will be added to the total run time too. Refer to section 8.1.3 for block system calls that
are traced.
Summary Report Sections
• System call summary: Provides the system call count, total time spent by the application on a
system call. Helps identify the system calls consuming most of the time and that can be optimized
if the system calls blocking in nature.
• Thread summary: Provides the total run time, wait time of each thread, and wait time percentage
with respect to the total time of thread. Helps identify if a thread is using the core effectively or
not. Wait time of threads should be low for an optimized application.
• Wait object summary: pthread synchronization object wait count and total wait time due to this
synchronization object. Helps identify the object responsible for most of the wait time.
• Import the profiled session in GUI and navigate to Analyze > Thread Timeline for better
visualization, thread timeline analysis, pthread synchronization object analysis, and call stack
analysis.
Blocking APIs
• flock • recvfrom • mq_receive • splice
• fsync • recvmsg • mq_timedreceive • vmsplice
• sync • recvmmsg • msgsnd • msync
• syncfs • send • msgrcv • fcntl
• fdatasync • sendto • semget • ioctl
• sync_file_range • sendmsg • semop • epoll_create
• accept • sendmmsg • semtimedop • epoll_create1
• accept4 • mq_send • semctl • epoll_ctl
• recv • mq_timedsend
Other APIs
• socket • shmctl • mlockall • fallocate
• bind • shmget • munlockall • ioperm
• listen • shmdt • mmap • iopl
• connect • fork • munmap • mount
• socketpair • vfork • move_pages • prctl
• mq_notify • alarm • mprotect • ptrace
• mq_getattr • system • mremap • sigaction
• mq_setattr • kill • process_vm_readv • swapon
• mq_close • killpg • process_vm_writev • swapoff
• mq_unlink • brk • acct • tee
• msgget • sbrk • chroot • umount
• msgctl • mlock • dup • umount2
• pipe • munlock • dup2 • unshare
• pipe2 • mlock2 • dup3 • vhangup
• shmat
3. Select the Data Source drop-down to enable selection of data to display on the timeline. Different
types of data source are as follows:
– CPU Utilization: Plots the timeline for the CPU utilization (in %) per thread at a per second
interval. To collect sufficient such data points, the total profile duration should be greater than
or equal to 10 seconds. This is enabled only for the Threading Analysis configuration.
– Memory Consumption: Plots the timeline for the memory consumption (in MB) categorized
as physical and virtual memory consumed. This is enabled only for the Threading Analysis
configuration.
– Context Switches: Plots the timeline for both voluntary context switches count (sleep, yield,
and so on) or involuntary context switches count (OS scheduler triggered context switch). This
is enabled only for the Threading Analysis configuration.
– CPU Profile Samples: Plots the timeline for the CPU sample collected for the CPU events.
The following events are supported:
Table 43. Supported CPU Events
Events Availability
Retired Instructions PMC event RETIRED_INSTRUCTIONS is collected.
Cycles not in Halt PMC event CYCLES_NOT_IN_HALT is collected.
Op Cycles IBS op event is collected with ‘count cycles’ unit mask.
CPU Time Time-based profiling is performed.
– Thread Trace: Plots the timeline based on OS trace data which can either originate from eBPF
Tracing or User-mode Tracing. The trace data is categorized and aggregated at certain intervals
to generate time-series plotted in timelines. The following categories are created:
Table 44. CPU Trace Categories
Category Description
Wait Time Total time spent in synchronization objects, that is, mutex, condition variable,
semaphore, locks, barriers, latches, and so on
Sleep Time Total time spent in sleep syscalls.
Running Time If only user-mode tracing is enabled:
Running Time = Total Time – (Wait Time + Sleep Time).
If eBPF tracing is enabled, then Running Time is total active time in CPU:
Running Time = Total Time – Sleep Time (from context switch records)
Block Time Total Time spent in blocking syscalls, that is, select, epoll, poll, wait, accept, and so
on.
I/O Time Total Time spent in I/O syscalls, that is, read, write, pread, pwrite, and so on.
Syscall Time Total time spent on all traced syscalls – (Block Time + I/O Time)
4. The Select Trace Overlay drop-down enables selection of the type of trace data to display.
– Don't Show Trace: Trace data will not be loaded in the timeline.
– Thread State: Shows the current state of thread from eePBF or User-mode tracing. In the former,
thread state is inferred from BPF data. In the later, thread state is treated as Running if Running
Time > 0, otherwise, Sleeping.
– Thread Trace: Displays traces for the traced libpthread functions, such as pthread_mutex_lock,
pthread_mutex_trylock, and so on.
– Syscalls: Displays traces for traced syscall in the specific region of the timeline.
5. Trace Cutoff can be used to specify a duration in nanoseconds, which acts as a cutoff to load the
trace data, that is, any traced function which takes less than the specified nanoseconds will not be
displayed.
6. Click the Reset Zoom button to reset any zoom performed earlier.
7. Hover over any timeline to view the tool-tip containing the relevant data along with timestamp. If
trace data is also present, the relevant traced functions with start time and duration.
8. Filter Threads/Ranks enables you to filter which thread's (or rank's) timelines must be
displayed. By default, the timelines are sorted internally and the first 6 are loaded. However, from
the table, you can select the required threads and clicking Apply Filter to apply the changes. If
CPU profile data is collected, highlighting functions or modules is also possible. Each function is
assigned a random color, which can be modified and highlighted in the timeline (implies there are
samples from the function/module).
9. Each entry in the filter table has the necessary data, that is, name, parent object, and samples/trace
times aggregated across the profile.
10. Click the Apply Filter button to apply a custom selection of entities or highlight entities in
timeline.
11. Click Deselect selected Items to deselect all the entries in the filtering table except the first one.
This is useful when a custom selection is required but all timelines are already loaded.
12. At the bottom of the filtering pane, timeline legend is displayed, which helps in identifying how
each type of ‘data source’ or ‘trace’ is mapped to which color.
13. The Show Core Transition button is disabled by default and works only when the CPU profiling
data is collected. When enabled, a red line is displayed in each timeline to signify when a thread
changes the core.
14. If any configuration is profiled with CSS enabled, select Threading Analysis > Select Data
Source > CPU Profile Samples. The callstack section will be enabled only if you select a valid
samples region.
Note: Time-series data (from Select Data Source) will be plotted as a line graph, where the x-axis is
time and y-axis the height implies how close to the maximum value it reached. For trace
records, the height is always total height of the timeline. However, the width varies based on
the duration of the traced function.
Support Matrix
Prerequisite
Compile the OpenMP application using a supported compiler (on a supported platform) with the
required compiler options to enable OpenMP.
• Parallel Regions shows the summary of all the parallel regions. This tab is useful to quickly
understand which parallel region might be load imbalanced. Double-click on the region names to
open the Regions Detailed Analysis page.
While performing the regular profiling, add option --trace openmp or --omp to enable OpenMP
profiling. This command will launch the program and collect the profile data required to generate the
OpenMP analysis report.
Modes of tracing OpenMP events are:
• Full Tracing: All the OpenMP events are traced in full tracing. Use the following command to
perform full OpenMP tracing:
./AMDuProfCLI collect --trace openmp=full -o /tmp/myapp_perf <openmp-app>
• Basic Tracing: Only the events which are required for the high level report generation are traced.
The size of trace data collected is less as compared to the full tracing mode. This is the default
mode. Use the following command to perform basic OpenMP tracing:
./AMDuProfCLI collect --trace openmp=basic -o /tmp/myapp_perf <openmp-app>
8.2.4 Limitations
The following features not supported in this release:
• OpenMP profiling with system-wide profiling scope.
• Loop chunk size and schedule type when the parameters are specified using schedule clause. In
such as case, it shows the default values (1 and Static).
• Nested parallel regions.
• GPU offloading and related constructs.
• Callstack for individual OpenMP threads.
• OpenMP profiling on Windows and FreeBSD platforms.
• Applications with static linkage of OpenMP libraries.
• Attaching to running OpenMP application.
Support Matrix
The MPI profiling supports the following components and the corresponding versions:
Table 46. MPI Profiling Support Matrix
Component Supported Versions
MPI Spec MPI v3.1
MPI Libraries Open MPI v4.1.2
MPICH v4.0.2
ParaStation MPI v5.4.8
Intel® MPI 2021.1
OS Ubuntu 18.04 LTS, 20.04 LTS, and 22.04 LTS
RHEL 8.6 and 9
CentOS 8
AMDuProfCLI is launched using <program> and the application is launched using the
AMDuProfCLI's arguments. So, use the following syntax to profile an MPI application using
AMDuProfCLI:
$ mpirun [options] AMDuProfCLI [options] <program> [<args>]
If an MPI application is launched on multiple nodes, AMDuProfCLI will profile all the MPI rank
processes running on all the nodes. You can either analyze the data for processes ran on one/many/all
node(s).
To collect profile data for all the ranks in multiple nodes, use the options -H / --host mpirun or specify
-hostfile <hostfile>:
To profile only a single rank in setup where 256 ranks running on 2 hosts (128 ranks per host):
$ mpirun -host host1:128 -np 1 $AMDUPROFCLI_CMD myapp.exe : -host host2:128,host1:128 -np 255
--map-by core myapp.exe
To run this config to collect data only for the MPI processes running on host2, execute the following
command:
$ mpirun --app myapp_config
Option --host is not mandatory to create the report file for the localhost.
• Generate a report for all the MPI processes ran on another host (for example, host2) in which the
MPI launcher was not launched:
$ AMDuProfCLI report --input-dir /tmp/myapp-perf/<SESSION-DIR> --host host2
• Generate a report for all the MPI processes ran on all the hosts:
$ AMDuProfCLI report --input-dir /tmp/myapp-perf/<SESSION-DIR> --host all
8.3.4 Limitations
The MPI environment parameters such as Total number of ranks and Number of ranks running
on each node are currently supported only for OpenMPI. MPI profiling with system-wide profiling
scope is not supported.
or
or
or
After the kernel debug info file is downloaded, it can be found at the default path:
$ /usr/lib/debug/boot/vmlinux-`uname -r`
RHEL
Follow the steps in Red Hat knowledgebase (https://access.redhat.com/solutions/9907) to download
the RHEL kernel debug info.
After the kernel debug info file is downloaded, it can be found at the default path:
$ /usr/lib/debug/lib/modules/`uname -r`/vmlinux
8.6.6 Constraints
• Do not move the downloaded kernel debug info from its default path.
• If the kernel version gets upgraded, then download the kernel debug info for the latest kernel
version. AMD uProf would not show correct source and assembly if there is any mismatch
between kernel debug info and kernel version.
• While profiling or analyzing kernel samples, do not reboot the system in between. Rebooting the
system would cause the kernel to load at a different virtual address due to the KASLR feature of
Linux kernel.
• The settings in the /proc/sys/kernel/kptr_restrict file enables AMD uProf to resolve kernel
symbols and attribute samples to kernel functions. It does not enable the source and assembly
level, call-graph analysis.
This command will launch the program and collect the profile and trace data. Once the launched
application is executed, the AMDuProfCLI will display the session directory path in which the raw
profile and trace data are saved.
In the above example, the session directory path is:
/tmp/blockio-analysis/AMDuProf-fio-OsTrace_Dec-09-2021_12-19-27/
Generate Profile Report
Use the following CLI report command to generate the profile report in .csv format by passing the
session directory path as the argument to -i option:
$ ./AMDuProfCLI report -i /tmp/blockio-analysis/AMDuProf-fio-OsTrace_Dec-09-2021_12-19-27
...
Generated report file: /tmp/blockio-analysis/AMDuProf-fio-OsTrace_Dec-09-2021_12-19-27/
report.csv
After processing the data and generating the report, the report file path is displayed on the terminal.
An example of the disk I/O report section in the .csv report file is as follows:
Use the following CLI translate command invocation to process the raw trace records saved in the
corresponding session directory path:
$ ./AMDuProfCLI translate -i /tmp/blockio-analysis/AMDuProf-classic-OsTrace_Dec-09-2021_12-19-
27
...
Translation finished
Then import this session in the GUI by specifying the session directory path in Profile Data File text
input box in the HOME > Import Session view. This will load the profile data saved in the session
directory for further analysis.
Navigate to the ANALYZE page and then select Disk I/O Stats in the vertical navigation bar as
follows:
In the above figure, the table shows various block I/O statistics at the device level.
Prerequisites
For tracing ROCr, HIP APIs, and GPU activities:
• Requires AMD ROCm 5.5 to be installed. For the steps to install AMD ROCm, refer section
“Installing ROCm” on page 6.
Note: Tracing might not work as expected on '5.2.1 or older' versions.
• Support accelerators - AMD InstinctTM MI100 and MI200
Optional Settings
By default, AMDuProf uses the:
• ROCm version pointed by /opt/rocm/ symbolic link. To specify the rocm path, you must export it
using AMDUPROF_ROCM_PATH before launching AMD uProf.
Example:
export AMDUPROF_ROCM_PATH=/opt/rocm-5.5.0/
This command will launch the program and collect the profile and trace data. Once the launched
application is executed, the AMDuProfCLI will display the session directory path in which the raw
profile and trace data are saved.
In the above example, the session directory path is:
/tmp/gpu-analysis/AMDuProf-SampleApp-GpuTrace_Dec-09-2021_12-19-27/
The behavior is undefined when the GPU profile collection is interrupted or the launch application is
killed from other terminal.
After processing the data and generating the report, the report file path is displayed on the terminal.
An example of the GPU trace report section in the .csv report file is as follows:
For more information on GPU tracing from GUI, refer to the section 7.8.1.
This command will launch the program and collect the profile data. Once the launched application is
executed, the AMDuProfCLI will display the session directory path in which the raw profile data are
saved.
In the above example, the session directory path is:
/tmp/AMDuProf-SampleApp-GPUProfile_Dec-09-2021_12-19-27/
The behavior is undefined when the GPU profile collection is interrupted or the launch application is
killed from other terminal.
Generate Profile Report
Use the following CLI report command to generate the profile report in .csv format by passing the
session directory path as the argument to -i option:
$ ./AMDuProfCLI report -i /tmp/AMDuProf-SampleApp-GPUProfile_Dec-09-2021_12-19-27
...
Generated report file: /tmp/AMDuProf-SampleApp-GPUProfile_Dec-09-2021_12-19-27/report.csv
After processing the data and generating the report, the report file path is displayed on the terminal.
An example of the GPU profile report section in the .csv report file is as follows:
Prerequisites
For tracing OS events and runtime libraries:
• Requires Linux kernel 4.7 or later (it is recommended to use kernel 4.15 or later).
• Root access is required to trace the OS events in Linux.
• To install BCC and eBPF scripts, refer section “Installing BCC and eBPF” on page 7. To validate
the BCC Installation, run the script sudo AMDuProfVerifyBpfInstallation.sh.
This command will launch the program and collect the profile and trace data. Once the launched
application is executed, the AMDuProfCLI will display the session directory path in which the raw
profile and trace data are saved.
In the above example, the session directory path is:
/tmp/AMDuProf-classic-OsTrace_Dec-09-2021_12-19-27/Generate Profile Report
Use the following CLI report command to generate the profile report in .csv format by passing the
session directory path as the argument to -i option:
$ ./AMDuProfCLI report -i /tmp/AMDuProf-classic-OsTrace_Dec-09-2021_12-19-27
...
Generated report file: /tmp/AMDuProf-classic-OsTrace_Dec-09-2021_12-19-27/report.csv
After processing the data and generating the report, the report file path is displayed on the terminal.
An example of the GPU trace report section in the .csv report file is as follows:
Examples:
• Collect the function count of malloc() from libc called by AMDTClassicMatMul-bin; libc will be
searched for in the default library paths:
$ AMDuProfCLI collect --trace os=funccount --func c:malloc -o /tmp/cpuprof-os
AMDTClassicMatMul-bin
• Collect context switches, syscalls, pthread API tracing, and function count of malloc() called by
AMDTClassicMatMul-bin:
$ AMDuProfCLI collect --trace os --func c:malloc -o /tmp/cpuprof-os AMDTClassicMatMul-bin
• Collect the count of malloc(), calloc(), and kernel functions that match the pattern 'vfs_read*'
system-wide:
$ AMDuProfCLI collect --trace os --func c:malloc,calloc,kernel:vfs_read* -o /tmp/cpuprof-os -
a -d 10
For more information on GPU tracing from GUI, refer to the section 7.8.1.
By default, OpenMPI will attempt to build all the 3 Fortran bindings: mpif.h, mpi module, and
mpi_f08 module.
• MPICH
--disable-fortran
By default, the Fortran bindings are enabled. You can use this option to disable it.
Support Matrix
Table 53. Support Matrix
Component Supported Versions
MPI Spec MPI v3.1
MPI Libraries Open MPI v4.1.4, MPICH v4.0.3, ParaStation MPI v5.6.0, and Intel® MPI 2021.1
OS • Ubuntu: 18.04 LTS, 20.04 LTS, and 22.04.04 LTS
• RHEL: 8.6 and 9
• CentOS 8.4
Languages C, C++ and Fortran
Tracing Modes
The AMDuProf CLI supports the following 2 modes for MPI tracing:
• LWT – Light-weight tracing is useful for quick analysis of an application. The report gets
generated in .csv format on-the-fly during collection stage.
• FULL – Full tracing is useful for in-depth analysis. This mode requires post-processing for report
generation in .csv format .
MPI Implementation Support
AMD uProf supports tracing of Open MPI and MPICH and the derivatives:
• --trace mpi=mpich for MPICH and derivatives (default option)
• --trace mpi=openmpi for Open MPI
Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation
used for compiling the MPI application. Passing incorrect option might cause undefined behavior.
For more information on MPI tracing options, refer “Linux Specific Options” on page 88.
After completing the tracing, the path to the session directory is displayed on the terminal. LWT
report is generated immediately after completing the collection and saved into the session directory
in: <output_directory>/<SESSION_DIR>/mpi/lwt/mpi-summary.csv.
MPI implementation MPICH or Open MPI should be passed in the command; MPICH is the default.
Following are the sample commands:
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi=lwt,openmpi -o
<output_directory> <application>
Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation
used for compiling the MPI application. Passing an incorrect option might cause undefined behavior.
After completing the tracing, the path to the session directory is displayed on the terminal.
MPI implementation MPICH or Open MPI should be passed in the command; MPICH is the default.
Following are the sample commands:
$ mpirun -np <number of processes> ./AMDuProfCLI collect --trace mpi=full,openmpi -o
<output_directory> <application>
Ensure that the correct option (mpich or openmpi) is passed depending on the MPI implementation
used for compiling the MPI application. Passing an incorrect option might cause undefined behavior.
Generate Profile Report
Example of a command to generate the report in .csv format. Pass the session directory path with -i
option:
$ ./AMDuProfCLI report -i <output_directory>/<SESSION_DIR>
After completing the report generation, the report.csv file path is displayed on the terminal.
Tables in the Report file
The following screenshots show example sections of a full tracing report file:
3. Tool-tip shows additional details when the mouse is hovered over a cell.
4. Color-coding legend based on data volume.
5. Sum of all the data transfers for the rank.
6. Mean of all the data transfers for the rank.
Analyzing MPI Rank Timeline
Navigate to HPC > MPI Rank Timeline to view to MPI Ranks timeline. This view shows the MPI
activities in the timeline graph as follows:
8. Trace Overlay Cutoff can be used to specify a duration in nanoseconds, which acts as a cutoff to
load the trace data, that is, any traced data source which takes less than the specified nanoseconds
will not be displayed.
9. Color coding legends for data source and trace overlay.
Analyzing MPI P2P API Summary
Navigate to HPC > MPI P2P API Summary. This view summarizes the P2P APIs called by the
application as follows:
• MPI Data Transfer which classifies MPI P2P Send/Receive and plots the volume of data
transfered at the given time interval.
9.1 Overview
System-wide Power Profile
The AMD uProf profiler offers live power profiling to monitor the behavior of the systems based on
AMD CPUs and APUs. It provides various counters to monitor power and thermal characteristics.
These counters are collected from various resources such as RAPL and MSRs. They are collected at
regular time interval and either reported as a text file or plotted as line graphs. They can also be saved
into the database for future analysis.
Features
AMD uProf comprises of the following features:
• The GUI can be used to configure and monitor the supported power metrics.
• The TIMECHART page helps to monitor and analyze:
– Logical Core level metrics – Core Effective Frequency and P-State
– Physical Core level metrics – RAPL based Core Power
– Package level metrics – RAPL based Package Power and Temperature
• AMDuProfCLI timechart command collects the system metrics and writes into a text file or
comma-separated-value (CSV) file.
• API library allows you to configure and collect the supported system level performance, thermal
and power metrics of AMD CPU/APUs.
• The collected live profile data can be stored in the database for future analysis.
9.2 Metrics
The supported metrics depend on the processor family and model and are broadly grouped under
various categories. Following are the supported counter categories by processor families:
Table 56. Family 17h Model 00h – 0Fh (AMD RyzenTM, AMD Ryzen ThreadRipperTM, and
1st Gen AMD EPYCTM)
Power Counter Category Description
Power Average Power for the sampling period, reported in Watts. This is an
estimated consumption value based on the platform activity levels. It is
available for Core and Package.
Frequency CPU Core Effective Frequency for the sampling period, reported in MHz.
Table 56. Family 17h Model 00h – 0Fh (AMD RyzenTM, AMD Ryzen ThreadRipperTM, and
1st Gen AMD EPYCTM)
Power Counter Category Description
Temperature Average temperature for the sampling period, reported in Celsius. The
temperature reported is with reference to Tctl. It is available for Package.
P-State CPU P-State at the time when sampling was performed.
Table 57. Family 17h Model 10h – 1Fh (AMD RyzenTM and AMD RyzenTM PRO APU)
Power Counter Category Description
Power Average Power for the sampling period, reported in Watts. This is an
estimated consumption value based on platform activity levels. Available
for Core and Package.
Frequency CPU Core Effective Frequency for the sampling period, reported in MHz
Temperature Average temperature for the sampling period, reported in Celsius.
Temperature reported is with reference to Tctl. Available for Package.
P-State CPU P-State at the time when sampling was performed.
Table 58. Family 17h Model 70h – 7Fh (3rd Gen AMD RyzenTM)
Power Counter Category Description
Power Average Power for the sampling period, reported in Watts. This is an
estimated consumption value based on platform activity levels. Available
for Core and Package.
Frequency CPU Core Effective Frequency for the sampling period, reported in MHz
P-State CPU P-State at the time when sampling was performed.
Temperature Average temperature for the sampling period, reported in Celsius.
Temperature reported is with reference to Tctl. Available for Package.
Table 60. Family 19h Model 0h – 2Fh (EPYC 7003 and EPYC 9000)
Power Counter Category Description
Power Average Power for the sampling period, reported in Watts. This is an
estimated consumption value based on platform activity levels. Available
for Core and Package.
Frequency CPU Core Effective Frequency for the sampling period, reported in MHz
P-State CPU P-State at the time when sampling was performed.
Temperature Average temperature for the sampling period, reported in Celsius.
Temperature reported is with reference to Tctl. Available for Package.
3. From the Select Profile Configuration screen, select the Live Power Profile tab.
All the live profiling options and available counters are displayed in the respective panes as
follows:
4. In the Counters pane, select the required counter category and the respective options.
Note: You can configure multiple counter categories.
During the profiling, you can render the graphs live.
5. Click the Start Profile button.
In this profile type, the profile data will be generated as line graphs in the TIMECHART page for
further analysis.
The CLI Command will be displayed for all the options selected from the GUI for Live Power
Profiling.
1. In the TIMECHART page, the metrics will be plotted in the live timeline graphs. The line graphs
are grouped together and plotted based on the category.
2. There is a data table adjacent to each graph to display the current value of the counters.
3. From the Graph Visibility pane, you can choose the graph to display.
4. When plotting is in progress, you can:
– Click the Pause Graphs button to pause the graphs without pausing the data collection. You
can click the Play Graphs button to resume them later.
– Click the Stop Profiling button to stop the profiling without closing the view. This will stop
collecting the profile data.
– Click the Close View button to stop the profiling and close the view.
The timechart run to collect the profile samples and write into a file is as follows:
The above run will collect the power and frequency counters on all the devices on which these
counters are supported and writes them in the output file specified with -o option. Before the profiling
begins, the given application will be launched and the data will be collected till the application
terminates.
9.4.1 Examples
Windows
• Collect all the power counter values for a duration of 10 seconds with a sampling interval of 100
milliseconds:
C:\> AMDuProfCLI.exe timechart --event power --interval 100 --duration 10
• Collect all frequency counter values for 10 seconds, sampling them every 500 milliseconds and
adding the results to a csv file:
C:\> AMDuProfCLI.exe timechart --event frequency -o C:\Temp\Poweroutput --interval 500 --
duration 10
• Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500
milliseconds and adding the results to a text file:
C:\> AMDuProfCLI.exe timechart --event core=0-3,frequency -o C:\Temp\Poweroutput
--interval 500 --duration 10 --format txt
Linux
• Collect all the power counter values for a duration of 10 seconds with a sampling interval of 100
milliseconds:
$ ./AMDuProfCLI timechart --event power --interval 100 --duration 10
• Collect all the frequency counter values for 10 seconds, sampling them every 500 milliseconds
and adding the results to a csv file:
$ ./AMDuProfCLI timechart --event frequency -o /tmp/PowerOutput
--interval 500 --duration 10
• Collect all the frequency counter values at core 0 to 3 for 10 seconds, sampling them every 500
milliseconds and adding the results to a text file:
$ ./AMDuProfCLI timechart --event core=0-3,frequency
-o /tmp/PowerOutput --interval 500 --duration 10 --format txt
Windows
A Visual Studio 2015 solution file CollectAllCounters.sln is available atthe directory C:/Program
Files/AMD/AMDuProf/Examples/CollectAllCounters/ to build the sample program.
Linux
1. Execute the following commands to build:
$ cd <AMDuProf-install-dir>/Examples/CollectAllCounters
$ g++ -O -std=c++11 CollectAllCounters.cpp -I<AMDuProf-install-dir>/include -l
AMDPowerProfileAPI -L<AMDuProf-install-dir>/lib -Wl,-rpath <AMDuProf-install-dir>/bin -o
CollectAllCounters
9.6 Limitations
• Only one power profile session can run at a time.
• Minimum supported sampling period in CLI is 100ms. It is recommended to use a large sampling
period to reduce the sampling and rendering overhead.
10.1 Overview
AMD uProf has the ability to connect to remote systems and trigger collection, translation of data on
the remote system and then visualize it in local GUI.
Note: CLI does not support remote profiling.
AMD uProf uses a separate AMDProfilerService binary that can be launched as an application server
on the remote target and local GUI can connect to such a server. By default, authorization must be set
up on the server to connect to the local GUI. Complete the following steps:
1. Locate the local GUI client ID.
2. Authorize the client ID on the remote target to connect to AMDProfilerService.
3. Launch AMDProfilerService with appropriate options/permissions on remote target.
4. Specify the connection details in the local GUI to connect to the remote target.
5. Local GUI updates itself and displays the remote data (including settings, session history,
available events for profiling/tracing, and so on).
6. Proceed to import session/profile on the remote target.
7. When you are done with remote target, disconnect to update the local data in GUI.
Support
Remote profiling from Windows (host/local platform) to Linux (target/remote platform) is supported.
This IP address should be one of the IP addresses of the target/remote machine on which
AMDProfilerService is launched.
If target/remote machine has multiple IP addresses, the ping command can be used on the host/local
machine to determine which IP address (of the remote machine) is reachable from the local machine.
The reachable IP address can be passed to --ip option.
(Optional) You can specify the following options:
Table 61. AMDProfilerService Options
Option Description
--port <port_number> Specify the port number
--logpath <path> Specify the log file path
--bypass-auth Skip the authorization
Note: This option must be used with caution as it will skip the authorization.
--fsearch-depth <depth> Specify the maximum depth for recursive file search operations
Note: This option is applicable only for importing a session from the GUI.
--fsearch-timeout Specify the maximum duration (in seconds) for recursive file
<timeout> search operations
Note: This option is applicable only for importing a session from the GUI.
The remote target data is displayed after a few seconds. All the profiling steps or importing session
steps remain identical as local henceforth. Once connected, the provided IP, port, and name are saved
as follows:
You can double-click on any table entry containing IP address to load the corresponding details and
connect to the required remote target.
Once connected, the title bar will reflect the connection to the remote target, Disconnect button in the
Remote Profile page will be enabled (instead of the Connect button) as follows:
10.5 Limitations
• Once connected to a remote target, all the Browse buttons in the GUI will remain disabled. You
can copy/paste or type the URI paths wherever required.
• If you have not closed the GUI after profiling locally and try to connect to Remote Target, the
GUI may crash sometimes. Hence, it is recommended to close the GUI after local profiling if
remote connection is desired.
• If local data is not required and you try to connect to the same remote target frequently, use the
following command to directly connect to the remote target (if it is running):
AMDuProf <ip_address> <port>
11.1 OverView
AMD uProf supports profiling in the virtualized environments. Availability of the profiling features
depends on the counters virtualized by the hypervisor manager. Currently, AMD uProf supports the
following hypervisors (with Linux and Windows OS as guest on these virtualized environments):
• VMWare ESXi
• Microsoft Hyper-V
• Linux KVM
• Citrix Xen
Feature support matrix on various hypervisors:
Table 62. AMD uProf Virtualization Support
Microsoft Hyper-V KVM VMware ESXi Citrix Xen
Host
Features Root Host Guest Guest
Guest Guest
Partition Root Host Host Host
VMs VMs VMs VMs
(system Partition
mode)
CPU Profiling
Time Based Yes Yes Yes Yes Yes Yes Yes Yes Yes
Profiling (TBP)
Micro- Yes Yes Yes Yes Yes Yes Yes No No
architecture
Analysis (EBP)
Instruction Based Yes No No No No No No No No
Sampling (IBS)
Cache Analysis Yes No No No No No No No No
HPC – MPI Code Yes Yes Yes Yes Yes Yes Yes Yes Yes
Profiling
HPC – OpenMP Yes Yes Yes Yes Yes Yes Yes Yes Yes
Tracing
HPC – MPI Yes Yes Yes Yes Yes Yes Yes Yes Yes
Tracing
OS Tracing Yes Yes Yes Yes Yes Yes Yes Yes Yes
Note: The virtualized hardware counters need to be enabled while configuring the guest VMs on the
respective hypervisors.
11.2.5 Examples
• Get the kvm guest OS PID:
$ ps aux | grep kvm
• Collecting pmcx76 event data for 10 secs (for guest kallsyms and guest kernel modules)
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms --guest-modules /home/amd/
guest/guest-module
• Collecting system-wide samples for pmcx76 event data for 10 secs (for guest kallsyms and guest
kernel modules):
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms --guest-modules /home/amd/
guest/guest-module -a
• Collecting system-wide samples for pmcx76 event data for 10 secs (for guest kallsyms):
$ ./AMDuProfCLI collect -e event=pmcx76,interval=250000 -o /tmp/cpuprof-76-guest-only -d 10 -
-kvm-guest 2444 --guest-kallsyms /home/amd/guest/guest-kallsyms -a
11.3 AMDuProfPcm
AMDuProfPcm is based on the following hardware and OS primitives provided by host or guest
operating system. Run the command ./AMDuProfCLI info --system to obtain this information and look
for the following sections:
[PERF Features Availability]
C ore PMC : Yes (Requires to collect dc, fp, ipc, l1, l2 metrics)
L3 PMC : Yes (Requires to collect l3 metrics option)
DF PMC : Yes (Requires to collect memory, xgmi, pcie metrics)
PERF TS : No
In Linux environment, check if the msr module is available and can be loaded using following
command:
$ modprobe msr
11.4 AMDuProfSys
AMDuProfSys is based on the following hardware and OS primitives provided by host or guest
operating system. Run the command ./AMDuProfCLI info --system to obtain this information and look
for the following sections:
[PERF Features Availability]
Core PMC : Yes (Requires to collect core metrics)
L3 PMC : Yes (Requires to collect l3 metrics)
DF PMC : Yes (Requires to collect df metrics)
PERF TS : No
In Linux environment, check if Linux kernel perf module and user space tools are available.
Linux
<AMDuProf-install-dir>/lib/x64/libAMDProfileController.a
amdProfileResume
When the instrumented target application is launched through AMDuProf/AMDuProfCLI, the
profiling will be in the paused state and no profile data will be collected till the application calls this
resume API.
bool amdProfileResume ();
amdProfilePause
When the instrumented target application has to pause the profile data collection, this API must be
called:
bool amdProfilePause ();
These APIs can be called multiple times within the application. Nested Resume - Pause calls are not
supported. AMD uProf profiles the code within each Resume-Pause APIs pair. After adding these
APIs, the target application should be compiled before initiating a profile session.
return 0;
}
Note: Do not use the -static option while compiling with g++.
To compile a C application on Linux using gcc, use the following command:
$ gcc -g <sourcefile.c> -I <AMDuProf-install-dir>/include -L<AMDuProf-install-dir>/lib/x64/ -
lAMDProfileController -lrt -pthread
Linux
$ ./AMDuProfCLI collect --config tbp --start-paused -o /tmp/cpuprof-tbp /tmp/AMDuProf/
Examples/ClassicCpuProfileCtrl/ClassicCpuProfileCtrl
12.1.5 Limitations
The CPU profile control APIs are not supported for the MPI applications.
Chapter 13 Reference
7. From the drop-down, select Program Database (/Zi) or Program Database for Edit &
Continue (/ZI).
compensate for this event counter multiplexing. For example, if an event is monitored 50% of the
time, the CPU Profiler scales the number of event samples by factor of 2.
L1_DTLB_MISS_RATE The DTLB L1 miss rate is the number of DTLB L1 misses divided by
the total number of retired instructions.
L2_DTLB_MISS_RATE The L2 DTLB miss rate is the number of L2 DTLB misses divided by
the total number of retired instructions.
L2_ITLB_MISS_RATE The ITLB L2 miss rate is the number of ITLB L2 miss divided by the
total number of retired instructions.
MISALIGNED_LOADS_RATI The misalign ratio is the number of misaligned loads divided by the
O total number of DC accesses.
MISALIGNED_LOADS_RATE The misalign rate is the number of misaligned loads divided by the total
number of retired instructions.
STLI_OTHER Store-to-load conflicts: A load was unable to complete due to a non-
forwardable conflict with an older store. Most commonly, a load's
address range partially but not completely overlaps with an
uncompleted older store. Software can avoid this problem by using the
same size and alignment loads and stores when accessing the data.
Vector/SIMD code is particularly susceptible to this problem; software
should construct wide vector stores by manipulating the vector elements
in the registers using shuffle/blend/swap instructions prior to storing to
the memory, instead of using narrow element-by-element stores.
L2_CACHE_ACCESSES_FRO The number of L2 cache access requests due to the L1 instruction cache
M_IC_MISSES misses per thousand retired instructions. This L2 cache access requests
also includes the prefetches.
L2_CACHE_MISSES_FROM_I The number of L2 cache misses from L1 instruction cache misses per
C_MISSES thousand retired instructions.
Following table lists the IBS op metrics for AMD “Zen4” and AMD “Zen3” server platforms:
Table 69. IBS Op Metrics for AMD “Zen4” and AMD “Zen3” Server Platforms
IBS Op Metric Description
%IBS_BR_TAG_TO_RETIRE_CYCLES Percentage of IBS Branch op tag to retire cycles.
%IBS_BR_MISP_TAG_TO_RETIRE_CYCLES Percentage of IBS Branch mis-predict op tag to retire
cycles.
%IBS_TAKEN_BR_TAG_TO_RETIRE_CYCLES Percentage of IBS Branch taken op tag to retire
cycles.
%IBS_RET_TAG_TO_RETIRE_CYCLES Percentage of IBS Branch return op tag to retire
cycles.
%IBS_BR_COMP_TO_RETIRE_CYCLES Percentage of IBS Branch op completion to retire
cycles.
%IBS_BR_MISP_COMP_TO_RETIRE_CYCLES Percentage of IBS Branch mis-predict op completion
to retire cycles.
%IBS_TAKEN_BR_COMP_TO_RETIRE_CYCLE Percentage of IBS Branch taken op completion to
S retire cycles.
Table 69. IBS Op Metrics for AMD “Zen4” and AMD “Zen3” Server Platforms
IBS Op Metric Description
%IBS_RET_COMP_TO_RETIRE_CYCLES Percentage of IBS Branch return op completion to
retire cycles.
IBS_BR_MISP_RATE_% Branch mis-predict rate in percentage. The number of
branch mis-predicts divided by the total number of
branch operations, expressed as percentage.
%IBS_L1_DTLB_REFILL_LAT_CYCLES Percentage of cycles wasted due to L1 DTLB misses.
The number of L1 DTLB refill latency cycles divided
by the total number of Tag-To-Retire cycles of all the
operations, expressed as percentage.
IBS_ST_L1_DC_MISS_RATE_% Store L1 DC Miss rate in percentage. The number of
store L1 DC misses divided by the total number of
store ops, expressed as percentage.
IBS_LD_L1_DC_MISS_RATE_% Load L1 DC Miss rate in percentage. The number of
load L1 DC misses divided by the total number of
load ops, expressed as percentage.
IBS_LD_L1_DC_HIT_RATE_% Load L1 DC Hit rate in percentage. The number of
load L1 DC hits divided by the total number of load
ops, expressed as percentage.
IBS_LD_L2_HIT_RATE_% Load L2 Hit rate in percentage. The number of load
L2 hits divided by the total number of load ops,
expressed as percentage.
IBS_LD_LOCAL_CACHE_HIT_RATE_% Percentage of load samples where the load operation
was serviced by the shared L3 cache or other L1/L2
cache in the same CCX. The number of
IBS_LD_LOCAL_CACHE_HIT divided by
IBS_LOAD, expressed in percentage.
IBS_LD_PEER_CACHE_HIT_RATE_% Percentage of load samples where the load operation
was serviced by L2/L3 cache in a different CCX of
same NUMA node. The number of
IBS_LD_PEER_CACHE_HIT divided by
IBS_LOAD, expressed in percentage.
IBS_LD_RMT_CACHE_HIT_RATE_% Percentage of load samples where the load operation
was serviced by L2/L3 cache of different CCX in
different NUMA node. The number of
IBS_LD_RMT_CACHE_HIT divided by
IBS_LOAD, expressed in percentage.
Table 69. IBS Op Metrics for AMD “Zen4” and AMD “Zen3” Server Platforms
IBS Op Metric Description
IBS_LD_LOCAL_DRAM_HIT_RATE_% Percentage of load samples where the load operation
was serviced by local system memory (local DRAM
via the memory controller) of same NUMA node. The
number of IBS_LD_LOCAL_DRAM_HIT divided
by IBS_LOAD, expressed in percentage.
IBS_LD_RMT_DRAM_HIT_RATE_% Percentage of load samples where the load operation
was serviced by DRAM in different NUMA node.
The number of IBS_LD_RMT_DRAM_HIT divided
by IBS_LOAD, expressed in percentage.
IBS_LD_DRAM_HIT_RATE_% Percentage of load samples where the load operation
was serviced by DRAM in the system. The number of
IBS_LD_DRAM_HIT divided by IBS_LOAD,
expressed in percentage.
IBS_LD_NVDIMM_HIT_RATE_% Percentage of load samples where the load operation
was serviced by NVDIMM in the system. The number
of IBS_LD_NVDIMM_HIT divided by IBS_LOAD,
expressed in percentage.
IBS_LD_EXT_MEM_HIT_RATE_% Percentage of load samples where the load operation
was serviced by Extension Memory in the system.
The number of IBS_LD_EXT_MEM_HIT divided by
IBS_LOAD, expressed in percentage.
IBS_LD_PEER_AGENT_MEM_RATE_% Percentage of load samples where the load operation
was serviced by Peer agent Memory in the system.
The number of IBS_LD_EXT_MEM_HIT divided by
IBS_LOAD, expressed in percentage.
IBS_LD_NON_MAIN_MEM_HIT_RATE_% Percentage of load samples where the load operation
was serviced from MMIO, configuration or PCI
space, or from the local APIC in the system. The
number of IBS_LD_NON_MAIN_MEM_HIT
divided by IBS_LOAD, expressed in percentage.
IBS_LD_L1_DC_MISS_LAT_AVE Average Load L1 DC Miss latency cycles. The
number of load L1 DC misses latency divided by the
total number of load L1 DC misses latency cycles.
%IBS_LD_L1_DC_MISS_LAT_CYCLES Percentage of cycles wasted to fetch the data. The
number of Load L1 DC misses latency cycles divided
by the total number of Tag-To-Retire cycles of all the
operations, expressed as percentage.
%IBS_LD_L2_HIT_LAT Percentage of IBS load L2 hit latency cycles wrt. load
L1 DC miss latency cycles.
Table 69. IBS Op Metrics for AMD “Zen4” and AMD “Zen3” Server Platforms
IBS Op Metric Description
%IBS_LD_LOCAL_CACHE_HIT_LAT Percentage of IBS load local cache hit latency cycles
with respect to the load L1 DC miss latency cycles.
%IBS_LD_PEER_CACHE_HIT_LAT Percentage of IBS load peer cache hit latency cycles
with respect to the load L1 DC miss latency cycles.
%IBS_LD_RMT_CACHE_HIT_LAT Percentage of IBS load remote cache hit latency
cycles with respect to the load L1 DC miss latency
cycles.
%IBS_LD_LOCAL_DRAM_HIT_LAT Percentage of IBS load local DRAM hit latency
cycles with respect to the load L1 DC miss latency
cycles.
%IBS_LD_RMT_DRAM_HIT_LAT Percentage of IBS load remote DRAM hit latency
cycles with respect to the load L1 DC miss latency
cycles.
%IBS_LD_DRAM_HIT_LAT Percentage of IBS load DRAM hit latency cycles with
respect to the load L1 DC miss latency cycles.
%IBS_LD_NVDIMM_HIT_LAT Percentage of IBS load NVDIMM hit latency cycles
with respect to the load L1 DC miss latency cycles.
%IBS_LD_EXTN_MEM_HIT_LAT Percentage of IBS load Extension Memory hit latency
cycles with respect to the load L1 DC miss latency
cycles.
%IBS_LD_PEER_AGENT_MEM_HIT_LAT Percentage of IBS load Peer Agent Memory hit
latency cycles with respect to the load L1 DC miss
latency cycles.
%IBS_LD_NON_MAIN_MEM_HIT_LAT Percentage of IBS load Non main memory hit latency
cycles with respect to the load L1 DC miss latency
cycles.