Vtune Profiler User Guide
Vtune Profiler User Guide
Vtune Profiler User Guide
Intel Corporation
www.intel.com
Legal Information
Intel® VTune™ Profiler User Guide
Contents
Notices and Disclaimers..................................................................... 12
Chapter 2: Introduction
What's New in Intel® VTune™ Profiler .......................................................... 16
Tuning Methodology................................................................................. 24
Tutorials and Samples .............................................................................. 25
Notational Conventions ............................................................................ 26
Get Help ................................................................................................ 27
Product Website and Support .................................................................... 30
Related Information ................................................................................. 30
2
Contents
3
Intel® VTune™ Profiler User Guide
4
Contents
5
Intel® VTune™ Profiler User Guide
6
Contents
discard-raw-data........................................................................... 565
duration....................................................................................... 566
filter ............................................................................................ 566
finalization-mode .......................................................................... 568
finalize ........................................................................................ 569
format ......................................................................................... 570
group-by...................................................................................... 571
help ............................................................................................ 572
import ......................................................................................... 573
inline-mode .................................................................................. 575
knob ........................................................................................... 576
kvm-guest-kallsyms ...................................................................... 585
kvm-guest-modules....................................................................... 586
limit ............................................................................................ 587
loop-mode ................................................................................... 587
mrte-mode................................................................................... 589
no-follow-child .............................................................................. 589
no-summary................................................................................. 590
no-unplugged-mode ...................................................................... 591
quiet ........................................................................................... 591
report.......................................................................................... 592
report-knob .................................................................................. 594
report-output................................................................................ 596
report-width ................................................................................. 596
result-dir...................................................................................... 597
resume-after ................................................................................ 598
return-app-exitcode....................................................................... 599
ring-buffer ................................................................................... 599
search-dir .................................................................................... 600
show-as ....................................................................................... 601
sort-asc ....................................................................................... 602
sort-desc ..................................................................................... 603
source-object ............................................................................... 603
source-search-dir .......................................................................... 604
stack-size..................................................................................... 605
start-paused................................................................................. 606
strategy ....................................................................................... 607
target-install-dir............................................................................ 608
target-system............................................................................... 608
target-tmp-dir .............................................................................. 610
target-duration-type ...................................................................... 611
target-pid .................................................................................... 612
target-process .............................................................................. 613
time-filter .................................................................................... 613
trace-mpi ..................................................................................... 614
user-data-dir ................................................................................ 615
verbose ....................................................................................... 615
version ........................................................................................ 616
Report Problems from Command Line....................................................... 616
7
Intel® VTune™ Profiler User Guide
8
Contents
9
Intel® VTune™ Profiler User Guide
10
Contents
11
Intel® VTune™ Profiler User Guide
The products described may contain design defects or errors known as errata which may cause the product
to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of
merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from
course of performance, course of dealing, or usage in trade.
Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its
subsidiaries. Other names and brands may be claimed as the property of others.
12
Intel® VTune™ Profiler User Guide 1
Download Here
You can download VTune Profiler from these sources:
• Standalone version
• As part of Intel® oneAPI Base Toolkit
NOTE
You can download older versions of documentation for VTune Profiler from the documentation archive.
Start Here
• Introduction
• What's New in VTune Profiler
• Get Started
• Tutorials and Samples
• Performance Analysis Cookbook
13
2 Intel® VTune™ Profiler User Guide
Introduction 2
NOTE
Intel® VTune™ Profiler is a new renamed version of the Intel® VTune™ Amplifier.
NOTE
Documentation for versions of Intel® VTune™ Profiler prior to the 2021 release are available for
download only. For a list of available documentation downloads by product version, see these pages:
• Download Documentation for Intel Parallel Studio XE
• Download Documentation for Intel System Studio
Key Features
This table summarizes the availability of important analysis types per host and remote target platform using
VTune Profiler:
Hotspots analysis + + +
Threading analysis + +
Remote analysis + + + +
14
Introduction 2
Analysis Windows Linux Android FreeBSD*
Target Target Target Target
Microarchitecture Exploration + + + +
Custom analysis + + + +
GPU analysis + +² +
OpenMP* analysis + +
MPI analysis + +
¹Preview only; ²Intel HD Graphics and Intel Iris® Graphics only; ³EBS analysis only; 4Hardware event-based
metrics only, excl. MMIO accesses, DPDK, SPDK
VTune Profiler provides features that facilitate the analysis and interpretation of the results:
• Top-down tree analysis: Use to understand which execution flow in your application is more performance-
critical.
• Timeline analysis: Analyze thread activity and the transitions between threads.
• ITT API analysis: Use the ITT API to mark significant transition points in your code and analyze
performance per frame, task, and so on.
• Architecture diagram: Analyze GPU OpenCL™ applications by exploring the GPU hardware metrics per GPU
architecture blocks.
• Source analysis: View source with performance data attributed per source line to explore possible causes
of an issue.
15
2 Intel® VTune™ Profiler User Guide
• Comparison analysis: Compare performance analysis results for several application runs to localize the
performance changes you got after optimization.
• Start data collection paused mode: Click the Start Paused button on the command bar to start the
application without collecting performance data and click the Resume button to enable the collection at
the right moment.
• Grouping: Group your data by different granularity in the grid view to analyze the problem from different
angles.
• Viewpoints: Choose among preset configurations of windows and panes available for the analysis result.
This helps focus on particular performance problems.
• Hot keys to start and stop the analysis: Use a batch file to create hot keys to start and stop a particular
analysis.
Caution
Because VTune Profiler requires specific knowledge of assembly-level instructions, its analysis may not
operate correctly if a program (target) is compiled to generate non-Intel architecture instructions. In
this case, run the analysis with a target executable compiled to generate only Intel instructions. After
you finish using VTune Profiler, you can use optimizing compiler options that generate non-Intel
architecture instructions.
See Also
Get Started with Intel® VTune™ Profiler
16
Introduction 2
• Server CPUs: Intel® Xeon® processor v3 and newer families.
• Client CPUs: Intel® Xeon® 4th generation processors and newer families.
Starting with this release, VTune Profiler does not support processors older than the versions listed
above. To analyze performance on older processors, use an older version of VTune Profiler.
NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.
17
2 Intel® VTune™ Profiler User Guide
18
Introduction 2
• Platform Analyses
• VTune Profiler - Platform Profiler as Analysis Type
VTune Profiler – Platform Profiler has been completely integrated into VTune Profiler as an analysis
type. Platform Profiler is now fully available as an analysis from the GUI or command line of VTune
Profiler. For more information, see Platform Profiler Analysis.
• CPU Throttling Data in System Overview Analysis
The System Overview analysis now displays information about factors that can cause the CPU to
throttle. Use this information to examine if your system is overheated or consumes significant
power, both of which could result in frequency drops that affect system performance.
• Microarchitecture Analyses
• Platform Diagram in Memory Usage View
This release introduces the Platform diagram in the Memory Usage viewpoint of the Memory
Access analysis type. Use this diagram to understand:
• System topology
• Utilization metrics for DRAM
• Intel® UPI links
• Physical cores
The platform diagram is available for:
• All client platforms
• Server platforms based on Intel® microarchitecture code name Skylake, with up to four sockets.
• Analysis Targets
• .NET
• .NET 5 Workloads
This release introduces support for running the Hotspots analysis on .NET 5 targets in Launch
Application mode when using hardware event-based sampling.
• Extended Support for .NET 5 Workloads
You can now analyze .NET 5 workloads in the Attach to Process mode when you use Hardware
Event-Based Sampling.
• FreeBSD* OS
• Input and Output Analysis on FreeBSD
You can now run the Input and Output analysis on remote FreeBSD targets. Analysis scope is
limited to platform-level metrics.
• SPDK on FreeBSD
On FreeBSD OS, The Input and Output analysis now supports Storage Performance Development Kit
(SPDK) analysis. You can now get SPDK-specific performance data on FreeBSD OS.
• Code Annotation
The Instrumentation and Tracing Technology API (ITT API) is now fully supported on FreeBSD OS.
The appropriate header and library files are provided as part of the FreeBSD target package. You
can use ITT API to annotate your code and collect arbitrary statistics with little to no overhead.
• Support for Unified Shared Memory Workloads
Starting with the 2021.8 release, you can profile OpenCL, SYCL, and DPC++ applications that use
Unified Shared Memory (USM) workloads. For OpenCL applications, this release also supports explicit
data transfer of the buffer as Unified Shared Memory.
• GPU Accelerators
• Source-level analysis for DPC++ and OpenMP applications running on GPU over Level Zero
19
2 Intel® VTune™ Profiler User Guide
The following modes in GPU Compute/Media Hotspots analysis are now available when profiling Level
Zero applications:
• Dynamic Instruction Count
• Basic Block Latency
• Memory Latency
Support also includes full-scale analysis of the kernel source per code line, including Source/Assembly
mapping.
• Advanced Data Transfer Information in GPU Offload Analysis
The following additions to the Graphics window clarify better the data transfer between CPU host and
GPU device when you run GPU profiling analyses:
• Allocation time information displays as part of total time by device operation.
• Data Transferred table has been renamed as Transfer Size table. Columns under Transfer Size
feature new names for data transferred between host and device.
• Highlights and tool tips for workloads with sub-optimal offload schemes direct your attention to
improve offload schema where necessary.
• Improved Tooltips for Occupancy Metrics in GPU Analysis
The GPU Compute/Media Hotspots Analysis has been enhanced to detect factors that limit peak
achievable occupancy for the hottest computing tasks that make the EU array idle when waiting for the
scheduler. Improved tooltips for occupancy metrics now provide information about peak occupancy and
bounding reasons for existing computing task launch configuration.
• GPU Analysis Coverage for Self-Check
Coverage of checks by the self-check functionality in VTune Profiler now includes GPU analyses as well.
Run vtune-self-checker.sh script on Windows and Linux systems to check for the GPU Compute/Media
Hotspots Analysis in source analysis and characterization modes when you run DPC++ applications on
an Intel GPU. You must install the Intel® oneAPI Base Toolkit for this purpose.
• Occupancy Report in GPU Hotspots Analysis
The GPU Compute/Media Hotspots analysis has been enhanced to display occupancy information in the
Summary section. Use this data to understand the architectural limitations of the GPU that affect
occupancy.
• CPU Context for GPU Execution in GPU Offload Analysis
The GPU Offload analysis now presents a richer set of information about execution on the GPU by
including context from the CPU. This includes stack information on:
• Execution
• Data transfer from host to device
• Data transfer from device to host
The viewpoint for the GPU Offload Analysis now includes the Call Stack pane with a new grouping by
GPU Computing Task/Host Call Stack. Navigate through transfer data contained in these panes to
identify inefficient code paths in your application.
• Analysis of Multiple GPUs
When you have multiple GPUs connected to your system, you can now analyze all of the GPUs
collectively with the GPU Offload and GPU Compute/Media Hotspots analyses. Previously, you could
analyze a single GPU at a time after VTune Profiler identified all the GPUs connected to the system.
When you run these analyses on all connected GPUs, see analysis information about each GPU in the
Summary window. Full compute set in Characterization mode is not available in multi-adapter and
multi-tile analysis.
• Hottest CPU Tasks in GPU Offload Analysis
20
Introduction 2
The Summary view in the GPU Offload analysis now includes the Hottest Host Tasks table, which
displays the most active tasks running on the CPU. Use this table to examine the overhead on the
host. Click on a performance-critical task to see more information in the Graphics window, where
results are grouped by host Task Type.
• Support for Affinity Mask
If you use the ZE_AFFINITY_MASK variable to bind your workload to a single tile, VTune Profiler can
then attribute kernels to the correct tile and also display relevant metrics per kernel.
• Host-GPU Bandwidth Information in GPU Offload Analysis
Previously, you checked the Analyze memory bandwidth option in the GPU Offload analysis to see
data required for this computation. Starting with this release of VTune Profiler, you can use the
Analyze host-GPU bandwidth option instead. Depending on your hardware configuration, this
selection displays DRAM bandwidth, PCIe bandwidth, or both sets of data on the timeline.
• PCIe Bandwidth Information in Custom and Command Line Runs of GPU Offload Analysis
Use new options to collect information about PCIe bandwidth (between the host and GPU sides) when
you run custom and command line runs of the GPU Offload analysis:
• Use the collect-host-gpu-pci-bandwidth switch for both custom and command line runs.
• In the UI, check the Analyze host-GPU PCIe bandwidth option for custom analysis.
• Improvements to Peak Occupancy Metric
The GPU Peak Occupancy metric for a computing task now flags the factors that limit peak
occupancy in the order of priority. Start tuning your application by addressing the most restricting
factor. VTune Profiler customizes recommendations for potential improvements based on the launch
parameters of the compute kernel (work size, SLM and barriers usage).
• Enhancements to GPU Offload Summary
The Summary window of the GPU Offload analysis contains these enhancements for an improved user
experience:
• Locate hotspots in your function when the GPU is not busy. See the new Top hotspots when GPU
was idle table in the GPU Time, % of Elapsed Time (formerly GPU Utilization) section.
• The Hottest Computing Functions section now includes occupancy information.
• Data Collection of CPU Host Stacks
When you collect information about host stacks in the GPU Offload and GPU Compute/Media Hotspots
analyses, you can now filter the data by selecting a call stack mode from the filter bar.
• Support to Trace DirectX* API on CPU Host
This release of VTune Profiler introduces support to profile DirectX applications on the CPU host. These
versions of the DirectX API can be traced:
• DXGI
• Direct3D 11
• Direct3D 12
• Direct3D-11-On-12(D3D11On12)
• Hardware Support
• Analysis Support for Intel® Microarchitecture Code Named Alder Lake
This version of VTune Profiler introduces support for Intel® microarchitecture code named Alder Lake in
these analysis types:
• Microarchitecture Exploration analysis
• Memory Access analysis
• Support for Intel® Atom® Processors
Support for Intel Atom® Processor P Series code named Snow Ridge, including Hotspots,
Microarchitecture Exploration, Memory Access, and Input and Output analyses.
• Support for 3rd Gen Intel® Xeon® Scalable Processor Architecture
21
2 Intel® VTune™ Profiler User Guide
This releases supports the 3rd Gen Intel® Xeon® Scalable processor architecture (code named Ice Lake
Server) .
• IDE Support
• Support for Microsoft Visual Studio* 2022
This release introduces support for the integration of VTune Profiler into Microsoft Visual Studio 2022.
• VTune Profiler Server
• New Capabilities for Account and Privilege Handling
•
VTune Profiler Server now supports profiling of workloads that require sudo elevation.
•
Introduced support for a collector wrapper script to elevate privileges before launching or attaching
to a workload.
• Application Performance Snapshot
• Metric tooltips in HTML reports
Metric tooltips in APS HTML reports now present a more holistic view of metrics and their properties.
The new tooltips present a compact yet comprehensive overview of a metric, which helps you to better
understand the importance of metrics in performance analysis. This change includes a visual bar that
indicates where the metric value stands in terms of current performance and tuning potential.
• PCIe bandwidth info in CLI reports
APS command line reports now include PCIe bandwidth metrics. This data is only available on server
platforms when using the Sampling Driver.
• New reports and filters
APS now features the following new types of reports and filters:
• Node topology report: view relations between ranks, nodes, and PCIe devices.
• Metrics report: get a configurable table that displays any collected metric for each rank, node, or
device.
• Ability to filter data by node.
• Outlier Detection
This release introduces a mechanism for the detection of outliers, or individual metric values
contributing to an average metric that differ significantly from the overall distribution or break a certain
threshold. Outliers can cause imbalance and distort average metric values. You can now see outliers in
both HTML and CLI reports, with attribution to specific rank or node where an outlier occurred.
• Metric Tooltip Enhancements
Metric tooltips now visualize ranges of average metrics, with their minimum, maximum, and average
contributing values.
• MPI Support
• Support for MPI applications in GPU and IO analyses
The GPU Offload, GPU Compute/Media Hotspots, and Input and Output analyses now support profiling
of MPI applications, as described in the MPI Code Analysis topic.
• User Interface
• Main Vertical Toolbar
This release introduces a new main vertical toolbar to enhance your user experience. All controls
previously located in the main horizontal toolbar are now located on this toolbar. The vertical toolbar is
designed to enhance your experience with clear, bright controls.
• Enhanced Project Navigator User Experience
The Project Navigator pane now features menu options to open a new or existing project to better
facilitate your VTune Profiler experience.
• Improvements to Vectorization Information
22
Introduction 2
The Vectorization sections of Performance Snapshot and HPC Performance Characterization analyses
have been enriched to provide a clearer picture of the state of vectorization in your application. Quickly
see if your code is not vectorized at all, if your code does not use the latest vector instruction set
extension, or if your code has too many scalar instructions. This version of VTune Profiler also features
improved recommendations to resolve vectorization issues.
• Rich Metric Tooltips in Multiple Analyses
This release introduces rich metric tooltips in Performance Snapshot, Hotspots, HPC Performance
Characterization, and Microarchitecture Exploration analyses. The new tooltips aim to make metrics
more intuitive by providing visualizations for thresholds, desired direction (more/less is better), and
tuning potential. Hover over a metric to get this tooltip.
• Detection of Compilation with Low Optimization Level in Hotspots Analysis
When debug information is available, VTune Profiler now detects and flags modules that may have
been compiled using non-optimal compiler optimization flags in the Top Hotspots section of the
Hotspots analysis result. This can help detect underutilization of compiler optimization capabilities and
correct the build system setup.
• Platform Diagram Extended with Persistent Memory Block
For Input and Output and Memory Access analyses, the Platform Diagram shown in Summary
windows now features a dedicated block for Persistent Memory devices, together with average per-
socket bandwidth.
This data is available on server platforms based on Intel microarchitectures code named Cascade Lake
and Ice Lake.
• Changes to Viewpoint Selection
The Viewpoint selection was adjusted with respect to each analysis type. Now, the viewpoint selection
is disabled for certain analysis types, and only features a managed set of most helpful viewpoints for
other analysis types. You can re-enable the display of all applicable viewpoints in the Options pane.
• Code Annotations
• New Instrumentation and Tracing Technology API Capabilities
A new Histogram API was added to ITT API. This API enables you to collect arbitrary histogram data
without extra overhead. The Summary tab of the Input and Output analysis automatically displays this
data in the form of a histogram.
• Debug Formats
• Support for DWARF5 Debug Format
VTune Profiler now supports version 5 of the DWARF debug format. You can now use debug information
in DWARF 5 format to resolve function names and source locations for binaries.
• Command Line Analysis
• Perf Tool Parameters for All Analysis Types
You can now use the target-system command to get parameters on the command line for the native
perf tool for all CPU hardware-based analysis types, including custom analyses. Use the get-perf-
cmd argument for this purpose. You can collect the perf trace on a target with the Linux Perf tool and
then import the trace to the VTune Profiler UI.
• Documentation
• Information on Hybrid CPU Analysis
The VTune Profiler User Guide features a new topic that explains how to profile applications that run on
hybrid platforms.
• Guidance resource on GPU-profiling features in Intel® VTune™ Profiler
23
2 Intel® VTune™ Profiler User Guide
A new article captures learning pathways to profile GPUs and illustrates techniques to Optimize
Applications for Intel® GPUs with Intel® VTune™ Profiler. Use this article to understand the Intel® VTune™
Profiler workflow to profile and optimize GPUs. The article also informs about several key resources
including procedural topics, cookbook recipes, and webinars that explain GPU compute profiling and
graphics profiling with Intel software analyzer products.
• New CLI Cheat Sheet for quick reference
Added a new downloadable document, the VTune Profiler CLI Cheat Sheet. You can use this print-
friendly PDF for quick reference on the VTune Profiler command-line interface.
• New Recipes in VTune Profiler Cookbook
The VTune Profiler Performance Analysis Cookbook features these new recipes:
• Measure the performance impact of non-uniform memory access (NUMA) in multi-processor
systems.
• Analyze hot code paths in your application using Flame Graphs.
• Improve hotspot observability in a C++ application using Flame Graphs.
• Simplified Chinese translation of the Top-Down Microarchitecture Analysis Method recipe.
See Also
Introduction to VTune Profiler
Tuning Methodology
When optimizing your code for parallel hardware, consider using the following iterative approach:
Ignore the top two elements if you are not running on a cluster. There is not a recommended start point what
to optimize first as this may vary. Pop up a level, look at all the potential optimizations and see where you
can get the biggest gain for the least work. That is where you want to start.
Use these Intel performance analysis tools for the performance optimization workflow:
24
Introduction 2
Explore available performance analysis and tuning scenarios with VTune Profiler provided in:
• Tutorials
• Performance Analysis Cookbook
• Profiling Scenarios for managed code and applications using Intel® runtime libraries
• Tuning Guides
Learning Objective:
• Demonstrates: Iterative application optimization with VTune Profiler, finding algorithmic and hardware
utilization bottlenecks
• Performance issues: memory access, vectorization
• Analyses used: Performance Snapshot, Hotspots, Memory Access, HPC Performance Characterization,
Microarchitecture Exploration
25
2 Intel® VTune™ Profiler User Guide
NOTE
• Samples are non-deterministic. Your screens may vary from the screen shots shown throughout
these tutorials.
• Samples are designed only to illustrate the VTune Profiler features and do not represent best
practices for tuning any particular code. Results may vary depending on the nature of the analysis
and the code to which it is applied.
See Also
Getting Help
Notational Conventions
The following conventions may be used in this document.
26
Introduction 2
Convention Explanation Example
Italic Used for introducing new terms, The filename consists of the basename and
denotation of terms, placeholders, or the extension.
titles of manuals.
For more information, refer to the Intel®
Linker Manual.
printf("hello, world\n");
* An asterisk at the end of a word or name OpenMP*
indicates it is a third-party product
trademark.
Get Help
Use these documents and resources to better understand functionality inIntel® VTune™ Profiler:
• Installation Guides
• Get Started Guide
• User Guide
• Tutorials and Cookbook
• Articles, Webinars, and Videos
• Intel Processor Event Reference
• Release Notes
NOTE
All documentation for VTune Profiler is available online in the Intel Software Documentation Library on
Intel Developer Zone (IDZ). You can also download an offline version of the VTune Profiler
documentation.
Access Documentation
Access product documentation through one of these ways:
•
For the cross-platform standalone user interface of the VTune Profiler: Click the menu button and
select Help > documentation_format or click the Help button on the product toolbar.
• Windows* only: For the VTune Profiler integrated into the Visual Studio user interface, select Intel VTune
Profilerversion > documentation_format from the Help menu or click the product icon on the toolbar.
27
2 Intel® VTune™ Profiler User Guide
NOTE
• VTune Profiler is shipped as a standalone version and as part of Intel oneAPI Base Toolkit. Access to
VTune Profiler documentation may vary depending on the product shipment.
• You need an internet connection to access all VTune Profiler documentation formats listed in the
menu.
• Google* Chrome* is the recommended browser to view a downloaded copy of the VTune Profiler
documentation. If you use Microsoft* Internet Explorer* or Microsoft Edge* browser, you may
encounter these issues:
• Internet Explorer 11: No help topics show up when you select them in the TOC pane.
Solution: Add http://localhost to the list of trusted sites in the Tools > Internet Options
> Security tab. You can remove the site when you finish viewing the documentation.
• Microsoft Edge: Help panes are truncated and a proper style sheet is not applied.
Solution: Click the Menu <…> and select Open with Internet Explorer.
Installation Guides
Installation Guides contain installation instructions for installing the product and post-installation
configuration steps.
Library on the web and accessible via the Help menu or the Help toolbar button.
Context-Sensitive Help
Access help topics on active GUI elements through context-sensitive help configured in VTune Profiler. These
features are available on a product-specific basis:
•
Learn more | F1 button | Context Help button provide help for an active dialog box, property page,
pane, or window.
• What's This Column: In the grid, right-click a performance metric column and select the What's This
Column entry from the context menu to open a help topic for this particular metric. You can also view a
lightweight metric description in the pop-up window when hovering over the column name.
28
Introduction 2
Help Tour
Use the Help Tour on the Welcome page to get started with Intel® VTune™ Profiler and understand its
interface. The tour uses a sample project to guide you through a typical workflow.
Overlays
In some windows, an overlay outlines useful tips to manage analysis data and enhance your experience.
Where available, click the icon for a tour of useful features in the analysis window.
29
2 Intel® VTune™ Profiler User Guide
Release Notes
VTune Profiler Release Notes provide the most up-to-date information about the product, including a product
description, technical support, system requirements, and known limitations and issues.
See Also
Tutorials and Samples
Related Information
For additional support information, see the Technical Support section of your Release Notes.
System Requirements
For detailed information on system requirements, see the Release Notes.
Related Information
For better understanding of the performance data provided by the Intel® VTune™ Profiler, you are highly
recommended to explore additional resources on the web.
30
Introduction 2
Intel® Processor Information
For the most updates, errata, and the latest information on Intel processors, explore the resources available
at https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html. The following sections
describe processor manuals for Intel 64, IA-32 architecture processors and for Intel Itanium® processors.
Intel 64 and IA-32 Architectures Manuals
The Intel 64 and IA-32 Architectures Software Developer's Manual consists of the following volumes that
describe the architecture and programming environment of all Intel 64 and IA-32 architecture processors:
• Volume 1 describes the architecture and programming environment of processors supporting IA-32 and
Intel 64 architectures.
• Volume 2 includes the full Instruction Set Reference, A-Z, in one volume. Describes the format of the
instruction and provides reference pages for instructions.
• Volume 3 includes the full System Programming Guide, Parts 1, 2, and 3, in one volume. Describes the
operating-system support environment of Intel 64 and IA-32 Architectures, including: memory
management, protection, task management, interrupt and exception handling, multi-processor support,
thermal and power management features, debugging, performance monitoring, system management
mode, VMX instructions, and Intel Virtualization Technology (Intel VT).
• Intel 64 and IA-32 Architectures Software Developer's Manual Documentation Changes section
describes bug fixes made to the Intel 64 and IA-32 Software Developer's Manual between versions.
NOTE
This Change Document applies to all Intel 64 and IA-32 Software Developer's Manual sets (combined
volume set, 3 volume set and 7 volume set).
Multithreading
You are strongly encouraged to read the following books for in-depth understanding of threading. Each book
discusses general concepts of parallel programming by explaining a particular programming technology:
Technology Resource
Intel Threading Building Reinders, James. Intel Threading Building Blocks: Outfitting C++ for
Blocks Multi-core Processor Parallelism. O'Reilly, July 2007 (http://oreilly.com/
catalog/9780596514808/)
OpenMP* technology Chapman, Barbara, Gabriele Jost, Ruud van der Pas, and David J. Kuck
(foreword). Using OpenMP: Portable Shared Memory Parallel
Programming. MIT Press, October 2007 (http://mitpress.mit.edu/catalog/
item/default.asp?ttype=2&tid=11387)
Microsoft Win32* Threading Akhter, Shameem, and Jason Roberts. Multi-Core Programming:
Increasing Performance through Software Multithreading, Intel Press,
April 2006 (http://www.intel.com/intelpress/sum_mcp.htm).
Intel Analyzers
Explore more profiling and optimization opportunities with Intel performance analysis tools:
31
2 Intel® VTune™ Profiler User Guide
• Intel Advisor to design your code performance on Intel hardware with the roofline methodology and
explore potential for vectorization, threading, and offload optimizations.
• Intel Inspector to analyze your code for threading, memory, and persistent memory errors.
• Intel Graphics Performance Analyzers to analyze performance of your game applications (system, frame,
and trace analysis).
32
Install Intel® VTune™ Profiler 3
System Requirements
To verify hardware and software requirements for your VTune Profiler download, see Intel® VTune™ Profiler
System Requirements.
NOTE
You can download older versions of documentation for VTune Profiler from the documentation archive.
Installation Information
Whether you downloaded Intel® VTune™ Profiler as a standalone component or with the Intel® oneAPI Base
Toolkit, the default path for your <install-dir> is:
Operating System Path to <install-dir>
macOS* /opt/intel/oneapi/
For OS-specific installation instructions, refer to the VTune Profiler Installation Guide.
See Also
Sampling Drivers
Sampling Drivers
Intel® VTune™ Profiler uses kernel drivers to enable the hardware event-based sampling. VTune Profiler
installer automatically uses the Sampling Driver Kit to build drivers for your kernel with the default
installation options. If the drivers were not built and set up during installation (for example, lack of
privileges, missing kernel development RPM, and so on), VTune Profiler provides an error message and, on
33
3 Intel® VTune™ Profiler User Guide
Linux* and Android* systems, enables driverless sampling data collection based on the Linux Perf* tool
functionality, which has some analysis limitations for a non-root user. VTune Profiler also automatically uses
the driverless mode on Linux when hardware event-based sampling collection is run with stack analysis, for
example, for Hotspots or Threading analysis types.
If not used by default, you may still enable a driver-based sampling data collection by building/installing the
sampling drivers for your target system:
• Windows* targets: Verify the sampling driver is installed correctly. If required, install the driver.
• Linux* targets:
• Make sure the driver is installed.
• Build the driver, if required.
• Install the driver, if required.
• Verify the driver configuration.
• Android* targets: Verify the sampling driver is installed. If required, build and install the driver.
NOTE
• You may need kernel header sources and other additional software to build and load the kernel
drivers on Linux. For details, see the README.txt file in the sepdk/src directory.
• A Linux kernel update can lead to incompatibility with VTune Profiler drivers set up on the system
for event-based sampling (EBS) analysis. If the system has installed VTune Profiler boot scripts to
load the drivers into the kernel each time the system is rebooted, the drivers will be automatically
re-built by the boot scripts at system boot time. Kernel development sources required for driver
rebuild should correspond to the Linux kernel update.
• If you loaded the drivers but do not use them and no collection is happening, there is no execution
time overhead of having the drivers loaded. The memory overhead is also minimal. You can let the
drivers be loaded at boot time (for example, via the install-boot-script, which is used by
default) and not worry about it. Unless data is being collected by the VTune Profiler, there will be
no latency impact on system performance.
NOTE
If you run GPU analysis via a Remote Desktop connection, make sure your software fits these
requirements:
• Intel® Graphics driver version 15.36.14.64.4080, or higher
• target analysis application runnable via RDC
34
Install Intel® VTune™ Profiler 3
Install Intel Metrics Discovery API Library on Linux* OS
Intel Metrics Discovery API library is supported on Linux operating systems with kernel version 4.14 or
newer. If VTune Profiler cannot collect GPU hardware metrics and provides a corresponding error message,
make sure you have installed the API library correctly.
You can download Intel Metrics Discovery API library from https://github.com/intel/metrics-discovery.
Enable Permissions
Typically, you should run the GPU Offload and GPU Compute/Media Hotspots analyses with root privileges on
Linux or as an Administrator on Windows.
If you lack root permissions on Linux, enable collecting GPU hardware metrics for non-privileged users.
Follow these steps:
• Add your username to the video group.
To check whether your username is part of the video group, enter: groups | grep video.
To add your username to the video group, enter: sudo usermod -a -G video <username>.
• Set the value of dev.i915.perf_stream_paranoidsysctl option to 0 as follows:
sysctl -w dev.i915.perf_stream_paranoid=0
This command makes a temporary change that is lost after reboot. To make a permanent change, enter:
echo dev.i915.perf_stream_paranoid=0 > /etc/sysctl.d/60-mdapi.conf
• Since GPU analysis relies on the Ftrace* technology, use the prepare_debugfs.sh script that sets read/
write permissions to debugFS.
See Also
Rebuild and Install the Kernel for GPU Analysis
35
3 Intel® VTune™ Profiler User Guide
NOTE Rebuilding the Linux kernel is only required if you need to see detailed information about GPU
utilization. You can run GPU analyses and see high level information about GPU utilization without
rebuilding your Linux kernel.
NOTE
Installing the kernel requires root permissions.
mkdir -p /tmp/kernel
cd !$
5. Download kernel sources:
CONFIG_EXPERT=y
CONFIG_FTRACE=y
36
Install Intel® VTune™ Profiler 3
CONFIG_DEBUG_FS=y
CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y
Update the file, if required, and save.
8. Create a full .config file for the kernel:
make olddefconfig
9. Build objtool. This tool is required for building the sampling driver.
NOTE Profiling support for CentOS* 7 is deprecated and will be removed in a future release.
To collect i915 ftrace events required to analyze the GPU utilization, your Linux kernel should be properly
configured. If the Intel® VTune™ Profiler cannot start an analysis and provides an error message: Collection of
GPU usage events cannot be enabled. i915 ftrace events are not available. You need to rebuild and install the
re-configured i915 module. For example, for kernel 4.14 and higher, these settings should be enabled:
CONFIG_EXPERT=y and CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS=y.
If you update the kernel often, make sure to build the special kernel for GPU analysis.
NOTE
Installing the kernel requires root permissions.
On CentOS* systems, if you update the kernel rarely, you can configure and rebuild only module i915 as
follows:
1. Install build dependencies:
sudo yum install flex bison elfutils-libelf-devel
2. Create a folder for kernel source:
mkdir -p /tmp/kernel
cd !$
3. Get your kernel version:
uname -r
This is an example of the command output:
4.18.0-80.11.2.el8_0.x86_64
37
3 Intel® VTune™ Profiler User Guide
make olddefconfig
11. Build module i915:
make -j$(getconf _NPROCESSORS_ONLN) modules_prepare
make -j$(getconf _NPROCESSORS_ONLN) M=./drivers/gpu/drm/i915 modules
If you get the following error:
LD [M] drivers/gpu/drm/i915/i915.o
ld: no input files
you need to replace the following lines in scripts/Makefile.build:
link_multi_deps = \
$(filter $(addprefix $(obj)/, \
$($(subst $(obj)/,,$(@:.o=-objs)) \
$($(subst $(obj)/,,$(@:.o=-y))) \
$($(subst $(obj)/,,$(@:.o=-m)))),$^)
38
Install Intel® VTune™ Profiler 3
with the line:
link_multi_deps = $(filter %.o,$^)
NOTE
See the patch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?
id=69ea912fda74a673d330d23595385e5b73e3a2b9 for more information.
sudo depmod
sudo dracut --force
15. Reboot the machine:
sudo reboot
16. Make sure the new driver is loaded:
modinfo i915 | grep filename
The command output should be the following:
filename: /lib/modules/4.18.0-80.11.2.el8_0.x86_64/extradrivers/gpu/drm/i915/i915.ko
To roll back the changes and load the original module i915:
1. Remove the folder with the new driver from /etc/depmod.d/* files:
sudo rm /etc/depmod.d/00-extra.conf
2. Update initramfs:
sudo depmod
sudo update-initramfs -u
3. Reboot the machine:
sudo reboot
NOTE
Installing the kernel requires root permissions.
39
3 Intel® VTune™ Profiler User Guide
make olddefconfig
11. Build module i915:
make -j$(getconf _NPROCESSORS_ONLN) modules_prepare
make -j$(getconf _NPROCESSORS_ONLN) M=./drivers/gpu/drm/i915 modules
If you get the following error:
LD [M] drivers/gpu/drm/i915/i915.o
ld: no input files
40
Install Intel® VTune™ Profiler 3
you need to replace the following lines in scripts/Makefile.build:
link_multi_deps = \
$(filter $(addprefix $(obj)/, \
$($(subst $(obj)/,,$(@:.o=-objs))) \
$($(subst $(obj)/,,$(@:.o=-y))) \
$($(subst $(obj)/,,$(@:.o=-m)))), $^)
with the line:
link_multi_deps = $(filter %.o,$^)
NOTE
See the patch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?
id=69ea912fda74a673d330d23595385e5b73e3a2b9 for more information.
sudo depmod
sudo update-initramfs -u
15. Reboot the machine:
sudo reboot
16. Make sure the new driver is loaded:
modinfo i915 | grep filename
The expected command output is the following:
filename: /lib/modules/4.15.0-20-generic/extradrivers/gpu/drm/i915/i915.ko
To roll back the changes and load the original module i915:
1. Remove the folder with new driver from /etc/depmod.d/* files:
sudo rm /etc/depmod.d/00-extra.conf
2. Update initramfs:
sudo depmod
sudo update-initramfs -u
3. Reboot the machine:
sudo reboot
41
3 Intel® VTune™ Profiler User Guide
42
Install Intel® VTune™ Profiler 3
Depending on your choice, you can proceed with the next steps:
• Set up transport security.
• Configure user authentication/authorization.
How It Works
1. (Reverse proxy and SAML SSO modes) Admin installs a VTune Profiler Server instance in a lab.
2. (Reverse proxy and SAML SSO modes) Admin emails the URL of the installed VTune Profiler Server to
the User(s).
3. User accesses the VTune Profiler via a supported web browser, configures and runs analysis on an
arbitrary target system.
VTune Profiler Server can be accessed from any client machine.
4. When analysis is initiated, the VTune Profiler Server installs a VTune Profiler Agent on the specified
target system. This agent performs collection and uploads results to the VTune Profiler Server for
analysis and storage.
Use this glossary of terms for your reference:
43
3 Intel® VTune™ Profiler User Guide
VTune Profiler VTune Profiler started as a web server and serving a web site to access the VTune
Server Profiler GUI from remote client machines using a web browser.
User client A machine that the User is logged to and used to access the VTune Profiler Server via
system a web browser.
Target system A machine, local or remote, that is profiled with the VTune Profiler.
VTune Profiler A piece of VTune Profiler software that runs on a target system.
Agent
System Requirements
VTune Profiler Server System
• 64-bit Linux* or Windows* OS
• Same system requirements and supported operating system distributions as specified for VTune Profiler
command line tool in the Release Notes
Client System
• Chrome, Firefox or Safari (recent versions)
VTune Profiler Server is tested with the latest versions of supported browsers at the time of each release.
Target System
• 32- or 64-bit Linux or Windows OS
• Same system requirements and supported operating system distributions as specified for VTune Profiler
target systems in the Release Notes
NOTE
VTune Profiler Server currently does not support cross-platform profiling. If the VTune Profiler Server is
hosted on a Linux system, then it supports data collection on Linux target systems only. The same is
applicable to Windows systems.
See Also
Web Server Interface
44
Install Intel® VTune™ Profiler 3
To set up the transport security, the Admin should follow these steps:
1. Provide the signed TLS certificate to users of the VTune Profiler Server.
Make sure to include the VTune Profiler Server DNS name to either Common Name or Alternative
Domain Names.
For example, if the URL to access the VTune Profiler Server is https://vtune.lab01.myorg.com, the
TLS certificate Common Name should be vtune.lab01.myorg.com, or vtune.lab01.myorg.com
should be included into Alternative Domain Names.
2. Start the VTune Profiler Server as follows:
Certificate password:
If the certificate private key is stored in a separate file, use the --tls-certificate-key option:
See Also
Web Server Interface
Passphrase Authentication
In the default personal use mode, VTune Profiler Server is configured to use passphrase authentication/
authorization. When you start the server, you can specify a passphrase:
There are no usernames involved: if the passphrase is shared between multiple users, then they are treated
as the same user.
VTune Profiler persists the hash of the passphrase. The browser also persists a secure HTTPS cookie so that
you do not enter the passphrase each time. Cookie expiration time is configurable, default value is 365 days.
When you access the VTune Profiler Server from a different machine or use a different browser, or if the
browser cookies are cleaned / expired, then you are prompted to enter the passphrase again.
45
3 Intel® VTune™ Profiler User Guide
If you forget the passphrase, you can reset it by re-running the VTune Profiler Server using the --reset-
passphrase option. The server provides an outcome URL with a one-time token to reset the passphrase:
vtune-backend --reset-passphrase
Serving GUI at https://127.0.0.1:65417?one-time-token=e2ed7c1365c972ec1024ac4e53179a08
When you open this URL in a web browser, you are prompted to set a new passphrase.
vtune-backend --web-port=8080
Serving GUI at https://127.0.0.1:8080
warn: Server access is limited to localhost only. To enable remote access, restart with --allow-
remote-ui.
• If VTune Profiler Server and reverse proxy are on different hosts: configure the reverse
proxy to use a client certificate authentication when calling the VTune Profiler Server. Provide the
VTune Profiler Server with the path to the pubic part of the reverse proxy client certificate :
NOTE
You are recommended to use the client certificate authentication even when VTune Profiler Server and
the reverse proxy are on the same host to prevent an unauthorized access from the host system.
See Also
Web Server Interface
46
Install Intel® VTune™ Profiler 3
47
4 Intel® VTune™ Profiler User Guide
Once you have downloaded Intel® VTune™ Profiler, follow these steps to run the application:
1. ocate the installation directory.
2. Set environment variables.
3. Open Intel® VTune™ Profiler
• From the GUI
• From the command line
macOS* /opt/intel/oneapi/
Windows* OS:<install-dir>\setvars.bat
When you run this script, it displays the product name and the build number. You can now use the vtune and
vtune-gui commands.
48
Open Intel® VTune™ Profiler 4
On a macOS* system, start Intel VTune Profiler version from the Launchpad.
NOTE
You can also launch the VTune Profiler from the Eclipse* IDE.
vtune-gui /root/intel/vtune/projects/matrix/matrix.vtuneproj
See Also
Web Server Interface
Set Up Project
49
4 Intel® VTune™ Profiler User Guide
To start with VTune Profiler, you need to have a project that specifies a target to analyze.
To create a new project, click the New Project... link. If a project is open, its name shows up on the
Welcome page as the Current project.
To configure and run a new analysis for the current project, click Configure Analysis... on the
Welcome screen. You also use this selection to configure target and analysis settings for a project
that is currently open.
The Configure Analysis link opens the Performance Snapshot analysis type by default. This
snapshot gives you a quick overview of issues affecting your application performance.
For other analysis types, click the analysis header to open the Analysis Tree which displays all
available analyses.
For quick and easy access to an existing project used recently, click the required project name in the
Recent Projects list. Hover over a project name in the list to see the full path to the project file.
Click Open Project... to open an existing project (*.vtuneproj).
To open a recently collected result, click the required item in the Recent Results list. By default,
each result name has an identifier of its analysis type (last two letters in the result name); for
example, tr stands for Threading analysis. Hover over a result name in the list to see the full path to
the result file.
Click Open Result... to open a result file (*.vtune).
Use the link bar to access additional informational resources such as Performance Analysis Cookbook,
online product documentation or social media channels. Consider getting started with the product by
running the Help Tour that guides you through the interface using a sample project.
50
Open Intel® VTune™ Profiler 4
Review the latest Featured Content that typically includes performance tuning scenarios and tuning
methodology articles.
Use the Get Started document to get up and running with a basic Hotspots analysis using your own
application on your host system.
• Windows*
• Linux*
• macOS*
NOTE
From a macOS host, you can launch a collection on a remote Linux* system or on an Android* system
and view the data collection result on the host. VTune Profiler does not support local analysis on a
macOS host.
See Also
Introduction and Key Features
Android* Targets
51
4 Intel® VTune™ Profiler User Guide
Project Navigator. Use the navigator to manage your project and collected analysis results.
Menu and Toolbar. Use the VTune Profiler menu and toolbar to configure and control performance
analysis, define and view project properties. Click the button to open/close the Project
Navigator. Use the Configure Analysis toolbar button to access an analysis configuration.
Analysis type and viewpoint. View the correlation of the analysis result and a viewpoint associated
with it. A Viewpoint is a pre-set configuration of windows/panes for an analysis result. For most of
analysis types, you can click the down arrow to switch between viewpoints and focus on particular
performance metrics.
Analysis Windows. Switch between window tabs to explore the analysis type configuration options
and collected data provided by the selected viewpoint.
Grouping. Use the Grouping drop-down menu to choose a granularity level for grouping data in the
grid. Available groupings are based on the hierarchy of the program units and let you analyze the
collected data from different perspectives; for example, if you are developing specific modules in an
application and interested only in their performance, you may select the Module/Function/Call
Stack grouping and view aggregated data per module functions.
Filtering.VTune Profiler provides two basic options for filtering the collected data: per object and per
time regions. Use the filter toolbar to filter out the result data according to the selected object
categories: Module, Process, Thread, and so on. To filter the data by a time region, select this region
on the timeline, right-click and choose Filter In by Selection content menu option.
This could be useful, for example, to get region specific data in the context summary for the HPC
Performance Characterization or GPU Compute/Media Hotspots analyses.
See Also
Open Intel® VTune™ Profiler
52
Open Intel® VTune™ Profiler 4
Analyze Performance
<vtune-install-dir>/bin64/vtune-backend
If you want the VTune Profiler Server to access a specific TCP port, specify it with the --web-port
option. For example:
vtune-backend --web-port=8080
VTune Profiler Server outputs a URL to access the GUI. For the first run, the URL includes a one-time
token. For example:
NOTE Additional command-line options are available to make the usage of VTune Profiler Server in
containers more convenient. See Use VTune Profiler Server in Containers for details.
53
4 Intel® VTune™ Profiler User Guide
VTune Profiler Server allows you to create a directory with a custom hierarchy, organized to best fit
your needs. Once you point VTune Profiler Server to this directory using the --data-directory option,
users will be able to access all projects and results, regardless of folder names and levels of nesting.
This can be especially useful if you're using an HPC scheduler to regularly collect VTune Profiler
performance data and put it into a shared folder on the network for later examination. For example,
you can organize your results folder by users and their workloads:
NOTE
• By default, access to the VTune Profiler Server is limited to the local host only. To enable access
from remote client and target systems, restart the server with the --allow-remote-access
option.
• By default, server host profiling is not enabled. To enable the server host profiling, restart the
server with the --enable-server-profiling option.
54
Open Intel® VTune™ Profiler 4
NOTE
If you start the VTune Profiler Server in the personal/evaluation mode with no signed TLS certificate
provided, your web browser warns you that the default self-signed server certificate is not trusted and
asks for your confirmation to proceed.
NOTE
VTune Profiler Server uses SSH for automated agent deployment. Running an SSH server on the target
machine is required for automated deployment.
55
4 Intel® VTune™ Profiler User Guide
NOTE
You can use tools such as wget to download the Agent directly to the target system.
2. Extract the Agent archive with your tool of choice and copy its contents to the target system.
3. Run the vtune-agent executable on the target system and specify the agent owner using the -owner
<vtune-user-id> option.
NOTE
You can find your VTune Profiler user ID in the About dialog.
4. Compare the Agent key fingerprint in the WHERE pane of the Configure Analysis window with the
fingerprint printed out by the agent upon startup. If they match, click the Admit Agent button.
56
Open Intel® VTune™ Profiler 4
Shared Agents
You can run a shared VTune Profiler Agent. In this case, the Agent will be available to all users of an instance
of VTune Profiler Server. This means that any user of this VTune Profiler Server instance will be able to run
data collection using this agent. It is recommended to only run shared agents using dedicated faceless
accounts.
To deploy a shared agent, check the Share the agent with all VTune Profiler users checkbox in the
WHERE pane of the Configure Analysis dialog, or use the --shared command line option when deploying
an agent manually.
NOTE
VTune Profiler maintains a list of used remote systems, if any, and displays it under Remote Targets.
57
4 Intel® VTune™ Profiler User Guide
#!/bin/sh
#Run VTune collector as the target process owner
sudo -C 65000 -A -u <target process owner> "$@"
The sudo command call runs the VTune Profiler collector under the account specified under <target process
owner>. Replace this placeholder with the account name under which the target process is running.
If the target workload or the collector request a sudo elevation during the analysis, VTune Profiler Server
requests this password interactively in the Web Interface:
58
Open Intel® VTune™ Profiler 4
NOTE
• The interactive sudo elevation requires that the VTune Profiler Agent is deployed under an account
that has sudo privileges. To achieve that, ensure that the Username that you provide during
deployment belongs to an account with sudo privileges.
• VTune Profiler provides the password directly to the target system and does not store the
password.
The dashboard opens in a new tab and shows all agents that are related to this instance of VTune
Profiler Server. This includes both connected and disconnected agents.
59
4 Intel® VTune™ Profiler User Guide
See Also
Install VTune Profiler Server Set up Intel® VTune™ Profiler as a web server, using a lightweight
deployment intended for personal use or a full-scale corporate deployment supporting multi-user
environment.
Cookbook: Using VTune Profiler Server in HPC Clusters
NOTE Support for Visual Studio* 2017 is deprecated as of the Intel® oneAPI 2022.1 release, and will
be removed in a future release.
60
Open Intel® VTune™ Profiler 4
Integrate VTune Profiler into Visual Studio During Installation
VTune Profiler integrates into Visual Studio by default. You specify the version of Visual Studio used for
integration in the IDE Integration portion of the installation wizard. If you have several versions of Visual
Studio and want to instruct the installation wizard to use a specific version for integration, click the
Customize link and specify the required version on the Choose Integration Target page. For example:
NOTE
You can only integrate one version of VTune Profiler into Visual Studio IDE.
61
4 Intel® VTune™ Profiler User Guide
On the Choose Integration Target page, specify the version of Visual Studio for integration by clicking
Customize.
You can also access VTune Profiler from the Tools menu in the IDE.
Load a project in the Solution Explorer window. Once you have compiled it, you can profile with VTune
Profiler. When you click the Open VTune Profiler icon from the toolbar, the application opens to the
Welcome Page.
The graphical interface of VTune Profiler integrated into Visual Studio is similar to the standalone VTune
Profiler interface.
62
Open Intel® VTune™ Profiler 4
Configure VTune Profiler for Visual Studio
To configure VTune Profiler options in the Visual Studio IDE, click the pulldown menu next to the Open
• Use the General pane to configure general collection options such as application output destination,
management of the collected raw data, and so on.
• Use the Result Location pane to specify the result name template that defines the name of the result file
and its directory.
• Use the Source/Assembly pane to manage the source file cache and specify syntax for the disassembled
code.
• Use the Privacy pane to opt in/out of collecting your information for the Intel® Software Improvement
Program.
If you need to change environment settings, however, read the documentation provided for the Visual Studio
product.
NOTE
From the standalone interface, you can access VTune Profiler options via the File > Options... menu.
Tip
When you launch VTune Profiler directly from Intel System Studio, you do not need to set environment
variables on your system because they are set during the launch process.
63
4 Intel® VTune™ Profiler User Guide
To open the VTune Profiler from Intel System Studio, select the Tools > VTune Profiler > Launch VTune
Profiler menu option.
See Also
Analyze Performance
Containerization Support
Use containers to set up environments for profiling:
• You can prepare a container with an environment pre-configured with all the tools you need, then develop
within that environment.
• You can move that environment to another machine without additional setup.
• You can extend containers with different sets of compilers, profilers, libraries, or other components, as
needed.
Depending on the setup, Intel® VTune™ Profiler supports the following target types and analyses:
64
Open Intel® VTune™ Profiler 4
Setup Target Type Analysis Type
NOTE
• The Hotspots (hardware event-based sampling mode) and Microarchitecture Exploration analyses
are configured to use driver-less data collection based on the Linux Perf* tool.
• In the Profile System mode, VTune Profiler profiles all applications running in the same container
or in different containers simultaneously. So, the standard limitation for the system-wide profiling
of the managed code is not applicable to Java applications running in the containers.
• The Attach to Process target type for Java apps is supported only with the Java Development Kit
(JDK).
• When VTune Profiler and an application are NOT running in the same container, both local and
remote target system configurations are available.
See Also
Profile Container Targets from the Host
Prerequisites
• Configure a Docker image:
1. Create and configure a Docker image.
For the pre-installed Intel® oneAPI Base Toolkit including VTune Profiler, you may pull an existing
Docker image from the Docker Hub repository:
host> image=amr-registry.caas.intel.com/oneapi/oneapi:base-dev-ubuntu18.04
host> docker pull "$image"
2. To enable profiling from the container and have all host processes visible from the container, run your
Docker image with --pid=host as follows:
65
4 Intel® VTune™ Profiler User Guide
NOTE
These steps are NOT required if you use a Docker image with pre-installed Intel oneAPI Base Toolkit.
1. Install the command-line interface of VTune Profiler inside your Docker container.
Make sure to select the [2] Custom installation > [3] Change components to install and de-select
components that are not required in the container environment: [3] Graphical user interface and
[4] Platform Profiler.
2. After installation, set up environment variables for the VTune Profiler. For example, for VTune Profiler in
Intel oneAPI Base Toolkit:
66
Open Intel® VTune™ Profiler 4
For example:
See Also
Cookbook: Profiling in a Docker* Container
Cookbook: Profiling in a Singularity* Container
Installation Guide for VTune Profiler on Linux*
Run Command Line Analysis
Prerequisites
VTune Profiler automatically detects an application running in the container. No container configuration
specific for performance analysis is required. But to run user-mode sampling analysis types (Hotspots or
Threading), make sure to run the container with the ptrace support enabled:
67
4 Intel® VTune™ Profiler User Guide
2. From the WHERE pane of the Configure Analysis window, select the Local Host system to start
analysis from your host Linux system or Remote Linux (SSH) to start analysis from a remote Linux
system connected to your host system via SSH. For the remote Linux targets, make sure to configure
SSH connection.
3. From the WHAT section, specify your analysis target. For container target analysis, the following target
types are supported: Attach to Process and Profile System.
Configure your process or system target as usual using available configuration options.
NOTE
In the Profile System mode, VTune Profiler profiles all applications running in the same container or
in different containers simultaneously. So, the standard limitation for the system-wide profiling of the
managed code is not applicable to Java applications running in the containers.
You can attach the VTune Profiler running under the superuser account to a Java process or a C/C++
application with embedded JVM instance running under a low-privileged user account. For example, you
may attach the VTune Profiler to Java based daemons or services.
NOTE
The dynamic attach mechanism is supported only with the Java Development Kit (JDK).
4. From the HOW section, select an analysis and customize the analysis options, if required.
NOTE
The Hotspots (hardware event-based sampling mode) and Microarchitecture Exploration analyses are
configured to use driverless data collection based on the Linux Perf* tool to gather performance data
for targets running in a container.
View Data
The collected result opens in the default Hotspots viewpoint, where paths to container modules show up with
prefixes (for instance, docker or lxc):
68
Open Intel® VTune™ Profiler 4
See Also
Cookbook: Profiling in a Docker* Container
Cookbook: Profiling in a Singularity* Container
Java* Code Analysis
macOS* Support
You can run Intel® VTune™ Profiler on a macOS* host system to launch a collection on a remote Linux*
system or Android* system. You can also view the data collection result on the macOS host. However, Intel®
VTune™ Profiler does not support data collection on a local macOS machine.
Prerequisites
See the Intel VTune Profiler Installation Guide - macOS for detailed information about installing and
configuring VTune Profiler for use on a macOS host.
1. Install VTune Profiler on your macOS host.
2. Set up a SSH connection to your remote target. You may need to install the appropriate drivers on the
target system:
• Target Linux System
• Target Android System
Get Started
1. Launch the VTune Profiler GUI from the Launchpad or launch the command line collector by executing
the amplxe-vars script and running the vtune command. By default, VTune Profiler is installed under
the /Applications directory. For more information, see Standalone VTune Profiler Interface.
2. Create a new project.
3. Click the Configure Analysis icon to set up your remote collection. This opens the Performance
Snapshot analysis type by default.
69
4 Intel® VTune™ Profiler User Guide
NOTE Profiling support for the macOS* 11 operating system is deprecated and will be removed in a
future release.
See Also
Introduction
Analyze Performance
70
Set Up Project 5
Set Up Project 5
For Microsoft Visual Studio* IDE, VTune Profiler creates a project for an active startup project, inherits Visual
Studio settings and uses the application generated for the selected project as your analysis target. The
default project directory is My VTune Results-[project name] in the solution directory.
For the standalone graphical interface, create a project by specifying its name and path to an analysis target.
The default project directory is %USERPROFILE%\My Documents\Amplifier XE\Projects on Windows*
and $HOME/intel/vtune/projects on Linux*.
To create a VTune Profiler project for the standalone GUI:
1. Click New Project... in the Welcome screen.
Create Project button Create a container *.vtuneproj file and open the Configure Analysis window.
3. Click the Create Project button.
The Configure Analysis window opens.
Your default project is pre-configured for the Performance Snapshot analysis. This presents an overview of
issues that affect the performance of your application. Click the Start button to proceed with the default
setup.
To select a different analysis type, click on the name of the analysis in the analysis header section. This
opens an Analysis Tree with all available analysis types.
NOTE
You cannot run a performance analysis or import analysis data without creating a project.
71
5 Intel® VTune™ Profiler User Guide
See Also
WHERE: Analysis System
Use these options to decide where you want to run the analysis.
Option Description
72
Set Up Project 5
Option Description
NOTE
This type of the target system is not available for
macOS*.
Arbitrary Host (not connected) Create a command line configuration for a platform
NOT accessible from the current host, which is
called an arbitrary target.
See Also
Analysis System Options
73
5 Intel® VTune™ Profiler User Guide
SSH destination field Specify a username, hostname, and port (if required) for your remote Linux
machine as username@hostname[:port].
VTune Profiler Specify a path to the VTune Profiler on the remote system.
installation directory on
• If VTune Profiler is not installed on the remote system, the collectors are
the remote system field
automatically copied over, installed in the default location (/tmp), and
the path is supplied.
• If VTune Profiler is already installed in a location other than /tmp, add
the location here.
Temporary directory on Specify a path to the /tmp directory on the remote system where
the remote system field performance results are temporarily stored.
Deploy button Deploy the collector package to the target system if the package is not
found on the target system.
Android* Options
When you select the Android Device (ADB) system on the WHERE pane, the VTune Profiler displays the
ADB destination menu and prompts you to specify an Android device for analysis. When the ADB
connection is set up, the VTune Profiler automatically detects available devices and displays them in the
menu.
Hardware platform field Select a hardware platform for analysis from the drop-down menu, for
example: Intel® processor code named Anniedale.
What's Next
In the WHAT pane, select an analysis target for the specified analysis system.
NOTE
You can launch an analysis only for targets accessible from the current host. For an arbitrary target,
you can only generate a command line configuration, save it to the buffer and later launch it on the
intended host.
See Also
Set Up Android* System
74
Set Up Project 5
Arbitrary Targets
To change a target type for your project, click the Browse button on the WHAT pane. Select from
these target types:
Launch Enable the Launch Application pane and choose and configure an application to
Application analyze, which can be either a binary file or a script. See options for launching an
application.
75
5 Intel® VTune™ Profiler User Guide
NOTE
This target type is not supported for the Hotspots analysis of Android applications. Use
the Attach to Process or Launch Android Package types instead.
Attach to Process Enable the Attach to Process pane and choose and configure a process to
analyze. See options for attaching to a process.
Profile System Enable the Profile System pane and configure the system-wide analysis that
monitors all the software executing on your system.
Launch Android Enable the Launch Android Package pane to specify the name of the Android*
Package package to analyze and configure target options. See options for launching an
Android package.
Options available for the target configuration depend on the target system you select in the WHERE pane.
To focus on analyzing particular processes, you may collect data on all processes (without selecting the
Attach to Process target type) and then filter the collected results as follows:
1. From the Grouping drop-down menu in the Bottom-up window, select the grouping by Process, for
example: Process/Function/Thread/Call Stack.
2. In the grid, right-click the process you are interested in and select the Filter In by Selection option
from the context menu.
VTune Profiler updates the grid to provide data for the selected process only.
3. From the Grouping drop-down menu, select any other grouping level you need, for example:
Function/Call Stack.
VTune Profiler groups the data for the selected process according to the granularity you specified.
NOTE
If attaching to a running process causes a hang or crash, consider launching your application with the
VTune Profiler in a paused state, and resume the collection when the application gets to an area of
interest.
See Also
WHERE: Analysis System
76
Set Up Project 5
3. Choose a target type on the WHAT pane and configure the options below.
NOTE
To create a command line configuration for a target not accessible from the current host, choose the
Arbitrary Host target system on the WHERE pane. Make sure to choose an operating system your
target will be running with: Windows or GNU/Linux and a hardware platform.
Target options vary with the selected target system and target type (Launch Application, Launch Android
Package, Attach to Process, or Profile System).
Basic Options
Inherit settings from Enable/disable using the project currently opened in Visual Studio IDE and
Visual Studio* project its current configuration settings as a target configuration. Checking this
check box (supported for check box makes all other target configuration settings unavailable for
Visual Studio IDE only) editing.
Inherit system Inherit and merge system and user-defined environment variables.
environment variables Otherwise, only the user-defined variables are set.
check box
Application field Specify a full path to the application to analyze, which can be a binary file
or script.
Use application directory Automatically match your working and application directory (enabled by
as working directory default). An application directory is the directory where your application
check box resides. For example, for a Linux application /home/foo/bar the
application directory is /home/foo. Application and working directories may
be different if, for example, an application file is located in one directory but
should be launched from a different directory (working directory).
Working directory field Specify a directory to use for launching your analysis target. By default, this
directory coincides with the application directory.
Package name field Specify the name of the Android* package (*.apk) to analyze.
77
5 Intel® VTune™ Profiler User Guide
NOTE
For performance analysis on non-rooted devices, compile your Android
application setting the debuggable attribute to true
(android:debuggable="true") but make sure to set APP_OPTIM to
release in your Application.mk to enable compilation with optimization.
Use MPI launcher check Enable the check box to generate a command line configuration for MPI
box analysis. Configure the following MPI analysis options:
• Select MPI launcher: Select an MPI launcher that should be used for
your analysis. You can either enable the Intel MPI launcher option
(default) or select Other and specify a launcher of your choice.
• Number of ranks: Specify the number of ranks used for your
application.
• Profile ranks: Use All to profile all ranks, or choose Selective and
specify particular ranks to profile, for example: 2-4,6-7,8.
• Result location: Specify a relative or absolute path to the directory
where the analysis result should be stored.
Advanced Options
Use the Advanced section to provide more details on your target configuration.
User-defined Type or paste environment variables required for running your application.
environment variables
field
Managed code profiling Select a profiling mode for managed code. Managed mode attributes data
mode menu to managed source and only collects managed portion. Native mode
collects everything but does not attribute data to managed source. Mixed
mode collects everything and attributes data to managed source where
appropriate.
Automatically resume Specify the time that should elapse before the data collection is resumed.
collection after (sec) When this options is used, the collection starts in the paused mode
automatically.
Automatically stop Set the duration of data collection in seconds starting from the target run.
collection after (sec) This is useful if you want to exclude some post-processing activities from
the analysis results.
Analyze child processes Collect data on processes launched by the target process. Use this option
check box when profiling an application with the script.
78
Set Up Project 5
Use This To Do This
Estimate the application duration time. This value affects the size of
collected data. For long running targets, sampling interval is increased to
reduce the result size. For hardware event-based sampling analysis types,
the VTune Profiler uses this estimate to apply a multiplier to the configured
sample after value.
Allow multiple runs Enable multiple runs to achieve more precise results for hardware event-
check box based collections. When disabled, the collector multiplexes events running a
single collection, which lowers result precision.
Analyze system-wide Enable analyzing all processes running on the system. When disabled, only
check box the target process is analyzed.
This option is applicable to hardware event-based sampling analysis types
only.
Limit collected data by If the amount of raw collected data is very large and takes long to process,
section use any of the following options to limit the collected data size:
• Result size from collection start, MB: Set the maximum possible
result size (in MB) to collect. VTune Profiler will start collecting data from
the beginning of the target execution and suspend data collection when
the specified limit for the result size is reached. For unlimited data size,
specify 0.
• Time from collection end, sec: Set the timer enabling the analysis
only for the last seconds before the target run or collection is
terminated. For example, if you specified 2 seconds as a time limit, the
VTune Profiler starts the data collection from the very beginning but
saves the collected data only for the last 2 seconds before you terminate
the collection.
79
5 Intel® VTune™ Profiler User Guide
NOTE
The size of data stored in the result directory may not exactly match
the specified result size due to the following reasons:
• The collected data may slightly exceed the limit since the VTune
Profiler only checks the data size periodically.
• During finalization, the VTune Profiler loads the raw data into a
database with additional information about source and binary files.
CPU mask field Specify CPU(s) to collect data on (for example: 2-8,10,12-14). This option
is applicable to hardware event-based analysis types only.
Custom collector field Provide a command line for launching an external collection tool, if any. You
can later import the custom collection data (time intervals and counters) in
a CSV format to a VTune Profiler result.
Select finalization mode Finalization may take significant system resources. For a powerful target
section system, select Full mode to apply immediately after collection. Otherwise,
shorten finalization with selecting the fast mode (default) or defer it to run
on another system (compute checksums only).
Wrapper script field Provide a script that is launched on the target system before starting the
collection. On the host system, you can prepare a custom script that
prepares the target environment and calls the VTune Profiler collector in this
environment.
An example of the wrapper script:
#!/bin/bash
# Prefix script
echo "Target process PID: $VTUNE_TARGET_PID"
# Postfix script
ls -la $VTUNE_RESULT_DIR
You can use the script to perform any actions available through the CLI of
your target operating system, and use "$@" or "$*" to pass all arguments
into the script and start VTune Profiler collection in this environment.
The following environment variables are available from the script:
VTUNE_TARGET_PID
VTUNE_TARGER_PROC_NAME
VTUNE_RESULT_DIR
VTUNE_TEMP_DIR
VTUNE_TARGET_PACKAGE_DIR
VTUNE_DATA_DIR
VTUNE_USER_DATA_DIR
80
Set Up Project 5
Use This To Do This
NOTE
• VTune Profiler preserves the content of the script. The script is
preserved within the project and is run for every analysis within that
project. To apply any changes to the script, attach it again using the
same Wrapper script field.
• For Linux targets, make sure that the script file is saved with LF line
endings.
Result location options Select where you want to store your result file. By default, the result is
stored in the project directory.
Trace MPI check box Configure collectors to trace MPI code and determine MPI rank IDs in case
(Linux* targets only) of a non-Intel MPI library implementation.
Analyze KVM guest OS Enable KVM guest system profiling. For proper kernel symbol resolution,
check box (Linux targets make sure to specify:
only)
• a local path to the /proc/kallsyms file copied from the guest OS
• a local path to the /proc/modules file copied from the guest OS
Analyze unplugged Enable collection on an unplugged device to exclude ADB connection and
device check box power supply impact on the results. When this option is used, you configure
and launch an analysis from the host but data collection starts after
disconnecting the device from the USB cable or a network. Collection results
are automatically transferred to the host as soon as you plug in the device
back.
Select a system for The result can be finalized on the same target system where the analysis is
result finalization options run (default). In this case make sure your target system is powerful enough
for finalization. If you choose to finalize the result on another system,
VTune Profiler will only compute module checksums to avoid an ambiguity
in resolving binaries on a different system.
Support Limitations
• VTune Profiler provides limited support for profiling Windows* services. For details, see Profiling Windows
Services article on the web.
• System-wide profiling is not supported for the user-mode sampling and tracing collection.
• For driverless event-based sampling data collection, VTune Profiler supports local and remote Launch
Application, Attach to Process and Profile System target types but their support fully depends on the Linux
Perf profiling credentials specified in the /proc/sys/kernel/perf_event_paranoid file and managed
by the administrator of your system using root credentials. For more information, see the perf_event
related configuration files topic at http://man7.org/linux/man-pages/man2/perf_event_open.2.html. By
default, only user processes profiling at the both user and kernel spaces is permitted, so you need
granting wider profiling credentials via the perf_event_paranoid file to employ the Profile System
target type.
81
5 Intel® VTune™ Profiler User Guide
What's Next
In the HOW pane, select an analysis type applicable to the specified target type and click Start to run the
analysis.
NOTE
You can launch an analysis only for targets accessible from the current host. For an arbitrary target,
you can only generate a command line configuration, save it to the buffer and later launch it on the
intended host.
See Also
Arbitrary Targets
82
Set Up Project 5
Click the header in the HOW pane to open an analysis tree. Select from an analysis type from one of these
groups:
Performance Snapshot analysis:
• Use Performance Snapshot to get an overview of issues that affect the performance of an application on
your system. The analysis is a good starting point that recommends areas for deeper focus. You also get
guidance on other analysis types to consider running next.
Algorithm analysis:
• Use the Hotspots analysis type to investigate call paths and find where your code is spending the most
time. Identify opportunities to tune your algorithms. See Finding Hotspots tutorial: Linux | Windows.
• Use Anomaly Detection (preview) to identify performance anomalies in frequently recurring intervals of
code like loop iterations. Perform fine-grained analysis at the microsecond level.
• Memory Consumption is best for analyzing memory consumption by your app, its distinct memory
objects, and their allocation stacks. This analysis is supported for Linux targets only.
Microarchitecture analysis:
• Microarchitecture Exploration (formerly known as General Exploration) is best for identifying the CPU
pipeline stage (front-end, back-end, and so on) and hardware units responsible for your hardware
bottlenecks.
• Memory Access is best for memory-bound apps to determine which level of the memory hierarchy is
impacting your performance by reviewing CPU cache and main memory usage, including possible NUMA
issues.
Parallelism analysis:
• Threading is best for visualizing thread parallelism on available cores, locating causes of low concurrency,
and identifying serial bottlenecks in your code.
• Use HPC Performance Characterization to understand how your compute-intensive application is using the
CPU, memory, and floating point unit (FPU) resources. See Analyzing an OpenMP* and MPI Application
tutorial: Linux.
I/O analysis:
83
5 Intel® VTune™ Profiler User Guide
• Input and Output analysis monitors utilization of the IO subsystems, CPU and processor buses.
Accelerators analysis:
• GPU Offload (preview) is targeted for applications using a Graphics Processing Unit (GPU) for rendering,
video processing, and computations. It helps you identify whether your application is CPU or GPU bound.
• GPU Compute/Media Hotspots (preview) is targeted for GPU-bound applications and helps analyze GPU
kernel execution per code line and identify performance issues caused by memory latency or inefficient
kernel algorithms.
• CPU/FPGA Interaction analysis explores FPGA utilization for each FPGA accelerator and identifies the most
time-consuming FPGA computing tasks.
Platform analysis:
• System Overview is a driverless event-based sampling analysis that monitors a general behavior of your
target system and identify platform-level factors that limit performance.
• Platform Profiler analysis collects data on a deployed system running a full load over an extended period
of time with insights into overall system configuration, performance, and behavior. The collection is run on
a command prompt outside of VTune Profiler and results are viewed in a web browser.
NOTE
A PREVIEW FEATURE may or may not appear in a future production release. It is available for your
use in the hopes that you will provide feedback on its usefulness and help determine its future. Data
collected with a preview feature is not guaranteed to be backward compatible with future releases.
Advanced users can create a custom analysis using the data collectors provided by VTune Profiler, or
combining the collector of VTune Profiler with another custom collector.
Search Directories
Search directories are used to locate supporting files
and display analysis information in relation to your
source code.
In some cases, the Intel® VTune™ Profiler cannot locate the supporting user files necessary for displaying
analysis information and you may need to configure additional search locations or override standard ones.
This is required for .exe projects on Windows* created out of Microsoft Visual Studio*, where no information
about project directory structure is available, for C++ projects with a third party library for which you wish to
define binaries/sources, or for the imported projects with the data collected remotely. When you run a
remote data collection, the VTune Profiler copies binary files from the target system to the host by default.
You need to either copy symbol and source files to the host or mount a directory with these files.
VTune Profiler searches the directories in the particular order when finalizing the collected data. For the
VTune Profiler integrated into the Visual Studio IDE, the search directories are defined by the Microsoft Visual
Studio C++ project properties.
For successful module resolution, the VTune Profiler needs to locate the following files:
• binaries (executables and dynamic libraries)
• symbols
• source files
It automatically locates the files for C/C++ projects that are not moved after building the application and
collecting the performance data.
84
Set Up Project 5
The Configure Analysis window opens.
2.
Click the Search Sources/Binaries button at the bottom to open the corresponding dialog
box and specify paths for symbol, binary and source files for the file resolution on the host.
3. To add a new search directory in the Search Directories table, click the <Add a new search
location> row and type in the path and name of the directory in the activated text box, or click the
browse button on the right to select a directory from the list. For example, if your project was initially
located in /work/projects/my_project on Linux* and then was moved to /home/user/
my_project_copy, you need to specify the /home/user/my_project_copy as a search directory for
binary/symbol and source files.
NOTE
The search is non-recursive. Make sure to specify all required directories.
If the search directories were not configured properly and modules were not resolved, you may see the
following:
• In the Summary window, you see a pop-up message starting with "Data is not complete due to missing
symbol information for user modules...". This pop-up window provides shortcut options to specify search
directories and re-resolve the analysis result.
• In the Bottom-up or Top-down Tree pane, the module shows only one [Unknown] line instead of
meaningful lines with function names.
• When you double-click a row to view the related source code, you get a Cannot find the source file
window asking you to locate the source file.
If the VTune Profiler cannot locate symbol files for system modules, it may provide incomplete stack
information in the Bottom-up/Top-down Tree panes and Call Stack pane. In this case, you may see
[Unknown frame(s)] hotspots when attributing system layers to user code using the Call Stack Mode
option on the filter toolbar. To avoid this for Windows targets, make sure to configure the Microsoft symbol
server or set the _NT_SYMBOL_PATH environment variable. For Linux targets, enable Linux kernel analysis.
See Also
Dialog Box: Binary/Symbol Search
Finalization
Search Order
When locating binary/symbol/source files, the Intel® VTune™ Profiler searches the following directories, in the
following order:
1. Directory <result dir>/all (recursively).
85
5 Intel® VTune™ Profiler User Guide
2. Additional search directories that you defined for this project in the VTune ProfilerBinary/Symbol
Search dialog box.
3. For local collection, an absolute path.
For remote collection, the VTune Profiler searches its cache directory for modules copied from the
remote system or tries to get the module from the remote system using the absolute path.
For results copied from a different machine, make sure to copy all the necessary source, symbol, and
binary files required for result finalization.
• For binaries, the path is captured in the result data files.
• For symbol files, the path is referenced in the binary file.
• For source files, the path is referenced in the symbol file.
On Linux*, to locate the vmlinux file, the VTune Profiler searches the following directories:
• /usr/lib/debug/lib/modules/`uname -r`/vmlinux
• /boot/vmlinuz-`uname -r`
4. Search around the binary file.
1.
Search the directory of the corresponding binary file.
On Windows*, search the directory of the corresponding binary file and alter the name of the symbol
2.
file holding the initial extension (for example, app.dll + app_x86.pdb -> app.pdb).
On Linux, search the .debug subdirectory of the corresponding binary file directory.
3.
5. On Windows, Microsoft Visual Studio* search directories. All directories are considered as non-
recursive. Directories may be specific to the selected build configuration and platform in time of
collection.
6. System directories.
On Windows:
• Binary files: %SYSTEMROOT%\system32\drivers (non-recursively)
• Symbol files:
• All directories specified in the _NT_SYMBOL_PATH environment variable (non-recursively). Symbol
server paths are possible here as well as in step 2.
• srv*%SYSTEMROOT%\symbols (treated as a symbol server path)
• %SYSTEMROOT%\symbols\dll (non-recursively)
On Linux:
• Binary files: If the file to search is a bare name only (no full path, no extension), it is appended by
the .ko extension before searching in the following directories:
1.
/lib/modules (non-recursively)
2.
/lib/modules/`uname -r`/kernel (recursively)
• Symbol files:
• /usr/lib/debug (non-recursively)
• /usr/lib/debug with appended path to the corresponding binary file (for example, /usr/lib/
debug/usr/bin/ls.debug)
• Source files:
• /usr/src (non-recursively)
• /usr/src/linux-headers-`uname -r` (non-recursively)
If the VTune Profiler cannot find a file that is necessary for a certain operation, such as viewing source, it
brings up a window enabling you to enter the location of the missing file.
86
Set Up Project 5
NOTE
VTune Profiler automatically applies recursive search to the <result dir>/all directory and some
system directories (Linux only). Additional directories you specify in the project configuration are
searched non-recursively.
1. For non-recursive directories, the VTune Profiler searches paths by merging the parts of the
file path with the specified directory iteratively. For example, for the /aaa/bbb/ccc/
filename.ext file on Linux:
/specified/search/directory/aaa/bbb/ccc/filename.ext
/specified/search/directory/bbb/ccc/filename.ext
/specified/search/directory/ccc/filename.ext
/specified/search/directory/filename.ext
2. For recursive directories, the VTune Profiler searches the same paths as for the non-recursive
directory and, in addition, paths in all sub-directories up to the deepest available level. For
example:
/specified/search/directory/subdir1/filename.ext
/specified/search/directory/subdir1/sub…subdir1/filename.ext
...
/specified/search/directory/subdir1/sub…subdirN/filename.ext
...
/specified/search/directory/subdirN/filename.ext
3. For symbol server paths on Windows, symsrv.dll is used from product distributive. Custom
symsrv.dll:s are not supported.
See Also
Search Directories
87
6 Intel® VTune™ Profiler User Guide
Target • Linux* OS
Platform • Windows* OS
• Android* OS
• FreeBSD*
• QNX*
• Intel® Xeon Phi® processors (code name: Knights Landing)
Programming • C/C++
Language • DPC++
• Fortran
• C# (Windows Store applications)
• Java*
• JavaScript
• Python*
• Go*
• .NET*
• .NET Core
Virtual • VMWare*
Environment • Parallels*
• KVM*
• Hyper-V*
• Xen*
88
Set Up Analysis Target 6
1.
Click the New Project button on the toolbar to create a new project.
If you need to re-configure the target for an existing project, click the Configure Analysis
toolbar button.
The Configure Analysis window opens. By default, the project is pre-configured to run the
Performance Snapshot analysis.
2. If you do not run an analysis on the local host, expand the WHERE pane and select an appropriate
target system.
The target system can be the same as the host system, which is a system where the VTune Profiler GUI
is installed. If you run an analysis on the same system where the VTune Profiler is installed (i.e. target
system=host system), such a target system is called local. Target systems other than local are called
remote systems. But both local and remote systems are accessible targets, which means you can
access them either directly (local) or via a connection (for example, SSH connection to a remote
target).
NOTE
This type of the target system is not available for macOS*.
Remote Linux (SSH) Run an analysis on a remote regular or embedded Linux* system.
VTune Profiler uses the SSH protocol to connect to your remote system.
Make sure to fill in the SSH Destination field with the username,
hostname, and port (if required) for your remote Linux target system as
username@hostname[:port].
Android Device (ADB) Run an analysis on an Android device. VTune Profiler uses the Android
Debug Bridge* (adb) to connect to your Android device. Make sure to
specify an Android device targeted for analysis in the ADB Destination
field. When the ADB connection is set up, the VTune Profiler
automatically detects available devices and displays them in the menu.
Arbitrary Host (not Create a command line configuration for a platform NOT accessible from
connected) the current host, which is called an arbitrary target.
3.
From the WHAT pane, specify an application to launch or click the Browse button to select a
different target type:
Launch Application Enable the Launch Application pane and choose and configure an
(pre-selected) application to analyze, which can be either a binary file or a script.
NOTE
This target type is not supported for the Hotspots analysis of Android
applications. Use the Attach to Process or Launch Android Package
types instead.
89
6 Intel® VTune™ Profiler User Guide
Attach to Process Enable the Attach to Process pane and choose and configure a
process to analyze.
Profile System Enable the Profile System pane and configure the system-wide
analysis that monitors all the software executing on your system.
Launch Android Enable the Launch Android Package pane to specify the name of the
Package Android* package to analyze and configure target options.
NOTE
• If you use VTune Profiler as a web server, the list of available targets and target systems differs.
• For driverless event-based sampling data collection, VTune Profiler supports local and remote
Launch Application, Attach to Process and Profile System target types but their support fully
depends on the Linux Perf profiling credentials specified in the /proc/sys/kernel/
perf_event_paranoid file and managed by the administrator of your system using root
credentials. For more information see the perf_event related configuration files topic at http://
man7.org/linux/man-pages/man2/perf_event_open.2.html. By default, only user processes
profiling at the both user and kernel spaces is permitted, so you need granting wider profiling
credentials via the perf_event_paranoid file to employ the Profile System target type.
What's Next
As soon as you specified the analysis system and target, you may either click the Start button to run
Performance Snapshot or click the analysis name in the analysis header to choose a different analysis type.
See Also
Analysis System Options
target-system
vtune option
Arbitrary Targets
(not connected)
Collect Data on Remote Linux* Systems from Command Line
90
Set Up Analysis Target 6
• Do This:
Build your application in Release mode, with maximum appropriate compiler optimization level.
Because:
• This eliminates performance issues that can be resolved by compiler optimizations, enabling you to
focus on bottlenecks that require your attention.
• Do This:
Generate debug information for your application, and, if possible, download debug information for any
third-party libraries it uses.
Because:
• This enables source-level analysis: view problematic source lines right in VTune Profiler.
• This enables resolution of function names and proper call stack information.
• By default, most compilers/IDEs do not generate debug information in Release mode.
NOTE The /O2 flag is a recommendation to ensure you are profiling the Release version of your
application with optimizations that favor speed enabled. If the production use of your application calls
for a different optimization level, use your required level. The key idea is to profile your application
when it is compiled as close to production use as possible.
• The /Zi and /DEBUG flags enable generation of debug info in the Program Database (PDB) format.
Follow these steps to configure the optimization level and debug information generation in Microsoft Visual
Studio*:
1. Enable Release build configuration:
a. On the Visual Studio toolbar, from the Solution Configuration drop-down list, select Release.
This also enables the /O2 optimization level. To check, right-click on your project and open
Properties > C/C++ > Optimization.
2. Enable Debug information generation:
a. Right-click your project and select the Properties item in the context menu.
The Property Pages dialog opens.
b. Make sure the Release configuration is selected in the Configuration drop-down list.
c. From the left pane, select C++ > General.
d. In the Debug Information Format field, choose Program Database (/Zi).
e. From the left pane, select Linker > Debugging.
f. In the Generate Debug Info field, select Generate Debug Information (/DEBUG).
g. Click OK to save your changes and close the dialog box.
These steps cover the most important compiler switches that apply to all C++ applications.
Additional compiler switches are recommended for applications that use OpenMP* or Intel® oneAPI Threading
Building Blocks for threading. See the Compiler Switches for Performance Analysis on Windows* Targets topic
for more information.
Once you have the debug information, make sure to set the Search Directories to point VTune Profiler to the
PDB and source files.
91
6 Intel® VTune™ Profiler User Guide
-O2 -g
• The -O2 flag enables compiler optimizations that favor speed.
NOTE The -O2 flag is a recommendation to ensure you are profiling the Release version of your
application with optimizations that favor speed enabled. If the production use of your application calls
for a different optimization level, use your required level. The key idea is to profile your application
when it is compiled as close to production use as possible.
On Linux, VTune Profiler requires debug information in the DWARF format to enable source and call stack
analysis.
The -g option usually produces debugging information in the DWARF format. If you are having trouble
generating debug information in the DWARF format, see Debug Information for Linux Binaries.
These steps cover the most important compiler switches that apply to all C++ applications.
Additional compiler switches are recommended for applications that use OpenMP* or Intel® oneAPI Threading
Building Blocks for threading. See the Compiler Switches for Performance Analysis on Linux* Targets topic for
more information.
Once you have the debug information, make sure to set the Search Directories to point VTune Profiler to the
binary and source files.
92
Set Up Analysis Target 6
Windows* Targets
Use the Intel® VTune™ Profiler for the performance
analysis of Windows* targets.
may right-click the project and select Configure Analysis toolbar button to verify target
properties from the menu. By default, the target type is set to Launch Application.
To choose an existing standalone executable file:
1. From the Visual Studio menu, choose File > Open > Project/Solution.
The Open Project dialog box opens.
2. Select the Executable Files (*.exe) filter and choose an executable file.
Visual Studio software creates a solution with a single project that contains your executable file. VTune
Profiler features are enabled.
3. Right-click the project and select Intel VTune Profilerversion > Configure Analysis... option.
The Configure Analysis window opens.
4. Click the Binary/Symbol Search or Source Search button at the bottom to specify search
directories. By default, the search directories are defined by the Microsoft Visual Studio* C++ project
properties. To view default project search directories for system functions in Visual Studio, right-click
the project in the Solution Explorer and select Properties.
When finalizing the collected data, the VTune Profiler uses these directories to search for binary
(executables and dynamic libraries), symbol (typically .pdb files), and source files supporting your
target in the particular order. VTune Profiler automatically locates the files for C/C++ projects which are
not moved after building the application and collecting the performance data.
5. Save the solution.
93
6 Intel® VTune™ Profiler User Guide
NOTE
Different versions of Visual Studio may have different user interface elements. Refer to the Visual
Studio online help for the exact user interface elements that you need to view file location.
When done with the configuration, click the Browse button on the HOW pane on the right to select
and run an analysis type.
See Also
Analysis Target Options
Analyze Performance
94
Set Up Analysis Target 6
Search Directories
NOTE
To install the drivers on Windows* 7 (deprecated) and Windows* Server 2008 R2 operating systems,
you must enable the SHA-2 code signing support for these systems by applying Microsoft Security
update 3033929. If the security update is not installed, event-based sampling analysis types will not
work properly on your system.
To verify the sampling driver is installed correctly on a Microsoft Windows* OS, open the command prompt
as an administrator and run the amplxe-sepreg.exe utility located at <install-dir>/bin64.
To make sure your system meets all the requirements necessary for the hardware event-based sampling
collection, enter:
amplxe-sepreg.exe -c
This command performs the following dependency checks required to install the sampling driver:
• platform, architecture, and OS environment
• availability of the sampling driver binaries: sepdrv4_x.sys, socperf2_x.sys, and sepdal.sys
• administrative privileges
• 32/64-bit installation
To check whether the sampling driver is loaded, enter:
amplxe-sepreg.exe -s
If the sampling driver is not installed but the system is supported by the VTune Profiler, execute the following
command with the administrative privileges to install the driver:
amplxe-sepreg.exe -i
95
6 Intel® VTune™ Profiler User Guide
NOTE
If you configured Visual Studio to generate debug information for your files, you cannot "fix" previous
results because the executable and the debug information do not match the executable you used to
collect the old results.
To generate a native .PDB file for a native image of .NET* managed assembly:
Use the Native Image Generator tool (Ngen.exe) from the .NET Framework. Make sure the search
directories, specified in the Binary/Symbol Search dialog box, include path to the generated .pdb file.
See Also
Debug Information for Windows* System Libraries
Search Directories
/Zi (highly Enable generating the symbol information required to associate addresses with source
recommended) lines and to properly walk the call stack in user-mode sampling and tracing analysis
types (Hotspots and Threading).
Release build Enable maximum compiler optimization to focus VTune Profiler on performance problems
(highly that cannot be optimized with the compiler.
recommended)
/MD or /MDd Enable identifying the C runtime calls as system functions and differentiating them from
the user code when a proper Call stack mode is applied to the VTune Profiler collection
result.
/D Enable full support for Intel® oneAPI Threaded Building Blocks(oneTBB) in VTune Profiler.
"TBB_USE_THR
Without TBB_USE_THREADING_TOOLS set, the VTune Profiler will not properly identify
EADING_TOOLS
concurrency issues related to using Intel TBB constructs.
"
96
Set Up Analysis Target 6
Use This To Do This
Switch
/Qopenmp Enable the VTune Profiler to identify parallel regions due to OpenMP* pragmas.
(highly
recommended)
(Intel C++
Compiler)
/Qopenmp- Enable the Intel Compiler to choose the dynamic version of the OpenMP runtime libraries
link:dynamic which has been instrumented for the VTune Profiler. Usually, this option is enabled for
the Intel Compiler by default.
(Intel C++
Compiler)
(Intel C++
Compiler)
-gline- Enable generating debug information for GPU analysis of a DPC++ application.
tables-only
-fdebug-
info-for-
profiling
Intel oneAPI
DPC++
Compiler
-Xsprofile Enable source-level mapping of performance data for FPGA application analysis.
Intel oneAPI
DPC++
Compiler
Explore the list of libraries recommended or not recommended for the user-mode sampling and tracing
analysis types:
97
6 Intel® VTune™ Profiler User Guide
debug:parall Enables the Intel® Parallel Debugger Extension for the Intel Compiler, which is not used
el for the VTune Profiler.
/Qopenmp- Chooses the static version of the OpenMP runtime libraries for the Intel Compiler. This
link:static version of the OpenMP runtime library does not contain the instrumentation data
required for the VTune Profiler analysis.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/
PerformanceIndex.
Notice revision #20201201
See Also
Debug Information for Windows* Application Binaries
NOTE
VTune Profiler does not automatically search the Microsoft symbol server for debug information for
system files since this functionality:
• Requires an internet connection. Some users are collecting and viewing results on isolated lab
systems and do not have internet access.
• Adds an overhead to finalization of the collection results. For each module without debug
information on the local system, a request goes out to the symbol server. If symbols are available,
additional time is required to download the symbol file.
• Uses additional disk space. If symbols for system modules are not used, this disk space is wasted.
• May be unwanted. Many users do not need to examine details of time spent in system calls and
modules. Automatically downloading symbols for system files would be wasteful in this case.
98
Set Up Analysis Target 6
Configure the Microsoft* Symbol Server from Visual Studio* IDE
NOTE
The instructions below refer to the Microsoft Visual Studio* 2015 integrated development environment
(IDE). They may slightly differ for other versions of Visual Studio IDE.
provided by default, or click the button and add the following address to the list: http://
msdl.microsoft.com/download/symbols.
5. Make sure the added address is checked.
6. In the Cache symbols in this directory field, specify the directory where the downloaded symbol files
will be stored.
NOTE
If you plan to download symbols from the Microsoft symbol server only once and then use local
storage, use the following syntax for the cache directory: srv*<local_dir>. For example:
srv*C:\Windows\symbols.
99
6 Intel® VTune™ Profiler User Guide
NOTE
If you use the symbol server, the finalization process may take a long time to complete the first time
the VTune Profiler downloads the debug information for system libraries to the local directory specified
in the Options (for example, C:\Windows\symbols). Subsequent finalizations should be faster.
Configure the Microsoft Symbol Server from the VTune Profiler Standalone GUI
1.
NOTE
If you specify different directories for different projects, the files will be downloaded multiple times,
adding unwanted overhead. If you have a Visual Studio project that defines a cache directory for the
symbol server, use the same directory in the standalone VTune Profiler so that you do not waste time
and space downloading symbols that already exist in a cache directory.
100
Set Up Analysis Target 6
See Also
Highly Accurate CPU Time Data Collection
Linux* Targets
Use the Intel® VTune™ Profiler for performance
analysis on local and remote Linux* target systems.
To analyze your Linux target, do the following:
1. Prepare your target application for analysis:
• Enable downloading debug information for system kernels by installing debug info packages
available for your system version.
• Enable downloading debug information for the application binaries by using the -g option when
compiling your code. Consider using the recommended compiler settings to make the performance
analysis more effective.
• Build your target in the Release mode.
• Create a baseline against which you can compare the performance improvements as a result of
tuning.
For example, you instrument your code to determine how long it takes to compress a certain file.
Your original target code, augmented to provide these timing data, serves as your performance
baseline. Every time you modify your target, compare the performance metrics of your optimized
target with the baseline, to verify that the performance has improved.
2. Prepare your target system for analysis:
• Build and install the sampling drivers, if required.
101
6 Intel® VTune™ Profiler User Guide
NOTE
• If the drivers were not built and set up during installation (for example, lack of privileges, missing
kernel development RPM, and so on), VTune Profiler provides an error message and enables
driverless sampling data collection based on the Linux Perf* tool functionality, which has a limited
scope of analysis options.
• On Ubuntu* systems, VTune Profiler may fail to collect Hotspots and Threading analysis data if the
scope of the ptrace() system call application is limited.
To workaround this issue for one session, set the value of the kernel.yama.ptrace_scopesysctl
option to 0 with this command:
sysctl -w kernel.yama.ptrace_scope=0
To make this change permanent, see the corresponding Troubleshooting topic.
• For remote analysis, configure SSH connection and set up your remote Linux system depending on
the analysis usage mode.
3. Create a VTune Profiler project and run the performance analysis of your choice.
Ubuntu* Systems
See Also
Compiler Switches for Performance Analysis on Linux* Targets
Sampling driver install type Choose the driver installation option. By default, VTune Profiler uses
[build driver (default) / the Sampling Driver Kit to build the driver for your kernel. You may
driver kit files only ] change the option to driver kit files only if you want to build the
driver manually after installation.
Driver access group [ vtune Set the driver access group ownership to determine which set of users
(default) ] can perform the collection on the system. By default, the group is
vtune. Access to this group is not restricted. To restrict access, see
the Driver permissions option below. You may set your own group
during installation in the Advanced options or change it manually after
installation by executing: ./boot-script -–group <your_group>
from the <install-dir>/sepdk/src directory.
102
Set Up Analysis Target 6
Use This Option To Do This
Driver permissions [ 660 Change permissions for the driver. By default, only a vtune group
(default) ] user can access the driver. Using this access the user can profile the
system, an application, or attach to a process.
Load driver [ yes (default) ] Load the driver into the kernel.
Install boot script [ yes Use a boot script that loads the driver into the kernel each time the
(default) ] system is rebooted. The boot script can be disabled later by
executing: ./boot-script --uninstall from the <install-dir>/
sepdk/src directory.
Enable per-user collection Install the hardware event-based collector driver with the per-user
mode [no (default) / yes] filtering on. When the filtering is on, the collector gathers data only
for the processes spawned by the user who started the collection.
When it is off (default), samples from all processes on the system are
collected. Consider using the filtering to isolate the collection from
other users on a cluster for security reasons. The administrator/root
can change the filtering mode by rebuilding/restarting the driver at
any time. A regular user cannot change the mode after the product is
installed.
NOTE
For MPI application analysis on a Linux* cluster, you may enable the Per-
user Hardware Event-based Sampling mode when installing the Intel
Parallel Studio XE Cluster Edition. This option ensures that during the
collection the VTune Profiler collects data only for the current user. Once
enabled by the administrator during the installation, this mode cannot be
turned off by a regular user, which is intentional to preclude individual
users from observing the performance data over the whole node including
activities of other users.
After installation, you can use the respective vars.sh files to set
up the appropriate environment (PATH, MANPATH) in the current
terminal session.
Driver build options … Specify the location of the kernel header files on this system, the path
and name of the C compiler to use for building the driver, the path
and name of the make command to use for building the driver.
103
6 Intel® VTune™ Profiler User Guide
If drivers are loaded, but you are not a member of the group listed in the query output, request your
system administrator to add you to the group. By default, the driver access group is vtune. To check
which groups you belong to, type groups at the command line. This is only required if the permissions
are other than 666.
NOTE
If there is no collection in progress, there is no execution time overhead of having the driver loaded
and very little overhead for memory usage. You can let the system module be automatically loaded at
boot time (for example, via the install-boot-script script, used by default). Unless the data is
being collected by the VTune Profiler, there will be no latency impact on the system performance.
104
Set Up Analysis Target 6
NOTE
If the current version of the sampling driver that is shipped with the VTune Profiler installation does
not suit your needs, for example, due to a recent change in the Linux* kernel, you can find the latest
version of the sampling driver on the Sampling Driver Downloads page.
• $ ./build-driver
The script prompts the build option default for your local system.
• $ ./build-driver -ni
The script builds the driver for your local system with default options without prompting for your input.
• $ ./build-driver -ni -pu
The script builds the driver with the per-user event-based sampling collection enabled, without prompting
for your input.
• $ ./build-driver -ni \
--c-compiler=i586-i586-xxx-linux-gcc \
--kernel-version="<kernel-version>" \
--kernel-src-dir=<kernel-source-dir> \
--make-args="PLATFORM=x32 ARITY=smp"
--install-dir=<path>
The script builds the drivers with a specified cross-compiler for a specific kernel version. This is usually
used for the cross-build for a remote target system on the current host. This example uses the following
options:
• -ni disables the interactive during the build.
• --c-compiler specifies the cross build compiler. The compiler should be available from the PATH
environment. If the option is not specified, the host GCC compiler is used for the build.
• --kernel-version specifies the kernel version of the target system. It should match the uname -r
output of your target system and the UTS_RELEASE in kernel-src-dir/include/generated/
utsrelease.h or kernel-src-dir/include/linux/utsrelease.h, depending on your kernel
version.
• --kernel-src-dir specifies the kernel source directory.
• --make-args specifies the build arguments. For a 32-bit target system, use PLATFORM=x32. For a 64-
bit target system, use PLATFORM=x32_64
• --install-dir specifies the path to a writable directory where the drivers and scripts are copied after
the build succeeds.
Use ./build-driver -h to get the detailed help message on the script usage.
To build the sampling driver as RPM using build services such as Open Build Service (OBS):
Use the sepdk.spec file located at the <install-dir>/sepdk/src directory.
105
6 Intel® VTune™ Profiler User Guide
$ cd <install_dir>/sepdk/src
$ ./insmod-sep -r -g <group>
where <group> is the group of users that have access to the driver.
To install the driver that is built with the per-user event-based sampling collection on, use the -pu (-
per-user) option as follows:
$ ./insmod-sep -g <group> -pu
If you are running on a resource-restricted environment, add the -re option as follows:
$ ./insmod-sep -re
2. Enable the Linux system to automatically load the drivers at boot time:
$ cd <install_dir>/sepdk/src
$ ./boot-script --install -g <group>
The -g <group> option is only required if you want to override the group specified when the driver was
built.
To remove the driver on a Linux system, run:
./rmmod-sep -s
If a module is not found or the name of a function cannot be resolved, the VTune Profiler displays module
identifiers within square brackets, for example: [module].
106
Set Up Analysis Target 6
If the debug information is absent, the VTune Profiler may not unwind the call stack and display it correctly in
the Call Stack pane. Additionally in some cases, it can take significantly more time to finalize the results for
modules that do not have debug information.
If DWARF is not a default debugging information format for the compiler, or if you are using MinGW/Cygwin
GCC*, use the -gdwarf-version option, for example: -gdwarf-2 or -gdwarf-3.
NOTE When using the Intel Fortran compiler to compile OpenMP Offload code, make sure to use the -
debug offload option.
See Also
Compiler Switches for Performance Analysis on Linux* Targets
Search Directories
107
6 Intel® VTune™ Profiler User Guide
-g (highly Enable generating the symbol information required to associate addresses with source
recommended) lines and to properly walk the call stack in user-mode sampling and tracing collection
types (Hotspots and Threading).
Release build Enable maximum compiler optimization to focus the VTune Profiler on real performance
or -O2 (highly problems that cannot be optimized with the compiler.
recommended)
-shared-intel Enable identifying the libm and C runtime calls as system functions and differentiating
(Intel® C++ them from the user code when a proper filter mode is applied to the VTune Profiler
Compiler) collection result.
-shared-
libgcc (GCC*
Compiler)
-debug Enable the VTune Profiler to identify inline functions and, according to the selectedinline
inline-debug- mode, associate the symbols for an inline function with the inline function itself or its
info caller. This is the default mode for GCC* 4.1 and higher.
(Intel C++
Compiler) NOTE The debug inline-debug-info option is enabled by default for the Intel®
oneAPI DPC++/C++ Compiler if you compile with optimizations (-O2 or higher) and debug
information (-g option).
-D Enable Intel® oneAPI Threading Building Blocks Analysis (oneTBB) for the VTune
TBB_USE_THREA Profiler. This macro is automatically set if you compile with -D_DEBUG or -
DING_TOOLS DTBB_USE_DEBUG.
Without TBB_USE_THREADING_TOOLS set, the VTune Profiler will not properly identify
concurrency issues related to using oneTBB constructs.
-qopenmp Enable the VTune Profiler to identify parallel regions due to OpenMP* pragmas.
(highly
recommended)
(Intel C++
Compiler)
-qopenmp-link Enable the Intel Compiler to choose the dynamic version of the OpenMP runtime
dynamic libraries which has been instrumented for the VTune Profiler. Usually, this option is
enabled for the Intel Compiler by default.
(Intel C++
Compiler)
--info-for- Enable generating debug information for GPU analysis of a SYCL application.
profiling Generate debug information for OpenMP* Offload applications compiled by Intel Fortran
Intel oneAPI compiler
DPC++
Compiler
Intel Fortran
Compiler
108
Set Up Analysis Target 6
Use This To Do This
Switch
-Xsprofile Enable source-level mapping of performance data for FPGA application analysis.
Intel oneAPI
DPC++
Compiler
Debug build or Changes the performance of your application compared to a release build and may
-O0 dramatically impact the performance profiling potentially causing you to analyze and
attempt optimization on a section of code that is not a performance problem in the
release build.
-static Prevents the VTune Profiler from being able to run the user-mode sampling and tracing
analysis types. See below for more details.
-static-
libgcc
NOTE
When you specify the -fast switch with the Intel Compiler, it automatically enables -
static.
-static- Prevents the user-mode sampling and tracing analysis types from distinguishing system
intel functions properly. This is the default option for the Intel Compiler.
-qopenmp- Chooses the static version of the OpenMP runtime libraries for the Intel Compiler. This
link static version of the OpenMP runtime library does not contain the instrumentation data
required for the VTune Profiler analysis.
-msse4a, - Generates binaries that use instructions not supported by Intel processors, which may
m3dnow cause unknown behavior when profiling with the VTune Profiler.
-debug VTune Profiler works best with -debug full (the default mode when using -g). Other
[parallel | options including parallel, extended, emit-column, expr-source-pos, semantic-
extended | stepping, and variable-locations are not supported by the VTune Profiler. See -
emit-column debug inline-debug-info for more information.
| expr-
source-pos |
semantic-
stepping |
variable-
locations]
-coarray Prevents the Threading analysis from identifying properly the locks that disable scaling
in Coarray Fortran.
109
6 Intel® VTune™ Profiler User Guide
NOTE
There are other options that may add frame pointers to your binary as a side effect, for example: -
fexceptions (default for C++) or -O0. To make sure the executable (and shared libraries) have this
information, use the objdump -h <binary> command and make sure you see the .eh_frame_hdr
section there.
User-mode sampling and tracing analysis types work better with dynamic versions of the following libraries:
User-mode sampling and tracing collection has the following limitations for analyzing statically linked
libraries/functions:
• The static version of the OpenMP runtime library supplied by the Intel Compiler does not provide the
necessary instrumentation for the Threading analysis type.
• Call Stack mode cannot properly distinguish user code from system functions.
• User-mode sampling and tracing collection cannot execute unless various C Runtime functions are
exported. There are multiple ways to do this; for example, use the -u command of the GCC compiler:
• -u malloc
• -u free
• -u realloc
• -u getenv
• -u setenv
• -u __errno_location
If your application creates Posix threads (either explicitly or via the static OpenMP library or some other
static library), you need to explicitly define the following additional functions:
• -u pthread_key_create
• -u pthread_key_delete
• -u pthread_setspecific
• -u pthread_getspecific
110
Set Up Analysis Target 6
• -u pthread_spin_init
• -u pthread_spin_destroy
• -u pthread_spin_lock
• -u pthread_spin_trylock
• -u pthread_spin_unlock
• -u pthread_mutex_init
• -u pthread_mutex_destroy
• -u pthread_mutex_trylock
• -u pthread_mutex_lock
• -u pthread_mutex_unlock
• -u pthread_cond_init
• -u pthread_cond_destroy
• -u pthread_cond_signal
• -u pthread_cond_wait
• -u _pthread_cleanup_push
• -u _pthread_cleanup_pop
• -u pthread_setcancelstate
• -u pthread_self
• -u pthread_yield
The easiest way to do this is by creating a file with the above options and passing it to gcc or ld. For
example:
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/
PerformanceIndex.
Notice revision #20201201
See Also
Compiler Switches for Performance Analysis on Windows* Targets
111
6 Intel® VTune™ Profiler User Guide
sysctl -w kernel.kptr_restrict=0
NOTE
To enable kernel profiling without the Intel Sampling Driver via perf, set the perf_event_paranoid
value to <= 1. See the Linux kernel documentation for details.
To resolve symbols for the Linux kernel, the VTune Profiler also uses the System.map file created during the
kernel build and shipped with the system by default. If the file is located in a non-default directory, you may
add it to the list of search directories in the Binary/Symbol Search dialog box when configuring your target
properties.
NOTE
The settings in the /proc/kallsyms and System.map file enable the VTune Profiler to resolve kernel
symbols and view kernel functions and kernel stacks but do not enable the assembly analysis.
112
Set Up Analysis Target 6
• Browse through the OS vendor FTP site and download the packages. For example: look at ftp://
ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os to get packages for Redhat* Enterprise
Server.
• Look for other sources on the internet. For example, for Red Hat Enterprise* Linux 3, 4 and 5
distros, Red Hat provides debuginfo RPMs at http://people.redhat.com/duffy/debuginfo/. After
installing the RPM, the debug version of the kernel file is located under /usr/lib/debug/boot (EL
3) or /usr/lib/debug/lib/modules (EL 4, 5).
3. Use the following commands to install the RPMs:
CFLAGS_KERNEL := -g
CFLAGS := -g
3. Run make clean; make to create the vmlinux kernel file with debug information. Once a debug
version of the kernel is created or obtained, specify that kernel file as the one to use during
performance analysis.
As soon as the debug information is available for your kernel modules, any future analysis runs will display
the kernel functions appropriately. To resolve the previously collected data against this new symbol
information, update the project Search Directories and click the Re-resolve button to apply the changes.
Limitations
When you collect data on a remote Linux system, VTune Profiler does not read /sys/module/<module-
name>/sections/* for results. In this case, to resolve symbols properly:
1. Copy the <module-name>/sections folder manually from the target system to ../<parent
directory>/<module-name>/sections on the host system.
2. Add <parent directory> to VTune search directories for binary and symbol files.
See Also
Compiler Switches for Performance Analysis on Linux* Targets
113
6 Intel® VTune™ Profiler User Guide
Search Directories
• pthread_create()
• pthread_key_create()
• pthread_setspecific()
• pthread_getspecific()
• pthread_self()
• pthread_getattr_np()
• pthread_attr_destroy()
• pthread_attr_setstack()
• pthread_attr_getstack()
• pthread_attr_getstacksize()
• pthread_attr_setstacksize()
• If a target employs pthread_cancel() API, it exports the following symbols:
• pthread_cancel()
• _pthread_cleanup_push()
• _pthread_cleanup_pop()
• If a target employs _pthread_cleanup_push() or _pthread_cleanup_pop() API, it exports the
following symbols:
• _pthread_cleanup_push()
• _pthread_cleanup_pop()
• If a target employs pthread_mutex_lock() API, it exports pthread_mutex_lock() and
pthread_mutex_trylock() symbol.
• If a target employs pthread_spin_lock() API, it exports pthread_spin_lock() and
pthread_spin_trylock() symbol.
• libdl.so:
If a target employs any of dlopen(), dlsym(), or dlclose() APIs, it exports all three of them
simultaneously.
114
Set Up Analysis Target 6
If the binary file does not export some of the symbols above, use the -u linker switch (for example, specify -
Wl,-u__errno_location if you use compiler for linking) to include symbols into the binary file at the linking
stage of compilation.
See Also
Compiler Switches for Performance Analysis on Linux* Targets
115
6 Intel® VTune™ Profiler User Guide
1. Install VTune Install the full-scale VTune Profiler product on the host system.
Profiler
2. Prepare your 1. Set up a password-less SSH access to the target using RSA keys.
target system 2. Install the VTune Profiler target package with data collectors on the target Linux
for analysis system.
NOTE
If you choose to install the target package to a non-default location, make sure
to specify the correct path either with the VTune Profiler installation
directory on the remote system option in the WHERE pane (GUI) or with the
-target-install-dir option (CLI).
3. Build the drivers on the host (if required), copy them to the target system and
install the drivers.
NOTE
To build the sampling driver as RPM using build services as Open Build Service
(OBS), use the sepdk.spec file located at <install_dir>/sepdk/src the
directory.
3. Configure and 1. On your host system, open the VTune Profiler GUI and select Configure
run remote Analysis.
analysis 2. In the Where pane, specify an SSH connection to a remote Linux system.
3. In the What pane, specify your target application on the remote system. Make
sure to specify search directories for symbol/source files required for finalization
on the host.
4. In the How pane, choose and configure an analysis type.
5. Start the analysis.
VTune Profiler launches your application on the target, collects data, copies the
analysis result and binary files to the host, and finalizes the data.
116
Set Up Analysis Target 6
In the native usage mode, workflow steps to configure and run analysis on a remote system are similar to
the remote collectors mode.
NOTE
VTune Profiler also provides the sepdk sources for building sampling drivers. This source code could be
same as the source code provided in the SEP package, if the VTune Profiler uses the same driver as
SEP. VTune Profilersepdk sources also include the event-based stack sampling data collector that is
not part of the SEP package.
See Also
Collect Data on Remote Linux* Systems from Command Line
117
6 Intel® VTune™ Profiler User Guide
NOTE
The automatic installation on the remote Linux system does not build the sampling drivers although
you can install the pre-built sampling drivers if you connect via password-less SSH as the root user.
Driverless sampling data collection is based on the Linux Perf* tool functionality, which is available
without Root access and has a limited scope of analysis options. To collect advanced hardware event-
based sampling data, manually install the sampling driver or set up the password-less SSH connection
with the Root user account.
118
Set Up Analysis Target 6
Press the Deploy button to start the automatic collectors package deployment process.
If the collectors are not automatically installed or you get an error message after an automatic install
attempt, you can install the collectors manually.
NOTE
Use both *_x86 and *_x86_64 packages if you plan to run and analyze 32-bit processes on 64-bit
systems.
119
6 Intel® VTune™ Profiler User Guide
2. On the target device, unpack the product package to the /tmp directory or another writable location on
the system:
target> tar -zxvf <target_package>.tgz
VTune Profiler target package is located in the newly created directory /tmp/
vtune_profiler_<version>.<package_num>.
When collecting data remotely, the VTune Profiler looks for the collectors on the target device in its default
location: /tmp/vtune_profiler_<version>.<package_num>. It also temporary stores performance results
on the target system in the /tmp directory. If you installed the target package to a different location or need
to specify another temporary directory, make sure to configure your target properties in the Configure
Analysis window as follows:
• Use the VTune Profiler installation directory on the remote system option to specify the path to the
VTune Profiler on the remote system. If default location is used, the path is provided automatically.
• Use the Temporary directory on the remote system option to specify a non-default temporary
directory.
Alternatively, use the -target-install-dir and -target-tmp-dir options from the vtune command line.
NOTE
Building the sampling drivers is only required if the drivers were not built as part of the collector
installation. The installation output should inform you if building the sampling driver is required.
NOTE
• Make sure kernel headers correspond to the kernel version running on the device. For details, see
the README.txt file in the sepdk/src directory.
• Make sure compiler version corresponds to the architecture (x86 or x86_64) of the kernel running
on the target system.
• For Hotspots in hardware event-based sampling mode, Microarchitecture Exploration, and Custom
event-based sampling analysis types, you may not need root credentials and installing the sampling
driver for systems with kernel 2.6.32 or higher, which exports CPU PMU programming details
over /sys/bus/event_source/devices/cpu/format file system. Your operating system limits on
the maximum amount of files opened by a process as well as maximum memory mapped to a
process address space still apply and may affect profiling capabilities. These capabilities are based
on Linux Perf* functionality and all its limitations fully apply to the VTune Profiler as well. For more
information, see the Tutorial: Troubleshooting and Tips topic at https://perf.wiki.kernel.org/
index.php/Main_Page.
120
Set Up Analysis Target 6
NOTE
To build the sampling driver as RPM using build services as Open Build Service (OBS), use the
sepdk.spec file located at the <install-dir>/sepdk/src the directory.
NOTE
A root connection is required to load the sampling drivers and to collect certain performance metrics.
You (or your administrator) can configure the system using root permissions and then set up
password-less SSH access for a non-root user if desired. For example, build and load the sampling
drivers on the target system using root access and then connect to the system and run analysis as a
non-root user. If you set up access without using the sampling drivers, then driverless event-based
sampling can still be used.
NOTE
Versions of Intel® VTune™ Profiler older than 2019 Update 5 have a different configuration for
password-less SSH. For legacy instructions, see this article.
121
6 Intel® VTune™ Profiler User Guide
NOTE
VTune Profiler does not keep your credentials but uses them only once to enable the password-less
access.
When the keys are applied, the terminal window closes and you can proceed with the project configuration
and analysis. For all subsequent sessions, you will not be asked to provide credentials for remote accesses to
the specified system.
Possible Issues
If the keys are copied but the VTune Profiler cannot connect to the remote system via SSH, make sure the
permissions for ~/.ssh and home directories, as well as SSH daemon configuration, are set properly.
Permissions
Make sure your ~/.ssh and ~/.ssh/authorized_keys directory permissions are not too open. Use the
following commands:
chmod go-w ~/
NOTE
For this step, you may need administrative privileges.
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
For root remote connections, use:
PermitRootLogin yes
122
Set Up Analysis Target 6
If the configuration has changed, save the file and restart the SSH service with:
NOTE
The search is non-recursive. Make sure to specify all required directories.
When you run a remote analysis, the VTune Profiler launches your application on the remote target, collects
data, copies all binary files to the host, and finalizes the analysis result. During finalization, the VTune Profiler
searches the directories for binary/symbol and source data in the following order:
1. Directory <result dir>/all (recursively).
2. Additional search directories that you defined for this project in the Binary/Symbol Search/Source
Search dialog boxes or --search-dir/--source-search-dir command line options.
3. Absolute path on the remote target or VTune Profiler cache directory (binary files only).
See Also
Set Up Remote Linux* Target
Search Directories
123
6 Intel® VTune™ Profiler User Guide
1. From within the shell where you will be running the VTune Profiler command line or GUI, assign a value
and export TMPDIR, for example:
See Also
Set Up Analysis Target
124
Set Up Analysis Target 6
Embedded device performance data can be collected remotely on the embedded device and running the
analysis from an instance of VTune Profiler installed on the host system. This is useful when the target
system is not capable of local data analysis (low performance, limited disk space, or lack of user interface
control).
NOTE
Root access to the operating system kernel is required to install the collectors and drivers required for
performance analysis using VTune Profiler.
NOTE
The Intel System Studio integration layer works for embedded systems with Wind River Linux or Yocto
Project installed.
The Intel System Studio integration layer allows the Intel System Studio products to be fully integrated with
a target operating system by building the drivers and corresponding target packages into the operating
system image automatically. Use this option in the case where a platform build engineer has control over the
kernel sources and signature files, but the application engineer does not. The platform build engineer can
integrate the product drivers with the target package and include them in the embedded device image that is
delivered to the application engineer.
125
6 Intel® VTune™ Profiler User Guide
BBLAYERS= "\
...
<OS_INSTALL_DIR>/YoctoProject/meta-intel-iss\
...
"
b. Add the VTune Profiler recipes to conf/local.conf. Possible recipes include:
• intel-vtune-drivers: integrates all VTune Profiler drivers for PMU-based analysis with
stacks and context switches. Requires additional kernel options to be enabled.
• intel-vtune-sep-driver: integrates drivers for PMU-based analysis with minimal
requirements for kernel options.
For more information about these collection methods, see Remote Linux Target Setup in the
VTune Profiler help.
4. Build the target operating system, which will complete the integration of the VTune Profiler collectors
and drivers.
5. Flash the operating system to the target embedded device.
After flashing the operating system to the target embedded device, ensure that the appropriate VTune
Profiler drivers are present. For more information, see Building the Sampling Drivers for Linux Targets.
6. Run the analysis on the target embedded device from the host system using an SSH connection or
using the SEP commands.
a. Set up a password-less SSH access to the target using RSA keys.
b. Specify your target application and remote system.
NOTE
After configuring the remote connection, VTune Profiler will install the appropriate collectors on the
target system.
126
Set Up Analysis Target 6
Use the Intel VTune Profiler Yocto Project Integration Layer
Intel VTune Profiler Yocto Project integration layer builds the drivers into the operating system image
automatically. Use this option in the case where a platform build engineer has control over the kernel sources
and signature files, but the application engineer does not. The platform build engineer can integrate the
product drivers with the target package and include them in the embedded device image that is delivered to
the application engineer.
BBLAYERS= "\
...
<OS_INSTALL_DIR>/vtune-layer\
...
"
e. Add the VTune Profiler recipes to conf/local.conf. Possible recipes include:
• intel-vtune-drivers: integrates all VTune Profiler drivers for PMU-based analysis with
stacks and context switches. Requires additional kernel options to be enabled.
• intel-vtune-sep-driver: integrates drivers for PMU-based analysis with minimal
requirements for kernel options.
127
6 Intel® VTune™ Profiler User Guide
For more information about these collection methods, see Remote Linux Target Setup in the
VTune Profiler user guide.
3. Build the target operating system, which will complete the integration of the VTune Profiler collectors
and drivers.
4. Flash the operating system to the target embedded device.
After flashing the operating system to the target embedded device, ensure that the appropriate VTune
Profiler drivers are present.
5. 5. Run the analysis on the target embedded device from the host system using an SSH connection or
using the SEP commands.
a. Set up a password-less SSH access to the target using RSA keys.
b. Specify your target application and remote system.
c. Choose an analysis type.
d. Run the analysis from the host.
Use the information available in the Sampling Enabling Product User's Guide to run the SEP commands.
6. View results in the VTune Profiler GUI.
Example: Configuring Yocto Project with the VTune Profiler Integration Layer
128
Set Up Analysis Target 6
If the drivers were not built during collector installation, the installation output should inform you that
building the sampling driver is required.
The drivers are built either on the target system or on the host system, depending on compiler toolchain
availability:
1. If the compiler toolchain is available on the target system:
a. On the target embedded device, build the driver from the <install-dir>/sepdk/src directory
using the ./build-driver command.
b. Load the driver into the kernel using the ./insmod-sep command.
2. If the compiler toolchain is not available on the target system:
a. On the host system, cross-build the driver using the driver source from the target package
sepdk/src directory with the ./build-driver command. Provide the cross-compiler (if
necessary) and the target kernel source tree for the build.
b. Copy the sepdk/src folder to the target system.
c. Load the driver into the kernel using the ./insmod-sep command.
Example: Configuring Yocto Project with Intel VTune Profiler Target Packages
See Also
Build and Install the Sampling Drivers for Linux* Targets
Configure Yocto Project* and VTune Profiler with the Integration Layer
NOTE Profiling support for the Yocto Project* is deprecated and will be removed in a future release.
Intel® VTune™ Profiler can collect and analyze performance data on embedded Linux* devices running Yocto
Project*. This topic provides an example of setting up the VTune Profiler to collect performance data on an
embedded device with Yocto Project 1.8 installed using the Intel VTune Profiler integration layer provided
with the product installation files. The process integrates the VTune Profiler product drivers with the target
package and includes them in the embedded device image. Root access to the kernel is required.
NOTE
VTune Profiler is able to collect some performance data without installing the VTune Profiler drivers. To
collect driverless event-based sampling data, installing the drivers and root access is not required. For
full capabilities, install the VTune Profiler drivers as described here.
Only one recipe can be used at a time. There is no difference between the x86 and x86_64 target packages
for building recipes within Yocto Project. Both can be used on either 32 bit or 64 bit systems.
129
6 Intel® VTune™ Profiler User Guide
1. Download the VTune Profiler target package or locate the package in the <install-dir>/target/
linux directory on the host system where VTune Profiler is installed.
2. Copy the selected target package to a location on the Yocto Project build system.
cd $HOME
tar xvzf vtune_profiler_target_x86_64.tgz
2. (Optional) Modify the $HOME/vtune_profiler_<version>/sepdk/vtune-layer/conf/user.conf
file to specify user settings.
a. If the VTune Profiler recipe has been split from the target package, specify one of the following
paths:
• Path to unzipped target package: VTUNE_TARGET_PACKAGE_DIR = "$HOME/
vtune_profiler_<version>"
• Path to VTune Profiler: VTUNE_PROFILER_2020_DIR = "/opt/intel/vtune_profiler"
b. To integrate the SEP driver during system boot:
Specify ADD_TO_INITD = "y" for init-based Yocto systems;
vi conf/bblayers.conf
BBLAYERS = "$HOME/vtune_profiler_<version>/sepdk/vtune-layer\"
Your file should look similar to the following:
BBLAYERS ?= " \
$HOME/source/poky/meta \
$HOME/source/poky/meta-poky \
$HOME/source/poky/meta-yocto-bsp \
$HOME/source/poky/meta-intel \
$HOME/vtune_profiler/sepdk/vtune-layer \
"
4. Specify the Intel VTune Profiler recipe in conf/local.conf. In this example, the intel-vtune-
drivers is used.
vi "conf/local.conf"
IMAGE_INSTALL_append = " intel-vtune-drivers"
NOTE
You cannot add both intel-vtune-drivers and intel-vtune-sep-driver at the same time.
bitbake core-image-sato
130
Set Up Analysis Target 6
NOTE
If you modified the kernel configuration options, make sure the kernel is recompiled.
Configure Yocto Project*/Wind River* Linux* and Intel® VTune™ Profiler with the Intel
System Studio Integration Layer
NOTE Profiling support for the Yocto Project* is deprecated and will be removed in a future release.
You can use Intel® VTune™ Profiler to collect and analyze performance data on embedded Linux* devices
running Yocto Project* or Wind River* Linux*. This example describes how you set up VTune Profiler using
the Intel System Studio integration layer, to collect performance data on an embedded device with Yocto
Project 1.8 or Wind River* Linux* installed. The integration layer is available with the product installation
files. The process integrates the VTune Profiler product drivers with the target package and includes them in
the embedded device image. For this example, you need root access to the kernel.
cp -r <ISS_BASE_DIR>/YoctoProject/meta-intel-iss <YOCTO_HOME>/
For Wind River* Linux*:
cp -r <ISS_BASE_DIR>/YoctoProject/meta-intel-iss <WR_HOME>/
where
• <ISS_BASE_DIR> : Root folder of the Intel System Studio installation. By default, this is /opt/
intel/system_studio_<version>.x.y/. For example, for the 2019 version, the root folder
is /opt/intel/system_studio_2019.0.0/.
• <YOCTO_HOME> : Root folder of the Yocto Project* cloned directory.
• <WR_HOME> : Root folder of the Wind River* Linux* cloned directory.
2. Register the layer by running the post-installation script.
For Yocto Project*:
In the shell console, go to the <YOCTO_HOME> folder and run this command:.
$ meta-intel-iss/yp-setup/postinst_yp_iss.sh <ISS_BASE_DIR>
For Wind River* Linux*:
131
6 Intel® VTune™ Profiler User Guide
In the shell console, go to the <WR_HOME> folder and run this command:.
$ meta-intel-iss/yp-setup/postinst_wr_iss.sh <ISS_BASE_DIR>
To uninstall the Intel System Studio integration:
1. Run the appropriate script to uninstall:
For Yocto Project*:
In the shell console, go to the <YOCTO_HOME> folder and run this command:.
$ meta-intel-iss/yp-setup/uninst_yp_iss.sh
For Wind River* Linux*:
In the shell console, go to the <WR_HOME> folder and run this command:.
$ meta-intel-iss/yp-setup/uninst_wr_iss.sh
2. Remove the meta-intel-iss layer.
vi /path/to/poky-fido-10.0.0/build/conf/bblayers.conf
BBLAYERS = "$HOME/source/poky/wr-iss-2019\"
Your file should look similar to the following:
BBLAYERS ?= " \
$HOME/source/poky/meta \
$HOME/source/poky/meta-poky \
$HOME/source/poky/meta-yocto-bsp \
$HOME/source/poky/meta-intel \
$HOME/source/poky/wr-iss-2019 \
"
2. Add the Intel VTune Profiler recipe to conf/local.conf. Two recipes are available,
vi "conf/local.conf"
IMAGE_INSTALL_append = " intel-vtune-drivers"
NOTE
You cannot add both intel-vtune-drivers and intel-vtune-sep-driver at the same time.
bitbake core-image-sato
2. Flash the operating system to the embedded device.
132
Set Up Analysis Target 6
4. Configure the analysis type.
5. Start the analysis.
Configure Yocto Project* and Intel® VTune™ Profiler with the Linux* Target Package
NOTE Profiling support for the Yocto Project* is deprecated and will be removed in a future release.
Intel® VTune™ Profiler can collect and analyze performance data on embedded Linux* devices. This topic
provides an example of setting up Intel VTune Profiler to collect performance data on an embedded device
running Yocto Project*. The first section provides information for a typical use case where the required
collectors are automatically installed. The second section provides steps to manually install the collectors and
the VTune Profiler drivers for hardware event-based sampling data collection.
133
6 Intel® VTune™ Profiler User Guide
cd /opt/intel/vtune_profiler_2020.0.0.0/sepdk/src
b. Build the driver using the ./build-driver command. For example:
If the compiler toolchain is not available on the target embedded system, build the driver on the host
system and install it on the target device using the following steps:
a. Open a command prompt and navigate to the <install-dir>/sepdk/src directory. For
example:
cd /opt/intel/vtune_profiler_2020.0.0.0/sepdk/src
b. Cross-build the driver using the using the ./build-driver command. Provide the cross-compiler
(if necessary) and the target kernel source tree for the build. For example:
mkdir drivers
./build-driver -ni \
--c-compiler=i586-i586-xxx-linux-gcc \
--kernel-version=4.4.3-yocto-standard \
--kernel-src-dir=/usr/src/kernel/ \
--make-args="PLATFORM=x32 ARITY=smp" \
--install-dir=./drivers
c. Copy the sepdk/src/drivers folder to the target system.
d. Load the driver into the kernel using the ./insmod-sep command.
FreeBSD* Targets
Intel® VTune™ Profiler allows you to collect
performance data on a FreeBSD* target system.
Intel VTune Profiler is not installed on the FreeBSD target system. Instead, you are able to install VTune
Profiler on a Linux*, Windows*, or macOS* host system and use a target package for collecting event-based
sampling data on a remote FreeBSD target system in one of the following ways:
• Using VTune Profiler's automated remote collection capability (command line or user interface)
• Collecting the results locally on the FreeBSD system and copying them to the host system for viewing with
VTune Profiler (command line only)
The following sections explain these options in more detail.
Supported Features
Collection from Linux, Windows, or macOS host Collection from the FreeBSD system using:
system using the Intel VTune Profiler GUI or
• Intel VTune Profiler command line (vtune)
command line (vtune)
• Sampling enabling product (SEP) collectors
134
Set Up Analysis Target 6
Remote Collection Local Collection
• Memory Access (without heap object allocation • io (with hardware event-based metrics and
tracking) SPDK analysis; without MMIO accesses and
• Input and Output (with hardware event-based DPDK analysis)
metrics and SPDK analysis; without MMIO • custom event-based sampling analysis
accesses and DPDK analysis)
• Custom Analysis
View results on host system View results in VTune Profiler on a Linux, Windows,
or macOS host system
1. Install VTune Profiler on your Linux*, Windows*, or macOS* host. Refer to the Installation Guide for
your host system for detailed instructions.
2. Install the appropriate sampling drivers on the FreeBSD target system. For more information, see
FreeBSD* System Setup.
3. [Optional] If you want to collect performance data with stacks, build your FreeBSD target application
using the -fno-omit-frame-pointer compiler option, to allow the sampling collector to determine the
call chain via frame pointer analysis.
4. Collect performance data using remote analysis from the host system from the VTune Profiler command
line or GUI.
a. Create or open a project.
b. Specify your target application and remote system and make sure to specify search directories for
symbol/source files required for finalization on the host.
c. Choose and configure an analysis type.
Supported VTune Profiler analysis types (event-based sampling analysis only) include:
• Hotspots (hardware event-based sampling mode)
• Microarchitecture Exploration
• Memory Access (without heap object allocation tracking)
• Input and Output (with hardware event-based metrics and SPDK analysis; without MMIO
accesses and DPDK analysis)
135
6 Intel® VTune™ Profiler User Guide
• Custom Analysis
d. Run the analysis from the host. Depending on your settings, the application launches and runs
automatically. Once collection is finished, the result is finalized and displayed with the Summary
window open.
5. Review the results on the host system.
1. Install VTune Profiler on your Linux*, Windows*, or macOS* host. Refer to the Installation Guide for
your host system for detailed instructions.
2. Install the appropriate sampling drivers on the FreeBSD target system. For more information, see
FreeBSD* System Setup.
3. [Optional] If you want to collect performance data with stacks, build your FreeBSD target application
using the -fno-omit-frame-pointer compiler option, which allows the sampling collector to
determine the call chain via frame pointer analysis.
4. Collect performance data using one of the following methods. For more information about each of these
methods, see Remote Linux Target Setup.
• Native analysis on the target system using the VTune Profiler command line (vtune). Supported
analysis types include: hotspots, uarch-exploration, memory-access, io or custom event-based
sampling analysis.
• Native analysis on the target system using the sampling enabling product (SEP) collectors. For more
information, see the Sampling Enabling Product User Guide.
5. Copy the results to the host system.
6. Review the results with VTune Profiler.
• If you used the vtune command, open the *.vtune file.
• If you collected SEP data, import the *.tb7 file.
See Also
Introduction
136
Set Up Analysis Target 6
Set Up FreeBSD* System
$ make
$ make install
5. Run the following command to install the drivers:
QNX* Targets
Intel® VTune™ Profiler supports collecting performance
data on QNX* target systems.
137
6 Intel® VTune™ Profiler User Guide
Data collection is possible via command line interface from a host system running Windows* or Linux* to the
target QNX system. The collected traces are transferred to the host system via ethernet and stored for
review. After collection, the performance results can be imported and viewed in the Intel VTune Profiler user
interface.
The target collector can be integrated into the target QNX image during the image build process and requires
only 1 MB of space on the target file system. Because the traces are transferred to the host system,
collection can be done on target systems with limited storage capacity or with read-only file systems.
1. Prerequisites
2. Set up your system
3. Run analysis
4. View and interpret results
Prerequisites
• Host System: Linux* or Windows* system with QNX BSP and VTune Profiler installed
• Target System: Supported processor with QNX7 operating with instrumental kernel, connected to the host
system via ethernet. Supported processors include Intel® Pentium®, Intel® Celeron®, or Intel Atom®
processors formerly code named Apollo Lake or Intel Atom® processors formerly code named Denverton.
• Turn off firewall restrictions for network connections between the host system and target system
Run Analysis
Analysis is run using collectors previously installed on the target QNX system and a command invoked on the
host Windows or Linux system. All result files are saved to the host system.
1. On the target QNX system, run the following command: <sep-dir>/sep
Where <sep-dir> is the location where the sep file was copied. The target collector loads and waits for
the host system to connect.
2. On the host system, run one of the following analysis commands.
• Hotspots with call stacks: <install-dir>/bin64/sep -start -d <duration> -target-ip
<target-ip-address> -target-port 9321 -lbr call_stack -out <filename>.tb7
Example command:
NOTE
Call stacks are hardware based and limited to a depth of 16 frames. Due to hardware limitations, the
depth of the captured call stack can be less than 16 frames.
138
Set Up Analysis Target 6
• Custom CPU events: <install-dir>/bin64/sep.exe -start -d <duration> -target-ip
<target-ip-address> -target-port 9321 -ec "<event-list>" -out <filename>.tb7
Example command:
1. On the host system, import the *.tb7 file into the previously created project.
2. Switch to the Hotspots viewpoint and review the performance data collected.
• If you collected hotspots data, begin with the Summary window in the Hotspots viewpoint. The
Top Hotspots list shows the top 5 functions that occupied the most CPU time. Double-click a
function to be taken to the Bottom-up window where you can see aggregated performance data
and a timeline showing activity over the entire collection. For more information, see Hotspots View.
• If you collected CPU event data, begin with the Microarchitecture Exploration viewpoint. For
more information, see Microarchitecture Exploration View.
See Also
Cookbook: Profiling Operating System Boot Time on Linux* and QNX*
Click the Configure Analysis button on the Intel® VTune™ Profiler toolbar.
The Configure Analysis window opens.
2. From the WHERE pane, select a required target system (for example, local host).
3. From the WHAT pane, select a target type (for example, Launch Application).
4. Expand the Advanced section and configure the Managed code profiling mode by choosing one of
the following options:
• Native mode collects data on native code only, does not attribute data to managed source.
• Managed mode collects everything, resolves samples attributed to native code, attributes data to
managed source only. The call stack in the analysis result displays data for managed code only.
• Mixed mode collects everything and attributes data to managed source where appropriate. Consider
using this option when analyzing a native executable that makes calls to the managed code.
• Auto mode automatically detects the type of target executable, managed or native, and switches to
the corresponding mode.
139
6 Intel® VTune™ Profiler User Guide
NOTE
• On Windows* OS, the managed code profiling setting is inherited automatically from the Visual
Studio* project. For native targets, the Managed code profiling mode option is disabled.
• System-wide profiling for managed code is not supported on Windows* OS.
• Managed and Mixed modes are not supported on Linux* OS.
See Also
.NET* Targets
.NET* Targets
Explore performance analysis specifics for pure .NET*
applications or native applications with .NET calls.
Intel® VTune™ Profiler automatically identifies the type of the code based on the debugger type specified in
the Visual Studio project property pages:
140
Set Up Analysis Target 6
VTune Profiler inherits this setting to set the profiling mode for the analysis target. The following types are
possible:
• Native mode collects data on native code only, does not attribute data to managed source.
• Managed mode collects everything, resolves samples attributed to native code, attributes data to
managed source only. The call stack in the analysis result displays data for managed code only.
• Mixed collects everything and attributes data to managed source where appropriate. Consider using this
option when analyzing a native executable that makes calls to the managed code.
• Auto mode automatically detects the type of target executable, managed or native, and switches to the
corresponding mode.
In the Configure Analysis window, click the Search Binaries button at the bottom.
4. In the Binary/Symbol Search dialog box, add a path to the generated native .pdb file.
141
6 Intel® VTune™ Profiler User Guide
NOTE
• System-wide profiling is not supported for managed code.
• Starting with the VTune Amplifier 2018 Update 2, you can use the Hotspots analysis in the
hardware event-based sampling mode (former Advanced Hotspots) to profile .Net Core applications
running on Linux* or Windows* systems in the Launch Application mode. For the product
versions prior to 2018 Update 2, make sure to manually configure CoreCRL environment variables
to enable the Advanced Hotspots analysis.
See Also
Problem: Analysis of the .NET* Application Fails
mrte-mode
vtune option
142
Set Up Analysis Target 6
NOTE
<package> varies with applications. To identify the package, use any of the following options:
• Open the Task Manager and check the properties for your application. The General tab contains
the package value including the version that should be omitted. For example, if the General tab
displays 47828<app_name>_1.0.0.4_neutral__sgvg9sxsmbbt4, then NGEN'ed modules are
located in C:\Users\Administrator\AppData\Local\Packages
\47828<app_name>_sgvg9sxsmbbt4\AC\Microsoft\CLR_v4.0_32\NativeImages\.
• Use the Process Explorer tool: explore the list of modules loaded in the application, find
*.ni.exe modules and get their location.
2. Rename the folders that include *.ni.dll or *.ni.exe. For example, rename C:\Users
\Administrator\AppData\Local\Packages\47828<app_name>_sgvg9sxsmbbt4\AC\Microsoft
\CLR_v4.0_32\NativeImages\<app_name> to C:\Users\Administrator\AppData\Local
\Packages\47828<app_name>_sgvg9sxsmbbt4\AC\Microsoft\CLR_v4.0_32\NativeImages
\<app_name>.
3. Re-start your application.
CLR JIT-compiles the methods. You can use the VTune Profiler to profile your C# application until the
next automatic NGEN pre-compilation.
NOTE
This workaround is not recommended for .NET* Framework libraries (for example, mscorlib.dll).
See Also
Set Up Analysis Target
Limitations
• Only Go applications compiled with a compiler version 1.6 and later are supported.
• Only 64-bit version of Go applications is supported.
143
6 Intel® VTune™ Profiler User Guide
See Also
Get Started with Intel® VTune™ Profiler
Android* Targets
Use the Intel® VTune™ Profiler installed on the Windows*, Linux,* or macOS* host to analyze code
performance on a remote Android* system.
NOTE
For successful product operation, the target Android system should have ~25 MB disk space.
VTune Profiler supports the following usage mode with VTune Profiler remote collector and ADB
communication:
144
Set Up Analysis Target 6
NOTE
If the remote VTune Profiler collector is installed on a non-rooted device, during installation you may
get an error message on missing/incorrect drivers. You can dismiss this message if you plan to run the
user-mode sampling and tracing collection (Hotspots) only.
NOTE
Depending on your system configuration, you may not need to gain a root mode access for Hotspots
(hardware event-based sampling mode), Microarchitecture Exploration and Custom EBS analysis
types.
• To enable hardware-event-based sampling analysis, verify that version compatible pre-installed signed
drivers are on the target Android system.
Tip
Use ITT APIs to control performance data collection by adding basic instrumentation to your
application.
NOTE
You may use the Analyze unplugged device option to exclude the ADB connection and power supply
impact on the performance results. In this case, the collection starts as soon as you disconnect the
device from the USB cable or a network. The analysis results are transferred to the host when you
plug in the device back.
NOTE
On Android platforms, the VTune Profiler supports hardware event-based sampling analysis types and
Hotspots analysis in the user-mode sampling mode. Other algorithmic analysis types are not
supported.
145
6 Intel® VTune™ Profiler User Guide
NOTE
To run Energy analysis on an Android system, use the Intel® SoC Watch tool.
See Also
Set Up Android* System
NOTE
You need the "exact" signing key that was produced at the time and on the system where your kernel
was built for your target.
146
Set Up Analysis Target 6
• Configure your Android device for analysis.
• Gain adb access to an Android device.
• For hardware event-based sampling, gain a root mode adb access to the Android device.
• Use the pre-installed drivers on the target Android system.
Optionally, do the following:
• Enable Java* analysis.
• To view functions within Android-supplied system libraries, device drivers, or the kernel, get access from
the host development system to the exact version of these binaries with symbols not stripped.
• To view sources within Android-supplied system libraries, device drivers, or the kernel, get access from
the host development system to the sources for these components.
NOTE
Path to the Developer Options may vary depending on the manufacture of your device and system
version.
2. Enable Unknown Sources to install the VTune Profiler Android package without Google* Play. To do
this, select Settings > Security and enable the Unknown Sources option.
> su
> setprop service.adb.tcp.port 5555
> stop adbd
> start adbd
3. Connect adb on the host to the remote device. In the Command Prompt or the Terminal on the host,
type:
147
6 Intel® VTune™ Profiler User Guide
NOTE
There are several analysis types on Android systems that do NOT require root privileges such as
Hotspots (user-mode samplingmode) and Perf*-based driverless sampling event-based collection.
Depending on the build, you gain root mode adb access differently:
• User/Production builds : Gaining root mode adb access to a user build of the Android OS is difficult and
different for various devices. Contact your manufacturer for how to do this.
• Engineering builds : Root-mode adb access is the default for engineering builds. Engineering builds of
the Android OS are by their nature not "optimized". Using the VTune Profiler against an engineering build
is likely to result in VTune Profiler identifying code to optimize which is already optimized in user and
userdebug builds.
• Userdebug builds : Userdebug builds of the Android OS offer a compromise between good results and
easy-to-run tools. By default, userdebug builds run adb in user mode. VTune Profiler tools require root
mode access to the device, which you can gain via typing adb root on the host. These instructions are
based on userdebug builds.
148
Set Up Analysis Target 6
<install-dir>/bin{32,64}/amplxe-androidreg.sh --package-command=install --
jitvtuneinfo=src
VTune Profiler updates the /data/local.prop file as follows:
NOTE
Java analysis currently requires an instrumented Dalvik JVM. Android systems running on the 4th
Generation Intel® Core™ processors or Android systems using ART vs. Dalvik for Java are not
instrumented to support JIT profiling. You do not need to specify --jitvtuneinfo=N.
Tip
If you are able to see the --generate-debug-info option in the logcat output (adb logcat *:S
dex2oat:I), the compiler uses this option.
NOTE
For releases prior to Android 6.0 Marshmallow*, the --generate-debug-info in the examples below
should be replaced with --include-debug-symbols.
To Do This: Do This:
149
6 Intel® VTune™ Profiler User Guide
To Do This: Do This:
150
Set Up Analysis Target 6
To Do This: Do This:
If you run the application before the system classes are compiled, you should add
NOTE another compiler option -Ximage-compiler-option --generate-debug-info:
This action is
required if Java adb shell rm –f /data/dalvik-cache/*/*
core classes get
adb shell dalvikvm –Xcompiler-option --generate-debug-info -Ximage-compiler-
option --generate-debug-info –cp TheApp.jar
compiled to the /
data/dalvik-
cache/
subdirectory.
Manufacturers
may place them in
different
directories. If
manufactures
supply the
precompiled
boot.oat file
in /system/
framework/
x86, Java core
classes will not be
resolved because
they cannot be re-
compiled with
debug information.
Compilation Settings
Performance analysis is only useful on binaries that have been optimized and have symbols to attribute
samples to source code. To achieve that:
• Compile your code with release level settings (for example, do not use the /O0 setting on GCC*).
• Do not set APP_OPTIM to debug in your Application.mk as this setting disables optimization (it
uses /O0) when the compiler builds your binary.
• To run performance analysis (Hotspots) on non-rooted devices, make sure to compile your code setting
the debuggable attribute to true in AndroidManifest.xml.
NOTE
If your application is debuggable (android:debuggable="true"), the default setting will be debug
instead of release. Make sure to override this by setting APP_OPTIM to release.
By default, the Android NDK build process for Android applications using JNI creates a version of your .so
files with symbols.
The binaries with symbols included go to [ApplicationProjectDir]/obj/local/x86.
151
6 Intel® VTune™ Profiler User Guide
The stripped binaries installed on the target Android system via the .apk file go to
[ApplicationProjectDir]/libs/x86 . These versions of the binaries cannot be used to find source in the
VTune Profiler. However, you may collect data on the target system with these stripped binaries and then
later use the binaries with symbols to do analysis (as long as it is an exact match).
When the VTune Profiler finishes collecting the data, it copies .so files from the device (which have had their
symbols stripped). This allows the very basic functionality of associating samples to assembly code.
Tip
Use ITT APIs to control performance data collection by adding basic instrumentation to your
application.
See Also
-no-unplugged-mode
vtune option
152
Set Up Analysis Target 6
• The Kernel [vmlinux] is one file that does not contain symbols on the target device. Typically it is
located in [AndroidOSBuildDir]/out/target/product/[your target]/linux/kernel/vmlinux.
• Many operating system binaries with symbols are located in either [AndroidOSBuildDir]/out/target/
product/[your target]/symbols, or [AndroidOSBuildDir]/out/target/product/[your
target]/obj.
• Application binaries with symbols are located in [AndroidAppBuildDir]/obj/local/x86.
• Application source files for the C/C++ modules are usually located in [AndroidAppBuildDir]/jni , not
in [AndroidAppBuildDir]/src (where the Java *source files are). Some third-party software in Android
does not provide binaries with symbols. You must contact the third party to get a version of the binaries
with symbols.
• You can see if a binary has symbols by using the file command in Linux and make sure that it says not
stripped.
file MyBinary.ext
MyBinary.ext: ELF 32-bit LSB shared object, Intel 80386, version 1
(SYSV), dynamically linked, not stripped
NOTE
Instrumentation-based collections such as Hotspots in the user-mode sampling mode or Threading
analysis can cause a significant overhead on the number of worker threads. Instead, use Hotspots
analysis in the hardware event-based sampling mode or HPC Performance Characterization to explore
application scalability.
153
6 Intel® VTune™ Profiler User Guide
NOTE
The workflow represented in the diagram is the recommended flow to speed up the analysis process.
It is possible to run the full Intel VTune Profiler collection on the Intel Xeon Phi processor, but
finalization and visualization might be slow. You can follow the regular analysis flow directly on the
target Intel Xeon Phi processor.
Prerequisites
It is recommended to install the sampling driver for hardware event-based sampling collection types such as
HPC Performance Characterization, Memory Access, Microarchitecture Exploration, or Hotspots (hardware
event-based sampling mode). If the sampling driver is not installed, Intel VTune Profiler can work on Linux
Perf*. Be aware of the following system configuration settings:
• To enable system-wide and uncore event collection that allows the measurement of DRAM and MCDRAM
memory bandwidth that is a part of the Memory Access and HPC Performance Characterization analysis
types, use root or sudo to set /proc/sys/kernel/perf_event_paranoid to 0.
echo 0>/proc/sys/kernel/perf_event_paranoid
• To enable collection with the Microarchitecture Exploration analysis type, increase the default limit of
opened file descriptors. Use root or sudo to increase the default value in /etc/security/limits.conf
to 100*<number_of_logical_CPU_cores>.
1. Configure and run analysis on the target system with an Intel Xeon Phi processor
There are two ways to configure and run the analysis on the target system:
154
Set Up Analysis Target 6
• Finalization on host system (recommended): Use a command to run the analysis on the system with the
Intel Xeon Phi processor without finalizing. This option results in the best performance.
From a command prompt, run the collection with the deferred finalization option to calculate the binary
check sum for proper symbol resolution on the host system. For example, to run a Memory Access
analysis: vtune -collect memory-access -finalization-mode=deferred -r
<my_result_dir> ./my_app
For more information, see vtune Command Syntax and finalization-mode topics.
Tip
You can also generate a command using the VTune Profiler GUI as described below. After generating
the command, add the -finalization-mode=deferred option to the command to delay finalization.
• Finalization on target system: Use the VTune Profiler GUI on the host system to generate a command for
the target system with the Intel Xeon Phi processor. Run and finalize the analysis on the target system.
This method may not provide the fastest results.
1. In the WHERE pane, select Arbitrary Host button, set the processor architecture to Intel® Processor
code named Knights Landing, and specify the operating system type.
2. In the WHAT pane, select Launch Application and configure the analysis:
• Enter the application name and parameters.
• Select the Use MPI Launcher checkbox and provide the launcher name, number of ranks, ranks to
profile, and result location.
3. In the HOW pane, select and configure an analysis type.
• Hotspots
• HPC Performance Characterization
• Microarchitecture Exploration
• Memory Access
4. Click the Command Line button at the bottom of the window to generate the command.
5. Copy the generated command to a command prompt on the target system and run the analysis.
Finalization begins after the analysis completes. Finalization may take several minutes.
155
6 Intel® VTune™ Profiler User Guide
See Also
Custom Analysis
NOTE
Typically the host operating system has access to the system hardware to collect performance data,
but there are cases in which the host system may also be virtualized. If this is the case and you want
to collect performance data on the host system, treat the host system as you would a guest system
and assume that it no longer has the same level of access to the system hardware.
156
Set Up Analysis Target 6
VTune Profiler uses the two sampling-based collection modes for analysis:
• User-Mode Sampling
In general, the Hotspots analysis type in this mode will work on every supported VMM because the
analysis type does not require access to the system hardware.
• Hardware Event-Based Sampling
Analysis types that use this mode (Hotspots and Microarchitecture Exploration) have limited reporting
functionality. For example, they may not include accurate results for stacks because this data relies on
information provided by precise events. Running analysis types that rely on precise events will return
results, but the collected data will be incomplete or distorted. That is, the result may not point to the
actual instruction that caused the event, which can be difficult to differentiate from correct events.
To enable performance analysis in the hardware event-based sampling mode on a virtual machine,
additional configuration steps are required. As soon as you installed VTune Profiler, you need to enable
the vPMU for your hypervisor:
• VMware*
• Hyper-V*
• KVM*
• Xen* Project
• Parallels* Desktop
NOTE
Analysis types based on uncore events (Memory Access, Input and Output analysis, and others) and
related performance metrics (Memory Bandwidth, PCIe Bandwidth, and others) are not supported on
virtual machines.
VTune Profiler supports profiling host and guest OS from the host system. This type of analysis is only
available for virtual machines with KVM hypervisor as a preview feature.
157
6 Intel® VTune™ Profiler User Guide
Use the following steps to enable event-based sampling analysis on the VMware virtual machine. Refer to the
VMware documentation for the most up-to-date information.
1. From the host system, open the configuration settings for the virtual machine.
2. Select the Processors device on the left.
3. Select the Virtualize CPU performance counters checkbox.
4. Click Save to apply the change.
158
Set Up Analysis Target 6
Profiling Modes
Currently, the VTune Profiler supports the following usage modes for KVM guest OS profiling, and each of
them has some limitations:
159
6 Intel® VTune™ Profiler User Guide
Profiling System KVM Guest OS (User KVM Guest OS (User and Host and KVM Guest
Apps) Kernel Space) OS (User and Kernel
Space)
(preview feature)
VTune Profiler On the guest OS On the guest OS On the host and guest
installation mode OS (VTune Profiler
custom collector)
160
Set Up Analysis Target 6
161
6 Intel® VTune™ Profiler User Guide
162
Set Up Analysis Target 6
Focus on the Platform tab to analyze your code performance on the guest OS and correlate this data with
CPU, GPU, power, hardware event metrics and interrupt count at each moment of time. If you enabled the
kvm Ftrace event collection for your target, you can also monitor the statistics for KVM kernel module:
Limitations
• In this mode, the VTune Profiler collects data only on the kernel space modules on the KVM guest OS.
Data on user space modules shows up in the [Unknown] node and includes only high-level statistics.
• Call stack data is not collected for this type of profiling.
163
6 Intel® VTune™ Profiler User Guide
For application analysis, you need to install the Intel® VTune™ Profiler directly on your guest OS. VTune
Profiler installation detects a virtual environment and disables sampling drivers installation to avoid system
instability. When the product is installed, proceed with project configuration by specifying your application as
an analysis target and selecting an analysis type:
164
Set Up Analysis Target 6
When you select a hardware event-based analysis type (for example, Microarchitecture Exploration), the
VTune Profiler automatically enables a driverless event-based sampling collection using the Linux Perf* tool.
For this analysis, the VTune Profiler collects only architectural events. See the Performance Monitoring Unit
Sharing Guide for more details on the supported architectural events.
Limitations
• User-mode sampling limitations:
• Only Hotspots and Threading analyses are supported.
• No system-wide analysis is available.
• Hardware event-based sampling limitations:
• Only Hotspots and limited Microarchitecture Exploration analyses are supported.
• PEBS counters are not virtualized.
• Uncore events are not available.
• KVM modules and host system modules do not show up in the analysis result.
• Data on the guest OS and your application modules show up as locally collected statistics with no
[guest] markers.
165
6 Intel® VTune™ Profiler User Guide
• VMs are not required to virtualize performance counters. All performance analysis features are available to
VM users out of the box.
• Sampling drivers (VTune Profiler sampling driver or Perf*) do not need to be installed on a guest VM.
To enable KVM kernel and user space profiling from the host:
1. Install the VTune Profiler on the host and virtual machine.
NOTE
You do not need to install sampling drivers.
2. On both host and guest systems, run the script from the bin64 folder as a root:
$ prepare-debugfs.sh -g <user_group>
$ echo 0 > /proc/sys/kernel/perf_event_paranoid
3. Configure a password-less SSH access from the host to the KVM guest system.
4. If your host system is multi-socket, export the environment variable to set the time source to TSC
before starting the VTune Amplfier:
VTUNE_RUNTOOL_OPTIONS=--time-source=tsc
5.
6. From the WHAT pane in the Configure Analysis window, expand the Advanced section and enter the
following string to the Custom collector field:
python <vtune_install_dir>/bin64/kvm-custom-collector.py --kvm-ssh-
login=<username>@<kvm_ssh_ip> --vtune-dir-on-kvm=<vtune-install-dir>
NOTE
For additional details on particular options, see the kvm-custom-collector.py script help.
7. To collect data from the guest kernel space, select the Analyze KVM Guest OS option.
Copy /proc/kallsyms and /proc/modules files from the virtual machine to the host.
NOTE
Since these are pseudo-files, you are recommended to cat their content into a regular file and then
copy it to the host. Specify paths to the copied files in the project properties.
8. From the HOW pane, select any hardware event-based sampling analysis (for example, General
Exploration) and run the analysis from the host.
Explore the collected data by enabling all the grouping levels containing a VM component to differentiate the
host and target data.
Example 1: Hotspots Analysis (Hardware Event-Based Sampling Mode)
Analyze hotspots for both an application launched from the Linux host, app-from-host, and an application
launched on the KVM guest system, app-in-vm:
166
Set Up Analysis Target 6
Example 2: Microarchitecture Exploration Analysis
Analyze the efficiency of the Microarchitecture Usage for the application launched on the KVM guest system.
The context summary on the right pane shows the hardware metrics for the thread (launched inside the
KVM) selected in the grid:
167
6 Intel® VTune™ Profiler User Guide
NOTE
• Some configurations do not support the all mode.
• CPU events virtualization requires root privileges.
• Unlike CPU profiling, GPU profiling in the hv mode is available for all domains (Dom0 and DomU).
The System Summary in the System Information dialog box should show the Virtualization-based
security item as Running:
168
Set Up Analysis Target 6
NOTE
If your system does not meet the profiling requirements but you plan to run hardware event-based
sampling analysis with VTune Profiler, make sure to disable the Hyper-V feature in the system
settings.
169
6 Intel® VTune™ Profiler User Guide
Arbitrary Targets
Configure and generate a command line for
performance analysis on a system that is not
accessible from the current host.
Besides targets accessible to Intel® VTune™ Profiler directly on the host or via a remote connection (SSH or
ADB), you have an Arbitrary Host option to create a command line configuration for a platform not
accessible from the current host. You can select any of the supported hardware platforms and operating
systems, configure corresponding target and analysis options, and generate a command line by clicking the
Command Line button. The generated command line will be saved in the buffer and can be used later on
the intended host.
NOTE
The option to generate a command line from GUI via the Command Line button is available for both
accessible and arbitrary targets.
Create a new project or click the Configure Analysis toolbar button for an existing project.
2.
From the Configure Analysis window, click the Browse button on the WHERE pane and select
the Arbitrary Host (not connected) type of the target system.
3. Specify a platform for profiling:
• Select a hardware platform for analysis from the drop-down menu, for example: Intel® processor
code named Anniedale.
• Specify either Windows* or GNU*/Linux* operating system.
4. Switch to the WHAT pane to configure analysis target options.
For MPI analysis of an arbitrary target, enable the Use MPI launcher check box to generate a
command line configuration. Configure the following MPI analysis options:
• MPI launcher: Select an MPI launcher that should be used for your analysis. You can either enable
the Intel MPI launcher option (default) or select Other and specify a launcher of your choice, for
example: aprun, srun, or lbrun.
• Number of ranks: Specify the number of ranks used for your application.
• Profile ranks: Use All to profile all ranks, or choose Selective and specify particular ranks to
profile, for example: 2-4,6-7,8.
• Result location: Specify a relative or absolute path to the directory where the analysis result
should be stored.
If your target system is not powerful enough, consider selecting another system for the result
finalization as follows:
In this case, VTune Profiler calculates only binary checksum to be used for finalization on the host
machine. This option is recommended for analysis on the Intel Xeon Phi processor (code name: Knights
Landing).
5. Switch to the HOW panechoose and configure (if required) an analysis type.
170
Set Up Analysis Target 6
6.
Click the Command Line... button at the bottom to generate a command line for your
configuration.
For example, VTune Profiler generates the following command line for a test MPI application that will
be launched on a GNU/Linux system via Intel MPI launcher and analyzed for Memory Access issues on
ranks 2-4,6-7,8:
See Also
Analysis Target
Finalization
NOTE
This connection type uses the TCP/IP protocol suite. This connection is not secure, and it is
recommended to use this connection type in a secure lab environment.
171
6 Intel® VTune™ Profiler User Guide
The Analysis Communication Agent is a software agent that runs on the target system which serves as
a connection between the VTune Profiler collector running on the host side and the sampling driver
running on the target system.
• Host side:
• Communication Agent (TCP/IP) connection type
The Communication Agent (TCP/IP) connection type is used to connect to the Analysis
Communication Agent running on the target system via the TCP/IP protocol suite.
Prerequisites
• Sampling driver and Analysis Communication Agent implementations for your target system. You can use
the reference solution to help implement and build these components.
• A TCP/IP capable operating system with the sampling driver loaded and Analysis Communication Agent
launched.
• A host system with VTune Profiler installed.
Run Analysis
Once the target system is ready, follow these steps to run an analysis:
1. Launch VTune Profiler on the host system.
2. (Optional) Click the New Project button to create a new project.
3. Click Configure Analysis and select the Communication Agent (TCP/IP) connection type in the
WHERE pane.
172
Analyze Performance 7
Analyze Performance 7
After you create a project and specify a target for analysis, you are ready to run your first analysis.
Performance Snapshot
Click Configure Analysis on the Welcome page. By default, this action opens the Performance Snapshot
analysis type. This is a good starting point to get an overview of potential performance issues that affect your
application. The snapshot view includes recommendations for other analysis types you should consider next.
Analysis Groups
Click anywhere on the analysis header that contains the name of the analysis type. This opens the Analysis
Tree, where you can see other analysis types grouped into several categories. See Analysis types to get an
overview of these predefined options.
Advanced users can create custom analysis types which appear at the bottom of the analysis tree.
See Also
Run Command Line Analysis
173
7 Intel® VTune™ Profiler User Guide
Reference
See Also
Error Message: Application Sets Its Own Handler for Signal
174
Analyze Performance 7
Profiler profiles your application using the counter
overflow feature of the Performance Monitoring Unit
(PMU).
The data collector interrupts a process and captures the IP of interrupted process at the time of the interrupt.
Statistically collected IPs of active processes enable the viewer to show statistically important code regions
that affect software performance.
Caution
Statistical sampling does not provide 100% accurate data. When the VTune Profiler collects an event,
it attributes not only that event but the entire sampling interval prior to it (often 10,000 to 2,000,000
events) to the current code context. For a big number of samples, this sampling error does not have a
serious impact on the accuracy of performance analysis and the final statistical picture is still valid. But
if something happened for very little time, then very few samples will exist for it. This may yield
seemingly impossible results, such as two million instructions retiring in 0 cycles for a rarely-seen
driver. In this case, you may either ignore hotspots showing an insignificant number of samples or
switch to a higher granularity (for example, function).
175
7 Intel® VTune™ Profiler User Guide
NOTE
This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production
release. It is available for your use in the hopes that you will provide feedback on its usefulness and
help determine its future. Data collected with a preview feature is not guaranteed to be backward
compatible with future releases.
You can also create a custom analysis type based on the hardware event-based sampling collection.
Caution
Analysis types that use the hardware event-based sampling collector are limited to only one collection
allowed at a time on a system.
Prerequisites:
It is recommended to install the sampling driver for hardware event-based sampling collection types. For
Linux* and Android* targets, if the sampling driver is not installed, VTune Profiler can enable the Perf*
driverless collection. Be aware of the following configuration settings for Linux target systems:
• To enable system-wide and uncore event collection, use root or sudo to set /proc/sys/kernel/
perf_event_paranoid to 0.
echo 0>/proc/sys/kernel/perf_event_paranoid
• To enable collection with the Microarchitecture Exploration analysis type, increase the default limit of
opened file descriptors. Use root or sudo to increase the default value in /etc/security/limits.conf
to 100*<number_of_logical_CPU_cores>.
<user> hard nofile <100 * number_of_logic_CPU_cores>
<user> soft nofile <100 * number_of_logic_CPU_cores>
See Also
Hardware Event-based Sampling Collection with Stacks
176
Analyze Performance 7
Event multiplexing is also useful if the application does not have a long steady state or takes a long time to
get to steady state. On the other hand, if application initialization is short and it gets to steady state quickly,
then you can do multiple short runs and will not need to do event multiplexing.
To enable/disable multiple runs of the data collection:
1.
NOTE
Collecting data in multiple runs is only possible if an application to launch is specified.
3. On the WHAT configuration pane, scroll down to the Advanced section and select the Allow multiple
runs option to enable more precise event data collection or deselect the option to use event
multiplexing.
If you enable the multiple run mode, the VTune Profiler runs the data collection several times for each event
set. You can easily detect these multiple runs on the Timeline pane: they are separated with the grayed out
paused areas.
The multiple run mode affects the metrics calculation. All "total" types of metrics (Total Time, Elapsed Time)
are calculated for the whole analysis session that includes multiple runs while all other metrics are provided
per run.
If you want to avoid running the application multiple times but get more accurate multiplexing data, you
need to create a custom analysis and enable the Use precise multiplexing option available for the custom
hardware event-based sampling analysis configuration. This option enables a multiplexing algorithm that
switches event groups on each sample. This mode provides more reliable statistics for applications with a
short execution time. You may also consider enabling the precise multiplexing if the MUX Reliability metric for
the Microarchitecture Exploration analysis result is low.
See Also
allow-multiple-runs
vtune option
Custom Analysis Options
Problem: 'Events= Sample After Value (SAV) * Samples' Is Not True If Multiple Runs Are Disabled
NOTE
For Linux* targets, make sure your kernel is configured to support event-based stack sampling
collection.
177
7 Intel® VTune™ Profiler User Guide
Multitask operating systems execute all software threads in time slices (thread execution quanta). Intel®
VTune™ Profiler profiler handles thread quantum switches and performs all monitoring operations in
correlation with the thread quantum layout.
The figure below explains the general idea of per-thread quantum monitoring:
• The profiler gains control whenever a thread gets scheduled on and then off a processor (that is, at thread
quantum borders). That enables the profiler to take exact measurements of any hardware performance
events or timestamps, as well as collect a call stack to the point where the thread gets activated and
inactivated.
• The profiler determines a reason for thread inactivation: it can either be an explicit request for
synchronization , or a so-called thread quantum expiration, when the operating system scheduler
preempts the current thread to run another, higher-priority one instead.
• The time during which a thread remains inactive is also measured directly and differentiated based on the
thread inactivation reason: inactivity caused by a request for synchronization is called Wait time, while
inactivity caused by preemption is called Inactive time.
While a thread is active on a processor (inside a quantum), the profiler employs event-based sampling to
reconstruct the program logic and associate hardware events and other characteristics with the program
code. Unlike the traditional event-based sampling, the profiler upon each sampling interrupt also collects:
• call stack information
• branching information (if configured so)
• processor timestamps
All that allows for statistically reconstructing program execution logic (call and control flow graphs) and
tracing threading activity over time, as well as collecting virtually any information related to hardware
utilization and performance.
178
Analyze Performance 7
Configure Stack Collection
1.
NOTE
• The event-based stack sampling data collection cannot be configured for the entire system. You
have to specify an application to launch or attach to.
• By default, on Linux* the VTune Profiler uses the driverless Perf*-based mode for hardware event-
based collection with stacks. To use the driver-based mode, set the Stack size option to 0
(unlimited).
• Call stack analysis adds an overhead to your data collection. To minimize the overhead incurred
with the stack size, use the Stack size option in the custom hardware event-based sampling
configuration or -stack-size knob from CLI to limit the size of a raw stack. By default, on Linux a
stack size of 1024 bytes is collected. On Windows, by default, a full size stack is collected (zero size
value). If you disable this option, the overhead will be also reduced but no stack data will be
collected.
Analyze Performance
Select the Hardware Events viewpoint and click the Event Count tab. By default, the data in the grid are
sorted by the Clockticks (CPU_CLK_UNHALTED) event count providing primary hotspots on top of the list.
Click the plus sign to expand each hotspot node (a function, by default) into a series of call paths, along
which the hotspot was executed. VTune Profiler decomposes all hardware events per call path based on the
frequency of the path execution.
The counts of the hardware events of all execution paths leading to a sampled node sum up to the event
count of that node. For example, for the CpupSyscallStub function, which is the top hotspot of the
application, the INST_RETIRED.ANY event count equals the sum of event counts for all 5 calling sequences:
25 700 419 203.
179
7 Intel® VTune™ Profiler User Guide
Such a decomposition is extremely important if a hotspot is in a third-party library function whose code
cannot be modified, or whose behavior depends on input parameters. In this case the only way of
optimization is analyzing the callers and eliminating excessive invocations of the function, or learning which
parameters/conditions cause most of the performance degradation.
Explore Parallelism
When the call stacks collection is enabled (for example, Collect stacks option for the Hotspots in the
hardware event-based sampling mode), the VTune Profiler analyzes context switches and displays data on
the threads activity using the context switch performance metrics.
Click the Context Switch by Reason > Synchronization column header to sort the data by this metric.
The synchronization hotspots with the highest number of context switches and high Wait time values typically
signals a thread contention on this stack.
Select a context switch oriented type of the stack (for example, the Preemption Context Switch Count
type) in the drop-down menu of the Call Stack pane and explore the Timeline pane that shows each
separate thread execution quantum. A dark-green bar represents a single thread activity quantum, grey bars
and light-green bars - thread inactivity periods (context switches). Hover over a context switch region in the
Timeline pane to view details on its duration, start time and the reason of thread inactivity.
180
Analyze Performance 7
When you select a context switch region in the Timeline pane, the Call Stack pane displays a call sequence
at which a preceding quantum was interrupted.
You may also select a hardware or software event from the Timeline drop-down menu and see how the event
maps to the thread activity quanta (or to the inactivity periods).
Correlate data you obtained during the performance and parallelism analysis. Those execution paths that are
listed as the performance hotspots with the highest event count and as the synchronization hotspots are
obvious candidates for optimization. Your next step could be analyzing power metrics to understand the cost
of such a synchronization scheme in terms of energy.
NOTE
• For analyses using the Perf*-based driverless collection, the types of context switches (preemption
or synchronization) may not be identified on kernels older than 4.17 and the following metrics may
not be available: Wait time, Wait Rate, Inactive Time, Preemption and Synchronization Context
Switch Count.
• The speed at which the data is generated (proportional to the sampling frequency and the intensity
of thread synchronization/contention) may become greater than the speed at which the data is
being saved to a trace file, so the profiler will try to adapt the incoming data rate to the outgoing
data rate by not letting threads of a program being profiled be scheduled for execution. This will
cause paused regions to appear on the timeline, even if no pause was explicitly requested. In
ultimate cases, when this procedure fails to limit the incoming data rate, the profiler will begin
losing sample records, but will still keep the counts of hardware events. If such a situation occurs,
the hardware event counts of lost sample records will be attributed to a special node: [Events
Lost on Trace Overflow].
See Also
knob enable-stack-collection=true
Performance Snapshot
VTune Profiler provides several analysis types that are tailored to examine various application types and
aspects of performance. Performance Snapshot captures a picture of these aspects and presents an overview
of the workings of your application.
Use Performance Snapshot when you want to see a summary of issues affecting your application. This
analysis also includes recommendations for other analysis types that you can run next for a deeper
investigation.
Run the Analysis
Before running Performance Snapshot, make sure you Create a project.
1. Click Configure Analysis on the VTune Profiler welcome screen. This opens the Performance
Snapshot analysis type by default. You can also select this analysis from the Analysis Tree.
2. In the WHAT pane, specify your target application and any application parameters.
3. In the HOW pane, click the Start button ( ) to run the analysis.
181
7 Intel® VTune™ Profiler User Guide
NOTE
To run Performance Snapshot from the command line for this configuration, use the
Command Line button at the bottom.
4. Once the data collection is complete, see a performance overview in the Summary tab.
The overview typically includes several metrics along with their descriptions.
See guidance on other analyses you should consider running next. The Analysis
Tree highlights these recommendations.
182
Analyze Performance 7
See Also
Run Command Line Analysis
Reference
Algorithm Group
The analyses in the Algorithm group target software
tuning. They help you understand where your
application spends the most time. You can also
analyze the efficiency of your algorithms.
The Algorithm group includes these analysis types:
• Hotspots focuses on a particular target, identifies functions that took the most CPU time to execute,
restores the call tree for each function, and shows thread activity.
• Anomaly Detection analysis helps you identify performance anomalies in frequently recurring intervals of
code like loop iterations.
• Memory Consumption analyzes your Linux* native or Python* targets to explore memory consumption
(RAM) over time and identify memory objects allocated and released during the analysis run.
183
7 Intel® VTune™ Profiler User Guide
NOTE
Intel® VTune™ Profiler is a new renamed version of the Intel® VTune™ Amplifier.
NOTE
• If you cannot run the hardware event-based sampling with stacks, disable the Collect stacks
option and run the collection. To correlate the obtained hardware event-based sampling data with
stacks, run a separate Hotspots analysis in the User-Mode Sampling mode.
• On 32-bit Linux* systems, the VTune Profiler uses a driverless Perf*-based collection for the
hardware event-based sampling mode.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the VTune
Profiler welcome screen.
2. In the HOW pane, select the Hotspots analysis from the Analysis Tree.
3. Configure the following options:
184
Analyze Performance 7
User-Mode Select to enable the user-mode sampling and tracing collection for hotspots
Sampling mode and call stack analysis (formerly known as Basic Hotspots). This collection
mode uses a fixed sampling interval of 10ms. If you need to change the
interval, click the Copy button and create a custom analysis configuration.
Hardware Event- Select to enable hardware event-based sampling collection for hotspots
Based Sampling analysis (formerly known as Advanced Hotspots).
mode
You can configure the following options for this collection mode:
• CPU sampling interval, ms to specify an interval (in milliseconds)
between CPU samples. Possible values for thehardware event-based
sampling mode are 0.01-1000. 1 ms is used by default.
• Collect stacks to enable advanced collection of call stacks and thread
context switches.
NOTE
When changing collection options, pay attention to the Overhead diagram on the
right. It dynamically changes to reflect the collection overhead incurred by the
selected options.
Show additional Get additional performance insights, such as vectorization, and learn next
performance steps. This option collects additional CPU events, which may enable the
insights check box multiplexing mode.
The option is enabled by default.
Details button Expand/collapse a section listing the default non-editable settings used for
this analysis type. If you want to modify or enable additional settings for the
analysis, you need to create a custom configuration by copying an existing
predefined configuration. VTune Profiler creates an editable copy of this
analysis type configuration.
4. Click the Start button to run the analysis.
NOTE
To generate the command line for this configuration, click the Command Line... button at the bottom.
View Data
When the data is collected, VTune Profiler opens it in the Hotspots by CPU Utilization viewpoint providing
the following views for analysis:
• Summary window displays statistics on the overall application execution to analyze CPU time and
processor utilization.
• Bottom-up window displays hotspot functions in the bottom-up tree, CPU time and CPU utilization per
function.
• Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only
(Self value) and for a function and its children together (Total value).
• Caller/Callee window displays parent and child functions of the selected focus function.
• Platform window provides details on CPU and GPU utilization, frame rate, memory bandwidth, and user
tasks (if corresponding metrics are collected).
185
7 Intel® VTune™ Profiler User Guide
What's Next
1. Identify the most time-consuming function in the grid and double-click it for source analysis.
2. Analyze the source of the critical function starting with the highlighted hottest code line and moving
further with the Hotspot Navigation options.
3. Modify your code to remove bottlenecks and improve the performance of your application.
4. Re-run the analysis and verify your optimization with the comparison mode.
For further steps, explore the Insights section provided in the Summary window. This section contains
information on your target performance against metrics collected in addition to standard hotspots metrics. If
there are any performance issues detected, the VTune Profiler flags such a metric value and provides an
insight on potential next steps to fix the problem.
Information provided by Hotspots analysis is important for tuning serial applications and it is still useful for
tuning the serial sections of parallel applications. The Hotspots analysis data helps you understand what your
application is doing and identify the code that is critical to tune. For parallel applications running on multi-
core systems you may need additional analyses: Threading or HPC Performance Characterization.
See Also
collect
hotspots vtune option
Tutorial: Analyze Common Performance Bottlenecks on Linux* - C++ Sample Code
Tutorial: Analyze Common Performance Bottlenecks on Windows* - C++ Sample Code
Hotspots View
Identify program units that took the most CPU time.
These are recognized as hotspots. The Hotspots
viewpoint is available for all analysis results.
Follow these steps to interpret performance data available in the Hotspots viewpoint:
1. Define a performance baseline.
2. Identify the hottest function.
3. Identify algorithm issues.
4. Analyze source.
5. Explore other analysis types.
186
Analyze Performance 7
If this metric value is flagged as critical, consider running the Microarchitecture Exploration analysis to
dive deeper into hardware metrics.
Next, focus your tuning efforts on the program units with the largest Poor value. This means that your
application underutilized the CPU time during the execution of these program units. The overall goal of
optimization is to achieve Ideal (green ) or OK (orange ) CPU utilization state and shorten the Poor
and Over CPU utilization values.
187
7 Intel® VTune™ Profiler User Guide
188
Analyze Performance 7
Identify Algorithm Issues
If you identify issues with the calling sequences in your application, you can improve performance by revising
the order in which functions are called. Use these methods:
• Top-down Tree pane: Analyze the Total and Self time data for callers and callees of the hotspot function to
understand whether this time can be optimized.
• Call Stack pane: Identify the highest contributing stack for the program unit(s) selected in the Bottom-
up or Top-down Tree panes. Use the navigation buttons to see the different stacks that
called the selected program unit(s). The contribution bar shows the contribution of the currently visible
stack to the overall time spent by the selected program unit(s). You can also use the drop-down list in the
Call Stack pane to view data for different types of stacks.
NOTE
Stack data is available by default for the user-mode sampling mode. To have this data for the
hardware event-based sampling mode, you need to enable the Collect stacks option in the Hotspots
analysis configuration.
Analyze Source
Double-click the hottest function to view its related source code in the Source/Assembly window. Open the
code editor directly from Intel® VTune™ Profiler and improve your code (for example, minimizing the number
of calls to the hotspot function).
What's Next
If you ran the analysis with the default Show additional performance insights option, the Summary
view will include the Insights section that provides additional metrics for your target such as efficiency of
the hardware usage and vectorization. This information helps you identify potential next steps for your
performance analysis and understand where you could focus your optimization efforts.
189
7 Intel® VTune™ Profiler User Guide
190
Analyze Performance 7
Related information
• An explanation of Flame Graphs
• Flame Graph Window
• Source Code Analysis
• View Stacks
• Reference
NOTE
This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production
release. It is available for your use in the hopes that you will provide feedback on its usefulness and
help determine its future. Data collected with a preview feature is not guaranteed to be backward
compatible with future releases.
The control flow trace feature in Intel® PT generates a variety of packets that, when combined with the
binaries of a program by a post-processing tool, can be used to produce an exact execution trace. The
packets record flow information such as instruction pointers (IP), indirect branch targets, and directions of
conditional branches within contiguous code regions (basic blocks). For descriptions of key concepts in Intel®
PT, see Chapter 35 of the Intel Software Developer's Manual (Volume 3C):System Programming Guide.
To detect software performance anomalies using VTune Profiler, you use the Instrumentation and Tracing
Technology (ITT) API to designate specific code regions of interest and then run Anomaly Detection analysis.
191
7 Intel® VTune™ Profiler User Guide
for(…;…;…)
{
__itt_mark_pt_region_begin(region);
<code processing your task>
__itt_mark_pt_region_end(region);
}
192
Analyze Performance 7
NOTE
To run Anomaly Detection from the command line, use the Command Line button at the
bottom.
View Data
Once the analysis is complete, VTune Profiler displays results in the Summary window.
• Elapsed Time indicates the total time spent on all code regions of interest.
• Code Region of Interest Duration Histogram plots the number of instances of performance-critical
tasks against specified duration (or latency). See specific code regions in the Fast and Slow regions to
understand why the duration changed.
• Collection and Platform Info displays relevant details about the system, data about the collection
platform, and the resulting set size.
View Data on a Different System
The above procedure is useful when you process analysis results on the same system where you collected
data. If you want to transfer the collected data onto a different system before you view it, run the archive
command after data collection to copy essential binaries to the results folder. You must complete this step
before transferring results to the new system to load collection details without problems.
To run the archive command:
193
7 Intel® VTune™ Profiler User Guide
NOTE
To view collected data on a different system, you must copy all binaries including system and compiler
runtime binaries that are linked to your main binary and were accessed during the collection. The
archive command is useful for this purpose since it is not easy to copy these binaries manually.
Next Steps
See the Anomaly Detection view for information on interpreting collected data in these ways:
• Load trace details for each analysis in the Bottom-up window.
• Look for unexpected kernel activity. See if applications entered certain kernels that should not have been
activated during the analysis.
• Use the source and assembly views to compare code regions of interest in fast and slow regions of the
histogram.
See Also
Anomaly Detection View Interpret results after performing Anomaly Detection analysis on your
application. Identify performance anomalies by examining code regions of interest.
View Data
Once you complete running Anomaly Detection on your application, the collected data displays in the
Summary window.
Start with the Code Region of Interest Duration Histogram. This shows the number of instances of a
performance-critical task for a specific duration or latency (in ms).
Examine the histogram to see:
• Code regions of interest
• Information about regions where simulations executed faster or slower than normal
This diagram identifies unexpected performance outliers in the Slow region.
194
Analyze Performance 7
NOTE If necessary, use the sliders on the X-axis to adjust the thresholds for Fast, Good, and Slow
latencies.
195
7 Intel® VTune™ Profiler User Guide
Metric Interpretation
Instructions Retired, Call Count, Total Control flow metrics. Instructions Retired refers
Iteration Count to the number of entries into a kernel.
Wait Time, Inactive Time Duration for which a thread was idle because of
synchronization or preemption.
Use this window as a hub to detect the following types of performance anomalies.
• Context Switch Anomaly
• Kernel-Induced Anomaly
• Frequency Drops
• Control Flow Deviation Anomaly
4. Expand the instance to drill down to a function or a stack. Identify the stack(s) that brought the thread
to an idle state.
Kernel-Induced Anomaly
1. In the Intel Processor Trace Details window, sort the data by Kernel Time. The topmost element of
the stack points to the entry point into the kernel. Where the ratio of kernel time to Elapsed Time is
high, a significant amount of time was spent in the kernel. In this example, 566 out of 997
microseconds were spent in the kernel for the highlighted thread.
196
Analyze Performance 7
2. Expand the thread to see contributing stacks that could be responsible for long kernel times.
Due to the presence of dynamic code in the kernel and drivers, it is not possible to perform static processing
of these binaries. The kernel_activity node at the top of the stack aggregates all performance data for
kernel activity that happened during a specific instance of the Code Region of Interest.
Since kernel binaries are not processed, VTune Profiler cannot collect code flow metrics like Call Count,
Iteration Count, or Instructions Retired. All these metrics are zero, except Instructions Retired, which
indicates the number of entries into the kernel.
A possible explanation for a kernel-induced anomaly could be network speed. This could cause a slowdown
when control goes to the kernel while receiving a request and sending a response over the network.
Frequency Drops
Find information about frequency drops in one of these windows:
• Bottom-up window: Shows frequency information for the entire application.
• Intel Processor Trace Details window: Shows frequency information only for the loaded region.
Frequency drops can happen due to several reasons:
• There are Intel® Advanced Vector Extensions (Intel® AVX) instructions used inside or outside a loaded
code region.
• There are underlying hardware issues like cooling.
• Apart from your application, low activity on the core and OS can also cause frequency drops. Look for high
numbers of Inactive Time or Wait Time.
1. Select a node in the grid where you see a high value for Instructions Retired.
2. Right-click and select Filter In by Selection from the context menu.
3. Switch to the Caller/Callee window.
197
7 Intel® VTune™ Profiler User Guide
In the flat profile view, you can see functions annotated with Self and Total CPU Times. The caller view
shows the callers of the selected function in a bottom-up representation. The callee view shows a call
tree from the selected function in a top-down representation.
In this example, the function call to _slab_evict_one function from _slab_evict_rand causes a
significant delay as evidenced by the Self CPU Time.
Source Code Analysis:
This is an alternative method to identify deviations in the control flow.
1. Check the Total Iteration Count to compare the number of loop iterations between a fast and slow
iteration.
2. If the slower iteration has a higher iteration count, switch to Source Assembly view and examine the
source code of the function.
3. Check to see if the slower iteration passed the validation of the cached element.
Both of these methods indicate the presence of a Cache Eviction, which can occur infrequently. While you
may not be able to eliminate cache evictions entirely, you can minimize them through these ways:
• Increase the cache size.
• Update cache data and repeat the analysis.
See Also
Anomaly Detection Analysis
Analyze Performance
198
Analyze Performance 7
How It Works
During Memory Consumption analysis, the VTune Profiler data collector intercepts memory allocation and
deallocation events and captures a call sequence (stack) for each allocation event (for deallocation, only a
function that released the memory is captured). VTune Profiler stores the calling instruction pointer (IP)
along with a call sequence in data collection files, and then analyzes and displays this data in a result tab.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel®
VTune Profiler toolbar.
™
From HOW pane, click the Browse button and select Memory Consumption.
The Memory Consumption analysis is pre-configured to collect data at the memory objects (data
structures) granularity, which is provided due to instrumentation of memory allocations/de-allocations
and getting static/global variables from symbol information.
3. Optionally, you may configure the Minimal dynamic memory object size to track option. This option
helps reduce runtime overhead of the instrumentation. The default value is 32 bytes.
4. Click the Start button to run the analysis.
NOTE
Generate the command line for this configuration using the Command Line button at the
bottom.
199
7 Intel® VTune™ Profiler User Guide
View Data
By default, the analysis result opens in the Memory Consumption viewpoint. Identify peaks of the memory
consumption on the Timeline pane and analyze allocation stacks for the hotspot functions. Double-click a
hotspot function to switch to the Source view and analyze the source lines allocating a high amount of
memory.
See Also
Memory Consumption and Allocations View
collect
memory-consumption vtune option
For further investigation, switch to the Bottom-up tab and explore the memory consumption distribution
over time. Focus on the peak values on the Timeline pane, select a time range of interest, right click and
use the Filter In by Selection context menu option to filter in the program units (functions, modules,
processes, and so on) executed during this range:
200
Analyze Performance 7
In the example above, the python foo function allocated 915 310 048 bytes of memory in a call tree
displayed in the Call Stack pane on the right but released only 817 830 048 bytes. 92MB is the maximum
Allocation/Deallocation delta value that signals a potential memory leak. Clicking the foo function opens the
Source view highlighting the code line that allocates the maximum memory. Use this information for deeper
code analysis to identify a cause of the memory leaks.
See Also
Memory Consumption Analysis
Analyze Performance
201
7 Intel® VTune™ Profiler User Guide
• To enable collection with the Microarchitecture Exploration analysis type, increase the default limit of
opened file descriptors. Use root or sudo to increase the default value in /etc/security/limits.conf
to 100*<number_of_logical_CPU_cores>.
<user> hard nofile <100 * number_of_logic_CPU_cores>
<user> soft nofile <100 * number_of_logic_CPU_cores>
NOTE
Intel® VTune™ Profiler is a new renamed version of the Intel® VTune™ Amplifier.
How It Works
The Microarchitecture Exploration analysis strategy varies by microarchitecture. For modern
microarchitectures starting with Intel microarchitecture code name Ivy Bridge, the Microarchitecture
Exploration analysis is based on the Top-Down Microarchitecture Analysis Method using the Top-Down
Characterization methodology, which is a hierarchical organization of event-based metrics that identifies the
dominant performance bottlenecks in an application.
202
Analyze Performance 7
Superscalar processors can be conceptually divided into the front-end, where instructions are fetched and
decoded into the operations that constitute them, and the back-end, where the required computation is
performed. Each cycle, the front-end generates up to four of these operations. It places them into pipeline
slots that then move through the back-end. Thus, for a given execution duration in clock cycles, it is easy to
determine the maximum number of pipeline slots containing useful work that can be retired in that duration.
The actual number of retired pipeline slots containing useful work, though, rarely equals this maximum. This
can be due to several factors: some pipeline slots cannot be filled with useful work, either because the front-
end could not fetch or decode instructions in time (Front-end bound execution) or because the back-end was
not prepared to accept more operations of a certain kind (Back-end bound execution). Moreover, even
pipeline slots that do contain useful work may not retire due to bad speculation. Front-end bound execution
may be due to a large code working set, poor code layout, or microcode assists. Back-end bound execution
may be due to long-latency operations or other contention for execution resources. Bad speculation is most
frequently due to branch misprediction.
Each cycle, each core can fill up to four of its pipeline slots with useful operations. Therefore, for some time
interval, it is possible to determine the maximum number of pipeline slots that could have been filled in and
issued during that time interval. This analysis performs this estimate and breaks up all pipeline slots into four
categories:
• Pipeline slots containing useful work that issued and retired (Retired)
• Pipeline slots containing useful work that issued and cancelled (Bad speculation)
• Pipeline slots that could not be filled with useful work due to problems in the front-end (Front-end Bound)
• Pipeline slots that could not be filled with useful work due to a backup in the back-end (Back-end Bound)
To use Microarchitecture Exploration analysis, first determine which top-level category dominates for
hotspots of interest. You can then dive into the dominating category by expanding its column. There, you can
find many issues that may contribute to that category.
You can also run the Microarchitecture Exploration analysis on other microarchitectures that are NOT covered
with the Top-Down Method in the VTune Profiler:
• Intel Microarchitecture Code Name Sandy Bridge: This microarchitecture is already partially based
on the top-down method and the VTune Profiler provides a hierarchical analysis of the hardware metrics
based on the following categories: Filled Pipeline Slots and Unfilled Pipeline Slots (Stalls).
• Intel Microarchitectures Code Name Nehalem and Westmere: During Microarchitecture Exploration
analysis on these microarchitectures, the VTune Profiler collects metrics that help identify such hardware-
level performance problems as:
• Front End stall and its causes
• Stalls at execution and retirement: particularly those caused by stalls due to the various high latency
loads, wasted work caused by branch misprediction, or long latency instructions.
NOTE
• For a detailed tuning methodology behind the Microarchitecture Exploration analysis and some of
the complexities associated with this analysis, see Understanding How General Exploration Works in
Intel® VTune™ Profiler.
• For architecture-specific Tuning Guides, visit https://software.intel.com/en-us/articles/processor-
specific-performance-analysis-papers.
203
7 Intel® VTune™ Profiler User Guide
1.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel®
VTune™ Profiler toolbar.
The Configure Analysis window opens.
2.
From HOW pane, click the Browse button and select Microarchitecture Exploration.
3. Configure the following options:
Extend By default, VTune Profiler collects data required to compute top-level metrics
granularity for (Front-End Bound, Bad Speculation, Memory Bound, Core Bound, and
the top-level Retiring) and all their sub-metrics.
metrics selection
You may limit the data collection by selecting particular top-level metrics. In
area
this case, the VTune Profiler extends the level of granularity and collects
additional sub-metrics only for the selected top-level metrics. For example, if
you select the Memory Bound top-level metric, the VTune Profiler collects
additional data and provides Memory Bound sub-metrics (such as DRAM
Bound, Store Bound, and so on), which helps narrow down the analysis to
particular microarchitecture levels.
Limiting the amount of data collected simultaneously may also improve
profiling accuracy due to less multiplexing. This may be particularly helpful for
short-running application or applications with short phases.
Evaluate max Evaluate maximum achievable local DRAM bandwidth before the collection
DRAM bandwidth starts. This data is used to scale bandwidth metrics on the timeline and
check box calculate thresholds.
The option is enabled by default.
Collection mode Choose the Detailed sampling-based collection mode (default) to view a data
drop-down menu breakdown per function and other hotspots. Use the Summary counting-
based mode for an overview of the whole profiling run. This mode has a lower
collection overhead and faster post-processing time.
Details button Expand/collapse a section listing the default non-editable settings used for
this analysis type. If you want to modify or enable additional settings for the
analysis, you need to create a custom configuration by copying an existing
predefined configuration. VTune Profiler creates an editable copy of this
analysis type configuration.
204
Analyze Performance 7
NOTE
• For detailed information on events collected for Microarchitecture Exploration on a particular
microarchitecture, refer to the Intel Processor Event Reference.
•
You may generate the command line for this configuration using the Command Line button
at the bottom.
View Data
To analyze the collected data, use the default Microarchitecture Exploration viewpoint that provides a high-
level performance overview based on the Top-Down Microarchitecture Analysis Method. To easier understand
where you could focus your optimization efforts and which part of the microarchitecture pipeline introduces
inefficiencies, start with the Microarchitecture Pipe.
See Also
collect microarchitecture-exploration
vtune option to run the analysis from CLI
Hardware Event-based Sampling Collection
Set Up Project
205
7 Intel® VTune™ Profiler User Guide
The four leaf categories serve as high-level performance metrics in the Microarchitecture Exploration
viewpoint.
Each metric is an event ratio defined by Intel architects and has its own predefined threshold. VTune Profiler
analyzes a ratio value for each aggregated program unit (for example, function). When this value exceeds
the threshold and the program unit has more then 5% of CPU time from collection CPU time, it signals a
potential performance problem and highlights such a value in pink.
NOTE
• For a detailed tuning methodology behind the Microarchitecture Exploration analysis and some of
the complexities associated with this analysis, see Understanding How General Exploration Works in
Intel® VTune™ Profiler.
• For architecture-specific Tuning Guides, visit https://software.intel.com/en-us/articles/processor-
specific-performance-analysis-papers.
To interpret the performance data provided during the hardware event-based sampling analysis, you may
follow the steps below:
1. Learn metrics and define a performance baseline.
2. Identify hardware issues.
3. Analyze source.
4. Explore other analysis types/viewpoints.
206
Analyze Performance 7
In the example above, mousing over the L1 Bound metric displays the metric description in the tooltip.
A flagged metric value signals a performance issue for the whole application execution. Mouse over the
flagged value to read the issue description:
You may use the performance issues identified by the VTune Profiler as a baseline for comparison of versions
before and after optimization. Your primary performance indicator is the Elapsed time value.
Grayed out metric values indicate that the data collected for this metric is unreliable. This may happen, for
example, if the number of samples collected for PMU events is too low. In this case, when you hover over
such an unreliable metric value, the VTune Profiler displays a message:
You may either ignore this data, or rerun the collection with the data collection time, sampling interval, or
workload increased.
By default, the VTune Profiler collects Microarchitecture Exploration data in the Detailed mode. In this mode,
all metric names in the Summary view are hyperlinks. Clicking such a hyperlink opens the Bottom-up
window and sorts the data in the grid by the selected metric. The lightweight Summary collection mode is
limited to the Summary view statistics.
207
7 Intel® VTune™ Profiler User Guide
In the example above, created on the Intel microarchitecture code name Skylake, the VTune Profiler
identified the sphere_intersect function as one of the biggest hotspots that took much CPU time. VTune
Profiler detected that the back-end portion of the pipeline caused the stalls. For the back-end, the VTune
Profiler identified Memory Bound > L1 Bound issue as a dominant bottleneck. 14.6% of Clockticks used in
this function was stalled missing L1 data cache. This means that if you focus on this function hotspot and
optimize it, you can potentially gain ~15% speed-up for this function.
VTune Profiler is able to identify the most common types of pipeline bottlenecks. You may go deeper for more
details. If the deeper levels of the metrics do not display any data, it means that the VTune Profiler cannot
see a dominant bottleneck on the lower level.
208
Analyze Performance 7
Analyze Source
When you identified a critical function, double-click it to open the Source/Assembly window and analyze the
source code.
The Source/Assembly window displays locator metrics that show what code contributed the most to the
issue represented by the metric. For example, if you have the Back-End Bound metric equal to 60% for your
function, the source view for this function splits the 60% value across function source lines or instructions to
help you identify a source line/instruction with the biggest value contributing the most to the total 60% Back-
End Bound metric.
Use the hotspots navigation toolbar buttons to navigate to the biggest hotspot for each locator metric and
identify the code to optimize.
What's Next
• You may view the collected data using the Hotspots viewpoint or run the Hotspots analysis type. Analyzing
the source and assembly code for the hotspot function in the Hotspots viewpoint helps identify which
instruction contributes most to the poor performance and how much CPU time the hotspot source line
takes. Such a code analysis could be useful for the hotspots that do not show any issues in the sub-
metrics but do show problems at the upper level of metrics (see the example above).
• Run the comparison analysis to understand the performance gain you obtained after your optimization.
• You may create your custom analysis configuration and monitor events you are interested in.
NOTE
• For information on processor events, see the Intel Processor Event Reference.
• Explore tuning recipes for hardware issues in the Performance Analysis Cookbook.
See Also
Analyze Performance
Custom Analysis
Microarchitecture Pipe
Explore the µPipe diagram of the CPU
microarchitecture metrics provided by the Intel®
VTune™ Profiler with the Microarchitecture Exploration
analysis to identify inefficiencies in the CPU utilization.
When your Microarchitecture Exploration analysis result is collected, the VTune Profiler opens the Summary
window that provides an overview of your target app performance based on the Top-down Microarchitecture
Analysis Method (TMA). Treat the diagram as a pipe with an output flow equal to the ratio: Actual
Instructions Retired/Possible Maximum Instruction Retired (pipe efficiency). If there are pipeline
stalls decreasing retiring, the pipe shape gets narrow.
209
7 Intel® VTune™ Profiler User Guide
The µPipe is based on CPU pipeline slots that represent hardware resources needed to process one micro-
operation. Usually there are several pipeline slots available (pipeline width). If pipeline slot does not retire,
this is considered as a stall. The fraction of retired pipeline slots represents CPU Microarchitecture efficiency.
If there were no stalls on all the CPU cycles, this is considered as 100% efficient CPU execution.
There are usually multiple reasons for stalling pipeline slots, identification of these reasons, as well as their
root causes is a CPU Microarchitecture performance analysis process based on the TMA model.
The µPipe in the Microarchitecture Exploration viewpoint visualizes top-level CPU microarchitecture metrics as
fractions of the overall number of pipeline slots in a pipe form where all the stalls are represented as
obstacles making the pipe narrow.
The pipe is divided into 3 columns and 5 rows where each row represents a pipeline high-level metric:
• Retiring metric (a fraction of retired pipeline slots) in the middle green row represents the efficiency of the
pipe and spans for all 3 columns.
• Memory Bound metric row above the Retiring metric spans for 2 columns.
• Core Bound metric row under the Retiring metric spans for 2 columns.
• Front-End Bound metric is the top row.
• Bad Speculation metric row at the bottom may have a dedicated representation of a drain meaning
wasted CPU work.
The height of the whole pipe is a constant value. The height of every row equals the fraction represented by
the corresponding metric.
Red color signals a potential performance problem. A fraction of the green color in the diagram helps
estimate how good execution efficiency is. So, the pipe form clearly represents existing CPU
microarchitecture issues and enables you to recognize the following common patterns:
210
Analyze Performance 7
A no significant issues
B Memory bound execution
C Core bound execution
D Front End bound execution
E Bad Speculation issues (for example, branch misprediction)
F a combination of Memory and Bad Speculation issues
Example 1
This an example of a pipe representing significant Front-End Bound and Core Bound issues limiting the whole
efficiency to 24.4%:
211
7 Intel® VTune™ Profiler User Guide
Example 2
This is an example of good CPU execution efficiency with a Front-End issue:
See Also
Instructions Retired Event
Memory Access Analysis for Cache Misses and High Bandwidth Issues
Use the Intel® VTune™ Profiler's Memory Access
analysis to identify memory-related issues, like NUMA
problems and bandwidth-limited accesses, and
attribute performance events to memory objects (data
structures), which is provided due to instrumentation
of memory allocations/de-allocations and getting
static/global variables from symbol information.
NOTE
Intel® VTune™ Profiler is a new renamed version of the Intel® VTune™ Amplifier.
212
Analyze Performance 7
How It Works
Memory Access analysis type uses hardware event-based sampling to collect data for the following metrics:
• Loads and Stores metrics that show the total number of loads and stores
• LLC Miss Count metric that shows the total number of last-level cache misses
• Local DRAM Access Count metric that shows the total number of LLC misses serviced by the local
memory
• Remote DRAM Access Count metric that shows the number of accesses to the remote socket
memory
• Remote Cache Access Count metric that shows the number of accesses to the remote socket cache
• Memory Bound metric that shows a fraction of cycles spent waiting due to demand load or store
instructions
• L1 Bound metric that shows how often the machine was stalled without missing the L1 data cache
• L2 Bound metric that shows how often the machine was stalled on L2 cache
• L3 Bound metric that shows how often the CPU was stalled on L3 cache, or contended with a sibling
core
• L3 Latency metric that shows a fraction of cycles with demand load accesses that hit the L3 cache
under unloaded scenarios (possibly L3 latency limited)
• NUMA: % of Remote Accesses metric shows percentage of memory requests to remote DRAM. The
lower its value is, the better.
• DRAM Bound metric that shows how often the CPU was stalled on the main memory (DRAM). This
metric enables you to identify DRAM Bandwidth Bound, UPI Utilization Bound issues, as well as
Memory Latency issues with the following metrics:
•Remote / Local DRAM Ratio metric that is defined by the ratio of remote DRAM loads to local
DRAM loads
• Local DRAM metric that shows how often the CPU was stalled on loads from the local memory
• Remote DRAM metric that shows how often the CPU was stalled on loads from the remote
memory
• Remote Cache metric that shows how often the CPU was stalled on loads from the remote cache in
other sockets
• Average Latency metric that shows an average load latency in cycles
213
7 Intel® VTune™ Profiler User Guide
NOTE
• The list of metrics may vary depending on your microarchitecture.
• The UPI Utilization metric replaced QPI Utilization starting with systems based on Intel
microarchitecture code name Skylake.
Many of the collected events used in the Memory Access analysis are precise. This simplifies understanding
the data access pattern. Off-core traffic is divided into the local DRAM and remote DRAM accesses. Typically,
you should focus on minimizing remote DRAM accesses that usually have a high cost.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel®
VTune Profiler toolbar.
™
From HOW pane, click the Browse button and select Memory Access.
3. Configure the following options:
Minimal dynamic Specify a minimal size of dynamic memory allocations to analyze. This option
memory object helps reduce runtime overhead of the instrumentation.
size to track, in
The default value is 1024.
bytes spin box
(Linux only)
Evaluate max Evaluate maximum achievable local DRAM bandwidth before the collection
DRAM bandwidth starts. This data is used to scale bandwidth metrics on the timeline and
check box calculate thresholds.
The option is enabled by default.
Analyze OpenMP Instrument and analyze OpenMP regions to detect inefficiencies such as
regions check box imbalance, lock contention, or overhead on performing scheduling, reduction
and atomic operations.
The option is disabled by default.
214
Analyze Performance 7
Details button Expand/collapse a section listing the default non-editable settings used for
this analysis type. If you want to modify or enable additional settings for the
analysis, you need to create a custom configuration by copying an existing
predefined configuration. VTune Profiler creates an editable copy of this
analysis type configuration.
4. Click the Start button to run the analysis.
Limitations:
• Memory objects analysis can be configured for Linux* targets only and only for processors based on Intel
microarchitecture code name Sandy Bridge or later.
View Data
For analysis, explore the Memory Usage viewpoint that includes the following windows:
• Summary window displays statistics on the overall application execution, including the application-level
bandwidth utilization histogram.
• Bottom-up window displays performance data per metric for each hotspot object. If you enable the
Analyze memory objects option for data collection, the Bottom-up window also displays memory
allocation call stacks in the grid and Call Stack pane. Use the Memory Object grouping level, preceded
with the Function level, to view memory objects as the source location of an allocation call.
• Platform window provides details on tasks specified in your code with the Task API, Ftrace*/Systrace*
event tasks, OpenCL™ API tasks, and so on. If corresponding platform metrics are collected, the Platform
window displays over-time data as GPU usage on a software queue, CPU time usage, OpenCL™ kernels
data, and GPU performance per the Overview group of GPU hardware metrics, Memory Bandwidth, and
CPU Frequency.
Support Limitations
Memory Access analysis is supported on the following platforms:
• 2nd Generation Intel® Core™ processors
• Intel® Xeon® processor families, or later
• 3rd Generation Intel Atom® processor family, or later
If you need to analyze older processors, you can create a custom analysis and choose events related to
memory accesses. However, you will be limited to memory-related events available on those processors. For
information about memory access events per processor, see the VTune Profiler tuning guides.
For dynamic memory object analysis on Linux, the VTune Profiler instruments the following Memory
Allocation APIs:
• standard system memory allocation API: mmap, malloc/free, calloc, and others
• memkind - https://github.com/memkind/memkind
• jemalloc - https://github.com/memkind/jemalloc
• pmdk - https://github.com/pmem/pmdk
See Also
Memory Usage View
collect
memory-accessvtune option
Intel Processor Events Reference
CPU Metrics Reference
Sampling Interval
215
7 Intel® VTune™ Profiler User Guide
216
Analyze Performance 7
NOTE
The platform diagram is available for:
• All client platforms
• Server platforms based on Intel® microarchitecture code name Skylake, with up to four sockets.
If you selected the Evaluate max DRAM bandwidth option in your analysis configuration, the Platform
Diagram shows the average DRAM utilization. Otherwise, it shows the average DRAM bandwidth.
The Average UPI Utilization metric displays UPI utilization in terms of transmit. Irrespective of the number
of UPI links that connect a pair of packages, the Platform Diagram shows a single cross-socket connection, .
If there are several links, the diagram displays the maximum value.
On top of each socket, the Average Physical Core Utilization metric indicates the utilization of physical
cores by computations of the application under analysis.
Once you examine the topology and utilization information in the diagram, focus on other sections in the
Summary window and then switch to the Bottom-up and Platform windows next.
NOTE
Memory objects identification is supported only for Linux targets and only for processors based on
Intel microarchitecture code name Sandy Bridge and later. On Windows*, you can group by
Cachelines, see the metrics against the code, and figure out what data structures it accesses.
For memory objects data, click the Bottom-up tab and select a grouping level containing Memory Object
or Memory Object Allocation Source. The Memory Object granularity groups the data by individual
allocations (call site and size) while Memory Object Allocation Source groups by the place where an
allocation happened.
Only metrics based on DLA-capable hardware events are applicable to the memory objects analysis. For
example, the CPU Time metric is based on a non DLA-capable Clockticks event, so cannot be applied to
memory objects. Examples of applicable metrics are Loads, Stores, LLC Miss Count, and Average Latency.
217
7 Intel® VTune™ Profiler User Guide
218
Analyze Performance 7
This histogram shows how much time the system bandwidth was utilized by the selected bandwidth domain
and provides thresholds to categorize bandwidth utilization as High, Medium and Low. By default, for Memory
Analysis results the thresholds are calculated based on the maximum achievable DRAM bandwidth measured
by the VTune Profiler before the collection starts and displayed in the System Bandwidth section of the
Summary window. To enable this functionality for custom analysis results, make sure to select the Evaluate
max DRAM bandwidth option. If this option is not enabled, the thresholds are calculated based on the
maximum bandwidth value collected for this result. You can also set the threshold by moving sliders at the
bottom. The modified values will be applied to all subsequent results in this project.
Explore the table under the histogram to identify which functions were frequently accessed while the
bandwidth utilization for the selected domain was high. Clicking a function from the list opens the Bottom-up
window with the grid automatically grouped by Bandwidth Domain / Bandwidth Utilization Type /
Function / Call Stack and this function highlighted. Under the DRAM, GB/sec > High utilization type, you
can see all functions executing when the system DRAM bandwidth utilization was high. Sort the grid by LLC
Miss Count to see what functions contributed to the high DRAM bandwidth utilization the most:
In addition to identifying bandwidth-limited code, the VTune Profiler provides a workflow to see the
frequently accessed memory objects (variables, data structures, arrays) that had an impact on the high
bandwidth utilization. So, if you enabled the memory object analysis for your target, the Bandwidth
Utilization section includes a table with the top memory objects that were frequently accessed while the
bandwidth utilization for the selected domain was high. Click such an object to switch to the Bottom-up
window with the grid automatically grouped by Bandwidth Domain / Bandwidth Utilization Type /
Memory Object / Allocation Stack and this object highlighted. Under the DRAM > High utilization type,
explore all memory objects that were accessed when the system DRAM bandwidth utilization was high. Sort
the grid by LLC Miss Count to see what memory objects contributed to the high DRAM bandwidth utilization
the most:
219
7 Intel® VTune™ Profiler User Guide
220
Analyze Performance 7
Hover over a bar with high bandwidth value to learn how much data was read from or written to DRAM
through the on-chip memory controller. Use time-filtering context menu options to filter in a specific range of
time during which bandwidth is notable. Then, switch to the core-based events that correlate with bandwidth
in the grid below to determine what specific code is inducing all the bandwidth.
221
7 Intel® VTune™ Profiler User Guide
NOTE
Interconnect bandwidth analysis is supported by the VTune Profiler for Intel microarchitecture code
name Ivy Bridge EP and later.
Switch to the Bottom-up tab and select the Bandwidth Domain / Bandwidth Utilization type /
Function / Call Stack grouping level. Expand the Interconnect domain grid row and then expand the
High utilization type row to see all functions that were executing when the system Interconnect bandwidth
utilization was high:
222
Analyze Performance 7
You can also select areas with the high Interconnect bandwidth utilization in the Timeline view and filter in by
this selection:
After the filter is applied, the grid view below the Timeline pane shows what was executing during that time
range.
Analyze Source
When you identified a critical function, double-click it to open the Source/Assembly window and analyze the
source code. The Source/Assembly window displays hardware metrics per code line for the selected
function.
To view the Source/Assembly data for memory objects:
1. Select the ../Function / Memory Object /.. grouping level (the Function granularity should precede
the Memory Object granularity) in the Bottom-up window.
2. Expand a function and double-click a memory object under this function.
The Source/Assembly window opens displaying metrics per function source lines where accesses to
the selected memory object happened.
223
7 Intel® VTune™ Profiler User Guide
NOTE
• For information on processor event, see Intel Processor Event Reference.
• For information on the performance tuning for HPC-computers using the event-based sampling
collection, see http://software.intel.com/en-US/articles/processor-specific-performance-analysis-
papers/.
• For information on performance improvement opportunities with NUMA hardware, see https://
software.intel.com/en-us/articles/performance-improvement-opportunities-with-numa-hardware.
See Also
Source Code Analysis
Threading Analysis
Use the Threading analysis to identify how efficiently
an application uses available processor compute cores
and explore inefficiencies in threading runtime usage
or contention on threading synchronization that makes
threads waiting and prevents effective processor
utilization.
NOTE
• Threading analysis combines and replaces the Concurrency and Locks and Waits analysis types
available in previous versions of Intel® VTune™ Profiler.
• Intel® VTune™ Profiler is a new renamed version of the Intel® VTune™ Amplifier.
Intel® VTune™ Profiler uses the Effective CPU Utilization metric as a main measurement of threading
efficiency. The metric is built on how an application utilizes the available logical cores. For throughput
computing, it is typical to load one logical core per physical core.
The following aspects of Threading Analysis provide possible reasons for poor CPU utilization:
• Thread count: a quick glance at the application thread count can give clues to threading inefficiencies,
such as a fixed number of threads that might prevent the application from scaling to a larger number of
cores or lead to thread oversubscription
224
Analyze Performance 7
• Wait time (trace-based or context switch-based): analyze threads waiting on synchronization objects or
I/O
• Spin and overhead time: estimate threading runtime overhead or the impact of spin waits (busy or active
waits)
The Threading Analysis provides two collection modes with major differences in thread wait time collection
and interpretation:
• User-Mode Sampling and Tracing, which can recognize synchronization objects and collect thread wait
time by objects using tracing. This is helpful in understanding thread interaction semantics and making
optimization changes based on that data. There are two groups of synchronization objects supported by
Intel VTune Profiler: objects usually used for synchronization between threads (such as Mutex or
Semaphore) and objects associated with waits on I/O operations (such as Stream).
• Hardware Event-Based Sampling and Context Switches, which collects thread inactive wait time based on
context switch information. Even though there is not a thread object definition in this case, the
problematic synchronization functions can be found by using the wait time attributed with call stacks with
lower overhead than the previous collection mode. The analysis based on context switches also shows
thread preemption time, which is useful in measuring the impact of thread oversubscription on a system.
225
7 Intel® VTune™ Profiler User Guide
Since context switch information is collected with call stacks, it is possible to explore reasons of Inactive Wait
Time by wait functions with their call paths. The Hardware Event-Based Sampling and Context Switches
mode shows the places in the code where the wait was induced by a synchronization object or I/O operation.
226
Analyze Performance 7
The Hardware Event-Based Sampling and Context Switches mode is based on the hardware event-based
sampling collection and analyzes all the processes running on your system at the moment, providing context
switching data on whole system performance. On Linux* systems, Inactive Wait Time Collection is available
in driverless Perf*-based collection usage with kernel version 4.4 or later. Inactive Time reasons are available
in kernel 4.17 and later.
NOTE
On 32-bit Linux* systems, the VTune Profiler uses a driverless Perf*-based collection for the hardware
event-based sampling mode.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel®
VTune™ Profiler toolbar.
The Configure Analysis window opens.
2.
From HOW pane, click the Browse button and select Threading.
3. Configure the collection options.
User-Mode Select to enable the user-mode sampling and tracing collection for
Sampling and synchronization object analysis. This collection mode uses a fixed sampling
Tracing mode interval of 10ms. If you need to change the interval, click the Copy button
and create a custom analysis configuration. For intervals less than 10ms, use
the Hardware Event-Based Sampling and Context Switches mode.
Hardware Event- Select to enable hardware event-based sampling and context switches
Based Sampling collection.
and Context
You can configure the CPU sampling interval, ms to specify an interval (in
Switches mode
milliseconds) between CPU samples. Possible values for thehardware event-
based sampling mode are 0.01-1000. 1 ms is used by default.
NOTE
When changing collection options, pay attention to the Overhead diagram on the
right. It dynamically changes to reflect the collection overhead incurred by the
selected options.
Details button Expand/collapse a section listing the default non-editable settings used for
this analysis type. If you want to modify or enable additional settings for the
analysis, you need to create a custom configuration by copying an existing
predefined configuration. VTune Profiler creates an editable copy of this
analysis type configuration.
227
7 Intel® VTune™ Profiler User Guide
NOTE
To run Threading Analysis from the command line for this configuration, use the Command
Line button at the bottom.
View Data
The Threading analysis results appear in the Threading Efficiency viewpoint, which consists of the following
windows/panes:
• Summary window displays statistics on the overall application execution, identifying CPU time and
processor utilization.
• Bottom-up window displays hotspot functions in the bottom-up tree, CPU time and CPU utilization per
function.
• Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only
(Self value) and for a function and its children together (Total value).
• Caller/Callee window displays parent and child functions of the selected focus function.
• Platform window provides details on CPU and GPU utilization, frame rate, memory bandwidth, and user
tasks (if corresponding metrics are collected).
What's Next
1. Start on the result Summary window to explore the Effective CPU utilization of your application and
identify reasons for underutilization connected with synchronization, parallel work arrangement
overhead, or incorrect thread count. Click links associated with flagged issues to be taken to more
detailed information. For example, clicking a sync object name in the Top Waiting Objects table takes
you to that object in the Bottom-up window.
2. Analyze thread integration synchronization objects with wait and signal stacks and transitions on the
timeline. Explore CPU time spent in threading runtimes to classify inefficiencies in their use.
3. Modify your code to remove CPU utilization bottlenecks and improve the parallelism of your application.
Concentrate your tuning on objects with long Wait time where the system is poorly utilized (red bars)
during the wait. Consider adding parallelism, rebalancing, or reducing contention. Ideal utilization
(green bars) occurs when the number of running threads equals the number of available logical cores.
4. Re-run the analysis to verify your optimization with the comparison mode and identify more possible
areas for improvement.
For more information and interpretation tips, see Threading Efficiency View.
See Also
Threading Efficiency View
collect
threading vtune option
HPC Performance Characterization Analysis
228
Analyze Performance 7
1. Define a performance baseline
2. Examine wait time, spin and overhead time, and thread count metrics
3. Review the timeline
4. Analyze the application source code
5. Explore other analysis types for further diagnosis and optimization
Explore the Spin Time, Overhead Time, Wait Time, and Total Thread Count to identify the main cause
of performance issues.
Wait Time
A high thread wait time can cause poor CPU utilization. One common problem in parallel applications is
threads waiting too long on synchronization objects that are on the critical path of application execution (for
example, locks). Parallel performance suffers when waits occur while cores are under-utilized. Threading
analysis helps to analyze thread wait time and find synchronization bottlenecks.
229
7 Intel® VTune™ Profiler User Guide
Explore the Bottom-up window to identify the most performance critical synchronization objects. Although it
varies, often there are non-interesting threads waiting for a long time on objects infrequently. Usually you
are recommended to focus your tuning efforts on the waits with both high Wait Time and Wait Count values,
especially if they have poor utilization/concurrency.
By default, the synchronization objects are sorted by Wait time. You can view the time distribution per
utilization level by clicking the button at the Wait Time by Utilization column header to expand the
column.
To identify the highest contributing stack for the synchronization objects selected in the Bottom-up or Top-
down Tree panes, use the navigation buttons on the stack pane. The contribution bar shows the
contribution of the currently visible stack to the overall time spent by the selected synchronization objects.
You can also use the drop-down list in the Call Stack pane to view data for different types of stacks.
You should try to eliminate or minimize the Wait Time for the synchronization objects with the highest Wait
Time (or longest red bars, if the bar format is selected) and Wait Count values.
In Hardware Event-based Sampling and Context Switches mode, sort functions by Inactive Sync Wait
Time. Use the Caller/Callee pane to figure out the call sites in the application that calls a wait function with
high Inactive Sync Wait Time.
NOTE
The spin time shown in Spin and Overhead Time section might be included into wait time based on
user-level sampling and tracing.
Thread Count
Threading analysis will show time an application spends in oversubscription by flagging when the application
is running more threads than the number of logical cores on the machine. Running an excessive number of
threads can cause a higher CPU time because some of the threads may be waiting on others to complete or
time may be wasted on context switches. Another common issue is running with a fixed number of threads,
230
Analyze Performance 7
which can cause performance degradation when running on a platform with a different number of cores. For
example, running with a significantly lower number of threads than the number of cores available can cause
higher application elapsed time.
Use the Total Thread Count metric available on the Summary window to determine if your application has
thread oversubscription or could benefit from increased threading.
In Hardware Event-based Sampling and Context Switches mode, use the Preemption Wait Time metric to
estimate the impact of oversubscription. The higher the metric value on worked threads, the higher the
impact of oversubscription on the application performance. Note that thread preemption can also be
triggered by a conflict with other applications or kernel threads running on a system.
To understand what your application was doing during a particular time frame, select this range on the
timeline, right-click and choose Zoom In and Filter In by Selection. VTune Profiler will display functions or
sync objects used during this time range.
For User-mode Sampling and Tracing collection mode, select the Transitions option on the timeline to
explore thread interactions.
For Hardware Event-based Sampling and Context Switches mode, the timeline is helpful in exploring inactive
waits. Select an inactive time area on the timeline to display the wait stack on the stack pane that
corresponds to the context switch.
231
7 Intel® VTune™ Profiler User Guide
Analyze Source
Double-click the hottest synchronization object (with the highest Wait Time and Wait Count values) to view
its related source code file in the Source/Assembly window. From the Timeline pane, you can double-click
the transition line to open the call site for this transition. You can open the code editor directly from the
VTune Profiler and edit your code.
See Also
Analyze Performance
View Stacks
How It Works
The HPC Performance Characterization analysis type can be used as a starting point for understanding the
performance aspects of your application. Additional scalability metrics are available for applications that use
Intel OpenMP* or Intel MPI runtime libraries.
232
Analyze Performance 7
During HPC Performance Characterization analysis, the Intel® VTune™ Profiler data collector profiles your
application using event-based sampling collection. OpenMP analysis metrics for Intel OpenMP runtime library
are based on User API instrumentation enabled in the runtime library.
Typically the collector will gather data for a specified application, but it can collect system-wide performance
data with limited detail if required.
NOTE
Vectorization and GFLOPS metrics are supported on Intel® microarchitectures formerly code named Ivy
Bridge, Broadwell, and Skylake. Limited support is available for Intel® Xeon Phi™ processors formerly
code named Knights Landing. The metrics are not currently available on 4th Generation Intel
processors. Expand the Details section on the analysis configuration pane to view the processor
family available on your system.
The analysis can be run from within the VTune Profiler GUI or from the command line.
233
7 Intel® VTune™ Profiler User Guide
NOTE
Intel® VTune™ Profiler is a new renamed version of the Intel® VTune™ Amplifier.
Click the (standalone GUI)/ (Visual Studio IDE)Configure Analysis button on the Intel®
VTune Profiler toolbar.
™
From HOW pane, click the Browse button and select HPC Performance Characterization.
3. Configure the following options:
Collect stacks Enable advanced collection of call stacks and thread context switches.
check box
The option is disabled by default.
Evaluate max Evaluate maximum achievable local DRAM bandwidth before the collection
DRAM bandwidth starts. This data is used to scale bandwidth metrics on the timeline and
check box calculate thresholds.
The option is enabled by default.
Analyze OpenMP Instrument and analyze OpenMP regions to detect inefficiencies such as
regions check box imbalance, lock contention, or overhead on performing scheduling, reduction
and atomic operations.
The option is enabled by default.
Details button Expand/collapse a section listing the default non-editable settings used for
this analysis type. If you want to modify or enable additional settings for the
analysis, you need to create a custom configuration by copying an existing
predefined configuration. VTune Profiler creates an editable copy of this
analysis type configuration.
NOTE
You may generate the command line for this configuration using the Command Line button at
the bottom.
234
Analyze Performance 7
4. Click the Start button to run the analysis.
View Data
Use the HPC Performance Characterization viewpoint to review the following:
• Effective Physical Core Utilization: Explore application parallel efficiency by looking at physical core
utilization by the application code execution. Look for scalability problems involving the use of serial time
versus parallel time, tuning potential for OpenMP regions, and MPI imbalance.
• Memory Bound: Evaluate whether the application is memory bound. To understand deeper problems, run
the Memory Access Analysis to identify specific memory objects causing issues.
• Vectorization: Determine if floating-point loops are bandwidth bound or vectorized. For bandwidth bound
loops/functions, run the Memory Access Analysis to reduce bandwidth consumption. For vectorization
optimization opportunities, use the Intel Advisor to run a vectorization analysis.
• Intel® Omni-Path Fabric Usage: Identify performance bottlenecks caused by reaching the interconnect
limits.
Use the Analyzing an OpenMP* and MPI Application tutorial to review basic steps for tuning a hybrid
application. The tutorial is available from the Intel Developer Zone at https://software.intel.com/en-us/itac-
vtune-mpi-openmp-tutorial-lin.
See Also
HPC Performance Characterization View
Tip
Use the Analyzing an OpenMP* and MPI Application tutorial to review basic steps for tuning a hybrid
application. The tutorial is available from the Intel Developer Zone at https://software.intel.com/en-
us/itac-vtune-mpi-openmp-tutorial-lin. You can also find a webinar about HPC Performance
Characterization analysis at https://software.intel.com/en-us/videos/hpc-applications-need-high-
performance-analysis.
235
7 Intel® VTune™ Profiler User Guide
Use the Elapsed Time and GFLOPS values as a baseline for comparison of versions before and after
optimization.
236
Analyze Performance 7
Sub-optimal application topology can result in induced DRAM and Intel® QuickPath Interconnect (Intel® QPI)
or Intel® Ultra Path Interconnect (Intel® UPI) cross-socket traffic. These incidents can limit performance.
237
7 Intel® VTune™ Profiler User Guide
NOTE
The platform diagram is available for:
• All client platforms.
• Server platforms based on Intel® microarchitecture code name Skylake, with up to four sockets.
If you selected the Evaluate max DRAM bandwidth option in your analysis configuration, the Platform
Diagram shows the average DRAM utilization. Otherwise, it shows the average DRAM bandwidth.
The Average UPI Utilization metric displays UPI utilization in terms of transmit. Irrespective of the number
of UPI links that connect a pair of packages, the Platform Diagram shows a single cross-socket connection, .
If there are several links, the diagram displays the maximum value.
On top of each socket, the Average Physical Core Utilization metric indicates the utilization of physical
cores by computations of the application under analysis.
Once you examine the topology and utilization information in the diagram, focus on other sections in the
Summary window and then switch to the Bottom-up window.
CPU Utilization
238
Analyze Performance 7
• Explore the Effective Physical Core Utilization metric as a measure of the parallel efficiency of the
application. A value of 100% means that the application code execution uses all available physical cores.
If the value is less than 100%, it is worth looking at the second level metrics to discover reasons for
parallel inefficiency.
• Learn about opportunities to use the logical cores. In some cases, using logical cores leads to application
concurrency increases and overall performance improvements.
• For some Intel® processors, such as Intel® Xeon Phi™ or Intel Atom®, or systems where Intel Hyper-
Threading Technology (Intel HT Technology) is OFF or absent, the metric breakdown between physical and
logical core utilization is not available. In these cases, a single Effective CPU Utilization metric is
displayed to show parallel execution efficiency.
• For applications that do not use OpenMP or MPI runtime libraries:
• Review the Effective CPU Utilization Histogram, which displays the Elapsed Time of your
application, broken down by CPU utilization levels.
• Use the data in the Bottom-up and Top-down Tree windows to identify the most time-consuming
functions in your application by CPU utilization. Focus on the functions with the largest CPU time and
low CPU utilization level as your candidates for optimization (for example, parallelization).
• For applications with Intel OpenMP*:
• Compare the serial time to the parallel region time. If the serial portion is significant, consider options
to minimize serial execution, either by introducing more parallelism or by doing algorithm or
microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines,
serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be
minimized as much as possible. Look at serial hotspots to define candidates for further parallelization.
• Review the OpenMP Potential Gain to estimate the efficiency of OpenMP parallelization in the parallel
part of the code. The Potential Gain metric estimates the elapsed time between the actual
measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and
zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the
maximum time that you may save by improving OpenMP parallelism. If Potential Gain for a region is
significant, you can go deeper and select the link on a region name to navigate to the Bottom-up
window employing an OpenMP Region dominant grouping and the region of interest selection.
• Consider running Threading analysis when there are multiple locks used in one parallel construct to
find the performance impact of a particular lock.
• For MPI applications:
Review the MPI Imbalance metric that shows the CPU time spent by ranks spinning in waits on
communication operations, normalized by number of ranks on the profiling node. The metric issue
detection description generation is based on minimal MPI Busy Wait time by ranks. If the minimal MPI
Busy wait time by ranks is not significant, then the rank on with the minimal time most likely lies on the
critical path of application execution. In this case, review the CPU utilization metrics by this rank.
239
7 Intel® VTune™ Profiler User Guide
The sub-section MPI Rank on Critical Path shows OpenMP efficiency metrics like Serial Time (outside of
any OpenMP region), Parallel Region time, and OpenMP Potential Gain. If the minimal MPI Busy Wait time
is significant, it can be a result of suboptimal communication schema between ranks or imbalance
triggered by another node. In this case, use Intel® Trace Analyzer and Collector for in depth analysis of
communication schema.
GPU Utilization
GPU utilization metrics display when:
• Your application makes use of a GPU.
• Your system is configured to collect GPU data. See Set Up System for GPU Analysis.
Under Elapsed Time, the GPU section presents an overview of how your application offloads work to the
GPU.
• The Time metric indicates if the GPU was idle at any point during data collection. A value of 100% implies
that your application offloaded work to the GPU throughout the duration of data collection. Anything lower
presents an opportunity to improve GPU utilization.
• The IPC Rate metric indicates the average number of instructions per cycle processed by the two FPU
pipelines of Intel ®Integrated Graphics. To have your workload fully utilize the floating-point capability of
the GPU, the IPC Rate should be closer to 2.
Next, look into GPU Utilization when Busy. This section can help you understand if your workload can use
the GPU more efficiently.
Ideally, your GPU utilization should be 100%. If GPU Utilization when Busy is <100%, there were cycles
where the GPU was stalled or idle.
• EU State breaks down the activity of GPU execution units. Check here to see if they were stalled or idle
when processing your workload.
• Occupancy is a measure of the efficiency of scheduling the GPU thread. A value below 100%
recommends that you tune the sizes of the work items in your workload. Consider running the GPU
Offload Analysis. This provides an insight into computing tasks running on the GPU as well as additional
GPU-related performance metrics.
240
Analyze Performance 7
If your application offloads code via Intel OpenMP*, check the Offload Time section:
• The Offload Time metric displays the total duration of the OpenMP offload regions in your workload. If
Offload Time is below 100%, consider offloading more code to the GPU.
• The Compute, Data Transfer, and Overhead metrics help you understand what constitutes the Offload
Time. Ideally, the Compute portion should be 100%. If the Data Transfer component is significant, try
to transfer less data between the host and the GPU.
In the Top OpenMP Offload Regions section, review the breakdown of offload and GPU metrics by OpenMP
offload region. Focus on regions that take up a significant portion of the Offload Time.
The names of the OpenMP offload regions use this format:
<func_name>$omp$target$region:dvc=<device_number>@<file_name>:<line_number>
where:
• func_name is the name of the source function where the OpenMP target directive is declared.
• device_number is the internal OpenMP device number where the offload was targeted.
• file_name and line_number constitute the source location of the OpenMP target directive.
When you compile your OpenMP application, the func_name, file_name, and line_number fields require you
to pass debug information options to the Intel Compiler. If debug information is absent, these fields get
default values.
Field Compiler Options to Enable Default Value
Linux OS Windows OS
line_number
-g /Zi 0
func_name
-g /Zi unknown
file_name
-g -mllvm -parallel- /Zi -mllvm -parallel- unknown
source-info=2 source-info=2
For applications that use OpenMP offload, the Bottom-up window displays additional information.
241
7 Intel® VTune™ Profiler User Guide
242
Analyze Performance 7
• Group by OpenMP Offload Region. In this grouping, the grid displays:
• OpenMP Offload Time metrics
• Instance Count
• GPU metrics
• The timeline view displays ruler markers that indicate the span of OpenMP Offload Regions and
OpenMP Offload Operations within those regions.
Memory Bound
• A high Memory Bound value might indicate that a significant portion of execution time was lost while
fetching data. The section shows a fraction of cycles that were lost in stalls being served in different cache
hierarchy levels (L1, L2, L3) or fetching data from DRAM. For last level cache misses that lead to DRAM, it
is important to distinguish if the stalls were because of a memory bandwidth limit since they can require
specific optimization techniques when compared to latency bound stalls. VTune Profiler shows a hint about
identifying this issue in the DRAM Bound metric issue description. This section also offers the percentage
of accesses to a remote socket compared to a local socket to see if memory stalls can be connected with
NUMA issues.
• For Intel® Xeon Phi™ processors formerly code named Knights Landing, there is no way to measure
memory stalls to assess memory access efficiency in general. Therefore Back-end Bound stalls that
include memory-related stalls as a high-level characterization metric are shown instead. The second level
metrics are focused particularly on memory access efficiency.
• A high L2 Hit Bound or L2 Miss Bound value indicates that a high ratio of cycles were spent handing
L2 hits or misses.
• The L2 Miss Bound metric does not take into account data brought into the L2 cache by the hardware
prefetcher. However, in some cases the hardware prefetcher can generate significant DRAM/MCDRAM
traffic and saturate the bandwidth. The Demand Misses and HW Prefetcher metrics show the
percentages of all L2 cache input requests that are caused by demand loads or the hardware
prefetcher.
• A high DRAM Bandwidth Bound or MCDRAM Bandwidth Bound value indicates that a large
percentage of the overall elapsed time was spent with high bandwidth utilization. A high DRAM
Bandwidth Bound value is an opportunity to run the Memory Access analysis to identify data
structures that can be allocated in high bandwidth memory (MCDRAM), if it is available.
• The Bandwidth Utilization Histogram shows how much time the system bandwidth was utilized by a
certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High,
Medium and Low. The thresholds are calculated based on benchmarks that calculate the maximum value.
You can also set the threshold by moving sliders at the bottom of the histogram. The modified values are
applied to all subsequent results in the project.
243
7 Intel® VTune™ Profiler User Guide
• Switch to the Bottom-up window and review the Memory Bound columns in the grid to determine
optimization opportunities.
• If your application is memory bound, consider running a Memory Access analysis for deeper metrics and
the ability to correlate these metrics with memory objects.
Vectorization
NOTE
Vectorization and GFLOPS metrics are supported on Intel® microarchitectures formerly code named Ivy
Bridge, Broadwell, and Skylake. Limited support is available for Intel® Xeon Phi™ processors formerly
code named Knights Landing. The metrics are not currently available on 4th Generation Intel
processors. Expand the Details section on the analysis configuration pane to view the processor
family available on your system.
• The Vectorization metric represents the percentage of packed (vectorized) floating point operations. 0%
means that the code is fully scalar while 100% means the code is fully vectorized. The metric does not
take into account the actual vector length used by the code for vector instructions. As a result, if the code
is fully vectorized and uses a legacy instruction set that loaded only half a vector length, the Vectorization
metric still shows 100%.
Low vectorization means that a significant fraction of floating point operations are not vectorized. Use
Intel® Advisor to understand possible reasons why the code was not vectorized.
The second level metrics allow for rough estimates of the size of floating point work with particular
precision and see the actual vector length of vector instructions with particular precision. Partial vector
length can provide information about legacy instruction set usage and show an opportunity to recompile
the code with modern instruction set, which can lead to additional performance improvement. Relevant
metrics might include:
• Instruction Mix
• FP Arithmetic Instructions per Memory Read or Write
244
Analyze Performance 7
• The Top Loops/Functions with FPU Usage by CPU Time table shows the top functions that contain
floating point operations sorted by CPU time and allows for a quick estimate of the fraction of vectorized
code, the vector instruction set used in the loop/function, and the loop type.
• For Intel® Xeon Phi™ processors (formerly code named Knights Landing), the following FPU metrics are
available instead of FLOP counters:
• SIMD Instructions per Cycle
• Fraction of packed SIMD instructions versus scalar SIMD Instructions per cycle
• Vector instructions for loops set based on static analysis
245
7 Intel® VTune™ Profiler User Guide
• Outgoing and Incoming Bandwidth Bound metrics shows the percent of elapsed time that an
application spent in communication closer to or reaching interconnect bandwidth limit.
• Bandwidth Utilization Histogram shows how much time the interconnect bandwidth was utilized by a
certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High,
Medium, and Low.
• Outgoing and Incoming Packet Rate metrics shows the percent of elapsed time that an application
spent in communication closer to or reaching interconnect packet rate limit.
• Packet Rate Histogram shows how much time the interconnect packet rate was reached by a certain
value and provides thresholds to categorize packet rate as High, Medium, and Low.
3. Analyze Source
Double-click the function you want to optimize to view its related source code file in the Source/Assembly
window. You can open the code editor directly from the Intel® VTune™ Profiler and edit your code (for
example, minimizing the number of calls to the hotspot function).
246
Analyze Performance 7
A preview HTML report is available to see process/thread affinity along with thread CPU execution and
remote accesses. Use the following command to generate the preview HTML report:
NOTE
This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production
release. It is available for your use in the hopes that you will provide feedback on its usefulness and
help determine its future. Data collected with a preview feature is not guaranteed to be backward
compatible with future releases.
247
7 Intel® VTune™ Profiler User Guide
See Also
Analyze Performance
Viewing Source
NOTE
The full set of Input and Output analysis metrics is available on Intel® Xeon® processors only.
248
Analyze Performance 7
NOTE
On FreeBSD systems, the graphical user interface of VTune Profiler is not supported. You can still
configure and run the analysis from a Linux* or Windows* system using remote SSH capabilities, or
collect the result locally from the CLI. For more information on available options, see FreeBSD Targets.
Platform-Level Metrics
To collect hardware event-based metrics, either load the Intel sampling driver or configure driverless
hardware event collection (Linux targets only).
249
7 Intel® VTune™ Profiler User Guide
250
Analyze Performance 7
IO Analysis Features Prerequisites/Applicability
Configuration
Check Box
251
7 Intel® VTune™ Profiler User Guide
252
Analyze Performance 7
OS- and API-Level Metrics
IO Analysis Configuration Check Box Prerequisites/Applicability
DPDK
Make sure DPDK is built with VTune Profiler support
enabled.
When profiling DPDK as FD.io VPP plugin, modify
the DPDK_MESON_ARGS variable in build/
external/packages/dpdk.mk with the same flags
as described in Profiling with VTune section.
Not available for FreeBSD targets. Not available in
system-wide mode.
SPDK
Make sure SPDK is built using the --with-vtune
advanced build option.
When profiling in Attach to Process mode, make
sure to set up the environment variables before
launching the application.
Not available in Profile System mode.
Kernel I/O
To collect these metrics, VTune Profiler enables
FTrace* collection that requires access to debugfs.
On some systems, this requires that you
reconfigure your permissions for the
prepare_debugfs.sh script located in the bin
directory, or use root privileges.
Not available for FreeBSD targets.
Analyze Platform Performance Understand the platform-level metrics provided by the Input and
Output analysis of Intel® VTune™ Profiler.
Analyze DPDK Applications Use the Input and Output analysis of Intel® VTune™ Profiler to profile
DPDK applications and collect batching statistics for polling threads performing Rx and event
dequeue operations.
Analyze SPDK Applications Use the Input and Output analysis of Intel® VTune™ Profiler to profile
SPDK applications and estimate SPDK Effective Time and SPDK Latency, and identify under-
utilized throughput of an SPDK device.
Analyze Linux Kernel I/O Use the Input and Output analysis of Intel® VTune™ Profiler to match
user-level code to I/O operations executed by the hardware.
io Command Line Analysis
253
7 Intel® VTune™ Profiler User Guide
254
Analyze Performance 7
Example of a Platform Diagram for a single-socket server with 8 active NVMe SSDs, network intefrace
card, and persistent memory:
255
7 Intel® VTune™ Profiler User Guide
256
Analyze Performance 7
NOTE
The Platform Diagram is available starting with server platforms based on Intel® microarchitecture
code named Skylake, with up to four sockets.
I/O devices are shown with short names that indicate the PCIe bus and device numbers. Full device name,
link capabilities, and status are shown in the device tooltip. Hover over the device image to see detailed
device information.
The Platform Diagram highlights device status issues that may be a reason of limited throughput. A common
issue is that the configured link speed/width does not match the maximum speed/width of the device.
When device capabilities are known and the maximum physical bandwidth can be calculated, the device link
is attributed with the Effective Link Utilization metric that represents the ratio of bandwidth consumed on
data transfers to the available physical bandwidth. This metric does not account for protocol overhead (TLP
headers, DLLPs, physical encoding) and reflects link utilization in terms of payloads. Thus, it cannot reach
100%. However, this metric can give a clue on how far from saturation the link is. Maximum theoretical
bandwidth is calculated for device link capabilities as shown in the device tooltip.
The Platform Diagram shows the Average DRAM Utilization when the Evaluate max DRAM bandwidth
checkbox is selected in the analysis configuration. Otherwise, it shows the average DRAM bandwidth.
If the system is equipped with persistent memory, the Platform Diagram shows the Average Persistent
Memory Bandwidth.
The Average UPI Utilization metric reveals UPI utilization in terms of transmit. The Platform Diagram
shows a single cross-socket connection, regardless of how many UPI links connect a pair of packages. If
there is more than one link, the maximum value is shown.
The Average Physical Core Utilization metric, displayed on top of each socket, indicates the utilization of
physical cores by computations of the application being analyzed.
Once you examine topology and utilization, drill down into the details to investigate platform performance.
257
7 Intel® VTune™ Profiler User Guide
• Outbound PCIe Bandwidth is induced by core transactions targeting the memory or registers of the I/O
device. Typically, the core accesses the device memory through the Memory-Mapped I/O (MMIO) address
space.
• Outbound PCIe Read — the core reads from the registers of the device.
• Outbound PCIe Write — the core writes to the registers of the device.
NOTE
• The Inbound PCIe Bandwidth metrics are only available for server platforms based on Intel®
microarchitecture code named Sandy Bridge EP and newer.
• The Outbound PCIe Bandwidth metrics are only available for server platforms based on Intel®
microarchitecture code named Haswell EP and newer.
The granularity of Inbound and Outbound PCIe Bandwidth metrics depends on CPU model, collector
used, and user privileges. For details, see the Platform-Level Metrics table.
You can analyze the Inbound and Outbound PCIe Bandwidth over time on a per-device basis using the
timeline in the Bottom-up or the Platform tabs:
258
Analyze Performance 7
The L3 Hit/Miss Ratios for Inbound I/O requests reflect the proportions of requests made by I/O devices
to the system memory that hit/miss the L3 cache. For a detailed explanation of Intel® DDIO utilization
efficiency, see the Effective Utilization of Intel® Data Direct I/O Technology Cookbook recipe.
NOTE
L3 Hit/Miss metrics are available for Intel® Xeon® processors code named Haswell and newer.
The Average Latency metric of the Inbound PCIe read/write groups shows an average amount of time
the platform spends on processing inbound read/write requests for a single cache line.
The CPU/IO conflicts ratio shows a portion of Inbound I/O write requests that experienced contention for a
cache line between the IO controller and some other agent on the CPU, which can be a core or another IO
controller. These conflicts are caused by the simultaneous access to the same cache line. Under certain
conditions, such access may cause the IO controller to lose ownership of this cache line. This forces the IO
controller to reacquire the ownership of this cache line. Such issues can occur in applications that use the
polling communication model, resulting in suboptimal throughput and latency. To resolve this, consider tuning
the Snoop Response Hold Off option of the Integrated IO configuration of UEFI/BIOS (option name may
vary depending on platform manufacturer).
NOTE
Average Latency for inbound I/O reads/writes and CPU/IO Conflicts metrics are available on Intel®
Xeon® processors code named Skylake and newer.
The granularity of DDIO efficiency metrics—second-level metrics for Inbound I/O bandwidth—depends on
CPU model, collector used, and user privileges. For details, see the Platform-Level Metrics table.
You can get a per-device breakdown for Inbound and Outbound Traffic, Inbound request L3 hits and
misses, Average latencies, and CPU/IO Conflicts using the Bottom-up pane with the Package /
M2PCIe or Package / IO Unitgrouping:
259
7 Intel® VTune™ Profiler User Guide
NOTE
Intel VT-d metrics are available starting with server platforms based on Intel® microarchitecture code
named Ice Lake.
The top-level metric shows the average total Address Translation Rate.
The IOTLB (I/O Translation Lookaside Buffer) is an address translation cache in the remapping hardware unit
that caches effective translations from virtual addresses, used by devices, to host physical addresses. IOTLB
lookups happen on address translation requests. The IOTLB Hit and IOTLB Miss metrics reflect the ratios of
address translation requests hitting and missing the IOTLB.
The next-level metrics for IOTLB misses are:
• Average IOTLB Miss Penalty, ns — average amount of time spent on handling an IOTLB miss. Includes
looking up the context cache, intermediate page table caches and page table reads (page walks) on a
miss, which turn into memory read requests.
• Memory Accesses Per IOTLB Miss — average number of memory read requests (page walks) per
IOTLB miss.
260
Analyze Performance 7
The granularity of Intel VT-d metrics depends on CPU model, collector used, and user privileges. For details,
see the Platform-Level Metrics table. When prerequisites are met, Intel VT-d metrics can be viewed per sets
of I/O devices—PCIe devices and/or integrated accelerators. Each set includes all devices handled by the
single I/O controller, which commonly serves 16 PCIe lanes. Switch to the Bottom-up window and use
Package / IO Unit grouping:
261
7 Intel® VTune™ Profiler User Guide
Use the Bottom-up pane to locate sources of memory-mapped PCIe device accesses. Explore the call stacks
and drill down to source and assembly view:
Double click on the function name to drive into source code or assembly view to locate the code responsible
for MMIO reads and writes at source line level:
262
Analyze Performance 7
NOTE
MMIO access data is collected when the Locate MMIO accesses check box is selected. However,
there are some limitations:
• This feature is only available starting with server platforms based on the Intel® microarchitecture
code name Skylake.
• Only Attach to Process and Launch Application collection modes are supported. When running
in the Profile System mode, this option only reveals functions performing reads from uncacheable
memory.
VTune Profiler provides per-channel breakdown for DRAM and PMEM bandwidth:
263
7 Intel® VTune™ Profiler User Guide
264
Analyze Performance 7
NOTE
To profile a DPDK application using VTune Profiler, make sure DPDK is built with VTune Profiler options
enabled. See the DPDK guide for more information.
When profiling DPDK as FD.io VPP plugin, modify DPDK_MESON_ARGS variable in build/external/
packages/dpdk.mk with the same flags as described in Profiling with VTune section.
DPDK statistics collection is not supported for FreeBSD* targets and is not available in Profile System mode.
265
7 Intel® VTune™ Profiler User Guide
Num of calls that return 0 packets
DPDK R x Spin T ime =
T otal num of calls
Use the Platform tab to explore the DPDK Rx Spin Time metric on the timeline at per-thread basis:
To learn more about core utilization in DPDK applications, see the corresponding cookbook recipe.
266
Analyze Performance 7
This histogram shows batching statistics for packet (event) dequeue operation from the DPDK eventdev
library. It provides statistics for each eventdev port, representing each worker thread that polls the event
device. Explore the histogram to identify inhomogenous load distribution, oversubscribed, or underutilized
worker threads.
Num of calls that return 0 packets
DPDK Event Dequeue Spin T ime =
T otal num of dequeue calls
Navigate to the Platform tab to explore the DPDK Event Dequeue Spin Time metric on the timeline. Per-
worker dequeue statistics reveal details about load balancing, which enables you to analyze pipeline
configuration efficiency and to identify underlying pipeline bottlenecks.
To learn more about the DPDK eventdev pipeline, see the DPDK Event Device Profiling Cookbook recipe.
267
7 Intel® VTune™ Profiler User Guide
NOTE
To enable VTune Profiler capabilities, make sure SPDK is built using the --with-vtune=<vtune-
install-dir> advanced build option.
When profiling in Attach to Process mode, make sure to set up the environment variables before
launching the application.
Not available in Profile System mode.
The SPDK Effective Time metric shows the amount of time the application spent performing any activity,
excluding polling for I/O operation completion:
268
Analyze Performance 7
269
7 Intel® VTune™ Profiler User Guide
You can use the timeline in the Platform tab to correlate areas of SPDK throughput utilization with SPDK I/O
operations and to get a breakdown of PCIe traffic per physical device:
270
Analyze Performance 7
Sample Duration
Latency =
T otal Number of IOPs in Sample
271
7 Intel® VTune™ Profiler User Guide
NOTE
This analysis actively relies on the data provided by the kernel block driver sub-system. If your
platform utilizes a non-standard block driver sub-system, such as in the case of using user-space
storage drivers, I/O metrics will not be available in this analysis type.
VTune Profiler provides the following system-wide metrics for the kernel I/O analysis:
• I/O Wait — this system-wide metric represents the amount of time during which the CPU cores were idle
due to threads being in an I/O wait state.
• I/O Queue Depth — this metric shows the number of I/O requests submitted to the storage device. If
the number of requests in a queue is zero, this means that there are no requests scheduled, and the disk
is not utilized at all.
• I/O Data Transfer — this metric shows the number of bytes read from or written to the storage
device(s).
• Page Faults — this metric shows the number of page faults that have occurred on the system. It is
particularly useful when analyzing access to memory-mapped files.
• CPU Activity — this metric represents the portion of time the system spent in one of the following states:
• Idle state — the CPU core is idle
272
Analyze Performance 7
• Active state — the CPU core is executing a thread
• I/O Wait — the CPU core is idle, but there is a thread that could potentially be executed on this core
that is blocked by disk access.
All I/O metrics collected by VTune Profiler, such as I/O Wait Time, I/O Waits, and I/O Queue Depth, are
collected in a system-wide mode and are not target-specific.
The I/O Wait Time metric represents a portion of time during which the threads are in I/O wait state while
the system has cores in idle state. In this case, the number of threads is not greater than the number of
idling cores. This aggregated I/O Wait Time metric is an integral function of the I/O Wait metric that is
available in the Timeline pane of the Bottom-up window.
To estimate how quickly storage requests are served by the kernel sub-system, see the Disk Input and
Output Histogram. Use the Operation Type drop-down menu to select the type of I/O operation you are
interested in. For example, for I/O writes, 2-4 storage requests executed within 0.06 seconds or more are
classified as slow by VTune Profiler:
To explore this type of I/O request in greater detail, switch to the Bottom-up window.
273
7 Intel® VTune™ Profiler User Guide
By zooming in on an area of interest, you can get a closer look at different metrics and understand the
reason behind high I/O wait time.
VTune Profiler collects the I/O Wait type of context switches caused by I/O accesses from the thread, and
provides a system-wide I/O Wait metric in the CPU Activity area. Use this data to identify imbalance
between I/O and compute operations.
System-wide I/O Wait shows the time during which the system cores were idle, but there were threads in a
context switch due to I/O access. Use this metric to estimate the dependency of performance on the storage
medium.
For example, an I/O Wait value of 100% means that all cores of the system are idle, but there are threads
blocked by I/O requests. To solve this issue, change the logic of the application to run compute threads in
parallel with I/O tasks. Alternatively, consider using faster storage.
An I/O Wait value of 0% could mean one of the following:
• Regardless of the number of threads blocked on storage access, all CPU cores are actively executing
application code.
• No threads are blocked on storage access.
Explore the I/O Queue Depth area to see thee number of storage requests submitted to the storage
device. Spikes correspond to the maximum number of requests. Zero-value gaps on the I/O Queue Depth
chart correspond to points in application run when storage was not utilized at all.
To identify the exact points in time when slow I/O packets were scheduled for execution, enable the Slow
markers for the I/O Queue Depth metric:
To identify points of high bandwidth, analyze the I/O Data Transfer area that shows thee number of bytes
read from or written to the storage device.
274
Analyze Performance 7
To view a Task Time call stack for a particular I/O call, select the required I/O API marker on the timeline
and explore the stack in the Call Stack pane:
NOTE
A PREVIEW FEATURE may or may not appear in a future production release. While a preview feature
is available for your use, feedback about its usefulness will determine its availability in future releases.
Data collected with a preview feature is not guaranteed to be compatible with future releases.
Prerequisites:
• Install the sampling driver for hardware event-based sampling collection types. For Linux* and Android*
targets, if the sampling driver is not installed, VTune Profiler can work on Perf* (driverless collection).
• To enable system-wide and uncore event collection, use root or sudo to set /proc/sys/kernel/
perf_event_paranoid to 0.
$ echo 0>/proc/sys/kernel/perf_event_paranoid
275
7 Intel® VTune™ Profiler User Guide
See Also
Optimize applications for Intel® GPUs with Intel® VTune Profiler
GPU Architecture Terminology for Intel® Xe Graphics
Optimize Your GPU Application with Intel oneAPI Base Toolkit
Offload Modeling Perspective in Intel® Advisor to estimate GPU offload overhead
in Intel® Advisor to estimate GPU offload overhead
276
Analyze Performance 7
Configure the Analysis
On Windows systems, to monitor general GPU usage over time, run VTune Profiler as an Administrator.
• Set up your system for GPU analysis.
• For SYCL applications: make sure to compile your code with the -gline-tables-only and -fdebug-
info-for-profiling Intel oneAPI DPC++ Compiler options.
• Create a project and specify an analysis system and target.
NOTE
If you have multiple Intel GPUs connected to your system, run the analysis on the GPU of your choice
or on all connected devices. For more information, see Analyze Multiple GPUs.
• Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread is
loaded, but the core is stalled for some reason. Formula:
277
7 Intel® VTune™ Profiler User Guide
• Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core.
Formula:
• The EU Threads Occupancy metric shows the normalized sum of all cycles on all cores and
thread slots when a slot has a thread scheduled.
• The Computing Threads Started metric shows the number of threads started across all EUs for
compute work.
NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.
NOTE
To generate the command line for any analysis configuration, use the Command Line button at the
bottom of the interface.
278
Analyze Performance 7
Once the GPU Offload Analysis completes data collection, the Summary window displays metrics that
describe:
• GPU usage
• GPU idle time
• The most active computing tasks that ran on the CPU host
• The most active computing tasks that ran on the CPU when the GPU was idle
• The most active computing tasks that ran on the GPU, along with occupancy information
279
7 Intel® VTune™ Profiler User Guide
280
Analyze Performance 7
You also see Recommendations and guidance for next steps.
281
7 Intel® VTune™ Profiler User Guide
282
Analyze Performance 7
The total time is broken down into:
• Allocation time
• Time for data transfer from host to device
• Execution time
• Time for data transfer from device to host
This breakdown can help you understand better the balance between data transfer and GPU execution time.
The Graphics window also displays in the Transfer Size section, the size of the data transfer between host
and device per computation task.
Computation tasks with sub-optimal offload schemas are highlighted in the table with details to help you
improve those schemes.
283
7 Intel® VTune™ Profiler User Guide
284
Analyze Performance 7
Examine Energy Consumption by your GPU
In Linux environments, when you run the GPU Offload analysis on an Intel® Iris® X e MAX graphics discrete
GPU, you can see energy consumption information for the GPU device. To collect this information, make sure
you check the Analyze power usage option when you configure the analysis.
NOTE Energy consumption metrics do not display in GPU profiling analyses that scan Intel® Iris® X e
MAX graphics on Windows machines.
Once the analysis completes, see energy consumption data in these sections of your results.
In the Graphics window, observe the Energy Consumption column in the grid when grouped by
Computing Task. Sort this column to identify the GPU kernels that consumed the most energy. You can also
see this information mapped in the timeline.
Tune for Power Usage
When you locate individual GPU kernels that consume the most energy, for optimum power efficiency, start
by tuning the top energy hotspot.
Tune for Processing Time
If your goal is to optimize GPU processing time, keep a check on energy consumption metrics per kernel to
monitor the tradeoff between performance time and power use.
Move the Energy Consumption column next to Total Time to make this comparison easier.
285
7 Intel® VTune™ Profiler User Guide
286
Analyze Performance 7
You may notice that the correlation between power use and processing time is not direct. The kernels that
compute the fastest may not be the same kernels that consume the least amounts of energy. Check to see if
larger values of power usage correspond to longer stalls/wait periods.
Support Aspect SYCL application with OpenCL as SYCL application with Level Zero
back end as back end
Data collection VTune Profiler collects and shows GPU VTune Profiler collects and shows GPU
computing tasks and the GPU computing computing tasks and the GPU
queue. computing queue.
Data display VTune Profiler maps the collected GPU VTune Profiler maps the collected
HW metrics to specific kernels and GPU HW metrics to specific kernels
displays them on a diagram. and displays them on a diagram.
See Also
Optimize applications for Intel® GPUs with Intel® VTune Profiler
GPU Architecture Terminology for Intel® Xe Graphics
Set Up System for GPU Analysis
287
7 Intel® VTune™ Profiler User Guide
NOTE
This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production
release. It is available for your use in the hopes that you will provide feedback on its usefulness and
help determine its future. Data collected with a preview feature is not guaranteed to be backward
compatible with future releases.
288
Analyze Performance 7
289
7 Intel® VTune™ Profiler User Guide
GPU metrics help identify how efficiently GPU hardware resources are used and whether any performance
improvements are possible. Many metrics are represented as a ratio of cycles when the GPU functional
unit(s) is in a specific state over all the cycles available for a sampling period.
NOTE
If you have multiple Intel GPUs connected to your system, run the analysis on the GPU of your choice
or on all connected devices. For more information, see Analyze Multiple GPUs.
290
Analyze Performance 7
6. Click Start to run the analysis.
NOTE
To generate the command line for this configuration, use the Command Line... button at the bottom.
Analysis Results
Once the GPU Compute/Media Hotspots Analysis completes data collection, the Summary window displays
metrics that describe:
• GPU time
• Occupancy
• Peak occupancy you can expect to achieve with the existing computing task configuration
• The most active computing tasks that ran on the GPU
291
7 Intel® VTune™ Profiler User Guide
NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.
292
Analyze Performance 7
Configure Characterization Analysis
Use the Characterization configuration option to:
• Monitor the Render and GPGPU engine usage (Intel Graphics only)
• Identify which parts of the engine are loaded
• Correlate GPU and CPU data
When you select the Characterization radio button, the configuration section expands with additional
options:
• Overview metric set includes additional metrics that track general GPU memory accesses such as
Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory
Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive
applications.
• Compute Basic (with global/local memory accesses) metric group includes additional metrics that
distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed
Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer
Loaded, and GPU EU Array Usage. These metrics are useful for compute-intensive workloads on the GPU.
• Compute Extended metric group includes additional metrics targeted only for GPU analysis on the Intel
processor code name Broadwell and higher. For other systems, this preset is not available.
• Full Compute metric group is a combination of the Overview and Compute Basic event sets.
• Dynamic Instruction Count metric group counts the execution frequency of specific classes of
instructions. With this metric group, you also get an insight into the efficiency of SIMD utilization by each
kernel.
The Characterization drop-down menu provides platform-specific presets of the GPU metrics. All presets,
except for the Dynamic Instruction Count, collect data about execution units (EUs) activity: EU Array
Active, EU Array Stalled, EU Array Idle, Computing Threads Started, and Core Frequency; and each one
introduces additional metrics:
NOTE You can run the GPU Compute/Media Hotspots analysis in Characterization mode for Windows*,
Linux* and Android* targets. However, you must have root/administrative privileges to run the
analysis in this mode.
For the Characterization analysis, you can also collect additional data:
• Use the Trace GPU programming APIs option to analyze SYCL, OpenCL™, or Intel Media SDK programs
running on Intel Processor Graphics. This option may affect the performance of your application on the
CPU side.
For SYCL or OpenCL applications, you may identify the hottest kernels and identify the GPU architecture
block where a performance issue for a particular kernel was detected.
For Intel Media SDK programs, you may explore the Intel Media SDK tasks execution on the timeline and
correlate this data with the GPU usage at each moment of time.
Support limitations:
• OpenCL kernels analysis is possible for Windows and Linux targets running on Intel Graphics.
• Intel Media SDK program analysis is possible for Windows and Linux targets running on Intel Graphics.
• Only Launch Application or Attach to Process target types are supported.
NOTE
In the Attach to Process mode if you attached to a process when the computing queue is already
created, VTune Profiler will not display data for the OpenCL kernels in this queue.
293
7 Intel® VTune™ Profiler User Guide
• Use the Analyze memory bandwidth option to collect the data required to compute memory
bandwidth. This type of analysis requires Intel sampling drivers to be installed.
• Use the GPU sampling internal, ms field to specify an interval (in milliseconds) between GPU samples
for GPU hardware metrics collection. By default, the VTune Profiler uses 1ms interval.
Control Flow group if, else, endif, while, break, cont, call, calla, ret, goto, jmpi,
brd, brc, join, halt and mov, add instructions that explicitly change the ip
register.
Int16 & HP Float | Bit operations (only for integer types): and, or, xor, and others.
Int32 & SP Float |
Arithmetic operations: mul, sub, and others; avg, frc, mac, mach, mad,
Int64 & DP Float
madm.
groups
Vector arithmetic operations: line, dp2, dp4, and others.
In the Instruction count mode, the VTune Profiler also provides Operations per second metrics calculated
as a weighted sum of the following executed instructions:
294
Analyze Performance 7
• Bit operations (only for integer types):
• and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol - weight 1
• Arithmetic operations:
• add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub - weight 1
• avg, frc, mac, mach, mad, madm - weight 2
• Vector arithmetic operations:
• line - weight 2
• dp2, sad2 - weight 3
• lrp, pln, sada2 - weight 4
• dp3 - weight 5
• dph - weight 6
• dp4 - weight 7
• dp4a - weight 8
• Extended math operations:
• math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos (weight 4)
• math.fdiv, math.pow (weight 8)
NOTE
The type of an operation is determined by the type of a destination operand.
View Data
VTune Profiler runs the analysis and opens the data in the GPU Compute/Media Hotspots viewpoint
providing various platform data in the following windows:
• Summary window displays overall and per-engine GPU usage, percentage of time the EUs were stalled or
idle with potential reasons for this, and the hottest GPU computing tasks.
• Graphics window displays CPU and GPU usage data per thread and provides an extended list of GPU
hardware metrics that help analyze accesses to different types of GPU memory. For GPU metrics
description, hover over the column name in the grid or right-click and select the What's This Column?
context menu option.
Support Aspect SYCL application with OpenCL as SYCL application with Level Zero
back end as back end
Data collection VTune Profiler collects and shows GPU VTune Profiler collects and shows GPU
computing tasks and the GPU computing computing tasks and the GPU
queue. computing queue.
Data display VTune Profiler maps the collected GPU VTune Profiler maps the collected
HW metrics to specific kernels and GPU HW metrics to specific kernels
displays them on a diagram. and displays them on a diagram.
295
7 Intel® VTune™ Profiler User Guide
Support Aspect SYCL application with OpenCL as SYCL application with Level Zero
back end as back end
NOTE
For a use case on profiling a SYCL application running on an Intel GPU, see Profiling a SYCL App
Running on a GPU in the Intel® VTune Profiler Performance Analysis Cookbook .
See Also
Optimize applications for Intel® GPUs with Intel® VTune Profiler
GPU Architecture Terminology for Intel® Xe Graphics
Optimize Your GPU Application with Intel oneAPI Base Toolkit
GPU Compute/Media Hotspots View
EU Array Stalled/Idle
296
Analyze Performance 7
297
7 Intel® VTune™ Profiler User Guide
Control Flow group if, else, endif, while, break, cont, call, calla, ret, goto, jmpi,
brd, brc, join, halt and mov, add instructions that explicitly change the ip
register.
Synchronization wait
group
Int16 & HP Float | Bit operations (only for integer types): and, or, xor, and others.
Int32 & SP Float |
Arithmetic operations: mul, sub, and others; avg, frc, mac, mach, mad,
Int64 & DP Float
madm.
groups
Vector arithmetic operations: line, dp2, dp4, and others.
NOTE
The type of an operation is determined by the type of a destination operand.
In the Graphics tab, the VTune Profiler also provides the SIMD Utilization metric. This metric helps identify
kernels that underutilize the GPU by producing instructions that cause thread divergence. A common cause of
low SIMD utilization is conditional branching within the kernel, since the threads execute all of the execution
paths sequentially, with each thread executing one path while the other threads are stalled.
To get additional information, double-click the hottest function to open the source view. Enable both the
Source and Assembly panes to get a side-by-side view of the source code and the resulting assembly code.
You can then locate the assembly instructions with low SIMD Utilization values and map them to specific lines
of code by clicking on the instruction. This allows you to determine and optimize the kernels that do not meet
your desired SIMD Utilization criteria.
298
Analyze Performance 7
NOTE For information on the Instruction Set Architecture (ISA) of Intel® Iris® Xe MAX Graphics, see
the Intel® Iris® Xe MAX Graphics Open Source Programmer's Reference Manual.
Analyze Source
If you selected the Source Analysis mode for the GPU Compute/Media Hotspots analysis, you can analyze a
kernel of interest for basic block latency or memory latency issues. To do this, in the Graphics tab, expand
the kernel node and double-click the function name. VTune Profiler redirects you to the hottest source line for
the selected function:
The GPU Compute/Media Hotspots provides a full-scale analysis of the kernel source per code line. The
hottest kernel code line is highlighted by default.
To view the performance statistics on GPU instructions executed per kernel instance, switch to the Assembly
view:
NOTE
If your OpenCL kernel uses inline functions, make sure to enable the Inline Mode on the filter toolbar
to have a correct attribution of the GPU Cycles per function. See examples.
299
7 Intel® VTune™ Profiler User Guide
300
Analyze Performance 7
Once the analysis completes, see energy consumption data in these sections of your results.
In the Graphics window, observe the Energy Consumption column in the grid when grouped by
Computing Task. Sort this column to identify the GPU kernels that consumed the most energy. You can also
see this information mapped in the timeline.
Tune for Power Usage
When you locate individual GPU kernels that consume the most energy, for optimum power efficiency, start
by tuning the top energy hotspot.
Tune for Processing Time
If your goal is to optimize GPU processing time, keep a check on energy consumption metrics per kernel to
monitor the tradeoff between performance time and power use.
Move the Energy Consumption column next to Total Time to make this comparison easier.
301
7 Intel® VTune™ Profiler User Guide
302
Analyze Performance 7
You may notice that the correlation between power use and processing time is not direct. The kernels that
compute the fastest may not be the same kernels that consume the least amounts of energy. Check to see if
larger values of power usage correspond to longer stalls/wait periods.
NOTE Energy consumption metrics do not display in GPU profiling analyses that scan Intel® Iris® X e
MAX graphics on Windows machines.
t = sum + p;
c = (t - sum) - p;
sum = t;
}
data[gid] = sum * sqrt(12.f);
}
To compare these operations, run the GPU In-kernel profiling in the Basic block latency mode and double-
click the kernel in the grid to open the Source view:
The Source view analysis highlights the pown() call as the most expensive operation in this kernel.
303
7 Intel® VTune™ Profiler User Guide
p -=c;
t = data[gid] + p;
c = (t - data[gid]) - p;
data[gid] = t;
}
data[gid] *= sqrt(12.f);
}
To identify which read instruction takes the longest time, run the GPU In-kernel Profiling in the Memory
latency mode:
The Source view analysis shows that the compiler understands that each thread works only with its own
element from the input buffer and generates the code that performs the read only once. The value from the
input buffer is stored in the registry and reused in other operations, so the compiler does not generate
additional reads.
See Also
Hotspots Report
from command line
View Data on Inline Functions
304
Analyze Performance 7
• Activity
• Idle
For other compiler options (exclusive to OpenCL profiling), see the FPGA Programming Guide.
• Create a VTune Profiler project.
1.
Click the (standalone GUI)/ (Visual Studio IDE)Configure Analysis button on the Intel®
VTune Profiler toolbar.
™
3.
In the HOW pane, click the Browse button.
• Select CPU/FPGA Interaction analysis type from the Accelerators group.
• Enter the CPU sampling interval in milliseconds.
• Specify if the collection should include CPU call stacks.
• Specify a source for the FPGA profiling data:
• OpenCL Profiling API - This source profiles only the host application.
• AOCL Profiler - This source profiles the host application as well as the design on your FPGA.
305
7 Intel® VTune™ Profiler User Guide
NOTE
To generate the command line for this configuration, use the Command Line button.
View Data
The CPU/FPGA Interaction analysis results appear in the CPU/FPGA Interaction viewpoint. The viewpoint
contains these windows:
• The Summary window displays statistics on the overall application execution, identifying CPU time and
processor utilization, and execution time for DPC++ or OpenCL kernels. Double click a kernel in the
Bottom-up view to see detailed performance data through the Source view.
• The Bottom-up window displays functions in the Bottom-up tree, CPU time and CPU utilization per
function. Click the functions or kernels in this view to see the Source view.
• The Platform window displays over-time metric and performance data for DPC++ or OpenCL kernels,
memory transfers, CPU context switches, FPU utilization, and CPU threads with DPC++ or OpenCL
kernels.
What's Next
Use the CPU/FPGA Interaction viewpoint to review the following:
• FPGA Utilization: Look at the FPGA Top Compute Tasks on the Summary window for a list of kernels
running on the FPGA. The Bottom-up window shows the Total and Average execution time for every
kernel.
• Memory Transfers: Look at the Data Transferred column on the Bottom-up window or the Computing
Queue rows on the Platform window to view DPC++ or OpenCL kernels and memory transfers.
• Workload Impact: The Context Switch Time metric on the Summary window shows how much time
was spent in CPU context switches. Context switches can also be seen on the Platform tab as they
occurred during application execution.
See Also
fpga-interaction Command Line Analysis
Intel FPGA SDK for OpenCL Pro Edition: Best Practices Guide
306
Analyze Performance 7
To interpret the performance data provided in the CPU/FPGA Interaction viewpoint, you may follow the steps
below:
1. Define a Performance Baseline
2. Assess FPGA Utilization
3. Review Memory Transfers
4. Determine Workload Impact
5. Review FPGA device metrics
6. Analyze channel depth
7. Analyze loops
8. Analyze Source of the host application part
9. Analyze Source of the kernel running on FPGA device
Switch to the Bottom-up window and use the Computing Task Purpose / Source Computing Task
(FPGA) grouping to view the hotspots for kernels.
Tip
You can click a task from the FPGA Top Compute Tasks list to be taken to that task on the Bottom-
up window.
Review the FPGA Utilization timeline, which shows how many kernels and transfers are executing at the
same time on the FPGA.
307
7 Intel® VTune™ Profiler User Guide
308
Analyze Performance 7
If the channel is full all the time, the write side of the channel is working faster than the read side, and the
channel will be stalling in the write kernel. If the channel is mostly empty, the read side is likely to be
stalling, and if the channel is bigger than 32 bits deep, you can reduce it in size without a performance hit.
Analyze Loops
Analyze the occupancy for profiled loops:
309
7 Intel® VTune™ Profiler User Guide
See Also
Analyze Performance
Reference
Intel FPGA SDK for OpenCL Pro Edition: Best Practices Guide
310
Analyze Performance 7
• Platform Profiler analysis collects data on a deployed system running a full load over an extended period
of time with insights into overall system configuration, performance, and behavior. The collection is run on
a command prompt outside of VTune Profiler and results are viewed in a web browser.
Prerequisites:
• For best results, install the sampling driver for hardware event-based sampling collection types. For
Linux* and Android* targets, if the sampling driver is not installed, VTune Profiler can work on Perf*
(driverless collection).
• To enable system-wide and uncore event collection, use root or sudo to set /proc/sys/kernel/
perf_event_paranoid to 0.
$ echo 0>/proc/sys/kernel/perf_event_paranoid
311
7 Intel® VTune™ Profiler User Guide
For Linux targets, the System Overview analysis collects the following Ftrace* events: sched, freq, idle,
workq, irq, softirq.
For Android targets, the System Overview analysis collects the following events:
• Atrace* events: input, view, webview, audio, video, camera, hal, res, dalvik
• Ftrace events: sched, freq, idle, workq, filesystem, irq, softirq, sync, disk
312
Analyze Performance 7
• Measure particular stages of workload execution without static instrumentation
• Analyze CPU core activities at the microsecond level
• Analyze a kernel/driver or application module by measuring exact CPU time with a nanosecond precision
• Triage latency issues resulted from:
• changes in the execution code flow
• preemption by another process
• resource sharing issues
• page faults
• power consumption issues caused by unexpected wake-ups
NOTE
• This analysis requires a direct access to the hardware. It does not work inside a Guest VM.
• In most cases, the collection overhead in this mode is less than 10%. It can be higher if your
application is IO or DRAM bound.
• The Hardware Tracing mode does not require sampling drivers.
From HOW pane, click the Browse button and select System Overview.
3. Select Hardware Tracing or Hardware Event-Based Sampling mode.
For the Hardware Tracing mode, you can also enable the Analyze interrupts option.
With the default Hardware Tracing configuration, Intel® VTune™ Profiler stops the data collection when
a 1GB data limit is reached. You can change this limit in the Advanced section of the WHAT pane:
313
7 Intel® VTune™ Profiler User Guide
4. In the HOW pane, check options if you are interested in examining power usage or understanding
reasons for throttling behavior.
5. Click the Start button to run the analysis.
VTune Profiler collects the data, generates a rxxxso result, and opens it in the default System Overview
viewpoint.
NOTE
To run this analysis from the command line, use the Command Line button at the bottom.
314
Analyze Performance 7
Once the data collection is finished, see the Energy Consumption section of the Summary window.
This section shows the total power consumed by the system during data collection, as well as the breakdown
by CPU package and DRAM module.
Switch to the Platform window to get a detailed view of power consumption over time. You can correlate
different metrics, such as DRAM bandwidth, CPU frequency, and CPU utilization, with the amount of power
consumed by each device.
315
7 Intel® VTune™ Profiler User Guide
NOTE
On the timeline, device power is represented in millijoules per second, which is physically equivalent to
milliwatts.
Throttling Analysis
If your CPU is operating at temperatures outside safe thermal limits, you may observe a significant drop in
CPU frequency as the system attempts to stabilize. The drop in frequency to restore safe CPU operating
temperature can result in significant performance loss. Run the System Overview analysis to analyze factors
that can cause the CPU to throttle in this way.
In the HOW pane of the Configure Analysis window, check the Analyze throttling reasons checkbox.
Then run the analysis.
316
Analyze Performance 7
Once the data collection is finished, see the CPU Throttling Reasons section in the Summary window.
Switch to the Platform window to see a breakdown of throttling events according to the reasons causing
them.
317
7 Intel® VTune™ Profiler User Guide
THERMAL Frequency has dropped below the OS frequency due to a thermal event.
RSR-LIMIT Frequency has dropped below the OS frequency due to a Residency State
Regulation Limit violation.
OTHER Frequency has dropped below the OS frequency due to electrical or other
constraints.
MAX-TURBO-LIMIT Frequency has dropped below the OS frequency due to multi-core turbo
limits.
TURBO-ATTENUATION Frequency has dropped below the OS frequency due to turbo transition
attenuation. This can cause performance degradation due to frequent
changes in operating ratio.
®
For more information about these reasons, see the Intel 64 and IA-32 Architectures Software Development
Manual.
See Also
Analyze Interrupts
318
Analyze Performance 7
Task Analysis
Analyze Interrupts
If you configured your collection to monitor IRQ Ftrace* events either by using the System Overview analysis
type or custom analysis, the Intel® VTune™ Profiler analyzes code performance inside IRQs and displays
interrupts statistics in the default Hardware Events viewpoint. Follow the steps below to analyze the collected
interrupt data:
• Identify most critical interrupt handlers.
• Analyze slow interrupts on the timeline.
Prerequisites
Analysis of interrupts requires access to the Linux Ftrace subsystem in /sys/kernel/debug/tracing.
Typically, it is only accessible for the root user.
To analyze interrupts, either run the analysis as root, or edit permissions for /sys/kernel/debug/tracing
as described in the Limitations section of the Linux* and Android* Kernel Analysis topic.
Clicking an interrupt handler in the list opens the grid view grouped by Interrupt/Interrupt Duration
Type/Function/Call Stack level.
• Interrupt Duration Histogram that shows a distribution of interrupt handler instances per duration
types defined by the VTune Profiler. High number of slow instances may signal a performance bottleneck.
Use the drop-down menu to view data for different interrupt handlers.
319
7 Intel® VTune™ Profiler User Guide
When you identified a slow interrupt in the Summary window, you may switch to the Event Count tab
sorted by the Interrupt/.. level, locate this interrupt, expand the hierarchy to view a function where slow
interrupts occurred, and double-click the function to explore its source code in the Source view.
320
Analyze Performance 7
See Also
Linux* and Android* Kernel Analysis
for IRQ event collection
Window: Platform
321
7 Intel® VTune™ Profiler User Guide
Locate the interrupt-intensive regions and zoom in. Hover over a module name to see the Module Entry
Pont that discovers a cause for an interrupt. For example, a page fault:
Or a timer interrupt:
322
Analyze Performance 7
Analyze Thread Activity at the Microsecond Level
Hardware Tracing analysis enables you to analyze data at a high granularity level. This could be particularly
useful, for example, to debug a network workload with a one-second duration between requests:
Zoom in to a single request. For example, the ping application measures and prints 250µs as reply time:
do_idle is executed.
323
7 Intel® VTune™ Profiler User Guide
Locate a region with multiple context switches or high kernel activity and zoom in to investigate. For
example, in this case the operating system has rescheduled a thread multiple times due to various reasons,
including preemption and synchronization. Hover over the markers to get additional details and to determine
the root cause of the issue.
In the grid pane of the Platform window, use the Process / Module / Module Entry Point grouping to
get a detailed view of user-mode and kernel activity. Expand a module and study the module entry points to
determine the amount of time spent by the module in the kernel mode.
You can also examine the number and frequency of Kernel-mode Entries caused by a specific module and
function to determine the performance impact of kernel activity.
324
Analyze Performance 7
Hardware Tracing collection is more precise than event-based sampling and provides all the modules
executed with their precise time.
See Also
Analyze Interrupts
Use Platform Profiler Analysis to ensure that you use available hardware in the most optimal way for a long
running workload.
325
7 Intel® VTune™ Profiler User Guide
NOTE
To run this analysis from the command line , click the Command Line button at the bottom.
Once data collection is complete, see a performance overview in the Platform Profiler viewpoint.
See Also
Platform Profiler View
326
Analyze Performance 7
Hover over here to see information about the
system used for data collection.
See Also
Platform Profiler Analysis
327
7 Intel® VTune™ Profiler User Guide
You can also create your own grouping and include the 'Core Type' entity in it. To do this, use the Customize
Grouping dialog box from the Grouping pulldown menu and select your combination of entities.
328
Analyze Performance 7
Group by Core Type in Timeline
You can also use the timeline view to group data by Core Type. To do this, select one of the available
groupings (from the pulldown menu) that contain the Core Type entity. This example shows the Process /
Core Type grouping in the timeline.
329
7 Intel® VTune™ Profiler User Guide
Use this hierarchical display of data to analyze microarchitecture bottlenecks in P-Cores and E-Cores. You will
also find a similar breakdown by core type in other analysis types (Memory Access or HPC Performance
Characterization) since they share some of the same metrics.
Prerequisites
Intel® VTune™ Profiler provides accurate source analysis if your code is compiled with the debug information
and debug information is written correctly in the binary file (for Linux* targets) or debug information file/
symbol file (for Windows* targets).
330
Analyze Performance 7
Access Source View
To open the source/assembly code of a specific item, either double-click the selected item in the grid view/
Call Stack/Timeline pane, or select the View Source option from the context menu:
Depending on the route you used to access the Source view, the data representation on the panes may
slightly differ:
• If you access the Source view by clicking a function in the grid, the VTune Profiler opens the source at the
hottest (with the highest value of the metric selected for hotspot navigation) line of this function in the
Source/Assembly pane.
• When you click a call stack function, the VTune Profiler opens the source highlighting the call site (location
where a function call is made) at the top of the call stack. The call site is marked with the yellow arrow .
• If you click a wait in the Timeline pane, the VTune Profiler opens a wait function highlighting the waiting
call site. If you double-click a transition (for Threading data), it highlights the signaling call site.
Analyze Code
The Source/Assembly window opens in a separate tab:
331
7 Intel® VTune™ Profiler User Guide
NOTE
• One source code line may have one or more related assembly instructions while one
instruction has only one related code line.
• Synchronization is possible only if the debug line information is available for the selected
function.
Hotspot navigation buttons. Typically, the VTune Profiler opens the source code highlighting the
most performance critical code line based on the key metric set up for this analysis. To go further and
freely navigate between code lines that have the highest metric value (hotspots), use these buttons
toolbar:
The Source pane shows your code written on a high-level programming language, for example, C, C
++, or Fortran. The Source pane opens if the symbol information for the selected function is
available.
Hotspot navigation metric column. By default, the source view navigation is based on the key
analysis metric like the CPU Time for the Hotspots analysis. Such a metric column is highlighted. To
change the hotspot navigation metric, right-click the required column and select Use for Hotspot
Navigation command from the context menu.
The Assembly pane displays disassembled code. This code shows the exact order of the assembly
instructions executed by the processor. Instructions on the Assembly pane are grouped into basic
blocks. To get help on a particular instruction, select it in the grid, right-click and choose Instruction
Reference from the context menu.
For better navigation in the Assembly pane, you may select one of the available granularity levels in
the Assembly grouping drop-down menu: Address, Basic Block/Address, or Function Range/
Basic Block/Address. VTune Profiler updates the Assembly view grouping the instructions into
collapsible nodes according to the selected hierarchy.
If there is no correct debug information, or symbol file is unavailable, the assembly data may be
incorrect. In this case, the VTune Profiler uses heuristics to define function boundaries in the binary
module.
Heat map markers. Use the blue markers to the right of the vertical scroll bar to quickly identify the
hotspot lines (based on the hotspot navigation metric). To view a hotspot, move the scroll bar slider
to the marker. The bright blue marker ( ) indicates a hot line for the function you drilled down into.
Light blue markers
332
Analyze Performance 7
(
Edit Source
When tuning your target, you may need to modify the source code. VTune Profiler enables you to open the
source files for editing directly from the Source/Assembly window.
To launch the source editor:
1. In the Source pane, select a line you want to edit.
2. Right-click the line and select Edit Source from the context menu, or click the Open Source File
NOTE
The Source/Assembly analysis is not supported for the source code using the #line directive.
See Also
Debug Information for Linux* Application Binaries
333
7 Intel® VTune™ Profiler User Guide
Custom Analysis
Create a new custom analysis type based on available
predefined analysis configurations.
To create and run a new custom analysis type:
Prerequisites: Make sure a VTune Profilerproject is created.
1.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel®
VTune Profiler toolbar.
™
Enable an editable mode for the configuration and specify the following analysis identifiers:
• Analysis name: Enter/edit a name of this custom analysis type.
• Command line name: Enter/edit a name of the custom analysis type that will be used as an
identifier when analyzing the project from the command line. Keep it short for your convenience.
• Analysis identifier: Specify a shorthand identifier to be appended to the name of each result
produced by this analysis type. For example, adding the tr identifier for the Threading analysis
result produces the following result name: r000tr, where 000 is the result number.
• Comments: Provide a short meaningful description of the analysis type you create. This information
may help you easily identify the analysis type specifics later.
334
Analyze Performance 7
See Also
Custom Analysis Options
A B CDE F G H I J K LM N O P Q R STU V W X Y Z
Analyze I/O waits Analyze the percentage of time each thread and CPU spends in I/O wait state.
check box
Analyze interrupts Collect interrupt events that alter a normal execution flow of a program. Such
check box events can be generated by hardware devices or by CPUs. Use this data to
identify slow interrupts that affect your code performance.
Analyze loops check Extend loops analysis to collect advanced loops information, such as
box instructions set usage and display analysis results by loops and functions.
Analyze memory Collect and analyze information about memory objects with the highest
consumption check memory consumption.
box (for Linux targets
only)
Analyze OpenMP Instrument the OpenMP* regions in your application to group performance data
regions check box by regions/work-sharing constructs and detect inefficiencies such as imbalance,
lock contention, or overhead on performing scheduling, reduction, and atomic
operations. Using this option may cause higher overhead and increase the
result size.
Analyze PCIe Collect the events required to compute PCIe bandwidth. As a result, you will be
bandwidth check box able to analyze the distribution of the read/write operations on the timeline and
identify where your application could be stalled due to approaching the
bandwidth limits of the PCIe bus.
In the Device class drop-down menu, you can choose a device class where
you need to analyze PCIe bandwidth: processing accelerators, mass storage
controller, network controller, or all classes of the devices (default).
335
7 Intel® VTune™ Profiler User Guide
NOTE
This analysis is possible only on the Intel microarchitecture code name
Sandy Bridge EP and later.
Analyze power usage Track power consumption by processor over time to see whether it can cause
check box CPU throttling.
Analyze Processor Analyze performance data from Intel HD Graphics and Intel Iris Graphics
Graphics hardware (further: Intel Graphics) based on the predefined groups of GPU metrics.
events drop-down
menu
Analyze system-wide Analyze detailed scheduling layout for all threads on the system and identify
context switches the nature of context switches for a thread (preemption or synchronization).
check box
Analyze user tasks, Analyze tasks, events, and counters specified in your code via the ITT API. This
events, and counters option causes a higher overhead and increases the result size.
check box
Analyze user Analyze the histogram specified in your code via the Histogram API. This option
histogram check box increases both overhead and result size.
Analyze user Enable User synchronization API profiling to analyze thread synchronization.
synchronization check This option causes higher overhead and increases result size.
box
Chipset events field Specify a comma-separated list of chipset events (up to 5 events) to monitor
with the hardware event-based sampling collector.
Collect context Analyze detailed scheduling layout for all threads in your application, explore
switches check box time spent on a context switch and identify the nature of context switches for a
thread (preemption or synchronization).
NOTE
The types of the context switches (preemption or synchronization) cannot be
identified if the analysis uses Perf* based driverless collection.
Collect CPU sampling Choose whether to collect information about CPU samples and related call
data menu stacks.
Collect highly Obtain more accurate CPU time data. This option causes more runtime
accurate CPU time overhead and increases result size. Administrator privileges are required.
check box (for Windows
targets only)
336
Analyze Performance 7
C
Collect I/O API data Choose whether to collect information about I/O calls and related call stacks.
menu This analysis option helps identify where threads are waiting or enables you to
compute thread concurrency. The collector instruments APIs, which causes
higher overhead and increases result size.
Collect Parallel File Enable collection of the Parallel File System counters to analyze Lustre* file
System counters system performance statistics, including Bandwidth, Package Rate, Average
check box Packet Size, and others.
Collect signalling API Choose whether to collect information about synchronization objects and call
data menu stacks for signaling calls. This analysis option helps identify synchronization
transitions in the timeline and signalling call stacks for associated waits. The
collector instruments signalling APIs, which causes higher overhead and
increases result size.
Collect stacks check Enable advanced collection of call stacks and thread context switches to
box analyze performance, parallelism, and power consumption per execution path.
Collect Choose whether to collect information about synchronization wait calls and
synchronization API related call stacks. This analysis option helps identify where threads are
data menu waiting or enables you to compute thread concurrency. The collector
instruments APIs, which causes higher overhead and increases result size.
Collect thread affinity Analyze thread pinning to sockets, physical cores, and logical cores. Identify
check box incorrect affinity that utilizes logical cores instead of physical cores and
contributes to poor physical CPU utilization.
NOTE
Affinity information is collected at the end of the thread lifetime, so the resulting
data may not show the whole issue for dynamic affinity that is changed during the
thread lifetime.
CPU Events table • Specify hardware events to collect using the check boxes in the first
column. By default, the table lists all events available for the target platform
with events used for the original analysis configuration pre-selected. You
may use the Search functionality to find events of interest. To get more
details on an event, select it in the table and click the Explain button.
• Modify the Sample After value for an event to control the number of events
after which the VTune Profiler interrupts the event data collection. The
Sample After value depends on the target duration. Based on the duration
value, the VTune Profiler adjusts the Sample After value with a multiplier.
Disable alternative Disable using alternative stacks for signal handlers. Consider this option for
stacks for signal profiling standard Python 3 code on Linux.
handlers check box
(available for Linux
targets)
337
7 Intel® VTune™ Profiler User Guide
Enable driverless Use driverless Perf*-based hardware event-based collection when possible.
collection check box
Evaluate max DRAM Evaluate maximum achievable local DRAM bandwidth before the collection
bandwidth check box starts. This data is used to scale bandwidth metrics on the timeline and
calculate thresholds.
Event mode drop-down Limit event-based sampling collection to USER (user events) or OS(system
list events) mode. By default, all event types are collected.
GPU Profiling mode Select a profiling mode to either characterize GPU performance issues based on
drop-down menu GPU hardware metric presets or enable a source analysis to identify basic
blocks latency due to algorithm inefficiencies, or memory latency due to
memory access issues.
Use the Computing task of interest table to specify the kernels of interest
and narrow down the GPU analysis to specific kernels minimizing the collection
overhead. If required, modify the instance step for each kernel, which is a
sampling interval (in the number of kernels).
Limit PMU collection Enable to collect counts of events instead of default detailed context data for
to counting check box each PMU event (such as code or hardware context). Counting mode
introduces less overhead but gives less information.
Linux Ftrace events / Use the kernel events library to select Linux Ftrace* and Android* framework
Android framework events to monitor with the collector. The collected data show up as tasks in the
events field Timeline pane. You can also apply the task grouping level to view performance
statistics in the grid.
Managed runtime Choose a type of the managed runtime to analyze. Available options are:
type to analyze menu
• for Windows targets: combined Java* and .NET* analysis; combined
Java, .NET and Python* analysis; Python only analysis
• for Linux targets: Java only analysis; combined Java and Python analysis;
Python only analysis
Minimal memory Specify a minimal size of memory allocations to analyze. This option helps
object size to track, reduce runtime overhead of the instrumentation.
in bytes spin box (for
Linux targets only)
338
Analyze Performance 7
P
Profile with Hardware Enable driver-less hardware tracing collection to explore CPU activities of your
Tracing check box code at the microsecond level and triage latency issues.
Stack size, in bytes Specify the size of a raw stack (in bytes) to process. Unlimited size value in
field GUI corresponds to 0 value in the command line. Possible values are numbers
between 0 and 2147483647.
Stack type drop-down Choose between software stack and hardware LBR-based stack types. Software
menu stacks have no depth limitations and provide more data while hardware stacks
introduce less overhead. Typically, software stack type is recommended unless
the collection overhead becomes significant. Note that hardware LBR stack
type may not be available on all platforms.
Stack unwinding Choose whether collection requires online (during collection) or offline (after
mode menu collection) stack unwinding. Offline mode reduces analysis overhead and is
typically recommended.
Stitch stacks check For applications using Intel® oneAPI Threading Building Blocks(oneTBB ) or
box OpenMP* with Intel runtime libraries, restructure the call flow to attach stacks
to a point introducing a parallel workload.
Trace GPU Capture the execution time of OpenCL™ kernels, DPC++ tasks and Intel Media
Programming APIs SDK programs on a GPU, identify performance-critical GPU tasks, and analyze
check box the performance per GPU hardware metrics.
Uncore sampling Specify an interval (in milliseconds) between uncore event samples.
interval, ms field
Use precise Enable a fine-grain event multiplexing mode that switches events groups on
multiplexing check each sample. This mode provides more reliable statistics for applications with a
box short execution time. You can also consider applying the precise multiplexing
algorithm if the MUX Reliability metric value for your results is low.
NOTE
You may generate the command line for this configuration using the Command Line... button at the
bottom.
See Also
collect-with
vtune option to configure custom analysis from command line
339
7 Intel® VTune™ Profiler User Guide
By default, the VTune Profiler detects CPU time based on the OS scheduler tick granularity. As a result, the
CPU time values may be inaccurate for targets that execute in short quanta less than the OS scheduler tick
interval (for example, frame-by-frame computation in video decoders).
Accurate collection of CPU time information is available for the user-mode sampling and tracing analysis
types (Hotspots and Threading) and enabled by default in the predefined analysis configurations when you
run both the VTune Profiler and your application to analyze with administrator privileges.
To collect more accurate CPU time information, the VTune Profiler uses the Event Tracing for Windows* (ETW)
capability. For example, without ETW, a sample is taken every 10ms. For each sample, the OS is queried for
the amount of time the thread executed and the difference is calculated between the samples, resulting in
the delta. The information returned by the OS via this mechanism has a coarse granularity. VTune Profiler
totals the deltas and displays it in the user interface. However, with ETW enabled, the VTune Profiler can filter
out any time spent executing other threads and accurately calculate time for monitored threads within each
10ms sample based on the context switch information acquired from ETW. Based on this additional
information, the CPU time metric calculated for the function/thread will be more accurate.
VTune Profiler needs exclusive access to the Microsoft* NT Kernel Logger. Therefore, only one VTune Profiler
collection can run in this mode on the system and no other tools can use the service. If the VTune Profiler
cannot get access to the NT Kernel Logger, the collection will continue with this mode disabled.
This type of collection takes more processing time and disk space. VTune Profiler may generate up to 5 MB of
temporary data per minute per logical CPU depending on the system configuration and the profiled target.
Enabling or disabling the accurate CPU time collection depends on what is executing on the system during
data collection and the structure of your application. In specific cases, there may be about a 3% variation
between "normal" and "highly accurate" CPU time. But, there are corner cases where the difference could be
as high as 30% or 40%. If the thread is executing, but happens to be inactive every 10ms that a sample is
taken without ETW, the results would grossly misrepresent the execution time. Or, if the thread is mostly
inactive, but runs exactly on the frequency of the 10ms samples, it may appear to consume large amounts of
time, when in reality it does not. The best thing to do is to test it yourself, if possible. That is, collect the Baic
Hotspots data with and without this option on and compare the resulting data. This can tell you if running
without the highly accurate CPU time option produces results accurate enough to direct your optimization
efforts, or if you need to have Administrative privileges so that you can enable this option. However, if you
are restricted from using highly accurate CPU time because of your corporation's policies, you can, in
general, be confident that analysis of your application's performance is valid using "normal" Hotspots data
collection.
To disable highly accurate CPU time collection for custom analysis:
1. Create a new custom analysis (based on an existing configuration such as Hotspots or Threading).
2. Deselect the Collect highly accurate CPU time option.
See Also
knob
accurate-cpu-time-detection option
Warnings about Accurate CPU Time Collection
340
Analyze Performance 7
1. In the HOW pane, select an existing hardware event-based analysis (for example, Microarchitecture
Exploration) and click the Copy button to create a custom copy of this configuration.
The new analysis type shows up under the Custom Analysis group in the HOW pane.
2. From the list of PMU events supported for the current platform, select the events you want the VTune
Profiler to monitor in your new configuration.
You may select an event and click the Explain... button at the bottom to open the Intel Processor
Event Reference and read more details on the selected event.
To filter in/out the event list for particular event(s), specify search keywords (applied to both the Event
Name and Event Description columns) in the Filter field.
NOTE
Usually precise events have a _PS postfix (for example, UOPS_RETIRED.RETIRE_SLOTS_PS) and/or a
clear indication (Precise Event) in the Event Description column.
NOTE
You may configure the VTune Profiler to monitor all the events in a single collection run using event
multiplexing or allow multiple runs to collect more precise event data.
See Also
Custom Analysis Options
knob
event-config option to specify events from CLI
341
7 Intel® VTune™ Profiler User Guide
Caution
The event skid affects the accuracy of your analysis results. When the grouping level is very small (for
example, instruction, source line, or basic block), the Intel® VTune™ Profiler attributes performance
results incorrectly. For example, when row A induces a problem, row B shows up as a hotspot. If
different CPU events in the formula of a hardware event-based metric have different skids, the VTune
Profiler may attribute data to different blocks, which makes all metrics invalid. This type of issue
typically does not show up at the function granularity.
Event % Instructions
See Also
Hardware Event-based Sampling Collection
342
Analyze Performance 7
Retirement and write back of state to visible registers is only done for instructions and uops that are on the
correct execution path. Instructions and uops of incorrectly predicted paths are flushed upon identification of
the misprediction and the correct paths are then processed. Retirement of the correct execution path
instructions can proceed when two conditions are satisfied:
• The uops associated with the instruction to be retired have completed, allowing the retirement of the
entire instruction, or in the case of instructions that generate very large number of uops, enough to fill the
retirement window.
• Older instructions and their uops of correctly predicted paths have retired.
Intel® VTune™ Profiler monitors the Instructions Retired event for all analysis types based on the hardware
event-based sampling (EBS), also known as Performance Monitoring Counter (PMC) analysis in the sampling
mode. The Instructions Retired event is also part of the basic Clockticks per Instructions Retired (CPI) metric
that shows how much latency affected an application execution.
For performance analysis, you may check how many instructions started their execution in OOO pipeline
(ISSUED counter or EXECUTED counter) and compare the number with the count of retired operations. High
difference shows that CPU does a lot of useless work and uses excess power.
See Also
Hardware Event-Based Sampling Collection
Precise Events
Precise events are events for which the exact
instruction addresses that caused the event are
available.
You can configure these events to collect extended information, the values of all the registers evaluated at
the IP of the interrupt, on IA-32 and Intel® 64 architecture systems. For example, on Intel Core™ 2 processor
family, an L2 load miss that retrieves a cacheline can be identified with the
MEM_LOAD_RETIRED.L2_LINE_MISS event. The register values and the disassembly allows the
reconstruction of the linear address of the memory operation that caused the event.
Check the HOW configuration pane in the Configure Analysis window to make sure the events you use are
precise. Usually precise events have a _PS postfix (for example, MEM_LOAD_RETIRED.FB_HIT_PS) in the
Description column as follows:
343
7 Intel® VTune™ Profiler User Guide
See Also
Hardware Event-based Sampling Collection
344
Analyze Performance 7
For example, for KVM guest OS profiling consider selecting the following Linux Ftrace events to track
IRQ injection process: kvm, irq, sofirq and workq.
The collected data shows up as tasks in the default viewpoint. Start with the Summary window to identify
the most time-consuming tasks in the Top Tasks section. Analyze task duration statistics presented by task
type in the Task Duration histogram:
Use the sliders to set up thresholds for high and slow task instances.
Clicking a task in the Top Tasks section opens the Bottom-up window grouped by tasks. To analyze tasks
over time, switch to the Platform window:
345
7 Intel® VTune™ Profiler User Guide
Limitations
On some systems, the Linux Ftrace subsystem, located in the debugfs partition in /sys/kernel/debug/
tracing, may be accessible for the root user only. In this case, the VTune Profiler provides an error
message: Ftrace collection is not possible due to a lack of credentials. Root privileges are required. To enable
Ftrace events collection on such a system, you may either run the VTune Profiler with root privileges or
change permissions manually by using the chown command under the root account, for example:
$ ./install/bin64/prepare-debugfs.sh [option]
where [option] is one of the following:
Option Description
-i | -- Configure the autoload debugfs boot script and install it in the appropriate system
install directory.
-c | --check Mount without options, script will configure debugfs and check permissions.
346
Analyze Performance 7
Option Description
-b | --batch Run in a non-interactive mode (exiting in case of already changed permissions) without
options. The script will configure debugfs.
See Also
Custom Analysis
Task Analysis
Analyze Interrupts
knob
atrace-config/ftrace-config option for CLI
Problem: No GPU Utilization Data Is Collected
Sampling Interval
Configure the amount of wall-clock time the Intel®
VTune™ Profiler waits before collecting each sample
(sampling interval).
The sampling interval is used to calculate the target number of samples and the Sample After value (SAV).
Increasing the sampling interval may be useful for profiles with long durations or profiles that create large
results. Typically, the size of the collected result is affected with such factors as duration, thread and core
counts, selected analysis type, additional collection knobs, and application behavior.
You may change the default sampling interval as follows:
1.
Click the (standalone GUI)/ (Visual Studio* IDE) Configure Analysis button on the VTune
Profiler toolbar.
2. Select a predefined analysis type from the HOW pane or create a custom analysis type.
3. Use the CPU sampling interval, ms field to specify the required interval.
For user-mode sampling and tracing types, specify a number (in milliseconds) between 1 and 1000.
Default: 10ms. For hardware event-based sampling types, specify a number between 0.01 and 1000.
Default: 1ms.
NOTE
For hardware event-based sampling types, the sampling interval serves as a simple SAV multiplier so
that the default interval value of 1ms just leaves the SAV intact. The sampling interval value of 0.1ms
divides the SAV for all events by 10 making them overflow 10 times more frequently. The sampling
interval value of 10ms multiplies the SAV for all events by 10 making them overflow 10 times less
frequently.
To determine an appropriate sampling interval, consider the duration of the collection, the speed of your
processors, and the amount of software activity. For instance, if the duration of sampling time is more than
10 minutes, consider increasing the sampling interval to 50 milliseconds. This reduces the number of
interrupts and the number of samples collected and written to disk. The smaller the sampling interval, the
larger the number of samples collected and written to disk.
The minimal value of the sampling interval for the user-mode sampling and tracing collection depends on the
system:
• 10 milliseconds for Windows* systems with a single CPU
347
7 Intel® VTune™ Profiler User Guide
NOTE
For driverless Perf*-based data collection on the targets running under Xen Hypervisor, the VTune
Profiler automatically sets the sampling interval to 0 to switch to the integrated Perf sampling interval.
This configuration provides more precise performance statistics in the hypervisor environment.
See Also
knob sampling-interval
vtune option
See Also
Custom Analysis Options
348
Analyze Performance 7
Energy Analysis
Use Intel® SoC Watch and Intel® VTune™ Profiler to collect and analyze power and energy consumption
metrics. You can collect data on Windows, Linux, or Android systems. Use this data to identify system
behaviors that waste energy.
6.
Click Start( ) to run the analysis.
When the analysis completes, VTune Profiler displays package power usage information (collected by Intel®
SoC Watch) in the Platform tab.
349
7 Intel® VTune™ Profiler User Guide
Track package power usage to see if the CPU is likely to enter a throttling phase. If that happens, you can
run a throttling analysis to explore possible causes.
For detailed information about using Intel SoC Watch, see the Energy Analysis User Guide.
NOTE Users in Linux environments do not require root privileges to run energy analysis. Once your
system administrator installs VTune Profiler sampling drivers and configures them with the necessary
permissions, users without root privileges can collect energy data when profiling with VTune Profiler.
On Windows systems, you must have administrator privileges to collect data on energy consumption.
For example, to run a collection for 1 minute (-t 60), gather data about how much time the CPU
spends in low power states (-f cpu-cstate), include trace data (-m), and store the reports in a
specified directory location with the specified file name (-o results/test), you would use:
socwatch -t 60 -f cpu-cstate -m -o results/test -r vtune
The import file is saved to the results directory as test.pwr.
350
Analyze Performance 7
For detailed descriptions of options and the different metrics that can be collected, see Intel SoC Watch
Command Options or the Getting Started section of the Intel SoC Watch User's Guide (Linux and
Android | Windows).
Tip
• Use feature group names as a shorthand for specifying several features (metrics) that should be
collected at the same time. For instance, -f sys collects many commonly used metrics, including
low power state residency for CPU, GPU, and devices, CPU temperature and frequency, and
memory bandwidth.
• Use the --help option to discover all of the available metrics that can be collected on the system
(found under feature and feature group names) as well as other options for controlling data
collection and reporting.
2. If running on a remote target system, copy the import file to the system where VTune Profiler is
installed. The import file has a (*.pwr) extension, such as results/test.pwr from the example
command.
3. Launch VTune Profiler.
4.
Click the Import Result button on the toolbar and browse to the import file that you copied
from the target system.
When the import completes, the Platform Power AnalysisPlatform Power Analysis viewpoint
opens automatically.
See Also
Interpret Energy Analysis Data with Intel® VTune™ Profiler
#unique_262
NOTE
Collecting energy analysis data with Intel® SoC Watch is available for target Android*, Windows*, or
Linux* devices. Import and viewing of the Intel SoC Watch results is supported with any version of the
VTune Profiler.
After you collect energy analysis data on your target system, using the Intel® SoC Watch collector, you can
import a result file (*.pwr) to Intel® VTune™ Profiler on your host system, and view Platform Power
analysisPlatform Power analysis data with the following windows. The windows that appear depend on which
metrics are collected:
• Summary WindowSummary Window displays a summary of the data collected. This window is a good
starting point for identifying energy issues.
• Correlate Metrics windowCorrelate Metrics window displays timelines for all collected data in the same
time scale. This window is a good starting point for identifying energy issues.
• Bandwidth windowBandwidth window displays the DDR SDRAM memory events and bandwidth usage over
time.
351
7 Intel® VTune™ Profiler User Guide
• Core Wake-ups windowCore Wake-ups window displays wake-up events that caused the core to switch
from a sleep state to an active state.
• CPU C/P States windowCPU C/P States window displays CPU sleep state and processor frequency data
correlated. The data is displayed according to the hierarchy for the platform on which the data was
collected, and over time.
• Graphics C/P States window Graphics C/P States window displays graphics sleep state, and P-state data
collected. The data is displayed by device and over time.
• NC Device window NC Device window displays the different D0ix sleep states for North Complex devices,
overall counts and over time.
• SC Device window SC Device window displays the different D0ix sleep states for South Complex devices,
overall counts and over time.
• Thermal Sample window Thermal Sample window displays the temperature readings from the cores and
SoC.
• Timer Resolution Timer Resolution (Windows* OS only) displays the timer resolution and requests to
change it, including the process requesting the change.
• Wakelocks window Wakelocks window (Android* OS only) displays wakelock data indicating why the
system can or cannot enter the ACPI S3 (Suspend-To-RAM) state.
For detailed descriptions of each of these windows, see the Intel VTune Profiler help.
352
Analyze Performance 7
See Also
Android* Targets
Remote Linux Target Setup
Search Directories
NOTE
Collecting energy analysis data with Intel® SoC Watch is available for target Android*, Windows*, or
Linux* devices. Import and viewing of the Intel SoC Watch results is supported with any version of the
VTune Profiler.
After you collect energy analysis data on your target system, using the Intel® SoC Watch collector, you can
import a result file (*.pwr ) to Intel® VTune™ Profiler on your host system. Energy analysis data is opened in
the Platform Power analysisPlatform Power analysis viewpoint.
To interpret the performance data provided during the energy analysis, you may follow the steps below:
1. Analyze overall statistics.
2. Identify cores with the highest time spent in C0 state.
353
7 Intel® VTune™ Profiler User Guide
Tip
Click the Details link next to the table or graph title on the Summary tab to view more information
about that metric in another tab.
354
Analyze Performance 7
Use the timeline view to understand when state transitions occur. Hover over a chart point to view the sleep
states details for the particular moment of time. The deeper the color of the chart, the deeper the sleep state
of the CPU. Select a region of the graph and zoom into the selection to see detailed sleep state transitions.
See Also
Viewing Source
NOTE
For additional use cases, explore the Intel® VTune™ Profiler Performance Analysis Cookbook.
355
7 Intel® VTune™ Profiler User Guide
For example, create a run.bat file on Windows* or run.sh file on Linux* with the following command:
Windows:
356
Analyze Performance 7
6. In the Advanced section, select the Auto Managed code profiling mode and enable the Analyze child
processes option.
Similarly, you can configure an analysis with the VTune Profiler command line interface, vtune. For example,
for the Hotspots analysis on Linux run the following command line:
NOTE The dynamic attach mechanism is supported only with the Java Development Kit (JDK).
To configure Java analysis in the Attach to Process mode under Low-privilege Account (Linux*
Only):
For hardware event-based sampling analysis types, you can attach the VTune Profiler running under the
superuser account to a Java process or a C/C++ application with embedded JVM instance running under a
low-privileged user account. For example, you may attach the VTune Profiler to Java based daemons or
services.
To do this, run the VTune Profiler under the root account, select the Attach to Process target type and
specify the java process name or PID.
357
7 Intel® VTune™ Profiler User Guide
NOTE
Due to inlining during the compilation stage, some functions may not appear in the stack by default.
Make sure to select the Show inline functions option for the Inline Mode on the filter bar.
358
Analyze Performance 7
Limitations
VTune Profiler supports analysis of Java applications with some limitations:
• System-wide profiling is not supported for managed code.
• The JVM interprets some rarely called methods instead of compiling them for the sake of performance.
VTune Profiler does not recognize interpreted Java methods and marks such calls as !Interpreter in the
restored call stack.
If you want such functions to be displayed in stacks with their names, force the JVM to compile them by
using the -Xcomp option (show up as [Compiled Java code] methods in the results). However, the
timing characteristics may change noticeably if many small or rarely used functions are being called
during execution.
• When opening source code for a hotspot, the VTune Profiler may attribute events or time statistics to an
incorrect piece of the code. It happens due to JDK Java VM specifics. For a loop, the performance metric
may slip upward. Often the information is attributed to the first line of the hot method's source code. In
the example below, a real hotspot line consuming most CPU time is line 35.
• Consider events and time mapping to the source code lines as approximate.
• For the Hotspots analysis type in the user-mode sampling mode, the VTune Profiler may display only a
part of the call stack. To view the complete stack on Windows, use the -Xcomp additional command line
JDK Java VM option that enables the JIT compilation for better quality of stack walking.
To view the complete stack on Linux, use additional command line JDK Java VM options that change
behavior of the Java VM:
• Use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for
better quality of stack walking.
• On Linux* x86, use client JDK Java VM instead of the server Java VM: either explicitly specify -client,
or simply do not specify -server JDK Java VM command line option.
• On Linux x64, specify -XX:-UseLoopCounter command line option that switches off on-the-fly
substitution of the interpreted method with the compiled version.
• Java application profiling is supported for the Hotspots and Microarchitecture analysis types. Support for
the Threading analysis is limited as some embedded Java synchronization primitives (which do not call
operating system synchronization objects) cannot be recognized by the VTune Profiler. As a result, some
of the timing metrics may be distorted.
• There are no dedicated libraries supplying a user API for collection control in the Java source code.
However, you may want to try applying the native API by wrapping the __itt calls with JNI calls.
See Also
Enable Java* Analysis on Android* System
Stitch Stacks for Intel® oneAPI Threading Building Blocks or OpenMP* Analysis
359
7 Intel® VTune™ Profiler User Guide
NOTE
Only Windows* and Linux* target systems are supported.
3. In the Launch Application configuration pane, specify a path to the installed Python interpreter in the
Application field and a path to your Python script in the Application parameters field.
NOTE
If you specify a relative path to your Python script in the Application parameters field, the VTune
Profiler properly resolves full function or method names only for the imported modules, and does not
resolve the names inside the main script. Consider specifying the absolute path to the script.
In addition, you may select the Auto managed code profiling mode, and the VTune Profiler
automatically detects the type of target executable, managed or native, and switches to the
corresponding mode. Optionally, you may select Analyze child processes option to collect data on
processes launched by the target process. For example, on Linux your configuration may look like this:
360
Analyze Performance 7
In case your Python application needs to run before the profiling starts or cannot be launched at the
start of this analysis, you may attach the VTune Profiler to the Python process. To do this, select the
Attach to Process target type and specify the Python process name or PID as follows:
NOTE
When you attach the VTune Profiler to the Python process, make sure you initialize the Global
Interpreter Lock (GIL) inside your script before you start the analysis. If GIL is not initialized, the
VTune Profiler collector initializes it only when a new Python function is called.
4. From the HOW configuration pane on the right, select the Hotspots, Threading, or Memory
Consumption analysis type.
5. Configure the following options, if required, or use the defaults:
User-Mode Select to enable the user-mode sampling and tracing collection for hotspots
Sampling mode and call stack analysis (formerly known as Basic Hotspots). This collection
mode uses a fixed sampling interval of 10ms. If you need to change the
interval, click the Copy button and create a custom analysis configuration.
Hardware Event- Select to enable hardware event-based sampling collection for hotspots
Based Sampling analysis (formerly known as Advanced Hotspots).
mode
You can configure the following options for this collection mode:
• CPU sampling interval, ms to specify an interval (in milliseconds)
between CPU samples. Possible values for thehardware event-based
sampling mode are 0.01-1000. 1 ms is used by default.
• Collect stacks to enable advanced collection of call stacks and thread
context switches.
NOTE
When changing collection options, pay attention to the Overhead diagram on the
right. It dynamically changes to reflect the collection overhead incurred by the
selected options.
361
7 Intel® VTune™ Profiler User Guide
Show additional Get additional performance insights, such as vectorization, and learn next
performance steps. This option collects additional CPU events, which may enable the
insights check box multiplexing mode.
The option is enabled by default.
Details button Expand/collapse a section listing the default non-editable settings used for
this analysis type. If you want to modify or enable additional settings for the
analysis, you need to create a custom configuration by copying an existing
predefined configuration. VTune Profiler creates an editable copy of this
analysis type configuration.
6. Click the Start button to run the analysis.
Identifying Hotspots
Hotspots analysis in the user-mode sampling mode helps identify sections of your Python code that take a
long time to execute (hotspots), along with their timing metrics and call stacks. It also displays the workload
distribution over threads in the Timeline pane.
By default, the VTune Profiler uses the Auto managed code profiling mode, that enables you to view and
analyze mixed stacks for Python/C++ applications. In the example below, you can see a native hotspot Intel®
oneAPI Math Kernel Library(oneMKL) function on the left pane. The mixed call stack analysis on the right
pane reveals a Python black_scholes function that actually calls the hotspot function:
Double-click the black_scholes function on the Call Stack pane to open the source view on call site line
66:
362
Analyze Performance 7
To view call stacks only inside your Python code, filter out Python core and system functions by selecting
Only user functions option for the Call Stack Mode on the filter bar.
Limitations
VTune Profiler supports Python code profiling with some limitations:
• Only Python distribution 2.6 and later are supported.
• If you use Python extensions that compile Python code to the native language (JIT, C/C++), the VTune
Profiler may show incorrect analysis results. Consider using JIT Profiling API to solve this problem.
• Python code profiling is supported for Windows and Linux target systems only.
• In some cases, the VTune Profiler may not resolve full names of Python functions and modules on
Windows OS. It displays correct source information, so you can view the source directly from the VTune
Profiler's viewpoints.
• Proper thread names are not always displayed in the Timeline pane.
• If your application has very low stack depth, which includes called functions and imported modules, the
VTune Profiler does not collect Python data. Consider using deeper calls to enable the profiling.
• When collecting data remotely, the VTune Profiler may not resolve full function or method names, and
display the source code of your Python script. To solve this problem for Linux targets, copy the source files
to a directory on your host system with a path identical to the path on your target system before running
the analysis.
See Also
knob
mrte-type=python option
Hotspots View
363
7 Intel® VTune™ Profiler User Guide
NOTE
Using Intel C++ compiler is recommended to get more comprehensive diagnostics from the VTune
Profiler.
Start exploration of oneTBB parallelization efficiency with Hotspots. Look at the Effective CPU Utilization
Histogram to see the parallelization level of your application. Note that the histogram reflects the
parallelization levels of your application based on the effective time spent subtracting time spent in threading
runtimes.
If you see a significant portion of your elapsed time spent with Idle or Poor CPU utilization, explore the Top
Hotspots table. Flagged oneTBB functions might mean that the application spends CPU time in the oneTBB
runtime because of parallel inefficiencies like scheduling overhead or imbalance. To discover the reason,
hover over the flag.
The Bottom-up tab can give you more details about synchronization or overhead in particular oneTBB
constructs. Expand the Spin Time and Overhead Time columns in the grid to determine why a particular
oneTBB runtime function had a higher than usual execution time. oneTBB runtime functions are flagged when
they consume more than 5% of the CPU time.
For example, an oneTBB runtime function with a high Scheduling value may indicate that your application has
threading work divided into small pieces, which leads to excessive scheduling overhead as the application
calls to the runtime. You can resolve this issue by increasing the threading chunk size.
If there is an idle wait time when the oneTBB runtime does not burn the CPU on synchronization, it is useful
to run the Threading analysis to explore synchronization bottlenecks that can prevent effective CPU
utilization. VTune Profiler recognizes all types of Intel TBB synchronization objects. If you assign a meaningful
name to an object you create in the source code, the VTune Profiler recognizes and represents it in the Result
tab. For performance reasons, this functionality is not enabled by default in oneTBB headers. To make the
user-defined objects visible to the VTune Profiler, recompile your application with
TBB_USE_THREADING_TOOLS set to 1.
To display an overhead introduced by oneTBB library internals, the VTune Profiler creates a pseudo
synchronization object TBB Scheduler that includes all waits from the oneTBB runtime libraries.
See Also
Cookbook: OpenMP* Code Analysis Method
364
Analyze Performance 7
NOTE
The version of the Intel MPI library included with the Intel Parallel Studio Cluster Edition makes an
important switch to use the Hydra process manager by default for mpirun. This provides high
scalability across the big number of nodes.
This topic focuses on how to use the VTune Profiler command line tool to analyze an MPI application. Refer to
the Additional Resources section below to learn more about other analysis tools.
Use the VTune Profiler for a single-node analysis including threading when you start analyzing hybrid codes
that combine parallel MPI processes with threading for a more efficient exploitation of computing resources.
HPC Performance Characterization analysis is a good starting point to understand CPU utilization, memory
access, and vectorization efficiency aspects and define the tuning strategy to address performance gaps. The
CPU Utilization section contains the MPI Imbalance metric, which is calculated for MPICH-based MPIs. Further
steps might include Intel Trace Analyzer and Collector to look at MPI communication efficiency, Memory
Access analysis to go deeper on memory issues, Microarchitecture Exploration analysis to explore
microarchitecture issues, or Intel Advisor to dive into vectorization tuning specifics.
Use these basic steps required to analyze MPI applications for imbalance issues with the VTune Profiler:
1. Configure installation for MPI analysis on Linux host.
2. Configure and run MPI analysis with the VTune Profiler.
3. Control collection with the MPI_Pcontrol function.
4. Resolve symbols for MPI modules.
5. View collected data.
Explore additional information on MPI analysis:
• MPI implementations supported by VTune Profiler
• MPI system modules recognized by VTune Profiler
• Analysis limitations
• Additional resources
365
7 Intel® VTune™ Profiler User Guide
• -quiet / -q option suppresses the diagnostic output like progress messages. This option is
recommended, but not required.
• -collect <analysis type> is an analysis type you run with the VTune Profiler. To view a list of
available analysis types, use VTune Profiler-help collect command.
• -trace-mpi adds a per-node suffix to the result directory name and adds a rank number to a process
name in the result. This option is required for non-Intel MPI launchers.
• -result-dir <my_result> specifies the path to a directory in which the analysis results are stored.
If a MPI application is launched on multiple nodes, VTune Profiler creates a number of result directories per
compute node in the current directory, named as my_result.<hostname1>, my_result.<hostname2>, ...
my_result.<hostnameN>, encapsulating the data for all the ranks running on the node in the same
directory. For example, the Hotspots analysis (hardware event-based sampling mode) run on 4 nodes collects
data on each compute node:
366
Analyze Performance 7
Alternatively, you can create a configuration file with the following content:
mpirun -host myhost1 -n 8 ./a.out : -host myhost2 -n 6 ./a.out : -host myhost2 -n 2 vtune -
result-dir foo -c hotspots -k sampling-mode=hw ./a.out
As a result, the VTune Profiler creates a result directory in the current directory foo.myhost2 (given
that process ranks 14 and 15 were assigned to the second node in the job).
3. As an alternative to the previous example, you can create a configuration file with the following
content:
367
7 Intel® VTune™ Profiler User Guide
NOTE
The examples above use the mpirun command as opposed to mpiexec and mpiexec.hydra while
real-world jobs might use the mpiexec* ones. mpirun is a higher-level command that dispatches to
mpiexec or mpiexec.hydra depending on the current default and options passed. All the listed
examples work for the mpiexec* commands as well as the mpirun command.
Common syntax:
• Pause data collection: MPI_Pcontrol(0)
• Resume data collection: MPI_Pcontrol(1)
• Exclude initialization phase: Use with the VTune Profiler-start-paused option by adding the
MPI_Pcontrol(1) call right after initialization code completion. Unlike with ITT API calls, using the
MPI_Pcontrol function to control data collection does not require a link to a profiled application with a
static ITT API library and therefore changes in the build configuration of the application.
Click the menu button and select Open > Result... and browse to the required result file (*.vtune).
Tip
You may copy a result to another system and view it there (for example, to open a result collected on
a Linux* cluster on a Windows* workstation).
VTune Profiler classifies MPI functions as system functions similar to Intel® oneAPI Threading Building Blocks
(oneTBB ) and OpenMP* functions. This approach helps you focus on your code rather than MPI internals.
You can use the VTune Profiler GUI Call Stack Mode filter bar combo box and CLI call-stack-mode option to
enable displaying the system functions and thus view and analyze the internals of the MPI implementation.
368
Analyze Performance 7
The call stack mode User functions+1 is especially useful to find the MPI functions that consumed most of
CPU Time (Hotspots analysis) or waited the most (Threading analysis). For example, in the call chain main()
-> foo() -> MPI_Bar() -> MPI_Bar_Impl() -> ..., MPI_Bar() is the actual MPI API function you use
and the deeper functions are MPI implementation details. The call stack modes behave as follows:
• The Only user functions call stack mode attributes the time spent in the MPI calls to the user function
foo() so that you can see which of your functions you can change to actually improve the performance.
• The default User functions+1 mode attributes the time spent in the MPI implementation to the top-level
system function - MPI_Bar() so that you can easily see outstandingly heavy MPI calls.
• The User/system functions mode shows the call tree without any re-attribution so that you can see
where exactly in the MPI library the time was spent.
NOTE
VTune Profiler prefixes the profile version of MPI functions with P, for example: PMPI_Init.
VTune Profiler provides oneTBB and OpenMP support. Use these thread-level parallel solutions in addition to
MPI-style parallelism to maximize the CPU resource usage across the cluster, and to use the VTune Profiler to
analyze the performance of that level of parallelism. The MPI, OpenMP, and oneTBB features in the VTune
Profiler are functionally independent, so all usual features of OpenMP and oneTBB support are applicable
when looking into a result collected for an MPI process. For hybrid OpenMP and MPI applications, the VTune
Profiler displays a summary table listing top MPI ranks with OpenMP metrics sorted by MPI Busy Wait from
low to high values. The lower the Communication time is, the longer a process was on a critical path of MPI
application execution. For deeper analysis, explore OpenMP analysis by MPI processes laying on the critical
path.
Example:
This example displays the performance report for functions and modules analyzed for any analysis type. Note
that this example opens per-node result directories (result_dir.host1, result_dir.host2) and groups
data by processes -mpi ranks encapsulated in the per-node result:
369
7 Intel® VTune™ Profiler User Guide
• An MPI implementation needs to operate in cases when there is the VTune Profiler process (vtune)
between the launcher process ( mpirun/ mpiexec) and the application process. It means that the
communication information should be passed using environment variables, as most MPI implementations
do. VTune Profiler does not work on an MPI implementation that tries to pass communication information
from its immediate parent process.
NOTE
This list is provided for reference only. It may change from version to version without any additional
notification.
Analysis Limitations
• VTune Amplifies does not support MPI dynamic processes (for example, the MPI_Comm_spawn dynamic
process API).
Additional Resources
For more details on analyzing MPI applications, see the Intel Parallel Studio Cluster Edition and online MPI
documentation at http://software.intel.com/en-US/articles/intel-mpi-library-documentation/. For information
on installing VTune Profiler in a cluster environment, see the Intel VTune Profiler Installation Guide for Linux.
There are also other resources available online that discuss usage of the VTune Profiler with other Parallel
Studio Cluster Edition tools:
• Tutorial: Analyzing an OpenMP* and MPI Application available from https://software.intel.com/en-us/
articles/intel-vtune-amplifier-tutorials
• Hybrid applications: Intel MPI Library and OpenMP at http://software.intel.com/en-US/articles/hybrid-
applications-intelmpi-openmp/
See Also
Cookbook: Profiling MPI Applications
Specify Search Directories from Command Line
from command line
HPC Performance Characterization Analysis
370
Analyze Performance 7
NOTE
This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production
release. It is available for your use in the hopes that you will provide feedback on its usefulness and
help determine its future. Data collected with a preview feature is not guaranteed to be backward
compatible with future releases.
NOTE
The Fabric Profiler tool is distributed as part of Intel® VTune™ Profiler. Full documentation of the tool,
examples, and pre-collected trace files are available in the Fabric Profiler package.
371
7 Intel® VTune™ Profiler User Guide
esp_enter("<region_name>");
exit_exit("<region_name>");
c. Rebuild the application.
372
Analyze Performance 7
c. There are many Fabric Profiler configuration parameters. The module sets them to default values
which are sufficient when you run your application for the first time. The configuration parameters
are described in a separate section.
d. For a dynamic application, add the data collector library to the LD_PRELOAD variable.
For example:
export LD_PRELOAD=$ESP_ROOT/lib/libesp.so:$LD_PRELOAD
srun --export=LD_PRELOAD,ALL <rest of srun command>
If you have loaded the esp module, the environment variable ESP_LIB contains the path to
libesp.so. See the sample job scripts *.slurm and *.lsf in the examples directory.
If the ESP_VERBOSITY_LEVEL environment variable is set correctly and the banners do not display on
function call, contact esp-support@intel.com for further assistance.
2. Merge the trace files.
The Fabric Profiler banner lists the path to the trace files. To merge traces, run esp_merge_traces.sh
script:
$ESP_ROOT/bin/esp_merge_traces.sh \
<path to application executable> <path to trace directory> <number of PEs>
3. Copy the trace files in the root level of the traces directory to the machine where you have installed the
analyzer.
373
7 Intel® VTune™ Profiler User Guide
NOTE
espr is a general report that summarizes all of the trace data in HTML format. Each sample application
in the examples directory includes this report so you can view the report for the sample application
without running the SHMEM application or MATLAB runtime. The esp/examples/samples/html
directory contains files named {app name}_{number of PEs}.htmland associated directories named
{app name}_{number of PEs}_html_files. Open the HTML file in a browser to view the report
generated by the analyzer from the corresponding trace files in esp/examples/output/samples/
trace.
Types of Analyzers
This table describes each analyzer in the Fabric Profiler package, along with associated operations that you
can perform.
374
Analyze Performance 7
Analyzer Type Name Purpose Suggested Operations
espba Barrier Trace Reads the function trace • Take any of these
Analyzer file and displays barrier measurements:
wait times for each
• PE wait time
barrier call in the source
• PE arrival time
code for each PE.
• Node wait density
• PE percent Late
• PE Outlier Late
• Vary the threshold.
• Restrict your results to a
specific lexical occurrence (a
particular source code line
containing a barrier)
espfbla Fabric Backlog Reads the put trace file • Select "Show Region Bounds"
Analyzer and correlates that with and choose regions of interest.
the HFI trace file to If the SHMEM code defined code
visualize fabric backlog at regions, the temporal regions
any point in time. are highlighted on the graph of
network backlog against time.
• Select an individual node to
display its associated backlog.
• View injection and or ejection
backlog (requested less actual)
• Injection requested, data
sent off-node by this node in
the application
• injection actual, data sent
into network by the HFI
• Ejection requested, data sent
by other nodes in application
to this node
• Ejection actual, data
received from network
according to HFI
• Zoom and pan to bring areas
into focus.
• Try offset adjustment modes.
• Switch between toggle and rate
displays.
• Use the data cursor. Click on the
widget first. Next clock
anywhere on the plot to see
data values for that point.
espla Function (latency) Reads the function trace • Select individual function calls
Trace Analyzer file and displays function to display latency hot spots for
latency for all each call.
instrumented SHMEM • If the application defined Fabric
calls. Trace files that Profiler regions, click View
contain ~100,000s of Regions. Choose regions to
function calls can take
several minutes to
375
7 Intel® VTune™ Profiler User Guide
376
Analyze Performance 7
NOTE
You may also configure a custom analysis to collect GPU usage data. To do this, select the GPU
Utilization option in the analysis configuration. This option introduces the least overhead during the
collection, while the Analyze Processor Graphics hardware events adds medium overhead, and
the Trace GPU Programming APIs option adds the biggest overhead.
377
7 Intel® VTune™ Profiler User Guide
378
Analyze Performance 7
If you select the Compute Basic preset during the analysis configuration, VTune Profiler analyzes metrics that
distinguish accessing different types of data on a GPU and displays the Occupancy section. See information
about GPU tasks with low occupancy and understand how you can achieve peak occupancy:
379
7 Intel® VTune™ Profiler User Guide
380
Analyze Performance 7
If the peak occupancy is flagged as a problem for your application, inspect factors that limit the use of all
the threads on the GPU. Consider modifying your code with corresponding solutions:
Factor responsible for Low Peak Occupancy Solution
SLM size requested per workgroup in a computing Decrease the SLM size or increase the Local size
task is too high
Barrier synchronization (the sync primitive can Remove barrier synchronization or increase the
cause low occupancy due to a limited number of Local size
hardware barriers on a GPU subslice)
381
7 Intel® VTune™ Profiler User Guide
If the occupancy is flagged as a problem for your application, change your code to improve hardware thread
scheduling. These are some reasons that may be responsible for ineffective thread scheduling:
• A tiny computing task could cause considerable overhead when compared to the task execution time.
• There may be high imbalance between the threads executing a computing task.
382
Analyze Performance 7
The Compute Basic preset also enables an analysis of the DRAM bandwidth usage. If the GPU workload is
DRAM bandwidth-bound, the corresponding metric value is flagged. You can explore the table with GPU
computing tasks heavily using the DRAM bandwidth during execution.
If you select the Full Compute preset and multiple run mode during the analysis configuration, the VTune
Profiler will use both Overview and Compute Basic event groups for data collection and provide all types of
reasons for the EU array stalled/idle issues in the same view.
NOTE
To analyze Intel® HD Graphics and Intel® Iris® Graphics hardware events, make sure to set up your
system for GPU analysis
To analyze GPU performance data per HW metrics over time, open the Graphics window, and focus on the
Timeline pane. List of GPU metrics displayed in the Graphics window depends on the hardware events
preset selected during the analysis configuration.
The example below shows the Overview group of metrics collected for the GPU bound application:
383
7 Intel® VTune™ Profiler User Guide
384
Analyze Performance 7
The first metric to look at is GPU Execution Units: EU Array Idle metric. Idle cycles are wasted cycles. No
threads are scheduled and the EUs' precious computational resources are not being utilized. If EU Array
Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on them.
In most cases the optimization strategy is to minimize the EU Array Stalled metric and maximize the EU
Array Active. The exception is memory bandwidth-bound algorithms and workloads where optimization
should strive to achieve a memory bandwidth close to the peak for the specific platform (rather than
maximize EU Array Active).
Memory accesses are the most frequent reason for stalls. The importance of memory layout and carefully
designed memory accesses cannot be overestimated. If the EU Array Stalled metric value is non-zero and
correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to
optimize memory accesses and layout.
Sampler accesses are expensive and can easily cause stalls. Sampler accesses are measured by the
Sampler Is Bottleneck and Sampler Busy metrics.
Explore Execution of OpenCL™ Kernels
If you know that your application uses OpenCL software technology and the GPU Computing Threads
Dispatch metric in the Timeline pane of the Graphics window confirms that your application is doing
substantial computation work on the GPU, you may continue your analysis and capture the timing (and other
information) of OpenCL kernels running on Intel Graphics. To run this analysis, enable the Trace GPU
Programming APIs option during analysis configuration. The GPU Compute/Media Hotspots analysis
enables this option by default.
The Summary view shows OpenCL kernels running on the GPU in the Hottest GPU Computing Tasks
section and flags the performance-critical kernels. Clicking such a kernel name opens the Graphics window
grouped by Computing Task (GPU) / Instance. You may also want to group the data in the grid by the
Computing Task. VTune Profiler identifies the following computing task purposes: Compute (kernels),
Transfer (OpenCL routines responsible for transferring data from the host to a GPU), and Synchronization
(for example, clEnqueueBarrierWithWaitList).
The corresponding columns show the overall time a kernel ran on the GPU and the average time for a single
invocation (corresponding to one call of clEnqueueNDRangeKernel ), working group sizes, as well as
averaged GPU hardware metrics collected for a kernel. Hover over a metric column header to read the metric
description. If a metric value for a computing task exceeds a threshold set up by Intel architects for the
metric, this value is highlighted in pink, which signals a performance issue. Hover over such a value to read
the issue description.
Analyze and optimize hot kernels with the longest Total Time values first. These include kernels characterized
by long average time values and kernels whose average time values are not long, but they are invoked more
frequently than the others. Both groups deserve attention.
To view details on OpenCL kernels submission and analyze the time spent in the queue, explore the
Computing Queue data in the Timeline pane of the Graphics or Platform window.
385
7 Intel® VTune™ Profiler User Guide
See Also
Intel® Media SDK Program Analysis
(Linux* only)
Configure GPU Analysis from Command Line
386
Analyze Performance 7
Follow these steps to explore the data provided by the VTune Profiler for OpenCL application analysis:
1. Explore summary statistic:
• Analyze GPU usage.
• Identify why execution units (EUs) were stalled or idle.
• Identify OpenCL kernels overutilizing both Floating Point Units (FPUs).
2. Analyze hot GPU OpenCL kernels.
3. Correlate OpenCL kernels data with GPU metrics.
4. Explore the computing queue.
5. Analyze source and assembly code.
You can correlate this data with the GPU Time used by GPU engines while your application was running:
If the GPU Time takes a significant portion of the Elapsed Time (95.6%), it clearly indicates that the
application is GPU-bound. You see that 94.4% of the GPU Time was spent on the OpenCL kernel execution.
For OpenCL applications, the VTune Profiler provides a list of OpenCL kernels with the highest execution time
on the GPU:
387
7 Intel® VTune™ Profiler User Guide
Mouse over the flagged kernels to learn what kind of performance problems were identified during their
execution. Clicking such a kernel name in the list opens the Graphics window grouped by computing tasks,
sorted by the Total Time, and with this kernel selected in the grid.
Depending on the GPU hardware events preset you used during the analysis configuration, the VTune Profiler
explores potential reasons for stalled/idle GPU execution units and provides them in the Summary. For
example, for the Compute Basic preset, you may analyze GPU L3 Bandwidth Bound issues:
In this example, EU stalls are caused by GPU L3 high bandwidth. You may click the hottest kernels in the list
to switch to the Graphics view, drill down to the Source or Assembly views of the selected kernel to
identify possible options for cache reuse.
If your application execution takes more than 80% of collection time heavily utilizing floating point units, the
VTune Profiler highlights such a value as an issue and lists the kernels that overutilized the FPUs:
You can switch to the Timeline pane on the Graphics tab and explore the distribution of the GPU EU
Instructions metric that shows the FPU usage during the analysis run:
388
Analyze Performance 7
In the example below, the Accelerator_Intersect kernel took the most time to execute (53.398s). The
GPU metrics collected for this workload show high L3 Bandwidth usage spent in stalls when executing this
kernel. For compute bound code it indicates that the performance might be limited by cache usage.
Analyze and optimize hot kernels with the longest Total Time values first. These include kernels characterized
by long average time values and kernels whose average time values are not long, but they are invoked more
frequently than the others. Both groups deserve attention.
If a kernel instance used the OpenCL 2.0 Shared Virtual Memory (SVM), the VTune Profiler detects it and,
depending on your hardware, displays the SVM usage type as follows:
• Coarse-Grained Buffer SVM: Sharing occurs at the granularity of regions of OpenCL buffer memory
objects. Cross-device atomics are not supported.
• Fine-Grained Buffer SVM: Sharing occurs at the granularity of individual loads and stores within
OpenCL buffer memory objects. Cross-device atomics are optional.
• Fine-Grained System SVM: Sharing occurs at the granularity of individual loads/stores occurring
anywhere within the host memory. Cross-device atomics are optional.
Every clCreateKernel results in a line in the Compute category. If two different kernels with the same
name (even from the same source) were created with two clCreateKernel calls (and then invoked through
two or more clEnqueueNDRangeKernel ), two lines with the same kernel name appear in the table. If they
are enqueued twice with a different global or local size or different sets of SVM arguments, they are also
listed separately in the grid. To aggregate data per the same kernel source, use the Computing Task
Purpose/Source Computing Task (GPU) grouping.
389
7 Intel® VTune™ Profiler User Guide
NOTE
GPU hardware metrics are available if you enabled the Analyze Processor Graphics events option
for Intel® HD Graphics or Intel® Iris® Graphics. To collect these metrics, make sure to set up your
system for GPU analysis.
You may find it easier to analyze your OpenCL application by exploring the GPU hardware metrics per GPU
architecture blocks. To do this, choose the Computing Task grouping level in the Graphics window, select
an OpenCL kernel of interest and click the Memory Hierarchy Diagram tab in the Timeline pane. VTune
Profiler updates the architecture diagram for your platform with performance data per GPU hardware metrics
for the time range the selected kernel was executed.
390
Analyze Performance 7
Currently this feature is available starting with the 4th generation Intel® Core™ processors and the Intel®
Core™ M processor, with a wider scope of metrics presented for the latter one.
391
7 Intel® VTune™ Profiler User Guide
NOTE
You can right-click the Memory Hierarchy Diagram, select Show Data As and choose a format of
metric data representation:
• Total Size
• Bandwidth (default)
• Percent of Bandwidth Maximum Value
VTune Profiler displays kernels with the same name and size in the same color. Synchronization tasks are
marked with vertical hatching . Data transfers, OpenCL routines responsible for transferring
data from the host system to a GPU, are marked with cross-diagonal hatching .
NOTE
In the Attach mode if you attached to a process when the computing queue is already created, VTune
Profiler will not display data for the OpenCL kernels in this queue.
392
Analyze Performance 7
Analyze the assembler code provided by your compiler for the OpenCL kernel, estimate its complexity,
identify issues, match the critical assembly lines with the affected source code, and optimize, if possible. For
example, if you see that some code lines were compiled into a high number of assembly instructions,
consider simplifying the source code to decrease the number of assembly lines and make the code more
cache-friendly.
Explore GPU metrics data per computing task in the Graphics window and drill down to the Source/
Assembly view to explore instructions that may have contributed to the detected issues. For example, if you
identified the Sampler Busy or Stalls issues in the Graphics window, you may search for the send
instructions in the Assembly pane and analyze their usage since these instructions often cause frequent
stalls and overload the sampler. Each send/sends instruction is annotated with comments in square brackets
that show a purpose of the instruction, such as data reads/writes (for example, Typed/Untyped Surface
Read), accesses to various architecture units (Sampler, Video Motion Estimation), end of a thread
(Thread Spawner), and so on. For example, this sends instruction is used to access the Sampler unit:
0x408 260 sends (8|M0) r10:d r100 r8 0x82 0x24A7000 [Sampler, msg-length:1, resp-
length:4, header:yes, func-control:27000]
NOTE
• Source/Assembly support is available for OpenCL programs with sources and for kernels created
with IL (intermediate language), if the intermediate SPIR-V binary was built with the -gline-
tables-only -s <cl_source_file_name> option.
• The Source/Assembly analysis is not supported for the source code using the #line directive.
• If your OpenCL kernels use inline functions, you can enable the Inline Mode filter bar option to view
inline functions in the grid and analyze them in the Source view.
See Also
GPU Compute/Media Hotspots Analysis (Preview)
393
7 Intel® VTune™ Profiler User Guide
Configure Target
Launch the VTune Profiler with root privileges and configure analysis for your Intel Media SDK target.
For the Launch Application mode, follow the standard project setup and analysis target setup process and
specify your application or a script as a target. VTune Profiler automatically sets environment variables and,
on Linux, creates an .mfx_trace configuration file for Intel Media SDK program analysis.
For the Attach To Process and Profile System modes, the .mfx_trace is not created by the VTune
Profiler automatically, which makes the Intel Media SDK program analysis incomplete. You need to manually
enable MFX tracing as follows:
1. Configure the system to include ITT traces to the result.
For Linux:
export INTEL_LIBITTNOTIFY32=/opt/intel/oneapi/vtune/latest/lib32/runtime/
libittnotify_collector.so
export INTEL_LIBITTNOTIFY64=/opt/intel/oneapi/vtune/latest/lib64/runtime/
libittnotify_collector.so
For Windows:
Run Analysis
1.
See Also
GPU Application Analysis on Intel® HD Graphics and Intel® Iris® Graphics
knob enable-gpu-runtimes
to enable Intel Media SDK program analysis from command line
394
Analyze Performance 7
395
7 Intel® VTune™ Profiler User Guide
396
Analyze Performance 7
VTune Profiler automatically sets up thresholds for slow and fast frame rate. But you may change them, if
needed, by dragging the slider at the bottom of the histogram. The thresholds you set will be automatically
applied to all subsequent results for this project.
Switch to the Bottom-up window and group the data in the grid by Frame Domain/Frame Duration
Type/Function/Call Stack:
397
7 Intel® VTune™ Profiler User Guide
398
Analyze Performance 7
This grouping displays frame analysis metrics including the Frame Time that is the wall time during which
frames were active. Focus on the frames with the highest Frame Time values. Expand a frame domain node
to see frames grouped by frame duration. You may select slow frames, right-click and select Filter In by
Selection to filter out all the data other than slow frames in this domain. Then you may group the data back
by Function/Call Stack to see the functions that took most of the time in these slow frames:
399
7 Intel® VTune™ Profiler User Guide
400
Analyze Performance 7
Analyze the Timeline
In the Bottom-up window, analyze the frame data represented in the Timeline pane. If you filtered the grid
by slow frames, the Timeline data is also automatically filtered to display data for the selected frames:
401
7 Intel® VTune™ Profiler User Guide
402
Analyze Performance 7
The scale area displays frame markers. Hovering over a marker opens a tooltip with details on frame
duration, frame rate and so on.
The Frame Rate band displays how the frame rate is changing over time. To understand the cause of the
bottleneck, identify sections with the Slow or Fast frame types and analyze the CPU Utilization data. For
example, you may detect the Slow frame rate for the section with the poor CPU utilization or thread
contention. In this case, you may parallelize the code to utilize CPU resources more effectively or optimize
the thread management.
To identify a hotspot function containing the critical frame from the Timeline view, select the range with the
Slow or Fast frame rate. VTune Profiler highlights the selected frame in the Bottom-up grid.
Task Analysis
Focus your performance analysis on a task - program
functionality performed by a particular code section.
Use the Intel® VTune™ Profiler to analyze the following types of tasks:
• ITT API tasks: Analyze performance of particular code regions (tasks) if your target uses the Task API to
mark task regions and you enabled the Analyze user tasks, events and counters option during the
analysis type configuration
• Platform tasks: Analyze tasks enabled for analysis of Ftrace* events, Atrace* events, Intel Media SDK
programs, OpenCL™ kernels, and so on.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the VTune
Profiler toolbar.
2. Choose the analysis type from the HOW pane.
3. Select the Analyze user tasks, events, and counters option.
4. Click the Start button to run the analysis.
VTune Profiler collects data detecting the marked tasks.
Analyze the collected results to identify the task regions and task duration versus application performance
over time.
To interpret the data provided during the user task analysis, you may use the following options:
• Identify most critical tasks.
• Analyze slow tasks per function.
• Analyze tasks per threads.
403
7 Intel® VTune™ Profiler User Guide
If you collected data for Ftrace/Atrace tasks using the System Overview or a custom analysis with Ftrace/
Atrace events selected, the Summary window also provides the Task Duration Histogram that helps you
identify slow tasks:
Use the Task Type drop-down list to switch between different tasks and analyze their duration. Based on the
thresholds set up for the task duration, you can understand whether the duration of the selected task is
acceptable or slow.
404
Analyze Performance 7
Analyze Slow Tasks per Function
Click a task type in the Top Tasks section to switch to the grid view (for example, Bottom-up or Event
Count) grouped by the Task Type granularity. The task selected in the Summary window is highlighted.
For example, for ITT API tasks collected during the Threading analysis the Bottom-up grid view is grouped
by Task Type/Function/Call Stack:
In the example above, the func4_task task has the longest duration - 2.923 seconds. You may expand the
node to see the function this task belongs to. Double-click the function to analyze the source code in the
Source view.
For Ftrace/Atrace tasks collected during the System Overview analysis, you may select the Task Type/Task
Duration Type/Function/Call Stack granularity and explore functions executed while a slow task instance
was running. You may double-click the function to open its source code and analyze the most time-
consuming source lines.
User tasks are shown on the timeline with yellow markers. Hover over a task marker for task execution
details. In the example above, the func2_task started at the 3.4th second of the application execution on
the thread threadstartex (TID: 8684) and lasted for 3.002 seconds.
If you collected platform-wide metrics, you may switch to the Platform window and identify threads
responsible for particular tasks. Each task shows up in the Thread section as a separate layer.
405
7 Intel® VTune™ Profiler User Guide
For Ftrace/Atrace tasks, the Platform view provides an option to enable Slow Tasks markers and explore
the CPU utilization, GPU usage and power consumption at the moment of slow tasks execution:
See Also
Pane: Timeline
Switch Viewpoints
View Instrumentation and Tracing Technology (ITT) API Task Data in Intel® VTune™ Profiler
To pause the analysis at the application start and then manually resume it when required, click the
Start Paused button.
406
Analyze Performance 7
NOTE
The Start button may be disabled if you either did not specify the analysis target or selected the
analysis type that is not supported by your processor.
To pause the analysis at the application start and then manually resume it when required, click the
Start Paused button.
NOTE
• You can provide a meaningful name for the result (for example, application name) for better
identification. To do this, select the result, right-click and choose Rename. The file extension
*.vtune cannot be changed.
• To change the result name template or the default directory for result location, go to Tools >
Options (or Options... in the standalone interface menu) and select Intel VTune Profilerversion
> Result Location from the left pane of the Options dialog box.
• You may program hot keys to start/stop a particular analysis. For more details, see http://
software.intel.com/en-us/articles/using-hot-keys-in-vtune-amplifier-xe/.
See Also
Pause Data Collection
Finalization
407
7 Intel® VTune™ Profiler User Guide
Finalization
Finalization is the process by which Intel® VTune™ Profiler converts the collected data to a database,
resolving symbol information, and pre-computes data to make further analysis more efficient and responsive.
VTune Profiler finalizes data automatically when data collection completes.
VTune Profiler provides three basic finalization modes:
• Full mode is used to perform the finalization on unchanged sampling data on the target system. This
mode takes the most time and resources to complete, but produces the most accurate results.
• Fast (default) mode is used to perform the finalization on the target system using algorithmically reduced
sampling data. This greatly reduces the finalization time with a negligible impact on accuracy in most
cases.
• Deferred mode is used to collect the sampling data and calculate the binary checksums to perform the
finalization on another machine. After data collection completes, you can finalize and open the analysis
result on the host system. This mode may be useful for profiling applications on targets with limited
computational resources, such as IoT devices, and finalizing the result later on the host machine.
• None option is used to skip finalization entirely and to not calculate the binary checksums. You can also
finalize this result later, however, you may encounter certain limitations. For example, if the binaries on
the target system have changed or have become unavailable since the sampling data collection, binary
resolution may produce an inaccurate or missing result for the affected binary.
From the WHERE pane, click the Browse button, choose a target system and specify required
details.
3.
From the WHAT pane, click the Browse to choose an appropriate target type.
4. Expand the Advanced section on the WHAT pane and scroll down to select the required finalization
mode, for example: Deferred to use another system.
NOTE
When the analysis result is collected and open, you can always check the used finalization mode in the
Summary view > Collection and Platform Info section.
Re-Finalize Results
You may want to re-finalize a result to:
• update symbol information after changes in the search directories settings
• resolve the number of [Unknown]-s in the results
Beware that re-finalization can lead to wrong results if you do not have the original binaries for your target
on the machine performing the re-finalization; for example, if you recompiled the target. The re-finalization
deletes the old database and then picks up the newer versions of the binaries. Since the collector raw data
does not contain a binary checksum, the VTune Profiler does not know when a binary has changed and
408
Analyze Performance 7
attempts to resolve the symbols matching the old addresses against the new binary. As a result, the VTune
Profiler may unwind stacks incorrectly and resolve samples to the wrong functions. To avoid this, make sure
you configured the search directories to use the correct files.
By default, the VTune Profiler saves the raw collector data after finalization. You may choose to remove these
data to reduce the size of the result file if you do not plan to re-finalize this result in the future. To remove
the raw collector data, from the Microsoft Visual Studio* menu go to Tools > Options > Intel VTune
Profiler <version> > General pane and select the Remove raw collector data after result finalization
option. To remove the raw collector data in the standalone interface, click the menu button and
select Options... > General.
To re-finalize a result in the Microsoft Visual Studio* IDE, select the result in the Solution Explorer, right-click
and select Re-resolve and Open.
To re-finalize a result in the standalone VTune Profiler interface:
1.
See Also
finalization-mode
vtune option
Search Directories
409
7 Intel® VTune™ Profiler User Guide
Use the Pause/Resume API to Insert Calls into Your Code to Start and Stop the Analysis
To get details on using the Pause/Resume API, see the Collection Control API topic.
When the data collection is complete, the VTune Profiler displays paused regions in the Timeline pane as
follows:
See Also
start-paused vtune option
Problem: Unexpected Paused Time
410
Analyze Performance 7
When the data size limit is reached and the data collection is suspended, click the Stop button on the
command toolbar at the bottom of the Configure Analysis window. VTune Analyzer proceeds with the
analysis of the collected data. If you want to extend the data collection for your target application for future
analysis runs, you may modify the default size limit for collected data as follows:
1.
data-limit
vtune option
ring-buffer
vtune option
Manage Analysis Duration from Command Line
Click the (standalone GUI)/ (Visual Studio IDE)Configure Analysis toolbar button to choose
and configure your analysis.
The Configure Analysis window opens.
3. From the HOW pane, choose a predefined or custom analysis type and configure the required settings.
4.
411
7 Intel® VTune™ Profiler User Guide
5. Click the Copy button to copy the command line to the clipboard.
6. Paste the copied command line to the shell.
7. Optionally, edit the application data in the command line as required.
If you analyze a remote application from the local host, make sure to:
• Set up your remote Linux or Android target system for data collection.
• Specify the correct path to the remote application in the command line.
• Use the -target-system=<system_details> option to specify your remote target address (for
Linux) or device name (for Android). For example:
host>./vtune -target-system=ssh:user@hostName -collect hotspots -- myapp
8. Press Enter to launch the analysis from the command line.
VTune Profiler collects the data and saves the result to the analysis result directory under your working
directory.
9. Open your data collection result file in the GUI or as a text-based command line report.
NOTE
To enable analyzing the source code, make sure to copy the required symbol/source files from your
remote machine and update the search directories in the Binary/Symbol Search or Source Search
dialog boxes.
See Also
Collect Data on Remote Linux* Systems from Command Line
target-system
vtune option
Intel® VTune™ Profiler Command Line Interface
412
Analyze Performance 7
Sampling Interval
This option configures the amount of wall-clock time the VTune Profiler waits before collecting each sample.
The smaller the Sampling Interval, the larger the number of samples collected and written to the disk. The
minimal value of the sampling interval depends on the system:
• 10 milliseconds for systems with a single CPU
• 15 milliseconds for systems with multi-core CPUs
To disable/modify the sampling interval value:
From GUI:
1. In the Configure Analysis window > HOW pane, click the Browse button and select an analysis type,
for example, Hotspots and use the Hardware Event-based Sampling mode.
2. For the CPU sampling interval, ms option, specify a required value.
413
7 Intel® VTune™ Profiler User Guide
From CLI:
Use the -knob sampling-interval=<value> option. For example:
Stack Size
This option is used to specify the size of a raw stack (in bytes) to process during hardware event-based
sampling collection. Zero value means unlimited size. Possible values are numbers between 0 and
2147483647.
To disable/modify this option:
From GUI:
1. In the Configure Analysis window > HOW pane, click the Browse button and select the Custom
Analysis > your_custom_analysis type.
2. In the custom configuration, decrease the Stack size, in bytes value.
From CLI:
Use the -stack-size option, for example:
See Also
Custom Analysis
knob
vtune option
stack-size
vtune option
414
Analyze Performance 7
• Application mode: You can leverage the statistics collected by your target application to enhance the
VTune Profiler analysis. For example, a part of your application has many instances executed many times
in one run and some of these instances exhibit a performance problem. You can retrieve time frames
where problems occur from your application log file and supply this data to the VTune Profiler.
• Custom collector mode: If you cannot/do not want to collect statistics directly by your application
during the VTune Profiler analysis, you may either create a custom collector or use an existing external
collector (for example, ftrace, ETW, logcatthat ) and launch it from the VTune Profiler. To enable this
mode, configure a VTune Profiler analysis type to use the Custom collector option and specify a command
starting your external collector.
Convert Custom Data to the CSV Format and Import It to VTune Profiler
To import the externally collected data to the VTune Profiler:
1. Convert the collected custom data to a csv file with a predefined structure.
To do this for the custom collector mode, you need to configure the collector to output the data in the
required CSV format using the VTUNE_HOSTNAME environment variable that identifies the name of the
current host required for the csv file format. For the application mode, you may identify the hostname
from the Computer name field provided in the Summary window for your result, or from the summary
command line report.
2. Import the csv file to the VTune Profiler result using any of the following options:
in GUI:
a. Open the VTune Profiler result that was launched in parallel with the external data collection.
b. Open the Analysis Target tab, or Analysis Type tab.
c. Click the Import from CSV button on the command toolbar on the left.
The Choose a File to Import dialog box opens.
d. Navigate to the required csv file and click Open. You may import several csv files at a time.
NOTE
Importing a csv file to the VTune Profiler result does not affect symbol resolution in the result. For
example, you can safely import a csv file to a result located on a system where module and debug
information is not available.
NOTE
If you develop a custom collector yourself, you may use the VTUNE_DATA_DIR environment variable to
make your collector identify the VTune Profiler result directory and automatically save the custom
collection result (in the CSV format) to this directory. In this case, external statistics will be imported
to the VTune Profiler result automatically.
See Also
Use a Custom Collector
415
7 Intel® VTune™ Profiler User Guide
custom-collector
vtune option
AMPLXE_DATA_DIR Identify a path to the VTune Profiler analysis result. The custom collector uses this
path to save the output csv file and make it accessible for the VTune Profiler that
adds the csv data to the native VTune Profiler result.
AMPLXE_HOSTNAME Identify the full hostname of the machine where data was collected. The hostname
is a mandatory part of the csv file name.
AMPLXE_COLLECT_ Manage a custom data collection. The custom collector may receive the values
CMD listed below. After any of these commands the custom collector should exit
immediately and return control to the VTune Profiler.
NOTE
For each command, the custom collector will be re-launched.
start Start custom data collection. If required, the custom collector may
create a background process.
stop Stop data collection (background process), convert data to a csv file,
copy it to the result directory (specified by AMPLXE_DATA_DIR) and
return control to the VTune Profiler.
AMPLXE_COLLECT_ Identify a Process ID of the application to analyze. VTune Profiler sets this
PID environment variable to the PID of the root target process. The custom collector
may use it, for example, to filter the data.
416
Analyze Performance 7
Environment Enables Custom Collector To Do This
Variable Provided
by VTune Profiler
VTune Profiler sets this variable to the process only when profiling in the Launch
Application or Attach to Process mode. For system-wide profiling, the value is
empty. When your profiled application spawns a tree of processes, the
AMPLXE_COLLECT_PID variable points to the PID of the launched or attached
process. This is important to know in case of using a script to launch a workload
since you may need to use your own means to pass the child process PID to the
custom collector.
The templates below demonstrate an interaction between the VTune Profiler and a custom collector:
Example in Python:
import os
def main():
cmd = os.environ['AMPLXE_COLLECT_CMD']
if cmd == "start":
path = os.environ['AMPLXE_DATA_DIR']
#starting collection of data to the given directory
elif cmd == "stop":
pass #stopping the collection and making transformation of own data to CSV if necessary
main()
Example in Windows CMD shell:
:start
rem Start command in non-blocking mode
start <my collector command to start the collection> “%AMPLXE_DATA_DIR%”\data_file.csv
exit 0
:stop
<my collector command to stop the collection>
exit 0
417
7 Intel® VTune™ Profiler User Guide
4. From the HOW pane, select the required analysis type, for example, Hotspots.
5. Configure available analysis options as you need.
6. Click the Start button to launch the VTune Profiler analysis and collect custom data in parallel.
VTune Profiler does the following:
a. Launches the target application, if any, in the suspended mode.
b. Launches the custom collector in the attach (or system-wide) mode.
c. Switches the application to the active mode and starts profiling.
If your custom collector cannot be launched in the attach mode, the collection may produce incomplete
data.
To launch a custom collector from the command line:
Use the -custom-collector=<string> option.
NOTE
If you use your target application as a custom collector, you do not need to apply the Custom
collector option but make sure your application uses the following variables:
• AMPLXE_DATA_DIR environment variable to identify a path to the VTune Profiler result directory and
save the output csv file in this location.
• AMPLXE_HOSTNAME environment variable to identify the name of the current host and use it for the
csv file name.
See Also
Import External Data
Cookbook: Core Utilization in DPDK Apps tracing with the custom collector
tracing with the custom collector
Profiling Tensorflow* workloads with Intel® VTune™ Profiler using the Custom collector option
using the Custom collector option
Intel® VTune™ Profiler Command Line Interface
418
Analyze Performance 7
File Name
csv filename should specify the hostname where your custom collector gathered the data, following these
format requirements:
Filename format:[user-defined]-hostname-<hostname-of-system>.csv
Where:
• [user-defined] is an option string, for example, describing the type of data collected
• -hostname- is a required text that must be specified verbatim
• <hostname-of-system> is the name of the system where the data is collected. If you use a custom
collector you can retrieve the hostname by using the VTUNE_HOSTNAME environment variable. If you
create a CSV file to import into an existing result, you can either refer to the Summary window that
provides the required hostname in the Collection and Platform Info section > Computer name, or
check the corresponding vtunesummary report: vtune -r <result> -R summary.
Example:phases-hostname-octagon53.csv
NOTE
If the hostname in the csv file name is not specified or specified incorrectly, the VTune Profiler
displays the imported data with the following limitations:
• Event timestamps are represented in the UTC format.
• Only global data (not attributed to specific threads/processes) are displayed.
For imported interval values, use 5 columns, where the order of columns is important:
name,start_tsc.[QPC|CLOCK_MONOTONIC_RAW|RDTSC|UTC],end_tsc,[pid],[tid]
419
7 Intel® VTune™ Profiler User Guide
start_tsc. Event start timestamp. This column name has a QPC|CLOCK_MONOTONIC_RAW, RDTSC
[QPC| or UTC suffix that indicates the type of a timestamp counter:
CLOCK_MONOTON • Specify QPC (QueryPerformanceCounter) on Windows* OS if the performance
IC_RAW|RDTSC|
counter is used and specify CLOCK_MONOTONIC_RAW on Linux* OS if
UTC]
clock_gettime(CLOCK_MONOTONIC_RAW) is used.
• Specify RDTSC if the RDTSC counter is used. To obtain RDTSC:
#include <stdint.h>
int64_t rdtsc()
{
int64_t tstamp;
#if defined(__x86_64__)
asm( "rdtsc\n\t"
"shlq $32,%%rdx\n\t"
"or %%rax,%%rdx\n\t"
"movq %%rdx,%0\n\t"
: "=g"(tstamp)
:
: "rax", "rdx" );
#elif defined(__i386__)
asm( "rdtsc\n": "=A"(tstamp) );
#else
#error NYI
#endif
return tstamp;
}
• Specify UTC if date and time is used. Expected format is YYYY-MM-DD
hh:mm:ss.sssss, where the number of decimal digits is arbitrary.
pid Process ID, provided optionally. Absence of a value in this field does not affect how a
result is imported except for extremely rare cases when the following conditions are
all met:
• Thread ID is reused by the operating system within the collection time frame.
• Different threads with the same thread ID generate records for thecsv file.
• Timestamps are inaccurate and data may be attributed to more than one thread
with the same thread ID.
You may specify this field as an empty value within the data, or skip it from both file
header and data entirely.
tid Thread ID, provided optionally. If a value is specified in this field, the interval will be
interpreted as a Task; otherwise, interval will be interpreted and shown as a Frame.
You may specify this field as an empty value within the data, or skip it from both file
header and data entirely.
Examples
420
Analyze Performance 7
Format for Discrete Values
You can import two types of discrete values:
• Cumulative data type (for example, distance, hardware event count), specified with the .COUNT suffix in
the csv file
• Instantaneous data type (for example, power consumption, temperature), specified with the .INST suffix
in the csv file
#include <stdint.h>
int64_t rdtsc()
{
int64_t tstamp;
#if defined(__x86_64__)
asm( "rdtsc\n\t"
"shlq $32,%%rdx\n\t"
"or %%rax,%%rdx\n\t"
"movq %%rdx,%0\n\t"
: "=g"(tstamp)
:
: "rax", "rdx" );
#elif defined(__i386__)
asm( "rdtsc\n", "=A"(tstamp) );
#else
#error NYI
#endif
return tstamp;
}
• Specify UTC if date and time is used. Expected format is YYYY-MM-DD
hh:mm:ss.sssss, where the number of decimal digits is arbitrary.
CounterName1 Name of the event. Each counter has a separate column. COUNT suffix is used to
specify a cumulative counter value. INST suffix is used to specify instantaneous
counter values.
pid Process ID, provided optionally. Absence of a value in this field does not affect how a
result is imported except for extremely rare cases when the following conditions are
all met:
• Thread ID is reused by the operating system within the collection time frame.
• Different threads with the same thread ID generate records for thecsv file.
421
7 Intel® VTune™ Profiler User Guide
• Timestamps are inaccurate and data may be attributed to more than one thread
with the same thread ID.
You may specify this field as an empty value within the data, or skip it from both file
header and data entirely.
tid Thread ID, provided optionally. If a value is specified in this field, the interval will be
interpreted as a Task; otherwise, interval will be interpreted and shown as a Frame.
You may specify this field as an empty value within the data, or skip it from both file
header and data entirely.
Examples
Additional Requirements
• Make sure each csv file contains only one table. If you need to load several tables, create several csv
files with one table per file.
• Use commas as value separators.
• Use RDTSC, UTC or performance counter (QueryPerformanceCounter on Windows OS and
CLOCK_MONOTONIC_RAW on Linux OS) to specify events timestamp.
See Also
Import External Data
import
vtune option
422
Analyze Performance 7
Run VTune Profiler to Get Linux Perf Options for Analysis
When the VTune Profiler runs a performance data collection in the driverless mode, it uses a Linux Perf
command line and logs it inside the result folder in the <result-folder>/data.0/perfcmd file. To get a
correct set of Perf options, do the following:
1. Install the VTune Profiler on any Linux system with a similar hardware configuration (the same CPU
family) as the system where real performance profiling is planned to be run.
2. Run a VTune Profiler analysis of your interest to generate perfcmd file with Perf options:
NOTE
• You do not run any real workload here. The only purpose of this run is to generate the perfcmd
file.
• VTune Profiler license is not required for this step since you only collect data without opening it.
3. Open the perfcmd file and copy-paste its content to a Linux Perf command invocation on your real
target system.
NOTE
Your Perf tool should contain a patch from https://github.com/torvalds/linux/commit/
f92da71280fb8da3a7c489e08a096f0b8715f939#diff-809984534aa420619413fdf4c260605d. In Linux
kernel version >= 4.19, this patch is applied out of the box, in earlier versions you need to manually
apply it and recompile the Perf tool.
See Also
Set Up Project
423
7 Intel® VTune™ Profiler User Guide
name,start_tsc.QPC,end_tsc,pid,tid
frame1,2,30,,
frame1,33,59,,
taskType1,3,43,1,1
taskType2,5,33,1,1
taskType1,46,59,1,1
taskType2,45,54,1,1
VTune Profiler will process data with missing PID and TID as frames. Data with the PID and TID specified will
be processed as tasks.
Example 2: CSV File with the System Counter Timestamp
name,start_tsc.UTC,end_tsc,pid,tid
Frame1,2013-08-28 01:02:03.0004,2013-08-28 01:02:03.0005,,
Task,2013-08-28 01:02:03.0004,2013-08-28 01:02:03.0005,1234,1235
Example 3: CSV File with Interval Data Bound to a Process
name,start_tsc.TSC,end_tsc,pid,tid
function1_task_type,419280823342846,419280876920231,12832,11644
function2_task_type,419280876920231,419281044717992,12832,11644
function1_task_type,419281044745822,419281102121452,12832,11644
function2_task_type,419281102121452,419281277898762,12832,11644
function1_task_type,419281277935812,419281342158661,12832,11644
function2_task_type,419281342158661,419281527040239,12832,11644
VTune Profiler processes this data as tasks (TID and PID values are specified) and displays the result in the
Platform window as follows:
424
Analyze Performance 7
Example 4: Command Line Report for Imported Interval Data Bound to a Process
In this example, the hotspots report shows counters bound to a specific process/thread grouped by tasks:
name,start_tsc.TSC,end_tsc,pid,tid
calibrating_frame,419743756747826,419747241283878,,
open_file_frame,419747251423510,419747504506086,,
VTune Profiler processes this data as frames (there are no TID and PID values specified) and displays the
result as follows:
With the VTune Profiler, you can easily correlate the frame data in the Timeline pane and grid view.
Example 6: Command Line Report for Imported Interval Data Not Bound to a Process
425
7 Intel® VTune™ Profiler User Guide
In this example, the hotspots report shows counters not bound to a specific thread/process grouped by
frame domain:
tsc.TSC,global_inst_val1.INST,global_counterWIV.COUNT,pid,tid
78912463824135,3,6,,
78916553573577,6,9,,
78919519641325,3,12,,
78922574591880,6,18,,
78925599513489,3,21,,
VTune Profiler processes this data and displays the result as follows:
426
Analyze Performance 7
Discrete cumulative counter values, both thread-specific and global (not thread-specific), are provided in the
grid view and in the Timeline pane in yellow. Instantaneous counter values, thread-specific and global, are
displayed in blue in the Timeline pane only.
NOTE
To view global counter values in the grid, make sure to select a generic (not thread specific) grouping
level like Frame Domain/Frame/Function/Call Stack.
See Also
Import External Data
import
option
427
7 Intel® VTune™ Profiler User Guide
Result This is a container of all other viewpoint elements. This tab has the same name as the VTune
Tab Profiler result file. The result tab name uses the r@@@{at} format, where @@@ is an
incremented result number starting with 000 and at is the analysis type.
For example:
r004hs is the fifth result run in this project and provides data for the Hotspots (hs) analysis.
The Hotspots is the analysis type name. Hotspots by CPU Utilization is the name of the
viewpoint selected via the down arrow. Use this arrow to switch to other viewpoints available
for this analysis result.
Window Each result tab includes a number of windows presenting colleted data from different
s perspectives. Each window has a corresponding tab. To ease your navigation, some windows
are synchronized: when you select an element in a window, the same element is automatically
selected in other windows of the same viewpoint. The list of windows depends on the selected
viewpoint.
Each window has a corresponding context help topic available via F1 button or icon.
NOTE Context help as part of this product help is available on the web only. You may also download
a copy of this help from the VTune Profiler documentation archive.
Panes Each window typically includes two or three panes, such as Call Stack pane, Timeline pane, and
others.
NOTE
For a brief overview on a particular viewpoint, click the question mark icon at the viewpoint name.
All the data views make your analysis more convenient and manageable with the following options:
Switch Viewpoints
Use a viewpoint, a pre-set configuration of Intel®
VTune™ Profiler's data views, to focus on specific
performance problems.
NOTE
By default, VTune Profiler shows no viewpoints, or a managed selection of viewpoints that may be
helpful for the specific analysis type. You can enable the display of all applicable viewpoints by
enabling the Show all applicable viewpoints option in the Options pane.
428
Analyze Performance 7
When you select a viewpoint, you select a set of performance metrics the Intel® VTune™ Profiler shows in the
windows of the result tab. To select the required viewpoint, click the down arrow:
Name of the current viewpoint. Click the down arrow next to the viewpoint name to open a drop-down
menu with a choice of applicable viewpoints.
Viewpoint drop-down menu that displays a list of viewpoints available for the current analysis type.
Explore the table below to understand which viewpoints are available for each analysis type:
Viewpoint Description
Hotspots by CPU Helps identify hotspots - code regions in the application that consume a lot
Utilization of CPU time. CPU time is broken down into CPU utilization states: idle,
poor, fair, and good.
Threading Efficiency Shows how your multi-threaded application is utilizing available CPU cores
and helps identify the possible causes of ineffective utilization. Use this
view to find threads waiting too long on synchronization objects (locks) or
identify scheduling overhead.
Microarchitecture Helps identify where the application is not making the best use of available
Exploration hardware resources. This viewpoint displays metrics derived from hardware
events. The Summary window reports overall metrics for the entire
execution along with explanations of the metrics. From the Bottom-up
and Top-down Tree windows you can locate the hardware issues in your
application. Cells are highlighted when potential opportunities to improve
performance are detected. Hover over the highlighted metrics in the grid to
see explanations of the issues.
Hardware Events Displays statistics of monitored hardware events: estimated count and/or
the number of samples collected. Use this view to identify code regions
(modules, functions, code lines, and so on) with the highest activity for an
event of interest.
Memory Usage Helps understand how effectively your application uses memory resources
and identify potential memory access related issues like excessive access
to remote memory on NUMA platforms, hitting DRAM or Interconnect
bandwidth limit, and others. It provides various performance metrics for
both the application code and memory objects arrays.
429
7 Intel® VTune™ Profiler User Guide
Viewpoint Description
HPC Performance Helps understand how effectively your application uses CPU, memory, and
Characterization floating-point operation resources. Use this view to identify scalability
issues for Intel OpenMP and MPI runtimes as well as next steps to increase
memory and FPU efficiency.
Input and Output Shows input/output data, CPU and bus utilization statistics correlated with
the execution of your target. Use this view to identify long latency of I/O
requests, explore call stacks for I/O functions, analyze slow I/O requests
on the timeline and identify imbalance between I/O and compute
operations.
GPU Compute/Media Helps identify GPU tasks with high GPU utilization and estimate its
Hotspots effectiveness. It is particularly useful for DPC++ computing tasks, analysis
of the OpenCL™ kernels and Intel Media SDK tasks. Use this view to identify
the most time-consuming GPU computing tasks, analyze GPU tasks
execution over time, explore the GPU hardware metrics per GPU
architecture blocks, and so on.
FPGA Hotspots Helps identify the FPGA and CPU tasks with high utilization. Use this view
to assess FPGA time spent executing kernels, overall time for memory
transfers between the CPU and FPGA, and how well a workload is balanced
between the CPU and FPGA.
Platform Power Analysis Helps identify where the application is generating idle and wake-up
behavior that can lead to inefficient use of energy. Where possible, it
provides data from both the OS and hardware perspective, such as the
detailed C-state residency report that shows the OS requested time in deep
sleep states compared to the actual residency the hardware indicated.
See Also
Interpret Energy Analysis Data with Intel® VTune™ Profiler
Analyze Performance
430
Analyze Performance 7
In the Threading Efficiency example above, columns in the Top-down Tree window match the columns in
the Bottom-up window as follows:
Bottom-up Window Top-down Tree Window
The Bottom-up window provides only Self type of data (function without callees). In the grid, Self time/
Count column headers do not have :suffix.
The Total type of data (function + all callees' Self data) is provided in the <data>:Total column and unique
to the Top-down Tree window. In the example above, these are the Wait Time:Total by Utilization and
Wait Count:Total columns.
Self time for a program unit in the Bottom-up window equals the sum of Self time values for the same
program unit in different call sequences in the Top-down Tree window.
See Also
Window: Bottom-up
View Stacks
View Stacks
Manage the Intel® VTune™ Profiler view to display call
stacks for user and system functions and estimate an
impact of each stack on the performance metrics.
Intel VTune Profiler provides call stack information in the Call Stack pane, Bottom-up pane, Top-down
Tree, and Caller/Callee pane. You may use the following options to manage and analyze stacks in different
views:
• Change stack layout
• Navigate between stacks
• View stacks per metric
• View system functions in the stack
• View source for a stack function
431
7 Intel® VTune™ Profiler User Guide
Manage the stack representation in the grid (Bottom-up or Top-down Tree pane) by using the /
stack layout toolbar button.
The button dynamically changes according to the selected layout. For example, if the chain layout is selected
for the view, the button changes to show an option to choose a tree layout, and vice versa.
Chain layouts are typically more useful for the bottom-up view:
While tree layouts are more natural for the top-down view:
NOTE
Chain layout in the Top-down Tree pane is possible only if there is no branching AND when all values
of data columns are the same for the parent and for the child.
critical stack, use the Call Stack pane and click the next/previous / arrows.
To view information on several stacks or program units, Ctrl-click to select these stacks or program units in
the Bottom-up or Top-down Tree pane. The Call Stack pane shows the highest contributing stack from all
the selected stacks, with the contribution calculated based on the sum of all selected stacks. All the stacks
related to the selection are added to the tab and you can navigate to them using the next/previous /
arrows.
Note that though each stack in the Bottom-up pane corresponds to a call stack provided in the Call Stack
pane, the number of tree branches in the Bottom-up grid does not necessarily equal the number of stacks
in the Call Stack pane. Since the stack in the Bottom-up pane is function-based and the stacks in the Call
Stack pane are line-number-based, the number of stacks in these views may differ.
432
Analyze Performance 7
For example, in the screen capture below, the Bottom-up pane shows two stacks for the grid_intersect
function whereas the Call Stack pane shows that 17 stacks exist.
For example, in a Threading analysis result, if you double-click the topmost item of the Wait Time (Sync
Object Creation) stack, the related source file opens on the source line that created the corresponding
synchronization object.
If the source code is not found, you can either locate it manually, or open the Assembly pane for this
program unit.
433
7 Intel® VTune™ Profiler User Guide
If you select a system function, the Source/Assembly window opens the source file of the system function
if it is available. If not, it shows the disassembly for the binary file containing this system function.
See Also
Window: Bottom-up
To view system functions (for example, kernel stacks) in the user function stacks, select the User/system
functions call stack mode :
434
Analyze Performance 7
To locate the call of the kernel function in the assembly code, double click the function in the Call Stack
pane.
NOTE
For more accurate kernel stack analysis on Linux targets, use the CONFIG_FRAME_POINTER=y option
for kernel configuration.
See Also
Enable Linux* Kernel Analysis
Toolbar: Filter
call-stack-mode
vtune option
inline-mode vtune option
435
7 Intel® VTune™ Profiler User Guide
CPU Time Time during which the CPU is actively executing your application on all cores.
Overhead and Spin Combined Overhead and Spin time calculated as CPU Time where call site type is
Time Overhead + CPU Time where call site type is Synchronization.
Wait Time Distribution of time when one thread is waiting for a signal from another thread.
For example, a thread that needs a lock that is currently held by another thread,
is waiting for the other thread to release the lock.
Wait Count Distribution of the number of times the corresponding system wait API was called.
Spin Time Distribution of Wait Time during which the CPU was busy.
Context Switch Distribution of software thread inactive time due to a context switch, regardless of
Time its reason (Preemption or Synchronization), over different call stacks.
Context Switch Distribution of the amount of context switches, regardless of their reason
Count (Preemption or Synchronization), over different call stacks.
Preemption Distribution of the amount of context switches where the operating system task
Context Switch scheduler switched a thread off a processor to run another, higher-priority thread.
Count
Synchronization Distribution of the amount of context switches where a thread was switched off
Context Switch because of making an explicit call to thread synchronization API or to I/O API.
Count
Inactive Time Distribution of time during which a thread remained preempted from execution.
Event metric such as Distribution of a hardware event. Use this metric to identify stacks with the
Instructions highest contribution of the event count into the total event count collected for the
Retired, Clockticks, target.
LLC Miss, and
others
Wait Time (Signal) Distribution of Wait Time by call stacks of a signaling thread that was releasing a
lock where the thread was waiting. Use this metric to identify signaling stacks
resulted in long waits to optimize algorithms of the signaling thread.
Wait Count Distribution of Wait Count by call stacks of a signaling thread that was releasing a
(Signal) lock where the thread was waiting. Use this metric to identify signaling stacks
resulted in the high number of waits.
436
Analyze Performance 7
Use This Metric To Analyze This
Spin Time (Signal) Distribution of Spin Time by call stacks of a signaling thread that was releasing a
lock where the thread was waiting. Use this metric to identify signaling stacks
resulted in long waits while the CPU is busy.
Wait Time (Sync Distribution of Wait Time by various object creations. For example, the currently
Object Creation) selected row in the grid may contain wait operations on various objects created in
different places of the program.
Loads (Memory Distribution of the total number of loads in the stacks allocating memory objects.
Allocation)
Execution Distribution of time spent in the stacks to execute computing tasks. Use this
(Computing Task metric to identify most expensive operations for Offload.
(GPU))
Host-to-Device Distribution of time spent in the stacks to transfer data from host to device. Use
Transfer this metric to identify most expensive operations for Offload.
(Computing Task
(GPU))
Device-to-Host Distribution of time spent in the stacks to transfer data from device to host. Use
Transfer this metric to identify most expensive operations for Offload.
(Computing Task
(GPU))
NOTE
If a selected stack type is not applicable to a selected program unit, VTune Profiler uses the first
applicable stack type from the stack type list instead.
See Also
Pane: Call Stack
Reference
437
7 Intel® VTune™ Profiler User Guide
To Do This Do This
Synchronize Select a program unit of your interest in a grid or Timeline pane and the VTune Profiler
the selected highlights the same unit in other panes/windows.
data
Re-group the Select the required granularity from the Grouping drop-down menu. The available
displayed data groups depend on the analysis type.
Expand/
Click the expand /collapse buttons in the data columns to expand the column by
Collapse data
utilization such as poor, or OK utilization, and by threads within the utilization definition.
in the column
Expand/
Click the expand /collapse buttons to show/hide the next level of grouping or call
Collapse row
stack elements.
data
Change the Right-click the data column and select Show Data As > and select from the different
data data format options. The data format you configure is used in all the windows.
representation
format
Select rows Shift-click to select two or more consecutive rows. Ctrl-click to select two or more rows
that are not consecutive.
Filter the • Use the drop-down controls in the Filter toolbar to filter data by the contribution of a
content of the selected program unit. The percentage contribution depends on the filtering metric
window
selected by clicking the Filter icon. In the example below, the analyze_locks
process contributes 53.4% of the Clockticks event count (default filtering metric for
the hardware event-based analysis) to the overall application Clockticks event count,
so filtering the collected data by this module causes the viewpoint to show 53.4 % of
the overall Clockticks data.
• Use the Filter In/Out by Selection options of the context menu. VTune Profiler
filters in/out the data based on the Total time of the selected item(s).
Filtering the data in one window applies the same filter to all the windows of that
viewpoint.
If you applied filters available on the Filter bar to the data already filtered with the Filter
In/Out by Selection context menu options, all filters are combined and applied
simultaneously.
View source/ Select a program unit you need and double-click. If the source file is not found, the
assembly code assembly pane is displayed.
See Also
Window: Bottom-up
Toolbar: Filter
438
Analyze Performance 7
To Do This Do This
Sort the data Right-click the list of threads/cores/CPUs (depending on the analysis type) and select
the required type of sorting from the Sort By content menu option:
• Row Start Time sorts the rows by the thread creation time.
• Row Label sorts the rows alphabetically.
• <Metric> sorts the rows by performance metric monitored for the selected
viewpoint, for example, CPU Time, Hardware Event Count, and others.
• Ascending sorts the program units in the ascending order by one of the categories
selected above.
• Descending sorts the program units in the descending order by one of the
categories selected above.
Re-order the Select the row you need, hold and drag it to the required position. Press SHIFT to select
rows multiple adjacent rows. Press CTRL to select multiple disjointed rows.
Filter data Select the required program unit(s), right-click and choose from the context menu to
filter in or filter out the data in the view by the selected items. To go back to the default
view, select the Remove All Filters option.
Zoom in/Zoom
out the Click the Zoom In/ Zoom Out buttons on the timeline toolbar.
timeline
Change the Right-click and select the Change Band Height option from the Timeline context menu
height of the and select the required mode:
row
• Super Tiny mode fits all available rows (corresponding to program units such as
processes, threads, and so on) into the timeline area and display metric data in a
gradient fill. This mode is especially useful for results with multiple processes/threads
since it shows all the data in a compact way ("bird's-eye view") with no scroll bar. It
helps observe large numbers that are typical for high-end parallel applications and
easily recognize application phases and places of underutilization for further zooming/
filtering.
439
7 Intel® VTune™ Profiler User Guide
To Do This Do This
If there are more rows than pixels, then multiple rows can share a pixel, in which
case the pixel shows the maximum value. If you hover over a chart object, the tooltip
shows all of the rows assigned to a pixel separately. If you resize the window, the
timeline view is re-drawn and pixels are re-shared.
If there is data, the active ranges are colored: the more data associated with a pixel,
the more intense color is used for drawing. Otherwise, the band is shown in a black
background color.
For hierarchical data, the Super Tiny mode shows timeline data for the last level of
hierarchy aggregated by the upper levels. For example, for the Process/Thread
grouping you see threads data aggregated by process. Hover over a chart element to
view the full hierarchy listed in the tooltip.
NOTE
The Super Tiny display mode is available only for the HPC Performance
Characterization viewpoint.
• Normal mode sets the normal row height (about 16-18px). This mode shows charts,
time markers, row identification (threads), and transitions. Rows can be reordered.
• Rich mode sets the maximum row height (35-50px). This mode shows charts, charts
for nested tasks, time markers, row identification (threads), and transitions. Rows
can be reordered and their height can be manually adjusted.
440
Analyze Performance 7
To Do This Do This
Change the Right-click, select the Show Time Scale As context menu option, and choose from the
measurement following values:
units on the
• Elapsed Time (default)
time scale
• OS Timestamp
• CPU Timestamp
See Also
Pane: Timeline
441
7 Intel® VTune™ Profiler User Guide
VTune Profiler applies these threshold changes to the data provided in all viewpoints/windows of the current
and subsequent results in this project.
See Also
Window: Summary
Bar Display a graphical indicator of the amount of CPU time spent on this row item (blue
bar) or the processor utilization during CPU or Wait time (composed bar). The longer
the bar, the higher the value.
Composed bar is available for the Threading analysis only.
Percent Display the amount of time calculated as the percentage of the cell value to the sum
of values in this column for the whole result (or to the non-filtered-out items if a
filter is applied). For the nested columns (for example, CPU Time > Idle), the sum
of values used in the formula is based on the top-level column values (for example,
CPU Time).
442
Analyze Performance 7
Use This Format To Do This
In the compare mode, the same formula is used for per-result columns (for
example: CPU Time:<result 1 name>, CPU Time:<result 2 name>). But for the
Difference column (for example: CPU Time:Difference), the percent value is
calculated as the cell value / sum of values in this column for the first result (or for
non-filtered-out items if a filter is applied).
Time Display the amount of time the processor spent on this row item. The unit size (m,
s, ms) is added to each cell value.
Time and Bar Display both the amount of time and a bar.
Counts For the Threading Efficiency viewpoint, display the number of times the
corresponding system wait API was called. For the event-based sampling results,
display event count based on the number of samples collected. Event Counts =
Samples x Sample After value.
Scientific Display performance values in the scientific notation. Typically this format is
recommended if a value is < 0.001.
Double For some viewpoints available for the hardware event-based sampling analysis
types, display the percentage of CPU cycles used by a program unit. For example,
1.533 means that 153% of CPU cycles were used to handle a particular hardware
issue during the execution of the selected program unit.
Double and Bar For some viewpoints available for the hardware event-based sampling analysis
types, display the percentage of CPU cycles used by a program unit and
corresponding graphical indicator.
Percent of For some metrics (for example, OpenMP* and MPI metrics), display the Time value
Collection Time as percent of Collection Time, which is the wall time from the beginning to the end
of collection, excluding Paused Time.
NOTE
The values in the data columns are rounded. For items that are sums of several other items, such as a
function with several stacks, the rounded sums may differ slightly from the sum of rounded
summands.
For example:
The rounded values in the grid do not sum up exactly as (0.123 + 0.123) != 0.247.
443
7 Intel® VTune™ Profiler User Guide
See Also
CPU Metrics Reference
Filter by Objects
To filter by particular program units (functions, modules, and so on), use any of the following options:
• Context menu options: Select objects of interest in the grid, right-click and choose the Filter In by
Selection context menu option to exclude all objects from the view other than the objects you selected.
And conversely, choosing the Filter Out by Selection hides the selected data. The filter bar at the
bottom is updated to show the percentage of the displayed data by a certain metric.
For example, you want to filter in the grid by the most time-consuming function sphere_intersect:
When the filter is applied, the filter bar shows that you see only 24.9% of the collected CPU Time data.
• Filter toolbar options: Select a program unit in the filtering drop-down menu (process, module, thread)
to filter out your grid and Timeline view for displaying the data for this particular program unit. For
example, if you select the analyze_locks process introducing 51.5% of the CPU Time, the result data
will display statistics for this module only and the filter bar provides an indicator that only 51354% of the
CPU Time data is currently displayed:
444
Analyze Performance 7
The context summary on the right will be updated for the selected time range and the filter toolbar will show
the percentage of the data (per the default metric for this viewpoint) displayed.
Group Data
You can organize a view to focus on the sequence of data you need using the Grouping menu. The available
groups depend on the analysis type and viewpoint:
For example, if you want to view the collected data for the modules you develop, you may select the
Module/Function/Call Stack granularity, identify the hottest functions in your modules, and then switch to
the Function/Thread/Logical Core/Call Stack granularity to see which CPUs your hot functions were
running on.
VTune Profiler provides a set of pre-configured granularities that could be semantically divided into the
following groups:
Groups Description
targeted for
analysis
445
7 Intel® VTune™ Profiler User Guide
Groups Description
targeted for
analysis
GPU analysis Analyze the CPU activity while the GPU was either idle or executing some code
Examples:
Render and GPGPU Packet Stage / Function / Call Stack
Render and GPGPU Packet Stage / Thread / Function / Call Stack
Typically, you start your analysis with the Summary window where clicking an object of interest opens the
grid pre-grouped in the most convenient way for analysis.
If the pre-configured grouping levels do not suit your analysis purposes, you can create your own grouping
levels by clicking the Customize Grouping button and configuring the Custom Grouping dialog box.
See Also
Filter and Group Command Line Reports
from command line
Cookbook: OpenMP* Code Analysis Method
Requirements
This option is supported if you compile your code using:
446
Analyze Performance 7
• Linux*:
• GCC* compiler 4.1 (or higher)
• Intel® oneAPI DPC++/C++ Compiler. The -debug:inline-debug-info option is enabled by default if
you compile with optimization level -O2 and higher, and if debugging is enabled with the -g option.
• Windows*:
• Intel® C++ Compiler Classic, with /debug:inline-debug-info option.
• Intel® oneAPI DPC++/C++ Compiler and Microsoft* Visual C++, with the /Zo option. The /Zo option is
enabled by default when you generate debug information with /Zi or /Z7.
• Any other compiler that can generate debug information for inline functions in the DWARF format on Linux
or Microsoft PDB format on Windows.
• JIT Profiling API is used for inline functions of JIT-compiled code.
You can select the Source Function/Function/Call Stack level in the Grouping menu to view all
instances of the inline function in one row.
If you double-click the GetModelParams inline function, you can identify the code line that took the most
CPU time and analyze the corresponding assembly code:
447
7 Intel® VTune™ Profiler User Guide
But if you double-click the main function entry and explore the source, you can see that all CPU time is
attributed to the code line where the GetModelParams inline function is called:
448
Analyze Performance 7
Double-clicking the GPU_FFT_Global source function opens the source view positioned on the code line
invoking this function with 95.3% of Estimated GPU Cycles attributed to it:
But if you select the Computing Task/Function/Call Stack or Computing Task/Source Function/Call
Stack grouping level and enable the Inline Mode for this view, you see that the GPU_FFT_Global function
took only 4.7% of the GPU Cycles, while four inline functions took the rest of cycles:
Double-click the hottest GPU_FftIteration function to analyze its source and assembly code:
See Also
Toolbar: Filter
449
7 Intel® VTune™ Profiler User Guide
Analyze Loops
Use the Intel® VTune™ Profiler to view a hierarchy of
the loops in your application call tree and identify code
sections for optimization.
To view and analyze loops in your application:
1. Create a custom analysis (for example, Loop Analysis) based on hardware event-based collection and
select the Analyze loops, Estimate call counts, and Estimate trip counts options.
2. Select the required filtering level from the Loop Mode drop-down menu on the Filter toolbar.
• Loops only: Display loops as regular nodes in the tree. Loop name consists of:
• start address of the loop
• number of the code line where this loop is created
• name of the function where this loop is created
• Loops and functions: Display both loops and functions as separate nodes.
• Functions only (default): Display data by function with no loop information.
VTune Profiler updates the grid according to the selected filtering level.
3. Analyze Self and Total metrics in the Bottom-up and Top-down Tree windows and identify the most
time-consuming loops.
4. Double-click a loop of interest to view the source code.
VTune Profiler opens a source file for the function with the selected loop. The code line creating the loop
is highlighted.
NOTE
You can see the code line information only if debug information for your function is available.
Examples
To identify the most time-consuming loop, select the Loops only mode in the Bottom-up window. By
default, loops with the highest CPU Time values show up at the top of the grid.
To identify the heaviest top-level loops, switch to the Top-down Tree window. The data in the grid is sorted
by the Total time metric displaying the hottest top-level loops first:
450
Analyze Performance 7
See Also
Custom Analysis
loop-mode
vtune option
Toolbar: Filter
Stitch Stacks for Intel® oneAPI Threading Building Blocks or OpenMP* Analysis
Use the Stitch stacks option to restore a logical call
tree for Intel® oneAPI Threading Building
Blocks(oneTBB ) or OpenMP* applications by catching
notifications from the runtime and attach stacks to a
point introducing a parallel workload.
Typically the real execution flow in the applications based on Intel® oneAPI Threading Building
Blocks(oneTBB ) or OpenMP is very different from the code flow. During the user-mode sampling and tracing
analysis of an oneTBB -based application or an OpenMP application using Intel runtime libraries, the Intel®
VTune™ Profiler automatically enables the Stitch stacks option. To view the OpenMP or oneTBB objects
hierarchy, explore the data provided in the Top-down Tree pane.
NOTE
• To analyze a logically structured OpenMP call flow, make sure to compile and run your code with
the Intel® Compiler 13.1 Update 3 or higher (part of the Intel Composer XE 2013 Update 3).
• Stack stitching is available when you run the application from the VTune Profiler (the Launch
Application target type). It does not work when attaching to the application (the Attach to
Process target type).
You may want to disable stack stitching, for example, to minimize the collection overhead. To do this for your
predefined user-mode sampling and tracing analysis type (for example, Hotspots or Threading), you need to
create a new custom analysis configuration and deselect the Stitch stacks option in the Custom Analysis
configuration. You may use the same modified GUI analysis configuration for command line analysis. For this,
just click the Command Line… button in the Configure Analysis window and copy the generated
command line to run it from the terminal window. Alternatively, you can manually configure the command
line for a custom runss analysis using the knob stack-stitching=false option like this:
451
7 Intel® VTune™ Profiler User Guide
In this case, the Top-down Tree pane (or top-down report) displays separate entries for OpenMP worker
threads.
Examples
Call stack in the Top-down Tree pane with the Stitch stacks option disabled:
Call stack in the Top-down Tree pane with the Stitch stacks option enabled (default behavior):
452
Analyze Performance 7
See Also
Window: Top-down Tree
knob
stack-stitching=true
453
7 Intel® VTune™ Profiler User Guide
See Also
Context Menu: Grid
You can do the same in the standalone version of the product, using the Rename Result context menu
option in the Project Navigator.
To change the default result name template:
1. Open the Result Location pane as follows:
• Visual Studio IDE: Go to Tools > Options > Intel VTune Profilerversion > Result Location
pane.
•
Standalone interface: click the menu button and select Options... > Intel VTune
Profilerversion > Result Location pane.
2. In the Result name template text box, edit the text to configure the naming scheme for new analysis
results. By default, r@@@{at} scheme is used, where {at} is an analysis type (for example, hs for
Hotspots).
NOTE
Do not remove the @@@ part from the template. This is a placeholder enabling multiple runs of the
same analysis configuration.
454
Analyze Performance 7
3. Expand the Advanced options section and edit the Store result in (and create link file to) another
directory field to specify a directory of your choice.
All subsequent analysis results will be located under the folder you defined in this tab.
Installation Information
Whether you downloaded Intel® VTune™ Profiler as a standalone component or with the Intel® oneAPI Base
Toolkit, the default path for your <install-dir> is:
Operating System Path to <install-dir>
macOS* /opt/intel/oneapi/
For OS-specific installation instructions, refer to the VTune Profiler Installation Guide.
Result (*.vtune) The location of the result files is controlled by the user. The default location for
produced with preset VTune Profiler is:
analysis type • On Linux: /root/intel/vtune/projects/[project directory]/
r@@@{at}
• On Windows:
• VTune Profiler Results\[project name]\r@@@{at} directory in the
solution directory (Visual Studio* IDE)
• %USERPROFILE%\My Documents\Profiler XE\Projects\[project
directory]\r@@@{at} directory (Standalone VTune Profiler GUI)
Result (*.vtune) The location of the result files is controlled by the user. The default location for
produced with a the VTune Profiler is:
custom analysis type • On Linux: /root/intel/vtune/projects/[project directory]/r@@@
• On Windows:
• VTune Profiler Results\[project name]\r@@@ directory in the
solution directory (Visual Studio* IDE)
• %USERPROFILE%\My Documents\Profiler XE\Projects\[project
directory]\r@@@ directory (Standalone VTune Profiler GUI)
455
7 Intel® VTune™ Profiler User Guide
To open a result from the standalone GUI, select Open > Result... from the menu and browse to the result
file. To open a result from Visual Studio, double-click the node in the Solution Explorer.
Project File
Project (for example, The filename is controlled by the system. However, the file location is
*.vtuneproj) controlled by the user. The default location is:
• On Linux: /root/intel/vtune/projects/[project directory]
• On Windows:
• VTune Profiler Results\[project name] directory in the
solution directory (Visual Studio* IDE)
• Profiler XE\Projects\[project directory] directory
(Standalone Intel VTune Profiler GUI)
Examples
Run the Hotspots analysis and then run the Threading analysis. If you use the default naming convention and
result location, the VTune Profiler names and saves the results in the following manner:
• Standalone GUI Linux:
• /root/intel/vtune/projects/r000hs/r000hs.vtune
• /root/intel/vtune/projects/r001tr/r001tr.vtune
• Standalone GUI Windows:
• %USERPROFILE%\My Documents\Profiler XE\Projects\[project directory]\r000hs
\r000hs.vtune
• %USERPROFILE%\My Documents\Profiler XE\Projects\[project directory]\r001tr
\r001tr.vtune
• Visual Studio IDE:
• VTune Profiler Results\[project name]\r000hs\r000hs.vtune
• VTune Profiler Results\[project name]\r001tr\r000tr.vtune
where
• hs is the Hotspots analysis type
• tr is the Threading analysis type
See Also
Pane: Options - Result Location
456
Analyze Performance 7
NOTE
Make sure the search directories are accessible to the VTune Profiler. For example, if you are to import
the data collected remotely, you need to copy the sources and binaries to the host system where the
VTune Profiler is installed or make them available via a shared drive.
result, click the menu button and select Import Result..., or click the Import
Result button on the toolbar.
The Import window opens.
4. Choose between two options:
• import an *.vtune result (a marker file with associated result directories) collected remotely with
the VTune Profiler command line interface;
• import a raw trace file collected by standalone collector tools.
Import Results
You can perform multiple collections on a remote system (with or without result finalization) with a full-
fledged VTune Profiler command line interface, copy the result directories to the host, and import the
result(s) into a VTune Profiler project.
To import result directories into a VTune Profiler project:
1. In the Import window, select the Import a result into the current project option.
2. Click the browse button to navigate to the required directory.
3. If required, click the Search Sources/Binaries button on the right to view/modify the search
directories.
4. Click the Import button on the right.
VTune Profiler copies the result directory to the current project folder and result name appears in the
Project Navigator as a node of the current project.
457
7 Intel® VTune™ Profiler User Guide
NOTE
If you do not need to copy a result, select the Import via a link instead of a result copy option.
VTune Profiler will import the result via this link.
NOTE
For FPGA data collected with the Profiler Runtime Wrapper, you must import a folder with the
profile.json file. Use the Import multiple trace files from a directory option in the Import
window. See the section below on importing trace files into a VTune Profiler project.
NOTE
The Linux kernel exposes Perf API to the Perf tool starting from version 2.6.31. Any attempts to run
the Perf tool on kernels prior to this version lead to undefined results or even crashes. See Linux Perf
documentation for more details.
458
Analyze Performance 7
NOTE
For FPGA data collected with the Profiler Runtime Wrapper, you need to use this option to import a
folder with the profile.json file. See the FPGA Optimization Guide for oneAPI DPC++ for details on
generating the profiling data.
3. If required, click the Search Sources/Binaries button on the right to view/modify the search
directories.
4. Click the Import button on the right.
VTune Profiler copies the trace file (or a directory with multiple traces) to the project directory, creates
an *.vtune result directory, finalizes the trace(s) in the directory, and imports it to the current project.
When you open the result in the VTune Profiler, it uses all applicable viewpoints to represent the data.
NOTE
• To reduce the size of the imported data, consider removing the copy of the trace file in the project
directory using the Remove raw collector data after resolving the result option available from
Options... > Intel VTune Profilerversion > General tab in the standalone interface menu
or from Tools > Options... > Intel VTune Profilerversion > General tab in Microsoft
Visual Studio* IDE. This option makes the result smaller but prevents future re-finalization.
• You can run a custom data collection (with a third-party collector or your own collection utility) in
parallel with the VTune Profiler analysis run, convert the collected data to a *.csv file and import
the this file to the VTune Profiler project using the Import from CSV GUI option or -import CLI
option. You may also choose to use the Custom collector option of the VTune Profiler to run your
custom collection directly from the VTune Profiler.
See Also
import
vtune option
Dialog Box: Binary/Symbol Search
Compare Results
Compare your analysis results before and after
optimization and identify a performance gain.
Use this feature on a regular basis for regression testing to quickly see where each version of your target has
better performance.
You can compare any results that have common performance metrics. Intel® VTune™ Profiler provides
comparison data for these common metrics only.
To compare two analysis results:
1.
Click the Compare Results button from the VTune Profiler toolbar.
The Compare Results window opens.
Option Description
Result 1 / Specify the results you want to compare. Choose the result of the current project
Result 2 drop- from the drop-down menu, or click the Browse button to choose a result from a
down menu different project.
459
7 Intel® VTune™ Profiler User Guide
Option Description
Swap Results Click this button to change the order of the result files you want to compare.
button Result 1 always serves as the basis for comparison.
Compare button Click this button to view the difference between the specified result files. This
button is only active if the selected results can be compared. Otherwise, an error
message is displayed.
2. Specify two results that you want to compare and click the Compare button.
A new result tab opens providing difference between the two results per performance metric.
The tab name combines the identifiers of two results. For example, the comparison of the Microarchitecture
Exploration analysis results r001ue and r005ue appears as r001ue-r005ue. The data views in the
comparison mode provide calculation of the difference between the two results in the order you originally
defined in the Compare Results window and as specified in the tab title.
You can compare performance statistics in the following views:
Summary window Analyze the difference in the overall application performance between two results
and the system/platform difference, if any. Start exploring the changes from the
Summary window and then move to the Bottom-up analysis to identify the
changes per program unit.
Bottom-up window Analyze the data columns of the two results and a new column with the difference
between these results for a function and its callers.
Event Count Compare results and identify the difference in event count and performance per
window hardware event-based metrics collected during event-based sampling analysis.
Top-Down Tree Explore the performance difference between two collection runs for a function and
window its callees.
Caller/Callee Get a holistic picture of the performance changes before and after optimization by
window comparing data for a function, its callers and callees.
Source/Assembly Understand how differently input values, command line parameters, or compilation
window options affect the performance when you are optimizing your target. Double-click a
program unit of your interest and compare the performance data for each line of the
source/assembly code.
See Also
Difference Report
from command line
If Then
The source and binary files are not modified and Compare performance for each source/assembly
the debug information is available code line.
The source and binary files are not modified but the Compare performance for each assembly
binaries are complied without the debug instruction.
information
460
Analyze Performance 7
If Then
The source files are not modified but the binary Compare performance for each source code line.
files are re-compiled with different options
NOTE
When comparing the source code for binary files with
different checksum, only the Source pane is available.
The source and binary files are modified Intel® VTune™ Profiler cannot compare performance
for source/assembly code and displays an error
message.
Example
When you click the hotspot function in the Bottom-up window, the VTune Profiler opens the Source pane
that displays the CPU time data per each result and the difference between the results.
You see that the execution of the hottest line 64 took less CPU time in result r006hs.
See Also
Window: Compare Results
Compare Results
461
7 Intel® VTune™ Profiler User Guide
For example, your binary with a my_f function was modified with adding a new function my_f1 and new calls
of this function. As a result, my_f address has changed. If you compare the results before and after the
modification using the default Call Stack grouping, the VTune Profiler treats the same functions with
different addresses as separate instances and does not compare them:
When the data is aggregated by Source Function Stack, the VTune Profiler ignores start addresses and
compares functions by source file objects:
cell_data_value/ cell_data_value/
absolute_max_value_in_result_column max(absolute_max_value_in__1st_result_column
,
absolute_max_value_in_corresponding_2nd_resu
lt_column)
462
Analyze Performance 7
CPU CPU CPU Time:Difference
Time:r001 Time:r002
Performance 1s 3s 2s
data
See Also
Bottom-up Comparison
Comparison Summary
Difference Report
from command line
Comparison Summary
When you click the Compare Results button and select two results to compare, the Summary window
shows the difference between these results.
NOTE
• You can compare any results that have common performance metrics. Intel® VTune™ Profiler will
provide comparison data for these common metrics only.
• Make sure to close the results before comparing them.
463
7 Intel® VTune™ Profiler User Guide
You see that the code changes in the second result have slightly decreased the Elapsed time of the
application in comparison with the baseline (result1), though the CPU Time has increased from 19.645
seconds to 21.571 seconds.
Clicking a metric in this section opens the Bottom-up view sorted by this metric in the Difference column.
For this example, the second result introduced slight degradation in CPU Time for the first and second
functions.
464
Analyze Performance 7
The chart shows that the Elapsed time spent within the Poor processor utilization level has slightly increased
with the second result. This means that the changes made for the second run have not optimized the
utilization of the processor cores but introduced a slight optimization reducing the total Elapsed time.
NOTE
You can click the Copy to Clipboard button next to any summary section and copy its content to
the clipboard.
See Also
Compare Results
Bottom-up Comparison
Bottom-up Comparison
To view the difference before and after optimization for a function and its callers, click the Bottom-up sub-
tab for the comparison result you created using the Compare Results window.
In the compare mode, the Bottom-up window shows the data columns of the two results and a new column
showing the difference between the two results for each program unit. The difference is calculated as <Result
1 Value> - <Result 2 Value>.
465
7 Intel® VTune™ Profiler User Guide
CPU time specific difference is calculated as <Result 1 CPU time> - <Result 2 CPU time>, which is r000hs-
r004hs (see the tab title). Expand the first two columns to see the data used for the calculation.
For the grid_intersect function in this example, the difference is 3.961s - 4.470s = -0.138s of Poor CPU
utilization time, which means that serial CPU time has insignificantly increased after code modification
(Result 2).
See Also
Compare Results
Window: Bottom-up
Comparison Summary
466
Analyze Performance 7
Example: Comparison for Hotspots Analysis Results
The function foo() is called from two places in your application, bar1() and bar2(). If you see that foo()
became slower in result 2, use the Top-down Tree window (compare mode) to check whether it became
slower when being called by bar1(), by bar2(), or both.
Tip
To compare results with stacks and without stacks, switch the Call Stack Mode filterbar option to
User/System functions to attribute performance data to functions where samples occurred.
See Also
Window: Top-down Tree
Bottom-up Comparison
Comparison Summary
Comparing Results
467
8 Intel® VTune™ Profiler User Guide
• Collect performance analysis data for your target application using your specified analysis type and other
options.
• Generate reports from analysis results.
• Import data files collected remotely.
• Compare performance before and after optimization.
NOTE
• See the VTune Profiler CLI Cheat Sheet quick reference on VTune Profiler command line interface.
• To access the most current command line documentation for vtune, enter: vtune -help.
• When you perform a task through the VTune Profiler GUI, you can use the command generation
feature to display the corresponding command and save it for future use.
• You cannot create a project from the command line. You must use the GUI for this purpose.
See Also
vtune Command Syntax
vtune Actions
Run Command Line Analysis
[-action-option] Action-options modify behavior specific to the action. You can have
multiple action-options per action. Using an action-option that does
not apply to the action results in a usage error.
468
Intel® VTune™ Profiler Command Line Interface 8
NOTE
Long names of the options can be abbreviated. If the option
consists of several words you can abbreviate each word, keeping
the dash between them. Make sure an abbreviated version
unambiguously matches the long name. For example, the -
option-name option can be abbreviated as -opt-name, -op-na,
-opt-n, or -o-n.
[-global-option] Global-options modify behavior in the same manner for all actions.
You can have multiple global-options per action.
NOTE
You may use vtune to analyze remote targets running on regular
Linux* or Android* systems.
NOTE
See the VTune Profiler CLI Cheat Sheet quick reference on VTune Profiler command line interface.
Example
This example runs the Hotspots analysis for the sample target located at the /home/test/ directory on a
Linux* system, saves the analysis result in the r001hs subdirectory of the current directory, and displays
the default summary report.
See Also
vtune Actions
vtune Actions
The vtune command tool of the Intel® VTune™ Profiler supports different command options.
469
8 Intel® VTune™ Profiler User Guide
Actions
collect Run the specified analysis type and collect data into a result.
NOTE
To access the most current command line documentation for an action, enter vtune -help
<action>, where <action> is one of the available actions. To see all available actions, enter vtune -
help.
Action Options
Action options define a behavior applicable to the specified action; for example, the -result-dir option
specifies the result directory path for the collect action.
NOTE
To access the list of available action options for an action, enter vtune -help <action>, where
<action> is one of the available actions. To see all available actions, enter vtune -help.
Global Options
Global options define a behavior applicable to all actions; for example, the -quiet option suppresses non-
essential messages for all actions. You may have one or more global options per command.
NOTE
To access the list of available global options for an action, enter vtune -help <action>, where
<action> is one of the available actions. To see all available actions, enter vtune -help.
470
Intel® VTune™ Profiler Command Line Interface 8
• VTune Profiler can re-use analysis configuration options set in the GUI version and command line version
of such a configuration. You can copy this command line to the clipboard and use it for the command line
analysis. To do this, use the Command Line... button in the Configure Analysis window. This also
works for custom analysis types.
• To get more information on an action, enter: vtune -help <action>. For example, this command
returns help for the collect-with action:
See Also
vtune Command Syntax
macOS* /opt/intel/oneapi/
471
8 Intel® VTune™ Profiler User Guide
472
Intel® VTune™ Profiler Command Line Interface 8
Analysis Type Description
gpu-hotspots Identify Graphics Processing Unit (GPU) tasks with high GPU utilization and
(preview) estimate the effectiveness of this utilization. This analysis type is intended for
analysis of applications that use a GPU for rendering, video processing, and
computations with explicit support of Intel® Media SDK and OpenCL™ software
technology.
gpu-offload Explore code execution on various CPU and GPU cores on your platform,
correlate CPU and GPU activity, and identify whether your application is GPU or
CPU bound.
vpp Get a holistic view of system behavior. Gain insights into platform-level
configuration, utilization, and imbalance issues that relate to compute, memory,
storage, IO and interconnects.
graphics-rendering Analyze the CPU/GPU utilization of your code running on the Xen virtualization
(preview) platform. Explore GPU usage per GPU engine and GPU hardware metrics that
help understand where performance improvements are possible. If applicable,
this analysis also detects OpenGL-ES API calls and displays them on the
timeline.
fpga-interaction Analyze the CPU/FPGA interaction issues via exploring OpenCL kernels running
on FPGA, identify the most time-consuming FPGA kernels.
io Monitor utilization of the IO subsystems, CPU and processor buses. This analysis
type uses the hardware event-based sampling collection and system-wide
Ftrace* collection (for Linux* and Android* targets)/ETW collection (Windows*
targets) to provide a consistent view of the storage sub-system combined with
hardware events and an easy-to-use method to match user-level source code
with I/O packets executed by the hardware.
system-overview Monitor a general behavior of your target system and identify platform-level
factors that limit performance.
473
8 Intel® VTune™ Profiler User Guide
Collector Description
runsa Profile your application using the counter overflow feature of the Performance
Monitoring Unit (PMU).
runss Profile the application execution and take snapshots of how that application utilizes
the processors in the system. The collector interrupts a process, collects the value of
all active instruction addresses and captures a calling sequence for each of these
samples.
Next Steps
When the collection is complete, the VTune Profiler saves the data as an analysis result in the default or
specified result directory. You can either view the result in the GUI or generate a formatted analysis report.
See Also
vtune Command Syntax
Android* Targets
Syntax
vtune -collect performance-snapshot [-knob <knobName=knobValue>] [--] <target>
Knobs:
• collect-memory-bandwidth
Collect the data required to compute memory bandwidth.
Default value : false
NOTE
For the most current information on available knobs (configuration options) for Performance Snapshot
analysis, enter:
vtune -help collect performance-snapshot
474
Intel® VTune™ Profiler Command Line Interface 8
Example
This example shows how to run Performance Snapshot on a Linux* myApplication application:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Syntax
vtune -collect hotspots -knob <knobName=knobValue> [--] <target>
Knobs: sampling-mode, enable-stack-collection, sampling-interval, enable-
characterization-insights
NOTE
For the most current information on available knobs (configuration options) for the Hotspots analysis,
enter:
vtune -help collect hotspots
475
8 Intel® VTune™ Profiler User Guide
Example
This example shows how to run the Hotspots analysis in the user-mode sampling mode for a Linux*
myApplication:
NOTE
The hardware event-based sampling mode replaced the advanced-hotspots analysis starting with
VTune Amplifier 2019.
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Hotspots Analysis for CPU Usage Issues
NOTE
This is a PREVIEW FEATURE. A preview feature may or may not appear in a future production
release. It is available for your use in the hopes that you will provide feedback on its usefulness and
help determine its future. Data collected with a preview feature is not guaranteed to be backward
compatible with future releases.
Syntax
vtune -collect anomaly-detection [-knob <knobName=knobValue>] [--] <target>
Knobs:
• ipt-regions-to-load
Specify the maximum number (10-5000) of code regions to load for detailed analysis. To load details
efficiently, maintain this number at or below 1000.
Default value : 1000
Range : 10-5000
476
Intel® VTune™ Profiler Command Line Interface 8
• max-region-duration
Specify the maximum duration (0.001-1000ms) of analysis per code region.
Default value : 100 ms
Range : 0.001-1000ms
NOTE
For the most current information on available knobs (configuration options) for Anomaly Detection
analysis, enter:
vtune -help collect anomaly-detection
Example
This example shows how to run Anomaly Detection analysis on a sample application called myApplication.
The analysis runs over 1000 code regions, analyzing each region for 300 ms.
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
NOTE
Threading analysis combines and replaces the Concurrency and Locks and Waits analysis types
available in previous versions of Intel® VTune™ Profiler.
Threading analysis uses user-mode sampling and tracing collection. With this analysis you can estimate the
impact each synchronization object has on the application and understand how long the application had to
wait on each synchronization object, or in blocking APIs, such as sleep and blocking I/O.
There are two groups of synchronization objects supported by the Intel® VTune™ Profiler:
• objects usually used for synchronization between threads, such as Mutex or Semaphore
• objects associated with waits on I/O operations, such as Stream
477
8 Intel® VTune™ Profiler User Guide
Syntax
vtune -collect threading [-knob <knobName=knobValue>] [--] <target>
Knobs: sampling-interval
NOTE
For the most current information on available knobs (configuration options) for the Threading analysis,
enter:
vtune -help collect threading
Example
This example shows how to run the Threading analysis on a Linux* myApplication application:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Syntax
vtune -collect memory-consumption [-knob <knobName=knobValue>] [--] <target>
Knobs: mem-object-size-min-thres.
NOTE
For the most current information on available knobs (configuration options) for the Memory
Consumption analysis, enter:
vtune -help collect memory-consumption
Example
This example shows how to run the Memory Consumption analysis on a Python test application:
478
Intel® VTune™ Profiler Command Line Interface 8
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Memory Consumption Analysis
configuration from GUI
Memory Consumption and Allocations View
Syntax
vtune -collect hpc-performance [-knob <knobName=knobValue>] [--] <target>
Knobs: sampling-interval, enable-stack-collection, collect-memory-bandwidth, dram-
bandwidth-limits, analyze-openmp, collect-affinity.
NOTE
For the most current information on available knobs (configuration options) for the HPC Performance
Characterization analysis, enter:
vtune -help collect hpc-performance
Example
The following example runs the HPC Characterization analysis on a Linux* application with enabled memory
bandwidth analysis:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
HPC Performance Characterization Analysis
479
8 Intel® VTune™ Profiler User Guide
Syntax
vtune -collect uarch-exploration [-knob [knobName=knobValue]] [--] <target>
Knobs: collect-memory-bandwidth, pmu-collection-mode, dram-bandwidth-limits, sampling-
interval, collect-frontend-bound, collect-bad-speculation, collect-memory-bound,
collect-core-bound, collect-retiring.
By default, the Microarchitecture Exploration analysis runs in the detailed PMU collection mode and collects
sub-metrics for all top-level metrics: CPU Bound, Memory Bound, Front-End Bound, Bad Speculation, and
Retiring. If required, you may configure the knob option to disable collecting sub-metrics for a particular top-
level metric.
NOTE
• For the most current information on available knobs (configuration options) for the
Microarchitecture Exploration analysis, enter:
vtune -help collect uarch-exploration
• The general-exploration analysis type value is deprecated. Make sure to use the uarch-
exploration option instead.
Examples
This example runs the Microarchitecture Exploration analysis on a Linux* matrix app with enabled memory
bandwidth analysis:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
480
Intel® VTune™ Profiler Command Line Interface 8
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Microarchitecture Exploration Analysis for Hardware Issues
Syntax
vtune -collect memory-access [-knob <knobName=knobValue>] [--] <target>
Knobs: sampling-interval, analyze-mem-objects (Linux* targets only), mem-object-size-min-thres
(Linux targets only), dram-bandwidth-limits, analyze-openmp.
NOTE
For the most current information on available knobs (configuration options) for the Memory Access
analysis, enter:
vtune -help collect memory-access
Example
This example shows how to run the Memory Access analysis on a Linux* myApplication app, collect data
on dynamic memory objects, and evaluate maximum achievable local DRAM bandwidth before the collection
starts:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Memory Access Analysis for Cache Misses and High Bandwidth Issues
NOTE
This analysis is deprecated in the GUI and available from command line only.
481
8 Intel® VTune™ Profiler User Guide
TSX Exploration analysis type uses hardware event-based sampling collection and is targeted for the Intel®
processors supporting Intel® Transactional Synchronization Extensions (Intel® TSX). This analysis type
collects events that help understand Intel® Transactional Synchronization Extensions behavior and causes of
transactional aborts.
Syntax
vtune -collect tsx-exploration [-knob <knobName=knobValue>] [--] <target>
Knobs: analysis-step, enable-user-tasks.
NOTE
For the most current information on available knobs (configuration options) for the TSX Exploration
analysis, enter:
vtune -help collect tsx-exploration
Example
This example shows how to run the TSX Exploration analysis on a Linux* myApplication with enabled user
tasks analysis:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Configure Analysis Options from Command Line
NOTE
This analysis is deprecated in GUI and available from command line only.
TSX Hotspots analysis type uses hardware event-based sampling collection and is targeted for the Intel®
processors supporting Intel® Transactional Synchronization Extensions (Intel® TSX). This analysis type uses
the UOPS_RETIRED.ALL_PS hardware event that emulates precise clockticks and helps identify performance-
critical program units inside transactions.
Syntax
vtune -collect tsx-hotspots [-knob <knobName=knobValue>] [--] <target>
Knobs: sampling-interval, enable-stack-collection.
482
Intel® VTune™ Profiler Command Line Interface 8
NOTE
For the most current information on available knobs (configuration options) for the TSX Hotspots
analysis, enter:
vtune -help collect tsx-hotspots
Example
This example shows how to run the TSX Hotspots analysis on a Linux* myApplication with enabled call
stacks and thread context switches advanced collection:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Configure Analysis Options from Command Line
NOTE
This analysis is deprecated in GUI and available from command line only.
SGX Hotspots analysis type is targeted for systems with Intel Software Guard Extensions (Intel SGX) feature
enabled. It uses the INST_RETIRED.PREC_DIST hardware event that emulates precise clockticks and helps
identify performance-critical program units inside security enclaves. Using the precise event is mandatory for
the analysis on the systems with the Intel SGX enabled because regular non-precise events do not provide a
correct instruction pointer and therefore cannot be attributed to correct modules.
Syntax
vtune -collect sgx-hotspots [-knob <knobName=knobValue>] [--] <target>
Knobs: sampling-interval, enable-user-tasks.
NOTE
For the most current information on available knobs (configuration options) for the SGX Hotspots
analysis, enter:
vtune -help collect sgx-hotspots
Example
The following example shows how to run the SGX Hotspots Analysis on a Linux* myApplication:
483
8 Intel® VTune™ Profiler User Guide
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Configure Analysis Options from Command Line
• Explore GPU kernels with high GPU utilization, estimate the effectiveness of this utilization, identify
possible reasons for stalls or low occupancy and options.
• Explore the performance of your application per selected GPU metrics over time.
• Analyze the hottest SYCL* or OpenCL™ kernels for inefficient kernel code algorithms or incorrect work item
configuration.
NOTE You can run the GPU Compute/Media Hotspots analysis in Characterization mode for Windows*,
Linux* and Android* targets. However, you must have root/administrative privileges to run the
analysis in this mode.
For the Characterization analysis, you can also collect additional data:
484
Intel® VTune™ Profiler Command Line Interface 8
• Use the Trace GPU programming APIs option to analyze SYCL, OpenCL™, or Intel Media SDK programs
running on Intel Processor Graphics. This option may affect the performance of your application on the
CPU side.
For SYCL or OpenCL applications, you may identify the hottest kernels and identify the GPU architecture
block where a performance issue for a particular kernel was detected.
For Intel Media SDK programs, you may explore the Intel Media SDK tasks execution on the timeline and
correlate this data with the GPU usage at each moment of time.
Support limitations:
• OpenCL kernels analysis is possible for Windows and Linux targets running on Intel Graphics.
• Intel Media SDK program analysis is possible for Windows and Linux targets running on Intel Graphics.
• Only Launch Application or Attach to Process target types are supported.
NOTE
In the Attach to Process mode if you attached to a process when the computing queue is already
created, VTune Profiler will not display data for the OpenCL kernels in this queue.
• Use the Analyze memory bandwidth option to collect the data required to compute memory
bandwidth. This type of analysis requires Intel sampling drivers to be installed.
• Use the GPU sampling internal, ms field to specify an interval (in milliseconds) between GPU samples
for GPU hardware metrics collection. By default, the VTune Profiler uses 1ms interval.
Control Flow group if, else, endif, while, break, cont, call, calla, ret, goto, jmpi,
brd, brc, join, halt and mov, add instructions that explicitly change the ip
register.
485
8 Intel® VTune™ Profiler User Guide
Int16 & HP Float | Bit operations (only for integer types): and, or, xor, and others.
Int32 & SP Float |
Arithmetic operations: mul, sub, and others; avg, frc, mac, mach, mad,
Int64 & DP Float
madm.
groups
Vector arithmetic operations: line, dp2, dp4, and others.
In the Instruction count mode, VTune Profiler also provides Operations per second metrics calculated as
a weighted sum of the following executed instructions:
• Bit operations (only for integer types):
• and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol - weight 1
• Arithmetic operations:
• add, addc, cmp, cmpn, mul, rndu, rndd, rnde, rndz, sub - weight 1
• avg, frc, mac, mach, mad, madm - weight 2
• Vector arithmetic operations:
• line - weight 2
• dp2, sad2 - weight 3
• lrp, pln, sada2 - weight 4
• dp3 - weight 5
• dph - weight 6
• dp4 - weight 7
• dp4a - weight 8
• Extended math operations:
• math.inv, math.log, math.exp, math.sqrt, math.rsq, math.sin, math.cos (weight 4)
• math.fdiv, math.pow (weight 8)
NOTE
The type of an operation is determined by the type of a destination operand.
Syntax
vtune -collect gpu-hotspots [-knob <knobName=knobValue>] -- <target> [target_options]
Knobs: gpu-sampling-interval, profiling-mode, characterization-mode, code-level-analysis,
collect-programming-api, computing-task-of-interest, target-gpu.
NOTE
For the most current information on available knobs (configuration options) for the GPU Compute/
Media Hotspots analysis, enter:
vtune -help collect gpu-hotspots
486
Intel® VTune™ Profiler Command Line Interface 8
Example
This example runs the gpu-hotspots analysis in the default characterization mode with the default
overview GPU hardware metric preset:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Optimize applications for Intel® GPUs with Intel® VTune Profiler
GPU Compute/Media Hotspots Analysis (Preview)
Syntax
vtune -collect gpu-offload [-knob <knobName=knobValue>] -- <target> [target_options]
Knobs: collect-cpu-gpu-bandwidth, collect-programming-api, enable-stack-collection,
enable-characterization-insights, target-gpu.
NOTE
For the most current information on available knobs (configuration options) for the GPU Offload
analysis, enter:
vtune -help collect gpu-offload
Example
This example runs GPU Offload analysis with enabled tracing for GPU programming APIs on the specified
Linux* application:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
487
8 Intel® VTune™ Profiler User Guide
See Also
Optimize applications for Intel® GPUs with Intel® VTune Profiler
GPU Offload Analysis
Prerequisites
For successful analysis, make sure to configure your system as follows:
• For Xen virtualization platforms:
• Virtualize CPU performance counters on a Xen platform to enable full-scale event-based sampling.
• Establish a password-less SSH connection to the remote target system with the Xen platform installed.
• To analyze Intel® HD and Intel® Iris® Graphics hardware events on a GPU, make sure to set up your
system for GPU analysis
Syntax
vtune [--target-system=ssh:username@hostname[:port]] --collect graphics-rendering [--
knob <knobName=knobValue>] -- [target] [target_options]
Knobs: gpu-sampling-interval, gpu-counters-mode=render-basic.
NOTE
For the most current information on available knobs (configuration options) for the GPU Rendering,
enter:
vtune -help collect graphics-rendering
Example
This example runs system-wide GPU Rendering analysis for a remote Xen target:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
488
Intel® VTune™ Profiler Command Line Interface 8
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
GPU Rendering Analysis (Preview)
Syntax
vtune -collect fpga-interaction [-knob <knobName=knobValue>] [--] <target>
Knobs: sampling-interval, enable-stack-collection.
NOTE
For the most current information on available knobs (configuration options) for the CPU/FPGA
Interaction analysis, enter:
vtune -help collect fpga-interaction
Example
This example runs the CPU/FPGA Interaction analysis on an application with stack collection enabled:
See Also
CPU/FPGA Interaction Analysis
Syntax
vtune -collect io [-knob <knobName=knobValue>] [-- target] [target_options]
Knobs
Platform-Level Metric Knobs:
489
8 Intel® VTune™ Profiler User Guide
dpdk true/false false Collect DPDK metrics. Make sure DPDK is built with VTune
Profiler support.
spdk true/false false Collect SPDK metrics. Make sure SPDK is built with VTune
Profiler support.
kernel-stack true/false false Profile Linux kernel I/O stack.
Prerequisites
Linux* OS:
Load the sampling driver or use driverless hardware event collection (Linux).
See the Input and Output analysis User Guide for detailed prerequisites for each metric type.
FreeBSD* OS:
Install the FreeBSD target package and configure your system following the instructions.
Examples
Example 1: Input and Output Analysis — Launch a Target Application
Run the Input and Output analysis with Intel® VT-d metrics collection enabled for the target application
<app>:
490
Intel® VTune™ Profiler Command Line Interface 8
Or attach by PID:
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
Input and Output Analysis
Input and Output analysis
Syntax
vtune -collect system-overview [-knob <knobName=knobValue>] -- <target>
Knobs: collecting-mode, sampling-interval, enable-interrupts-collection,analyze-
throttling-reasons.
NOTE
For the most current information on available knobs (configuration options) for the System Overview
analysis, enter:
vtune -help collect system-overview
Example 1:
This example runs the System Overview analysis on a guest OS via Kernel-based Virtual Machine with
specified kallsyms and modules files paths.
Example 2:
This example runs the System Overview analysis for the matrix application in the low-overhead hardware
tracing mode.
What's Next
When the data collection is complete, do one of the following to view the result:
491
8 Intel® VTune™ Profiler User Guide
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
System Overview Analysis
from GUI
Analyze Latency Issues
Use Platform Profiler Analysis to ensure that you use available hardware in the most optimal way for a long
running workload.
Syntax
vtune -collect platform-profiler [-knob <knobName=knobValue>] -- <target>
Knobs:
• analyze-persistent-memory
Collect performance information for Intel® Optane™ Persistent Memory modules.
Default value : false
NOTE
For the most current information on available knobs for Platform Profiler analysis, run this command:
vtune -help collect platform-profiler
Example:
This example demonstrates how you run Platform Profiler analysis.
What's Next
When the data collection is complete, Open the data collection result (*.vtune) in the VTune Profiler
graphical interface.
492
Intel® VTune™ Profiler Command Line Interface 8
See Also
Platform Profiler Analysis
runsa
The hardware event-based sampling collector of the VTune Profiler profiles your application using the counter
overflow feature of the Performance Monitoring Unit (PMU).
Syntax:
vtune -collect-with runsa [-knob <knobName=knobValue>] [--] <target>
NOTE
For the most current information on available knobs (configuration options) for the hardware event-
based sampling, enter:
vtune -help collect-with runsa
runss
The user-mode sampling and tracing collector profiles an application execution and takes snapshots of how
that application utilizes the processors in the system. The collector interrupts a process, collects the value of
all active instruction addresses and captures a calling sequence for each of these samples.
Syntax:
493
8 Intel® VTune™ Profiler User Guide
NOTE
For the most current information on available knobs (configuration options) for the user-mode
sampling and tracing, enter:
vtune -help collect-with runss
Example:
This example runs user-mode sampling and tracing collection for the sample application with enabled loop
analysis.
What's Next
When the data collection is complete, do one of the following to view the result:
• Use the -report action to view the data from command line.
• Use the -report-output action to write report to a .txt or .csv file
• Open the data collection result (*.vtune) in the VTune Profiler graphical interface.
See Also
collect-with
action
Configure Analysis Options from Command Line
NOTE
System-wide collection is available for Hardware Event-based Sampling Collection types only.
494
Intel® VTune™ Profiler Command Line Interface 8
Example
This example runs the Hotspots analysis in the hardware event-based sampling for the sample application
and collects data system-wide.
Example
This example runs a system-wide Hotspots analysis hardware event-based sampling for 60 seconds.
See Also
analyze-system
option
Set Up Analysis Target
from GUI
NOTE
If you plan to collect data remotely using the full-scale command line interface of the VTune Profiler
installed on your target Linux system, see the topic Running Command Line Analysis. You may use the
Command Line option in the VTune Profiler graphical interface to automatically generate a command
line for an analysis configuration selected in the GUI. Make sure to edit the generated command line
for remote collection as described in the Generating Command Line Configuration from GUI topic.
Use the following command line syntax to run the analysis on remote Linux system:
host>./vtune -target-system=ssh:user@target <-action> <analysis_type> [<-knob>
[knobName=knobValue]] [-target-tmp-dir=PATH] [-target-install-dir=PATH][--] <target>
495
8 Intel® VTune™ Profiler User Guide
where
• -target-system=ssh:user@target is a remote Linux target
• <-action> is the action to perform the analysis (collect or collect-with)
• <analysis_type is the type of analysis
• <-knob> is a configuration option that modifies the analysis. For a list of available knobs, enter:
vtune -help <action> <analysis_type>
• [knobName=knobValue] is the name of specified knob and its value
• [-target-tmp-dir=PATH] is a path to the temporary directory on the remote system where
performance results are temporarily stored
• [-target-install-dir=PATH] is a path to the VTune Profiler target package installed on the remote
system
• <target> is the path and name of the application to analyze
Examples
Example 1: Event-based System-wide Sampling Collection
The command line below collects system-wide Hotspots analysis information without call stacks. This
command automatically pulls in modules required for viewing results from the device and caches them in the
temp directory on the host. This happens only on the first collection, all subsequent collections reuse
modules from the cache.
See Also
Set Up Remote Linux* Target
496
Intel® VTune™ Profiler Command Line Interface 8
enable-gpu- runss, Analyze frame rate and usage of Processor Graphics engines.
usage=true runsa
| false
gpu- gpu- Analyze performance data from Processor Graphics based on the GPU
counters- hotspots, Metrics Reference.
mode=none | graphics- • overview - track general GPU memory accesses such as Memory
overview | rendering, Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is
global- gpu- Bottleneck, and GPU Memory Texture Read Bandwidth. These
local- offload, metrics can be useful for both graphics and compute-intensive
accesses | runss, applications.
compute- runsa • global-local-accesses - include metrics that distinguish
extended | accessing different types of data on a GPU: Untyped Memory Read/
full- Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/
compute | Write Bandwidth, Render/GPGPU Command Streamer Loaded, and
render- GPU EU Array Usage. This metrics are useful for compute-intensive
basic workloads on the GPU.
• compute-extended - analyze GPU activity on the Intel processor
code name Broadwell. This metrics set is disabled for other systems.
• full-compute - collect both overview and compute-basic metrics
with the allow-multiple-runs option enabled to analyze all types
of EUs array stalled/idle issues in the same view.
• render-basic (preview) - collect Pixel Shader, Vertex Shader, and
Output Merger metrics.
This option is available only for supported platforms with the Intel
Graphics Driver installed.
gpu- gpu- Set the interval between GPU samples between 10 and 1000
sampling- hotspots, microseconds. Default is 1000us. An interval of less than 100us is not
interval=<v runss, recommended.
alue in us> runsa
enable-gpu- gpu- Capture the execution time of OpenCL™ kernels and Intel Media SDK
runtimes=tr hotspots, programs on a GPU, identify performance-critical GPU computing tasks,
ue | false runss, and analyze the performance per GPU hardware metrics.
runsa
497
8 Intel® VTune™ Profiler User Guide
NOTE
OpenCL kernels analysis is currently supported for Windows and Linux
target systems with Intel HD Graphics and Intel Iris Graphics. Intel® Media
SDK Program Analysis Configuration is supported for Linux targets only and
should be started with root privileges.
Examples
Example 1: Running Analysis for an Intel Media SDK Application
This example starts vtune as root and launches the GPU Compute/Media Hotspots analysis for an Intel Media
SDK application running on Linux:
For example, to run GPU Compute/Media Hotspots analysis, collect GPU hardware metrics and trace OpenCL
kernels on the BitonicSort application (-g is the option of the application), enter:
498
Intel® VTune™ Profiler Command Line Interface 8
Examples
This command generates a callstacks report on the r001hs hotspots result on a Windows* system, searching
for symbol files in the C:\Import\system_modules high-priority search directory, and sends the report to
stdout. -R is the short form of the -report action, and -r is the short form of the result-dir action-option.
499
8 Intel® VTune™ Profiler User Guide
This command opens the source view for the foo function annotated with the Hotspots analysis metrics data
collected for the r001hs result. It uses the /home/my_sources directory to search for source files.
See Also
Import Results from Command Line
Search Directories
from GUI
Search Directories for Remote Linux* Targets
NOTE
Use the user-data-dir action-option to specify the base directory for result paths.
Example
This command runs the Hotspots analysis of myApplication in the current working directory, which is
named test. The result is saved in a default-named directory under the /home/test/ directory. If this was
the first Hotspots analysis run, the result directory would be named r000hs.
See Also
result-dir action-option
user-data-dir
action-option
500
Intel® VTune™ Profiler Command Line Interface 8
Start collection in the paused mode, and then automatically resume collection
To start data collection in the paused mode, use the start-paused action option as follows:
Examples
This example starts the Hotspots analysis of the sample Linux* application in the paused mode, and then
resumes collection after a 50 second pause.
See Also
resume-after action-option
command
action
Pause Data Collection
501
8 Intel® VTune™ Profiler User Guide
NOTE
To view all the analysis types that are available for your processor, use the command line help:
vtune -help collect
NOTE
For hardware event-based analysis types, a multiplier applies to the configured Sample After value.
Example
Perform a Hotspots analysis in the user-mode sampling mode using a medium sampling interval that is
appropriate for targets with a duration of 15 minutes to 3 hours.
502
Intel® VTune™ Profiler Command Line Interface 8
NOTE
To start the analysis in the paused mode or pause the collection during the analysis, refer to Pause
Collection from the Command Line section.
Examples
Example 1: Ending analysis after specified time
Start a Hotspots analysis of myApplication and end analysis after 60 seconds.
See Also
Pause Data Collection
from GUI
target-duration-type
action-option
duration
action-option
503
8 Intel® VTune™ Profiler User Guide
See Also
data-limit
command line option
ring-buffer
command line option
Limit Data Collection
from GUI
The result appears in the Intel VTune Profiler Results folder. You can now work with the command line
result exactly as with the result collected from GUI, for example: view source/assembly, filter performance
data, or compare it with another result of the same analysis type.
Click the menu button, select Open > Result..., and navigate to the result file.
See Also
Generate Command Line Reports
504
Intel® VTune™ Profiler Command Line Interface 8
NOTE
The Linux* kernel exposes Perf API to the Perf tool starting from version 2.6.31. Any attempts to run
the Perf tool on kernels prior to this version lead to undefined results or even crashes. See Linux Perf
documentation for more details.
505
8 Intel® VTune™ Profiler User Guide
NOTE
To import a CSV file with external data, use the -result-dir option and specify the name of an
existing directory of the result that was collected by the VTune Profiler in parallel with the external
collection. VTune Profiler adds the externally collected statistics to the result and provides integrated
data in the Timeline pane.
3. You can use the command line to display the imported result in the VTune Profiler GUI, or generate a
report to view it.
• In the GUI:
vtune-gui <result_dir>/<result>.vtune
• In the CLI:
vtune -report <report_type> -result-dir <result_dir>/<result>.vtune
NOTE
• Use the search-dir action-option to specify symbol and binary files locations for module
resolution.
• For Linux targets, make sure to generate the debug information for your binary files using the -g
option for compiling and linking. This enables the VTune Profiler to collect accurate performance
data.
• To minimize the size of the result, you may use the discard-raw-data action-option, but this will
prevent re-finalizing the result.
• Imported result files may not have all the fields that are present in the VTune Profiler result files, so
some types of data may be missing from the report.
NOTE
• <project_folder> must be a non-existing folder, or you will get an error.
• The energy analysis data file has an extension of .pwr.
You may include a path with the project name to create the project in a directory other than the current
directory.
VTune Profiler should start up and automatically open your project in the Platform Power Analysis
viewpoint.
Examples
This command imports the /home/import/r001.tb6 data collection file on Linux, searching the same
directory for binary and symbol information. The result is output to the current working directory.
506
Intel® VTune™ Profiler Command Line Interface 8
Generate the callstacks report from the imported r001hs Hotspots result, searching the /home/import/
r001hs directory for binary and symbol information.
See Also
import
action
Import Results and Traces into VTune Profiler GUI
Search Directories
from GUI
NOTE
Raw collector data is used to re-finalize a result. If the collect action is performed with the
discard-raw-data option, so that the raw data is deleted after the initial finalization, the result
cannot be re-finalized.
Re-Finalize a Result
To force result re-finalization, run the finalize action using this general syntax:
Example
This example re-finalizes the r001hs result, searching for symbol files in the specified search directory.
See Also
finalize
action
Finalization
507
8 Intel® VTune™ Profiler User Guide
NOTE
-R is the short form of the report action, and -r is the short form of the result-dir action-option.
The command syntax for generating a report could also be written as: vtune -R <report_type> -r
<result_path>
Report Types
The vtune command can generate the following types of reports:
Value Description
callstacks Report full stack data for each hotspot function; identify the impact of each
stack on the function CPU or Wait time. You can use the group-by or filter
options to sort the data by:
• callstack
508
Intel® VTune™ Profiler Command Line Interface 8
• function
• function-callstack
top-down Report call sequences (stacks) detected during collection phase, starting
from the application root (usually, the main() function). Use this report to
see the impact of program units together with their callees.
gprof-cc Report a call tree with the time (CPU and Wait time, if available) spent in
each function and its children.
Example
This example displays a Hotspots report for the r001hs result, presenting CPU time for the functions of the
target in descending order starting from the most time-consuming function.
Function CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU
Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU
Time:Effective Time:Over
---------------- -------- ----------------------- ----------------------------
---------------------------- -------------------------- -----------------------------
----------------------------
grid_intersect 3.371s 3.371s
0s 3.371s 0s
0s 0s
sphere_intersect 2.673s 2.673s
0s 2.673s 0s
0s 0s
render_one_pixel 0.559s 0.559s
0s 0.559s 0s
0s 0s
...
See Also
report action
Save and Format Command Line Reports
Summary Report
Similar to the Summary window, available in GUI, the summary report provides overall performance data of
your target. Intel® VTune™Profiler automatically generates the summary report when data collection
completes. To disable this report, use the no-summary option in your command when performing a collect
or collect-with action.
Use the following syntax to generate the Summary report from a preexisting result:
vtune -report summary -result-dir <result_path>
The summary report output depends on the collection type:
• User-mode Sampling and Tracing Collection Summary Report
• Hardware Event-based Sampling Collection Summary Report
509
8 Intel® VTune™ Profiler User Guide
Top Hotspots
Function Module CPU Time
--------- ---------- --------
multiply1 matrix.exe 10.069s
510
Intel® VTune™ Profiler Command Line Interface 8
To identify the cause of the wait, view the result in the GUI performance pane, or generate a performance
report.
Hardware Events
Hardware Event Type Hardware Event Count Hardware Event Sample Count Events
Per Sample
----------------------------------- -------------------- ---------------------------
-----------------
CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 24,832,593 8
1000030
CPU_CLK_UNHALTED.REF_TSC 3,471,208,416 120
24000000
CPU_CLK_UNHALTED.REF_XCLK 43,877,874 14
1000030
CPU_CLK_UNHALTED.THREAD 3,903,569,890 127
24000000
FP_ARITH_INST_RETIRED.SCALAR_DOUBLE 943,046,424 14
20000030
INST_RETIRED.ANY 4,536,715,682 140
24000000
UOPS_EXECUTED.THREAD 5,282,967,942 72
20000030
UOPS_RETIRED.RETIRE_SLOTS 5,587,595,565 76
20000030
511
8 Intel® VTune™ Profiler User Guide
512
Intel® VTune™ Profiler Command Line Interface 8
Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
Frequency: 2.694 GHz
Logical CPU Count: 24
Bandwidth Utilization
Bandwidth Domain Platform Maximum Observed Maximum Average Bandwidth % of Elapsed Time with
High BW Utilization(%)
---------------- ---------------- ---------------- -----------------
---------------------------------------------
DRAM, GB/sec 15 11.300
2.836 0.4%
Collection and Platform Info
Application Command Line: C:\samples\tachyon\vc10\analyze_locks_Win32_Release
\analyze_locks.exe "C:\samples\tachyon\dat\balls.dat"
Operating System: Microsoft Windows 10
Computer Name: My Computer
Result Size: 31 MB
Collection start time: 09:33:44 07/06/2017 UTC
513
8 Intel® VTune™ Profiler User Guide
See Also
report action
summary action-option
Window: Summary
in GUI
Hotspots Report
Use the hotspots command line report to identify program units (for example: functions, modules, or
objects) that take the most processor time (Hotspots analysis), underutilize available CPUs or have long
waits (Threading analysis), and so on.
Use the hotspots report to view hottest GPU computing tasks (or their instances) identified with the gpu-
hotspots or gpu-offload analysis.
The report displays the hottest program units in the descending order by default, starting from the most
performance-critical unit. The command-line reports provide the same data that is displayed in the default
GUI analysis viewpoint.
NOTE
To display a list of available groupings for a Hotspots report, enter vtune -report hotspots -r
<result_dir> group-by=?. If you do not specify a result directory, the latest result is used by
default.
Examples
Example 1: Hotspots Report with Module Grouping
This example opens the Hotspots report for the r001hs Hotspots analysis result and groups the data by
module.
514
Intel® VTune™ Profiler Command Line Interface 8
KERNELBASE 0.679s
ntdl 0.164s
...
Example 2: Hotspots Report with Limited Items
This example displays the Hotspots report for the r001hs analysis result including only the top two functions
with the highest CPU Time values. Functions having insignificant impact on performance are excluded from
output.
See Also
Summary Report
515
8 Intel® VTune™ Profiler User Guide
Example
This example generates the hw-events report for the specified Hotspots analysis (hardware event-based
sampling mode).
See Also
report action
Filter and Group Command Line Reports
from command line
Callstacks Report
Intel® VTune™Profiler collects call stack information during User-Mode Sampling and Tracing Collection or
Hardware Event-based Sampling Collection with stack collection enabled. Use the callstacks report to see
how the hot functions are called. This report type focuses on call sequences, beginning from the functions
that take most CPU time.
You can use the -column option to filter the callstacks report and focus on the specific metric, for example:
516
Intel® VTune™ Profiler Command Line Interface 8
NOTE
To display a list of columns available for callstacks report, enter: vtune -report callstacks -r
<result_dir> column=?
Examples
Example 1: Callstacks Report with Limited Items
The following example generates a callstacks report for the most recent analysis result and limits the
number of functions and function stacks to 5 items.
517
8 Intel® VTune™ Profiler User Guide
tbb::interface6::internal 0s tachyon_analyze_locks
tbb::interface6::internal
execute<tbb::interface6::internal 0s tachyon_analyze_locks
execute::interface6::internal
[TBB parallel_for on draw_task] 0s tachyon_analyze_locks
tbb::interface6::internal::execute(void)
[TBB Dispatch Loop] 0s libtbb.so.2
tbb::internal::local_wait_for_all(tbb::task&, tbb::task*)
...
518
Intel® VTune™ Profiler Command Line Interface 8
See Also
report action
Filter and Group Command Line Reports
Top-down Report
Similar to the Top-down window, available in GUI, the Top-down represents call sequences (stacks)
detected during collection phase starting from the application root. Use the top-down report to explore the
call sequence flow of the application and analyze the time spent in each program unit and on its callees.
NOTE
Intel® VTune™ Profiler collects information about program unit callees only during User-Mode Sampling
and Tracing Collection or Hardware Event-based Sampling Collection with stack collection enabled.
Examples
Example 1: Hotspots Top-down Report
This example displays the report for the specified Hotspots analysis in the user-mode sampling mode with
functions stacks limited to 5 elements.
Function Stack CPU Time:Total CPU Time:Effective Time:Total CPU Time:Spin Time:Total
CPU Time:Overhead Time:Total
---------------------- -------------- ----------------------------- ------------------------
----------------------------
Total 100.000% 100.000%
100.000% 100.000%
func@0x6b2daccf 99.853% 99.835%
100.000% 100.000%
func@0x6b2dacf0 99.853% 99.835%
100.000% 100.000%
BaseThreadInitThunk 99.853% 99.835%
100.000% 100.000%
thread_video 95.614% 97.876%
78.195% 0.0%
Example 2: Hotspots Report with Enabled Call Stack Collection (Linux*)
This command runs the Hotspots analysis in the hardware event-based sampling mode with enabled call
stack collection.
Function Stack CPU Time: CPU Time: CPU Time: Context Switch Time:
Context Switch Time: Context Switch Time:
Total Effective Time:Total Spin Time:Total Total
Wait Time:Total Inactive Time:Total
---------------------- --------- -------------------- --------------- --------------------
519
8 Intel® VTune™ Profiler User Guide
-------------------- --------------------
Total 100.000% 100.000% 100.000%
100.000% 100.000% 100.000%
func@0x6b2daccf 97.595% 97.704% 89.202%
65.777% 90.121% 62.893%
func@0x6b2dacf0 97.595% 97.704% 89.202%
65.777% 90.121% 62.893%
BaseThreadInitThunk 97.595% 97.704% 89.202%
65.777% 90.121% 62.893%
threadstartex 67.091% 67.855% 8.335%
29.825% 9.027% 32.289%
...
Example 3: Hotspots Report with Disabled Stack Collection (Windows*)
This command runs the Hotspots analysis in the hardware event-based sampling mode with disabled call
stack collection.
This command generates the top-down report for the previously collected result, and shows the result for
columns with the time:total string in the title. The report does not include information about program unit
callees, as it was not collected during the analysis.
Function Stack CPU Time:Total CPU Time:Effective Time:Total CPU Time:Spin Time:Total
---------------------- -------------- ----------------------------- ------------------------
Total 100.000% 100.000% 100.000%
grid_intersect 50.172% 50.213% 0.0%
sphere_intersect 31.740% 31.766% 0.0%
grid_bounds_intersect 3.766% 3.769% 0.0%
pos2grid 0.778% 0.778% 0.0%
...
See Also
report action
Window: Top-down Tree
gprof-cc Report
You can use the Intel® VTune™Profiler command line interface to display analysis results in gprof-like format.
The gprof-cc report shows how much time is spent in each program unit, its callers and callees. The report
is sorted by time spent in the function and its callees.
Example
This example generates a gprof-cc report from the r001hs hotspots result.
520
Intel® VTune™ Profiler Command Line Interface 8
The empty lines divide the report into entries, one for each function. The first line of the entry shows the
caller of the function, the second line shows the called function, and the following lines show function callees.
The Index by function name portion of the report shows the function index sorted by function name.
<spontaneous>
[2] 100.0 0.0 11.319
func@0x6b2daccf [2]
0.0 11.319
func@0x6b2dacf0 [3]
0.0 11.319
func@0x6b2daccf [2]
[3] 100.0 0.0 11.319
func@0x6b2dacf0 [3]
0.0 11.319
BaseThreadInitThunk [1]
0.0 10.709
thread_trace [9]
[4] 94.61 0.0 10.709 [TBB parallel_for on class
draw_task] [4]
0.0 10.709
draw_task::operator() [5]
Index Function
521
8 Intel® VTune™ Profiler User Guide
----- ------------------------------------------------
[96] ColorAccum
[30] ColorAddS
[15] ColorScale
[137] CreateCompatibleBitmap
[138] DeleteObject
[211] EngAcquireSemaphore
[139] EngCopyBits
[212] EtwEventRegister
[45] ExAcquirePushLockExclusiveEx
[35] ExAcquireResourceExclusiveLite
...
Difference Report
Comparing two results from the command line is a quick way to check for your application regressions. Use
the following syntax to create the difference report for the specified analysis results:
vtune -report <report_name> -r <result1_path> -r <result2_path>
where
• <report_name> is the type of report for comparison
• <result1_path> is a directory where your first result file is located
• <result2_path> is a directory where your second result file is located
Example
This example compares r001hs and r002hs Hotspots analysis results collected on Linux and displays CPU
time difference for each function of the analyzed application. In the result for the optimized application
(r002hs), a new main function is running for 0.010 seconds, while the Hotspot function algorithm_2 is
optimized by 1.678 seconds.
Function Module Result 1:CPU Time Result 2:CPU Time Difference:CPU Time
algorithm_1 matrix 1.225 1.222 0.003
algorithm_2 matrix 3.280 1.602 1.678
main matrix 0 0.010 -0.010
1. Create a baseline.
• Run the vtune tool to analyze your target using a particular analysis type. For example:
On Linux*
vtune -collect hotspots -- sample
On Windows*:
vtune -collect hotspots -- sample.exe
The command runs a Hotspots analysis on the sample or sample.exe target and writes the result
to the current working directory. A Summary report is written to stdout.
• Generate a report to use as a baseline for further analysis. For example:
vtune -report hotspots -result-dir r001hs
522
Intel® VTune™ Profiler Command Line Interface 8
This creates a Hotspots report that shows the CPU time for each function of the sample or
sample.exe target.
2. Update your source code to optimize the target application.
3. Create and run the script that:
• On Linux: Sets the path to thevtuneinstallation folder
• On Windows: Invokes sep-vars.cmd in the Intel® VTune™ Profilerinstallation folder to set up the
environment.
• Starts the vtune command to collect performance data.
• Runs the vtune command to compare the current result with the initial baseline result and displays
the difference. For example:
vtune -R hotspots -r r001hs -r r002hs
This example compares CPU time for each function in results r001hs and r002hs and displays both
results side-by-side with the calculated difference. The positive difference between the performance
values indicates an improvement for result 2. The negative difference indicates a regression.
NOTE
You can compare results of the same analysis type or performance metrics only.
Installation Information
Whether you downloaded Intel® VTune™ Profiler as a standalone component or with the Intel® oneAPI Base
Toolkit, the default path for your <install-dir> is:
Operating System Path to <install-dir>
macOS* /opt/intel/oneapi/
For OS-specific installation instructions, refer to the VTune Profiler Installation Guide.
See Also
vtune Command Syntax
523
8 Intel® VTune™ Profiler User Guide
Examples
Example 1: Report Displaying Source Data
This example generates a hotspots report that displays source data for the grid_intersect function. The
report is filtered to display only data columns with source, instructions, cpi values in the title. Since
the result directory is not specified, the most recent hotspots analysis result is used.
524
Intel® VTune™ Profiler Command Line Interface 8
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d34b 7,600,000 0.750 xor eax, esp
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d34d 3,800,000 4.500 mov dword ptr [esp+0xd4], eax
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d354 5,700,000 0.333 push esi
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d355 1,900,000 1.000 mov esi, dword ptr [esp+0xe4]
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d35c 1,900,000 10.000 push edi
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d35d 3,800,000 0.500 mov edi, dword ptr [esp+0xe4]
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d364 1,900,000 2.000 mov dword ptr [esp+0x74], edi
466 grid_intersect grid.cpp 0x646 0x40d340
0x40d368 3,800,000 3.500 test byte ptr [esi+0x8], 0x8
475 grid_intersect grid.cpp 0x646 0x40d340
0x40d36c 5,700,000 0.667 jnz 0x40d96f <Block 64>
475 grid_intersect grid.cpp 0x646 0x40d340
0x40d372 9,500,000 3.800 Block 2
[Unknown] [Unknown] [Unknown] [Unknown] 0
0x40d372 0 0.000 lea eax, ptr [esp+0x50]
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d376 push eax
478 [Unknown] [Unknown] [Unknown] [Unknown]
0x40d377 1,900,000 11.000 lea eax, ptr [esp+0x8c]
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d37e 1,900,000 0.000 push eax
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d37f 3,800,000 1.000 push esi
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d380 0 push edi
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d381 1,900,000 1.000 call 0x40e4a0 <grid_bounds_intersect>
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d386 15,200,000 2.375 Block 3
[Unknown] [Unknown] [Unknown] [Unknown] 0
0x40d386 13,300,000 2.286 add esp, 0x10
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d389 1,900,000 3.000 test eax, eax
478 grid_intersect grid.cpp 0x646 0x40d340
0x40d38b jz 0x40d96f <Block 64>
478 [Unknown] [Unknown] [Unknown] [Unknown]
0x40d391 3,800,000 2.000 Block 4
[Unknown] [Unknown] [Unknown] [Unknown] 0
0x40d391 0 0.000 movsd xmm0, qword ptr [esp+0x88]
481 grid_intersect grid.cpp 0x646 0x40d340
0x40d39a 3,800,000 1.000 comisd xmm0, qword ptr [esi+0x48]
481 grid_intersect grid.cpp 0x646 0x40d340
0x40d39f 0 jnbe 0x40d96f <Block 64>
481 grid_intersect grid.cpp 0x646 0x40d340
0x40d3a5 5,700,000 2.000 Block 5
[Unknown] [Unknown] [Unknown] [Unknown] 0
0x40d3a5 1,900,000 1.000 sub esp, 0x8
484 grid_intersect grid.cpp 0x646 0x40d340
0x40d3a8 1,900,000 1.000 lea eax, ptr [esp+0x10]
484 grid_intersect grid.cpp 0x646 0x40d340
525
8 Intel® VTune™ Profiler User Guide
See Also
report action
source-object action-option
Filter and Group Command Line Reports
NOTE
To be sure the correct result is used, use the result-dir option to specify the result directory. If not
specified when generating a report, the report uses the highest numbered compatible result in the
current working directory.
Examples:
• Generate a Hotspots report from the r001hs result on Linux*, and save it to /home/test/MyReport.txt
in text format.
526
Intel® VTune™ Profiler Command Line Interface 8
Example:
Output a Hotspots report from the most recent result as a text file with a maximum width of 60 characters
per line.
See Also
report
action
Filter and Group Command Line Reports
NOTE
To display a list of available groupings for a particular report, use -help report <report_name> .
Examples:
• Write stack information for all functions in the threading analysis result r00tr and group data by call
stack:
527
8 Intel® VTune™ Profiler User Guide
NOTE
• To display a list of available filters for a particular report, use -report <report_name> -result
<result_dir> -filter=? .
• To specify multiple filter items, use multiple -filter option attributes. Multiple values for the same
column are combined with 'OR'. Values for different columns are combined with 'AND'.
Examples:
• Display a Hotspots report on the most recent result in the current working directory, but only include data
on the sample module:
Example:
Generate a Hotspots report from the most recent compatible result, group the result data by function, and
only display user functions and system functions called directly from user functions:
NOTE
To display a list of columns available for a particular report, type: vtune -report <report_name> -
r <result_dir> column=?
Examples:
528
Intel® VTune™ Profiler Command Line Interface 8
• Show grouping and data columns only for event columns with the *INST_RETIRED.* string in the title:
group-by
action-option
sort-asc
action-option
sort-desc
action-option
filter
action-option
column
action-option
529
8 Intel® VTune™ Profiler User Guide
These command line options are designed to trigger certain actions inside VTune Profiler Server in order to
make it more convenient to run VTune Profiler Server inside a container.
All of these options apply to the vtune-backend binary.
--base-url=http(s)://<host>:<port>/<pathname>/
Usage Example:
1. Enable SSH port forwarding on the host:
See Also
Install VTune Profiler Server Set up Intel® VTune™ Profiler as a web server, using a lightweight
deployment intended for personal use or a full-scale corporate deployment supporting multi-user
environment.
Web Server Interface Use Intel® VTune™ Profiler in a web server mode to get an easy on-boarding
experience, benefit from a collaborative multi-user environment, and access a common repository
of collected performance results.
Cookbook: Using VTune Profiler Server in HPC Clusters
530
Intel® VTune™ Profiler Command Line Interface 8
Prerequisites: Make sure to prepare a target Android* system and your application for analysis.
To run an analysis on an Android device:
1. Launch your application on the target device.
2. Find out <pid> or <name> of the application running on remote Android system. For example, you can
use adb shell ps command for the purpose:
adb shell ps
...
root 2956 2 0 0 c1263c67 00000000 S kworker/u:3
u0_a34 8485 174 770232 54260 ffffffff 00000000 R com.intel.tbb.example.tachyon
shell 8502 235 2148 1028 00000000 b76bcf46 R ps
...
3. Optional: If you have several Android devices, you may set the ANDROID_SERIAL environment variable
to specify the device you plan to use for analysis. For example:
export ANDROID_SERIAL= emulator-5554 or export ANDROID_SERIAL=10.23.235.47:5555
4. On the development host, run vtune to collect data.
• On Windows*: <install-dir>\bin{32,64}
• On Linux*: <install-dir>/bin{32,64}
531
8 Intel® VTune™ Profiler User Guide
• <search_dir> is a path to search for binary files used by your Android application
• <source_search_dir> is a path to search for source files used by your Android application
• <target_application> is an application to analyze. The command option depends on analysis
target type:
• To specify an application (a native Linux* application running on Android) or a script to analyze,
enter the path to the application or the script on your host system.
NOTE
This target type is not supported for the Hotspots analysis of Android applications.
• To specify an Android application package to analyze, enter the name of the Android package
installed on a remote device.
• To specify a particular process to attach to and analyze, use the -target-process command to
specify application by process name or the -target-pid command to specify the application by
process PID.
• To profile your Android system, do not specify target application.
NOTE
System-wide analysis is possible on rooted devices only.
5. Optional: You can send pause and resume commands during collection from another console window,
for example:
NOTE
You may use the Command Line... option in the VTune Profiler graphical interface to automatically
generate a command line for an analysis configuration selected in the GUI.
NOTE
Java analysis is not supported for the 4th Generation Intel® Core™ processors (based on Intel
microarchitecture code name Haswell).
532
Intel® VTune™ Profiler Command Line Interface 8
Example
This example runs Hotspots analysis on target Android system.
NOTE
Java analysis is not supported for the 4th Generation Intel® Core™ processors (based on Intel
microarchitecture code name Haswell) or systems using ART.
The following event-based sampling analysis types are supported by the VTune Profiler on Android systems:
• hotspots
• uarch-exploration
• memory-access
• system-overview
To associate JITed Java functions to samples in the system-wide event-based sampling, you have the
following two options:
• Specify -target-process Proccess.Name for the process you are interested in similar to how you do
this for the event-based call stack collection.
• For any process you are interested in, copy the JIT files for the PID of that process into the data.0
directory, and re-resolve the results in the VTune Profiler GUI:
1. Collect results:
Examples
Example 1: Microarchitecture Exploration Analysis
This example launches specified Android package and collects a complete list of events required to analyze
typical client applications running on the 4th Generation Intel Core processor.
533
8 Intel® VTune™ Profiler User Guide
By default, the VTune Profiler does not collect stack data during hardware event-based sampling. To enable
call stack analysis, use the enable-stack-collection=true knob. For example:
Custom Analysis
Use the -collect-with option to configure VTune Profiler to run a custom user-mode sampling and tracing
(runss) or event-based sampling (runsa) analysis and take other than default configuration options. For
example, to run a custom event-based sampling analysis, use the -collect-with option and specify
required event counters with the -knob event-config option as follows:
NOTE
To display a list of events available on the target PMU, enter:
vtune -collect-with <collector> -target-system=android:deviceName -knob event-
config=? <target_application>
You can take any counter that the Performance Monitoring Unit (PMU) of that processor supports.
Additionally, you can enable multiple counters at a time. Each processor supports only a specific number of
counters that can be taken at a time. You can take more events than the processor supports by using the -
event-mux option, which will round robin the events you specified on the available counters in that
processor.
NOTE
Typically, you are recommended to use analysis types with the predefined sets of counters. Use of
specific counters is targeted for advanced users. Please note that names of some counters may not
exactly correspond to the analysis scope provided with these counters.
After collecting these counters, import the result to the VTune Profiler GUI and explore the Microarchitecture
Explorationdata.
See Also
Android* Targets
534
Intel® VTune™ Profiler Command Line Interface 8
nm libgomp.so.1.0.0
If the library does not contain any symbols, either install/compile a new library with symbols or generate
debug information for the library. For example, on Fedora* you can install GCC debug information from
the yum repository:
535
8 Intel® VTune™ Profiler User Guide
536
Intel® VTune™ Profiler Command Line Interface 8
Use the hotspots report to identify the hottest program units. Use the following command to list the top five
parallel regions with the highest Potential Gain metric values:
537
8 Intel® VTune™ Profiler User Guide
Analyze the OpenMP Potential Gain columns data that shows a breakdown of Potential Gain in the region
by representing the cost (in elapsed time) of the inefficiencies with a normalization by the number of OpenMP
threads. Elapsed time cost helps decide whether you need to invest into addressing a particular type of
inefficiency. VTune Profiler can recognize the following types of inefficiencies:
• Imbalance: threads are finishing their work in different time and waiting on a barrier. If imbalance time
is significant, try dynamic type of scheduling. Intel OpenMP runtime library from Intel Parallel Studio
Composer Edition reports precise imbalance numbers and the metrics do not depend on statistical
accuracy as other inefficiencies that are calculated based on sampling.
• Lock Contention: threads are waiting on contended locks or "ordered" parallel loops. If the time of lock
contention is significant, try to avoid synchronization inside a parallel construct with reduction operations,
thread local storage usage, or less costly atomic operations for synchronization.
• Creation: overhead on a parallel work arrangement. If the time for parallel work arrangement is
significant, try to make parallelism more coarse-grain by moving parallel regions to an outer loop.
• Scheduling: OpenMP runtime scheduler overhead on a parallel work assignment for working threads. If
scheduling time is significant, which often happens for dynamic types of scheduling, you can use a
"dynamic" schedule with a bigger chunk size or "guided" type of schedule.
• Atomics: OpenMP runtime overhead on performing atomic operations.
• Reduction: time spent on reduction operations.
538
Intel® VTune™ Profiler Command Line Interface 8
Limitations
VTune Profiler supports the analysis of parallel OpenMP regions with the following limitations:
• Maximum number of supported lexical parallel regions is 512, which means that no region annotations will
be emitted for regions whose scope is reached after 512 other parallel regions are encountered.
• Regions from nested parallelism are not supported. Only top-level items emit regions.
• VTune Profiler does not support static linkage of OpenMP libraries.
See Also
Cookbook: OpenMP* Code Analysis Method
MPI Code Analysis
NOTE
To see all knobs available for a predefined analysis type, enter:
vtune -help collect <analysis_type>
To see knobs for a custom analysis type, enter:
vtune -help collect-with <analysis_type>
Examples
Example 1: Running Java Analysis
539
8 Intel® VTune™ Profiler User Guide
The following command line runs the Hotspots analysis on a java command on Linux*:
NOTE
The dynamic attach mechanism is supported only with the Java Development Kit (JDK).
The following example attaches the Hotspots analysis to a running Java process on Linux:
NOTE
For more information on analyzing the summary report data, refer to the Summary Report section.
Examples
The following example generates the summary report for the Hotspots analysis result. For user-mode
sampling and tracing analysis results, the summary report includes Collection and Platform information, CPU
information and summary per the basic metrics.
On Windows:
------------------------ ------------------------------------------
Operating System Microsoft Windows 10
Result Size 21258782
Collection start time 11:58:36 15/04/2019 UTC
Collection stop time 11:58:50 15/04/2019 UTC
CPU
540
Intel® VTune™ Profiler Command Line Interface 8
---
Parameter r001hs
----------------- -------------------------------------------------
Name 4th generation Intel(R) Core(TM) Processor family
Frequency 2494227391
Logical CPU Count 4
Summary
-------
Elapsed Time: 12.939
CPU Time: 14.813
Average CPU Usage: 1.012
On Linux:
CPU
---
Parameter r001hs
----------------- -------------------------------------------------
Name 3rd generation Intel(R) Core(TM) Processor family
Frequency 3492067692
Logical CPU Count 8
Summary
-------
Elapsed Time: 10.183
CPU Time: 19.200
Average CPU Usage: 1.885
541
8 Intel® VTune™ Profiler User Guide
This example generates the summary report for the Hotspots analysis (hardware event-based sampling
mode) result. For hardware event-based sampling analysis results, the summary report includes Collection
and Platform information, CPU information, summary per the basic metrics, and an event summary.
------------------------ ------------------------------------------
Operating System 3.16.0-30-generic NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
Result Size 171662827
Collection start time 10:44:34 15/04/2019 UTC
Collection stop time 10:44:50 15/04/2019 UTC
CPU
---
Parameter r002hs
----------------- -------------------------------------------------
Name 4th generation Intel(R) Core(TM) Processor family
Frequency 2494227445
Logical CPU Count 4
Summary
-------
Elapsed Time: 15.463
CPU Time: 6.392
Average CPU Usage: 0.379
CPI Rate: 1.318
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events
Per Sample
-------------------------- ------------------------- --------------------------------
-----------------
INST_RETIRED.ANY 13014608235 8276 1900000
CPU_CLK_UNHALTED.THREAD 17158609921 8207 1900000
CPU_CLK_UNHALTED.REF_TSC 15942400300 5163 1900000
BR_INST_RETIRED.NEAR_TAKEN 1228364727 4648 200003
CALL_COUNT 213650621 75413 1
ITERATION_COUNT 370567815 84737 1
LOOP_ENTRY_COUNT 162943310 70069 1
542
Intel® VTune™ Profiler Command Line Interface 8
The report displays the hottest program units in the descending order by default, starting from the most
performance-critical unit. The command-line reports provide the same data that is displayed in the default
GUI analysis viewpoints.
NOTE
• To display a list of available groupings for a hotspots report, enter: vtune -report hotspots -
r <result_dir> group-by=?.
• To set the number of top items to include in a report, use the limit action option: vtune -report
<report_type> -limit <value> -r <result_dir>
Examples
This example generates the hotspots report for the Hotspots analysis result and groups the data by module.
The result file is not specified and VTune Profiler uses the latest analysis result.
Function CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU
Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU
Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Module Function
(Full) Source File Start Address
--------------------- -------- ----------------------- ----------------------------
---------------------------- -------------------------- -----------------------------
----------------------------
consume_time 10.371s 10.371s
0s 10.341s 0.020s
0.010s 0s 0s 0s
mixed_call.dll consume_time mixed_call.c 0x180001000
NtWaitForSingleObject 1.609s 0s
0s 0s 0s
0s 0s 1.609s 0s ntdll.dll
NtWaitForSingleObject [Unknown] 0x1800906f0
WriteFile 0.245s 0.245s
0.009s 0.190s 0.030s
0.016s 0s 0s 0s
KERNELBASE.dll WriteFile [Unknown] 0x180001c50
func@0x707d5440 0.114s 0.010s
0s 0.010s 0s
0s 0s 0.104s 0s jvm.dll
func@0x707d5440 [Unknown] 0x707d5440
func@0x705be5c0 0.072s 0.025s
0s 0.025s 0s
0s 0s 0.047s 0s jvm.dll
func@0x705be5c0 [Unknown] 0x705be5c0
...
On Linux:
Function CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU
Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU
Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Module Function
(Full) Source File Start Address
------------------ -------- ----------------------- ----------------------------
543
8 Intel® VTune™ Profiler User Guide
Module CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU
Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU
Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Instructions Retired CPI
Rate Wait Rate CPU Frequency Ratio Context Switch Time Context Switch Time:Wait Time
Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption
Context Switch Count:Synchronization Module
Path
-------------- -------- ----------------------- ----------------------------
---------------------------- -------------------------- -----------------------------
---------------------------- -------
mixed_call.dll 15.294s 15.294s
0.419s 14.871s 0.004s
0s 0s 0s 0s
21,148,958,284 1.907 0.000 1.149
1.401s 0s 1.401s
26,769 26,769 0 C:\work\module
Java\module Java\java_mixed_call\vc9\bin32\mixed_call.dll
jvm.dll 0.582s 0.582s
0.033s 0.547s 0.002s
0s 0s 0s 0s
792,807,896 1.513 0.437 0.899
0.047s 0.005s 0.042s
462 451 11 C:\Program Files
(x86)\Java\jre8\bin\client\jvm.dll
ntoskrnl.exe 0.404s 0.404s
0.034s 0.370s 0.001s
0s 0s 0s 0s
544
Intel® VTune™ Profiler Command Line Interface 8
660,557,183 1.096 0.000
0.780
C:\WINDOWS\system32\ntoskrnl.exe
...
On Linux:
Module CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU
Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU
Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Instructions Retired CPI
Rate Wait Rate CPU Frequency Ratio Context Switch Time Context Switch Time:Wait Time
Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption
Context Switch Count:Synchronization Module
Path
---------------- -------- ----------------------- ----------------------------
---------------------------- -------------------------- -----------------------------
---------------------------- ------
libmixed_call.so 15.294s 15.294s
0.419s 14.871s 0.004s
0s 0s 0s 0s
21,148,958,284 1.907 0.000 1.149
1.401s 0s 1.401s
26,769 26,769 0 /tmp/
java_mixed_call/src/libmixed_call.so
libjvm.so 0.582s 0.582s
0.033s 0.547s 0.002s
0s 0s 0s 0s
792,807,896 1.513 0.437 0.899
0.047s 0.005s 0.042s
462 451 11 /tmp/
java_mixed_call/src/libmjvm.so
...
...
Analyze Stacks
To get the maximum performance out of your Java application, writing and compiling performance critical
modules of your Java project in native languages, such as C or even assembly. This will help your application
take advantage of vectorization and make complete use of powerful CPU resources. This way of programming
helps to employ powerful CPU resources like vector computing (implemented via SIMD units and instruction
sets). In this case, compute-intensive functions become hotspots in the profiling results, which is expected as
they do most of the job. However, you might be interested not only in hotspot functions, but in identifying
locations in Java code these functions were called from via a JNI interface. Tracing such cross-runtime calls in
the mixed language algorithm implementations could be a challenge.
Use the callstacks report to display full stack data for each hotspot function and identify the impact of
each stack on the function CPU or Wait time.
NOTE
To display a list of available groupings for a callstacks report, enter vtune -report callstacks
-r <result_dir> group-by=?.
Example
The following command line generates the callstacks report for the specified Hotspots analysis result.
545
8 Intel® VTune™ Profiler User Guide
On Windows:
546
Intel® VTune™ Profiler Command Line Interface 8
NOTE
To display a list of available groupings for a hw-events report, enter vtune -report hw-events -r
<result_dir> group-by=?.
Example
This example generates the hw-events report for the specified Hotspots analysis (hardware event-based
sampling mode) result.
On Windows:
547
8 Intel® VTune™ Profiler User Guide
1.401s 0s 1.401s
26,769 26,769 0
[libmixed_call.so] [libmixed_call.so] [Unknown] 0
[libjvm.so] 792,807,896
1,199,773,286 1,335,034,092
0.047s 0.005s 0.042s
462 451 11 [libjvm.so]
[libjvm.so] [Unknown] 0
...
Limitations
VTune Profiler supports analysis of Java applications with some limitations:
• System-wide profiling is not supported for managed code.
• The JVM interprets some rarely called methods instead of compiling them for the sake of performance.
VTune Profiler does not recognize interpreted Java methods and marks such calls as !Interpreter in the
restored call stack.
If you want such functions to be displayed in stacks with their names, force the JVM to compile them by
using the -Xcomp option (show up as [Compiled Java code] methods in the results). However, the
timing characteristics may change noticeably if many small or rarely used functions are being called
during execution.
• When opening source code for a hotspot, the VTune Profiler may attribute events or time statistics to an
incorrect piece of the code. It happens due to JDK Java VM specifics. For a loop, the performance metric
may slip upward. Often the information is attributed to the first line of the hot method's source code.
• Consider events and time mapping to the source code lines as approximate.
• For the user-mode sampling based Hotspots analysis type, the VTune Profiler may display only a part of
the call stack. To view the complete stack on Windows, use the -Xcomp additional command line JDK Java
VM option that enables the JIT compilation for better quality of stack walking. On Linux, use additional
command line JDK Java VM options that change behavior of the Java VM:
• Use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for
better quality of stack walking.
• On Linux* x86, use client JDK Java VM instead of the server Java VM: either explicitly specify -client,
or simply do not specify -server JDK Java VM command line option.
• On Linux x64, specify -XX:-UseLoopCounter command line option that switches off on-the-fly
substitution of the interpreted method with the compiled version.
• Java application profiling is supported for the Hotspots and Microarchitecture analysis types. Support for
the Threading analysis is limited as some embedded Java synchronization primitives (which do not call
operating system synchronization objects) cannot be recognized by the VTune Profiler. As a result, some
of the timing metrics may be distorted.
• There are no dedicated libraries supplying a user API for collection control in the Java source code.
However, you may want to try applying the native API by wrapping the __itt calls with JNI calls.
See Also
Java* Code Analysis
from GUI
Enable Java* Analysis on Android* System
Stitch Stacks for Intel® oneAPI Threading Building Blocks or OpenMP* Analysis
548
Intel® VTune™ Profiler Command Line Interface 8
NOTE
See the VTune Profiler CLI Cheat Sheet quick reference on VTune Profiler command line interface.
Option Description
Each option description provides the following information:
• A short description of the option.
• Products: This section lists the names of products supporting this option.
• GUI Equivalent: This section shows the equivalent of the option in the integrated development
environment (IDE)/standalone GUI client. If no equivalent is available, None is specified.
• Syntax: This section describes the command line syntax of the option.
• Arguments: This section lists the arguments related to the option. If it has no arguments, None is
specified.
• Default: This section shows the default setting for the option.
• Modifiers: This section lists the modifiers for the described action. The section is only available for
actions.
• Actions Modified: This section lists the actions modified by the described option. The section is only
available for modifiers.
• Description: This section provides the full description for the option.
• Alternate Options: These options can be used instead of the described option. If no alternate options
are available, None is specified.
• Example: This is a typical usage example of the option.
• See Also: This section provides links to further information related to the option such as other options or
corresponding GUI procedures.
General Rules
• Options can be preceded by a single dash ("-") or a double dash ("--").
• Option names and values can be separated with a space (" "), or an equal sign ("=").
• Options defining the collection are specified before the analyzed target and can appear on the command
line in any order. Options related to the target are specified after the target
• You cannot combine options with a single dash. For example, -q and -c options cannot be specified as -
qc option.
• Options may have short and long names. Short names consist of one letter. Long names consist of one or
more words separated by dashes. Both short and long names are case-sensitive. Long and short option
names can be used interchangeably. For example, you may use -report or -R to generate a report.
• Long names of the options can be abbreviated. If the option consists of several words you can abbreviate
each word, keeping the dash between them. Make sure an abbreviated version unambiguously matches
the long name. For example, the -option-name option can be abbreviated as -opt-name, -op-na, -
opt-n, or -o-n.
• If the abbreviation is ambiguous between two available options, a syntax error is reported.
• You can disable Boolean default options by specifying -no-<optionname> from the command line. For
example, to avoid displaying a summary report after analysis, run vtune with the -no-summary option.
Conversely, if the default is -no-<option>, you can disable it by specifying -<optionname>.
549
8 Intel® VTune™ Profiler User Guide
• You can specify multiple values for the option by using the option several times, or by using the option
once and specifying comma-separated values (make sure there are no spaces around the commas). The
examples below are equivalent and specify two filters for the r001tr result when generating a hotspots
report.
On Linux*:
See Also
vtune Actions
allow-multiple-runs
Enable multiple runs to achieve more precise results
for hardware event-based collections.
Syntax
-allow-multiple-runs
-no-allow-multiple-runs
Actions Modified
collect, collect-with
Description
By default, no-allow-multiple-runs is enabled, and a collect or collect-with action performs a single
analysis run. Performing multiple analysis runs can provide more precise results for hardware event-based
collections. To enable event multiplexing, specify allow-multiple-runs.
Example
This example runs the target application twice, collecting different events on each run.
See Also
Allow Multiple Runs or Multiplex Events
from GUI
vtune Command Syntax
analyze-kvm-guest
Analyze a KVM guest OS running on your system.
Syntax
-analyze-kvm-guest | -no-analyze-kvm-guest
550
Intel® VTune™ Profiler Command Line Interface 8
Actions Modified
collect-with
Description
Analyze a KVM guest OS running on your system. For successful analysis, make sure to do the following:
1. Copy these files from the guest OS to your local file system:
• /proc/kallsyms
• /proc/modules
• any guest OS’s modules of interest (vmlinux, any *.ko files, and so on)
2. Specify a Linux target system for analysis using the target-system option.
3. Configure your VTune Profiler analysis target by using thekvm-guest-kallsyms, kvm-guest-modules,
and search-dir options to specify paths to the files copied in step 1 for accurate module resolution.
4. Configure your collect-with by using theknob ftrace-config=<events> option to specify Linux
FTrace* events tracking IRQ injection process.
Example
Enable a custom hardware event-based sampling collection for the KVM guest OS and collect irq, softirq,
workq, and kvm FTrace events:
See Also
Profile KVM Kernel and User Space on the KVM System
from GUI
Targets in Virtualized Environments
knob
ftrace-config
kvm-guest-modules
kvm-guest-kallsyms
vtune Actions
analyze-system
Enable analysis of all processes running on the
system.
GUI Equivalent
Configure Analysis window > WHAT pane > Advanced section > Analyze system-wide option
551
8 Intel® VTune™ Profiler User Guide
Syntax
-analyze-system
-no-analyze-system
Default
no-analyze-system
Actions Modified
collect, collect-with
Description
For hardware event-based analysis types, no-analyze-system is enabled by default, so only the target
process is analyzed. Use analyze-system if you want to analyze all processes running on the system. Data
on CPU consumption for these other processes shows how they affect the performance of the target process.
Example
Perform the Hotspots analysis (hardware event-based sampling mode) of all processes running on the
system.
See Also
vtune Actions
app-working-dir
Specify the application directory in auto-generated
commands.
GUI Equivalent
Configure Analysis window > HOW pane > Launch Application target type
Syntax
-app-working-dir=<PATH>
Arguments
A string containing the PATH/name.
Default
Default is the current working directory.
Actions Modified
collect, collect-with
Description
If your data files are stored in a separate location from the application, use the app-working-dir option to
specify the application working directory.
552
Intel® VTune™ Profiler Command Line Interface 8
Example
This command line example changes the application directory to C:\myAppDirectory (on Windows*) and
to /home/myAppDirectory(on Linux*) to run the myApp application, uses binary and symbol files found in
the directory specified by the search-dir option to finalize the result, writes the result in the default result
directory, and then returns to the working directory.
On Windows:
See Also
vtune Actions
call-stack-mode
Choose how to show system functions in the call
stack.
Syntax
-call-stack-mode <value>
Arguments
<value> - Type of call stack display. The following values are available:
Argument Description
user-plus-one Show user functions and system functions called directly from user
functions.
Default
user-plus-one Collected data is attributed to user functions and system functions called directly
from user functions.
Actions Modified
collect, finalizeimportreport
Description
Use the call-stack-mode option when performing data collection, finalization or importation, to set call
stack data attribution for the result or report. If set for collection, finalization or importation, this sets the
default view when the result is opened in the GUI, and applies to any reports unless overridden in the
command used to generate the report.
553
8 Intel® VTune™ Profiler User Guide
Example
Generate a hotspots result and include system as well as user functions in the call stack. This is now the
project-level setting, and if the result is viewed in the GUI, the call stack shows both user functions and
system functions.
See Also
vtune Actions
collect
Run the specified analysis type and collect data into a
result.
Syntax
-collect <analysis_type>
-c <analysis_type>
Arguments
hotspots Identify your most time-consuming source code using one of the
available collection modes:
554
Intel® VTune™ Profiler Command Line Interface 8
Collection type: user-mode sampling and tracing collection or
hardware event-based sampling.
Knobs: enable-characterization-insights, enable-stack-
collection, sampling-interval, sampling-mode.
threading Analyze how your application is using available logical CPU cores,
discover where parallelism is incurring synchronization overhead, find
how waits affect your application's performance, and identify potential
candidates for parallelization.
Collection type: user-mode sampling and tracing collection.
Knobs: sampling-interval.
hрc-performance Identify opportunities to optimize CPU, memory, and FPU utilization for
compute-intensive or throughput applications.
Collection type: hardware event-based sampling collection.
Knobs: enable-stack-collection, collect-memory-bandwidth,
sampling-interval, dram-bandwidth-limits.
uarch-exploration (formely known Identify and locate the most significant hardware issues that affect the
as general-exploration) performance of your application. Use this analysis type as a starting
point for microarchitecture analysis.
Collection type: hardware event-based sampling collection.
Knobs: enable-stack-collection, collect-memory-bandwidth,
enable-user-tasks.
memory-access Measure a set of metrics to identify memory access related issues (for
example, specific for NUMA architectures).
Collection type: hardware event-based sampling collection.
Knobs: sampling-interval, dram-bandwidth-limits, analyze-
openmp; Linux only: analyze-mem-objects, mem-object-size-
min-thres.
sgx-hotspots (deprecated) Analyze hotspots inside security enclaves for systems with the Intel®
Software Guard Extensions (Intel® SGX) feature enabled.
Collection type: hardware event-based sampling collection.
Knobs: enable-stack-collection, enable-user-tasks.
555
8 Intel® VTune™ Profiler User Guide
cpugpu-concurrency (deprecated) Enable the CPU/GPU Concurrency analysis and explore code execution
on the various CPU and GPU cores in your system, correlate CPU and
GPU activity and identify whether your application is GPU or CPU
bound.
Knobs: sampling-interval, enable-user-tasks, enable-user-
sync, enable-gpu-usage, gpu-counters-mode, enable-gpu-
runtimes.
gpu-hotspots Identify GPU tasks with high GPU utilization and estimate the
effectiveness of this utilization.
Collection type: hardware event-based sampling collection.
Knobs: gpu-sampling-interval, enable-gpu-usage, gpu-
counters-mode, enable-gpu-runtimes, enable-stack-
collection.
gpu-profiling (deprecated) Analyze GPU kernel execution per code line and identify performance
issues caused by memory latency or inefficient kernel algorithms.
Collection type: hardware event-based sampling collection.
Knobs: gpu-profiling-mode, kernels-to-profile.
graphics-rendering (preview) Analyze the CPU/GPU utilization of your code running on the Xen
virtualization platform. Explore GPU usage per GPU engine and GPU
hardware metrics that help understand where performance
improvements are possible. If applicable, this analysis also detects
OpenGL-ES API calls and displays them on the timeline.
Collection type: hardware event-based sampling collection.
Knobs: gpu-sampling-interval, gpu-counters-mode.
fpga-interaction Analyze the CPU/FPGA interaction issues via exploring OpenCL kernels
running on FPGA, identify the most time-consuming FPGA kernels.
Collection type: hardware event-based sampling collection.
Knobs: sampling-interval, enable-stack-collection.
556
Intel® VTune™ Profiler Command Line Interface 8
NOTE
For Android* systems, VTune Profiler provides GPU analysis only on processors with Intel®
HD Graphics and Intel® Iris® Graphics. You cannot view the collected results in the CLI
report. To view the results, open the result file in GUI.
Modifiers
[no]-allow-multiple-runs, [no]-analyze-system, data-limit, discard-raw-data,
duration,finalization-mode,[no]-follow-child, knob , mrte-mode, quiet, resume-after,
return-app-exitcode, ring-buffer, search-dir,start-paused, , strategy, [no-]summary,
target-duration-type ,target-pid, target-process, target-system,trace-mpi,
no-unplugged-mode, user-data-dir, verbose
Description
Use the collect action to perform analysis and collect data. By default, this process performs the specified
type of analysis, collects and finalize data into a result file, and outputs a Summary report to stdout. In most
cases you will want to use the search-dir action-option to specify the search directory. Some analysis types
support the knob option, which allow you to specify additional level settings.
There are many options that you can use to customize the behavior of the collect action to suit your
purposes. For example, you can choose whether to analyze a child process only, whether to start collection
after a certain amount of time has elapsed, or whether to perform collection without finalizing the result.
There are a few examples included in this topic. For more information, use one of the help commands
described below, or browse or search this documentation for information on the type of analysis you wish to
perform.
NOTE
To access the most current command line documentation for an action, enter vtune -help
<action>, where <action> is one of the available actions. To see all available actions, enter
vtune -help.
Examples
This command runs the hotspots analysis in the hardware event-based sampling mode for a Linux myApp
application, writes the result to the default directory, and outputs a summary report by default.
557
8 Intel® VTune™ Profiler User Guide
The no-auto-finalize action-option start a Threading analysis, collect performance data, and exit without
finalizing the result.
See Also
Run Command Line Analysis
collect-with
action
Analyze Performance
in GUI
vtune Command Syntax
collect-with
Run a custom hardware event-based sampling or
user-mode sampling and tracing collection using your
settings.
Syntax
-collect-with <collector_name>
Arguments
collector_name Description
Modifiers
[no-]allow-multiple-runs, analyze-kvm-guest, [no-]analyze-system, app-working-dir,
call-stack-mode, cpu-mask, data-limit, discard-raw-data, duration, finalization-mode,
[no-]follow-child, inline-mode, knob, mrte-mode, quiet, result-dir, resume-after,
return-app-exitcode, ring-buffer, search-dir, start-paused, strategy, [no-]summary,
target-duration-type, target-pid, target-process, no-unplugged-mode, user-data-dir,
verbose
Description
Use the collect-with action when you want finer control over analysis settings than the collect action
can offer. Specify both the collector type and the knob. The collector type determines the type of collection,
and the knob determines the level or granularity. Lower levels are coarser grained, while higher levels are
finer grained. The analysis process includes finalization of the result, and a summary report is displayed by
default.
For the runsa (event-based sampling) collector, the event-config knob option specifies the list of events to
collect. To display a list of events available on the target PMU, enter:
vtune -collect-with runsa -knob event-config=? <target>
The command returns names and short descriptions of available events. For more information on the events,
use Intel Processor Events Reference
558
Intel® VTune™ Profiler Command Line Interface 8
NOTE
• To access the most current command line documentation for the collect or collect-with action,
enter vtune -help collect or vtune -help collect-with.
• For the most current information on available knobs, enter vtune -help collect
<analysis_type> or vtune -help collect-with <analysis_type>, where <analysis_type>
is the type of analysis you wish to perform.
Example
This example runs the hardware event-based sampling collector for the sample Linux* application on the
specified events and displays a summary report.
See Also
Hardware Event-based Sampling Collection
Custom Analysis
in GUI
column
Specify substrings for the column names to display
only corresponding columns in the report.
Syntax
-column=<string>
Arguments
<string> - Full name of the column or its substring.
Actions Modified
report , report-output
Description
Filter in the report to display only data columns (typically corresponding to performance metrics or hardware
events) with the specified <string> in the title. For example, specify -column=Total to view only Total
metrics in the report. Columns used for data grouping are always displayed.
To display a list of columns available for a particular report, type: vtune -report <report_name> -r
<result_dir> column=?.
Example
Display grouping and data columns only for event columns with the *INST_RETIRED.* string in the title:
559
8 Intel® VTune™ Profiler User Guide
See Also
Save and Format Command Line Reports
command
Issue a command to a running collect action.
Syntax
-command=<value>
Arguments
<value> Description
mark Place time-stamped mark in the data that can be referenced during analysis.
pause Temporarily suspend the collection process. Use -command resume when you are
ready to continue collection.
resume Continue collection on a paused collection process.
status Print collection status.
stop Terminate a running collection process. Alternatively, use ctrl + c.
Modifiers
result-dir, user-data-dir
Description
This option performs one of the following actions on a running collect action: pause, resume, stop,
status, or mark. Use with result-dir to specify the result directory for the running analysis.
Example
This example terminates the collect process in the default directory.
See Also
vtune Command Syntax
560
Intel® VTune™ Profiler Command Line Interface 8
vtune Actions
cpu-mask
Syntax
-cpu-mask=<cpu_mask1>,<cpu_mask3>-<cpu_mask5>...
Arguments
CPU number or a range of numbers.
Actions Modified
collect, collect-with
Description
This option specifies the CPU(s) for which data will be collected during hardware event-based sampling
collection. Specify a list of comma-separated CPU IDs (with no spaces) and/or the range(s) of CPU IDs. A
range is represented by a beginning and ending ID, separated by a dash.
Example
This example collects samples on four CPUs (1, 3, 4, and 5) for a Linux sample application.
csv-delimiter
Specify the delimiter for a tabular report.
Syntax
-csv-delimiter=<delimiter>
Arguments
Actions Modified
The report action, used with the format csv action-option. To write the report to a file, also use the
report-output option.
Description
Use this option to specify a delimiter when using -format csv to generate a report in CSV format.
Example
Generate a tabular hotspots report from the most recent result, using comma delimiters, and save the report
as MyReport.csv in the current working directory.
561
8 Intel® VTune™ Profiler User Guide
Sample output:
Module,Process,CPU Time
worker3.so,main,10.735worker1.so,main,5.525worker2.so,main,3.612worker5.so,main,3.103worker4.so,m
ain,1.679main,main,0.064
See Also
Generate Command Line Reports
vtune Actions
cumulative-threshold-percent
Set a percent of the target CPU/Wait time to display
only the hottest program units that exceed this
threshold.
GUI Equivalent
Window: Summary - Hotspots
Syntax
-cumulative-threshold-percent=<value>
Arguments
<value> The percent of target CPU/Wait time consumed by the program units
displayed.
Default
Actions Modified
report
Description
Use the cumulative-threshold-percent action-option to generate a performance detail report that
focuses on program units that exceed the specified percentage of target CPU/Wait time. Functions below the
specified threshold are filtered out, so your report includes just the hottest program units, and excludes
those that are insignificant.
562
Intel® VTune™ Profiler Command Line Interface 8
Example
Linux*: Generate a Performance Detail report from the r001hs Hotspots result that only includes functions
that cumulatively account for 90% of target CPU time. Functions cumulatively representing less than 10% of
target CPU time are excluded.
Module Function Result 1:CPU Time Result 2:CPU Time Difference:CPU Time
Cumulative Percent
matrix.exe algorithm_2 3.106 3.131 -0.025
100.000
Module Function Result 1:CPU Time Result 2:CPU Time Difference:CPU Time
Cumulative Percent
ntdll.dll KiFastSystemCallRet 0.012 0 0.012
39.956
ntdll.dll NtWaitForSingleObject 0.113 0.110 0.003 50.051
See Also
Change Threshold Values
vtune Actions
custom-collector
Launch an external collector to gather custom interval
and counter statistics for your target in parallel with
the VTune Profiler.
Syntax
-custom-collector=<string>
Arguments
Actions Modified
collect,collect-with
Description
Your custom collector can be an application you analyze with the VTune Profiler or a collector that can be
launched with the VTune Profiler.
Use the -custom-collector option to specify an external collector other than a target analysis application.
563
8 Intel® VTune™ Profiler User Guide
When you start a collection, the VTune Profiler does the following:
1. Launches the target application in the suspended mode.
2. Launches the custom collector in the attach (or system-wide) mode.
3. Switches the application to the active mode and starts profiling.
If your custom collector cannot be launched in the attach mode, the collection may produce incomplete data.
You can later import custom collection data (time intervals and counters) in a CSV format to the VTune
Profiler result.
Example
This example runs Hotspots analysis in the default user-mode sampling mode and also launches an external
script collecting custom statistics for the specified application:
Windows:
vtune Actions
data-limit
Limit the amount of raw data (in MB) to be collected.
Syntax
-data-limit=<integer>
Arguments
Actions Modified
collect, collect-with
Description
Use the data-limit action-option to limit the amount of raw data (in MB) to be collected. Zero data limit
means no limit for data collection.
564
Intel® VTune™ Profiler Command Line Interface 8
Alternate Options
Example
Perform a Hotspots analysis and limit the size of collected data to 200MB.
See Also
Limit Data Collection
ring-buffer
action-option
vtune Command Syntax
vtune Actions
discard-raw-data
Specify removal of raw collector data after finalization.
Syntax
-discard-raw-data
-no-discard-raw-data
Actions Modified
collect, collect-with, finalize, import
Description
Use the discard-raw-data action-option if you want to remove raw collector data after the result is
finalized. This makes the result files smaller.
NOTE
Keeping raw data enables result re-finalization. Do not use this option if you want to re-
finalize the results in the future.
Example
This example runs the Hotspots analysis for the sample Linux* application, generates a default summary
report, and removes raw collector data.
See Also
Finalization
vtune Actions
565
8 Intel® VTune™ Profiler User Guide
duration
Specify the duration for collection (in seconds).
Syntax
-duration=<value>
Arguments
Actions Modified
collect, collect-with
Description
The duration option is required for system-wide collection, and specifies the duration for collection (in
seconds). System-wide collection occurs when the target is not specified on the command line when
collection is initiated. It also can be used with when the target is specified, but you want to set a specific
duration for data collection.
Example
This command performs system-wide collection of Hotspots for 20 seconds.
See Also
Manage Analysis Duration from Command Line
vtune Actions
filter
Specify which data to include or exclude.
Syntax
-filter <column_name> [= | !=]<value>
Arguments
Argument Description
566
Intel® VTune™ Profiler Command Line Interface 8
Actions Modified
report
Description
Use the filter option to include or exclude data from a report based on the specified column_name, the =
or != operator, and the value for that column.
To display a list of available filter attributes for a particular report, use vtune -report <report_name> -r
<result_dir> filter=? option. If you do not specify a result directory, the latest result is used by default.
Examples
Generate a hotspots report on Linux* from the specified hotspots result that only includes data from the
appname process. Data from other processes is excluded. This report is sent to stdout.
See Also
Filter and Group Command Line Reports
from CLI
group-by
action-option
567
8 Intel® VTune™ Profiler User Guide
vtune Actions
finalization-mode
Perform full finalization, fast finalization, deferred
finalization or skip finalization.
GUI Equivalent
Configure Analysis window > WHAT pane > Advanced section > Select finalization mode option
Syntax
finalization-mode=<value>
Arguments
Default
fast vtune performs fast finalization with the reduced number of loaded samples.
Actions Modified
collect,collect-with,import,finalize
Description
Use the finalization-mode option with the collect, collect-with, import, and finalize commands
to define the finalization mode for the result.
Use the full finalization mode to perform the finalization on unchanged sampling data on the target system.
This mode takes the most time and resources to complete, but produces the most accurate results.
Use the fast finalization mode to perform the finalization on the target system using algorithmically reduced
sampling data. This greatly reduces the finalization time with a negligible impact on accuracy in most cases.
If you discover inaccuracies in your finalization, you can always use the finalize action with the full
finalization mode to re-finalize the result in full mode.
Use the deferred finalization mode to collect the sampling data and the binary checksums to perform the
finalization on another machine. After data collection completes, you can finalize and open the analysis result
on the host system. This mode may be useful for profiling applications on targets with limited computational
resources, such as IoT devices, and finalizing the result later on the host machine.
NOTE
To have binaries successfully resolved during finalization, ensure that the host system has access to
the binaries.
568
Intel® VTune™ Profiler Command Line Interface 8
Use the none option to skip finalization entirely and to not collect the binary checksums. You can also finalize
this result later, however, you may encounter certain limitations. For example, if the binaries on the target
system have changed or have become unavailable since the sampling data collection, binary resolution may
produce an inaccurate or missing result for the affected binary.
You can always repeat the finalization process in a different mode using the finalize action.
Example
The following command starts the Hotspots analysis on Windows and only calculates the binary checksums
for finalization on another machine.
See Also
Intel® Xeon Phi™ Processor Targets
finalize
option
Run Command Line Analysis
Finalization
vtune Actions
finalize
Perform symbol resolution to finalize or re-resolve a
result.
Syntax
-finalize -result-dir <PATH>
-I -result-dir <PATH>
Arguments
The finalize action must be used with the result-dir action-option, which passes in the PATH/name of
the result directory.
Modifiers
call-stack-mode, discard-raw-data, inline-mode, quiet, result-dir, search-dir, verbose
Description
Use the finalize action when you need to finalize an un-finalized or improperly finalized result in the
directory specified by the result-dir action-option. Use GUI tools to change search directories settings, or
use the search-dir action-option with the finalize action to re-finalize the result and update symbol
information.
Normally, finalization is performed automatically as part of a collect or import action. However, you may
need to re-finalize a result if:
• Finalization was suppressed during collection or importation, for example when the
-finalization-mode=none action-option was specified for a collect or collect-with action.
569
8 Intel® VTune™ Profiler User Guide
• Re-resolve a result that was not properly finalized because some of the source or symbol files were
missing. When viewed in the GUI or reports, the word [Unknown] commonly appears.
Example
In this example, finalization is suppressed when generating a Hotspots analysis result r001hs on Linux*.
See Also
Finalization
vtune Actions
format
Syntax
-format <value>
Arguments
<value> Description
csv CSV output format. File extension is .csv. Must be used with csv-
delimiter option.
xml XML output format. File extension is .xml. Available for summary report
only.
html HTML output format. File extension is .html. Available for summary report
only.
Default
text
Actions Modified
report
Description
Use the format action-option to specify output format for report. To print to a file, use this with the report-
output option. If you choose csv, you must also use the csv-delimiter option to specify the delimiter, such as
comma.
570
Intel® VTune™ Profiler Command Line Interface 8
NOTE
XML and HTML formats are available for the summary report only.
Example
Generate a Hotspots report in CSV file format using a comma delimiter and save it as MyReport.csv in the
current working directory.
See Also
vtune Command Syntax
vtune Actions
group-by
Specify grouping in a report.
Syntax
-group-by <granularity1>,<granularity2>
Arguments
Argument Description
Actions Modified
report
Description
Use the group-by action-option to group data in your report by your specified criteria. For multiple grouping
levels, add arguments separated by commas (no spaces).
NOTE
For some reports (for example, top-down report) you can specify only a single grouping
level.
To display a list of available groupings for a particular report, type: vtune -report <report_name> -r
<result_dir> group-by=?. If you do not specify a result directory, the latest result is used by default.
NOTE
The function value groups the result data both by function and by module. To group just by
the function, use function-only.
Example
Output a hotspots report for the latest result with data grouped by module:
571
8 Intel® VTune™ Profiler User Guide
Output a hotspots report for the latest result with data grouped by thread and function:
See Also
Save and Format Command Line Reports
vtune Actions
help
Display brief explanations of command line
arguments.
572
Intel® VTune™ Profiler Command Line Interface 8
Syntax
-h, -help
-help <action>
-help collect <analysis_type>
-help collect-with <collector_type>
-help report <report_type>
Arguments
Argument Description
Description
Use the help action to access help for the amplxe-cl command. The help for each action includes
explanations and usage examples.
Below is a list of available actions:
help, version, import, finalize, report, collect, collect-with, command
Examples
Display all available vtune actions.
vtune -help
Display help for the collect action, including all available options.
See Also
Get Help
vtune Actions
import
Import one or more collection data files/directories.
Syntax
-import <PATH>
573
8 Intel® VTune™ Profiler User Guide
Arguments
A string containing the PATH of the data files to import. To import several files, make sure to use the import
option for each path.
Modifiers
call-stack-mode, discard-raw-data, inline-mode, result-dir, search-dir, user-data-dir
Description
Use the import action to import one or more collection data files into the VTune Profiler. You may import the
following formats:
• .tb6 or .tb7 with event-based sampling data. To import the files, use the -result-dir option and
specify the name for a new directory you want to create for the imported data. If you do not use the -
result-dir option, the VTune Profiler creates a new directory with the default name.
• .perf files with event-based sampling data collected by Linux* Perf tool. To ensure accurate data
representation in the VTune Profiler, make sure to run the Perf collection with the predefined command
line options:
• For application analysis:
perf record -o <trace_file_name>.perf --call-graph dwarf -e cpu-
cycles,instructions <application_to_launch>
• For process analysis:
perf record -o <trace_file_name>.perf --call-graph dwarf -e cpu-
cycles,instructions <application_to_launch> -p <PID> sleep 15
where the -e option is used to specify a list of events to collect as -e <list of events>; --call-
graph option (optional) configures samples to be collected together with the thread call stack at the
moment a sample is taken. See Linux Perf documentation on possible call stack collection options (for
example, dwarf) and its availability in different OS kernel versions.
NOTE
The Linux kernel exposes Perf API to the Perf tool starting from version 2.6.31. Any attempts to run
the Perf tool on kernels prior to this version lead to undefined results or even crashes. See Linux Perf
documentation for more details.
• To import a csv file , use the -result-dir option and specify the name of an existing directory of the
result that was collected by the VTune Profiler in parallel with the external data collection. VTune Profiler
adds the externally collected statistics to the result and provides integrated data in the Timeline pane.
NOTE
Importing a csv file to the VTune Profiler result does not affect symbol resolution in the result. For
example, you can safely import a csv file to a result located on a system where module and debug
information is not available.
• *.pwr processed Intel SoC Watch files with energy analysis data
Example
This example imports the sample_data.tb7 file into a VTune Profiler project and creates the result directory
r000hs:
574
Intel® VTune™ Profiler Command Line Interface 8
This example imports a trace file collected with the Linux Perf tool into a VTune Profiler project and creates a
default result directory r000 (since no result directory is specified from the command line):
See Also
vtune Command Syntax
vtune Actions
inline-mode
GUI Equivalent
Toolbar: Filter > Inline Mode menu
Syntax
-inline-mode off | on
Actions Modified
collect, finalize, import, report
Description
Use inline-mode off with the collect, finalize or import actions if you want to exclude inline functions
from the stack in results. You can also use this with the report action to exclude inline functions from
reports.
By default, this option is enabled so that performance details for all inline functions used in the application
are included in the stack in results and reports.
NOTE
This option is available if information about inline functions is available in debug information
generated by compilers. See View Data on Inline Functions for supported compilers and
options.
Example
Generate a hotspots report with inline mode disabled.
See Also
View Data on Inline Functions
from GUI
vtune Command Syntax
vtune Actions
575
8 Intel® VTune™ Profiler User Guide
knob
Set configuration options for the specified analysis
type or collector type.
Syntax
-knob | -k <knob-name>=<knob-value>
Arguments
knob-name An analysis type or collector type may have one or more configuration
options (knobs) that provide additional instructions for performing the
specified type of analysis. To use a knob, you must specify the knob
name and knob value.
Multiple knob options are allowed and can be followed by additional
action-options, as well as global-options, if needed.
knob-value There are values available for each knob. In most cases this is a
Boolean value, so for Boolean knobs, specify <knob-name>=true to
enable the knob.
NOTE
Knob behavior may vary depending on the analysis type or collector type.
<knob-name> Description
accurate-cpu-time- Collect more accurate CPU time data. This option requires additional disk
detection=true | false space and post-processing time. Administrator privileges are required.
(Windows only) Supported analysis: runss
Default: true
576
Intel® VTune™ Profiler Command Line Interface 8
<knob-name> Description
analyze-power- Collect information about energy consumed by CPU, DRAM, and discrete
usage=true | false GPU.
analyze-throttling- Collect information about factors that cause the CPU to throttle.
reasons=true | false Supported analysis: system-overview
Default: false
characterization- Monitor the Render and GPGPU engine usage (Intel Graphics only), identify
mode=overview | global- which parts of the engine are loaded, and correlate GPU and CPU data.
local-accesses | The Characterization mode uses platform-specific presets of the GPU
compute-extended | metrics. All presets, except for the instruction-count, collect data
full-compute | about execution units (EUs) activity: EU Array Active, EU Array Stalled, EU
instruction-count Array Idle, Computing Threads Started, and Core Frequency; and each one
Default: overview introduces additional metrics:
• overview metric set includes additional metrics that track general GPU
memory accesses such as Memory Read/Write Bandwidth, GPU L3
Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture
Read Bandwidth. These metrics can be useful for both graphics and
compute-intensive applications.
• global-local-accesses metric group includes additional metrics that
distinguish accessing different types of data on a GPU: Untyped
Memory Read/Write Bandwidth, Typed Memory Read/Write
Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command
Streamer Loaded, and GPU EU Array Usage. These metrics are useful
for compute-intensive workloads on the GPU.
• compute-extended metric group includes additional metrics targeted
only for GPU analysis on the Intel processor code name Broadwell and
higher. For other systems, this preset is not available.
• full-compute metric group is a combination of the overview and
global-local-accesses event sets.
• instruction-count metric group counts the execution frequency of
specific classes of instructions.
Supported analysis: gpu-hotspots, graphics-rendering, runsa
577
8 Intel® VTune™ Profiler User Guide
<knob-name> Description
Default value: bb-latency • bb-latency mode helps you identify issues caused by algorithm
inefficiencies. In this mode, VTune Profiler measures the execution time
of all basic blocks. Basic block is a straight-line code sequence that has
a single entry point at the beginning of the sequence and a single exit
point at the end of this sequence. During post-processing, VTune
Profiler calculates the execution time for each instruction in the basic
block. So, this mode helps understand which operations are more
expensive.
• mem-latency mode helps identify latency issues caused by memory
accesses. In this mode, VTune Profiler profiles memory read/
synchronization instructions to estimate their impact on the kernel
execution time. Consider using this option, if you ran the gpu-
hotspots analysis in the Characterization mode, identified that the
GPU kernel is throughput or memory-bound, and want to explore which
memory read/synchronization instructions from the same basic block
take more time.
Supported analysis: gpu-hotspots
collect-bad- Collect the minimum set of data required to compute top-level metrics and
speculation=true | all Bad Speculation sub-metrics.
false Supported analysis: uarch-exploration, runsa
Default value: true
collect-core-bound=true Collect the minimum set of data required to compute top-level metrics and
| false all Core Bound sub-metrics.
collect-frontend- Collect the minimum set of data required to compute top-level metrics and
bound=true | false all Front-End Bound sub-metrics.
collect-cpu-gpu- Collect DRAM bandwidth data for all hosts. Additionally, collect PCIe
bandwidth=true | false bandwidth for supported server hosts (Intel® micro-architectures code
named Ice Lake and Sapphire Rapids). To view collected data in GUI,
Default: false
enable the Analyze CPU host-GPU bandwidth option.
Supported analysis:gpu-offload
collect-cpu-gpu-pci- Collect PCIe bandwidth for supported server hosts (Intel® micro-
bandwidth=true | false architectures code named Ice Lake and Sapphire Rapids). This knob is
available for custom analyses only. To view collected data in GUI, enable
Default: false
the Analyze CPU host-GPU bandwidth option.
Supported analysis:runsa
collect-io-waits=true | Analyze the percentage of time each thread and CPU spends in I/O wait
false state.
578
Intel® VTune™ Profiler Command Line Interface 8
<knob-name> Description
collect-memory- Collect the minimum set of data required to compute top-level metrics and
bound=true | false all Memory Bound sub-metrics.
collect-programming- Analyze execution of DPC++ apps, OpenCL™ kernels and Intel® Media SDK
api=true | false programs on Intel HD Graphics and Intel® Iris® Graphics. This option may
affect the performance of your application on the CPU side.
Default for gpu-hotspots:
true, for runss: false. Supported analysis: gpu-hotspots, gpu-offload, runsa
collect-retiring=true | Collect the minimum set of data required to compute top-level metrics and
false all Retiring sub-metrics.
collecting-mode=hw- Specify the system-wide collection mode to either explore CPU, GPU, and
tracing | hw-tracing I/O resources utilization with the default event-based sampling mode, or
enable the low-overhead hardware tracing and identify a root cause of
Default value: hw-sampling
latency issues.
Supported analysis: system-overview, runsa
counting-mode=true | Choose between collecting detailed context data for each PMU event (such
false as code or hardware context) or the counts of events. Counting mode
introduces less overhead but gives less information.
Default: false
Supported analysis: runsa
dram-bandwidth- Evaluate maximum achievable local DRAM bandwidth before the collection
limits=true | false starts. This data is used to scale bandwidth metrics on the timeline and
calculate thresholds.
579
8 Intel® VTune™ Profiler User Guide
<knob-name> Description
Default: true for the HPC Supported analysis: performance-snapshot, memory-access, uarch-
Performance exploration, hpc-performance, runsa
Characterization and
Microarchitecture
Exploration analysis with
collect-memory-
bandwidth knob enabled;
true for the Memory Access
and Microarchitecture
Exploration analysis.
enable-context- Analyze detailed scheduling layout for all threads in your application,
switches=true | false explore time spent on a context switch and identify the nature of context
switches for a thread (preemption or synchronization).
Default: false
Supported analysis: runsa
enable-gpu-usage=true | Analyze frame rate and usage of Intel HD Graphics and Intel® Iris®
false Graphics engines and identify whether your application is GPU or CPU
bound.
Default: false
Supported analysis: runss, runsa
enable-interrupt- Collect interrupt events that alter a normal execution flow of a program.
collection=true | false Such events can be generated by hardware devices or by CPUs. Use this
data to identify slow interrupts that affect your code performance.
Default: false
Supported analysis: system-overview.
enable-system- Analyze detailed scheduling layout for all threads on the system and
cswitch=true | false identify the nature of context switches for a thread (preemption or
synchronization).
Default: false
Supported analysis: runsa
enable-thread- Analyze thread pinning to sockets, physical cores, and logical cores.
affinity=true | false Identify incorrect affinity that utilizes logical cores instead of physical cores
and contributes to poor physical CPU utilization.
580
Intel® VTune™ Profiler Command Line Interface 8
<knob-name> Description
Default: false
NOTE
Affinity information is collected at the end of the thread lifetime, so the
resulting data may not show the whole issue for dynamic affinity that is
changed during the thread lifetime.
enable-user-tasks=true Analyze tasks, events and counters specified in your application via the
| false Task API. This option causes higher overhead and increases result size.
event- Configure PMU events to collect with the hardware event-based sampling
config=<event_name1>,<e collector. Multiple events can be specified as a comma-separated list (no
vent_name2>,... spaces).
NOTE
To display a list of events available on the target PMU, enter:
vtune -collect-with runsa -knob event-config=? <target>
The command returns names and short descriptions of available
events. For more information on the events, use Intel Processor
Events Reference.
581
8 Intel® VTune™ Profiler User Guide
<knob-name> Description
io-mode=off | stack | Enable to identify where threads are waiting or compute thread
nostack concurrency. The collector instruments APIs, which causes higher
overhead and increases result size.
Default: off
Supported analysis: runss, runsa
ipt-regions-to- Specify the maximum number (10-5000) of code regions to load for
load=<number> between 10 detailed analysis.
and 5000 Supported analysis: anomaly-detection
Default: 1000
mem-object-size-min- Specify a minimal size of memory allocations to analyze. This option helps
thres=<number> reduce runtime overhead of the instrumentation.
Default: 1024 bytes This option is supported for Linux targets only running on the Intel
microarchitecture code name Sandy Bridge (or later).
Supported analysis: memory-access
no-altstack=true | Disable using alternative stacks for signal handlers. Consider this option
false for profiling standard Python 3 code on Linux.
582
Intel® VTune™ Profiler Command Line Interface 8
<knob-name> Description
Default: sw Use sw to identify CPU hotspots and explore a call flow of your program.
This mode does not require sampling drivers to be installed but incurs
more collection overhead.
Use hw to identify application hotspots based on such basic hardware
events as Clockticks and Instructions Retired. This is a low-overhead
collection mode but it requires the sampling driver to be installed on your
system.
Supported analysis: hotspots, threading
stack-size=<number> Reduce the collection overhead and limit the stack size (in bytes)
processed by the VTune Profiler.
A number between 0 and
2147483647. Default is 0 Supported analysis: runsa
(unlimited stack size).
583
8 Intel® VTune™ Profiler User Guide
<knob-name> Description
stack-type=software | Choose between software stack and hardware LBR-based stack types.
lbr Software stacks have no depth limitations and provide more data while
hardware stacks introduce less overhead. Typically, software stack type is
Default: software
recommended unless the collection overhead becomes significant. Note
that hardware LBR stack type may not be available on all platforms.
Supported analysis: runsa
stackwalk-mode=online | Choose between online (during collection) and offline (after collection)
offline modes to analyze stacks. Offline mode reduces analysis overhead and is
typically recommended.
Default: offline
Supported analysis: runss
target-gpu= Select a target GPU for profiling when you have multiple GPUs connected
<domain:bus:device.func to your system. If unset, VTune Profiler selects the newest GPU
tion> architecture it can detect.
waits-mode=off | stack Enable to identify where threads are waiting or compute thread
| nostack concurrency. The collector instruments APIs, which causes higher
overhead and increases result size.
Default: off
Supported analysis: runss
Actions Modified
collect, collect-with
Description
Use the knob action-option to configure knob settings for a collect (predefined analysis types) or
collect-with (custom analysis types) action where the analysis type supports one or more knobs. Each
analysis type or collector type supports a specific set of knobs, and each knob requires a value. In most
cases the knob value is Boolean, so you would use True to enable the knob.
Example
This example returns a list of knobs for the Threading analysis type:
584
Intel® VTune™ Profiler Command Line Interface 8
This example runs a custom event-based sampling data collection on an Android system enabling collection
of Android framework and chipset events.
See Also
Custom Analysis Options
in GUI
Analyze Performance
from GUI
API Support
vtune Actions
kvm-guest-kallsyms
Specify a local path to the /proc/kallsyms file
copied from the guest system.
Syntax
-kvm-guest-kallsyms=<string>
Arguments
A string containing the PATH, for example: /home/<user>/[guest]/<kvm kallsyms path>.
Actions Modified
collect, collect-with
Description
Specify a local path to the /proc/kallsyms file copied from the guest OS for proper finalization.
Example
Enable a custom hardware event-based sampling collection for the KVM guest OS and collect irq, softirq,
workq, and kvm FTrace* events:
See Also
Profile KVM Kernel and User Space on the KVM System
from GUI
Targets in Virtualized Environments
585
8 Intel® VTune™ Profiler User Guide
knob
ftrace-config
kvm-guest-modules
analyze-kvm-guest
vtune Actions
kvm-guest-modules
Specify a local path to the /proc/modules file copied
from the guest system.
Syntax
-kvm-guest-modules=<string>
Arguments
A string containing the PATH, for example: /home/<user>/<guest mount path>/<kvm modules path>.
Actions Modified
collect, collect-with
Description
Specify a local path to the /proc/modules file copied from the guest OS for proper finalization.
Example
Enable a custom hardware event-based sampling collection for the KVM guest OS mounted to the /home/
vtune/guest_mount directory:
knob
ftrace-config
analyze-kvm-guest
kvm-guest-kallsyms
586
Intel® VTune™ Profiler Command Line Interface 8
vtune Actions
limit
Set the number of top items to include in a report.
Syntax
-limit <value>
Arguments
Actions Modified
report
Description
Use the limit action-option when you only want to include the top items in a report, and specify the number
of items (program units) to include.
Example
Output a Hotspots report on the ten modules with the highest CPU time values.
See Also
vtune Command Syntax
vtune Actions
loop-mode
Show or hide loops in the stack.
IDE Equivalent
Toolbar: Filter > Loop Mode drop-down menu
Syntax
loop-mode=<value>
Arguments
loop-only Display loops as regular nodes in the tree. Loop name consists of:
• start address of the loop
• number of the code line where this loop is created
• name of the function where this loop is created
587
8 Intel® VTune™ Profiler User Guide
Default
Actions Modified
report
Description
Use the loop-mode option when performing data collection, finalization or importation, to set loop view for
the result or report. You can also use this option with the report action to override the project-level setting
for viewing a hierarchy of the loops in your application call tree.
Example
This command displays the data collected during the Hotspots analysis in the callstack report that is filtered
to show loops only:
[Loop@0x7dea03b7 in func@0x7dea0392]
ntdll.dll 0.002
[Loop@0x7dea03a6 in func@0x7dea0392]
ntdll.dll 0.002
[Outside any loop]
[Unknown] 0
[Loop@0x1400147f0 in func@0x140014782]
mfeapfk.sys 0.001
[Outside any loop]
[Unknown] 0.001
[Loop@0x14001a111 in func@0x14001a0c0]
mfeapfk.sys 0.001
[Loop@0x14001a100 in func@0x14001a0c0]
mfeapfk.sys 0.001
[Outside any loop]
[Unknown] 0
[Loop@0x1402d0329 in func@0x1402d02af]
ntoskrnl.exe 0.001
[Outside any loop]
[Unknown] 0.001
See Also
Analyze Loops
588
Intel® VTune™ Profiler Command Line Interface 8
mrte-mode
Specify managed profiling mode for Java*, Python*,
Go*, .NET*, and Windows* Store applications.
Syntax
-mrte-mode <value>
Arguments
<value> Profiling mode for the managed code. Possible values are:
Actions Modified
collect, collect-with
Description
Use the mrte-mode option to specify one of the following Microsoft* run-time environment profiling modes:
auto, native, mixed, or managed.
Example
Collect hotspots data on native code only for a Windows sample application:
See Also
Managed Code Targets
in GUI
Java* Code Analysis from the Command Line
vtune Actions
no-follow-child
Syntax
-no-follow-child
-follow-child
589
8 Intel® VTune™ Profiler User Guide
Actions Modified
collect
Description
Use the no-follow-child action-option when you want to exclude child processes from collect action
data collection and analysis. This option is recommended when profiling an application launched by a script.
Example
In this example, only the myApp Linux* application will be profiled. No information will be collected about any
child processes initiated by myApp.
See Also
Run Command Line Analysis
vtune Actions
no-summary
Suppress summary report generation.
Syntax
-no-summary
-summary
Actions Modified
collect
Description
When performing certain actions, such as collect or collect-with, a Summary is generated and sent to stdout
by default. To suppress this, use the no-summary option when performing data collection. This can save time
and system resources when analyzing large applications.
Example
This example runs the Hotspots analysis for the sample application without generating a summary report.
On Windows*:
See Also
report
option
Summary Report
590
Intel® VTune™ Profiler Command Line Interface 8
vtune Actions
no-unplugged-mode
Enable collection from an unplugged Android* device
to exclude ADB connection and power impact on the
results .
GUI Equivalent
Analyze detached device option in the WHAT: Analysis Target pane
Syntax
-no-unplugged-mode
-unplugged-mode
Actions Modified
collect, collect-with
Description
The unplugged-mode option enables collection on an unplugged Android device to exclude ADB connection
and power supply impact on the results. When this option is used, you configure and launch an analysis from
the host but data collection starts after disconnecting the device from the USB cable or a network. Collection
results are automatically transferred to the host as soon as you plug in the device back.
Example
This command configures Hotspots analysis for the application on an Android system that will be launched
after disconnecting the device from the USB cable or a network:
See Also
Android* Target Analysis from the Command Line
vtune Actions
quiet
Syntax
-quiet
-q
Actions Modified
collect, finalize, report, version
591
8 Intel® VTune™ Profiler User Guide
Description
Use the quiet option to limit the amount of information displayed by vtune. Only error, fatal error, and
warning messages are displayed when this option is used.
Example
This example suppresses unimportant messages while running the Hotspots analysis of the Linux* sample
application and generating the default summary report.
See Also
vtune Actions
report
Generate a specified type of report from an analysis
result.
GUI Equivalent
Viewpoint
Syntax
-report <report_name>
-R <report_name>
Arguments
Argument Description
Value Description
callstacks Report full stack data for each hotspot function; identify the impact
of each stack on the function CPU or Wait time. You can use the
group-by or filter options to sort the data by:
• callstack
• function
• function-callstack
592
Intel® VTune™ Profiler Command Line Interface 8
top-down Report call sequences (stacks) detected during collection phase,
starting from the application root (usually, the main() function). Use
this report to see the impact of program units together with their
callees.
gprof-cc Report a call tree with the time (CPU and Wait time, if available)
spent in each function and its children.
Modifiers
call-stack-mode, csv-delimiter, cumulative-threshold-percent, discard-raw-data, filter,
format, group-by, inline-mode, limit, quiet, report-output, result-dir, search-dir,
source-search-dir, source-object, verbose, time-filter, loop-mode, column
Description
Use the report action to generate a report from an existing result. The report type must be compatible with
the analysis type used in the collection.
By default, your report is written to stdout. If you want to save it to a file, use the report-output action-
option.
Both short names and long names are case-sensitive. For example, -R is the short name of the report
action, and -r is the short name of the result-dir action-option.
NOTE
To get the list of available report types, use the vtune -help report command.
To display help for a specific report type, use vtune -help report <report_name>, where
<report_name> is the type of report that you want to create.
Example
In this pair of examples, a collect action is used to perform a hotspots analysis for the Linux* sample
target and write the result to the current working directory. The second command uses the report action to
generate a hotspots report from the most recent result and write it to stdout.
See Also
report-output
option
Save and Format Command Line Reports
593
8 Intel® VTune™ Profiler User Guide
report-knob
Set configuration options for the specified report type.
Syntax
-report-knob<knobName>=<knobValue>
Arguments
<knobName> <knobValue> Supported Report Description
NOTE
This knob is available
only for the HPC
Performance
Characterization analysis
report.
Actions Modified
report
Description
Use the -report-knob action-option to configure knob settings for a report action.
Example
This example generates the summary report for the HPC Performance Characterization analysis result and
skips issue descriptions.
594
Intel® VTune™ Profiler Command Line Interface 8
Collection and Platform Info
Application Command Line: ./sp.B.x
User Name: vtune
Operating System: 3.10.0-327.el7.x86_64 NAME="Red Hat Enterprise Linux Server" VERSION="7.2
(Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.2" P
RETTY_NAME="Red Hat Enterprise Linux Server 7.2 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/
o:redhat:enterprise_linux:7.2:GA:server" HOME_URL="https://w
ww.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat
Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.
2 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.2"
Computer Name: test
Result Size: 1 GB
Collection start time: 19:04:30 13/07/2016 UTC
Collection stop time: 19:04:53 13/07/2016 UTC
Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
Frequency: 2.694 GHz
Logical CPU Count: 24
CPU
Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
Frequency: 2.694 GHz
Logical CPU Count: 24
This example generates the summary report for the HPC Performance Characterization analysis result and
shows issue descriptions.
See Also
vtune Command Syntax
vtune Actions
595
8 Intel® VTune™ Profiler User Guide
report-output
Write a generated report to a file.
Syntax
-report-output <pathname>
Arguments
Argument Description
<dir> Name of the directory if you are writing multiple report files
Actions Modified
report
Description
Use the report-output action-option to write a report to a file.
NOTE
If you specify a .csv file, use the csv-delimiter option to specify which delimiter you want to
use in the report.
Example
This example generates a wait-time report for the r001tr Threading analysis result and saves it in the /
home/text/report.txt file.
See Also
vtune Command Syntax
vtune Actions
report-width
Set the maximum width for a report
Syntax
-report-width <double>
596
Intel® VTune™ Profiler Command Line Interface 8
Arguments
Default
None
Actions Modified
report
Description
If a report is too wide to view or print properly, use the report-width option to limit the number of
characters per line.
Example
Output a hotspots report from the most recent result as a text file with a maximum width of 60 characters
per line.
See Also
vtune Command Syntax
vtune Actions
result-dir
Specify the result directory.
Syntax
-result-dir <PATH>
-r <PATH>
Arguments
Argument Description
Actions Modified
collect, collect-with, finalize, import, report
Description
Use the result-dir option to specify the result directory. If you specify the result directory for collection or
to import results from other projects, you should also specify the result directory for any actions that use this
result, such as report.. Specifying the result directory when using the finalize action is highly
recommended.
597
8 Intel® VTune™ Profiler User Guide
If you want to specify the result directory name, you can use the auto-incremented counter pattern @@@ with
a prefix and/or suffix.
For example, you could use the prefix myResult- and the usual analysis type suffix like this: myResult-
@@@{at}. If you then perform a memory error analysis, followed by a threading error analysis, specifying -
result-dir myResult-@@@{at} each time, the result directories would be assigned the following names:
myResult-000mi1 and myResult-001ti2.
Both short names and long names are case-sensitive. For example, -R is the short name of the report
action, and -r is the short name of the result-dir action-option.
Example
This example starts the Threading analysis of the myApplication application and saves the results in the
baseline result directory.
On Linux*:
See Also
Specify Result Directory from Command Line
vtune Actions
resume-after
Resume collection after the specified number of
seconds.
Syntax
-resume-after <value>
Arguments
Argument Description
<value> The number of seconds that should elapse before data collection is
resumed. Fractions of seconds are possible, for example: 1.56 for 1 sec 560
msec.
Actions Modified
collect
Description
Use the resume-after option with the start-paused option to automatically exit paused mode after the
specified number of seconds has elapsed.
598
Intel® VTune™ Profiler Command Line Interface 8
Example
This example starts a Linux* sample application in paused mode and resumes the Hotspots analysis in 5
seconds.
See Also
vtune Command Syntax
vtune Actions
return-app-exitcode
Return the exit code of the target.
Syntax
-return-app-exitcode
Actions Modified
collect
Description
Use the return-app-exitcode option to return the exit code of the target rather than the vtune tool.
Example
This example runs the Threading analysis for the sample Linux* application, generates a default summary
report, and returns the exit code of the sample application.
See Also
vtune Command Syntax
vtune Actions
ring-buffer
Limit the amount of raw data to be collected by
setting the timer that enables the analysis only for the
last seconds before the target or collection is
terminated.
GUI Equivalent
Configure Analysis window >WHAT pane > Advanced > Limit collected data by: Time from
collection end, sec option
Syntax
-ring-buffer=<integer>
599
8 Intel® VTune™ Profiler User Guide
Arguments
Actions Modified
collect, collect-with
Description
Use the ring-buffer action-option to limit the amount of raw data to be collected. The option sets the timer
(in sec) that enables the analysis only for the last seconds before the target or collection is terminated.
Alternate Options
Example
Enable a Hotspots analysis for the last 10 seconds before the collection is terminated.
See Also
Limit Data Collection
data-limit
action-option
vtune Command Syntax
vtune Actions
search-dir
Specify a search directory for binary and symbol files.
Syntax
-search-dir DIR
Arguments
Actions Modified
collect, finalize, import
Description
This option specifies search directories for binary and symbol files. It is often used in conjunction with the
finalize action to re-finalize a result when a symbol file is missed during collection. It is also used if you
import results from another system.
During data collection, the result directory is set as the default search directory for the collected result.
If you import results from another system, specify additional search directories for system modules. To show
correct results, the vtune tool requires the same modules that were used for data collection. To ensure the
Intel® VTune™ Profiler takes the right module, copy the original system modules to your system.
600
Intel® VTune™ Profiler Command Line Interface 8
Alternate Options
Examples
When your source files are in multiple directories, use the search-dir option multiple times so that all the
necessary directories are searched.
See Also
source-search-dir
action-option
vtune Command Syntax
vtune Actions
Search Directories
Finalization
show-as
Syntax
-show-as samples | events | percent
Arguments
Argument Description
samples Show the total number of samples collected for each event in the
viewpoints provided for the hardware event-based sampling data collection.
events Show the number of times the event occurred during sampling data
collection. VTune Profiler determines this value by applying the following
formula for each event: < Event name > samples * Sample After
value.
percent Show the percentage of samples collected for the event. This value is
calculated using the following formula: (Number of samples collected
for the event/ Total number of samples collected for the event) x
100 .
Actions Modified
report
Description
Choose the data format for displaying results collected during hardware event-based sampling.
601
8 Intel® VTune™ Profiler User Guide
Example
Generate a hardware events report for the result collected during a hotspots analysis and show as a
percentage of events.
See Also
vtune Actions
sort-asc
Syntax
-sort-asc <string>
-s <string>
Arguments
Argument Description
Actions Modified
report
Description
Use the sort-asc option with the report action to sort data by the specified column name in ascending
order. Each column name corresponds to a performance metric or event.
You can specify multiple values as a comma-separated string (no spaces).
Example
This example sorts the data collected in the r001ue result and displayed in the Hardware Events report in
the ascending order by the INST_RETIRED.ANY and CPU_CLK_UNHALTED.CORE event columns.
See Also
Generate Command Line Reports
Reference
602
Intel® VTune™ Profiler Command Line Interface 8
sort-desc
Syntax
-sort-desc <string>
-S <string>
Arguments
Argument Description
Actions Modified
report
Description
Use the sort-desc option with the report action to sort data by the specified column name in descending
order. Each column name corresponds to a performance metric or event.
You can specify multiple values as a comma-separated string (no spaces).
Example
Sort the data collected in the r001ue result and displayed in the Hardware Events report in the descending
order by the INST_RETIRED.ANY and CPU_CLK_UNHALTED.CORE event columns.
See Also
Generate Command Line Reports
Reference
source-object
Type of source object to display in a report for source
or assembly data.
Syntax
-source-object <object_type> [=]<value>
Arguments
Argument Description
Actions Modified
report with either hw-events or hotspots report type.
603
8 Intel® VTune™ Profiler User Guide
Description
Use the source-object option to switch report to source or assembly view mode, including associated
performance data. To define a particular object, you can specify this option more than once. For example, if
two modules each have a function named foo, VTune Profiler will throw an error unless you specify both the
module and function.
Tip
By default, source view is displayed. Specify group-by address to see disassembly view with
associated performance data.
Examples
Generate a hardware events report that displays source data for the foo function. Since the result directory
is not specified, the most recent hardware analysis result in the current working directory is used.
See Also
filter
vtune Actions
source-search-dir
Specify a search directory for source files.
Syntax
-source-search-dir DIR
Arguments
Argument Description
604
Intel® VTune™ Profiler Command Line Interface 8
Actions Modified
report
Description
This option specifies search directories for source files. Use this option to specify the location of source files
required to provide correct source view report with the source-object option.
During data collection, the result directory is set as the default search directory for the collected result.
Alternate Options
Example
This command opens the source view with the hotspots performance metrics for the foo function and uses
the directory to search for source files.
See Also
search-dir
action-option
source-object
action-option
vtune Command Syntax
vtune Actions
Search Directories
stack-size
Specify the size of a raw stack (in bytes) to process.
Syntax
-stack-size=<value in bytes>
Arguments
Possible <value>: numbers between 0 and 2147483647
Default
NOTE
For driverless sampling collection, the default value is 1024 bytes.
Actions Modified
collect-with
605
8 Intel® VTune™ Profiler User Guide
Description
When you configure a customhardware event-based sampling collection, you may reduce the collection
overhead and limit the stack size (in bytes) processed by the VTune Profiler by using the -stack-size
option.
Example
This example configures and runs a custom event-based sampling data collection with the stack size limited
to 8192 bytes:
See Also
vtune Command Syntax
vtune Actions
start-paused
Start data collection in the paused mode.
Syntax
-start-paused
Actions Modified
collect with one of the user-mode sampling analysis types
Description
This option starts the data collection in the paused mode.
Collection resumes when pause/resume API calls in the target code are reached, when the command action
is used with the resume argument, or if the resume-after option is used, when the specified time has
elapsed.
Example
This example starts the hotspots analysis of the sample application in the paused mode.
See Also
resume-after
option
vtune Command Syntax
vtune Actions
606
Intel® VTune™ Profiler Command Line Interface 8
strategy
Specify which processes to analyze.
Syntax
-strategy <process_name1>:<profiling_mode>,<process_name2>:<profiling_mode>,...
Arguments
Argument Description
<process_name> The name of the process to which the strategy
configuration applies. If <process_name> is empty,
the strategy configuration applies by default to all
processes for which a profiling strategy is not
specified.
<profiling_mode> The strategy for profiling the specified process.
Possible values are:
Value Description
trace:trace Collect data on the process, and its child processes.
notrace:trace Do not analyze the process, but collect data on its
child processes.
notrace:notrace Ignore the process, and its child processes, while
collecting data.
trace:notrace Analyze the process, but do not collect data on its
child processes.
Actions Modified
collect, collect-with
Description
Use the strategy action-option to specify which processes to analyze, and which to ignore.
Example
This example performs a Hotspots analysis where the strategy configuration limits data collection to the
example process, and ignores its child processes.
See Also
vtune Command Syntax
vtune Actions
607
8 Intel® VTune™ Profiler User Guide
target-install-dir
Specify a path to the VTune Profiler target package
installed on the remote system.
Syntax
-target-install-dir=<string>
Arguments
<string> Path to the product installed on a remote Linux system. If the product
is installed to the default location, this option is configured
automatically.
Default
/opt/intel/vtune_profiler_<version>
Actions Modified
collect, collect-with
Description
VTune Profiler supports command line analysis of applications running on a remote Linux or Android system
(target) using the following product components installed:
Example
This command runs Hotspots analysis with stacks for a Linux application and specifies a path to the remote
version of the VTune Profiler installed to a non-default location:
See Also
Set Up Remote Linux* Target
vtune Actions
target-system
Collect data on a remote machine using SSH/ADB
connection.
608
Intel® VTune™ Profiler Command Line Interface 8
Syntax
-target-system=<string>
Arguments
sep -platform-list
Use this argument when:
• You do not have an SSH connection to the target machine.
• You cannot install VTune Profiler on the target machine, for
security reasons.
NOTE
The Linux Perf* tool (driverless collection) supports complex
event names that contain .:= symbols in v4.18 and newer
versions. For example,
Actions Modified
collect, collect-with
Description
Intel®VTune Profiler enables you to analyze applications running on a remote Linux system or Android device
(target system) using the VTune Profiler command line interface (vtune) installed on the host system
(remote usage mode). Use the target-system option to specify your target system and enable remote data
collection.
For details, see Linux* System Setup for Remote Analysis and Android* System Setup.
609
8 Intel® VTune™ Profiler User Guide
Example
This command runs Hotspots analysis in the hardware event-based mode for the application on a Linux
embedded system:
$sep -platform-list
...
Platform: 111, PMU: skylake_server, Signature: 0x50650, CPU name: Intel(R) Xeon(R) Processor
code named Skylake
...
$ vtune --collect uarch-exploration --target-system=get-perf-cmd:skylake_server
This command runs Hotspots analysis in the user-mode sampling mode for the application on an Android
system:
See Also
Set Up Remote Linux* Target
Android* Targets
vtune Actions
target-tmp-dir
Specify a path to the temporary directory on the
remote system where performance results are
temporarily stored.
Syntax
-target-tmp-dir=<string>
Arguments
610
Intel® VTune™ Profiler Command Line Interface 8
Default
/tmp
Actions Modified
collect, collect-with
Description
VTune Profiler supports command line analysis of applications running on a remote Linux system (target)
using the following product components installed:
This command runs Hotspots analysis with stacks for a Linux application and specifies a non-default
temporary location on the remote system:
See Also
Temporary Directory for Performance Results on Linux* Targets
vtune Actions
target-duration-type
Adjust the sampling interval for longer-running
targets.
Syntax
-target-duration-type veryshort | short | medium | long
Arguments
Actions Modified
collect, collect-with
611
8 Intel® VTune™ Profiler User Guide
Description
If your target runs 15 minutes or longer, or if it runs less than one minute, use the target-duration
action-option to set a different duration type. The collect or collect-with action uses this value to adjust
the sampling interval, which determines how much data is collected. For longer-running targets, the
sampling interval is greater (less frequent) to reduce the amount of collected data. For very short-running
targets, the sampling interval is smaller (more frequent). For hardware event-based analysis types, a
multiplier applies to the configured Sample After value.
NOTE
This option is deprecated. Use the -knob sampling-interval option instead.
Example
Perform a Hotspots analysis using a medium sampling interval that is appropriate for targets with a duration
of 15 minutes to 3 hours.
See Also
Manage Analysis Duration from Command Line
Sampling Interval
vtune Actions
target-pid
Attach a collection to a running process specified by
the process ID.
Syntax
-target-pid <value>
Arguments
ID of process that you want to analyze.
Actions Modified
collect, collect-with
Description
Use the target-pid option to attach a collect or collect-with action to a running process specified by
its process ID (pid).
Alternate Options
The target-process option provides the same capabilities, but uses the process name to specify the process.
612
Intel® VTune™ Profiler Command Line Interface 8
Example
Attach a hotspots collection to a running process whose ID is 1234.
See Also
vtune Actions
target-process
Attach a collection to a running process specified by
the process name.
Syntax
-target-process <string>
Arguments
A string containing the name of the process to profile.
Actions Modified
collect, collect-with
Description
Use the target-process option to attach a collect or collect-with action to a running process specified
by the process name.
Alternate Options
The target-pid option provides the same capabilities, but uses the process ID to specify the process.
Example
In this example, a Hotspots analysis is attached to the myApp process, which is already running on the
system.
See Also
vtune Command Syntax
vtune Actions
time-filter
Filter reports by a time range.
IDE Equivalent
Pane: Timeline
Syntax
time-filter=<value>
613
8 Intel® VTune™ Profiler User Guide
Arguments
Default
OFF By default, vtune-cl reports display data for the full analysis duration.
Actions Modified
report
Description
Use the time-filter option to filter the report and display data for the specified time range only. For
example, -time-filter=2.3:5.4 reports data collected from 2.3 seconds to 5.4 seconds of Elapsed Time.
Examples
vtune-cl -R hotspots -time-filter=2.3:5.4
See Also
Run Command Line Analysis
trace-mpi
For message passing interface (MPI) analysis ,
configure collectors to determine MPI rank ID in case
of a non-Intel MPI library implementation.
Syntax
-trace-mpi | -no-trace-mpi
Default
-no-trace-mpi
Actions Modified
collect, collect-with
Description
Based on the PMI_RANK or PMI_ID MPI analysis environment variable (whichever is set), the VTune Profiler
extends a process name with the captured rank number that is helpful to differentiate ranks in a VTune
Profiler result with multiple ranks. The process naming schema in this case is <process_name> (rank <N>).
Use the -trace-mpi option to enable detecting an MPI rank ID for MPI implementations that do not provide
the environment variable.
614
Intel® VTune™ Profiler Command Line Interface 8
Examples
This command runs the Hotspots analysis type (hardware event-based sampling mode) with enabled MPI
rank ID detection.
See Also
MPI Code Analysis
user-data-dir
Specify the base directory for result paths.
Syntax
-user-data-dir <PATH>
Arguments
A string containing the PATH/name of the user data directory.
Actions Modified
collect, finalize, import
Description
Use the user-data-dir action-option with the result-dir action-option when you want to specify a base
directory for results.
Example
This example runs a Threading analysis of the sample Linux application and creates the default-named result
directories under the myresults directory.
See Also
vtune Command Syntax
vtune Actions
result-dir
option
Manage Result Files
from GUI
verbose
Display detailed information on actions performed by
the vtune tool.
Syntax
-verbose
-v
615
8 Intel® VTune™ Profiler User Guide
Description
Use the verbose option when you want to see detailed information on the actions performed by the vtune
command.
Example
This example displays detailed information while running a Hotspots analysis.
See Also
quiet
option
vtune Command Syntax
vtune Actions
version
Display version information for the vtune tool.
Syntax
-version
-V
Description
This action displays version information for the Intel® VTune™ Profiler and the vtune command.
Example
This example shows version information for the Intel® VTune™ Profiler and the vtune command.
vtune -version
See Also
vtune Command Syntax
vtune Actions
Introduction
616
Intel® VTune™ Profiler Command Line Interface 8
Crash Report Actions
Action Argument Description
create-bug-
<PATH> Package the following into a bug report package: product log files,
report Pathname system information, crash reports for each running product process,
for the bug and product installation details.
report.
list-crash-
None Output a list of existing bug reports.
report
report-system-
None Output system information.
info
send-crash-
<PATH> Email the specified bug/crash report(s) to the Intel Customer
report Pathname Support Team.
for the bug
report.
Examples
This command generates a bug report package and stores it in a compressed file under the name you
specify, such as 001bug.
amplxe-feedback -create-bug-report=001bug
This command creates a list of crash report filenames.
amplxe-feedback -list-crash-report
This command outputs system information so you can provide this information to support.
amplxe-feedback -report-system-info
This command forwards the specified bug report to the Intel Customer Support Team.
amplxe-feedback -send-bug-report=r0001b
617
9 Intel® VTune™ Profiler User Guide
API Support 9
Intel® VTune™ Profiler supports two kinds of APIs:
• The Instrumentation and Tracing Technology API (ITT API) provided by the Intel®VTune™ Profiler enables
your application to generate and control the collection of trace data during its execution.
• The JIT (Just-In-Time) Profiling API provides functionality to report information about just-in-time
generated code that can be used by performance tools. You need to insert JIT Profiling API calls in the
code generator to report information before JIT-compiled code goes to execution. This information is
collected at runtime and used by tools like Intel® VTune™ Profiler to display performance metrics
associated with JIT-compiled code.
The Instrumentation and Tracing Technology API (ITT API) provided by the Intel®VTune™ Profiler enables your
application to generate and control the collection of trace data during its execution.
ITT API has the following features:
• Controls application performance overhead based on the amount of traces that you collect.
• Enables trace collection without recompiling your application.
• Supports applications in C/C++ and Fortran environments on Windows*, Linux*, FreeBSD*, or Android*
systems.
• Supports instrumentation for tracing application code.
To use the APIs, add API calls in your code to designate logical tasks. These markers will help you visualize
the relationship between tasks in your code relative to other CPU and GPU tasks. To see user tasks in your
performance analysis results, enable the Analyze user tasks checkbox in analysis settings.
NOTE
The ITT API is a set of pure C/C++ functions. There are no Java* or .NET* APIs. If you need runtime
environment support, you can use a JNI, or C/C++ function call from the managed code. If the
collector causes significant overhead or data storage, you can pause the analysis to reduce the
overhead.
See Also
Task Analysis
View Instrumentation and Tracing Technology (ITT) API Task Data in Intel® VTune™ Profiler
618
API Support 9
User applications/modules linked to the static ITT API library do not have a runtime dependency on a
dynamic library. Therefore, they can be executed without Intel®VTune™ Profiler.
To use the ITT APIs, set up your C/C++ or Fortran application using the steps provided in Configuring Your
Build System.
Unicode Support
All API functions that take parameters of type __itt_char follow the Windows OS unicode convention. If
UNICODE is defined when compiling on a Windows OS, __itt_char is wchar_t, otherwise it is char. The
actual function names are suffixed with A for the ASCII APIs and W for the unicode APIs. Both types of
functions are defined in the DLL that implements the API.
Strings that are all ASCII characters are internally equivalent for both the unicode and the ASCII API
versions. For example, the following strings are equivalent:
See Also
Minimize ITT API Overhead
Task Analysis
NOTE
ITT API usage is supported on Windows*, Linux*, FreeBSD*, and Android* systems. It is not
supported for QNX* systems.
Before instrumenting your application, you need to configure your build system to be able to reach the API
headers and libraries.
For Windows* and Linux* systems:
• Add <install_dir>/sdk/include to your INCLUDE path for C/C++ applications or
<install_dir>/sdk/[lib32 or lib64] to your INCLUDE path for Fortran applications
• Add <install_dir>/sdk/lib32 to your 32-bit LIBRARIES path
• Add <install_dir>/sdk/lib64 to your 64-bit LIBRARIES path
NOTE
On Linux* systems, you have to link the dl and pthread libraries to enable ITT API functionality. Not
linking these libraries will not prevent your application from running, but no ITT API data will be
collected.
NOTE
Header and library files are available from the vtune_profiler_target_x86_64.tgz FreeBSD target
package. See Set Up FreeBSD* System for more information.
619
9 Intel® VTune™ Profiler User Guide
NOTE
The ITT API headers, static libraries, and Fortran modules previously located at <install_dir>/
include and <install_dir>/lib32 [64] folders were moved to the <install_dir>/sdk folder
starting the VTune Profiler 2021.1-beta08 release. Copies of these files are retained at their old
locations for backwards compatibility and these copies should not be used for new projects.
#include <ittnotify.h>
The ittnotify.h header contains definitions of ITT API routines and important macros which provide the
correct logic of API invocation from a user application.
The ITT API is designed to incur almost zero overhead when tracing is disabled. But if you need fully zero
overhead, you can compile out all ITT API calls from your application by defining the
INTEL_NO_ITTNOTIFY_API macro in your project at compile time, either on the compiler command line, or
in your source file, prior to including the ittnotify.h file.
USE ITTNOTIFY
C/C++ example:
__itt_pause();
Fortran example:
CALL ITT_PAUSE()
For more information, see Instrumenting Your Application.
Link the libittnotify.a (Linux*, Android*, FreeBSD*) or libittnotify.lib (Windows*) Static Library to
Your Application
You need to link the static library, libittnotify.a (Linux*, FreeBSD*, Android*) or libittnotify.lib
(Windows*), to your application. If tracing is enabled, this static library loads the ITT API implementation and
forwards ITT API instrumentation data to VTune Profiler. If tracing is disabled, the static library ignores ITT
API calls, causing nearly zero instrumentation overhead.
620
API Support 9
After you instrument your application by adding ITT API calls to your code and link the libittnotify.a
(Linux*, FreeBSD*, Android*) or libittnotify.lib (Windows*) static library, your application will check
the INTEL_LIBITTNOTIFY32 or theINTEL_LIBITTNOTIFY64 environment variable depending on your
application's architecture. If that variable is set, it will load the libraries defined in the variable.
Make sure to set these environment variables for the ittnotify_collector to enable data collection:
On Windows*:
INTEL_LIBITTNOTIFY32=<install-dir>\bin32\runtime\ittnotify_collector.dll
INTEL_LIBITTNOTIFY64=<install-dir>\bin64\runtime\ittnotify_collector.dll
On Linux*:
INTEL_LIBITTNOTIFY32=<install-dir>/lib32/runtime/libittnotify_collector.so
INTEL_LIBITTNOTIFY64=<install-dir>/lib64/runtime/libittnotify_collector.so
On FreeBSD*:
INTEL_LIBITTNOTIFY64=<target-package>/lib64/runtime/libittnotify_collector.so
See Also
Basic Usage and Configuration
NOTE
Header and library files are available from the vtune_profiler_target_x86_64.tgz FreeBSD target
package. See Set Up FreeBSD* System for more information.
INTEL_LIBITTNOTIFY64=<target-package>/lib64/runtime/libittnotify_collector.so
NOTE
The variables should contain the full path to the library without quotes.
621
9 Intel® VTune™ Profiler User Guide
Example
On Windows:
export INTEL_LIBITTNOTIFY32=/opt/intel/oneapi/vtune/latest/lib32/runtime/
libittnotify_collector.so
export INTEL_LIBITTNOTIFY64=/opt/intel/oneapi/vtune/latest/lib64/runtime/
libittnotify_collector.so
On FreeBSD:
NOTE You may need to change the path to reflect the placement of the FreeBSD target package on
your target system.
See Also
Set Up Analysis Target
Example
The following sample shows how four basic ITT API functions are used in a multi threaded application:
• Domain API
• String Handle API
• Task API
622
API Support 9
• Thread Naming API
#include <windows.h>
#include <ittnotify.h>
// Wait a while,...
::Sleep(5000);
g_done = true;
// Mark the end of the main task
__itt_task_end(domain);
}
// Create string handle for the work task.
__itt_string_handle* handle_work = __itt_string_handle_create("work");
DWORD WINAPI workerthread(LPVOID data)
{
// Set the name of this thread so it shows up in the UI as something meaningful
char threadname[32];
wsprintf(threadname, "Worker Thread %d", data);
__itt_thread_set_name(threadname);
// Each worker thread does some number of "work" tasks
while(!g_done)
{
__itt_task_begin(domain, __itt_null, __itt_null, handle_work);
::Sleep(150);
__itt_task_end(domain);
}
return 0;
}
See Also
Basic Usage and Configuration
Domain API
String Handle API
Task API
623
9 Intel® VTune™ Profiler User Guide
Conditional Compilation
For best performance in the release version of your code, use conditional compilation to turn off annotations.
Define the macro INTEL_NO_ITTNOTIFY_API before you include ittnotify.h during compilation to
eliminate all __itt_* functions from your code.
You can also remove the static library from the linking stage by defining this macro.
624
API Support 9
// Do some work here...
__itt_task_end(detailed);
}
// This is my entry point.
int main(int argc, char** argv)
{
if(argc < 2)
//Disable detailed domain if we do not need tracing from that in this
//application run
detailed ->flags = 0;
MyFunction(atoi(argv[1])); }
See Also
Basic Usage and Configuration
View Instrumentation and Tracing Technology (ITT) API Task Data in Intel® VTune™ Profiler
User task and API data can be visualized in Intel®
VTune™ Profiler performance analysis results.
After you have added basic annotations to your application to control performance data collection, you can
view these annotations in the Intel VTune Profiler timeline. All supported instrumentation and tracing
technology (ITT) API tasks can be visualized in VTune Profiler.
Use the following steps to include ITT API tasks in your performance analysis collection:
1.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel®
VTune™ Profiler toolbar.
The Configure Analysis window opens.
2. Set up the analysis target in the WHERE and WHAT panes.
3.
From HOW pane, click the Browse button and select an analysis type. For more information
about each analysis type, see Performance Analysis Setup.
4. Select the Analyze user tasks, events, and counters checkbox to view the API tasks, counters, and
events that you added to your application code.
NOTE
In some cases, the Analyze user tasks, events, and counters checkbox is in the expandable
Details section. To enable the checkbox, use the Copy button at the top of the tab to create an
editable version of the analysis type configuration. For more information, see Custom Analysis .
625
9 Intel® VTune™ Profiler User Guide
• Grid view: Set the Grouping to Task Domain / Task Type / Function / Call Stack or Task Type /
Function / Call Stack to view task data in the grid pane.
• Platform tab: Individual tasks are available in a larger view on the Platform tab. Hover over a task to get
more information.
626
API Support 9
See Also
Instrumentation and Tracing Technology APIs
Task Analysis
Domain API
A domain enables tagging trace data for different modules or libraries in a program. Domains are specified
by unique character strings, for example TBB.Internal.Control.
Each domain is represented by an opaque __itt_domain structure, which you can use to tag each of the ITT
API calls in your code.
You can selectively enable or disable specific domains in your application, in order to filter the subsets of
instrumentation that are collected into the output trace capture file. To disable a domain set its flag field to 0
value. This disables tracing for a particular domain while keeping the rest of the code unmodified. The
overhead of a disabled domain is a single if check.
To create a domain, use the following primitives:
__itt_domain *ITTAPI__itt_domain_create ( const char *name)
627
9 Intel® VTune™ Profiler User Guide
For a domain name, the URI naming style is recommended, for example,
com.my_company.my_application. The set of domains is expected to be static over the application's
execution time, therefore, there is no mechanism to destroy a domain.
Any domain can be accessed by any thread in the process, regardless of which thread created the domain.
This call is thread-safe.
Parameters of the primitives:
Usage Example
#include "ittnotify.h"
See Also
Basic Usage and Configuration
Instrument Your Application
Minimize ITT API Overhead
See Also
Basic Usage and Configuration
Minimize ITT API Overhead
void __itt_pause Run the application without collecting data. VTune Profiler reduces the overhead
(void) of collection, by collecting only critical information, such as thread and process
creation.
void __itt_resume Resume data collection. VTune Profiler resumes collecting all data.
(void)
void __itt_detach Detach data collection. VTune Profiler detaches all collectors from all processes.
(void) Your application continues to work but no data is collected for the running
collection.
628
API Support 9
Pausing the data collection has the following effects:
• Data collection is paused for the whole program, not only within the current thread.
• Some runtime analysis overhead reduction.
• The following APIs are not affected by pausing the data collection:
• Domain API
• String Handle API
• Thread Naming API
• The following APIs are affected by pausing the data collection. Data is not collected for these APIs while in
paused state:
• Task API
• Frame API
• Event API
• User-Defined Synchronization API
NOTE
The Pause/Resume API call frequency is about 1Hz for a reasonable rate. Since this operation pauses
and resumes data collection in all processes in the analysis run with the corresponding collection state
notification to GUI, you are not recommended to call it on frequent basis for small workloads.
See Also
Basic Usage and Configuration
View Instrumentation and Tracing Technology (ITT) API Task Data in Intel® VTune™ Profiler
629
9 Intel® VTune™ Profiler User Guide
Usage Example
You can use the following thread naming example to give a meaningful name to the thread you wish to focus
on and ignore the service thread.
See Also
Basic Usage and Configuration
Task API
A task is a logical unit of work performed by a particular thread. Tasks can nest; thus, tasks typically
correspond to functions, scopes, or a case block in a switch statement. You can use the Task API to assign
tasks to threads.
Task API is a per-thread function that works in resumed state. This function does not work in paused state.
The Task API does not enable a thread to suspend the current task and switch to a different task (task
switching), or move a task to a different thread (task stealing).
630
API Support 9
A task instance represents a piece of work performed by a particular thread for a period of time. The task is
defined by the bracketing of __itt_task_begin() and __itt_task_end() on the same thread.
NOTE
To be able to see user tasks in your results, enable the Analyze user tasks checkbox in analysis
settings.
Usage Example
The following code snippet creates a domain and a couple of tasks at global scope.
#include "ittnotify.h"
void BeginFrame() {
631
9 Intel® VTune™ Profiler User Guide
void DoWork() {
__itt_task_begin(domain, __itt_null, __itt_null, shMySubtask);
do_foo(1);
__itt_task_end(domain);
}
void EndFrame() {
do_foo(1);
__itt_task_end(domain);
}
int main() {
BeginFrame();
DoWork();
EndFrame();
return 0;
}
#ifdef _WIN32
#include <ctime>
clock_gettime(CLOCK_REALTIME, &start_time);
while(1) {
clock_gettime(CLOCK_REALTIME, ¤t_time);
TYPE cur_nsec=(long)((current_time.tv_sec-start_time.tv_sec-sec)*NSEC +
current_time.tv_nsec - start_time.tv_nsec);
if(cur_nsec>=0)
break;
}
}
#endif
See Also
Basic Usage and Configuration
Minimize ITT API Overhead
Task Analysis
View Instrumentation and Tracing Technology (ITT) API Task Data in Intel® VTune™ Profiler
632
API Support 9
Frame API
Use the frame API to insert calls to the desired places in your code and analyze performance per frame,
where frame is the time period between frame begin and end points. When frames are displayed in
Intel®VTune™ Profiler, they are displayed in a separate track, so they provide a way to visually separate this
data from normal task data.
Frame API is a per-process function that works in resumed state. This function does not work in paused
state.
You can run the frame analysis to:
• Analyze Windows OS game applications that use DirectX* rendering.
• Analyze graphical applications performing repeated calculations.
• Analyze transaction processing on a per transaction basis to discover input cases that cause bad
performance.
Frames represent a series of non-overlapping regions of Elapsed time. Frames are global in nature and not
connected with any specific thread. ITT APIs that enable analyzing code frames and presenting the analysis
data.
633
9 Intel® VTune™ Profiler User Guide
__itt_frame_
end_v3_htm"
>__itt_frame_
end_v3 with
NULL as the id
parameter
designates the
end of the
frame.
NOTE
The analysis types based on the hardware event-based sampling collection are limited to 64 distinct
frame domains.
634
API Support 9
Usage Example
The following example uses the frame API to capture the Elapsed times for the specified code sections.
#include "ittnotify.h"
__itt_frame_begin_v3(pD, NULL);
do_foo_1();
__itt_frame_end_v3(pD, NULL);
__itt_frame_begin_v3(pD, NULL);
do_foo_2();
__itt_frame_end_v3(pD, NULL);
See Also
Basic Usage and Configuration
Viewing ITT API Task Data in Intel VTune Profiler
Histogram API
Use the Histogram API to define histograms that
display arbitrary data in histogram form in Intel®
VTune™ Profiler.
The Histogram API enables you to define custom histogram graphs in your code to display arbitrary data of
your choice in VTune Profiler.
Histograms can be especially useful for showing statistics that can be split by individual units for cross-
comparison.
For example, you can use this API in your workload to:
• Track load distribution
• Track resource utilization
• Identify oversubscribed or underutilized worker nodes
Any histogram instance can be accessed by any thread in the process, regardless of which thread created the
histogram. The Histogram API call is thread-safe.
NOTE
By default, Histogram API data collection and visualization are available in the Input and Output
analysis only. To see the histogram in the result of other analysis types, create a custom analysis
based on the pre-defined analysis type you are interested in, and enable the Analyze user
histogram checkbox in the custom analysis options.
635
9 Intel® VTune™ Profiler User Guide
Primitives:
636
API Support 9
Use This Primitive To Do This
Primitives:
Use This Primitive To Do This
637
9 Intel® VTune™ Profiler User Guide
Usage Example
The following example creates a histogram to store worker thread statistics:
NOTE
The User-Defined Synchronization API works with the Threading analysis type.
void __itt_sync_create (void *addr, const Register the creation of a sync object using char or
__itt_char *objtype, const __itt_char Unicode string.
*objname, int attribute)
void __itt_sync_rename (void *addr, const Assign a name to a sync object using char or
__itt_char *name) Unicode string, after it was created.
void __itt_sync_prepare (void *addr) Enter spin loop on user-defined sync object.
void __itt_sync_cancel (void *addr) Quit spin loop without acquiring spin object.
void __itt_sync_acquired (void *addr) Define successful spin loop completion (sync object
acquired).
void __itt_sync_releasing (void *addr) Start sync object releasing code. This primitive is
called before the lock release call.
638
API Support 9
Each API call has a single parameter, addr. The address, not the value, is used to differentiate between two
or more distinct custom synchronization objects. Each unique address enables the VTune Profiler to track a
separate custom object. Therefore, to use the same custom object to protect access in different parts of your
code, use the same addr parameter around each.
When properly embedded in your code, the primitives tell the VTune Profiler when the code is attempting to
perform some type of synchronization. Each prepare primitive must be paired with a cancel or acquired
primitive.
Each user-defined synchronization construct may involve any number of synchronization objects. Each
synchronization object must be triggered off of a unique memory handle, which the user-defined
synchronization API uses to track the object. Any number of synchronization objects may be tracked at one
time using the user-defined synchronization API, as long as each object uses a unique memory pointer. You
can think of this as modeling objects similar to the WaitForMultipleObjects function in the Windows* OS
API. You can create more complex synchronization constructs out of a group of synchronization objects;
however, it is not advisable to interlace different user-defined synchronization constructs as this results in
incorrect behavior.
long spin = 1;
. . . .
. . . .
__itt_sync_prepare((void *) &spin );
while(ResourceBusy);
// spin wait;
__itt_sync_acquired((void *) &spin );
639
9 Intel® VTune™ Profiler User Guide
Using the cancel API may be applicable to other scenarios where the current thread tests the user
synchronization construct and decides to do something useful instead of waiting for a signal from another
thread. See the following code example:
long spin = 1;
. . . .
. . . .
__itt_sync_prepare((void *) &spin );
while(ResourceBusy)
{
__itt_sync_cancel((void *) &spin );
//
// Do useful work
//
. . . . .
. . . . .
//
// Once done with the useful work, this construct will test the
// lock variable and try to acquire it again. Before this can
// be done, a call to the prepare API is required.
//
__itt_sync_prepare((void *) &spin );
}
__itt_sync_acquired((void *) &spin);
After you acquire a lock, you must call the releasing API before the current thread releases the lock. The
following example shows how to use the releasing API:
long spin = 1;
. . . .
. . . .
__itt_sync_releasing((void *) &spin );
// Code here should free the resource
CSEnter()
{
__itt_sync_prepare((void*) &cs);
while(LockIsUsed)
{
if(LockIsFree)
{
// Code to actually acquire the lock goes here
__itt_sync_acquired((void*) &cs);
}
if(timeout)
{
__itt_sync_cancel((void*) &cs );
}
}
}
CSLeave()
{
if(LockIsMine)
{
640
API Support 9
__itt_sync_releasing((void*) &cs);
// Code to actually release the lock goes here
}
}
This simple critical section example demonstrates how to use the user-defined synchronization primitives.
When looking at this example, note the following points:
• Each prepare primitive is paired with an acquired primitive or a cancel primitive.
• The prepare primitive is placed immediately before the user code begins waiting for the user lock.
• The acquired primitive is placed immediately after the user code actually obtains the user lock.
• The releasing primitive is placed before the user code actually releases the user lock. This ensures that
another thread does not call the acquired primitive before the VTune Profiler realizes that this thread has
released the lock.
Barrier()
{
teamflag = false;
__itt_sync_releasing((void *) &counter);
InterlockedIncrement(&counter); // use the atomic increment primitive appropriate to your
OS and compiler
641
9 Intel® VTune™ Profiler User Guide
See Also
Basic Usage and Configuration
Event API
The event API is used to observe when demarcated events occur in your application, or to identify how long it
takes to execute demarcated regions of code. Set annotations in the application to demarcate areas where
events of interest occur. After running analysis, you can see the events marked in the Timeline pane.
Event API is a per-thread function that works in resumed state. This function does not work in paused state.
NOTE
• On Windows* OS platforms you can define Unicode to use a wide character version of APIs that
pass strings. However, these strings are internally converted to ASCII strings.
• On Linux* OS platforms only a single variant of the API exists.
__itt_event __itt_event_create(const Create an event type with the specified name and length.
__itt_char *name, int namelen ); This API returns a handle to the event type that should
be passed into the following event start and event end
APIs as a parameter. The namelen parameter refers to
the name length in number of characters, not the number
of bytes.
int __itt_event_start( __itt_event Call this API with your previously created event type
event ); handle to register an instance of the event. Event start
appears in the Timeline pane display as a tick mark.
NOTE
To see events and user tasks in your results, create a custom analysis (based on the pre-defined
analysis you are interested in) and select the Analyze user tasks, events and counters checkbox in
the analysis settings.
642
API Support 9
Usage Example: Creating and Marking Single Events
The __itt_event_create API returns a new event handle that you can subsequently use to mark user
events with the __itt_event_start API. In this example, two event type handles are created and used to
set the start points for tracking two different types of events.
#include "ittnotify.h"
#include "ittnotify.h"
See Also
Basic Usage and Configuration
View Instrumentation and Tracing Technology (ITT) API Task Data in Intel® VTune™ Profiler
Counter API
The Counter API is used to observe user-defined global characteristic counters that are unknown to VTune
Profiler. For example, it is useful for system on a chip (SoC) development when different counters may
represent different parts of the SoC and count some hardware characteristics.
To define and create a counter object, use the following primitives:
__itt_counter
__itt_counter_create(const char *name, const char *domain);
__itt_counter_createA(const char *name, const char *domain);
__itt_counter_createW(const wchar_t *name, const wchar_t *domain);
__itt_counter_create_typed (const char *name, const char *domain, __itt_metadata_type
type);
643
9 Intel® VTune™ Profiler User Guide
NOTE
Applicable to uint64 counters only.
Usage Example
The following example creates a counter that measures temperature and memory usage metrics:
#include "ittnotify.h"
while (...)
{
...
temperature = getTemperature();
__itt_counter_set_value(temperatureCounter, &temperature);
__itt_counter_inc_delta(memoryUsageCounter, getAllocatedMemSize());
__itt_counter_dec_delta(memoryUsageCounter, getDeallocatedMemSize());
...
}
644
API Support 9
__itt_counter_destroy(temperatureCounter);
__itt_counter_destroy(memoryUsageCounter);
See Also
Basic Usage and Configuration
void __itt_module_loadW Call this function after the relocation of a module. Provide the new
(void* start_addr,void* start and end addresses for the module and the full path to the
end_addr, const wchar_t* module on the local drive.
path)
void Call this function after the relocation of a module. Provide the new
__itt_module_loadA(void* start and end addresses for the module and the full path to the
start_addr, void* end_addr, module on the local drive.
const char* path)
void __itt_module_load(void* Call this function after the relocation of a module. Provide the new
start_addr, void* end_addr, start and end addresses for the module and the full path to the
const char* path) module on the local drive.
Usage Example
#include "ittnotify.h"
__itt_module_load(relocatedBaseModuleAddress, relocatedEndModuleAddress, “/some/path/to/dynamic/
library.so”);
See Also
Basic Usage and Configuration
Instrumenting Your Application
Minimizing ITT API Overhead
Usage Tips
Follow these guidelines when using the memory allocation APIs:
• Create wrapper functions for your routines, and put the __itt_heap_*_begin and __itt_heap_*_end
calls in these functions.
• Allocate a unique domain for each pair of allocate/free functions when calling
__itt_heap_function_create. This allows the VTune Profiler to verify a matching free function is
called for every allocate function call.
645
9 Intel® VTune™ Profiler User Guide
• Annotate the beginning and end of every allocate function and free function.
• Call all function pairs from the same stack frame, otherwise the VTune Profiler assumes an exception
occurred and the allocation attempt failed.
• Do not call an end function without first calling the matching begin function.
typedef void* Declare a handle type to match begin and end calls and
__itt_heap_function; domains.
646
API Support 9
Usage Example: Heap Allocation
#include <ittnotify.h>
__itt_heap_function my_allocator;
__itt_heap_function my_reallocator;
__itt_heap_function my_freer;
void* my_malloc(size_t s)
{
void* p;
__itt_heap_allocate_begin(my_allocator, s, 0);
p = user_defined_malloc(s);
__itt_heap_allocate_end(my_allocator, &p, s, 0);
return p;
}
return(np);
}
See Also
Basic Usage and Configuration
647
9 Intel® VTune™ Profiler User Guide
The JIT (Just-In-Time) Profiling API provides functionality to report information about just-in-time generated
code that can be used by performance tools. You need to insert JIT Profiling API calls in the code generator to
report information before JIT-compiled code goes to execution. This information is collected at runtime and
used by tools like Intel® VTune™ Profiler to display performance metrics associated with JIT-compiled code.
You can use the JIT Profiling API to profile such environments as dynamic JIT compilation of JavaScript code
traces, JIT execution in OpenCL™ applications, Java*/.NET* managed execution environments, and custom
ISV JIT engines.
The standard VTune Profiler installation contains a static part (as a static library and source files) and a
profiler object. The JIT engine generating code during runtime communicates with a profiler object through
the static part. During runtime, the JIT engine reports the information about JIT-compiled code stored in a
trace file by the profiler object. After collection, the VTune Profiler uses the generated trace file to resolve the
JIT-compiled code. If the VTune Profiler is not installed, profiling is disabled.
Use the JIT Profiling API to:
• Profile trace-based and method-based JIT-compiled code
• Analyze split functions
• Explore inline functions
JIT profiling is supported with the Launch Application target option for event based sampling.
#include <jitprofiling.h>
if (iJIT_IsProfilingActive() != iJIT_SAMPLING_ON) {
return;
}
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED,
(void*)&jmethod);
iJIT_NotifyEvent(iJVM_EVENT_TYPE_SHUTDOWN, NULL);
Usage Tips
• If any iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED event overwrites an already reported method, then
such a method becomes invalid and its memory region is treated as unloaded. VTune Profiler displays the
metrics collected by the method until it is overwritten.
• If supplied line number information contains multiple source lines for the same assembly instruction (code
location), then VTune Profiler picks up the first line number.
648
API Support 9
• Dynamically generated code can be associated with a module name. Use the iJIT_Method_Load_V2
structure.
• If you register a function with the same method ID multiple times, specifying different module names,
then the VTune Profiler picks up the module name registered first. If you want to distinguish the same
function between different JIT engines, supply different method IDs for each function. Other symbolic
information (for example, source file) can be identical.
#include <jitprofiling.h>
iJIT_Method_Load a = {0};
a.method_id = method_id;
a.method_load_address = 0x100;
a.method_size = 0x20;
iJIT_Method_Load b = {0};
b.method_id = method_id;
b.method_load_address = 0x200;
b.method_size = 0x30;
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, (void*)&a);
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, (void*)&b)
Usage Tips
• If a iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED event overwrites an already reported method, then such
a method becomes invalid and its memory region is treated as unloaded.
• All code regions reported with the same method ID are considered as belonging to the same method.
Symbolic information (method name, source file name) will be taken from the first notification, and all
subsequent notifications with the same method ID will be processed only for line number table
information. So, the VTune Profiler will map samples to a source line using the line number table from the
current notification while taking the source file name from the very first one.
• If you register a second code region with a different source file name and the same method ID, this
information will be saved and will not be considered as an extension of the first code region, but VTune
Profiler will use the source file of the first code region and map performance metrics incorrectly.
• If you register a second code region with the same source file as for the first region and the same
method ID, the source file will be discarded but VTune Profiler will map metrics to the source file
correctly.
• If you register a second code region with a null source file and the same method ID, provided line
number info will be associated with the source file of the first code region.
#include <jitprofiling.h>
// method_id parent_id
// [-- c --] 3000 2000
649
9 Intel® VTune™ Profiler User Guide
iJIT_Method_Load a = {0};
a.method_id = 1000;
iJIT_Method_Inline_Load b = {0};
b.method_id = 2000;
b.parent_method_id = 1000;
iJIT_Method_Inline_Load c = {0};
c.method_id = 3000;
c.parent_method_id = 2000;
iJIT_Method_Inline_Load d = {0};
d.method_id = 2001;
d.parent_method_id = 1000;
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED, (void*)&a);
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_INLINE_LOAD_FINISHED, (void*)&b);
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_INLINE_LOAD_FINISHED, (void*)&c);
iJIT_NotifyEvent(iJVM_EVENT_TYPE_METHOD_INLINE_LOAD_FINISHED, (void*)&d);
Usage Tips
• Each inline (iJIT_Method_Inline_Load) method should be associated with two method IDs: one for
itself; one for its immediate parent.
• Address regions of inline methods of the same parent method cannot overlap each other.
• Execution of the parent method must not be started until it and all its inline methods are reported.
• In case of nested inline methods an order of iJVM_EVENT_TYPE_METHOD_INLINE_LOAD_FINISHED events
is not important.
• If any event overwrites either inline method or top parent method, then the parent, including inline
methods, becomes invalid and its memory region is treated as unloaded.
See Also
JIT Profiling API Reference
650
API Support 9
1. Include jitprofiling.h file, located under the <install-dir>\include (Windows*) or <install-
dir>/include (Linux*) directory, in your code. This header file provides all API function prototypes
and type definitions.
2. Link to jitprofiling.lib (Windows*) or jitprofiling.a (Linux*), located under <install-dir>
\lib32or <install-dir>\lib64 (Windows*) or <install-dir>/lib32 or <install-dir>/lib32
(Linux*).
int Use this API to send a notification of event_type with the data pointed by
iJIT_NotifyEvent( i EventSpecificData to the agent. The reported information is used to
JIT_JVM_EVENT attribute samples obtained from any Intel® VTune™ Profiler collector.
event_type, void
*EventSpecificData
);
unsigned int Generate a new method ID. You must use this function to assign unique and
iJIT_GetNewMethodID valid method IDs to methods reported to the profiler.
( void ); This API returns a new unique method ID. When out of unique method IDs,
this API function returns 0.
iJIT_IsProfilingAct Returns the current mode of the profiler: off, or sampling, using the
iveFlags iJIT_IsProfilingActiveFlags enumeration.
iJIT_IsProfilingAct This API returns iJIT_SAMPLING_ON by default, indicating that Sampling is
ive( void ); running. It returns iJIT_NOTHING_RUNNING if no profiler is running.
651
9 Intel® VTune™ Profiler User Guide
6. Select Configuration Properties > C/C++ > General and add the path to the headers (<install-
dir>/include) to Additional Include Directories.
7. Select Configuration Properties > C/C++ > Linker > General and add the path to the library
(<install-dir>/lib32 or <install-dir>/lib64) to Additional Library Directories.
8. Click OK to apply the changes and close the window.
9. Rebuild the solution with the new project settings.
Installation Information
Whether you downloaded Intel® VTune™ Profiler as a standalone component or with the Intel® oneAPI Base
Toolkit, the default path for your <install-dir> is:
macOS* /opt/intel/oneapi/
For OS-specific installation instructions, refer to the VTune Profiler Installation Guide.
See Also
About JIT Profiling API
JIT Profiling API Reference
Basic Usage and Configuration
See prerequisites here
iJIT_NotifyEvent
Reports information about JIT-compiled code to the
agent.
Syntax
int iJIT_NotifyEvent( iJIT_JVM_EVENT event_type, void *EventSpecificData );
Description
The iJIT_NotifyEvent function sends a notification of event_type with the data pointed by
EventSpecificData to the agent. The reported information is used to attribute samples obtained from any
Intel® VTune™ Profiler collector. This API needs to be called after JIT compilation and before the first entry into
the JIT-compiled code.
Input Parameters
Parameter Description
iJIT_JVM_EVENT event_type Notification code sent to the agent. See a complete list of
event types below.
652
API Support 9
The following values are allowed for event_type:
iJVM_EVENT_TYPE_METHOD_LOAD_FINISH Send this notification after a JITted method has been loaded
ED into memory, and possibly JIT compiled, but before the
code is executed. Use the iJIT_Method_Load structure for
EventSpecificData. The return value of
iJIT_NotifyEvent is undefined.
iJIT_Method_Inline_Load Structure
When you use the iJIT_Method_Inline_Load structure to describe the JIT compiled method, use
iJVM_EVENT_TYPE_METHOD_INLINE_LOAD_FINISHED as an event type to report it. The
iJIT_Method_Inline_Load structure has the following fields:
Field Description
unsigned int method_id Unique method ID. Method ID cannot be smaller than 999.
You must either use the API function
iJIT_GetNewMethodID to get a valid and unique method
ID, or else manage ID uniqueness and correct range by
yourself.
unsigned int parent_method_id Unique immediate parent’s method ID. Method ID may not
be smaller than 999. You must either use the API function
iJIT_GetNewMethodID to get a valid and unique method ID,
or else manage ID uniqueness and correct range by yourself.
653
9 Intel® VTune™ Profiler User Guide
Field Description
char *method_name The name of the method, optionally prefixed with its class
name and appended with its complete signature. This
argument cannot be set to NULL.
void *method_load_address The base address of the method code. Can be NULL if the
method is not JITted.
unsigned int method_size The virtual address on which the method is inlined. If NULL,
then data provided with the event are not accepted.
unsigned int line_number_size The number of entries in the line number table. 0 if none.
pLineNumberInfo line_number_table Pointer to the line numbers info array. Can be NULL if
line_number_size is 0. See LineNumberInfo structure for
a description of a single entry in the line number info array.
iJIT_Method_Load Structure
When you use the iJIT_Method_Load structure to describe the JIT compiled method, use
iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED as an event type to report it. The iJIT_Method_Load
structure has the following fields:
Field Description
unsigned int method_id Unique method ID. Method ID cannot be smaller than 999.
You must either use the API function
iJIT_GetNewMethodID to get a valid and unique method
ID, or else manage ID uniqueness and correct range by
yourself.
char *method_name The name of the method, optionally prefixed with its class
name and appended with its complete signature. This
argument cannot be set to NULL.
void *method_load_address The base address of the method code. Can be NULL if the
method is not JITted.
unsigned int method_size The virtual address on which the method is inlined. If NULL,
then data provided with the event are not accepted.
unsigned int line_number_size The number of entries in the line number table. 0 if none.
pLineNumberInfo line_number_table Pointer to the line numbers info array. Can be NULL if
line_number_size is 0. See LineNumberInfo structure for
a description of a single entry in the line number info array.
654
API Support 9
Field Description
iJIT_Method_Load_V2 Structure
When you use the iJIT_Method_Load_V2 structure to describe the JIT compiled method, use
iJVM_EVENT_TYPE_METHOD_LOAD_FINISHED_V2 as an event type to report it. The iJIT_Method_Load_V2
structure has the following fields:
Field Description
unsigned int method_id Unique method ID. Method ID cannot be smaller than 999.
You must either use the API function
iJIT_GetNewMethodID to get a valid and unique method
ID, or else manage ID uniqueness and correct range by
yourself.
char *method_name The name of the method, optionally prefixed with its class
name and appended with its complete signature. This
argument cannot be set to NULL.
void *method_load_address The base address of the method code. Can be NULL if the
method is not JITted.
unsigned int method_size The virtual address on which the method is inlined. If NULL,
then data provided with the event are not accepted.
unsigned int line_number_size The number of entries in the line number table. 0 if none.
pLineNumberInfo line_number_table Pointer to the line numbers info array. Can be NULL if
line_number_size is 0. See LineNumberInfo structure
for a description of a single entry in the line number info
array.
char *module_name Module name. Can be NULL. The module name can be
useful for distinguishing among different JIT engines. VTune
Profiler will display reported methods grouped by specific
module.
LineNumberInfo Structure
Use the LineNumberInfo structure to describe a single entry in the line number information of a code
region. A table of line number entries provides information about how the reported code region is mapped to
source file. VTune Profiler uses line number information to attribute the samples (virtual address) to a line
number. It is acceptable to report different code addresses for the same source line:
1 2
12 4
655
9 Intel® VTune™ Profiler User Guide
15 2
18 1
21 30
VTune Profiler constructs the following table using the client data:
0-1 2
1-1 4
2
12- 2
15
15- 1
18
18- 30
21
Field Description
unsigned int Offset Opcode byte offset from the beginning of the method.
unsigned int LineNumber Matching source line number offset (from beginning of
source file).
Return Values
The return values are dependent on the particular iJIT_JVM_EVENT.
See Also
About JIT Profiling API
iJIT_IsProfilingActive
Returns the current mode of the agent.
Syntax
iJIT_IsProfilingActiveFlags JITAPI iJIT IsProfilingActive ( void )
Description
The iJIT_IsProfilingActive function returns the current mode of the agent.
656
API Support 9
Input Parameters
None
Return Values
iJIT_SAMPLING_ON, indicating that agent is running, or iJIT_NOTHING_RUNNING if no agent is running.
See Also
About JIT Profiling API
Using JIT Profiling API
iJIT_ GetNewMethodID
Generates a new unique method ID.
Syntax
unsigned int iJIT_GetNewMethodID(void);
Description
The iJIT_GetNewMethodID function generates new method ID upon each call. Use this API to obtain unique
and valid method IDs for methods or traces reported to the agent if you do not have your own mechanism to
generate unique method IDs.
Input Parameters
None
Return Values
A new unique method ID. When out of unique method IDs, this API function returns 0.
See Also
About JIT Profiling API
Using JIT Profiling API
.NET* APIs
RegisterClassA ThreadPool_UnsafeRegisterWaitForSingleObject_4
RegisterClassW ThreadPool_QueueUserWorkItem_1
RegisterClassExA ThreadPool_QueueUserWorkItem_2
RegisterClassExW ThreadPool_UnsafeQueueUserWorkItem
UnregisterClassA ThreadPool_UnsafeQueueNativeOverlapped
UnregisterClassW Timer_Ctor_1
657
9 Intel® VTune™ Profiler User Guide
.NET* APIs
GetClassInfoA Timer_Ctor_2
GetClassInfoW Timer_Ctor_3
GetClassInfoExA Timer_Ctor_4
GetClassInfoExW Timer_Ctor_5
GetWindowLongA Monitor_Exit
GetWindowLongW MonitorWait
GetWindowLongPtrA Monitor_Wait_1
GetWindowLongPtrW Monitor_Wait_2
GetClassLongA Monitor_Wait_3
GetClassLongW Monitor_Wait_4
GetClassLongPtrA Monitor_Wait_5
GetClassLongPtrW Monitor_Pulse
SetWindowLongA Monitor_PulseAll
SetWindowLongW Monitor_Enter
SetWindowLongPtrA Monitor_Enter_1
SetWindowLongPtrW MonitorTryEnter
SetClassLongA Monitor_TryEnter_1
SetClassLongW Monitor_TryEnter_2
SetClassLongPtrA Monitor_TryEnter_3
SetClassLongPtrW Monitor_TryEnter_4
AutoResetEvent_Ctor Monitor_TryEnter_5
ManualResetEvent_Ctor Mutex_Ctor_1
EventWaitHandle_Ctor_1 Mutex_Ctor_2
EventWaitHandle_Ctor_2 Mutex_Ctor_3
EventWaitHandle_Ctor_3 Mutex_Ctor_4
EventWaitHandle_Ctor_4 Mutex_Ctor_5
EventWaitHandle_OpenExisting_1 Mutex_Release
EventWaitHandle_OpenExisting_2 Mutex_OpenExisting_1
EventWaitHandle_Set Mutex_OpenExisting_2
EventWaitHandle_Reset Semaphore_Ctor_1
WaitHandle_WaitOne_1 Semaphore_Ctor_2
WaitHandle_WaitOne_2 Semaphore_Ctor_3
WaitHandle_WaitOne_3 Semaphore_Ctor_4
WaitHandle_WaitAny_1 Semaphore_OpenExisting_1
WaitHandle_WaitAny_2 Semaphore_OpenExisting_2
WaitHandle_WaitAny_3 Semaphore_Release_1
658
API Support 9
.NET* APIs
WaitHandle_WaitAll_1 Semaphore_Release_2
WaitHandle_WaitAll_2 ReaderWriterLock_Ctor
WaitHandle_WaitAll_3 ReaderWriterLock_AcquireReaderLock_1
WaitHandle_SignalAndWait_1 ReaderWriterLock_AcquireReaderLock_2
WaitHandle_SignalAndWait_2 ReaderWriterLock_AcquireWriterLock_1
WaitHandle_SignalAndWait_3 ReaderWriterLock_AcquireWriterLock_2
Thread_Join_1 ReaderWriterLock_ReleaseReaderLock
Thread_Join_2 ReaderWriterLock_ReleaseWriterLock
Thread_Join_3 ReaderWriterLock_UpgradeToWriterLock_1
Thread_Sleep_1 ReaderWriterLock_UpgradeToWriterLock_2
Thread_Sleep_2 ReaderWriterLock_DowngradeFromWriterLock
Thread_Interrupt ReaderWriterLock_RestoreLock
ThreadPool_RegisterWaitForSingleObject_1 ReaderWriterLock_ReleaseLock
ThreadPool_RegisterWaitForSingleObject_2 WaitHandle_WaitOne_4
ThreadPool_RegisterWaitForSingleObject_3 WaitHandle_WaitOne_5
ThreadPool_RegisterWaitForSingleObject_4 WaitHandle_WaitAny_4
ThreadPool_UnsafeRegisterWaitForSingleObject_1 WaitHandle_WaitAny_5
ThreadPool_UnsafeRegisterWaitForSingleObject_2 WaitHandle_WaitAll_4
ThreadPool_UnsafeRegisterWaitForSingleObject_3 WaitHandle_WaitAll_5
Callback APIs
BindIoCompletionCallback QueueUserAPC
GetOverlappedResult RaiseException
RtlInitializeConditionVariable SleepConditionVariableCS
RtlWakeAllConditionVariable SleepConditionVariableSRW
RtlWakeConditionVariable
InitializeCriticalSection RtlInitializeCriticalSection
InitializeCriticalSection RtlTryEnterCriticalSection
InitializeCriticalSectionEx RtlEnterCriticalSection
InitializeCriticalSectionAndSpinCount RtlLeaveCriticalSection
RtlInitializeCriticalSectionAndSpinCount RtlSetCriticalSectionSpinCount
RtlDeleteCriticalSection
659
9 Intel® VTune™ Profiler User Guide
Event APIs
CreateEventA OpenEventW
CreateEventExA PulseEvent
CreateEventExW ResetEvent
CreateEventW SetEvent
OpenEventA PulseEvent
Fiber APIs
SwitchToFiber DeleteFiber
CreateFiberEx FiberStartRoutineWrapper
File/Directory APIs
CreateFileA FindFirstFileW
CreateFileW FindFirstFileExA
OpenFile FindFirstFileExW
WriteFile FindNextChangeNotification
WriteFileEx FindNextFileA
WriteFileGather FindNextFileW
ReadFile GetCurrentDirectoryA
ReadFileEx GetCurrentDirectoryW
ReadFileScatter MoveFileA
FindFirstChangeNotificationA MoveFileW
FindFirstChangeNotificationW MoveFileExA
FindCloseChangeNotification MoveFileExW
CreateDirectoryA ReadDirectoryChangesW
CreateDirectoryW RemoveDirectoryA
CreateDirectoryExA RemoveDirectoryW
CreateDirectoryExW SetCurrentDirectoryA
DeleteFileA SetCurrentDirectoryW
DeleteFileW lock
FindFirstFileA unlock
Input/output APIs
CreateMailslotA ReadConsoleInputA
CreateMailslotW ReadConsoleInputW
DeviceIoControl ReadConsoleA
FindFirstPrinterChangeNotification ReadConsoleW
FindClosePrinterChangeNotification WaitCommEvent
660
API Support 9
Input/output APIs
GetStdHandle WaitForInputIdle
malloc LocalReAlloc
calloc LocalSize
realloc LocalUnlock
free GetProcessHeap
RtlAllocateHeap GetProcessHeaps
RtlReAllocateHeap HeapAlloc
RtlFreeHeap HeapCompact
RtlSizeHeap HeapCreate
GlobalAlloc HeapDestroy
GlobalFlags HeapFree
GlobalFree HeapLock
GlobalHandle HeapQueryInformation
GlobalLock HeapReAlloc
GlobalReAlloc HeapSetInformation
GlobalSize HeapSize
GlobalUnlock HeapUnlock
LocalAlloc HeapValidate
LocalFlags HeapWalk
LocalFree
LocalHandle
LocalLock
Mutex APIs
CreateMutexA OpenMutexA
CreateMutexExA OpenMutexW
CreateMutexExW ReleaseMutex
CreateMutexW
Networking APIs
RpcNsBindingLookupBeginA closesocket
RpcNsBindingLookupBeginW connect
RpcNsBindingLookupNext recv
RpcNsBindingLookupDone recvfrom
RpcNsBindingImportBeginA send
661
9 Intel® VTune™ Profiler User Guide
Networking APIs
RpcNsBindingImportBeginW sendto
RpcNsBindingImportNext select
RpcNsBindingImportDone WSASocketA
RpcStringBindingComposeA WSASocketW
RpcStringBindingComposeW WSAAccept
RpcServerListen WSAConnect
RpcMgmtWaitServerListen WSASend
RpcMgmtInqIfIds WSASendTo
RpcEpResolveBinding WSARecv
RpcCancelThread WSARecvFrom
RpcMgmtEpEltInqBegin WSAGetOverlappedResult
RpcMgmtEpEltInqDone WSACreateEvent
RpcMgmtEpEltInqNextA WSACloseEvent
RpcMgmtEpEltInqNextW WSAResetEvent
socket WSASetEvent
accept WSAWaitForMultipleEvents
Object APIs
CloseHandle DuplicateHandle
InitOnceBeginInitialize InitOnceExecuteOnce
InitOnceComplete RtlRunOnceInitialize
Pipe APIs
CallNamedPipeA TransactNamedPipe
CallNamedPipeW WaitNamedPipeA
ConnectNamedPipe WaitNamedPipeW
CreateNamedPipeA
CreateNamedPipeW
Process APIs
CreateProcessA TerminateProcess
CreateProcessW ExitProcess
OpenProcess RtlExitUserProcess
662
API Support 9
Semaphore APIs
CreateSemaphoreA OpenSemaphoreA
CreateSemaphoreExA OpenSemaphoreW
CreateSemaphoreExW ReleaseSemaphore
CreateSemaphoreW
Sleep APIs
Sleep SleepEx
RtlInitializeSRWLock RtlAcquireSRWLockShared
RtlAcquireSRWLockExclusive RtlReleaseSRWLockShared
RtlReleaseSRWLockExclus
Thread APIs
CreateThread RtlExitUserThread
CreateRemoteThread TerminateThread
OpenThread SuspendThread
ExitThread Wow64SuspendThread
FreeLibraryAndExitThread ResumeThread
Threadpool APIs
CreateIoCompletionPort CreateTimerQueue
GetQueuedCompletionStatus CreateTimerQueueTimer
PostQueuedCompletionStatus DeleteTimerQueueTimer
CreateThreadpoolWait DeleteTimerQueueEx
CreateThreadpoolWork DeleteTimerQueue
TrySubmitThreadpoolCallback UnregisterWait
CreateThreadpoolTimer UnregisterWaitEx
CreateThreadpoolIo QueueUserWorkItem
CreateThreadpoolCleanupGroup RegisterWaitForSingleObject
Timer APIs
CancelWaitableTimer OpenWaitableTimerA
CreateWaitableTimerA OpenWaitableTimerW
CreateWaitableTimerW SetWaitableTimer
Wait APIs
MsgWaitForMultipleObjects WaitForMultipleObjectsEx
663
9 Intel® VTune™ Profiler User Guide
Wait APIs
MsgWaitForMultipleObjectsEx WaitForSingleObject
SignalObjectAndWait WaitForSingleObjectEx
WaitForMultipleObjects RegisteredWaitHandle_Unregister
GetMessageA PostThreadMessageW
GetMessageW ReplyMessage
PeekMessageA WaitMessage
PeekMessageW DialogBoxParamA
SendMessageA DialogBoxParamW
SendMessageW DialogBoxIndirectParamA
SendMessageTimeoutA DialogBoxIndirectParamW
SendMessageTimeoutW MessageBoxA
SendMessageCallbackA MessageBoxW
SendMessageCallbackW MessageBoxExA
SendNotifyMessageA MessageBoxExW
SendNotifyMessageW NdrSendReceive
BroadcastSystemMessageExA NdrNsSendReceive
BroadcastSystemMessageExW PrintDlgA
BroadcastSystemMessageA PrintDlgW
BroadcastSystemMessageW PrintDlgExA
PostMessageA PrintDlgExW
PostMessageW ConnectToPrinterDlg
PostThreadMessageA
setitimer clock_nanosleep
getitimer pause
wait alarm
waitpid signal
waitid sigaction
wait3 sigprocmask
wait4 sigsuspend
sleep sigpending
usleep sigtimedwait
664
API Support 9
Timer, signal and wait APIs
ualarm sigwaitinfo
nanosleep sigwait
I/O API
getwc read
getw write
getchar readv
getwchar writev
getch open
wgetch fopen
mvgetch fdopen
gets close
fgetc fclose
fgetwc io_submit
fgets io_cancel
fgetws io_setup
fread io_destroy
fwrite io_getevents
pipe
select epoll_pwait
pselect poll
epoll_wait ppoll
Network API
socket recv
accept recvfrom
connect send
shutdown sendto
ioctl funlockfile
flock lockf
flockfile fcntl
665
9 Intel® VTune™ Profiler User Guide
DSO API
dlopen dlvsym
dlclose dladdr
dlsym dladdr1
RPC API
callrpc pmap_rmtcall
clnt_broadcast pmap_set
clntudp_create svc_run
clntudp_bufcreate svc_sendreply
clntraw_create svcraw_create
pmap_getmaps svctcp_create
pmap_getport svcudp_bufcreate
svcudp_create
pthread_exit pthread_rwlock_timedrdlock
pthread_cancel pthread_rwlock_timedwrlock
pthread_barrier_init pthread_spin_init
pthread_barrier_destroy pthread_spin_destroy
pthread_barrier_wait pthread_spin_lock
pthread_mutex_init pthread_spin_unlock
pthread_mutex_destroy pthread_cond_init
pthread_mutex_lock pthread_cond_destroy
pthread_mutex_unlock pthread_cond_broadcast
pthread_mutex_timedlock pthread_cond_signal
pthread_rwlock_init pthread_cond_timedwait
pthread_rwlock_destroy pthread_cond_wait
pthread_rwlock_rdlock pthread_key_create
pthread_rwlock_wrlock pthread_key_delete
pthread_rwlock_unlock pthread_sigmask
pthread_create pthread_setcancelstate
pthread_join
sem_init recvmsg
sem_destroy sendmsg
sem_wait msgrcv
666
API Support 9
POSIX Interprocess Communication API
sem_timedwait msgsnd
sem_post msgget
semop semget
semtimedop
mq_close mq_timedreceive
mq_open mq_send
mq_receive mq_timedsend
See Also
API Support
667
10 Intel® VTune™ Profiler User Guide
Troubleshooting 10
This section describes known problems and questions you may encounter when analyzing your application
with the Intel® VTune™ Profiler, and suggests solutions:
• Best Practices: Resolve VTune Profiler BSODs, Crashes, and Hangs in Windows OS
• Error Message: Application Sets Its Own Handler for Signal
• Error Message: Cannot Enable Event-Based Sampling Collection
• Error Message: Cannot Collect GPU Hardware Metrics
• Error Message: Cannot Collect GPU Hardware Metrics for the Selected Adapter
• Error Message: Cannot Load Data File
• Error Message: Cannot Locate Debugging Symbols
• Error Message: Result Is Empty
• Error Message: Client Is Not Authorized To Connect to Server
• Error Message: Make sure you have root privileges to analyze Processor Graphics hardware events
• Error Message: No Pre-built Driver Exists for This System
• Error Message: Not All OpenCL Code Profiling Callbacks Are Received
• Error Message: Problem Accessing the Sampling Driver
• Error Message: Required Key Not Available
• Error Message: Scope of ptrace System Call Application Is Limited
• Error Message: Stack Size Is Too Small
• Error Message: Symbol File Is Not Found
• Problem: Analysis of the .NET* Application Fails
• Problem: Cannot Access Documentation
• Problem: CPU Time for Hotspots and Threading Analysis Is Too Low
• Problem: Events= Sample After Value (SAV) * Samples Is Wrong for Disabled Multiple Runs
• Problem: Guessed Stack Frames
• Problem: GUI Hangs or Crashes
• Problem: Inaccurate Sum in the Grid
• Problem: Information Collected via ITT API Is Not Available When Attaching to a Process
• Problem: No GPU Utilization Data Is Collected
• Problem: Same Functions Are Compared As Different Instances
• Problem: Skipped Stack Frames
• Problem: Stack in the Top-Down Tree Window Is Incorrect
• Problem: Stacks in Call Stack and Bottom-Up Panes Are Different
• Problem: System Functions Appear in the User Functions Only Mode
• Problem: VTune Profiler is Slow to Respond When Collecting or Displaying Data
• Problem: VTune Profiler is Slow on XServers with SSH Connection
• Problem: Unexpected Paused Time
• Problem: {Unknown Timer} in the Platform Power Analysis Viewpoint
• Problem: Unknown Critical Error Due to Disabled Loopback Interface
• Problem: Unknown Frames
• Problem: Unreadable text in Intel VTune Profiler on macOS*
• Problem: Unsupported Windows Operating System
• Warnings about Accurate CPU Time Collection
668
Troubleshooting 10
Suggestion Every time you upgrade to the latest version of Windows OS, uninstall your existing
version of Intel® VTune™ Profiler and install the latest available version.
Suggestion Update all Intel® VTune™ Profiler drivers by installing the latest available version.
669
10 Intel® VTune™ Profiler User Guide
When symbol resolution happens in the finalization process, Intel® VTune™ Profiler may have to retrieve and
process large .pdb files. If used within Microsoft Visual Studio, Intel® VTune™ Profiler uses the Visual Studio
settings to find symbol files and any additional paths provided in Intel® VTune™ Profiler settings. However, if
Intel® VTune™ Profiler uses a symbol server, the resolution waits on updates and therefore slows down.
Depending on the size of the .pdb files, this may cause Intel® VTune™ Profiler to stall or hang.
Suggestion If your analysis requires symbols for system libraries, use a local cache (like the location
defined in Visual Studio) instead of a symbol server. Also, remove large .pdb files from the symbol
location you provide to Intel® VTune™ Profiler if these files are not required for your analysis.
Suggestion Exclude the pin.exe process from your virus scanning software or disable the scan when
running a Intel® VTune™ Profiler collection. Also, pause synchronization and/or back-up utilities until
Intel® VTune™ Profiler finalization is complete.
Suggestion Run Intel® VTune™ Profiler as an administrator. You can then profile processes with
elevated privileges. You can also configure Intel® VTune™ Profiler to profile specific modules. See the
Advanced section in the WHAT pane for this purpose.
User-mode sampling for Threading analysis is too Run Threading analysis with Hardware-Event Based
slow or creates too much overhead. Sampling (HEBS) and context switches enabled.
This provides the context switch data necessary to
understand thread behavior.
670
Troubleshooting 10
Problem Suggestion
Hotspots analysis is unavailable with HEBS and Disable stack collection. To correlate hotspots with
stack collection enabled. stacks, run a separate hotspots analysis with user-
mode sampling enabled.
Intel® VTune™ Profiler hangs or crashes when Run Intel® VTune™ Profiler with the application in
attaching to a running process. paused state. Resume data collection when the
application gets to an area of interest.
Data collection crashes when using the Create a custom analysis. Disable the checkbox to
Instrumentation and Tracing Technology (ITT) API. analyze user tasks, events, and counters. Identify if
the API is causing the crash.
Get Help
The suggestions described in this topic can help resolve several crashes or stalls. If you are still facing issues,
contact us so we may better assist you.
• Contact Customer Support.
• Discuss in the Analyzers developer forum.
• See if the issue has been addressed in the Intel® VTune™ Profiler release notes.
See Also
Hardware Event-based Sampling Collection
Cause
User-mode sampling and tracing collector cannot profile applications that set up the signal handler for a
signal used by the Intel® VTune™ Profiler.
Solution
When collecting data with vtune, add the --run-pass-thru=--profiling-signal=<not_used_signal>
command line option, where <not_used_signal> is a signal that should not be used by your application to
analyze; you need to select the signal from SIGRTMIN..SIGRTMAX.
Alternatively, you may set the environment variable AMPLXE_RUNTOOL_OPTIONS=--profiling-
signal=<not_used_signal>. You may do this, either from your terminal window before running the VTune
Profiler GUI or from the Configure Analysis window entering the variable into the User-defined
Environment Variables field.
See Also
Set Up Analysis Target
671
10 Intel® VTune™ Profiler User Guide
• The Hyper-V allows the sampling-based performance profiling on the latest generation of Intel
microarchitectures code named Skylake and Goldmont onward. VTune Profiler will not be able to work in
the Hyper-V environment running on Intel microarchitectures code named Haswell or Broadwell.
Solution
To enable hardware event-based sampling collection for systems prior to Windows 10 RS3, do the following:
• Enable access to the PMU resources through BIOS options (if it was disabled manually).
• Disable the Hyper-V feature as follows:
1. From the Start menu select Search > Settings > Turn Windows features on or off to open the
Windows Features window.
2. Make sure to disable the Hyper-V feature and its sub-features and restart the system.
672
Troubleshooting 10
3. If the Hyper-V feature is not disabled even after the system reboot, you must disable the BIOS VMX
(virtualization feature) if it was not turned off already.
To troubleshoot hardware event-based sampling collection problems for Windows 10 RS3, make sure you
have the Credential Guard and Device Guard security features disabled on your system.
See Also
Profiling Targets in the Hyper-V* Environment
Cause
To collect GPU hardware metrics and GPU utilization data on Linux, the VTune Profiler uses the Intel® Metric
Discovery API library distributed with the product. If it cannot access the library, the corresponding error
message is provided.
Solution
Consider upgrading your Intel® VTune™ Profiler to version 2021.1 available as part of Intel oneAPI Base
Toolkit or as a stand-alone component. This version of the product automatically selects the latest libstdc+
+ available in runtime to satisfy the GPU analysis requirements, so no additional configuration is required.
For VTune Profiler versions 2020, 2021.1.0. beta04 and older, install the Intel Metric Discovery API library
from the official repository at https://github.com/intel/metrics-discovery and make sure to meet the
following requirements:
673
10 Intel® VTune™ Profiler User Guide
• To enable the VTune Profiler to successfully load the library, it should be linked to libstdc++ (version
GLIBCXX_3.4.20 or older) or statically linked to libstd++. If libmd.so is dynamically linked to a newer
version of libstdc++, make sure to have it loaded to the process before loading libmd.so. You can do
this, for example, by re-defining the environment variable LD_PRELOAD:
Cause
To collect GPU hardware metrics and GPU utilization data on Linux systems (or Windows systems with driver
versions older than 27.20.100.8280 ), VTune Profiler uses the Intel® Metric Discovery API library that is
distributed with the product. This error message displays if VTune Profiler cannot access the selected GPU
adapter.
Solution
Make sure that you have set the AMPLXE_TARGET_GPU environment variable correctly. See a description of
this issue in the VTune Profiler release notes to set the variable.
For Windows systems, update the driver for the selected GPU adapters.
For Linux systems, install a version of the Intel Metric Discovery API library that is newer than 1.6.0 to
support the selection of video adapters. To collect metrics from the video card of your choice, disable other
adapters in the BIOS first.
Solution
Consider providing an alternative temporary directory for collected data.
See Also
Analysis Target Setup
674
Troubleshooting 10
If the debug information is absent, the VTune Profiler may not unwind the call stack and display it correctly in
the Call Stack pane. Additionally in some cases, it can take significantly more time to finalize the results for
modules that do not have debug information.
Solution
For accurate performance analysis, you are recommended to have the debug information available on the
system where the VTune Profiler is installed. See detailed instructions to enable:
• debug information for Windows application binaries
• debug information for Windows system libraries
• debug information for Linux application binaries
• debug information for Linux kernels
See Also
Compiler Switches for Performance Analysis on Windows* Targets
Cause
The data collection period could be too short (for example, <10ms), so that the VTune Profiler could not
capture performance data.
Solution
Consider the following options:
• Verify that you can run your application without the VTune Profiler.
You may have two console windows: the first one for building the application and the second one for
launching the VTune Profiler. The second console should run the application smoothly before attempting to
launch the VTune Profiler. If you see an error message reporting problems with loading shared libraries on
the second console, set up the environment correctly either via the LD_LIBRARY_PATH variable or by
running source <install-dir>/env/vars.sh for Linux* and vars.bat for Windows*. Once the
application runs, start the VTune Profiler from that environment.
675
10 Intel® VTune™ Profiler User Guide
• If the analysis duration is too short, increase the workload for your application.
See Also
Manage Result Files
Solution
You may permanently allow root access applying any of the two proposed methods.
See Also
GPU Application Analysis on Intel® HD Graphics and Intel® Iris® Graphics
Cause
You selected the Analyze Processor Graphics events option of the GPU analysis but do not have a
supported version of the Intel® Metric Discovery API library installed.
Solution
To analyze Intel® HD Graphics and Intel® Iris® Graphics hardware events, make sure to set up your system
for GPU analysis
See Also
GPU Application Analysis on Intel® HD Graphics and Intel® Iris® Graphics
Solution
To resolve this issue, execute the following commands to configure the kernel sources:
676
Troubleshooting 10
$ cd /usr/src/linux
$ make mrproper
$ cp /boot/config-'uname-r' .config
$ vi Makefile
Make sure that EXTRAVERSION matches the tail of the output of uname -r. The resulting /user/src/linux/
include/version.h should have a UTS_RELEASE that matches the output of uname -r. Once that is true,
run the following commands:
$ make oldconfig
$ make dep
After completing these steps, run the build-driver script to build the sampling driver against the kernel
sources in /usr/src/linux
Solution
Use OpenCL API to set callbacks for events for clEnqueue* functions and wait for them to be received. For
example:
#include <atomic>
#include <thread>
...
#include <CL/cl2.hpp>
std::atomic_uint32_t number_of_uncompleted_callbacks = 0;
std::string((std::istreambuf_iterator<char>(programSourceFile)),std::istreambuf_iterator<char>())
);
...
auto kernelFunc = cl::KernelFunctor<cl::Buffer, cl_int>(prog, "sin_cos");
cl::Event event = kernelFunc(cl::EnqueueArgs(cl::NDRange(dataBuf.size())), clDataBuf, 0);
++ number_of_uncompleted_callbacks;
event.setCallback(CL_COMPLETE, completion_callback);
...
while (number_of_uncompleted_callbacks.load())
677
10 Intel® VTune™ Profiler User Guide
{
std::this_thread::yield();
}
return EXIT_SUCCESS;
}
See Also
GPU OpenCL™ Application Analysis
Cause
Intel® VTune™ Profiler cannot access the hardware event-based sampling (EBS) driver required to run a
hardware event-based sampling analysis type. This problem happens if the sampling driver was not loaded or
you do not have correct permissions.
Solution
Make sure the sampling drivers are loaded:
> lsmod | grep sep3_1 or > lsmod | grep sep4_
> lsmod | grep pax
If the drivers are already loaded, make sure you are a member of the vtune user group. You can check
the /etc/group file or contact your system administrator to find out if you are a member of this group.
See Also
Sampling Drivers
Solution
Make sure you use the same signing key that was produced at the time and on the system where your kernel
was built for your target.
See Also
Android* System Setup
678
Troubleshooting 10
Cause
VTune Profiler may fail to collect data for Hotspots and Threading analysis types on the Ubuntu* operating
system if the scope of ptrace() system call application is limited.
Solution
Set the value of the kernel.yama.ptrace_scopesysctl option to 0 with this command:
sysctl -w kernel.yama.ptrace_scope=0
To make this change permanent, set the kernel.yama.ptrace_scope value to 0 in the /etc/
sysctl.d/10-ptrace.conf file using root permissions and reboot the machine.
Cause
When setting up SIGPROF signal handler, the VTune Profiler attempts to configure the signals to use the
alternative stack size using sigaltstack() API to make sure that its signal handler does not depend on the
stack size of the profiled application. If the application uses alternative signal stack itself, the Intel® VTune™
Profiler requires that the alternative stack size is 64K at a minimum. This may be not the case if the
application uses SIGSTKSZ constant for the alternative stack size (which is 8192 bytes). In this case, the
data collection may terminate with the error message.
Solution
Configure the VTune Profiler not to set up the alternative stack and use the stack provided by th application.
To do this, pass the following command line options to the tool:
vtune -run-pass-thru=--no-altstack
Or, set up the environment variable AMPLXE_RUNTOOL_OPTIONS=--no-altstack.
See Also
Pane: Call Stack
View Stacks
679
10 Intel® VTune™ Profiler User Guide
You may check that the vdso module is in a dynamic dependency list:
See Also
Debug Information for Linux* Application Binaries
680
Troubleshooting 10
Cause
If your .NET application performs security checks based on a known public key (for example, checks whether
its assemblies are strong-name signed), it may either crash when launched by the VTune Profiler or provide
unpredicted analysis results.
Solution
This is a third-party technology limitation. To workaround this issue, you are recommended to disable the
security check for any of the user-mode sampling and tracing analysis types.
See Also
.NET* Code Analysis
Solution
For the best experience with context help, use the Google Chrome* browser.
You can also access these VTune Profiler documents directly
• Get Started Guide
• Installation Guide
• VTune Profiler User Guide
• Tutorials
• VTune Profiler Performance Analysis Cookbook
• Intel Processor Event Reference
Download offline versions of the VTune documentation from this repository: https://
d1hdbi2t0py8f.cloudfront.net/index.html?prefix=vtune-docs/.
Get Help
681
10 Intel® VTune™ Profiler User Guide
Solution
Try one of the following:
• Extend the duration of the analysis run.
• Windows OS only: Enable accurate CPU time detection. To do this for the Hotspots or Threading analysis,
it is enough to run the VTune Profiler with administrative permissions. You may also enable this option
explicitly in the custom analysis configuration by checking the Collect highly accurate CPU time box.
Make sure to extend maximum size of raw collector data.
NOTE
Accurate CPU time collection produces a significant amount of temporary data depending on the
system configuration and the profiled target. VTune Profiler may generate up to 5 Mb of temporary
data per minute per logical CPU.
See Also
Warnings about Accurate CPU Time Collection
Custom Analysis Options
Solution
Select the Allow multiple runs option to disable event multiplexing and run a separate data collection for
each event group. This mechanism provides more precise data on collected events.
See Also
Sample After Value
682
Troubleshooting 10
Cause
VTune Profiler did not unwind the stack to reduce data collection overhead, but resolved the stack
heuristically.
[Guessed stack frame] is considered to be a system function. If the Call Stack Mode filter bar option is
set to User/system functions, the VTune Profiler displays [Guessed stack frame(s)].
Solution
To avoid displaying [Guessed stack frame(s)], set the Call Stack Mode filter bar option to Only user
functions.
See Also
Manage Data Views
There are also some processes that can interfere with the VTune Profiler collection and finalization, such as
virus scanners and synchronization/back-up utilities. Virus scanners can cause problems in the process the
VTune Profiler uses for software-based analysis types, such as Threading. Some synchronization utilities can
also cause finalization to fail if they try to back up a file while the VTune Profiler is processing it.
Crashes during the collection are rare but may happen in some situations, for example, if the VTune Profiler
tries to instrument or attach to a privileged process or service that is not accessible to it.
Solution
To workaround a problem with GUI hangs during finalization, consider the following:
• If symbols for system libraries are necessary for your analysis, use a local cache instead of a symbol
server, such as the location defined for Visual Studio.
• Remove large pdb files from the search directories provided to the VTune Profiler if they are not the focus
of your analysis.
• Exclude the pin.exe process from your virus scanner, or disable the virus scanner while running the
VTune profiler collection.
• Pause synchronization and/or back-up utilities until the finalization is complete.
To prevent a possible crash for the VTune Profiler accessing processes with elevated privileges, run the VTune
Profiler as administrator. You can also configure the VTune Profiler to profile specific modules in the
Advanced section of the WHAT pane.
683
10 Intel® VTune™ Profiler User Guide
See Also
Finalization
Cause
The values in the data columns are rounded. For items that are sums of several other items, such as a
function with several stacks, the rounded sums may differ slightly from the sum of rounded summands.
For example:
The rounded values in the grid do not sum up exactly as (0.123 + 0.123) != 0.247.
See Also
Manage Data Views
NOTE
The variables should contain the full path to the library without quotes.
See Also
Analysis Target
684
Troubleshooting 10
Cause
Intel® VTune™ Profiler may not collect the detailed GPU utilization data in the following cases:
• GPU analysis is run without root privileges.
• Intel Graphics driver is not signed properly.
• Linux kernel is configured with the CONFIG_FTRACE option disabled.
Solution
Depending on the root cause, which is typically identified by the VTune Profiler and described in a warning
message, consider one of the following workarounds:
• Make sure to properly set up your system for GPU analysis.
• Since detailed GPU utilization analysis relies on the Ftrace* technology (i915 Ftrace events collection),
your Linux kernel should be properly configured.
• If you update the kernel rarely, configure and rebuild only module i915.
• If you update the kernel often, build the special kernel for GPU analysis.
If your system does not support i915 Ftrace event collection, all the GPU Utilization statistics will be
calculated based on the hardware events and attributed to the Render and GPGPU engine.
See Also
Rebuild and Install the Kernel for GPU Analysis
Cause
You are using the Function Stack grouping for the recompiled binary. The Function Stack grouping uses
function start addresses and is based on function instances.
Solution
Switch to the Source Function Stack grouping level to ignore start addresses and display the data by
source file objects.
See Also
Compare Results
685
10 Intel® VTune™ Profiler User Guide
Cause
VTune Profiler did not unwind the stack to reduce data collection overhead, and failed to resolve the stack
heuristically.
Solution
You may collect deeper stacks by creating a custom event-based sampling analysis and increasing the Stack
size option value in bytes (-stack-size option in CLI), though beware that this also increases the collection
overhead.
See Also
Custom Analysis Options
Solution
Decrease the optimization level of your project and rebuild the target. Then profile with the Intel® VTune™
Profiler.
See Also
Compiler Switches for Performance Analysis on Linux* Targets
Cause
There are several stacks going to the same function, but to different code lines (call sites).
686
Troubleshooting 10
The call tree in the Bottom-up pane aggregates these stacks in one line but the Call Stack pane shows
each as a separate stack. For more details, see the Call Stacks in the Bottom-up Pane and Call Stack Pane
topic.
See Also
Pane: Call Stack
Window: Bottom-up
Cause
If there is a system function that has no user function calling it, the system function appears and its time is
shown in the analysis result windows.
See Also
Call Stack Mode
Viewing Stacks
Cause
• If your project directory (and consequently, the result files) are located on an NFS-mounted directory and
not on a local disk, this significantly impacts performance of the tool in several areas: writing of the
results is slower, updating project information is slower, and when loading the results for display you may
see delays of several minutes.
• If application binaries are on an NFS-mounted drive but not on a local drive, the VTune Profiler takes
longer to parse symbol information and present the results.
Solution
Make sure your project directory (and consequently, the result files) and application binaries are located on a
local disk and not an NFS-mounted directory. By default, the projects are stored in $HOME/intel/vtune/
projects. If your home directory is on an NFS-mounted drive to facilitate access from multiple systems, you
should ensure that you set the project directory to a local directory at project creation.
See Also
Set Up Project
687
10 Intel® VTune™ Profiler User Guide
Cause
The GUI response may be slow if you use an X-server (for example, Xming*) with SSH on Windows to run
the VTune Profiler GUI on a connected Linux machine and the X-server is slow.
Solution
Option 1: Enable Traffic Compression
Compression may help if you are forwarding X sessions on a dial-up or slow network. Turn on the
compression with ssh -C or specify Compression yes in your configuration file.
NOTE
You can explore all available options with man ssh_config.
Change your configuration file with the Cipher option depending on whether you are connecting with SSH1
or SSH2:
• for SSH1, use Cipher blowfish
• for SSH2, use Ciphers blowfish-cbc,aes128-cbc,3des-cbc,cast128-cbc,arcfour,aes192-
cbc,aes256-cbc
You may also follow recommendations provided in the documentation to an X-server you are using.
See Also
Configure SSH Access for Remote Collection
688
Troubleshooting 10
This may happen when collecting call stacks with hardware event-based sampling (EBS).
Cause
In the above example, the application called __itt_pause() at about the 22 sec mark. But the other,
smaller pauses were inserted by the VTune Profiler, which temporarily pauses profiling when data generation
rate exceeds data spill rate and it is about to lose data. The data is flushed and then the collection resumes.
In the paused regions, your application is not executing: the VTune Profiler lets the application exhaust its
current quanta and then prevents it from being scheduled on the CPU until all the data has been saved to a
file.
Solution
You can ignore this injected paused time. For example, in the Summary information below, you can see that
Paused Time is part of the Elapsed Time, but is not included in CPU Time.
See Also
Pausing Data Collection
Cause
The kernel configuration prevents the VTune Profiler from collecting the required data: it cannot identify the
PID/TID/module or process name for the timer.
Solution
You may set the CONFIG_TIMER_STAT =Y in the boot configuration file and recompile the kernel.
689
10 Intel® VTune™ Profiler User Guide
See Also
Interpreting Energy Analysis Data
Solution
Run the following command to enable the loopback interface: ipconfig lo up.
You run Hotspots or/and VTune Profiler cannot unwind the stack correctly since stacks do not
Threading analysis and your reach user code and stay inside the system modules. Often such stacks
application uses a system API may be limited to call sites from system modules. Since VTune Profiler
intensively tries to attach incomplete stacks to previous full stacks via [Unknown
frame(s)], you may see [Unknown frame(s)] hotspots when
You run Threading analysis attributing system layers to user code via the Call stack mode option on
and your application uses the Filter bar.
synchronization API causing
waits that slow down the
application
NOTE
Windows* only: Missing PDB files may lead to the incorrect stack information only for 32-bit
applications. For 64-bit applications, stack unwinding information is encoded inside the application.
Solution
1. On Windows, make sure the search directories, specified in the Binary/Symbol Search dialog box,
include paths to PDB files for your application modules. For more details, see the Search Directories
topic.
2. On Windows, specify paths to the Microsoft* symbol server in Tools > Options > Debugging >
Symbols page. On Linux, make sure to install the debug info packages available for your system
version. For more details, see the Using Debug Information topic.
3. Re-finalize the result.
690
Troubleshooting 10
On Windows, the VTune Profiler will use the symbol files for system modules from the specified cache
directory and provide a more complete call stack.
See Also
Search Order
Cause
Running the X11* version of XQuartz* on a macOS system caused the text in the VTune Profiler graphical
interface to appear garbled and unreadable. The problem is related to the XQuartz X11 server performing
font anti-aliasing, even in 256 color mode.
Solution
Reset the XQuartz preference to "millions" of colors and restart XQuartz.
See Also
macOS* Support
Cause
In general, VTune Profiler is compatible with Windows OS versions supported by Microsoft, but there may be
one update behind the latest major version. Depending on the changes in the OS update, this may cause
incompatibility with the VTune Profiler drivers, particularly the sampling driver for hardware event-based
collections. VTune Profiler installer detects an unsupported OS and fails to install incompatible drivers. While
this can prevent hardware event-based sampling and stack collection, other analysis types using user-mode
sampling, such as Hotspots and Threading, can still be run. If the VTune Profiler is already installed when
your Windows system is updated to an unsupported version, the data collector may cause a crash or BSOD
while accessing the required drivers (sampling, graphics, or third-party drivers).
Solution
After installing the latest major Windows update, uninstall and reinstall the latest version of the VTune
Profiler.
Make sure all drivers are up to date.
691
10 Intel® VTune™ Profiler User Guide
See Also
Control Data Collection
692
Reference 11
Reference 11
Explore the following reference information for Intel® VTune™ Profiler:
• Graphical User Interface Reference
• CPU Metrics Reference
• GPU Metrics Reference
• OpenCL™ Kernel Analysis Metrics Reference
• Energy Analysis Metrics Reference
• Intel Processor Events Reference
typically accessed from the product via Learn More link, Context Help button, or F1 button.
• Context Menu: Grid
• Context Menu: Call Stack Pane
• Context Menu: Project Navigator
• Context Menu: Source/Assembly Window
• Dialog Box: Binary/Symbol Search
• Dialog Box: Source Search
• Hot Keys
• Menu: Customize Grouping
• Menu: Intel VTune Profiler
• Pane: Call Stack
• Pane: Options - General
• Pane: Options - Result Location
• Pane: Options - Source/Assembly
• Pane: Project Navigator
• Pane: Timeline
• Toolbar: Command
• Toolbar: Filter
• Toolbar: Source/Assembly
• Toolbar: Intel VTune Profiler
• Window: Bandwidth - Platform Power Analysis
• Window: Bottom-up
• Window: Caller/Callee
• Window: Cannot Find file type File
• Window: Collection Log
• Window: Compare Results
• Window: Configure Analysis
• Window: Core Wake-ups - Platform Power Analysis
• Window: Correlate Metrics - Platform Power Analysis
• Window: CPU C\P States - Platform Power Analysis
• Window: Debug
• Window: Event Count
• Window: Flame Graph
• Window: Graphics - GPU Hotspots
• Window: Graphics C/P States - Platform Power Analysis
• Window: NC Device States - Platform Power Analysis
• Window: Platform
• Window: Platform Power Analysis
• Window: Sample Count
693
11 Intel® VTune™ Profiler User Guide
View Source Open the Source/Assembly window of the selected program unit.
Change Focus Function Use a function selected in the Callers or Callees pane as a focus function
and display its parent and child functions.
What's This Column? Open a help topic describing the selected metric column.
Show Data As Specify the data format for the collected data (for example, time, percent,
bar, counts, and others).
This option is available for columns displaying numeric data.
Select All Select all items in the grid. The Selected data row at the bottom of the
grid is updated to sum up all selected data per metric. Selecting data in
one of the panes, Bottom-up or Top-down Tree, automatically updates
the other pane and Call Stack pane.
Expand Selected Rows Expand all child entries for the selected row(s).
Find Open a search bar and search for a string in the grid.
Export to CSV... Export the content of the active pane to CSV format.
Copy Rows to Clipboard Copy the content of the selected rows or a cell into the clipboard buffer.
Copy Cells to Clipboard
Filter In by Selection Filter in the grid and Timeline pane based on the currently selected rows.
Selecting this menu item updates the filter bar based on the current
selection. All rows except for the selected ones will be hidden. To show
rows again, use the Clear all filters button on the Filter toolbar.
694
Reference 11
Use This To Do This
If you applied filters available on the Filter bar to the data already filtered
with the Filter In/Out by Selection context menu options, all filters are
combined and applied simultaneously.
Filter Out by Selection Filter out the grid and Timeline pane based on the currently selected
rows. Selecting this menu item updates the filter bar based on the current
selection. All selected rows will be hidden. To show rows again, use the
Show Grouping Area Show/hide the Grouping drop-down menu at the top of the Bottom-up
pane.
See Also
Window: Bottom-up
Window: Caller/Callee
Pane: Timeline
Toolbar: Filter
View Source hyperlink Open the Source/Assembly window for the program unit in the selected
stack.
695
11 Intel® VTune™ Profiler User Guide
Show Modules toggle Display the module names of the program units selected in the Call Stack
pane.
Show Source File and Display the source file names of the program units selected in the Call
Line toggle Stack pane and a line number where the call was made.
Stack Selector Switch between available stacks using the left/right arrows.
Copy to Clipboard button Copy the data into the clipboard buffer to paste it to a different location.
See Also
Viewing Source
Pane: Call Stack
New Project... Open the Create a Project dialog box to browse to or create a directory
in which the Intel® VTune™ Profiler will create a project
(config.amplxeproj).
Open Project from New Open the Select Project dialog box to browse to a directory containing
Location VTune Profiler projects.
Copy Path to Clipboard Copy the path to the currently opened project to the system clipboard.
Close Project Close the current project and any opened results.
Configure Analysis... Open the Configure Analysis window to modify project properties
including a target system, a target type, and an analysis type.
Close All Results Close all opened results for this project.
Delete Project Immediately delete the selected project and associated results from the
Project Navigator and file system.
696
Reference 11
Use This To Do This
Rename Project Rename the selected project in the Project Navigator immediately and in
the file system after you close the project or exit the VTune Profiler.
Copy Project Path to Copy the path to the selected project to the system clipboard.
Clipboard
Re-resolve and Open Finalize the selected result again. You may use this option after changing
the search directories settings to enable updating the symbol information.
This option is available if the result is NOT open in the grid.
Compare Open the Compare Results window and select a result to compare the
current result with.
Delete Result Delete the selected result from the Project Navigator and file system.
Rename Result Rename the selected result in the Project Navigator immediately and in
the file system after you close the result or project, or exit the VTune
Profiler.
NOTE
The corresponding result directory in the file system is not renamed.
Copy Result Path to Copy the path to the selected result to the system clipboard.
Clipboard
See Also
Pane: Project Navigator
Set Up Project
Analyze Performance
697
11 Intel® VTune™ Profiler User Guide
Edit Source Launch the source file editor. This option is only available for the Source
pane.
Instruction Reference Open the Reference help system for particular assembly instruction. This
option is only available for the Assembly pane.
What's This Column? Open a help topic for the selected performance metric column.
Show Data As Specify the format to display the collected data. You can view the data as:
• Time
• Percent
• Bar
• Time and Bar
• Percent and Bar
This option is only available for columns displaying numeric data.
Define the current metric column in the Source and Assembly views.
This option is only available for columns displaying numeric data.
Export to CSV Export the content of the active pane to CSV format.
Copy Rows to Clipboard Copy the content of the selected rows into the clipboard buffer.
Copy Cell to Clipboard Copy the content of the selected cell into the clipboard buffer.
See Also
Source Code Analysis
On the Intel® VTune™ Profiler toolbar, click the Configure Analysis button.
The result tab opens the Configure Analysis window.
698
Reference 11
2. Specify your analysis system on the WHERE pane and analysis target on the WHAT pane.
3.
Click the Search Sources/Binaries button on the command toolbar at the bottom.
4. In the dialog box, select Binaries/Symbols from the left pane.
To manage the search directories list, hover over a respective line to see the action buttons.
<Add a new search Add a new local search directory or a symbol server paths to the list by clicking
location> field the field and typing the path and name of the directory in the activated text box.
If running an analysis from the standalone VTune Profiler GUI on Windows* OS,
make sure to configure the Microsoft* symbol server by adding the following line
to the list of search directories:
srv*C:\local_symbols_cache_location*http://msdl.microsoft.com/
download/symbols
where local_symbols_cache_location is the location of local symbols. VTune
Profiler will download debug symbols for system libraries to this location and use
them to resolve collected data and provide accurate performance data for system
modules.
NOTE
The search is non-recursive. Make sure to specify correct paths to the binary/symbol
files.
See Also
Search Directories
699
11 Intel® VTune™ Profiler User Guide
On the Intel® VTune™ Profiler toolbar, click the Configure Analysis button.
The result tab opens the Configure Analysis window.
2. Specify your analysis system on the WHERE pane and analysis target on the WHAT pane.
3.
Click the Search Sources/Binaries button on the command toolbar at the bottom.
4. In the dialog box, select Sources from the left pane.
To manage the search directories list, hover over a respective line to see the action buttons.
<Add a new Add a new local search directory to the list by clicking the field and typing the path
search location> and name of the directory in the activated text box.
field
NOTE
The search is non-recursive. Make sure to specify correct paths to the source files.
button Move the selected directory down the search priority list.
See Also
Search Directories
Hot Keys
Use hot keys supported by the Intel® VTune™ Profiler to quickly perform various tasks:
700
Reference 11
Use This To Do This
Alt + 1 Launch the VTune Profiler and start the analysis of the selected type, or resume the
data collection after it has been paused.
Alt + 9 Open the Configure Analysis window to choose and run a new analysis.
Ctrl + O Open the Select Result dialog box to select and open an existing analysis result.
NOTE
You may program hot keys to start/stop a particular analysis. For more details, see http://
software.intel.com/en-us/articles/using-hot-keys-in-vtune-amplifier-xe/.
may organize the collected data to explore it from a different perspective. For this, click the Customize
Grouping button in the grid view and combine a grouping you need.
List of available Select grouping levels required for your custom grouping. This list provides all levels
grouping levels supported by the Intel® VTune™ Profiler. Make sure to select grouping levels applicable
to your analysis type.
Custom View the custom grouping you created. The grouping shows up in the Grouping menu
grouping field in the order presented in this field. If the grouping uses levels not applicable to the
current analysis, no data is shown in the grid.
Left and Right Use the left and right arrows to add/remove the groping levels in the custom grouping.
arrows Use double right arrows to remove all levels from the custom grouping.
Up and Down Modify the order of grouping levels selected for the custom grouping.
arrows
The grouping you create is added to the Grouping menu for the current session and automatically removed
when you close the result.
See Also
Grouping and Filtering Data
701
11 Intel® VTune™ Profiler User Guide
Open VTune Profiler Open VTune Profiler within Microsoft Visual Studio IDE.
Configure Analysis Configure your VTune Profiler project and profile your target with
with VTune Profiler VTune Profiler.
In the Visual Studio IDE sub-menu (File > Intel VTune Profiler), these options are available:
To access the VTune Profiler menu in the standalone GUI, click the button in the Menu button in the
upper left corner. The following commands are available:
Welcome Open the Welcome page that provides direct access to most recent
projects and results. You can also use this page to open or create a
VTune Profiler project or access the latest technical articles on the
product functionality
Help Tour Launch an interactive tour around the product that uses a sample pre-
collected result to demo basic product functionality.
New > Project... CTRL Create a new VTune Profiler project that introduces your analysis
+SHIFT target.
+N
New > Compare CTRL+ALT Open the Compare Results dialog box and specify analysis results to
Results... +O compare. You can compare only the results of the same analysis type.
New > Analysis... CTRL+N Open the Configure Analysis window to choose, configure, and run
an analysis.
New > <analysis Run the specified analysis types without opening the Configure
type> Analysis Analysis window. For your convenience, this list of analysis types
includes the most recent configurations you ran.
Open > Project... CTRL Open an existing VTune Profiler project to introduce your analysis
+SHIFT target and start analysis.
+O
702
Reference 11
Command Hot Keys Description
Import Result... CTRL+ALT Open the Import window to import a data file, such as *.tb6.
+N
View > Project Open the project navigator window to explore the currently selected
Navigator project.
Options... Open the Options dialog box to configure general, result name, or
source/assembly options.
Help > Open one of the following online documentation format for the VTune
<doc_format> Profiler:
• Intel VTune Profilerversion User Guide
• Get Started with Intel VTune Profilerversion
• VTune Profiler Developer Forum
• Cookbooks and Tutorials
• Intel Processor Event Reference
Help > Additional Access VTune Profiler documentation on the Intel Developer Zone or
Resources download it for offline usage.
See Also
Toolbar: Intel VTune Profiler
Compare Results
Finalization
703
11 Intel® VTune™ Profiler User Guide
Stack metric drop-down menu. Select a performance metric to explore the distribution of this metric
over stacks of the selected object. For example, for the Threading Efficiency viewpoint the Wait Time
metric is preselected. For the GPU Offload viewpoint, the Execution metric is preselected.
Navigation bar. Click the next/previous arrows to view stacks for the selected program unit(s).
The stack types are classified by metrics and depend on the selected viewpoint. For example, for the
Threading Efficiency viewpoint the Wait Time stack type displays call stacks where the object
selected in the grid contributed to the application Wait time.
When multiple stacks lead to the selected program unit, the Call Stack pane shows the stack that
contributed most to the metric value, the hottest path, as the first stack. To see other stacks, click the
navigation arrows.
NOTE
• If several stacks go to the same functions in different code lines, the bottom-up tree shown in
the Bottom-up grid aggregates these stacks in one line. But the Call Stack pane shows each
as a separate stack.
• If a selected stack type is not applicable to a selected program unit, the VTune Profiler
automatically uses the first applicable stack type from the stack type list instead.
Contribution bar. Analyze the indicator of the contribution of the currently visible stack to the overall
metric data for the selected program unit(s). If you select a single stack in the result window, the
Contribution bar shows 100%. If more than one program unit is selected, all the related stacks are
added to the calculation.
In the example above, the function selected in the Bottom-up grid had 3 Wait Time stacks leading to it
with the total Wait time 23.718 seconds. The first stack is responsible for 97.9% (or 23.230s) of the
overall 23.718 seconds. Note that the Bottom-up grid aggregates all 3 stacks into one since all of them
go to the same function in different code lines.
Call stack for a program unit selected in the grid or in the Timeline pane. Analyze the call
sequence for the selected function according to the stack metric selected in the navigation bar. Each
row in the stack represents a function (with an RVA and a line number of the call site, if available) that
called the function in the row above it. When the Call Stack Mode on the filter toolbar is set to Only
user functions, the system functions are shown at the bottom of the stack. When set to User/
system functions, the system functions are shown in the correct location, according to the call
sequence.
Click a hyperlink or double-click a function in the stack to open the source exactly where this function
was called.
704
Reference 11
NOTE
If you see [Unknown frame(s)] identifiers in the stack, it means that the VTune Profiler could
not locate symbol files for system or your application modules. See the Resolving Unknown
Frames topic for more details.
Context menu. Manage the call stack representation in the Call Stack pane (applicable to all stacks).
Right-click and select an option. For example, you may de-select the Show in One-Line Mode option
to view functions in two lines:
NOTE
When you compare two analysis results, the Call Stack pane does not show any call stacks.
See Also
Metrics Distribution Over Call Stacks
View Stacks
In the Microsoft Visual Studio IDE, click the pull-down menu next to the Open VTune Profiler icon ( )
and select Options:
705
11 Intel® VTune™ Profiler User Guide
From the standalone VTune Profiler interface: Click the menu button and select Options... > Intel
VTune Profiler version > General.
The following options are available:
Use This To Do This
Application output Choose the location for the output of the analyzed application:
destination options • Product output window: Direct the application output to the Application
Output pane in the Collection Log window.
• Separate console window: Direct the application output to a separate
console window (default).
• Microsoft Visual Studio* output window: View the application output in
the Microsoft Visual Studio* output window. Use this option to see the output
during the analysis.
Remove raw Enable/disable removing raw collector data after finalizing the result. Removing
collector data after raw data makes the result file smaller but prevents future re-finalization.
resolving the result
check box
Display verbose Enable/disable detailed collection status messages in the Collection Log window.
messages in the Make sure to re-open the result to apply this change.
Collection Log
window check box
Show all applicable Display all applicable viewpoints in the viewpoint selector for every analysis type.
viewpoints check
box
Specify path to the Specify the path to the adb executable used to access an Android* device for
adb executable field analysis with the VTune Profiler.
See Also
Set Up Android* System
From Microsoft Visual Studio* IDE: Click the pull-down menu next to the Open VTune Profiler icon ( )
and select Options:
From standalone VTune Profiler interface: Click the menu button and select Options... > Intel
VTune Profilerversion > Result Location.
Use the Result Location pane to configure the following options:
706
Reference 11
Do This To Do This
Result name template text Change the default template defining the name of the result file and its
box directory.
NOTE
Do not remove the @@@ part from the template. This is a placeholder
enabling multiple runs of the same analysis configuration.
See Also
VTune Profiler Filenames and Locations
Tab size: text box Set the tab character display width in white spaces. The tab size should be an
integer starting from 1.
Cache source files Save your source files in the cache. You can go back to the cached sources at any
check box time in the future and explore the performance data collected per code line at that
moment of time.
If you enable this option, the VTune Profiler caches your sources in the result
database when you open the Source window for the first time and provides the
following message:
When you open the Source window for this result for the second time, one of the
following behaviors is possible:
• If the source file has not been changed, the VTune Profiler opens the source
from the located source path. The message about caching the source file
shows up at the bottom. The Open Source File Editor toolbar button is
enabled.
707
11 Intel® VTune™ Profiler User Guide
• If the source files has been changed, the VTune Profiler opens the source from
the cached file and provides a proper notification on this at the bottom. The
NOTE
• VTune Profiler opens previously cached source files even if the Cache
source files option is disabled now.
• If you have the Cache source files option enabled and open a changed
source file that does not match the selected result, the VTune Profiler will
cache it but will not use it for this result.
See Also
Source Code Analysis
Project Navigator
The Project Navigator pane provides a hierarchical
view of your projects and results based on the
directory where the opened project resides.
To access this pane: Click the Project Navigator icon on the Intel® VTune™ Profiler toolbar in the
standalone graphical interface. To manage VTune Profiler projects/results from the Microsoft Visual Studio*
IDE, use the Solution Explorer functionality.
Use To Do This
This
Project Double-click to open the project. Right-click the project node to access the
node project context menu.
NOTE
Opening a project closes the currently opened project.
708
Reference 11
Use To Do This
This
Result Double-click to open the result. Right-click the result node to access the result
node context menu.
NOTE
Opening a result opens the associated project if it is not already open.
See Also
VTune Profiler Filenames and Locations
Set Up Project
Pane: Timeline
Use the Timeline pane to visualize metrics over time
at either the thread level or platform level and identify
patterns, anomalies, and trends in the data.
You can hover, zoom-in, and filter the data at interesting points in time to get more detail. Typically the
Timeline pane is located at the bottom of the window but for the views focused on the metrics distribution
over time, it may occupy the upper or central part of the window. Data presented in the Timeline pane varies
depending on the analysis type and viewpoint.
The Timeline pane typically provides the following data:
Toolbar. Navigation control to zoom in/out the view on areas of interest. For more
details on the Timeline controls, see Managing Timeline View topic.
Platform metrics. Depending on the analysis type, the Timeline pane may present
several areas with platform specific metrics such as GPU engine usage, computing queue
for OpenCL™ applications, bandwidth data, power consumption, and so on. The most
detailed analysis of the platform metrics is available with the Timeline pane in the
Platform window.
Application metrics per grouping level. Depending on the viewpoint, the data may be
represented by threads, modules, processes, cores, packages, and other units monitored
by the data collector during the analysis run. For most of the viewpoints, the Thread
grouping is default. For some viewpoints, you may change the grouping level using the
drop-down menu in the Legend area.
709
11 Intel® VTune™ Profiler User Guide
Note that the CPU Time metric value provided in the Thread area is applicable to a
particular thread where 100% is the maximum possible utilization for a thread. For
example, for the selection above 94.2% of CPU Time utilization means that the thread
was active 94.2% of time and 5.8% it was waiting.
Legend. Types of data presented on the timeline. Filter in/out any type of data presented
in the timeline by selecting/deselecting corresponding check boxes. The list of
performance metrics presented in the view depend on the selected analysis type and
viewpoint.
VTune Profiler also uses special indicators to classify the presented data on the timeline:
• Markers. Color markers indicate an area on the timeline when a particular task/
frame/event/etc. was executed. Hover over a marker to see the execution details for
the selected element. The following markers are available:
• Frame markers show frame duration. Available for applications using frames.
• User task markers provide information on a task executed at this particular
moment of time. Available for applications using Task API.
• CPU sample markers indicate exact points where profiling samples happened
during hardware event-based stack sampling collection. Use the markers density to
estimate the data resolution. For example, the VTune Profiler interpolates the
sampling data where accuracy depends on number of samples. In this case, the
CPU Samples markers show more accurate information discovering the sporadic
CPU utilization for the thread.
Sample markers also help understand how exactly filtering and Spin/Overhead time
calculation works. VTune Profiler filters or classifies samples as a whole, so when
you do time filtering it is important to know whether the sample point got into the
selected time interval or not. No data interpolation is done for sampling data when
filtering or classifying sample metrics.
• VSync markers for vertical synchronization. If your application uses vertical
synchronization, you can select the VSync timeline option, estimate the correlation
between VSync events and application frames, identify frames missing VSync
events and explore possible reasons.
• Sampling point markers point at which a data sample was read during energy
analysis. Hovering over it gives the value(s) read at that time.
710
Reference 11
• Wake-up object markers for energy analysis that show processor wake-ups on
the timeline. Hover over a yellow marker to see the time when the selected wake-
up happened and the name of the wake-up object.
• Slow tasks markers show the duration of tasks (I/O Wait, Ftrace*, Atrace*, and so
on) that is categorized as slow (according to the thresholds set up in the Summary
window)
• I/O APIs markers
• Context switches. The time threads are spending on context switches. Hover over a
context switch area to see the details on its duration, reason, and affected CPU. If you
choose the Context Switch Time option in the Call Stack pane and select a context
switch in the Timeline pane, the Call Stack pane shows a call sequence at which a
preceding thread execution quantum was interrupted.
• Transitions. The execution flow between threads where one thread signals to another
thread waiting to receive that signal. For example, one thread attempts to acquire a
lock held by another thread, which then releases it. The release acts like a signal to
the waiting thread. Hover over a transition for more details. Double-click a transition
to open the source code.
• Memory transfers. OpenCL routines responsible for transferring data from the host
system to a GPU are marked with cross-diagonal hatching on a computing queue:
• Scaling indicators. For GPU metrics and bandwidth graphs, the VTune Profiler
provides maximum Y-axis values used to scale the graphs. Color of such a value
corresponds to the color of the relevant metric in the legend. For example, for the GPU
L3 Cache Misses and Memory Access metrics, maximum Y value for the selected scale
is 20.153 GB/sec for GPU Memory Read Bandwidth and for the GPU Memory Write
Bandwidth, and 521849224.729 Misses/sec for GPU L3 Misses.
711
11 Intel® VTune™ Profiler User Guide
Tooltips. Hover over a chart element to get statistics on this metric/program unit for the
selected moment of time.
For the GPU analysis of applications using OpenCL software technology, the Timeline pane in the Graphics
window provides the following tabs:
• Platform tab that focuses on a per-thread and per-process distribution of the CPU and GPU hardware
metrics collected during the analysis run.
• Architecture Diagram tab that is provided for OpenCL application analysis collected with the Analyze
Processor Graphics hardware events option on systems with Intel® HD Graphics and Intel® Iris®
Graphics. This tabs helps better understand the distribution of the GPU hardware metrics per architecture
blocks for the period the selected OpenCL kernel was running.
NOTE
Collecting energy analysis data with Intel® SoC Watch is available for target Android*, Windows*, or
Linux* devices. Import and viewing of the Intel SoC Watch results is supported with any version of the
VTune Profiler.
See Also
Window: Bottom-up
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the product
toolbar.
712
Reference 11
• Windows* only: From the Microsoft Visual Studio* Tools > Intel VTune Profiler <version> menu,
select the Configure Analysis option.
• From the standalone interface menu, select New > Analysis....
The VTune Profiler result tab opens providing the command bar on the right. The command bar is
dynamically changing depending on the analysis phase. The following commands are available:
Start/ Run the analysis, or resume the analysis after a pause. To enable this button:
Resume • Select a system for analysis on the WHERE pane
• Specify an analysis target on the WHAT pane. If you work in Visual Studio, the
project target is automatically associated with the current project.
• Select an analysis type on the HOW pane.
Start Launch the application but run the analysis after some delay. To resume the analysis,
Paused click the Resume button.
Pause Pause the data collection any time you need while the application is running. To resume
the data collection, click the Resume button
Stop Stop the data collection. This button is only enabled during collection.
Cancel Cancel the data collection. This button is only enabled during collection.
Mark Mark an important moment in the application execution. These marks appear in the
Timeline Timeline pane. This button is only enabled during collection.
Open the search dialog box with the Binary/Symbol Search tab to specify search
directories for binary and symbol files in your project and the Source Search tab to
specify search directories for source files in your project.
Search
Sources/
Binaries
Re- Finalize the result again. This button shows up on the command bar when you try to run
resolve the target after changes in the search directories settings.
Import Import external performance data into a VTune Profiler result as a csv file. You may
from CSV collect the external performance data with a custom collector out of the VTune Profiler or
with your target application used for the VTune Profiler analysis.
Generate a command line version of the selected configuration and save it to the buffer
for running from a terminal window. You can use this approach to configure and run your
remote application analysis.
Command
Line
See Also
Pause Data Collection
Finalization
713
11 Intel® VTune™ Profiler User Guide
Toolbar: Filter
Use the Filter toolbar to filter the data displayed in the grid or Timeline pane. Filtering settings applied to
the currently opened result are saved for the whole project and automatically applied to the subsequent
results in this project.
Metric
filter
Mouse over the Filter icon to enable the metric drop-down menu and select a filtering
metric:
By default, you see 100% of all metric data collected in the result. Metric values vary with a
viewpoint and analysis type.
For example, for the Hotspots viewpoint available for the Hotspots analysis result (hardware
event-based sampling mode) there are CPU Time and Instructions Retired event metrics
available, where the CPU Time is selected by default. Open any filtering drop-down menu to
see the percentage of the CPU Time each module/process/thread introduces into the overall
CPU Time for the result:
If you select a program unit in the filtering drop-down menu, your grid and Timeline view will
be filtered out to display data for this particular program unit. For example, if you select the
analyze_locks process introducing 53.4% of the CPU Time, the result data will display
statistics for this process only and the Filter bar provides an indicator that only 53.4% of the
CPU Time data is currently displayed:
Module Select a module to filter the collected data by its contribution. All data related to other
filter modules is hidden.
By default Any Module is selected. This option does not filter any data.
Thread Select a thread to filter the collected data by its contribution. All data related to other
filter threads is hidden.
By default Any Thread is selected. This option does not filter any data.
Process Select a process to filter the collected data by its contribution. All data related to other
filter processes is hidden.
By default Any Process is selected. This option does not filter any data.
Thread Select a thread efficiency level to filter the collected data by its contribution. All data related
Efficienc to other efficiency levels is hidden.
y filter
By default Any Thread Efficiency is selected. This option does not filter any data.
This filter is applied to the Hotspots by Thread Concurrency and Threading Efficiency
viewpoints for user-mode sampling and tracing analysis results.
714
Reference 11
Use This To Do This
Sleep Select a sleep state (C0 - Cn) to filter the collected data by its contribution. The deeper the
States sleep state of the CPU is, the greater power savings are.
filter
This filter is available for Energy analysis results only.
Wake-up Filter data by types of the objects that force the processor to wake up. Possible wake-up
Reason reasons are timer, interrupt, IPI, and so on.
filter
This filter is available for Energy analysis results only.
Timer Filter data by type of the timers that force the processor to wake up. Choose between User
Type and Kernel Timers.
filter
This filter is available for Energy analysis results only.
Clear
Filter
icon
Loop Select a type of hierarchy to display loop data in the grid. The following types are available:
Mode • Loops only: Display loops as regular nodes in the tree. Loop name consists of:
option
• start address of the loop
• number of the code line where this loop is created
• name of the function where this loop is created
• Loops and functions: Display both loops and functions as separate nodes.
• Functions only (default): Display data by function with no loop information.
NOTE
If you applied filters available on the Filter bar to the data already filtered with the Filter In/Out by
Selection context menu options, all filters are combined and applied simultaneously.
See Also
Group and Filter Data
filter
vtune option
715
11 Intel® VTune™ Profiler User Guide
call-stack-mode
vtune option
inline-mode
vtune option
loop-mode
vtune option
Toolbar: Source/Assembly
Use the Source/Assembly toolbar to navigate between the most performance-critical code sections
(hotspots). In the Source pane, you can navigate between source code lines, in the Assembly pane you can
navigate between assembly instructions.
Source button Toggle the Source pane on/off. This button is enabled only when both source and
assembly code is available.
Assembly Toggle the Assembly pane on/off. This button is enabled only when both source and
button assembly code is available.
Horizontal
Mode button
Go to the code line that has the biggest hotspot navigation metric value in the selected
function.
Go to
Biggest
Function
Hotspot button
Go to the previous (by the hotspot navigation metric value) hot line in the selected
function.
Go to
Bigger
Function
Hotspot button
Go to the next (by the hotspot navigation metric value) hot line in the selected
function.
Go to
Smaller
Function
Hotspot button
Go to the code line that has the smallest hotspot navigation metric value in the
selected function.
Go to
Smallest
Function
Hotspot button
716
Reference 11
Use This To Do This
Edit the source code in the default code editor. This option is available for the Source
Source File pane only.
Editor button
NOTE
To select a hotspot navigation metric, right-click the required metric column in the Source view and
select Use for Hotspot Navigation.
See Also
Source Code Analysis
Project Navigator
(standalone client only)
Open the Import window and specify result or raw data collection
file(s) to import into the current project. VTune Profiler creates a
Import Result result with the imported data and locates it in the current project.
(standalone client only)
717
11 Intel® VTune™ Profiler User Guide
Set options to collect, display, and save profiling data. View privacy
information about collected data.
Options
NOTE
VTune Profiler toolbar icons look slightly different in different versions of the Microsoft Visual Studio*
IDE. The Compare Results button is not available from the toolbar in the Microsoft Visual Studio*
IDE.
VTune Profiler also provides a lightweight integration to the Eclipse* development environment, adding the
following buttons in the Eclipse GUI:
Open the VTune Profiler Get Started page providing access to the
product documentation resources.
Open Intel VTune Profiler
Help
When you view results, VTune Profiler provides an additional toolbar for the Bottom-up and Top-down
Tree windows:
Change the stack layout for the Call Stack grouping level.
View Stacks as a Chain/
Create a custom grouping for the current viewpoint using the Custom
Customize Grouping Grouping dialog box.
See Also
Menu: Intel VTune Profiler
718
Reference 11
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
Bandwidth Pane
The Bandwidth pane displays the bandwidth values for the data collected. Bandwidth data is collected as byte
counts and is displayed as MB/sec. The bandwidth is given in both total values and the average bandwidth by
event type and component. You can change the unit displayed by right-clicking a data cell and selecting the
Show Data As option to select an alternate unit.
719
11 Intel® VTune™ Profiler User Guide
The average bandwidth displayed in this pane is typically the most important metric used to determine
bandwidth usage during collection. The other columns display the number of bytes transferred by event and
by the device or component.
There are two types of bandwidth data that can be collected: approximate bandwidth and detailed
bandwidth. Approximate bandwidth is measured across all devices with a lower level of detail. Detailed
bandwidth allows in-depth collection for the specified device and events related to that device. The type of
bandwidth collected is specified when running the Intel SoC Watch collector. For more information about the
options to use for detailed bandwidth collection, see the Intel SoC Watch User's Guide for the operating
system of your target device.
Timeline Pane
Use the Timeline pane to view bandwidth changes over time. Expand the timeline vertically to improve the
data visualization and see more bandwidth values. Consider removing the Sampling Points from the timeline
while viewing the full timeline to improve visibility to the lowest bandwidth values. You can add the sampling
values again after zooming in on a section of the timeline.
Hover over the timeline to view a tooltip listing the exact bandwidth values at that time during the collection
(MB/sec). The blue sampling points indicate the time at which the sample is obtained from the hardware. The
duration between sampling points is the sampling interval that was specified at collection time.
Filters applied on a timeline in one window are applied on all other windows within the viewpoint. This is
useful if you identify an issue on one tab and want to see how the issue impacts the metrics shown on a
different tab.
See Also
Interpreting Energy Analysis Data
720
Reference 11
Viewing Energy Analysis Data
Viewpoint
Grouping Data
Managing Timeline View
Window: Bottom-up
Use the Bottom-up window to analyze performance
of each program from the bottom level when a child
function is placed directly above its parent (bottom-up
analysis).
To access this window: Click the Bottom-up tab. Depending on the analysis type, the Bottom-up window
may include the following panes:
• Bottom-up pane
• Call Stack pane
• Timeline pane
Bottom-up Pane
Data provided in the Bottom-up pane depends on the analysis, data collection type, and viewpoint you apply.
Grouping menu. Each row in the grid corresponds to a grouping level (granularity) of program units
(module, function, synchronization object, and others). For example, the data in the Hotspots
viewpoint is grouped by Function/Call Stack.
Call stack. Analyze a tree hierarchy of the call stacks that lead to the selected program unit. Click
the triangle sign to expand a row and view call trees for each program unit. Each tree is a call stack
that called the selected unit. Each tree lists all the program units that had only one caller in the same
row, with an arrow indicating the call relationship. Program units that had more than one caller
are split so that each caller has a separate row with the callers to that callee. If a function was called
from different code lines (call sites) in the same parent function, the Bottom-up pane aggregates
such stacks into one and sums up their CPU time. The full information on the stack is shown in the
Call Stack pane.
The time value for a row is equal to the sum of all the nested items from that row.
721
11 Intel® VTune™ Profiler User Guide
NOTE
• Call stack information is always available for the results of the User-Mode Sampling
collection. It is also available for the results of the hardware event-based sampling
collection, if you enabled the Collect stacks option during the analysis configuration.
Otherwise, the Call Stack column for the event-based results shows "Unknown" entries in
the call tree.
• If you see [Unknown frame(s)] identifiers for the functions, it means that the VTune
Profiler could not locate symbol files for system or your application modules. See the
Resolving Unknown Frame(s) topic for more details.
• If the VTune Profiler does not find debug information in binaries, it statically identifies
function boundaries and assigns hotspot addresses to generated pseudo names
func@address for such functions, for example:
Performance metrics. Each data column in the grid corresponds to a performance metric. By
default, all program units are sorted in the descending order by metric values in the fist column
providing the most performance-critical program units first. You may click a column header to sort the
table by the required metric.
The list of performance metrics varies depending on the analysis type. Mouse over a column header
(metric) to read the metric description, or right-click and select the What's This Column? option
from the context menu.
If a metric has a threshold value set up by the VTune Profiler architect and this value is exceeded, the
VTune Profiler highlights such a value in pink. You may mouse over a pink cell to read the description
of the detected issue and tuning advice for this issue.
For some analysis types, you may see grayed out metric values in the grid, which indicates that the
data collected for such a metric is unreliable. This may happen, for example, if the number of
samples collected for PMU events is too low. In this case, when you hover over such an unreliable
metric value, the VTune Profiler displays a message: The amount of collected PMU samples is too low
to reliably calculate the metric.
Depending on the analysis type and viewpoint, the Bottom-up view may represent the CPU Time by
utilization levels. Focus your tuning efforts on the program units with the largest Poor value. This
means that during the execution of these program units your application underutilized the CPU time.
The overall goal of optimization is to achieve Ideal (green ) or OK (orange ) CPU utilization
state and shorten the Poor and Over CPU utilization values.
Chain layouts are typically more useful for the bottom-up view:
722
Reference 11
While tree layouts are more natural for the top-down view:
See Also
Manage Data Views
Reference
View Stacks
Window: Caller/Callee
To access this window: Click the Caller/Callee sub-tab in the result tab.
The Caller/Callee window is available in all viewpoints that provide call stack data.
Use this window to analyze parent and child functions of the selected focus function and identify the most
time-critical call paths.
Functions pane. The Functions pane displays a flat list of functions providing data per the following
metrics:
• Self time: Active processor time spent in a function.
• Total time: Active processor time spent in the function and its callees.
By default, the grid is sorted by the Total time metric. Select a function of interest in the grid (focus
function) and explore its callers and callees on the right panes.
You may select a function of interest and filter the grid to display the functions included into all
subtrees that contain the selected function at any level. To do this, select the function, right-click and
choose the Filter In by Selection context menu option. For the call tree view, switch to the Top-
down Tree window.
You can also change a focus function from the Callers or Callees panes by double-clicking a function
of interest. Alternatively, you may select a function, right-click and choose the Change Focus
Function context menu option.
723
11 Intel® VTune™ Profiler User Guide
VTune Profiler highlights this function in the Functions pane and updates the Callers and Callees
panes to display its parent and child functions respectively.
You can double-click a function of interest in the Functions pane to go to the source view and explore
the function performance by a source line.
Callers pane. The Callers pane shows parent functions (callers) for the function currently selected in
the Functions pane.
Callees pane. The Callees pane shows child functions (callees) for the function currently selected in
the Functions pane.
See Also
CPU Metrics Reference
Specify location of file to open Specify the correct path to the file that is not found. You may
text box choose the required file from the list. If the file you specify is invalid
or partially valid, the VTune Profiler displays an error message.
Add the directory to the search list
as check box
Enable adding a new directory to the search list. This option is
active when you enter a directory in the Specify location of file to
open text box. To add a folder to the list of search directories for
the current project, select it from the drop-down list. This helps
locate the module/source/symbol files for the next analysis runs.
Assembly button on the toolbar View the disassembly code for the current selection.
OK button Close the window. If you provided a valid location in the Specify
location of file to open text box, the VTune Profiler opens the
source code for the selected item. If you cannot provide a valid
location for the file, click the Assembly button on the toolbar to
view the disassembly code or close the Source/Assembly window.
Skip button Stop searching for symbol files and open the Source/Assembly
window. This button is only available when a symbol file is not
found.
See Also
Dialog Box: Binary/Symbol Search
Search Directories
724
Reference 11
Intel® VTune™ Profiler uses two types of data collectors: user-mode sampling and tracing collector and
hardware event-based sampling collector. During data collection and finalization the VTune Profiler provides
status messages in the Collection Log window. If required, you can click the Clear Log button to
delete the log.
NOTE
You may enable detailed collection messages by using the Display verbose messages in the
Collection Log window option, available from the Options… > Intel VTune Profilerversion >
General pane.
Configure Analysis window > WHAT pane available via the Configure Analysis toolbar button.
• (for VTune Profiler integrated into Visual Studio) Displays the analysis result in the Solution Explorer. The
naming scheme of the analysis result is specified in the Tools > Options... > Intel VTune
Profilerversion > Result Location pane.
• Opens the result tab with the default viewpoint.
Application Output
If you configured the General pane options to display the application output in the product output window,
the VTune Profiler redirects the output to the Application Output pane.
See Also
Control Data Collection
Finalization
Troubleshooting
Click the Compare Results button on the Intel® VTune™ Profiler toolbar.
You can compare two results that have common performance metrics. VTune Profiler provides comparison
data for these common metrics only.
Result 1 / Specify the results you want to compare. Choose the result of the current project
Result 2 drop- from the drop-down menu, or click the Browse button to choose a result from a
down menu different project.
Swap Results Click this button to change the order of the result files you want to compare. Result 1
button always serves as the basis for comparison.
725
11 Intel® VTune™ Profiler User Guide
Compare button Click this button to view the difference between the specified result files. This button
is only active if the selected results can be compared. Otherwise, an error message is
displayed.
When you click the Compare button, the VTune Profiler opens a new result tab with the performance data
for Result 1 and Result 2 side-by-side and their calculated delta.
See Also
Comparing Results
Bottom-up Comparison
Comparison Summary
Comparing Source Code
enables you to specify binary and source files for successful post-processing finalization (for
example, for remote analysis);
creates a command line version of the selected configuration that can be copied and used
on other systems.
726
Reference 11
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
NOTE
Additional details about the wake-up objects, such as Process Name or ThreadID, are available for
results collected on a Linux* or Android* system only.
727
11 Intel® VTune™ Profiler User Guide
Unknown The operating system did not log a wake-up reason between exiting idle and
re-entering idle or the wake-up reason was not passed to Intel VTune Profiler.
Timeline Pane
The Timeline pane shows the time spent in the active state (C0) or the various sleep states (Cn) as well as
the total wake-up count for the package, package cores, and hardware cores. Use the Core Wake-ups pane
to filter the wake-up types shown in the timeline by right-clicking a wake-up reason and selecting Filter In
by Selection. Filters applied on a timeline in one window are applied on all other windows within the
viewpoint. This is useful if you identify an issue on one tab and want to see how the issue impacts the
metrics shown on a different tab.
Toolbar
Navigation control to zoom in/out on the view on areas of
interest. For more details on the Timeline control, see Managing
Timeline View.
Legend
Types of data presented on the timeline. Filter in/out any type
of data presented on the timeline by selecting/deselecting
corresponding check boxes. For example, each state is a
different color and you may only be interested in the time spent
in the active state. You can also filter in and out the hardware
or package/core data.
The Wake-up Object marker shows processor wake-ups on
the timeline. Hover over a yellow marker to see the time when
the sleep state changed to an active state and the name of the
728
Reference 11
wake-up object. Zoom in on the timeline to view individual
markers if they are not visible when viewing the timeline for
the full collection time.
Package/Module/Core C-
Graphical representation of the sleep states in each core and in
states
the overall package. Each state is a different color, which can
be filtered using the legend. Hover over the band to view the
total wake-up count. Click the / to expand the package
and view the individual modules and cores.
Hardware C-states
Graphical representation of the sleep states on the hardware.
Each state is a different color, which can be filtered using the
legend.
Wake up Band
Represents the wake-up objects that caused the core to switch
from a sleep state to an active state. Each wake-up object type
uses a unique color. By hovering over the band, you can view
all of the wake-up objects at that point in time, including
details such as wake-up object type, start time, and duration.
Find an area of interest in the timeline, such as a time when
the core was active for a period of time, and then select the
Zoom In and Filter In by Selection action to view the
reasons the core became active. You can view the wake-up
reasons and additional details for the time selected in the Core
Wake-ups pane.
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
Grouping Data
729
11 Intel® VTune™ Profiler User Guide
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
The timelines in the Correlate Metrics window can also be found in other sub-tabs with the Platform Power
Analysis result tab. The Correlate Metrics window is a good starting point if you are interested in identifying
areas of energy inefficiency.
730
Reference 11
Toolbar
Navigation control to zoom in/out on the view on areas of
interest. For more details on the Timeline control, see Managing
Timeline View.
Legend
Types of data presented on the timeline. Filter in/out any type
of data presented on the timeline by selecting/deselecting
corresponding check boxes. For example, to remove the
timeline for the North Complex Devices from the view, uncheck
the North Complex Devices checkbox.
Expandable Rows
Click the / to expand the data and view metrics for
individual cores or devices.
Tooltips
Hover over the individual timelines to see data specific to that
metric at that point during the collection. In the example, the
C-States and Wake-up Counts for the Packages, Modules, and
Cores are shown.
Wake-up Objects
Processor wake-ups on the timeline. Hover over a yellow
marker to see the time when the sleep state change happened
and the name of the wake-up object. Zoom in on the timeline
to view individual wake-up markers.
Sampling Points
The point at which the sample was obtained from the
hardware. The duration between sampling points is the
sampling interval, which was specified during collection. Hover
over a blue marker to see the time when the sample was
obtained. Zoom in on the timeline to view individual sampling
point markers and the time they occurred.
Examples
In the first example, the CPU starts in the active state and then drops into one of the deeper sleep states.
The spikes in the CPU activity correspond to spikes in other timelines, such as the temperature and SoC
power consumption. By viewing all data on one tab, you can identify trends and associations between
metrics. To view each metric in more detail, visit the metric-specific tab.
731
11 Intel® VTune™ Profiler User Guide
In the second example, the CPU spends most time in the active state, and the similar activity levels for the
Core C-States and Frequency indicates balance in the distribution of that activity.
See Also
Pane: Timeline
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
732
Reference 11
CPU C/P States Pane
The CPU C/P States pane shows the time spent in sleep states (C-States) and at each processor frequency
(P-State). Intel SoC Watch can collect sleep states as requested by the OS (ACPI C-States) as well as the
actual states used at the hardware level on a Windows* system. The data can be displayed per core or per
package using the Grouping drop-down. Click the expand /collapse buttons in the data columns to
expand or hide the columns of data for ACPI C-States, hardware C-State, and P-States. You can change the
unit displayed by right-clicking a data cell and selecting the Show Data As option to select an alternate unit.
For example, if you are analyzing an idle scenario, you would use this report to see if most of the collection
time was spent in the deepest possible sleep state. The time spent in CPU states is shown in the Core C-
States Time by Core Sleep State columns (CC0-CCn for cores, MC0-MCn for modules, and PC0-PCn for
packages). C0 represents the active state and Cn represents a sleep state, where the larger the number, the
deeper the sleep state. Spending more time in deeper sleep states (C1-Cn) provides greater power savings.
In the example below, both cores spent the most time in the deepest CPU sleep state C7, which corresponds
to the OS request for the deepest sleep state ACPI C3. This is the desired result when the system being
tested is idle. Expand the columns under P-State by Core Frequency to read the full values for the
processor frequencies. Time in 0GHz indicates the time the processor was not running (total time in sleep
states).
Right-click in a column and select Show Data As > Percent to view the data in that column as a percent of
the total time rather than in seconds. If the core spent a higher than expected percentage of time in an
unexpected state, you can look at the timeline to identify when the core was in that state and then switch to
the Core Wake-ups window to identify reasons for the change in state.
Timeline Pane
The Timeline pane graphically displays the C-States of each core, at each point in time. Filters applied on a
timeline in one window are applied on all other windows within the viewpoint. This is useful if you identify an
issue on one tab and want to see how the issue impacts the metrics shown on a different tab.
733
11 Intel® VTune™ Profiler User Guide
Toolbar
Navigation control to zoom in/out on the view on areas of
interest. For more details on the Timeline control, see Managing
Timeline View.
Legend
Types of data presented on the timeline. Filter in/out any type
of data presented on the timeline by selecting/deselecting
corresponding check boxes. For example, each state is a
different color and you may only be interested in the time spent
in the active state. You can also filter in and out the hardware
or package/core data if you are only interested in frequency
metrics.
The Wake-up Object marker shows processor wake-ups on
the timeline. Hover over a yellow marker to see the time when
the sleep state change happened and the name of the wake-up
object.
Package/Core C-states
Graphical representation of the sleep states in each core and in
the overall package. Each state is a different color, which can
be filtered using the legend. Hover over the band to view the
total wake-up count. Click the / to expand the package
and view the individual cores.
Hardware C-states
Graphical representation of the sleep states on the hardware.
Each state is a different color, which can be filtered using the
legend.
Frequency (by core)
Core frequency values at each point during the collection.
Hover over the frequency P-State line to view a tooltip listing
the frequency at each time point.
734
Reference 11
Wake up Band
Represents the wake-up objects that caused the core to switch
from a sleep state to an active state. Each wake-up object type
uses a unique color. By hovering over the band, you can view
all of the wake-up objects at that point in time, including
details such as wake-up object type, start time, and duration.
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
Grouping Data
Energy Analysis Metrics
C-State
P-State
Window: Debug
By default, during data collection, all application output and collector event log displays in a separate console
window. To change the output window for the standalone GUI menu , go to Options... > Intel
VTune Profilerversion > General pane.
By default, the Debug window appears at the bottom of the view.
To choose what output to view, select an output source from the Show output from drop-down list.
See Also
Pane: General
The list of hardware events depends on the analysis type. You may right-click an event column and select the
What's This Column context menu option to open the description of the selected event.
735
11 Intel® VTune™ Profiler User Guide
When you explore the hardware events statistics for a result, you may drag and drop the columns in the grid
for your convenience. VTune Profiler automatically saves your preferences and keeps the columns order for
subsequent result views.
Timeline Pane
The Timeline pane is synchronized with the Event Count pane. The Thread area of the Timeline pane
shows the number of times the selected event (CPU_CLK_UNHALTED.REF_TSC in the example below)
occurred while a thread was running. You may use the Hardware Event Count drop-down menu in the
legend area to choose a different event.
The Hardware Event Type area shows the application-level performance per each event.
See Also
Intel Processor Events Reference
Switch Viewpoints
736
Reference 11
b. If you are running the analysis in Hardware Event-Based Sampling mode, check the Collect
Stacks option.
2. When the analysis is complete and results display, switch to the Flame Graph tab. You can also click
on the Flame Graph link in the Insight section of the Summary window.
737
11 Intel® VTune™ Profiler User Guide
Details Area:
Hover over a flame graph element to get CPU Time as well as the percentage of
Total Time taken by the selected stack-frame.
Tooltips:
When you hover over a flame graph element, a tool tip displays these details for
the selected bar or stack frame:
• CPU Time
• Function name
• Module name
• Source file
• Function type
Legend:
The legend describes the types of functions included in the flame graph.
Navigation Bar:
738
Reference 11
Use these controls in the navigation bar to manage the flame graph display:
•
: Select the Flame Graph mode.
•
: Select the Icicle Graph mode. This inverts the flame graph display.
•
: Undo the last zoom action.
•
: Restore the flame graph to its original view.
Search:
Search for any functions in the flame graph. You can use regular expressions in
the search string. When the results display, the CPU Time and percentage of Total
Time include the times for all of the matched functions.
739
11 Intel® VTune™ Profiler User Guide
740
Reference 11
Analyze Flame Graph Data
Use these tips to analyze the application information contained in your flame graph:
• For hot code paths in your application, analyze the time spent on each function and its callees. The
function bar displays as a fraction of CPU time.
• Choose between the Flame Graph and Icicle Graph visualizations to help with your analysis.
• Filter data through the Filter bar and/or Timeline.
• Optimize your application starting with the lowest function in the flame graph and working your way up.
• Pay close attention to the hottest user and synchronization functions. In the flame graph, they appear as
the widest functions.
• Use the stack pane to dive into the source code of a function.
Related information
• An explanation of Flame Graphs
• Hotspots View
• Java Code Analysis
Grid. Analyze basic performance metrics per program unit and identify the most time-
consuming units. If your application uses the OpenCL software technology and you ran
the analysis with the Trace GPU Programming APIs option enabled, the grid is
grouped by Computing Task Purpose granularity by default.
Analyze and optimize hot kernels with the longest Total Time values first. These include
kernels characterized by long Average Time values and kernels whose Average Time
values are not long, but they are invoked more frequently than the others (see Instance
Count values). Both groups deserve attention. For more details, see GPU OpenCL™
Application Analysis.
To understand the CPU activity (which module/function was executed and its CPU time)
while the GPU execution units were idle, queued, or busy executing some code, use the
GPU Render and EU Engine State grouping level:
741
11 Intel® VTune™ Profiler User Guide
Thread. Explore CPU and GPU utilization by a particular thread. The Platform tab
displays the thread name as a name of the module where the thread function resides.
For example, if you have a myFoo function that belongs to MyMegaFoo (Linux*) or
MyMegaFoo.dll (Windows*) function, the thread name is displayed as MyMegaFoo
(Linux*) or MyMegaFoo.dll (Windows*) . This approach helps easily identify the
location of the thread code producing the work displayed on the timeline.
Windows* targets only: Correlate CPU and GPU usage and estimate whether your
application is GPU bound. GPU Engines Usage bars show DMA packets on CPU threads
originating GPU tasks. The bars are colored according to the type of used GPU engine
(yellow bars in the example above correspond to the Render and GPGPU engine).
GPU hardware metrics. If you enabled the Analyze Processor Graphics hardware
events option for GPU analysis on the processors with the Intel® HD and Intel® Iris®
Graphics, the VTune Profiler displays the statistics for the selected group of metrics over
time.
For example, for the default Overview group of metrics, you may start with GPU
Execution Units: EU Array Idle metric. Idle cycles are wasted cycles. No threads are
scheduled and the EUs' precious computational resources are not being utilized. If EU
Array Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on
them.
In most cases the optimization strategy is to minimize the EU Array Stalled metric and
maximize the EU Array Active. The exception is memory bandwidth-bound algorithms
and workloads where optimization should strive to achieve a memory bandwidth close to
the peak for the specific platform (rather than maximize EU Array Active).
Memory accesses are the most frequent reason for stalls. The importance of memory
layout and carefully designed memory accesses cannot be overestimated. If the EU
Array Stalled metric value is non-zero and correlates with the GPU L3 Misses, and if
the algorithm is not memory bandwidth-bound, you should try to optimize memory
accesses and layout.
Sampler accesses are expensive and can easily cause stalls. Sampler accesses are
measured by the Sampler Is Bottleneck and Sampler Busy metrics.
NOTE
To analyze Intel Graphics hardware events on a GPU, make sure to set up your system for
GPU analysis.
Windows targets only: Switch to the Platform window to explore how the execution
path of the OpenCL device queue correlates to the DMA packets software queue.
742
Reference 11
GPU Usage metrics. GPU usage bars are colored according to the type of used GPU
engine.
Theoretically, if the Platform tab shows that the GPU is busy most of the time and
having small idle gaps between busy intervals and the GPU software queue is rarely
decreased to zero, your application is GPU bound. If the gaps between busy intervals are
big and the CPU is busy during these gaps, your application is CPU bound. But such
obvious situations are rare and you need a detailed analysis to understand all
dependencies. For example, an application may be mistakenly considered GPU bound
when GPU engines usage is serialized (for example, when GPU engines responsible for
video processing and for rendering are loaded in turns). In this case, an ineffective
scheduling on the GPU results from the application code running on the CPU.
For further OpenCL kernel analysis, select a computing task you are interested in (for example,
AdvancedPaths) and switch to the Architecture Diagram tab. VTune Profiler displays performance data
per GPU hardware metrics for the time range when the selected kernel was executed:
Flagged values signal a performance issue. In this example, ~50% of the GPU time was spent in stalls. This
means that performance is limited by memory or sampler accesses.
See Also
GPU Application Analysis on Intel® HD Graphics and Intel® Iris® Graphics
Pane: Timeline
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
743
11 Intel® VTune™ Profiler User Guide
Shows the time spent in each state or frequency, organized by device. Click the expand /collapse
buttons in the data columns to expand the column and show data for different C-States and P-States in each
device. You can change the unit displayed by right-clicking a data cell and selecting the Show Data As
option to select an alternate unit. For example, select Show Data As > Time and Bar to view a visual
representation of the percent of collection time spent in each state.
Timeline Pane
Displays the C-states and P-states of each device at each point in time. The states are shown in a different
color as identified by the legend to the right of the timeline. The frequency graph uses data points to indicate
that the data has been read from the hardware at discrete sampling points instead of from a residency
counter. Hover over a blue marker to see the time when the sampling point occurred.
Time spent in each state is represented by a heat map. The heat map data may not be visible when viewing
the full timeline. Zoom in on a section of interest to view the heat map and details about the data points. The
heat map, represented in the example below with shades of red in the Graphics P-States timeline,
illustrates how active the device was since the previous sample. The deeper the red color, the longer it was in
the active state. The exact transitions between active and idle are not known. Hover over a point to view the
percentage of time in the active state. In the example below, the device was active for 99.2% of the time
between the two sampling points and the color is the deepest shade of red. The bars with lighter shades
indicate less time in the active state.
744
Reference 11
Use the timeline to identify times when there was a higher frequency for a longer period of time and ensure
that it matches expectations. If it does not, you can look at the CPU C/P States tab to show CPU activity at
the same time. You can also view the Bandwidth tab to see if a similar spike in activity occurs in that tab.
Filters applied on a timeline in one window are applied on all other windows within the viewpoint. This is
useful if you identify an issue on one tab and want to see how the issue impacts the metrics shown on a
different tab.
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
Grouping Data
Managing Timeline View
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
745
11 Intel® VTune™ Profiler User Guide
The North Complex Device States pane shows the list of devices in the North complex and displays the
sample counts for each device. Click the expand /collapse buttons in the data columns to expand the
column and show data for different D0ix states in each device. You can change the unit displayed by right-
clicking a data cell and selecting the Show Data As option to select an alternate unit. For example, you
could select Show Data As > Percent to view the percent of collection time a particular device spent in the
active state.
A device remaining in the active state (D0i0) can prevent the system from entering a deep sleep state
(S0ix). Compare device time spent in the active state with System Sleep States or Graphics C/P State.
Timeline Pane
746
Reference 11
The Timeline pane displays the D0ix states of each device at each point when the data was read. Each state
is shown in a different color. Use the legend on the right to add or remove D0ix states from the timeline.
Hover over a data point to see the percentage of time spent in each state. Zoom in or out on the timeline to
view trends in more detail. Filters applied on a timeline in one window are applied on all other windows
within the viewpoint. This is useful if you identify an issue on one tab and want to see how the issue impacts
the metrics shown on a different tab. The following example shows a zoomed-in view of the result above to
show individual data points.
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
Grouping Data
Managing Timeline View
Window: Platform
To access this window: Click the Platform sub-tab in the result tab.
Depending on the metrics collected during the analysis, use the Platform window to:
• Inspect CPU and GPU utilization, frame rate and memory bandwidth.
• Explore your application performance for user tasks such as Intel ITT API tasks, Ftrace*/Systrace* event
tasks, DPC++ and OpenCL™ API tasks, and so on.
• Correlate CPU and GPU activity and identify whether your application/some phases of it are GPU or CPU
bound.
• Analyze СPU/GPU interactions and software queue for GPU engines at each moment of time.
The Platform window represents a distribution of the performance data over time. For example, on Linux
the Platform window displays the following data:
747
11 Intel® VTune™ Profiler User Guide
Frame Rate. Identify bounds for GPU and CPU frames (Windows only), where:
• CPU Frame X (Present) is the time range between the moment frame X-1 is queued for
presentation and the moment frame X is queued for presentation.
• GPU Frame X (Flip) is the time range between the moment frame X-1 is rendered on
the screen and the moment frame X is rendered on the screen.
Hover over a frame object to view a summary including data on frame duration, frame
rate, and others:
748
Reference 11
CPU and GPU frames with the same ID are displayed in the same color.
GPU Engine. Explore overall GPU utilization per GPU engine at each moment of time. By
default, the Platform window displays GPU Utilization and software queues per GPU
engine. Hover over an object executed on the GPU (in yellow) to view a short summary on
GPU utilization, where GPU Utilization is the time when a GPU engine was executing a
workload. You can explore the top GPU Utilization band in the chart to estimate the
percentage of GPU engine utilization (yellow areas vs. white spaces) and options to submit
additional work to the hardware.
To view and analyze GPU software queues, select an object (packet) in the queue and the
VTune Profiler highlights the corresponding software queue bounds:
Full software queue prevents packet submissions and causes waits on a CPU side in the
user-mode driver until there is space in the queue. To check whether such a stall
decreases your performance, you may decrease a workload on the hardware and switch to
the Graphics window to see if there are less waits on the CPU in threads that spawn
packets. Another option could be to additionally load the queue by tasks and see whether
the queue length increases.
Each packet in the Platform window has its own ID that helps track its life cycle in a
software queue. The ID does not correspond to the rendered frames. You may identify
where a packet came from by the thread name (corresponding to the name of the module
where a thread entry point resides) specified in the tooltip.
Horizontal hatching is used for data that may be not accurate due to collection issues (for
example, missing event from the Intel® Graphics Driver). This type of data is identified as
Reconstructed packets in the Legend.
Windows only:
For Windows targets, you may select the Packet Type drop-down menu option in the
Legend area to explore GPU utilization and software queues per DMA packet domain:
749
11 Intel® VTune™ Profiler User Guide
Windows only:
750
Reference 11
On Windows, you can explore how the execution path (marked in blue) of the OpenCL
device queue (in orange) correlates with the DMA packets software queue (in black). The
OpenCL kernel queue expedites kernels to the driver where DMA packets of different types
are get multiplexed in the single DMA queue. In the example above, the Render and
GPGPU queue serves both graphics (GHAL3D) and compute (OpenCL)-originated packets.
Thread. Explore CPU utilization by thread. The Platform window displays the thread
name as a name of the module where the thread function resides. For example, if you
have a myFoo function that belongs to MyMegaFoo function, the thread name is displayed
as MyMegaFoo. This approach helps easily identify the location of the thread code
producing the work displayed on the timeline.
If your code used the Task API to mark the tasks regions or you enabled any system tasks
for monitoring specific events, the task objects show up on the timeline and you can hover
over such an object for details:
Windows only:
Hover over a context switch area to see the details on its duration, reason, and affected
CPU. Dark-green context switches show time slices when a thread was busy with a
workload while light-green context switch objects show areas where a thread was waiting
for a synchronization object. Gray areas show inactivity periods caused by preemption
when the operating system task scheduler switched a thread off a processor to run
another, higher-priority thread.
751
11 Intel® VTune™ Profiler User Guide
Correlate CPU and GPU utilization and estimate whether your application is CPU or GPU
bound. GPU Engines utilization bars show DMA packets on CPU threads originating GPU
tasks. The bars are colored according to the type of used GPU engine (yellow bars in the
example below correspond to the Render and GPGPU engine). If the GPU Engine area of
the Platform window shows aggregated GPU utilization for all threads and processes in
the system, the GPU Engines Utilization bars in the Thread area show GPU engine
utilization by a particular thread.
GPU Metrics. Correlate the data on GPU activity per GPU metrics with the CPU utilization
data. The GPU Utilization bars are colored according to the type of used GPU engine.
To analyze CPU and GPU utilization per thread, switch to the Graphics window.
NOTE
To analyze Intel HD Graphics and Intel® Iris® Graphics hardware events on a GPU, make sure to
set up your system for GPU analysis.
Core Frequency. Explore the ratio between the actual and the nominal CPU frequencies.
Values above 1.0 indicate that CPU is operating in a turbo boost mode.
NOTE
This data is available only for the hardware event-based sampling analysis results.
DRAM Bandwidth. Explore the application performance per Uncore to DRAM Bandwidth
metrics over time.
NOTE
This data is available only for the hardware event-based sampling analysis results with the
bandwidth events collection enabled.
Interrupt. Identify the intervals where system interrupts occurred. Hover over an
interrupt object to view full details in the tooltip:
752
Reference 11
NOTE
This type of data shows up for the custom data collection results if you enabled the
corresponding Ftrace events collection during the analysis type configuration.
NOTE
To monitor general GPU utilization over time on Windows OS, run the VTune Profiler as an
Administrator.
The EU Stalled/Idle metric shows the time when execution units were stalled or idle. High values are
flagged as a performance issue with a negative impact on the compute-bound applications.
See Also
GPU Compute/Media Hotspots Analysis (Preview)
Task Analysis
Analyze Interrupts
753
11 Intel® VTune™ Profiler User Guide
Energy analysis data collected by Intel SoC Watch version 2.3 or later on an Android* or Linux* device can
be imported into Intel® VTune™ Profiler and visualized with the Platform Power Analysis viewpoint. The
Summary window is always present, but other windows within the viewpoint will vary depending on the
metrics collected with Intel SoC Watch. For example, the ddr-bw metrics are visualized on the DDR
Bandwidth window. The metrics available to you will depend on your device hardware and operating
system. Review the Intel SoC Watch User's Guide for your operating system for detailed information on each
metric.
• Sampled Residency Data: The value is gathered by sampling data over regular intervals. There is a set
range of values. The exact time of transition between values is not known, but the percentage of time
spent in each value is calculated and displayed as a heat map in the timeline pane.
For example, the Graphics C-State status is collected at regular intervals. The value transitioned in and
out of different C-States during the collection time, but the exact transition time is not tracked. Instead, a
heat map shows that more time was spent in one state than the other. Hover over the graph to see the
exact percentage of time spent in each state.
• Sampled Counter Data: The value is gathered by measuring a count since the previous sampling point.
The data is then calculated into a rate per second to show the changes over time and visualized as a line
graph in the timeline pane.
For example, the DDR Bandwidth data is displayed as a line graph with different lines for read, write, read
partials, and write partials. Sampling points show when the counts were collected.
754
Reference 11
• Traced Residency Data: The value is gathered when the state changes from one value to another. The
time spent in the previous state is known and can be displayed in the timeline pane. In some cases an
additional metric is tracked, such as the frequency values for the Core P-State Residency metric.
For example, for the Core-C-State Residency Metric, a processor is in a certain C-State at any given time.
C0 is the active state and Cn is a sleep state where a larger number means a deeper sleep state. When
the processor transitions from one C-State to another, an event is emitted and the transition and time
spent in the previous state is logged. The values are visualized as colored bars indicating the time in a
certain state in the timeline pane. Drag and select an area of the timeline and then select the Zoom In
on Selection option from the menu that appears to show finer granularity in the timeline pane. For more
information, see Managing Timeline View.
• Traced Event Data: The value is gathered when a new event occurs. Each event is displayed on the
timeline with an event marker showing the exact time that the event occurred. Events of the same type
are shown with the same color marker. The legend to the right of the timeline shows what color marker
corresponds to each event type collected.
Unlike other traced event data, Wakeup and Abort events are displayed as bars and triangle event points
on the timeline pane. Each event is color-coded by event type (timer, scheduled, etc.). The bar length
shows how the event corresponds with the CPU sleep state, even though the event is instantaneous. The
exact time of the wakeup or abort event is shown with the triangle.
755
11 Intel® VTune™ Profiler User Guide
See Also
Window: Summary - Platform Power Analysis
756
Reference 11
• Context Summary pane
The list of hardware events depends on the analysis type. You may right-click an event column and select the
What's This Column context menu option to open the description of the selected event.
When you explore the hardware events statistics for a result, you may drag and drop the columns in the grid
for your convenience. VTune Profiler automatically saves your preferences and keeps the columns order for
subsequent result views.
Timeline Pane
The Timeline pane is synchronized with the Sample Count pane. The Thread area of the Timeline pane
shows the number of samples collected for the selected event (INST_RETIRED.ANY in the example below)
while a thread was running. You may use the Hardware Event Sample Count drop-down menu in the
legend area to choose a different event.
The Hardware Event Type area shows the application-level performance per each event.
757
11 Intel® VTune™ Profiler User Guide
See Also
Intel Processor Events Reference
Switch Viewpoints
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
758
Reference 11
The South Complex Device States pane shows the list of devices in the South Complex and displays
estimated sample counts for each device. The sample counts are not a precise measure of the length of time
each device spent in a state, but can be used as a guideline to determine if a device spent a greater amount
of time in a particular state than was expected.
Click the expand /collapse buttons in the data columns to expand the column and show data for
different D-States in each device. You can change the unit displayed by right-clicking a data cell and selecting
the Show Data As option to select an alternate unit. For example, you could select Show Data As >
Percent to view the percent of collection time a particular device spent in the active state.
Timeline Pane
The Timeline pane displays the D0ix states of each device, at each point in time. You can rearrange the order
of the devices in the timeline by dragging and dropping.
Toolbar
Navigation control to zoom in/out on the view on areas of
interest. For more details on the Timeline control, see Managing
Timeline View.
Legend
Types of data presented on the timeline. Filter in/out any type
of data presented on the timeline by selecting/deselecting
corresponding check boxes.
South Complex Devices
Graphical representation of the time spent in a D-State. Each
state is a different color, which can be filtered using the legend.
Hover over the timeline for a device to view the total
percentage of time spent in a particular state.
Zoom in or out on the timeline to view trends in more detail.
Filters applied on a timeline in one window are applied on all
other windows within the viewpoint. This is useful if you
identify an issue on one tab and want to see how the issue
impacts the metrics shown on a different tab.
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
Grouping Data
Window: Summary
Use the Summary window as a starting point for your analysis in the following viewpoints:
759
11 Intel® VTune™ Profiler User Guide
NOTE
• Click a metric or an object name represented in the Summary window as a hyperlink to open the
Bottom-up window with the grid data sorted by the selected metric or the selected object
highlighted. By default, the grid data is grouped by Thread/Page Faults, which helps you easier
•
Сlick the Copy to Clipboard button to copy the content of the selected summary section to the
clipboard.
Analysis Metrics
Explore the list of CPU metrics to understand high-level statistics of an overall application execution.
For Linux* targets, Intel® VTune™ Profiler introduces the I/O Wait Time metric that helps you estimate
whether your application is I/O-bound:
The I/O Wait Time metric represents a portion of time when threads reside in I/O wait state while there are
idle cores on the system. For every moment of time the number of counted threads does not exceed the
number of idling cores on a system. This aggregated I/O Wait Time metric is an integral function of I/O Wait
metric that is available in the Timeline pane of the Bottom-up view. If you see that the I/O Wait Time is a
substantial part of the application Elapsed Time, as in the example above, switch to the Platform window to
have a closer look at all the metrics on the timeline and understand what caused high I/O Wait time.
VTune Profiler analyzes metrics, compares their values with the threshold values provided by Intel architects,
and, if the threshold is exceeded, it flags the metric value as a performance issue for an application as a
whole. Mouse over the flagged value to read an issue description and tuning recommendation.
760
Reference 11
NOTE
This histogram is available if you collected results with the Analyze memory bandwidth option
enabled.
SPDK Info
Explore SDPK Info section for overall IO performance statistics. To see how each device performed per
operation or metric, expand a corresponding block and identify potential IO performance imbalance among
SSDs:
SPDK Throughput
Explore the SPDK Throughput histogram and table to identify how long your workload has been under-
utilizing the throughout of the selected SPDK device (Low utilization level):
Top Hotspots
VTune Profiler displays the most performance-critical functions and their CPU Time in the Top Hotspots
section. Optimizing these functions typically results in improving overall application performance. Clicking a
function in the list opens the Bottom-up window with this function selected.
761
11 Intel® VTune™ Profiler User Guide
The grayed-out [Others] module, if provided, displays the total value for all other functions in the application
that are not included into this table.
NOTE
You can control the number of objects in this list and displayed metrics via the viewpoint configuration
file.
To get more details on this type I/O request, switch to the Timeline pane in the Bottom-up window.
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
762
Reference 11
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
Collector Type of the data collector used for the analysis. The following types are possible:
type
• Driver-based sampling
• Driver-less Perf*-based sampling: per-process or system-wide
• User-mode sampling and tracing
CPU Information
Logical CPU Logical CPU count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
GPU Information
EU Count Number of execution units (EUs) in the Render and GPGPU engine. This data is Intel®
HD Graphics and Intel® Iris® Graphics (further: Intel Graphics) specific.
Max EU Maximum number of threads per execution unit. This data is Intel Graphics specific.
Thread
Count
Max Core Maximum frequency of the Graphics processor. This data is Intel Graphics specific.
Frequency
Graphics GPU metrics collection is enabled on the hardware level. This data is Intel Graphics
Performanc specific.
e Analysis
NOTE
Some systems disable collection of extended metrics such as L3 misses, memory accesses,
sampler busyness, SLM accesses, and others in the BIOS. On some systems you can set a
BIOS option to enable this collection. The presence or absence of the option and its name are
BIOS vendor specific. Look for the Intel® Graphics Performance Analyzers option (or
similar) in your BIOS and set it to Enabled.
See Also
Input and Output Analysis
763
11 Intel® VTune™ Profiler User Guide
Comparison Summary
NOTE
You may click the Copy to Clipboard button to copy the content of the selected summary section
to the clipboard.
Analysis Metrics
The first section displays the summary statistics on the overall application execution per hardware-related
metrics measured in Pipeline Slots or Clockticks. Metrics are organized by execution categories in a list and
also represented as a µPipe diagram. To view a metric description, mouse over the help icon :
764
Reference 11
In the example above, mousing over the L1 Bound metric displays the metric description in the tooltip.
A flagged metric value signals a performance issue for the whole application execution. Mouse over the
flagged value to read the issue description:
You may use the performance issues identified by the VTune Profiler as a baseline for comparison of versions
before and after optimization. Your primary performance indicator is the Elapsed time value.
Grayed out metric values indicate that the data collected for this metric is unreliable. This may happen, for
example, if the number of samples collected for PMU events is too low. In this case, when you hover over
such an unreliable metric value, the VTune Profiler displays a message:
You may either ignore this data, or rerun the collection with the data collection time, sampling interval, or
workload increased.
By default, the VTune Profiler collects Microarchitecture Exploration data in the Detailed mode. In this mode,
all metric names in the Summary view are hyperlinks. Clicking such a hyperlink opens the Bottom-up
window and sorts the data in the grid by the selected metric. The lightweight Summary collection mode is
limited to the Summary view statistics.
765
11 Intel® VTune™ Profiler User Guide
Vertical bars Hover over the bar to identify the amount of Elapsed time the application spent using the
specified number of logical CPUs.
Target Identify the target CPU utilization. This number is equal to the number of logical CPUs.
Utilization Consider this number as your optimization goal.
Average CPU Identify the average number of CPUs used aggregating the entire run. It is calculated as
Utilization CPU time / Elapsed time.
CPU utilization at any point in time cannot surpass the available number of logical CPUs.
Even when the system is oversubscribed, and there are more threads running than CPUs,
the CPU utilization is the same as the number of CPUs.
Use this number as a baseline for your performance measurements. The closer this
number to the number of logical CPUs, the better, except for the case when the CPU time
goes to spinning.
Utilization Analyze how the various utilization levels map to the number of simultaneously utilized
Indicator bar logical CPUs.
NOTE
In the CPU Utilization histogram, the VTune Profiler treats the Spin and Overhead time as Idle
CPU utilization. Different analysis types may recognize Spin and Overhead time differently
depending on availability of call stack information. This may result in a difference of CPU
Utilization graphical representation per analysis type.
NOTE
The Effective CPU Utilization Histogram is available for Microarchitecture Exploration results
collected in the Detailed mode only.
766
Reference 11
Computer Name of the computer used for the collection.
Name
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
Collector Type of the data collector used for the analysis. The following types are possible:
type
• Driver-based sampling
• Driver-less Perf*-based sampling: per-process or system-wide
• User-mode sampling and tracing
CPU Information
Logical CPU Logical CPU count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
GPU Information
EU Count Number of execution units (EUs) in the Render and GPGPU engine. This data is Intel®
HD Graphics and Intel® Iris® Graphics (further: Intel Graphics) specific.
Max EU Maximum number of threads per execution unit. This data is Intel Graphics specific.
Thread
Count
Max Core Maximum frequency of the Graphics processor. This data is Intel Graphics specific.
Frequency
Graphics GPU metrics collection is enabled on the hardware level. This data is Intel Graphics
Performanc specific.
e Analysis
767
11 Intel® VTune™ Profiler User Guide
NOTE
Some systems disable collection of extended metrics such as L3 misses, memory accesses,
sampler busyness, SLM accesses, and others in the BIOS. On some systems you can set a
BIOS option to enable this collection. The presence or absence of the option and its name are
BIOS vendor specific. Look for the Intel® Graphics Performance Analyzers option (or
similar) in your BIOS and set it to Enabled.
See Also
Microarchitecture Exploration View
Comparison Summary
NOTE
Click the Copy to Clipboard button to copy the content of the selected summary section to the
clipboard.
GPU Utilization
If your system satisfies configuration requirements for GPU analysis (i915 ftrace event collection is
supported), VTune Profiler displays detailed GPU Utilization analysis data across all engines that had at
least one DMA packet executed. By default, the VTune Profiler flags the GPU utilization less than 80% as a
performance issue. In the example below, 85.9% of the application elapsed time was utilized by GPU
engines.
768
Reference 11
Depending on the target platform used for GPU analysis, the GPU Utilization section in the Summary
window shows the time (in seconds) used by GPU engines. Note that GPU engines may work in parallel and
the total time taken by GPU engines does not necessarily equal the application Elapsed time.
You may correlate GPU Time data with the Elapsed Time metric. The GPU Time value shows a share of the
Elapsed time used by a particular GPU engine. If the GPU Time takes a significant portion of the Elapsed
Time, it clearly indicates that the application is GPU-bound.
If your system does not support i915 ftrace event collection, all the GPU Utilization statistics will be
calculated based on the hardware events and attributed to the Render and GPGPU engine.
The Summary view provides the Packet Queue Depth Histogram that helps you estimate the GPU
software queue depth per GPU engine during the target run:
Ideally, your goal is an effective GPU engine utilization with evenly loaded queues and minimal duration for
the zero queue depth.
For a high-level view of the DMA packet execution during the target run, review the Packet Duration
Histogram:
Select a required packet type from the drop-down menu and identify how effectively these packets were
executed on the GPU. Having high Packet Count values for the minimal duration is optimal.
To get detailed information on the packet queues and execution, switch to the Platform tab and analyze the
GPU software queue on the timeline.
For OpenCL™ applications, explore the Hottest GPU Computing Tasks section that helps you understand
which OpenCL kernels had performance issues:
769
11 Intel® VTune™ Profiler User Guide
Mouse over a flagged computing task for details on a performance issue. For example, for the Intersect
computing task a significant portion of the GPU time was spent in stalls, which may result from frequent
sampler or memory accesses. Click a hot GPU computing task to open the Graphics window with this
computing task pre-selected for your convenience.
EU Array Stalled/Idle
For the compute-bound workloads, explore the EU Array Stalled/Idle section that shows the most typical
reasons why the execution units could be waiting. This section shows up for the analysis that collects Intel®
HD Graphics and Intel® Iris® Graphics hardware events for the GPU Compute/Media Hotspots.
Depending on the event preset you used for the configuration, the VTune Profiler analyzes metrics for stalled/
idle executions units. The GPU Compute/Media Hotspots analysis by default collects the Overview preset
including the metrics that track general GPU memory accesses, such as Sampler Busy and Sampler Is
Bottleneck, and GPU L3 bandwidth. As a result, the EU Array Stalled/Idle section displays the Sampler
Busy section with a list of GPU computing tasks with frequent access to the Sampler and hottest GPU
computing tasks bound by GPU L3 bandwidth:
770
Reference 11
771
11 Intel® VTune™ Profiler User Guide
If you select the Compute Basic preset during the analysis configuration, VTune Profiler analyzes metrics that
distinguish accessing different types of data on a GPU and displays the Occupancy section. See information
about GPU tasks with low occupancy and understand how you can achieve peak occupancy:
772
Reference 11
773
11 Intel® VTune™ Profiler User Guide
If the peak occupancy is flagged as a problem for your application, inspect factors that limit the use of all
the threads on the GPU. Consider modifying your code with corresponding solutions:
Factor responsible for Low Peak Occupancy Solution
SLM size requested per workgroup in a computing Decrease the SLM size or increase the Local size
task is too high
Barrier synchronization (the sync primitive can Remove barrier synchronization or increase the
cause low occupancy due to a limited number of Local size
hardware barriers on a GPU subslice)
774
Reference 11
If the occupancy is flagged as a problem for your application, change your code to improve hardware thread
scheduling. These are some reasons that may be responsible for ineffective thread scheduling:
• A tiny computing task could cause considerable overhead when compared to the task execution time.
• There may be high imbalance between the threads executing a computing task.
775
11 Intel® VTune™ Profiler User Guide
The Compute Basic preset also enables an analysis of the DRAM bandwidth usage. If the GPU workload is
DRAM bandwidth-bound, the corresponding metric value is flagged. You can explore the table with GPU
computing tasks heavily using the DRAM bandwidth during execution.
If you select the Full Compute preset and multiple run mode during the analysis configuration, the VTune
Profiler will use both Overview and Compute Basic event groups for data collection and provide all types of
reasons for the EU array stalled/idle issues in the same view.
NOTE
To analyze Intel® HD Graphics and Intel® Iris® Graphics hardware events, make sure to set up your
system for GPU analysis
FPU Utilization
If your application execution takes more than 80% of collection time heavily utilizing both floating point units
(FPUs), the VTune Profiler highlights such a value as an issue and lists the kernels that overutilized the FPUs:
Click a flagged kernel to switch to the Graphics tab > Timeline pane, explore the distribution of the GPU
EU Instructions metric that shows the FPU usage during the analysis run, and identify time ranges with the
highest metric values. To address high FPU utilization issue for your code, consider reducing computations.
Bandwidth Utilization
For memory-bound applications, explore the Bandwidth Utilization Histogram section that includes
statistics on the average system bandwidth and a Bandwidth Utilization histogram that shows how intensively
your application was using each bandwidth domain:
776
Reference 11
The Hardware Events viewpoint is enabled for all hardware event-based sampling results and is targeted
primarily for the analysis of monitored hardware events: estimated count and/or the number of samples
collected. In the Summary window, explore the following data:
• Analysis metrics
• Hardware Events
• Uncore Event Count
• Top Tasks
• Collection and Platform Info
NOTE
You may click the Copy to Clipboard button to copy the content of the selected summary section
to the clipboard.
Analysis Metrics
The Summary window displays a list of CPU metrics that help you estimate an overall application execution.
For a metric description, hover over the corresponding question mark icon to read the pop-up help.
Use the Elapsed Time metric as your primary indicator and a baseline for comparison of results before and
after optimization. Note that for multithreaded applications, the CPU Time is different from the Elapsed Time
since the CPU Time is the sum of CPU time for all application threads.
Hardware Events
This section provides a list of hardware events monitored for this analysis and the statistics collected:
Hardware Event name provided as a hyperlink. Clicking an event name opens the Event Count
Event Type window sorted by the selected event. You can identify a function with the highest event/
sample count and double-click it to open the Source view and identify which code line
generated the highest count for the event of interest.
Hardware Estimated number of times this event occurred during the collection.
Event
Count
Events per Number of events collected at one sample (Sample After Value).
Sample
Uncore Event name provided as a hyperlink. Clicking an event name opens the Uncore Event
Event Type Count window sorted by the selected event.
Uncore The number of times this uncore event occurred during the collection.
Event
Count
777
11 Intel® VTune™ Profiler User Guide
Top Tasks
This section provides a list of tasks that took most of the time to execute, where tasks are either code
regions marked with Task API, or system tasks enabled to monitor Ftrace* events, Atrace* events, Intel
Media SDK programs, OpenCL™ kernels, and so on.
Clicking a task type in the table opens the grid view (for example, Bottom-up or Event Count) grouped by
the Task Type granularity. See Task Analysis for more information.
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
Collector Type of the data collector used for the analysis. The following types are possible:
type
• Driver-based sampling
• Driver-less Perf*-based sampling: per-process or system-wide
• User-mode sampling and tracing
CPU Information
Logical CPU Logical CPU count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
GPU Information
778
Reference 11
Stepping Microprocessor version.
EU Count Number of execution units (EUs) in the Render and GPGPU engine. This data is Intel®
HD Graphics and Intel® Iris® Graphics (further: Intel Graphics) specific.
Max EU Maximum number of threads per execution unit. This data is Intel Graphics specific.
Thread
Count
Max Core Maximum frequency of the Graphics processor. This data is Intel Graphics specific.
Frequency
Graphics GPU metrics collection is enabled on the hardware level. This data is Intel Graphics
Performanc specific.
e Analysis
NOTE
Some systems disable collection of extended metrics such as L3 misses, memory accesses,
sampler busyness, SLM accesses, and others in the BIOS. On some systems you can set a
BIOS option to enable this collection. The presence or absence of the option and its name are
BIOS vendor specific. Look for the Intel® Graphics Performance Analyzers option (or
similar) in your BIOS and set it to Enabled.
See Also
Sample After Value
NOTE
You may click the Copy to Clipboard button to copy the content of the selected summary section
to the clipboard.
Analysis Metrics
The Summary window displays a list of CPU metrics that help you estimate an overall application execution.
For a metric description, hover over the corresponding question mark icon to read the pop-up help. For
metric values flagged as performance issues, hover over such a value for details:
779
11 Intel® VTune™ Profiler User Guide
Use the Elapsed Time metric as your primary indicator and a baseline for comparison of results before and
after optimization. Note that for multithreaded applications, the CPU Time is different from the Elapsed Time
since the CPU Time is the sum of CPU time for all application threads.
For some analysis types, the Effective CPU Time is classified per CPU utilization as follows:
Utilization Description
Type
Idle Idle utilization. By default, if the CPU Time is insignificant (less than 50% of 1 CPU), such
CPU utilization is classified as idle.
Poor Poor utilization. By default, poor utilization is when the number of simultaneously running
CPUs is less than or equal to 50% of the target CPU utilization.
Ideal Ideal utilization. By default, Ideal utilization is when the number of simultaneously running
CPUs is between 86-100% of the target CPU utilization.
The Overhead and Spin Time metrics, if provided (depend on the analysis), can tell you how your
application's use of synchronization and threading libraries is impacting the CPU time. Review the metrics
within these categories to learn where your application might be spending additional time making calls to
synchronization and threading libraries such as system synchronization API, Intel® oneAPI Threading Building
Blocks(oneTBB ), and OpenMP*. VTune Profiler provides the following types of inefficiencies in your code
taking CPU time:
Imbalance Imbalance or Serial Spinning time is CPU time when working threads are spinning on a
or Serial synchronization barrier consuming CPU resources. This can be caused by load imbalance,
Spinning insufficient concurrency for all working threads or waits on a barrier in the case of serialized
Time execution.
Lock Lock Contention time is CPU time when working threads are spinning on a lock consuming
Contention CPU resources. High metric value may signal inefficient parallelization with highly contended
Spin Time synchronization objects. To avoid intensive synchronization, consider using reduction,
atomic operations or thread local variables where possible.
Other Spin This metric shows unclassified Spin time spent in a threading runtime library.
Time
Creation Creation time is CPU time that a runtime library spends on organizing parallel work.
Overhead
Time
Scheduling Scheduling time is CPU time that a runtime library spends on work assignment for threads.
Overhead If the time is significant, consider using coarse-grain work chunking.
Time
Reduction Reduction time is CPU time that a runtime library spends on loop or region reduction
Overhead operations.
Time
780
Reference 11
Atomics Atomics time is CPU time that a runtime library spends on atomic operations.
Overhead
Time
Other This metric shows unclassified Overhead time spent in a threading runtime library.
Overhead
Time
Depending on the analysis type, the VTune Profiler may analyze a metric, compare its value with the
threshold value provided by Intel architects, and highlight the metric value in pink as a performance issue for
an application as a whole. The issue description for such a value may be provided below the critical metric or
when you hover over the highlighted metric.
Each metric in the list shows up as a hyperlink. Clicking a hyperlink opens the Bottom-up window and sorts
the grid by the selected metric or highlights the selected object in the grid.
Top Hotspots
VTune Profiler displays the most performance-critical functions and their CPU Time in the Top Hotspots
section. Optimizing these functions typically results in improving overall application performance. Clicking a
function in the list opens the Bottom-up window with this function selected.
The grayed-out [Others] module, if provided, displays the total value for all other functions in the application
that are not included into this table.
NOTE
You can control the number of objects in this list and displayed metrics via the viewpoint configuration
file.
Top Tasks
This section provides a list of tasks that took most of the time to execute, where tasks are either code
regions marked with Task API, or system tasks enabled to monitor Ftrace* events, Atrace* events, Intel
Media SDK programs, OpenCL™ kernels, and so on.
Clicking a task type in the table opens the grid view (for example, Bottom-up or Event Count) grouped by
the Task Type granularity. See Task Analysis for more information.
781
11 Intel® VTune™ Profiler User Guide
Vertical bars Hover over the bar to identify the amount of Elapsed time the application spent using the
specified number of logical CPU cores.
Target Identify the target CPU utilization. This number is equal to the number of logical CPU
Utilization cores. Consider this number as your optimization goal.
Average Identify the average number of CPUs used aggregating the entire run. It is calculated as
Effective CPU CPU time / Elapsed time.
Utilization
CPU utilization at any point in time cannot surpass the available number of logical CPU
cores. Even when the system is oversubscribed, and there are more threads running then
CPUs, the CPU utilization is the same as the number of CPUs.
Use this number as a baseline for your performance measurements. The closer this
number to the number of logical CPU cores, the better, except for the case when the CPU
time goes to spinning.
Utilization Analyze how the various utilization levels map to the number of simultaneously utilized
Indicator bar logical CPU cores.
NOTE
In the CPU Utilization histogram, the VTune Profiler treats the Spin and Overhead time as Idle
CPU utilization. Different analysis types may recognize Spin and Overhead time differently
depending on availability of call stack information. This may result in a difference of CPU
utilization graphical representation per analysis type.
Domain drop- Choose a frame domain to analyze with the frame rate histogram. If only one domain is
down menu available, the drop-down menu is grayed out. Then, you can switch to the Bottom-up
window grouped by Frame Domain, filter the data by slow frames and switch to the
Function grouping to identify functions in the slow frame domains. Try to optimize your
code to keep the frame rate constant (for example, from 30 to 60 frames per second).
Vertical bars Hover over a bar to see the total number of frames in your application executed with a
specific frame rate. High number of slow or fast frames signals a performance
bottleneck.
Frame rate bar Use the sliders to adjust the frame rate threshold (in frames per second) for the
currently open result and all subsequent results in the project.
782
Reference 11
Collection and Platform Info
This section provides the following data:
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
Collector Type of the data collector used for the analysis. The following types are possible:
type
• Driver-based sampling
• Driver-less Perf*-based sampling: per-process or system-wide
• User-mode sampling and tracing
CPU Information
Logical CPU Logical CPU count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
GPU Information
EU Count Number of execution units (EUs) in the Render and GPGPU engine. This data is Intel®
HD Graphics and Intel® Iris® Graphics (further: Intel Graphics) specific.
Max EU Maximum number of threads per execution unit. This data is Intel Graphics specific.
Thread
Count
783
11 Intel® VTune™ Profiler User Guide
Max Core Maximum frequency of the Graphics processor. This data is Intel Graphics specific.
Frequency
Graphics GPU metrics collection is enabled on the hardware level. This data is Intel Graphics
Performanc specific.
e Analysis
NOTE
Some systems disable collection of extended metrics such as L3 misses, memory accesses,
sampler busyness, SLM accesses, and others in the BIOS. On some systems you can set a
BIOS option to enable this collection. The presence or absence of the option and its name are
BIOS vendor specific. Look for the Intel® Graphics Performance Analyzers option (or
similar) in your BIOS and set it to Enabled.
See Also
Comparison Summary
Thread Concurrency
CPU Utilization
NOTE
You may click the Copy to Clipboard button to copy the content of the selected summary section
to the clipboard.
Analysis Metrics
The Summary window displays metrics that help you estimate an overall application execution. For a metric
description, hover over the corresponding question mark icon to read the pop-up help.
Use the Elapsed Time, GFLOPS, or GFLOPS Upper Bound (Intel® Xeon Phi™ processor only) metric as your
primary indicator and a baseline for comparison of results before and after optimization.
784
Reference 11
CPU Utilization
The CPU Utilization section displays metrics for CPU usage during the collection time.
785
11 Intel® VTune™ Profiler User Guide
Elapsed Time the application spent using the specified number of logical CPU cores. Use the Average
Physical Core Utilization and Average Logical Core Utilization numbers as a baseline for your performance
measurements. The CPU usage at any point cannot surpass the available number of logical CPU cores.
Memory Bound
A high Memory Bound value might indicate that a significant portion of execution time was lost while fetching
data. The section shows a fraction of cycles that were lost in stalls being served in different cache hierarchy
levels (L1, L2, L3) or fetching data from DRAM. For last level cache misses that lead to DRAM, it is important
to distinguish if the stalls were because of a memory bandwidth limit since they can require specific
optimization techniques when compared to latency bound stalls. VTune Profiler shows a hint about identifying
this issue in the DRAM Bound metric issue description. This section also offers the percentage of accesses to
a remote socket compared to a local socket to see if memory stalls can be connected with NUMA issues.
For Intel® Xeon Phi™ processors formerly code named Knights Landing, there is no way to measure memory
stalls to assess memory access efficiency in general. Therefore Back-end Bound stalls that include memory-
related stalls as a high-level characterization metric are shown instead. The second level metrics are focused
particularly on memory access efficiency.
• A high L2 Hit Bound or L2 Miss Bound value indicates that a high ratio of cycles were spent handing L2
hits or misses.
• The L2 Miss Bound metric does not take into account data brought into the L2 cache by the hardware
prefetcher. However, in some cases the hardware prefetcher can generate significant DRAM/MCDRAM
traffic and saturate the bandwidth. The Demand Misses and HW Prefetcher metrics show the
percentages of all L2 cache input requests that are caused by demand loads or the hardware prefetcher.
• A high DRAM Bandwidth Bound or MCDRAM Bandwidth Bound value indicates that a large
percentage of the overall elapsed time was spent with high bandwidth utilization. A high DRAM
Bandwidth Bound value is an opportunity to run the Memory Access analysis to identify data structures
that can be allocated in high bandwidth memory (MCDRAM), if it is available.
The Bandwidth Utilization Histogram shows how much time the system bandwidth was utilized by a
certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High,
Medium and Low. The thresholds are calculated based on benchmarks that calculate the maximum value. You
can also set the threshold by moving the sliders at the bottom of the histogram. The modified values are
applied to all subsequent results in the project.
786
Reference 11
If your application is memory bound, consider running a Memory Access analysis to identify deeper memory
issues and examine memory objects in more detail.
Vectorization
NOTE
Vectorization and GFLOPS metrics are supported on Intel® microarchitectures formerly code named Ivy
Bridge, Broadwell, and Skylake. Limited support is available for Intel® Xeon Phi™ processors formerly
code named Knights Landing. The metrics are not currently available on 4th Generation Intel
processors. Expand the Details section on the analysis configuration pane to view the processor
family available on your system.
This metric shows how efficiently the application is using floating point units for vectorization. Expand the
GFLOPS or GFLOPS Upper Bound (Intel Xeon Phi processors only) section to show the number of Scalar
and Packed GFLOPS. This section provides a quick estimate of the amount of FLOPs that were not vectorized.
787
11 Intel® VTune™ Profiler User Guide
The Top Loops/Functions with FPU Usage by CPU Time table shows the top functions that contain
floating point operations sorted by CPU time and allows for a quick estimate of the fraction of vectorized
code, the vector instruction set used in the loop/function, and the loop type.
For example, if a floating point loop (function) is bandwidth bound, use the Memory Access analysis to
resolve the bandwidth bound issue. If a floating point loop is vectorized, use the Intel Advisor to improve the
vectorization. If the loop is also bandwidth bound, the bandwidth bound issue should be resolved prior to
improving vectorization. Click one of the function names to switch to the Bottom-up window and evaluate if
the function is memory bound.
• Outgoing and Incoming Bandwidth Bound metrics shows the percent of elapsed time that an
application spent in communication closer to or reaching interconnect bandwidth limit.
• Bandwidth Utilization Histogram shows how much time the interconnect bandwidth was utilized by a
certain value (Bandwidth Domain) and provides thresholds to categorize bandwidth utilization as High,
Medium, and Low.
• Outgoing and Incoming Packet Rate metrics shows the percent of elapsed time that an application
spent in communication closer to or reaching interconnect packet rate limit.
• Packet Rate Histogram shows how much time the interconnect packet rate was reached by a certain
value and provides thresholds to categorize packet rate as High, Medium, and Low.
788
Reference 11
Collection and Platform Info
This section provides the following data:
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
CPU Information
Logical CPU Logical CPU core count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
See Also
HPC Performance Characterization View
Reference
Comparison Summary
789
11 Intel® VTune™ Profiler User Guide
NOTE
Click the Copy to Clipboard button to copy the content of the selected summary section to the
clipboard.
Analysis Metrics
The first section displays the summary statistics on the overall application execution:
All metric names are hyperlinks. Clicking such a hyperlink opens the Bottom-up window and sorts the data
in the grid by the selected metric.
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
790
Reference 11
Collector Type of the data collector used for the analysis. The following types are possible:
type
• Driver-based sampling
• Driver-less Perf*-based sampling: per-process or system-wide
• User-mode sampling and tracing
CPU Information
Logical CPU Logical CPU count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
GPU Information
EU Count Number of execution units (EUs) in the Render and GPGPU engine. This data is Intel®
HD Graphics and Intel® Iris® Graphics (further: Intel Graphics) specific.
Max EU Maximum number of threads per execution unit. This data is Intel Graphics specific.
Thread
Count
Max Core Maximum frequency of the Graphics processor. This data is Intel Graphics specific.
Frequency
Graphics GPU metrics collection is enabled on the hardware level. This data is Intel Graphics
Performanc specific.
e Analysis
NOTE
Some systems disable collection of extended metrics such as L3 misses, memory accesses,
sampler busyness, SLM accesses, and others in the BIOS. On some systems you can set a
BIOS option to enable this collection. The presence or absence of the option and its name are
BIOS vendor specific. Look for the Intel® Graphics Performance Analyzers option (or
similar) in your BIOS and set it to Enabled.
See Also
Memory Consumption Analysis
791
11 Intel® VTune™ Profiler User Guide
NOTE
You may click the Copy to Clipboard button to copy the content of the selected summary section
to the clipboard.
Analysis Metrics
The Summary window displays a list of memory-related CPU metrics that help you estimate an overall
memory usage during application execution. For a metric description, hover over the corresponding question
mark icon to read the pop-up help:
Memory Bound metrics are measured either as Clockticks or as Pipeline Slots. Metrics measured in Clockticks
are less precise compared to the metrics measured in Pipeline Slots since they may overlap and their sum at
some level does not necessarily match the parent metric value. But such metrics are still useful for
identifying the dominant performance bottleneck in the code.
Mouse over a flagged value with the performance issue and read the recommendation for further analysis.
For example, a high Memory Bound value typically indicates that a significant fraction of the execution
pipeline slots could be stalled due to a demand memory load and stores. For further details, you may switch
to the Bottom-up window and explore metric data per memory object.
792
Reference 11
A high DRAM Bandwidth Bound metric value indicates that your system spent much time heavily utilizing the
DRAM bandwidth. The calculation of this metric relies on the accurate maximum system DRAM bandwidth
measurement provided in the System Bandwidth section below.
System Bandwidth
This section provides various system bandwidth-related properties detected by the product. Depending on
the number of sockets on your system, the following types of system bandwidth are measured:
Max DRAM Maximum DRAM bandwidth measured for the whole system (across all packages) by
System running a micro-benchmark before the collection starts. If the system has already
Bandwidth been actively loaded at the moment of collection start (for example, with the attach
mode), the value may be less accurate.
Max DRAM Maximum DRAM bandwidth for single package measured by running a micro-
Single-Package benchmark before the collection starts. If the system has already been actively loaded
Bandwidth at the moment of collection start (for example, with the attach mode), the value may
be less accurate.
These values are used to define default High, Medium and Low bandwidth utilization thresholds for the
Bandwidth Utilization Histogram and to scale over-time bandwidth graphs in the Bottom-up view. By
default, for Memory Analysis results the system bandwidth is measured automatically. To enable this
functionality for custom analysis results, make sure to select the Evaluate max DRAM bandwidth option.
If you switch to the Bottom-up window and group the grid data by ../Bandwidth Utilization Type/.., you
can identify functions or memory objects with high bandwidth utilization in the specific bandwidth domain.
If you select the Interconnect domain, you will be able to check whether the performance of your
application is limited by the bandwidth of Interconnect links (inter-socket connections). Then, you may
switch to the Bottom-up window and identify code and memory objects with NUMA issues.
793
11 Intel® VTune™ Profiler User Guide
Single-Package domains are displayed for the systems with two or more CPU packages and the histogram
for them shows the distribution of the elapsed time per maximum bandwidth utilization among all packages.
Use this data to identify situations where your application utilizes bandwidth only on a subset of CPU
packages. In this case, the whole system bandwidth utilization represented by domains like DRAM may be
low whereas the performance is in fact limited by bandwidth utilization.
NOTE
• Interconnect bandwidth analysis is supported by the VTune Profiler for Intel microarchitecture code
name Ivy Bridge EP and later.
• To learn bandwidth capabilities, refer to your system specifications or run appropriate benchmarks
to measure them; for example, Intel Memory Latency Checker can provide maximum achievable
DRAM and Interconnect bandwidth.
NOTE
• Memory objects identification is supported only for Linux targets and only for processors based on
Intel microarchitecture code name Sandy Bridge and later.
• Only metrics based on DLA-capable hardware events are applicable to the memory objects analysis.
For example, the CPU Time metric is based on a non DLA-capable Clockticks event, so cannot be
applied to memory objects. Examples of applicable metrics are Loads, Stores, LLC Miss Count, and
Average Latency.
Clicking an object in the table opens the Bottom-up window with the grid data grouped by Memory
Object/Function/Allocation Stack. The selected hotspot object is highlighted.
Top Tasks
This section provides a list of tasks that took most of the time to execute, where tasks are either code
regions marked with Task API, or system tasks enabled to monitor Ftrace* events, Atrace* events, Intel
Media SDK programs, OpenCL™ kernels, and so on.
Clicking a task type in the table opens the grid view (for example, Bottom-up or Event Count) grouped by
the Task Type granularity. See Task Analysis for more information.
Latency Histogram
This histogram shows a distribution of loads per latency (in cycles).
794
Reference 11
Computer Name of the computer used for the collection.
Name
Collection Start time (in UTC format) of the external collection. Explore the Timeline pane to track
start time the performance statistics provided by the custom collector over time.
Collection Stop time (in UTC format) of the external collection. Explore the Timeline pane to track
stop time the performance statistics provided by the custom collector over time.
Collector Type of the data collector used for the analysis. The following types are possible:
type
• Driver-based sampling
• Driver-less Perf*-based sampling: per-process or system-wide
• User-mode sampling and tracing
CPU Information
Logical CPU Logical CPU count for the machine used for the collection.
Count
User Name User launching the data collection. This field is available if you enabled the per-user
event-based sampling collection mode during the product installation.
GPU Information
EU Count Number of execution units (EUs) in the Render and GPGPU engine. This data is Intel®
HD Graphics and Intel® Iris® Graphics (further: Intel Graphics) specific.
Max EU Maximum number of threads per execution unit. This data is Intel Graphics specific.
Thread
Count
Max Core Maximum frequency of the Graphics processor. This data is Intel Graphics specific.
Frequency
Graphics GPU metrics collection is enabled on the hardware level. This data is Intel Graphics
Performanc specific.
e Analysis
795
11 Intel® VTune™ Profiler User Guide
NOTE
Some systems disable collection of extended metrics such as L3 misses, memory accesses,
sampler busyness, SLM accesses, and others in the BIOS. On some systems you can set a
BIOS option to enable this collection. The presence or absence of the option and its name are
BIOS vendor specific. Look for the Intel® Graphics Performance Analyzers option (or
similar) in your BIOS and set it to Enabled.
See Also
Memory Usage View
Comparison Summary
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
Depending on the options selected when running the Intel SoC Watch collector and the operating system or
platform on which the analysis was run, the Summary window provides the following statistics in the
Platform Power Analysis viewpoint:
• Wake-ups/sec per Core
• Top 5 Frequencies
• Top 5 Causes of Core Wake-ups
• Top 5 Kernel Wakelocks
• Core Frequency Histogram
• Elapsed Time per Core Sleep State Histogram
• Elapsed Time per System Sleep State Histogram
• Elapsed Time per Graphics Device State Histogram
• Collection and Platform Information
After reviewing the information on the Summary window, switch to the Correlate Metrics window to view
all timeline data on one window. The Correlate Metrics window is another method of identifying energy
trends in the collected data.
796
Reference 11
Tip
•
Click the Copy to Clipboard button to copy the content of the selected summary section to the
clipboard.
• Click the Details link next to the table or graph title on the Summary tab to view more
information about that metric in another window of the Platform Power Analysis viewpoint.
Available Core Time Total execution time across all cores (elapsed time
X number of cores).
CPU Utilization (%) Percentage of time spent in the active state (C0)
during collection.
A greater percentage time spent in the active state
(C0) is an indication of higher energy consumption.
Total Time in Non-C0 States Total time spent in sleep states (C1-Cn) across all
cores. The larger the C-State number, the deeper
the sleep state and the greater the energy savings.
See Window: Core Wake-ups - Platform Power Analysis for more information.
797
11 Intel® VTune™ Profiler User Guide
Top 5 Frequencies
View the total time and total percentage of time spent in each of the top 5 processor frequencies. The 0GHz
frequency is time when the processor was inactive (in a sleep state). Switch to the CPU C/P States sub-tab
to view more detailed information about core frequency. See Window: CPU C\P States - Platform Power
Analysis for more information.
798
Reference 11
Cn represents the inactive or sleep state during which the device consumes the least energy. The larger the
C-State number, the deeper the sleep state. A greater amount of time spent in the C0 or active state is an
indication of higher energy consumption. Switch to the Core Wake-ups sub-tab to view more detailed
information about the reasons the cores spent time in active states and to view a timeline indicating when
the cores were active. See Window: Core Wake-ups - Platform Power Analysis for more information.
799
11 Intel® VTune™ Profiler User Guide
800
Reference 11
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
801
11 Intel® VTune™ Profiler User Guide
state. Click the expand /collapse buttons in the data columns to expand the column and show data for
different S-States in each device. You can change the unit displayed by right-clicking a data cell and selecting
the Show Data As option to select an alternate unit. For example, you could select Show Data As >
Percent to view the percent of collection time a particular device spent in the active state.
In the following example, the system never leaves the active S0i0 state. Either the CPU is active or one or
more devices kept the system active during collection. The active devices can be identified by switching to
the NC Device States tab or the SC Device States tab and looking for a device or devices that were active
during the collection. Use the CPU C/P States window to check the CPU activity level.
Timeline Pane
The Timeline pane displays the S-States of the system at each point in time. Each state is shown in a
different color. Use the legend on the right to see the colors related to the different states or features. Zoom
in on the timeline to better view the transitions between inactive and active states. Hover over the timeline
to view the percent of time spent in each state.
802
Reference 11
Filters applied on a timeline in one window are applied on all other windows within the viewpoint. This is
useful if you identify an issue on one tab and want to see how the issue impacts the metrics shown on a
different tab.
See Also
Interpreting Energy Analysis Data
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
803
11 Intel® VTune™ Profiler User Guide
Temperature Pane
The Temperature pane shows the sample counts at each temperature reading in degrees Celsius (oC) for
each core or device. A greater number of sample counts indicates that the device or core spent more of the
collection time at that temperature. Click the expand /collapse buttons in the data columns to expand
the column and show data for different temperature readings in each device. You can change the unit
displayed by right-clicking a data cell and selecting the Show Data As option to select an alternate unit. For
example, you can display the sample counts as a percentage of the total sample counts.
Timeline Pane
The Timeline pane displays the temperatures of each core at each point in time during the collection. Expand
the timeline rows vertically to view subtle temperature shifts. Zoom in on the timeline to view sampling
points. Filters applied on a timeline in one window are applied on all other windows within the viewpoint. This
is useful if you identify an issue on one tab and want to see how the issue impacts the metrics shown on a
different tab.
Shifts in core temperature often mirror shifts in processor frequency. When the processor runs at a higher
frequency, the temperature also rises. In the following example of the Correlated Metrics tab showing both
Thermal Sample and Core P-State Frequency data, both the temperature and the frequency fluctuate for the
first 4 seconds of collection and then remain fairly stable.
804
Reference 11
If the temperature is high, but the frequency is low, it could mean that the CPU is being throttled to lower
core temperature.
See Also
Interpreting Energy Analysis Data
Viewpoint
Grouping Data
Window: CPU C\P States - Platform Power Analysis
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
805
11 Intel® VTune™ Profiler User Guide
In the following example, most of the collection time was spent at a low timer resolution value of 4 ms.
Applications generally request more frequent system timer wakeups like this to ensure a faster response.
Such changes in the system timer should be restricted to critical regions in the application since it impacts
the entire system. Use the Timeline pane to see which process or processes caused the change in timer
resolution.
Timeline Pane
The Timeline pane shows a graphical representation of the timer resolution value changes and the duration
each application spent at each resolution value.
Toolbar
Navigation control to zoom in/out on the view on areas of interest.
Filters applied on a timeline in one window are applied on all other
windows within the viewpoint. This is useful if you identify an issue
on one tab and want to see how the issue impacts the metrics
shown on a different tab. For more details on the Timeline control,
see Managing Timeline View.
Legend
Types of data presented on the timeline. Filter in/out any type of
data presented on the timeline by selecting/deselecting
corresponding check boxes.
System Timer
Timer resolution value changes over time. The black line illustrates
Resolution
the change in timer resolution. The colored bar at the bottom
illustrates the duration at each timer resolution value.
Requested/
Requests for timer resolution change and duration by application.
Application Timer
Hover over the timeline to view a tooltip listing information about
Resolution
the request, including start time, duration, application requesting
the change, and requested resolution value (ms). Zoom in on the
timeline to view changes in timer resolution value.
806
Reference 11
See Also
Interpreting Energy Analysis Data
Viewing Energy Analysis Data
Viewpoint
Grouping Data
807
11 Intel® VTune™ Profiler User Guide
Function The Function Stack column represents call sequences (stacks) detected during collection
Stack phase starting from the application root (usually, the main() function). The time value for
a row is equal to the sum of all the nested items from that row. Use this data to see the
impact of program units together with their callees. This type of investigation is known as
a top-down analysis.
In this example above, the hotspot thread_video function has three callees, where
rt_renderscene is the first candidate for optimization.
The call stacks are always available for the results of the user-mode sampling and tracing
collection. They are also available for the results of the hardware event-based sampling
collection, if you enabled the Collect stacks option during the analysis configuration.
Otherwise, the Function Stack column for the event-based sampling results shows a flat
list of the functions.
<Performan Each data column in the Top-Down Tree grid corresponds to a performance metric. The
ce metrics> list of performance metrics varies with the analysis type and selected viewpoint. In the
Top-down Tree window, the Intel® VTune™ Profiler provides two types of metrics:
• Self metrics show performance data collected within particular procedures and
functions.
• Total metrics show performance data collected within functions AND children (callees).
By default, all program units are sorted in a descending order by the metric values in the
first column (for example, CPU Time: Total) providing the most performance-critical
program units first. You may click a column header to re-sort the table by the required
metric.
NOTE
Mouse over a column header to see a metric description.
See Also
Manage Data Views
808
Reference 11
Timeline Pane
NOTE
If there are no uncore events selected for the analysis, the Timeline pane is empty.
NOTE
Platform Power Analysis viewpoint is available as part of energy analysis. Collecting energy analysis
with Intel® SoC Watch is available for target Android*, Windows*, or Linux* devices. Import and
viewing of the Intel SoC Watch results is supported with any version of the VTune Profiler.
Wakelock Pane
The Wakelock pane shows the list of wakelock objects for the user/application or kernel, depending on the
grouping selected. Change the grouping selection to view data about either kernel or application/user
wakelocks and the process that caused the lock or unlock. The following grouping levels and combinations of
these grouping levels are available from the Grouping drop-down menu:
• Kernel wakelock
• Locking processes
• Application name
• User locking process
• User wakelock tag
The grid displays the sample counts for each object. Click the expand /collapse buttons in the data
columns to expand the column and show data for different wakelock objects. By default, the table is sorted
by the Kernel Wakelock/Locking Process/Locking Thread grouping in ascending order, which provides
the objects with the highest total lock duration first. You can change the unit displayed by right-clicking a
data cell and selecting the Show Data As option to select an alternate unit.
The following columns are available for kernel wakelocks.
809
11 Intel® VTune™ Profiler User Guide
Wakelock Lock Count by Lock Reason Number of wakelocks for the following reasons:
Process, Existing Lock, Unknown. An Existing Lock
was already started when the collection began.
Wakelock Unlock Count by Unlock Reason The number of wakelock unlocks for the following
reasons: Process, Timeout, Overwritten, Unknown.
An Unknown wakelock unlock reason may mean
that the wakelock continued after the collection
ended.
In the following example, the PowerManagerService.Wakelocks kernel wakelock had already started before
the collection began and continued after the collection ended.
Application/user wakelocks show information for the APK name rather than the wakelock name like kernel
wakelocks. The following columns are available for application/user wakelocks:
In the following example, two wakelocks originated from the com.intel.wakelockapp APK.
810
Reference 11
Timeline Pane
Toolbar
Navigation control to zoom in/out on the view on areas of
interest. For more details on the Timeline control, see Managing
Timeline View.
Legend
Types of data presented on the timeline. Filter in/out any type
of data presented on the timeline by selecting/deselecting
corresponding check boxes. For example, you may only be
interested in the application/user wakelock data and want to
remove the kernel wakelock timelines for an expanded view of
the application/user wakelock data.
Application Name
Graphical representation of the wakelock duration for each
application APK.
Wakelock
Graphical representation of the kernel wakelock duration
through the collection time.
Wakelock Details
Hover over the timeline of an application to view tooltips with
details such as the wakelock type, start time, duration, locking
and unlocking process name, and application name.
Hover over the timeline of a kernel wakelock to view tooltips
with details such as the wakelock type, start time, duration,
locking and unlocking reasons, and locking process.
Zoom in on the timeline to view the exact time when the
wakelock started and when it was released. It is possible for
one wakelock to begin before another ends, causing an overlap.
811
11 Intel® VTune™ Profiler User Guide
Filters applied on a timeline in one window are applied on all other windows within the viewpoint. This is
useful if you identify an issue on one tab and want to see how the issue impacts the metrics shown on a
different tab. For more details on the timeline control, see Managing Timeline View.
See Also
Interpreting Energy Analysis Data
Viewpoint
Grouping Data
812
Reference 11
1. Examine the FP_ASSIST and OTHER_ASSISTS events to determine the specific cause.
2. Add options eliminating x87 code and set the compiler options to enable DAZ (denormals-are-zero) and
FTZ (flush-to-zero).
Average Bandwidth
Metric Description
Average bandwidth utilization during the analysis.
813
11 Intel® VTune™ Profiler User Guide
Back-End Bound
Metric Description
Back-End Bound metric represents a Pipeline Slots fraction where no uOps are being delivered due to a lack
of required resources for accepting new uOps in the Back-End. Back-End is a portion of the processor core
where an out-of-order scheduler dispatches ready uOps into their respective execution units, and, once
completed, these uOps get retired according to program order. For example, stalls due to data-cache misses
or stalls due to the divider unit being overloaded are both categorized as Back-End Bound. Back-End Bound is
further divided into two main categories: Memory Bound and Core Bound.
Possible Issues
A significant proportion of pipeline slots are remaining empty. When operations take too long in the back-
end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work
to be retired per cycle than the machine is capable of supporting. This opportunity cost results in slower
execution. Long-latency operations like divides and memory operations can cause this, as can too many
operations being directed to a single execution port (for example, more multiply operations arriving in the
back-end per cycle than the execution unit can support).
Memory Bandwidth
Metric Description
This metric represents a fraction of cycles during which an application could be stalled due to approaching
bandwidth limits of the main memory (DRAM). This metric does not aggregate requests from other threads/
cores/sockets (see Uncore counters for that). Consider improving data locality in NUMA multi-socket
systems.
LLC Miss
Metric Description
The LLC (last-level cache) is the last, and longest-latency, level in the memory hierarchy before main
memory (DRAM). Any memory requests missing here must be serviced by local or remote DRAM, with
significant latency. The LLC Miss metric shows a ratio of cycles with outstanding LLC misses to all cycles.
Possible Issues
A high number of CPU cycles is being spent waiting for LLC load misses to be serviced. Possible optimizations
are to reduce data working set size, improve data access locality, blocking and consuming data in chunks
that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but they can
increase latency by interfering with normal loads, and can increase pressure on the memory system.
814
Reference 11
UTLB Overhead
Metric Description
This metric represents a fraction of cycles spent on handling first-level data TLB (or UTLB) misses. As with
ordinary data caching, focus on improving data locality and reducing working-set size to reduce UTLB
overhead. Additionally, consider using profile-guided optimization (PGO) to collocate frequently-used data on
the same page. Try using larger page sizes for large amounts of frequently-used data. This metric does not
include store TLB misses.
Possible Issues
A significant proportion of cycles is being spent handling first-level data TLB misses. As with ordinary data
caching, focus on improving data locality and reducing working-set size to reduce UTLB overhead.
Additionally, consider using profile-guided optimization (PGO) to collocate frequently-used data on the same
page. Try using larger page sizes for large amounts of frequently-used data.
Port Utilization
Metric Description
This metric represents a fraction of cycles during which an application was stalled due to Core non-divider-
related issues. For example, heavy data-dependency between nearby instructions, or a sequence of
instructions that overloads specific ports. Hint: Loop Vectorization - most compilers feature auto-
Vectorization options today - reduces pressure on the execution ports as multiple elements are calculated
with same uop.
Possible Issues
A significant fraction of cycles was stalled due to Core non-divider-related issues.
Tips
Use vectorization to reduce pressure on the execution ports as multiple elements are calculated with same
uOp.
Port 0
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 0 (SNB+: ALU; HSW
+:ALU and 2nd branch)
Port 1
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 1 (ALU)
Port 2
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 2 (Loads and Store-
address)
Port 3
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 3 (Loads and Store-
address)
815
11 Intel® VTune™ Profiler User Guide
Port 4
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 4 (Store-data)
Possible Issues
This metric represents Core cycles fraction CPU dispatched uops on execution port 4 (Store-data). Note that
this metric value may be highlighted due to Split Stores issue.
Port 5
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 5 (SNB+: Branches and
ALU; HSW+: ALU)
Port 6
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 6 (Branches and simple
ALU)
Port 7
Metric Description
This metric represents Core cycles fraction CPU dispatched uops on execution port 7 (simple Store-address)
BACLEARS
Metric Description
This metric estimates a fraction of cycles lost due to the Branch Target Buffer (BTB) prediction corrected by a
later branch predictor.
Possible Issues
A significant number of CPU cycles lost due to the Branch Target Buffer (BTB) prediction corrected by a later
branch predictor. Consider reducing the amount of taken branches.
816
Reference 11
Superscalar processors can be conceptually divided into the 'front-end', where instructions are fetched and
decoded into the operations that constitute them; and the 'back-end', where the required computation is
performed. Each cycle, the front-end generates up to four of these operations placed into pipeline slots that
then move through the back-end. Thus, for a given execution duration in clock cycles, it is easy to determine
the maximum number of pipeline slots containing useful work that can be retired in that duration. The actual
number of retired pipeline slots containing useful work, though, rarely equals this maximum. This can be due
to several factors: some pipeline slots cannot be filled with useful work, either because the front-end could
not fetch or decode instructions in time ('Front-end bound' execution) or because the back-end was not
prepared to accept more operations of a certain kind ('Back-end bound' execution). Moreover, even pipeline
slots that do contain useful work may not retire due to bad speculation. Front-end bound execution may be
due to a large code working set, poor code layout, or microcode assists. Back-end bound execution may be
due to long-latency operations or other contention for execution resources. Bad speculation is most
frequently due to branch misprediction.
Possible Issues
A significant proportion of pipeline slots are remaining empty. When operations take too long in the back-
end, they introduce bubbles in the pipeline that ultimately cause fewer pipeline slots containing useful work
to be retired per cycle than the machine is capable of supporting. This opportunity cost results in slower
execution. Long-latency operations like divides and memory operations can cause this, as can too many
operations being directed to a single execution port (for example, more multiply operations arriving in the
back-end per cycle than the execution unit can support).
FP Arithmetic
Metric Description
This metric represents an overall arithmetic floating-point (FP) uOps fraction the CPU has executed (retired).
FP Assists
Metric Description
Certain floating point operations cannot be handled natively by the execution pipeline and must be performed
by microcode (small programs injected into the execution stream). For example, when working with very
small floating point values (so-called denormals), the floating-point units are not set up to perform these
operations natively. Instead, a sequence of instructions to perform the computation on the denormal is
injected into the pipeline. Since these microcode sequences might be hundreds of instructions long, these
microcode assists are extremely deleterious to performance.
Possible Issues
A significant portion of execution time is spent in floating point assists.
Tips
Consider enabling the DAZ (Denormals Are Zero) and/or FTZ (Flush To Zero) options in your compiler to flush
denormals to zero. This option may improve performance if the denormal values are not critical in your
application. Also note that the DAZ and FTZ modes are not compatible with the IEEE Standard 754.
FP Scalar
Metric Description
This metric represents an arithmetic floating-point (FP) scalar uops fraction the CPU has executed. Analyze
metric values to identify why vector code is not generated, which is typically caused by the selected
algorithm or missing/wrong compiler switches.
FP Vector
Metric Description
817
11 Intel® VTune™ Profiler User Guide
This metric represents an arithmetic floating-point (FP) vector uops fraction the CPU has executed. Make sure
vector width is expected.
FP x87
Metric Description
This metric represents a floating-point (FP) x87 uops fraction the CPU has executed. It accounts for
instructions beyond X87 FP arithmetic operations; hence may be used as a thermometer to avoid X87 high
usage and preferably upgrade to modern ISA. Consider compiler flags to generate newer AVX (or SSE)
instruction sets, which typically perform better and feature vectors.
MS Assists
Metric Description
Certain corner-case operations cannot be handled natively by the execution pipeline and must be performed
by the microcode sequencer (MS), where 1 or more uOps are issued. The microcode sequencer performs
microcode assists (small programs injected into the execution stream), inserting flows, and writing to the
instruction queue (IQ). For example, when working with very small floating point values (so-called
denormals), the floating-point units are not set up to perform these operations natively. Instead, a sequence
of instructions to perform the computation on the denormal is injected into the pipeline. Since these
microcode sequences might be hundreds of instructions long, these microcode assists are extremely
deleterious to performance.
Possible Issues
A significant portion of execution time is spent in microcode assists, inserted flows, and writing to the
instruction queue (IQ). Examine the FP Assist and SIMD Assist metrics to determine the specific cause.
Branch Mispredict
Metric Description
When a branch mispredicts, some instructions from the mispredicted path still move through the pipeline. All
work performed on these instructions is wasted since they would not have been executed had the branch
been correctly predicted. This metric represents slots fraction the CPU has wasted due to Branch
Misprediction. These slots are either wasted by uOps fetched from an incorrectly speculated program path, or
stalls when the out-of-order part of the machine needs to recover its state from a speculative path.
Possible Issues
A significant proportion of branches are mispredicted, leading to excessive wasted work or Back-End stalls
due to the machine need to recover its state from a speculative path.
Tips
1. Identify heavily mispredicted branches and consider making your algorithm more predictable or reducing
the number of branches. You can add more work to 'if' statements and move them higher in the code flow for
earlier execution. If using 'switch' or 'case' statements, put the most commonly executed cases first. Avoid
using virtual function pointers for heavily executed calls.
2. Use profile-guided optimization in the compiler.
See the Intel 64 and IA-32 Architectures Optimization Reference Manual for general strategies to address
branch misprediction issues.
Bus Lock
Metric Description
818
Reference 11
Intel processors provide a LOCK# signal that is asserted automatically during certain critical memory
operations to lock the system bus or equivalent link. While this output signal is asserted, requests from other
processors or bus agents for control of the bus are blocked. This metric measures the ratio of bus cycles,
during which a LOCK# signal is asserted on the bus. The LOCK# signal is asserted when there is a locked
memory access due to uncacheable memory, locked operation that spans two cache lines, and page-walk
from an uncacheable page table.
Possible Issues
Bus locks have a very high performance penalty. It is highly recommended to avoid locked memory accesses
to improve memory concurrency.
Tips
Examine the BUS_LOCK_CLOCKS.SELF event in the source/assembly view to determine where the LOCK#
signals are asserted from. If they come from themselves, look at Back-end issues, such as memory latency
or reissues. Account for skid.
Cache Bound
Metric Description
This metric shows how often the machine was stalled on L1, L2, and L3 caches. While cache hits are serviced
much more quickly than hits in DRAM, they can still incur a significant performance penalty. This metric also
includes coherence penalties for shared data.
Possible Issues
A significant proportion of cycles are being spent on data fetches from caches. Check Memory Access analysis
to see if accesses to L2 or L3 caches are problematic and consider applying the same performance tuning as
you would for a cache-missing workload. This may include reducing the data working set size, improving data
access locality, blocking or partitioning the working set to fit in the lower cache levels, or exploiting hardware
prefetchers. Consider using software prefetchers, but note that they can interfere with normal loads, increase
latency, and increase pressure on the memory system. This metric includes coherence penalties for shared
data. Check Microarchitecture Exploration analysis to see if contested accesses or data sharing are indicated
as likely issues.
Clears Resteers
Metric Description
This metric measures the fraction of cycles the CPU was stalled due to Branch Resteers as a result of Machine
Clears.
Possible Issues
A significant fraction of cycles could be stalled due to Branch Resteers as a result of Machine Clears.
819
11 Intel® VTune™ Profiler User Guide
The CPI value of an application or function is an indication of how much latency affected its execution. Higher
CPI values mean there was more latency in your system - on average, it took more clockticks for an
instruction to retire. Latency in your system can be caused by cache misses, I/O, or other bottlenecks.
When you want to determine where to focus your performance tuning effort, the CPI is the first metric to
check. A good CPI rate indicates that the code is executing optimally.
The main way to use CPI is by comparing a current CPI value to a baseline CPI for the same workload. For
example, suppose you made a change to your system or your code and then ran the VTune Profiler and
collected CPI. If the performance of the application decreased after the change, one way to understand what
may have happened is to look for functions where CPI increased. If you have made an optimization that
improved the runtime of your application, you can look at VTune Profiler data to see if CPI decreased. If it
did, you can use that information to help direct you toward further investigations. What caused CPI to
decrease? Was it a reduction in cache misses, fewer memory operations, lower memory latency, and so on.
How do I know when CPI is high?
The CPI of a workload depends both on the code, the processor, and the system configuration.
VTune Profiler analyzes the CPI value against the threshold set up by Intel architects. These numbers can be
used as a general guide:
Good Poor
0.75 4
A CPI < 1 is typical for instruction bound code, while a CPI > 1 may show up for a stall cycle bound
application, also likely memory bound.
If a CPI value exceeds the threshold, the VTune Profiler highlights this value in pink.
A high value for this ratio (>1) indicates that over the current code region, instructions are taking a high
number of processor clocks to execute. This could indicate a problem if most of the instructions are not
predominately high latency instructions and/or coming from microcode ROM. In this case there may be
opportunities to modify your code to improve the efficiency with which instructions are executed within the
processor.
For processors with Inte® Hyper-Threading Technology, this ratio measures the CPI for the phases where the
physical package is not in any sleep mode, that is, at least one logical processor in the physical package is in
use. Clockticks are continuously counted on logical processors even if the logical processor is in a halted
stated (not executing instructions). This can impact the logical processors CPI ratio because the Clockticks
820
Reference 11
event continues to be accumulated while the Instructions Retired event is unchanged. A high CPI value still
indicates a performance problem however a high CPI value on a specific logical processor could indicate poor
CPU usage and not an execution problem.
If your application is threaded, CPI at all code levels is affected. The Clockticks event counts independently
on each logical processors parallel execution is not accounted for.
For example, consider the following:
Function XYZ on logical processor 0 |------------------------| 4000 Clockticks / 1000 Instructions
Function XYZ on logical processor 1 |------------------------| 4000 Clockticks / 1000 Instructions
The CPI for the function XYZ is ( 8000 / 2000 ) 4.0. If parallel execution is taken into account in Clockticks
the CPI would be ( 4000 / 2000 ) 2.0. Knowledge of the application behavior is necessary in interpreting the
Clockticks event data.
What are the pitfalls of using CPI?
CPI can be misleading, so you should understand the pitfalls. CPI (latency) is not the only factor affecting the
performance of your code on your system. The other major factor is the number of instructions executed
(sometimes called path length). All optimizations or changes you make to your code will affect either the
time to execute instructions (CPI) or the number of instructions to execute, or both. Using CPI without
considering the number of instructions executed can lead to an incorrect interpretation of your results. For
example, you vectorized your code and converted your math operations to operate on multiple pieces of data
at once. This would have the effect of replacing many single-data math instructions with fewer multiple-data
math instructions. This would reduce the number of instructions executed overall in your code, but it would
likely raise your CPI because multiple-data instructions are more complex and take longer to execute. In
many cases, this vectorization would increase your performance, even though CPI went up.
It is important to be aware of your total instructions executed as well. The number of instructions executed is
generally called INST_RETIRED in the VTune Profiler. If your instructions retired is remaining fairly constant,
CPI can be a good indicator of performance (this is the case with system tuning, for example). If both the
number of instructions and CPI are changing, you need to look at both metrics to understand why your
performance increased or decreased. Finally, an alternative to looking at CPI is applying the top-down
method.
CPI Rate
Metric Description
Cycles per Instruction Retired, or CPI, is a fundamental performance metric indicating approximately how
much time each executed instruction took, in units of cycles. Modern superscalar processors issue up to four
instructions per cycle, suggesting a theoretical best CPI of 0.25. But various effects (long-latency memory,
floating-point, or SIMD operations; non-retired instructions due to branch mispredictions; instruction
starvation in the front-end) tend to pull the observed CPI up. A CPI of 1 is generally considered acceptable
for HPC applications but different application domains will have very different expected values. Nonetheless,
CPI is an excellent metric for judging an overall potential for application performance tuning.
Possible Issues
The CPI may be too high. This could be caused by issues such as memory stalls, instruction starvation,
branch misprediction or long latency instructions. Explore the other hardware-related metrics to identify what
is causing high CPI.
821
11 Intel® VTune™ Profiler User Guide
Cycles per Instructions Retired is a fundamental performance metric indicating an average amount of time
each instruction took to execute, in units of cycles. For Intel Atom processors, the theoretical best CPI per
thread is 0.50, but CPI's over 2.0 warrant investigation. High CPI values may indicate latency in the system
that could be reduced such as long-latency memory, floating-point operations, non-retired instructions due to
branch mispredictions, or instruction starvation in the front-end. Beware that some optimizations such as
SIMD will use less instructions per cycle (increasing CPI), and debug code can use redundant instructions
creating more instructions per cycle (decreasing CPI).
Possible Issues
The CPI may be too high. This could be caused by issues such as memory stalls, instruction starvation,
branch misprediction or long latency instructions. Explore the other hardware-related metrics to identify what
is causing high CPI.
CPU Time
Metric Description
CPU Time is time during which the CPU is actively executing your application.
Core Bound
Metric Description
This metric represents how much Core non-memory issues were of a bottleneck. Shortage in hardware
compute resources, or dependencies software's instructions are both categorized under Core Bound. Hence it
may indicate the machine ran out of an OOO resources, certain execution units are overloaded or
dependencies in program's data- or instruction- flow are limiting the performance (e.g. FP-chained long-
latency arithmetic operations).
CPU Frequency
Metric Description
Frequency calculated with APERF/MPERF MSR registers captured on the clockcycles event.
It is a software frequency showing the average logical core frequency between two samples. The smaller the
sampling interval is, the closer the metric is to the real HW frequency.
CPU Time
Metric Description
CPU Time is time during which the CPU is actively executing your application.
CPU Utilization
Metric Description
This metric evaluates the parallel efficiency of your application. It estimates the percentage of all the logical
CPU cores in the system that is used by your application -- without including the overhead introduced by the
parallel runtime system. 100% utilization means that your application keeps all the logical CPU cores busy for
the entire time that it runs.
Depending on the analysis type, you can see the CPU Utilization data in the Bottom-up grid (HPC
Performance Characterization), on the Timeline pane, and in the Summary window on the Effective CPU
Utilization histogram:
822
Reference 11
Utilization Histogram
For the histogram, the Intel® VTune™ Profiler identifies a processor utilization scale, calculates the target CPU
utilization, and defines default utilization ranges depending on the number of processor cores. You can
change the utilization ranges by dragging the sliders, if required.
Idle Idle utilization. By default, if the CPU Time on all threads is less than 0.5 of 100%
CPU Time on 1 core, such CPU utilization is classified as idle. Formula:
Σi=1,ThreadsCount(CPUTime(T,i)/T) < 0.5, where CPUTime(T,i) is the total CPU
Time on thread i on interval T.
Poor Poor utilization. By default, poor utilization is when the number of simultaneously
running CPUs is less than or equal to 50% of the target CPU utilization.
Ideal Ideal utilization. By default, Ideal utilization is when the number of simultaneously
running CPUs is between 86-100% of the target CPU utilization.
VTune Profiler treats the Spin and Overhead time as Idle CPU utilization. Different analysis types may
recognize Spin and Overhead time differently depending on availability of call stack information. This may
result in a difference of CPU Utilization graphical representation per analysis type.
For the HPC Performance Characterization analysis, the VTune Profiler differentiates Effective Physical Core
Utilization vs. Effective Logical Core Utilization for all systems other than Inte® Xeon Phi processors
code named Knights Mill and Knights Landing.
For Intel® Xeon Phi processors code named Knights Mill and Knights Landing, as well as systems with Intel
Hyper-Threading Technology (Intel HT Technology) OFF, only generic Effective CPU Utilization metric is
provided.
CPU Utilization vs. Thread Efficiency
CPU Utilization may be higher than the Thread Efficiency (available for Threading analysis) if a thread is
executing code on a CPU while it is logically waiting (that is, the thread is spinning).
CPU Utilization may be lower than the Thread Efficiency if:
1. The concurrency level is higher than the number of available cores (oversubscription) and, thus,
reaching this level of CPU utilization is not possible. Generally, large oversubscription negatively impacts
the application performance since it causes excessive context switching.
2. There was a period when the profiled process was swapped out. Thus, while it was not logically waiting,
it was not scheduled for any CPU either.
Possible Issues
823
11 Intel® VTune™ Profiler User Guide
The metric value is low, which may signal a poor logical CPU cores utilization caused by load imbalance,
threading runtime overhead, contended synchronization, or thread/process underutilization. Explore CPU
Utilization sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Threading
analysis to identify parallel bottlenecks for other parallel runtimes.
824
Reference 11
Cycles of 3+ Ports Utilized
Metric Description
This metric represents Core cycles fraction CPU executed total of 3 or more uops per cycle on all execution
ports.
Divider
Metric Description
Not all arithmetic operations take the same amount of time. Divides and square roots, both performed by the
DIV unit, take considerably longer than integer or floating point addition, subtraction, or multiplication. This
metric represents cycles fraction where the Divider unit was active.
Possible Issues
The DIV unit is active for a significant portion of execution time.
Tips
Locate the hot long-latency operation(s) and try to eliminate them. For example, if dividing by a constant,
consider replacing the divide by a product of the inverse of the constant. If dividing an integer, consider using
a right-shift instead.
825
11 Intel® VTune™ Profiler User Guide
This metric represents how efficiently the application utilized the physical CPU cores available and helps
evaluate the parallel efficiency of the application. It shows the percent of average utilization by all physical
CPU cores on the system. Effective Physical Core Utilization contains only effective time and does not contain
spin and overhead. An utilization of 100% means that all of the physical CPU cores were loaded by
computations of the application.
Possible Issues
The metric value is low, which may signal a poor physical CPU cores utilization caused by:
• load imbalance
• threading runtime overhead
• contended synchronization
• thread/process underutilization
• incorrect affinity that utilizes logical cores instead of physical cores
Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism or run the Locks and Waits
analysis to identify parallel bottlenecks for other parallel runtimes.
Effective Time
Metric Description
Effective Time is CPU time spent in the user code. This metric does not include Spin and Overhead time.
Elapsed Time
Metric Description
Elapsed time is the wall time from the beginning to the end of collection.
Execution Stalls
Metric Description
Execution stalls may signify that a machine is running at full capacity, with no computation resources wasted.
Sometimes, however, long-latency operations can serialize while waiting for critical computation resources.
This metric is the ratio of cycles with no micro-operations executed to all cycles.
826
Reference 11
Possible Issues
The percentage of cycles with no micro-operations executed is high. Look for long-latency operations at code
regions with high execution stalls and try to use alternative methods or lower latency operations. For
example, consider replacing 'div' operations with right-shifts, or try to reduce the latency of memory
accesses.
False Sharing
Metric Description
This metric shows how often CPU was stalled on store operations to a shared cache line. It can be easily
avoided by padding to make threads access different lines.
Far Branch
Metric Description
This metric indicates when a call/return is using a far pointer. A far call is often used to transfer from user
code to privileged code.
Possible Issues
Transferring from user to privileged code may be too frequent. Consider reducing calls to system APIs.
FPU Utilization
Metric Description
This metric represents how intensively your program uses the FPU. 100% means that the FPU is fully loaded
and is retiring a vector instruction with full capacity every cycle of the application execution.
Possible Issues
The metric value is low. This can indicate poor FPU utilization because of non-vectorized floating point
operations, or inefficient vectorization due to legacy vector instruction set or memory access pattern issues.
Consider using vector analysis in Intel Advisor for data and tips to improve vectorization efficiency in your
application.
% of Packed FP Instructions
Metric Description
This metric represents the percentage of all packed floating point instructions.
827
11 Intel® VTune™ Profiler User Guide
% of Scalar FP Instructions
Metric Description
This metric represents the percentage of scalar floating point instructions.
Loop Type
Metric Description
Displays a loop type (body, peel, reminder) based on the Intel Compiler optreport information.
828
Reference 11
Vector Instruction Set
Metric Description
Displays the Vector Instruction Set used for arithmetic floating point computations and memory access
operations.
Possible Issues
You are not using a modern vectorization instruction set. Consider recompiling your code using compiler
options that allow using a modern vectorization instruction set. See the compiler User and Reference Guide
for C++ or Fortran for more details.
Front-End Bandwidth
Metric Description
This metric represents a fraction of slots during which CPU was stalled due to front-end bandwidth issues,
such as inefficiencies in the instruction decoders or code restrictions for caching in the DSB (decoded uOps
cache). In such cases, the front-end typically delivers a non-optimal amount of uOps to the back-end.
Front-End Bound
Metric Description
Front-End Bound metric represents a slots fraction where the processor's Front-End undersupplies its Back-
End. Front-End denotes the first part of the processor core responsible for fetching operations that are
executed later on by the Back-End part. Within the Front-End, a branch predictor predicts the next address to
fetch, cache-lines are fetched from the memory subsystem, parsed into instructions, and lastly decoded into
micro-ops (uOps). Front-End Bound metric denotes unutilized issue-slots when there is no Back-End stall
(bubbles where Front-End delivered no uOps while Back-End could have accepted them). For example, stalls
due to instruction-cache misses would be categorized as Front-End Bound.
829
11 Intel® VTune™ Profiler User Guide
Possible Issues
A significant portion of Pipeline Slots is remaining empty due to issues in the Front-End.
Tips
Make sure the code working size is not too large, the code layout does not require too many memory
accesses per cycle to get enough instructions for filling four pipeline slots, or check for microcode assists.
Front-End Other
Metric Description
This metric accounts for those slots that were not delivered by the front-end and do not count as a common
front-end stall.
Possible Issues
The front-end did not deliver a significant portion of pipeline slots that do not classify as a common front-end
stall.
Branch Resteers
Metric Description
This metric represents cycles fraction the CPU was stalled due to Branch Resteers.
Possible Issues
A significant fraction of cycles was stalled due to Branch Resteers. Branch Resteers estimate the Front-End
delay in fetching operations from corrected path, following all sorts of mispredicted branches. For example,
branchy code with lots of mispredictions might get categorized as Branch Resteers. Note the value of this
node may overlap its siblings.
DSB Switches
Metric Description
Intel microarchitecture code name Sandy Bridge introduces a new decoded ICache. This cache, called the
DSB (Decoded Stream Buffer), stores uOps that have already been decoded, avoiding many of the penalties
of the legacy decode pipeline, called the MITE (Micro-instruction Translation Engine). However, when control
flows out of the region cached in the DSB, the front-end incurs a penalty as uOp issue switches from the DSB
to the MITE. This metric measures this penalty.
Possible Issues
A significant portion of cycles is spent switching from the DSB to the MITE. This may happen if a hot code
region is too large to fit into the DSB.
Tips
Consider changing code layout (for example, via profile-guided optimization) to help your hot regions fit into
the DSB.
See the "Optimization for Decoded ICache" section in the Intel 64 and IA-32 Architectures Optimization
Reference Manual for more details.
ICache Misses
Metric Description
To introduce new uOps into the pipeline, the core must either fetch them from a decoded instruction cache,
or fetch the instructions themselves from memory and then decode them. In the latter path, the requests to
memory first go through the L1I (level 1 instruction) cache that caches the recent code working set. Front-
end stalls can accrue when fetched instructions are not present in the L1I. Possible reasons are a large code
830
Reference 11
working set or fragmentation between hot and cold code. In the latter case, when a hot instruction is fetched
into the L1I, any cold code on its cache line is brought along with it. This may result in the eviction of other,
hotter code.
Possible Issues
A significant proportion of instruction fetches are missing in the instruction cache.
Tips
1. Use profile-guided optimization to reduce the size of hot code regions.
2. Consider compiler options to reorder functions so that hot functions are located together.
3. If your application makes significant use of macros, try to reduce this by either converting the relevant
macros to functions or using linker options to eliminate repeated code.
4. Consider the Os/O1 optimization level or the following subset of optimizations to decrease your code
footprint:
• Use inlining only when it decreases the footprint.
• Disable loop unrolling.
• Disable intrinsic inlining.
ITLB Overhead
Metric Description
In x86 architectures, mappings between virtual and physical memory are facilitated by a page table, which is
kept in memory. To minimize references to this table, recently-used portions of the page table are cached in
a hierarchy of 'translation look-aside buffers', or TLBs, which are consulted on every virtual address
translation. As with data caches, the farther a request has to go to be satisfied, the worse the performance
impact. This metric estimates the performance penalty of page walks induced on ITLB (instruction TLB)
misses.
Possible Issues
A significant proportion of cycles is spent handling instruction TLB misses.
Tips
1. Use profile-guided optimization and IPO to reduce the size of hot code regions.
2. Consider compiler options to reorder functions so that hot functions are located together.
3. If your application makes significant use of macros, try to reduce this by either converting the relevant
macros to functions or using linker options to eliminate repeated code.
4. For Windows targets, add function splitting.
5. Consider using large code pages.
831
11 Intel® VTune™ Profiler User Guide
See the "Length-Changing Prefixes (LCP)" section in the Intel 64 and IA-32 Architectures Optimization
Reference Manual.
MS Switches
Metric Description
This metric represents a fraction of cycles when the CPU was stalled due to switches of uop delivery to the
Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB or MITE
pipelines. Certain operations cannot be handled natively by the execution pipeline, and must be performed
by microcode (small programs injected into the execution stream). Switching to the MS too often can
negatively impact performance. The MS is designated to deliver long uOp flows required by CISC instructions
like CPUID, or uncommon conditions like Floating Point Assists when dealing with Denormals.
Possible Issues
A significant fraction of cycles was stalled due to switches of uOp delivery to the Microcode Sequencer (MS).
Commonly used instructions are optimized for delivery by the DSB or MITE pipelines. Certain operations
cannot be handled natively by the execution pipeline, and must be performed by microcode (small programs
injected into the execution stream). Switching to the MS too often can negatively impact performance. The
MS is designated to deliver long uOp flows required by CISC instructions like CPUID, or uncommon conditions
like Floating Point Assists when dealing with Denormals. Note that this metric value may be highlighted due
to Microcode Sequencer issue.
Front-End Latency
Metric Description
This metric represents a fraction of slots during which CPU was stalled due to front-end latency issues, such
as instruction-cache misses, ITLB misses or fetch stalls after a branch misprediction. In such cases, the front-
end delivers no uOps.
General Retirement
Metric Description
This metric represents a fraction of slots during which CPU was retiring uOps not originated from the
Microcode Sequencer. This correlates with the total number of instructions executed by the program. A uOps-
per-Instruction ratio of 1 is expected. While this is the most desirable of the top 4 categories, high values
may still indicate areas for improvement. If possible focus on techniques that reduce instruction count or
result in more efficient instructions generation such as vectorization.
Ideal Time
Metric Description
832
Reference 11
Ideal Time is the estimated time for all parallel regions potentially load-balanced with zero OpenMP runtime
overhead according to the formula: Total User CPU time in all regions/Number of OpenMP threads.
Inactive Time
Metric Description
The time while threads were preempted by the system and remained inactive.
833
11 Intel® VTune™ Profiler User Guide
Instruction Starvation
Metric Description
A large code working set size or a high degree of branch misprediction can induce instruction delivery stalls
at the front-end, such as misses in the L1I. Such stalls are called Instruction Starvation. This metric is the
ratio of cycles generated when no instruction was issued by the front-end to all cycles.
Possible Issues
A significant number of CPU cycles is spent waiting for code to be delivered due to L1I misses or other
problems. Look for ways to reduce the code working set, branch misprediction, and the use of virtual
functions.
834
Reference 11
Interrupt Time
IPC
Metric Description
Instructions Retired per Cycle, or IPC shows average number of retired instructions per cycle. Modern
superscalar processors issue up to four instructions per cycle, suggesting a theoretical best IPC of 4. But
various effects (long-latency memory, floating-point, or SIMD operations; non-retired instructions due to
branch mispredictions; instruction starvation in the front-end) tend to pull the observed IPC down. A IPC of 1
is generally considered acceptable for HPC applications but different application domains will have very
different expected values. Nonetheless, IPC is an excellent metric for judging an overall potential for
application performance tuning.
Possible Issues
The IPC may be too low. This could be caused by issues such as memory stalls, instruction starvation, branch
misprediction or long latency instructions. Explore the other hardware-related metrics to identify what is
causing low IPC.
L1 Bound
Metric Description
This metric shows how often machine was stalled without missing the L1 data cache. The L1 cache typically
has the shortest latency. However, in certain cases like loads blocked on older stores, a load might suffer a
high latency even though it is being satisfied by the L1.
Possible Issues
This metric shows how often machine was stalled without missing the L1 data cache. The L1 cache typically
has the shortest latency. However, in certain cases like loads blocked on older stores, a load might suffer a
high latency even though it is being satisfied by the L1. Note that this metric value may be highlighted due to
DTLB Overhead or Cycles of 1 Port Utilized issues.
4K Aliasing
Metric Description
This metric estimates how often memory load accesses were aliased by preceding stores (in the program
order) with a 4K address offset. Possible false match may incur a few cycles to re-issue a load. However, a
short re-issue duration is often hidden by the out-of-order core and HW optimizations. Hence, you may safely
ignore a high value of this metric unless it propagates up into parent nodes of the hierarchy (for example, to
L1_Bound).
Possible Issues
A significant proportion of cycles is spent dealing with false 4k aliasing between loads and stores.
Tips
Use the source/assembly view to identify the aliasing loads and stores, and then adjust your data layout so
that the loads and stores no longer alias. See the Intel 64 and IA-32 Architectures Optimization Reference
Manual for more details.
835
11 Intel® VTune™ Profiler User Guide
DTLB Overhead
Metric Description
In x86 architectures, mappings between virtual and physical memory are facilitated by a page table, which is
kept in memory. To minimize references to this table, recently-used portions of the page table are cached in
a hierarchy of 'translation look-aside buffers', or TLBs, which are consulted on every virtual address
translation. As with data caches, the farther a request has to go to be satisfied, the worse the performance
impact. This metric estimates the performance penalty paid for missing the first-level data TLB (DTLB) that
includes hitting in the second-level data TLB (STLB) as well as performing a hardware page walk on an STLB
miss.
Possible Issues
A significant proportion of cycles is being spent handling first-level data TLB misses.
Tips
1. As with ordinary data caching, focus on improving data locality and reducing the working-set size to
minimize the DTLB overhead.
2. Consider using profile-guided optimization (PGO) to collocate frequently-used data on the same page.
3. Try using larger page sizes for large amounts of frequently-used data.
FB Full
Metric Description
This metric does a rough estimation of how often L1D Fill Buffer unavailability limited additional L1D miss
memory access requests to proceed. The higher the metric value, the deeper the memory hierarchy level the
misses are satisfied from. Often it hints on approaching bandwidth limits (to L2 cache, L3 cache or external
memory).
Possible Issues
This metric does a rough estimation of how often L1D Fill Buffer unavailability limited additional L1D miss
memory access requests to proceed. The higher the metric value, the deeper the memory hierarchy level the
misses are satisfied from. Often it hints on approaching bandwidth limits (to L2 cache, L3 cache or external
memory). Avoid adding software prefetches if indeed memory BW limited.
Lock Latency
Metric Description
836
Reference 11
This metric represents cycles fraction the CPU spent handling cache misses due to lock operations. Due to
the microarchitecture handling of locks, they are classified as L1 Bound regardless of what memory source
satisfied them.
Possible Issues
A significant fraction of CPU cycles spent handling cache misses due to lock operations. Due to the
microarchitecture handling of locks, they are classified as L1 Bound regardless of what memory source
satisfied them. Note that this metric value may be highlighted due to Store Latency issue.
Split Loads
Metric Description
Throughout the memory hierarchy, data moves at cache line granularity - 64 bytes per line. Although this is
much larger than many common data types, such as integer, float, or double, unaligned values of these or
other types may span two cache lines. Recent Intel architectures have significantly improved the
performance of such 'split loads' by introducing split registers to handle these cases, but split loads can still
be problematic, especially if many split loads in a row consume all available split registers.
Possible Issues
A significant proportion of cycles is spent handling split loads.
Tips
Consider aligning your data to the 64-byte cache line granularity. See the Intel 64 and IA-32 Architectures
Optimization Reference Manual for more details.
L1 Hit Rate
Metric Description
The L1 cache is the first, and shortest-latency, level in the memory hierarchy. This metric provides the ratio
of demand load requests that hit the L1 cache to the total number of demand load requests.
L1D Replacements
Metric Description
Replacements into the L1D
837
11 Intel® VTune™ Profiler User Guide
L2 Bound
Metric Description
This metric shows how often machine was stalled on L2 cache. Avoiding cache misses (L1 misses/L2 hits) will
improve the latency and increase performance.
L2 Hit Bound
Metric Description
The L2 is the last and longest-latency level in the memory hierarchy before the main memory (DRAM) or
MCDRAM. While L2 hits are serviced much more quickly than hits in DRAM or MCDRAM, they can still incur a
significant performance penalty. This metric also includes coherence penalties for shared data. The L2 Hit
Bound metric shows a ratio of cycles spent handling L2 hits to all cycles. The cycles spent handling L2 hits
are calculated as L2 CACHE HIT COST * L2 CACHE HIT COUNT where L2 CACHE HIT COST is a constant
measured as typical L2 access latency in cycles.
Possible Issues
A significant proportion of cycles is being spent on data fetches that miss the L1 but hit the L2. This metric
includes coherence penalties for shared data.
Tips
1. If contested accesses or data sharing are indicated as likely issues, address them first. Otherwise, consider
the performance tuning applicable to an L2-missing workload: reduce the data working set size, improve
data access locality, consider blocking or partitioning your working set so that it fits into the L1, or better
exploit hardware prefetchers.
2. Consider using software prefetchers, but note that they can interfere with normal loads, potentially
increasing latency, as well as increase pressure on the memory system.
L2 Hit Rate
Metric Description
The L2 is the last and longest-latency level in the memory hierarchy before DRAM or MCDRAM. While L2 hits
are serviced much more quickly than hits in DRAM or MCDRAM, they can still incur a significant performance
penalty. This metric provides a ratio of the demand load requests that hit the L2 to the total number of the
demand load requests serviced by the L2. This metric does not include instruction fetches.
Possible Issues
The L2 is the last and longest-latency level in the memory hierarchy before DRAM or MCDRAM. While L2 hits
are serviced much more quickly than hits in DRAM, they can still incur a significant performance penalty. This
metric provides the ratio of demand load requests that hit the L2 to the total number of the demand load
requests serviced by the L2. This metric does not include instruction fetches.
838
Reference 11
L2 HW Prefetcher Allocations
Metric Description
The number of L2 allocations caused by HW Prefetcher.
L2 Input Requests
Metric Description
A total number of L2 allocations. This metric accounts for both demand loads and HW prefetcher requests.
L2 Miss Bound
Metric Description
The L2 is the last and longest-latency level in the memory hierarchy before the main memory (DRAM) or
MCDRAM. Any memory requests missing here must be serviced by local or remote DRAM or MCDRAM, with
significant latency. The L2 Miss Bound metric shows a ratio of cycles spent handling L2 misses to all cycles.
The cycles spent handling L2 misses are calculated as L2 CACHE MISS COST * L2 CACHE MISS COUNT where
L2 CACHE MISS COST is a constant measured as typical DRAM access latency in cycles.
Possible Issues
A high number of CPU cycles is being spent waiting for L2 load misses to be serviced.
Tips
1. Reduce the data working set size, improve data access locality, blocking and consuming data in chunks
that fit into the L2, or better exploit hardware prefetchers.
2. Consider using software prefetchers but note that they can increase latency by interfering with normal
loads, as well as increase pressure on the memory system.
L2 Miss Count
Metric Description
The L2 is the last and longest-latency level in the memory hierarchy before the main memory (DRAM) or
MCDRAM. Any memory requests missing here must be serviced by local or remote DRAM or MCDRAM, with
significant latency. The L2 Miss Count metric shows the total number of demand loads that missed the L2.
Misses due to the HW prefetcher are not included.
L2 Replacement Percentage
Metric Description
When a cache line is brought into the L2 cache, another line must be evicted to make room for it. When lines
in active use are evicted, a performance problem may arise from continually rotating data back into the
cache. This metric measures the percentage of all replacements due to each row. For example, if the
grouping is set to 'Function', this metric shows the percentage of all replacements due to each function,
summing up to 100%.
Possible Issues
This row is responsible for a majority of all L2 cache replacements. Some replacements are unavoidable, and
a high level of replacements may not indicate a problem. Consider this metric only when looking for the
source of a significant number of L2 cache misses for a particular grouping. If these replacements are
marked as a problem, try rearranging data structures (for example, moving infrequently-used data away
from more-frequently-used data so that unused data is not taking up cache space) or re-ordering operations
(to get as much use as possible out of data before it is evicted).
839
11 Intel® VTune™ Profiler User Guide
L2 Replacements
Metric Description
Replacements into the L2
L3 Bound
Metric Description
This metric shows how often CPU was stalled on L3 cache, or contended with a sibling Core. Avoiding cache
misses (L2 misses/L3 hits) improves the latency and increases performance.
Contested Accesses
Metric Description
Contested accesses occur when data written by one thread is read by another thread on a different core.
Examples of contested accesses include synchronizations such as locks, true data sharing such as modified
locked variables, and false sharing. This metric is a ratio of cycles generated while the caching system was
handling contested accesses to all cycles.
Possible Issues
There is a high number of contested accesses to cachelines modified by another core. Consider either using
techniques suggested for other long latency load events (for example, LLC Miss) or reducing the contested
accesses. To reduce contested accesses, first identify the cause. If it is synchronization, try increasing
synchronization granularity. If it is true data sharing, consider data privatization and reduction. If it is false
data sharing, restructure the data to place contested variables in distinct cachelines. This may increase the
working set due to padding, but false sharing can always be avoided.
Data Sharing
Metric Description
Data shared by multiple threads (even just read shared) may cause increased access latency due to cache
coherency. This metric measures the impact of that coherency. Excessive data sharing can drastically harm
multithreaded performance. This metric is defined by the ratio of cycles while the caching system is handling
shared data to all cycles. It does not measure waits due to contention on a variable, which is measured by
the analysis.
Possible Issues
Significant data sharing by different cores is detected.
Tips
1. Examine the Contested Accesses metric to determine whether the major component of data sharing is due
to contested accesses or simple read sharing. Read sharing is a lower priority than Contested Accesses or
issues such as LLC Misses and Remote Accesses.
2. If simple read sharing is a performance bottleneck, consider changing data layout across threads or
rearranging computation. However, this type of tuning may not be straightforward and could bring more
serious performance issues back.
L3 Latency
Metric Description
This metric shows a fraction of cycles with demand load accesses that hit the L3 cache under unloaded
scenarios (possibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3 hits) will improve
the latency, reduce contention with sibling physical cores and increase performance. Note the value of this
node may overlap with its siblings.
840
Reference 11
LLC Hit
Metric Description
The LLC (last-level cache) is the last, and longest-latency, level in the memory hierarchy before main
memory (DRAM). While LLC hits are serviced much more quickly than hits in DRAM, they can still incur a
significant performance penalty. This metric also includes coherence penalties for shared data.
Possible Issues
A significant proportion of cycles is being spent on data fetches that miss in the L2 but hit in the LLC. This
metric includes coherence penalties for shared data.
Tips
1. If contested accesses or data sharing are indicated as likely issues, address them first. Otherwise, consider
the performance tuning applicable to an LLC-missing workload: reduce the data working set size, improve
data access locality, consider blocking or partitioning your working set so that it fits into the low-level cache,
or better exploit hardware prefetchers.
2. Consider using software prefetchers, but note that they can interfere with normal loads, potentially
increasing latency, as well as increase pressure on the memory system.
SQ Full
Metric Description
This metric measures fraction of cycles where the Super Queue (SQ) was full taking into account all request-
types and both hardware SMT threads. The Super Queue is used for requests to access the L2 cache or to go
out to the Uncore.
841
11 Intel® VTune™ Profiler User Guide
Possible Issues
This row is responsible for a majority of all last-level cache replacements. Some replacements are
unavoidable, and a high level of replacements may not indicate a problem. Consider this metric only when
looking for the source of a significant number of last-level cache misses for a particular grouping. If these
replacements are marked as a problem, try rearranging data structures (for example, moving infrequently-
used data away from more-frequently-used data so that unused data is not taking up cache space) or re-
ordering operations (to get as much use as possible out of data before it is evicted).
LLC Replacements
Metric Description
Replacements into the LLC
Machine Clears
Metric Description
Certain events require the entire pipeline to be cleared and restarted from just after the last retired
instruction. This metric measures three such events: memory ordering violations, self-modifying code, and
certain loads to illegal address ranges. Machine Clears metric represents slots fraction the CPU has wasted
due to Machine Clears. These slots are either wasted by uOps fetched prior to the clear, or stalls the out-of-
order portion of the machine needs to recover its state after the clear.
Possible Issues
A significant portion of execution time is spent handling machine clears.
Tips
Examine the MACHINE_CLEARS events to determine the specific cause. See the "Memory Disambiguation"
section in the Intel 64 and IA-32 Architectures Optimization Reference Manual for more details.
842
Reference 11
Max DRAM Single-Package Bandwidth
Metric Description
Maximum DRAM bandwidth for single package measured by running a micro-benchmark before the collection
starts. If the system has already been actively loaded at the moment of collection start (for example, with
the attach mode), the value may be less accurate.
Memory Bandwidth
Metric Description
This metric represents a fraction of cycles during which an application could be stalled due to approaching
bandwidth limits of the main memory (DRAM). This metric does not aggregate requests from other threads/
cores/sockets (see Uncore counters for that). Consider improving data locality in NUMA multi-socket
systems.
843
11 Intel® VTune™ Profiler User Guide
Possible Issues
A significant fraction of cycles were stalled due to to approaching bandwidth limits of the main memory
(DRAM).
Tips
Improve data accesses to reduce cacheline transfers from/to memory using these possible techniques:
• Consume all bytes of each cacheline before it is evicted (for example, reorder structure elements and split
non-hot ones).
• Merge compute-limited and bandwidth-limited loops.
• Use NUMA optimizations on a multi-socket system.
NOTE
Software prefetches do not help a bandwidth-limited application.
Memory Bound
Metric Description
This metric shows how memory subsystem issues affect the performance. Memory Bound measures a
fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts
mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to
less common cases where stores could imply back-pressure on the pipeline.
Possible Issues
The metric value is high. This can indicate that the significant fraction of execution pipeline slots could be
stalled due to demand memory load and stores. Use Memory Access analysis to have the metric breakdown
by memory hierarchy, memory bandwidth information, correlation by memory objects.
DRAM Bound
Metric Description
This metric shows how often CPU was stalled on the main memory (DRAM). Caching typically improves the
latency and increases performance.
844
Reference 11
This metric represents percentage of elapsed time the system spent with high UPI utilization. Explore the
Bandwidth Utilization Histogram and make sure the Low/Medium/High utilization thresholds are correct for
your system. You can manually adjust them, if required.
NOTE
The UPI Utilization metric replaced QPI Utilization starting with systems based on Intel®
microarchitecture code name Skylake.
Possible Issues
The system spent much time heavily utilizing UPI bandwidth. Improve data accesses using NUMA
optimizations on a multi-socket system.
Memory Latency
Metric Description
This metric represents a fraction of cycles during which an application could be stalled due to the latency of
the main memory (DRAM). This metric does not aggregate requests from other threads/cores/sockets (see
Uncore counters for that). Consider optimizing data layout or using Software Prefetches (through the
compiler).
Possible Issues
This metric represents a fraction of cycles during which an application could be stalled due to the latency of
the main memory (DRAM).
Tips
Improve data accesses or interleave them with compute using such possible techniques as data layout re-
structuring or software prefetches (through the compiler).
Local DRAM
Metric Description
This metric shows how often CPU was stalled on loads from local memory. Caching will improve the latency
and increase performance.
Possible Issues
The number of CPU stalls on loads from the local memory exceeds the threshold. Consider caching data to
improve the latency and increase the performance.
Remote Cache
Metric Description
This metric shows how often CPU was stalled on loads from remote cache in other sockets. This is caused
often due to non-optimal NUMA allocations.
Possible Issues
The number of CPU stalls on loads from the remote cache exceeds the threshold. This is often caused by
non-optimal NUMA memory allocations.
Remote DRAM
Metric Description
This metric shows how often CPU was stalled on loads from remote memory. This is caused often due to non-
optimal NUMA allocations.
845
11 Intel® VTune™ Profiler User Guide
Possible Issues
The number of CPU stalls on loads from the remote memory exceeds the threshold. This is often caused by
non-optimal NUMA memory allocations.
Memory Efficiency
Metric Description
This metric represents how efficiently the memory subsystem was used by the application. It shows the
percent of cycles where the pipeline was not stalled due to demand load or store instructions. The metric is
based on the Memory Bound measurement.
Microarchitecture Usage
Metric Description
Microarchitecture Usage metric is a key indicator that helps estimate (in %) how effectively your code runs
on the current microarchitecture. Microarchitecture Usage can be impacted by long-latency memory, floating-
point, or SIMD operations; non-retired instructions due to branch mispredictions; instruction starvation in the
front-end.
Possible Issues
You code efficiency on this platform is too low.
Possible cause: memory stalls, instruction starvation, branch misprediction or long latency instructions.
Tips
Run Microarchitecture Exploration analysis to identify the cause of the low microarchitecture usage efficiency.
Microcode Sequencer
Metric Description
This metric represents a fraction of slots during which CPU was retiring uOps fetched by the Microcode
Sequencer (MS) ROM. The MS is used for CISC instructions not fully decoded by the default decoders (like
repeat move strings), or by microcode assists used to address some modes of operation (like in Floating-
Point assists).
Possible Issues
A significant fraction of cycles was spent retiring uOps fetched by the Microcode Sequencer.
Tips
1. Make sure the /arch compiler flags are correct.
2. Check the child Assists metric and, if it is highlighted as an issue, follow the provided recommendations.
846
Reference 11
Note that this metric value may be highlighted due to MS Switches issue.
Mispredicts Resteers
Metric Description
This metric measures the fraction of cycles the CPU was stalled due to Branch Resteers as a result of Branch
Misprediction at execution stage.
Possible Issues
A significant fraction of cycles could be stalled due to Branch Resteers as a result of Branch Misprediction at
execution stage.
MPI Imbalance
Metric Description
MPI Imbalance shows the CPU time spent by ranks spinning in waits on communication operations,
normalized by the number of ranks. High metric value can be caused by application workload imbalance
between ranks, nonoptimal communication schema or settings of MPI library. Explore details on
communication inefficiencies with Intel Trace Analyzer and Collector.
MS Entry
Metric Description
This metric estimates a fraction of cycles lost due to the Microcode Sequencer entry.
Possible Issues
A significant number of CPU cycles lost due to the Microcode Sequencer entry.
MUX Reliability
Metric Description
847
11 Intel® VTune™ Profiler User Guide
This metric estimates reliability of HW event-related metrics. Since the number of collected HW events
exceeds the number of counters, Intel® VTune™ Profiler uses event multiplexing (MUX) to share HW counters
and collect different subsets of events over time. This may affect the precision of collected event data. The
ideal value for this metric is 1. If the value is less than 0.7, the collected data may be not reliable.
Possible Issues
Precision of collected HW event data is not enough. Metrics data may be unreliable. Consider increasing your
application execution time, using the multiple runs mode instead of event multiplexing, or creating a custom
analysis with a limited subset of HW events. If you are using a driverless collection, consider reducing the
value of /sys/bys/event_source/devices/cpu/perf_event_mux_interval_ms file.
NOTE
A high value for this metric does not guarantee an accuracy of the hardware-based metrics. However,
a low value definitely puts the metrics in question and you should re-run the analysis using the Allow
multiple runs option or increase the execution time to improve the accuracy.
Other
Metric Description
This metric represents a non-floating-point (FP) uop fraction the CPU has executed. If your application has no
FP operations, this is likely to be the biggest fraction.
848
Reference 11
Overhead Time
Metric Description
Overhead time is CPU time spent on the overhead of known synchronization and threading libraries, such as
system synchronization APIs, Intel® oneAPI Threading Building Blocks(oneTBB ), and OpenMP.
Possible Issues
A significant portion of CPU time is spent in synchronization or threading overhead. Consider increasing task
granularity or the scope of data synchronization.
Page Walk
Metric Description
In x86 architectures, mappings between virtual and physical memory are facilitated by a page table that is
kept in memory. To minimize references to this table, recently-used portions of the page table are cached in
a hierarchy of 'translation look-aside buffers', or TLBs, which are consulted on every virtual address
translation. As with data caches, the farther a request has to go to be satisfied, the worse the performance
impact is. This metric estimates the performance penalty paid for missing the first-level TLB that includes
hitting in the second-level data TLB (STLB) as well as performing a hardware page walk on an STLB miss.
Possible Issues
Page Walks have a large performance penalty because they involve accessing the contents of multiple
memory locations to calculate the physical address. Since this metric includes the cycles handling both
instruction and data TLB misses, look at ITLB Overhead and DTLB Overhead and follow the instructions to
improve performance. Also examine PAGE_WALKS.D_SIDE_CYCLES and PAGE_WALKS.I_SIDE_CYCLES
events in the source/assembly view for further breakdown. Account for skid.
Paused Time
Metric Description
Paused time is the amount of Elapsed time during which the analysis was paused using either the GUI, CLI
commands, or user API.
Pipeline Slots
Metric Description
A pipeline slot represents hardware resources needed to process one uOp.
The Top-Down Characterization assumes that for each CPU core, on each clock cycle, there are several
pipeline slots available. This number is called Pipeline Width.
849
11 Intel® VTune™ Profiler User Guide
Possible Issues
The time wasted on load imbalance or parallel work arrangement is significant and negatively impacts the
application performance and scalability. Explore OpenMP regions with the highest metric values. Make sure
the workload of the regions is enough and the loop schedule is optimal.
Imbalance
Metric Description
OpenMP Potential Gain Imbalance shows maximum elapsed time that could be saved if the OpenMP construct
is optimized to have no imbalance. It is calculated as summary of CPU time by all OpenMP threads spinning
on a barrier divided by the number of OpenMP threads.
Possible Issues
Significant time spent waiting on an OpenMP barrier inside of a parallel region can be a result of load
imbalance. Consider using dynamic work scheduling to reduce the imbalance, where possible.
Lock Contention
Metric Description
OpenMP Potential Gain Lock Contention shows elapsed time cost of OpenMP locks and ordered
synchronization. High metric value may signal inefficient parallelization with highly contended
synchronization objects. To avoid intensive synchronization, consider using reduction, atomic operations or
thread local variables where possible. This metric is based on CPU sampling and does not include passive
waits.
Possible Issues
When synchronization objects are used inside a parallel region, threads can spend CPU time waiting on a lock
release, contending with other threads for a shared resource. Where possible, reduce synchronization by
using reduction or atomic operations, or minimize the amount of code executed inside the critical section.
850
Reference 11
Pre-Decode Wrong
Metric Description
This metric estimates a fraction of cycles lost due to the decoder predicting wrong instruction length.
Possible Issues
A significant number of CPU cycles lost due to the decoder predicting wrong instruction length.
Retire Stalls
Metric Description
This metric is defined as a ratio of the number of cycles when no micro-operations are retired to all cycles. In
the absence of performance issues, long latency operations, and dependency chains, retire stalls are
insignificant. Otherwise, retire stalls result in a performance penalty. On processors based on the Intel
microarchitecture code name Nehalem, this metric is based on precise events that do not suffer from
significant skid.
Possible Issues
A high number of retire stalls is detected. This may result from branch misprediction, instruction starvation,
long latency operations, and other issues. Use this metric to find where you have stalled instructions. Once
you have located the problem, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data
Sharing, and Contested Accesses, or look for long-latency instructions like divisions and string operations to
understand the cause.
Retiring
Metric Description
Retiring metric represents a Pipeline Slots fraction utilized by useful work, meaning the issued uOps that
eventually get retired. Ideally, all Pipeline Slots would be attributed to the Retiring category. Retiring of 100%
would indicate the maximum possible number of uOps retired per cycle has been achieved. Maximizing
851
11 Intel® VTune™ Profiler User Guide
Retiring typically increases the Instruction-Per-Cycle metric. Note that a high Retiring value does not
necessary mean no more room for performance improvement. For example, Microcode assists are
categorized under Retiring. They hurt performance and can often be avoided.
Possible Issues
A high fraction of pipeline slots was utilized by useful work.
Tips
While the goal is to make this metric value as big as possible, a high Retiring value for non-vectorized code
could prompt you to consider code vectorization. Vectorization enables doing more computations without
significantly increasing the number of instructions, thus improving the performance. Note that this metric
value may be highlighted due to Microcode Sequencer (MS) issue, so the performance can be improved by
avoiding using the MS.
852
Reference 11
CPU time spent on waits for MPI communication operations is significant and can negatively impact the
application performance and scalability. This can be caused by load imbalance between ranks, active
communications or non-optimal settings of MPI library. Explore details on communication inefficiencies with
Intel Trace Analyzer and Collector.
Other
Metric Description
This metric shows unclassified Serial CPU Time.
SIMD Assists
Metric Description
SIMD assists are invoked when an EMMS instruction is executed after MMX technology code has changed the
MMX state in the floating point stack. The EMMS instruction clears the MMX technology state at the end of all
MMX technology procedures or subroutines and before calling other procedures or subroutines that may
execute x87 floating-point instructions, which can incur a performance penalty when intermixing MMX and
X87 instructions. The SIMD assists are required in the streaming SIMD Extensions (SSE) instructions with
denormal input when the DAZ (Denormals Are Zeros) flag is off or underflow result when the FTZ (Flush To
Zero) flag is off.
Possible Issues
A significant portion of execution time is spent in SIMD assists. Consider enabling the DAZ (Denormals Are
Zero) and/or FTZ (Flush To Zero) options in your compiler to flush denormals to zero. This option may
improve performance if the denormal values are not critical in your application. Also note that the DAZ and
FTZ modes are not compatible with the IEEE Standard 754.
853
11 Intel® VTune™ Profiler User Guide
This metric represents how intensively your program uses the FPU. 100% means that the FPU is fully loaded
and is retiring a vector instruction with full capacity every cycle of the application execution.
SP GFLOPS
Metric Description
Number of single precision giga-floating point operations calculated per second. All double operations are
converted to two single operations.
Spin Time
Metric Description
Spin time is Wait Time during which the CPU is busy. This often occurs when a synchronization API causes
the CPU to poll while the software thread is waiting. Some Spin Time may be preferable to the alternative of
increased thread context switches. Too much Spin Time, however, can reflect lost opportunity for productive
work.
Possible Issues
A significant portion of CPU time is spent waiting. Use this metric to discover which synchronizations are
spinning. Consider adjusting spin wait parameters, changing the lock implementation (for example, by
backing off then descheduling), or adjusting the synchronization granularity.
854
Reference 11
Communication (MPI)
Metric Description
MPI Busy Wait Time is CPU time when MPI runtime library is spinning on waits in communication operations.
High metric value can be caused by load imbalance between ranks, active communications or nonoptimal
settings of MPI library. Explore details on communication inefficiencies with Intel Trace Analyzer and Collector.
Possible Issues
CPU time spent on waits for MPI communication operations is significant and can negatively impact the
application performance and scalability. This can be caused by load imbalance between ranks, active
communications or non-optimal settings of MPI library. Explore details on communication inefficiencies with
Intel Trace Analyzer and Collector.
Lock Contention
Metric Description
Lock Contention time is CPU time when working threads are spinning on a lock consuming CPU resources.
High metric value may signal inefficient parallelization with highly contended synchronization objects. To
avoid intensive synchronization, consider using reduction, atomic operations or thread local variables where
possible.
Possible Issues
When synchronization objects are used inside a parallel region, threads can spend CPU time waiting on a lock
release, contending with other threads for a shared resource. Where possible, reduce synchronization by
using reduction or atomic operations, or minimize the amount of code executed inside the critical section.
Other (Spin)
Metric Description
This metric shows unclassified Spin time spent in a threading runtime library.
855
11 Intel® VTune™ Profiler User Guide
payload work. In cases when a parallel runtime (for example, Intel® Threading Building Blocks, Intel® Cilk™,
OpenMP*) is used inefficiently, a significant portion of time may be spent inside the parallel runtime wasting
CPU time at high concurrency levels. For example, if you increase the number of threads performing some
fixed load of work in parallel, each thread gets less work and the overhead, as a relative measure, will get
larger. It is a basic application of Amdahl's Law.
To detect this wasted CPU time, Intel® VTune™ Profiler analyzes the call stack at the point of interest and
computes the Overhead time performance metric. VTune Profiler classifies the stack layers into user, system,
and overhead layers and attributes the CPU time spent in system functions called by overhead functions to
the overhead functions.
Spin Time
Spin time is the Wait time during which the CPU is busy. This often occurs when a synchronization API causes
the CPU to poll while the software thread is waiting. Some Spin time may be preferable to the alternative of
increased thread context switches. Too much Spin time, however, can reflect lost opportunity for productive
work.
Overhead and Spin Time
VTune Profiler provides the combined Overhead and Spin Time metric in the grid and Timeline view of the
Hotspots by CPU Utilization, Hotspots by Thread Concurrency, and Hotspots viewpoints. This metric
represents the sum of the Overhead and Spin time values calculated as CPU Time where Call Site Type is
Overhead + CPU Time where Call Site Type is Synchronization. To view the Overhead and Spin time
values separately, expand the column by clicking the symbol.
NOTE
VTune Profiler ignores the Overhead and Spin time when calculating the CPU Utilization metric.
Possible Issues
A significant portion of CPU time is spent in synchronization or threading overhead. Consider increasing task
granularity or the scope of data synchronization.
Atomics
Metric Description
Atomics time is CPU time that a runtime library spends on atomic operations.
Possible Issues
CPU time spent on atomic operations is significant. Consider using reduction operations where possible.
Creation
Metric Description
Creation time is CPU time that a runtime library spends on organizing parallel work.
Possible Issues
CPU time spent on parallel work arrangement can be a result of too fine-grain parallelism. Try parallelizing
outer loops, rather than inner loops, to reduce the work arrangement overhead.
Other (Overhead)
Metric Description
This metric shows unclassified Overhead time spent in a threading runtime library.
856
Reference 11
Reduction
Metric Description
Reduction time is CPU time that a runtime library spends on loop or region reduction operations.
Possible Issues
A significant portion of CPU time is spent on doing reduction.
Scheduling
Metric Description
Scheduling time is CPU time that a runtime library spends on work assignment for threads. If the time is
significant, consider using coarse-grain work chunking.
Possible Issues
Dynamic scheduling with small work chunks can cause increased overhead due to threads frequently
returning to the scheduler for more work. Try increasing the chunk size to reduce this overhead.
Tasking
Metric Description
Tasking time is CPU time that a runtime library spends on allocating and completing tasks.
Split Stores
Metric Description
Throughout the memory hierarchy, data moves at cache line granularity - 64 bytes per line. Although this is
much larger than many common data types, such as integer, float, or double, unaligned values of these or
other types may span two cache lines. Recent Intel architectures have significantly improved the
performance of such 'split stores' by introducing split registers to handle these cases. But split stores can still
be problematic, especially if they consume split registers which could be servicing other split loads.
Possible Issues
A significant portion of cycles is spent handling split stores.
Tips
Consider aligning your data to the 64-byte cache line granularity.
Note that this metric value may be highlighted due to Port 4 issue.
Store Bound
Metric Description
This metric shows how often CPU was stalled on store operations. Even though memory store accesses do
not typically stall out-of-order CPUs there are few cases where stores can lead to actual stalls.
Possible Issues
CPU was stalled on store operations for a significant fraction of cycles.
Tips
Consider False Sharing analysis as your next step.
Store Latency
Metric Description
857
11 Intel® VTune™ Profiler User Guide
This metric represents cycles fraction the CPU spent handling long-latency store misses (missing 2nd level
cache).
Possible Issues
This metric represents a fraction of cycles the CPU spent handling long-latency store misses (missing the 2nd
level cache). Consider avoiding/reducing unnecessary (or easily loadable/computable) memory store. Note
that this metric value may be highlighted due to a Lock Latency issue.
Task Time
Metric Description
Total amount of time spent within a task.
Thread Concurrency
Thread Oversubscription
Metric Description
Thread Oversubscription indicates time spent in the code with the number of simultaneously working threads
more than the number of available logical cores on the system.
Possible Issues
Significant amount of time application spent in thread oversubscription. This can negatively impact parallel
performance because of thread preemption and context switch cost.
[uOps]
Metric Description
uOp, or micro-op, is a low-level hardware operation. The CPU Front-End is responsible for fetching the
program code represented in architectural instructions and decoding them into one or more uOps.
VPU Utilization
Metric Description
This metric measures the fraction of micro-ops that performed packed vector operations of any vector length
and any mask. VPU utilization metric can be used in conjunction with the compiler's vectorization report to
assess VPU utilization and to understand the compiler's judgement about the code. Note that this metric does
not account for loads and stores and does not take into consideration vector length as well as masking.
Includes integer packed simd.
Possible Issues
This metric measures the fraction of micro-ops that performed packed vector operations of any vector length
and any mask. VPU utilization metric can be in conjunction with the compiler's vectorization report to assess
VPU utilization and to understand the compiler's judgement about the code. Note that this metric does not
account for loads and stores and does not take into consideration vector length as well as masking. This
metric includes integer packed SIMD.
858
Reference 11
Wait Count
Metric Description
Wait Count measures the number of times software threads wait due to APIs that block or cause
synchronization.
Wait Rate
Metric Description
Average Wait time (in milliseconds) per synchronization context switch. Low metric value may signal an
increased contention between threads and inefficient use of system API.
Possible Issues
The average Wait time is too low. This could be caused by small timeouts, high contention between threads,
or excessive calls to system synchronization functions. Explore the call stack, the timeline, and the source
code to identify what is causing low wait time per synchronization context switch.
Wait Time
Metric Description
Wait Time occurs when software threads are waiting due to APIs that block or cause synchronization. Wait
Time is per-thread, therefore the total Wait Time can exceed the application Elapsed Time.
Intel® VTune™ Profiler collects and analyzes the following groups of GPU metrics for Intel® HD Graphics and
Intel® Iris® Graphics:
• Overview metrics:
• Memory Read Bandwidth
• Memory Write Bandwidth
• L3 Miss Rate
• Sampler Busy
• Sampler Is Bottleneck
• GPU Memory Texture Read Bandwidth
Starting with the fifth generation of the Inte©l Core™ processor family (code name: Broadwell), the
following metrics are included:
• L3 Shader Bandwidth
• L3 Sampler Bandwidth
• L3 Miss Ratio
• Shared Local Memory Read Bandwidth
• Shared Local Memory Write Bandwidth
• Compute basic (with global/local memory accesses) metrics:
• Untyped Memory Read Bandwidth
• Untyped Memory Write Bandwidth
• Typed Memory Read Transactions
• Typed Memory Write Transactions
• Shared Local Memory Read Bandwidth
• Shared Local Memory Write Bandwidth
859
11 Intel® VTune™ Profiler User Guide
NOTE
To analyze Intel® HD Graphics and Intel® Iris® Graphics hardware events, make sure to set up your
system for GPU analysis
See Also
Running GPU Analysis from Command Line
Average Time
Metric Description
Average amount of time spent in the task.
See Also
Reference for Performance Metrics
860
Reference 11
Metric Description
Number of threads started across all EUs for compute work.
Possible Issues
High thread issue rate lowers GPU usage efficiency due to thread creation overhead even for lightweight GPU
threads. To improve performance, change the kernel code to increase the load in a working item, adjust
global working size, and so decrease the number of GPU threads.
See Also
Reference for Performance Metrics
Metric Description
Number of threads started across all EUs for compute work per second.
See Also
Reference for Performance Metrics
CPU Time
Metric Description
CPU Time is time during which the CPU is actively executing your application.
See Also
Reference for Performance Metrics
Metric Description
The normalized sum of all cycles on all cores when both EU FPU pipelines were actively processing
See Also
Reference for Performance Metrics
EU Array Active
Metric Description
The normalized sum of all cycles on all cores spent actively executing instructions.
861
11 Intel® VTune™ Profiler User Guide
See Also
Reference for Performance Metrics
EU Array Idle
Metric Description
The normalized sum of all cycles on all cores when no threads were scheduled on a core.
Possible Issues
A significant portion of GPU time is spent idle. That is usually caused by imbalance or thread scheduling
problems.
See Also
Reference for Performance Metrics
EU Array Stalled/Idle
Metric Description
The average time the EUs were stalled or idle.
Possible Issues
The time when the EUs were stalled or idle is high, which has a negative impact on compute-bound
applications.
See Also
Reference for Performance Metrics
EU Array Stalled
Metric Description
The normalized sum of all cycles on all cores spent stalled. At least one thread is loaded, but the core is
stalled for some reason.
Possible Issues
A significant portion of GPU time is spent in stalls. For compute bound code it indicates that the performance
might be limited by memory or sampler accesses.
See Also
Reference for Performance Metrics
EU IPC Rate
Metric Description
The average rate of instructions per cycle (IPC) calculated for 2 FPU pipelines
862
Reference 11
See Also
Reference for Performance Metrics
Metric Description
The normalized sum of all cycles on all cores when EU send pipeline was actively processing
See Also
Reference for Performance Metrics
EU Threads Occupancy
Metric Description
The normalized sum of all cycles on all cores and thread slots when a slot has a thread scheduled.
See Also
Reference for Performance Metrics
Global
Metric Description
Total working size of a computing task.
See Also
Reference for Performance Metrics
Metric Description
The normalized sum of all cycles on all cores with at least one thread loaded.
See Also
Reference for Performance Metrics
GPU L3 Bound
Metric Description
This metric shows how often the GPU was idle or stalled on the L3 cache.
Possible Issues
L3 bandwidth was high when EUs were stalled or idle. Consider improving cache reuse.
863
11 Intel® VTune™ Profiler User Guide
See Also
Reference for Performance Metrics
Metric Description
Read and write miss ratio in GPU L3 cache. This doesn't count code lookups.
See Also
Reference for Performance Metrics
GPU L3 Misses
Metric Description
Read and write misses in GPU L3 cache.
See Also
Reference for Performance Metrics
Metric Description
Read and write misses in GPU L3 cache. This doesn't count code lookups.
See Also
Reference for Performance Metrics
Metric Description
GPU memory read bandwidth between the GPU, chip uncore (LLC) and main memory. This metric counts all
memory accesses that miss the internal GPU L3 cache or bypass it and are serviced either from uncore or
main memory.
See Also
Reference for Performance Metrics
Metric Description
Sampler unit misses in sampler cache.
864
Reference 11
See Also
Reference for Performance Metrics
Metric Description
GPU write bandwidth between the GPU, chip uncore (LLC) and main memory. This metric counts all memory
accesses that miss the internal GPU L3 cache or bypass it and are serviced either from uncore or main
memory.
See Also
Reference for Performance Metrics
Metric Description
Number of texels returned from the sampler.
See Also
Reference for Performance Metrics
GPU Utilization
Metric Description
The percentage of time when GPU engine was utilized.
VTune Profiler collects high level information about the GPU Utilization metric when you run the GPU
Offload and GPU Compute/Media Hotspots analyses. This information is available in the GPU Offload
viewpoint. To see more detailed metric information, rebuild the Linux kernel to enable i915 ftrace events.
Use the Summary, Platform, and Graphics window to explore the GPU utilization at the application and
computing task level.
GPU Utilization in the Summary Window
If your system satisfies configuration requirements for GPU analysis (i915 ftrace event collection is
supported), VTune Profiler displays detailed GPU Utilization analysis data across all engines that had at
least one DMA packet executed. By default, the VTune Profiler flags the GPU utilization less than 80% as a
performance issue. In the example below, 85.9% of the application elapsed time was utilized by GPU
engines.
Depending on the target platform used for GPU analysis, the GPU Utilization section in the Summary
window shows the time (in seconds) used by GPU engines. Note that GPU engines may work in parallel and
the total time taken by GPU engines does not necessarily equal the application Elapsed time.
865
11 Intel® VTune™ Profiler User Guide
You may correlate GPU Time data with the Elapsed Time metric. The GPU Time value shows a share of the
Elapsed time used by a particular GPU engine. If the GPU Time takes a significant portion of the Elapsed
Time, it clearly indicates that the application is GPU-bound.
If your system does not support i915 ftrace event collection, all the GPU Utilization statistics will be
calculated based on the hardware events and attributed to the Render and GPGPU engine.
GPU Utilization in the Platform Window
Explore overall GPU utilization per GPU engine at each moment of time. By default, the Platform window
displays GPU Utilization and software queues per GPU engine. Hover over an object executed on the GPU (in
yellow) to view a short summary on GPU utilization, where GPU Utilization is the time when a GPU engine
was executing a workload. You can explore the top GPU Utilization band in the chart to estimate the
percentage of GPU engine utilization (yellow areas vs. white spaces) and options to submit additional work to
the hardware.
To view and analyze GPU software queues, select an object (packet) in the queue and the VTune Profiler
highlights the corresponding software queue bounds:
Full software queue prevents packet submissions and causes waits on a CPU side in the user-mode driver
until there is space in the queue. To check whether such a stall decreases your performance, you may
decrease a workload on the hardware and switch to the Graphics window to see if there are less waits on
the CPU in threads that spawn packets. Another option could be to additionally load the queue by tasks and
see whether the queue length increases.
Possible Issues
GPU utilization is low. Consider offloading more work to the GPU to increase overall application performance.
See Also
GPU Application Analysis on Intel® HD Graphics and Intel® Iris® Graphics
Instance Count
Metric Description
Total number of times a task is run.
See Also
Reference for Performance Metrics
Metric Description
Total number of bytes transferred between Samplers and L3 caches.
See Also
Reference for Performance Metrics
866
Reference 11
Metric Description
Total number of bytes transferred directly between EUs and L3 caches.
See Also
Reference for Performance Metrics
Metric Description
The Last Level Uncore cache (LLC) miss rate across all look-ups done from the GPU.
See Also
Reference for Performance Metrics
Metric Description
The Last Level Uncore cache (LLC) miss count across all lookups done from the GPU.
See Also
Reference for Performance Metrics
Local
Metric Description
Local space size of a computing task. For example, for an OpenCL kernel, it is a working group size.
See Also
Reference for Performance Metrics
Metric Description
Maximum GPU usage across engines that had at least one packet on them.
See Also
Reference for Performance Metrics
867
11 Intel® VTune™ Profiler User Guide
Occupancy
Metric Description
The normalized sum of all cycles on all core and thread slots when a slot has a thread scheduled.
Possible Issues
Low value of the occupancy metric may be be caused by iniefficient work scheduling. Make sure work items
are niether too small nor too large.
See Also
Reference for Performance Metrics
PS EU Active %
The metric PS EU Active % represents the percentage of overall GPU time that the EUs were actively
executing Pixel Shader instructions.
This metric is important if pixel shading seems to be the bottleneck for selected rendering calls.
Possible Issues
• IfPS EU Active % is 50%, it means that half of the overall GPU time was spent actively executing Pixel
Shader instructions.
• If PS EU Active % is 0%, it means that no Pixel Shader was associated with the selected draw calls, or
that the amount of time actively executing Pixel Shader instructions was negligible.
To improve performance:
•
• If PS EU Active % accounts for most of the EU active time, then to improve performance you may need
to simplify the pixel shader.
• If PS EU Active % is larger than you would expect and you are encountering slow rendering times, you
should examine the pixel shader code for potential reasons why these stalls may be occurring.
See Also
GPU Rendering Analysis (Preview)
PS EU Stall %
Metric Description
The metric PS EU Stall % represents the percentage of overall GPU time that the EUs were stalled in Pixel
Shader instructions. This metric is important if pixel shading seems to be the bottleneck for selected
rendering calls.
NOTE
This metric does not show total amount of stalled time in the pixel shader, but only the fraction of time
when pixel shader stalls caused the entire EU to stall. The entire EU stalls when all of its threads are
stalled.
868
Reference 11
Possible Issues
• If PS EU Stall % is 50%, it means that half of the overall GPU time was spent stalled on Pixel Shader
instructions.
• If PS EU Stall % is 0% it means that no Pixel Shader was associated with selected rendering calls or
Pixel Shader threads were not causing EUs stalls.
To improve performance:
• If PS EU Stall %accounts for most the EU active time, then to improve performance you may need to
simplify the pixel shader.
• If PS EU Stall % is larger than you expect and you are encountering slow rendering times, you need to
concentrate on pixel shader code to find reasons for these stalls.
See Also
GPU Rendering Analysis (Preview)
Metric Description
Ratio of the bandwidth on this link to its theoretical peak.
See Also
Reference for Performance Metrics
Metric Description
Ratio of the write bandwidth on this link to its write theoretical peak.
See Also
Reference for Performance Metrics
Metric Description
Ratio of the read bandwidth on this link to its read theoretical peak.
See Also
Reference for Performance Metrics
Metric Description
The normalized sum of all cycles where commands exist on the GPU Render/GPGPU ring.
869
11 Intel® VTune™ Profiler User Guide
See Also
Reference for Performance Metrics
Samples Blended
Metric Description
The Samples Blended metric represents the total number of blended samples or pixels written to all render
targets.
See Also
GPU Rendering Analysis (Preview)
Metric Description
The Samples Killed in PS, pixels metric represents the total number of samples or pixels dropped in pixel
shaders.
See Also
GPU Rendering Analysis (Preview)
Samples Written
Metric Description
The Samples Written metric represents the number of pixels/samples written to render targets.
The graphics driver 9.17.10 introduces a new notion of deferred clears. For the sake of optimization, the
driver decides whether to defer the actual rendering of clear calls in case subsequent clear and draw calls
make it unnecessary. As a result, when clear calls are deferred, the Intel® VTune™ Profiler shows their GPU
Duration and Samples Written as zero. If later it turns out that a clear call needs to be drawn, the work
associated with that clear call gets included in the duration of the erg that was being drawn when this clear
call was deferred, not necessarily a clear call. This means that in the VTune Profiler metrics associated with a
clear call accurately reflect the real work associated with that erg.
See Also
GPU Rendering Analysis (Preview)
Sampler Busy
Metric Description
The normalized sum of all cycles on all cores when the Sampler was busy while EUs were stalled or idle.
Possible Issues
Sampler was overutilized when EUs were stalled or idle. Consider reducing the image-related operations.
870
Reference 11
See Also
Reference for Performance Metrics
Sampler Is Bottleneck
Metric Description
Sampler stalls EUs due to the full input fifo queue, and starves the output fifo, so EUs need to wait to submit
requests to sampler.
Possible Issues
Significant amount of sampler accesses might cause stalls. Consider decreasing the use of the sampler or
access it with a better locality.
See Also
Reference for Performance Metrics
Metric Description
Untyped memory reads from Shared Local Memory.
See Also
Reference for Performance Metrics
Metric Description
Untyped memory writes to Shared Local Memory.
See Also
Reference for Performance Metrics
SIMD Width
Metric Description
The number of working items processed by a GPU thread.
See Also
Reference for Performance Metrics
871
11 Intel® VTune™ Profiler User Guide
Size
Metric Description
Amount of memory processed on a GPU.
See Also
Reference for Performance Metrics
Total, GB/sec
Metric Description
Average bandwidth of data transfer between a CPU and a GPU. In some cases (for example,
clEnqueueMapBuffer), there may be transfers generating high bandwidth values because memory is not
copied but shared via L3 cache.
See Also
Reference for Performance Metrics
Total Time
Metric Description
Total amount of time spent within a task.
See Also
Reference for Performance Metrics
Metric Description
Bandwidth of memory read from typed buffers. Note that reads from images (for example created with
clCreateImage) are counted by sampler accesses and Texture Read metrics.
See Also
Reference for Performance Metrics
Metric Description
Bandwidth of memory written to typed buffers (for example created with clCreateImage).
See Also
Reference for Performance Metrics
872
Reference 11
Metric Description
Transaction Coalescence is a ratio of the used bytes to all bytes requested by the transaction. The lower the
coalescence, the bigger part of the bandwidth is wasted. It originates from the GPU Data Port function that
dynamically merges scattered memory operations into fewer operations over non-duplicated 64-byte
cacheline requests. For example, if a 16-wide SIMD operation consecutively reads integer array elements
with a stride of 2, the coalescence of such a transaction is 50%, because half of the bytes in the requested
cacheline is not used.
See Also
Reference for Performance Metrics
Metric Description
Transaction Coalescence is a ratio of the used bytes to all bytes requested by the transaction. The lower the
coalescence, the bigger part of the bandwidth is wasted. It originates from the GPU Data Port function that
dynamically merges scattered memory operations into fewer operations over non-duplicated 64-byte
cacheline requests. For example, if a 16-wide SIMD operation consecutively reads integer array elements
with a stride of 2, the coalescence of such a transaction is 50%, because half of the bytes in the requested
cacheline is not used.
See Also
Reference for Performance Metrics
Metric Description
Bandwidth of memory read from untyped buffers (for example created with clCreateBuffer).
See Also
Reference for Performance Metrics
Metric Description
Bandwidth of memory written to untyped buffers (for example created with clCreateBuffer).
See Also
Reference for Performance Metrics
873
11 Intel® VTune™ Profiler User Guide
Metric Description
Transaction Coalescence is a ratio of the used bytes to all bytes requested by the transaction. The lower the
coalescence, the bigger part of the bandwidth is wasted. It originates from the GPU Data Port function that
dynamically merges scattered memory operations into fewer operations over non-duplicated 64-byte
cacheline requests. For example, if a 16-wide SIMD operation consecutively reads integer array elements
with a stride of 2, the coalescence of such a transaction is 50%, because half of the bytes in the requested
cacheline is not used.
See Also
Reference for Performance Metrics
Metric Description
Transaction Coalescence is a ratio of the used bytes to all bytes requested by the transaction. The lower the
coalescence, the bigger part of the bandwidth is wasted. It originates from the GPU Data Port function that
dynamically merges scattered memory operations into fewer operations over non-duplicated 64-byte
cacheline requests. For example, if a 16-wide SIMD operation consecutively reads integer array elements
with a stride of 2, the coalescence of such a transaction is 50%, because half of the bytes in the requested
cacheline is not used.
See Also
Reference for Performance Metrics
VS EU Active
Metric Description
The VS EU Active metric represents the percentage of overall GPU time that the execution units (EUs) were
actively executing Vertex Shader instructions. This metric is important if vertex processing seems to be a
bottleneck for selected rendering calls.
Possible Issues
• If VS EU Active is 50%, half of the overall GPU time was spent actively executing Vertex Shader
instructions.
• If VS EU Active is 0%, no Vertex Shader was associated with the selected draw calls, or the amount of
time actively executing Vertex Shader instructions was negligible.
To improve performance:
• If VS EU Active accounts for most of the EU active time, then to improve performance you should
simplify the vertex shader or simplify and optimize the geometry of your primitives.
• If VS EU Active is significant, you should examine your vertex shader code to find the reasons that might
be causing stalls.
See Also
GPU Rendering Analysis (Preview)
874
Reference 11
VS EU Stall
Metric Description
The VS EU Stall metric represents the percentage of overall GPU time that the execution units (EUs) were
stalled in Vertex Shader instructions. This metric is important if vertex processing seems to be the bottleneck
for selected rendering calls.
NOTE
This metric does not include the total amount of time stalled in the vertex shader, but only the fraction
of the time when vertex shader stalls were causing the entire EU to stall. The entire EU stalls when all
of its threads are stalled.
Possible Issues
• If VS EU Stall is 50%, it means that half of the overall GPU time was spent stalled on Vertex Shader
instructions.
• If VS EU Stall is 0%, it means that no Vertex Shader was associated with selected rendering calls or
Vertex Shader threads were not causing EUs stalls.
To improve performance:
• If VS EU Stallaccounts for most of the EU active time, then to improve performance you might need to
simplify the vertex shader or simplify and optimize geometry.
• If VS EU Stall is significant, you need to concentrate on vertex shader code to find the reasons that are
causing stalls.
See Also
GPU Rendering Analysis (Preview)
Metric Description
Total amount of time spent within a computing task (OpenCL™ kernel).
See Also
Interpreting GPU OpenCL Application Analysis Data
Instance Count
Metric Description
Total number of times a computing task (OpenCL™ kernel) is run.
See Also
Interpreting GPU OpenCL Application Analysis Data
875
11 Intel® VTune™ Profiler User Guide
SIMD Width
Metric Description
The number of working items processed by a GPU thread.
See Also
Interpreting GPU OpenCL Application Analysis Data
SIMD Utilization
Metric Description
The ratio of active SIMD lanes to the width of the SIMD instructions.
See Also
Reference for Performance Metrics
Work Size
Metric Description
Global Work Size is a total workspace size of a computing task (OpenCL™ kernel). Local Work Size is a local
working group size of a computing task.
See Also
Interpreting GPU OpenCL Application Analysis Data
Metric Description
Total execution time over all cores.
See Also
Reference for Performance Metrics
C-State
C-State residencies are collected from hardware and/or the operating system (OS).
For systems that collect OS C-State residencies, CPU C-states are core power states requested by the
Operating System Directed Power Management (OSPM) infrastructure that define the degree to which the
processor is "idle".
876
Reference 11
For systems that collect hardware C-State residencies, CPU C-States are obtained by reading the processor’s
MSRs which count the actual time spent in each C-State.
C-States range from C0 to Cn. C0 indicates an active state. All other C-states (C1-Cn) represent idle sleep
states where the processor clock is inactive (cannot execute instructions) and different parts of the processor
are powered down. As the C-States get deeper, the exit latency duration becomes longer (the time to
transition to C0) and the power savings becomes greater.
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Energy Analysis
Interpreting Energy Analysis Data
D0ix States
D0ix-states represent power states ranging from D0i0 to D0i3, where D0i0 is fully powered on and D0i3 is
primarily powered off.
The SoC is organized into a north and south complex where the compute intensive components (for example,
video decode, image processing, and others) are located in the north complex. The south complex contains
I/O, audio, system management, and other components. SoC components should be in the D0i3 state when
not in use.
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Interpreting Energy Analysis Data
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Energy Analysis with Intel VTune Profiler
Interpreting Energy Analysis Data
Window: Bandwidth
877
11 Intel® VTune™ Profiler User Guide
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Energy Analysis
Interpreting Energy Analysis Data
Idle Wake-ups
Number of times a thread caused the system to wake up from idleness to begin executing the thread.
This metric is available in the Hardware Events viewpointviewpoint if you enabled the Collect stacks option
during the hardware event-based sampling analysis configuration.
See Also
Hardware Event-based Sampling Collection with Stacks
P-State
CPU P-states represent voltage-frequency control states defined as performance states in the industry
standard Advanced Configuration and Power Interface (ACPI) specification (see http://www.acpi.info for more
details).
In voltage-frequency control, the voltage and clocks that drive circuits are increased or decreased in
response to a workload. The operating system requests specific P-states based on the current workload. The
processor may accept or reject the request and set the P-state based on its own state.
P-states columns represent the processor’s supported frequencies and the time spent in each frequency
during the collection period.
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Interpreting Energy Analysis Data
Energy Analysis Metrics
878
Reference 11
S0ix States
S0ix-states represent the residency in the Intel® SoC idle standby power states. The S0ix states shut off part
of the SoC when they are not in use. The S0ix states are triggered when specific conditions within the SoC
have been achieved, for example: certain components are in low power states. The SoC consumes the least
amount of power in the deepest (for example, S0i3) state.
On Linux*, Android*, and Chrome* OS, ACPI-SState represent the system’s residency in the ACPI Suspend-
To-RAM (S3). In the Suspend-To-RAM state, the Linux kernel powers down many of the systems’ components
while maintaining the system’s state in its main memory. The system consumes the least amount of power
possible while in the Suspend-To-RAM state. Note that any wakelock will prevent the system from entering
the Suspend-To-RAM state.
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Energy Analysis
Interpreting Energy Analysis Data
Window: Wakelocks
Temperature
Temperature columns show the number of samples collected in each temperature reading (Co), for each
device.
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
See Also
Energy Analysis To analyze the power consumption of your Android*, Windows*, or Linux*
platform, run the Intel® SoC Watch collector and view the results using Intel VTune Profiler.
Timer Resolution
The default timer resolution on Windows* is 15.6 ms – a timer interrupt 64 times a second. While in
connected standby, the resolution will be changed by the operating system to 30 seconds. When programs
increase the timer frequency (decrease the timer resolution), they increase power consumption of the
platform.
The Timer Resolution shows the time spent in each resolution interval during the collection period.
NOTE
This metric is collected as part of energy analysis. Collecting energy analysis data with Intel® SoC
Watch is available for target Android*, Windows*, or Linux* devices. Import and viewing of the Intel
SoC Watch results is supported with any version of the VTune Profiler.
879
11 Intel® VTune™ Profiler User Guide
See Also
Energy Analysis
Interpreting Energy Analysis Data
Window: Timer Resolution
Metric Description
Total time spent in the active C0 state over all cores.
See Also
Reference for Performance Metrics
Metric Description
Total time in sleep states C1-Cx over all cores.
See Also
Reference for Performance Metrics
Metric Description
Total time spent in the active S0i0 state.
See Also
Reference for Performance Metrics
See Also
Interpreting Energy Analysis Data
Wake-ups
Metric Description
Percentage of core wake-ups over all cores.
See Also
Reference for Performance Metrics
880
Reference 11
Metric Description
Rate of wake-ups.
See Also
Reference for Performance Metrics
NOTE
For more information on Intel® 64 and IA-32 architectures, explore Intel Software Developer Manuals
available at https://software.intel.com/en-us/articles/intel-sdm.
For details on hardware events supported by your system's PMU, use any of the following options:
• When adding new events to your custom configuration, select an event in the table and explore its short
description, or click the Explain button to open the Intel Processor Events Reference for more details:
• For a full list of processor events and descriptions, explore the web-based Intel Processor Events
Reference.
881