Gpa Cookbook 2023.1-767264-775816
Gpa Cookbook 2023.1-767264-775816
Gpa Cookbook 2023.1-767264-775816
Cookbook
Intel® Graphics Performance Analyzers Cookbook
Contents
Chapter 1: Intel® Graphics Performance Analyzers Cookbook
Platform-Based Graphics Performance Analysis.............................................. 3
Identify Basic GPU-CPU Bound Scenarios.............................................. 5
Performance Optimization for Intel® Processor Graphics .................................. 8
Optimize Sampler............................................................................ 13
Optimize Shader Execution ............................................................... 27
Notices and Disclaimers............................................................................ 37
2
Intel® Graphics Performance Analyzers Cookbook 1
NOTE
Recipes are added on a regular basis. Please use the Intel® GPA Forum to communicate suggestions
for new recipes.
Related Information
Get Started with Intel® Graphics Performance Analyzers
Intel® Graphics Performance Analyzers User Guide
用于 Windows* 主机的 Intel® Graphics Performance Analyzers
3
1 Intel® Graphics Performance Analyzers Cookbook
Thread Activity Shows how threads from different processes including your profiled application
have been executed.
Hardware GPU queue Shows how GPU executes commands forming a frame buffer you see on the
screen.
Flip queue and VSYNC Shows work performed by a display manager.
events
CPU frames Shows the range containing graphics commands between two successive
frames buffer swap calls.
Driver CPU queue Shows a how many graphics commands are scheduled by a graphics driver for
being executed by the GPU.
Debug and ITT events Shows the result of user-defined instrumentation of profiled application
and markers matched with performance data generated by the system.
Each Graphics Trace Analyzer track shows specific performance events generated by your application and
system at subsequent stages of graphics command execution:
To start platform-based graphics performance analysis, use Hardware GPU queue, Flip queue and Driver CPU
queue tracks to quickly define whether your application is GPU-bound or CPU-bound. Thread Activity track,
CPU frames track and Debug and ITT events track can be used for a detailed analysis of CPU-bound
scenarios.
4
Intel® Graphics Performance Analyzers Cookbook 1
See Also
Identify Basic GPU/CPU Bound Scenarios
Ingredients
To identify GPU-bound graphics applications, you need the following:
• Tool: Intel® GPAGraphics Trace Analyzer
NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.
5
1 Intel® Graphics Performance Analyzers Cookbook
NOTE Buffer execution time is an interval between command buffer appearance in a queue and
executing its last command. The longer this interval, the more GPU-bound your application is.
Typically, a GPU-bound application is an application that has a combination of the following factors: very
complicated shaders running on GPU; memory consuming assets, such as geometry or textures; or too many
drawing commands submitted into command buffers.
Tip For a detailed analysis and optimization of GPU-bound graphics applications, use Graphics Frame
Analyzer.
NOTE
Frame time is an interval from the appearance of the first frame package in a queue till the execution
of the last frame package in the queue.
6
Intel® Graphics Performance Analyzers Cookbook 1
NOTE
Disable VSync intervals synchronization in rendering, and then recapture trace to continue analysis.
Once VSync intervals are disabled, queue packages distribution on the timeline may change. Your
application might appear GPU-bound or CPU-bound.
One of the probable scenarios in these conditions can be inadequate synchronization of GPU and CPU parts of
rendering, for example, GPU may stall waiting for resources to be prepared on the CPU. Such
desynchronization affects User Mode Driver making it accumulate excessive number of packages.
7
1 Intel® Graphics Performance Analyzers Cookbook
NOTE
CPU-bound scenario is the most complex case for optimization. Use code analysis provided by Intel®
VTune™ Profiler to explore CPU bottlenecks in rendering and frame analysis with Graphics Frame
Analyzer to explore GPU bottlenecks. To explore CPU bottlenecks, you can also use Graphics Trace
Analyzer tracks with events generated by Debug API and Instrumentation and Tracing Technology API
(ITT API) markup.
NOTE In the default Graphics Trace Analyzer color scheme, queue packages from different processes
have different colors.
NOTE
In this scenario, it is not possible to define accurately whether the application is GPU-bound or CPU-
bound. Stop all irrelevant applications that utilize GPU, and then recapture a trace to continue
analysis.
See Also
Launching an Application
Platform Analysis
8
Intel® Graphics Performance Analyzers Cookbook 1
Methodology
Use the series of recipes to learn how to use the Intel® GPAGraphics Frame Analyzer on Intel® Processor
Graphics to profile your code efficiently and to find bottlenecks in the graphics pipeline.
1. How to start analysis
2. How Graphics Frame Analyzer identifies bottlenecks using hardware metrics
Ingredients
To optimize performance of graphics applications on Intel® Processor Graphics with Intel® GPA, you need the
following:
• Tool:Intel® Graphics Performance Analyzers
NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.
NOTE It is recommended to analyze performance with the latest driver and version of Intel® GPA.
3. Open the captured frame with the Intel® GPAGraphics Frame Analyzer.
NOTE For Vulkan, open the captured stream in the Multiframe View, and then select a frame to open
with Intel® GPAGraphics Frame Analyzer.
4. Click the
button to enable the Advanced Profiling mode, and then select any event or group of events for further
analysis.
In the normal mode you can manually select one event or a contiguous range of events. To properly observe
graphics architecture, the selected events should meet the following conditions:
• Total cycle count of all selected events is ≥ 20,000.
NOTE
Check the GPU Core Clocks, cycles metric.
• There are no state changes between the events, such as shader changes, pipeline state, and so on
Texture and constant changes are exempt from this rule, unless the texture is a dynamically-generated
surface.
9
1 Intel® Graphics Performance Analyzers Cookbook
• Events share the same render, depth, and stencil surface. This is not an explicit check in Intel® GPA.
If you select a set of events that do not meet the above conditions, these events will be considered filtered
events, and the analysis will not be conducted. When using metrics analysis techniques like this, do not have
any state change within the selection. For example, if you measure two draw calls where one has a depth
attachment and the other does not, any potential hotspot associated with depth would be averaged out over
the two draw calls—effectively diluting the results.
Each of the metrics blocks in the Intel® GPAGraphics Frame Analyzer Metrics pane is mapped based on the
graphics processing unit workflows. Intel® Processor Graphics performs deeply pipelined parallel execution of
the front-end work and the back-end work within a single event. The front-end work includes geometry
transformation, rasterization, early depth/stencil, etc. The back-end work includes pixel shading, sampling,
color write, blend, and late depth/stencil. Due to the deeply pipelined execution, hotspots from downstream
architectural blocks bubble up and stall upstream blocks. This can make it difficult to find the actual hotspot.
To find the primary hotspot using the metrics, Intel® GPA walks the pipeline in reverse order. Intel® GPA
follows two separate workflows for 3D and general-purpose computing designed on graphics processing units
(GPGPU).
Workflow for 3D workloads:
10
Intel® Graphics Performance Analyzers Cookbook 1
11
1 Intel® Graphics Performance Analyzers Cookbook
Green nodes within the flowcharts represent potential bottlenecks within the GPU. At each node Intel® GPA
asks, whether the bottleneck is primary. If yes, the bottleneck for the particular selection is found. If no,
Intel® GPA continues to the next node in the flowchart. Blue nodes branch the decision path and grey nodes
represent terminal hotspots.
NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.
For more information about sampler and shader execution hotspots, read the following sections: Optimize
Sampler, Optimize Shader Execution.
See Also
Developer and Optimization Guide for Intel® Processor Graphics Gen11 API
Launching an Application
Profiling Desktop API Frames
12
Intel® Graphics Performance Analyzers Cookbook 1
Metrics Pane
Optimize Sampler
Sampling is the process of fetching a value from a texture at a given position. You can configure multiple
sampling parameters, such as filtering mode, to balance visual results and sampling performance.
Intel® GPAGraphics Frame Analyzer checks the difference between the percentage of time when a Sampler
Input is available and the percentage of time when a Sampler Output is ready.
Metric Name Description
GPU / Sampler : Slice <N> Subslice<M> Sampler Input Percentage of time there is input from the EUs on slice ‘N’
Available and subslice ‘M’ to the sampler.
GPU / Sampler : Slice <N> Subslice<M> Sampler Output Percentage of time there is output from the sampler to
Ready EUs on slice ‘N’ and subslice ‘M’.
NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.
When Input Available is >10 percent greater than Output Ready for a subslice of a given slice, the sampler is
not returning data back to the EUs as fast as it is being requested. The sampler is probably the hotspot. This
comparison only indicates a primary hotspot when the samplers are relatively busy, which means that both
EU Occupancy and EU Stall are relatively high.
1. Optimize Sampler Bottleneck with Graphics Frame Analyzer
• Reduce Texture Size
• Change Filter Parameters in Pixel Shader
Ingredients
To optimize a Sampler bottleneck, you need the following:
• Application: Unreal Engine 4* Sun Temple sample, DirectX SDK* CascadedShadowMaps11 sample
• Tool: Intel® GPAGraphics Frame Analyzer
NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.
13
1 Intel® Graphics Performance Analyzers Cookbook
With Intel® GPAGraphics Frame Analyzer you can optimize the Sampler bottleneck with real-time
experiments, such as changing texture size and filter parameters in a pixel shader.
2. Click the Show All Resources button, and then click the Textures tab to open the list of sampled
textures.
14
Intel® Graphics Performance Analyzers Cookbook 1
3. Reduce the size of one or more large textures. For example, the marble texture size is 1024x1024
pixels. Select a smaller size, for example 256x256, and then click the
button.
15
1 Intel® Graphics Performance Analyzers Cookbook
16
Intel® Graphics Performance Analyzers Cookbook 1
Result:
17
1 Intel® Graphics Performance Analyzers Cookbook
Difference:
18
Intel® Graphics Performance Analyzers Cookbook 1
The textures before and after changing the size look quite similar, but the Sampler metric in the 3D Pipeline
tab is now green. The execution time is improved by 18% for selection segments and by 4% overall.
19
1 Intel® Graphics Performance Analyzers Cookbook
20
Intel® Graphics Performance Analyzers Cookbook 1
The pink segment contains the texture and shadow rendering. Shadow properties are set in the pixel
shader.
2. Select the Shader resource in the Resource List, and then choose the Pixel shader type. The pixel
shader contains the CalculatePCFPercentLit method with m1 and m2 values, which represent the
iteration range in the filter loop.
m1 and m2 formulas:
m1 = m_iPCFBlurSize / -2
m2 = m_iPCFBlurSize / 2 + 1,
21
1 Intel® Graphics Performance Analyzers Cookbook
where m_iPCFBlurSize is the kernel size. The initial kernel size is 9, m1 = -4, and m2 = 5.
22
Intel® Graphics Performance Analyzers Cookbook 1
The metrics values are improved, but the Sampler is still a bottleneck.
4. Check the extreme condition by setting the kernel size to 1, m1 to 0, and m2 to 1.
23
1 Intel® Graphics Performance Analyzers Cookbook
The Sampler is underlined green now. The execution time is improved by 8% overall and by 89% for the
selection segment.
Compare the original and the resulting textures:
Original:
24
Intel® Graphics Performance Analyzers Cookbook 1
Result:
25
1 Intel® Graphics Performance Analyzers Cookbook
Difference:
26
Intel® Graphics Performance Analyzers Cookbook 1
See Also
How to find a bottleneck with Graphics Frame Analyzer
Intel® Processor Graphics developer documents
DXGI_FORMAT enumeration (dxgiformat.h)
Cascaded Shadow Maps
27
1 Intel® Graphics Performance Analyzers Cookbook
NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.
When EU Array / Pipes: EU FPU0 Pipe Active or EU Array / Pipes: EU FPU1 Pipe Active are above 90 percent,
it can indicate that the primary hotspot is due to the number of instructions per clock (IPC). If so, adjust
shader algorithms to reduce unnecessary instructions or implement using more efficient instructions to
improve IPC. For IPC-limited pixel shaders, ensure maximum throughput by limiting shader temporary
registers to ≤ 16.
NOTE The recipe describes how to optimize the Shader Execution bottleneck on a particular sample.
Some of these optimizations can be applied to real-world graphics applications.
Ingredients
To optimize the Shader Execution bottleneck, you need the following:
• Application: Microsoft D3D12Multithreading sample: https://github.com/microsoft/DirectX-Graphics-
Samples/tree/master/Samples/Desktop/D3D12Multithreading
• Tool: Intel® GPAGraphics Frame Analyzer
NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.
28
Intel® Graphics Performance Analyzers Cookbook 1
29
1 Intel® Graphics Performance Analyzers Cookbook
30
Intel® Graphics Performance Analyzers Cookbook 1
totalLight += lightPass;
}
return diffuseColor * saturate(totalLight);
}
4. Select the ISA type in the Shader Code drop-down list to analyze the GEN Assembly.
In the GEN Assembly, you can find the instructions, which are generated by Intel® Graphics Compiler. The
example also contains a lot of complex math operations, such as square roots, inverse square roots and
inversions:
• math.sqt – 3 instructions
• math.rsqt – 9 instructions
• math.inv -– 7 instructions
There is also a lot of arithmetic operations: 83 multiplications and 80 fused multiply/add instructions. To
optimize the Shader Execution bottleneck, you can try to reduce unnecessary computations in a shader code
or simplify a shader.
NOTE If a shader has loops or branches, the number of executed instructions may not correspond to
the execution count in the assembly code.
Perform Optimization
Once you determined the areas for optimization, try the following corresponding changes to fix the Shader
Execution bottleneck:
1. Eliminate the constant condition to remove the flow control.
31
1 Intel® Graphics Performance Analyzers Cookbook
button.
2. Click the resource binding description to open the buffer in the Resource Viewer.
32
Intel® Graphics Performance Analyzers Cookbook 1
Since sampleShadowMap equals to 1, the condition sampleShadowMap && i == 0 equals to i==0. The
modification makes shader linear without flow control instructions, the number of assembly lines
reduces from 280 to 263. The only basic block - linear sequence of instructions - contains 262
instructions. It affects the performance insignificantly, but simplifies the further shader analysis.
33
1 Intel® Graphics Performance Analyzers Cookbook
58 vVertNormal = normalize(vVertNormal);
59 vVertTangent = normalize(vVertTangent);
61 float3 vVertBinormal = normalize(cross(vVertTangent, vVertNormal));
87 float fCosAngle = dot(vLightToPixelNormalized, vLightDir / length(vLightDir));
You can find a vertex buffer in the Graphics Frame Analyzer in the same way as the constant buffer
content using the Shader Resource panel. If you look at the vertex buffer content, you can see that
Normal and Tangent vertex attributes are normalized on the CPU and therefore can be removed from
the shader code.
34
Intel® Graphics Performance Analyzers Cookbook 1
Light intensity depends on the distance passed from a light position to a pixel. Light attenuation starts
from distance 800, the vFalloffs.x parameter. The bounding box size in the sample shader is much
smaller than the distance that causes light attenuation.
You can remove the distance attenuation without rendering impact, because the
saturate(vFalloffs.x - fDist) expression always equals 1. Upon the distance attenuation
removal, the shader contains only one inverse instruction and three inverse square root instructions.
Shader performs two subsequent vector-matrix multiplications:
35
1 Intel® Graphics Performance Analyzers Cookbook
Deleting one extra ComputeLightIntensity function call, reduces the number of instructions from 189 to
166.
The performed optimizations reduce the number of instructions by 1.6 times compared to the original
shader.
Verify Optimizations
To check that shader optimizations do not affect rendering, compare the original and modified render targets
using the Diff Visualization mode in the Graphics Frame Analyzer:
There are no visual changes, a small difference becomes visible on the Color Histogram scaling, as the
operations order in the code slightly changed.
36
Intel® Graphics Performance Analyzers Cookbook 1
The performed optimizations reduce the draw call duration from 549 to 367 us, which gives a 1.5x
performance gain. The primary bottleneck in the sample frame moves to the L3 cache:
See Also
How to find a bottleneck with Graphics Frame Analyzer
Intel® Processor Graphics developer guides
Gen9 Compute Architecture
Introduction to GEN Assembly
37
1 Intel® Graphics Performance Analyzers Cookbook
The products described may contain design defects or errors known as errata which may cause the product
to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of
merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from
course of performance, course of dealing, or usage in trade.
38