0% found this document useful (0 votes)
16 views38 pages

Gpa Cookbook 2023.1-767264-775816

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 38

Intel® Graphics Performance Analyzers

Cookbook
Intel® Graphics Performance Analyzers Cookbook

Contents
Chapter 1: Intel® Graphics Performance Analyzers Cookbook
Platform-Based Graphics Performance Analysis.............................................. 3
Identify Basic GPU-CPU Bound Scenarios.............................................. 5
Performance Optimization for Intel® Processor Graphics .................................. 8
Optimize Sampler............................................................................ 13
Optimize Shader Execution ............................................................... 27
Notices and Disclaimers............................................................................ 37

2
Intel® Graphics Performance Analyzers Cookbook 1

Intel® Graphics Performance


Analyzers Cookbook 1
Intel® Graphics Performance Analyzers (Intel® GPA) provides a solution for graphics analysis and optimization
that can help you improve performance of games and other graphics-intensive applications.
This Cookbook provides performance analysis scenarios (recipes) to help you solve real-world specific
performance issues using Intel® GPA tools:
• Platform-Based Graphics Performance Analysis
• Performance Optimization for Intel® Processor Graphics

NOTE
Recipes are added on a regular basis. Please use the Intel® GPA Forum to communicate suggestions
for new recipes.

Related Information
Get Started with Intel® Graphics Performance Analyzers
Intel® Graphics Performance Analyzers User Guide
用于 Windows* 主机的 Intel® Graphics Performance Analyzers

Platform-Based Graphics Performance Analysis


Increasing performance of graphics processors boosts the image rendering level of realism. The more
powerful GPUs become, the more resource-demanding graphics can be created. Game makers try to gain all
available advantages of modern GPUs and often face graphics performance issues.
Usually talking about game graphics, developers assume a scene in a game world with objects rendered by a
bunch of graphics commands forming a final frame. This final frame is an output image that users see on a
screen. A number of frames rendered per second (FPS) is a common metric for measuring overall game
performance. A reasonable frame rate depends on a game plot. A dynamic game, such as a first person
shooter, usually requires a higher FPS rate than a step-by-step strategy.
If FPS drops below a desired level it is necessary to identify bottlenecks in software or hardware that limit the
game performance. Modern games are complex applications not only in terms of graphics. Slow graphics is
not always a result of bottlenecks in the GPU part of a game. CPU can impact a game and indirectly a
graphics performance too. That is why identifying whether your application is CPU-bound or GPU-bound can
be tricky.
Intel® GPAGraphics Trace Analyzer gives a high level granularity insight into the execution flow of a running
graphics application. Graphics Trace Analyzer shows performance events generated by the system kernel and
drivers, and application events specified by a user through debug API events and Instrumentation and
Tracing Technology (ITT) markers provided by Intel® GPA .
Once you open a captured trace in the Graphics Trace Analyzer, you can explore the following critical tracks
for performance analysis:

3
1 Intel® Graphics Performance Analyzers Cookbook

Thread Activity Shows how threads from different processes including your profiled application
have been executed.
Hardware GPU queue Shows how GPU executes commands forming a frame buffer you see on the
screen.
Flip queue and VSYNC Shows work performed by a display manager.
events
CPU frames Shows the range containing graphics commands between two successive
frames buffer swap calls.
Driver CPU queue Shows a how many graphics commands are scheduled by a graphics driver for
being executed by the GPU.
Debug and ITT events Shows the result of user-defined instrumentation of profiled application
and markers matched with performance data generated by the system.

Each Graphics Trace Analyzer track shows specific performance events generated by your application and
system at subsequent stages of graphics command execution:

To start platform-based graphics performance analysis, use Hardware GPU queue, Flip queue and Driver CPU
queue tracks to quickly define whether your application is GPU-bound or CPU-bound. Thread Activity track,
CPU frames track and Debug and ITT events track can be used for a detailed analysis of CPU-bound
scenarios.

4
Intel® Graphics Performance Analyzers Cookbook 1

See Also
Identify Basic GPU/CPU Bound Scenarios

Identify Basic GPU-CPU Bound Scenarios


If rendering in your graphics application is visibly slow, explore GPU and CPU queues available in Graphics
Trace Analyzer to determine whether your application is GPU-bound or CPU-bound.
1. How to start analysis
2. Analyze GPU and CPU queues
• Typical GPU-bound scenario
• VSync-bound scenario
• Typical CPU-bound scenario
• Multi-Process GPU Utilization Scenario

Ingredients
To identify GPU-bound graphics applications, you need the following:
• Tool: Intel® GPAGraphics Trace Analyzer

NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.

• Operating System: Windows*


• GPU: Any
• API: DirectX* 9-12, Vulkan*

How to Start Analysis


To get started with your analysis:
1. Launch the Intel® GPAGraphics Monitor on your target system.
2. Capture a sample trace. A trace contains performance data connected with your application and
system.
3. Open the captured trace in the Graphics Trace Analyzer to explore performance events in GPU/CPU
queues and VSync events generated by a window display manager.

Analyze GPU and CPU queues


Graphics rendering is a process of submitting commands into a graphics driver. Driver batches submitted
commands in command buffers, pushes the buffers into a Driver CPU queue, and schedules the commands
for executing on the GPU. The size of a queue indicates whether the GPU is busy or starved. The queue size
also shows how many graphics commands are submitted, and how many of them wait for the execution.

Typical GPU Bound Scenario


• Hardware queue is completely busy executing command buffers and has no visible gaps.
• Driver queue continuously accumulates command buffers waiting for the execution on the GPU.
• Average command buffer execution time exceeds the desired limit based on the expected FPS rate.

5
1 Intel® Graphics Performance Analyzers Cookbook

NOTE Buffer execution time is an interval between command buffer appearance in a queue and
executing its last command. The longer this interval, the more GPU-bound your application is.

Typically, a GPU-bound application is an application that has a combination of the following factors: very
complicated shaders running on GPU; memory consuming assets, such as geometry or textures; or too many
drawing commands submitted into command buffers.

Tip For a detailed analysis and optimization of GPU-bound graphics applications, use Graphics Frame
Analyzer.

VSync Bound Scenario


• Hardware queue has visible gaps, indicating that the GPU is not fully busy.
• Driver queue has visible gaps, indicating that the CPU part of graphics workload is low enough.
• Frame time is shorter than VSync intervals.

NOTE

Frame time is an interval from the appearance of the first frame package in a queue till the execution
of the last frame package in the queue.

6
Intel® Graphics Performance Analyzers Cookbook 1

NOTE
Disable VSync intervals synchronization in rendering, and then recapture trace to continue analysis.
Once VSync intervals are disabled, queue packages distribution on the timeline may change. Your
application might appear GPU-bound or CPU-bound.

Typical CPU Bound Scenario


• Hardware queue size is small and has visible gaps. This means that the GPU is idle most of the time.
• Driver queue size is big enough.

One of the probable scenarios in these conditions can be inadequate synchronization of GPU and CPU parts of
rendering, for example, GPU may stall waiting for resources to be prepared on the CPU. Such
desynchronization affects User Mode Driver making it accumulate excessive number of packages.

7
1 Intel® Graphics Performance Analyzers Cookbook

NOTE
CPU-bound scenario is the most complex case for optimization. Use code analysis provided by Intel®
VTune™ Profiler to explore CPU bottlenecks in rendering and frame analysis with Graphics Frame
Analyzer to explore GPU bottlenecks. To explore CPU bottlenecks, you can also use Graphics Trace
Analyzer tracks with events generated by Debug API and Instrumentation and Tracing Technology API
(ITT API) markup.

Multi-Process GPU Utilization Scenario


• More than one graphics application run simultaneously.
• GPU queue is full and contains packages from multiple processes.

NOTE In the default Graphics Trace Analyzer color scheme, queue packages from different processes
have different colors.

NOTE
In this scenario, it is not possible to define accurately whether the application is GPU-bound or CPU-
bound. Stop all irrelevant applications that utilize GPU, and then recapture a trace to continue
analysis.

See Also
Launching an Application
Platform Analysis

Performance Optimization for Intel® Processor Graphics


Performance Optimization for Intel® Processor Graphics is a series of recipes to help you determine and
optimize performance bottlenecks in graphics applications.

8
Intel® Graphics Performance Analyzers Cookbook 1
Methodology
Use the series of recipes to learn how to use the Intel® GPAGraphics Frame Analyzer on Intel® Processor
Graphics to profile your code efficiently and to find bottlenecks in the graphics pipeline.
1. How to start analysis
2. How Graphics Frame Analyzer identifies bottlenecks using hardware metrics

Ingredients
To optimize performance of graphics applications on Intel® Processor Graphics with Intel® GPA, you need the
following:
• Tool:Intel® Graphics Performance Analyzers

NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.

• Operating System: Windows*, Ubuntu*


• GPU:Intel® Processor Graphics Gen6 - Gen11
• API: DirectX* 9 - 12, Vulkan*, OpenGL*

How to Start Analysis


To get started with your analysis:
1. Launch the Intel® GPAGraphics Monitor on your target system.
2. Capture a sample frame or stream (for Vulkan) from your game with the Intel® GPA Heads-Up Display
(HUD).

NOTE It is recommended to analyze performance with the latest driver and version of Intel® GPA.

3. Open the captured frame with the Intel® GPAGraphics Frame Analyzer.

NOTE For Vulkan, open the captured stream in the Multiframe View, and then select a frame to open
with Intel® GPAGraphics Frame Analyzer.

4. Click the

button to enable the Advanced Profiling mode, and then select any event or group of events for further
analysis.
In the normal mode you can manually select one event or a contiguous range of events. To properly observe
graphics architecture, the selected events should meet the following conditions:
• Total cycle count of all selected events is ≥ 20,000.

NOTE
Check the GPU Core Clocks, cycles metric.

• There are no state changes between the events, such as shader changes, pipeline state, and so on
Texture and constant changes are exempt from this rule, unless the texture is a dynamically-generated
surface.

9
1 Intel® Graphics Performance Analyzers Cookbook

• Events share the same render, depth, and stencil surface. This is not an explicit check in Intel® GPA.
If you select a set of events that do not meet the above conditions, these events will be considered filtered
events, and the analysis will not be conducted. When using metrics analysis techniques like this, do not have
any state change within the selection. For example, if you measure two draw calls where one has a depth
attachment and the other does not, any potential hotspot associated with depth would be averaged out over
the two draw calls—effectively diluting the results.

How Graphics Frame Analyzer Identifies Bottlenecks Using Hardware Metrics


Once the selection is made, Intel® GPAGraphics Frame Analyzer playbacks the frame on your GPU, collects
performance data, and highlights graphics architectural blocks with bottlenecks.
Red hot spot icon next to the hardware block name means that this part of the GPU pipeline is the primary
bottleneck. Orange means that the node is not a primary bottleneck, but does have performance
optimization opportunities.

Each of the metrics blocks in the Intel® GPAGraphics Frame Analyzer Metrics pane is mapped based on the
graphics processing unit workflows. Intel® Processor Graphics performs deeply pipelined parallel execution of
the front-end work and the back-end work within a single event. The front-end work includes geometry
transformation, rasterization, early depth/stencil, etc. The back-end work includes pixel shading, sampling,
color write, blend, and late depth/stencil. Due to the deeply pipelined execution, hotspots from downstream
architectural blocks bubble up and stall upstream blocks. This can make it difficult to find the actual hotspot.
To find the primary hotspot using the metrics, Intel® GPA walks the pipeline in reverse order. Intel® GPA
follows two separate workflows for 3D and general-purpose computing designed on graphics processing units
(GPGPU).
Workflow for 3D workloads:

10
Intel® Graphics Performance Analyzers Cookbook 1

Workflow for compute workloads:

11
1 Intel® Graphics Performance Analyzers Cookbook

Green nodes within the flowcharts represent potential bottlenecks within the GPU. At each node Intel® GPA
asks, whether the bottleneck is primary. If yes, the bottleneck for the particular selection is found. If no,
Intel® GPA continues to the next node in the flowchart. Blue nodes branch the decision path and grey nodes
represent terminal hotspots.

NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.

For more information about sampler and shader execution hotspots, read the following sections: Optimize
Sampler, Optimize Shader Execution.

See Also
Developer and Optimization Guide for Intel® Processor Graphics Gen11 API
Launching an Application
Profiling Desktop API Frames

12
Intel® Graphics Performance Analyzers Cookbook 1
Metrics Pane

Optimize Sampler
Sampling is the process of fetching a value from a texture at a given position. You can configure multiple
sampling parameters, such as filtering mode, to balance visual results and sampling performance.
Intel® GPAGraphics Frame Analyzer checks the difference between the percentage of time when a Sampler
Input is available and the percentage of time when a Sampler Output is ready.
Metric Name Description
GPU / Sampler : Slice <N> Subslice<M> Sampler Input Percentage of time there is input from the EUs on slice ‘N’
Available and subslice ‘M’ to the sampler.
GPU / Sampler : Slice <N> Subslice<M> Sampler Output Percentage of time there is output from the sampler to
Ready EUs on slice ‘N’ and subslice ‘M’.

NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.

When Input Available is >10 percent greater than Output Ready for a subslice of a given slice, the sampler is
not returning data back to the EUs as fast as it is being requested. The sampler is probably the hotspot. This
comparison only indicates a primary hotspot when the samplers are relatively busy, which means that both
EU Occupancy and EU Stall are relatively high.
1. Optimize Sampler Bottleneck with Graphics Frame Analyzer
• Reduce Texture Size
• Change Filter Parameters in Pixel Shader

Ingredients
To optimize a Sampler bottleneck, you need the following:
• Application: Unreal Engine 4* Sun Temple sample, DirectX SDK* CascadedShadowMaps11 sample
• Tool: Intel® GPAGraphics Frame Analyzer

NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.

• Operating System: Windows* 10


• GPU: Intel® Processor Graphics Gen9 and higher
• API: DirectX* 11

Optimize Sampler Bottleneck with Graphics Frame Analyzer


There can be multiple reasons for the sampler to be a hotspot. To speed up the sampler, you can try the
following:
• Reduce the texture size.
• Change a filtering mode.
• Choose a texture format with a smaller amount of data for a pixel or an uncompressed texture format, if
possible. In some cases, the uncompressed format may cause a new bottleneck for larger textures.
• Reduce the number of surfaces on the screen where the texture is applied.
• Adjust the sampling access pattern to make an access to the texture more linear.

13
1 Intel® Graphics Performance Analyzers Cookbook

With Intel® GPAGraphics Frame Analyzer you can optimize the Sampler bottleneck with real-time
experiments, such as changing texture size and filter parameters in a pixel shader.

Reduce Texture Size


To reduce the texture size, do the following:
1. Open the event with the discovered Sampler bottleneck in the Graphics Frame Analyzer Resource
Viewer by selecting this event on the Main bar chart.

2. Click the Show All Resources button, and then click the Textures tab to open the list of sampled
textures.

14
Intel® Graphics Performance Analyzers Cookbook 1

3. Reduce the size of one or more large textures. For example, the marble texture size is 1024x1024
pixels. Select a smaller size, for example 256x256, and then click the

button.

15
1 Intel® Graphics Performance Analyzers Cookbook

4. Compare the original and the resulting textures:


Original:

16
Intel® Graphics Performance Analyzers Cookbook 1

Result:

17
1 Intel® Graphics Performance Analyzers Cookbook

Difference:

18
Intel® Graphics Performance Analyzers Cookbook 1

The textures before and after changing the size look quite similar, but the Sampler metric in the 3D Pipeline
tab is now green. The execution time is improved by 18% for selection segments and by 4% overall.

19
1 Intel® Graphics Performance Analyzers Cookbook

Change Filter Parameters in Pixel Shader


Percentage-Closer Filtering (PCF) may often affect the graphics application performance, that is why the
described experiment with changing filter parameters uses the PCF as an example to optimize the Sampler
bottleneck.
Percentage-Closer Filtering can be used to render antialiased shadows and soft shadows. For more
information on the PCF, see https://docs.microsoft.com/en-us/windows/win32/dxtecharts/cascaded-shadow-
maps.
To change filter parameters, do the following:
1. Open the event with the discovered Sampler bottleneck in the Graphics Frame Analyzer Resource
Viewer by selecting this event on the Main bar chart.

20
Intel® Graphics Performance Analyzers Cookbook 1

The pink segment contains the texture and shadow rendering. Shadow properties are set in the pixel
shader.
2. Select the Shader resource in the Resource List, and then choose the Pixel shader type. The pixel
shader contains the CalculatePCFPercentLit method with m1 and m2 values, which represent the
iteration range in the filter loop.
m1 and m2 formulas:
m1 = m_iPCFBlurSize / -2
m2 = m_iPCFBlurSize / 2 + 1,

21
1 Intel® Graphics Performance Analyzers Cookbook

where m_iPCFBlurSize is the kernel size. The initial kernel size is 9, m1 = -4, and m2 = 5.

3. Reduce the kernel size to 3, set m1 to -1 and m2 to 2.

22
Intel® Graphics Performance Analyzers Cookbook 1

The metrics values are improved, but the Sampler is still a bottleneck.
4. Check the extreme condition by setting the kernel size to 1, m1 to 0, and m2 to 1.

23
1 Intel® Graphics Performance Analyzers Cookbook

The Sampler is underlined green now. The execution time is improved by 8% overall and by 89% for the
selection segment.
Compare the original and the resulting textures:
Original:

24
Intel® Graphics Performance Analyzers Cookbook 1

Result:

25
1 Intel® Graphics Performance Analyzers Cookbook

Difference:

26
Intel® Graphics Performance Analyzers Cookbook 1

See Also
How to find a bottleneck with Graphics Frame Analyzer
Intel® Processor Graphics developer documents
DXGI_FORMAT enumeration (dxgiformat.h)
Cascaded Shadow Maps

Optimize Shader Execution


Shader is a program, which handles programmable graphics pipeline stages or performs general-purpose
computations on a GPU. Shaders are executed on execution units (EUs) of the GEN architecture. In each EU,
the primary computation units are a pair of SIMD (Single Instruction, Multiple Data) floating-point units
(FPUs). FPU0 processes floating point and integers operations, FPU1 can perform floating point operations
and extended math instructions so it is also referred as Extended Math (EM) unit.
To detect that Shader Execution is a bottleneck, the Intel® GPAGraphics Frame Analyzer checks if an FPU
pipes load is more than 90 percent. Usually, the Shader Execution bottleneck is caused by Pixel and Compute
shaders that perform complex computations and are executed many times.
Metric Name Description
EU Array / Pipes: EU FPU0 Percentage of time the Floating Point Unit (FPU) pipe is actively executing instructions.
Pipe Active

27
1 Intel® Graphics Performance Analyzers Cookbook

Metric Name Description


EU Array / Pipes: EU FPU1 Percentage of time the Extended Math (EM) pipe is active executing instructions.
Pipe Active

NOTE Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and
newer generations feature GPU architecture terminology that shifts from legacy terms. For more
information on the terminology changes and to understand their mapping with legacy content, see
GPU Architecture Terminology for Intel® Xe Graphics.

When EU Array / Pipes: EU FPU0 Pipe Active or EU Array / Pipes: EU FPU1 Pipe Active are above 90 percent,
it can indicate that the primary hotspot is due to the number of instructions per clock (IPC). If so, adjust
shader algorithms to reduce unnecessary instructions or implement using more efficient instructions to
improve IPC. For IPC-limited pixel shaders, ensure maximum throughput by limiting shader temporary
registers to ≤ 16.

NOTE The recipe describes how to optimize the Shader Execution bottleneck on a particular sample.
Some of these optimizations can be applied to real-world graphics applications.

Ingredients
To optimize the Shader Execution bottleneck, you need the following:
• Application: Microsoft D3D12Multithreading sample: https://github.com/microsoft/DirectX-Graphics-
Samples/tree/master/Samples/Desktop/D3D12Multithreading
• Tool: Intel® GPAGraphics Frame Analyzer

NOTE
To download a free copy of the Intel® Graphics Performance Analyzers toolkit, visit the Intel® GPA
product page.

• Operating System: Windows* 10


• GPU: Intel® Processor Graphics Gen9 and higher
• API: DirectX* 11/12

Define Code Portions to Optimize


To find potential hotspots in the shader, do the following:
1. Open the event with the discovered Shader Execution bottleneck in the Graphics Frame Analyzer
Resource Viewer by selecting this event on the Main bar chart.

28
Intel® Graphics Performance Analyzers Cookbook 1

2. Select Shader in the Resource List to open the shader source.


3. Analyze the shader source to understand the algorithm and find potential places for optimization.
Pixel shader invokes CalcLightingColor for each light (NUM_LIGHTS=3), the first light also computes
shadow by the CalcUnshadowedAmountPCF2x2 function. CalcLightingColor function is called three
times, other functions are called only once per shader invocation. So CalcLightingColor is potentially
the primary place for optimization.

float4 CalcLightingColor(float3 vLightPos, float3 vLightDir, float4 vLightColor, float4


vFalloffs, float3 vPosWorld, float3 vPerPixelNormal)
{
float3 vLightToPixelUnNormalized = vPosWorld - vLightPos;
// Dist falloff = 0 at vFalloffs.x, 1 at vFalloffs.x - vFalloffs.y

29
1 Intel® Graphics Performance Analyzers Cookbook

float fDist = length(vLightToPixelUnNormalized);


float fDistFalloff = saturate((vFalloffs.x - fDist) / vFalloffs.y);
// Normalize from here on.
float3 vLightToPixelNormalized = vLightToPixelUnNormalized / fDist;
// Angle falloff = 0 at vFalloffs.z, 1 at vFalloffs.z - vFalloffs.w
float fCosAngle = dot(vLightToPixelNormalized, vLightDir / length(vLightDir));
float fAngleFalloff = saturate((fCosAngle - vFalloffs.z) / vFalloffs.w);
// Diffuse contribution.
float fNDotL = saturate(-dot(vLightToPixelNormalized, vPerPixelNormal));
return vLightColor * fNDotL * fDistFalloff * fAngleFalloff;
}

float4 CalcUnshadowedAmountPCF2x2(int lightIndex, float4 vPosWorld)


{
// Compute pixel position in light space.
float4 vLightSpacePos = vPosWorld;
vLightSpacePos = mul(vLightSpacePos, lights[lightIndex].view);
vLightSpacePos = mul(vLightSpacePos, lights[lightIndex].projection);
vLightSpacePos.xyz /= vLightSpacePos.w;
// Translate from homogeneous coords to texture coords.
float2 vShadowTexCoord = 0.5f * vLightSpacePos.xy + 0.5f;
vShadowTexCoord.y = 1.0f - vShadowTexCoord.y;
// Depth bias to avoid pixel self-shadowing.
float vLightSpaceDepth = vLightSpacePos.z - SHADOW_DEPTH_BIAS;
// Find sub-pixel weights.
float2 vShadowMapDims = float2(1280.0f, 720.0f); // need to keep in sync with .cpp file
float4 vSubPixelCoords = float4(1.0f, 1.0f, 1.0f, 1.0f);
vSubPixelCoords.xy = frac(vShadowMapDims * vShadowTexCoord);
vSubPixelCoords.zw = 1.0f - vSubPixelCoords.xy;
float4 vBilinearWeights = vSubPixelCoords.zxzx * vSubPixelCoords.wwyy;
// 2x2 percentage closer filtering.
float2 vTexelUnits = 1.0f / vShadowMapDims;
float4 vShadowDepths;
vShadowDepths.x = shadowMap.Sample(sampleClamp, vShadowTexCoord);
vShadowDepths.y = shadowMap.Sample(sampleClamp, vShadowTexCoord + float2(vTexelUnits.x,
0.0f));
vShadowDepths.z = shadowMap.Sample(sampleClamp, vShadowTexCoord + float2(0.0f,
vTexelUnits.y));
vShadowDepths.w = shadowMap.Sample(sampleClamp, vShadowTexCoord + vTexelUnits);
// What weighted fraction of the 4 samples are nearer to the light than this pixel?
float4 vShadowTests = (vShadowDepths >= vLightSpaceDepth) ? 1.0f : 0.0f;
return dot(vBilinearWeights, vShadowTests);
}

float4 PSMain(PSInput input) : SV_TARGET


{
float4 diffuseColor = diffuseMap.Sample(sampleWrap, input.uv);
float3 pixelNormal = CalcPerPixelNormal(input.uv, input.normal, input.tangent);
float4 totalLight = ambientColor;
for (int i = 0; i < NUM_LIGHTS; i++)
{
float4 lightPass = CalcLightingColor(lights[i].position, lights[i].direction,
lights[i].color, lights[i].falloff, input.worldpos.xyz, pixelNormal);
if (sampleShadowMap && i == 0)
{
lightPass *= CalcUnshadowedAmountPCF2x2(i, input.worldpos);
}

30
Intel® Graphics Performance Analyzers Cookbook 1
totalLight += lightPass;
}
return diffuseColor * saturate(totalLight);
}
4. Select the ISA type in the Shader Code drop-down list to analyze the GEN Assembly.

In the GEN Assembly, you can find the instructions, which are generated by Intel® Graphics Compiler. The
example also contains a lot of complex math operations, such as square roots, inverse square roots and
inversions:
• math.sqt – 3 instructions
• math.rsqt – 9 instructions
• math.inv -– 7 instructions
There is also a lot of arithmetic operations: 83 multiplications and 80 fused multiply/add instructions. To
optimize the Shader Execution bottleneck, you can try to reduce unnecessary computations in a shader code
or simplify a shader.

NOTE If a shader has loops or branches, the number of executed instructions may not correspond to
the execution count in the assembly code.

Perform Optimization
Once you determined the areas for optimization, try the following corresponding changes to fix the Shader
Execution bottleneck:
1. Eliminate the constant condition to remove the flow control.

31
1 Intel® Graphics Performance Analyzers Cookbook

Consider the following condition inside the loop:

166 if (sampleShadowMap && i == 0)


167 {
168 lightPass *= CalcUnshadowedAmountPCF2x2(i, input.worldpos);
169 }
sampleShadowMap is a constant from SceneConstantBuffer. To view the constant buffer content, open
the Shader Resource list by clicking the

button.

2. Click the resource binding description to open the buffer in the Resource Viewer.

32
Intel® Graphics Performance Analyzers Cookbook 1

Since sampleShadowMap equals to 1, the condition sampleShadowMap && i == 0 equals to i==0. The
modification makes shader linear without flow control instructions, the number of assembly lines
reduces from 280 to 263. The only basic block - linear sequence of instructions - contains 262
instructions. It affects the performance insignificantly, but simplifies the further shader analysis.

33
1 Intel® Graphics Performance Analyzers Cookbook

3. Reduce the number of complex math and floating point instructions.


Though there are no explicit square roots in the shader source, these instructions are produced by the
normalize() and length() HLSL intrinsics:

58 vVertNormal = normalize(vVertNormal);
59 vVertTangent = normalize(vVertTangent);
61 float3 vVertBinormal = normalize(cross(vVertTangent, vVertNormal));
87 float fCosAngle = dot(vLightToPixelNormalized, vLightDir / length(vLightDir));
You can find a vertex buffer in the Graphics Frame Analyzer in the same way as the constant buffer
content using the Shader Resource panel. If you look at the vertex buffer content, you can see that
Normal and Tangent vertex attributes are normalized on the CPU and therefore can be removed from
the shader code.

You can also remove divisions by vFalloffs.y and vFalloffs.w:

81 float fDistFalloff = saturate((vFalloffs.x - fDist) / vFalloffs.y);


88 float fAngleFalloff = saturate((fCosAngle - vFalloffs.z) / vFalloffs.w);
These divisions are redundant, because their divisors are equal to 1. Generally, such divisions can be
replaced with multiplication by the inverse parameter.

34
Intel® Graphics Performance Analyzers Cookbook 1
Light intensity depends on the distance passed from a light position to a pixel. Light attenuation starts
from distance 800, the vFalloffs.x parameter. The bounding box size in the sample shader is much
smaller than the distance that causes light attenuation.

You can remove the distance attenuation without rendering impact, because the
saturate(vFalloffs.x - fDist) expression always equals 1. Upon the distance attenuation
removal, the shader contains only one inverse instruction and three inverse square root instructions.
Shader performs two subsequent vector-matrix multiplications:

103 vLightSpacePos = mul(vLightSpacePos, lights[lightIndex].view);


104 vLightSpacePos = mul(vLightSpacePos, lights[lightIndex].projection);
Instead of these subsequent multiplications by view and projection matrices, you can precompute the
view-projection matrix on the CPU.
As a result, the total number of instructions reduces from 262 to 189.
4. Remove redundant function calls.
Lights 0 and lights 2 in SceneConstantBuffer have the same parameters, that is the shader performs
the same operations twice.

35
1 Intel® Graphics Performance Analyzers Cookbook

Deleting one extra ComputeLightIntensity function call, reduces the number of instructions from 189 to
166.
The performed optimizations reduce the number of instructions by 1.6 times compared to the original
shader.

Verify Optimizations
To check that shader optimizations do not affect rendering, compare the original and modified render targets
using the Diff Visualization mode in the Graphics Frame Analyzer:

There are no visual changes, a small difference becomes visible on the Color Histogram scaling, as the
operations order in the code slightly changed.

36
Intel® Graphics Performance Analyzers Cookbook 1
The performed optimizations reduce the draw call duration from 549 to 367 us, which gives a 1.5x
performance gain. The primary bottleneck in the sample frame moves to the L3 cache:

See Also
How to find a bottleneck with Graphics Frame Analyzer
Intel® Processor Graphics developer guides
Gen9 Compute Architecture
Introduction to GEN Assembly

Notices and Disclaimers


Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its
subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this
document.

37
1 Intel® Graphics Performance Analyzers Cookbook

The products described may contain design defects or errors known as errata which may cause the product
to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of
merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from
course of performance, course of dealing, or usage in trade.

38

You might also like