Intel Architecture Code Analyzer, IACA-Guide
Intel Architecture Code Analyzer, IACA-Guide
Intel Architecture Code Analyzer, IACA-Guide
User's Guide
Revision: 2.0.1
Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN
INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal
injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL
INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND
EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES
ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY
OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN,
MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence
or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall
have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject
to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by
calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
This document contains information on products in the design phase of development.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products.
BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, Flexpipe, i960, Intel, the Intel
logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel
NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo,
Intel StrataFlash, Intel vPro, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS,
MMX, Moblin, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, Stay With It, The Creators Project, The
Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of
Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Contents
1 Introduction ...................................................................................................................... 4
1.1 Intel® Architecture Code Analyzer Accuracy.................................................................. 4
1.2 Processor Support ...................................................................................................... 4
1.3 Platform Support ....................................................................................................... 4
2 Analysis ............................................................................................................................. 5
2.1 Throughput Analysis ................................................................................................... 5
2.2 Latency Analysis ........................................................................................................ 6
2.3 Graphs ................................................................................................................... 7
2.4 Analysis Report Notes................................................................................................. 8
2.4.1 Unbound Instructions .................................................................................... 8
2.4.2 Combining 256-bit Intel® AVX and Legacy Intel® SSE ...................................... 8
2.4.3 Unsupported Instructions ............................................................................... 8
4 Examples ......................................................................................................................... 11
4.1 Throughput Analysis – 4x4 Matrix Multiply ...................................................................11
4.1.1 Initial Code Version ......................................................................................11
4.1.2 Optimization ...............................................................................................12
4.2 Latency and Graph Analysis – Add Reduction................................................................13
4.2.1 Initial Code Version ......................................................................................13
4.2.2 Optimization 1 .............................................................................................14
4.2.3 Optimization 2 .............................................................................................15
5 Release Contents ............................................................................................................. 16
5.1 Windows* OS ...........................................................................................................16
5.2 Linux* OS ................................................................................................................17
5.3 Mac OS X* ...............................................................................................................18
1 Introduction
Intel® Architecture Code Analyzer helps you statically analyze the data dependency,
latency, and throughput of instruction sequences (kernels) on Intel® microarchitectures.
The performance data reported by the tool may significantly deviate from actual
performance observed on an Intel® processor. You can achieve the most accurate
throughput and latency measurements by executing the analyzed code on the processor
itself. The Intel® Architecture Code Analyzer complements such measured data with
information on port binding, bottlenecks, and critical paths.
NOTE: Intel® Architecture Code Analyzer has been validated on 64-bit Windows* 7, 64-bit
Ubuntu* 10.04, and Mac OS X* 10.6 and 10.7. It should work on other versions of
Windows*, Linux*, and Mac OS X* operating systems.
2 Analysis
Intel® Architecture Code Analyzer performs two different types of analysis: Throughput
and Latency.
The detailed section of the throughput analysis report contains one line for each instruction
in the analyzed block. Each line contains:
• Number of the instruction micro-ops.
• Average number of cycles per iteration that the instruction was bound to each
processor port. For most instructions this simply means the number of cycles the
instruction was bound to each port. However, if a particular micro-op may execute
on more than one port, the average number of cycles per iteration may be a partial
cycle for each port because that micro-op may bind to a different port on each
iteration.
• An indication whether the instruction is on the critical path of the analyzed code.
The critical path for Throughput Analysis is all instructions that use the throughput
bottleneck.
• Instruction disassembly in Intel® Software Developer’s Manual (MASM) style
Some ports have both a regular pipe and a secondary pipe. These ports are separated by a
hyphen, and look like two separate ports in the detailed report. Specifically:
• Port 0 has the Divider pipe split from it. In the first cycle they are both busy, then
port 0 is available for the next micro-op and the Divider pipe is kept busy for the
duration of the divide operation.
• Load ports 2 and 3 have an Address Generation Unit (AGU) split from them. For
256-bit load operations that keep the port busy for two cycles, the AGU gets freed
after the first cycle and can process a store address generation if such micro-op is
available for execution.
Following is an example Throughput Analysis report:
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
The Resource delay is counted since all the sources of the instructions are ready
and until the needed resource becomes available
2.3 Graphs
Use the –graph option to set Intel® Architecture Code Analyzer to output the data
dependency graph.
TIP: Graph files produced by Intel® Architecture Code Analyzer can be opened with graphviz.
The data dependency graph may be different for throughput analysis as the throughput
analysis treats the analyzed code block as an infinite loop block, so there may be inter-
iteration dependencies. Red nodes in the graph indicate instructions that are on the critical
path for that particular analysis.
For more information on transitions between Intel® AVX and Intel® SSE, see Avoiding
AVX-SSE Transition Penalties.
In such cases, the summary report includes two additional lines at the top indicating that
such instruction(s) exist in your code, and marks the instruction with a ‘!’ character in all
columns.
When analyzing a loop construct, place the macros at the following locations:
while ( condition )
{
IACA_START
<loop body>
}
IACA_END
This placement skips the loop initialization and includes the loop-end branch instruction.
These macros modify the ebx register in IA-32 code. As a result, the compiler saves this
register just before the macro and restores it immediately after the macro. This adds POP
and PUSH instructions at the beginning and end of the analyzed block. By default, Intel®
Architecture Code Analyzer ignores those instructions, as they are not part of the original
code. See section 3.2 how to force the tool analyze those instructions.
For Microsoft* Visual C++ compiler, 64-bit version, use IACA_VC64_START and
IACA_VC64_END, instead.
Once you insert the macros into your code, build your code into an executable file or an
object file.
NOTE: Input files generated with the Intel compiler option –Qipo are not supported.
Available <options>:
-64 64-bit input file (required for 64-bit object files only)
Architecture type.
-arch <type> These are the available types: NHM, WSM, SNB, IVB
COULD NOT FIND START_MARKER Code did not contain the proper marker(s). See
section 3.1 for more details.
COULD NOT FIND END_MARKER
Intel® Architecture Code Analyzer cannot
CAN'T DETERMINE MODE, PLEASE USE determine the supplied file format (32-bit or 64-
ONE OF -32/-64 COMMAND LINE OPTIONS bit). Use the -32 or -64 option to specify.
4 Examples
This section provides examples of how to analyze and optimize code using Intel®
Architecture Code Analyzer.
4.1.2 Optimization
The Throughput Analysis Report shows that the total throughput (Block Throughput) is 12
cycles, and port 5 was most pressured (Throughput Bottleneck), with 12 micro-ops
allocated to it.
Examination of the instructions that bind to port 5 in the instruction analysis report shows
that the instructions were broadcasts and vpermilps. The broadcasts can only execute on
port 5, but replacing them with 128-bit loads followed by vinsertf128 instructions
reduces the pressure on port 5 because vinsertf128 can execute on port 0. These
changes reduced the throughput to 10 cycles.
4.2.2 Optimization 1
The analysis report and graph show that all instructions are on the same data dependency
path because they all depend on xmm0. We can optimize this code by constructing an add
tree, which reduces the dependency between instructions. This change reduced the
latency from 7 to 5 cycles.
4.2.3 Optimization 2
The analysis report tells us that instruction 4 (vpaddd xmm6, xmm6, xmm7) was delayed by
instruction 2 due to a resource conflict, and that instruction 4 is on a critical path. Because
instruction 5 depends on instruction 4 and instruction 6 depends on instruction 5, both of
these instructions are also delayed, and these last three add insturctions can only be
executed at a rate of one per cycle. The result from instruction 2 (vpaddd xmm0, xmm0,
xmm2) are not needed until instruction 6 (vpaddd xmm0, xmm0 xmm4), so we can resolve
this issue by moving vpaddd xmm0, xmm0, xmm2 lower in the add tree, which enables the
code to fully utilize the resources, reducing the latency to 4 cycles.
5 Release Contents
This section lists the files required for running on Windows*, Linux*, and Mac OS X*
operating systems to analyze IA-32 and Intel® 64 code. Each section also explains which
environmental variables to modify.
5.1 Windows* OS
Add the iaca-mac32 directory to the PATH environment variable.
Filename Description
iaca.exe Intel® Architecture Code Analyzer command-line tool.
iacaLoader.dll Intel Architecture Code Analyzer shared library.
iacaLogicNHM.dll
iacaLogicWSM.dll Intel Architecture Code Analyzer shared library for each
iacaLogicSNB.dll of the supported architectures.
iacaLogicIVB.dll
iacaArchDataNHM.dll
iacaArchDataWSM.dll Instruction databases for each of the supported
iacaArchDataSNB.dll architectures.
iacaArchDataIVB.dll
XED2NHM.dll
XED2WSM.dll XED2 shared libraries for each of the supported
XED2SNB.dll architectures.
XED2IVB.dll
Header file for the start/end markers.
iacaMarks.h
Place this file in another directory.
msvcp100.dll Microsoft Visual Studio* 2010 runtime redistributable
msvcr100.dll packages.
5.2 Linux* OS
Add the bin/ directory to the PATH environment variable.
Filename Description
bin/iaca Intel Architecture Code Analyzer command-line tool
bin/iaca.sh Intel Architecture Code Analyzer invocation script
lib/libiacaLoader.so Intel Architecture Code Analyzer shared objects
lib/libiacaLogicNHM.so
lib/libiacaLogicWSM.so Intel Architecture Code Analyzer shared objects for each
lib/libiacaLogicSNB.so of the supported architectures
lib/libiacaLogicIVB.so
lib/libiacaArchDataNHM.so
lib/libiacaArchDataWSM.so Instruction databases for each of the supported
lib/libiacaArchDataSNB.so architectures
lib/libiacaArchDataIVB.so
lib/libXED2NHM.so
lib/libXED2WSM.so XED2 shared objects for each of the supported
lib/libXED2SNB.so architectures
lib/libXED2IVB.so
include/iacaMarks.h Header file for the start/end markers
5.3 Mac OS X*
Add the bin/ directory to the PATH environment variable.
Filename Description
bin/iaca Intel Architecture Code Analyzer command-line tool
bin/iaca.sh Intel Architecture Code Analyzer invocation script
lib/libiacaLoader.so Intel Architecture Code Analyzer shared objects
lib/libiacaLogicNHM.so
lib/libiacaLogicWSM.so Intel Architecture Code Analyzer shared objects for each
lib/libiacaLogicSNB.so of the supported architectures
lib/libiacaLogicIVB.so
lib/libiacaArchDataNHM.so
lib/libiacaArchDataWSM.so Instruction databases for each of the supported
lib/libiacaArchDataSNB.so architectures
lib/libiacaArchDataIVB.so
lib/libXED2NHM.so
lib/libXED2WSM.so XED2 shared objects for each of the supported
lib/libXED2SNB.so architectures
lib/libXED2IVB.so
include/iacaMarks.h The header file for the start/end markers