Power 8 Architecture
Power 8 Architecture
Power 8 Architecture
Peter Bergner
Brian Hall
Julian Wang
Suresh Warrier
Madhusudanan Kandasamy
David Wendt
Tulio Magno
Alex Mericas
Steve Munroe
Mauricio Oliveira
Bill Schmidt
Will Schmidt
Redbooks
SG24-8171-01
Note: Before using this information and the product it supports, read the information in Notices on
page ix.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x
IBM Redbooks promotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
August 2015, Second Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Chapter 1. Optimization and tuning on IBM POWER8 processor-based systems . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline of this guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Conventions that are used in this guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Optimizing performance on POWER8 processor-based systems. . . . . . . . . . . . . . . . . . 6
1.5.1 Lightweight tuning and optimization guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Deployment guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.3 Deep performance optimization guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 2. The IBM POWER8 processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Introduction to the POWER8 processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Using POWER8 features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Multi-core and multi-thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Multipage size support (page sizes (4 KB, 64 KB, 16 MB, and 16 GB)) . . . . . . . .
2.2.3 Efficient use of cache and memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Transactional memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Vector Scalar eXtension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.6 Decimal floating point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.7 In-core cryptography and integrity enhancements . . . . . . . . . . . . . . . . . . . . . . . .
2.2.8 On-chip accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.9 Storage synchronization (sync, lwsync, lwarx, stwcx., and eieio) . . . . . . . . . . . . .
2.2.10 Fixed-point load and store quadword instructions. . . . . . . . . . . . . . . . . . . . . . . .
2.2.11 Instruction fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.12 Event-based branches (or user-level fast interrupts) . . . . . . . . . . . . . . . . . . . . .
2.2.13 Power management and system performance . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.14 Coherent Accelerator Processor Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 I/O adapter affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
26
28
28
32
33
42
45
47
47
48
49
51
51
52
52
53
55
55
57
58
59
59
63
iii
63
66
67
67
67
68
iv
Chapter 5. IBM i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Using Power features with IBM i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Multi-core and multi-thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Multipage size support on IBM i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Vector Scalar eXtension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Decimal floating point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 IBM i operating system-specific optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 IBM i advanced optimization techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Performance management on IBM i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
112
112
112
113
113
113
114
114
115
116
Chapter 6. Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Using Power features with Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Multi-core and multi-thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
118
118
119
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
123
123
124
125
126
128
129
129
133
137
137
137
138
139
139
141
142
143
143
144
146
148
148
151
154
154
156
160
160
162
162
162
164
165
169
169
171
Contents
182
183
183
183
184
184
184
185
186
186
187
189
189
189
190
190
191
191
191
192
193
194
194
194
195
196
196
196
197
198
198
198
198
199
199
199
200
201
201
201
202
202
205
206
206
206
207
209
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Contents
vii
viii
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Any performance data contained herein was determined in a controlled environment. Therefore, the results
obtained in other operating environments may vary significantly. Some measurements may have been made
on development-level systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
ix
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Active Memory
AIX
AIX 5L
Blue Gene/L
DB2
FDPR
IBM
IBM Watson
Micro-Partitioning
POWER
Power Architecture
POWER Hypervisor
Power Systems
Power Systems Software
POWER6
POWER6+
POWER7
POWER7+
POWER8
PowerLinux
PowerPC
PowerVM
PowerVP
Rational
Redbooks
Redbooks (logo)
System z
Tivoli
WebSphere
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Download
Now
Android
iOS
ibm.com/Redbooks
About Redbooks
Preface
This IBM Redbooks publication focuses on gathering the correct technical information,
and laying out simple guidance for optimizing code performance on IBM POWER8
processor-based systems that run the IBM AIX, IBM i, or Linux operating systems. There is
straightforward performance optimization that can be performed with a minimum of effort and
without extensive previous experience or in-depth knowledge.
The POWER8 processor contains many new and important performance features, such as
support for eight hardware threads in each core and support for transactional memory. The
POWER8 processor is a strict superset of the IBM POWER7+ processor, and so all of the
performance features of the POWER7+ processor, such as multiple page sizes, also appear
in the POWER8 processor. Much of the technical information and guidance for optimizing
performance on POWER8 processors that is presented in this guide also applies to
POWER7+ and earlier processors, except where the guide explicitly indicates that a feature is
new in the POWER8 processor.
This guide strives to focus on optimizations that tend to be positive across a broad set of
IBM POWER processor chips and systems. Specific guidance is given for the POWER8
processor; however, the general guidance is applicable to the IBM POWER7+,
IBM POWER7, IBM POWER6, IBM POWER5, and even to earlier processors.
This guide is directed at personnel who are responsible for performing migration and
implementation activities on POWER8 processor-based systems. This includes system
administrators, system architects, network administrators, information architects, and
database administrators (DBAs).
Authors
This book was produced by a team of specialists from around the world working at the
International Technical Support Organization, Poughkeepsie Center.
Peter Bergner is the GCC Compiler Team Lead within the
Linux on Power Toolchain department. Since joining IBM in
1996, Peter has worked in various areas, including compiler
optimizer development for the IBM i platform, as a core
member of the teams that ported Linux and GLIBC to 64-bit
POWER, and as a team lead for the IBM Blue Gene/L
compiler and runtime library development team. He obtained a
PhD in Electrical Engineering from the University of Minnesota.
Brian Hall is the lead analyst for performance improvement efforts with the IBM Cloud
Innovation Laboratory team. He works with many IBM software products to capitalize on
the IBM Power Architecture and develop performance preferred practices for software
development and deployment. After joining IBM in 1987, Brian originally worked on the
IBM XL C/C++/Fortran compilers and on the just-in-time compiler for IBM Java on Power.
He has a Bachelor's degree in Computer Science from Queen's University at Kingston and
a Master's degree in Computer Science from the University of Toronto.
xiii
Alex Mericas is a member of the IBM Systems and Technology Group in Austin, Texas. He
is a Senior Technical Staff Member and is the Performance Architect for the POWER8
processor. He designed the performance monitoring unit on POWER4, POWER5,
POWER6, POWER7, and IBM PowerPC 970 processor-based systems. Alex is an IBM
Master Inventor with 47 US patent applications and 22 issued patents covering
microprocessor design and hardware performance monitors.
Steve Munroe is a Senior Technical Staff Member at the
Rochester, Minnesota Lab in IBM US. He has 38 years of
experience in the software development field. He holds a
Bachelors degree in Computer Science from Washington State
University (1974). His areas of expertise include PowerISA,
compilers, POSIX run times, and performance analysis. He has
written extensively about IBM POWER performance and Java
performance.
Mauricio Oliveira is a Staff Software Engineer at the Linux Technology Center at IBM
Brazil. His areas of expertise include Linux performance and Debian and Ubuntu
distributions on IBM Power Systems. He also worked with official benchmark
publications for Linux on IBM Power Systems and early development (bootstrap) of Debian
on Little Endian 64-bit PowerPC. Mauricio holds a Master of Computer Science and
Technology degree and a Bachelor of Engineering degree in Computer Engineering from
Federal University of Itajub, Brazil.
xiv
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Preface
xv
xvi
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface
xvii
xviii
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Summary of changes
This section describes the technical changes that are made in this edition of the book and in
previous editions. This edition might also include minor corrections and editorial changes that
are not identified.
Summary of Changes for SG24-8171-01 for Performance Optimization and Tuning
Techniques for IBM Power Systems Processors Including IBM POWER8 as created or
updated on August 28, 2015.
xix
xx
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 1.
1.1 Introduction
This guide gathers the correct technical information and lays out simple guidance for
optimizing code performance on IBM Power Systems that run the AIX, IBM i, or Linux
operating systems.
This guide focuses on optimizations that tend to be positive across a broad set of IBM
POWER processor chips and systems. Much of the technical information and guidance for
optimizing performance on the POWER8 processor that is presented in this guide also
applies to POWER7+ and earlier processors, except where the guide explicitly indicates that
a feature is new in the POWER8 processor.
Straightforward performance optimization can be performed with a minimum of effort and
without extensive previous experience or in-depth knowledge. This optimization work can
accomplish the following goals:
Substantially improve the performance of the application that is being optimized for the
POWER8 processor (the focus of this guide).
Typically, carry over improvements to systems that are based on related processor chips,
such as the IBM POWER7+, IBM POWER7, and IBM POWER6 processor chips.
Improve performance on other platforms.
The POWER8 processor contains many new and important performance features, such as
support for eight hardware threads in each core and support for transactional memory. The
POWER8 processor is a strict superset of the POWER7+ processor, and so all of the
performance features of the POWER7+ processor, such as multiple page sizes, also appear
in the POWER8 processor.
This guide is directed at personnel who are responsible for performing migration and
implementation activities on POWER8 processor-based systems, including systems
administrators, system architects, network administrators, information architects, program
product developers, software architects, database administrators (DBAs), and compiler
writers.
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Section 1.5.2, Deployment guidelines on page 15 describes deployment choices, that is,
system setup and configuration choices, so you can tune these designed-for-performance
IBM Power Systems for your environment. Together with 1.5.1, Lightweight tuning and
optimization guidelines on page 7, these simple optimization strategies and deployment
guidance satisfy the requirements for most environments and can deliver substantial
improvements.
Finally, 1.5.3, Deep performance optimization guidelines on page 21 describes some of the
more advanced investigative techniques that can be used to identify performance bottlenecks
in an application. It is here that optimization efforts move into the application code, and
improvements are typically made by modifying source code. Coverage in this last area is fairly
rudimentary, focusing on general areas of investigation and the tools that you can use.
Most of the remaining material in this guide is technical information that was developed by
domain experts at IBM:
This guide provides hardware information about the POWER8 processor (see Chapter 2,
The IBM POWER8 processor on page 25), highlighting the important features from a
performance perspective and laying out the basic information that is drawn upon by the
material that follows.
This guide describes the system software stack, examining the IBM POWER Hypervisor
(see Chapter 3, The IBM POWER Hypervisor on page 57), the AIX, IBM i, and Linux
operating systems and system libraries (see Chapter 4, IBM AIX on page 71, Chapter 5,
IBM i on page 111, and Chapter 6, Linux on page 117), and the compilers (see
Chapter 7, Compilers and optimization tools for C, C++, and Fortran on page 141). Java
(see Chapter 8, Java on page 173) also receives extensive coverage.
Chapter 4, IBM AIX on page 71 highlights some of the areas in which AIX exposes some
new features of the POWER8 processor. Then, this chapter examines a set of operating
system-specific optimization opportunities. The chapter concludes with a short description
of AIX preferred practices regarding system setup and maintenance.
Chapter 5, IBM i on page 111 describes IBM i support for a number of features in
POWER8 processors (including features that are available in previous generations of
POWER processors). The chapter describes how this operating system can be effective in
automatically capitalizing on many new POWER architecture features without changes to
existing programs. The chapter also provides information about IBM Portable Application
Solutions Environment for i (PASE for i), a part of IBM i that allows some AIX application
binary files to run on IBM i with little or no changes.
Chapter 6, Linux on page 117 describes the primary Linux operating systems that are
used on POWER8 processor-based systems. The chapter covers using features of the
POWER architecture, and operating system-specific optimization opportunities.
Linux is based on community efforts that are focused not only on the Linux kernel, but also
all of the complementary packages, tools, toolchains, and GNU Compiler Collection
(GCC) compilers that are needed to use effectively POWER8 processor-based systems.
IBM provides the expertise for Power Systems by developing, optimizing, and pushing
open source changes to the Linux communities.
Chapter 7, Compilers and optimization tools for C, C++, and Fortran on page 141
describes current compiler versions and optimization levels and how, for projects with
increased focus on runtime performance, you can take advantage of the more advanced
compiler optimization techniques. It describes XL compiler static analysis and runtime
checking to validate the correctness of the program.
Chapter 8, Java on page 173 describes the optimization and tuning of Java based
applications that are running in a POWER environment.
Finally, this book covers important information about IBM middleware, DB2 (see
Chapter 9, IBM DB2 on page 193) and IBM WebSphere Application Server (see
Chapter 10, IBM WebSphere Application Server on page 205). Various applications use
middleware, and it is critical that the middleware is tuned correctly and performs well. The
middleware chapters cover how these products are optimized for POWER8
processor-based systems, including select preferred practices for tuning and deploying
these products.
The following appendixes are included:
Appendix A, Analyzing malloc usage under IBM AIX on page 211 explains some simple
techniques for analyzing how an application is using the system memory allocation
routines (malloc and related functions in the C library). malloc is often a bottleneck for
application performance, especially under AIX. AIX has an extensive set of optimized
malloc implementations, and it is easy to switch between them without rebuilding or
changing an application. Knowing how an application uses malloc is key to choosing the
best memory allocation alternatives that AIX offers. Even Java applications often make
extensive use of malloc, either in Java Native Interface (JNI) code that is part of the
application itself or in the Java class libraries, or in binary code that is part of the software
development kit (SDK).
Appendix B, Performance tools and empirical performance analysis on page 215
describes some of the important performance tools that are available on the IBM Power
Architecture under AIX or Linux, and strategies for using them in empirical performance
analysis efforts.
These performance tools are most often used as part of the advanced investigative
techniques that are described in 1.5.3, Deep performance optimization guidelines on
page 21, except for the performance advisors, which are intended as investigative tools
that are appropriate for a broader audience of users.
Throughout the book, there are links to related sections among the chapters. For example,
Vector Scalar eXtension (VSX) is described in the processor chapter (Chapter 2, The IBM
POWER8 processor on page 25), all of the OS chapters (Chapter 4, IBM AIX on page 71,
Chapter 5, IBM i on page 111, and Chapter 6, Linux on page 117), and in the compiler
chapter (Chapter 7, Compilers and optimization tools for C, C++, and Fortran on page 141).
Therefore, after the description of VSX in the processor chapter, there are links to that same
section in the OS chapters and in the compiler chapter.
After you review the advice in this guide, for more information, visit the IBM Power Systems
website at:
http://www.ibm.com/systems/power/index.html
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Format that is
used in this
guide
Monofont,
bolded
ldedit
Monofont
Monofont,
italicized
Monofont,
italicized
-mcmodel={medium|large}
1.4 Background
Continuing trends in processor design are making it more important than ever to consider
analyzing and working to improve application performance. In the past, two of the ways in
which newer processor chips delivered higher performance were by:
Increasing the clock rate
Making microarchitectural improvements that increase the performance of a single thread
Often, upgrading to a new processor chip gave existing applications a 50% or possibly 100%
performance improvement, leaving little incentive to spend much effort to get an uncertain
amount of additional performance. However, the approach in the industry has shifted, so that
the newer processor chips do not substantially increase clock rates, as compared to the
previous generation. In some cases, clock rates declined in newer designs. Recent designs
also generally offer more modest improvements in the performance of a single execution
thread.
Instead, the focus has shifted to delivering multiple cores per processor chip, and to delivering
more hardware threads in each core (known as simultaneous multi-threading (SMT) in IBM
Power Architecture terminology). This situation means that some of the best opportunities for
improving the performance of an application are in delivering scalable code by having an
application make effective use of multiple concurrent threads of execution.
Coupled with the trend toward aggressive multi-core and multi-threaded designs, there are
sometimes changes in the amount of cache and memory bandwidth available to each
hardware thread. Cache sizes and chip-level bandwidth are, in some cases, increasing at a
slower rate than the growth of hardware threads, meaning that the amount of cache per
thread is not growing as rapidly. In particular instances, it decreases from one generation to
the next. Again, this situation shows where deeper analysis and performance optimization
efforts can provide some benefits.
There is also a recent trend toward adding transactional memory support to processors and
toward support for special purpose accelerators. Transactional memory is a feature that
simplifies multi-threaded programming by providing safe access mechanisms to shared data.
Special purpose accelerators may be based on adding new instructions to the core, on
chip-level accelerators, or on fast and efficient access mechanisms to new off-chip
accelerators, such as graphics processing units (GPUs) or field-programmable gate arrays
(FPGAs).
Lightweight tuning covers simple prescriptive steps for tuning application performance on
POWER8 processor-based systems. These simple steps can be carried out without
detailed knowledge of the internals of the application that is being optimized and usually
without modifying the application source code. Simple system utilization and performance
tools are used for understanding and improving your application performance. The steps
and tools are general guidelines that apply to all types of applications. Although they are
simple and straightforward, they often lead to significant performance improvements. It is
possible to accomplish these steps in as little as two days or so for a small application.
Two weeks might be required to perform these steps for a large and complex application.
Performance improvement: Consider lightweight tuning to be the starting point for any
performance improvement effort.
2. Deployment guidelines
Deep performance analysis covers performance tools and general strategies for identifying
and fixing application bottlenecks. This type of analysis requires more familiarity with
performance tools and analysis techniques, sometimes requiring a deeper understanding
of the application internals, and often requiring a more dedicated and lengthy effort. Often,
a simpler analysis is all that is required to identify serious bottlenecks in an application;
however, a more detailed investigation is required to perform an exhaustive search for all
of the opportunities for increasing performance.
6
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Performance improvement: Consider this the last activity that is undertaken, with
simpler analysis steps, for a moderately serious performance effort. The more complex
iterative analysis is reserved for only the most performance critical applications.
This chapter provides only minimal background on the guidance provided. Detailed material
about these topics is incorporated in the chapters that follow and in the appendixes. The
following chapters and appendixes also cover many other performance topics that are not
addressed here.
Guidance for POWER8 processor-based systems: The guidance that is provided in this
book specifically applies to POWER8 processor chips and systems. The guidance that is
provided also generally applies to previous generations of POWER processor chips and
systems, including POWER7, POWER6, and POWER5 processor-based systems. When
the guidance is not applicable to all generations of Power Systems, it is noted.
Performance test beds must be sized and configured for performance and scalability testing.
Choose your scalability goals based on the requirements that are placed on an application,
and the test bed must accommodate at least the minimum requirements. For example, when
you target a multi-threaded application to scale up to four cores on POWER8
processor-based systems, it is important that the test bed be at least a 4-core system and
that tests are configured to run in various configurations (1-core, 2-core, and 4-core). You
want to be able to measure performance across the different configurations such that the
scalability can be computed. Ideally, a 4-core system delivers four times the performance of a
1-core system, but in practice, the scalability is less than ideal. Scalability bottlenecks might
not be clearly visible if the only testing done for this example was in a 4-core configuration.
With the multi-threaded POWER8 cores (see 2.2, Using POWER8 features on page 28),
each processor core can be instantiated with one, two, four, or eight logical CPUs within the
operating system. A 4-core server, with SMT8 mode (eight hardware threads per core),
means that the operating system is running 32 logical CPUs. Also, larger-core servers are
becoming more pervasive, with scaling considerations well beyond 4-core servers.
The performance test bed should be a dedicated LPAR. You must ensure that there is no
other activity on the system (including on other LPARs, if any, configured on the system) when
performance tests are run. The initial performance testing should be done in a dedicated
resource environment to minimize the factors that affect performance. Ensure that the LPAR
is running an up-to-date version of the operating system, at the level that is expected for the
typical usage of the application. Keep the test bed in place after any performance effort so
that performance can occasionally be monitored, which ensures that later maintenance of an
application does not introduce a performance regression.
Choosing the appropriate workloads for performance work is also important. Ideally, a
workload has the following characteristics:
Be representative of the expected actual usage of the application.
Have simple measures of performance that are easily collected and compared, such as
run time or transactions/second.
Be easy to set up and run in an automated environment, with a fairly short run time for a
fast turnaround in performance experiments.
Have a low run-to-run variability across duplicated runs, such that extensive tests are not
required to obtain a statistically significant measure of performance.
Produce a result that is easily tested for correctness.
When an application is being optimized for multiple operating systems, much of the
performance work can be undertaken on just one of the operating systems. However, some
performance characteristics are operating system-dependent, so some analysis must be
performed on each operating system. In particular, perform profiling and lock analysis
separately for each operating system to account for differences in system libraries and
kernels. Each operating system also has unique scalability considerations.
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Critically, all compilers that are used to build an application must use up-to-date versions that
offer full support for the target processor chip. Older levels of a compiler might tolerate newer
processor chips, but they do not capitalize on the unique features of the latest processor
chips. For the IBM XL compilers on AIX or Linux, XLC13 and XLF15 are the first compiler
versions that have processor-specific tuning for POWER8 processor-based systems. For the
GCC compiler on Linux, IBM Advance Toolchain Version 7.0 (and later versions) contain an
updated GCC compiler that is preferred for POWER7 and POWER8 processor-based
systems, and Version 8.0 and later support POWER Little Endian and Big Endian. The IBM
XL Fortran Compiler is recommended over gfortran for the most optimized high floating point
performance characteristics.
For more information about the Advance Toolchain features and supported environments, see
the Introduction and Supported Linux Distributions sections of the Advance Toolchain wiki
page at the following website:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W51a7ffcf4d
fd_4b40_9d82_446ebc23c550/page/IBM%20Advance%20Toolchain%20for%20PowerLinux%20Docu
mentation
For the GCC compiler on Linux, the GCC compilers, which come with the distributions, both
recognize and take advantage of the POWER architecture and optimizations. For improved
optimizations and newer GCC technology, the IBM Advance Toolchain package provides an
updated GCC compiler and optimized toolchain libraries for use with POWER8
processor-based systems.
The Advance Toolchain is a key performance technology that is available for Power Systems
running Linux. It includes newer, POWER-optimized versions of compilers (GCC, G++,
GFortran, and GCCGo (since Version 8.0)), utilities, and libraries, along with various
performance tools. The full Advance Toolchain must be installed in the build environment, and
the Advance Toolchain runtime package must be installed in the performance test bed. The
Toolchain is designed to coexist with the GCC compilers and toolchain that are provided in
the standard Linux distributions. More information is available in 6.3.1, GCC, toolchain, and
IBM Advance Toolchain on page 129.
Along with the compilers for C/C++ and Fortran, there is the separate IBM Feedback Directed
Program Restructuring (FDPR) tool to optimize performance. FDPR takes a post-link
executable image (such as one produced by static compilers) and applies additional
optimizations. FDPR is another tool that can be considered for optimizing applications that
are based on an executable image. More details can be found in 7.4, IBM Feedback Directed
Program Restructuring on page 160.
Java also contains a dynamic Just-In-Time (JIT) compiler, and only newer versions are tuned
for POWER8 processor-based systems. However, Java compilations to binary code take
place at application execution time, so a newer Java release must be installed on the
performance test bed system.
10
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
-qarch=pwr5 -qtune=balanced for an executable file that can run on POWER5 and
higher processor-based systems, and is tuned for good performance for all recent
Power Systems (including POWER6, POWER7, and POWER8 processor-based
systems)
-mtune=power7 to tune for the POWER7 processsor on GCC and -mtune=power8 for the
POWER8 processor on GCC
Strict options: Sometimes the compilers can produce faster code by subtly altering the
semantics of the original source code. An example of this scenario is expression
reorganization. Especially for floating point code, the effect of expression reorganization
can produce different results. For some applications, these optimizations must be
prevented to achieve valid results. For the XL compilers, certain semantic-altering
transformations are allowed by default at higher optimization levels, such as -O3, but those
transformations can be disabled by using the -qstrict option (for example, -O3
-qstrict). For GCC, the default is strict mode, but you can use -ffast-math to enable
optimizations that are not concerned with Not a Number (NaN), signed zeros, infinities,
floating point expression reorganization, or setting the errno variable. The new -Ofast
GCC option includes -O3 and -ffast-math, and might include other options in the future.
Source code compatibility options: The XL compilers assume that the C and C++ source
code conforms to language rules for aliasing. On occasion, older source code fails when
compiled with optimization because the code violates the language rules. A workaround
for this situation is to use the -qalias=noansi option. The GCC workaround is the
-fno-strict-aliasing option.
Profile Directed Feedback (PDF): PDF is an advanced optimization feature of the
compilers to consider for performance-critical applications.
Interprocedural Analysis (IPA): IPA is an advanced optimization feature of the compilers
to consider for performance-critical applications.
A simple way to experiment with the C, C++, and Fortran compilation options is to repeatedly
build an application with different option combinations, and then to run it and measure
performance to see the effect. If higher optimization levels produce invalid results, try adding
one or both of the -qstrict and -qalias options with the XL compilers, or
-fno-strict-aliasing with GCC.
Not all source files must be compiled with the same set of options, but all files must be
compiled at the minimum optimization level. There are cases where optimization was not
used on just one or two important source files and that caused an application to suffer from
substantially reduced performance.
Java options
Many Java applications are performance-sensitive to the configuration of the Java heap and
garbage collection (GC). Experimentation with different heap sizes and GC policies is an
important first optimization step. For generational GC, consider using the options that specify
the split between nursery space (also known as the new or young space) and tenured space
(also known as the old space). Most Java applications have modest requirements for
long-lived objects in the tenured space, but frequently allocate new objects with a short life
span in the nursery space. For more information, see 8.5, Java garbage collection tuning on
page 183
If 64-bit Java is used, use the -Xcompressedrefs option. In newer Java releases, the
compressed references option is the default for a 64-bit Java. For more information, see 8.3.4,
Compressed references on page 177.
11
By default, newer releases of Java use 64 KB medium pages for the Java heap, which is the
equivalent of explicitly specifying the -Xlp64k option. Linux defaults to 64 KB pages, but AIX
defaults to 4 KB pages. If older releases of Java are used on AIX, use the -Xlp64k option;
otherwise, those releases default to using 4 KB pages. Often, there is some additional
performance improvement that is seen in using larger 16 MB large pages by using the -Xlp
option. However, using 16 MB pages normally requires explicit configuration by the
administrator of the AIX or Linux operating system to reserve a portion of the memory to be
used exclusively for large pages. (For more information, see 8.3.2, Configuring large pages
for Java heap and code cache on page 176.) As such, the medium pages are a better choice
for general use, and the large pages can be considered for performance critical applications.
Many Java applications benefit from turning off the default hardware prefetching on the
POWER7 processor, and some applications might benefit from doing so on the POWER8
processor. Some recent Java releases turn off hardware prefetching by default. If you turned
off hardware prefetching on the POWER7 processor, revisit that tuning because hardware
changes on the POWER8 processor have made prefetching more beneficial across different
types of code and applications. For more information, see in Tuning to capitalize on hardware
performance features on page 14. The new and improved hardware prefetcher on the the
POWER8 processor has proven to be beneficial to numerous Java applications.
On Power Systems, the -Xcodecache option often delivers a small improvement in
performance, especially in a large Java application. This option specifies the size of each
code cache that is allocated by the JIT compiler for the binary code that is generated for Java
methods. Ideally, all of the compiled Java method binary code fits into a single code cache,
eliminating the small penalty that might occur when one Java method calls another method
when the binary code for the two methods is in different code caches. To use this option,
determine how much code space is being used, and then set the size of the option correctly.
The maximum size of each code cache that is allocated is 32 MB, so the largest value that
can be used for this option is -Xcodecache32m. For more information, see 8.3.5, JIT code
cache on page 180.
The JIT compiler automatically uses an appropriate optimization level when it compiles Java
methods. Recent Java releases automatically fully use all of the new features of the target
POWER8 processor of the system on which an application is running.
For more information about Java performance, see Chapter 8, Java on page 173.
Optimized libraries
Optimized libraries are important for application performance. This section covers some
considerations that are related to standard libraries for AIX or Linux, libraries for Java, or
specialized mathematical subroutine libraries that are available for the Power Architecture.
AIX malloc
The AIX operating system offers various memory allocation packages (the standard malloc()
and related routines in the C library). The default package offers good space efficiency and
performance for single-threaded applications, but it is not a good choice for the scalability of
multi-threaded applications. Choosing the correct malloc package on AIX is important for
performance. Even Java applications can make extensive use of malloc through JNI code or
internally in the Java Runtime Environment (JRE).
Fortunately, AIX offers a number of different memory allocation packages that are appropriate
for different scenarios. These different packages are chosen by setting environment variables
and do not require any code modification or rebuilding of an application.
12
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Choosing the best malloc package requires some understanding of how an application uses
the memory allocation routines. Appendix A, Analyzing malloc usage under IBM AIX on
page 211 shows how to collect easily the required information. Following the data collection,
experiment with various alternatives, alone or in combination. Some alternatives that deliver
high performance include:
Pool malloc: The pool front end to the malloc subsystem optimizes the allocation of
memory blocks of 512 bytes or less. It is common for applications to allocate many small
blocks, and pools are particularly space- and time-efficient for that allocation pattern.
Thread-specific pools are used for multi-threaded applications. The pool malloc is a good
choice for both single-threaded and multi-threaded applications.
Multiheap malloc: The multiheap malloc package uses up to 32 separate heaps, reducing
contention when multiple threads attempt to allocate memory. It is a good choice for
multi-threaded applications.
Using the pool front end and multiheap malloc in combination is a good alternative for
multi-threaded applications. Small memory block allocations, typically the most common, are
handled with high efficiency by the pool front end. Larger allocations are handled with good
scalability by the multiheap malloc. A simple example of specifying the pool and multiheap
combination is by using the following environment variable setting:
MALLOCOPTIONS=pool,multiheap
For more information about malloc alternatives, see 4.3.1, Malloc on page 95.
13
java/util/concurrent
For Java, all of the standard class libraries are included with the JRE. One package of interest
for scalability optimization is java/util/concurrent. Some classes in java/util/concurrent
are more scalable replacements for older classes, such as
java/util/concurrent/ConcurrentHashMap, which can be used as a replacement for
java/util/Hashtable. ConcurrentHashMap might be slightly less efficient than Hashtable
when run in smaller system configurations where scalability is not an issue, so there can be
trade-offs. Also, switching packages requires a source code change, albeit a simple one.
14
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Recent POWER processors allow not only prefetching to be enabled or disabled, but they
also allow the fine-tuning of the prefetch engine. Such fine-tuning is especially beneficial for
scientific and engineering and memory-intensive applications.1 Because the effect of
hardware prefetching is heavily dependent on the way that an application accesses memory,
and also dependent on the cache sizes of a particular POWER chip, it is always best to test
explicitly the effects of different prefetch settings on each chip on which the application is
expected to run.
For more information about hardware prefetching and hardware and operating system tuning
and usage for optimum performance, see Chapter 2, The IBM POWER8 processor on
page 25, Chapter 4, IBM AIX on page 71, Chapter 6, Linux on page 117, and Chapter 5,
IBM i on page 111.
15
Cache affinity
The hardware threads for each core of a POWER8 processor share a core-specific cache
space. For multi-threaded applications where different threads are accessing the same data,
it can be advantageous to arrange for those threads to run on the same core. By doing so, the
shared data remains resident in the core-specific cache space, as opposed to moving
between different private cache spaces in the system. This enhanced cache affinity can
provide more efficient utilization of the cache space in the system and reduce the latency of
data references.
Similarly, the multiple cores on a POWER8 processor share a chip-specific cache space.
Again, arranging the software threads that are sharing the data to run on the same POWER8
processor (when the partition spans multiple chips) often allows more efficient utilization of
cache space and reduced data reference latencies.
Memory affinity
By default, the POWER Hypervisor attempts to satisfy the memory requirements of a partition
by using the local memory DIMMs for the processor cores that are allocated to the partition.
For larger partitions, however, the partition might contain a mixture of local and remote
memory. For an application that is running on a particular core or chip, the application runs
best when using only local memory. This enhanced memory affinity reduces the latency of
memory accesses.
16
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
17
18
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The PowerVM Hypervisor uses a three-level affinity mechanism in its scheduler to enforce
affinity as much as possible. The reason why absolute affinity is not always possible is that
partitions can expand and use unused cycles of other LPARs. This process is done by using
uncapped mode in Power, where the uncapped cycles might not always have affinity.
Therefore, binding logical processors that are seen at the operating system level to physical
threads seen at the hypervisor level works only in some cases in shared partitions. Achieving
a high level of affinity is difficult when multiple partitions share resources from a single pool,
especially at high utilization, and when partitions are expanding to use other partition cycles.
Therefore, creating large shared processor core pools that span across chips tends to create
remote memory accesses. For this reason, it might be less desirable to use larger partitions
and large processor core pools where high-level affinity performance is expected.
Virtualized deployments can use Micro-Partitioning, where a partition is allocated a fraction of
a core. Micro-Partitioning allow a core allocation as small as 0.1 cores in older firmware
levels, and as small as 0.05 cores in more recent firmware levels, when coupled with
supporting operating system levels. This powerful mechanism provides great flexibility in
deployments. However, small core allocations can be more appropriate for situations in which
many virtual machines are often idle. Therefore, active 0.05 core LPARs can use those idle
cycles.
Also, there is one negative performance effect in deployments with considerably small
partitions, in particular with 0.1 or fewer cores at high system utilization: Java warm-up times
can be greatly increased. In a Java execution, the JIT compiler is producing binary code for
Java methods dynamically. Steady-state optimal performance is reached after a portion of the
Java methods are compiled to binary code. With considerably small partitions, there might be
a long warm-up period before reaching steady-state performance, where a 0.05 core LPAR
cannot get additional cycles from other LPARs because the other LPARs are consuming their
cycles. Also, if the workload that is running on this small-size LPAR does not need more than
5% of a processor core capacity, then the performance impact is mitigated.
For more information about this topic, see Chapter 3, The IBM POWER Hypervisor on
page 57.
Memory requirements
For good performance, there needs to be enough physical memory available so that
application data does not need to be frequently paged in and out between memory and disk.
The physical memory that is allocated to a partition must be enough to satisfy the
requirements of the operating system and the applications that are running on the partition.
Java is sensitive to having enough physical memory available to contain the Java heap
because Java applications often have frequent GC cycles where large portions of the Java
heap are accessed. If portions of the Java heap are paged out to disk by the operating system
because of a lack of physical memory, then GC cycles can cause a large amount of disk
activity, which is known as thrashing.
SMT mode
The POWER8 processor is designed to support as many as eight SMT threads in each core.
This is up to eight separate concurrent threads of instruction execution that share the
hardware resources of the core. At the operating system level, this is seen as up to eight
logical CPUs per core in the partition. The operating system therefore can schedule up to
eight software threads to run concurrently on the core.
19
Different operating systems can choose to run by default at different SMT modes. AIX
defaults to SMT4 when running on a POWER8 processor, and Linux defaults to SMT8
(assuming newer operating system levels that are POWER8 aware). As a deployment choice,
configuring the system to run in a particular SMT mode is easily done by the system
administrator by using the smtctl command on AIX or the ppc64_cpu --smt command on
Linux.
Note: The SMT mode that the operating system is running in specifies a maximum SMT
level, and not a fixed level. The AIX and Linux operating systems dynamically alter the
SMT level up to the maximum permitted. During periods where there are few software
threads available to run, the operating system can dynamically reduce the SMT mode.
During periods where there are many software threads available to run, the operating
system dynamically switches to the maximum SMT mode that the system administrator
has configured.
SMT is sometimes a trade-off between the best performance a single thread of execution can
achieve versus the best total throughput the partition can achieve. To understand this better,
consider the following cases:
A particular software thread is consuming only a modest fraction of the hardware
resources of the core. This often occurs for threads in a Java application, for example. In
Java, there is typically a higher frequency of loads, stores and branches, and a lower level
of instruction-level parallelism, in the binary code. In a case such as this, SMT effectively
supports many software threads running simultaneously on the same core and achieves a
high level of total throughput without sacrificing the performance of the individual threads.
A particular software thread can consume a large fraction of the hardware resources of
the core. This sometimes occurs in numerically intensive C code, for example. At high
SMT modes with many active threads, this particular software thread is competing with
other threads for core resources, and can be starved for resources by the other threads. In
a case such as this, the highest single thread performance of this particular software
thread is achieved at lower SMT modes. Conversely, the highest throughput is still
achieved with a high SMT mode, at the expense of reducing the performance of some
individual threads.
With the larger number of cores per chip in POWER8 processor-based systems as compared
to previous generations and the higher number of SMT threads per core on the POWER8
processor, one natural tendency is to create partitions with more logical CPUs than in the
past. One effect that has been repeatedly seen with more logical CPUs is for an application to
start suffering from scalability bottlenecks. All applications typically have a limit to their
scalability, and more logical CPUs can cause or exacerbate a scalability issue.
Counterintuitively, application performance goes down when a scalability bottleneck appears.
When this happens, users sometime experiment with different SMT modes and come to the
conclusion that the application naturally prefers a lower SMT mode, and adjust the operating
system configuration. This is often the incorrect conclusion. As explained previously, a small
minority of applications do perform better at SMT2, but most applications run well at SMT4 or
SMT8 if there are no scalability bottlenecks present.
For cases where an application is seeing reduced performance with higher SMT modes
because of scalability effects, possible remediations include:
Bind the application to run on a subset of the available logical CPUs. See the example
under Power dedicated LPARs on page 17.
Reduce the number of cores in the partition, which frees hardware resources that can be
used to create other partitions.
20
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Perform the analysis steps that are outlined in 1.5.3, Deep performance optimization
guidelines on page 21 for identifying scalability bottlenecks and fixing the application so
that it scales better.
As a temporary measure, lower the SMT mode of the partition. This should be considered
a temporary measure only because by artificially lowering the SMT mode to address a
scalability issue, you are effectively wasting some of the available hardware resources of
the system.
For the cases where an application intrinsically runs better in a lower SMT mode, and where
there is a desire to achieve the best possible single thread performance at the expense of
overall throughput, possible remediations include:
Segregate the system into partitions running in lower SMT modes and other partitions
running in higher SMT modes. Run the applications that prefer lower SMT modes in the
partitions with lower SMT levels.
Use the hybrid thread and core feature to have some cores in a partition that is run in
lower SMT modes, while other cores run in high SMT modes. Bind the applications that
prefer lower SMT modes to the cores running in lower SMT modes and bind other
applications to the other cores.
For more information see 2.2.1, Multi-core and multi-thread on page 28, 4.2.1, Multi-core
and multi-thread on page 72, 5.2.1, Multi-core and multi-thread on page 112, and 6.2.1,
Multi-core and multi-thread on page 119, all of which address multi-core and multi-thread
from the processor and operating system standpoints.
21
Common operating system tools for gathering this general information include topas and
perfpmr (AIX), top and LPCPU (Linux), vmstat, iostat, and netstat. Detailed CPU usage
information is available by running sar. This command diagnoses cases where some
logical processors are saturated and others are underutilized, an issue that is seen with
network interrupt processing on Linux.
Collect a time-based profile of the application to see where run time is concentrated.
Some possible areas of concern are:
Particular user routines or Java methods with a high concentration of execution time.
This situation is an indication of a poor coding practice or an inefficient algorithm that is
being used in the application itself.
Particular library routines or Java class library methods with a high concentration of
execution time. First, determine whether the hot routine or method is legitimately used
to that extent. Look for alternatives or more efficient versions, such as using the
optimized libraries in the IBM Advance Toolchain or the vector routines in the MASS
library (for more information, see Mathematical Acceleration Subsystem Library and
Engineering and Scientific Subroutine Library on page 13).
A concentration of run time in the pthreads library (see Java profiling example on
page 242) or in kernel locking routines. This situation is associated with a locking
issue. This locking might ultimately arise at the system level (as seen with malloc
locking issues on AIX), or at the application level in Java code (associated with
synchronized blocks or methods in Java code). The source of locking issues is not
always immediately apparent from a profile. For example, with AIX malloc locking
issues, the time that is spent in the malloc and free routines might be low, with almost
all of the impact appearing in kernel locking routines.
The tools for gathering profiles are tprof (AIX), OProfile (Linux), and perf (Linux) (these
tools are described in IBM Rational Performance Advisor on page 221). The curt tool
(see AIX trace-based analysis tools on page 226) also provides a breakdown, describing
where CPU time is consumed and includes more useful information, such as a system call
summary.
Where there are indications of a locking issue, collect locking information.
With locking problems, the primary concern is to determine where the locking originates in
the application source code. Cases such as AIX malloc locking can be easily solved just
by switching to a more scalable memory allocation package through the MALLOCTYPE and
MALLOCOPTIONS environment variables. In this case, examine how malloc is used and
consider making changes at the source code level. For example, rather than repeatedly
allocating many small blocks of memory by calling malloc for each block, the application
can allocate an array of blocks and then internally manage the space.
As mentioned in java/util/concurrent on page 14, Java locking issues that are associated
with some older classes, such as java/util/Hashtable, can be easily solved by using
java/util/concurrent/ConcurrentHashMap.
For Java programs, use Java Lock Monitor (see Java Health Center on page 241). For
non-Java programs, use the splat tool on AIX (see AIX trace-based analysis tools on
page 226).
22
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For Java, the WAIT tool is a powerful, easy-to-use analysis tool that is based on collecting
thread state information.
Using the WAIT tool requires installing and running only a data collection shell. The shell
collects various information about the Java program execution, the most important of
which is a set of javacore files. The javacore files show the state of all of the threads at the
time the file was dumped. The collected data is submitted to an online tool by using a web
browser, and the tool analyzes the data and displays the results with a GUI. The GUI
presents information about thread states and has powerful features to drill down to see
call chains.
The WAIT tool results combine many of the features of a time-based profile, a lock
monitor, and other tools. For Java programs, the WAIT tool might be one of the first
analysis tools to consider because of its versatility and ease of use.
For more information about IBM Whole-system Analysis of Idle Time, which is the
browser-based (that is, no-install) WAIT tool, go to:
http://wait.researchlabs.ibm.com
23
24
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 2.
25
Core
Core
Core
L2
L2
L2
8M L3
Region
POWER8 processor-based systems use memory buffer chips to interface between the
POWER8 processor and DDR3 or DDR4 memory. Each buffer chip also includes an L4 cache
to reduce the latency of local memory accesses. The number of memory controllers, memory
buffer chips, PCIe lanes, and cores that are available for use depend upon the particular
POWER8 processor-based system.
Core
Core
Core
L2
L2
L2
L2
L2
L2
Core
Core
Core
Mem. Ctrl.
Mem. Ctrl.
L2
L2
L2
Core
Core
Core
Each core is a 64-bit implementation of the IBM Power Instruction Set Architecture (ISA)
Version 2.071 and has the following features:
Multi-threaded design, capable of up to eight-way simultaneous multithreading (SMT)
32 KB, eight-way set-associative L1 i-cache
64 KB, eight-way set-associative L1 d-cache
26
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
72-entry Effective to Real Address Translation (ERAT) for effective to real address
translation for instructions (fully associative)
48-entry primary ERAT (fully associative) and 144-entry secondary ERAT for effective to
real address translation for data
Aggressive branch prediction, using both local and global prediction tables with a selector
table to choose the best predictor
16-entry link stack
256-entry count cache
Aggressive out-of-order execution
Two symmetric fixed-point execution units
Two symmetric load/store units and two load units, all four of which can also run simple
fixed-point instructions
An integrated, multi-pipeline vector-scalar floating point unit for running both scalar and
SIMD-type instructions, including the Vector Multimedia eXtension (VMX) instruction set
and the new Vector Scalar eXtension (VSX) instruction set, and capable of up to eight
floating point operations (flops) per cycle (four double precision or eight single precision)
In-core Advanced Encryption Standard (AES) encryption capability
Hardware data prefetching with 16 independent data streams and software control
Hardware decimal floating point (DFP) capability
The POWER8 processor is designed for system offerings from single-socket blades to
multi-socket Enterprise servers. It incorporates a triple-scope broadcast coherence protocol
over local and global SMP links to provide superior scaling attributes. Multiple-scope
coherence protocols reduce the amount of SMP link bandwidth that is required by attempting
operations on a limited scope (single chip or multi-chip group) when possible. If the operation
cannot complete coherently, the operation is reissued by using a larger scope to complete the
operation.
Here are additional features that can augment performance of the POWER8 processor:
Adaptive power management.
Support for DDR3 and DDR4 memory through memory buffer chips that offload the
memory support from the POWER8 memory controller.
16 MB L4 cache within the memory buffer chip that reduces the memory latency for local
access to memory behind the buffer chip. The operation of the L4 cache is transparent to
applications running on the POWER8 processor.
On-chip accelerators, including on-chip encryption, compression, and random number
generation accelerators.
For more information about this topic, see 2.3, I/O adapter affinity on page 55.
27
Cores/system
Maximum SMT
mode
Maximum hardware
threads per LPAR
IBM POWER4
processor
32
ST
32
IBM POWER5
processor
64
SMT2
128
IBM POWER6
processor
64
SMT2
128
IBM POWER7
processor
256
SMT4
1024
IBM POWER8
processor
192
SMT8
1536
Information about the multi-thread per core features by single LPAR scaling is available in the
following tables:
Table 4-1 on page 73 (AIX)
Table 5-1 on page 112 (IBM i)
Table 6-1 on page 119 (Linux)
28
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Simultaneous multithreading
The Power Architecture uses simultaneous multithreading (SMT) to provide multiple streams
of hardware execution. The POWER8 processor provides eight SMT hardware threads per
core and can be configured to run in SMT8, SMT4, SMT2, or single-threaded mode (SMT1
mode or, as referred to in this publication, ST mode). The POWER7 and POWER7+
processors provide four SMT hardware threads per core and can be configured to run in
SMT4, SMT2, or ST mode. POWER6 and POWER5 processors provide two SMT threads per
core, and can be run in SMT2 mode or ST mode.
By using multiple SMT threads, a workload can take advantage of more of the hardware
features that are provided in the POWER processor than if a single SMT thread is used per
core. By configuring the processor core to run in multi-threaded mode, the operating system
can maximize the use of the hardware capabilities that are provided in the system and the
overall workload throughput by correctly balancing software threads across all of the cores
and SMT hardware threads in the partition.
SMT does include some performance tradeoffs:
SMT can provide a significant throughput and capacity improvement on POWER
processors. When you are in SMT mode, there is a trade-off between overall CPU
throughput and the performance of each hardware thread. SMT allows multiple instruction
streams to be run simultaneously, but this concurrency can cause some resource conflict
between the instruction streams. This conflict can result in a decrease in performance for
an individual thread, but an increase in overall throughput.
Some workloads do not run well with the SMT feature. This situation is not typical for
commercial workloads, but it has been observed with scientific (floating point-intensive)
workloads.
Information about the topic of SMT, from the OS perspective, is available in the following
sections:
Simultaneous multithreading on page 73 (AIX)
Simultaneous multithreading on page 112 (IBM i)
Simultaneous multithreading on page 119 (Linux)
Chapter 2. The IBM POWER8 processor
29
PPR
(11:13)
Thread shutoff
(read only; set
by disabling
thread)
b'000'
Very low
b'001'
Low
Priority Nop
Minimum privilege
required to set level
in POWER5,
POWER6, and
POWER7
processors
Minimum privilege
required to set level
in a POWER7+
processor
Minimum privilege
required to set level
in a POWER8
processor
Hypervisor
Hypervisor
Hypervisor
or 31,31,31
Supervisor
Problem-state
Problem-state
b'010'
or 1,1,1
Problem-state
Problem-state
Problem-state
Medium low
b'011'
or 6,6,6
Problem-state
Problem-state
Problem-state
Medium
b'100'
or 2,2,2
Problem-state
Problem-state
Problem-state
Medium high
b'101'
or 5,5,5
Supervisor
Supervisor
Problem-state
High
b'110'
or 3,3,3
Supervisor
Supervisor
Supervisor
Very high
b'111'
or 7,7,7
Hypervisor
Hypervisor
Hypervisor
a. The required privilege to set a particular SMT thread priority level is associated with the physical processor
implementation that the LPAR is running on, and not the processor compatible mode. Therefore, setting Very Low
SMT priority requires only user level privilege on POWER7+ processors, even when running in IBM
POWER6-compatible, POWER6+-compatible, or POWER7-compatible modes.
For more information about SMT priority levels, see Power ISA Version 2.07, found at:
https://www.power.org/documentation/power-isa-v-2-07b/
Changing the SMT priority level can generally be done in one of the following ways:
Running a Priority Nop, a special form of the or x,x,x nop
Writing a value to the Program Priority Register (PPR) by running mtppr
Through a system call, which can be used by problem-state programs to set priorities in
the range that is permitted for the supervisor state
2
30
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
31
2.2.2 Multipage size support (page sizes (4 KB, 64 KB, 16 MB, and 16 GB))
The virtual address space of a program is divided into segments. The size of each segment
can be either 256 MB or 1 TB on Power Systems. The virtual address space can also consist
of a mix of these segment sizes. The segments are again divided into units, called pages. IBM
Power Architecture supports multiple virtual memory page sizes, which provides performance
benefits to an application because of hardware efficiencies that are associated with larger
page sizes.4
The POWER5+ and later processors support four virtual memory page sizes: 4 KB, 64 KB, 16
MB, and 16 GB. The POWER6 and later processors also support using 64 KB pages inside
segments along with a base page size of 4 KB.5 The 16 GB pages can be used only within
1 TB segments.
Large pages provide multiple technical advantages:
Reduced Page Faults and Translation Lookaside Buffer (TLB) Misses: A single large page
that is being constantly referenced remains in memory. This feature eliminates the
possibility of several small pages often being swapped out.
Unhindered Data Prefetching: A large page enables unhindered data prefetch (which is
constrained by page boundaries).
Increased TLB Reach: This feature saves space in the TLB by holding one translation
entry instead of n entries, which increases the amount of memory that can be accessed by
an application without incurring hardware translation delays.
Increased ERAT Reach: The ERAT on Power Systems is a first level and fully associative
translation cache that can go directly from effective to real address. Large pages also
improve the efficiency and coverage of this translation cache as well.
Large segments (1 TB) also provide reduced Segment Lookaside Buffer (SLB) misses, and
increases the reach of the SLB. The SLB is a cache of the most recently used Effective to
Virtual Segment translations.
The 16 MB and 16 GB pages are intended only for high-performance environments; however,
64 KB pages are considered general-purpose, and most workloads benefit from using 64 KB
pages rather than 4 KB pages.
For more information about this topic, from the OS perspective, see:
4.2.2, Multipage size support on AIX on page 83
5.2.2, Multipage size support on IBM i on page 113
6.2.2, Multipage size support on Linux on page 123
4
5
32
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Cache sharing
Power Systems consist of multiple processor cores and multiple processor chips that share
caches and memory in the system. The architecture uses a processor and memory layout
that you can use to scale the hardware to many nodes of processor chips and memory. One
advantage is that systems can be used for multiple workloads and workloads that are large.
However, these characteristics must be carefully weighed in the design, implementation, and
evaluation of a workload. Aspects of a program, such as the allocation of data across cores
and chips and the layout of data within a data structure, play a key role in maximizing
performance, especially when scaling across many processor cores and chips.
Power Systems use a cache-coherent SMP design, in which all of the memory in the system
is accessible to all of the processor cores in the system, and all of the cache is coherently
maintained:
Any processor core on any chip can access the memory of the entire system.
Any processor core can access the contents of any core cache, even if it on a different
chip.
Processor core access: In both of these cases, the processor core can access only
memory or cache that it has authorized access to using normal operating system and
Hypervisor memory access permissions and controls.
In POWER8 processor-based systems, each chip consists of twelve processor cores, each
with on-core L1 instruction and d-caches, an L2 cache, and an L3 cache, as shown in
Figure 2-2.6
Memory
Memor y Buffer
L4 Cach e
Memory
Memor y Buffer
L4 Cach e
Memory
Memor y Buffer
Memory
Memor y Buffer
L4 Cach e
Memory
Memor y Buffer
L4 Cach e
Memory
Memor y Buffer
L4 Cach e
Memory
Memor y Buffer
Memory
M emory Buffer
L4 Cache
L4 Cach e
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
L2
512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB
Cach e
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L3
8 MB
L4 Cach e
Processor Chip
Ibid
33
All of these caches are effectively shared. The L2 cache has a longer access latency than L1,
and L3 has a longer access latency than L2. Each chip also has memory controllers, allowing
direct access to a portion of the memory DIMMs in the system.7 Thus, it takes longer for an
application thread to access data in cache or memory that is attached to a remote chip than
to access data in a local cache or memory. These types of characteristics are often referred to
as affinity performance effects (for more information, see The POWER8 processor and
affinity performance effects on page 16). In many cases, systems that are built around
different processor models that have varying characteristics (for example, although L3 is
supported, it might not be implemented on some models).
Functionally, it does not matter which core in the system an application thread is running on,
or what memory the data it is accessing is on. However, this situation does affect the
performance of applications because accessing a remote memory or cache takes more time
than accessing a local memory or cache.8 This situation becomes even more imperative with
the capability of modern systems to support massive scaling and the resulting possibility for
remote accesses to occur across a large processor interconnection complex.
The effect of these system properties can be observed by application threads because they
often move, sometimes rather frequently, between processor cores. This situation can
happen for various reasons, such as a page fault or lock contention that results in the
application thread being preempted while it waits for a condition to be satisfied, and then
being resumed on a different core. Any application data that is in the cache local to the
original core is no longer in the local cache because the application thread moved and a
remote cache access is required.9 Although modern operating systems, such as AIX, attempt
to ensure that cache and memory affinity is retained, this movement does occur, and can
result in a loss in performance. For an introduction to the concepts of cache and memory
affinity, see The POWER8 processor and affinity performance effects on page 16.
The POWER Hypervisor is responsible for:
Virtualization of processor cores and memory that is presented to the operating system
Ensuring that the affinity between the processor cores and memory an LPAR is using is
maintained as much as possible
However, it is important for application designers to consider affinity issues in the design of
applications, and to carefully assess the impact of application thread and data placement on
the cores and the memory that is assigned to the LPAR the application is running in.
Various techniques that are employed at the system level can alleviate the effect of cache
sharing. One example is to configure the LPAR so that the amount of memory that is
requested for the LPAR is satisfied by the memories that are locally available to processor
cores in the system (the memory DIMMs that are attached to the memory controllers for each
processor core). It is more likely that the POWER Hypervisor can maintain affinity between
the processor cores and memory that is assigned to the partition, improving performance.10
For more information about LPAR configuration and running the AIX lssrad-va command to
query the affinity characteristics of a partition, see Chapter 3, The IBM POWER Hypervisor
on page 57. The equivalent Linux command is numactl --hardware.
The rest of this section covers multiple topics that can affect application performance,
including the effects of cache geometry, alignment of data, and sensitivity to the scaling of
applications to more cores.
7
34
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Cache geometry
Cache geometry refers to the specific layout of the caches in the system, including their
location, interconnection, and sizes. These design details change for every processor chip,
even within the Power Architecture. Figure 2-2 on page 33 shows the layout of a POWER8
chip, including the processor cores, caches, and local memory. Table 2-3 shows the cache
sizes and related geometry information for POWER8 processor-based systems.11
Table 2-3 POWER8 storage hierarchy
Cache
POWER7 processor-based
system
POWER7+
processor-based system
POWER8 processor-based
systems
L1 i-cache:
Capacity/associativity
32 KB, 4-way
32 KB, 4-way
32 KB, 8-way
L1 d-cache:
Capacity/associativity
bandwidth
32 KB, 8-way
2 16 B reads or
1 16 B writes per cycle
32 KB, 8-way
2 16 B reads or
1 16 B writes per cycle
64 KB, 8-way
4 16 B reads or
1 16 B writes per cycle
L2 cache:
Capacity/associativity
bandwidth
L3 cache:
Capacity/associativity
bandwidth
On-Chip
4 MB/core, 8-way
16 B reads and 16 B writes
per cycle
On-Chip
10 MB/core, 8-way
16 B reads and 16 B writes
per cycle
On-Chip
8 MB/core, 8-way
32 B reads and 32 B writes
per cycle
L4 cache:
Capacity/associativity
bandwidth
N/A
N/A
On-Chip
16 MB/buffer chip, 16-way
Up to 8 buffer chips per
socket
11
12
Ibid
Splitting Data Objects to Increase Cache Utilization (Preliminary Version, 9th October 1998). found at:
http://citeseer.uark.edu:8080/citeseerx/viewdoc/summary?doi=10.1.1.84.3359
35
As described in Eliminate False Sharing, Stop your CPU power from invisibly going down
the drain,13 it is also important to assess carefully the impact of this strategy, especially
when applied to systems where there are a high number of CPU cores and a phenomenon
referred to as false sharing can occur. False sharing occurs when multiple data elements
are in the same cache line that can otherwise be accessed independently. For example, if
two different hardware threads wanted to update (store) two different words in the same
cache line, only one of them at a time can gain exclusive access to the cache line to
complete the store. This situation results in:
Cache line transfers between the processors where those threads are
Stalls in other threads that are waiting for the cache line
Leaving all but the most recent thread to update the line without a copy in their cache
This effect is compounded as the number of application threads that share the cache line
(that is, threads that are using different data in the cache line under contention) is scaled
upwards.14 The discussion about cache sharing15 also presents techniques for analyzing
false sharing and suggestions for addressing the phenomenon.
Prefetching to avoid cache miss penalties
Prefetching to avoid cache miss penalties is another technique that is used to improve
performance of applications. The concept is to prefetch blocks of data to be placed into the
cache a number of cycles before the data is needed. This action hides the penalty of
waiting for the data to be read from main storage. Prefetching can be speculative when,
based on the conditional path that is taken through the code, the data might end up not
being required. The benefit of prefetching depends on how often the prefetched data is
used. Although prefetching is not strictly related to cache geometry, it is an important
technique.
A caveat to prefetching is that, although it is common for the technique to improve
performance for single-thread, single core, and low utilization environments, it can
decrease performance in high thread-count per-socket and high-utilization environments.
Most systems today virtualize processors and the memory that is used by the workload.
Because of this situation, the application designer must consider that, although an LPAR
might be assigned only a few cores, the overall system likely has a large number of cores.
Further, if the LPARs are sharing processor cores, the problem becomes compounded.
The dcbt and dcbtst instructions are commonly used to prefetch data.16,17 Power
Architecture ISA 2.06 Stride N Prefetch Engines to boost Application's performance
provides an overview about how these instructions can be used to improve application
performance. These instructions can be used directly in hand-tuned assembly language
code, or they can be accessed through compiler built-ins or directives.
A preferred way to use pre-fetching is to have the compiler decide where in the application
code to place prefetch instructions. The type of analysis that is required is highly suited for
computers to perform.
Prefetching is also automatically done by the POWER8 hardware and is configurable, as
described in Data prefetching using d-cache instructions and the Data Streams Control
Register (DSCR) on page 39.
13
Eliminate False Sharing, Stop your CPU power from invisibly going down the drain, found at:
http://drdobbs.com/goparallel/article/showArticle.jhtml?articleID=217500206
14
Ibid
15 Ibid
16
dcbt (Data Cache Block Touch) instruction, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.aix.aixassem/doc/a
langref/idalangref_dcbt_instrs.htm
17
dcbtst (Data Cache Block Touch for Store) instruction, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.aix.aixassem/doc/a
langref/idalangref_dcbstst_instrs.htm
36
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Alignment of data
Processors are optimized for accessing data elements on their naturally aligned boundaries.
Unaligned data accesses might require extra processing time by the processor for individual
load or store instructions. They might require a trap and emulation by the host operating
system. Ensuring natural data alignment also ensures that individual accesses do not span
cache line boundaries.
Similar to the idea of splitting structures into hot and cold elements, the concept of data
alignment seeks to optimize cache performance by ensuring that data does not span across
multiple cache lines. The cache line size in Power Systems is 128 bytes.
The general technique for alignment is to keep operands (data) on natural boundaries, such
as a word or doubleword boundary (that is, an int is aligned on a word boundary in memory).
This technique might involve padding and reordering data structures to avoid cases such as
the interleaving of chars and doubles: char; double; char; double. High-level language
compilers can ensure optimal data alignment by inserting padding. However, data layout must
be carefully analyzed to avoid an undue increase in size by such methods. For example, the
previous case of a structure containing char; double; char; double; requires 14 bytes of
padding. Such an increase in size might result in more cache misses or page misses
(especially for rarely referenced groupings of data).
Additionally, to achieve optimal performance, floating point and VMX/VSX have different
alignment requirements. For example, the preferred VSX alignment is 16 bytes instead of the
element size of the data type being used. This situation means that VSX data that is smaller
than 16 bytes must be padded out to 16 bytes. The compilers introduce padding as
necessary to provide optimal alignment for vector data types.
Non-vector data that is intended to be accessed through VSX instructions should be aligned
so that VSX loads and stores are performed on addresses that are aligned to 16-byte
boundaries. However, the POWER8 processor improves the handling of misaligned
accesses. Most loads, which cross cache lines and hit in the d-cache, are handled by the
hardware with minimal impact on performance.
Byte ordering
The byte ordering (Big Endian or Little Endian) is specified by the operating system. In Little
Endian mode, byte swapping is performed before data is written to storage and before data is
fetched into the execution units. The Load and Store Multiple instructions and the Move Assist
instructions are not supported in Little Endian mode. Attempting to run any of these
instructions in Little Endian mode causes the system alignment error handler to be started.
The POWER8 processor can operate with the same byte ordering for both instruction and
data, or with Split Endian, with instructions and data having different byte ordering.
37
In general terms, an application that tends to not access memory without CPU intervention
(that are core-centric) scales perfectly across more cores. Performance loss when scaling
across multiple cores tends to come from one or more of the following sources:
Increased cache misses (often from invalidations of data by other processor cores,
especially for locks)
The increased cost of cache misses, which in turn drives overall memory and interconnect
fabric traffic into the region of bandwidth limitations (saturating the memory busses and
interconnect)
The additional cores that are being added to the workload in other nodes, resulting in
increased latency in reaching memory and caches in those nodes
Briefly, cache miss requests and returning data can end up being routed through busses that
connect multiple chips and memory, which have particular bandwidth and latency
characteristics. The goal for scaling across multiple cores, then, is to minimize the change in
the potential penalties that are associated with cache misses and data requests as the
workload size grows.
It is difficult to assess what strategies are effective for scaling to more cores without
considering the complex aspects of a specific application. For example, if all of the cores that
the application is running across eventually access all of the data, then it might be wise to
interleave data across the processor sockets (which are typically a grouping of processor
chips) to optimize them from a memory bus utilization point of view. However, if the access
pattern to data is more localized so that, for most of the data, separate processor cores are
accessing it most of the time, the application might obtain better performance if the data is
close to the processor core that is accessing that data the most (maintaining affinity between
the application thread and the data it is accessing). For the latter case, where the data ought
to be close to the processor core that is accessing the data, the AIX MEMORY_AFFINITY=MCM
environment variable can be set to achieve this behavior. For Linux, the equivalent is the -l
option on a numactl command.
When multiple processor cores are accessing the same data and that data is being held by a
lock, resulting in the data line in the cache that is invalidated, programs can suffer. This
phenomenon is often referred to as hot locks, where a lock is holding data that has a high rate
of contention. Hot locks result in cache-to-cache intervention and can easily limit the ability to
scale a workload because all updates to the lock are serialized.
Tools such as splat (see AIX trace-based analysis tools on page 226) can be used to
identify hot locks. Additionally, the transactional memory (TM) feature can speed up
lock-based programs. Learn more about TM in 2.2.4, Transactional memory on page 42.
Hot locks can be caused by the programmer having lock control access to too large an area of
data, which is known as coarse-grained locking.18 In that case, the strategy to deal effectively
with a hot lock is to split the lock into a set of fine-grained locks, such that multiple locks, each
managing a smaller portion of the data than the original lock, now manage the data for which
access is being serialized. Hot locks can also be caused by trying to scale an application to
more cores than the original design intended. In that case, using an even finer grain of locking
might be possible, or changes can be made in data structures or algorithms, such that lock
contention is reduced.
Additionally, the programmer must spend time considering the layout of locks in the cache to
ensure that multiple locks, especially hot locks, are not in the same cache line because any
updates to the lock itself results in the cache line being invalidated on other processor cores.
When possible, pad the locks so that they are in their own distinct cache line.
18
38
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For more information about this topic, see 2.3, I/O adapter affinity on page 55.
19
39
The dcbt and dcbtst instructions can also be used to provide hints about the transient nature
of accesses to data elements. If TH=0b10000, the dcbt instruction provides a hint that the
program will probably soon load from the block that contains the byte addressed by EA, and
that the programs need for the block will be transient (this means the time interval during
which the program accesses the block is likely to be short). If TH=0b10001, the dcbt
instruction provides a hint that the program will probably not access the block that contains
the byte addressed by EA for a relatively long period.
The contents of the DSCR, a special purpose register, affects how the data prefetcher
responds to hardware-detected and software-defined data streams.
The layout of the DSCR register is shown in Table 2-4.
Table 2-4 DSCR register layout (field names are defined following the table)
0:38
SWTE
HWTE
STE
LTE
SWUE
HWUE
UNT
CNT
URG
LSD
SNSE
SSE
DPFD
39
40
41
42
43
44
45:54
55 57
58
59
60
61:63
Where:
39 Software Transient Enable (SWTE)
New field added in the POWER8 processor. Applies the transient attribute to
software-defined streams.
40 Hardware Transient Enable (HWTE)
New field added in the POWER8 processor. Applies the transient attribute to
hardware-detected streams.
41 Store Transient Enable (STE)
New field added in the POWER8 processor. Applies the transient attribute to store
streams.
42 Load Transient Enable (LTE)
New field added in the POWER8 processor. Applies the transient attribute to load streams.
43 Software Unit count Enable (SWUE)
New field added in the POWER8 processor. Applies the unit count to software-defined
streams.
44 Hardware Unit count Enable (HWUE)
New field added in the POWER8 processor. Applies the unit count to hardware-detected
streams.
45:54 Unit Count (UNITCNT)
New field added in the POWER8 processor. Number of units in data stream. Streams that
exceed this count are terminated.
55:57 Depth Attainment Urgency (URG)
New field added in the POWER7+ processor. This field indicates how quickly the prefetch
depth can be reached for hardware-detected streams.
Bits 58 Load Stream Disable (LDS)
New field added in the POWER7+ processor. Disables hardware detection and initiation of
load streams.
40
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
0: Default
1: Not urgent
2: Least urgent
3: Less urgent
4: Medium
5: Urgent
6: More urgent
7: Most urgent
The ability to enable or disable the three types of streams that the hardware can detect (load
streams, store streams, or stride-N streams), or to set the default prefetch depth, allows
empirical testing of any application. There are no simple rules for determining which settings
are optimum overall for an application: The performance of prefetching depends on many
different characteristics of the application in addition to the characteristics of the specific
system and its configuration. Data prefetches are purely speculative, meaning they can
improve performance greatly when the data that is prefetched is, in fact, referenced by the
application later, but can also degrade performance by expending bandwidth on cache lines
that are not later referenced, or by displacing cache lines that are later referenced by the
program.
Similarly, setting DPFD to a deeper depth tends to improve performance for data streams that
are predominately sourced from memory because the longer the latency to overcome, the
deeper the prefetching must be to maximize performance. But deeper prefetching also
increases the possibility of stream overshoot, that is, prefetching lines beyond the end of the
stream that are not later referenced. Prefetching in multi-core processor implementations has
implications for other threads or processes that are sharing cache (in SMT mode) or the same
system bandwidth.
For information about modifying the DSCR value by using the XL compiler family, see 7.3.4,
Data Streams Control Register controls on page 154.
41
Where:
RA specifies a source general-purpose register for EA computation.
RB specifies a source general-purpose register for EA computation.
CT indicates the level of cache the block is to be loaded into. The only supported value for
the POWER8 processor is 2.
Information about the efficient use of cache, from the OS perspective, is available in the
following sections:
4.2.3, Efficient use of cache on page 86 (AIX)
6.2.3, Efficient use of cache on page 123 (Linux)
7.3.4, Data Streams Control Register controls on page 154 (compilers)
42
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
A transaction may be put into suspended state by the application by using the tsuspend.
instruction. This allows a sequence of instructions within the transaction to have the same
effect as though the sequence were run in the absence of a transaction. For example, such
instructions are not run speculatively, and any storage updates are committed, regardless of
transaction success or failure. The tresume. instruction is used to resume the transaction and
to continue speculative execution of instructions.
Checkpoint state
When a transaction is initiated, and when it is restored following transaction failure, a set of
registers is saved or restored, representing the checkpoint state of the processor (for
example, the pre-transactional state). The checkpoint state includes all of the problem state,
writable registers, with the exception of CR0, FXCC, EBBHR, EBBRR, BESCR, the
performance monitor registers, and the TM special purpose registers (SPRs).
The checkpoint state is not directly accessible in either the supervisor or problem state.
Instead, the checkpoint state is copied into the respective registers when the treclaim.
instruction is run. This allows privileged code to save or modify values. The checkpoint state
is copied back into the speculative registers (from the respective user-accessible registers)
when the new trechkpt. instruction is run.
Transaction failure
A transaction might fail for various reasons, which can be either externally induced or
self-induced. External causes include conflicts with the storage accesses of another process
thread (for example, they both access the same storage area and one of the accesses is a
store). There are many self-induced causes for a transaction to fail, for example:
Explicitly aborted by using a set of conditional and unconditional abort instructions (for
example, various forms of the tabort. instruction)
Too many nested transactions
Too many storage accesses performed in the transactional state, causing a state overflow
Execution of certain instructions that are disallowed in transactional state (for example,
slbie, dcbi, and so on)
When a transaction fails, a software failure handler may be started. This is accomplished by
redirecting control to the instruction following the tbegin. instruction of the outermost
transaction and setting CR0 to 0b1010. Therefore, when writing a TM program, the tbegin.
instruction must always be followed with a conditional branch (for example, beq), predicated
on bit 2 of CR0. The target of the branch should be the software failure handler that is
responsible for handling the transaction failure. For comparison, when tbegin. is successfully
ran at the start of the transaction, CR0 is set to either 0b0000 or 0b0100.
A transaction failure may be of a transient or a persistent type. Transient failures are typically
considered temporary failures, and persistent failures indicate that it is unlikely that the
transaction will succeed if restarted. The failure handler can retry the transaction or employ a
different locking construct or logic path, depending on the nature of the failure. When handling
transient type failures, applications might find it useful to keep a count of transient failures and
to treat the failure as a persistent type failure on reaching a threshold. If the failure is of
persistent type, the expectation is that the applications fall back to non-transactional logic.
When transaction failure occurs while in a suspended state, failure handling occurs after the
transaction is resumed by using the tresume. instruction.
43
The software failure handler may identify the cause of the transaction failure by examining bits
0:31 of the Transaction EXception And Summary Register (TEXASR), a special purpose
register that is associated with the TM architecture. In particular, bits 0:6 indicate the failure
code, and bit 7 indicates whether the failure is persistent and whether the transaction will
likely fail if attempted again. These bits are copied from the treclaim. instruction (privileged
code) or the tabort. instruction (problem state code) that are used by software to induce a
transaction failure.
The Power Architecture Platform reserves a range of failure codes for use by client operating
systems and a separate range for use by a hypervisor, leaving a range of codes free for use
by software applications:
0x00 0x3F is reserved for use by the OS.
0x40 0xDF is free for use by problem state (application) code.
0xE0 0xFF is reserved for use by a hypervisor.
Problem state code is limited to using transaction failure codes to the range specified above
to provide a failure reason when issuing a tabort. instruction.
Sample transaction
Example 2-1 is a sample of assembly language code, showing a simple transaction that
writes the value in GPR 5 into the address in GPR 4, which is assumed to be shared among
multiple threads of execution. If the transaction fails because of a persistent cause, the code
falls back to an alternative code path at the label lock_based_update (the code for the
alternative path is not shown) (based on sample code available from Power.org20).
Example 2-1 A transaction that writes to an address that is shared among multiple execution threads
trans_entry:
tbegin.
beqfailure_hdlr
# Transaction Body
stw r5, 0(r4)
tend.
b trans_exit
# Failure Handler
failure_hdlr:
mfspr r4, TEXASRU
andis. r5, r4, 0x0100
bne lock_based_update
b trans_entry
# Start transaction
# Handle transaction failure
#
#
#
#
#
#
lock_based_update:
trans_exit:
20
44
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Synchronization mechanisms
In multi-thread programs, synchronization mechanisms are used to ensure that threads have
exclusive access to critical sections. Usually, compare-and-swap (CAS - x86_64) or
load-link/store-conditional (LLSC - PowerPC) instructions are used to create locks, a
synchronization mechanism. The semantics of locks is this: A running program acquires the
lock, runs its CSs in a serialized way (only one thread of execution at a time), and releases
the lock.
The serialization of threads because of CSs is a bottleneck to achieving high performance in
multi-thread programs. There are some techniques for mitigating or removing such
performance issues, for example, non-blocking algorithms, lock-free data, and fine-grained
locking.
Lock Elision (LE) is another optimization technique that uses Hardware Transaction Memory
(HTM) primitives to avoid lock acquiring. It relies on the behavior of some algorithms that do
not have mutually exclusive executions of CS. For example, a hash table insertion where
updates can be done in parallel, and locks are only needed when the same bucket is
accessed at same time.
The LE uses an HTM to first try a transaction on a shared data resource. If it is successful, no
locks are required. If the transaction cannot succeed, such as during concurrent access, it
falls back to default locking mechanism.
For more information about the topic of transactional memory, from the OS and compiler
perspectives, see:
45
A 64-entry Unified Register File is shared across VSX, the Binary floating point unit (BFP),
VMX, and the DFP unit. The thirty-two 64-bit Floating Point Registers (FPRs), which are used
by the BFP and DFP units, are mapped to registers 0 - 31 of the Vector Scalar Registers. The
32 vector registers (VRs) that are used by the VMX are mapped to registers 32 - 63 of the
VSRs, as shown in Table 2-5.
Table 2-5 The Unified Register File
FPR0
VSR0
FPR1
VSR1
....
FPR30
FPR31
VR0
VR1
..
..
VR30
VSR62
VR31
VSR63
VSX supports Double Precision Scalar and Vector Operations and Single Precision Vector
Operations. VSX instructions are broadly divided into two categories that can operate on 64
vector scalar registers:21, 22, 23, 24
Computational instructions: Addition, subtraction, multiplication, division, extracting the
square root, rounding, conversion, comparison, and combinations of these operations
Non-computational instructions: Loads/stores, moves, select values, and so on
In terms of compiler support for vectors, XLC supports vector processing technologies
through language extensions on both AIX and Linux. GCC supports using the VSX engine on
Linux. XL and GCC C implement and extend the AltiVec Programming Interface specification.
For more information about the topic of VSX, from the OS and compiler perspectives, see:
21
46
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
AES
AES was established for the encryption of electronic data by the US National Institute of
Standards and Technology (NIST) in 2001 (FIPS PUB 197). AES is a symmetric-key
algorithm that processes data blocks of 128 bits (a block cipher algorithm), and, therefore,
naturally fits into the 128-bit VSX data flow. The AES algorithm is covered in five new
instructions, available in Power ISA Version 2.07.26
25
How to Leverage Decimal Floating-Point unit on POWER6 for Linux, found at:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Perform
ance%20Computing%20%28HPC%29%20Central/page/How%20to%20Leverage%20Decimal%20Floating-Point%20unit%20
on%20POWER6%20for%20Linux
26 Power ISA Version 2.07, found athttps://www.power.org/documentation/power-isa-v-2-07b/
47
SHA-2
SHA-2 was designed by the US National Security Agency (NSA) and published in 2001 by the
NIST (FIPS PUB 180-2). It is a set of four hash functions (SHA-224, SHA-256, SHA-384, and
SHA-512) with message digests that are 224, 256, 384, and 512 bits. The SHA-2 functions
compute the digest based on 32-bit words (SHA-224 and SHA-256) or 64-bit words (SHA-384
and SHA-512). Different combinations of rotate and xor vector instructions have been
identified to be merged into a new instruction to accelerate the SHA-2 family. The new
instruction comes in two flavors:
In word (32-bit), targeting SHA-224 and SHA-256
In doubleword (64 bit), accelerating SHA-384 and SHA-512 (Power ISA v2.07)
CRC
CRC can be seen as an error-detecting code. It is used in storage devices and digital
networks to protect data from accidental (or hacker-intended) changes to raw data. Data to be
stored or information that is sent over the network (in a stream) gets a short checksum
attached (based on the remainder of the polynomial division and modulo operations). CRC is
a reversible function, which makes it unsuitable for use in digital signatures, but it is in use for
error detection when data is transferred, for example, in an Ethernet network protocol.
CRC algorithms are defined by the different generator polynomial used. For example, an n-bit
CRC is defined by an n-bit polynomial. Examples for applications using CRC-32 are Ethernet
(Open Systems Interconnection (OSI) physical layer), Serial Advanced Technology
Attachment (Serial ATA), Moving Picture Experts Group (MPEG-2), GNU Project file
compression software (Gzip), and Portable Network Graphics (PNG, fixed 32-bit polynomial).
In contrast, Internet Small Computer System Interface (iSCSI) and the Stream Control
Transmission Protocol (SCTP transport layer protocol) are based on a different, 32-bit
polynomial.27 The POWER8 enhancements focus on a specific application that supports only
one single generator polynomial, and they help to accelerate any kind of CRC size, ranging
from 8-bit CRC, 16-bit CRC, and 32-bit CRC, to 64-bit CRC.
For more information about the topic of in-core cryptography, from the OS and compiler
perspectives, see:
4.2.7, On-chip encryption accelerator on page 94 (AIX)
7.3.1, In-core cryptography on page 148 (XL and GCC compiler families)
48
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
On-chip random number generator: AIX capitalizes on the on-chip random number
generator, providing the advantages of stronger hardware-based random numbers. In
some instances, there can also be a performance advantage.
For more information about this topic, from the AIX perspective, see:
4.2.7, On-chip encryption accelerator on page 94 (AIX)
AIX /dev/random (random number generation) on page 94 (AIX)
Associated instructions
The following instructions provide various storage synchronization mechanisms:
sync
lwsync
28
PowerPC storage model and AIX programming: What AIX programmers need to know about how their software
accesses shared storage, found at: http://www.ibm.com/developerworks/systems/articles/powerpc.html
29
Ibid
30
sync (Synchronize) or dcs (Data Cache Synchronize) instruction. found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/index.jsp?topic=%2Fcom.ibm.aix.a
ixassem%2Fdoc%2Falangref%2Fidalangref_sync_dcs_instrs.htm
31
PowerPC storage model and AIX programming: What AIX programmers need to know about how their software
accesses shared storage, found at: http://www.ibm.com/developerworks/systems/articles/powerpc.html
49
lwarx
stwcx.
This instruction performs a store to the target location only if the location
specified by a previous lwarx instruction is not used for storage by another
processor (hardware thread) or mechanism, which invalidates the
reservation.33
eieio
This instruction creates a memory barrier that provides an order for storage
accesses caused by load, store, dcbz, eciwx, and ecowx instructions.34
makeitso
New in the POWER8 processor, this instruction allows data to push out to the
coherence point as quickly as possible. An attempt to run the makeitso
instruction provides a hint that preceding stores are made visible with higher
priority.
lbarx/stbcx.
These instructions were added in the POWER8 processor and are similar to
lwarx/stwcx., except that they load and store a byte.
lharx/sthcx.
These instructions were added in the POWER8 processor and are similar to
lwarx/stwcx., except that they load and store a 16-bit halfword.
ldarx/stdcx.
These instructions are similar to lwarx/stwcx., except that they load and
store a 64-bit doubleword (requires 64-bit mode).
lqarx/stqcx.
These instructions were added in the POWER8 processor and are similar to
lwarx/stwcx., except that they load and store a 128-bit quad word (requires
64-bit mode).
Where to use
Care must be taken when you use synchronization mechanisms in any processor architecture
because the associated load and store instructions have a heavier weight than normal loads
and stores, and the barrier operations have a cost that is associated with them. Thus, it is
imperative that the programmer carefully consider when and where to use such operations,
so that data consistency is ensured without adversely affecting the performance of the
software and the overall system.
PowerPC storage model and AIX programming35 describes where synchronization
mechanisms must be used to ensure that the code adheres to the Power Architecture.
Although this documentation covers how to write compliant code, it does not cover the
performance aspect of using the mechanisms.
Unless the code is hand-tuned assembly language code, take advantage of the locking
services that are provided by the operating system because they are tuned and provide the
necessary synchronization mechanisms. Power Instruction Set Architecture Version 2.0736
provides assembly language programming examples for sharing storage. For more
information, see Appendix B, Performance tools and empirical performance analysis on
page 215.
32
50
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For more information about this topic, from the perspective of compiler built-ins, see 7.3.3,
Built-in functions for storage synchronization on page 154.
RT, RA, DS (eligible instructions are LD, LBZ, LHZ, and LWZ)
Where the RT of the ADDIS is the same as RA of the LD instruction. The POWER8 processor
internally fuses them into a single instruction.
51
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The controls for all modes that are listed above are available on the Advanced System
Management Interface and are described in more detail in a white paper that is found at
http://www.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=POW0
3125USEN. Additionally, the appendix of this white paper includes links to other papers that
detail the performance benefits and impacts of using these controls.
53
Figure 2-3 shows a high-level view of how an accelerator communicates with the POWER8
processor through CAPI. The POWER8 processor provides a Coherent Attached Processor
Proxy (CAPP), which is responsible for extending the coherence in the processor
communications to an external device. The coherency protocol is tunneled over standard
PCIe Gen3 connections, effectively making the accelerator part of the coherency domain.
The accelerator adapter implements the Power Service Layer (PSL), which provides address
translation and system memory cache for the accelerator functions. The custom processors
on the board, which might consist of an FPGA or an Application Specific Integrated Circuit
(ASIC) use this layer to access shared memory regions and cache areas as though they were
a processor in the system. This ability greatly enhances the performance of the data access
for the device and simplifies the programming effort to use the device. Instead of treating the
hardware accelerator as an I/O device, it is treated as a processor. This eliminates the
requirement of a device driver to perform communication, and the need for Direct Memory
Access that requires system calls to the operating system kernel. By removing these layers,
the data transfer operation requires fewer clock cycles in the processor, greatly improving the
I/O performance.
The implementation of CAPI on the POWER8 processor allows hardware companies to
develop solutions for specific application demands and use the performance of the POWER8
processor for general applications. The developers can also provide custom acceleration of
specific functions by using a hardware accelerator, with a simplified programming model and
efficient communication with the processor and memory resources.
54
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
55
PowerPC storage model and AIX programming: What AIX programmers need to know
about how their software accesses shared storage, found at:
http://www.ibm.com/developerworks/systems/articles/powerpc.html
Refer to the following sections:
Power Instruction Set Architecture
Section 4.4.3 Memory Barrier Instructions Synchronize
Product documentation for XL C/C++ for AIX, V12.1 (PDF format), found at:
http://www.ibm.com/support/docview.wss?uid=swg27024811
Simple performance lock analysis tool (splat), found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.i
bm.aix.prftools/doc/prftools/idprftools_splat.htm
What makes Apple's PowerPC memcpy so fast?, found at:
http://stackoverflow.com/questions/1990343/what-makes-apples-powerpc-memcpy-sofast
What programmers need to know about hardware prefetching?, found at:
http://www.futurechips.org/chip-design-for-all/prefetching.html
56
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 3.
57
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Shared LPAR: Weight selection to assign a level of priority to get uncapped capacity
(excess cycles to address the peak usage).
Shared LPAR: Multiple shared pools to address software licensing costs, which
prevents a set of partitions from exceeding its capacity consumption.
Active Memory Sharing: The size of a shared pool is based on active workload
memory consumption:
Inactive workload memory is used for active workloads, which reduces the memory
capacity of the pool.
The Active Memory Deduplication option can reduce memory capacity further.
AIX file system cache memory is loaned to address memory demands that lead to
memory savings.
Active Memory Sharing: A shared pool size determines the levels of memory
over-commit. Starts without over-commit and is based on workload consumption that
reduces the pool.
Active Memory Expansion: AIX working set memory is compressed.
Active Memory Sharing and Active Memory Expansion can be deployed on the
same workload.
Active Memory Sharing: VIOS sizing is critical for CPU and memory.
Virtual Ethernet: An inter-partition communication VLANs option that is used for higher
network performance.
Shared Ethernet versus host Ethernet.
Virtual disk I/O: Virtual small computer system interface (vSCSI), N_Port ID
Virtualization (NPIV), file-backed storage, and storage pool.
Dynamic resource movement (DLPAR) to adopt to growth.
59
If a partition has multiple virtual processors, they might be scheduled to run simultaneously
on the physical processor cores.
Partition entitlement is the guaranteed resource that is available to a partition. A partition that
is defined as capped can consume only the processors units that are explicitly assigned as its
entitled capacity. An uncapped partition can consume more than its entitlement, but is limited
by many factors:
Uncapped partitions can exceed their entitlement if there is unused capacity in the shared
pool, dedicated partitions that share their physical processor cores while active or inactive,
unassigned physical processors, and Capacity on Demand (CoD) utility processors.
If the partition is assigned to a virtual shared processor pool, the capacity for all of the
partitions in the virtual shared processor pool might be limited.
The number of virtual processors in an uncapped partition is throttled depending on how
much CPU it can consume. For example:
An uncapped partition with one virtual CPU can consume only one physical processor
core of CPU resources under any circumstances.
An uncapped partition with four virtual CPUs can consume only four physical processor
cores of CPU.
Virtual processors can be added or removed from a partition by using HMC actions.
60
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Entitlement also determines the number of SPLPARs that can be configured for a shared
processor pool. The sum of the entitlement of all the SPLPARs cannot exceed the number of
physical cores that are configured in a shared pool.
For example, a shared pool has eight cores and 16 SPLPARs are created, each with 0.1 core
entitlement and one virtual CPU. In our example, we configured the partitions with 0.1 core
entitlement because these partitions are not running that frequently. In this example, the sum
of the entitlement of all the 16 SPLPARs comes to 1.6 cores. The rest of the 6.4 cores and
any unused cycles from the 1.6 entitlement can be dispatched as uncapped cycles.
At the same time, keeping entitlement low when there is capacity in the shared pool is not
always a preferred practice. Unless the partitions are frequently idle, or there is a plan to add
more partitions, the preferred practice is that the sum of the entitlement of all the SPLPARs
configured is close to the capacity in the shared pool. Entitlement cycles are guaranteed, so
when a partition is using its entitlement cycles, the partition is not preempted; however, a
partition can be preempted when it is dispatched to use excess cycles. Following this
preferred practice allows the hypervisor to optimize the affinity of the partitions memory and
processor cores and also reduces unnecessary preemptions of the virtual processors.
Entitlement also affects the choice of memory and processors that are assigned by the
hypervisor for the partition. The hypervisor uses the entitlement value as a guide to the
amount of CPU that a partition consumes. If the entitlement is undersized, performance can
be adversely affected, for example, if there are four cores per processor chip and two
partitions are consistently consuming about 3.5 processors of CPU capacity. If the partitions
are undersized with four virtual processors and 2.0 entitlement (that is, entitlement is set
below normal usage levels), the hypervisor may allocate both of the partitions on the same
processor chip, as the entitlement of 2.0 allows two partitions to fit into a 4-core processor
chip. If both partitions consistently consume 3.5 processors worth of capacity, the hypervisor
is forced to dispatch some of the virtual processors on chips that do not contain memory that
is associated with the partitions. If the partitions were configured with an entitled capacity of
3.5 instead of 2.0, the hypervisor places each partition on its own processor chip to ensure
that there is sufficient processor capacity for each partition. This improves the locality,
resulting in better performance.
61
If the peak usage is below the 50% mark, then there is no need for more virtual processors. In
this case, look at the ratio of virtual processors to configured entitlement and if the ratio is
greater than 1, then consider reducing the ratio. If there are too many virtual processors that
are configured, AIX can fold those virtual processors so that the workload can run on fewer
virtual processors to optimize virtual processor performance.
For example, if an SPLPAR is given a CPU entitlement of 2.0 cores and four virtual
processors in an uncapped mode, then the hypervisor can dispatch the virtual processors to
four physical cores concurrently if there are free cores available in the system. The SPLPAR
uses unused cores and the applications can scale up to four cores. However, if the system
does not have free cores, then the hypervisor dispatches four virtual processors on two cores
so that the concurrency is limited to two cores. In this situation, each virtual processor is
dispatched for a reduced time slice as two cores are shared across four virtual processors.
This situation can impact performance, so AIX operating system processor folding support
might be able to reduce to number of virtual processors that are dispatched so that only two
or three virtual processors are dispatched across the two physical cores.
62
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
63
CPU
0-31
32-63
REF1 represents a domain, and domains vary by platform. SRAD always references a chip.
However, lssrad does not report the actual physical domain or chip location of the partition: it
is a relative value whose purpose is to inform you whether the resources of the partition are
within the same domain or chip. The output of this lssrad example indicates that the LPAR is
allocated with 16 cores from two chips within the same domain. The lssrad command output
was taken from an SMT4 platform, so CPU 0-31 represents eight cores.
When all the resources are free (an initial machine state or restart of the CEC), the PowerVM
allocates memory and cores as optimally as possible. At partition boot time, PowerVM is
aware of all of the LPAR configurations, so placement of processors and memory are made
regardless of the order of activation of the LPARs.
However, after the initial configuration, the setup might not stay static. Numerous operations
take place, such as:
Reconfiguration of existing LPARs with new profiles
Reactivating existing LPARs and replacing them with new LPARs
Adding and removing resources to LPARs dynamically (DLPAR operations)
Any of these changes might result in memory fragmentation, causing LPARs to be spread
across multiple domains. There are ways to minimize or even eliminate the spread. For the
first two operations, the spread can be minimized by releasing the resources that are
assigned to the deactivated LPARs.
Resources of an LPAR can be released by running the following commands:
chhwres -r mem -m <system_name> -o r -q <num_of_Mbytes> --id <lp_id>
chhwres -r proc -m <system_name> -o r --procunits <number> --id <lp_id>
The first command frees the memory, and the second command frees cores.
Fragmentation because of frequent movement of memory or processor cores between
partitions is avoidable with correct planning. DLPAR actions can be done in a controlled way
so that the performance impact of resource addition or deletion is minimal. Planning for
growth helps alleviate the fragmentation that is caused by DLPAR operations. Knowing the
LPARs that must grow or shrink dynamically, and placing them with LPARs that can tolerate
nodal crossing latency (less critical LPARs), is one approach to handling the changes of
critical LPARs dynamically. In such a configuration, when growth is needed for the critical
LPAR, the resources that are assigned to the non-critical LPAR can be reduced so that the
critical LPAR can grow. Another method of managing fragmentation is to monitor the affinity
score of the system or important partitions and use the Dynamic Platform Optimizer to
reoptimize the memory and processor that is assigned to the partitions.
64
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Affinity groups
PowerVM firmware has support for affinity groups that can be used to group multiple LPARs
within the same processor chip, processor socket, or drawer. When using affinity groups, it is
important to understand the physical configuration of the processor cores and memory that is
contained within the processor chips, processor sockets. and drawers, such that the size of
the affinity group does not exceed the capacity of the wanted domain. For example, if the
system has 4 cores and 64 GB of memory per processor chip, and you want to contain the
partitions to a single processor chip, ensure that the size of the affinity group does not exceed
four cores and 64 GB of memory. When calculating the memory size of an affinity group and
what is available on a chip, the computed value must account for the memory that is used by
the hypervisor for I/O space and for objects that are associated with the partition, such as the
hardware page table.
Note: As a general rule, the size of the affinity group wanted memory should allocate only
90 - 95% of the physical memory that is contained in a domain. If the affinity group is larger
than the wanted domain, the hypervisor cannot contain the affinity group within a single
domain.
This affinity group feature can be used in multiple situations:
LPARs that are dependent or related, such as server and client, and application server
and database server, can be grouped so they are in the same book.
Affinity groups can be created that are large enough such that they force the assignment
of LPARs to be in different books. For example, if you have a two-socket system and the
total resources (memory and processor cores) assigned to the two groups exceeds the
capacity of a single socket, these two groups are forced to be in separate sockets.
If a pair of LPARs is created with the intent of one being a failover to another partition, and
one partition fails, the other partition (which is placed in the same node, if both are in the
same affinity group) uses all of the resources that were freed up from the failed LPAR.
The following HMC CLI command adds or removes a partition from an affinity group:
chsyscfg -r prof -m <system_name> -i name=<profile_name>
lpar_name=<partition_name>,affinity_group_id=<group_id>
group_id is a number 1 - 255 (255 groups can be defined), and affinity_group_id=none
removes a partition from the group.
When the hypervisor places resources at frame restart, it first places all the LPARs in group
255, then the LPARs in group 254, and so on. Place the most important partitions regarding
affinity in the highest configured group.
65
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
67
IBM PowerKVM is the port of KVM for hardware virtualization of Power Systems and provides
full virtualization on POWER8 scale-out systems. The architecture-specific model for Power
Systems is called kvm-hv.ko. PowerKVM also includes the virtualization packages such as
libvirt, which provide the tools, runtime libraries, and a daemon for managing platform
virtualization. There is both a CLI interface (the virsh command, which is part of the
libvirt-client package) and a web-based interface, Kimchi, for managing virtualization,
including starting and stopping virtual machines.
In KVM terminology, a virtual machine is more commonly referred to as the guest. The
hypervisor is often referred to as running on the host machine. The hypervisor consists of the
operating system (including the virtualization modules) and firmware that directly runs on the
hardware and supports running guests.
Note: On the PowerVM hypervisor, a virtual machine or guest is called a logical partition
(LPAR).
PowerKVM V2.1, released June 2014, is the first PowerKVM release. Only Linux distributions
(such as RHEL, Ubuntu, SLES, or Fedora) are supported as guest OSes by PowerKVM.
Unlike PowerVM, there is no need for a Hardware Management Console (HMC) to manage
PowerKVM. Instead, the industry-standard Intelligent Platform Management Interface (IPMI)
interface is used to manage the host. On IBM Power Systems, the IPMI server runs on the
service controller and not on the host. Therefore, commands directed to the IPMI server must
use the service processor IP address in the command line.
For more information about the KVM technology on IBM systems and the various
virtualization support tools (such as qemu, libvirt, Kimchi, IPMI, and so on), see IBM
PowerKVM Configuration and Use, SG24-8231.
Note: IBM PowerKVM Configuration and Use, SG24-8231 can help in configuring
PowerKVM and the guest OSes optimally.
68
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
69
70
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 4.
IBM AIX
This chapter describes the optimization and tuning of POWER8 and other Power Systems
processor-based servers running the AIX operating system. It covers the following topics:
71
4.1 Introduction
AIX is regarded as a good choice for building an IT infrastructure on IBM systems that are
designed with Power Architecture technology. With its proven scalability, advanced
virtualization, security, manageability, and reliability features, it is an enterprise-class OS. In
particular, AIX is the only operating system that uses decades of IBM technology innovation
that is designed to provide the highest level of performance and reliability of any UNIX
operating system. AIX has demonstrated leadership performance on various system
benchmarks.
The performance benefits of AIX include:
Deep integration with the Power Architecture (core design with the Power Architecture)
Autonomic optimization
A single OS image configures itself to support any POWER processor.
Dynamic workload optimization.
Performs on a wide variety of system configurations
Scales from 0.05 to 256 cores (up to 1024 logical processors).
Horizontal (native clustering) and vertical scaling.
Strong virtualization support for PowerVM virtualization
Tight integration with PowerVM.
Enabler for virtual I/O (VIO).
Full set of integrated performance tools.
AIX V6.1 and AIX V7.1 run on and maximize the capabilities of systems based on the
POWER8 processor-based system, which is the latest generation of POWER
processor-based systems, while supporting POWER4, POWER5, POWER6, and POWER7
(including POWER7+) processor-based systems.
For more information about this topic, see 4.5, Related publications on page 107.
72
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
AIX release
32-core/32-thread
5.3/6.1/7.1
64-core/128-thread (SMT2)
5.3/6.1/7.1
64-core/256-thread (SMT4)
6.1(TL4)/7.1
256-core/1024-thread (SMT4) or
128-core/1024-thread (SMT8)
7.1
Simultaneous multithreading
Simultaneous Multithreading (SMT) is a feature of the Power Architecture and is described in
Simultaneous multithreading on page 29. SMT is supported in AIX, as described in
Simultaneous multithreading.1
AIX provides options to allow SMT customization. The smtctl command allows the SMT
feature to be enabled, disabled, or capped (SMT2 versus SMT4 mode on POWER7
processor-based systems and SMT2 or SMT4 modes versus SMT8 on POWER8
processor-based systems). The partition-wide tuning option, smtctl, changes the SMT mode
of all processor cores in the partition. It is built on the AIX dynamic reconfiguration (AIX DR)
framework to allow hardware threads (logical processors) to be added and removed in a
running partition. Because of this options global nature, it is normally set by system
administrators. Most AIX systems (commercial) use the default SMT settings enabled (that is,
SMT2 mode on POWER5 and POWER6 processor-based systems, and SMT4 mode on
POWER7 and POWER8 processor-based systems).
When SMT is enabled (SMT2, SMT4, or SMT8 mode), the AIX kernel takes advantage of the
platform feature to change SMT modes dynamically. These mode switches are done based
on partition load (the number of running or waiting to run software threads) to choose the
optimal SMT mode for the CPUs in the partition. The mode switching policies optimize overall
workload throughput, but do not attempt to optimize individual software threads.
For more information about the topic of SMT, from the processor and OS perspectives, see:
Simultaneous multithreading on page 29 (processor)
Simultaneous multithreading on page 112 (IBM i)
Simultaneous multithreading on page 119 (Linux)
1
73
Where to use
SMT thread priority can be used to improve the performance of a workload by lowering the
SMT thread priority that is being used on an SMT thread that is running a particular
process-thread in the following situations:
The thread is waiting on a lock.
The thread is waiting on an event, such as the completion of an IO event.
Alternatively, process-threads that are performance-sensitive can maximize their
performance by ensuring that the SMT thread priority level is set to an elevated level.
2
3
4
5
6
74
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Affinity APIs
Most applications must be bound to logical processors to get a performance benefit from
memory/cache affinity to prevent the AIX dispatcher from moving the application to processor
cores in different SRADs while the application runs.
AIX provides bind-ids, Resource Sets (RSETs), and Scheduler Resource Allocation Domains
(SRADs) for affinity tuning in the following ways:
Bindprocessor: Provides affinity to a single hardware thread that is identified by bind-id. It
does not provide topology.
RSET: Provides affinity to a group of hardware threads and supports memory binding. It
provides topology.
SRAD: Provides affinity to a scheduler resource domain and supports memory binding. It
provides topology.
The most likely way to obtain a benefit from memory affinity is to limit the application to
running only on the processor cores that are contained in a single SRAD. You can accomplish
this task with the help of RSET (commands/API) and SRAD (APIs). If the application just
needs a single processor, then the bindprocessor command or the bindprocessor() function
can be used. It can also be done with the resource set affinity commands (rset) and service
applications. Often, affinity is provided as an administrator option that can be optionally
enabled on large systems.
When the application requires more processor cores than are contained in a single SRAD, the
performance benefit through memory affinity depends on the memory allocation and access
patterns of the various threads in the application. Applications with threads that individually
allocate and reference unique data areas can see improved performance.
Bind processor
Processor affinity is the probability of dispatching a thread to the logical processor that was
previously running it. If a thread is interrupted and later redispatched to the same logical
processor, the processor's cache might still contain lines that belong to the thread. If the
thread is dispatched to a different logical processor, it probably experiences a series of cache
misses until its cache working set is retrieved from RAM or the other logical processor's
cache. If a dispatchable thread must wait until the logical processor that it was previously
running on is available, the thread might experience an even longer delay.
The highest possible degree of processor affinity is to bind a thread to a specific logical
processor. Binding means that the thread is dispatched to that logical processor only,
regardless of the availability of other logical processors.
75
The bindprocessor command and the bindprocessor() subroutine bind the thread (or
threads) of a specified process to a particular logical processor. Explicit binding is inherited
through fork() and exec() system calls. The bindprocessor command requires the process
identifier of the process whose threads are to be bound or unbound, and the bind CPU
identifier of the logical processor to be used. This bind-id is different from the logical
processor number and does not have any topology that is associated with it. Bind-ids cannot
be associated to a specific chip/core because they tend to change on every DR operation
(such as an SMT mode change). For NUMA affinity, RSETs or SRADs should be used to
restrict the application to a set of logical processors in the same core/SRAD and its local
memory.
CPU binding is useful for CPU-intensive applications; however, it can sometimes be
counter-productive for I/O-intensive applications.
RSETS
Every process and kernel thread can have an RSET attached to it. The CPUs on which a
thread can be dispatched are controlled by a hierarchy of resource sets. RSETs are
mandatory bindings and are accepted by the AIX kernel always. Also, RSETs can affect
dynamic reconfiguration (DR) activities.
Resource sets
These resource sets are:
Thread effective RSET
Other RSETs
Another type of RSET is the exclusive RSET. Exclusive use processor resource sets
(XRSETs) allow an installation to limit the usage of the processors in XRSETs; they are used
only by work that is attached to those XRSETS. They can be created by running the mkrset
command in the 'sysxrset' namespace.
RSET data types and operations
The public shipped header file rset.h contains declarations for the public RSET data types
and function prototypes.
An RSET is an opaque data type. Applications allocate an RSET by calling rs_alloc().
Applications receive a handle to the RSET. The RSET handle (data type rsethandle_t in
sys/rset.h) is then used in RSET APIs to manipulate or attach the RSET.
76
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
lsrset -p 28026
mkrset: Makes a named RSET containing specific CPU and memory pools and places the
RSET in the system registry. For example, mkrset -c 6 10-12 test/lotsofcpus creates
an RSET named test/lotsofcpus that contains the specified CPUs.
rmrset: Removes an RSET from the system registry. For example:
rmrset test/lotsofcpus
attachrset: Attaches an RSET to a specified PID. The RSET can either be in the system
registry, or CPUs or mempools that are specified in the command. For example:
attachrset test/lotsofcpus 28026
detachrset -P 20828
execrset: Runs a specific program or command with a specified RSET. For example:
execrset sys/node.04.00000 -e test
rs_free()
rs_init()
rs_op()
rs_getinfo()
rs_getrad()
rs_numrads()
77
rs_getpartition()
rs_setpartition()
rs_discardname()
rs_getnameattr()
rs_getnamedrset()
rs_setnameattr()
rs_registername()
These are services that are used to manage the RSET system
registry. There are services to create, obtain, and delete RSETs in
the registry.
Attachment services
Here are the RSET attachment services:
78
ra_attachrset()
ra_detachrset()
ra_exec()
ra_fork()
ra_get_attachinfo()
ra_free_attachinfo()
This service frees the memory that was allocated for the
attachment information that was returned by
ra_get_attachinfo().
ra_getrset()
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
79
API support
SRADIDs can be attached to threads and memory by using the following functions:
ra_attach() (new)
ra_fork()
ra_exec()
ra_mmap() and ra_mmapv()
ra_shmget() and ra_shmgetv()
SRADIDs can be detached from thread and memory by using the sra_detach() function
(new).
For more information about the topic of affinitization and binding, from the processor and OS
perspectives, see:
Affinitization and binding to hardware threads on page 31 (processor)
Affinitization and binding on page 121 (Linux)
80
lsrset -av
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
An XRSET alone can be used to ensure that only specific work uses a CPU set. There is also
the ability to restrict work execution to primary threads in an XRSET. This ability is known as
an STRSET. STRSETs allow software threads to use ST execution mode independently of the
load on the other CPUs in the system. Work can be placed onto STRSETs by running the
following commands:
execrset -S
ra_attach(R_STRSET)
For more information about this topic, from the processor and OS perspectives, see:
Hybrid thread and core on page 31 (processor)
Hybrid thread and core on page 122 (Linux)
For more information about this topic, see 4.5, Related publications on page 107.
AIX folding
Folding is a key AIX feature on shared processor LPARs that can improve both system and
partition performance. Folding is needed for supporting many partitions in a system. It is an
integrated feature, requiring both hardware and PowerVM support. The AIX component that
manages folding is the Virtual Processor Manager (VPM).
The basic concept of folding is to compress work to a smaller number of cores, based on CPU
utilization, by folding the remaining cores. The unused cores are folded by VPM, and
PowerVM does not schedule them for dispatch in the partition unless the operating system
requests that the cores be unfolded (or woken up), for example, when the workload changes
or when a timer interrupt needs to be fired on that core.
As an example, an LPAR might have 24 virtual cores (processors) assigned, but is consuming
only a total of three physical processors across all of these virtual cores. Folding compresses
(moves) all work to a smaller number of cores (three cores plus some extra cores to handle
spikes in workload), allowing PowerVM to allocate the unused cores for use elsewhere on the
system.
Folding generally improves LPAR and system performance by reducing context switching of
cores between partitions across a system, thus reducing context switching of software
threads across multiple cores in an LPAR. It improves overall affinity at both the LPAR and
system levels.
VPM runs once per second and computes how many cores are kept unfolded based on the
overall CPU utilization of the LPAR. On POWER8 processor-based systems, the folding
algorithm has been enhanced to include the average load (or the average number of runnable
software threads) as a factor in the computation.
Folding can be enabled and disabled by using the schedo command to adjust the value of the
vpm_fold_policy tunable. To respond faster to spikes in workloads or on partitions with a high
interrupt load, a second tunable, vpm_xvcpus, can also be used to increase the number of
spare, unfolded CPUs. This can improve response time for workloads with steep utilization
spikes, or on partitions with a high interrupt load, although this can result in higher core
usage.
81
AIX V6.1 TL8 and AIX V7.1 TL2 introduced a new scaled throughput-based folding algorithm
that can be enabled by using the schedo command to adjust the value of the
vpm_throughput_mode tunable. The default folding algorithm favors single-threaded
performance and overall LPAR throughput over core utilization. The new scaled throughput
algorithm can favor reduced core utilization and higher core throughput, instead of overall
LPAR throughput. The new algorithm applies both load and utilization data to make folding
decisions. It can switch unfolded cores to SMT2, SMT4, or SMT8 modes when the workload
increases, rather than unfolding more cores, as shown in Figure 4-1.
Core 2
Core 3
Core 1
Core 2
Core 1
Core N
Core 1
Core 1
Core 1
Core 1
Core 2
Core 2
Core N
The degree of SMT mode (SMT2 or SMT4 (or SMT8 for POWER8 processor-based
systems)) to favor reduced core utilization can be controlled by assigning the appropriate
value to the vpm_throughput_mode tunable (2 for SMT2 mode, 4 for SMT4 mode, and 8 for
SMT8 mode). When the vpm_throughput_mode tunable is set to a value of 1, the folding
algorithm behaves like the legacy folding algorithm and favors single-threaded (ST mode)
performance. However, unlike the legacy algorithm, which uses only utilization data, the new
algorithm employs both load and utilization data to make folding decisions.
The default value of the vpm_throughput_mode tunable is 1 on POWER8 processor-based
systems, and on POWER7 and earlier processor-based systems, the default value is zero
(the legacy folding algorithm continues to be applicable).
82
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Required hardware
Restricted
4 KB
ALL
No
No
64 KB
POWER5+
processor-based system or
later
No
No
16 MB
POWER4 processor-based
system or later
Yes
Yes
16 GB
POWER5+
processor-based system or
later
Yes
Yes
4 KB
4 KB/64 KB
POWER6 processor-based
system
64 KB
64 KB
POWER5+ processor-based
system
16 MB
16 MB
POWER4 processor-based
system
16 GB
16 GB
POWER5+ processor-based
system
83
Page sizes are an attribute of an individual segment. Earlier POWER processors supported
only a single page size per segment. The system administrator or user had to choose the
optimal page size for a specific application based on its memory footprint. The POWER5+
processor introduced the concept of mixed or multiple page sizes within a single segment: 4
KB and 64 KB. POWER7 and later processors support mixed page segment sizes of 4 KB, 64
KB, and 16 MB.
Starting with Version 6.1, AIX takes advantage of this new hardware capability on POWER6
and later processor-based systems to combine the conservative memory usage aspects of
the 4 KB page size in sparsely referenced memory regions with the performance benefits of
the 64 KB page size in densely referenced memory regions. AIX V6.1 takes advantage of this
automatically, without user intervention, although it is disabled in segments that have an
explicit page size that is selected by the user. This AIX feature is referred to as dynamic
Variable Page Size Support (VPSS). Some applications might prefer to use a larger page
size, even when a 64 KB region is not fully referenced. The page size promotion
aggressiveness factor (PSPA) can be used to reduce the memory-referenced requirement, at
which point a group of 4 KB pages is promoted to a 64 KB page size. The vmo command on
AIX allows configuration of the VMM tunable parameters. The PSPA can be set for the whole
system by using the vmm_default_pspa vmo tunable, or for a specific process by using the
vm_pattr system call.9
In addition to 4 KB and 64 KB page sizes, AIX supports 16 MB pages, also called large
pages, and 16 GB pages, also called huge pages. These page sizes are intended for use only
in high-performance environments, and AIX by default does not automatically configure a
system to use these page sizes.
Use the vmo tunables lgpg_regions and lgpg_size to configure the number of 16 MB large
pages on a system.
The following example allocates 1 GB of 16 MB large pages:
vmo -r -o lgpg_regions=64 -o lgpg_size=16777216
To use large pages, non-root users must have the CAP_BYPASS_RAC_VMM capability in AIX
enabled. The system administrator can add this capability by running chuser:
chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user_id>
Huge pages must be configured by using the Hardware Management Console (HMC). To do
so, complete the following steps:
1. On the managed system, click Properties Memory Advanced Options Show
Details to change the number of 16 GB pages.
2. Assign 16 GB huge pages to a partition by changing the partition profile.
The vmo tunable vmm_mpisze_support can be used to limit multiple page size support. The
default value of 1 supports all four page sizes, but the tunable can be set to other values to
configure which page sizes will to be supported.
Ibid
Ibid
11 Power ISA Version 2.07, found at https://www.power.org/documentation/power-isa-v-2-07b/
10
84
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
These page sizes can be configured with an environment variable or with settings in an
application XCOFF binary with the ldedit or ld commands, as shown in Table 4-4.
Table 4-4 Page sizes for four regions of a 32-bit or 64-bit process address space
Region
ld or ldedit
option
LDR_CNTRL
environment variable
Description
Data
bdatapsize
DATAPSIZE
Stack
bstackpsize
STACKPSIZE
Text
btextpsize
TEXTPSIZE
Shared memory
None
SHMPSIZE
You can specify a different page size to use for each of the four regions of a process address
space. Only the 4 KB and 64 KB page sizes are supported for all four memory regions. The
16 MB page size is supported only for the process data, process text, and process shared
memory regions. The 16 GB page size is supported only for a process shared memory
region.
You can set the preferred page sizes for an application in the XCOFF/XCOFF64 binary file by
running the ldedit or ld commands.
The ld or cc commands can be used to set these page size options when you are linking an
executable command:
ld -o mpsize.out -btextpsize:4K -bstackpsize:64K sub1.o sub2.o
cc -o mpsize.out -btextpsize:4K -bstackpsize:64K sub1.o sub2.o
The ldedit command can be used to set these page size options in an existing executable
command:
ldedit -btextpsize=4K -bdatapsize=64K -bstackpsize=64K mpsize.out
You can set the preferred page sizes of a process with the LDR_CNTRL environment variable.
As an example, the following command causes the mpsize.out process to use 4 KB pages for
its data, 64 KB pages for its text, 64 KB pages for its stack, and 64 KB pages for its shared
memory on supported hardware:
LDR_CNTRL=DATAPSIZE=4K@TEXTPSIZE=64K@SHMPSIZE=64K mpsize.out
Page size environment variables override any page size settings in an executable XCOFF
header. Also, the DATAPSIZE environment variable overrides any LARGE_PAGE_DATA
environment variable setting.
Rather than using the LDR_CNTRL environment variable, consider marking specific executable
files to use large pages because this limits the large page usage to the specific application
that benefits from large page usage.
85
Support for specifying the page size to use for the shared memory of a process with the
SHMPSIZE environment variable is available starting in IBM AIX 5L Version 5.3 with the
5300-08 Technology Level, or later, and AIX Version 6.1 with the 6100-01 Technology Level,
or later.
dscr_ctl() API
#include <sys/machine.h>
int dscr_ctl(int op, void *buf_p, int size)
12
86
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Where:
op:
Buf_p:
Size:
Function:
The action that is taken depends on the value of the operation parameter that is defined in
<sys/machine.h>:
DSCR_WRITE
Stores a new value from the input buffer into the process context and
in the DSCR.
DSCR_READ
Reads the current value of DSCR and returns it in the output buffer.
DSCR_GET_PROPERTIES
DSCR_SET_DEFAULT
0
1
2
3
4
5
6
87
DPFD_DEEPEST
DSCR_SSE
7
8
88
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The dscr_value is treated as a decimal number unless it starts with 0x, in which case it is
treated as hexadecimal.
To cancel a permanent setting of the operating system default prefetch depth at start time, run
the following command:
dscrctl -c
Applications that have predictable data access patterns, such as numerical applications that
process arrays of data in a sequential manner, benefit from aggressive data prefetching.
These applications must run with the default operating system prefetch depth, or whichever
settings are empirically found to be the most beneficial.
Applications that have considerably unpredictable data access patterns, such as some
transactional applications, can be negatively affected by aggressive data prefetching. The
data that is prefetched is unlikely to be needed, and the prefetching uses system bandwidth
and might displace useful data from the caches. Some WebSphere Application Server and
DB2 workloads have this characteristic. Performance can be improved by disabling hardware
prefetching in these cases by running the following command:
dscrctl -n -s 1
This system (partition) wide disabling is only appropriate if it is expected to benefit all of the
applications that are running in the partition. However, the same effect can be achieved on an
application-specific basis by using the programming API.
For more information about the efficient use of cache, from the processor and OS
perspectives, see:
2.2.3, Efficient use of cache and memory on page 33 (processor)
6.2.3, Efficient use of cache on page 123 (Linux)
For more information about this topic, see 4.5, Related publications on page 107.
89
Debugger support
The dbx AIX debugger, found in /usr/ccs/bin/dbx, supports machine-level debugging of TM
programs. This support includes the ability to disassemble the new TM instructions, and to
display the TM SPRs.
Setting a breakpoint inside of a transaction causes the transaction to unconditionally fail
whenever the breakpoint is encountered. To determine the cause and location of a failing
transaction, the approach is to set a breakpoint on the transaction failure handler, and then to
view the TEXASR and TFIAR registers when the breakpoint is encountered.
The TEXASR, TFIAR, and TFHAR registers can be displayed by using the print
subcommand with the $texasr, $tfiar, or $tfhar parameter. The line of code that is
associated with the address that is found in TFIAR and TFHAR can be displayed by using the
list subcommand, for example:
(dbx) list at $tfiar
A new tm_status subcommand is provided that displays and interprets the contents of the
TEXASR register. This is useful in determining the nature of a transaction failure.
Tracing support
The AIX trace facility has been expanded to include a set of trace events for TM operations
that are performed by AIX, including the processing of TM-type facility unavailable interrupts,
preemptions that cause transaction failure, and other operations that can cause transaction
failure. The trace event identifier 675 can be used as input to the trace and trcrpt commands
to view TM-related trace events.
90
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The reason that AIX cannot allow system calls to be made while in the transactional state is
that any operations (writes or updates, including I/O) that are performed by AIX underneath a
system call cannot be rolled back.
91
For more information about the topic of VSX, from the processor, OS, and compiler
perspectives, see:
For more information about this topic, see 4.5, Related publications on page 107.
Note: The printf() function uses new options to print these new data types:
_Decimal32 uses %Hf
_Decimal64 uses %Df
_Decimal128 uses %DDf
15
92
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The IBM XL C/C++ Compiler, Release 9 or later, includes native DFP language
support. Here is a list of compiler options for IBM XL compilers that are related to DFP:
-qdfp: Enables DFP support. This option makes the compiler recognize DFP literal
suffixes, and the _Decimal32, _Decimal64, and _Decimal128 keywords.
For hardware supported DFP, with -qarch=pwr6, -qarch=pwr7, or -qarch=pwr8, run the
following command:
cc -qdfp
For software emulation of DFP (on earlier processor chips), run the following
command:
cc -qdfp -qfloat=dfpemulate
The GCC compilers for Power Systems also include native DFP language support.
Here is a list of GCC compiler options that are related to DFP:
-mno-hard-dfp: Instructs the compiler to use calls to library functions to handle DFP
computation, regardless of the architecture level. If your application is dynamically
linked to the libdfp variant and running on POWER6 or POWER7 processor-based
systems, then the run time automatically binds to the libdfp variant implemented
with hardware DFP instructions. Otherwise, the software DFP library is used. You
might experience performance degradation when you use software emulation.
17
Ibid
93
94
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
4.3.1 Malloc
Every application needs a fast, scalable, and memory efficient allocator. However, each
applications memory request patterns are different. It is difficult to provide one common
allocator or tunable that can satisfy the needs of all applications. AIX provides different
memory allocators and suboptions within the allocator so that a system administrator or
developer can choose more suitable settings for their application. This section explains the
available choices and when to choose them.
Memory allocators
AIX provides three different allocators, and each of them uses a different memory
management algorithm and data structures. These allocators work independently, so the
application developer must choose one of them by exporting the MALLOCTYPE environment
variable. The allocators are:
Default allocator
The default allocator is selected when the MALLOCTYPE environment variable is unset. This
setting maintains a consistent performance, even in a worst case scenario, but might not
be as memory-efficient as a Watson allocator. This allocator is ideal for 32-bit applications,
which do not make frequent calls to malloc().
Watson allocator
This allocator is selected when MALLOCTYPE=watson is set. This allocator is designed for
64-bit applications. It is memory efficient, scalable, and provides good performance. This
allocator has a built-in bucket component for allocation requests up to 512 bytes. Table 4-5
provides the mapping for the allocation requests to bucket size.
Table 4-5 Mapping for allocation requests to bucket size
Request
size
Bucket
size
Request
size
Bucket
size
Request
size
Bucket
size
Request
size
Bucket
size
1-4
33-40
40
129-144
144
257-288
288
5-8
41 - 48
48
145 - 160
160
289 - 320
320
9 - 12
12
49 - 56
56
161 - 176
176
321 - 352
352
13 - 16
16
57 - 64
64
177 - 192
192
353 - 384
384
17 - 20
20
65 - 80
80
193 - 208
208
385 - 416
416
95
21 - 24
24
81 - 96
96
209 - 224
224
417 - 448
448
25 - 28
28
97 - 112
112
224 - 240
240
449 - 480
480
29 - 32
32
113 - 128
128
241 - 256
256
481 - 512
512
18
System memory allocation using the malloc subsystem, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.aix.genprogc/doc/gen
progc/sys_mem_alloc.htm
96
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
malloc pools
This option enables a high performance front end to malloc subsystem for managing
storage objects smaller than 513 bytes. This suboption is similar to the built-in bucket
allocator of the Watson allocator. However, this suboption maintains the bucket for each
thread, providing lock-free allocation and deallocation for blocks smaller than 513 bytes.
This suboption improves the performance for multi-threaded applications, as the time
spent on locking is avoided for blocks smaller than 513 bytes.
The pool option makes small memory block allocations fast (no locking) and memory
efficient (no header on each allocation object). The pool malloc both speeds up
single-threaded applications, and improves the scalability of multi-threaded applications.
malloc disclaim
By enabling this option, free() automatically disclaims memory. This suboption is useful
for reducing the paging space requirement. This option can be set by exporting
MALLOCOPTIONS=disclaim.
Use cases
Here are some uses cases that you can use to set up your environment:
For a 32-bit single-threaded application, use the default allocator.
For a 64-bit application, use the Watson allocator.
Multi-threaded applications use the multiheap option. Set the number of heaps
proportional to the number of threads in the application.
For single-threaded or multi-threaded applications that make frequent allocation and
deallocation of memory blocks smaller than 513, use the malloc pool option.
For a memory usage pattern of the application that shows high usage of memory blocks of
the same size (or sizes that can fall to common block size in bucket option) and sizes
greater than 512 bytes, use the configure malloc bucket option.
For older applications that require high performance and do not have memory
fragmentation issues, use malloc 3.1.
Ideally, the Watson allocator, along with the multiheap and malloc pool options, is good
for most multi-threaded applications. The pool front end is fast and scalable for small
allocations, and with multiheap, ensures scalability for larger and less frequent allocations.
If you notice high memory usage in the application process even after you run free(), the
disclaim option can help.
For more information about this topic, see 4.5, Related publications on page 107.
97
SPINLOOPTIME=n
The SPINLOOPTIME variable controls the number of times the system tries to get a busy
mutex or spin lock without taking a secondary action, such as calling the kernel to yield the
process. This control is intended for MP systems, where it is hoped that the lock that is
held by another actively running pthread is released. The parameter works only within
libpthreads (user threads). If locks are available within a short period, you might want to
increase the spin time by setting this environment variable. The number of times to try a
busy lock before yielding to another pthread is n. The default is 40 and n must be a
positive value.
YIELDLOOPTIME=n
The YIELDLOOPTIME variable controls the number of times that the system yields the logical
processor when it tries to acquire a busy mutex or spin lock before it goes to sleep on the
lock. The logical processor is yielded to another kernel thread, assuming that there is
another executable thread with sufficient priority. This variable is effective in complex
applications, where multiple locks are in use. The number of times to yield the logical
processor before blocking on a busy lock is n. The default is 0 and n must be a positive
value.
For more information about this topic, see 4.5, Related publications on page 107.
4.3.3 pollset
AIX 5L V5.3 introduced the pollset APIs. Pollsets are an AIX replacement for UNIX select()
and poll(). Pollset, select(), and poll() all allow an application to query efficiently the
status of file descriptors. This action is typically done to allow a single application to multiplex
I/O across many file descriptors. Pollset APIs can be more efficient when the number of file
descriptors that are queried becomes large.
Efficient I/O event polling through the pollset interface on AIX contains a pollset summary and
outlines the most advantageous use of Java. To see this document, go to the following
website:
http://www.ibm.com/developerworks/aix/library/au-pollset/index.html
For more information about this topic, see 4.5, Related publications on page 107.
98
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The direct I/O access method bypasses the file cache and transfers data directly from disk
into the user space buffer, as opposed to using the normal cache policy of placing pages in
kernel memory.
At the user level, file systems can be mounted by using the dio option with the mount
command.
At the programming level, applications enable direct I/O access to a file by passing the
O_DIRECT flag to the open subroutine. This flag is defined in the fcntl.h file. Applications must
be compiled with _ALL_SOURCE enabled to see the definition of O_DIRECT.
For more information, see Working with file I/O, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v6r1/index.jsp?topi
c=%2Fcom.ibm.aix.genprogc%2Fdoc%2Fgenprogc%2Fworking_file_io.htm
99
100
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
101
The 32-bit ABI provides an ILP32 model (32-bit integers, longs, and pointers). The 64-bit ABI
provides an LP64 model (32-bit integer and 64-bit longs/pointers). Although current POWER
CPUs have 64-bit fixed-point registers, they are treated as 32-bit fixed-point registers by the
ABI (the high 32 bits of all fixed-point registers are treated as volatile or undefined by the ABI).
The 32-bit ABI preserves only 32-bit fixed-point context across subroutine linkage, non-local
goto (longjmp()), or signal delivery. 32-bit programs cannot attempt to use 64-bit registers
when they run in 32-bit mode (32-bit ABI). In general, other registers (floating point, vector,
and status registers) are the same size in both 32-bit/64-bit ABIs.
Starting with AIX V6.1, all supervisor code (kernel, kernel extensions, and device drivers)
uses the 64-bit ABI. In general, a unified system call interface is provided to applications that
provides efficient system call linkage to both 32-bit and 64-bit applications. Because the
AIX V6.1 kernel is 64-bit, it implies that all systems supported by AIX V6.1 support the 64-bit
ABI. Some older IBM PowerPC CPUs supported on AIX 5L V5.3 cannot run the 64-bit ABI.
Operating system libraries provide both 32-bit and 64-bit objects, allowing full support for
either ABI. Development tools (assembly language, linker, and debuggers) support both ABIs.
Trade-offs
The primary motivation to choose the 64-bit ABI is to go beyond the 4 GB directly memory
addressability barrier. A second reason is to improve scalability by extending some 32-bit
data type limits that are in the 32-bit ABI (time_t, pid_t, and offset_t). Lastly, 64-bit mode
provides access to 64-bit fixed-point registers and instructions that can improve the
performance of specific fixed-point operations (long long arithmetic and 64-bit memory
copies).
The 64-bit ABI does have some performance drawbacks, such as the 64-bit fixed-point
registers and the LP64 model grow stack usage and data structures. These items can cause
a performance drawback for some applications. Also, 64-bit text is larger for most compiles,
producing a larger i-cache footprint.
The most significant issue is typically the porting effort (for existing applications), as changing
between ILP32 and LP64 normally requires a port. Large memory addressability and
scalability are normally the deciding factor when you chose an application execution model.
For more information about this topic, see 4.5, Related publications on page 107.
102
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The following list has more information about the associated subroutines:
thread_wait
The thread_wait subroutine allows a thread to wait or block until another thread posts it
with the thread_post or the thread_post_many subroutine or until the time limit that is
specified by the timeout value expires.
If the event for which the thread is waiting and for which it is posted occurs only in the
future, the thread_wait subroutine can be called with a timeout value of 0 to clear any
pending posts. This action can be accomplished by running the following command:
thread_wait (timeout)
thread_post
The thread_post subroutine posts the thread whose thread ID is indicated by the value of
the tid parameter, of the occurrence of an event. If the posted thread is waiting in
thread_wait, it is awakened immediately. If it is not waiting in thread_wait, the next call to
thread_wait is not blocked, but returns with success immediately.
Multiple posts to the same thread without an intervening wait by the specified thread
counts only as a single post. The posting remains in effect until the indicated thread calls
the thread_wait subroutine, upon which the posting is cleared.
thread_post_many
The thread_post_many subroutine posts one or more threads of the occurrence of the
event. The number of threads to be posted is specified by the value of the nthreads
parameter, and the tidp parameter points to an array of thread IDs of threads that must be
posted. The subroutine works just like the thread_post subroutine, but can be used to
post to multiple threads at the same time. A maximum of 512 threads can be posted in one
call to the thread_post_many subroutine.
For more information about this topic, see 4.5, Related publications on page 107.
Documentation
AIX provides optimizations that enable sharing of loaded text (libraries and dynamically
loaded modules). Sharing text among processes often improves performance because it
reduces resource usage (memory and disk space). It also allows unrelated software-threads
to share cache space when they run concurrently. Lastly, it can reduce load times when the
code is already loaded by a previous program.
Applications can control whether private or shared loads are performed to shared text
regions. Shared loads require that execute permissions be set for group/other on the text files.
As a preferred practice, enable sharing.
For more information about this topic, see 4.5, Related publications on page 107.
103
104
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
This type of installation sometimes takes a little effort on the part of the application, but it
allows you to get the most value from using WPARs. If there is a need to run the same version
of the software in several WPARs, this type of installation provides the following benefits:
It increases administrative efficiency by reducing the number of application instances that
users must maintain. The administrator saves time in application-maintenance tasks, such
as applying fixes and performing backups and migrations.
It allows users to deploy quickly multiple instances of the same application, each in its own
secure and isolated environment. It can take only a matter of minutes to create and start a
WPAR to run a shared installation of the application.
By sharing one AIX or application image among multiple WPARs, the memory resource
usage is reduced because only one copy of the application image is in real memory.
For more information about WPAR, see WPAR concepts, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.
aix.wpar/wpar-overview.htm
For more information about the topic of operating system-specific optimizations, from the IBM
i and Linux perspectives, see:
5.3, IBM i operating system-specific optimizations on page 114 (IBM i)
6.3, Linux operating system-specific optimizations on page 129 (Linux)
4.4.1 AIX preferred practices that are applicable to all Power Systems
generations
Preferred practices for the installation and configuration of all Power Systems generations are
noted in the following list:
If this server is a VIOS, then run the VIO Performance Advisor on the VIOS. Instructions
are available for Virtual I/O Server Advisor at the following website:
http://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20S
ystems/page/VIOS%20Advisor
For more information, see VIOS Performance Advisor on page 217.
For logical partitions (LPARs) with Java applications, run and evaluate the output from the
Java Performance Advisor, which can be run on POWER5 and POWER6 processor-based
systems, to determine whether there is an existing issue before you migrate to a POWER7
processor-based systems. Instructions are available for Java Performance Advisor (JPA)
at the following website:
https://www.ibm.com/developerworks/community/wikis/home/wiki/Power%20Systems/pa
ge/Java%20Performance%20Advisor%20(JPA)
For more information, see Java Performance Advisor on page 219.
105
For virtualized environments, you can also use the IBM PowerVM Virtualization
Performance Advisor. Instructions for the IBM PowerVM Virtualization Performance
Advisor are found at the following website:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20
Systems/page/PowerVM%20Virtualization%20Performance%20Advisor
For more information, see Virtualization Performance Advisor on page 218.
The number of online virtual CPUs of a single LPAR cannot exceed the number of active
CPUs in a pool. See the output of lparstat i from the LPAR to see the values for online
virtual CPUs and active CPUs in pool.
IBM maintains a strong focus on the quality and reliability of Power Systems servers. To
maintain this reliability, the currency of Licensed Internal Code levels on your systems is
critical. Therefore, apply the latest Power Systems Firmware and management console
levels as soon as possible. These service pack updates contain a collective number of
High Impact or PERvasive (HIPER) fixes that continue to provide you with the system
availability you expect from Power Systems.
When you install firmware from the HMC, avoid the do not auto accept option. Selecting
this advanced option can cause firmware installation problems.
Subscribe to My Notifications to provide you with customizable communications that
contain important news, new or updated support content, such as publications, hints, and
tips, technical notes, product flashes (alerts), downloads, and drivers.
4.4.2 AIX preferred practices that are applicable to POWER7 and POWER8
processor-based systems
This section covers the AIX preferred practices that are applicable to POWER7 and POWER8
processor-based systems.
106
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
107
108
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
109
110
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 5.
IBM i
This chapter describes the optimization and tuning of the POWER8 processor and other
Power Systems processor-based servers running the IBM i operating system. It covers the
following topics:
111
5.1 Introduction
IBM i provides an operating environment that emphasizes integration, security, and ease of
use.
POWER6 processor-based
systems
POWER7 processor-based
systems
POWER8 processor-based
systems
IBM i 6.1
Not supported
Not supported
IBM i 6.1.1
Not supported
IBM i 7.2
For more information about this topic, from the processor and OS perspectives, see:
2.2.1, Multi-core and multi-thread on page 28 (processor)
4.2.1, Multi-core and multi-thread on page 72 (AIX)
6.2.1, Multi-core and multi-thread on page 119 (Linux)
Simultaneous multithreading
Smultaneous multithreading (SMT) is a feature of the Power Architecture and is described in
Simultaneous multithreading on page 29.
112
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
113
For more information about this topic, from the processor and OS perspectives, see:
2.2.6, Decimal floating point on page 47 (processor)
4.2.6, Decimal floating point on page 92 (AIX)
6.2.6, Decimal floating point on page 126 (Linux)
114
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 5. IBM i
115
For more information about the topic of operating system-specific optimizations, from the AIX
and Linux perspectives, see:
4.3, AIX operating system-specific optimizations on page 95 (AIX)
6.3, Linux operating system-specific optimizations on page 129 (Linux)
116
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 6.
Linux
This chapter describes the optimization and tuning of the POWER8 and other Power Systems
processor-based servers running the Linux operating system. It covers the following topics:
117
6.1 Introduction
When you work with POWER7, POWER7+, or POWER8 processor-based servers and
solutions, a solid choice for running enterprise-level workloads is Linux. Red Hat Enterprise
Linux (RHEL), SUSE Linux Enterprise Server (SLES), and Ubuntu provide operating systems
that are optimized and targeted for the Power Architecture. These operating systems run
natively on the Power Architecture and are designed to take full advantage of the specialized
features of Power Systems.
RHEL and SLES support both POWER7 and POWER8 processor-based systems. Ubuntu is
supported on POWER8 processor-based systems only (starting with Ubuntu Version 14.04).
Unless otherwise stated, the references to the POWER8 processor or POWER8
processor-based systems in this chapter applies to all three Linux distributions, and
references to POWER7 or POWER7 processor-based systems applies only to RHEL or
SLES.
All of these Linux distributions provide the tools, kernel support, optimized compilers, and
tuned libraries for Power Systems to achieve excellent performance. For advanced users,
more application and customer-specific tuning approaches are also available.
Additionally, IBM provides a number of added value packages, tools, and extensions that
provide for more tunings, optimizations, and products for the best possible performance on
POWER8 processor-based systems. The typical Linux open source performance tools that
Linux users are comfortable with are available on Linux on Power systems.
The IBM Linux on Power Tools repository enables the use of standard Linux package
management tools (such as yum and zypper) to provide easy access to IBM recommended
tools:
IBM Linux on Power hardware diagnostic aids and productivity tools
IBM Software Development Toolkit for Linux on Power servers
IBM Advance Toolchain for Linux on Power Systems servers
The IBM Linux on Power Tools repository is found at:
http://www.ibm.com/support/customercare/sas/f/lopdiags/yum.html
Under a PowerVM hypervisor, Linux on Power supports small virtualized Micro-Partitioning
partitions up through large dedicated partitions containing all of the resources of a high-end
server. Under a PowerKVM hypervisor, the Linux on Power supports running as a KVM guest
on POWER8 processor-based systems.
IBM premier products, such as IBM XL compilers, IBM Java products, IBM WebSphere, and
IBM DB2 database products, all provide Power Systems optimized support with the RHEL,
SLES, and Ubuntu operating systems.
For more information about this topic, see 6.5, Related publications on page 139.
118
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Linux release
128
SLES 10
256
RHEL 5
1024
RHEL 6
SLES 11
2048
RHEL 7
SLES 12
Ubuntu 14.04
Simultaneous multithreading
Simultaneous multithreading (SMT) is a feature of the Power Architecture and is described in
Simultaneous multithreading on page 29.
On a POWER8 processor-based system, with a properly enabled Linux distribution, or distro,
the Linux operating system supports up to eight hardware threads per core (SMT=8).
With the POWER8 processor cores, the SMT hardware threads are more equal in the
execution implementation, which allows the system to support flexible SMT scheduling and
management.
Application throughput and SMT scaling from SMT=1 to SMT=2, to SMT=4, and to SMT=8 is
highly application-dependent. With additional hardware threads that are available for
scheduling, the ability of the processor cores to switch from a waiting (stalled) hardware
thread to another thread that is ready for processing can improve overall system effectiveness
and throughput.
High SMT modes are best for maximizing total system throughput, and lower SMT modes
might be appropriate for high performance threads and low latency applications. For code
with low levels of instruction-level parallelism (often seen in Java code, for example), high
SMT modes are preferred.
Chapter 6. Linux
119
For more information about the topic of SMT, from the processor and OS perspectives, see:
Simultaneous multithreading on page 29 (processor)
Simultaneous multithreading on page 73 (AIX)
Simultaneous multithreading on page 112 (IBM i)
0,1,2,3,4,5,6,7,
0,1,2,3,
0,1,
0,
8,9,10,11,12,13,14,15,
8,9,10,11,
8,9,
8,
16,17,18,19,20,21,22,23, ...
16,17,18,19, ...
16,17, ...
16, ...
The setaffinity application programming interface (API) allows processes and threads to have
affinity to specific logical processors, as described in Affinitization and binding on page 121.
Because the POWER8 processor supports running up to eight threads per core, the CPU
numbering is different than in POWER7 processor-based systems, which supported only up
to four threads per core. Therefore, an application that specifically binds processes to threads
must be aware of the new CPU numbering to ensure that the binding is correct because there
are now more threads available for each core.
For more information about this topic, see 6.5, Related publications on page 139.
120
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The current GLIBC (from Version 2.16) provides the system header sys/platform/ppc.h,
which contains a wrapper for setting the PPR by using the Priority Nop mechanism, as shown
in Example 6-1.
Example 6-1 GLIBC PPR set functions
Where to use
SMT thread priority can be used to improve the performance of a workload by lowering the
SMT thread priority that is being used on an SMT thread that is running a particular
process-thread when:
121
Linux scheduler
The Linux Completely Fair Scheduler (CFS) handles load balancing across CPUs and uses
scheduler modules to make policy decisions. CFS works with multi-core and multi-thread
processors and balances tasks across real processors. CFS also groups and tunes related
tasks together.
The Linux topology considers physical packages, threads, siblings, and cores. The CFS
scheduler domains help to determine load balancing. The base domain contains all sibling
threads of the physical CPU, the next parent domain contains all physical CPUs, and the next
parent domain takes NUMA nodes into consideration.
Because of the specific asymmetrical thread ordering of POWER7 processors, special Linux
scheduler modifications were added for the POWER7 CPU type. With the POWER8
processor, this logic is no longer needed because any of the SMT8 threads can act as the
primary thread by design. This means that the number of threads that are active in the core at
one time determines the dynamic SMT mode (for example, from a performance perspective,
thread 0 can be the same as thread 7). Idle threads should be napping or in a deeper sleep if
they are idle for a period.
taskset
Use the taskset command to retrieve, set, and verify the CPU affinity information of a
process that running.
numactl
Similar to the taskset command, use the numactl command to retrieve, set, and verify the
CPU affinity information of a process that running. The numactl command, however, provides
additional performance information about local memory allocation.
122
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
When there is no work to be done on a CPU, the scheduler goes into the idle loop, and Linux
calls into the hypervisor to report that the CPU is truly idle. The kernel-to-hypervisor interface
is defined in the Power Architecture Platform Reference (PAPR) found at http://power.org.
In this case, it is the H_CEDE hypervisor call.
For more information about this topic, from the processor and OS perspectives, see:
Hybrid thread and core on page 31 (processor)
Hybrid thread and core on page 80 (AIX)
Chapter 6. Linux
123
For more information about the efficient use of cache, from the processor and OS
perspectives, see:
2.2.3, Efficient use of cache and memory on page 33 (processor)
4.2.3, Efficient use of cache on page 86 (AIX)
For more information about this topic, see 6.5, Related publications on page 139.
124
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Debugger support
The GDB debugger currently supports only machine-level debugging of TM programs. This
support includes the ability to disassemble the new TM instructions. This support does not
allow the setting of breakpoints within a transaction. Setting a breakpoint inside of a
transaction causes the transaction to unconditionally fail whenever the breakpoint is
encountered. To determine the cause and location of a failing transaction, set a breakpoint on
the transaction failure handler, and then view the TEXASR and TFIAR registers when the
breakpoint is encountered.
For more information about the topic of transactional memory, from the processor, OS, and
compiler perspectives, see:
Chapter 6. Linux
125
For more information about the topic of Vector Scalar eXtension (VSX), from the processor,
AIX, IBM i, and compiler perspectives, see:
Note: The printf() function uses new options to print these new data types:
_Decimal32 uses %Hf.
_Decimal64 uses %Df.
_Decimal128 uses %DDf.
The IBM XL C/C++ Compiler, release 9 or later for AIX and Linux, includes native DFP
language support. Here is a list of the compiler options for IBM XL compilers that are
related to DFP:
126
-qdfp: Enables DFP support. This option makes the compiler recognize DFP literal
suffixes, and the _Decimal32, _Decimal64, and _Decimal128 keywords.
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
-ldfp: Enables the DFP function that is provided by the Advance Toolchain on
Linux.
For hardware supported DFP, with -qarch=pwr6, -qarch=pwr7, or -qarch=pwr8, use the
following command:
cc -qdfp
For software emulation of DFP (on earlier processor chips), use the following
command:
cc -qdfp -qfloat=dfpemulate
The GCC compilers for Power Systems also include native DFP language support.
As of SLES 11 SP1 and RHEL 6, and in accord with the Institute of Electrical and
Electronics Engineers (IEEE) 754R, DFP is fully integrated with compiler and runtime
(printf and DFP math) support. For older Linux distribution releases (RHEL 5/SLES 10
and earlier), you can use the freely available Advance Toolchain compiler and runtime
libraries. The Advance Toolchain runtime libraries can also be integrated with recent
XL (V9+) compilers for DFP exploitation.
The latest Advance Toolchain compiler and run times can be downloaded from the
following website:
ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/
Advance Toolchain is a self-contained toolchain that does not rely on the base system
toolchain for operability. In fact, it is designed to coexist with the toolchain that is
shipped with the operating system. You do not have to uninstall the regular GCC
compilers that come with your Linux distribution to use the Advance Toolchain.
The latest Enterprise distributions and Advance Toolchain run time use the Linux CPU
tune library capability to select automatically hardware DFP or software
implementation library variants, which are based on the hardware platform.
Here is a list of GCC compiler options for Advance Toolchain that are related to DFP:
-mno-hard-dfp: Instructs the compiler to use calls to library functions to handle DFP
computation, regardless of the architecture level. If your application is dynamically
linked to the libdfp variant and running on POWER6, POWER7, or POWER8
processors, then the run time automatically binds to the libdfp variant that is
implemented with hardware DFP instructions. Otherwise, the software DFP library
is used. You might experience performance degradation when you use software
emulation.
-ldfp: Enables the DFP function that is provided by recent Linux Enterprise
Distributions or the Advance Toolchain run time.
Chapter 6. Linux
127
Decimal Floating Point Library (libdfp) is an implementation of the joint efforts of the
International Organization for Standardization and the International Electrotechnical
Commission (ISO/IEC). ISO/IEC technical report ISO/IEC TR 247322 describes the
C-Language library routines that are necessary to provide the C library runtime support for
decimal floating point data types, as introduced in IEEE 754-2008, namely _Decimal32,
_Decimal64, and _Decimal128.
The library provides functions, such as sin and cos, for the decimal types that are
supported by GCC. Current development and documentation can be found at
https://github.com/libdfp/libdfp, and RHEL6 and SLES11 provide this library as a
supplementary extension. Advance Toolchain also ships with the library.
application
To view the results and see what symbols the event samples are associated with, run the
following command:
opreport --symbols
If you see this message, there were no samples found for the event that is specified when
running the application:
opreport error: No sample file found
For more information about this topic, from the processor and OS perspectives, see:
2.2.6, Decimal floating point on page 47 (processor)
4.2.6, Decimal floating point on page 92 (AIX)
5.2.4, Decimal floating point on page 113 (IBM i)
For more information, see 6.5, Related publications on page 139.
128
Information technology -- Programming languages, their environments, and system software interfaces -Extension for the programming language C to support decimal floating-point arithmetic, found at:
http://www.iso.org/iso/catalogue_detail.htm?csnumber=38842
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
There are interoperability considerations with the Power Architecture Executable File and
Linkable format (ELF) application binary interface (ABI), and these can complicate the usage
of this facility to remain ABI compliant. As a result, user applications should use an API that is
provided by libpaf-ebb that handles the ABI implications consistently and correctly and
provides a handler by proxy.
For more information about this topic, from the processor perspective, see 2.2.12,
Event-based branches (or user-level fast interrupts) on page 52 (processor).
For more information about EBB, see the following website:
https://github.com/paflib/paflib/wiki/Event-Based-Branching----Overview,-ABI,-andAPI
Chapter 6. Linux
129
The handling of floating point and vector data is the same (registers size and format and
instructions) for 32-bit and 64-bit modes. Therefore, for these applications, the key decision
depends on the address space requirements. For 32-bit POWER applications (32-bit mode
applications that are running on 64-bit POWER hardware with a 64-bit kernel), the address
space is limited to 4 GB, which is the limit of a 32-bit address. 64-bit applications are limited to
16 TB of application program or data per process. This limitation is not a hardware one, but is
a restriction of the shared Linux virtual memory manager implementation. For applications
with low latency response requirements, using the larger, 64-bit addressing to avoid I/O
latencies that use memory mapped files or large local caches is a good trade-off.
CPU-tuned libraries
If an application must support only one POWER hardware platform (such as POWER7 and
later processor-based systems), then compiling the entire application with the appropriate
-mcpu= and -mtune= compiler flags might be the best option.
For example, -mcpu=power7 allows the compiler to use all the POWER7 instructions, such as
the VSR category. The -mcpu=power7 option also implies -mtune=power7 if it is not explicitly
set.
The GCC compiler does not have any specific POWER7+ optimizations, so use -mcpu=power7
or -mtune=power7.
The -mcpu=power8 option allows the compiler to use instructions that were added for the
POWER8 processor, such as cryptography built-in functions, direct move instructions that
allow data movement between the general-purpose registers and the floating point or floating
vector registers, and additional vector scalar instructions that were introduced.
-mcpu generates code for a specific machine. If you specify -mcpu=power7, the code also runs
on a POWER8 processor-based system, but not on a POWER6 processor-based system.
-mcpu=power6x generates instructions that are not implemented on POWER7 or POWER8
processor-based systems, and -mcpu=power6 generates code that runs on POWER7 and
POWER8 processor-based systems. The -mtune option focuses on optimizing the order of
the instructions.
Most applications do need to run on more than one platform, for example, in POWER7 mode
and POWER8 mode. For applications composed of a main program and a set of shared
libraries or applications that spend significant execution time in other (from the Linux run time
or extra package) shared libraries, you can create packages that automatically select the best
optimization for each platform.
Linux also supports automatic CPU tuned library selection. There are a number of
implementation options for CPU tuned library implementers as described here. For more
information, see Optimized Libraries, found at:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/W51a7ffcf4df
d_4b40_9d82_446ebc23c550/page/Optimized%20Libraries
The Linux Technology Center works with the SUSE, Canonical, and Red Hat Linux
Distribution Partners to provide some automatic CPU-tuned libraries for the C/POSIX runtime
libraries. However, these libraries might not be supported for all platforms or have the latest
optimization.
130
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
One advantage of the Advance Toolchain is that the runtime RPMs for the current release do
include CPU-tuned libraries for all the supported POWER processors and the latest
processor-specific optimization and capabilities, which are constantly updated. Additional
libraries are added as they are identified. The Advance Toolchain run time can be used with
either Advance Toolchain GCC or XL compilers and includes configuration files to simplify
linking XL compiled programs with the Advance Toolchain runtime libraries.
These techniques are not restricted to systems libraries, and can be easily applied to
application shared library components. The dynamic code path and processor tuned libraries
are good starting points. With this method, the compiler and dynamic linker do most of the
work. You need only some additional build time and extra media for the multiple library
images.
In this example, the following conditions apply:
Your product is implemented in your own shared library, such as libmyapp.so.
You want to support Linux running on POWER5, POWER6, POWER7, and POWER8
processor-based systems.
DFP and Vector considerations:
Your oldest supported platform is a POWER5 processor-based system, which does not
have a DFP or the Vector unit.
The POWER6 processor has DFP and a Vector Unit implementing the older Vector
Multimedia eXtension (VMX) (vector float but no vector double) instructions.
POWER7 and POWER8 processors have DFP and the new VSX (the original VMX
instructions plus Vector Double and more).
Your application benefits greatly from both Hardware Decimal and high performance
vector, but if you compile your application with -mcpu=power7 -O3, it does not run on
POWER5 (no hardware DFP instructions) or POWER6 (no vector double instructions)
processor-based systems.
You can optimize all three Power platforms if you build and install your application and
libraries correctly by completing the following steps:
1. Build the main application binary file and the default version of libmyapp.so for the oldest
supported platform (in this case, use -mcpu=power5 -O3). You can still use decimal data
because the Advance Toolchain and the newest SLES 11 and RHEL 6 include a DFP
emulation library and run time.
2. Install the application (myapp) into the appropriate ./bin directory and libmyapp.so into
the appropriate ./lib64 directory. The following paths provide the application main and
default run time for your product:
/opt/ibm/myapp1.0/bin/myapp
/opt/ibm/myapp1.0/lib64/libmyapp.so
3. Compile and link libmyapp.so with -mcpu=power6 -O3, which enables the compiler to
generate DFP and VMX instructions for POWER6 processor-based systems.
4. Install this version of libmyapp.so in to the appropriate ./lib64/power6 directory. For
example:
/opt/ibm/myapp1.0/lib64/power6/libmyapp.so
Chapter 6. Linux
131
5. Compile and link the fully optimized version of libmyapp.so for POWER7 processors with
-mcpu=power7 -O3, which enables the compiler to generate DFP and all the VSX
instructions. Install this version of libmyapp.so in to the appropriate ./lib64/power7
directory. For example:
/opt/ibm/myapp1.0/lib64/power7/libmyapp.so
6. Compile and link the fully optimized version of libmyapp.so for the POWER8 processor
with -mcpu=power8 -O3, which enables the compiler to generate DFP and all the VSX
instructions. Install this version of libmyapp.so into the appropriate ./lib64/power8
directory. For example:
/opt/ibm/myapp1.0/lib64/power8/libmyapp.so
By simply running some extra builds, your myapp1.0 is fully optimized for the current and
N-1/N-2 POWER hardware releases. When you start your application with the appropriate
LD_LIBRARY_PATH (including /opt/ibm/myapp1.0/lib64), the dynamic linker automatically
searches the subdirectories under the library path for names that match the current
platform (POWER5, POWER6, POWER7, or POWER8 processor-based systems). If the
dynamic linker finds the shared library in the subdirectory with the matching platform
name, it loads that version; otherwise, the dynamic linker looks in the base lib64 directory
and uses the default implementation. This process continues for all directories in the
library path and recursively for any dependent libraries.
132
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Linux malloc
Generally, tuning malloc invocations on Linux systems is an application-specific focus.
Chapter 6. Linux
133
Storage within arenas can be reused without kernel intervention. The default malloc
implementation uses trylock techniques to detect contentions between POSIX threads, and
then tries to assign each thread its own arena. This action works well when the same thread
frees storage that it allocates, but it does result in more contention when malloc storage is
passed between producer and consumer threads. The default malloc implementation also
tries to use atomic operations and more granular and critical sections (lock and unlock) to
enhance parallel thread execution, which is a trade-off for better multi-thread execution at the
expense of a longer malloc path length with multiple atomic operations per call.
Large allocations (greater than MMAP_THRESHOLD) require a kernel syscall for each malloc()
and free(). The Linux Virtual Memory Management (VMM) policy does not allocate any real
memory pages to an anonymous mmap() until the application touches those pages. The
benefit of this policy is that real memory is not allocated until it is needed. The downside is
that, as the application populates the new allocation with data, the application experiences
multiple page faults, on first touch to allocate and zero fill the page. This situation means that
on the initial touching of memory, there is more processing then, as opposed to the earlier
timing when the original mmap is done. In addition, this first touch timing can impact the
NUMA placement of each memory page.
Such storage is unmapped by free(), so each new large malloc allocation starts with a flurry
of page faults. This situation is partially mitigated by the larger (64 KB) default page size of
the RHEL and SLES on Power Systems; there are fewer page faults than with 4 KB pages.
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For more information about tuning malloc parameters, see Malloc Tunable Parameters, found
at:
http://www.gnu.org/software/libtool/manual/libc/Malloc-Tunable-Parameters.html
Thread-caching malloc
Under some circumstances, an alternative malloc implementation can prove beneficial for
improving application performance. Packaged as part of Google's Perftools package
(http://code.google.com/p/gperftools/?redir=1), and in the IBM Advance Toolchain, this
specialized malloc implementation can improve performance across a number of C and C++
applications.
Thread-caching malloc (TCMalloc) uses a thread-local cache for each thread and moves
objects from the memory heap into the local cache as needed. Small objects with less than
32 KB are mapped into allocatable size-classes. A thread cache contains a singly linked list of
free objects per size-class. Large objects are rounded up to a page size (4 KB) and handled
by a central page heap, which is an array of linked lists.
For more information about how TCMalloc works, see TCMalloc: Thread-Caching Malloc,
found at:
http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html
The TCMalloc implementation is part of the gperftools project. For more information about
this topic, go to the following website:
http://code.google.com/p/gperftools/
Usage
To use TCMalloc, link TCMalloc in to your application by using the -ltcmalloc linker flag by
running the following command:
$ gcc [...] -ltcmalloc
You can also use TCMalloc in applications that you did not compile yourself by using
LD_PRELOAD as follows:
$ LD_PRELOAD="/usr/lib/libtcmalloc.so"
These examples assume that the TCMalloc library is in /usr/lib. With the Advance Toolchain
V5.0.4, the 32-bit and 64-bit libraries are in /opt/at5.0/lib and /opt/at5.0/lib64.
Chapter 6. Linux
135
Where:
TCMALLOC_MEMFS_MALLOC_PATH=/libhugetlbfs/ defines the libhugetlbfs mount point.
HUGETLB_ELFMAP=RW allocates both RSS and BSS (text/code and data) segments on the
large pages, which is useful for codes that have large static arrays, such as Fortran
programs.
HUGETLB_MORECORE=yes allows heap usage on the large pages.
2. Allocate the number of large pages from the system by running one of the
following commands:
# echo N > /proc/sys/vm/nr_hugepages
# echo N > /proc/sys/vm/nr_overcommit_hugepages
Where:
N is the number of large pages to be reserved. A peak usage of 4 GB by your program
requires 256 large pages (4096/16).
nr_hugepages is the static pool. The kernel reserves N * 16 MB of memory from the
static pool to be used exclusively by the large pages allocation.
nr_overcommit_hugepages is the dynamic pool. The kernel sets a maximum usage of N
large pages and dynamically allocates or deallocates these large pages.
3. Set up the libhugetlbfs mount point by running the following commands:
# mkdir -p /libhugetlbfs
# mount -t hugetlbfs hugetlbfs /libhugetlbfs
4. Monitor large pages usage by running the following command:
# cat /proc/meminfo | grep Huge
This command produces the following output:
HugePages_Total:
HugePages_Free:
HugePages_Rsvd:
HugePages_Surp:
Hugepagesize:
Where:
HugePages_Total is the total pages that are allocated on the system for LP usage.
HugePages_Free is the total free memory available.
HugePages_Rsvd is the total of large pages that are reserved but not used.
Hugepagesize is the size of a single LP.
You can monitor large pages by NUMA nodes by running the following command:
# watch -d grep Huge /sys/devices/system/node/node*/meminfo
MicroQuill SmartHeap
MicroQuill SmartHeap is an optimized malloc that is used for SPECcpu2006 publishes for
optimizing performance on selected benchmark components. For more information, see
SmartHeap for SMP: Does your app not scale because of heap contention?, found at:
http://www.microquill.com/smartheapsmp/index.html
136
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 6. Linux
137
The PowerKVM host runs in single-threaded (or SMT off) mode. Enabling microthreading
requires first switching to SMT on mode on the host, then setting the number of subcores per
core, and finally switching back to SMT off mode with the following commands in this specific
order:
ppc64_cpu --smt=on
ppc64_cpu --subcores-per-core=4
ppc64_smt --smt=off
Important: No guests must be active when running these commands.
CPU numbering on a POWER8 host is usually 0, 8, 16, 24, and so on (in multiples of 8)
because the other seven threads of each core (1 - 7, 9 - 15, 17 - 23, and so on) are disabled
in SMT off mode. When switching to multithreading mode, CPUs 2, 4, and 6 (of the first core),
CPUs 10, 12, and 14 (of the second core), and so on, also become active.
For more information about multithreading, see 5.3.2, Microthreading, in IBM PowerKVM
Configuration and Use, SG24-8231.
Open source applications now support Little Endian mode also on Power Systems. Many
third-party and most IBM applications have migrated to Little Endian and work continues in
optimizing them to run efficiently.
A new more efficient ABI has also been introduced for Little Endian mode, which is described
in the next section.
138
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 6. Linux
139
Red Hat Enterprise Linux 6 Performance Tuning Guide, Optimizing subsystem throughput
in Red Hat Enterprise Linux 6, Edition 4.0, found at:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/Perfor
mance_Tuning_Guide/index.html
SMT settings, found at:
http://www.ibm.com/support/knowledgecenter/POWER7/p7hc3/iphc3attributes.htm?cp=
POWER7%2F1-8-3-7-2-0-3
Simultaneous multithreading, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/lnxinfo/v3r0m0/index
.jsp?topic=%2Fliaai.hpctune%2Fsmtsetting.htm
SUSE Linux Enterprise Server System Analysis and Tuning Guide (Version 11 SP3),
found at:
http://www.suse.com/documentation/sles11/pdfdoc/book_sle_tuning/book_sle_tuning
.pdf
140
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 7.
141
142
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
On GCC, the equivalent options are -mcpu and -mtune. So, for an application that must run on
a POWER7 processor-based system, but that is usually run on a POWER8 processor-based
system, the options are -mcpu=power7 and -mtune=power8.
The XLC13 and XLF15 compilers for AIX and Linux introduce an extension to -qtune to
indicate the SMT mode in which the application will most often run. For example, for an
application that must run on POWER8 processor-based systems and will most often run in
SMT4 mode (for example, four hardware threads per core), use -qarch=pwr8
-qtune=pwr8:smt4. If the same application might run in several different SMT modes, consider
using -qarch=pwr8 -qtune=pwr8:balanced.
The POWER8 processor supports the Vector Scalar eXtension (VSX) instruction set, which
improves performance for numerical applications over regular data sets. These performance
features can increase the performance of some computations, and can be accessed
manually by using the Altivec vector extensions, or automatically by the XL compiler by using
-O3 or above with -qarch=pwr7 or -qarch=pwr8. By default, these options implicitly enable
-qsimd, which allows the XL compilers to transform loops in an application to use VSX
instructions. The POWER8 processor includes several extensions to the Vector Multimedia
eXtension (VMX) and VSX instruction sets, which can improve performance of applications by
using 64-bit integer types and single-precision floating point.
The GCC compiler equivalents are the -maltivec and -mvsx options, which you should
combine with -ftree-vectorize and -fvect-cost-model. On GCC, the combination of -O3
and -mcpu=power7 or -mcpu=power8 implicitly enables Altivec and VSX code generation with
auto-vector (-ftree-vectorize) and -mpopcntd. Other important options include
-mrecip=rsqrt and -mveclibabi=mass (which require -ffast-math or -Ofast to be effective).
If the compiler uses optimizations that depend on the MASS libraries, the link command must
explicitly name the MASS library directories and library names.
For more information about this topic, see 7.7, Related publications on page 171.
143
Prerequisites
The XL compilers assist with identifying certain programming errors that are outlined 7.2.1,
Common prerequisites on page 143:
Static analysis/warnings: The XL compilers can identify suspicious code constructs, and
provide some information about these constructs through the -qinfo=all option. Examine
the output of this option to identify suspicious code constructs and validate that the
constructs are correct.
Runtime analysis or warning: The XL compilers can cause the application to perform
runtime checks to validate program correctness by using the -qcheck option. This option
triggers a program abort when an error condition (such as a null pointer dereference or
out-of-bounds array access) is run, identifying a problem and making it easier for you to
identify it. This option has a significant performance cost, so use it only during functional
verification, not on a production environment.
Aliasing compliance: The C, C++, and Fortran languages specify rules that govern the
access of data through overlapping pointers. These rules are brought into play
aggressively by optimization techniques, but they can lead to incorrect results if they are
broken. The compiler can be instructed not to take advantage of these rules, at a cost of
runtime performance. This situation can be useful for older code that is written without
following these rules. The options to request this optimization are -qalias=noansi for
C/C++ and -qalias=nostd for Fortran.
The XLC13 and XLF15 compilers include enhancements to -qinfo and -qcheck to detect
accesses to uninitialized variables and stack corruption or stack clobbering.
High-order transformations
The XL compilers have sophisticated optimizations to improve the performance of numeric
applications. These applications often contain regular loops that process large amounts of
data. The high-order transformation (HOT) optimizations in these compilers analyze these
loops, identify opportunities for restructuring them to improve cache usage, improve data
reuse, and expose more instruction-level parallelism to the hardware. For these types of
applications, the performance impact of this option can be substantial.
There are two levels of aggressiveness to the HOT optimization framework in these
compilers:
Level 0, which is the default at optimization level -O3, performs a minimal amount of loop
optimization, focusing on simple opportunities and minimizing compilation time.
Level 1, which is the default at optimization levels -O4 and up, performs full loop analysis
and transformation of loops.
The HOT optimizations can be explicitly requested through the -qhot=level=0 and
-qhot=level=1 options. The -qhot option alone enables -qhot=level=1. The -O3 -qhot
options are preferred for numerical applications.
144
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
OpenMP
The OpenMP API is an industry specification for shared-memory parallel programming. The
latest XL Compilers provide a full implementation of the OpenMP 3.1 specification and partial
support of the OpenMP4.0 specification in C, C++, and Fortran. You can program with
OpenMP to capitalize on the incremental introduction of parallelism in an existing application
by adding pragmas or directives to specify how the application can be parallelized.
For applications with available parallelism, OpenMP can provide a simple solution for parallel
programming without requiring low-level thread manipulation. The OpenMP implementation
on the XL compilers is available by using the -qsmp=omp option.
Whole-program analysis
Traditional compiler optimizations operate independently on each application source file.
Inter-procedural optimizations operate at the whole-program scope by using the interaction
between parts of the application on different source files. It is often effective for large-scale
applications that are composed of hundreds or thousands of source files.
On the XL compilers, these capabilities are accessed by using the -qipa option. It is also
implied when you use optimization levels -O4 and -O5. In this phase, the compiler saves a
high-level representation of the program in the object files during compilation, and reoptimizes
it at the whole-program scope during the link phase. For this situation to occur, the compiler
driver must be used to link the resulting binary file instead of starting the system linker
directly.
Whole-program analysis (IPA) is effective on programs that use many global variables,
overflowing the default AIX limit on global symbols. If the application requires the use of the
-bbigtoc option to link successfully on AIX, it is likely a good candidate for IPA optimization.
There are three levels of IPA optimization on the XL compilers (0, 1, and 2). By default, -qipa
implies ipa=level=1, which performs basic program restructuring. For more aggressive
optimization, apply -qipa=level=2, which performs full program restructuring during the link
step. The time that it takes to complete the link step can increase significantly.
145
For the PDF profile data to be written out at the end of execution, the program must either
implicitly or explicitly call the exit() library subroutine. Using exit() causes code that is
introduced as part of the PDF instrumentation to be run and write out the PDF profile data. In
contrast, running the _exit() system call skips the writing of the PDF profile data file, which
results in inaccurate profile data being recorded.
Prerequisites
The GCC compiler assists with identifying certain programming errors that are outlined in
7.2.1, Common prerequisites on page 143:
Static analysis and warnings. The -pedantic and -pedantic-errors options warn of
violations of ISO C or ISO C++ standards.
The language standard to enforce and the aliasing compliance requirements are specified
by the -std, -ansi, and -fno-strict-aliasing options. For example:
ISO C 1990 level: -std=c89, -std=iso9899:1990, and -ansi
ISO C 1998 level: -std=c99 and -std=iso9899:1999
Do not assume strict aliasing rules for the language level: fno-strict-aliasing
The GCC compiler documentation contains more details about these options.1, 2, 3
-fpeel-loops
-funroll-loops
-ftree-vectorize
-fvect-cost-model
-mcmodel=medium
Specifying the -mveclibabi=mass option and linking to the MASS libraries enables more loops
for -ftree-vectorize. The MASS libraries support only static archives for linking, and so they
require explicit naming and library search order for each platform/mode:
1
Language Standards Supported by GCC, found at:
http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Standards.html#Standards
2 Options Controlling C Dialect, found at:
http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/C-Dialect-Options.html#C-Dialect-Options
3
Options That Control Optimization, and specifically the discussion of -fstrict-aliasing, found at:
http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Optimize-Options.html#Optimize-Options
146
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
ABI improvements
The -mcmodel={medium|large} option implements important ABI improvements that are
further optimized in hardware for future generations of the POWER processor. This
optimization extends the Table-Of-Content (TOC) to 2 GB and eliminates the previous
requirement for -mminimal-toc or multi-TOC switching within a single a program or library.
The default for newer GCC compilers (including Advance Toolchain V4.0 and later) is
-mcmodel=medium. This model logically extends the TOC to include local static data and
constants and allows direct data access relative to the TOC pointer.
OpenMP
The OpenMP API is an industry specification for shared-memory parallel programming. The
current GCC compilers, starting with GCC- 4.4 (Advance Toolchain V4.0 and later), provide a
full implementation of the OpenMP 3.0 specification in C, C++, and Fortran. Programming
with OpenMP allows you to benefit from the incremental introduction of parallelism in an
existing application by adding pragmas or directives to specify how the application can
be parallelized.
For applications with available parallelism, OpenMP can provide a simple solution for parallel
programming, without requiring low-level thread manipulation. The GNU OpenMP
implementation on the GCC compilers is available under the -fopenmp option. GCC also
provides auto-parallelization under the -ftree-parallelize-loops option.
Whole-program analysis
Traditional compiler optimizations operate independently on each application source file.
Inter-procedural optimizations operate at the whole-program scope, by using the interaction
between parts of the application on different source files. It is often effective for large-scale
applications that are composed of hundreds or thousands of source files.
Starting with GCC- 4.6 (Advance Toolchain V5.0), there is the Link Time Optimization (LTO)
feature. LTO allows separate compilation of multiple source files but saves additional (an
abstract program description) information in the resulting object file. Then, at application link
time, the linker can collect all the objects (with additional information) and pass them back to
the compiler (GCC) for whole program whole-program analysis (IPA) and final code
generation.
The GCC LTO feature is enabled during the compile and link phases by the -flto option. A
simple example follows:
gcc -flto -O3 -c a.c
gcc -flto -O3 -c b.c
gcc -flto -o program a.o b.o
Additional options that can be used with -flto include:
-flto-partition={1to1|balanced|none}
-flto-compression-level=n
Detailed descriptions about -flto and its related options are in Options That Control
Optimization, found at:
http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Optimize-Options.html#Optimize-Options
147
Profiled-based optimization
Profile-based optimization allows the compiler to collect information about the program
behavior and use that information when you make code generation decisions. It involves
compiling the program twice: first, to generate an instrumented version of the application that
collects program behavior data when run, and a second time to generate an optimized binary
by using information that is collected by running the instrumented binary file through a set of
typical inputs for the application.
Profile-based optimization in the GCC compiler is accessed through the -fprofile-generate
and -fprofile-use options on top of -O2 optimization levels. The instrumented binary file is
generated by using -fprofile-generate on top of all other options, and the resulting binary
file generates the profile data in a file, named ._pdf by default. For example:
gcc -fprofile-generate -O3 -c a.c
gcc -fprofile-generate -O3 -c b.c
gcc -fprofile-generate -o program a.o b.o
program < sample1
program < sample2
program < sample3
gcc -fprofile-use -O3 -c a.c
gcc -fprofile-use -O3 -c b.c
gcc -fprofile-use -o program a.o b.o
Additional options that are related to GCC PDF include:
-fprofile-correction
-fprofile-dir=PATH
Detailed descriptions about -fprofile-generate and its related options can be found in
Options That Control Optimization, found at:
http://gcc.gnu.org/onlinedocs/gcc-4.6.3/gcc/Optimize-Options.html#Optimize-Options
For more information about this topic, see 7.7, Related publications on page 171.
148
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
AES
The following built-in functions are provided for the implementation of the AES algorithm:
vsbox
GCC: vector unsigned long long __builtin_crypto_vsbox (vector unsigned long long)
XL C/C++: vector unsigned char __vsbox (vector unsigned char) XLF: VSBOX (ARG1),
where ARG1 and result are unsigned vector types of kind 1
vcipher
GCC: vector unsigned long long __builtin_crypto_vcipher (vector unsigned long
long, vector unsigned long long)
XL C/C++: vector unsigned char __vcipher (vector unsigned char, vector unsigned
char)
XLF: VCIPHER (ARG1,ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 1
vcipherlast
GCC: vector unsigned long long __builtin_crypto_vcipherlast (vector unsigned
long long, vector unsigned long long)
XL C/C++: vector unsigned char __vcipherlast (vector unsigned char, vector unsigned
char)
XLF: VCIPHERLAST (ARG1,ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 1
vncipher
GCC: vector unsigned long long __builtin_crypto_vncipher (vector unsigned long
long, vector unsigned long long)
XL C/C++: vector unsigned char __vncipher (vector unsigned char, vector unsigned
char)
XLF: VNCIPHER (ARG1,ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 1
vncipherlast
GCC: vector unsigned long long __builtin_crypto_vncipherlast (vector unsigned
long long, vector unsigned long long)
XL C/C++: vector unsigned char __vncipherlast (vector unsigned char, vector
unsigned char)
XLF: VNCIPHERLAST (ARG1,ARG2), where ARG1, ARG2, and result are unsigned
vector types of kind 1
For more information, see AES on page 47.
149
XLF: VPMSUMD (ARG1, ARG2), where ARG1, ARG2, and result are unsigned vector types
of kind 8
For more information, see AES special mode of operation: Galois Counter Mode on
page 48.
SHA-2
The following built-in functions are provided for the implementation of SHA-2 hash functions:
vshasigmad
GCC: vector unsigned long long __builtin_crypto_vshasigmad (vector unsigned long
long, int, int)
XL C/C++: vector unsigned long long __vshasigmad (vector unsigned long long, int, int)
XLF: VSHASIGMAD (ARG1,ARG2,ARG3), where ARG1 and result are unsigned vector
types of kind 8, and ARG2 and ARG3 are integer types
vshasigmaw
GCC: vector unsigned int __builtin_crypto_vshasigmaw (vector unsigned int, int, int)
XL C/C++: vector unsigned int __vshasigmaw (vector unsigned int, int, int)
XLF: VSHASIGMAW (ARG1,ARG2,ARG3), where ARG1 and result are unsigned vector
types of kind 4, and ARG2 and ARG3 are integer types
For more information, see SHA-2 on page 48.
CRC
The following built-in functions are provided for the implementation of the CRC algorithm:
vpmsumd
GCC: vector unsigned long long __builtin_crypto_vpmsum (vector unsigned long long,
vector unsigned long long)
XL C/C++: vector unsigned long long __vpmsumd (vector unsigned long long, vector
unsigned long long)
XLF: VPMSUMD (ARG1, ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 8
vpmsumw
GCC: vector unsigned int __builtin_crypto_vpmsum (vector unsigned int, vector
unsigned int)
XL C/C++: vector unsigned int __vpmsumw (vector unsigned int, vector unsigned int)
XLF: VPMSUMW (ARG1, ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 4
vpmsumh
GCC: vector unsigned short __builtin_crypto_vpmsum (vector unsigned short, vector
unsigned short)
XL C/C++: vector unsigned short __vpmsumh (vector unsigned short, vector unsigned
short)
XLF: VPMSUMH (ARG1, ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 2
150
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
vpmsumb
GCC: vector unsigned char __builtin_crypto_vpmsum (vector unsigned char, vector
unsigned char)
XL C/C++: vector unsigned char __vpmsumb (vector unsigned char, vector unsigned
char)
XLF: VPMSUMB (ARG1, ARG2), where ARG1, ARG2, and result are unsigned vector
types of kind 1
For more information about the topic of in-core cryptography, from the processor and OS
perspectives, see:
2.2.7, In-core cryptography and integrity enhancements on page 47 (processor)
4.2.7, On-chip encryption accelerator on page 94 (AIX)
Interpretation of content
Range of values
16 unsigned char
0..255
16 signed char
-128..127
16 unsigned char
0, 255
8 unsigned short
0..65535
8 signed short
-32768..32767
8 unsigned short
0, 65535
4 unsigned int
0..2^32-1
0..2^32-1
0..2^64-1
4 signed int
-2^31..2^31-1
151
-2^31..2^31-1
-2^63..2^63-1
4 unsigned int
0, 2^32-1
0, 2^32-1
0, 2^64-1
vector float
4 float
vector double
2 double
vector pixel
8 unsigned short
1/5/5/5 pixel
Vector types: The vector double type requires architectures that support the VSX
instruction set extensions, such as the POWER7 processor. You must specify the XL
-qarch=pwr7 -qaltivec compiler options when you use this type, or the GCC
-mcpu=power7 or -mvsx options.
The hardware does not have instructions for supporting vector unsigned long long, vector
bool long long, or vector signed long long. In GCC, you can declare these types, but the only
hardware operation that you can use these types for is vector floating point convert. In 64-bit
mode, vector long is the same as vector long long. In 32-bit mode, these types are not
permitted.
All vector types are aligned on a 16-byte boundary. An aggregate that contains one or more
vector types is aligned on a 16-byte boundary, and padded, if necessary, so that each
member of vector type is also 16-byte aligned. Vector data types can use some of the unary,
binary, and relational operators that are used with primitive data types. All operators require
compatible types as operands unless otherwise stated. For more information about the
operators usage, see the XLC online publications.4, 5, 6
Individual elements of vectors can be accessed by using the VMX or the VSX built-in
functions. For more information about the VMX and the VSX built-in functions, see the
Built-in functions section of Vector Built-in Functions.7
152
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Vector initialization
A vector type is initialized by a vector literal or any expression that has the same vector type.
For example:8
vector unsigned int v1;
vector unsigned int v2 = (vector unsigned int)(10);// XL only, not GCC
v1 = v2;
The number of values in a braced initializer list must be less than or equal to the number of
elements of the vector type. Any uninitialized element is initialized to zero.
Here are examples of vector initialization that use initializer lists:
vector unsigned int v1 = {1};// initialize the first 4 bytes of v1 with 1
// and the remaining 12 bytes with zeros
vector unsigned int v2 = {1,2};// initialize the first 8 bytes of v2 with 1 and 2
// and the remaining 8 bytes with zeros
vector unsigned int v3 = {1,2,3,4};// equivalent to the vector literal
// (vector unsigned int) (1,2,3,4)
For Fortran
Using Engineering and Scientific Subroutine (ESSL) libraries with vectorization support:
Select routines have vector analogs in the library.
Key FFT, BLAS routines.
For more information about the topic of VSX, from the processor and OS perspectives, see:
153
154
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
155
int
int
int
int
int
unsigned
unsigned
unsigned
unsigned
int
int
int
int
__builtin_tcheck (void)
__builtin_treclaim (unsigned int)
__builtin_trechkpt (void)
__builtin_tsr (unsigned int)
unsigned
unsigned
unsigned
unsigned
long
long
long
long
__builtin_get_texasr (void)
__builtin_get_texasru (void)
__builtin_get_tfhar (void)
__builtin_get_tfiar (void)
156
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
#include <htmintrin.h>
if (__builtin_tbegin (0))
{
/* Transaction State Initiated. */
if (is_locked (lock))
__builtin_tabort (0);
a = b + c;
__builtin_tend (0);
}
else
{
/* Transaction State Failed, Use Locks.
acquire_lock (lock);
a = b + c;
release_lock (lock);
}
*/
A slightly more complicated example is shown in Example 7-4. This example shows an
attempt to retry the transaction a specific number of times before falling back to using
locks.
Example 7-4 Complex use of HTM built-in functions
#include <htmintrin.h>
int num_retries = 10;
while (1)
{
if (__builtin_tbegin (0))
{
/* Transaction State Initiated. */
if (is_locked (lock))
__builtin_tabort (0);
a = b + c;
__builtin_tend (0);
break;
}
else
{
/* Transaction State Failed. Use locks if the transaction
failure is "persistent" or we've tried too many times. */
if (num_retries-- <= 0
|| _TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ()))
157
{
acquire_lock (lock);
a = b + c;
release_lock (lock);
break;
}
}
}
In some cases, it can be useful to know whether the code that is being run is in the
transactional state or not. Unfortunately, that cannot be determined by analyzing the HTM
Special Purpose Registers (SPRs). That specific information is contained only within the
Machine State Register (MSR) Transaction State (TS) bits, which are not accessible by
user code. To allow access to that information, we have added one final built-in function
and some associated macros to help the user to determine what the transaction state is at
a particular point in their code:
unsigned int __builtin_ttest (void)
Usage of the built-in function and its associated macro might look like the code that is
shown in Example 7-5.
Example 7-5 Determine the transaction state
#include <htmintrin.h>
unsigned char tx_state = __builtin_ttest ();
if (_HTM_STATE (tx_state) == _HTM_TRANSACTIONAL)
{
/* Code to use in transactional state. */
}
else if (_HTM_STATE (tx_state) == _HTM_NONTRANSACTIONAL)
{
/* Code to use in non-transactional state. */
}
else if (_HTM_STATE (tx_state) == _HTM_SUSPENDED)
{
/* Code to use in transaction suspended state. */
}
A second option for using HTM is by using the slightly higher-level inline functions that are
common to GCC and the IBM XL compilers on both POWER and System z. These sets
of common HTM built-in functions are defined in the htmxlintrin.h header file and can be
used to write code that can be compiled on POWER or System z by using either the IBM
XL or GCC compilers. See Example 7-6.
Example 7-6 HTM intrinsic functions common to IBM XL and GCC compilers
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
long
long
long
long
long
long
long
long
Using these built-in functions, you can create a more portable version of the code that is
shown in Example 7-4 on page 157 so that it works on POWER and on System z, by using
either GCC or the XL compilers. This more portable version is shown in Example 7-7.
Example 7-7 Complex HTM usage using portable HTM intrinsics
#ifdef __GNUC__
# include <htmxlintrin.h>
#endif
int num_retries = 10;
TM_buff_type TM_buff;
while (1)
{
if (__TM_begin (TM_buff) == _HTM_TBEGIN_STARTED)
{
/* Transaction State Initiated. */
if (is_locked (lock))
__TM_abort ();
a = b + c;
__TM_end ();
break;
}
else
{
/* Transaction State Failed. Use locks if the transaction
failure is "persistent" or we've tried too many times. */
if (num_retries-- <= 0
|| __TM_is_failure_persistent (TM_buff))
{
acquire_lock (lock);
a = b + c;
release_lock (lock);
break;
}
}
}
The third and most portable option uses a high-level language interface that is
implemented by GCC and the GNU Transactional Memory Library (LIBITM), which is
described at the following website:
http://gcc.gnu.org/wiki/TransactionalMemory
159
This high-level language option is enabled by using the -fgnu-tm option (-mcpu=power8
and -mhtm are not needed), and it provides a common transactional model across multiple
architectures and multiple compilers by using the __transaction_atomic {...} language
construct. The LIBITM library, which is included with the GCC compiler, can determine, at
run time, whether it is running on a processor that supports HTM instructions, and, if so, it
uses them in running the transaction. Otherwise, it automatically falls back to using
software TM, which relies on locks. LIBITM also can retry a transaction by using HTM if
the initial transaction begin failed, similar to the complicated example (Example 7-4 on
page 157). An example of the third option that is equivalent to the complicated examples
(Example 7-4 on page 157 and Example 7-7 on page 159) is simple and is shown in
Example 7-8.
Example 7-8 GNU Transactional Memory Library (LIBITM) Usage
__transaction_atomic
{
a = b + c;
}
Support for the HTM built-in functions, the XL HTM built-in functions, and LIBITM support will
be in an upcoming Free Software Foundation (FSF) version of GCC. However, it is also
available in the GCC 4.8-based compiler that is shipped in Advance Toolchain (AT) V7.0.
For more information about the topic of TM, from the processor, OS, and compiler
perspectives, see:
7.4.1 Introduction
FDPR optimizes the executable binary file of a program by collecting information about the
behavior of the program while the program is used for a typical workload, and then creates a
new version of the program that is optimized for that workload. Both main executable and
dynamically linked libraries (DLLs) are supported.
FDPR performs global optimizations at the level of the entire executable library, including
statically linked library code. Because the executable library to be optimized by FDPR is not
relinked, the compiler and linker conventions do not need to be preserved, thus allowing
aggressive optimizations that are not available to optimizing compilers.
The main advantage that is provided by FDPR is the reduced footprint of both code and data,
resulting in more effective cache usage. The principal optimizations of FDPR include global
code reordering, global data reordering, function inlining, and loop unrolling, along with
various tuning options that are tailored for the specific POWER target. The effectiveness of
the optimization depends largely on how representative the collected profile is regarding the
true workload.
160
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
FDPR runs on both AIX and Linux and produces optimized code for all versions of the Power
Architecture. The POWER7 processor is its default target architecture.
Figure 7-1 shows how FDPR is used to optimize executable programs.
1. Instrumentation
Instrumented
executable
Input
executable
Profile
2. Running the
instrumented
profile collection
Optimized
executable
3. Optimization
Figure 7-1 FDPR operation
161
162
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
163
164
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Instrumentation stack
The instrumentation is using the stack for saving registers by dynamically allocating space on
the stack at a default location below the current stack pointer. On AIX, this default is at offset
-10240, and on Linux it is -1800. In some cases, especially in multi-threaded applications
where the stack space is divided between the threads, following a deep calling sequence, the
application can be quite close to the end of the stack, which can cause the application to fail.
To allocate the instrumentation closer to the current stack pointer, use the -iso option:
$ fdprpro -a instr my_prog -iso -300
7.4.6 Optimization
The optimization step is performed by running the following command:
$ fdprpro -a opt in [-o out] -f prof [opts]
If out is not specified, the output file is in.fdpr. No profile is provided by default. If none is
specified or if the profile is empty, the resulting output binary file is not optimized.
Code reordering
Global code reordering works in two phases: making chains and reordering the chains.
The initial chains are sequentially ordered basic blocks, with branch conditions inverted where
necessary, so that branches between the basic blocks are mostly not taken. This
configuration makes instruction prefetching more efficient. Chains are terminated when the
heat (that is, execution count) goes below a certain threshold relative to the initial heat.
165
The second phase orders chains by successively merging the more strongly linked two
chains, based on how frequent the calls between the chains are. Combining chains crosses
function boundaries. Thus, a function can be broken into multiple chunks in which different
pieces of different functions are placed closely if there is a high frequency of call, branch, and
return between them. This approach improves code locality and thus i-cache and page
table efficiency.
You use the following options for code reordering:
--reorder-code (-RC): This component is the hard-working component of the global code
reordering. Use --rcaf to determine the aggressiveness level:
0: no change
1: Standard (default)
2: Most aggressive.
Use --rcctf to lower the threshold for terminating chains. Use -pp to preserve function
integrity and -pc to preserve CSECT integrity (AIX only). These two options limit global
code reordering and might be requested for ease of debugging.
--branch-folding (-bf) and --branch-prediction (-bp): These options control important
parts of the code reordering process. The -bf folds branch to branch into a single branch.
The -bp sets the static branch prediction bit when taken or not taken statistics justify it.
Function inlining
FDPR performs function inlining of function bodies into their respective calling sites if the call
site is selected by one of a number of user-selected filters:
Dominant callers (--selective-inlining (-si), -sidf f, and -siht f): The filter criteria
here is that the site is dominant regarding other callers of the called function (the callee). It
is controlled by two attributes. The -sidf option sets the domination percentage threshold
(default 80). The -siht option further restricts the selection to functions hotter than the
threshold, which is specified in percents relative to the average (default 100).
Hot functions (--inline-hot-functions f (-ihf f)): This filter selects inlining for all call
sites where the call is hotter than the heat threshold (in percent, relative to the average).
Small functions (--inline-small-functions f (-isf f)): This filter selects for inlining all
functions whose size, in bytes, is smaller than or equal to the parameter.
Selective hot code (--selective-hot-code-inline f (-shci f)): The filter computes how
much execution count is saved if the function is inlined at a call site and selects those sites
where the relative saving is above the percentage.
De-virtualization
De-virtualization is addressed by the --ptrgl-optimization (-pto) option. Its call by a pointer
mechanism (ptrgl) sets a new TOC anchor, loads the function address, moves it to the
counter register (CTR), and jumps indirectly through the CTR. The -pto option optimizes this
mechanism in cases where there are few hot targets from a calling site. In terms of C++, it
de-virtualizes the virtual method calls by calling the actual targets directly. The optimized
code compares the address of the function descriptor, which is used for the indirect call,
against the address of a hot candidate, as identified in the profile, and conditionally calls such
a target directly. If none of the hot targets match, the code starts the original indirect call
mechanism. The idea is that most of the time the conditional direct branches are run instead
of the ptrgl mechanism. The impact of the optimization on performance depends heavily on
the function call profile.
166
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The following thresholds can help to tune the optimization and to adjust it to different
workloads:
Use -ptoht thres to set the frequency threshold for indirect calls that will be optimized
(thres can be 0 - 1, with 0.8 by default).
Use -ptosl n to set the limit of the number of hot functions to optimize in a given indirect
call site (the default for n is 3).
Loop-unrolling
Most programs spend their time in loops. This statement is true regardless of the target
architecture or application. FDPR has one option to control the unrolling optimization for
loops: --loop-unrolling factor (-lu factor).
FDPR optimizes a loop by using a technique that is called loop-unrolling. By unrolling a loop
n times, the number of back branches is reduced n times, so code prefetch efficiency can be
improved. The downside with loop-unrolling is code inflation, which results in increased code
footprint and increased i-cache misses. Unlike traditional loop-unrolling, FDPR can mitigate
this problem by unrolling only the hottest paths in the loop. The factor parameter determines
the aggressiveness of the optimization. With -O3, the optimization is started with -lu 9.
By default, loops are unrolled two times. Use -lu factor to change that default.
Architecture-specific optimizations
Here are some architecture-specific optimizations:
--machine tgt (-m tgt): FDPR optimizations include general optimizations that are based
on a high-level program representation as a control and data flow, in addition to peephole
optimizations, relying on different architecture features. Those optimizations can perform
better when they are tuned for specific platforms. The -m flag allows the user to specify the
target machine model cases where the program is not intended for use on multiple target
platforms. The default target is the POWER7 processor.
--align-code code (-A code): Optimizing the alignment and the placement of the code is
crucial to the performance of the program. Correct alignment can improve instruction
fetching and dispatching. The alignment algorithm in FDPR uses different techniques that
are based on the target platform. Some techniques are generic for the Power Architecture,
and others are considered dispatch rules of the specific machine model. If code is 1 (the
default), FDPR applies a standard alignment algorithm that is adapted for the selected
target machine (see -m in the previous bullet point). If code is 2, FDPR applies a more
advanced version, by using dispatch rules and other heuristics to decide how the program
code chunks are placed relatively to i-cache sectors, again based on the selected target. A
value of 0 disables the alignment algorithm.
Function optimization
FDPR includes a number of function level optimizations that are based on detailed data flow
analysis (DFA). With DFA, optimizations can determine the data that is contained in each
register at each point in the function and whether this value is used later.
167
Peephole optimization
Peephole optimizations require a small context around the specific site in the code, which is
problematic. The more important optimizations that FDPR performs are -las, -tlo, and -nop.
--load-after-store (-las): In recent Power Architectures, when a load instruction from
address A closely follows a store to that address, it can cause the load to be rejected. The
instruction is then tried in a slower mode, which produces a large performance penalty.
This behavior is also called Load-Hit-Store (LHS). With the -las optimization, the load is
pushed further from the store, thus avoiding the reject condition.
--toc-load-optimization (-tlo): The TOC is a data section in programs where pointers
are kept to avoid the lengthy address computation at run time. Loading an address (a
pointer) is a costly operation and FDPR can reduce the amount of processing if the
address is close enough to the TOC anchor (R2). In such cases, the load from TOC is
replaced by addi Rt,R2,offset, where R2+offset equals a loaded address. The
optimization is performed after data is reordered so that commonly accessed data is
placed closer to R2, increasing the potential of this optimization. A TOC is used in 32-bit
and 64-bit programs on AIX, and in 64-bit programs on Power Systems running Linux.
Linux 32-bit uses a GOT, but this optimization is not relevant here.
--nop-removal (-nop): The compiler (or the linker) sometimes inserts no-operation (NOP)
instructions in various places to create some necessary space in the instruction stream.
The most common place is following a function call in code. Because the call might have
modified the TOC anchor register (R2), the compiler inserts a load instruction that resets
R2 to its correct value for the current function. Because FDPR has a global view of the
program, the optimization can remove the NOP if the called function uses the same TOC
(the TOC anchor is used in AIX and in Linux 64-bit).
168
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Data reordering
The profile that is collected by FDPR provides important information about the running of
branch instructions, thus enabling efficient code reordering. The profile does not provide
direct information about whether to put specific objects one after the other. Nevertheless,
FDPR can infer such a placement by using the collected profile.
Here are the relevant options:
--reorder-data (-RD): This optimization reorders data by placing pointers and data closer
to the TOC anchor, depending on their hotness. FDPR uses a heuristic where the hotness
is computed as the total count of basic blocks where the pointer to the data was retrieved
from the TOC.
--reduce-toc thres (-rt thres): The optimization removes from the TOC entries that are
colder than the threshold. Their access, if any, is replaced by computing the address (see
-tlo optimization in Peephole optimization on page 168). Typically, you use -rt 0, which
removes only the entries that are never accessed.
Combination optimizations
FDPR has predefined optimization sets that provide a good starting point for
performance tuning:
-O: Performs code reordering (-RC) with the branch prediction bit setting (-bp), branch
folding (-bf), and NOOP instructions removal (-nop).
-O2: Adds to -O function de-virtualization (-pto), TOC-load optimization (-tlo), function
inlining (-isf 8), and some function optimizations (-hr, -see 0, and -kr).
-O3: Turns on data reordering (-RD and -rt 0), loop-unrolling (-lu), more aggressive
function optimization (-see 1 and -vro), and employs more aggressive inlining (-lro and
-isf 12). This set provides an aggressive but still stable set of optimizations that are
beneficial for many benchmarks and applications.
-O4: Essentially turns on more aggressive inlining (-sidf 50, -ihf 20, and -shci 90). As a
result, the number of branches is reduced, but at the cost of increasing the code footprint.
This option works well with large i-caches or with small to medium programs/threads.
7.5 Using the Advance Toolchain with IBM XLC and XLF
For XLC13 and XLF15, there is a new feature in the existing new_install script, which is
shipped with the Linux package.
Run this script with one option, and it detects whether AT has been installed in the
environment. If yes, it automatically generates a configuration file with the AT information
specified, and generates a new invocation that is named xlc_at, which uses the generated
configuration file. Then, you can use this xlc_at invocation to get the XLC + AT usage.
169
170
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
NVIDIA also provides a set of CUDA libraries that include highly optimized kernels for various
purposes. Some of these libraries and tools can be found at the following website:
https://developer.nvidia.com/gpu-accelerated-libraries
One of these libraries is NVBLAS, which is a CPU implementation that automatically uses the
cuBLAS library GPU kernels to accelerate some BLAS calls. For more information about
NVBLAS, go to the following website:
http://docs.nvidia.com/cuda/nvblas/index.html
CUDA kernels are written as C functions. NVIDIA also provides a C++ library that is called
Thrust to better integrate with existing C++ applications. Here is a Thrust example:
void sortVector( int* values, int count )
{
thrust::device_vector<int> d_vec(count); // create device memory
thrust::copy(values, values+count, d_vec.begin()); // copy data to GPU
thrust::sort(d_vec.begin(), d_vec.end()); // call builtin sort kernel
thrust::copy(d_vec.begin(), d_vec.end(), values); // copy result from GPU
}
The Thrust library also supports functions and integrates well with the C++ standard template
libraries. For more information about the Thrust library, see the following website
https://developer.nvidia.com/thrust
For more information about CUDA and POWER8 processor-based systems, see NVIDIA
CUDA on IBM POWER8: Technical Overview, Software Installation, and Application,
REDP-5169.
171
XL Compiler Documentation:
C and C++ Compilers
Optimization and Programming Guide - XL C/C++ for AIX, V12.1, found at:
http://www.ibm.com/support/docview.wss?uid=swg27024208
Fortran compilers
Optimization and Programming Guide - XL Fortran for AIX, V14.1, found at:
http://www.ibm.com/support/docview.wss?uid=swg27024219
172
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 8.
Java
This chapter describes the optimization and tuning of Java based applications that are
running on a POWER8 processor-based system. It covers the following topics:
173
174
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For more information about this topic, see 8.8, Related publications on page 192.
Platform
Linux version
AIX version
Requires user
configuration
4 KB
All
RHEL 5, SLES 10
and earlier
All
No
64 KB
POWER5+
processor-based
systems or later
RHEL 6, SLES 11
No
16 MB
POWER4
processor-based
system or later
RHEL 5, SLES11
Yes
8.3.1 Medium and large pages for Java heap and code cache
Medium and large pages can be enabled for the Java heap and JIT code cache
independently of other memory areas. IBM JVM supports at least three page sizes,
depending on the platform:
4 KB (default)
64 KB
16 MB
Large pages, specifically 16 MB pages, do have some processing impact and are best suited
for long-running applications with large memory requirements. The -Xlp64k option provides
many of the benefits of 16 MB pages with less impact and can be suitable for workloads that
benefit from large pages but do not take full advantage of 16 MB pages.
Starting with IBM Java 6 SR7, the default page size is 64 KB.
Chapter 8. Java
175
Starting with IBM Java 7 SR4 (and Java 6.2.6 SR5), there are more command-line options to
specify pagesize for java heap and code cache. The -Xlp:objectheap:pagesize=<size> and
-Xlp:codecache:pagesize=<size> options are supported. To obtain the large page sizes
available and the current setting, use the -verbose:sizes option. The current settings are the
requested sizes and not the sizes that are obtained.
8.3.2 Configuring large pages for Java heap and code cache
In an AIX environment, to use large pages with Java requires both configuring the large
pages and setting the v_pinshm tunable to a value of one by running vmo. The following
example demonstrates how to configure dynamically 1 GB of 16 MB pages and set the
v_pinshm tunable:
# vmo -o lgpg_regions=64 -o lgpg_size=16777216 -o v_pinshm=1
To configure permanently large pages, the -r option must be specified with the vmo
command. Run bosboot to configure the large pages at boot time:
# vmo -r -o lgpg_regions=64 -o lgpg_size=16777216 -o v_pinshm=1
# bosboot -a
Non-root users must have the CAP_BYPASS_RAC_VMM capability on AIX enabled to use large
pages. The system administrator can add this capability by running chuser:
# chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE <user_id>
On Linux, 1 GB of 16 MB pages are configured by running echo:
# echo 64 > /proc/sys/vm/nr_hugepages
8.3.3 Prefetching
Prefetching is an important strategy to reduce memory latency and take full advantage of
on-chip caches. The -XtlhPrefetch option can be specified to enable aggressive prefetching
of thread-local heap memory shortly before objects are allocated. This option ensures that
the memory that is required for new objects that are allocated from the TLH is fetched into the
cache ahead of time if possible, reducing latency and increasing overall object allocation
speed.
A POWER8 processor-based system has increased cache sizes compared to POWER7 and
POWER7+ processor-based systems, and also features an additional L4 cache. Therefore, it
is important to conduct thorough performance evaluations with TLH prefetching to determine
whether it is beneficial to the application being run. The -XnotlhPrefetch option can be used
to disable explicitly TLH prefetching if it is enabled by default. This option can provide
noticeable gains for workloads that frequently allocate objects, such as transactional
workloads, but it can also hurt performance if prefetching causes more important data to be
thrown out of the cache.
In addition to the TLH prefetching, POWER processors feature a hardware prefetching engine
that can detect certain memory allocation patterns and effectively prefetch memory.
Applications that access memory in a linear, predictable fashion can benefit from enabling
hardware prefetching, but this must be done in cooperation with the operating system. For a
brief description of the dscrctl and ppc64_cpu commands that can be used to affect hardware
prefetching on AIX and Linux respectively, see 1.5.1, Lightweight tuning and optimization
guidelines on page 7.
176
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 8. Java
177
Example 8-1 depicts the failure. For a heap size of 4 GB, large pages are requested
(requestedPageSize is 0x1000000, or 16 MB), but conventional pages are obtained (pageSize
is 0x10000, or 64 kB); the problem does not occur without compressed references (the
-Xnocompressedrefs option).
Example 8-1 The requestPageSize and pageSize attributes in the verbose GC log (failure)
# java -Xlp -Xmx4g -Xms4g -verbose:gc -version 2>&1 | grep -i pagesize
<attribute name="pageSize" value="0x10000" />
<attribute name="requestedPageSize" value="0x1000000" />
# java -Xnocompressedrefs -Xlp -Xmx4g -Xms4g -verbose:gc -version 2>&1 | grep -i pagesize
<attribute name="pageSize" value="0x1000000" />
<attribute name="requestedPageSize" value="0x1000000" />
To resolve the problem, you can choose one of the following methods:
1. Disable prelink completely.
This can be accomplished by removing the prelink settings and package. Consider its
applicability and requirement in your environment before proceeding. To accomplish this
task, for example, on RHEL 6, run the following commands:
# prelink --undo --all
# yum remove prelink
2. Disable prelink selectively for the affected shared libraries.
This can be accomplished by discovering the shared libraries that are relinked to the
conflicting virtual memory segment, then reverting, and then disabling the prelink setting
for those shared libraries in the configuration files of the prelink utility. Use the following
steps to perform this action:
a. Create and compile a Java program that simply waits for some time (5 minutes). It
allows you to examine the shared libraries in the memory map of the JVM. See
Example 8-2.
Example 8-2 Java program for waiting 5 minutes
178
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Example 8-3 Discover shared libraries in the JVM 4 GB - 1 TB virtual memory segment
# java SleepFiveMinutes &
[1] 3154
# libs="$(grep '^[0-9a-z]\{9,10\}-' /proc/3154/smaps | sort -u -k6 | awk '{ print $6 }')"
# grep '^[0-9a-z]\{9,10\}-' /proc/3154/smaps | sort -u -k6
8001230000-8001240000 rw-p 00000000 00:00 0
8001000000-8001030000 r-xp 00000000 fd:00 524686
8001050000-8001210000 r-xp 00000000 fd:00 524687
8001240000-8001250000 r-xp 00000000 fd:00 524689
8001470000-8001490000 r-xp 00000000 fd:00 524696
/lib64/libgcc_s-4.4.7-20120601.so.1
8001330000-8001410000 r-xp 00000000 fd:00 524695
8001270000-8001290000 r-xp 00000000 fd:00 524697
/lib64/libpthread-2.12.so
8001e00000-8001e20000 r-xp 00000000 fd:00 524704
/lib64/libresolv-2.12.so
80012b0000-80012c0000 r-xp 00000000 fd:00 524386
/lib64/ld-2.12.so
/lib64/libc-2.12.so
/lib64/libdl-2.12.so
/lib64/libm-2.12.so
/lib64/librt-2.12.so
# for lib in $libs; do objdump -p $lib | grep -m1 LOAD | awk '{ printf $5 }'; echo " $lib";
done
0x0000008001000000 /lib64/ld-2.12.so
0x0000008001050000 /lib64/libc-2.12.so
0x0000008001240000 /lib64/libdl-2.12.so
0x0000008001470000 /lib64/libgcc_s-4.4.7-20120601.so.1
0x0000008001330000 /lib64/libm-2.12.so
0x0000008001270000 /lib64/libpthread-2.12.so
0x0000008001e00000 /lib64/libresolv-2.12.so
0x00000080012b0000 /lib64/librt-2.12.so
# kill -9 3154
c. Revert the prelink setting to the shared libraries (which has an immediate effect), and
configure the ibm-java.conf prelink configuration file so that it does not relink those
shared libraries anymore (by using the -b option). These steps are described in
Example 8-4.
Example 8-4 Revert and disable the prelink setting to the shared libraries
# for lib in $libs; do prelink --undo $lib; echo "-b $lib" >>
/etc/prelink.conf.d/ibm-java.conf; done
# for lib in $libs; do objdump -p $lib | grep -m1 LOAD | awk '{ printf $5
}'; echo " $lib"; done
0x0000000000000000 /lib64/ld-2.12.so
0x0000000000000000 /lib64/libc-2.12.so
0x0000000000000000 /lib64/libdl-2.12.so
0x0000000000000000 /lib64/libgcc_s-4.4.7-20120601.so.1
0x0000000000000000 /lib64/libm-2.12.so
0x0000000000000000 /lib64/libpthread-2.12.so
0x0000000000000000 /lib64/libresolv-2.12.so
0x0000000000000000 /lib64/librt-2.12.so
# cat /etc/prelink.conf.d/ibm-java.conf
-b /lib64/ld-2.12.so
-b /lib64/libc-2.12.so
-b /lib64/libdl-2.12.so
-b /lib64/libgcc_s-4.4.7-20120601.so.1
-b /lib64/libm-2.12.so
-b /lib64/libpthread-2.12.so
Chapter 8. Java
179
-b /lib64/libresolv-2.12.so
-b /lib64/librt-2.12.so
After performing one of these methods, the problem should be resolved.
You can verify the equality between the values of the requestedPageSize and pageSize
attributes, and inspect the virtual memory segment between 4 GB - 1 TB to verify that there
are no shared libraries. Example 8-5 describes that verification for equal values and lists the
memory map of the heap segment of 4 GB size (0x800000000 - 0x700000000 = 0x100000000
bytes = 4 GB) with 16 MB pages.
Example 8-5 The requestPagesize and pageSize attributes in the verbose GC log (success)
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Two techniques can be used to determine whether the code cache allocation sizes or total
limit must be altered. First, a Java core file can be produced by running kill -3 <pid> at the
end/stable state of your application. The core file shows how many pieces of code cache are
allocated. The active amount of code cache can be estimated by summing all of the pieces.
For example, if 20 MB is needed to run the application, -Xcodecache5m (four pieces of 5 MB
each) typically allocates 20 MB code caches at start time, and they are likely close to each
other and have better performance for cross-code cache calls. Second, to determine whether
the total code cache is sufficient, the -Xjit:verbose option can be used to print method
names as they are compiled. If compilation fails because the limit of code cache is reached,
an error to that effect is printed.
Bootstrap classes
Application classes
Metadata that describes the classes
Ahead-of-time (AOT) compiled code
Chapter 8. Java
181
Starting with IBM Java 8, SHA2 (for example, SHA224, SHA256, SHA384, and SHA512) is
accelerated by using POWER8 in-core SHA instructions. SHA2 is enabled by default and no
command-line parameter is required. In-core SHA instructions can increase speed, as
compared with equivalent JIT-generated code.
182
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 8. Java
183
184
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Although the Java heap is a contiguous range of memory addresses, any region within that
range can be committed or released as required. This situation enables the balanced
collector to contract the heap more dynamically and aggressively than other garbage
collectors, which typically require the committed portion of the heap to be contiguous. Java
heap configuration for the -Xgcpolicy:balanced strategy can be specified through the -Xmn,
-Xmx, and -Xms options.
Chapter 8. Java
185
ST
SMT2
SMT4
SMT8
The default SMT mode on a POWER7 and later processor depends on the AIX version and
the compatibility mode with which the processor cores are running. Table 8-3 shows the
default SMT modes.
Table 8-3 SMT mode on the POWER8 processor depends on the AIX and compatibility mode
AIX version
Compatibility mode
POWER8
SMT4
AIX V6.1
POWER7
SMT4
AIX V6.1
POWER6/POWER6+
SMT2
AIX 5L V5.3
POWER6/POWER6+
SMT2
Most applications benefit from SMT. However, some applications do not scale with an
increased number of logical CPUs on an SMT-enabled system. One way to address such an
application scalability issue is to make a smaller LPAR or use processor binding, as described
in 8.6.2, Using resource sets on page 187.
Additionally, if you need improved performance from your larger new system, there is a
potential alternative. If your application semantics support it, you might be able to run multiple
smaller instances of your application, each bound to exclusive processors/cores. In this way,
the aggregate performance from the multiple instances of your application might be able to
meet your performance expectations. This alternative is one of the WebSphere preferred
practices. For more information about selecting an appropriate SMT mode, see Scalability
challenges when moving from a POWER5 or POWER6 processor-based system to a
POWER7 or POWER8 processor-based system on page 208.
186
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For applications that might benefit from a lower SMT mode with fewer logical CPUs,
experiment with using SMT2 or ST modes. For more information, from the processor and OS
perspectives, see:
AIX environment
In an AIX environment, RSETs allow specifying on which logical CPUs an application can run.
They are useful when an application that does not scale beyond a certain number of logical
CPUs is run on a large LPAR, for example, an application that scales up to eight logical CPUs
but is run on an LPAR that has 64 logical CPUs.
For more information, see The POWER8 processor and affinity performance effects on
page 16. An example is included in Partition sizes and affinity on page 16.
RSETs can be created with the mkrset command and attached to a process by using the
attachrset command. An alternative way is creating an RSET and attaching it to an
application in a single step by using the execrset command.
The following example demonstrates how to use execrset to create an RSET with CPUs 4 - 7
and run an application that is attached to it:
execrset -c 4-7 -e <application>
In addition to running the application that is attached to an RSET, set the MEMORY_AFFINITY
environment variable to MCM to assure that the applications private and shared memory is
allocated from memory that is local to the logical CPUs of the RSET:
MEMORY_AFFINITY=MCM
Chapter 8. Java
187
In general, RSETs are created on core boundaries. For example, a partition with four
POWER8 cores that are running in SMT4 mode has 16 logical CPUs. Create an RSET with
four logical CPUs by selecting four SMT threads that belong to one core. Create an RSET
with eight logical CPUs by selecting eight SMT threads that belong to two cores. The smtctl
command can be used to determine which logical CPUs belong to which core, as shown
in Example 8-6.
Example 8-6 Use the smtctl command to determine which logical CPUs belong to which core
# smtctl
This system is SMT capable.
This system supports up to 4 SMT threads per processor.
SMT is currently enabled.
SMT boot mode is not set.
SMT threads are bound to the same physical processor.
proc0 has 4 SMT threads.
Bind processor 0 is bound
Bind processor 1 is bound
Bind processor 2 is bound
Bind processor 3 is bound
with
with
with
with
proc0
proc0
proc0
proc0
with
with
with
with
proc4
proc4
proc4
proc4
The smtctl output in Example 8-6 shows that the system is running in SMT4 mode with bind
processors (logical CPU) 0 - 3 belonging to proc0 and bind processors 4 - 7 belonging to
proc1. Create an RSET with four logical CPUs either for CPUs 0 - 3 or for CPUs 4 - 7.
To achieve the best performance with RSETs that are created across multiple cores, all cores
of the RSET must be from the same chip and in the same scheduler resource allocation
domain (SRAD). The lssrad command can be used to determine which logical CPUs belong
to which SRAD, as shown in Example 8-7:
Example 8-7 Use the lssrad command to determine which logical CPUs belong to which SRAD
# lssrad -av
REF1 SRAD
MEM
0
0 22397.25
1
1 29801.75
CPU
0-31
32-63
The output in Example 8-7 shows a system that has two SRADs. CPUs 0 - 31 belong to the
first SRAD, and CPUs 32 - 63 belong to the second SRAD. In this example, create an RSET
with multiple cores either by using the CPUs of the first or second SRAD.
Authority for RSETs: A user must have root authority or have CAP_NUMA_ATTACH capability
to use RSETs.
188
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Linux environment
In a Linux environment, the equivalent to execrset is the taskset command. The following
example demonstrates how to use taskset to create a taskset with CPUs 4 - 7 and run an
application that is attached to it:
Linux: taskset -c 4-7 <application>
There is no equivalent environment variable to MEMORY_AFFINITY on Linux; however, there is a
command, numactl, that can accomplish the same task as MEMORY_AFFINITY and the execrset
and taskset commands. For example:
numactl [-l | --localalloc] -C 4-7 <application>
The -l | --localalloc option is analogous to MEMORY_AFFINITY=MCM.
Chapter 8. Java
189
In general, for both the gencon and optavgpause GC policies, concurrent marking can be
tuned with the -Xconcurrentlevel<number> option, which specifies the ratio between the
amount of heap that is allocated and heap that is marked. The default value is 8. The number
of low-priority mark threads can be set with the -Xconcurrentbackground<number> option. By
default, one thread is used for concurrent marking.
For more information about this topic, see 8.8, Related publications on page 192.
190
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Method invocations
Intermediate operations, such as map or filter
User-defined exceptions
New/delete statements
To enable GPU processing of the parallel loops, set the -Xjit:enableGPU option on the
command line when you start your Java application.
For more information, see the Java 8 SDK documentation, found at:
https://www-01.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.do
c/diag/understanding/gpu_jit.html
Chapter 8. Java
191
}
// C source code
#include <jni.h>
JNIEXPORT void JNICALL Java_CudaTest1_grayscale0(JNIEnv* env, jclass,
jobject buffer, jint width, jint height)
{
unsigned char* bytes = (unsigned char*)env->GetDirectBufferAddress(buffer);
int length = width * height * 3; // 3 bytes per pixel
grayscale(bytes,width,height);
// see C/C++ section on example implementation
}
192
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Chapter 9.
IBM DB2
This chapter describes the optimization and tuning of DB2 running on POWER
processor-based servers. It covers the following topics:
193
9.2.1 Affinitization
A simple way to achieve affinitization on POWER7 and POWER8 processor-based systems is
through the DB2 registry variable DB2_RESOURCE_POLICY. In general, this variable defines a
policy that outlines which operating system resources are available for DB2 databases. When
this variable is set to AUTOMATIC, the DB2 database system automatically detects the POWER
hardware topology and computes the best way to assign engine dispatchable units (EDUs) to
various hardware modules. The goal is to determine the most efficient way to share memory
between multiple EDUs that need access to the same regions of memory.
194
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
On AIX, the AUTOMATIC setting uses Scheduler Resource Allocation Domain Identifier
(SRADID) attachments for affinity purposes. On Linux, the AUTOMATIC setting uses NUMA
nodes (as exposed by libnuma) for affinity purposes.
The AUTOMATIC setting can be used on POWER7 and POWER8 processor-based systems
running the following or later releases:
AIX V6.1 Technology Level (TL) 5 with DB2 10.1
Linux (Little Endian) with DB2 10.5 FP5
This setting is intended for multi-socket SCM and all DCM Power Systems. It is best to run a
performance analysis of the workload before and after you set this variable to AUTOMATIC to
validate the performance improvement.
For more information about other usages of DB2_RESOURCE_POLICY other memory-related DB2
registry variables, see Chapter 2, AIX configuration, in Best Practices for DB2 on AIX 6.1 for
POWER Systems, SG24-7821.
Enabling large page support (AIX) (for DB2 Version 10.1 for Linux, UNIX, and Windows), found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.d
b2.luw.admin.dbobj.doc%2Fdoc%2Ft0010405.html
195
To enable large page support on Linux operating systems, complete the following steps:
Note: Do not be confused by the Linux OS terminology of huge pages.
1. Configure Linux server for large page support by running the following command:
echo "vm.nr_hugepages=<LargePages>" >> /etc/sysctl.conf
2. Restart the server.
In both cases, after the server is configured for large pages and restarted, use the following
steps to enable large page support in DB2:
1. Set the DB2_LARGE_PAGE_MEM registry variable by running db2set:
db2set DB2_LARGE_PAGE_MEM=DB
2. Start the DB2 database manager by running db2start:
db2start
196
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
SIMD instructions are low-level CPU instructions that enable you to perform the same
operation on multiple data points at the same time.
DB2 10.5 with BLU Acceleration auto-detects whether it is running on an SIMD-enabled CPU,
and automatically uses SIMD to effectively multiply the power of the CPU. In particular, BLU
Acceleration can use a single SIMD instruction to get results from multiple data elements.
Figure 9-1 is an example of a scan operation that involves a predicate evaluation.
>
SIMD
exploitation
unavailable
Data
Data
Data
2009
2010
2011
>
SIMD
technology
detected, autoenable
exploitation
STOP
P LEASE WAIT
HERE
Data
2009
2012
2012
2011
2011
2012
2012
Data
Data
Instruction
Instruction
Compare
= 2009
2010
Processor
Core
Result
Stream
Compare
= 2009
Processor
Core
Result
Stream
Figure 9-1 Compare predicate evaluation with and without SIMD on POWER using DB2 10.5 with BLU
Acceleration
The left side of Figure 9-1 illustrates a typical operation, namely, that each data element (or
column value) is evaluated, one after another. The right side of the figure shows how BLU
Acceleration processes four columns at a time by using SIMD. Think of it as CPU power
multiplied by four. Although Figure 9-1 shows a predicate evaluation, BLU Acceleration can
also take advantage of SIMD processing for join operations, arithmetic, and more.
197
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
DB2 also supports the PowerVM Live Partition Mobility (LPM) feature when virtual I/O is
configured. LPM allows an active database to be moved from a system with limited memory
to one with more memory without disrupting the operating system or applications. When
coupling dynamic LPAR (DLPAR) with STMM, the newly migrated database can automatically
adjust to the additional memory resource for better performance.
DB2 Virtualization, SG24-7805 describes in considerable detail the concept of DB2
virtualization, in addition to setup, configuration, and management of DB2 on IBM Power
Systems with PowerVM technology. That book follows many of the preferred practices for
Power Systems virtualization and has a list of preferred practices for DB2 on PowerVM.
199
Non-buffered I/O
By default, DB2 uses CIO or DIO for newly created table space containers because
non-buffered I/O provides more efficient underlying storage access over buffered I/O on most
workloads, with most of the benefit realized by bypassing the file system cache. Non-buffered
I/O is configured through the NO FILE SYSTEM CACHING clause of the table space definition. To
maximize the benefits of non-buffered I/O, a correct buffer pool size is essential. This size can
be achieved by using STMM to tune the buffer pool sizes. (The default buffer pool is always
tuned by STMM, but user-created buffer pools must specify the automatic keyword for the
size to allow STMM to tune them.) When STMM is enabled, it automatically adjusts the buffer
pool size for optimal performance.
For file systems that support CIO, such as AIX JFS2, DB2 automatically uses this I/O method
because of its performance benefits over DIO.
The DB2 log file by default uses DIO, which brings similar performance benefits as avoiding
file system cache for table spaces.
Asynchronous I/O
In general, DB2 users cannot explicitly choose synchronous or asynchronous I/O. However,
to improve the overall response time of the database system, minimizing synchronous I/O is
preferred and can be achieved through correct database tuning. Consider the following items:
Synchronous read I/O can occur when a DB2 agent needs a page that is not in the buffer
pool to process an SQL statement. In addition, a synchronous write I/O can occur if no
clean pages are available in the buffer pool to make room to bring another page from disk
into that buffer pool. This situation can be minimized by having sufficiently large buffer
pools or setting the buffer pool size to automatic to allow STMM to find its optimal size, in
addition to tuning the page cleaning (by using the chngpgs_thresh database parameter).
Not all pages read into buffer pools are done synchronously. Depending on the SQL
statement, DB2 can prefetch pages of data into buffer pools through asynchronous I/O.
When prefetching is enabled, two parallel activities occur during query processing: data
processing and data page I/O. The latter is done through the I/O servers that wait for
prefetch requests from the former. These prefetch requests contain a description of the I/O
that must satisfy the query. The number of I/O servers for a database is specified through
the num_ioservers configuration parameter. By default, this parameter is automatically
tuned during database start.
For more information about how to monitor and tune AIO for DB2, see Best Practices for DB2
on AIX 6.1 for POWER Systems, SG24-7821.
200
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
After IOCP is configured on AIX, then DB2, by default, capitalize on this feature for all
asynchronous I/O requests. With IOCP configured, AIO server processes from the AIX
operating system manage the I/O requests by processing many requests in the most optimal
way for the system.
For more information about this topic, see 9.8, Related publications on page 202.
AIX tprof
tprof is a powerful profiling tool on the AIX platform that does program counter-sampling in
clock interrupts. It can work on any binary without recompilation and is a great tool for
codepath analysis.
For instructions about using the tprof command, go to the following website:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.
aix.prftools/doc/prftools/tprofcommand.htm
201
9.7 Conclusion
DB2 is positioned to capitalize on many POWER processor features to maximize the return
on investment (ROI) of the full IBM stack. During the entire DB2 development cycle, there is a
targeted effort to take advantage of POWER processor features and ensure that the highest
level of optimization is employed on this platform. With every new POWER processor
generation, DB2 ensures that the key features are supported and brought into play at the
POWER processor launch by working on such features well in advance of general availability.
This type of targeted effort ensures that DB2 is at the forefront of optimization for POWER
processor applications.
202
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
203
204
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
10
Chapter 10.
205
10.1.1 Installation
As there are multiple versions of WebSphere Application Server, there are also multiple
versions of AIX that are supported by POWER7 and POWER8 processor-based systems.
Table 10-1 shows some of the installation considerations. Use the most currently available
code, including the latest installation binary files. For the most current AIX installation and
configuration details, see 4.4.1, AIX preferred practices that are applicable to all Power
Systems generations on page 105.
Important: If running on a POWER7 processor-based system, use the versions of
WebSphere Application Server that run in POWER7 mode with performance
enhancements. Similarly, for POWER8 processor-based systems, use the versions of
WebSphere Application Server that run in POWER8 mode with performance
enhancements.
Associated website
Information provided
http://www.ibm.com/support/
docview.wss?uid=swg21422150
10.1.2 Deployment
When you start the WebSphere Application Server, there is an option to bind the Java
processors to specific CPU processor cores to circumvent the operating system scheduler to
send the work to available processors in the pool. In certain cases, using RSETs and binding
the JVM to stay within core/socket boundaries improves the performance. Table 10-2 on
page 207 lists some of the deployment considerations.
206
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Associated website
Information provided
Workload partitioning
(WPAR) in AIX V6.1
http://www.ibm.com/develope
rworks/aix/library/au-wpar6
1aix/
Troubleshooting and
performance analysis of
different applications in
versioned WPARs
http://www.ibm.com/develope
rworks/aix/library/au-wpars
/
10.1.3 Performance
When you run WebSphere Application Server on POWER7 and POWER8 processor-based
systems, end-to-end performance depends on many subsystems. This includes the network,
memory, disk, and CPU subsystems of POWER7 and POWER8 processor-based systems; a
crucial consideration is Java configuration and tuning. Topology also plays a major role in the
performance of the enterprise application that is being deployed. The architecture of the
application must be considered when you determine the best deployment topology.
Table 10-3 includes links to preferred practices documents, which target each of these major
areas.
Table 10-3 Performance considerations
Document
Associated website
Information provided
Java Performance on
POWER7 - Best practice
http://www.ibm.com/common/s
si/cgi-bin/ssialias?infotyp
e=SA&subtype=WH&htmlfid=POW
03066USEN
http://www.ibm.com/develope
rworks/aix/library/au-aix7n
etworkoptimize1/index.html
207
http://www.ibm.com/develope
rworks/aix/library/au-aix7m
emoryoptimize1/index.html
Optimizing AIX V7
performance: Part 2,
Monitoring logical volumes
and analyzing the results
http://www.ibm.com/develope
rworks/aix/library/au-aix7o
ptimize2/index.html
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Associated website
Information provided
https://www.ibm.com/develop
erworks/wikis/display/WikiP
type/Java+Performance+Advis
or
http://www.ibm.com/develope
rworks/aix/library/au-perfo
rmancedectective/index.html
MustGather: Performance,
hang, or high CPU issues
with WebSphere Application
Server on AIX
http://www.ibm.com/support/
docview.wss?uid=swg21052641
209
210
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Appendix A.
211
Introduction
There is a simple methodology on AIX to collect useful information about how an application
uses the C heap. That information can then be used to choose and tune the appropriate
malloc settings. The type of information that typically must be collected is:
The distribution of malloc allocation sizes that are used by an application, which shows
whether AIX MALLOCOPTIONS, such as pool and buckets, are expected to perform well. This
information can be used to fine-tune bucket sizes.
The steady state size of the heap, which shows how to size the pool option.
Additional information about thread counts, malloc usage per thread, and so on, can be
useful, but the information that is presented here presents a basic view.
This appendix does not apply to the watson2 allocator (see Memory allocators on page 95),
which autonomically adjusts to the memory usage of an application and does not require
specific tuning.
==================================
Malloc buckets statistical summary
==================================
Configuration values:
Number of buckets: 16
Bucket sizing factor: 32
Blocks per bucket: 1024
Allocation request totals:
Buckets allocator:
118870654
Default allocator:
343383
Total for process:
119214037
Allocation requests by bucket
Bucket
Maximum
Number of
Number
Block Size
Allocations
-----------------------0
32
104906782
1
64
9658271
2
96
1838903
3
128
880723
4
160
300990
5
192
422310
6
224
143923
7
256
126939
8
288
157459
9
320
72162
10
352
87108
212
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
11
384
56136
12
416
63137
13
448
66160
14
480
45571
15
512
44080
Allocation requests by heap
Heap
Buckets
Default
Number
Allocator
Allocator
--------------- ----------0
118870654
343383
This environment variable causes the program to produce a histogram of allocation sizes
when it terminates. The number of allocation requests that are satisfied by the default
allocator indicates the fraction of requests that are too large for the buckets allocator (larger
than 512 bytes, in this example). By modifying some of the malloc buckets configuration
options, you can, for example, obtain more information about larger allocation sizes.
To discover the steady state size of the heap, set the following environment variable:
export MALLOCDEBUG=log
Run an application to a steady state point, attach it by running dbx, and then run malloc.
Example A-2 shows a sample output.
Example A-2 Sample output from the malloc subroutine
(dbx) malloc
The following options are enabled:
Implementation Algorithm........ Default Allocator (Yorktown)
Malloc Log
Stack Depth............. 4
Statistical Report on the Malloc Subsystem:
Heap 0
heap lock held by................ pthread ID 0x20023358
bytes acquired from sbrk().......
5309664
bytes in the freespace tree......
334032
bytes held by the user...........
4975632
allocations currently active.....
76102
allocations since process start..
20999785
The Process Heap
Initial process brk value........ 0x20013850
current process brk value........ 0x214924c0
sbrk()s called by malloc.........
78
The bytes held by the user value indicates how much heap space is allocated. By stopping
multiple times when you run dbx and then running malloc, you can get a good estimate of the
heap space that is needed by the application.
For more information, see System memory allocation using the malloc subsystem, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.
aix.genprogc/doc/genprogc/sys_mem_alloc.htm
213
214
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Appendix B.
215
Introduction
This appendix includes a general description about performance advisors, and descriptions
that are specific to the three performance advisors that are referenced in this book:
AIX
Linux
Java (either AIX or Linux)
Performance advisors
IBM developed four new performance advisors that empower users to address their own
performance issues to best use their Power Systems server. These performance advisors can
be run by a broad class of users.
The first three of these advisors are tools that run and analyze the configuration of a system
and the software that is running on it. They also provide advice about the performance
implications of the current configuration and suggestions for improvement. These three
advisors are documented in Expert system advisors on page 216.
The fourth advisor is part of the IBM Rational Developer for Power Systems Software. It is a
component of an integrated development environment (IDE), which provides a set of features
for performance tuning of C and C++ applications on AIX and Linux. That advisor is
documented in IBM Rational Performance Advisor on page 221.
216
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
All of the advisors follow the same reporting format, which is a single page XML file you can
use to assess quickly conditions by visually inspecting the report and looking at the
descriptive icons, as shown in Figure B-1.
Figure B-1 Descriptive icons in expert system advisors (AIX Partition Virtualization, VIOS Advisor, and
Java Performance Advisor)
The XML reports that are generated by all of the advisors are interactive. If a problem is
detected, three pieces of information are shared with the user:
1. What is this?
This section explains why a particular topic was monitored, and provides a definition of the
performance metric or setting.
2. Why is it important?
This report entry explains why the topic is relevant and how it impacts performance.
3. How do I modify it?
Instructions for addressing the problem are listed in this section.
CPU
Shared processing pool
Memory
Fibre Channel performance
Disk I/O subsystem
Shared Ethernet adapter
The output is presented on a single page, and copies of the report can be saved, making it
easy to document the settings and performance of VIOS over time. The goal of the advisor is
for you to be able to self-assess the health of your VIOS and act to attain optimal
performance.
217
Figure B-2 shows a window of the VIOS Performance Advisor, focusing on the FC adapter
section of the report, which attempts to guide the user in determining whether any of the FC
ports are being saturated, and, if so, to what extent. An investigate image was displayed next
to the idle FC port to confirm that the idle adapter port is intentional and because of an
administrative configuration design choice.
The VIOS Advisor can be found at the following website:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20Sys
tems/page/VIOS%20Advisor
Figure B-2 shows a window from the VIOS Advisor.
The output is presented in a single window, and copies of the report can be saved, making it
easy for the user to document the settings and performance of their LPAR over time. The goal
of the advisor is for the user to be able to self-assess the health of their LPAR and act to attain
optimal performance.
218
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Figure B-3 is a snapshot of the LPAR Virtualization Performance Advisor, focusing on the
LPAR optimization section of the report, which applies virtualization preferred practice
guidance to the LPAR configuration, resource usage of the LPAR, and shared processor pool,
and determines whether the LPAR configuration is optimized. If the advisor finds that the
LPAR configuration is not optimal for the workload, it guides the user in determining the best
possible configuration. The LPAR Performance Advisor can be found at the following website:
https://www.ibm.com/developerworks/community/blogs/simplyaix/entry/lpar_performanc
e_advisor?lang=en
219
JVM tunables: Heap sizing, garbage collection (GC) policy, page size, and so on
WebSphere Application Server related settings for a WebSphere Application
Server process
The guidance is based on Java tuning preferred practices. The criteria that are used to
determine the guidance include the relative importance of the Java application, machine
usage (test and production), and the user's expertise level.
Figure B-4 on page 221 is a snapshot of Java and WebSphere Application Server
recommendations from a sample run, indicating the best JVM optimization and WebSphere
Application Server settings for better results, per Java preferred practices. Details about the
metrics can be obtained by expanding each of the metrics. The output of the run is a simple
XML file that can be viewed by using the supplied XSL viewer and any browser. The Java
Performance Advisor (JPA) can be found at the following website:
https://www.ibm.com/developerworks/wikis/display/WikiPtype/Java+Performance+Adviso
r
220
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
221
Rational Performance Advisor gathers data from several sources. The raw application
performance data comes from the same expert-level tprof and OProfile CPU profilers that
are described in AIX on page 223 and Linux on page 233, and other low-level operating
system tools. The debug information that is generated by the compiler allows this data to be
matched back to the original source code. XLC compilers can generate XML report files that
provide information about optimizations that were performed during compilation. Finally, the
application build and runtime systems are analyzed to determine whether there are any
potential environmental problems.
All of this data is automatically gathered, correlated, analyzed, and presented in a way that is
quick to access and easy to understand (Figure B-5).
For more information about Rational Performance Advisor, including a trial download, see
Rational Developer for AIX and Linux C/C++ Edition, found at:
http://www.ibm.com/software/products/en/dev-c-cpp
222
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
AIX
The section introduces tools and techniques that are used for optimizing software for a
combination of Power Systems and AIX. The intended audience for this section is software
development teams. As such, this section does not address performance topics that are
related to capacity planning, and system-level performance monitoring and tuning.
To download Java for AIX, go to the following website:
http://www.ibm.com/developerworks/java/jdk/aix/
For capacity planning, see the IBM Systems Workload Estimator, found at the following
website:
http://www-912.ibm.com/estimator
223
For system-level performance monitoring and tuning information for AIX, see Performance
management, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/topic/com.ibm.
aix.prftungd/doc/prftungd/performance_management-kickoff.htm
The bedrock of any empirically based software optimization effort is a suite of repeatable
benchmark tests. To be useful, such tests must be representative of the manner in which
users interact with the software. For many commercial applications, a benchmark test
simulates the actions of multiple users that drive a prescribed mix of application transactions.
Here, the fundamental measure of performance is throughput (the number of transactions
that are run over a period) with an acceptable response time. Other applications are more
batch-oriented, where few jobs are started and the time that is taken to completion is
measured. Whichever benchmark style is used, it must be repeatable. Within some small
tolerance (typically a few percent), running the benchmark several times on the same setup
yields the same result.
Tools and techniques that are employed in software performance analysis focus on
pinpointing aspects of the software that inhibit performance. At a high level, here are the two
most common inhibitors to application performance:
Areas of code that consume large amounts of CPU resources. This code is caused by
using inefficient algorithms, poor coding practices, or inadequate compiler optimization
Waiting for locks or external events. Locks are used to serialize execution through critical
sections, that is, sections of code where the need for data consistency requires that only
one software thread run at a time. An example of an external event is the system that is
waiting for a disk I/O to complete. Although the amount of time that an application must
wait for external events might be outside of the control of the application (for example, the
time that is required for a disk I/O depends on the type of storage employed), simply being
aware that the application is having to wait for such an event can open the door to potential
optimizations.
CPU profiling
A CPU profiler is a performance tool that shows in which code CPU resources are being
consumed. tprof is a powerful CPU profiler that encompasses a broad spectrum of
profiling functions:
It can profile any program, library, or kernel extension that is compiled with C, C++,
Fortran, or Java compilers. It can profile machine code that is created in real time by the
JIT compiler.
It can attribute time to processes, threads, subroutines (user mode, kernel mode, shared
library, and Java methods), source statements, and even individual machine instructions.
In most cases, no recompilation of object files is required.
Usage of tprof typically focuses on generating subroutine-level profiles to pinpoint code
hotspots, and to examine the impact of an attempted code optimization. A common way to
run tprof is as follows:
$ tprof -E -skeuz -x sleep 10
224
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The -E flag instructs tprof to employ the performance monitoring unit (PMU) as the sampling
mechanism to generate the profile. Using the PMU as the sampling mechanism provides a
more accurate profile than the default time-based sampling mechanism, as the PMU
sampling mechanism can accurately sample regions of kernel code where interrupts are
disabled. The s, k, e, and u flags instruct tprof to generate subroutine-level profiles for shared
library, kernel, kernel extension, and user-level activity. The z flag instructs tprof to report
CPU time in the number of ticks (that is, samples) instead of percentages. The -x sleep 10
argument instructs tprof to collect profiling data during the running of the sleep 10
command. This command collects profile data over the entire system (including all running
processes) over a period of 10 seconds.
Excerpts from a tprof report are shown in Example B-1, Example B-2, and Example B-3 on
page 226.
Example B-1 is a breakdown of samples of the processes that are running on the system.
When multiple processes have the same name, they have only one line in this report: the
number of processes with that name is in the Freq column. Total is the total number of
samples that are accumulated by the process, and Kernel, User, and Shared are the
number of samples that are accumulated by the processes in kernel (including kernel
extensions), user space, and shared libraries. Other is a catchall for samples that do not fall
in the other categories. The most common scenario where samples wind up in Other is
because of CPU resources that are being consumed by machine code that is generated in
real time by the JIT compiler. The -j flag of tprof can be used to attribute these samples to
Java methods.
Example: B-1 Excerpt from a tprof report - breakdown of samples of processes running on the system
Process
=======
wait
./version1
/usr/bin/tprof
/etc/syncd
/usr/bin/sh
swapper
/usr/bin/trcstop
rmcd
=======
Total
Freq
Total Kernel
====
=====
4
5810
1
1672
2
15
1
2
2
2
1
1
1
1
1
1
===
=====
13
7504
User Shared
======
====
5810
0
35
1637
13
0
2
0
2
0
1
0
1
0
1
0
======
====
5865
1637
Other
======
0
0
2
0
0
0
0
0
======
2
=====
0
0
0
0
0
0
0
0
=====
0
Example B-2 is a breakdown of samples of the threads that are running on the system. In
addition to the columns that are described in Example B-1, this report has PID and TID
columns that detail the process IDs and thread IDs.
Example: B-2 Excerpt from a tprof report - breakdown of threads that are running on the system
Process
PID
=======
wait
wait
wait
./version1
wait
/usr/bin/tprof
/usr/bin/tprof
/etc/syncd
/usr/bin/sh
/usr/bin/sh
TID
===
16392
12294
20490
245974
8196
291002
274580
73824
245974
245976
Total
===
16393
12295
20491
606263
8197
643291
610467
110691
606263
606265
Kernel
=====
1874
1873
1860
1672
203
13
2
2
1
1
User
======
1874
1873
1860
35
203
13
0
2
1
1
Shared
====
0
0
0
1637
0
0
0
0
0
0
Other
======
0
0
0
0
0
0
2
0
0
0
=====
0
0
0
0
0
0
0
0
0
0
225
/usr/bin/trcstop
swapper
rmcd
=======
Total
245976
0
155876
===
606263
3
348337
===
1
1
1
=====
7504
1
1
1
======
5865
0
0
0
====
1637
0
0
0
======
2
0
0
0
=====
0
Example B-3 from the report gives the subroutine-level profile for the Version1 program. In
this simple example, all of the time is spent in main().
Example: B-3 Excerpt from a tprof report - subroutine-level profile for the version1 program with all time
spent in main()
Profile: ./version1
Total Ticks For All Processes (./version1) = 1637
Subroutine
Ticks
%
Source
Address
============= ====== ======
=======
=======
.main
1637
21.82 version1.c
350
536
Bytes
=====
For more information about using AIX tprof for Java programs, see Hot method or routine
analysis on page 241.
The functions of tprof are rich. As such, it cannot be fully described in this guide. For
complete tprof documentation, see tprof Command, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/index.jsp?topi
c=/com.ibm.aix.cmds/doc/aixcmds5/tprof.htm
226
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The curt command takes as its input a trace that is collected by using the AIX trace facility,
and generates a report that breaks down how CPU time is consumed by various
entities, including:
One of the most useful reports from curt is the System Calls Summary. This report provides a
system-wide summary of the system calls that are run while the trace is collected. For each
system call, the following information is provided:
Count: The number of times the system call was run during the monitoring interval
Total Time: Amount of CPU time (in milliseconds) consumed in running the system call
% sys time: Percentage of overall CPU capacity that is spent in running the system call
Avg Time: Average CPU time that is consumed for each execution of the system call
Min Time: Minimum CPU time that is consumed during an execution of the system call
Max Time: Maximum CPU time that is consumed during an execution of the system call
SVC: Name and address of the system call
-------------------Total Time
% sys Avg Time Min Time
(msec)
time
(msec)
(msec)
=========== ====== ======== ========
3172.0694 14.60%
0.0257
0.0128
1354.6939
6.24%
2.5133
0.0163
757.6204
3.49%
0.0286
0.0162
447.7029
2.06%
0.0169
0.0082
266.1382
1.23%
0.0269
0.0143
167.8132
0.77%
0.0049
0.0032
Max Time
(msec)
========
0.9064
4.1719
0.0580
0.0426
0.5350
0.0204
SVC (Address)
================
kpread(2a2d5e8)
listio64(516ea40)
_esend(2a29f88)
_erecv(2a29e98)
kpwrite(2a2d588)
_thread_wait(2a28778)
As a first step, compare the mix of system calls to the expectation of how the application is
expected to behave. Is the mix aligned with expectations? If not, first confirm that the trace is
collected while the wanted workload runs. If the trace is collected at the correct time and the
mix still differs from expectations, then investigate the application logic. Also, examine the list
of system calls for potential optimizations. For example, if select or poll is used frequently,
consider employing the pollset facility (see 4.3.3, pollset on page 98).
As a further breakdown, curt provides a report of the system calls that are run by each
thread. An example report is shown in Example B-5.
Example: B-5 System calls run by each thread
Report for Thread Id: 549305 (hex 861b9) Pid: 323930 (hex 4f15a)
Process Name: proc1
-------------------Total Application Time (ms): 89.010297
Total System Call Time (ms): 160.465531
Total Hypervisor Call Time (ms): 18.303531
Thread System Call Summary
-------------------------Count
Total Time Avg Time Min Time Max Time SVC (Address)
Appendix B. Performance tools and empirical performance analysis
227
========
492
494
12
6
4
(msec)
===========
157.0663
3.3656
0.0238
0.0060
0.0028
(msec)
(msec)
(msec)
======== ======== ======== ================
0.3192
0.0032
0.6596 listio64(516ea40)
0.0068
0.0002
0.0163 GetMultipleCompletionStatus(549a6a8)
0.0020
0.0017
0.0022 _thread_wait(2a28778)
0.0010
0.0007
0.0014 thread_unlock(2a28838)
0.0007
0.0005
0.0008 thread_post(2a288f8)
Another useful report that is provided by curt is the Pending System Calls Summary. This
summary shows the list of threads that are in an unfinished system call at the end of the
trace. An example report is given in Example B-6.
Example: B-6 Threads that are in an unfinished system call at the end of the trace
228
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
00000000D268EB88
.pthread_once
00000000D01FE588
.__libs_init
00000000D01EB2FCdne_callbacks
00000000D01EB280
._libc_declare_data_functions
00000000D269F960
._pth_init_libc
00000000D268A2B4
.pthread_init
00000000D01EAC08
.__modinit
000000001000014C
.__start
|
|
| Percent Held ( 26.235284s )
Acqui| Miss Spin
Wait Busy |
Secs Held
| Real Real
Comb Real
sitions | Rate Count Count Count |CPU
Elapsed | CPU Elapsed Spin Wait
1
| 0.000 0
0
0
|0.000006 0.000006 | 0.00
0.00
0.00 0.00
------------------------------------------------------------------------------------Depth
Min
Max
Avg
SpinQ
0
0
0
WaitQ
0
0
0
Recursion 0
1
0
AcquiMiss Spin Wait
Busy
Percent Held of Total Time
PThreadID
sitions
Rate Count Count Count
CPU
Elapse
Spin
Wait
~~~~~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~
~~~~~~
~~~~~~
~~~~~~
~~~~~~
1
1
0.00
0
0
0
0.00
0.00
0.00
0.00
Acqui- Miss
Spin Wait
Busy
Percent Held of Total Time
Function Name
sitions Rate
Count Count Count
CPU
Elapse Spin
Wait
Return Address
Start Address
Offset
^^^^^^^^^^^^^
^^^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
^^^^^^^^
.pthread_once
0
0.00
0
0
0
99.99 99.99
0.00
0.00 00000000D268EC98 00000000D2684180
.pthread_once
1
0.00
0
0
0
0.01
0.01
0.00
0.00 00000000D268EB88 00000000D2684180
In addition to the common header information and the [pthread MUTEX] identifier, this report
lists the following lock details:
Parent thread
Creation time
Elapsed time in seconds after the first event recorded in trace (if available)
Deletion time
Elapsed time in seconds after the first event recorded in trace (if available)
PID
Process identifier
Process Name
Call-chain
Acquisitions
The number of times the lock was acquired in the analysis interval
Miss Rate
Spin Count
Wait Count
The number of times a thread is forced into a suspended wait state while
waiting for the lock to come available
Busy Count
Seconds Held
Elapse(d)
229
Percent Held
Depth
Real Elapsed
Comb(ined) Spin
Real Wait
WaitQ
Recursion
# hpmcount -g 38 ./unaligned
Group: 38
Counting mode: user
Counting duration: 21.048874056 seconds
PM_LSU_FLUSH_ULD (LRQ unaligned load flushes)
PM_LSU_FLUSH_UST (SRQ unaligned store flushes)
230
:
:
4320840034
0
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
:
:
:
:
450842085
149
19327363517
84219113069
4.358
The hpmstat command is similar to hpmcount, except that it collects performance data on a
system-wide basis, rather than just for the running of a command.
Generally, scenarios in which the ratio of (LRQ unaligned load flushes + SRQ unaligned store
flushes) divided by Run instructions completed is greater than 0.5% must be further
investigated. The tprof command can be used to further pinpoint where in the code the
unaligned storage references are occurring. To pinpoint unaligned loads, the -E
PM_MRK_LSU_FLUSH_ULD flag is added to the tprof command line, and to pinpoint unaligned
stores, the -E PM_MRK_LSU_FLUSH_UST flag is added. When these flags are used, tprof
generates a profile where unaligned loads and stores are sampled instead of
time-based sampling.
Examples of alignment issues that cause an alignment interrupt include execution of a lmw or
lwarx instruction on a non-word-aligned boundary. These issues can be detected by running
alstat. This command can be run with an interval, which is the number of seconds between
each report. An example is presented in Example B-9.
Example: B-9 Alignment issues can be addressed with the alstat command
> alstat 5
Alignment
SinceBoot
2016
2016
2016
2016
2016
2016
Alignment
Delta
0
0
0
0
0
0
The key metric in the alstat report is the Alignment Delta. This metric is the number of
alignment interrupts that occurred during the interval. Nonzero counts in this column merit
further investigation with tprof. Running tprof with the -E ALIGNMENT flag generates a profile
that shows where the unaligned references are occurring.
For more information, see alstat Command, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/index.jsp?topi
c=/com.ibm.aix.cmds/doc/aixcmds1/alstat.htm
231
> emstat 5
Emulation
SinceBoot
0
0
0
0
0
Emulation
Delta
0
0
0
0
0
The key metric is the Emulation Delta (the number of instructions that are emulated during
each interval). Nonzero values merit further investigation. Running tprof with the -E
EMULATION flag generates a profile that shows where the emulated instructions are.
For more information, see emstat Command, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/aix/v7r1/index.jsp?topi
c=/com.ibm.aix.cmds/doc/aixcmds2/emstat.htm
232
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Linux
The section introduces tools and techniques that are used for optimizing software on the
combination of Power Systems and Linux. The intended audience for this section is software
development teams.
To download Java for Linux, go to the following website:
http://www.ibm.com/developerworks/java/jdk/linux/
233
The SDK provides an Eclipse C/C++ IDE with Linux tools integration. The SDK provides
graphical presentation and source code view integration with Linux execution profiling
(gprof/OProfile/Perf), malloc and memory usage (valgrind), pthread synchronization
(helgrind), SystemTap tapsets, and tapset development.
Hotspot analysis
You should profile the application and look for hotspots. When you run the application under
one or more representative workloads, use a hardware-based profiling tool such as OProfile.
OProfile can be run directly as a command-line tool or under the IBM SDK for Linux on
Power.
Note: You might find it useful to use the Linux perf tool in its top mode. This perf top
works similarly to old top (however, instead of showing processes, it shows hot methods).
The OProfile tools can monitor the whole system (LPAR), including all the tasks and the
kernel. This action requires root authority, but is the preferred way to profile the kernel and
complex applications with multiple cooperating processes. OProfile is fully enabled to take
samples by using the full set of the PMU events (run ophelp for a complete list of events).
OProfile can produce text file reports that are organized by process, program and libraries,
function symbols, and annotated source file and line number or machine code disassembly.
The IBM SDK for Linux on Power can profile applications that are associated with Eclipse
projects. The SDK automates the setup and running of the profile, but is restricted to a single
application, its libraries, and direct kernel calls. The SDK is easier to use, as it is hierarchically
organized by percentage with program, function symbol, and line number. Clicking the line
number in the profile pane jumps the source view pane to the matching source file and line
number. This action simplifies edit, compile, and profile tuning activities.
The whole system profile is a good place to start. You might find that your application is
consuming most of the CPU cycles, and deeper analysis of the application is the next logical
step. The IBM SDK for Linux on Power provides a number of helpful tools, including integrated
application profiling (OProfile and valgrind), Migration Assistant, and the Source
Code Advisor.
234
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Excessive locking or poor lock granularity can also result in high kernel usage (in the kernels
spin_lock, futex, and scheduler components) when applications move to larger system
configurations. This situation might require adjusting the application lock strategy and
possibly the type of lock mechanism that is used as well:
POSIX pthread_mutex and pthread_rwlock locks are complex and heavy, and POSIX
semaphores are simpler and lighter.
Use trylock forms to spin in user mode for a limited time when appropriate. Use this
technique when there is normally a finite lock hold time and limited contention for the
resource. This situation avoids context switch and scheduler impact in the kernel.
Reserve POSIX pthread_spinlock and sched_yield for applications that have exclusive
use of the system and with carefully designed thread affinity (assigning specific threads to
specific cores).
The compiler provides inline functions (__sync_fetch_and_add, __sync_fetch_and_or, and
so on) that are better suited for simple atomic updates than POSIX lock and unlock. Use
thread local storage, where appropriate to avoid locking for thread safe code.
235
In GCC, you must specify the -O3 optimization level and inform the compiler that you are
running on a newer processor chip with the Vector ISA extensions. In fact, with GCC, you
need both -O3 and -mcpu=power7 for the compiler to generate code that capitalizes on the
new VSX feature of the POWER7 processor. You need both -O3 and -mcpu=power8 for the
compiler to take advantage of the latest VSX instructions that are implemented on the
POWER8 processor.
One source of optimized libraries is the IBM Advance Toolchain for Linux on Power. The
Advance Toolchain provides alternative runtime libraries for all the common POSIX C
language, Math, and pthread libraries that are highly optimized (-O3 and -mcpu=) for multiple
Power Systems platforms (including POWER7 and POWER8 processor-based systems). The
IBM Advance Toolchain runtime RPM provides multiple CPU tuned library instances and
automatically selects the specific library version that is optimized for the specific POWER5,
POWER6, POWER7, or POWER8 processor-based system.
If there are specific open source or third-party libraries that are dominating the execution
profile of your application, you must ask the distribution or library product owner to provide a
build that uses higher optimization. Alternatively, for open source library packages, you can
build your own optimized binary version of those packages.
Program usage of non-portable data types and inline assembly language can cause poor
performance on the POWER processor, which always must be investigated and addressed.
For example, the long double data type is supported for both Intel x86 and POWER, but has a
different size, data range, and implementation. The x86 80-bit Floating Point format is
implemented in hardware and is faster than (although not compatible with) the AIX long
double, which is implemented as an algorithm that uses two 64-bit doubles. Neither one is
fully IEEE-compliant, and both must be avoided in cross-platform application codes and
libraries.
236
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Another example is small Intel specific optimization that uses inline x86 assembly language
and conditionally providing a generic C implementation for other platforms. In most cases,
GCC provides an equivalent built-in function that generates the optimal code for each
platform. Replacing inline assembly language with GCC built-in functions makes the
application more portable and provides equivalent or better performance on all platforms.
To use the MA tool, complete the following steps:
1. Import your project into the SDK.
2. Select the Projects Properties.
3. Select the Linux/x86 to PowerLinux application Migration check box under C/C++
General/Code Analysis.
4. Right-click the project name, and select Run Migration Advisor.
Hotspot profiling
IBM SDK for Linux on Power integrates the Linux OProfile hardware event profiling with the
application source code view. This configuration is a convenient way to do hotspot analysis.
The integrated Linux Tools profiler focuses on an application that is selected from the current
SDK project.
After you run the application, the SDK opens an OProfile tab in a console window. This
window shows a nested set of twisties, starting with the event (cycles by default), then
program/library, function, and source line (within function). The developer drills down by
opening the twisties in the profile window, opening the next level of detail. Items are ordered
by profile frequency with highest frequency first. Clicking the function or line number entries in
the profile window causes the source view to jump to the corresponding source file or
line number.
This process is a convenient way to do hotspot analysis, focusing only on the top three to five
items at each level in the profile. Examine the source code for algorithmic problems, excess
conversions, unneeded debug code, and so on, and make the appropriate source
code changes.
With your application code (or subset) imported in to the SDK, it is easy to edit, compile, and
profile code changes and verify improvements. As the developer makes code improvements,
the hotspots in the profile change. Repeat this process until performance is satisfactory or all
the profile entries at the function level are in the low single digits.
To use the integrated profiler, right-click the project and select Profile As Profile with
OProfile. If your project contains multiple applications or the application needs setup or
inputs to run the specific workload, then create profile configurations as needed.
237
The SCA can find and recommend solutions for many of these coding style and machine
hazards. The process generates a journal that associates performance problems (including
hazards) with specific source file and line numbers.
The SCA window has a drill-down hierarchy similar to the profile window that is described in
Hotspot profiling on page 237. The SCA window is organized as a list of problem categories,
and then nested twisties, for affected functions and source line numbers within functions.
Functions and lines are ordered by the percent of overall contribution to execution time.
Associated with each problem is a plain language description and suggested solution that
describes a source change or compiler or linker options that are expected to resolve the
problem. Clicking the line number item jumps the source display to the associated source file
and line number for editing.
SCA uses the Feedback Directed Program Restructuring (FDPR) tool to instrument your
application (or library) for code and data flow trace when you run a workload. The resulting
FDPR journal is used to drive the SCA analysis. Running FDPR and retrieving the journal is
automated by clicking Profile as Profile with Source Code Advisor.
Pipeline stall analysis with the cycles per instruction breakdown tool
The cycles per instruction (CPI) metric is a measure of the average processor clock cycles
that are needed to complete an instruction. The CPI value is a measure of processor
performance and, in a modern processor such as the POWER processor, a high value can
indicate poor performance because of a high ratio of stalls in the execution pipeline. By
collecting information from the processors PMU, those events and derived metrics can be
mapped to the CPU functional units (for example, branch, load or store, or floating point),
where they occurred. These events and metrics can be represented in a hierarchical
breakdown of cycles, called the CPI breakdown model (CBM). For more information about
CPI metric and pipeline analysis, see Commonly Used Metrics for Performance Analysis,
found at (registration required):
https://www.power.org/documentation/commonly-used-metrics-for-performance-analysis/
The IBM SDK for Linux on Power delivers with the CPI breakdown tool for automating the
collection of PMU stall events and for building a CBM representation of application execution.
After you run the application, the CPI breakdown tool opens a CBM view in the default Eclipse
perspective. This view shows a breakdown of stall events and metrics, along with their
contribution percentage and description. A sample is shown in Figure B-6 on page 239.
238
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
In the CBM view (see Figure B-6), click any of the squares to open the drill-down menu that
shows a nested set of twisties, including the event, program or library, function, and source
line, as shown in the lower right of Figure B-6. As you drill down, the items are ordered by
profile frequency, with highest frequency first. Click a function or line number in the profile
window to open the source view and jump to the corresponding source file and line number.
Use the CPI Breakdown tool to measure application behavior in the POWER processor and
for hotspot analysis. The tool assists in finding the CPU functional units with a high ratio of
stalls and the corresponding chunks of source code that are likely to be the cause of
performance degradation of your application.
239
Verbose GC Log
The verbose GC log is a keytool to understanding the memory characteristics of a particular
workload. The information that is provided in the log can be used to guide tuning decisions to
minimize GC impact and improve overall performance. Logging can be activated with the
-verbose:gc option and is directed to the command terminal. Logging can be redirected to a
file with the -Xverbosegclog:<file> option.
Verbose logs capture many types of GC events, such as regular GC cycles, allocation
failures, heap expansion and contraction, events that are related to concurrent marking, and
scavenger collections. Verbose logs also show the approximate length of time many events
take, the number of bytes processed (if applicable), and other relevant metrics. Information
relevant to many of the tuning issues for GC can be obtained from the log, such as
appropriate GC policies, optimal constant heap size, optimal min and max free space factors,
and growth and shrink sizes. For a detailed description of verbose log output, see Diagnostics
Guide for IBM SDK and Runtime Environment Java Technology Edition, Version 6, found at:
http://www-01.ibm.com/support/knowledgecenter/api/redirect/javasdk/v6r0/topic/com.
ibm.java.doc.diagnostics.60/homepage/plugin-homepage-java6.html
240
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
For more information about the GC and memory visualizer, see Java diagnostics, IBM style,
Part 2: Garbage collection with the IBM Monitoring and Diagnostic Tools for Java Garbage
Collection and Memory Visualizer, found at:
http://www.ibm.com/developerworks/java/library/j-ibmtools2
241
General information about running the profiler and interpreting the results are in AIX on
page 223 and Linux on page 233. For Java profiling, additional Java options are required to
profile the machine code that is generated for methods by the JIT compiler:
AIX 32-bit: -agentlib:jpa=instructions=1
AIX 64-bit: -agentlib:jpa64=instructions=1
Linux OProfile: -agentlib:jvmti_oprofile
The entire execution of a Java program can be profiled, for example, on AIX by running the
following command:
tprof -ujeskzl -A -I -E -x java
However, it is more common to profile Java after a warm-up period so that JIT compilation
activity has completed. To profile after a warm-up, start Java and wait an appropriate interval
until a steady-state performance is reached, which is anywhere from a few seconds to a few
minutes for large applications. Then, start the profiler, for example, on AIX, by running the
following command:
tprof -ujeskzl -A -I -E -x sleep 60
On Linux, OProfile and perf can be used in a similar fashion; for more information, see Java
profiling example.
242
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
The program also uses the Double class, creating many short-lived objects by using new. By
running the program with a small Java heap, GC is frequently required to free the Java heap
space that is taken by the Double objects that are no longer in use.
Example B-12 shows how this program was run and profiled on AIX. 64-bit Java was used
with the options -Xms10m and -Xmx10m to specify the size of the Java heap. The profile that is
generated appears in the java.prof file.
Example: B-12 Results of running tprof on AIX
243
Example B-13 and Example B-14 contain excerpts from the java.prof file that is created on
AIX. Here are the notable elements of the profile:
Lock contention impact: The impact of spin locking is shown in Example B-13 as ticks in
the libj9jit24.so helper routine jitMonitorEntry, in the AIX pthreads library
libpthreads.a, and in the AIX kernel routine _check_lock. This Java program clearly has
excessive lock contention with jitMonitorEntry consuming 26.66% of the ticks in the
profile. jitMonitorEntry and other routines, such as jitMethodMonitorEntry, indicate
spin locking at the Java language level, and the impact in the pthreads library or
_check_lock is locking at the system level, which might be associated with Java locks. For
example, libpthreads.a and _check_lock are active for lock contention that is related to
malloc on AIX.
Example: B-13 AIX profile excerpt showing kernel and shared library ticks
Ticks
%
Source
===== ====== ======
240
5.71 low.s
Shared Object
=============
libj9jit24.so
libj9gc24.so
/usr/lib/libpthreads.a[shr_xpg5_64.o]
Address
=======
3420
Bytes
=====
40
Ticks
%
Address
Bytes
===== ====== =======
=====
1157 27.51 900000003e81240 5c8878
510 12.13 900000004534200 91d66
175
4.16 900000000b83200 30aa0
Profile: libj9jit24.so
Subroutine
==========
.jitMonitorEntry
Ticks
%
Source
===== ====== ======
1121 26.66 nathelp.s
Address
=======
549fc0
Bytes
=====
cc0
GC impact: The impact of initializing new objects and of GC is shown in Example B-13 as
12.13% of ticks in the libj9gc24.so shared object. This high GC impact is related to the
excessive creation of Double objects in the sample program.
Java method execution: In Example B-14, the profile shows the time that is spent in the
ProfileTest class, which is broken down by method. Some methods appear more than
one time in the breakdown because they are compiled multiple times at increasing
optimization levels by the JIT compiler. Most of the ticks appear in the final highly
optimized version of the doWork()D method, into which the initialize()V and
calculate()V methods are inlined by the JIT compiler.
Example: B-14 AIX profile excerpt showing Java classes and methods
Ticks
%
===== ======
1401 33.32
38
0.90
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
java/lang/Float
java/lang/Double
java/lang/Math
5
3
3
0.12
0.07
0.07
Profile: ProfileTest
Method
======
doWork()D
doWork()D
doWork()D
initialize()V
calculate()V
initialize()V
d04
Ticks
%
Source
===== ====== ======
1385 32.94 ProfileTest.java
6
0.14 ProfileTest.java
4
0.10 ProfileTest.java
3
0.07 ProfileTest.java
2
0.05 ProfileTest.java
1 0.02 ProfileTest.java
Address Bytes
======= =====
1107283bc
b54
110725148
464
110726e3c
156c
1107262dc
b4c
110724400
144
1107255c4
Example B-15 contains a shell program that collects a profile on Linux by using OProfile. The
resulting profile might be similar to the previous example profile on AIX, indicating substantial
time in spin locking and in GC. Depending on some specifics of the Linux system, however,
the locking impact can appear in routines in the libj9thr24.so shared object, as compared
to the AIX spin locking seen in libj9jit24.so.
In some cases, an environment variable setting might be necessary to indicate the location of
the JVMTI library that is needed for running OProfile with Java:
Linux 32-bit: LD_LIBRARY_PATH=/usr/lib/oprofile
Linux 64-bit: LD_LIBRARY_PATH=/usr/lib64/oprofile
Alternatively, you can specify the full path to the JVMTI library on the Java command line,
such as:
java -agentpath:/usr/lib/oprofile/libjvmti_oprofile.so
Example: B-15 Linux shell to collect a profile by using OProfile
#!/bin/bash
#
#
#
#
245
Locking analysis
Locking bottlenecks are fairly common in Java applications. Collect locking information to
identify any bottlenecks, and then take the appropriate steps to eliminate the problems. A
common case is when older Java/util classes, such as Hashtable, do not scale well and cause
a locking bottleneck. An easy solution is to use Java/util/concurrent classes instead, such
as ConcurrentHashMap.
Locking can be at the Java code level or at the system level. Java Lock Monitor is an easy to
use tool that identifies locking bottlenecks at the Java language level or in internal JVM
locking. A profile that is slowing a significant fraction of time in kernel locking routines
indicates that system level locking that might be related to an underlying Java locking issue.
Other AIX tools, such as splat, are helpful in diagnosing locking problems at the system level.
Always evaluate locking in the largest required scalability configuration (the largest number
of cores).
246
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
WAIT was originally developed for Java and Java Platform, Enterprise Edition workloads, but
a beta version that works with C/C++ native code is also available. The WAIT diagnostic
capabilities are not limited to traditional Java bottlenecks such as GC problems or hot
methods. WAIT employs an expert rule system to look at how Java code communicates with
the wider world to provide a high-level view of system and application bottlenecks.
WAIT is also agentless (relying on javacores, ps, vmstat, and similar information, all of which
are subject to availability). For example, WAIT produces a report with whatever subset of data
can be extracted on a machine. Getting javacores, ps, and vmstat data almost never requires
a change to command lines, environment variables, and so on.
The output is viewed in a browser such as Firefox, Chrome, Safari, and Internet Explorer, and
assuming one has a browser, no additional installation is needed to view the WAIT output.
Reports are interactive, and clicking different elements reveals more information. Manuals,
animated demonstrations, and sample reports are also available on the WAIT website.
For more information about WAIT, go to the following website:
http://wait.researchlabs.ibm.com
This website also has sample input files for WAIT, so users can try out the data analysis and
visualization aspects without collecting any data.
247
248
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
Performance Optimization and Tuning Techniques for IBM Power Systems Processors Including IBM POWER8
SG24-8171-01
ISBN 0738440922
(0.5 spine)
0.475<->0.873
250 <-> 459 pages
Back cover
SG24-8171-01
ISBN 0738440922
Printed in U.S.A.
ibm.com/redbooks