Perfbook-1c 2023 06 11a
Perfbook-1c 2023 06 11a
Perfbook-1c 2023 06 11a
Edited by:
Paul E. McKenney
Facebook
paulmck@kernel.org
v2023.06.11a
ii
Legal Statement
This work represents the views of the editor and the authors and does not necessarily
represent the view of their respective employers.
Trademarks:
• IBM, z Systems, and PowerPC are trademarks or registered trademarks of Inter-
national Business Machines Corporation in the United States, other countries, or
both.
• Linux is a registered trademark of Linus Torvalds.
• Intel, Itanium, Intel Core, and Intel Xeon are trademarks of Intel Corporation or its
subsidiaries in the United States, other countries, or both.
• Arm is a registered trademark of Arm Limited (or its subsidiaries) in the US and/or
elsewhere.
• SPARC is a registered trademark of SPARC International, Inc. Products bearing
SPARC trademarks are based on an architecture developed by Sun Microsystems,
Inc.
• Other company, product, and service names may be trademarks or service marks
of such companies.
The non-source-code text and images in this document are provided under the terms
of the Creative Commons Attribution-Share Alike 3.0 United States license.1 In brief,
you may use the contents of this document for any purpose, personal, commercial, or
otherwise, so long as attribution to the authors is maintained. Likewise, the document
may be modified, and derivative works and translations made available, so long as
such modifications and derivations are offered to the public on equal terms as the
non-source-code text and images in the original document.
Source code is covered by various versions of the GPL.2 Some of this code is
GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later.
See the comment headers of the individual source files within the CodeSamples directory
in the git archive3 for the exact licenses. If you are unsure of the license for a given
code fragment, you should assume GPLv2-only.
Combined work © 2005–2023 by Paul E. McKenney. Each individual contribution is
copyright by its contributor at the time of contribution, as recorded in the git archive.
1 https://creativecommons.org/licenses/by-sa/3.0/us/
2 https://www.gnu.org/licenses/gpl-2.0.html
3 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.
git
v2023.06.11a
Contents
2 Introduction 9
2.1 Historic Parallel Programming Difficulties . . . . . . . . . . . . . . . 9
2.2 Parallel Programming Goals . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Alternatives to Parallel Programming . . . . . . . . . . . . . . . . . . 17
2.3.1 Multiple Instances of a Sequential Application . . . . . . . . 17
2.3.2 Use Existing Parallel Software . . . . . . . . . . . . . . . . . 18
2.3.3 Performance Optimization . . . . . . . . . . . . . . . . . . . 18
2.4 What Makes Parallel Programming Hard? . . . . . . . . . . . . . . . 19
2.4.1 Work Partitioning . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Parallel Access Control . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Resource Partitioning and Replication . . . . . . . . . . . . . 21
2.4.4 Interacting With Hardware . . . . . . . . . . . . . . . . . . . 21
2.4.5 Composite Capabilities . . . . . . . . . . . . . . . . . . . . 22
2.4.6 How Do Languages and Environments Assist With These Tasks? 22
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
v2023.06.11a
iv CONTENTS
5 Counting 77
5.1 Why Isn’t Concurrent Counting Trivial? . . . . . . . . . . . . . . . . 78
5.2 Statistical Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 Array-Based Implementation . . . . . . . . . . . . . . . . . 81
5.2.3 Per-Thread-Variable-Based Implementation . . . . . . . . . . 83
5.2.4 Eventually Consistent Implementation . . . . . . . . . . . . 84
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Approximate Limit Counters . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Simple Limit Counter Implementation . . . . . . . . . . . . 88
5.3.3 Simple Limit Counter Discussion . . . . . . . . . . . . . . . 94
5.3.4 Approximate Limit Counter Implementation . . . . . . . . . 94
5.3.5 Approximate Limit Counter Discussion . . . . . . . . . . . . 95
5.4 Exact Limit Counters . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.1 Atomic Limit Counter Implementation . . . . . . . . . . . . 95
5.4.2 Atomic Limit Counter Discussion . . . . . . . . . . . . . . . 100
5.4.3 Signal-Theft Limit Counter Design . . . . . . . . . . . . . . 101
5.4.4 Signal-Theft Limit Counter Implementation . . . . . . . . . . 102
5.4.5 Signal-Theft Limit Counter Discussion . . . . . . . . . . . . 106
5.4.6 Applying Exact Limit Counters . . . . . . . . . . . . . . . . 107
v2023.06.11a
CONTENTS v
7 Locking 159
7.1 Staying Alive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1.2 Livelock and Starvation . . . . . . . . . . . . . . . . . . . . 171
7.1.3 Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.1.4 Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.2 Types of Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2.1 Exclusive Locks . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2.2 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . 176
7.2.3 Beyond Reader-Writer Locks . . . . . . . . . . . . . . . . . 177
7.2.4 Scoped Locking . . . . . . . . . . . . . . . . . . . . . . . . 178
7.3 Locking Implementation Issues . . . . . . . . . . . . . . . . . . . . . 181
7.3.1 Sample Exclusive-Locking Implementation Based on Atomic
Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.2 Other Exclusive-Locking Implementations . . . . . . . . . . 182
7.4 Lock-Based Existence Guarantees . . . . . . . . . . . . . . . . . . . 184
7.5 Locking: Hero or Villain? . . . . . . . . . . . . . . . . . . . . . . . 186
7.5.1 Locking For Applications: Hero! . . . . . . . . . . . . . . . 187
v2023.06.11a
vi CONTENTS
v2023.06.11a
CONTENTS vii
11 Validation 323
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
11.1.1 Where Do Bugs Come From? . . . . . . . . . . . . . . . . . 324
11.1.2 Required Mindset . . . . . . . . . . . . . . . . . . . . . . . 325
11.1.3 When Should Validation Start? . . . . . . . . . . . . . . . . 328
11.1.4 The Open Source Way . . . . . . . . . . . . . . . . . . . . . 329
11.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
11.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
11.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
11.5 Code Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
11.5.1 Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
11.5.2 Walkthroughs . . . . . . . . . . . . . . . . . . . . . . . . . . 334
11.5.3 Self-Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 334
11.6 Probability and Heisenbugs . . . . . . . . . . . . . . . . . . . . . . . 336
11.6.1 Statistics for Discrete Testing . . . . . . . . . . . . . . . . . 337
11.6.2 Statistics Abuse for Discrete Testing . . . . . . . . . . . . . . 339
11.6.3 Statistics for Continuous Testing . . . . . . . . . . . . . . . . 339
11.6.4 Hunting Heisenbugs . . . . . . . . . . . . . . . . . . . . . . 341
11.7 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 346
11.7.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . 347
11.7.2 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
11.7.3 Differential Profiling . . . . . . . . . . . . . . . . . . . . . . 348
11.7.4 Microbenchmarking . . . . . . . . . . . . . . . . . . . . . . 348
11.7.5 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
11.7.6 Detecting Interference . . . . . . . . . . . . . . . . . . . . . 350
11.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
v2023.06.11a
viii CONTENTS
v2023.06.11a
CONTENTS ix
v2023.06.11a
x CONTENTS
v2023.06.11a
CONTENTS xi
v2023.06.11a
xii CONTENTS
Glossary 881
Bibliography 893
Credits 945
LATEX Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945
Reviewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945
Machine Owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
Original Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
Figure Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947
Other Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Acronyms 951
Index 953
v2023.06.11a
If you would only recognize that life is hard, things
would be so much easier for you.
Louis D. Brandeis
Chapter 1
The purpose of this book is to help you program shared-memory parallel systems
without risking your sanity.1 Nevertheless, you should think of the information in this
book as a foundation on which to build, rather than as a completed cathedral. Your
mission, if you choose to accept, is to help make further progress in the exciting field of
parallel programming—progress that will in time render this book obsolete.
Parallel programming in the 21st century is no longer focused solely on science,
research, and grand-challenge projects. And this is all to the good, because it means
that parallel programming is becoming an engineering discipline. Therefore, as befits
an engineering discipline, this book examines specific parallel-programming tasks and
describes how to approach them. In some surprisingly common cases, these tasks can
be automated.
This book is written in the hope that presenting the engineering discipline underlying
successful parallel-programming projects will free a new generation of parallel hackers
from the need to slowly and painstakingly reinvent old wheels, enabling them to instead
focus their energy and creativity on new frontiers. However, what you get from this
book will be determined by what you put into it. It is hoped that simply reading this
book will be helpful, and that working the Quick Quizzes will be even more helpful.
However, the best results come from applying the techniques taught in this book to
real-life problems. As always, practice makes perfect.
But no matter how you approach it, we sincerely hope that parallel programming
brings you at least as much fun, excitement, and challenge that it has brought to us!
1 Or, perhaps more accurately, without much greater risk to your sanity than that incurred
by non-parallel programming. Which, come to think of it, might not be saying all that much.
v2023.06.11a
2 CHAPTER 1. HOW TO USE THIS BOOK
1.1 Roadmap
This book is a handbook of widely applicable and heavily used design techniques, rather
than a collection of optimal algorithms with tiny areas of applicability. You are currently
reading Chapter 1, but you knew that already. Chapter 2 gives a high-level overview of
parallel programming.
Chapter 3 introduces shared-memory parallel hardware. After all, it is difficult to
write good parallel code unless you understand the underlying hardware. Because
hardware constantly evolves, this chapter will always be out of date. We will nevertheless
do our best to keep up. Chapter 4 then provides a very brief overview of common
shared-memory parallel-programming primitives.
Chapter 5 takes an in-depth look at parallelizing one of the simplest problems
imaginable, namely counting. Because almost everyone has an excellent grasp of
counting, this chapter is able to delve into many important parallel-programming issues
without the distractions of more-typical computer-science problems. My impression is
that this chapter has seen the greatest use in parallel-programming coursework.
Chapter 6 introduces a number of design-level methods of addressing the issues
identified in Chapter 5. It turns out that it is important to address parallelism at the
design level when feasible: To paraphrase Dijkstra [Dij68], “retrofitted parallelism
considered grossly suboptimal” [McK12c].
The next three chapters examine three important approaches to synchronization.
Chapter 7 covers locking, which is still not only the workhorse of production-quality
parallel programming, but is also widely considered to be parallel programming’s worst
villain. Chapter 8 gives a brief overview of data ownership, an often overlooked but
remarkably pervasive and powerful approach. Finally, Chapter 9 introduces a number
of deferred-processing mechanisms, including reference counting, hazard pointers,
sequence locking, and RCU.
Chapter 10 applies the lessons of previous chapters to hash tables, which are heavily
used due to their excellent partitionability, which (usually) leads to excellent performance
and scalability.
As many have learned to their sorrow, parallel programming without validation is a
sure path to abject failure. Chapter 11 covers various forms of testing. It is of course
impossible to test reliability into your program after the fact, so Chapter 12 follows up
with a brief overview of a couple of practical approaches to formal verification.
Chapter 13 contains a series of moderate-sized parallel programming problems. The
difficulty of these problems vary, but should be appropriate for someone who has
mastered the material in the previous chapters.
Chapter 14 looks at advanced synchronization methods, including non-blocking
synchronization and parallel real-time computing, while Chapter 15 covers the advanced
topic of memory ordering. Chapter 16 follows up with some ease-of-use advice. Chap-
ter 17 looks at a few possible future directions, including shared-memory parallel system
v2023.06.11a
1.2. QUICK QUIZZES 3
design, software and hardware transactional memory, and functional programming for
parallelism. Finally, Chapter 18 reviews the material in this book and its origins.
This chapter is followed by a number of appendices. The most popular of these
appears to be Appendix C, which delves even further into memory ordering. Appendix E
contains the answers to the infamous Quick Quizzes, which are discussed in the next
section.
“Quick quizzes” appear throughout this book, and the answers may be found in
Appendix E starting on page 705. Some of them are based on material in which that
quick quiz appears, but others require you to think beyond that section, and, in some
cases, beyond the realm of current knowledge. As with most endeavors, what you get
out of this book is largely determined by what you are willing to put into it. Therefore,
readers who make a genuine effort to solve a quiz before looking at the answer find their
effort repaid handsomely with increased understanding of parallel programming.
Quick Quiz 1.1: Where are the answers to the Quick Quizzes found?
Quick Quiz 1.2: Some of the Quick Quiz questions seem to be from the viewpoint of the
reader rather than the author. Is that really the intent?
Quick Quiz 1.3: These Quick Quizzes are just not my cup of tea. What can I do about it?
In short, if you need a deep understanding of the material, then you should invest
some time into answering the Quick Quizzes. Don’t get me wrong, passively reading
the material can be quite valuable, but gaining full problem-solving capability really
does require that you practice solving problems. Similarly, gaining full code-production
capability really does require that you practice producing code.
Quick Quiz 1.4: If passively reading this book doesn’t get me full problem-solving and
code-production capabilities, what on earth is the point???
I learned this the hard way during coursework for my late-in-life Ph.D. I was studying
a familiar topic, and was surprised at how few of the chapter’s exercises I could answer
off the top of my head.2 Forcing myself to answer the questions greatly increased
my retention of the material. So with these Quick Quizzes I am not asking you to do
anything that I have not been doing myself.
Finally, the most common learning disability is thinking that you already understand
the material at hand. The quick quizzes can be an extremely effective cure.
2 So I suppose that it was just as well that my professors refused to let me waive that
class!
v2023.06.11a
4 CHAPTER 1. HOW TO USE THIS BOOK
As Knuth learned the hard way, if you want your book to be finite, it must be focused.
This book focuses on shared-memory parallel programming, with an emphasis on
software that lives near the bottom of the software stack, such as operating-system
kernels, parallel data-management systems, low-level libraries, and the like. The
programming language used by this book is C.
If you are interested in other aspects of parallelism, you might well be better served
by some other book. Fortunately, there are many alternatives available to you:
Bornat.
v2023.06.11a
1.3. ALTERNATIVES TO THIS BOOK 5
the problems inherent in parallelism often take a back seat to getting one’s head
around a real-world application.
4. If you want to work with Linux-kernel device drivers, then Corbet’s, Rubini’s, and
Kroah-Hartman’s “Linux Device Drivers” [CRKH05] is indispensable, as is the
Linux Weekly News web site (https://lwn.net/). There is a large number of
books and resources on the more general topic of Linux kernel internals.
5. If your primary focus is scientific and technical computing, and you prefer a
patternist approach, you might try Mattson et al.’s textbook [MSM05]. It covers
Java, C/C++, OpenMP, and MPI. Its patterns are admirably focused first on design,
then on implementation.
6. If your primary focus is scientific and technical computing, and you are interested
in GPUs, CUDA, and MPI, you might check out Norm Matloff’s “Programming
on Parallel Machines” [Mat17]. Of course, the GPU vendors have quite a bit of
additional information [AMD20, Zel11, NVi17a, NVi17b].
7. If you are interested in POSIX Threads, you might take a look at David R. Butenhof’s
book [But97]. In addition, W. Richard Stevens’s book [Ste92, Ste13] covers UNIX
and POSIX, and Stewart Weiss’s lecture notes [Wei13] provide an thorough and
accessible introduction with a good set of examples.
8. If you are interested in C++11, you might like Anthony Williams’s “C++ Concur-
rency in Action: Practical Multithreading” [Wil12, Wil19].
9. If you are interested in C++, but in a Windows environment, you might try Herb
Sutter’s “Effective Concurrency” series in Dr. Dobbs Journal [Sut08]. This series
does a reasonable job of presenting a commonsense approach to parallelism.
10. If you want to try out Intel Threading Building Blocks, then perhaps James
Reinders’s book [Rei07] is what you are looking for.
11. Those interested in learning how various types of multi-processor hardware cache
organizations affect the implementation of kernel internals should take a look at
Curt Schimmel’s classic treatment of this subject [Sch94].
12. If you are looking for a hardware view, Hennessy’s and Patterson’s classic
textbook [HP17, HP11] is well worth a read. A “Readers Digest” version of this
tome geared for scientific and technical workloads (bashing big arrays) may be
found in Andrew Chien’s textbook [Chi22]. If you are looking for an academic
textbook on memory ordering from a more hardware-centric viewpoint, that of
Daniel Sorin et al. [SHW11, NSHW20] is highly recommended. For a memory-
ordering tutorial from a Linux-kernel viewpoint, Paolo Bonzini’s LWN series is a
good place to start [Bon21a, Bon21e, Bon21c, Bon21b, Bon21d, Bon21f].
13. Those wishing to learn about the Rust language’s support for low-level concurrency
should refer to Mara Bos’s book [Bos23].
14. Finally, those using Java might be well-served by Doug Lea’s textbooks [Lea97,
GPB+ 07].
However, if you are interested in principles of parallel design for low-level software,
especially software written in C, read on!
v2023.06.11a
6 CHAPTER 1. HOW TO USE THIS BOOK
This book discusses its fair share of source code, and in many cases this source code
may be found in the CodeSamples directory of this book’s git tree. For example, on
UNIX systems, you should be able to type the following:
This command will locate the file rcu_rcpls.c, which is called out in Appendix B.
Non-UNIX systems have their own well-known ways of locating files by filename.
As the cover says, the editor is one Paul E. McKenney. However, the editor does accept
contributions via the perfbook@vger.kernel.org email list. These contributions
can be in pretty much any form, with popular approaches including text emails, patches
against the book’s LATEX source, and even git pull requests. Use whatever form
works best for you.
To create patches or git pull requests, you will need the LATEX source to
the book, which is at git://git.kernel.org/pub/scm/linux/kernel/git/
paulmck/perfbook.git, or, alternatively, https://git.kernel.org/pub/scm/
linux/kernel/git/paulmck/perfbook.git. You will of course also need git
and LATEX, which are available as part of most mainstream Linux distributions. Other
packages may be required, depending on the distribution you use. The required list
of packages for a few popular distributions is listed in the file FAQ-BUILD.txt in the
LATEX source to the book.
To create and display a current LATEX source tree of this book, use the list of Linux
commands shown in Listing 1.1. In some environments, the evince command that
displays perfbook.pdf may need to be replaced, for example, with acroread. The
git clone command need only be used the first time you create a PDF, subsequently,
you can run the commands shown in Listing 1.2 to pull in any updates and generate an
v2023.06.11a
1.5. WHOSE BOOK IS THIS? 7
updated PDF. The commands in Listing 1.2 must be run within the perfbook directory
created by the commands shown in Listing 1.1.
PDFs of this book are sporadically posted at https://kernel.org/pub/
linux/kernel/people/paulmck/perfbook/perfbook.html and at http://
www.rdrop.com/users/paulmck/perfbook/.
The actual process of contributing patches and sending git pull requests is similar to
that of the Linux kernel, which is documented here: https://www.kernel.org/doc/
html/latest/process/submitting-patches.html. One important requirement
is that each patch (or commit, in the case of a git pull request) must contain a valid
Signed-off-by: line, which has the following format:
(a) The contribution was created in whole or in part by me and I have the right to
submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is
covered under an appropriate open source license and I have the right under that
license to submit that work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am permitted to submit under
a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified
(a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that
a record of the contribution (including all personal information I submit with
it, including my sign-off) is maintained indefinitely and may be redistributed
consistent with this project or the open source license(s) involved.
This is quite similar to the Developer’s Certificate of Origin (DCO) 1.1 used by the
Linux kernel. You must use your real name: I unfortunately cannot accept pseudonymous
or anonymous contributions.
The language of this book is American English, however, the open-source nature
of this book permits translations, and I personally encourage them. The open-source
licenses covering this book additionally allow you to sell your translation, if you wish. I
do request that you send me a copy of the translation (hardcopy if available), but this
is a request made as a professional courtesy, and is not in any way a prerequisite to
the permission that you already have under the Creative Commons and GPL licenses.
Please see the FAQ.txt file in the source tree for a list of translations currently in
v2023.06.11a
8 CHAPTER 1. HOW TO USE THIS BOOK
progress. I consider a translation effort to be “in progress” once at least one chapter has
been fully translated.
There are many styles under the “American English” rubric. The style for this
particular book is documented in Appendix D.
As noted at the beginning of this section, I am this book’s editor. However, if you
choose to contribute, it will be your book as well. In that spirit, I offer you Chapter 2,
our introduction.
v2023.06.11a
If parallel programming is so hard, why are there so
many parallel programs?
Unknown
Chapter 2
Introduction
Parallel programming has earned a reputation as one of the most difficult areas a
hacker can tackle. Papers and textbooks warn of the perils of deadlock, livelock, race
conditions, non-determinism, Amdahl’s-Law limits to scaling, and excessive realtime
latencies. And these perils are quite real; we authors have accumulated uncounted years
of experience along with the resulting emotional scars, grey hairs, and hair loss.
However, new technologies that are difficult to use at introduction invariably become
easier over time. For example, the once-rare ability to drive a car is now commonplace
in many countries. This dramatic change came about for two basic reasons: (1) Cars
became cheaper and more readily available, so that more people had the opportunity to
learn to drive, and (2) Cars became easier to operate due to automatic transmissions,
automatic chokes, automatic starters, greatly improved reliability, and a host of other
technological improvements.
The same is true for many other technologies, including computers. It is no longer
necessary to operate a keypunch in order to program. Spreadsheets allow most non-
programmers to get results from their computers that would have required a team of
specialists a few decades ago. Perhaps the most compelling example is web-surfing
and content creation, which since the early 2000s has been easily done by untrained,
uneducated people using various now-commonplace social-networking tools. As
recently as 1968, such content creation was a far-out research project [Eng68], described
at the time as “like a UFO landing on the White House lawn” [Gri00].
Therefore, if you wish to argue that parallel programming will remain as difficult as
it is currently perceived by many to be, it is you who bears the burden of proof, keeping
in mind the many centuries of counter-examples in many fields of endeavor.
As indicated by its title, this book takes a different approach. Rather than complain
about the difficulty of parallel programming, it instead examines the reasons why
v2023.06.11a
10 CHAPTER 2. INTRODUCTION
parallel programming is difficult, and then works to help the reader to overcome these
difficulties. As will be seen, these difficulties have historically fallen into several
categories, including:
2. The typical researcher’s and practitioner’s lack of experience with parallel systems.
Many of these historic difficulties are well on the way to being overcome. First, over
the past few decades, the cost of parallel systems has decreased from many multiples of
that of a house to that of a modest meal, courtesy of Moore’s Law [Moo65]. Papers
calling out the advantages of multicore CPUs were published as early as 1996 [ONH+ 96].
IBM introduced simultaneous multi-threading into its high-end POWER family in 2000,
and multicore in 2001. Intel introduced hyperthreading into its commodity Pentium
line in November 2000, and both AMD and Intel introduced dual-core CPUs in 2005.
Sun followed with the multicore/multi-threaded Niagara in late 2005. In fact, by 2008,
it was becoming difficult to find a single-CPU desktop system, with single-core CPUs
being relegated to netbooks and embedded devices. By 2012, even smartphones were
starting to sport multiple CPUs. By 2020, safety-critical software standards started
addressing concurrency.
Second, the advent of low-cost and readily available multicore systems means that the
once-rare experience of parallel programming is now available to almost all researchers
and practitioners. In fact, parallel systems have long been within the budget of students
and hobbyists. We can therefore expect greatly increased levels of invention and
innovation surrounding parallel systems, and that increased familiarity will over time
make the once prohibitively expensive field of parallel programming much more friendly
and commonplace.
Third, in the 20th century, large systems of highly parallel software were almost
always closely guarded proprietary secrets. In happy contrast, the 21st century has
seen numerous open-source (and thus publicly available) parallel software projects,
including the Linux kernel [Tor03], database systems [Pos08, MS08], and message-
passing systems [The08, Uni08a]. This book will draw primarily from the Linux kernel,
but will provide much material suitable for user-level applications.
Fourth, even though the large-scale parallel-programming projects of the 1980s and
1990s were almost all proprietary projects, these projects have seeded other communities
with cadres of developers who understand the engineering discipline required to develop
production-quality parallel code. A major purpose of this book is to present this
engineering discipline.
Unfortunately, the fifth difficulty, the high cost of communication relative to that
of processing, remains largely in force. This difficulty has been receiving increasing
attention during the new millennium. However, according to Stephen Hawking,
the finite speed of light and the atomic nature of matter will limit progress in this
area [Gar07, Moo03]. Fortunately, this difficulty has been in force since the late 1980s,
so that the aforementioned engineering discipline has evolved practical and effective
v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 11
strategies for handling it. In addition, hardware designers are increasingly aware of
these issues, so perhaps future hardware will be more friendly to parallel software, as
discussed in Section 3.3.
Quick Quiz 2.1: Come on now!!! Parallel programming has been known to be exceedingly
hard for many decades. You seem to be hinting that it is not so hard. What sort of game are you
playing?
The three major goals of parallel programming (over and above those of sequential
programming) are as follows:
1. Performance.
2. Productivity.
3. Generality.
Unfortunately, given the current state of the art, it is possible to achieve at best two of
these three goals for any given parallel program. These three goals therefore form the
iron triangle of parallel programming, a triangle upon which overly optimistic hopes all
too often come to grief.1
Quick Quiz 2.3: Oh, really??? What about correctness, maintainability, robustness, and so
on?
Quick Quiz 2.4: And if correctness, maintainability, and robustness don’t make the list, why
do productivity and generality?
Quick Quiz 2.5: Given that parallel programs are much harder to prove correct than are
sequential programs, again, shouldn’t correctness really be on the list?
v2023.06.11a
12 CHAPTER 2. INTRODUCTION
10000
100
10
0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Year
2.2.1 Performance
Performance is the primary goal behind most parallel-programming effort. After all, if
performance is not a concern, why not do yourself a favor: Just write sequential code,
and be happy? It will very likely be easier and you will probably get done much more
quickly.
Quick Quiz 2.7: Are there no cases where parallel programming is about something other
than performance?
Note that “performance” is interpreted broadly here, including for example scalability
(performance per CPU) and efficiency (performance per watt).
That said, the focus of performance has shifted from hardware to parallel software.
This change in focus is due to the fact that, although Moore’s Law continues to deliver
increases in transistor density, it has ceased to provide the traditional single-threaded
performance increases. This can be seen in Figure 2.1,2 which shows that writing
single-threaded code and simply waiting a year or two for the CPUs to catch up may
no longer be an option. Given the recent trends on the part of all major manufacturers
towards multicore/multithreaded systems, parallelism is the way to go for those wanting
to avail themselves of the full performance of their systems.
Quick Quiz 2.8: Why not instead rewrite programs from inefficient scripting languages to C
or C++?
Even so, the first goal is performance rather than scalability, especially given that the
easiest way to attain linear scalability is to reduce the performance of each CPU [Tor01].
2 This plot shows clock frequencies for newer CPUs theoretically capable of retiring one
or more instructions per clock, and MIPS (millions of instructions per second, usually from
the old Dhrystone benchmark) for older CPUs requiring multiple clocks to execute even the
simplest instruction. The reason for shifting between these two measures is that the newer
CPUs’ ability to retire multiple instructions per clock is typically limited by memory-system
performance. Furthermore, the benchmarks commonly used on the older CPUs are obsolete,
and it is difficult to run the newer benchmarks on systems containing the old CPUs, in part
because it is hard to find working instances of the old CPUs.
v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 13
Given a four-CPU system, which would you prefer? A program that provides 100
transactions per second on a single CPU, but does not scale at all? Or a program that
provides 10 transactions per second on a single CPU, but scales perfectly? The first
program seems like a better bet, though the answer might change if you happened to
have a 32-CPU system.
That said, just because you have multiple CPUs is not necessarily in and of itself
a reason to use them all, especially given the recent decreases in price of multi-CPU
systems. The key point to understand is that parallel programming is primarily a
performance optimization, and, as such, it is one potential optimization of many. If your
program is fast enough as currently written, there is no reason to optimize, either by
parallelizing it or by applying any of a number of potential sequential optimizations.3
By the same token, if you are looking to apply parallelism as an optimization to a
sequential program, then you will need to compare parallel algorithms to the best
sequential algorithms. This may require some care, as far too many publications ignore
the sequential case when analyzing the performance of parallel algorithms.
2.2.2 Productivity
Quick Quiz 2.9: Why all this prattling on about non-technical issues??? And not just any
non-technical issue, but productivity of all things? Who cares?
3 Of course, if you are a hobbyist whose primary interest is writing parallel software,
that is more than enough reason to parallelize whatever software you are interested in.
v2023.06.11a
14 CHAPTER 2. INTRODUCTION
1x106
100000
10000
100
10
0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Year
One of the inescapable consequences of the rapid decrease in the cost of hardware is
that software productivity becomes increasingly important. It is no longer sufficient
merely to make efficient use of the hardware: It is now necessary to make extremely
efficient use of software developers as well. This has long been the case for sequential
hardware, but parallel hardware has become a low-cost commodity only recently.
Therefore, only recently has high productivity become critically important when
creating parallel software.
Quick Quiz 2.10: Given how cheap parallel systems have become, how can anyone afford to
pay people to program them?
Perhaps at one time, the sole purpose of parallel software was performance. Now,
however, productivity is gaining the spotlight.
2.2.3 Generality
One way to justify the high cost of developing parallel software is to strive for maximal
generality. All else being equal, the cost of a more-general software artifact can be
spread over more users than that of a less-general one. In fact, this economic force
explains much of the maniacal focus on portability, which can be seen as an important
special case of generality.4
Unfortunately, generality often comes at the cost of performance, productivity, or
both. For example, portability is often achieved via adaptation layers, which inevitably
exact a performance penalty. To see this more generally, consider the following popular
parallel programming environments:
C/C++ “Locking Plus Threads”: This category, which includes POSIX Threads
(pthreads) [Ope97], Windows Threads, and numerous operating-system kernel
environments, offers excellent performance (at least within the confines of a
single SMP system) and also offers good generality. Pity about the relatively low
productivity.
4 Kudos to Michael Wong for pointing this out.
v2023.06.11a
2.2. PARALLEL PROGRAMMING GOALS 15
Productivity
Application
Performance
Generality
System Libraries
Container
Firmware
Hardware
MPI: This Message Passing Interface [MPI08] powers the largest scientific and technical
computing clusters in the world and offers unparalleled performance and scalability.
In theory, it is general purpose, but it is mainly used for scientific and technical
computing. Its productivity is believed by many to be even lower than that of
C/C++ “locking plus threads” environments.
OpenMP: This set of compiler directives can be used to parallelize loops. It is thus
quite specific to this task, and this specificity often limits its performance. It is,
however, much easier to use than MPI or C/C++ “locking plus threads.”
SQL: Structured Query Language [Int92] is specific to relational database queries.
However, its performance is quite good as measured by the Transaction Processing
Performance Council (TPC) benchmark results [Tra01]. Productivity is excellent;
in fact, this parallel programming environment enables people to make good
use of a large parallel system despite having little or no knowledge of parallel
programming concepts.
v2023.06.11a
16 CHAPTER 2. INTRODUCTION
Special−Purpose
User 1 Env Productive User 2
for User 1
HW / Special−Purpose
Abs Environment
Productive for User 2
User 3
General−Purpose User 4
Environment
Special−Purpose Environment
Special−Purpose
Productive for User 3
Environment
Productive for User 4
(hence the importance of generality), and performance lost in lower layers cannot easily
be recovered further up the stack. In the upper layers of the stack, there might be very
few users for a given specific application, in which case productivity concerns are
paramount. This explains the tendency towards “bloatware” further up the stack: Extra
hardware is often cheaper than extra developers. This book is intended for developers
working near the bottom of the stack, where performance and generality are of greatest
concern.
It is important to note that a tradeoff between productivity and generality has existed
for centuries in many fields. For but one example, a nailgun is more productive than
a hammer for driving nails, but in contrast to the nailgun, a hammer can be used for
many things besides driving nails. It should therefore be no surprise to see similar
tradeoffs appear in the field of parallel computing. This tradeoff is shown schematically
in Figure 2.4. Here, users 1, 2, 3, and 4 have specific jobs that they need the computer to
help them with. The most productive possible language or environment for a given user is
one that simply does that user’s job, without requiring any programming, configuration,
or other setup.
Quick Quiz 2.11: This is a ridiculously unachievable ideal! Why not focus on something that
is achievable in practice?
Unfortunately, a system that does the job required by user 1 is unlikely to do user 2’s job.
In other words, the most productive languages and environments are domain-specific,
and thus by definition lacking generality.
Another option is to tailor a given programming language or environment to the
hardware system (for example, low-level languages such as assembly, C, C++, or Java)
or to some abstraction (for example, Haskell, Prolog, or Snobol), as is shown by the
circular region near the center of Figure 2.4. These languages can be considered to be
general in the sense that they are equally ill-suited to the jobs required by users 1, 2, 3,
and 4. In other words, their generality comes at the expense of decreased productivity
when compared to domain-specific languages and environments. Worse yet, a language
that is tailored to a given abstraction is likely to suffer from performance and scalability
problems unless and until it can be efficiently mapped to real hardware.
v2023.06.11a
2.3. ALTERNATIVES TO PARALLEL PROGRAMMING 17
In order to properly consider alternatives to parallel programming, you must first decide
on what exactly you expect the parallelism to do for you. As seen in Section 2.2, the
primary goals of parallel programming are performance, productivity, and generality.
Because this book is intended for developers working on performance-critical code near
the bottom of the software stack, the remainder of this section focuses primarily on
performance improvement.
It is important to keep in mind that parallelism is but one way to improve performance.
Other well-known approaches include the following, in roughly increasing order of
difficulty:
v2023.06.11a
18 CHAPTER 2. INTRODUCTION
v2023.06.11a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 19
Performance Productivity
Work
Partitioning
Resource
Parallel
Partitioning and
Access Control Replication
Interacting
With Hardware
Generality
Parallelism can be a powerful optimization technique, but it is not the only such
technique, nor is it appropriate for all situations. Of course, the easier it is to
parallelize your program, the more attractive parallelization becomes as an optimization.
Parallelization has a reputation of being quite difficult, which leads to the question
“exactly what makes parallel programming so difficult?”
v2023.06.11a
20 CHAPTER 2. INTRODUCTION
One such approach is to carefully consider the tasks that parallel programmers must
undertake that are not required of sequential programmers. We can then evaluate how
well a given programming language or environment assists the developer with these
tasks. These tasks fall into the four categories shown in Figure 2.5, each of which is
covered in the following sections.
v2023.06.11a
2.4. WHAT MAKES PARALLEL PROGRAMMING HARD? 21
v2023.06.11a
22 CHAPTER 2. INTRODUCTION
Performance Productivity
Work
Partitioning
Resource
Parallel
Partitioning and
Access Control Replication
Interacting
With Hardware
Generality
Although these four capabilities are fundamental, good engineering practice uses
composites of these capabilities. For example, the data-parallel approach first partitions
the data so as to minimize the need for inter-partition communication, partitions the code
accordingly, and finally maps data partitions and threads so as to maximize throughput
while minimizing inter-thread communication, as shown in Figure 2.6. The developer
can then consider each partition separately, greatly reducing the size of the relevant state
space, in turn increasing productivity. Even though some problems are non-partitionable,
clever transformations into forms permitting partitioning can sometimes greatly enhance
both performance and scalability [Met99].
Although many environments require the developer to deal manually with these tasks,
there are long-standing environments that bring significant automation to bear. The poster
child for these environments is SQL, many implementations of which automatically
parallelize single large queries and also automate concurrent execution of independent
queries and updates.
These four categories of tasks must be carried out in all parallel programs, but that
of course does not necessarily mean that the developer must manually carry out these
tasks. We can expect to see ever-increasing automation of these four tasks as parallel
systems continue to become cheaper and more readily available.
Quick Quiz 2.16: Are there any other obstacles to parallel programming?
v2023.06.11a
2.5. DISCUSSION 23
2.5 Discussion
Until you try, you don’t know what you can’t do.
Henry James
This section has given an overview of the difficulties with, goals of, and alternatives to
parallel programming. This overview was followed by a discussion of what can make
parallel programming hard, along with a high-level approach for dealing with parallel
programming’s difficulties. Those who still insist that parallel programming is impossibly
difficult should review some of the older guides to parallel programmming [Seq88,
Bir89, BK85, Inm85]. The following quote from Andrew Birrell’s monograph [Bir89]
is especially telling:
Writing concurrent programs has a reputation for being exotic and difficult.
I believe it is neither. You need a system that provides you with good primitives
and suitable libraries, you need a basic caution and carefulness, you need an
armory of useful techniques, and you need to know of the common pitfalls.
I hope that this paper has helped you towards sharing my belief.
The authors of these older guides were well up to the parallel programming challenge
back in the 1980s. As such, there are simply no excuses for refusing to step up to the
parallel-programming challenge here in the 21st century!
We are now ready to proceed to the next chapter, which dives into the relevant
properties of the parallel hardware underlying our parallel software.
v2023.06.11a
24 CHAPTER 2. INTRODUCTION
v2023.06.11a
Premature abstraction is the root of all evil.
A cast of thousands
Chapter 3
Most people intuitively understand that passing messages between systems is more
expensive than performing simple calculations within the confines of a single system.
But it is also the case that communicating among threads within the confines of a single
shared-memory system can also be quite expensive. This chapter therefore looks at the
cost of synchronization and communication within a shared-memory system. These
few pages can do no more than scratch the surface of shared-memory parallel hardware
design; readers desiring more detail would do well to start with a recent edition of
Hennessy’s and Patterson’s classic text [HP17, HP95].
Quick Quiz 3.1: Why should parallel programmers bother learning low-level properties of
the hardware? Wouldn’t it be easier, better, and more elegant to remain at a higher level of
abstraction?
3.1 Overview
Mechanical Sympathy: Hardware and software
working together in harmony.
Martin Thompson
Careless reading of computer-system specification sheets might lead one to believe that
CPU performance is a footrace on a clear track, as illustrated in Figure 3.1, where the
race always goes to the swiftest.
Although there are a few CPU-bound benchmarks that approach the ideal case shown
in Figure 3.1, the typical program more closely resembles an obstacle course than a
race track. This is because the internal architecture of CPUs has changed dramatically
over the past few decades, courtesy of Moore’s Law. These changes are described in the
following sections.
25
v2023.06.11a
26 CHAPTER 3. HARDWARE AND ITS HABITS
v2023.06.11a
3.1. OVERVIEW 27
ON
TI
DIC
RE
SP
PI MI
PE H
LI
NE A NC
ER BR
RO
R
instruction and data handling; speculative execution, and more [HP17, HP11] in order
to optimize the flow of instructions and data through the CPU. Some cores have more
than one hardware thread, which is variously called simultaneous multithreading (SMT)
or hyperthreading (HT) [Fen73], each of which appears as an independent CPU to
software, at least from a functional viewpoint. These modern hardware features can
greatly improve performance, as illustrated by Figure 3.2.
Achieving full performance with a CPU having a long pipeline requires highly
predictable control flow through the program. Suitable control flow can be provided
by a program that executes primarily in tight loops, for example, arithmetic on large
matrices or vectors. The CPU can then correctly predict that the branch at the end of
the loop will be taken in almost all cases, allowing the pipeline to be kept full and the
CPU to execute at full speed.
However, branch prediction is not always so easy. For example, consider a program
with many loops, each of which iterates a small but random number of times. For
another example, consider an old-school object-oriented program with many virtual
objects that can reference many different real objects, all with different implementations
for frequently invoked member functions, resulting in many calls through pointers. In
these cases, it is difficult or even impossible for the CPU to predict where the next
branch might lead. Then either the CPU must stall waiting for execution to proceed
far enough to be certain where that branch leads, or it must guess and then proceed
using speculative execution. Although guessing works extremely well for programs
with predictable control flow, for unpredictable branches (such as those in binary search)
the guesses will frequently be wrong. A wrong guess can be expensive because the
CPU must discard any speculatively executed instructions following the corresponding
branch, resulting in a pipeline flush. If pipeline flushes appear too frequently, they
drastically reduce overall performance, as fancifully depicted in Figure 3.3.
This gets even worse in the increasingly common case of hyperthreading (or SMT, if
you prefer), especially on pipelined superscalar out-of-order CPU featuring speculative
execution. In this increasingly common case, all the hardware threads sharing a core also
share that core’s resources, including registers, cache, execution units, and so on. The
instructions are often decoded into micro-operations, and use of the shared execution
v2023.06.11a
28 CHAPTER 3. HARDWARE AND ITS HABITS
Thread 0 Thread 1
Instructions Instructions
Decode and
Translate
Micro-Op
Scheduler
Registers
Execution (100s!)
Units
intel/microarchitectures/skylake_(server).
2 It is only fair to add that each of these single cycles lasted no less than 1.6 microseconds.
v2023.06.11a
3.1. OVERVIEW 29
Although the large caches found on modern microprocessors can do quite a bit to help
combat memory-access latencies, these caches require highly predictable data-access
patterns to successfully hide those latencies. Unfortunately, common operations such as
traversing a linked list have extremely unpredictable memory-access patterns—after
all, if the pattern was predictable, us software types would not bother with the pointers,
right? Therefore, as shown in Figure 3.5, memory references often pose severe obstacles
to modern CPUs.
Thus far, we have only been considering obstacles that can arise during a given CPU’s
execution of single-threaded code. Multi-threading presents additional obstacles to the
CPU, as described in the following sections.
v2023.06.11a
30 CHAPTER 3. HARDWARE AND ITS HABITS
that can sometimes hide cache latencies, the resulting effect on performance is all too
often as depicted in Figure 3.6.
Unfortunately, atomic operations usually apply only to single elements of data.
Because many parallel algorithms require that ordering constraints be maintained
between updates of multiple data elements, most CPUs provide memory barriers. These
memory barriers also serve as performance-sapping obstacles, as described in the next
section.
Quick Quiz 3.2: What types of machines would allow atomic operations on multiple data
elements?
If the CPU were not constrained to execute these statements in the order shown, the
effect would be that the variable “a” would be incremented without the protection of
“mylock”, which would certainly defeat the purpose of acquiring it. To prevent such
destructive reordering, locking primitives contain either explicit or implicit memory
barriers. Because the whole purpose of these memory barriers is to prevent reorderings
that the CPU would otherwise undertake in order to increase performance, memory
barriers almost always reduce performance, as depicted in Figure 3.7.
As with atomic operations, CPU designers have been working hard to reduce
memory-barrier overhead, and have made substantial progress.
v2023.06.11a
3.1. OVERVIEW 31
Memory
Barrier
v2023.06.11a
32 CHAPTER 3. HARDWARE AND ITS HABITS
CACHE-
MISS
TOLL
BOOTH
v2023.06.11a
3.2. OVERHEADS 33
3.2 Overheads
Don’t design bridges in ignorance of materials, and
don’t design low-level software in ignorance of the
underlying hardware.
Unknown
This section presents actual overheads of the obstacles to performance listed out in the
previous section. However, it is first necessary to get a rough view of hardware system
architecture, which is the subject of the next section.
v2023.06.11a
34 CHAPTER 3. HARDWARE AND ITS HABITS
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
1. CPU 0 checks its local cache, and does not find the cacheline. It therefore records
the write in its store buffer.
2. A request for this cacheline is forwarded to CPU 0’s and 1’s interconnect, which
checks CPU 1’s local cache, and does not find the cacheline.
3. This request is forwarded to the system interconnect, which checks with the other
three dies, learning that the cacheline is held by the die containing CPU 6 and 7.
4. This request is forwarded to CPU 6’s and 7’s interconnect, which checks both
CPUs’ caches, finding the value in CPU 7’s cache.
5. CPU 7 forwards the cacheline to its interconnect, and also flushes the cacheline
from its cache.
6. CPU 6’s and 7’s interconnect forwards the cacheline to the system interconnect.
7. The system interconnect forwards the cacheline to CPU 0’s and 1’s interconnect.
v2023.06.11a
3.2. OVERHEADS 35
8. CPU 0’s and 1’s interconnect forwards the cacheline to CPU 0’s cache.
9. CPU 0 can now complete the write, updating the relevant portions of the newly
arrived cacheline from the value previously recorded in the store buffer.
Quick Quiz 3.4: This is a simplified sequence of events? How could it possibly be any more
complex?
Quick Quiz 3.5: Why is it necessary to flush the cacheline from CPU 7’s cache?
v2023.06.11a
36 CHAPTER 3. HARDWARE AND ITS HABITS
If the unlocked state is represented by zero and the locked state is represented by the
value one, then a CAS operation on the lock that specifies zero for the old value and one
for the new value will acquire the lock if it is not already held. The key point is that
there is only one access to the memory location, namely the CAS operation itself.
In contrast, a normal CAS operation’s old value is derived from some earlier load. For
example, to implement an atomic increment, the current value of that location is loaded
and that value is incremented to produce the new value. Then in the CAS operation, the
value actually loaded would be specified as the old value and the incremented value
as the new value. If the value had not been changed between the load and the CAS,
this would increment the memory location. However, if the value had in fact changed,
then the old value would not match, causing a miscompare that would result in the
CAS operation failing. The key point is that there are now two accesses to the memory
location, the load and the CAS.
Thus, it is not surprising that on-core blind CAS consumes only about seven
nanoseconds, while on-core CAS consumes about 18 nanoseconds. The non-blind
case’s extra load does not come for free. That said, the overhead of these operations are
similar to same-CPU CAS and lock, respectively.
Quick Quiz 3.6: Table 3.1 shows CPU 0 sharing a core with CPU 224. Shouldn’t that instead
be CPU 1???
v2023.06.11a
3.2. OVERHEADS 37
Table 3.2: Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs
@ 2.10 GHz
A blind CAS involving CPUs in different cores but on the same socket consumes
almost fifty nanoseconds, or almost one hundred clock cycles. The code used for this
cache-miss measurement passes the cache line back and forth between a pair of CPUs,
so this cache miss is satisfied not from memory, but rather from the other CPU’s cache.
A non-blind CAS operation, which as noted earlier must look at the old value of the
variable as well as store a new value, consumes over one hundred nanoseconds, or
more than two hundred clock cycles. Think about this a bit. In the time required to
do one CAS operation, the CPU could have executed more than two hundred normal
instructions. This should demonstrate the limitations not only of fine-grained locking,
but of any other synchronization mechanism relying on fine-grained global agreement.
If the pair of CPUs are on different sockets, the operations are considerably more
expensive. A blind CAS operation consumes almost 150 nanoseconds, or more than three
hundred clock cycles. A normal CAS operation consumes more than 400 nanoseconds,
or almost one thousand clock cycles.
Worse yet, not all pairs of sockets are created equal. This particular system appears
to be constructed as a pair of four-socket components, with additional latency penalties
when the CPUs reside in different components. In this case, a blind CAS operation
consumes more than three hundred nanoseconds, or more than seven hundred clock
cycles. A CAS operation consumes almost a full microsecond, or almost two thousand
clock cycles.
Quick Quiz 3.7: Surely the hardware designers could be persuaded to improve this situation!
Why have they been content with such abysmal performance for these single-instruction
operations?
Quick Quiz 3.8: Table E.1 in the answer to Quick Quiz 3.7 on page 716 says that on-core CAS
is faster than both of same-CPU CAS and on-core blind CAS. What is happening there?
v2023.06.11a
38 CHAPTER 3. HARDWARE AND ITS HABITS
v2023.06.11a
3.3. HARDWARE FREE LUNCH? 39
can degrade energy efficiency and cache-miss latency, the ever-growing cache sizes on
production microprocessors attests to the power of this optimization.
A final hardware optimization is read-mostly replication, in which data that is
frequently read but rarely updated is present in all CPUs’ caches. This optimization
allows the read-mostly data to be accessed exceedingly efficiently, and is the subject of
Chapter 9.
In short, hardware and software engineers are really on the same side, with both trying
to make computers go fast despite the best efforts of the laws of physics, as fancifully
depicted in Figure 3.12 where our data stream is trying its best to exceed the speed of
light. The next section discusses some additional things that the hardware engineers
might (or might not) be able to do, depending on how well recent research translates to
practice. Software’s contribution to this noble goal is outlined in the remaining chapters
of this book.
The major reason that concurrency has been receiving so much focus over the past few
years is the end of Moore’s-Law induced single-threaded performance increases (or
“free lunch” [Sut08]), as shown in Figure 2.1 on page 12. This section briefly surveys a
few ways that hardware designers might bring back the “free lunch”.
However, the preceding section presented some substantial hardware obstacles to
exploiting concurrency. One severe physical limitation that hardware designers face is
the finite speed of light. As noted in Figure 3.11 on page 34, light can manage only
about an 8-centimeters round trip in a vacuum during the duration of a 1.8 GHz clock
period. This distance drops to about 3 centimeters for a 5 GHz clock. Both of these
distances are relatively small compared to the size of a modern computer system.
To make matters even worse, electric waves in silicon move from three to thirty times
more slowly than does light in a vacuum, and common clocked logic constructs run still
v2023.06.11a
40 CHAPTER 3. HARDWARE AND ITS HABITS
70 um
3 cm 1.5 cm
more slowly, for example, a memory reference may need to wait for a local cache lookup
to complete before the request may be passed on to the rest of the system. Furthermore,
relatively low speed and high power drivers are required to move electrical signals
from one silicon die to another, for example, to communicate between a CPU and main
memory.
Quick Quiz 3.10: But individual electrons don’t move anywhere near that fast, even in
conductors!!! The electron drift velocity in a conductor under semiconductor voltage levels is
on the order of only one millimeter per second. What gives???
There are nevertheless some technologies (both hardware and software) that might
help improve matters:
1. 3D integration,
3.3.1 3D Integration
3-dimensional integration (3DI) is the practice of bonding very thin silicon dies to
each other in a vertical stack. This practice provides potential benefits, but also poses
significant fabrication challenges [Kni08].
Perhaps the most important benefit of 3DI is decreased path length through the system,
as shown in Figure 3.13. A 3-centimeter silicon die is replaced with a stack of four
1.5-centimeter dies, in theory decreasing the maximum path through the system by a
factor of two, keeping in mind that each layer is quite thin. In addition, given proper
attention to design and placement, long horizontal electrical connections (which are
both slow and power hungry) can be replaced by short vertical electrical connections,
which are both faster and more power efficient.
However, delays due to levels of clocked logic will not be decreased by 3D integration,
and significant manufacturing, testing, power-supply, and heat-dissipation problems
must be solved for 3D integration to reach production while still delivering on its
promise. The heat-dissipation problems might be solved using semiconductors based
v2023.06.11a
3.3. HARDWARE FREE LUNCH? 41
on diamond, which is a good conductor for heat, but an electrical insulator. That said, it
remains difficult to grow large single diamond crystals, to say nothing of slicing them
into wafers. In addition, it seems unlikely that any of these technologies will be able to
deliver the exponential increases to which some people have become accustomed. That
said, they may be necessary steps on the path to the late Jim Gray’s “smoking hairy golf
balls” [Gra02].
v2023.06.11a
42 CHAPTER 3. HARDWARE AND ITS HABITS
incrementing the loop counter, testing this counter, and branching back to the top of the
loop are in some sense wasted effort: The real goal is instead to multiply corresponding
elements of the two vectors. Therefore, a specialized piece of hardware designed
specifically to multiply vectors could get the job done more quickly and with less energy
consumed.
This is in fact the motivation for the vector instructions present in many commodity
microprocessors. Because these instructions operate on multiple data items simultane-
ously, they would permit a dot product to be computed with less instruction-decode and
loop overhead.
Similarly, specialized hardware can more efficiently encrypt and decrypt, compress
and decompress, encode and decode, and many other tasks besides. Unfortunately, this
efficiency does not come for free. A computer system incorporating this specialized
hardware will contain more transistors, which will consume some power even when not
in use. Software must be modified to take advantage of this specialized hardware, and
this specialized hardware must be sufficiently generally useful that the high up-front
hardware-design costs can be spread over enough users to make the specialized hardware
affordable. In part due to these sorts of economic considerations, specialized hardware
has thus far appeared only for a few application areas, including graphics processing
(GPUs), vector processors (MMX, SSE, and VMX instructions), and, to a lesser extent,
encryption. And even in these areas, it is not always easy to realize the expected
performance gains, for example, due to thermal throttling [Kra17, Lem18, Dow20].
Unlike the server and PC arena, smartphones have long used a wide variety of
hardware accelerators. These hardware accelerators are often used for media decoding,
so much so that a high-end MP3 player might be able to play audio for several minutes—
with its CPU fully powered off the entire time. The purpose of these accelerators is
to improve energy efficiency and thus extend battery life: Special purpose hardware
can often compute more efficiently than can a general-purpose CPU. This is another
example of the principle called out in Section 2.2.3: Generality is almost never free.
Nevertheless, given the end of Moore’s-Law-induced single-threaded performance
increases, it seems safe to assume that increasing varieties of special-purpose hardware
will appear.
Although multicore CPUs seem to have taken the computing industry by surprise, the
fact remains that shared-memory parallel computer systems have been commercially
available for more than a quarter century. This is more than enough time for significant
parallel software to make its appearance, and it indeed has. Parallel operating systems
are quite commonplace, as are parallel threading libraries, parallel relational database
management systems, and parallel numerical software. Use of existing parallel software
can go a long ways towards solving any parallel-software crisis we might encounter.
Perhaps the most common example is the parallel relational database management
system. It is not unusual for single-threaded programs, often written in high-level
scripting languages, to access a central relational database concurrently. In the resulting
highly parallel system, only the database need actually deal directly with parallelism. A
very nice trick when it works!
v2023.06.11a
3.4. SOFTWARE DESIGN IMPLICATIONS 43
The values of the ratios in Table 3.1 are critically important, as they limit the efficiency
of a given parallel application. To see this, suppose that the parallel application uses
CAS operations to communicate among threads. These CAS operations will typically
involve a cache miss, that is, assuming that the threads are communicating primarily
with each other rather than with themselves. Suppose further that the unit of work
corresponding to each CAS communication operation takes 300 ns, which is sufficient
time to compute several floating-point transcendental functions. Then about half of the
execution time will be consumed by the CAS communication operations! This in turn
means that a two-CPU system running such a parallel program would run no faster than
a sequential implementation running on a single CPU.
The situation is even worse in the distributed-system case, where the latency of a
single communications operation might take as long as thousands or even millions
of floating-point operations. This illustrates how important it is for communications
operations to be extremely infrequent and to enable very large quantities of processing.
Quick Quiz 3.11: Given that distributed-systems communication is so horribly expensive,
why does anyone bother with such systems?
The lesson should be quite clear: Parallel algorithms must be explicitly designed with
these hardware properties firmly in mind. One approach is to run nearly independent
threads. The less frequently the threads communicate, whether by atomic operations,
locks, or explicit messages, the better the application’s performance and scalability will
be. This approach will be touched on in Chapter 5, explored in Chapter 6, and taken to
its logical extreme in Chapter 8.
Another approach is to make sure that any sharing be read-mostly, which allows the
CPUs’ caches to replicate the read-mostly data, in turn allowing all CPUs fast access.
This approach is touched on in Section 5.2.4, and explored more deeply in Chapter 9.
In short, achieving excellent parallel performance and scalability means striving for
embarrassingly parallel algorithms and implementations, whether by careful choice of
data structures and algorithms, use of existing parallel applications and environments,
or transforming the problem into an embarrassingly parallel form.
Quick Quiz 3.12: OK, if we are going to have to apply distributed-programming techniques to
shared-memory parallel programs, why not just always use these distributed techniques and
dispense with shared memory?
1. The good news is that multicore systems are inexpensive and readily available.
2. More good news: The overhead of many synchronization operations is much lower
than it was on parallel systems from the early 2000s.
v2023.06.11a
44 CHAPTER 3. HARDWARE AND ITS HABITS
3. The bad news is that the overhead of cache misses is still high, especially on large
systems.
The remainder of this book describes ways of handling this bad news.
In particular, Chapter 4 will cover some of the low-level tools used for parallel
programming, Chapter 5 will investigate problems and solutions to parallel counting,
and Chapter 6 will discuss design disciplines that promote performance and scalability.
v2023.06.11a
You are only as good as your tools, and your tools
are only as good as you are.
Unknown
Chapter 4
This chapter provides a brief introduction to some basic tools of the parallel-programming
trade, focusing mainly on those available to user applications running on operating
systems similar to Linux. Section 4.1 begins with scripting languages, Section 4.2
describes the multi-process parallelism supported by the POSIX API and touches on
POSIX threads, Section 4.3 presents analogous operations in other environments, and
finally, Section 4.4 helps to choose the tool that will get the job done.
Quick Quiz 4.1: You call these tools??? They look more like low-level synchronization
primitives to me!
Please note that this chapter provides but a brief introduction. More detail is available
from the references (and from the Internet), and more information will be provided in
later chapters.
The Linux shell scripting languages provide simple but effective ways of managing
parallelism. For example, suppose that you had a program compute_it that you needed
to run twice with two different sets of arguments. This can be accomplished using
UNIX shell scripting as follows:
1 compute_it 1 > compute_it.1.out &
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out
5 cat compute_it.2.out
Lines 1 and 2 launch two instances of this program, redirecting their output to two
separate files, with the & character directing the shell to run the two instances of the
program in the background. Line 3 waits for both instances to complete, and lines 4
and 5 display their output. The resulting execution is as shown in Figure 4.1: The two
instances of compute_it execute in parallel, wait completes after both of them do,
and then the two instances of cat execute sequentially.
45
v2023.06.11a
46 CHAPTER 4. TOOLS OF THE TRADE
wait
cat compute_it.1.out
cat compute_it.2.out
Quick Quiz 4.2: But this silly shell script isn’t a real parallel program! Why bother with such
trivia???
Quick Quiz 4.3: Is there a simpler way to create a parallel shell script? If so, how? If not,
why not?
For another example, the make software-build scripting language provides a -j option
that specifies how much parallelism should be introduced into the build process. Thus,
typing make -j4 when building a Linux kernel specifies that up to four build steps be
executed concurrently.
It is hoped that these simple examples convince you that parallel programming need
not always be complex or difficult.
Quick Quiz 4.4: But if script-based parallel programming is so easy, why bother with anything
else?
This section scratches the surface of the POSIX environment, including pthreads [Ope97],
as this environment is readily available and widely implemented. Section 4.2.1 provides
a glimpse of the POSIX fork() and related primitives, Section 4.2.2 touches on thread
creation and destruction, Section 4.2.3 gives a brief overview of POSIX locking, and,
finally, Section 4.2.4 describes a specific lock which can be used for data that is read by
many threads and only occasionally updated.
v2023.06.11a
4.2. POSIX MULTIPROCESSING 47
executing a fork() primitive is said to be the “parent” of the newly created process.
A parent may wait on its children using the wait() primitive.
Please note that the examples in this section are quite simple. Real-world applications
using these primitives might need to manipulate signals, file descriptors, shared memory
segments, and any number of other resources. In addition, some applications need to
take specific actions if a given child terminates, and might also need to be concerned
with the reason that the child terminated. These issues can of course add substantial
complexity to the code. For more information, see any of a number of textbooks on the
subject [Ste92, Wei13].
If fork() succeeds, it returns twice, once for the parent and again for the child.
The value returned from fork() allows the caller to tell the difference, as shown in
Listing 4.1 (forkjoin.c). Line 1 executes the fork() primitive, and saves its return
value in local variable pid. Line 2 checks to see if pid is zero, in which case, this is the
child, which continues on to execute line 3. As noted earlier, the child may terminate
via the exit() primitive. Otherwise, this is the parent, which checks for an error return
from the fork() primitive on line 4, and prints an error and exits on lines 5–7 if so.
Otherwise, the fork() has executed successfully, and the parent therefore executes
line 9 with the variable pid containing the process ID of the child.
The parent process may use the wait() primitive to wait for its children to complete.
However, use of this primitive is a bit more complicated than its shell-script counterpart,
as each invocation of wait() waits for but one child process. It is therefore customary
to wrap wait() into a function similar to the waitall() function shown in Listing 4.2
(api-pthreads.h), with this waitall() function having semantics similar to the
shell-script wait command. Each pass through the loop spanning lines 6–14 waits on
one child process. Line 7 invokes the wait() primitive, which blocks until a child
process exits, and returns that child’s process ID. If the process ID is instead −1, this
indicates that the wait() primitive was unable to wait on a child. If so, line 9 checks
v2023.06.11a
48 CHAPTER 4. TOOLS OF THE TRADE
for the ECHILD errno, which indicates that there are no more child processes, so that
line 10 exits the loop. Otherwise, lines 11 and 12 print an error and exit.
Quick Quiz 4.5: Why does this wait() primitive need to be so complicated? Why not just
make it work like the shell-script wait does?
It is critically important to note that the parent and child do not share memory. This
is illustrated by the program shown in Listing 4.3 (forkjoinvar.c), in which the child
sets a global variable x to 1 on line 9, prints a message on line 10, and exits on line 11.
The parent continues at line 20, where it waits on the child, and on line 21 finds that its
copy of the variable x is still zero. The output is thus as follows:
Quick Quiz 4.6: Isn’t there a lot more to fork() and wait() than discussed here?
v2023.06.11a
4.2. POSIX MULTIPROCESSING 49
Quick Quiz 4.7: If the mythread() function in Listing 4.4 can simply return, why bother
with pthread_exit()?
Note that this program carefully makes sure that only one of the threads stores a value
to variable x at a time. Any situation in which one thread might be storing a value to a
given variable while some other thread either loads from or stores to that same variable
is termed a data race. Because the C language makes no guarantee that the results of
a data race will be in any way reasonable, we need some way of safely accessing and
modifying data concurrently, such as the locking primitives discussed in the following
section.
But your data races are benign, you say? Well, maybe they are. But please do
everyone (yourself included) a big favor and read Section 4.3.4.1 very carefully. As
compilers optimize more and more aggressively, there are fewer and fewer truly benign
data races.
v2023.06.11a
50 CHAPTER 4. TOOLS OF THE TRADE
Quick Quiz 4.8: If the C language makes no guarantees in presence of a data race, then why
does the Linux kernel have so many data races? Are you trying to tell me that the Linux kernel
is completely broken???
This exclusive-locking property is demonstrated using the code shown in Listing 4.5
(lock.c). Line 1 defines and initializes a POSIX lock named lock_a, while line 2
similarly defines and initializes a lock named lock_b. Line 4 defines and initializes a
shared variable x.
Lines 6–33 define a function lock_reader() which repeatedly reads the shared
variable x while holding the lock specified by arg. Line 12 casts arg to a pointer
to a pthread_mutex_t, as required by the pthread_mutex_lock() and pthread_
mutex_unlock() primitives.
Quick Quiz 4.10: Why not simply make the argument to lock_reader() on line 6 of
Listing 4.5 be a pointer to a pthread_mutex_t?
Quick Quiz 4.11: What is the READ_ONCE() on lines 20 and 47 and the WRITE_ONCE() on
line 47 of Listing 4.5?
Lines 14–18 acquire the specified pthread_mutex_t, checking for errors and exiting
the program if any occur. Lines 19–26 repeatedly check the value of x, printing the
new value each time that it changes. Line 25 sleeps for one millisecond, which allows
this demonstration to run nicely on a uniprocessor machine. Lines 27–31 release
the pthread_mutex_t, again checking for errors and exiting the program if any
occur. Finally, line 32 returns NULL, again to match the function type required by
pthread_create().
Quick Quiz 4.12: Writing four lines of code for each acquisition and release of a pthread_
mutex_t sure seems painful! Isn’t there a better way?
Lines 35–56 of Listing 4.5 show lock_writer(), which periodically updates the
shared variable x while holding the specified pthread_mutex_t. As with lock_
reader(), line 39 casts arg to a pointer to pthread_mutex_t, lines 41–45 acquire
v2023.06.11a
4.2. POSIX MULTIPROCESSING 51
v2023.06.11a
52 CHAPTER 4. TOOLS OF THE TRADE
the specified lock, and lines 50–54 release it. While holding the lock, lines 46–49
increment the shared variable x, sleeping for five milliseconds between each increment.
Finally, lines 50–54 release the lock.
Listing 4.6 shows a code fragment that runs lock_reader() and lock_writer() as
threads using the same lock, namely, lock_a. Lines 2–6 create a thread running lock_
reader(), and then lines 7–11 create a thread running lock_writer(). Lines 12–19
wait for both threads to complete. The output of this code fragment is as follows:
Because both threads are using the same lock, the lock_reader() thread cannot
see any of the intermediate values of x produced by lock_writer() while holding the
lock.
Quick Quiz 4.13: Is “x = 0” the only possible output from the code fragment shown in
Listing 4.6? If so, why? If not, what other output could appear, and why?
v2023.06.11a
4.2. POSIX MULTIPROCESSING 53
Listing 4.7 shows a similar code fragment, but this time using different locks: lock_
a for lock_reader() and lock_b for lock_writer(). The output of this code
fragment is as follows:
Creating two threads w/different locks:
lock_reader(): x = 0
lock_reader(): x = 1
lock_reader(): x = 2
lock_reader(): x = 3
Because the two threads are using different locks, they do not exclude each other, and
can run concurrently. The lock_reader() function can therefore see the intermediate
values of x stored by lock_writer().
Quick Quiz 4.14: Using different locks could cause quite a bit of confusion, what with
threads seeing each others’ intermediate states. So should well-written parallel programs restrict
themselves to using a single lock in order to avoid this kind of confusion?
Quick Quiz 4.15: In the code shown in Listing 4.7, is lock_reader() guaranteed to see all
the values produced by lock_writer()? Why or why not?
Quick Quiz 4.16: Wait a minute here!!! Listing 4.6 didn’t initialize shared variable x, so why
does it need to be initialized in Listing 4.7?
Although there is quite a bit more to POSIX exclusive locking, these primitives
provide a good start and are in fact sufficient in a great many situations. The next section
takes a brief look at POSIX reader-writer locking.
v2023.06.11a
54 CHAPTER 4. TOOLS OF THE TRADE
v2023.06.11a
4.2. POSIX MULTIPROCESSING 55
10
ideal 10000us
1
0.1
100us
0.01
10us
0.001
1us
0.0001
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Lines 7–10 define goflag, which synchronizes the start and the end of the test. This
variable is initially set to GOFLAG_INIT, then set to GOFLAG_RUN after all the reader
threads have started, and finally set to GOFLAG_STOP to terminate the test run.
Lines 12–44 define reader(), which is the reader thread. Line 19 atomically
increments the nreadersrunning variable to indicate that this thread is now running,
and lines 20–22 wait for the test to start. The READ_ONCE() primitive forces the
compiler to fetch goflag on each pass through the loop—the compiler would otherwise
be within its rights to assume that the value of goflag would never change.
Quick Quiz 4.17: Instead of using READ_ONCE() everywhere, why not just declare goflag
as volatile on line 10 of Listing 4.8?
Quick Quiz 4.18: READ_ONCE() only affects the compiler, not the CPU. Don’t we also need
memory barriers to make sure that the change in goflag’s value propagates to the CPU in a
timely fashion in Listing 4.8?
Quick Quiz 4.19: Would it ever be necessary to use READ_ONCE() when accessing a per-thread
variable, for example, a variable declared using GCC’s __thread storage class?
The loop spanning lines 23–41 carries out the performance test. Lines 24–28 acquire
the lock, lines 29–31 hold the lock for the specified number of microseconds, lines 32–36
release the lock, and lines 37–39 wait for the specified number of microseconds before
re-acquiring the lock. Line 40 counts this lock acquisition.
Line 42 moves the lock-acquisition count to this thread’s element of the
readcounts[] array, and line 43 returns, terminating this thread.
Figure 4.2 shows the results of running this test on a 224-core Xeon system with two
hardware threads per core for a total of 448 software-visible CPUs. The thinktime
parameter was zero for all these tests, and the holdtime parameter set to values ranging
from one microsecond (“1us” on the graph) to 10,000 microseconds (“10000us” on the
v2023.06.11a
56 CHAPTER 4. TOOLS OF THE TRADE
Quick Quiz 4.21: But one microsecond is not a particularly small size for a critical section.
What do I do if I need a much smaller critical section, for example, one containing only a few
instructions?
Quick Quiz 4.22: The system used is a few years old, and new hardware should be faster. So
why should anyone worry about reader-writer locks being slow?
Despite these limitations, reader-writer locking is quite useful in many cases, for
example when the readers must do high-latency file or network I/O. There are alternatives,
some of which will be presented in Chapters 5 and 9.
v2023.06.11a
4.2. POSIX MULTIPROCESSING 57
v2023.06.11a
58 CHAPTER 4. TOOLS OF THE TRADE
arguments, all the atomic operations are fully ordered, and the arguments permit weaker
orderings. For example, “atomic_load_explicit(&a, memory_order_relaxed)”
is vaguely similar to the Linux kernel’s “READ_ONCE()”.1
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 59
“fiber”, “event”, “execution agent”, and so on. Similar design principles apply to all of them.
5 How is that for a circular definition?
v2023.06.11a
60 CHAPTER 4. TOOLS OF THE TRADE
smp_thread_id()
Because the thread_id_t returned from create_thread() is system-dependent,
the smp_thread_id() primitive returns a thread index corresponding to the thread
making the request. This index is guaranteed to be less than the maximum number
of threads that have been in existence since the program started, and is therefore
useful for bitmasks, array indices, and the like.
for_each_thread()
The for_each_thread() macro loops through all threads that exist, including all
threads that would exist if created. This macro is useful for handling the per-thread
variables introduced in Section 4.2.8.
for_each_running_thread()
The for_each_running_thread() macro loops through only those threads that
currently exist. It is the caller’s responsibility to synchronize with thread creation
and deletion if required.
wait_thread()
The wait_thread() primitive waits for completion of the thread specified by the
thread_id_t passed to it. This in no way interferes with the execution of the
specified thread; instead, it merely waits for it. Note that wait_thread() returns
the value that was returned by the corresponding thread.
wait_all_threads()
The wait_all_threads() primitive waits for completion of all currently running
threads. It is the caller’s responsibility to synchronize with thread creation and
deletion if required. However, this primitive is normally used to clean up at the
end of a run, so such synchronization is normally not needed.
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 61
Quick Quiz 4.26: What happened to the Linux-kernel equivalents to fork() and wait()?
4.3.3 Locking
A good starting subset of the Linux kernel’s locking API is shown in Listing 4.13,
each API element being described in the following section. This book’s CodeSamples
locking API closely follows that of the Linux kernel.
v2023.06.11a
62 CHAPTER 4. TOOLS OF THE TRADE
until the spinlock becomes available. In some environments, such as pthreads, this
waiting will involve blocking, while in others, such as the Linux kernel, it might
involve a CPU-bound spin loop.
The key point is that only one thread may hold a spinlock at any given time.
spin_trylock()
The spin_trylock() primitive acquires the specified spinlock, but only if it is
immediately available. It returns true if it was able to acquire the spinlock and
false otherwise.
spin_unlock()
The spin_unlock() primitive releases the specified spinlock, allowing other
threads to acquire it.
spin_lock(&mutex);
counter++;
spin_unlock(&mutex);
Quick Quiz 4.27: What problems could occur if the variable counter were incremented
without the protection of mutex?
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 63
Quick Quiz 4.28: What is wrong with loading Listing 4.14’s global_ptr up to three times?
Load tearing occurs when the compiler uses multiple load instructions for a single
access. For example, the compiler could in theory compile the load from global_
ptr (see line 1 of Listing 4.14) as a series of one-byte loads. If some other thread
was concurrently setting global_ptr to NULL, the result might have all but one
byte of the pointer set to zero, thus forming a “wild pointer”. Stores using such
a wild pointer could corrupt arbitrary regions of memory, resulting in rare and
difficult-to-debug crashes.
Worse yet, on (say) an 8-bit system with 16-bit pointers, the compiler might have
no choice but to use a pair of 8-bit instructions to access a given pointer. Because
the C standard must support all manner of systems, the standard cannot rule out
load tearing in the general case.
Store tearing occurs when the compiler uses multiple store instructions for a single
access. For example, one thread might store 0x12345678 to a four-byte integer
variable at the same time another thread stored 0xabcdef00. If the compiler
used 16-bit stores for either access, the result might well be 0x1234ef00, which
could come as quite a surprise to code loading from this integer. Nor is this a
strictly theoretical issue. For example, there are CPUs that feature small immediate
instruction fields, and on such CPUs, the compiler might split a 64-bit store
into two 32-bit stores in order to reduce the overhead of explicitly forming the
64-bit constant in a register, even on a 64-bit CPU. There are historical reports
of this actually happening in the wild (e.g. [KM13]), but there is also a recent
report [Dea19].7
Of course, the compiler simply has no choice but to tear some stores in the general
case, given the possibility of code using 64-bit integers running on a 32-bit system.
6 That is, normal loads and stores instead of C11 atomics, inline assembly, or volatile
accesses.
7 Note that this tearing can happen even on properly aligned and machine-word-sized
accesses, and in this particular case, even for volatile stores. Some might argue that this
behavior constitutes a bug in the compiler, but either way it illustrates the perceived value of
store tearing from a compiler-writer viewpoint.
v2023.06.11a
64 CHAPTER 4. TOOLS OF THE TRADE
But for properly aligned machine-sized stores, WRITE_ONCE() will prevent store
tearing.
Load fusing occurs when the compiler uses the result of a prior load from a given
variable instead of repeating the load. Not only is this sort of optimization just fine
in single-threaded code, it is often just fine in multithreaded code. Unfortunately,
the word “often” hides some truly annoying exceptions.
For example, suppose that a real-time system needs to invoke a function named
do_something_quickly() repeatedly until the variable need_to_stop was set,
and that the compiler can see that do_something_quickly() does not store to
need_to_stop. One (unsafe) way to code this is shown in Listing 4.16. The
compiler might reasonably unroll this loop sixteen times in order to reduce the
per-invocation of the backwards branch at the end of the loop. Worse yet, because
the compiler knows that do_something_quickly() does not store to need_to_
stop, the compiler could quite reasonably decide to check this variable only once,
resulting in the code shown in Listing 4.17. Once entered, the loop on lines 2–19
will never exit, regardless of how many times some other thread stores a non-zero
value to need_to_stop. The result will at best be consternation, and might well
also include severe physical damage.
The compiler can fuse loads across surprisingly large spans of code. For example,
in Listing 4.18, t0() and t1() run concurrently, and do_something() and
do_something_else() are inline functions. Line 1 declares pointer gp, which
C initializes to NULL by default. At some point, line 5 of t0() stores a non-NULL
pointer to gp. Meanwhile, t1() loads from gp three times on lines 10, 12, and 15.
Given that line 13 finds that gp is non-NULL, one might hope that the dereference
on line 15 would be guaranteed never to fault. Unfortunately, the compiler is within
its rights to fuse the read on lines 10 and 15, which means that if line 10 loads
NULL and line 12 loads &myvar, line 15 could load NULL, resulting in a fault.8
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 65
Note that the intervening READ_ONCE() does not prevent the other two loads from
being fused, despite the fact that all three are loading from the same variable.
Quick Quiz 4.29: Why does it matter whether do_something() and do_something_
else() in Listing 4.18 are inline functions?
Store fusing can occur when the compiler notices a pair of successive stores to a given
variable with no intervening loads from that variable. In this case, the compiler is
within its rights to omit the first store. This is never a problem in single-threaded
code, and in fact it is usually not a problem in correctly written concurrent code.
After all, if the two stores are executed in quick succession, there is very little
chance that some other thread could load the value from the first store.
However, there are exceptions, for example as shown in Listing 4.19. The function
shut_it_down() stores to the shared variable status on lines 3 and 8, and
so assuming that neither start_shutdown() nor finish_shutdown() access
status, the compiler could reasonably remove the store to status on line 3.
Unfortunately, this would mean that work_until_shut_down() would never exit
its loop spanning lines 14 and 15, and thus would never set other_task_ready,
which would in turn mean that shut_it_down() would never exit its loop spanning
lines 5 and 6, even if the compiler chooses not to fuse the successive loads from
other_task_ready on line 5.
v2023.06.11a
66 CHAPTER 4. TOOLS OF THE TRADE
And there are more problems with the code in Listing 4.19, including code
reordering.
Code reordering is a common compilation technique used to combine common
subexpressions, reduce register pressure, and improve utilization of the many
functional units available on modern superscalar microprocessors. It is also
another reason why the code in Listing 4.19 is buggy. For example, suppose that
the do_more_work() function on line 15 does not access other_task_ready.
Then the compiler would be within its rights to move the assignment to other_
task_ready on line 16 to precede line 14, which might be a great disappointment
for anyone hoping that the last call to do_more_work() on line 15 happens before
the call to finish_shutdown() on line 7.
It might seem futile to prevent the compiler from changing the order of accesses in
cases where the underlying hardware is free to reorder them. However, modern
machines have exact exceptions and exact interrupts, meaning that any interrupt or
exception will appear to have happened at a specific place in the instruction stream.
This means that the handler will see the effect of all prior instructions, but won’t see
the effect of any subsequent instructions. READ_ONCE() and WRITE_ONCE() can
therefore be used to control communication between interrupted code and interrupt
handlers, independent of the ordering provided by the underlying hardware.9
Invented loads were illustrated by the code in Listings 4.14 and 4.15, in which the
compiler optimized away a temporary variable, thus loading from a shared variable
more often than intended.
Invented loads can be a performance hazard. These hazards can occur when a load
of variable in a “hot” cacheline is hoisted out of an if statement. These hoisting
optimizations are not uncommon, and can cause significant increases in cache
misses, and thus significant degradation of both performance and scalability.
Invented stores can occur in a number of situations. For example, a compiler emitting
code for work_until_shut_down() in Listing 4.19 might notice that other_
task_ready is not accessed by do_more_work(), and stored to on line 16. If
do_more_work() was a complex inline function, it might be necessary to do a
register spill, in which case one attractive place to use for temporary storage is
other_task_ready. After all, there are no accesses to it, so what is the harm?
Of course, a non-zero store to this variable at just the wrong time would result
in the while loop on line 5 terminating prematurely, again allowing finish_
shutdown() to run concurrently with do_more_work(). Given that the entire
point of this while appears to be to prevent such concurrency, this is not a good
thing.
9 That said, the various standards committees would prefer that you use atomics or
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 67
Quick Quiz 4.30: Ouch! So can’t the compiler invent a store to a normal variable pretty
much any time it likes?
Finally, pre-C11 compilers could invent writes to unrelated variables that happened
to be adjacent to written-to variables [Boe05, Section 4.2]. This variant of invented
stores has been outlawed by the prohibition against compiler optimizations that
invent data races.
Store-to-load transformations can occur when the compiler notices that a plain store
might not actually change the value in memory. For example, consider Listing 4.22.
Line 1 fetches p, but the “if” statement on line 2 also tells the compiler that the
developer thinks that p is usually zero.10 The barrier() statement on line 4
forces the compiler to forget the value of p, but one could imagine a compiler
choosing to remember the hint—or getting an additional hint via feedback-directed
optimization. Doing so would cause the compiler to realize that line 5 is often an
expensive no-op.
Such a compiler might therefore guard the store of NULL with a check, as shown
on lines 5 and 6 of Listing 4.23. Although this transformation is often desirable, it
could be problematic if the actual store was required for ordering. For example,
a write memory barrier (Linux kernel smp_wmb()) would order the store, but
10 The unlikely() function provides this hint to the compiler, and different compilers
v2023.06.11a
68 CHAPTER 4. TOOLS OF THE TRADE
not the load. This situation might suggest use of smp_store_release() over
smp_wmb().
Dead-code elimination can occur when the compiler notices that the value from a
load is never used, or when a variable is stored to, but never loaded from. This
can of course eliminate an access to a shared variable, which can in turn defeat
a memory-ordering primitive, which could cause your concurrent code to act in
surprising ways. Experience thus far indicates that relatively few such surprises
will be at all pleasant. Elimination of store-only variables is especially dangerous
in cases where external code locates the variable via symbol tables: The compiler
is necessarily ignorant of such external-code accesses, and might thus eliminate a
variable that the external code relies upon.
Reliable concurrent code clearly needs a way to cause the compiler to preserve the
number, order, and type of important accesses to shared memory, a topic taken up by
Sections 4.3.4.2 and 4.3.4.3, which are up next.
This wording might be reassuring to those writing low-level code, except for the
fact that compiler writers are free to completely ignore non-normative notes. Parallel
programmers might instead reassure themselves that compiler writers would like to
avoid breaking device drivers (though perhaps only after a few “frank and open”
discussions with device-driver developers), and device drivers impose at least the
following constraints [MWPF18]:
11 JF Bastien thoroughly documented the history and use cases for the volatile keyword
in C++ [Bas18].
12 Note that this leaves unspecified what to do with 128-bit loads and stores on CPUs
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 69
Concurrent code also relies on the first two constraints to avoid undefined behavior
that could result due to data races if any of the accesses to a given object was either
non-atomic or non-volatile, assuming that all accesses are aligned and machine-sized.
The semantics of mixed-size accesses to the same locations are more complex, and are
left aside for the time being.
So how does volatile stack up against the earlier examples?
Using READ_ONCE() on line 1 of Listing 4.14 avoids invented loads, resulting in the
code shown in Listing 4.24.
As shown in Listing 4.25, READ_ONCE() can also prevent the loop unrolling in
Listing 4.17.
READ_ONCE() and WRITE_ONCE() can also be used to prevent the store fusing and
invented stores that were shown in Listing 4.19, with the result shown in Listing 4.26.
v2023.06.11a
70 CHAPTER 4. TOOLS OF THE TRADE
However, this does nothing to prevent code reordering, which requires some additional
tricks taught in Section 4.3.4.3.
Finally, WRITE_ONCE() can be used to prevent the store invention shown in List-
ing 4.20, with the resulting code shown in Listing 4.27.
To summarize, the volatile keyword can prevent load tearing and store tearing in
cases where the loads and stores are machine-sized and properly aligned. It can also
prevent load fusing, store fusing, invented loads, and invented stores. However, although
it does prevent the compiler from reordering volatile accesses with each other, it
does nothing to prevent the CPU from reordering these accesses. Furthermore, it does
nothing to prevent either compiler or CPU from reordering non-volatile accesses with
each other or with volatile accesses. Preventing these types of reordering requires
the techniques described in the next section.
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 71
The doctor’s advice might seem unhelpful, but one time-tested way to avoid con-
currently accessing shared variables is access those variables only when holding a
particular lock, as will be discussed in Chapter 7. Another way is to access a given
“shared” variable only from a given CPU or thread, as will be discussed in Chapter 8. It
is possible to combine these two approaches, for example, a given variable might be
modified only by a given CPU or thread while holding a particular lock, and might be
read either from that same CPU or thread on the one hand, or from some other CPU or
thread while holding that same lock on the other. In all of these situations, all accesses
to the shared variables may be plain C-language accesses.
Here is a list of situations allowing plain loads and stores for some accesses to a
given variable, while requiring markings (such as READ_ONCE() and WRITE_ONCE())
for other accesses to that same variable:
1. A shared variable is only modified by a given owning CPU or thread, but is read by
other CPUs or threads. All stores must use WRITE_ONCE(). The owning CPU or
thread may use plain loads. Everything else must use READ_ONCE() for loads.
2. A shared variable is only modified while holding a given lock, but is read by
code not holding that lock. All stores must use WRITE_ONCE(). CPUs or threads
holding the lock may use plain loads. Everything else must use READ_ONCE() for
loads.
v2023.06.11a
72 CHAPTER 4. TOOLS OF THE TRADE
3. A shared variable is only modified while holding a given lock by a given owning
CPU or thread, but is read by other CPUs or threads or by code not holding that
lock. All stores must use WRITE_ONCE(). The owning CPU or thread may use
plain loads, as may any CPU or thread holding the lock. Everything else must use
READ_ONCE() for loads.
Quick Quiz 4.32: What needs to happen if an interrupt or signal handler might itself be
interrupted?
In most other cases, loads from and stores to a shared variable must use READ_
ONCE() and WRITE_ONCE() or stronger, respectively. But it bears repeating that neither
READ_ONCE() nor WRITE_ONCE() provide any ordering guarantees other than within
the compiler. See the above Section 4.3.4.3 or Chapter 15 for information on such
guarantees.
Examples of many of these data-race-avoidance patterns are presented in Chapter 5.
v2023.06.11a
4.3. ALTERNATIVES TO POSIX OPERATIONS 73
__get_thread_var()
The __get_thread_var() primitive accesses the current thread’s variable.
init_per_thread()
The init_per_thread() primitive sets all threads’ instances of the specified
variable to the specified value. The Linux kernel accomplishes this via normal C
initialization, relying in clever use of linker scripts and code executed during the
CPU-online process.
v2023.06.11a
74 CHAPTER 4. TOOLS OF THE TRADE
DEFINE_PER_THREAD(int, counter);
The value of the counter is then the sum of its instances. A snapshot of the value of
the counter can thus be collected as follows:
for_each_thread(t)
sum += READ_ONCE(per_thread(counter, t));
Again, it is possible to gain a similar effect using other mechanisms, but per-thread
variables combine convenience and high performance, as will be shown in more detail
in Section 5.2.
Quick Quiz 4.34: What do you do if you need a per-thread (not per-CPU!) variable in the
Linux kernel?
As a rough rule of thumb, use the simplest tool that will get the job done. If you
can, simply program sequentially. If that is insufficient, try using a shell script to
mediate parallelism. If the resulting shell-script fork()/exec() overhead (about 480
microseconds for a minimal C program on an Intel Core Duo laptop) is too large,
try using the C-language fork() and wait() primitives. If the overhead of these
primitives (about 80 microseconds for a minimal child process) is still too large, then
you might need to use the POSIX threading primitives, choosing the appropriate locking
and/or atomic-operation primitives. If the overhead of the POSIX threading primitives
(typically sub-microsecond) is too great, then the primitives introduced in Chapter 9
may be required. Of course, the actual overheads will depend not only on your hardware,
but most critically on the manner in which you use the primitives. Furthermore,
always remember that inter-process communication and message-passing can be good
alternatives to shared-memory multithreaded execution, especially when your code
makes good use of the design principles called out in Chapter 6.
Quick Quiz 4.35: Wouldn’t the shell normally use vfork() rather than fork()?
Because concurrency was added to the C standard several decades after the C language
was first used to build concurrent systems, there are a number of ways of concurrently
v2023.06.11a
4.4. THE RIGHT TOOL FOR THE JOB: HOW TO CHOOSE? 75
accessing shared variables. All else being equal, the C11 standard operations described
in Section 4.2.6 should be your first stop. If you need to access a given shared variable
both with plain accesses and atomically, then the modern GCC atomics described in
Section 4.2.7 might work well for you. If you are working on an old codebase that
uses the classic GCC __sync API, then you should review Section 4.2.5 as well as
the relevant GCC documentation. If you are working on the Linux kernel or similar
codebase that combines use of the volatile keyword with inline assembly, or if you
need dependencies to provide ordering, look at the material presented in Section 4.3.4
as well as that in Chapter 15.
Whatever approach you take, please keep in mind that randomly hacking multi-
threaded code is a spectacularly bad idea, especially given that shared-memory parallel
systems use your own intelligence against you: The smarter you are, the deeper a hole
you will dig for yourself before you realize that you are in trouble [Pok16]. Therefore, it
is necessary to make the right design choices as well as the correct choice of individual
primitives, as will be discussed at length in subsequent chapters.
v2023.06.11a
76 CHAPTER 4. TOOLS OF THE TRADE
v2023.06.11a
As easy as 1, 2, 3!
Unknown
Chapter 5
Counting
Counting is perhaps the simplest and most natural thing a computer can do. However,
counting efficiently and scalably on a large shared-memory multiprocessor can be quite
challenging. Furthermore, the simplicity of the underlying concept of counting allows
us to explore the fundamental issues of concurrency without the distractions of elaborate
data structures or complex synchronization primitives. Counting therefore provides an
excellent introduction to parallel programming.
This chapter covers a number of special cases for which there are simple, fast, and
scalable counting algorithms. But first, let us find out how much you already know
about concurrent counting.
Quick Quiz 5.1: Why should efficient and scalable counting be hard??? After all, computers
have special hardware for the sole purpose of doing counting!!!
Quick Quiz 5.2: Network-packet counting problem. Suppose that you need to collect
statistics on the number of networking packets transmitted and received. Packets might be
transmitted or received by any CPU on the system. Suppose further that your system is capable
of handling millions of packets per second per CPU, and that a systems-monitoring package
reads the count every five seconds. How would you implement this counter?
Quick Quiz 5.3: Approximate structure-allocation limit problem. Suppose that you need
to maintain a count of the number of structures allocated in order to fail any allocations once
the number of structures in use exceeds a limit (say, 10,000). Suppose further that the structures
are short-lived, the limit is rarely exceeded, and a “sloppy” approximate limit is acceptable.
Quick Quiz 5.4: Exact structure-allocation limit problem. Suppose that you need to
maintain a count of the number of structures allocated in order to fail any allocations once the
number of structures in use exceeds an exact limit (again, say 10,000). Suppose further that
these structures are short-lived, and that the limit is rarely exceeded, that there is almost always
at least one structure in use, and suppose further still that it is necessary to know exactly when
this counter reaches zero, for example, in order to free up some memory that is not required
unless there is at least one structure in use.
Quick Quiz 5.5: Removable I/O device access-count problem. Suppose that you need to
maintain a reference count on a heavily used removable mass-storage device, so that you can
tell the user when it is safe to remove the device. As usual, the user indicates a desire to remove
the device, and the system tells the user when it is safe to do so.
77
v2023.06.11a
78 CHAPTER 5. COUNTING
Section 5.1 shows why counting is non-trivial. Sections 5.2 and 5.3 investigate
network-packet counting and approximate structure-allocation limits, respectively.
Section 5.4 takes on exact structure-allocation limits. Finally, Section 5.5 presents
performance measurements and discussion.
Sections 5.1 and 5.2 contain introductory material, while the remaining sections are
more advanced.
Let’s start with something simple, for example, the straightforward use of arithmetic
shown in Listing 5.1 (count_nonatomic.c). Here, we have a counter on line 1, we
increment it on line 5, and we read out its value on line 10. What could be simpler?
Quick Quiz 5.6: One thing that could be simpler is ++ instead of that concatenation of
READ_ONCE() and WRITE_ONCE(). Why all that extra typing???
This approach has the additional advantage of being blazingly fast if you are doing
lots of reading and almost no incrementing, and on small systems, the performance is
excellent.
There is just one large fly in the ointment: This approach can lose counts. On my
six-core x86 laptop, a short run invoked inc_count() 285,824,000 times, but the final
value of the counter was only 35,385,525. Although approximation does have a large
place in computing, loss of 87 % of the counts is a bit excessive.
Quick Quiz 5.7: But can’t a smart compiler prove that line 5 of Listing 5.1 is equivalent to the
++ operator and produce an x86 add-to-memory instruction? And won’t the CPU cache cause
this to be atomic?
Quick Quiz 5.8: The 8-figure accuracy on the number of failures indicates that you really did
test this. Why would it be necessary to test such a trivial program, especially when the bug is
easily seen by inspection?
v2023.06.11a
5.1. WHY ISN’T CONCURRENT COUNTING TRIVIAL? 79
100000
1000
100
10
1
1
10
100
Number of CPUs (Threads)
However, it is slower: On my six-core x86 laptop, it is more than twenty times slower
than non-atomic increment, even when only a single thread is incrementing.1
This poor performance should not be a surprise, given the discussion in Chapter 3,
nor should it be a surprise that the performance of atomic increment gets slower as
the number of CPUs and threads increase, as shown in Figure 5.1. In this figure, the
horizontal dashed line resting on the x axis is the ideal performance that would be
achieved by a perfectly scalable algorithm: With such an algorithm, a given increment
would incur the same overhead that it would in a single-threaded program. Atomic
increment of a single global variable is clearly decidedly non-ideal, and gets multiple
orders of magnitude worse with additional CPUs.
Quick Quiz 5.9: Why doesn’t the horizontal dashed line on the x axis meet the diagonal line
at 𝑥 = 1?
Quick Quiz 5.10: But atomic increment is still pretty fast. And incrementing a single variable
in a tight loop sounds pretty unrealistic to me, after all, most of the program’s execution should
be devoted to actually doing work, not accounting for the work it has done! Why should I care
about making this go faster?
more quickly than atomically incrementing the counter. Of course, if your only goal is to
make the counter increase quickly, an easier approach is to simply assign a large value to the
counter. Nevertheless, there is likely to be a role for algorithms that use carefully relaxed
notions of correctness in order to gain greater performance and scalability [And91, ACMS03,
Rin13, Ung11].
v2023.06.11a
80 CHAPTER 5. COUNTING
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
For another perspective on global atomic increment, consider Figure 5.2. In order
for each CPU to get a chance to increment a given global variable, the cache line
containing that variable must circulate among all the CPUs, as shown by the red arrows.
Such circulation will take significant time, resulting in the poor performance seen in
Figure 5.1, which might be thought of as shown in Figure 5.3. The following sections
discuss high-performance counting, which avoids the delays inherent in such circulation.
Quick Quiz 5.11: But why can’t CPU designers simply ship the addition operation to the data,
avoiding the need to circulate the cache line containing the global variable being incremented?
v2023.06.11a
5.2. STATISTICAL COUNTERS 81
This section covers the common special case of statistical counters, where the count
is updated extremely frequently and the value is read out rarely, if ever. These will be
used to solve the network-packet counting problem posed in Quick Quiz 5.2.
5.2.1 Design
Statistical counting is typically handled by providing a counter per thread (or CPU, when
running in the kernel), so that each thread updates its own counter, as was foreshadowed
in Section 4.3.6 on page 73. The aggregate value of the counters is read out by simply
summing up all of the threads’ counters, relying on the commutative and associative
properties of addition. This is an example of the Data Ownership pattern that will be
introduced in Section 6.3.4 on page 135.
Quick Quiz 5.12: But doesn’t the fact that C’s “integers” are limited in size complicate things?
Such an array can be wrapped into per-thread primitives, as shown in Listing 5.3
(count_stat.c). Line 1 defines an array containing a set of per-thread counters of
type unsigned long named, creatively enough, counter.
Lines 3–8 show a function that increments the counters, using the __get_thread_
var() primitive to locate the currently running thread’s element of the counter
array. Because this element is modified only by the corresponding thread, non-atomic
increment suffices. However, this code uses WRITE_ONCE() to prevent destructive
v2023.06.11a
82 CHAPTER 5. COUNTING
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
compiler optimizations. For but one example, the compiler is within its rights to use a
to-be-stored-to location as temporary storage, thus writing what would be for all intents
and purposes garbage to that location just before doing the desired store. This could
of course be rather confusing to anything attempting to read out the count. The use of
WRITE_ONCE() prevents this optimization and others besides.
Quick Quiz 5.14: What other nasty optimizations could GCC apply?
Lines 10–18 show a function that reads out the aggregate value of the counter, using
the for_each_thread() primitive to iterate over the list of currently running threads,
and using the per_thread() primitive to fetch the specified thread’s counter. This
code also uses READ_ONCE() to ensure that the compiler doesn’t optimize these loads
into oblivion. For but one example, a pair of consecutive calls to read_count()
might be inlined, and an intrepid optimizer might notice that the same locations were
being summed and thus incorrectly conclude that it would be simply wonderful to sum
them once and use the resulting value twice. This sort of optimization might be rather
frustrating to people expecting later read_count() calls to account for the activities of
other threads. The use of READ_ONCE() prevents this optimization and others besides.
Quick Quiz 5.15: How does the per-thread counter variable in Listing 5.3 get initialized?
Quick Quiz 5.16: How is the code in Listing 5.3 supposed to permit more than one counter?
This approach scales linearly with increasing number of updater threads invoking
inc_count(). As is shown by the green arrows on each CPU in Figure 5.4, the reason
for this is that each CPU can make rapid progress incrementing its thread’s variable,
without any expensive cross-system communication. As such, this section solves the
network-packet counting problem presented at the beginning of this chapter.
Quick Quiz 5.17: The read operation takes time to sum up the per-thread values, and
during that time, the counter could well be changing. This means that the value returned by
read_count() in Listing 5.3 will not necessarily be exact. Assume that the counter is being
incremented at rate 𝑟 counts per unit time, and that read_count()’s execution consumes 𝛥
units of time. What is the expected error in the return value?
v2023.06.11a
5.2. STATISTICAL COUNTERS 83
2 GCC provides its own __thread storage class, which was used in previous versions of
this book. The two methods for specifying a thread-local variable are interchangeable when
using GCC.
v2023.06.11a
84 CHAPTER 5. COUNTING
Quick Quiz 5.20: Why on earth do we need something as heavyweight as a lock guarding the
summation in the function read_count() in Listing 5.4?
v2023.06.11a
5.2. STATISTICAL COUNTERS 85
v2023.06.11a
86 CHAPTER 5. COUNTING
we want to terminate the program with an accurate counter value). The inc_count()
function shown on lines 5–10 is similar to its counterpart in Listing 5.3. The read_
count() function shown on lines 12–15 simply returns the value of the global_count
variable.
However, the count_init() function on lines 34–44 creates the eventual() thread
shown on lines 17–32, which cycles through all the threads, summing the per-thread
local counter and storing the sum to the global_count variable. The eventual()
thread waits an arbitrarily chosen one millisecond between passes.
The count_cleanup() function on lines 46–51 coordinates termination. The call to
smp_load_acquire() here and the call to smp_store_release() in eventual()
ensure that all updates to global_count are visible to code following the call to
count_cleanup().
This approach gives extremely fast counter read-out while still supporting linear
counter-update scalability. However, this excellent read-side performance and update-
side scalability comes at the cost of the additional thread running eventual().
Quick Quiz 5.23: Why doesn’t inc_count() in Listing 5.5 need to use atomic instructions?
After all, we now have multiple threads accessing the per-thread counters!
Quick Quiz 5.24: Won’t the single global thread in the function eventual() of Listing 5.5
be just as severe a bottleneck as a global lock would be?
Quick Quiz 5.25: Won’t the estimate returned by read_count() in Listing 5.5 become
increasingly inaccurate as the number of threads rises?
Quick Quiz 5.26: Given that in the eventually-consistent algorithm shown in Listing 5.5 both
reads and updates have extremely low overhead and are extremely scalable, why would anyone
bother with the implementation described in Section 5.2.2, given its costly read-side code?
Quick Quiz 5.27: What is the accuracy of the estimate returned by read_count() in
Listing 5.5?
5.2.5 Discussion
These three implementations show that it is possible to obtain near-uniprocessor
performance for statistical counters, despite running on a parallel machine.
Quick Quiz 5.28: What fundamental difference is there between counting packets and counting
the total number of bytes in the packets, given that the packets vary in size?
Quick Quiz 5.29: Given that the reader must sum all the threads’ counters, this counter-read
operation could take a long time given large numbers of threads. Is there any way that the
increment operation can remain fast and scalable while allowing readers to also enjoy not only
reasonable performance and scalability, but also good accuracy?
Given what has been presented in this section, you should now be able to answer the
Quick Quiz about statistical counters for networking near the beginning of this chapter.
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 87
5.3.1 Design
One possible design for limit counters is to divide the limit of 10,000 by the number
of threads, and give each thread a fixed pool of structures. For example, given 100
threads, each thread would manage its own pool of 100 structures. This approach is
simple, and in some cases works well, but it does not handle the common case where
a given structure is allocated by one thread and freed by another [MS93]. On the one
hand, if a given thread takes credit for any structures it frees, then the thread doing most
of the allocating runs out of structures, while the threads doing most of the freeing
have lots of credits that they cannot use. On the other hand, if freed structures are
credited to the CPU that allocated them, it will be necessary for CPUs to manipulate
each others’ counters, which will require expensive atomic instructions or other means
of communicating between threads.3
In short, for many important workloads, we cannot fully partition the counter. Given
that partitioning the counters was what brought the excellent update-side performance
for the three schemes discussed in Section 5.2, this might be grounds for some pessimism.
However, the eventually consistent algorithm presented in Section 5.2.4 provides an
interesting hint. Recall that this algorithm kept two sets of books, a per-thread counter
variable for updaters and a global_count variable for readers, with an eventual()
thread that periodically updated global_count to be eventually consistent with the
values of the per-thread counter. The per-thread counter perfectly partitioned the
counter value, while global_count kept the full value.
For limit counters, we can use a variation on this theme where we partially partition
the counter. For example, consider four threads with each having not only a per-thread
counter, but also a per-thread maximum value (call it countermax).
But then what happens if a given thread needs to increment its counter, but counter
is equal to its countermax? The trick here is to move half of that thread’s counter
value to a globalcount, then increment counter. For example, if a given thread’s
counter and countermax variables were both equal to 10, we do the following:
allocated it, then this simple partitioning approach works extremely well.
v2023.06.11a
88 CHAPTER 5. COUNTING
3. To balance out the addition, subtract five from this thread’s counter.
Although this procedure still requires a global lock, that lock need only be acquired
once for every five increment operations, greatly reducing that lock’s level of contention.
We can reduce this contention as low as we wish by increasing the value of countermax.
However, the corresponding penalty for increasing the value of countermax is reduced
accuracy of globalcount. To see this, note that on a four-CPU system, if countermax
is equal to ten, globalcount will be in error by at most 40 counts. In contrast, if
countermax is increased to 100, globalcount might be in error by as much as 400
counts.
This raises the question of just how much we care about globalcount’s deviation
from the aggregate value of the counter, where this aggregate value is the sum of
globalcount and each thread’s counter variable. The answer to this question depends
on how far the aggregate value is from the counter’s limit (call it globalcountmax).
The larger the difference between these two values, the larger countermax can be
without risk of exceeding the globalcountmax limit. This means that the value of a
given thread’s countermax variable can be set based on this difference. When far from
the limit, the countermax per-thread variables are set to large values to optimize for
performance and scalability, while when close to the limit, these same variables are set
to small values to minimize the error in the checks against the globalcountmax limit.
This design is an example of parallel fastpath, which is an important design pattern
in which the common case executes with no expensive instructions and no interactions
between threads, but where occasional use is also made of a more conservatively
designed (and higher overhead) global algorithm. This design pattern is covered in
more detail in Section 6.4.
2. The sum of all threads’ countermax values must be less than or equal to
globalreserve.
3. Each thread’s counter must be less than or equal to that thread’s countermax.
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 89
countermax 3
globalreserve
counter 3
countermax 2 counter 2
globalcountmax
countermax 1 counter 1
countermax 0
counter 0
globalcount
Lines 1–18 show add_count(), which adds the specified value delta to the counter.
Line 3 checks to see if there is room for delta on this thread’s counter, and, if so,
line 4 adds it and line 5 returns success. This is the add_counter() fastpath, and it
does no atomic operations, references only per-thread variables, and should not incur
any cache misses.
Quick Quiz 5.31: What is with the strange form of the condition on line 3 of Listing 5.7?
Why not the more intuitive form of the fastpath shown in Listing 5.8?
If the test on line 3 fails, we must access global variables, and thus must acquire
gblcnt_mutex on line 7, which we release on line 11 in the failure case or on line 16
in the success case. Line 8 invokes globalize_count(), shown in Listing 5.9,
which clears the thread-local variables, adjusting the global variables as needed, thus
simplifying global processing. (But don’t take my word for it, try coding it yourself!)
v2023.06.11a
90 CHAPTER 5. COUNTING
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 91
Lines 9 and 10 check to see if addition of delta can be accommodated, with the
meaning of the expression preceding the less-than sign shown in Figure 5.5 as the
difference in height of the two red (leftmost) bars. If the addition of delta cannot
be accommodated, then line 11 (as noted earlier) releases gblcnt_mutex and line 12
returns indicating failure.
Otherwise, we take the slowpath. Line 14 adds delta to globalcount, and then
line 15 invokes balance_count() (shown in Listing 5.9) in order to update both the
global and the per-thread variables. This call to balance_count() will usually set this
thread’s countermax to re-enable the fastpath. Line 16 then releases gblcnt_mutex
(again, as noted earlier), and, finally, line 17 returns indicating success.
Quick Quiz 5.32: Why does globalize_count() zero the per-thread variables, only to
later call balance_count() to refill them in Listing 5.7? Why not just leave the per-thread
variables non-zero?
Lines 20–36 show sub_count(), which subtracts the specified delta from the
counter. Line 22 checks to see if the per-thread counter can accommodate this
subtraction, and, if so, line 23 does the subtraction and line 24 returns success. These
lines form sub_count()’s fastpath, and, as with add_count(), this fastpath executes
no costly operations.
If the fastpath cannot accommodate subtraction of delta, execution proceeds to the
slowpath on lines 26–35. Because the slowpath must access global state, line 26 acquires
gblcnt_mutex, which is released either by line 29 (in case of failure) or by line 34 (in
case of success). Line 27 invokes globalize_count(), shown in Listing 5.9, which
again clears the thread-local variables, adjusting the global variables as needed. Line 28
checks to see if the counter can accommodate subtracting delta, and, if not, line 29
releases gblcnt_mutex (as noted earlier) and line 30 returns failure.
Quick Quiz 5.33: Given that globalreserve counted against us in add_count(), why
doesn’t it count for us in sub_count() in Listing 5.7?
Quick Quiz 5.34: Suppose that one thread invokes add_count() shown in Listing 5.7, and
then another thread invokes sub_count(). Won’t sub_count() return failure even though
the value of the counter is non-zero?
If, on the other hand, line 28 finds that the counter can accommodate subtracting
delta, we complete the slowpath. Line 32 does the subtraction and then line 33 invokes
balance_count() (shown in Listing 5.9) in order to update both global and per-thread
variables (hopefully re-enabling the fastpath). Then line 34 releases gblcnt_mutex,
and line 35 returns success.
Quick Quiz 5.35: Why have both add_count() and sub_count() in Listing 5.7? Why not
simply pass a negative number to add_count()?
Lines 38–51 show read_count(), which returns the aggregate value of the counter. It
acquires gblcnt_mutex on line 43 and releases it on line 49, excluding global operations
from add_count() and sub_count(), and, as we will see, also excluding thread
creation and exit. Line 44 initializes local variable sum to the value of globalcount,
and then the loop spanning lines 45–48 sums the per-thread counter variables. Line 50
then returns the sum.
Listing 5.9 shows a number of utility functions used by the add_count(), sub_
count(), and read_count() primitives shown in Listing 5.7.
v2023.06.11a
92 CHAPTER 5. COUNTING
Lines 1–7 show globalize_count(), which zeros the current thread’s per-thread
counters, adjusting the global variables appropriately. It is important to note that this
function does not change the aggregate value of the counter, but instead changes how
the counter’s current value is represented. Line 3 adds the thread’s counter variable to
globalcount, and line 4 zeroes counter. Similarly, line 5 subtracts the per-thread
countermax from globalreserve, and line 6 zeroes countermax. It is helpful to
refer to Figure 5.5 when reading both this function and balance_count(), which is
next.
Lines 9–19 show balance_count(), which is roughly speaking the inverse of
globalize_count(). This function’s job is to set the current thread’s countermax
variable to the largest value that avoids the risk of the counter exceeding the
globalcountmax limit. Changing the current thread’s countermax variable of course
requires corresponding adjustments to counter, globalcount and globalreserve,
as can be seen by referring back to Figure 5.5. By doing this, balance_count()
maximizes use of add_count()’s and sub_count()’s low-overhead fastpaths. As with
globalize_count(), balance_count() is not permitted to change the aggregate
value of the counter.
Lines 11–13 compute this thread’s share of that portion of globalcountmax that
is not already covered by either globalcount or globalreserve, and assign the
computed quantity to this thread’s countermax. Line 14 makes the corresponding
adjustment to globalreserve. Line 15 sets this thread’s counter to the middle of
v2023.06.11a
5.3. APPROXIMATE LIMIT COUNTERS 93
globalize_count() balance_count()
cm 3
globalreserve
c 3
globalreserve
cm 3 cm 3
globalreserve
c 3 c 3
cm 2
c 2
cm 2 cm 2
c 2 c 2
cm 1 c 1
cm 1 c 1 cm 1 c 1
cm 0
c 0
cm 0 c 0
globalcount
globalcount
globalcount
the range from zero to countermax. Line 16 checks to see whether globalcount can
in fact accommodate this value of counter, and, if not, line 17 decreases counter
accordingly. Finally, in either case, line 18 makes the corresponding adjustment to
globalcount.
Quick Quiz 5.36: Why set counter to countermax / 2 in line 15 of Listing 5.9? Wouldn’t
it be simpler to just take countermax counts?
v2023.06.11a
94 CHAPTER 5. COUNTING
counter variables), again as indicated by the lowermost of the two dotted lines
connecting the center and rightmost configurations. The globalreserve variable
is also adjusted so that this variable remains equal to the sum of the four threads’
countermax variables. Because thread 0’s counter is less than its countermax,
thread 0 can once again increment the counter locally.
Quick Quiz 5.37: In Figure 5.6, even though a quarter of the remaining count up to the limit
is assigned to thread 0, only an eighth of the remaining count is consumed, as indicated by the
uppermost dotted line connecting the center and the rightmost configurations. Why is that?
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 95
To solve the exact structure-allocation limit problem noted in Quick Quiz 5.4, we need a
limit counter that can tell exactly when its limits are exceeded. One way of implementing
such a limit counter is to cause threads that have reserved counts to give them up. One
way to do this is to use atomic instructions. Of course, atomic instructions will slow
down the fastpath, but on the other hand, it would be silly not to at least give them a try.
The variables and access functions for a simple atomic limit counter are shown in
Listing 5.12 (count_lim_atomic.c). The counter and countermax variables in
earlier algorithms are combined into the single variable counterandmax shown on
line 1, with counter in the upper half and countermax in the lower half. This variable
is of type atomic_t, which has an underlying representation of int.
Lines 2–6 show the definitions for globalcountmax, globalcount,
globalreserve, counterp, and gblcnt_mutex, all of which take on roles simi-
v2023.06.11a
96 CHAPTER 5. COUNTING
lar to their counterparts in Listing 5.10. Line 7 defines CM_BITS, which gives the
number of bits in each half of counterandmax, and line 8 defines MAX_COUNTERMAX,
which gives the maximum value that may be held in either half of counterandmax.
Quick Quiz 5.39: In what way does line 7 of Listing 5.12 violate the C standard?
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 97
v2023.06.11a
98 CHAPTER 5. COUNTING
Lines 1–32 show add_count(), whose fastpath spans lines 8–15, with the remainder
of the function being the slowpath. Lines 8–14 of the fastpath form a compare-and-swap
(CAS) loop, with the atomic_cmpxchg() primitive on lines 13–14 performing the
actual CAS. Line 9 splits the current thread’s counterandmax variable into its counter
(in c) and countermax (in cm) components, while placing the underlying int into old.
Line 10 checks whether the amount delta can be accommodated locally (taking care
to avoid integer overflow), and if not, line 11 transfers to the slowpath. Otherwise,
line 12 combines an updated counter value with the original countermax value into
new. The atomic_cmpxchg() primitive on lines 13–14 then atomically compares this
thread’s counterandmax variable to old, updating its value to new if the comparison
succeeds. If the comparison succeeds, line 15 returns success, otherwise, execution
continues in the loop at line 8.
Quick Quiz 5.42: Yecch! Why the ugly goto on line 11 of Listing 5.13? Haven’t you heard
of the break statement???
Quick Quiz 5.43: Why would the atomic_cmpxchg() primitive at lines 13–14 of Listing 5.13
ever fail? After all, we picked up its old value on line 9 and have not changed it!
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 99
the loop spanning lines 11–16 adds the per-thread counters to this sum, isolating each
per-thread counter using split_counterandmax on line 13. Finally, line 18 returns
the sum.
Listings 5.15 and 5.16 show the utility functions globalize_count(), flush_
local_count(), balance_count(), count_register_thread(), and count_
unregister_thread(). The code for globalize_count() is shown on lines 1–12
of Listing 5.15, and is similar to that of previous algorithms, with the addition of line 7,
which is now required to split out counter and countermax from counterandmax.
The code for flush_local_count(), which moves all threads’ local counter state
to the global counter, is shown on lines 14–32. Line 22 checks to see if the value of
globalreserve permits any per-thread counts, and, if not, line 23 returns. Otherwise,
line 24 initializes local variable zero to a combined zeroed counter and countermax.
The loop spanning lines 25–31 sequences through each thread. Line 26 checks to see if
the current thread has counter state, and, if so, lines 27–30 move that state to the global
counters. Line 27 atomically fetches the current thread’s state while replacing it with
zero. Line 28 splits this state into its counter (in local variable c) and countermax (in
local variable cm) components. Line 29 adds this thread’s counter to globalcount,
while line 30 subtracts this thread’s countermax from globalreserve.
Quick Quiz 5.44: What stops a thread from simply refilling its counterandmax variable
immediately after flush_local_count() on line 14 of Listing 5.15 empties it?
Quick Quiz 5.45: What prevents concurrent execution of the fastpath of either add_count()
or sub_count() from interfering with the counterandmax variable while flush_local_
count() is accessing it on line 27 of Listing 5.15?
v2023.06.11a
100 CHAPTER 5. COUNTING
Lines 1–22 on Listing 5.16 show the code for balance_count(), which refills
the calling thread’s local counterandmax variable. This function is quite similar
to that of the preceding algorithms, with changes required to handle the merged
counterandmax variable. Detailed analysis of the code is left as an exercise for the
reader, as it is with the count_register_thread() function starting on line 24 and
the count_unregister_thread() function starting on line 33.
Quick Quiz 5.46: Given that the atomic_set() primitive does a simple store to the specified
atomic_t, how can line 21 of balance_count() in Listing 5.16 work correctly in face of
concurrent flush_local_count() updates to this variable?
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 101
IDLE
need no
flushed
flush count
!counting
REQ READY
done
counting
counting
ACK
from other threads. Because signal handlers run in the context of the signaled thread,
atomic operations are not necessary, as shown in the next section.
Quick Quiz 5.47: But signal handlers can be migrated to some other CPU while running.
Doesn’t this possibility require that atomic instructions and memory barriers are required to
reliably communicate between a thread and a signal handler that interrupts that thread?
4 For those with black-and-white versions of this book, IDLE and READY are green,
v2023.06.11a
102 CHAPTER 5. COUNTING
Quick Quiz 5.48: In Figure 5.7, why is the REQ theft state colored red?
Quick Quiz 5.49: In Figure 5.7, what is the point of having separate REQ and ACK theft
states? Why not simplify the state machine by collapsing them into a single REQACK state?
Then whichever of the signal handler or the fastpath gets there first could set the state to READY.
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 103
v2023.06.11a
104 CHAPTER 5. COUNTING
Quick Quiz 5.52: In Listing 5.18, why doesn’t line 30 check for the current thread sending
itself a signal?
Quick Quiz 5.53: The code shown in Listings 5.17 and 5.18 works with GCC and POSIX.
What would be required to make it also conform to the ISO C standard?
The loop spanning lines 33–46 waits until each thread reaches READY state, then steals
that thread’s count. Lines 34–35 skip any non-existent threads, and the loop spanning
lines 36–40 waits until the current thread’s theft state becomes READY. Line 37
blocks for a millisecond to avoid priority-inversion problems, and if line 38 determines
that the thread’s signal has not yet arrived, line 39 resends the signal. Execution reaches
line 41 when the thread’s theft state becomes READY, so lines 41–44 do the thieving.
Line 45 then sets the thread’s theft state back to IDLE.
Quick Quiz 5.54: In Listing 5.18, why does line 39 resend the signal?
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 105
In either case, line 12 prevents the compiler from reordering the fastpath body to follow
line 13, which permits any subsequent signal handlers to undertake theft. Line 14 again
disables compiler reordering, and then line 15 checks to see if the signal handler deferred
the theft state-change to READY, and, if so, line 16 uses smp_store_release() to
set the theft state to READY, further ensuring that any CPU that sees the READY
state also sees the effects of line 9. If the fastpath addition at line 9 was executed, then
line 18 returns success.
Otherwise, we fall through to the slowpath starting at line 19. The structure of the
slowpath is similar to those of earlier examples, so its analysis is left as an exercise
to the reader. Similarly, the structure of sub_count() on Listing 5.20 is the same as
that of add_count(), so the analysis of sub_count() is also left as an exercise for the
reader, as is the analysis of read_count() in Listing 5.21.
v2023.06.11a
106 CHAPTER 5. COUNTING
This is but one reason why high-quality APIs are so important: They permit
implementations to be changed as required by ever-changing hardware performance
characteristics.
v2023.06.11a
5.4. EXACT LIMIT COUNTERS 107
Quick Quiz 5.56: What if you want an exact limit counter to be exact only for its lower limit,
but to allow the upper limit to be inexact?
Although a biased counter can be quite helpful and useful, it is only a partial solution
to the removable I/O device access-count problem called out on page 77. When
attempting to remove a device, we must not only know the precise number of current
I/O accesses, we also need to prevent any future accesses from starting. One way to
accomplish this is to read-acquire a reader-writer lock when updating the counter, and to
write-acquire that same reader-writer lock when checking the counter. Code for doing
I/O might be as follows:
1 read_lock(&mylock);
2 if (removing) {
3 read_unlock(&mylock);
4 cancel_io();
5 } else {
6 add_count(1);
7 read_unlock(&mylock);
8 do_io();
9 sub_count(1);
10 }
Line 1 read-acquires the lock, and either line 3 or 7 releases it. Line 2 checks to see
if the device is being removed, and, if so, line 3 releases the lock and line 4 cancels
the I/O, or takes whatever action is appropriate given that the device is to be removed.
Otherwise, line 6 increments the access count, line 7 releases the lock, line 8 performs
the I/O, and line 9 decrements the access count.
Quick Quiz 5.58: This is ridiculous! We are read-acquiring a reader-writer lock to update the
counter? What are you playing at???
1 write_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 write_unlock(&mylock);
5 while (read_count() != 0)
v2023.06.11a
108 CHAPTER 5. COUNTING
Exact?
Reads (ns)
Algorithm Updates
(count_*.c) Section (ns) 1 CPU 8 CPUs 64 CPUs 420 CPUs
stat 5.2.2 6.3 294 303 315 612
stat_eventual 5.2.4 6.4 1 1 1 1
end 5.2.3 2.9 301 6,309 147,594 239,683
end_rcu 13.5.1 2.9 454 481 508 2,317
lim 5.3.2 N 3.2 435 6,678 156,175 239,422
lim_app 5.3.4 N 2.4 485 7,041 173,108 239,682
lim_atomic 5.4.1 Y 19.7 513 7,085 199,957 239,450
lim_sig 5.4.4 Y 4.7 519 6,805 120,000 238,811
6 poll(NULL, 0, 1);
7 remove_device();
Line 1 write-acquires the lock and line 4 releases it. Line 2 notes that the device is
being removed, and the loop spanning lines 5–6 waits for any I/O operations to complete.
Finally, line 7 does any additional processing needed to prepare for device removal.
Quick Quiz 5.59: What other issues would need to be accounted for in a real system?
This chapter has presented the reliability, performance, and scalability problems with
traditional counting primitives. The C-language ++ operator is not guaranteed to
function reliably in multithreaded code, and atomic operations to a single variable
neither perform nor scale well. This chapter therefore presented a number of counting
algorithms that perform and scale extremely well in certain special cases.
It is well worth reviewing the lessons from these counting algorithms. To that end,
Section 5.5.1 overviews requisite validation, Section 5.5.2 summarizes performance and
scalability, Section 5.5.3 discusses the need for specialization, and finally, Section 5.5.4
enumerates lessons learned and calls attention to later chapters that will expand on these
lessons.
v2023.06.11a
5.5. PARALLEL COUNTING DISCUSSION 109
aided and abetted by the complexities inherent in maintaining a 64-bit count on a 32-bit
system. Therefore, validation is not optional, even for the simple algorithms presented
in this chapter.
The statistical counters are tested for acting like counters (“counttorture.h”), that
is, that the aggregate sum in the counter changes by the sum of the amounts added by
the various update-side threads.
The limit counters are also tested for acting like counters (“limtorture.h”), and
additionally checked for their ability to accommodate the specified limit.
Both of these test suites produce performance data that is used in Section 5.5.2.
Although this level of validation is good and sufficient for textbook implementations
such as these, it would be wise to apply additional validation before putting similar
algorithms into production. Chapter 11 describes additional approaches to testing,
and given the simplicity of most of these counting algorithms, most of the techniques
described in Chapter 12 can also be quite helpful.
Quick Quiz 5.61: Even on the fourth row of Table 5.1, the read-side performance of these
statistical counter implementations is pretty horrible. So why bother with them?
The bottom half of Table 5.1 shows the performance of the parallel limit-counting
algorithms. Exact enforcement of the limits incurs a substantial update-side performance
penalty, although on this x86 system that penalty can be reduced by substituting signals
for atomic operations. All of these implementations suffer from read-side lock contention
in the face of concurrent readers.
Quick Quiz 5.62: Given the performance data shown in the bottom half of Table 5.1, we
should always prefer signals over atomic operations, right?
Quick Quiz 5.63: Can advanced techniques be applied to address the lock contention for
readers seen in the bottom half of Table 5.1?
In short, this chapter has demonstrated a number of counting algorithms that perform
and scale extremely well in a number of special cases. But must our parallel counting
be confined to special cases? Wouldn’t it be better to have a general algorithm that
operated efficiently in all cases? The next section looks at these questions.
v2023.06.11a
110 CHAPTER 5. COUNTING
This problem is not specific to arithmetic. Suppose you need to store and query data.
Should you use an ASCII file? XML? A relational database? A linked list? A dense
array? A B-tree? A radix tree? Or one of the plethora of other data structures and
environments that permit data to be stored and queried? It depends on what you need
to do, how fast you need it done, and how large your data set is—even on sequential
systems.
Similarly, if you need to count, your solution will depend on how large of numbers
you need to work with, how many CPUs need to be manipulating a given number
concurrently, how the number is to be used, and what level of performance and scalability
you will need.
Nor is this problem specific to software. The design for a bridge meant to allow people
to walk across a small brook might be a simple as a single wooden plank. But you would
probably not use a plank to span the kilometers-wide mouth of the Columbia River, nor
would such a design be advisable for bridges carrying concrete trucks. In short, just
as bridge design must change with increasing span and load, so must software design
change as the number of CPUs increases. That said, it would be good to automate this
process, so that the software adapts to changes in hardware configuration and in workload.
There has in fact been some research into this sort of automation [AHS+ 03, SAH+ 03],
and the Linux kernel does some boot-time reconfiguration, including limited binary
rewriting. This sort of adaptation will become increasingly important as the number of
CPUs on mainstream systems continues to increase.
In short, as discussed in Chapter 3, the laws of physics constrain parallel software
just as surely as they constrain mechanical artifacts such as bridges. These constraints
force specialization, though in the case of software it might be possible to automate the
choice of specialization to fit the hardware and workload in question.
Of course, even generalized counting is quite specialized. We need to do a great
number of other things with computers. The next section relates what we have learned
from counters to topics taken up later in this book.
v2023.06.11a
5.5. PARALLEL COUNTING DISCUSSION 111
The examples in this chapter have shown that an important scalability and performance
tool is partitioning. The counters might be fully partitioned, as in the statistical counters
discussed in Section 5.2, or partially partitioned as in the limit counters discussed in
Sections 5.3 and 5.4. Partitioning will be considered in far greater depth in Chapter 6,
and partial parallelization in particular in Section 6.4, where it is called parallel fastpath.
Quick Quiz 5.65: But if we are going to have to partition everything, why bother with
shared-memory multithreading? Why not just partition the problem completely and run as
multiple processes, each in its own address space?
The partially partitioned counting algorithms used locking to guard the global data,
and locking is the subject of Chapter 7. In contrast, the partitioned data tended to be fully
under the control of the corresponding thread, so that no synchronization whatsoever
was required. This data ownership will be introduced in Section 6.3.4 and discussed in
more detail in Chapter 8.
Because integer addition and subtraction are extremely cheap compared to typical
synchronization operations, achieving reasonable scalability requires synchronization
operations be used sparingly. One way of achieving this is to batch the addition and
subtraction operations, so that a great many of these cheap operations are handled by
a single synchronization operation. Batching optimizations of one sort or another are
used by each of the counting algorithms listed in Table 5.1.
Finally, the eventually consistent statistical counter discussed in Section 5.2.4 showed
how deferring activity (in that case, updating the global counter) can provide substantial
performance and scalability benefits. This approach allows common case code to use
much cheaper synchronization operations than would otherwise be possible. Chapter 9
will examine a number of additional ways that deferral can improve performance,
scalability, and even real-time response.
Summarizing the summary:
1. Partitioning promotes performance and scalability.
2. Partial partitioning, that is, partitioning applied only to common code paths, works
almost as well.
3. Partial partitioning can be applied to code (as in Section 5.2’s statistical counters’
partitioned updates and non-partitioned reads), but also across time (as in Sec-
tion 5.3’s and Section 5.4’s limit counters running fast when far from the limit, but
slowly when close to the limit).
4. Partitioning across time often batches updates locally in order to reduce the number
of expensive global operations, thereby decreasing synchronization overhead, in
turn improving performance and scalability. All the algorithms shown in Table 5.1
make heavy use of batching.
5. Read-only code paths should remain read-only: Spurious synchronization writes
to shared memory kill performance and scalability, as seen in the count_end.c
row of Table 5.1.
6. Judicious use of delay promotes performance and scalability, as seen in Sec-
tion 5.2.4.
7. Parallel performance and scalability is usually a balancing act: Beyond a certain
point, optimizing some code paths will degrade others. The count_stat.c and
count_end_rcu.c rows of Table 5.1 illustrate this point.
v2023.06.11a
112 CHAPTER 5. COUNTING
Batch
Work
Partitioning
Resource
Parallel
Partitioning and
Access Control Replication
Interacting
With Hardware
Weaken Partition
8. Different levels of performance and scalability will affect algorithm and data-
structure design, as do a large number of other factors. Figure 5.1 illustrates this
point: Atomic increment might be completely acceptable for a two-CPU system,
but nevertheless be completely inadequate for an eight-CPU system.
Summarizing still further, we have the “big three” methods of increasing performance
and scalability, namely (1) partitioning over CPUs or threads, (2) batching so that more
work can be done by each expensive synchronization operation, and (3) weakening
synchronization operations where feasible. As a rough rule of thumb, you should
apply these methods in this order, as was noted earlier in the discussion of Figure 2.6
on page 22. The partitioning optimization applies to the “Resource Partitioning and
Replication” bubble, the batching optimization to the “Work Partitioning” bubble,
and the weakening optimization to the “Parallel Access Control” bubble, as shown
in Figure 5.8. Of course, if you are using special-purpose hardware such as digital
signal processors (DSPs), field-programmable gate arrays (FPGAs), or general-purpose
graphical processing units (GPGPUs), you may need to pay close attention to the
“Interacting With Hardware” bubble throughout the design process. For example, the
structure of a GPGPU’s hardware threads and memory connectivity might richly reward
very careful partitioning and batching design decisions.
In short, as noted at the beginning of this chapter, the simplicity of counting have
allowed us to explore many fundamental concurrency issues without the distraction of
complex synchronization primitives or elaborate data structures. Such synchronization
primitives and data structures are covered in later chapters.
v2023.06.11a
Divide and rule.
Philip II of Macedon
Chapter 6
Partitioning and
Synchronization Design
This chapter describes how to design software to take advantage of modern commodity
multicore systems by using idioms, or “design patterns” [Ale79, GHJV95, SSRB00], to
balance performance, scalability, and response time. Correctly partitioned problems lead
to simple, scalable, and high-performance solutions, while poorly partitioned problems
result in slow and complex solutions. This chapter will help you design partitioning
into your code, with some discussion of batching and weakening as well. The word
“design” is very important: You should partition first, batch second, weaken third, and
code fourth. Changing this order often leads to poor performance and scalability along
with great frustration.1
This chapter will also look at some specific problems, including:
1. Constraints on the classic Dining Philosophers problem requiring that all the
philophers be able to dine concurrently.
2. Lock-based double-ended queue implementations that provide concurrency between
operations on both ends of a given queue when there are many elements in the
queue, but still work correctly when the queue contains only a few elements. (Or,
for that matter, no elements.)
3. Summarizing the rough quality of a concurrent algorithm with only a few numbers.
Chapter 9.
113
v2023.06.11a
114 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
P1
P5 P2
P4 P3
Although partitioning is more widely understood than it was in the early 2000s, its value
is still underappreciated. Section 6.1.1 therefore takes more highly parallel look at the
classic Dining Philosophers problem and Section 6.1.2 revisits the double-ended queue.
v2023.06.11a
6.1. PARTITIONING EXERCISES 115
P1
5 1
P5 P2
4 2
P4 P3
v2023.06.11a
116 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1990s.3 More recent solutions number the forks as shown in Figure 6.3. Each philosopher
picks up the lowest-numbered fork next to his or her plate, then picks up the other fork.
The philosopher sitting in the uppermost position in the diagram thus picks up the
leftmost fork first, then the rightmost fork, while the rest of the philosophers instead
pick up their rightmost fork first. Because two of the philosophers will attempt to pick
up fork 1 first, and because only one of those two philosophers will succeed, there will
be five forks available to four philosophers. At least one of these four will have two
forks, and will thus be able to eat.
This general technique of numbering resources and acquiring them in numerical
order is heavily used as a deadlock-prevention technique. However, it is easy to imagine
a sequence of events that will result in only one philosopher eating at a time even though
all are hungry:
In short, this algorithm can result in only one philosopher eating at a given time, even
when all five philosophers are hungry, despite the fact that there are more than enough
forks for two philosophers to eat concurrently. It should be possible to do better than
this!
One approach is shown in Figure 6.4, which includes four philosophers rather than five
to better illustrate the partition technique. Here the upper and rightmost philosophers
share a pair of forks, while the lower and leftmost philosophers share another pair of
forks. If all philosophers are simultaneously hungry, at least two will always be able to
eat concurrently. In addition, as shown in the figure, the forks can now be bundled so
that the pair are picked up and put down simultaneously, simplifying the acquisition and
release algorithms.
Quick Quiz 6.1: Is there a better solution to the Dining Philosophers Problem?
Quick Quiz 6.2: How would you valididate an algorithm alleged to solve the Dining
Philosophers Problem?
3 It is all too easy to denigrate Dijkstra from the viewpoint of the year 2021, more than
50 years after the fact. If you still feel the need to denigrate Dijkstra, my advice is to publish
something, wait 50 years, and then see how well your ideas stood the test of time.
v2023.06.11a
6.1. PARTITIONING EXERCISES 117
P1
P4 P2
P3
This suite includes both sequential and concurrent tests. Although this suite is good
and sufficient for textbook code, you should test considerably more thoroughly for code
intended for production use. Chapters 11 and 12 cover a large array of validation tools
and techniques.
But with a prototype test suite in place, we are ready to look at the double-ended-queue
algorithms in the next sections.
v2023.06.11a
118 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Lock L Lock R
Header L Header R
Lock L Lock R
Header L 0 Header R
Lock L Lock R
Header L 0 1 Header R
Lock L Lock R
Header L 0 1 2 Header R
Lock L Lock R
Header L 0 1 2 3 Header R
Lock L Lock R
DEQ L DEQ R
v2023.06.11a
6.1. PARTITIONING EXERCISES 119
Index L Index R
Lock L Lock R
to avoid deadlock, for example, always acquiring the left-hand lock before acquiring
the right-hand lock. This will be much simpler than applying two locks to the same
double-ended queue, as we can unconditionally left-enqueue elements to the left-hand
queue and right-enqueue elements to the right-hand queue. The main complication
arises when dequeuing from an empty queue, in which case it is necessary to:
1. If holding the right-hand lock, release it and acquire the left-hand lock.
Quick Quiz 6.4: In this compound double-ended queue implementation, what should be done
if the queue has become non-empty while releasing and reacquiring the lock?
v2023.06.11a
120 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
R1
Index L Index R
Enq 3R
R4 R1 R2 R3
Index L Index R
Enq 3L1R
R4 R5 R2 R3
L0 R1 L −2 L −1
Index L Index R
v2023.06.11a
6.1. PARTITIONING EXERCISES 121
R4 R5 R6 R7
L0 R1 R2 R3
that deadlock is avoided by acquiring the index locks before the chain locks, and by
never acquiring more than one lock of a given type (index or chain) at a time.
Each hash chain is itself a double-ended queue, and in this example, each holds
every fourth element. The uppermost portion of Figure 6.8 shows the state after a
single element (“R1 ”) has been right-enqueued, with the right-hand index having been
incremented to reference hash chain 2. The middle portion of this same figure shows
the state after three more elements have been right-enqueued. As you can see, the
indexes are back to their initial states (see Figure 6.7), however, each hash chain is
now non-empty. The lower portion of this figure shows the state after three additional
elements have been left-enqueued and an additional element has been right-enqueued.
From the last state shown in Figure 6.8, a left-dequeue operation would return element
“L−2 ” and leave the left-hand index referencing hash chain 2, which would then contain
only a single element (“R2 ”). In this state, a left-enqueue running concurrently with a
right-enqueue would result in lock contention, but the probability of such contention
can be reduced to arbitrarily low levels by using a larger hash table.
Figure 6.9 shows how 16 elements would be organized in a four-hash-bucket parallel
double-ended queue. Each underlying single-lock double-ended queue holds a one-
quarter slice of the full parallel double-ended queue.
Listing 6.1 shows the corresponding C-language data structure, assuming an existing
struct deq that provides a trivially locked double-ended-queue implementation. This
data structure contains the left-hand lock on line 2, the left-hand index on line 3,
the right-hand lock on line 4 (which is cache-aligned in the actual implementation),
the right-hand index on line 5, and, finally, the hashed array of simple lock-based
double-ended queues on line 6. A high-performance implementation would of course
use padding or special alignment directives to avoid false sharing.
Listing 6.2 (lockhdeq.c) shows the implementation of the enqueue and dequeue
functions.4 Discussion will focus on the left-hand operations, as the right-hand
operations are trivially derived from them.
Lines 1–13 show pdeq_pop_l(), which left-dequeues and returns an element if
possible, returning NULL otherwise. Line 6 acquires the left-hand spinlock, and line 7
computes the index to be dequeued from. Line 8 dequeues the element, and, if line 9
4 One could easily create a polymorphic implementation in any number of languages,
v2023.06.11a
122 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2023.06.11a
6.1. PARTITIONING EXERCISES 123
finds the result to be non-NULL, line 10 records the new left-hand index. Either way,
line 11 releases the lock, and, finally, line 12 returns the element if there was one, or
NULL otherwise.
Lines 29–38 show pdeq_push_l(), which left-enqueues the specified element.
Line 33 acquires the left-hand lock, and line 34 picks up the left-hand index. Line 35
left-enqueues the specified element onto the double-ended queue indexed by the left-hand
index. Line 36 then updates the left-hand index and line 37 releases the lock.
As noted earlier, the right-hand operations are completely analogous to their left-
handed counterparts, so their analysis is left as an exercise for the reader.
Quick Quiz 6.5: Is the hashed double-ended queue a good solution? Why or why not?
v2023.06.11a
124 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2023.06.11a
6.1. PARTITIONING EXERCISES 125
Quick Quiz 6.9: Surely the left-hand lock must sometimes be available!!! So why is it
necessary that line 25 of Listing 6.3 unconditionally release the right-hand lock?
(DCAS) instructions are not needed for lock-free implementations of double-ended queues.
Instead, the common compare-and-swap (e.g., x86 cmpxchg) suffices.
6 In short, a linearization point is a single point within a given function where that
function can be said to have taken effect. In this lock-based implementation, the linearization
points can be said to be anywhere within the critical section that does the work.
7 Nir Shavit produced relaxed stacks for roughly the same reasons [Sha11]. This situation
leads some to believe that the linearization points are useful to theorists rather than developers,
and leads others to wonder to what extent the designers of such data structures and algorithms
were considering the needs of their users.
v2023.06.11a
126 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
the data used by your concurrent program through a single queue, you really need to
rethink your overall design.
Quick Quiz 6.13: Is there a significantly better way of handling concurrency for double-ended
queues?
These two examples show just how powerful partitioning can be in devising parallel
algorithms. Section 6.3.5 looks briefly at a third example, matrix multiply. However, all
three of these examples beg for more and better design criteria for parallel programs, a
topic taken up in the next section.
One way to obtain the best performance and scalability is to simply hack away until
you converge on the best possible parallel program. Unfortunately, if your program is
other than microscopically tiny, the space of possible parallel programs is so huge that
convergence is not guaranteed in the lifetime of the universe. Besides, what exactly is
the “best possible parallel program”? After all, Section 2.2 called out no fewer than
three parallel-programming goals of performance, productivity, and generality, and
the best possible performance will likely come at a cost in terms of productivity and
generality. We clearly need to be able to make higher-level choices at design time in
order to arrive at an acceptably good parallel program before that program becomes
obsolete.
However, more detailed design criteria are required to actually produce a real-world
design, a task taken up in this section. This being the real world, these criteria often
conflict to a greater or lesser degree, requiring that the designer carefully balance the
resulting tradeoffs.
As such, these criteria may be thought of as the “forces” acting on the design, with
particularly good tradeoffs between these forces being called “design patterns” [Ale79,
GHJV95].
The design criteria for attaining the three parallel-programming goals are speedup,
contention, overhead, read-to-write ratio, and complexity:
v2023.06.11a
6.2. DESIGN CRITERIA 127
Contention: If more CPUs are applied to a parallel program than can be kept busy by
that program, the excess CPUs are prevented from doing useful work by contention.
This may be lock contention, memory contention, or a host of other performance
killers.
Work-to-Synchronization Ratio: A uniprocessor, single-threaded, non-preemptible,
and non-interruptible8 version of a given parallel program would not need any
synchronization primitives. Therefore, any time consumed by these primitives (in-
cluding communication cache misses as well as message latency, locking primitives,
atomic instructions, and memory barriers) is overhead that does not contribute
directly to the useful work that the program is intended to accomplish. Note that
the important measure is the relationship between the synchronization overhead
and the overhead of the code in the critical section, with larger critical sections
able to tolerate greater synchronization overhead. The work-to-synchronization
ratio is related to the notion of synchronization efficiency.
Read-to-Write Ratio: A data structure that is rarely updated may often be replicated
rather than partitioned, and furthermore may be protected with asymmetric
synchronization primitives that reduce readers’ synchronization overhead at the
expense of that of writers, thereby reducing overall synchronization overhead.
Corresponding optimizations are possible for frequently updated data structures,
as discussed in Chapter 5.
Complexity: A parallel program is more complex than an equivalent sequential program
because the parallel program has a much larger state space than does the sequential
program, although large state spaces having regular structures can in some cases
be easily understood. A parallel programmer must consider synchronization
primitives, messaging, locking design, critical-section identification, and deadlock
in the context of this larger state space.
This greater complexity often translates to higher development and maintenance
costs. Therefore, budgetary constraints can limit the number and types of modi-
fications made to an existing program, since a given degree of speedup is worth
only so much time and trouble. Worse yet, added complexity can actually reduce
performance and scalability.
Therefore, beyond a certain point, there may be potential sequential optimizations
that are cheaper and more effective than parallelization. As noted in Section 2.2.1,
parallelization is but one performance optimization of many, and is furthermore an
optimization that applies most readily to CPU-based bottlenecks.
These criteria will act together to enforce a maximum speedup. The first three criteria are
deeply interrelated, so the remainder of this section analyzes these interrelationships.9
v2023.06.11a
128 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Note that these criteria may also appear as part of the requirements specification,
and further that they are one solution to the problem of summarizing the quality of
a concurrent algorithm from page 113. For example, speedup may act as a relative
desideratum (“the faster, the better”) or as an absolute requirement of the workload (“the
system must support at least 1,000,000 web hits per second”). Classic design pattern
languages describe relative desiderata as forces and absolute requirements as context.
An understanding of the relationships between these design criteria can be very
helpful when identifying appropriate design tradeoffs for a parallel program.
1. The less time a program spends in exclusive-lock critical sections, the greater the
potential speedup. This is a consequence of Amdahl’s Law [Amd67] because only
one CPU may execute within a given exclusive-lock critical section at a given time.
More specifically, for unbounded linear scalability, the fraction of time that the
program spends in a given exclusive critical section must decrease as the number
of CPUs increases. For example, a program will not scale to 10 CPUs unless it
spends much less than one tenth of its time in the most-restrictive exclusive-lock
critical section.
2. Contention effects consume the excess CPU and/or wallclock time when the actual
speedup is less than the number of available CPUs. The larger the gap between the
number of CPUs and the actual speedup, the less efficiently the CPUs will be used.
Similarly, the greater the desired efficiency, the smaller the achievable speedup.
4. If the critical sections have high overhead compared to the primitives guarding
them, the best way to improve speedup is to increase parallelism by moving to
reader/writer locking, data locking, asymmetric, or data ownership.
5. If the critical sections have high overhead compared to the primitives guarding them
and the data structure being guarded is read much more often than modified, the
best way to increase parallelism is to move to reader/writer locking or asymmetric
primitives.
6. Many changes that improve SMP performance, for example, reducing lock con-
tention, also improve real-time latencies [McK05c].
Quick Quiz 6.14: Don’t all these problems with critical sections mean that we should just
always use non-blocking synchronization [Her90], which don’t have critical sections?
It is worth reiterating that contention has many guises, including lock contention,
memory contention, cache overflow, thermal throttling, and much else besides. This
chapter looks primarily at lock and memory contention.
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 129
Sequential
Program
Partition Batch
Code
Locking
Partition Batch
Data
Locking
Own Disown
Data
Ownership
Figure 6.10 gives a pictorial view of different levels of synchronization granularity, each
of which is described in one of the following sections. These sections focus primarily
on locking, but similar granularity issues arise with all forms of synchronization.
10 This plot shows clock frequencies for newer CPUs theoretically capable of retiring
one or more instructions per clock, and MIPS for older CPUs requiring multiple clocks to
execute even the simplest instruction. The reason for taking this approach is that the newer
CPUs’ ability to retire multiple instructions per clock is typically limited by memory-system
performance.
v2023.06.11a
130 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
10000
100
10
0.1
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Year
1x106
100000 Ethernet
Relative Performance
10000
1000
10
0.1
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
2020
Year
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 131
simplicity of the hash-table lookup code in Listing 6.4 underscores this point.11 A key
point is that speedups due to parallelism are normally limited to the number of CPUs.
In contrast, speedups due to sequential optimizations, for example, careful choice of
data structure, can be arbitrarily large.
Quick Quiz 6.15: What should you do to validate a hash table?
On the other hand, if you are not in this happy situation, read on!
11 The examples in this section are taken from Hart et al. [HMB06], adapted for clarity by
with synchronized instances, you are instead using “data locking”, described in Section 6.3.3.
v2023.06.11a
132 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
access that hash table. Another way of looking at this is that hash_lock is partitioning
time, thus giving each requesting CPU its own partition of time during which it owns this
hash table. In addition, in a well-designed algorithm, there should be ample partitions
of time during which no CPU owns this hash table.
Quick Quiz 6.16: “Partitioning time”? Isn’t that an odd turn of phrase?
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 133
toy
v2023.06.11a
134 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
toy
yot
toy
always translates into increased performance and scalability. For this reason, data
locking was heavily used by Sequent in its kernels [BK85, Inm85, Gar90, Dov90, MD92,
MG92, MS93].
Another way of looking at this is to think of each ->bucket_lock as mediating
ownership not of the entire hash table as was done for code locking, but only for the
bucket corresponding to that ->bucket_lock. Each lock still partitions time, but
the per-bucket-locking technique also partitions the address space, so that the overall
technique can be said to partition spacetime. If the number of buckets is large enough,
this partitioning of space should with high probability permit a given CPU immediate
access to a given hash bucket.
However, as those who have taken care of small children can again attest, even
providing enough to go around is no guarantee of tranquillity. The analogous situation
can arise in SMP programs. For example, the Linux kernel maintains a cache of files
and directories (called “dcache”). Each entry in this cache has its own lock, but the
entries corresponding to the root directory and its direct descendants are much more
likely to be traversed than are more obscure entries. This can result in many CPUs
contending for the locks of these popular entries, resulting in a situation not unlike that
shown in Figure 6.15.
In many cases, algorithms can be designed to reduce the instance of data skew, and in
some cases eliminate it entirely (for example, in the Linux kernel’s dcache [MSS04,
Cor10a, Bro15a, Bro15b, Bro15c]). Data locking is often used for partitionable data
structures such as hash tables, as well as in situations where multiple entities are each
represented by an instance of a given data structure. The Linux-kernel task list is an
example of the latter, each task structure having its own alloc_lock and pi_lock.
A key challenge with data locking on dynamically allocated structures is ensuring
that the structure remains in existence while the lock is being acquired [GKAS99]. The
code in Listing 6.6 finesses this challenge by placing the locks in the statically allocated
hash buckets, which are never freed. However, this trick would not work if the hash
table were resizeable, so that the locks were now dynamically allocated. In this case,
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 135
toy
toy
toy
toy
there would need to be some means to prevent the hash bucket from being freed during
the time that its lock was being acquired.
Quick Quiz 6.17: What are some ways of preventing a structure from being freed while its
lock is being acquired?
v2023.06.11a
136 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
can be as simple as the sequential-program case shown in Listing 6.4. Such situations
are often referred to as “embarrassingly parallel”, and, in the best case, resemble the
situation previously shown in Figure 6.14.
Another important instance of data ownership occurs when the data is read-only, in
which case, all threads can “own” it via replication.
Where data locking partitions both the address space (with one hash buckets per
partition) and time (using per-bucket locks), data ownership partitions only the address
space. The reason that data ownership need not partition time is because a given thread
or CPU is assigned permanent ownership of a given address-space partition.
Quick Quiz 6.18: But won’t system boot and shutdown (or application startup and shutdown)
be partitioning time, even for data ownership?
𝜆 = 𝑛𝜆0 (6.1)
Here, 𝑛 is the number of CPUs and 𝜆0 is the transaction-processing capability of a
single CPU. Note that the expected time for a single CPU to execute a single transaction
in the absence of contention is 1/𝜆0 .
13 Of course, if there are 8 CPUs all incrementing the same shared variable, then each
CPU must wait at least 35 nanoseconds for each of the other CPUs to do its increment before
consuming an additional 5 nanoseconds doing its own increment. In fact, the wait will be
longer due to the need to move the variable from one CPU to another.
v2023.06.11a
6.3. SYNCHRONIZATION GRANULARITY 137
Synchronization Efficiency
1
0.9
0.8
0.7
100
0.6
0.5 75
0.4 50
0.3 25
0.2 10
0.1
10
20
30
40
50
60
70
80
90
100
Number of CPUs (Threads)
Figure 6.16: Synchronization Efficiency
Because the CPUs have to “wait in line” behind each other to get their chance to
increment the single shared variable, we can use the M/M/1 queueing-model expression
for the expected total waiting time:
1
𝑇= (6.2)
𝜇−𝜆
Substituting the above value of 𝜆:
1
𝑇= (6.3)
𝜇 − 𝑛𝜆0
Now, the efficiency is just the ratio of the time required to process a transaction
in absence of synchronization (1/𝜆0 ) to the time required including synchronization
(𝑇 + 1/𝜆0 ):
1/𝜆0
𝑒= (6.4)
𝑇 + 1/𝜆0
Substituting the above value for 𝑇 and simplifying:
𝜇
𝜆0 −𝑛
𝑒= 𝜇 (6.5)
𝜆0 − (𝑛 − 1)
But the value of 𝜇/𝜆0 is just the ratio of the time required to process the transaction
(absent synchronization overhead) to that of the synchronization overhead itself (absent
contention). If we call this ratio 𝑓 , we have:
𝑓 −𝑛
𝑒= (6.6)
𝑓 − (𝑛 − 1)
Figure 6.16 plots the synchronization efficiency 𝑒 as a function of the number of
CPUs/threads 𝑛 for a few values of the overhead ratio 𝑓 . For example, again using the
5-nanosecond atomic increment, the 𝑓 = 10 line corresponds to each CPU attempting
an atomic increment every 50 nanoseconds, and the 𝑓 = 100 line corresponds to
each CPU attempting an atomic increment every 500 nanoseconds, which in turn
v2023.06.11a
138 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1.2
corresponds to some hundreds (perhaps thousands) of instructions. Given that each trace
drops off sharply with increasing numbers of CPUs or threads, we can conclude that
synchronization mechanisms based on atomic manipulation of a single global shared
variable will not scale well if used heavily on current commodity hardware. This is an
abstract mathematical depiction of the forces leading to the parallel counting algorithms
that were discussed in Chapter 5. Your real-world mileage may differ.
Nevertheless, the concept of efficiency is useful, and even in cases having little or no
formal synchronization. Consider for example a matrix multiply, in which the columns
of one matrix are multiplied (via “dot product”) by the rows of another, resulting in
an entry in a third matrix. Because none of these operations conflict, it is possible to
partition the columns of the first matrix among a group of threads, with each thread
computing the corresponding columns of the result matrix. The threads can therefore
operate entirely independently, with no synchronization overhead whatsoever, as is done
in matmul.c. One might therefore expect a perfect efficiency of 1.0.
However, Figure 6.17 tells a different story, especially for a 64-by-64 matrix multiply,
which never gets above an efficiency of about 0.3, even when running single-threaded,
and drops sharply as more threads are added.14 The 128-by-128 matrix does better,
but still fails to demonstrate much performance increase with added threads. The
256-by-256 matrix does scale reasonably well, but only up to a handful of CPUs. The
512-by-512 matrix multiply’s efficiency is measurably less than 1.0 on as few as 10
threads, and even the 1024-by-1024 matrix multiply deviates noticeably from perfection
at a few tens of threads. Nevertheless, this figure clearly demonstrates the performance
and scalability benefits of batching: If you must incur synchronization overhead, you
may as well get your money’s worth, which is the solution to the problem of deciding
on granularity of synchronization put forth on page 113.
Quick Quiz 6.19: How can a single-threaded 64-by-64 matrix multiple possibly have an
efficiency of less than 1.0? Shouldn’t all of the traces in Figure 6.17 have efficiency of exactly
1.0 when running on one thread?
14 In contrast to the smooth traces of Figure 6.16, the wide error bars and jagged traces of
v2023.06.11a
6.4. PARALLEL FASTPATH 139
Reader/Writer
Locking
RCU
Parallel
Fastpath
Hierarchical
Locking
Allocator
Caches
Quick Quiz 6.21: What did you do to validate this matrix multiply algorithm?
v2023.06.11a
140 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
4. Resource Allocator Caches ([McK96a, MS93]). See Section 6.4.3 for more detail.
v2023.06.11a
6.4. PARALLEL FASTPATH 141
our hash-table search might be adapted to do hierarchical locking, but also shows the
great weakness of this approach: We have paid the overhead of acquiring a second lock,
but we only hold it for a short time. In this case, the data-locking approach would be
simpler and likely perform better.
Quick Quiz 6.22: In what situation would hierarchical locking work well?
v2023.06.11a
142 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Global Pool
Overflow
Overflow
(Code Locked)
Empty
Empty
CPU 0 Pool CPU 1 Pool
Allocate/Free
laptop I am using right now. We could simply assign each CPU a five-gigabyte region
of memory, and allow each CPU to allocate from its own region, without the need for
locking and its complexities and overheads. Unfortunately, this scheme fails when CPU 0
only allocates memory and CPU 1 only frees it, as happens in simple producer-consumer
workloads.
The other extreme, code locking, suffers from excessive lock contention and over-
head [MS93].
The commonly used solution uses parallel fastpath with each CPU owning a modest
cache of blocks, and with a large code-locked shared pool for additional blocks. To
prevent any given CPU from monopolizing the memory blocks, we place a limit on the
number of blocks that can be in each CPU’s cache. In a two-CPU system, the flow of
memory blocks will be as shown in Figure 6.19: When a given CPU is trying to free a
block when its pool is full, it sends blocks to the global pool, and, similarly, when that
CPU is trying to allocate a block when its pool is empty, it retrieves blocks from the
global pool.
The actual data structures for a “toy” implementation of allocator caches are shown
in Listing 6.9 (“smpalloc.c”). The “Global Pool” of Figure 6.19 is implemented
by globalmem of type struct globalmempool, and the two CPU pools by the per-
thread variable perthreadmem of type struct perthreadmempool. Both of these
data structures have arrays of pointers to blocks in their pool fields, which are filled
from index zero upwards. Thus, if globalmem.pool[3] is NULL, then the remainder
of the array from index 4 up must also be NULL. The cur fields contain the index of the
highest-numbered full element of the pool array, or −1 if all elements are empty. All
v2023.06.11a
6.4. PARALLEL FASTPATH 143
(Empty) −1
small, but this small size makes it easier to single-step the program in order to get a feel for
its operation.
v2023.06.11a
144 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
In either case, line 18 checks for the per-thread pool still being empty, and if not,
lines 19–21 remove a block and return it. Otherwise, line 23 tells the sad tale of memory
exhaustion.
Listing 6.11 shows the memory-block free function. Line 6 gets a pointer to this thread’s
pool, and line 7 checks to see if this per-thread pool is full.
If so, lines 8–15 empty half of the per-thread pool into the global pool, with lines 8
and 14 acquiring and releasing the spinlock. Lines 9–12 implement the loop moving
blocks from the local to the global pool, and line 13 sets the per-thread pool’s count to
the proper value.
In either case, line 16 then places the newly freed block into the per-thread pool.
Quick Quiz 6.23: Doesn’t this resource-allocator design resemble that of the approximate
limit counters covered in Section 5.3?
v2023.06.11a
6.4. PARALLEL FASTPATH 145
30
20
15
10
0
0 5 10 15 20 25
Allocation Run Length
6.4.3.6 Performance
Rough performance results16 are shown in Figure 6.21, running on a dual-core Intel
x86 running at 1 GHz (4300 bogomips per CPU) with at most six blocks allowed in
each CPU’s cache. In this micro-benchmark, each thread repeatedly allocates a group
of blocks and then frees all the blocks in that group, with the number of blocks in the
group being the “allocation run length” displayed on the x-axis. The y-axis shows the
number of successful allocation/free pairs per microsecond—failed allocations are not
counted. The “X”s are from a two-thread run, while the “+”s are from a single-threaded
run.
Note that run lengths up to six scale linearly and give excellent performance, while run
lengths greater than six show poor performance and almost always also show negative
scaling. It is therefore quite important to size TARGET_POOL_SIZE sufficiently large,
which fortunately is usually quite easy to do in actual practice [MSK01], especially
given today’s large memories. For example, in most systems, it is quite reasonable to
set TARGET_POOL_SIZE to 100, in which case allocations and frees are guaranteed to
be confined to per-thread pools at least 99 % of the time.
As can be seen from the figure, the situations where the common-case data-ownership
applies (run lengths up to six) provide greatly improved performance compared to the
cases where locks must be acquired. Avoiding synchronization in the common case will
be a recurring theme through this book.
Quick Quiz 6.24: In Figure 6.21, there is a pattern of performance rising with increasing run
length in groups of three samples, for example, for run lengths 10, 11, and 12. Why?
Quick Quiz 6.25: Allocation failures were observed in the two-thread tests at run lengths of
19 and greater. Given the global-pool size of 40 and the per-thread target pool size 𝑠 of three,
number of threads 𝑛 equal to two, and assuming that the per-thread pools are initially empty
with none of the memory in use, what is the smallest allocation run length 𝑚 at which failures
16 This data was not collected in a statistically meaningful way, and therefore should be
viewed with great skepticism and suspicion. Good data-collection and -reduction practice is
discussed in Chapter 11. That said, repeated runs gave similar results, and these results match
more careful evaluations of similar algorithms.
v2023.06.11a
146 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
can occur? (Recall that each thread repeatedly allocates 𝑚 block of memory, and then frees the
𝑚 blocks of memory.) Alternatively, given 𝑛 threads each with pool size 𝑠, and where each
thread repeatedly first allocates 𝑚 blocks of memory and then frees those 𝑚 blocks, how large
must the global pool size be? Note: Obtaining the correct answer will require you to examine
the smpalloc.c source code, and very likely single-step it as well. You have been warned!
6.4.3.7 Validation
Validation of this simple allocator spawns a specified number of threads, with each
thread repeatedly allocating a specified number of memory blocks and then deallocating
them. This simple regimen suffices to exercise both the per-thread caches and the global
pool, as can be seen in Figure 6.21.
Much more aggressive validation is required for memory allocators that are to be used
in production. The test suites for tcmalloc [Ken20] and jemalloc [Eva11] are instructive,
as are the tests for the Linux kernel’s memory allocator.
The toy parallel resource allocator was quite simple, but real-world designs expand on
this approach in a number of ways.
First, real-world allocators are required to handle a wide range of allocation sizes, as
opposed to the single size shown in this toy example. One popular way to do this is to
offer a fixed set of sizes, spaced so as to balance external and internal fragmentation,
such as in the late-1980s BSD memory allocator [MK88]. Doing this would mean that
the “globalmem” variable would need to be replicated on a per-size basis, and that the
associated lock would similarly be replicated, resulting in data locking rather than the
toy program’s code locking.
Second, production-quality systems must be able to repurpose memory, meaning that
they must be able to coalesce blocks into larger structures, such as pages [MS93]. This
coalescing will also need to be protected by a lock, which again could be replicated on a
per-size basis.
Third, coalesced memory must be returned to the underlying memory system, and
pages of memory must also be allocated from the underlying memory system. The
locking required at this level will depend on that of the underlying memory system, but
could well be code locking. Code locking can often be tolerated at this level, because
this level is so infrequently reached in well-designed systems [MSK01].
Concurrent userspace allocators face similar challenges [Ken20, Eva11].
Despite this real-world design’s greater complexity, the underlying idea is the same—
repeated application of parallel fastpath, as shown in Table 6.1.
And “parallel fastpath” is one of the solutions to the non-partitionable application
problem put forth on page 113.
v2023.06.11a
6.5. BEYOND PARTITIONING 147
This chapter has discussed how data partitioning can be used to design simple linearly
scalable parallel programs. Section 6.3.4 hinted at the possibilities of data replication,
which will be used to great effect in Section 9.5.
The main goal of applying partitioning and replication is to achieve linear speedups,
in other words, to ensure that the total amount of work required does not increase
significantly as the number of CPUs or threads increases. A problem that can be
solved via partitioning and/or replication, resulting in linear speedups, is embarrassingly
parallel. But can we do better?
To answer this question, let us examine the solution of labyrinths and mazes. Of
course, labyrinths and mazes have been objects of fascination for millennia [Wik12],
so it should come as no surprise that they are generated and solved using computers,
including biological computers [Ada11], GPGPUs [Eri08], and even discrete hard-
ware [KFC11]. Parallel solution of mazes is sometimes used as a class project in
universities [ETH11, Uni10] and as a vehicle to demonstrate the benefits of parallel-
programming frameworks [Fos10].
Common advice is to use a parallel work-queue algorithm (PWQ) [ETH11, Fos10].
This section evaluates this advice by comparing PWQ against a sequential algorithm
(SEQ) and also against an alternative parallel algorithm, in all cases solving ran-
domly generated square mazes. Section 6.5.1 discusses PWQ, Section 6.5.2 discusses
an alternative parallel algorithm, Section 6.5.4 analyzes its anomalous performance,
Section 6.5.5 derives an improved sequential algorithm from the alternative parallel algo-
rithm, Section 6.5.6 makes further performance comparisons, and finally Section 6.5.7
presents future directions and concluding remarks.
v2023.06.11a
148 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1 2 3
2 3 4
3 4 5
Line 7 visits the initial cell, and each iteration of the loop spanning lines 8–21 traverses
passages headed by one cell. The loop spanning lines 9–13 scans the ->visited[]
array for a visited cell with an unvisited neighbor, and the loop spanning lines 14–19
traverses one fork of the submaze headed by that neighbor. Line 20 initializes for the
next pass through the outer loop.
The pseudocode for maze_try_visit_cell() is shown on lines 1–12 of Listing 6.13
(maze.c). Line 4 checks to see if cells c and t are adjacent and connected, while line 5
checks to see if cell t has not yet been visited. The celladdr() function returns the
address of the specified cell. If either check fails, line 6 returns failure. Line 7 indicates
the next cell, line 8 records this cell in the next slot of the ->visited[] array, line 9
indicates that this slot is now full, and line 10 marks this cell as visited and also records
the distance from the maze start. Line 11 then returns success.
The pseudocode for maze_find_any_next_cell() is shown on lines 14–28 of
Listing 6.13 (maze.c). Line 17 picks up the current cell’s distance plus 1, while lines 19,
21, 23, and 25 check the cell in each direction, and lines 20, 22, 24, and 26 return
true if the corresponding cell is a candidate next cell. The prevcol(), nextcol(),
prevrow(), and nextrow() each do the specified array-index-conversion operation.
If none of the cells is a candidate, line 27 returns false.
The path is recorded in the maze by counting the number of cells from the starting
point, as shown in Figure 6.22, where the starting cell is in the upper left and the
ending cell is in the lower right. Starting at the ending cell and following consecutively
decreasing cell numbers traverses the solution.
v2023.06.11a
6.5. BEYOND PARTITIONING 149
1
0.9
0.8
0.7 PWQ
Probability
0.6
0.5 SEQ
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)
v2023.06.11a
150 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
v2023.06.11a
6.5. BEYOND PARTITIONING 151
index. Second, the parent visits the first cell on each child’s behalf, which the child
retrieves on line 8. Third, the maze is solved as soon as one child locates a cell that has
been visited by the other child. When maze_try_visit_cell() detects this, it sets
a ->done field in the maze structure. Fourth, each child must therefore periodically
check the ->done field, as shown on lines 13, 18, and 23. The READ_ONCE() primitive
must disable any compiler optimizations that might combine consecutive loads or that
might reload the value. A C++1x volatile relaxed load suffices [Smi19]. Finally, the
maze_find_any_next_cell() function must use compare-and-swap to mark a cell
as visited, however no constraints on ordering are required beyond those provided by
thread creation and join.
The pseudocode for maze_find_any_next_cell() is identical to that shown in
Listing 6.13, but the pseudocode for maze_try_visit_cell() differs, and is shown
in Listing 6.15. Lines 8–9 check to see if the cells are connected, returning failure if not.
The loop spanning lines 11–18 attempts to mark the new cell visited. Line 13 checks to
see if it has already been visited, in which case line 16 returns failure, but only after
line 14 checks to see if we have encountered the other thread, in which case line 15
indicates that the solution has been located. Line 19 updates to the new cell, lines 20
and 21 update this thread’s visited array, and line 22 returns success.
Performance testing revealed a surprising anomaly, shown in Figure 6.24. The median
solution time for PART (17 milliseconds) is more than four times faster than that of
SEQ (79 milliseconds), despite running on only two threads.
The first reaction to such a dramatic performance anomaly is to check for bugs, which
suggests stringent validation be applied. This is the topic of the next section.
v2023.06.11a
152 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
1
0.9
0.8 PART
0.7 PWQ
Probability
0.6
0.5 SEQ
0.4
0.3
0.2
0.1
0
0 20 40 60 80 100 120 140
CDF of Solution Time (ms)
Figure 6.24: CDF of Solution Times For SEQ, PWQ, and PART
Additional manual validation was applied by Paul’s wife, who greatly enjoys solving
puzzles.
However, if this maze software was to be used in production, whatever that might
mean, it would be wise to construct an independent maze fsck program. Nevertheless,
the mazes and solutions all proved to be quite valid. The next section therefore more
deeply analyzes the scalability anomaly called out in Section 6.5.2.
v2023.06.11a
6.5. BEYOND PARTITIONING 153
1
0.9
0.8
0.7
Probability
0.6
0.5 SEQ/PWQ SEQ/PART
0.4
0.3
0.2
0.1
0
0.1 1 10 100
CDF of Speedup Relative to SEQ
140
120
100
Solution Time (ms)
SEQ
80
60 PWQ
40
20 PART
0
0 10 20 30 40 50 60 70 80 90 100
Percent of Maze Cells Visited
v2023.06.11a
154 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
Further investigation showed that PART sometimes visited fewer than 2 % of the
maze’s cells, while SEQ and PWQ never visited fewer than about 9 %. The reason for
this difference is shown by Figure 6.26. If the thread traversing the solution from the
upper left reaches the circle, the other thread cannot reach the upper-right portion of the
maze. Similarly, if the other thread reaches the square, the first thread cannot reach the
lower-left portion of the maze. Therefore, PART will likely visit a small fraction of the
non-solution-path cells. In short, the superlinear speedups are due to threads getting
in each others’ way. This is a sharp contrast with decades of experience with parallel
programming, where workers have struggled to keep threads out of each others’ way.
Figure 6.27 confirms a strong correlation between cells visited and solution time
for all three methods. The slope of PART’s scatterplot is smaller than that of SEQ,
indicating that PART’s pair of threads visits a given fraction of the maze faster than can
SEQ’s single thread. PART’s scatterplot is also weighted toward small visit percentages,
confirming that PART does less total work, hence the observed humiliating parallelism.
This humiliating parallelism also provides more than 2x speedup on two CPUs, as put
forth in page 113.
The fraction of cells visited by PWQ is similar to that of SEQ. In addition, PWQ’s
solution time is greater than that of PART, even for equal visit fractions. The reason for
this is shown in Figure 6.28, which has a red circle on each cell with more than two
neighbors. Each such cell can result in contention in PWQ, because one thread can
enter but two threads can exit, which hurts performance, as noted earlier in this chapter.
In contrast, PART can incur such contention but once, namely when the solution is
located. Of course, SEQ never contends.
Quick Quiz 6.26: Given that a 2D maze achieved 4x speedup on two CPUs, would a 3D maze
achieve an 8x speedup on two CPUs?
v2023.06.11a
6.5. BEYOND PARTITIONING 155
1
0.9
PART
0.8
0.7
Probability
0.6
0.5 PWQ
0.4
0.3
0.2
0.1 SEQ -O3
0
0.1 1 10 100
CDF of Speedup Relative to SEQ
1
COPART
0.9
PWQ
0.8 PART
0.7
Probability
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1 1 10 100
CDF of Speedup Relative to SEQ (-O3)
v2023.06.11a
156 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
12
2 PART PWQ
0
10 100 1000
Maze Size
1.8
Speedup Relative to COPART (-O3)
1.6
1.4
1.2
0.8 PART
0.6
PWQ
0.4
0.2
0
10 100 1000
Maze Size
v2023.06.11a
6.5. BEYOND PARTITIONING 157
3.5
2.5
1.5
PART
1
0.5
PWQ
0
1 2 3 4 5 6 7 8
Number of Threads
reasons for the peak at two threads are (1) the lower complexity of termination detection
in the two-thread case and (2) the fact that there is a lower probability of the third
and subsequent threads making useful forward progress: Only the first two threads are
guaranteed to start on the solution line. This disappointing performance compared to
results in Figure 6.32 is due to the less-tightly integrated hardware available in the larger
and older Xeon system running at 2.66 GHz.
Quick Quiz 6.27: Why place the third, fourth, and so on threads on the diagonal? Why not
instead distribute them evenly around the maze?
v2023.06.11a
158 CHAPTER 6. PARTITIONING AND SYNCHRONIZATION DESIGN
solving mazes from mildly scalable to humiliatingly parallel and back again. It is hoped
that this experience will motivate work on parallelism as a first-class design-time whole-
application optimization technique, rather than as a grossly suboptimal after-the-fact
micro-optimization to be retrofitted into existing programs.
Most important, although this chapter has demonstrated that applying parallelism at
the design level gives excellent results, this final section shows that this is not enough.
For search problems such as maze solution, this section has shown that search strategy
is even more important than parallel design. Yes, for this particular type of maze,
intelligently applying parallelism identified a superior search strategy, but this sort of
luck is no substitute for a clear focus on search strategy itself.
As noted back in Section 2.2, parallelism is but one potential optimization of many.
A successful design needs to focus on the most important optimization. Much though I
might wish to claim otherwise, that optimization might or might not be parallelism.
However, for the many cases where parallelism is the right optimization, the next
section covers that synchronization workhorse, locking.
v2023.06.11a
Locking is the worst general-purpose
synchronization mechanism except for all those
other mechanisms that have been tried from time to
time.
With apologies to the memory of Winston Churchill
and to whoever he was quoting
Chapter 7
Locking
In recent concurrency research, locking often plays the role of villain. Locking stands
accused of inciting deadlocks, convoying, starvation, unfairness, data races, and all
manner of other concurrency sins. Interestingly enough, the role of workhorse in
production-quality shared-memory parallel software is also played by locking. This
chapter will look into this dichotomy between villain and hero, as fancifully depicted in
Figures 7.1 and 7.2.
There are a number of reasons behind this Jekyll-and-Hyde dichotomy:
1. Many of locking’s sins have pragmatic design solutions that work well in most
cases, for example:
(a) Use of lock hierarchies to avoid deadlock.
(b) Deadlock-detection tools, for example, the Linux kernel’s lockdep facil-
ity [Cor06a].
(c) Locking-friendly data structures, such as arrays, hash tables, and radix trees,
which will be covered in Chapter 10.
2. Some of locking’s sins are problems only at high levels of contention, levels
reached only by poorly designed programs.
3. Some of locking’s sins are avoided by using other synchronization mechanisms in
concert with locking. These other mechanisms include statistical counters (see
Chapter 5), reference counters (see Section 9.2), hazard pointers (see Section 9.3),
sequence-locking readers (see Section 9.4), RCU (see Section 9.5), and simple
non-blocking data structures (see Section 14.2).
4. Until quite recently, almost all large shared-memory parallel programs were
developed in secret, so that it was not easy to learn of these pragmatic solutions.
5. Locking works extremely well for some software artifacts and extremely poorly for
others. Developers who have worked on artifacts for which locking works well
can be expected to have a much more positive opinion of locking than those who
have worked on artifacts for which locking works poorly, as will be discussed in
Section 7.5.
6. All good stories need a villain, and locking has a long and honorable history
serving as a research-paper whipping boy.
159
v2023.06.11a
160 CHAPTER 7. LOCKING
XXXX
Quick Quiz 7.1: Just how can serving as a whipping boy be considered to be in any way
honorable???
This chapter will give an overview of a number of ways to avoid locking’s more
serious sins.
v2023.06.11a
7.1. STAYING ALIVE 161
Lock 1
Thread A Lock 2
Lock 3 Thread B
Thread C Lock 4
Given that locking stands accused of deadlock and starvation, one important concern
for shared-memory parallel developers is simply staying alive. The following sections
therefore cover deadlock, livelock, starvation, unfairness, and inefficiency.
7.1.1 Deadlock
Deadlock occurs when each member of a group of threads is holding at least one lock
while at the same time waiting on a lock held by a member of that same group. This
happens even in groups containing a single thread when that thread attempts to acquire
a non-recursive lock that it already holds. Deadlock can therefore occur even given but
one thread and one lock!
Without some sort of external intervention, deadlock is forever. No thread can acquire
the lock it is waiting on until that lock is released by the thread holding it, but the thread
holding it cannot release it until the holding thread acquires the lock that it is in turn
waiting on.
We can create a directed-graph representation of a deadlock scenario with nodes for
threads and locks, as shown in Figure 7.3. An arrow from a lock to a thread indicates
that the thread holds the lock, for example, Thread B holds Locks 2 and 4. An arrow
from a thread to a lock indicates that the thread is waiting on the lock, for example,
Thread B is waiting on Lock 3.
A deadlock scenario will always contain at least one deadlock cycle. In Figure 7.3,
this cycle is Thread B, Lock 3, Thread C, Lock 4, and back to Thread B.
Quick Quiz 7.2: But the definition of lock-based deadlock only said that each thread was
holding at least one lock and waiting on another lock that was held by some thread. How do
you know that there is a cycle?
Although there are some software environments such as database systems that can
recover from an existing deadlock, this approach requires either that one of the threads
v2023.06.11a
162 CHAPTER 7. LOCKING
be killed or that a lock be forcibly stolen from one of the threads. This killing and
forcible stealing works well for transactions, but is often problematic for kernel and
application-level use of locking: Dealing with the resulting partially updated structures
can be extremely complex, hazardous, and error-prone.
Therefore, kernels and applications should instead avoid deadlocks. Deadlock-
avoidance strategies include locking hierarchies (Section 7.1.1.1), local locking hierar-
chies (Section 7.1.1.2), layered locking hierarchies (Section 7.1.1.3), temporal locking
hierarchies (Section 7.1.1.4), strategies for dealing with APIs containing pointers to
locks (Section 7.1.1.5), conditional locking (Section 7.1.1.6), acquiring all needed locks
first (Section 7.1.1.7), single-lock-at-a-time designs (Section 7.1.1.8), and strategies for
signal/interrupt handlers (Section 7.1.1.9). Although there is no deadlock-avoidance
strategy that works perfectly for all situations, there is a good selection of tools to choose
from.
But suppose that a library function does invoke the caller’s code. For example,
qsort() invokes a caller-provided comparison function. Now, normally this comparison
function will operate on unchanging local data, so that it need not acquire locks, as
shown in Figure 7.4. But maybe someone is crazy enough to sort a collection whose
keys are changing, thus requiring that the comparison function acquire locks, which
might result in deadlock, as shown in Figure 7.5. How can the library function avoid
this deadlock?
v2023.06.11a
7.1. STAYING ALIVE 163
Application
Lock A Lock B
Library
Lock C
qsort()
Application
DEADLOCK
Library
Lock C
qsort()
The golden rule in this case is “Release all locks before invoking unknown code.”
To follow this rule, the qsort() function must release all of its locks before invoking
the comparison function. Thus qsort() will not be holding any of its locks while the
comparison function acquires any of the caller’s locks, thus avoiding deadlock.
Quick Quiz 7.4: But if qsort() releases all its locks before invoking the comparison function,
how can it protect against races with other qsort() threads?
To see the benefits of local locking hierarchies, compare Figures 7.5 and 7.6. In
both figures, application functions foo() and bar() invoke qsort() while holding
Locks A and B, respectively. Because this is a parallel implementation of qsort(),
it acquires Lock C. Function foo() passes function cmp() to qsort(), and cmp()
acquires Lock B. Function bar() passes a simple integer-comparison function (not
shown) to qsort(), and this simple function does not acquire any locks.
Now, if qsort() holds Lock C while calling cmp() in violation of the golden
release-all-locks rule above, as shown in Figure 7.5, deadlock can occur. To see this,
suppose that one thread invokes foo() while a second thread concurrently invokes
bar(). The first thread will acquire Lock A and the second thread will acquire Lock B.
v2023.06.11a
164 CHAPTER 7. LOCKING
Application
Library
Lock C
qsort()
If the first thread’s call to qsort() acquires Lock C, then it will be unable to acquire
Lock B when it calls cmp(). But the first thread holds Lock C, so the second thread’s
call to qsort() will be unable to acquire it, and thus unable to release Lock B, resulting
in deadlock.
In contrast, if qsort() releases Lock C before invoking the comparison function,
which is unknown code from qsort()’s perspective, then deadlock is avoided as shown
in Figure 7.6.
If each module releases all locks before invoking unknown code, then deadlock is
avoided if each module separately avoids deadlock. This rule therefore greatly simplifies
deadlock analysis and greatly improves modularity.
Nevertheless, this golden rule comes with a warning. When you release those locks,
any state that they protect is subject to arbitrary changes, changes that are all too
easy for the function’s caller to forget, resulting in subtle and difficult-to-reproduce
bugs. Because the qsort() comparison function rarely acquires locks, let’s switch to a
different example.
Consider the recursive tree iterator in Listing 7.1 (rec_tree_itr.c). The iterator
visits every node in the tree, invoking a user’s callback function. The tree lock is
released before the invocation and re-acquired after return. This code makes dangerous
assumptions: (1) The number of children of the current node has not changed, (2) The
ancestors stored on the stack by the recursion are still there, and (3) The visited node
itself has not been removed and freed. A few of these hazards can be encountered if one
thread calls tree_add() while another thread releases the tree’s lock to run a callback
function.
Quick Quiz 7.5: So the iterating thread may or may not observe the added child. What is the
big deal?
One strategy is to ensure that state is preserved despite the lock being released, for
example, by acquiring a reference on a node to prevent it from being freed. Alternatively,
the state can be re-initialized once the lock is re-acquired after the callback function
returns.
v2023.06.11a
7.1. STAYING ALIVE 165
v2023.06.11a
166 CHAPTER 7. LOCKING
Application
Lock A Lock B
foo() bar()
Library
Lock C
qsort()
Lock D
cmp()
v2023.06.11a
7.1. STAYING ALIVE 167
v2023.06.11a
168 CHAPTER 7. LOCKING
list_next(), which results in deadlock. We can avoid the deadlock by layering the
locking hierarchy to take the list-iterator locking into account.
This layered approach can be extended to an arbitrarily large number of layers, but
each added layer increases the complexity of the locking design. Such increases in
complexity are particularly inconvenient for some types of object-oriented designs, in
which control passes back and forth among a large group of objects in an undisciplined
manner.1 This mismatch between the habits of object-oriented design and the need to
avoid deadlock is an important reason why parallel programming is perceived by some
to be so difficult.
Some alternatives to highly layered locking hierarchies are covered in Chapter 9.
However, grace periods last for many milliseconds, so waiting another millisecond
before starting a new grace period is not normally a problem. Therefore, if call_rcu()
detects a possible deadlock with the scheduler, it arranges to start the new grace period
later, either within a timer handler or within the scheduler-clock interrupt handler,
depending on configuration. Because no scheduler locks are held across either handler,
deadlock is successfully avoided.
The overall approach is thus to adhere to a locking hierarchy by deferring lock
acquisition to an environment in which no locks are held.
One exception is functions that hand off some entity, where the caller’s lock must
be held until the handoff is complete, but where the lock must be released before the
function returns. One example of such a function is the POSIX pthread_cond_wait()
function, where passing a pointer to a pthread_mutex_t prevents hangs due to lost
wakeups.
Quick Quiz 7.8: Doesn’t the fact that pthread_cond_wait() first releases the mutex and
then re-acquires it eliminate the possibility of deadlock?
v2023.06.11a
7.1. STAYING ALIVE 169
In short, if you find yourself exporting an API with a pointer to a lock as an argument
or as the return value, do yourself a favor and carefully reconsider your API design. It
might well be the right thing to do, but experience indicates that this is unlikely.
v2023.06.11a
170 CHAPTER 7. LOCKING
Quick Quiz 7.9: Can the transformation from Listing 7.4 to Listing 7.5 be applied universally?
Quick Quiz 7.10: But the complexity in Listing 7.5 is well worthwhile given that it avoids
deadlock, right?
A related approach, two-phase locking [BHG87], has seen long production use in
transactional database systems. In the first phase of a two-phase locking transaction, locks
are acquired but not released. Once all needed locks have been acquired, the transaction
enters the second phase, where locks are released, but not acquired. This locking
approach allows databases to provide serializability guarantees for their transactions,
in other words, to guarantee that all values seen and produced by the transactions are
consistent with some global ordering of all the transactions. Many such systems rely on
the ability to abort transactions, although this can be simplified by avoiding making
any changes to shared data until all needed locks are acquired. Livelock and deadlock
are issues in such systems, but practical solutions may be found in any of a number of
database textbooks.
v2023.06.11a
7.1. STAYING ALIVE 171
if holding such a lock, it is illegal to attempt to acquire any lock that is ever acquired
outside of a signal handler without blocking signals.
Quick Quiz 7.12: Suppose Lock A is never acquired within a signal handler, but Lock B
is acquired both from thread context and by signal handlers. Suppose further that Lock A
is sometimes acquired with signals unblocked. Why is it illegal to acquire Lock A holding
Lock B?
If a lock is acquired by the handlers for several signals, then each and every one of
these signals must be blocked whenever that lock is acquired, even when that lock is
acquired within a signal handler.
Quick Quiz 7.13: How can you legally block signals within a signal handler?
7.1.1.10 Discussion
There are a large number of deadlock-avoidance strategies available to the shared-
memory parallel programmer, but there are sequential programs for which none of them
is a good fit. This is one of the reasons that expert programmers have more than one
tool in their toolbox: Locking is a powerful concurrency tool, but there are jobs better
addressed with other tools.
Quick Quiz 7.15: Given an object-oriented application that passes control freely among a
group of objects such that there is no straightforward locking hierarchy,a layered or otherwise,
how can this application be parallelized?
a Also known as “object-oriented spaghetti code.”
Nevertheless, the strategies described in this section have proven quite useful in many
settings.
v2023.06.11a
172 CHAPTER 7. LOCKING
Quick Quiz 7.16: How can the livelock shown in Listing 7.6 be avoided?
For better results, backoffs should be bounded, and even better high-contention results
are obtained via queued locking [And90], which is discussed more in Section 7.3.2.
Of course, best of all is to use a good parallel design that avoids these problems by
maintaining low lock contention.
7.1.3 Unfairness
Unfairness can be thought of as a less-severe form of starvation, where a subset of
threads contending for a given lock are granted the lion’s share of the acquisitions. This
3 Try not to get too hung up on the exact definitions of terms like livelock, starvation,
and unfairness. Anything that causes a group of threads to fail to make adequate forward
progress is a bug that needs to be fixed, and debating names doesn’t fix bugs.
v2023.06.11a
7.1. STAYING ALIVE 173
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
v2023.06.11a
174 CHAPTER 7. LOCKING
can happen on machines with shared caches or NUMA characteristics, for example, as
shown in Figure 7.8. If CPU 0 releases a lock that all the other CPUs are attempting to
acquire, the interconnect shared between CPUs 0 and 1 means that CPU 1 will have an
advantage over CPUs 2–7. Therefore CPU 1 will likely acquire the lock. If CPU 1 holds
the lock long enough for CPU 0 to be requesting the lock by the time CPU 1 releases it
and vice versa, the lock can shuttle between CPUs 0 and 1, bypassing CPUs 2–7.
Quick Quiz 7.18: Wouldn’t it be better just to use a good parallel design so that lock contention
was low enough to avoid unfairness?
7.1.4 Inefficiency
Locks are implemented using atomic instructions and memory barriers, and often
involve cache misses. As we saw in Chapter 3, these instructions are quite expensive,
roughly two orders of magnitude greater overhead than simple instructions. This can be
a serious problem for locking: If you protect a single instruction with a lock, you will
increase the overhead by a factor of one hundred. Even assuming perfect scalability,
one hundred CPUs would be required to keep up with a single CPU executing the same
code without locking.
This situation is not confined to locking. Figure 7.9 shows how this same principle
applies to the age-old activity of sawing wood. As can be seen in the figure, sawing a
board converts a small piece of that board (the width of the saw blade) into sawdust.
Of course, locks partition time instead of sawing wood,4 but just like sawing wood,
using locks to partition time wastes some of that time due to lock overhead and (worse
yet) lock contention. One important difference is that if someone saws a board into
too-small pieces, the resulting conversion of most of that board into sawdust will be
immediately obvious. In contrast, it is not always obvious that a given lock acquisition
is wasting excessive amounts of time.
And this situation underscores the importance of the synchronization-granularity
tradeoff discussed in Section 6.3, especially Figure 6.16: Too coarse a granularity will
limit scalability, while too fine a granularity will result in excessive synchronization
overhead.
Acquiring a lock might be expensive, but once held, the CPU’s caches are an effective
performance booster, at least for large critical sections. In addition, once a lock is held,
the data protected by that lock can be accessed by the lock holder without interference
from other threads.
v2023.06.11a
7.2. TYPES OF LOCKS 175
Quick Quiz 7.19: How might the lock holder be interfered with?
The Rust programming language takes lock/data association a step further by allowing
the developer to make a compiler-visible association between a lock and the data that
it protects [JJKD21]. When such an association has been made, attempts to access
the data without the benefit of the corresponding lock will result in a compile-time
diagnostic. The hope is that this will greatly reduce the frequency of this class of bugs.
Of course, this approach does not apply straightforwardly to cases where the data to be
locked is distributed throughout the nodes of some data structure or when that which is
locked is purely abstract, for example, when a small subset of state-machine transitions
is to be protected by a given lock. For this reason, Rust allows locks to be associated
with types rather than data items or even to be associated with nothing at all. This last
option permits Rust to emulate traditional locking use cases, but is not popular among
Rust developers. Perhaps the Rust community will come up with other mechanisms
tailored to other locking use cases.
Only locks in life are what you think you know, but
don’t. Accept your ignorance and try something new.
Dennis Vickers
There are a surprising number of types of locks, more than this short chapter can
possibly do justice to. The following sections discuss exclusive locks (Section 7.2.1),
reader-writer locks (Section 7.2.2), multi-role locks (Section 7.2.3), and scoped locking
(Section 7.2.4).
1. Strict FIFO, with acquisitions starting earlier acquiring the lock earlier.
v2023.06.11a
176 CHAPTER 7. LOCKING
2. Approximate FIFO, with acquisitions starting sufficiently earlier acquiring the lock
earlier.
3. FIFO within priority level, with higher-priority threads acquiring the lock earlier
than any lower-priority threads attempting to acquire the lock at about the same
time, but so that some FIFO ordering applies for threads of the same priority.
4. Random, so that the new lock holder is chosen randomly from all threads attempting
acquisition, regardless of timing.
5. Unfair, so that a given acquisition might never acquire the lock (see Section 7.1.3).
v2023.06.11a
7.2. TYPES OF LOCKS 177
Concurrent Write
Concurrent Read
Null (Not Held)
Protected Write
Protected Read
Exclusive
Null (Not Held)
Concurrent Read X
Concurrent Write X X X
Protected Read X X X
Protected Write X X X X
Exclusive X X X X X
But suppose a large number of readers hold the lock and a writer is waiting to acquire
the lock. Should readers be allowed to continue to acquire the lock, possibly starving the
writer? Similarly, suppose that a writer holds the lock and that a large number of both
readers and writers are waiting to acquire the lock. When the current writer releases the
lock, should it be given to a reader or to another writer? If it is given to a reader, how
many readers should be allowed to acquire the lock before the next writer is permitted
to do so?
There are many possible answers to these questions, with different levels of complexity,
overhead, and fairness. Different implementations might have different costs, for example,
some types of reader-writer locks incur extremely large latencies when switching from
read-holder to write-holder mode. Here are a few possible approaches:
2. Batch-fair implementations ensure that when both readers and writers are acquiring
the lock, both have reasonable access via batching. For example, the lock might
admit five readers per CPU, then two writers, then five more readers per CPU, and
so on.
Of course, these distinctions matter only under conditions of high lock contention.
Please keep the waiting/blocking dual nature of locks firmly in mind. This will
be revisited in Chapter 9’s discussion of scalable high-performance special-purpose
alternatives to locking.
v2023.06.11a
178 CHAPTER 7. LOCKING
(DLM) [ST87], which is shown in Table 7.1. Blank cells indicate compatible modes,
while cells containing “X” indicate incompatible modes.
The VAX/VMS DLM uses six modes. For purposes of comparison, exclusive locks
use two modes (not held and held), while reader-writer locks use three modes (not held,
read held, and write held).
The first mode is null, or not held. This mode is compatible with all other modes,
which is to be expected: If a thread is not holding a lock, it should not prevent any other
thread from acquiring that lock.
The second mode is concurrent read, which is compatible with every other mode except
for exclusive. The concurrent-read mode might be used to accumulate approximate
statistics on a data structure, while permitting updates to proceed concurrently.
The third mode is concurrent write, which is compatible with null, concurrent read,
and concurrent write. The concurrent-write mode might be used to update approximate
statistics, while still permitting reads and concurrent updates to proceed concurrently.
The fourth mode is protected read, which is compatible with null, concurrent read, and
protected read. The protected-read mode might be used to obtain a consistent snapshot
of the data structure, while permitting reads but not updates to proceed concurrently.
The fifth mode is protected write, which is compatible with null and concurrent
read. The protected-write mode might be used to carry out updates to a data structure
that could interfere with protected readers but which could be tolerated by concurrent
readers.
The sixth and final mode is exclusive, which is compatible only with null. The
exclusive mode is used when it is necessary to exclude all other accesses.
It is interesting to note that exclusive locks and reader-writer locks can be emulated
by the VAX/VMS DLM. Exclusive locks would use only the null and exclusive modes,
while reader-writer locks might use the null, protected-read, and protected-write modes.
Quick Quiz 7.21: Is there any other way for the VAX/VMS DLM to emulate a reader-writer
lock?
Although the VAX/VMS DLM policy has seen widespread production use for
distributed databases, it does not appear to be used much in shared-memory applications.
One possible reason for this is that the greater communication overheads of distributed
databases can hide the greater overhead of the VAX/VMS DLM’s more-complex
admission policy.
Nevertheless, the VAX/VMS DLM is an interesting illustration of just how flexible
the concepts behind locking can be. It also serves as a very simple introduction to the
locking schemes used by modern DBMSes, which can have more than thirty locking
modes, compared to VAX/VMS’s six.
html#finally.
v2023.06.11a
7.2. TYPES OF LOCKS 179
Root rcu_node
Structure
CPU 0
CPU 1
CPU m
CPU m * (N − 1)
CPU m * (N − 1) + 1
CPU m * N − 1
Figure 7.10: Locking Hierarchy
This approach can be quite useful, in fact in 1990 I was convinced that it was the only
type of locking that was needed.6 One very nice property of RAII locking is that you
don’t need to carefully release the lock on each and every code path that exits that scope,
a property that can eliminate a troublesome set of bugs.
However, RAII locking also has a dark side. RAII makes it quite difficult to encapsulate
lock acquisition and release, for example, in iterators. In many iterator implementations,
you would like to acquire the lock in the iterator’s “start” function and release it in the
iterator’s “stop” function. RAII locking instead requires that the lock acquisition and
release take place in the same level of scoping, making such encapsulation difficult or
even impossible.
Strict RAII locking also prohibits overlapping critical sections, due to the fact that
scopes must nest. This prohibition makes it difficult or impossible to express a number
of useful constructs, for example, locking trees that mediate between multiple concurrent
attempts to assert an event. Of an arbitrarily large group of concurrent attempts, only
one need succeed, and the best strategy for the remaining attempts is for them to fail as
quickly and painlessly as possible. Otherwise, lock contention becomes pathological on
large systems (where “large” is many hundreds of CPUs). Therefore, C++17 [Smi19]
has escapes from strict RAII in its unique_lock class, which allows the scope of the
critical section to be controlled to roughly the same extent as can be achieved with
explicit lock acquisition and release primitives.
Example strict-RAII-unfriendly data structures from Linux-kernel RCU are shown
in Figure 7.10. Here, each CPU is assigned a leaf rcu_node structure, and each
rcu_node structure has a pointer to its parent (named, oddly enough, ->parent), up to
the root rcu_node structure, which has a NULL ->parent pointer. The number of child
rcu_node structures per parent can vary, but is typically 32 or 64. Each rcu_node
structure also contains a lock named ->fqslock.
6 My later work with parallelism at Sequent Computer Systems very quickly disabused
v2023.06.11a
180 CHAPTER 7. LOCKING
v2023.06.11a
7.3. LOCKING IMPLEMENTATION ISSUES 181
Quick Quiz 7.22: The code in Listing 7.8 is ridiculously complicated! Why not conditionally
acquire a single global lock?
Quick Quiz 7.23: Wait a minute! If we “win” the tournament on line 16 of Listing 7.8, we get
to do all the work of do_force_quiescent_state(). Exactly how is that a win, really?
Developers are almost always best-served by using whatever locking primitives are
provided by the system, for example, the POSIX pthread mutex locks [Ope97, But97].
Nevertheless, studying sample implementations can be helpful, as can considering the
challenges posed by extreme workloads and environments.
7 Which is why many RAII locking implementations provide a way to leak the lock out of
the scope that it was acquired and into the scope in which it is to be released. However, some
object must mediate the scope leaking, which can add complexity compared to non-RAII
explicit locking primitives.
v2023.06.11a
182 CHAPTER 7. LOCKING
Lock acquisition is carried out by the xchg_lock() function shown on lines 4–10.
This function uses a nested loop, with the outer loop repeatedly atomically exchanging
the value of the lock with the value one (meaning “locked”). If the old value was already
the value one (in other words, someone else already holds the lock), then the inner loop
(lines 7–8) spins until the lock is available, at which point the outer loop makes another
attempt to acquire the lock.
Quick Quiz 7.25: Why bother with the inner loop on lines 7–8 of Listing 7.9? Why not simply
repeatedly do the atomic exchange operation on line 6?
Lock release is carried out by the xchg_unlock() function shown on lines 12–15.
Line 14 atomically exchanges the value zero (“unlocked”) into the lock, thus marking it
as having been released.
Quick Quiz 7.26: Why not simply store zero into the lock word on line 14 of Listing 7.9?
This lock is a simple example of a test-and-set lock [SR84], but very similar
mechanisms have been used extensively as pure spinlocks in production.
There are nevertheless some situations where high lock contention is the lesser of the available
evils, and in any case, studying schemes that deal with high levels of contention is a good
mental exercise.
v2023.06.11a
7.3. LOCKING IMPLEMENTATION ISSUES 183
are used in recent versions of the Linux kernel [Cor14b]. Queued locks avoid high
cache-invalidation overhead by assigning each thread a queue element. These queue
elements are linked together into a queue that governs the order that the lock will be
granted to the waiting threads. The key point is that each thread spins on its own queue
element, so that the lock holder need only invalidate the first element from the next
thread’s CPU’s cache. This arrangement greatly reduces the overhead of lock handoff at
high levels of contention.
More recent queued-lock implementations also take the system’s architecture into
account, preferentially granting locks locally, while also taking steps to avoid starva-
tion [SSVM02, RH03, RH02, JMRR02, MCM02]. Many of these can be thought of as
analogous to the elevator algorithms traditionally used in scheduling disk I/O.
Unfortunately, the same scheduling logic that improves the efficiency of queued locks
at high contention also increases their overhead at low contention. Beng-Hong Lim and
Anant Agarwal therefore combined a simple test-and-set lock with a queued lock, using
the test-and-set lock at low levels of contention and switching to the queued lock at high
levels of contention [LA94], thus getting low overhead at low levels of contention and
getting fairness and high throughput at high levels of contention. Browning et al. took
a similar approach, but avoided the use of a separate flag, so that the test-and-set fast
path uses the same sequence of instructions that would be used in a simple test-and-set
lock [BMMM05]. This approach has been used in production.
Another issue that arises at high levels of contention is when the lock holder is delayed,
especially when the delay is due to preemption, which can result in priority inversion,
where a low-priority thread holds a lock, but is preempted by a medium priority
CPU-bound thread, which results in a high-priority process blocking while attempting to
acquire the lock. The result is that the CPU-bound medium-priority process is preventing
the high-priority process from running. One solution is priority inheritance [LR80],
which has been widely used for real-time computing [SRL90, Cor06b], despite some
lingering controversy over this practice [Yod04a, Loc02].
Another way to avoid priority inversion is to prevent preemption while a lock is held.
Because preventing preemption while locks are held also improves throughput, most
proprietary UNIX kernels offer some form of scheduler-conscious synchronization
mechanism [KWS97], largely due to the efforts of a certain sizable database vendor.
These mechanisms usually take the form of a hint that preemption should be avoided in
a given region of code, with this hint typically being placed in a machine register. These
hints frequently take the form of a bit set in a particular machine register, which enables
extremely low per-lock-acquisition overhead for these mechanisms. In contrast, Linux
avoids these hints. Instead, the Linux kernel community’s response to requests for
scheduler-conscious synchronization was a mechanism called futexes [FRK02, Mol06,
Ros06, Dre11].
Interestingly enough, atomic instructions are not strictly needed to implement
locks [Dij65, Lam74]. An excellent exposition of the issues surrounding locking
implementations based on simple loads and stores may be found in Herlihy’s and Shavit’s
textbook [HS08, HSLS20]. The main point echoed here is that such implementations
currently have little practical application, although a careful study of them can be both
entertaining and enlightening. Nevertheless, with one exception described below, such
study is left as an exercise for the reader.
Gamsa et al. [GKAS99, Section 5.3] describe a token-based mechanism in which a
token circulates among the CPUs. When the token reaches a given CPU, it has exclusive
v2023.06.11a
184 CHAPTER 7. LOCKING
access to anything protected by that token. There are any number of schemes that may
be used to implement the token-based mechanism, for example:
1. Maintain a per-CPU flag, which is initially zero for all but one CPU. When a
CPU’s flag is non-zero, it holds the token. When it finishes with the token, it zeroes
its flag and sets the flag of the next CPU to one (or to any other non-zero value).
2. Maintain a per-CPU counter, which is initially set to the corresponding CPU’s
number, which we assume to range from zero to 𝑁 − 1, where 𝑁 is the number
of CPUs in the system. When a CPU’s counter is greater than that of the next
CPU (taking counter wrap into account), the first CPU holds the token. When it is
finished with the token, it sets the next CPU’s counter to a value one greater than
its own counter.
Quick Quiz 7.27: How can you tell if one counter is greater than another, while accounting
for counter wrap?
Quick Quiz 7.28: Which is better, the counter approach or the flag approach?
This lock is unusual in that a given CPU cannot necessarily acquire it immediately,
even if no other CPU is using it at the moment. Instead, the CPU must wait until the
token comes around to it. This is useful in cases where CPUs need periodic access
to the critical section, but can tolerate variances in token-circulation rate. Gamsa et
al. [GKAS99] used it to implement a variant of read-copy update (see Section 9.5), but
it could also be used to protect periodic per-CPU operations such as flushing per-CPU
caches used by memory allocators [MS93], garbage-collecting per-CPU data structures,
or flushing per-CPU data to shared storage (or to mass storage, for that matter).
The Linux kernel now uses queued spinlocks [Cor14b], but because of the complexity
of implementations that provide good performance across the range of contention
levels, the path has not always been smooth [Mar18, Dea18]. As increasing numbers
of people gain familiarity with parallel hardware and parallelize increasing amounts
of code, we can continue to expect more special-purpose locking primitives to appear,
see for example Guerraoui et al. [GGL+ 19, Gui18]. Nevertheless, you should carefully
consider this important safety tip: Use the standard synchronization primitives whenever
humanly possible. The big advantage of the standard synchronization primitives over
roll-your-own efforts is that the standard primitives are typically much less bug-prone.9
However, you will notice that my hair is much greyer than it was before I started doing that
sort of work. Coincidence? Maybe. But are you really willing to risk your own hair turning
prematurely grey?
v2023.06.11a
7.4. LOCK-BASED EXISTENCE GUARANTEES 185
1. Global variables and static local variables in the base module will exist as long as
the application is running.
2. Global variables and static local variables in a loaded module will exist as long as
that module remains loaded.
3. A module will remain loaded as long as at least one of its functions has an active
instance.
4. A given function instance’s on-stack variables will exist until that instance returns.
5. If you are executing within a given function or have been called (directly or
indirectly) from that function, then the given function has an active instance.
To see one of these race conditions, consider the following sequence of events:
1. Thread 0 invokes delete(0), and reaches line 10 of the listing, acquiring the lock.
2. Thread 1 concurrently invokes delete(0), reaching line 10, but spins on the lock
because Thread 0 holds it.
v2023.06.11a
186 CHAPTER 7. LOCKING
3. Thread 0 executes lines 11–14, removing the element from the hashtable, releasing
the lock, and then freeing the element.
4. Thread 0 continues execution, and allocates memory, getting the exact block of
memory that it just freed.
5. Thread 0 then initializes this block of memory as some other type of structure.
6. Thread 1’s spin_lock() operation fails due to the fact that what it believes to be
p->lock is no longer a spinlock.
Because there is no existence guarantee, the identity of the data element can change
while a thread is attempting to acquire that element’s lock on line 10!
One way to fix this example is to use a hashed set of global locks, so that each hash
bucket has its own lock, as shown in Listing 7.11. This approach allows acquiring
the proper lock (on line 9) before gaining a pointer to the data element (on line 10).
Although this approach works quite well for elements contained in a single partitionable
data structure such as the hash table shown in the listing, it can be problematic if a
given data element can be a member of multiple hash tables or given more-complex
data structures such as trees or graphs. Not only can these problems be solved,
but the solutions also form the basis of lock-based software transactional memory
implementations [ST95, DSS06]. However, Chapter 9 describes simpler—and faster—
ways of providing existence guarantees.
As is often the case in real life, locking can be either hero or villain, depending on
how it is used and on the problem at hand. In my experience, those writing whole
applications are happy with locking, those writing parallel libraries are less happy, and
those parallelizing existing sequential libraries are extremely unhappy. The following
sections discuss some reasons for these differences in viewpoints.
v2023.06.11a
7.5. LOCKING: HERO OR VILLAIN? 187
v2023.06.11a
188 CHAPTER 7. LOCKING
v2023.06.11a
7.5. LOCKING: HERO OR VILLAIN? 189
then the application can associate a lock with each tree. The application then acquires
and releases locks as needed, so that the library need not be aware of parallelism at all.
Instead, the application controls the parallelism, so that locking can work very well, as
was discussed in Section 7.5.1.
However, this strategy fails if the library implements a data structure that requires
internal concurrency, for example, a hash table or a parallel sort. In this case, the library
absolutely must control its own synchronization.
1. If the application invokes the library function from within a signal handler, then
that signal must be blocked every time that the library function is invoked from
outside of a signal handler.
v2023.06.11a
190 CHAPTER 7. LOCKING
2. If the application invokes the library function while holding a lock acquired within
a given signal handler, then that signal must be blocked every time that the library
function is called outside of a signal handler.
These rules can be enforced by using tools similar to the Linux kernel’s lockdep lock
dependency checker [Cor06a]. One of the great strengths of lockdep is that it is not
fooled by human intuition [Ros11].
1. The data structures protected by that lock are likely to be in some intermediate
state, so that naively breaking the lock might result in arbitrary memory corruption.
2. If the child creates additional threads, two threads might break the lock concurrently,
with the result that both threads believe they own the lock. This could again result
in arbitrary memory corruption.
The pthread_atfork() function is provided to help deal with these situations. The
idea is to register a triplet of functions, one to be called by the parent before the fork(),
one to be called by the parent after the fork(), and one to be called by the child after
the fork(). Appropriate cleanups can then be carried out at these three points.
Be warned, however, that coding of pthread_atfork() handlers is quite subtle in
general. The cases where pthread_atfork() works best are cases where the data
structure in question can simply be re-initialized by the child. Which might be one
reason why the POSIX standard forbids use of any non-async-signal-safe functions
between the fork() and the exec(), which rules out acquisition of locks during that
time.
Other alternatives to fork()/exec() include posix_spawn() and io_uring_
spawn() [Tri22, Edg22].
v2023.06.11a
7.5. LOCKING: HERO OR VILLAIN? 191
constructing parallel libraries using locking is possible, but not as easy as constructing
a parallel application.
These flaws and the consequences for locking are discussed in the following sections.
1. Determining when to resize the hash table. In this case, an approximate count
should work quite well. It might also be useful to trigger the resizing operation
from the length of the longest chain, which can be computed and maintained in a
nicely partitioned per-chain manner.
2. Producing an estimate of the time required to traverse the entire hash table. An
approximate count works well in this case, also.
3. For diagnostic purposes, for example, to check for items being lost when transferring
them to and from the hash table. This clearly requires an exact count. However,
given that this usage is diagnostic in nature, it might suffice to maintain the lengths
of the hash chains, then to infrequently sum them up while locking out addition
and deletion operations.
It turns out that there is now a strong theoretical basis for some of the constraints that
performance and scalability place on a parallel library’s APIs [AGH+ 11a, AGH+ 11b,
McK11b]. Anyone designing a parallel library needs to pay close attention to those
constraints.
Although it is all too easy to blame locking for what are really problems due to a
concurrency-unfriendly API, doing so is not helpful. On the other hand, one has little
v2023.06.11a
192 CHAPTER 7. LOCKING
choice but to sympathize with the hapless developer who made this choice in (say)
1985. It would have been a rare and courageous developer to anticipate the need for
parallelism at that time, and it would have required an even more rare combination of
brilliance and luck to actually arrive at a good parallel-friendly API.
Times change, and code must change with them. That said, there might be a huge
number of users of a popular library, in which case an incompatible change to the API
would be quite foolish. Adding a parallel-friendly API to complement the existing
heavily used sequential-only API is usually the best course of action.
Nevertheless, human nature being what it is, we can expect our hapless developer
to be more likely to complain about locking than about his or her own poor (though
understandable) API design choices.
v2023.06.11a
7.6. SUMMARY 193
level rather than the thread level. In general, if a task is proving extremely hard, it is
worth some time spent thinking about not only alternative ways to accomplish that
particular task, but also alternative tasks that might better solve the problem at hand.
7.6 Summary
Achievement unlocked.
Unknown
Locking is perhaps the most widely used and most generally useful synchronization
tool. However, it works best when designed into an application or library from the
beginning. Given the large quantity of pre-existing single-threaded code that might
need to one day run in parallel, locking should therefore not be the only tool in your
parallel-programming toolbox. The next few chapters will discuss other tools, and how
they can best be used in concert with locking and with each other.
v2023.06.11a
194 CHAPTER 7. LOCKING
v2023.06.11a
It is mine, I tell you. My own. My precious. Yes, my
precious.
Gollum in The Fellowship of the Ring, J.R.R. Tolkien
Chapter 8
Data Ownership
One of the simplest ways to avoid the synchronization overhead that comes with locking
is to parcel the data out among the threads (or, in the case of kernels, CPUs) so that a
given piece of data is accessed and modified by only one of the threads. Interestingly
enough, data ownership covers each of the “big three” parallel design techniques: It
partitions over threads (or CPUs, as the case may be), it batches all local operations, and
its elimination of synchronization operations is weakening carried to its logical extreme.
It should therefore be no surprise that data ownership is heavily used: Even novices use
it almost instinctively. In fact, it is so heavily used that this chapter will not introduce
any new examples, but will instead refer back to those of previous chapters.
Quick Quiz 8.1: What form of data ownership is extremely difficult to avoid when creating
shared-memory parallel programs (for example, using pthreads) in C or C++?
There are a number of approaches to data ownership. Section 8.1 presents the
logical extreme in data ownership, where each thread has its own private address space.
Section 8.2 looks at the opposite extreme, where the data is shared, but different threads
own different access rights to the data. Section 8.3 describes function shipping, which
is a way of allowing other threads to have indirect access to data owned by a particular
thread. Section 8.4 describes how designated threads can be assigned ownership of a
specified function and the related data. Section 8.5 discusses improving performance
by transforming algorithms with shared data to instead use data ownership. Finally,
Section 8.6 lists a few software environments that feature data ownership as a first-class
citizen.
195
v2023.06.11a
196 CHAPTER 8. DATA OWNERSHIP
This example runs two instances of the compute_it program in parallel, as separate
processes that do not share memory. Therefore, all data in a given process is owned by that
process, so that almost the entirety of data in the above example is owned. This approach
almost entirely eliminates synchronization overhead. The resulting combination of
extreme simplicity and optimal performance is obviously quite attractive.
Quick Quiz 8.2: What synchronization remains in the example shown in Section 8.1?
Quick Quiz 8.3: Is there any shared data in the example shown in Section 8.1?
This same pattern can be written in C as well as in sh, as illustrated by Listings 4.1
and 4.2.
It bears repeating that these trivial forms of parallelism are not in any way cheating
or ducking responsibility, but are rather simple and elegant ways to make your code
run faster. It is fast, scales well, is easy to program, easy to maintain, and gets the job
done. In addition, taking this approach (where applicable) allows the developer more
time to focus on other things whether these things might involve applying sophisticated
single-threaded optimizations to compute_it on the one hand, or applying sophisticated
parallel-programming patterns to portions of the code where this approach is inapplicable.
What is not to like?
The next section discusses the use of data ownership in shared-memory parallel
programs.
Concurrent counting (see Chapter 5) uses data ownership heavily, but adds a twist.
Threads are not allowed to modify data owned by other threads, but they are permitted to
read it. In short, the use of shared memory allows more nuanced notions of ownership
and access rights.
For example, consider the per-thread statistical counter implementation shown in
Listing 5.4 on page 83. Here, inc_count() updates only the corresponding thread’s
instance of counter, while read_count() accesses, but does not modify, all threads’
instances of counter.
Quick Quiz 8.4: Does it ever make sense to have partial data ownership where each thread
reads only its own instance of a per-thread variable, but writes to other threads’ instances?
Partial data ownership is also common within the Linux kernel. For example, a given
CPU might be permitted to read a given set of its own per-CPU variables only with
interrupts disabled, another CPU might be permitted to read that same set of the first
CPU’s per-CPU variables only when holding the corresponding per-CPU lock. Then
that given CPU would be permitted to update this set of its own per-CPU variables if
it both has interrupts disabled and holds its per-CPU lock. This arrangement can be
thought of as a reader-writer lock that allows each CPU very low-overhead access to its
own set of per-CPU variables. There are a great many variations on this theme.
v2023.06.11a
8.3. FUNCTION SHIPPING 197
For its own part, pure data ownership is also both common and useful, for example,
the per-thread memory-allocator caches discussed in Section 6.4.3 starting on page 141.
In this algorithm, each thread’s cache is completely private to that thread.
The previous section described a weak form of data ownership where threads reached
out to other threads’ data. This can be thought of as bringing the data to the functions
that need it. An alternative approach is to send the functions to the data.
Such an approach is illustrated in Section 5.4.3 beginning on page 101, in particular the
flush_local_count_sig() and flush_local_count() functions in Listing 5.18
on page 103.
The flush_local_count_sig() function is a signal handler that acts as the shipped
function. The pthread_kill() function in flush_local_count() sends the signal—
shipping the function—and then waits until the shipped function executes. This shipped
function has the not-unusual added complication of needing to interact with any
concurrently executing add_count() or sub_count() functions (see Listing 5.19 on
page 104 and Listing 5.20 on page 105).
Quick Quiz 8.5: What mechanisms other than POSIX signals may be used for function
shipping?
The earlier sections describe ways of allowing each thread to keep its own copy or its
own portion of the data. In contrast, this section describes a functional-decomposition
approach, where a special designated thread owns the rights to the data that is required to
do its job. The eventually consistent counter implementation described in Section 5.2.4
provides an example. This implementation has a designated thread that runs the
eventual() function shown on lines 17–32 of Listing 5.5. This eventual() thread
periodically pulls the per-thread counts into the global counter, so that accesses to the
global counter will, as the name says, eventually converge on the actual value.
Quick Quiz 8.6: But none of the data in the eventual() function shown on lines 17–32
of Listing 5.5 is actually owned by the eventual() thread! In just what way is this data
ownership???
v2023.06.11a
198 CHAPTER 8. DATA OWNERSHIP
8.5 Privatization
Data ownership works best when the data can be partitioned so that there is little or no
need for cross thread access or update. Fortunately, this situation is reasonably common,
and in a wide variety of parallel-programming environments.
Examples of data ownership include:
v2023.06.11a
8.6. OTHER USES OF DATA OWNERSHIP 199
1 But note that a great many other classes of applications have also been ported to
v2023.06.11a
200 CHAPTER 8. DATA OWNERSHIP
v2023.06.11a
All things come to those who wait.
Violet Fane
Chapter 9
Deferred Processing
The strategy of deferring work goes back before the dawn of recorded history. It has
occasionally been derided as procrastination or even as sheer laziness. However, in
the last few decades workers have recognized this strategy’s value in simplifying and
streamlining parallel algorithms [KL80, Mas92]. Believe it or not, “laziness” in parallel
programming often outperforms and out-scales industriousness! These performance
and scalability benefits stem from the fact that deferring work can enable weakening of
synchronization primitives, thereby reducing synchronization overhead.
Those who are willing and able to read and understand this chapter will uncover many
mysteries, including:
2. A concurrent reference counter that avoids not only this trap, but also avoids
expensive atomic read-modify-write accesses, and in addition avoids as well as
writes of any kind to the data structure being traversed.
5. A synchronization primitive whose use cases are far more conceptually more
complex than is the primitive itself.
General approaches of work deferral include reference counting (Section 9.2), hazard
pointers (Section 9.3), sequence locking (Section 9.4), and RCU (Section 9.5). Finally,
Section 9.6 describes how to choose among the work-deferral schemes covered in this
chapter and Section 9.7 discusses updates. But first, Section 9.1 will introduce an
example algorithm that will be used to compare and contrast these approaches.
201
v2023.06.11a
202 CHAPTER 9. DEFERRED PROCESSING
route_list
This chapter will use a simplified packet-routing algorithm to demonstrate the value
of these approaches and to allow them to be compared. Routing algorithms are used
in operating-system kernels to deliver each outgoing TCP/IP packet to the appropriate
network interface. This particular algorithm is a simplified version of the classic 1980s
packet-train-optimized algorithm used in BSD UNIX [Jac88], consisting of a simple
linked list.1 Modern routing algorithms use more complex data structures, however a
simple algorithm will help highlight issues specific to parallelism in a straightforward
setting.
We further simplify the algorithm by reducing the search key from a quadruple
consisting of source and destination IP addresses and ports all the way down to a simple
integer. The value looked up and returned will also be a simple integer, so that the data
structure is as shown in Figure 9.1, which directs packets with address 42 to interface 1,
address 56 to interface 3, and address 17 to interface 7. This list will normally be
searched frequently and updated rarely. In Chapter 3 we learned that the best ways to
evade inconvenient laws of physics, such as the finite speed of light and the atomic
nature of matter, is to either partition the data or to rely on read-mostly sharing. This
chapter applies read-mostly sharing techniques to Pre-BSD packet routing.
Listing 9.1 (route_seq.c) shows a simple single-threaded implementation corre-
sponding to Figure 9.1. Lines 1–5 define a route_entry structure and line 6 defines
the route_list header. Lines 8–20 define route_lookup(), which sequentially
searches route_list, returning the corresponding ->iface, or ULONG_MAX if there is
no such route entry. Lines 22–33 define route_add(), which allocates a route_entry
structure, initializes it, and adds it to the list, returning -ENOMEM in case of memory-
allocation failure. Finally, lines 35–47 define route_del(), which removes and frees
the specified route_entry structure if it exists, or returns -ENOENT otherwise.
This single-threaded implementation serves as a prototype for the various concurrent
implementations in this chapter, and also as an estimate of ideal scalability and
performance.
1 In other words, this is not OpenBSD, NetBSD, or even FreeBSD, but none other than
Pre-BSD.
v2023.06.11a
9.1. RUNNING EXAMPLE 203
v2023.06.11a
204 CHAPTER 9. DEFERRED PROCESSING
Reference counting tracks the number of references to a given object in order to prevent
that object from being prematurely freed. As such, it has a long and honorable history
of use dating back to at least an early 1960s Weizenbaum paper [Wei63]. Weizenbaum
discusses reference counting as if it was already well-known, so it likely dates back to
the 1950s or even to the 1940s. And perhaps even further, given that people repairing
large dangerous machines have long used a mechanical reference-counting technique
implemented via padlocks. Before entering the machine, each worker locks a padlock
onto the machine’s on/off switch, thus preventing the machine from being powered
on while that worker is inside. Reference counting is thus an excellent time-honored
candidate for a concurrent implementation of Pre-BSD routing.
To that end, Listing 9.2 shows data structures and the route_lookup() function
and Listing 9.3 shows the route_add() and route_del() functions (all at route_
refcnt.c). Since these algorithms are quite similar to the sequential algorithm shown
in Listing 9.1, only the differences will be discussed.
Starting with Listing 9.2, line 2 adds the actual reference counter, line 6 adds
a ->re_freed use-after-free check field, line 9 adds the routelock that will be
used to synchronize concurrent updates, and lines 11–15 add re_free(), which sets
->re_freed, enabling route_lookup() to check for use-after-free bugs. In route_
lookup() itself, lines 29–30 release the reference count of the prior element and free
it if the count becomes zero, and lines 34–42 acquire a reference on the new element,
with lines 35 and 36 performing the use-after-free check.
Quick Quiz 9.1: Why bother with a use-after-free check?
In Listing 9.3, lines 11, 15, 24, 32, and 39 introduce locking to synchronize concurrent
updates. Line 13 initializes the ->re_freed use-after-free-check field, and finally
lines 33–34 invoke re_free() if the new value of the reference count is zero.
Quick Quiz 9.2: Why doesn’t route_del() in Listing 9.3 use reference counts to protect the
traversal to the element to be freed?
Figure 9.2 shows the performance and scalability of reference counting on a read-
only workload with a ten-element list running on an eight-socket 28-core-per-socket
hyperthreaded 2.1 GHz x86 system with a total of 448 hardware threads (hps.2019.
12.02a/lscpu.hps). The “ideal” trace was generated by running the sequential code
shown in Listing 9.1, which works only because this is a read-only workload. The
reference-counting performance is abysmal and its scalability even more so, with the
“refcnt” trace indistinguishable from the x-axis. This should be no surprise in view of
Chapter 3: The reference-count acquisitions and releases have added frequent shared-
memory writes to an otherwise read-only workload, thus incurring severe retribution
from the laws of physics. As well it should, given that all the wishful thinking in the
world is not going to increase the speed of light or decrease the size of the atoms used
in modern digital electronics.
Quick Quiz 9.3: Why the break in the “ideal” line at 224 CPUs in Figure 9.2? Shouldn’t it be
a straight line?
v2023.06.11a
9.2. REFERENCE COUNTING 205
2.5x107
7
2x10
Lookups per Millisecond
ideal
7
1.5x10
1x107
5x106
refcnt
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
v2023.06.11a
206 CHAPTER 9. DEFERRED PROCESSING
Quick Quiz 9.4: Shouldn’t the refcnt trace in Figure 9.2 be at least a little bit off of the
x-axis???
2. Thread B invokes route_del() in Listing 9.3 to delete the route entry for
address 42. It completes successfully, and because this entry’s ->re_refcnt field
was equal to the value one, it invokes re_free() to set the ->re_freed field and
to free the entry.
v2023.06.11a
9.3. HAZARD POINTERS 207
The problem is that the reference count is located in the object to be protected, but
that means that there is no protection during the instant in time when the reference
count itself is being acquired! This is the reference-counting counterpart of a locking
issue noted by Gamsa et al. [GKAS99]. One could imagine using a global lock or
reference count to protect the per-route-entry reference-count acquisition, but this
would result in severe contention issues. Although algorithms exist that allow safe
reference-count acquisition in a concurrent environment [Val95], they are not only
extremely complex and error-prone [MS95], but also provide terrible performance and
scalability [HMBW07].
In short, concurrency has most definitely reduced the usefulness of reference counting!
Of course, as with other synchronization primitives, reference counts also have well-
known ease-of-use shortcomings. These can result in memory leaks on the one hand or
premature freeing on the other.
And this is the reference-counting trap that awaits unwary developers of concurrent
code, noted back on page 201.
Quick Quiz 9.5: If concurrency has “most definitely reduced the usefulness of reference
counting”, why are there so many reference counters in the Linux kernel?
One way of avoiding problems with concurrent reference counting is to implement the
reference counters inside out, that is, rather than incrementing an integer stored in the
data element, instead store a pointer to that data element in per-CPU (or per-thread)
lists. Each element of these lists is called a hazard pointer [Mic04a].2 The value of a
given data element’s “virtual reference counter” can then be obtained by counting the
number of hazard pointers referencing that element. Therefore, if that element has been
rendered inaccessible to readers, and there are no longer any hazard pointers referencing
it, that element may safely be freed.
Of course, this means that hazard-pointer acquisition must be carried out quite carefully
in order to avoid destructive races with concurrent deletion. One implementation is
shown in Listing 9.4, which shows hp_try_record() on lines 1–16, hp_record()
on lines 18–27, and hp_clear() on lines 29–33 (hazptr.h).
The hp_try_record() macro on line 16 is simply a casting wrapper for the _h_t_
r_impl() function, which attempts to store the pointer referenced by p into the hazard
pointer referenced by hp. If successful, it returns the value of the stored pointer. If it
fails due to that pointer being NULL, it returns NULL. Finally, if it fails due to racing
with an update, it returns a special HAZPTR_POISON token.
2 Also independently invented by others [HLM02].
v2023.06.11a
208 CHAPTER 9. DEFERRED PROCESSING
Quick Quiz 9.6: Given that papers on hazard pointers use the bottom bits of each pointer to
mark deleted elements, what is up with HAZPTR_POISON?
Line 6 reads the pointer to the object to be protected. If line 8 finds that this pointer
was either NULL or the special HAZPTR_POISON deleted-object token, it returns the
pointer’s value to inform the caller of the failure. Otherwise, line 9 stores the pointer
into the specified hazard pointer, and line 10 forces full ordering of that store with the
reload of the original pointer on line 11. (See Chapter 15 for more information on
memory ordering.) If the value of the original pointer has not changed, then the hazard
pointer protects the pointed-to object, and in that case, line 12 returns a pointer to that
object, which also indicates success to the caller. Otherwise, if the pointer changed
between the two READ_ONCE() invocations, line 13 indicates failure.
Quick Quiz 9.7: Why does hp_try_record() in Listing 9.4 take a double indirection to the
data element? Why not void * instead of void **?
v2023.06.11a
9.3. HAZARD POINTERS 209
v2023.06.11a
210 CHAPTER 9. DEFERRED PROCESSING
v2023.06.11a
9.3. HAZARD POINTERS 211
Which is a very good thing, because B’s successor is the now-freed element C, which
means that Thread 0’s subsequent accesses might have resulted in arbitrarily horrible
memory corruption, especially if the memory for element C had since been re-allocated
for some other purpose. Therefore, hazard-pointer readers must typically restart the
full traversal in the face of a concurrent deletion. Often the restart must go back to
some global (and thus immortal) pointer, but it is sometimes possible to restart at some
intermediate location if that location is guaranteed to still be live, for example, due to
the current thread holding a lock, a reference count, etc.
Quick Quiz 9.9: Readers must “typically” restart? What are some exceptions?
Because algorithms using hazard pointers might be restarted at any step of their
traversal through the linked data structure, such algorithms must typically take care to
avoid making any changes to the data structure until after they have acquired all the
hazard pointers that are required for the update in question.
Quick Quiz 9.10: But don’t these restrictions on hazard pointers also apply to other forms of
reference counting?
v2023.06.11a
212 CHAPTER 9. DEFERRED PROCESSING
v2023.06.11a
9.4. SEQUENCE LOCKS 213
2.5x107
2x107
7
1.5x10
1x107
5x106
hazptr
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Quick Quiz 9.12: The paper “Structured Deferral: Synchronization via Procrastina-
tion” [McK13] shows that hazard pointers have near-ideal performance. Whatever happened in
Figure 9.3???
And hazard pointers are the concurrent reference counter mentioned on page 201.
The next section attempts to improve on hazard pointers by using sequence locks, which
avoid both read-side writes and per-object memory barriers.
The published sequence-lock record [Eas71, Lam77] extends back as far as that of
reader-writer locking, but sequence locks nevertheless remain in relative obscurity.
Sequence locks are used in the Linux kernel for read-mostly data that must be seen in
a consistent state by readers. However, unlike reader-writer locking, readers do not
exclude writers. Instead, like hazard pointers, sequence locks force readers to retry
an operation if they detect activity from a concurrent writer. As can be seen from
Figure 9.4, it is important to design code using sequence locks so that readers very
rarely need to retry.
Quick Quiz 9.13: Why isn’t this sequence-lock discussion in Chapter 7, you know, the one on
locking?
The key component of sequence locking is the sequence number, which has an even
value in the absence of updaters and an odd value if there is an update in progress.
Readers can then snapshot the value before and after each access. If either snapshot has
v2023.06.11a
214 CHAPTER 9. DEFERRED PROCESSING
an odd value, or if the two snapshots differ, there has been a concurrent update, and the
reader must discard the results of the access and then retry it. Readers therefore use
the read_seqbegin() and read_seqretry() functions shown in Listing 9.8 when
accessing data protected by a sequence lock. Writers must increment the value before
and after each update, and only one writer is permitted at a given time. Writers therefore
use the write_seqlock() and write_sequnlock() functions shown in Listing 9.9
when updating data protected by a sequence lock.
As a result, sequence-lock-protected data can have an arbitrarily large number of
concurrent readers, but only one writer at a time. Sequence locking is used in the Linux
kernel to protect calibration quantities used for timekeeping. It is also used in pathname
traversal to detect concurrent rename operations.
A simple implementation of sequence locks is shown in Listing 9.10 (seqlock.h).
The seqlock_t data structure is shown on lines 1–4, and contains the sequence number
along with a lock to serialize writers. Lines 6–10 show seqlock_init(), which, as
the name indicates, initializes a seqlock_t.
Lines 12–19 show read_seqbegin(), which begins a sequence-lock read-side
critical section. Line 16 takes a snapshot of the sequence counter, and line 17 orders
this snapshot operation before the caller’s critical section. Finally, line 18 returns the
value of the snapshot (with the least-significant bit cleared), which the caller will pass
to a later call to read_seqretry().
Quick Quiz 9.14: Why not have read_seqbegin() in Listing 9.10 check for the low-order
bit being set, and retry internally, rather than allowing a doomed read to start?
v2023.06.11a
9.4. SEQUENCE LOCKS 215
v2023.06.11a
216 CHAPTER 9. DEFERRED PROCESSING
Lines 21–29 show read_seqretry(), which returns true if there was at least one
writer since the time of the corresponding call to read_seqbegin(). Line 26 orders the
caller’s prior critical section before line 27’s fetch of the new snapshot of the sequence
counter. Line 28 checks whether the sequence counter has changed, in other words,
whether there has been at least one writer, and returns true if so.
Quick Quiz 9.15: Why is the smp_mb() on line 26 of Listing 9.10 needed?
Quick Quiz 9.16: Can’t weaker memory barriers be used in the code in Listing 9.10?
Quick Quiz 9.17: What prevents sequence-locking updaters from starving readers?
Lines 31–36 show write_seqlock(), which simply acquires the lock, increments
the sequence number, and executes a memory barrier to ensure that this increment is
ordered before the caller’s critical section. Lines 38–43 show write_sequnlock(),
which executes a memory barrier to ensure that the caller’s critical section is ordered
before the increment of the sequence number on line 41, then releases the lock.
Quick Quiz 9.18: What if something else serializes writers, so that the lock is not needed?
Quick Quiz 9.19: Why isn’t seq on line 2 of Listing 9.10 unsigned rather than unsigned
long? After all, if unsigned is good enough for the Linux kernel, shouldn’t it be good enough
for everyone?
v2023.06.11a
9.4. SEQUENCE LOCKS 217
So what happens when sequence locking is applied to the Pre-BSD routing table?
Listing 9.11 shows the data structures and route_lookup(), and Listing 9.12 shows
route_add() and route_del() (route_seqlock.c). This implementation is once
again similar to its counterparts in earlier sections, so only the differences will be
highlighted.
In Listing 9.11, line 5 adds ->re_freed, which is checked on lines 29 and 30. Line 8
adds a sequence lock, which is used by route_lookup() on lines 18, 23, and 32, with
lines 24 and 33 branching back to the retry label on line 17. The effect is to retry any
lookup that runs concurrently with an update.
In Listing 9.12, lines 11, 14, 23, 31, and 39 acquire and release the sequence lock,
while lines 10 and 33 handle ->re_freed. This implementation is therefore quite
straightforward.
It also performs better on the read-only workload, as can be seen in Figure 9.5, though
its performance is still far from ideal. Worse yet, it suffers use-after-free failures. The
problem is that the reader might encounter a segmentation violation due to accessing an
already-freed structure before read_seqretry() has a chance to warn of the concurrent
update.
v2023.06.11a
218 CHAPTER 9. DEFERRED PROCESSING
2.5x107
2x107
7
1.5x10
1x107
seqlock
5x106
hazptr
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Quick Quiz 9.20: Can this bug be fixed? In other words, can you use sequence locks as
the only synchronization mechanism protecting a linked list supporting concurrent addition,
deletion, and lookup?
As hinted on page 201, both the read-side and write-side critical sections of a sequence
lock can be thought of as transactions, and sequence locking therefore can be thought
of as a limited form of transactional memory, which will be discussed in Section 17.2.
The limitations of sequence locking are: (1) Sequence locking restricts updates and
(2) Sequence locking does not permit traversal of pointers to objects that might be freed
by updaters. These limitations are of course overcome by transactional memory, but can
also be overcome by combining other synchronization primitives with sequence locking.
Sequence locks allow writers to defer readers, but not vice versa. This can result in
unfairness and even starvation in writer-heavy workloads.3 On the other hand, in the
absence of writers, sequence-lock readers are reasonably fast and scale linearly. It is only
human to want the best of both worlds: Fast readers without the possibility of read-side
failure, let alone starvation. In addition, it would also be nice to overcome sequence
locking’s limitations with pointers. The following section presents a synchronization
mechanism with exactly these properties.
All of the mechanisms discussed in the preceding sections used one of a number of
approaches to defer specific actions until they may be carried out safely. The reference
counters discussed in Section 9.2 use explicit counters to defer actions that could disturb
readers, which results in read-side contention and thus poor scalability. The hazard
pointers covered by Section 9.3 uses implicit counters in the guise of per-thread lists
3 Dmitry Vyukov describes one way to reduce (but, sadly, not eliminate) reader
starvation: http://www.1024cores.net/home/lock-free-algorithms/reader-
writer-problem/improved-lock-free-seqlock.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 219
of pointer. This avoids read-side contention, but requires readers to do stores and
conditional branches, as well as either full memory barriers in read-side primitives or
real-time-unfriendly inter-processor interrupts in update-side primitives.4 The sequence
lock presented in Section 9.4 also avoids read-side contention, but does not protect
pointer traversals and, like hazard pointers, requires either full memory barriers in
read-side primitives, or inter-processor interrupts in update-side primitives. These
schemes’ shortcomings raise the question of whether it is possible to do better.
This section introduces read-copy update (RCU), which provides an API that allows
readers to be associated with regions in the source code, rather than with expensive
updates to frequently updated shared data. The remainder of this section examines RCU
from a number of different perspectives. Section 9.5.1 provides the classic introduction
to RCU, Section 9.5.2 covers fundamental RCU concepts, Section 9.5.3 presents the
Linux-kernel API, Section 9.5.4 introduces some common RCU use cases, and finally
Section 9.5.5 covers recent work related to RCU.
Although RCU has gained a reputation for being subtle and difficult, when used
properly, it is quite straightforward. In fact, no less an authority than Butler Lampson
classifies it as easy concurrency [AH22, Chapter 3].
4 In some important special cases, this extra work can be avoided by using link counting
v2023.06.11a
220 CHAPTER 9. DEFERRED PROCESSING
(1) gptr
kmalloc()
p
->addr=?
(2) gptr
->iface=?
initialization
p
->addr=42
(3) gptr
->iface=1
smp_store_release(&gptr, p);
p
->addr=42
(4) gptr
->iface=1
Similarly, one might hope that readers could use a single C-language assignment
to fetch the value of gptr, and be guaranteed to either get the old value of NULL or
to get the newly installed pointer, but either way see a valid result. Unfortunately,
Section 4.3.4.1 dashes these hopes as well. To obtain this guarantee, readers must
instead use READ_ONCE(), or, as will be seen, rcu_dereference(). However, on
most modern computer systems, each of these read-side primitives can be implemented
with a single load instruction, exactly the instruction that would normally be used in
single-threaded code.
Reviewing Figure 9.6 from the viewpoint of readers, in the first three states all readers
see gptr having the value NULL. Upon entering the fourth state, some readers might
see gptr still having the value NULL while others might see it referencing the newly
inserted element, but after some time, all readers will see this new element. At all times,
all readers will see gptr as containing a valid pointer. Therefore, it really is possible
to add new data to linked data structures while allowing concurrent readers to execute
the same sequence of machine instructions that is normally used in single-threaded
code. This no-cost approach to concurrent reading provides excellent performance and
scalability, and also is eminently suitable for real-time use.
Insertion is of course quite useful, but sooner or later, it will also be necessary to
delete data. As can be seen in Figure 9.7, the first step is easy. Again taking the lessons
from Section 4.3.4.1 to heart, smp_store_release() is used to NULL the pointer, thus
moving from the first row to the second in the figure. At this point, pre-existing readers
see the old structure with ->addr of 42 and ->iface of 1, but new readers will see a
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 221
->addr=42 Readers?
(1) gptr
->iface=1
1 Version
smp_store_release(&gptr, NULL);
->addr=42
Readers?
(2) gptr
->iface=1
2 Versions
wait
gptr = for readers???
NULL; /*almost*/
->addr=42
Readers?
(3) gptr
->iface=1
1 Version
NULL pointer, that is, concurrent readers can disagree on the state, as indicated by the
“2 Versions” in the figure.
Quick Quiz 9.21: Why does Figure 9.7 use smp_store_release() given that it is storing a
NULL pointer? Wouldn’t WRITE_ONCE() work just as well in this case, given that there is no
structure initialization to order against the store of the NULL pointer?
Quick Quiz 9.22: Readers running concurrently with each other and with the procedure
outlined in Figure 9.7 can disagree on the value of gptr. Isn’t that just a wee bit problematic???
We get back to a single version simply by waiting for all the pre-existing readers to
complete, as shown in row 3. At that point, all the pre-existing readers are done, and
no later reader has a path to the old data item, so there can no longer be any readers
referencing it. It may therefore be safely freed, as shown on row 4.
Thus, given a way to wait for pre-existing readers to complete, it is possible to both
add data to and remove data from a linked data structure, despite the readers executing
the same sequence of machine instructions that would be appropriate for single-threaded
execution. So perhaps going all the way was not too far after all!
But how can we tell when all of the pre-existing readers have in fact completed? This
question is the topic of Section 9.5.1.3. But first, the next section defines RCU’s core
API.
v2023.06.11a
222 CHAPTER 9. DEFERRED PROCESSING
Primitive Purpose
for the upcoming sections introducing RCU and covering its fundamentals. The full
API is covered in Section 9.5.3.
Three members of the core APIs are used by readers. The rcu_read_lock() and
rcu_read_unlock() functions delimit RCU read-side critical sections. These may
be nested, so that one rcu_read_lock()–rcu_read_unlock() pair can be enclosed
within another. In this case, the nested set of RCU read-side critical sections act as
one large critical section covering the full extent of the nested set. The third read-side
API member, rcu_dereference(), fetches an RCU-protected pointer. Conceptually,
rcu_dereference() simply loads from memory, but we will see in Section 9.5.2.1
that rcu_dereference() must prevent the compiler and (in one case) the CPU from
reordering its load with later memory operations that dereference this pointer.
Quick Quiz 9.23: What is an RCU-protected pointer?
The other three members of the core APIs are used by updaters. The synchronize_
rcu() function implements the “wait for readers” operation from Figure 9.7. The
call_rcu() function is the asynchronous counterpart of synchronize_rcu() by
invoking the specified function after all pre-existing RCU readers have completed.
Finally, the rcu_assign_pointer() macro is used to update an RCU-protected
pointer. Conceptually, this is simply an assignment statement, but we will see in
Section 9.5.2.1 that rcu_assign_pointer() must prevent the compiler and the CPU
from reordering this assignment to precede any prior assignments used to initialize the
pointed-to structure.
Quick Quiz 9.24: What does synchronize_rcu() do if it starts at about the same time as
an rcu_read_lock()?
The core RCU API is summarized in Table 9.1 for easy reference. With that, we are
ready to continue this introduction to RCU with the key RCU operation, waiting for
readers.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 223
reference counters in Figure 9.2 on page 205. Hazard pointers profoundly reduce this
overhead, but, as we saw in Figure 9.3 on page 213, not to zero. Nevertheless, many
RCU implementations use counters with carefully controlled cache locality.
A second approach observes that memory synchronization is expensive, and therefore
uses registers instead, namely each CPU’s or thread’s program counter (PC), thus
imposing no overhead on readers, at least in the absence of concurrent updates. The
updater polls each relevant PC, and if that PC is not within read-side code, then the
corresponding CPU or thread is within a quiescent state, in turn signaling the completion
of any reader that might have access to the newly removed data element. Once all
CPU’s or thread’s PCs have been observed to be outside of any reader, the grace
period has completed. Please note that this approach poses some serious challenges,
including memory ordering, functions that are sometimes invoked from readers, and
ever-exciting code-motion optimizations. Nevertheless, this approach is said to be used
in production [Ash15].
A third approach is to simply wait for a fixed period of time that is long enough to
comfortably exceed the lifetime of any reasonable reader [Jac93, Joh95]. This can work
quite well in hard real-time systems [RLPB18], but in less exotic settings, Murphy says
that it is critically important to be prepared even for unreasonably long-lived readers.
To see this, consider the consequences of failing do so: A data item will be freed while
the unreasonable reader is still referencing it, and that item might well be immediately
reallocated, possibly even as a data item of some other type. The unreasonable reader
and the unwitting reallocator would then be attempting to use the same memory for two
very different purposes. The ensuing mess will be exceedingly difficult to debug.
A fourth approach is to wait forever, secure in the knowledge that doing so will
accommodate even the most unreasonable reader. This approach is also called “leaking
memory”, and has a bad reputation due to the fact that memory leaks often require
untimely and inconvenient reboots. Nevertheless, this is a viable strategy when the
update rate and the uptime are both sharply bounded. For example, this approach could
work well in a high-availability cluster where systems were periodically crashed in order
to ensure that cluster really remained highly available.6 Leaking the memory is also a
viable strategy in environments having garbage collectors, in which case the garbage
collector can be thought of as plugging the leak [KL80]. However, if your environment
lacks a garbage collector, read on!
A fifth approach avoids the period crashes in favor of periodically “stopping the
world”, as exemplified by the traditional stop-the-world garbage collector. This approach
was also heavily used during the decades before ubiquitous connectivity, when it was
common practice to power systems off at the end of each working day. However, in
today’s always-connected always-on world, stopping the world can gravely degrade
response times, which has been one motivation for the development of concurrent
garbage collectors [BCR03]. Furthermore, although we need all pre-existing readers to
complete, we do not need them all to complete at the same time.
This observation leads to the sixth approach, which is stopping one CPU or thread
at a time. This approach has the advantage of not degrading reader response times at
all, let alone gravely. Furthermore, numerous applications already have states (termed
quiescent states) that can be reached only after all pre-existing readers are done. In
transaction-processing systems, the time between a pair of successive transactions might
6 The program that forces the periodic crashing is sometimes known as a “chaos monkey”:
v2023.06.11a
224 CHAPTER 9. DEFERRED PROCESSING
be a quiescent state. In reactive systems, the state between a pair of successive events
might be a quiescent state. Within non-preemptive operating-systems kernels, a context
switch can be a quiescent state [MS98a]. Either way, once all CPUs and/or threads have
passed through a quiescent state, the system is said to have completed a grace period, at
which point all readers in existence at the start of that grace period are guaranteed to
have completed. As a result, it is also guaranteed to be safe to free any removed data
items that were removed prior to the start of that grace period.7
Within a non-preemptive operating-system kernel, for context switch to be a valid
quiescent state, readers must be prohibited from blocking while referencing a given
instance data structure obtained via the gptr pointer shown in Figures 9.6 and 9.7. This
no-blocking constraint is consistent with similar constraints on pure spinlocks, where a
CPU is forbidden from blocking while holding a spinlock. Without this constraint, all
CPUs might be consumed by threads spinning attempting to acquire a spinlock held
by a blocked thread. The spinning threads will not relinquish their CPUs until they
acquire the lock, but the thread holding the lock cannot possibly release it until one of
the spinning threads relinquishes a CPU. This is a classic deadlock situation, and this
deadlock is avoided by forbidding blocking while holding a spinlock.
Again, this same constraint is imposed on reader threads dereferencing gptr: Such
threads are not allowed to block until after they are done using the pointed-to data
item. Returning to the second row of Figure 9.7, where the updater has just completed
executing the smp_store_release(), imagine that CPU 0 executes a context switch.
Because readers are not permitted to block while traversing the linked list, we are
guaranteed that all prior readers that might have been running on CPU 0 will have
completed. Extending this line of reasoning to the other CPUs, once each CPU has
been observed executing a context switch, we are guaranteed that all prior readers
have completed, and that there are no longer any reader threads referencing the newly
removed data element. The updater can then safely free that data element, resulting in
the state shown at the bottom of Figure 9.7.
This approach is termed quiescent-state-based reclamation (QSBR) [HMB06]. A
QSBR schematic is shown in Figure 9.8, with time advancing from the top of the
figure to the bottom. The cyan-colored boxes depict RCU read-side critical sections,
each of which begins with rcu_read_lock() and ends with rcu_read_unlock().
CPU 1 does the WRITE_ONCE() that removes the current data item (presumably having
previously read the pointer value and availed itself of appropriate synchronization), then
waits for readers. This wait operation results in an immediate context switch, which is a
quiescent state (denoted by the pink circle), which in turn means that all prior reads
on CPU 1 have completed. Next, CPU 2 does a context switch, so that all readers on
CPUs 1 and 2 are now known to have completed. Finally, CPU 3 does a context switch.
At this point, all readers throughout the entire system are known to have completed, so
the grace period ends, permitting synchronize_rcu() to return to its caller, in turn
permitting CPU 1 to free the old data item.
Quick Quiz 9.25: In Figure 9.8, the last of CPU 3’s readers that could possibly have access
to the old data item ended before the grace period even started! So why would anyone bother
waiting until CPU 3’s later context switch???
7 It is possible to do much more with RCU than simply defer reclamation of memory, but
deferred reclamation is RCU’s most common use case, and is therefore an excellent place to
start. For an example of the more general case of deferred execution, please see phased state
change in Section 9.5.4.3.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 225
WRITE_ONCE(gptr, NULL);
synchronize_rcu()
CPU 1 CPU 2 CPU 3
Context Switch
Reader
Grace Period
free()
The for_each_online_cpu() primitive iterates over all CPUs, and the sched_
setaffinity() function causes the current thread to execute on the specified CPU,
which forces the destination CPU to execute a context switch. Therefore, once the
for_each_online_cpu() has completed, each CPU has executed a context switch,
which in turn guarantees that all pre-existing reader threads have completed.
Please note that this approach is not production quality. Correct handling of a
number of corner cases and the need for a number of powerful optimizations mean that
production-quality implementations are quite complex. In addition, RCU implementa-
tions for preemptible environments require that readers actually do something, which
in non-real-time Linux-kernel environments can be as simple as defining rcu_read_
lock() and rcu_read_unlock() as preempt_disable() and preempt_enable(),
respectively.8 However, this simple non-preemptible approach is conceptually complete,
8 Some toy RCU implementations that handle preempted read-side critical sections are
shown in Appendix B.
v2023.06.11a
226 CHAPTER 9. DEFERRED PROCESSING
Referring back to Listing 9.13, note that route_lock is used to synchronize between
concurrent updaters invoking ins_route() and del_route(). However, this lock is
not acquired by readers invoking access_route(): Readers are instead protected by
the QSBR techniques described in Section 9.5.1.3.
Note that ins_route() simply returns the old value of gptr, which Figure 9.6
assumed would always be NULL. This means that it is the caller’s responsibility to figure
out what to do with a non-NULL value, a task complicated by the fact that readers might
still be referencing it for an indeterminate period of time. Callers might use one of the
following approaches:
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 227
3. Pass the returned pointer to a later invocation of ins_route() to restore the earlier
value.
This example shows one general approach to reading and updating RCU-protected
data structures, however, there is quite a variety of use cases, several of which are
covered in Section 9.5.4.
In summary, it is in fact possible to create concurrent linked data structures that can
be traversed by readers executing the same sequence of machine instructions that would
be executed by single-threaded readers. The next section summarizes RCU’s high-level
properties.
A key RCU property is that reads need not wait for updates. This property enables
RCU implementations to provide low-cost or even no-cost readers, resulting in low
overhead and excellent scalability. This property also allows RCU readers and updaters
to make useful concurrent forward progress. In contrast, conventional synchronization
primitives must enforce strict mutual exclusion using expensive instructions, thus
increasing overhead and degrading scalability, but also typically prohibiting readers and
updaters from making useful concurrent forward progress.
Quick Quiz 9.29: Doesn’t Section 9.4’s seqlock also permit readers and updaters to make
useful concurrent forward progress?
v2023.06.11a
228 CHAPTER 9. DEFERRED PROCESSING
20000
18000
16000
2000
2005
2010
2015
2020
2025
Year
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 229
ins_route() access_route()
Allocate
Pre-initialization
garbage
Initialize
Subscribe to
Publish pointer
pointer
Not OK
Valid route structure Dereference pointer
OK
Surprising, but OK
the RCU-protected data item. For their part, updaters can be thought of as publishing
new versions.
Unfortunately, as laid out in Section 4.3.4.1 and reiterated in Section 9.5.1.1, it is
unwise to use plain accesses for these publication and subscription operations. It is
instead necessary to inform both the compiler and the CPU of the need for care, as can
be seen from Figure 9.10, which illustrates interactions between concurrent executions
of ins_route() (and its caller) and access_route() from Listing 9.13.
The ins_route() column from Figure 9.10 shows ins_route()’s caller allocating
a new route structure, which then contains pre-initialization garbage. The caller then
initializes the newly allocated structure, and then invokes ins_route() to publish a
pointer to the new route structure. Publication does not affect the contents of the
structure, which therefore remain valid after publication.
The access_route() column from this same figure shows the pointer being sub-
scribed to and dereferenced. This dereference operation absolutely must see a valid
route structure rather than pre-initialization garbage because referencing garbage could
result in memory corruption, crashes, and hangs. As noted earlier, avoiding such
garbage means that the publish and subscribe operations must inform both the compiler
and the CPU of the need to maintain the needed ordering.
Publication is carried out by rcu_assign_pointer(), which ensures that ins_
route()’s caller’s initialization is ordered before the actual publication operation’s
store of the pointer. In addition, rcu_assign_pointer() must be atomic in the sense
that concurrent readers see either the old value of the pointer or the new value of the
pointer, but not some mash-up of these two values. These requirements are met by the
C11 store-release operation, and in fact in the Linux kernel, rcu_assign_pointer()
is defined in terms of smp_store_release(), which is similar to C11 store-release.
Note that if concurrent updates are required, some sort of synchronization mechanism
will be required to mediate among multiple concurrent rcu_assign_pointer() calls
on the same pointer. In the Linux kernel, locking is the mechanism of choice, but
v2023.06.11a
230 CHAPTER 9. DEFERRED PROCESSING
Adding data to a linked structure without disrupting readers is a good thing, as are the
cases where this can be done with no added read-side cost compared to single-threaded
readers. However, in most cases it is also necessary to remove data, and this is the
subject of the next section.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 231
P0()
rcu_read_lock()
r2 = y;
rcu_read_unlock() x = 1;
synchronize_rcu()
P1()
In RCU’s case, each of the things waited on is called an RCU read-side critical
section. As noted in Table 9.1, an RCU read-side critical section starts with an
rcu_read_lock() primitive, and ends with a corresponding rcu_read_unlock()
primitive. RCU read-side critical sections can be nested, and may contain pretty much
any code, as long as that code does not contain a quiescent state. For example, within
the Linux kernel, it is illegal to sleep within an RCU read-side critical section because
a context switch is a quiescent state.10 If you abide by these conventions, you can
use RCU to wait for any pre-existing RCU read-side critical section to complete, and
synchronize_rcu() uses indirect means to do the actual waiting [DMS+ 12, McK13].
The relationship between an RCU read-side critical section and a later RCU grace
period is an if-then relationship, as illustrated by Figure 9.11. If any portion of a given
critical section precedes the beginning of a given grace period, then RCU guarantees
that all of that critical section will precede the end of that grace period. In the figure,
P0()’s access to x precedes P1()’s access to this same variable, and thus also precedes
the grace period generated by P1()’s call to synchronize_rcu(). It is therefore
guaranteed that P0()’s access to y will precede P1()’s access. In this case, if r1’s final
value is 0, then r2’s final value is guaranteed to also be 0.
Quick Quiz 9.32: What other final values of r1 and r2 are possible in Figure 9.11?
The relationship between an RCU read-side critical section and an earlier RCU grace
period is also an if-then relationship, as illustrated by Figure 9.12. If any portion of
a given critical section follows the end of a given grace period, then RCU guarantees
that all of that critical section will follow the beginning of that grace period. In the
10 However, a special form of RCU called SRCU [McK06] does permit general sleeping
v2023.06.11a
232 CHAPTER 9. DEFERRED PROCESSING
P1()
x = 1;
P0()
this ordering.
rcu_read_lock() synchronize_rcu()
rcu_read_unlock()
figure, P0()’s access to y follows P1()’s access to this same variable, and thus follows
the grace period generated by P1()’s call to synchronize_rcu(). It is therefore
guaranteed that P0()’s access to x will follow P1()’s access. In this case, if r2’s final
value is 1, then r1’s final value is guaranteed to also be 1.
Quick Quiz 9.33: What would happen if the order of P0()’s two accesses was reversed in
Figure 9.12?
Finally, as shown in Figure 9.13, an RCU read-side critical section can be completely
overlapped by an RCU grace period. In this case, r1’s final value is 1 and r2’s final
value is 0.
However, it cannot be the case that r1’s final value is 0 and r2’s final value is 1.
This would mean that an RCU read-side critical section had completely overlapped
a grace period, which is forbidden (or at the very least constitutes a bug in RCU).
RCU’s wait-for-readers guarantee therefore has two parts: (1) If any part of a given
RCU read-side critical section precedes the beginning of a given grace period, then
the entirety of that critical section precedes the end of that grace period. (2) If any
part of a given RCU read-side critical section follows the end of a given grace period,
then the entirety of that critical section follows the beginning of that grace period. This
definition is sufficient for almost all RCU-based algorithms, but for those wanting more,
simple executable formal models of RCU are available as part of Linux kernel v4.17
and later, as discussed in Section 12.3.2. In addition, RCU’s ordering properties are
examined in much greater detail in Section 15.4.3.
Quick Quiz 9.34: What would happen if P0()’s accesses in Figures 9.11–9.13 were stores?
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 233
P1()
x = 1;
P0()
rcu_read_lock() synchronize_rcu()
rcu_read_unlock()
y = 1;
v2023.06.11a
234 CHAPTER 9. DEFERRED PROCESSING
rcu_read_lock()
Remove
rcu_read_unlock() synchronize_rcu()
Remove
rcu_read_lock() synchronize_rcu()
rcu_read_unlock()
Remove
rcu_read_lock()
synchronize_rcu()
rcu_read_unlock()
rcu_read_lock()
Remove
synchronize_rcu()
rcu_read_unlock()
BUG!!!
(4) Grace period within reader (BUG!!!)
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 235
Reader { A }
1. A B C D
Reader { A, B }
2. A B C D
Reader { A, B }
3. B C D
Reader { A, B }
4. B C D E
Reader { A, B, C, D, E }
5. B C D E
Because RCU readers can make forward progress while updates are in progress,
different readers might disagree about the state of the data structure, a topic taken up by
the next section.
v2023.06.11a
236 CHAPTER 9. DEFERRED PROCESSING
However, maintaining multiple weakly consistent versions can provide some surprises.
For example, consider Figure 9.15, in which a reader is traversing a linked list that is
concurrently updated.11 In the first row of the figure, the reader is referencing data
item A, and in the second row, it advances to B, having thus far seen A followed by B.
In the third row, an updater removes element A and in the fourth row an updater adds
element E to the end of the list. In the fifth and final row, the reader completes its
traversal, having seeing elements A through E.
Except that there was no time at which such a list existed. This situation might
be even more surprising than that shown in Figure 9.7, in which different concurrent
readers see different versions. In contrast, in Figure 9.15 the reader sees a version that
never actually existed!
One way to resolve this strange situation is via weaker semanitics. A reader traversal
must encounter any data item that was present during the full traversal (B, C, and D),
and might or might not encounter data items that were present for only part of the
traversal (A and E). Therefore, in this particular case, it is perfectly legitimate for the
reader traversal to encounter all five elements. If this outcome is problematic, another
way to resolve this situation is through use of stronger synchronization mechanisms,
such as reader-writer locking, or clever use of timestamps and versioning, as discussed
in Section 9.5.4.11. Of course, stronger mechanisms will be more expensive, but then
again the engineering life is all about choices and tradeoffs.
Strange though this situation might seem, it is entirely consistent with the real world.
As we saw in Section 3.2, the finite speed of light cannot be ignored within a computer
system, and it most certainly cannot be ignored outside of this system. This in turn
means that any data within the system representing state in the real world outside of
the system is always and forever outdated, and thus inconsistent with the real world.
Therefore, it is quite possible that the sequence {A, B, C, D, E} occurred in the real
world, but due to speed-of-light delays was never represented in the computer system’s
memory. In this case, the reader’s surprising traversal would correctly reflect reality.
As a result, algorithms operating on real-world data must account for inconsistent
data, either by tolerating inconsistencies or by taking steps to exclude or reject them. In
many cases, these algorithms are also perfectly capable of dealing with inconsistencies
within the system.
The pre-BSD packet routing example laid out in Section 9.1 is a case in point.
The contents of a routing list is set by routing protocols, and these protocols feature
significant delays (seconds or even minutes) to avoid routing instabilities. Therefore,
once a routing update reaches a given system, it might well have been sending packets
the wrong way for quite some time. Sending a few more packets the wrong way for the
few microseconds during which the update is in flight is clearly not a problem because
the same higher-level protocol actions that deal with delayed routing updates will also
deal with internal inconsistencies.
Nor is Internet routing the only situation tolerating inconsistencies. To repeat, any
algorithm in which data within a system tracks outside-of-system state must tolerate
inconsistencies, which includes security policies (often set by committees of humans),
storage configuration, and WiFi access points, to say nothing of removable hardware such
as microphones, headsets, cameras, mice, printers, and much else besides. Furthermore,
the large number of Linux-kernel RCU API uses shown in Figure 9.9, combined with
the Linux kernel’s heavy use of reference counting and with increasing use of hazard
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 237
pointers in other projects, demonstrates that tolerance for such inconsistencies is more
common than one might imagine.
One root cause of this common-case tolerance of inconsistencies is that single-item
lookups are much more common in practice than are full-data-structure traversals. After
all, full-data-structure traversals are much more expensive than single-item lookups, so
developers are motivated to avoid such traversals. Not only are concurrent updates less
likely to affect a single-item lookup than they are a full traversal, but it is also the case
that an isolated single-item lookup has no way of detecting such inconsistencies. As a
result, in the common case, such inconsistencies are not just tolerable, they are in fact
invisible.
In such cases, RCU readers can be considered to be fully ordered with updaters,
despite the fact that these readers might be executing the exact same sequence of
machine instructions that would be executed by a single-threaded program, as hinted on
page 201. For example, referring back to Listing 9.13 on page 226, suppose that each
reader thread invokes access_route() exactly once during its lifetime, and that there
is no other communication among reader and updater threads. Then each invocation
of access_route() can be ordered after the ins_route() invocation that produced
the route structure accessed by line 11 of the listing in access_route() and ordered
before any subsequent ins_route() or del_route() invocation.
In summary, maintaining multiple versions is exactly what enables the extremely
low overheads of RCU readers, and as noted earlier, many algorithms are unfazed by
multiple versions. However, there are algorithms that absolutely cannot handle multiple
versions. There are techniques for adapting such algorithms to RCU [McK04], for
example, the use of sequence locking described in Section 13.4.2.
Exercises These examples assumed that a mutex was held across the entire update
operation, which would mean that there could be at most two versions of the list active
at a given time.
Quick Quiz 9.35: How would you modify the deletion example to permit more than two
versions of the list to be active?
Quick Quiz 9.36: How many RCU versions of a given list can be active at any given time?
Quick Quiz 9.37: How can the per-update overhead of RCU be reduced?
v2023.06.11a
238 CHAPTER 9. DEFERRED PROCESSING
Quick Quiz 9.38: How can RCU updaters possibly delay RCU readers, given that neither
rcu_read_lock() nor rcu_read_unlock() spin or block?
These three RCU components allow data to be updated in the face of concurrent
readers that might be executing the same sequence of machine instructions that would
be used by a reader in a single-threaded implementation. These RCU components can
be combined in different ways to implement a surprising variety of different types of
RCU-based algorithms, a number of which are presented in Section 9.5.4. However, it
is usually better to work at higher levels of abstraction. To this end, the next section
describes the Linux-kernel API, which includes simple data structures such as lists.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 239
lists in addition to the pointer-oriented core API of Table 9.1. The Linux kernel itself
also provides RCU-protected hash tables and search trees.
Operating-systems kernels such as Linux operate near the bottom of the “iron triangle”
of the software stack shown in Figure 2.3, where performance is critically important.
There are thus specialized variants of a number of RCU APIs for use on fastpaths,
for example, as discussed in Section 9.5.3.3, RCU_INIT_POINTER() may be used
in place of rcu_assign_pointer() in cases where the RCU-protected pointer is
being assigned to NULL or when that pointer is not yet accessible by readers. Use of
RCU_INIT_POINTER() allows the compiler more leeway in selecting instructions and
carrying out optimizations, thus increasing performance.
On the other hand, when used incorrectly RCU_INIT_POINTER() can result in silent
memory corruption, so please be careful! Yes, in some cases, the kernel can check for
inappropriate use of RCU API members from a given kernel context, but the constraints
of RCU_INIT_POINTER() use are not yet checkable.
Finally, within the Linux kernel, the aforementioned limits of human cognition are
compounded by the variety and severity of workloads running on Linux. As of v5.16,
this has given rise to no fewer than five flavors of RCU, each designed to provide
different performance, scalability, response-time, and energy efficiency tradeoffs to
RCU readers and writers. These RCU flavors are the subject of the next section.
The “RCU” column corresponds to the consolidation of the three Linux-kernel RCU
implementations [McK19c, McK19a], in which RCU read-side critical sections start
with rcu_read_lock(), rcu_read_lock_bh(), or rcu_read_lock_sched() and
end with rcu_read_unlock(), rcu_read_unlock_bh(), or rcu_read_unlock_
sched(), respectively. Any region of code that disables bottom halves, interrupts,
or preemption also acts as an RCU read-side critical section. RCU read-side criti-
cal sections may be nested. The corresponding synchronous update-side primitives,
synchronize_rcu() and synchronize_rcu_expedited(), along with their syn-
onym synchronize_net(), wait for any type of currently executing RCU read-side
critical sections to complete. The length of this wait is known as a “grace period”, and
synchronize_rcu_expedited() is designed to reduce grace-period latency at the
13 This citation covers v4.20 and later. Documetation for earlier versions of the Linux-
v2023.06.11a
v2023.06.11a
Table 9.2: RCU Wait-to-Finish APIs
CHAPTER 9. DEFERRED PROCESSING
RCU: Original SRCU: Sleeping readers Tasks RCU: Free tracing Tasks RCU Rude: Free idle-task Tasks RCU Trace: Protect sleepable
trampolines tracing trampolines BPF programs
Initialization and DEFINE_SRCU()
Cleanup DEFINE_STATIC_SRCU()
init_srcu_struct()
cleanup_srcu_struct()
Read-side rcu_read_lock() ! srcu_read_lock() Voluntary context switch Voluntary context switch and rcu_read_lock_trace()
critical-section rcu_read_unlock() ! srcu_read_unlock() preempt-enable regions of code rcu_read_unlock_trace()
markers rcu_read_lock_bh()
rcu_read_unlock_bh()
rcu_read_lock_sched()
rcu_read_unlock_sched()
(Plus anything disabing bottom
halves, preemption, or interrupts.)
Update-side primitives synchronize_rcu() synchronize_srcu() synchronize_rcu_tasks() synchronize_rcu_tasks_rude() synchronize_rcu_tasks_trace()
(synchronous) synchronize_net() synchronize_srcu_expedited()
synchronize_rcu_expedited()
Update-side primitives call_rcu() ! call_srcu() call_rcu_tasks() call_rcu_tasks_rude() call_rcu_tasks_trace()
(asynchronous /
callback)
Update-side primitives rcu_barrier() srcu_barrier() rcu_barrier_tasks() rcu_barrier_tasks_rude() rcu_barrier_tasks_trace()
(wait for callbacks)
Update-side primitives get_state_synchronize_rcu()
(initiate / wait) cond_synchronize_rcu()
Update-side primitives kfree_rcu()
(free memory)
Type-safe memory SLAB_TYPESAFE_BY_RCU
Read side constraints No blocking (only preemption) No synchronize_srcu() with No voluntary context switch Neither blocking nor preemption No RCU tasks trace grace period
same srcu_struct
Read side overhead CPU-local accesses (barrier() Simple instructions, memory Free CPU-local accesses (free on CPU-local accesses
on PREEMPT=n) barriers PREEMPT=n)
Asynchronous sub-microsecond sub-microsecond sub-microsecond sub-microsecond sub-microsecond
update-side overhead
Grace-period latency 10s of milliseconds Milliseconds Seconds Milliseconds 10s of milliseconds
Expedited 10s of microseconds Microseconds N/A N/A N/A
240
grace-period latency
9.5. READ-COPY UPDATE (RCU) 241
expense of increased CPU overhead and IPIs. The asynchronous update-side primitive,
call_rcu(), invokes a specified function with a specified argument after a subsequent
grace period. For example, call_rcu(p,f); will result in the “RCU callback” f(p)
being invoked after a subsequent grace period. There are situations, such as when
unloading a Linux-kernel module that uses call_rcu(), when it is necessary to wait for
all outstanding RCU callbacks to complete [McK07e]. The rcu_barrier() primitive
does this job.
Quick Quiz 9.40: How do you prevent a huge number of RCU read-side critical sections from
indefinitely blocking a synchronize_rcu() invocation?
Quick Quiz 9.41: The synchronize_rcu() API waits for all pre-existing interrupt handlers
to complete, right?
Quick Quiz 9.42: What is the difference between synchronize_rcu() and rcu_barrier()?
Similar to normal RCU, self-deadlock can be avoided using the asynchronous call_
srcu() function. However, special care must be taken when using call_srcu()
v2023.06.11a
242 CHAPTER 9. DEFERRED PROCESSING
because a single task could register SRCU callbacks very quickly. Given that SRCU
allows readers to block for arbitrary periods of time, this could consume an arbitrarily
large quantity of memory. In contrast, given the synchronous synchronize_srcu()
interface, a given task must finish waiting for a given grace period before it can start
waiting for the next one.
Also similar to RCU, there is an srcu_barrier() function that waits for all prior
call_srcu() callbacks to be invoked.
In other words, SRCU compensates for its extremely weak forward-progress guarantees
by permitting the developer to restrict its scope.
The “Tasks RCU” column in Table 9.2 displays a specialized RCU API that mediates
freeing of the trampolines used in Linux-kernel tracing. These trampolines are used to
transfer control from a point in the code being traced to the code doing the actual tracing.
It is of course necessary to ensure that all code executing within a given trampoline has
finished before freeing that trampoline.
Changes to the code being traced are typically limited to a single jump or call
instruction, and thus cannot accommodate the sequence of code required to implement
rcu_read_lock() and rcu_read_unlock(). Nor can the trampoline contain these
calls to rcu_read_lock() and rcu_read_unlock(). To see this, consider a CPU
that is just about to start executing a given trampoline. Because it has not yet executed
the rcu_read_lock(), that trampoline could be freed at any time, which would
come as a fatal surprise to this CPU. Therefore, trampolines cannot be protected by
synchronization primitives executed in either the traced code or in the trampoline itself.
Which does raise the question of exactly how the trampoline is to be protected.
The key to answering this question is to note that trampoline code never contains
code that either directly or indirectly does a voluntary context switch. This code might
be preempted, but it will never directly or indirectly invoke schedule(). This suggests
a variant of RCU having voluntary context switches and idle execution as its only
quiescent states. This variant is Tasks RCU.
Tasks RCU is unusual in having no read-side marking functions, which is good given
that its main use case has nowhere to put such markings. Instead, calls to schedule()
serve directly as quiescent states. Updates can use synchronize_rcu_tasks() to wait
for all pre-existing trampoline execution to complete, or they can use its asynchronous
counterpart, call_rcu_tasks(). There is also an rcu_barrier_tasks() that waits
for completion of callbacks corresponding to all prior invocations of call_rcu_
tasks(). There is no synchronize_rcu_tasks_expedited() because there has
not yet been a request for it, though implementing a useful variant of it would not be
free of challenges.
The “Tasks RCU Rude” column provides a more effective variant of the toy imple-
mentation presented in Section 9.5.1.4. This variant causes each CPU to execute a
context switch, so that any voluntary context switch or any preemptible region of code
can serve as a quiescent state. The Tasks RCU Rude variant uses the Linux-kernel
workqueues facility to force concurrent context switches, in contrast to the serial
CPU-by-CPU approach taken by the toy implementation. The API mirrors that of Tasks
RCU, including the lack of explicit read-side markers.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 243
Finally, the “Tasks RCU Trace” column provides an RCU implementation with
functionality similar to that of SRCU, except with much faster read-side markers.14
However, this speed is a consequence of the fact that these markers do not execute
memory-barrier instructions, which means that Tasks RCU Trace grace periods must
often send IPIs to all CPUs and must always scan the entire task list, thus degrading
real-time response and consuming considerable CPU time. Nevertheless, in the absence
of readers, the resulting grace-period latency is reasonably short, rivaling that of RCU.
14 And thus is unusual for the Tasks RCU family for having explicit read-side markers!
v2023.06.11a
244 CHAPTER 9. DEFERRED PROCESSING
Quick Quiz 9.46: Are there any downsides to the fact that these traversal and update primitives
can be used with any of the RCU API family members?
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 245
A B C
v2023.06.11a
246 CHAPTER 9. DEFERRED PROCESSING
pointer. When get_nulls_value() returns an unexpected value, the reader can take
corrective action, for example, restarting its traversal from the beginning.
Quick Quiz 9.47: But what if an hlist_nulls reader gets moved to some other bucket and
then back again?
15 q = kmalloc(sizeof(*p), GFP_KERNEL);
16 *q = *p;
17 q->b = 2;
18 q->c = 3;
19 list_replace_rcu(&p->list, &q->list);
20 synchronize_rcu();
21 kfree(p);
The following discussion walks through this code, using Figure 9.19 to illustrate
the state changes. The triples in each element represent the values of fields ->a,
->b, and ->c, respectively. The red-shaded elements might be referenced by readers,
and because readers do not synchronize directly with updaters, readers might run
concurrently with this entire replacement process. Please note that backwards pointers
and the link from the tail to the head are omitted for clarity.
The initial state of the list, including the pointer p, is the same as for the deletion
example, as shown on the first row of the figure.
The following text describes how to replace the 5,6,7 element with 5,2,3 in such a
way that any given reader sees one of these two values.
Line 15 allocates a replacement element, resulting in the state as shown in the second
row of Figure 9.19. At this point, no reader can hold a reference to the newly allocated
element (as indicated by its green shading), and it is uninitialized (as indicated by the
question marks).
Line 16 copies the old element to the new one, resulting in the state as shown in the
third row of Figure 9.19. The newly allocated element still cannot be referenced by
readers, but it is now initialized.
v2023.06.11a
Table 9.4: RCU-Protected List APIs
list
list: Circular doubly linked list hlist
hlist: Linear doubly linked list hlist_nulls
hlist_nulls: Linear doubly linked list hlist_bl
hlist_bl: Linear doubly linked list
with marked NULL pointer, with up to with bit locking
31 bits of marking
Structures
struct list_head struct hlist_head struct hlist_nulls_head struct hlist_bl_head
struct hlist_node struct hlist_nulls_node struct hlist_bl_node
Initialization
INIT_LIST_HEAD_RCU()
Full traversal
list_for_each_entry_rcu() hlist_for_each_entry_rcu() hlist_nulls_for_each_entry_rcu() hlist_bl_for_each_entry_rcu()
list_for_each_entry_lockless() hlist_for_each_entry_rcu_bh() hlist_nulls_for_each_entry_safe()
hlist_for_each_entry_rcu_notrace()
Resume traversal
9.5. READ-COPY UPDATE (RCU)
list_for_each_entry_continue_rcu() hlist_for_each_entry_continue_rcu()
list_for_each_entry_from_rcu() hlist_for_each_entry_continue_rcu_bh()
hlist_for_each_entry_from_rcu()
Stepwise traversal
list_entry_rcu() hlist_first_rcu() hlist_nulls_first_rcu() hlist_bl_first_rcu()
list_entry_lockless() hlist_next_rcu() hlist_nulls_next_rcu()
list_first_or_null_rcu() hlist_pprev_rcu()
list_next_rcu()
list_next_or_null_rcu()
Add
list_add_rcu() hlist_add_before_rcu() hlist_nulls_add_head_rcu() hlist_bl_add_head_rcu()
list_add_tail_rcu() hlist_add_behind_rcu() hlist_bl_set_first_rcu()
hlist_add_head_rcu()
hlist_add_tail_rcu()
Delete
list_del_rcu() hlist_del_rcu() hlist_nulls_del_rcu() hlist_bl_del_rcu()
hlist_del_init_rcu() hlist_nulls_del_init_rcu() hlist_bl_del_init_rcu()
Replace
247
list_replace_rcu() hlist_replace_rcu()
Splice
list_splice_init_rcu() list_splice_tail_init_rcu()
v2023.06.11a
248 CHAPTER 9. DEFERRED PROCESSING
Allocate
?,?,?
Copy
5,6,7
Update
5,2,3
list_replace_rcu()
5,2,3
synchronize_rcu()
5,2,3
kfree()
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 249
Category Primitives
Line 17 updates q->b to the value “2”, and line 18 updates q->c to the value “3”, as
shown on the fourth row of Figure 9.19. Note that the newly allocated structure is still
inaccessible to readers.
Now, line 19 does the replacement, so that the new element is finally visible to readers,
and hence is shaded red, as shown on the fifth row of Figure 9.19. At this point, as
shown below, we have two versions of the list. Pre-existing readers might see the 5,6,7
element (which is therefore now shaded yellow), but new readers will instead see the
5,2,3 element. But any given reader is guaranteed to see one set of values or the other,
not a mixture of the two.
After the synchronize_rcu() on line 20 returns, a grace period will have elapsed,
and so all reads that started before the list_replace_rcu() will have completed. In
particular, any readers that might have been holding references to the 5,6,7 element are
guaranteed to have exited their RCU read-side critical sections, and are thus prohibited
from continuing to hold a reference. Therefore, there can no longer be any readers
holding references to the old element, as indicated its green shading in the sixth row of
Figure 9.19. As far as the readers are concerned, we are back to having a single version
of the list, but with the new element in place of the old.
After the kfree() on line 21 completes, the list will appear as shown on the final
row of Figure 9.19.
Despite the fact that RCU was named after the replacement case, the vast majority
of RCU usage within the Linux kernel relies on the simple independent insertion and
deletion, as was shown in Figure 9.15 in Section 9.5.2.3.
The next section looks at APIs that assist developers in debugging their code that
makes use of RCU.
v2023.06.11a
250 CHAPTER 9. DEFERRED PROCESSING
Quick Quiz 9.48: Why isn’t there a rcu_read_lock_tasks_held() for Tasks RCU?
Because rcu_read_lock() cannot be used from the idle loop, and because energy-
efficiency concerns have caused the idle loop to become quite ornate, rcu_is_
watching() returns true if invoked in a context where use of rcu_read_lock() is
legal. Note again that srcu_read_lock() may be used from idle and even offline
CPUs, which means that rcu_is_watching() does not apply to SRCU.
RCU_LOCKDEP_WARN() emits a warning if lockdep is enabled and if its argument
evaluates to true. For example, RCU_LOCKDEP_WARN(!rcu_read_lock_held())
would emit a warning if invoked outside of an RCU read-side critical section.
RCU_NONIDLE() may be used to force RCU to watch when executing the statement
that is passed in as the sole argument. For example, RCU_NONIDLE(WARN_ON(!rcu_
is_watching())) would never emit a warning. However, changes in the 2020–2021
timeframe extend RCU’s reach deeper into the idle loop, which should greatly reduce or
even eliminate the need for RCU_NONIDLE().
Finally, rcu_sleep_check() emits a warning if invoked within an RCU, RCU-bh,
or RCU-sched read-side critical section.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 251
NMI
rcu_assign_pointer()
call_rcu()
Process synchronize_rcu()
v2023.06.11a
252 CHAPTER 9. DEFERRED PROCESSING
2.5x107
2x107
Lookups per Millisecond
ideal
1.5x107
RCU
1x107
seqlock
5x106
hazptr
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
listed in Table 9.6 and as displayed in Figure 9.23. Following the sections listed in this
table, Section 9.5.4.12 provides a summary.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 253
v2023.06.11a
254 CHAPTER 9. DEFERRED PROCESSING
2.5x107
2x107 RCU-QSBR
1x107
seqlock
5x106
hazptr
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Figure 9.21 shows the performance on the read-only workload. RCU scales quite
well, and offers nearly ideal performance. However, this data was generated using
the RCU_SIGNAL flavor of userspace RCU [Des09b, MDJ13f], for which rcu_read_
lock() and rcu_read_unlock() generate a small amount of code. What happens for
the QSBR flavor of RCU, which generates no code at all for rcu_read_lock() and
rcu_read_unlock()? (See Section 9.5.1, and especially Figure 9.8, for a discussion
of RCU QSBR.)
The answer to this is shown in Figure 9.22, which shows that RCU QSBR’s perfor-
mance and scalability actually exceeds that of the ideal synchronization-free workload.
Quick Quiz 9.49: Wait, what??? How can RCU QSBR possibly be better than ideal? Just
what rubbish definition of ideal would fail to be the best of all possible results???
Quick Quiz 9.50: Given RCU QSBR’s read-side performance, why bother with any other
flavor of userspace RCU?
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 255
v2023.06.11a
256 CHAPTER 9. DEFERRED PROCESSING
1. Make a change, for example, to the way that the OS reacts to an NMI.
2. Wait for all pre-existing read-side critical sections to completely finish (for example,
by using the synchronize_sched() primitive).16 The key observation here is
that subsequent RCU read-side critical sections are guaranteed to see whatever
change was made.
3. Clean up, for example, return status indicating that the change was successfully
made.
The remainder of this section presents example code adapted from the Linux kernel.
In this example, the nmi_stop() function in the now-defunct oprofile facility uses
synchronize_sched() to ensure that all in-flight NMI notifications have completed
before freeing the associated resources. A simplified version of this code is shown in
Listing 9.16.
Lines 1–4 define a profile_buffer structure, containing a size and an indefinite
array of entries. Line 5 defines a pointer to a profile buffer, which is presumably
initialized elsewhere to point to a dynamically allocated region of memory.
Lines 7–16 define the nmi_profile() function, which is called from within an
NMI handler. As such, it cannot be preempted, nor can it be interrupted by a normal
interrupt handler, however, it is still subject to delays due to cache misses, ECC errors,
and cycle stealing by other hardware threads within the same core. Line 9 gets a local
pointer to the profile buffer using the rcu_dereference() primitive to ensure memory
ordering on DEC Alpha, and lines 11 and 12 exit from this function if there is no profile
buffer currently allocated, while lines 13 and 14 exit from this function if the pcvalue
16 In Linux kernel v5.1 and later, synchronize_sched() has been subsumed into
synchronize_rcu().
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 257
argument is out of range. Otherwise, line 15 increments the profile-buffer entry indexed
by the pcvalue argument. Note that storing the size with the buffer guarantees that the
range check matches the buffer, even if a large buffer is suddenly replaced by a smaller
one.
Lines 18–27 define the nmi_stop() function, where the caller is responsible for
mutual exclusion (for example, holding the correct lock). Line 20 fetches a pointer to the
profile buffer, and lines 22 and 23 exit the function if there is no buffer. Otherwise, line 24
NULLs out the profile-buffer pointer (using the rcu_assign_pointer() primitive to
maintain memory ordering on weakly ordered machines), and line 25 waits for an RCU
Sched grace period to elapse, in particular, waiting for all non-preemptible regions of
code, including NMI handlers, to complete. Once execution continues at line 26, we
are guaranteed that any instance of nmi_profile() that obtained a pointer to the old
buffer has returned. It is therefore safe to free the buffer, in this case using the kfree()
primitive.
Quick Quiz 9.51: Suppose that the nmi_profile() function was preemptible. What would
need to change to make this example work correctly?
In short, RCU makes it easy to dynamically switch among profile buffers (you just
try doing this efficiently with atomic operations, or at all with locking!). This is a rare
use of RCU in its pure form. RCU is normally used at higher levels of abstraction, as
will be shown in the following sections.
Figure 9.24 shows a timeline for an example phased state change to efficiently handle
maintenance operations. If there is no maintenance operation in progress, common-case
operations must proceed quickly, for example, without acquiring a reader-writer lock.
However, if there is a maintenance operation in progress, the common-case operations
must be undertaken carefully, taking into account added complexities due to their
running concurrently with that maintenance operation. This means that common-case
operations will incur higher overhead during maintenance operations, which is one
reason that maintenance operations are normally scheduled to take place during times
of low load.
In the figure, these apparently conflicting requirements are resolved by having a
prepare phase prior to the maintenance operation and a cleanup phase after it, during
which the common-case operations can proceed either quickly or carefully.
Example pseudo-code for this phased state change is shown in Listing 9.17. The
common-case operations are carried out by cco() within an RCU read-side critical
section extending from line 5 to line 10. Here, line 6 checks a global be_careful flag,
invoking cco_carefully() or cco_quickly(), as indicated.
This allows the maint() function to set the be_careful flag on line 15 and wait for
an RCU grace period on line 16. When control reaches line 17, all cco() functions that
saw a false value of be_careful (and thus which might invoke the cco_quickly()
function) will have completed their operations, so that all currently executing cco()
functions will be invoking cco_carefully(). This means that it is safe for the
do_maint() function to be invoked. Line 18 then waits for all cco() functions that
might have run concurrently with do_maint() to complete, and finally line 19 sets the
be_careful flag back to false.
v2023.06.11a
258 CHAPTER 9. DEFERRED PROCESSING
Common-Case Maintenance
Operations Operations
Time
Quickly
Either Prepare
Carefully Maintenance
Either Clean up
Quickly
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 259
Quick Quiz 9.52: What is the point of the second call to synchronize_rcu() in function
maint() in Listing 9.17? Isn’t it OK for any cco() invocations in the clean-up phase to invoke
either cco_carefully() or cco_quickly()?
Quick Quiz 9.53: How can you be sure that the code shown in maint() in Listing 9.17 really
works?
Phased state change allows frequent operations to use light-weight checks, without
the need for expensive lock acquisitions or atomic read-modify-write operations, and is
used in the Linux kernel in the guise of rcu_sync [NZ13] to implement a variant of
reader-writer semaphores with lightweight readers. Phased state change adds only a
checked state variable to the wait-to-finish use case (Section 9.5.4.2), thus also residing
at a rather low level of abstraction.
v2023.06.11a
260 CHAPTER 9. DEFERRED PROCESSING
system executing in an RCU read-side critical section? Wouldn’t that prevent any data from a
SLAB_TYPESAFE_BY_RCU slab ever being returned to the system, possibly resulting in OOM
events?
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 261
Otherwise, line 13 acquires the update-side spinlock, and line 14 then checks that the
element is still the one that we want. If so, line 15 leaves the RCU read-side critical
section, line 16 removes it from the table, line 17 releases the lock, line 18 waits for
all pre-existing RCU read-side critical sections to complete, line 19 frees the newly
removed element, and line 20 indicates success. If the element is no longer the one we
want, line 22 releases the lock, line 23 leaves the RCU read-side critical section, and
line 24 indicates failure to delete the specified key.
Quick Quiz 9.56: Why is it OK to exit the RCU read-side critical section on line 15 of
Listing 9.18 before releasing the lock on line 17?
Quick Quiz 9.57: Why not exit the RCU read-side critical section on line 23 of Listing 9.18
before releasing the lock on line 22?
Quick Quiz 9.58: The RCU-based algorithm shown in Listing 9.18 locks very similar to that
in Listing 7.11, so why should the RCU-based approach be any better?
Alert readers will recognize this as only a slight variation on the original wait-to-finish
theme (Section 9.5.4.2), adding publish/subscribe, linked structures, a heap allocator
(typically), and deferred reclamation, as shown in Figure 9.23. They might also note the
deadlock-immunity advantages over the lock-based existence guarantees discussed in
Section 7.4.
v2023.06.11a
262 CHAPTER 9. DEFERRED PROCESSING
must manually indicate when a given data structure is eligible to be collected and (2) The
programmer must manually mark the RCU read-side critical sections where references
might be held.
Despite these differences, the resemblance does go quite deep. In fact, the first
RCU-like mechanism I am aware of used a reference-count-based garbage collector
to handle the grace periods [KL80], and the connection between RCU and garbage
collection has been noted more recently [SWS16].
The light-weight garbage collector use case is very similar to the existence-guarantee
use case, adding only the desired non-blocking algorithm to the mix. This light-weight
garbage collector use case can also be used in conjunction with the existence guarantees
described in the next section.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 263
10000
100
10
RCU
1
0.1
1 10 100
Number of CPUs (Threads)
10000
Nanoseconds per operation
1000
rwlock
100
10 RCU
1
1 10 100
Number of CPUs (Threads)
Quick Quiz 9.60: Didn’t an earlier edition of this book show RCU read-side overhead way
down in the sub-picosecond range? What happened???
Quick Quiz 9.61: Why is there such large variation for the RCU trace in Figure 9.25?
Note that reader-writer locking is more than an order of magnitude slower than RCU
on a single CPU, and is more than four orders of magnitude slower on 192 CPUs. In
contrast, RCU scales quite well. In both cases, the error bars cover the full range of the
measurements from 30 runs, with the line being the median.
v2023.06.11a
264 CHAPTER 9. DEFERRED PROCESSING
100000
10 CPUs
1000
1 CPU RCU
100
100 1000 10000
Critical-Section Duration (nanoseconds)
Of course, the low performance of reader-writer locking in Figures 9.25 and 9.26 is
exaggerated by the unrealistic zero-length critical sections. The performance advantages
of RCU decrease as the overhead of the critical sections increase, as shown in Figure 9.27,
which was run on the same system as the previous plots. Here, the y-axis represents
the sum of the overhead of the read-side primitives and that of the critical section and
the x-axis represents the critical-section overhead in nanoseconds. But please note
the logscale y axis, which means that the small separations between the traces still
represent significant differences. This figure shows non-preemptible RCU, but given
that preemptible RCU’s read-side overhead is only about three nanoseconds, its plot
would be nearly identical to Figure 9.27.
Quick Quiz 9.63: Why the larger error ranges for the submicrosecond durations in Figure 9.27?
There are three traces for reader-writer locking, with the upper trace being for
100 CPUs, the next for 10 CPUs, and the lowest for 1 CPU. The greater the number of
CPUs and the shorter the critical sections, the greater is RCU’s performance advantage.
These performance advantages are underscored by the fact that 100-CPU systems are
no longer uncommon and that a number of system calls (and thus any RCU read-side
critical sections that they contain) complete within microseconds.
In addition, as is discussed in the next paragraph, RCU read-side primitives are almost
entirely deadlock-immune.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 265
was in fact its immunity to read-side deadlocks. This immunity stems from the fact that
RCU read-side primitives do not block, spin, or even do backwards branches, so that
their execution time is deterministic. It is therefore impossible for them to participate in
a deadlock cycle.
Quick Quiz 9.64: Is there an exception to this deadlock immunity, and if so, what sequence of
events could lead to deadlock?
1 rcu_read_lock();
2 list_for_each_entry_rcu(p, &head, list_field) {
3 do_something_with(p);
4 if (need_update(p)) {
5 spin_lock(my_lock);
6 do_update(p);
7 spin_unlock(&my_lock);
8 }
9 }
10 rcu_read_unlock();
Note that do_update() is executed under the protection of the lock and under RCU
read-side protection.
Another interesting consequence of RCU’s deadlock immunity is its immunity to a
large class of priority inversion problems. For example, low-priority RCU readers cannot
prevent a high-priority RCU updater from acquiring the update-side lock. Similarly, a
low-priority RCU updater cannot prevent high-priority RCU readers from entering an
RCU read-side critical section.
Quick Quiz 9.65: Immunity to both deadlock and priority inversion??? Sounds too good to
be true. Why should I believe that this is even possible?
Realtime Latency Because RCU read-side primitives neither spin nor block, they
offer excellent realtime latencies. In addition, as noted earlier, this means that they are
immune to priority inversion involving the RCU read-side primitives and locks.
However, RCU is susceptible to more subtle priority-inversion scenarios, for example,
a high-priority process blocked waiting for an RCU grace period to elapse can be
blocked by low-priority RCU readers in -rt kernels. This can be solved by using RCU
priority boosting [McK07d, GMTW08].
However, use of RCU priority boosting requires that rcu_read_unlock() do
deboosting, which entails acquiring scheduler locks. Some care is therefore required
within the scheduler and RCU to avoid deadlocks, which as of the v5.15 Linux kernel
requires RCU to avoid invoking the scheduler while holding any of RCU’s locks.
This in turn means that rcu_read_unlock() is not always lockless when RCU
priority boosting is enabled. However, rcu_read_unlock() will still be lockless if
its critical section was not priority-boosted. Furthermore, critical sections will not be
priority boosted unless they are preempted, or, in -rt kernels, they acquire non-raw
spinlocks. This means that rcu_read_unlock() will normally be lockless from the
perspective of the highest priority task running on any given CPU.
v2023.06.11a
266 CHAPTER 9. DEFERRED PROCESSING
Update Received
RCU Readers and Updaters Run Concurrently Because RCU readers never spin
nor block, and because updaters are not subject to any sort of rollback or abort semantics,
RCU readers and updaters really can run concurrently. This means that RCU readers
might access stale data, and might even see inconsistencies, either of which can render
conversion from reader-writer locking to RCU non-trivial.
However, in a surprisingly large number of situations, inconsistencies and stale data
are not problems. The classic example is the networking routing table. Because routing
updates can take considerable time to reach a given system (seconds or even minutes),
the system will have been sending packets the wrong way for quite some time when
the update arrives. It is usually not a problem to continue sending updates the wrong
way for a few additional milliseconds. Furthermore, because RCU updaters can make
changes without waiting for RCU readers to finish, the RCU readers might well see the
change more quickly than would batch-fair reader-writer-locking readers, as shown in
Figure 9.28.
Quick Quiz 9.66: But how many other algorithms really tolerate stale and inconsistent data?
Once the update is received, the rwlock writer cannot proceed until the last reader
completes, and subsequent readers cannot proceed until the writer completes. However,
these subsequent readers are guaranteed to see the new value, as indicated by the green
shading of the rightmost boxes. In contrast, RCU readers and updaters do not block
each other, which permits the RCU readers to see the updated values sooner. Of course,
because their execution overlaps that of the RCU updater, all of the RCU readers might
well see updated values, including the three readers that started before the update.
Nevertheless only the green-shaded rightmost RCU readers are guaranteed to see the
updated values.
Reader-writer locking and RCU simply provide different guarantees. With reader-
writer locking, any reader that begins after the writer begins is guaranteed to see new
values, and any reader that attempts to begin while the writer is spinning might or
might not see new values, depending on the reader/writer preference of the rwlock
implementation in question. In contrast, with RCU, any reader that begins after the
updater completes is guaranteed to see new values, and any reader that completes after
the updater begins might or might not see new values, depending on timing.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 267
The key point here is that, although reader-writer locking does indeed guarantee
consistency within the confines of the computer system, there are situations where
this consistency comes at the price of increased inconsistency with the outside world,
courtesy of the finite speed of light and the non-zero size of atoms. In other words,
reader-writer locking obtains internal consistency at the price of silently stale data with
respect to the outside world.
Note that if a value is computed while read-holding a reader-writer lock, and then
that value is used after that lock is released, then this reader-writer-locking use case
is using stale data. After all, the quantities that this value is based on could change
at any time after that lock is released. This sort of reader-writer-locking use case is
often easy to convert to RCU, as will be shown in Listings 9.19, 9.20, and 9.21 and the
accompanying text.
RCU Grace Periods Extend for Many Milliseconds With the exception of userspace
RCU [Des09b, MDJ13f], expedited grace periods, and several of the “toy” RCU
implementations described in Appendix B, RCU grace periods extend milliseconds.
Although there are a number of techniques to render such long delays harmless, including
use of the asynchronous interfaces (call_rcu() and call_rcu_bh()) or of the polling
interfaces (get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
and poll_state_synchronize_rcu()), this situation is a major reason for the rule
of thumb that RCU be used in read-mostly situations.
As noted in Section 9.5.3, within the Linux kernel, shorter grace periods may be
obtained via expedited grace periods, for example, by invoking synchronize_rcu_
expedited() instead of synchronize_rcu(). Expedited grace periods can reduce
delays to as little as a few tens of microseconds, albeit at the expense of higher CPU
utilization and IPIs. The added IPIs can be especially unwelcome in some real-time
workloads.
Code: Reader-Writer Locking vs. RCU In the best case, the conversion from
reader-writer locking to RCU is quite simple, as shown in Listings 9.19, 9.20, and 9.21,
all taken from Wikipedia [MPA+ 06].
However, the transformation is not always this straightforward. This is because neither
the spin_lock() nor the synchronize_rcu() in Listing 9.21 exclude the readers in
Listing 9.20. First, the spin_lock() does not interact in any way with rcu_read_
lock() and rcu_read_unlock(), thus not excluding them. Second, although both
write_lock() and synchronize_rcu() wait for pre-existing readers, only write_
lock() prevents subsequent readers from commencing.18 Thus, synchronize_rcu()
18 Kudos to whoever pointed this out to Paul.
v2023.06.11a
268 CHAPTER 9. DEFERRED PROCESSING
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 269
RCU dispenses entirely with constraint #3 and weakens the other two as follows:
1. Writers wait for any pre-existing read-holders before progressing to the destructive
phase of their update (usually the freeing of memory).
(a) That data has been unlinked so as to be inaccessible to later readers, and
(b) A subsequent RCU grace period has elapsed.
Of course, there are some reader-writer-locking use cases for which RCU’s weakened
semantics are inappropriate, but experience in the Linux kernel indicates that more than
80% of reader-writer locks can in fact be replaced by RCU. For example, a common
reader-writer-locking use case computes some value while holding the lock and then
uses that value after releasing that lock. This use case results in stale data, and therefore
often accommodates RCU’s weaker semantics.
This interaction of temporal and spatial constraints is illustrated by the RCU singleton
data structure illustrated in Figures 9.6 and 9.7. This structure is defined on lines 1–4 of
Listing 9.22, and contains two integer fields, ->a and ->b (singleton.c). The current
instance of this structure is referenced by the curconfig pointer defined on line 4.
The fields of the current structure are passed back through the cur_a and cur_b
parameters to the get_config() function defined on lines 6–20. These two fields can
be slightly out of date, but they absolutely must be consistent with each other. The
get_config() function provides this consistency within the RCU read-side critical
section starting on line 10 and ending on either line 13 or line 18, which provides the
needed temporal synchronization. Line 11 fetches the pointer to the current myconfig
structure. This structure will be used regardless of any concurrent changes due to
v2023.06.11a
270 CHAPTER 9. DEFERRED PROCESSING
calls to the set_config() function, thus providing the needed spatial synchronization.
If line 12 determines that the curconfig pointer was NULL, line 14 returns failure.
Otherwise, lines 16 and 17 copy out the ->a and ->b fields and line 19 returns success.
These ->a and ->b fields are from the same myconfig structure, and the RCU read-side
critical section prevents this structure from being freed, thus guaranteeing that these
two fields are consistent with each other.
The structure is updated by the set_config() function shown in Listing 9.23.
Lines 5–8 allocate and initialize a new myconfig structure. Line 9 atomically exchanges
a pointer to this new structure with the pointer to the old structure in curconfig, while
also providing full memory ordering both before and after the xchg() operation, thus
providing the needed updater/reader spatial synchronization on the one hand and the
needed updater/updater synchronization on the other. If line 10 determines that the
pointer to the old structure was in fact non-NULL, line 11 waits for a grace period (thus
providing the needed reader/updater temporal synchronization) and line 12 frees the old
structure, safe in the knowledge that there are no longer any readers still referencing it.
Figure 9.29 shows an abbreviated representation of get_config() on the left and
right and a similarly abbreviated representation of set_config() in the middle. Time
advances from top to bottom, and the address space of the objects referenced by
curconfig advances from left to right. The boxes with comma-separated numbers
each represent a myconfig structure, with the constraint that ->b is the square of ->a.
Each blue dash-dotted arrow represents an interaction with the old structure (on the left,
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 271
Time
Address Space
5,
5,25 curconfig
5, 5,
9,81
Readers
rcu_read_lock();
mcp = ...
*cur_a = mcp->a; (5) mcp = kmalloc(...)
mcp = xchg(&curconfig, mcp);
*cur_b = mcp->b; (25) synchronize_rcu();
Grace
Period
containing “5,25”) and each green dashed arrow represents an interaction with the new
structure (on the right, containing “9,81”).
The black dotted arrows represent temporal relationships between RCU readers
on the left and right and the RCU grace period at center, with each arrow pointing
from an older event to a newer event. The call to synchronize_rcu() followed the
leftmost rcu_read_lock(), and therefore that synchronize_rcu() invocation must
not return until after the corresponding rcu_read_unlock(). In contrast, the call to
synchronize_rcu() precedes the rightmost rcu_read_lock(), which allows the
return from that same synchronize_rcu() to ignore the corresponding rcu_read_
unlock(). These temporal relationships prevent the myconfig structures from being
freed while RCU readers are still accessing them.
The two horizontal grey dashed lines represent the period of time during which
different readers get different results, however, each reader will see one and only one of
the two objects. All readers that end before the first horizontal line will see the leftmost
myconfig structure, and all readers that start after the second horizontal line will see
the rightmost structure. Between the two lines, that is, during the grace period, different
readers might see different objects, but as long as each reader loads the curconfig
pointer only once, each reader will see a consistent view of its myconfig structure.
Quick Quiz 9.68: But doesn’t the RCU grace period start sometime after the call to
synchronize_rcu() rather than in the middle of that xchg() statement?
In short, when operating on a suitable linked data structure, RCU combines temporal
and spatial synchronization in order to approximate reader-writer locking, with RCU
read-side critical sections acting as the reader-writer-locking reader, as shown in
Figures 9.23 and 9.29. RCU’s temporal synchronization is provided by the read-side
markers, for example, rcu_read_lock() and rcu_read_unlock(), as well as the
update-side grace-period primitives, for example, synchronize_rcu() or call_
rcu(). The spatial synchronization is provided by the read-side rcu_dereference()
family of primitives, each of which subscribes to a version published by rcu_assign_
v2023.06.11a
272 CHAPTER 9. DEFERRED PROCESSING
1 spin_lock(&mylock);
2 p = head;
3 rcu_assign_pointer(head, NULL);
4 spin_unlock(&mylock);
5 /* Wait for all references to be released. */
6 synchronize_rcu();
7 kfree(p);
The assignment to head prevents any future references to p from being acquired, and
the synchronize_rcu() waits for any previously acquired references to be released.
Quick Quiz 9.70: But wait! This is exactly the same code that might be used when thinking
of RCU as a replacement for reader-writer locking! What gives?
Of course, RCU can also be combined with traditional reference counting, as discussed
in Section 13.2.
But why bother? Again, part of the answer is performance, as shown in Figures 9.30
and 9.31, again showing data taken on a 448-CPU 2.1 GHz Intel x86 system for non-
preemptible and preemptible Linux-kernel RCU, respectively. Non-preemptible RCU’s
advantage over reference counting ranges from more than an order of magnitude at one
CPU up to about four orders of magnitude at 192 CPUs. Preemptible RCU’s advantage
ranges from about a factor of three at one CPU up to about three orders of magnitude at
192 CPUs.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 273
10000
100
10
RCU
1
0.1
1 10 100
Number of CPUs (Threads)
10000
Nanoseconds per operation
1000
refcnt
100
10 RCU
1
1 10 100
Number of CPUs (Threads)
100000
Nanoseconds per operation
10 CPUs
1000 RCU
1 CPU
100
100 1000 10000
Critical-Section Duration (nanoseconds)
Figure 9.32: Response Time of RCU vs. Reference Counting, 192 CPUs
v2023.06.11a
274 CHAPTER 9. DEFERRED PROCESSING
However, as with reader-writer locking, the performance advantages of RCU are most
pronounced for short-duration critical sections and for large numbers of CPUs, as shown
in Figure 9.32 for the same system. In addition, as with reader-writer locking, many
system calls (and thus any RCU read-side critical sections that they contain) complete
in a few microseconds.
Although traditional reference counters are usually associated with a specific data
structure, or perhaps a specific group of data structures, this approach does have some
disadvantages. For example, maintaining a single global reference counter for a large
variety of data structures typically results in bouncing the cache line containing the
reference count. As we saw in Figures 9.30–9.32, such cache-line bouncing can severely
degrade performance.
In contrast, RCU’s lightweight rcu_read_lock(), rcu_dereference(), and rcu_
read_unlock() read-side primitives permit extremely frequent read-side usage with
negligible performance degradation. Except that the calls to rcu_dereference() are
not doing anything specific to acquire a reference to the pointed-to object. The heavy
lifting is instead done by the rcu_read_lock() and rcu_read_unlock() primitives
and their interactions with RCU grace periods.
And ignoring those calls to rcu_dereference() permits RCU to be thought of as a
“bulk reference-counting” mechanism, where each call to rcu_read_lock() obtains
a reference on each and every RCU-protected object, and with little or no overhead.
However, the restrictions that go with RCU can be quite onerous. For example, in many
cases, the Linux-kernel prohibition against sleeping while in an RCU read-side critical
section would defeat the entire purpose. Such cases might be better served by the hazard
pointers mechanism described in Section 9.3. Cases where code rarely sleeps have been
handled by using RCU as a reference count in the common non-sleeping case and by
bridging to an explicit reference counter when sleeping is necessary.
Alternatively, situations where a reference must be held by a single task across a
section of code that sleeps may be accommodated with Sleepable RCU (SRCU) [McK06].
This fails to cover the not-uncommon situation where a reference is “passed” from one
task to another, for example, when a reference is acquired when starting an I/O and
released in the corresponding completion interrupt handler. Again, such cases might be
better handled by explicit reference counters or by hazard pointers.
Of course, SRCU brings restrictions of its own, namely that the return value from
srcu_read_lock() be passed into the corresponding srcu_read_unlock(), and that
no SRCU primitives be invoked from hardware interrupt handlers or from non-maskable
interrupt (NMI) handlers. The jury is still out as to how much of a problem is presented
by this restriction, and as to how it can best be handled.
However, in the common case where references are held within the confines of
a single CPU or task, RCU can be used as high-performance and highly scalable
reference-counting mechanism.
As shown in Figure 9.23, quasi reference counts add RCU readers as individual or
bulk reference counts, possibly also bridging to reference counters in corner cases.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 275
Nevertheless, there are situations where consistency and fresh data are required
across multiple data elements. Fortunately, there are a number of approaches that avoid
inconsistency and stale data, including the following:
1. Enclose RCU readers within sequence-locking readers, forcing the RCU readers to
be retried should an update occur, as described in Section 13.4.2 and Section 13.4.3.
2. Place the data that must be consistent into a single element of a linked data structure,
and refrain from updating those fields within any element visible to RCU readers.
RCU readers gaining a reference to any such element are then guaranteed to see
consistent values. See Section 13.5.4 for additional details.
3. Use a per-element lock that guards a “deleted” flag to allow RCU readers to reject
stale data [McK04, ACMS03].
4. Provide an existence flag that is referenced by all data elements whose update is to
appear atomic to RCU readers [McK14d, McK14a, McK15b, McK16b, McK16a].
That said, it is possible to build higher-level constructs on top of RCU, including the
various use cases described in the earlier sections. Furthermore, I have no doubt that
new use cases will continue to be found for RCU, as well as for any of a number of other
synchronization primitives. And so it is that RCU’s use cases are conceptually more
complex than is RCU itself, as hinted on page 201.
Quick Quiz 9.71: Which of these use cases best describes the Pre-BSD routing example in
Section 9.5.4.1?
In the meantime, Figure 9.33 shows some rough rules of thumb on where RCU is
most helpful.
v2023.06.11a
276 CHAPTER 9. DEFERRED PROCESSING
& CU
Re con ork
In
ad sist s G
(R
-M en
Ne U M
Ne
os t D reat
100% Reads
100% Writes
(R
W
Re ns
ed
Re ons Be
ed U W
tly at !!!)
(R
C
ad iste Wel
,S a
ad iste OK
C ght
Co ork
Ne
-M
-W nt
tal OK
ed
W ons ot B
os t D
i
e
rite Da ..)
rite ist
C
tly ata
(R
,
-M ent st)*
n
,
CU
s
os Da
tly
N
ta
l)
, ta
.
e
Need Fully Fresh and Consistent Data
* 1. RCU provides ABA protection for update-friendly synchronization mechanisms
* 2. RCU provides bounded wait-free read-side primitives for real-time use
As shown in the blue box in the upper-right corner of the figure, RCU works best
if you have read-mostly data where stale and inconsistent data is permissible (but see
below for more information on stale and inconsistent data). The canonical example
of this case in the Linux kernel is routing tables. Because it may have taken many
seconds or even minutes for the routing updates to propagate across the Internet, the
system has been sending packets the wrong way for quite some time. Having some
small probability of continuing to send some of them the wrong way for a few more
milliseconds is almost never a problem.
If you have a read-mostly workload where consistent data is required, RCU works
well, as shown by the green “read-mostly, need consistent data” box. One example
of this case is the Linux kernel’s mapping from user-level System-V semaphore IDs
to the corresponding in-kernel data structures. Semaphores tend to be used far more
frequently than they are created and destroyed, so this mapping is read-mostly. However,
it would be erroneous to perform a semaphore operation on a semaphore that has
already been deleted. This need for consistency is handled by using the lock in the
in-kernel semaphore data structure, along with a “deleted” flag that is set when deleting
a semaphore. If a user ID maps to an in-kernel data structure with the “deleted” flag set,
the data structure is ignored, so that the user ID is flagged as invalid.
Although this requires that the readers acquire a lock for the data structure representing
the semaphore itself, it allows them to dispense with locking for the mapping data
structure. The readers therefore locklessly traverse the tree used to map from ID to
data structure, which in turn greatly improves performance, scalability, and real-time
response.
As indicated by the yellow “read-write” box, RCU can also be useful for read-write
workloads where consistent data is required, although usually in conjunction with a
number of other synchronization primitives. For example, the directory-entry cache in
recent Linux kernels uses RCU in conjunction with sequence locks, per-CPU locks, and
per-data-structure locks to allow lockless traversal of pathnames in the common case.
Although RCU can be very beneficial in this read-write case, the corresponding code is
often more complex than that of the read-mostly cases.
Finally, as indicated by the red box in the lower-left corner of the figure, update-mostly
workloads requiring consistent data are rarely good places to use RCU, though there
are some exceptions [DMS+ 12]. For example, as noted in Section 9.5.4.5, within
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 277
By the year 2000, the initiative had passed to open-source projects, most notably the
Linux kernel community [Rus00a, Rus00b, MS01, MAK+ 01, MSA+ 02, ACMS03].20
RCU was accepted into the Linux kernel in late 2002, with many subsequent improve-
ments for scalability, robustness, real-time response, energy efficiency, and specialized
use cases. As of 2023, Linux-kernel RCU is still under active development.
Quick Quiz 9.73: Why didn’t Kung’s and Lehman’s paper result in immediate use of RCU?
However, in the mid 2010s, there was a welcome upsurge in RCU research and
development across a number of communities and institutions [Kaa15]. Section 9.5.5.1
describes uses of RCU, Section 9.5.5.2 describes RCU implementations (as well as work
that both creates and uses an implementation), and finally, Section 9.5.5.3 describes
verification and validation of RCU and its uses.
v2023.06.11a
278 CHAPTER 9. DEFERRED PROCESSING
Austin Clements, Frans Kaashoek, and Nickolai Zeldovich of MIT created an RCU-
optimized balanced binary tree (Bonsai) [CKZ12], and applied this tree to the Linux
kernel’s VM subsystem in order to reduce read-side contention on the Linux kernel’s
mmap_sem. This work resulted in order-of-magnitude speedups and scalability up to
at least 80 CPUs for a microbenchmark featuring large numbers of minor page faults.
This is similar to a patch developed earlier by Peter Zijlstra [Zij14], and both were
limited by the fact that, at the time, filesystem data structures were not safe for RCU
readers. Clements et al. avoided this limitation by optimizing the page-fault path for
anonymous pages only. More recently, filesystem data structures have been made safe
for RCU readers [Cor10a, Cor11], so perhaps this work can be implemented for all page
types, not just anonymous pages—Peter Zijlstra has, in fact, recently prototyped exactly
this, and Laurent Dufour Michel Lespinasse have continued work along these lines.
For their part, Matthew Wilcox and Liam Howlett are working towards use of RCU to
enable fine-grained locking of and lockless access to other memory-management data
structures.
Yandong Mao and Robert Morris of MIT and Eddie Kohler of Harvard University
created another RCU-protected tree named Masstree [MKM12] that combines ideas
from B+ trees and tries. Although this tree is about 2.5x slower than an RCU-protected
hash table, it supports operations on key ranges, unlike hash tables. In addition, Masstree
supports efficient storage of objects with long shared key prefixes and, furthermore,
provides persistence via logging to mass storage.
The paper notes that Masstree’s performance rivals that of memcached, even given that
Masstree is persistently storing updates and memcached is not. The paper also compares
Masstree’s performance to the persistent datastores MongoDB, VoltDB, and Redis,
reporting significant performance advantages for Masstree, in some cases exceeding two
orders of magnitude. Another paper [TZK+ 13], by Stephen Tu, Wenting Zheng, Barbara
Liskov, and Samuel Madden of MIT and Kohler, applies Masstree to an in-memory
database named Silo, achieving 700K transactions per second (42M transactions per
minute) on a well-known transaction-processing benchmark. Interestingly enough, Silo
guarantees linearizability without incurring the overhead of grace periods while holding
locks.
Maya Arbel and Hagit Attiya of Technion took a more rigorous approach [AA14]
to an RCU-protected search tree that, like Masstree, allows concurrent updates. This
paper includes a proof of correctness, including proof that all operations on this
tree are linearizable. Unfortunately, this implementation achieves linearizability by
incurring the full latency of grace-period waits while holding locks, which degrades
scalability of update-only workloads. One way around this problem is to abandon
linearizability [HKLP12, McK14d], however, Arbel and Attiya instead created an RCU
variant that reduces low-end grace-period latency. Of course, nothing comes for free,
and this RCU variant appears to hit a scalability limit at about 32 CPUs. Although
there is much to be said for dropping linearizability, thus gaining both performance
and scalability, it is very good to see academics experimenting with alternative RCU
implementations.
v2023.06.11a
9.5. READ-COPY UPDATE (RCU) 279
v2023.06.11a
280 CHAPTER 9. DEFERRED PROCESSING
Alexander Matveev (MIT), Nir Shavit (MIT and Tel-Aviv University), Pascal Felber
(University of Neuchâtel), and Patrick Marlier (also University of Neuchâtel) [MSFM15]
produced an RCU-like mechanism that can be thought of as software transactional
memory that explicitly marks read-only transactions. Their use cases require holding
locks across grace periods, which limits scalability [MP15a, MP15b]. This appears to
be the first academic RCU-related work to make good use of the rcutorture test suite,
and also the first to have submitted a performance improvement to Linux-kernel RCU,
which was accepted into v4.4.
Alexander Matveev’s RLU was followed up by MV-RLU from Jaeho Kim et
al. [KMK+ 19]. This work improves scalability over RLU by permitting multiple
concurrent updates, by avoiding holding locks across grace periods, and by using asyn-
chronous grace periods, for example, call_rcu() instead of synchronize_rcu().
This paper also made some interesting performance-evaluation choices that are discussed
further in Section 17.2.3.3 on page 594.
Adam Belay et al. created an RCU implementation that guards the data structures used
by TCP/IP’s address-resolution protocol (ARP) in their IX operating system [BPP+ 16].
Geoff Romer and Andrew Hunter (both at Google) proposed a cell-based API for
RCU protection of singleton data structures for inclusion in the C++ standard [RH18].
Dimitrios Siakavaras et al. have applied HTM and RCU to search trees [SNGK17,
SBN+ 20], Christina Giannoula et al. have used HTM and RCU to color graphs [GGK18],
and SeongJae Park et al. have used HTM and RCU to optimize high-contention locking
on NUMA systems.
Alex Kogan et al. applied RCU to the construction of range locking for scalable
address spaces [KDI20].
Production uses of RCU are listed in Section 9.6.3.3.
v2023.06.11a
9.6. WHICH TO CHOOSE? 281
Readers Slow and unscalable Fast and scalable Fast and scalable Fast and scalable
Memory Overhead Counter per object Pointer per No protection None
reader per object
Duration of Protection Can be long Can be long No protection User must bound
duration
Need for Traversal If object deleted If object deleted If any update Never
Retries
automatically mutated Linux-kernel RCU’s source code to test the coverage of the
rcutorture test suite. The effort found several holes in this suite’s coverage, one of
which was hiding a real bug (since fixed) in Tiny RCU.
With some luck, all of this validation work will eventually result in more and better
tools for validating concurrent code.
Section 9.6.1 provides a high-level overview and then Section 9.6.2 provides a more
detailed view of the differences between the deferred-processing techniques presented
in this chapter. This discussion assumes a linked data structure that is large enough
that readers do not hold references from one traversal to another, and where elements
might be added to and removed from the structure at any location and at any time.
Section 9.6.3 then points out a few publicly visible production uses of hazard pointers,
sequence locking, and RCU. This discussion should help you to make an informed
choice between these techniques.
v2023.06.11a
282 CHAPTER 9. DEFERRED PROCESSING
The “Duration of Protection” describes constraints (if any) on how long a period
of time a user may protect a given object. Reference counting and hazard pointers
can both protect objects for extended time periods with no untoward side effects, but
maintaining an RCU reference to even one object prevents all other RCU from being
freed. RCU readers must therefore be relatively short in order to avoid running the
system out of memory, with special-purpose implementations such as SRCU, Tasks
RCU, and Tasks Trace RCU being exceptions to this rule. Again, sequence locks provide
no pointer-traversal protection, which is why it is normally used on static data.
The “Need for Traversal Retries” row tells whether a new reference to a given object
may be acquired unconditionally, as it can with RCU, or whether the reference acquisition
can fail, resulting in a retry operation, which is the case for reference counting, hazard
pointers, and sequence locks. In the case of reference counting and hazard pointers,
retries are only required if an attempt to acquire a reference to a given object while
that object is in the process of being deleted, a topic covered in more detail in the
next section. Sequence locking must of course retry its critical section should it run
concurrently with any update.
Quick Quiz 9.76: But don’t Linux-kernel kref reference counters allow guaranteed uncondi-
tional reference acquisition?
Of course, different rows will have different levels of importance in different situations.
For example, if your current code is having read-side scalability problems with hazard
pointers, then it does not matter that hazard pointers can require retrying reference
acquisition because your current code already handles this. Similarly, if response-time
considerations already limit the duration of reader traversals, as is often the case in
kernels and low-level applications, then it does not matter that RCU has duration-limit
requirements because your code already meets them. In the same vein, if readers must
already write to the objects that they are traversing, the read-side overhead of reference
counters might not be so important. Of course, if the data to be protected is in statically
allocated variables, then sequence locking’s inability to protect pointers is irrelevant.
Finally, there is some work on dynamically switching between hazard pointers and
RCU based on dynamic sampling of delays [BGHZ16]. This defers the choice between
hazard pointers and RCU to runtime, and delegates responsibility for the decision to the
software.
Nevertheless, this table should be of great help when choosing between these
techniques. But those wishing more detail should continue on to the next section.
v2023.06.11a
9.6. WHICH TO CHOOSE? 283
Of course, as shown in the “Updates and Readers Progress Concurrently” row, this
detection of updates implies that sequence locking does not permit updaters and readers
to make forward progress concurrently. After all, preventing such forward progress is
the whole point of using sequence locking in the first place! This situation points the
way to using sequence locking in conjunction with reference counting, hazard pointers,
or RCU in order to provide both existence guarantees and update detection. In fact, the
Linux kernel combines RCU and sequence locking in this manner during pathname
lookup.
The “Contention Among Readers”, “Reader Per-Critical-Section Overhead”, and
“Reader Per-Object Traversal Overhead” rows give a rough sense of the read-side
overhead of these techniques. The overhead of reference counting can be quite large,
with contention among readers along with a fully ordered read-modify-write atomic
operation required for each and every object traversed. Hazard pointers incur the
overhead of a memory barrier for each data element traversed, and sequence locks
incur the overhead of a pair of memory barriers for each attempt to execute the critical
section. The overhead of RCU implementations vary from nothing to that of a pair
of memory barriers for each read-side critical section, thus providing RCU with the
best performance, particularly for read-side critical sections that traverse many data
elements. Of course, the read-side overhead of all deferred-processing variants can be
reduced by batching, so that each read-side operation covers more data.
Quick Quiz 9.77: But didn’t the answer to one of the quick quizzes in Section 9.3 say that
pairwise asymmetric barriers could eliminate the read-side smp_mb() from hazard pointers?
v2023.06.11a
284 CHAPTER 9. DEFERRED PROCESSING
The “Reader Forward Progress Guarantee” row shows that only RCU has a bounded
wait-free forward-progress guarantee, which means that it can carry out a finite traversal
by executing a bounded number of instructions.
The “Reader Reference Acquisition” row indicates that only RCU is capable of
unconditionally acquiring references. The entry for sequence locks is “Unsafe” because,
again, sequence locks detect updates rather than acquiring references. Reference
counting and hazard pointers both require that traversals be restarted from the beginning
if a given acquisition fails. To see this, consider a linked list containing objects A, B, C,
and D, in that order, and the following series of events:
2. An updater removes object B, but refrains from freeing it because the reader holds
a reference. The list now contains objects A, C, and D, and object B’s ->next
pointer is set to HAZPTR_POISON.
3. The updater removes object C, so that the list now contains objects A and D.
Because there is no reference to object C, it is immediately freed.
4. The reader tries to advance to the successor of the object following the now-removed
object B, but the poisoned ->next pointer prevents this. Which is a good thing,
because object B’s ->next pointer would otherwise point to the freelist.
5. The reader must therefore restart its traversal from the head of the list.
v2023.06.11a
9.6. WHICH TO CHOOSE? 285
the only blocking operation is a wait to free memory, which results in a situation that,
for many purposes, is as good as non-blocking [DMS+ 12].
As shown in the “Automatic Reclamation” row, only reference counting can automate
freeing of memory, and even then only for non-cyclic data structures. Certain use
cases for hazard pointers and RCU can provide automatic reclamation using link counts,
which can be thought of as reference counts, but applying only to incoming links from
other parts of the data structure [Mic18].
Finally, the “Lines of Code” row shows the size of the Pre-BSD Routing Table
implementations, giving a rough idea of relative ease of use. That said, it is important to
note that the reference-counting and sequence-locking implementations are buggy, and
that a correct reference-counting implementation is considerably more complex [Val95,
MS95]. For its part, a correct sequence-locking implementation requires the addition of
some other synchronization mechanism, for example, hazard pointers or RCU, so that
sequence locking detects concurrent updates and the other mechanism provides safe
reference acquisition.
As more experience is gained using these techniques, both separately and in combi-
nation, the rules of thumb laid out in this section will need to be refined. However, this
section does reflect the current state of the art.
21 Kudos to Mathias Stearn, Matt Wilson, David Goldblatt, LiveJournal user fanf, Nadav
Har’El, Avi Kivity, Dmitry Vyukov, Raul Guitterez S., Twitter user @peo3, Paolo Bonzini,
and Thomas Monjalon for locating a great many of these use cases.
v2023.06.11a
286 CHAPTER 9. DEFERRED PROCESSING
In 2011, Samy Al Bahra added sequence locking to the Concurrency Kit li-
brary [Bah11c].
Paolo Bonzini added a simple sequence-lock to the QEMU emulator in 2013 [Bon13].
Alexis Menard abstracted a sequence-lock implementation in Chromium in
2016 [Men16].
A simple sequence locking implementation was added to jemalloc() in
2018 [Gol18a]. The eigen library also has a special-purpose queue that is managed by a
mechanism resembling sequence locking.
(ReadMostly)”.
v2023.06.11a
9.7. WHAT ABOUT UPDATES? 287
as well as showing how best to apply each of them. And with that, we have uncovered
the last of the mysteries put forth on page 201.
The next section discusses updates, a ticklish issue for many of the read-mostly
mechanisms described in this chapter.
The deferred-processing techniques called out in this chapter are most directly applicable
to read-mostly situations, which begs the question “But what about updates?” After all,
increasing the performance and scalability of readers is all well and good, but it is only
natural to also want great performance and scalability for writers.
We have already seen one situation featuring high performance and scalability
for writers, namely the counting algorithms surveyed in Chapter 5. These algorithms
featured partially partitioned data structures so that updates can operate locally, while the
more-expensive reads must sum across the entire data structure. Silas Boyd-Wickhizer
has generalized this notion to produce OpLog, which he has applied to Linux-kernel
pathname lookup, VM reverse mappings, and the stat() system call [BW14].
Another approach, called “Disruptor”, is designed for applications that process
high-volume streams of input data. The approach is to rely on single-producer-single-
consumer FIFO queues, minimizing the need for synchronization [Sut13]. For Java
applications, Disruptor also has the virtue of minimizing use of the garbage collector.
And of course, where feasible, fully partitioned or “sharded” systems provide excellent
performance and scalability, as noted in Chapter 6.
The next chapter will look at updates in the context of several types of data structures.
v2023.06.11a
288 CHAPTER 9. DEFERRED PROCESSING
v2023.06.11a
Bad programmers worry about the code. Good
programmers worry about data structures and their
relationships.
Linus Torvalds
Chapter 10
Data Structures
1. Data structures designed in full accordance with the good advice given in Chapter 6
can nonetheless abjectly fail to scale on some types of systems.
2. Data structures designed in full accordance with the good advice given in both
Chapter 6 and Chapter 9 can still abjectly fail to scale on some types of systems.
Section 10.1 presents the motivating application for this chapter’s data structures.
Chapter 6 showed how partitioning improves scalability, so Section 10.2 discusses
partitionable data structures. Chapter 9 described how deferring some actions can
greatly improve both performance and scalability, a topic taken up by Section 10.3.
Section 10.4 looks at a non-partitionable data structure, splitting it into read-mostly and
partitionable portions, which improves both performance and scalability. Because this
chapter cannot delve into the details of every concurrent data structure, Section 10.5
surveys a few of the important ones. Although the best performance and scalability
results from design rather than after-the-fact micro-optimization, micro-optimization is
nevertheless necessary for the absolute best possible performance and scalability, as
described in Section 10.6. Finally, Section 10.7 presents a summary of this chapter.
289
v2023.06.11a
290 CHAPTER 10. DATA STRUCTURES
There are a huge number of data structures in use today, so much so that there are
multiple textbooks covering them. This section focuses on a single data structure,
namely the hash table. This focused approach allows a much deeper investigation of
how concurrency interacts with data structures, and also focuses on a data structure that
is heavily used in practice. Section 10.2.1 overviews the design, and Section 10.2.2
presents the implementation. Finally, Section 10.2.3 discusses the resulting performance
and scalability.
v2023.06.11a
10.2. PARTITIONABLE DATA STRUCTURES 291
struct hashtab
−>ht_nbuckets = 4
−>ht_cmp
−>ht_bkt[0] struct ht_elem struct ht_elem
−>htb_head −>hte_next −>hte_next
−>htb_lock −>hte_hash −>hte_hash
−>ht_bkt[1]
−>htb_head
−>htb_lock
−>ht_bkt[2] struct ht_elem
−>htb_head −>hte_next
−>htb_lock −>hte_hash
−>ht_bkt[3]
−>htb_head
−>htb_lock
In addition, each bucket has its own lock, so that elements in different buckets of the
hash table may be added, deleted, and looked up completely independently. A large
hash table with a large number of buckets (and thus locks), with each bucket containing
a small number of elements should therefore provide excellent scalability.
v2023.06.11a
292 CHAPTER 10. DATA STRUCTURES
This macro uses a simple modulus: If more aggressive hashing is required, the caller
needs to implement it when mapping from key to hash value. The remaining two
functions acquire and release the ->htb_lock corresponding to the specified hash
value.
Listing 10.3 shows hashtab_lookup(), which returns a pointer to the element with
the specified hash and key if it exists, or NULL otherwise. This function takes both a hash
value and a pointer to the key because this allows users of this function to use arbitrary
keys and arbitrary hash functions. Line 8 maps from the hash value to a pointer to the
corresponding hash bucket. Each pass through the loop spanning lines 9–14 examines
one element of the bucket’s hash chain. Line 10 checks to see if the hash values match,
and if not, line 11 proceeds to the next element. Line 12 checks to see if the actual key
matches, and if so, line 13 returns a pointer to the matching element. If no element
matches, line 15 returns NULL.
Quick Quiz 10.2: But isn’t the double comparison on lines 10–13 in Listing 10.3 inefficient in
the case where the key fits into an unsigned long?
Listing 10.4 shows the hashtab_add() and hashtab_del() functions that add and
delete elements from the hash table, respectively.
The hashtab_add() function simply sets the element’s hash value on line 4, then
adds it to the corresponding bucket on lines 5 and 6. The hashtab_del() function
simply removes the specified element from whatever hash chain it is on, courtesy of the
doubly linked nature of the hash-chain lists. Before calling either of these two functions,
the caller is required to ensure that no other thread is accessing or modifying this same
bucket, for example, by invoking hashtab_lock() beforehand.
v2023.06.11a
10.2. PARTITIONABLE DATA STRUCTURES 293
v2023.06.11a
294 CHAPTER 10. DATA STRUCTURES
6
1.4x10
6
1x106
800000
ideal
600000
400000
200000
bucket
0
5 10 15 20 25
Number of CPUs (Threads)
250000
Total Lookups per Millisecond
200000
150000
100000
50000
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Figure 10.3: Read-Only Hash-Table Performance For Schrödinger’s Zoo, 448 CPUs
too clearly worse than abysmal. This clearly underscores the dangers of extrapolating
performance from a modest number of CPUs.
Of course, one possible reason for the collapse in performance might be that more
hash buckets are needed. We can test this by increasing the number of hash buckets.
Quick Quiz 10.3: Instead of simply increasing the number of hash buckets, wouldn’t it be
better to cache-align the existing hash buckets?
However, as can be seen in Figure 10.4, changing the number of buckets has almost no
effect: Scalability is still abysmal. In particular, we still see a sharp dropoff at 29 CPUs
and beyond, clearly demonstrating the complication put forward on page 289. And just
as clearly, something else is going on.
The problem is that this is a multi-socket system, with CPUs 0–27 and 225–251
mapped to the first socket as shown in Figure 10.5. Test runs confined to the first
28 CPUs therefore perform quite well, but tests that involve socket 0’s CPUs 0–27 as
well as socket 1’s CPU 28 incur the overhead of passing data across socket boundaries.
This can severely degrade performance, as was discussed in Section 3.2.1. In short, large
multi-socket systems require good locality of reference in addition to full partitioning.
The remainder of this chapter will discuss ways of providing good locality of reference
v2023.06.11a
10.3. READ-MOSTLY DATA STRUCTURES 295
250000
150000
100000
50000
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
within the hash table itself, but in the meantime please note that one other way to provide
good locality of reference would be to place large data elements in the hash table. For
example, Schrödinger might attain excellent cache locality by placing photographs or
even videos of his animals in each element of the hash table. But for those needing hash
tables containing small data elements, please read on!
Quick Quiz 10.4: Given the negative scalability of the Schrödinger’s Zoo application across
sockets, why not just run multiple copies of the application, with each copy having a subset of
the animals and confined to run on a single socket?
One key property of the Schrödinger’s-zoo runs discussed thus far is that they are all
read-only. This makes the performance degradation due to lock-acquisition-induced
cache misses all the more painful. Even though we are not updating the underlying hash
table itself, we are still paying the price for writing to memory. Of course, if the hash
table was never going to be updated, we could dispense entirely with mutual exclusion.
This approach is quite straightforward and is left as an exercise for the reader. But
even with the occasional update, avoiding writes avoids cache misses, and allows the
read-mostly data to be replicated across all the caches, which in turn promotes locality
of reference.
The next section therefore examines optimizations that can be carried out in read-
mostly cases where updates are rare, but could happen at any time.
Although partitioned data structures can offer excellent scalability, NUMA effects can
result in severe degradations of both performance and scalability. In addition, the
need for read-side synchronization can degrade performance in read-mostly situations.
However, we can achieve both performance and scalability by using RCU, which
was introduced in Section 9.5. Similar results can be achieved using hazard pointers
v2023.06.11a
296 CHAPTER 10. DATA STRUCTURES
Hyperthread
Socket 0 1
0 0–27 224–251
1 28–55 252–279
2 56–83 280–307
3 84–111 308–335
4 112–139 336–363
5 140–167 364–391
6 168–195 392–419
7 196–223 420–447
(hazptr.c) [Mic04a], which will be included in the performance results shown in this
section [McK13].
v2023.06.11a
10.3. READ-MOSTLY DATA STRUCTURES 297
Quick Quiz 10.5: But if elements in a hash table can be removed concurrently with lookups,
doesn’t that mean that a lookup could return a reference to a data element that was removed
immediately after it was looked up?
Listing 10.8 shows hashtab_add() and hashtab_del(), both of which are quite
similar to their counterparts in the non-RCU hash table shown in Listing 10.4. The
hashtab_add() function uses cds_list_add_rcu() instead of cds_list_add()
in order to ensure proper ordering when an element is added to the hash table at the
same time that it is being looked up. The hashtab_del() function uses cds_list_
del_rcu() instead of cds_list_del_init() to allow for the case where an element
is looked up just before it is deleted. Unlike cds_list_del_init(), cds_list_del_
rcu() leaves the forward pointer intact, so that hashtab_lookup() can traverse to the
newly deleted element’s successor.
Of course, after invoking hashtab_del(), the caller must wait for an RCU grace
period (e.g., by invoking synchronize_rcu()) before freeing or otherwise reusing
the memory for the newly deleted element.
v2023.06.11a
298 CHAPTER 10. DATA STRUCTURES
1x108
100000 bucket
10000 global
1000
1 10 100
Number of CPUs (Threads)
In addition, concurrent test runs run lookups concurrently with updates in order to
catch all manner of data-structure corruption problems. Some runs also continually
resize the hash table concurrently with both lookups and updates to verify correct
behavior, and also to verify that resizes do not unduly delay either readers or updaters.
Finally, the concurrent tests output statistics that can be used to track down performance
and scalabilty issues, which provides the raw data used by Section 10.3.3.
Quick Quiz 10.6: The hashtorture.h file contains more than 1,000 lines! Is that a
comprehensive test or what???
All code requires significant validation effort, and high-performance concurrent code
requires more validation than most.
v2023.06.11a
10.3. READ-MOSTLY DATA STRUCTURES 299
2.2x107
2x107
As you can see, both RCU and hazard pointers perform and scale much better than
per-bucket locking because read-only replication avoids NUMA effects. The difference
increases with larger numbers of threads. Results from a globally locked implementation
are also shown, and as expected the results are even worse than those of the per-bucket-
locked implementation. RCU does slightly better than hazard pointers.
Figure 10.7 shows the same data on a linear scale. This drops the global-locking
trace into the x-axis, but allows the non-ideal performance of RCU and hazard pointers
to be more readily discerned. Both show a change in slope at 224 CPUs, and this is due
to hardware multithreading. At 224 and fewer CPUs, each thread has a core to itself. In
this regime, RCU does better than does hazard pointers because the latter’s read-side
memory barriers result in dead time within the core. In short, RCU is better able to
utilize a core from a single hardware thread than is hazard pointers.
This situation changes above 224 CPUs. Because RCU is using more than half of
each core’s resources from a single hardware thread, RCU gains relatively little benefit
from the second hardware thread in each core. The slope of the hazard-pointers trace
also decreases at 224 CPUs, but less dramatically, because the second hardware thread
is able to fill in the time that the first hardware thread is stalled due to memory-barrier
latency. As we will see in later sections, this second-hardware-thread advantage depends
on the workload.
But why is RCU’s performance a factor of five less than ideal? One possibility is that
the per-thread counters manipulated by rcu_read_lock() and rcu_read_unlock()
are slowing things down. Figure 10.8 therefore adds the results for the QSBR variant
of RCU, whose read-side primitives do nothing. And although QSBR does perform
slightly better than does RCU, it is still about a factor of five short of ideal.
Figure 10.9 adds completely unsynchronized results, which works because this is
a read-only benchmark with nothing to synchronize. Even with no synchronization
whatsoever, performance still falls far short of ideal, thus demonstrating two more
complications on page 289.
The problem is that this system has sockets with 28 cores, which have the modest
cache sizes shown in Table 3.2 on page 37. Each hash bucket (struct ht_bucket)
occupies 56 bytes and each element (struct zoo_he) occupies 72 bytes for the RCU
and QSBR runs. The benchmark generating Figure 10.9 used 262,144 buckets and
v2023.06.11a
300 CHAPTER 10. DATA STRUCTURES
2.2x107
2x107
2.2x107
2x107
Total Lookups per Millisecond
1.8x107
1.6x107
1.4x107 ideal
1.2x107
1x107
8x106
6x106
unsync,QSBR,RCU
6
4x10
2x106 hazptr
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
v2023.06.11a
10.3. READ-MOSTLY DATA STRUCTURES 301
1x107
1x106
10000
bucket
1000
100 global
10
1 10
Number of CPUs Looking Up The Cat
up to 262,144 elements, for a total of 33,554,448 bytes, which not only overflows the
1,048,576-byte L2 caches by more than a factor of thirty, but is also uncomfortably
close to the L3 cache size of 40,370,176 bytes, especially given that this cache has
only 11 ways. This means that L2 cache collisions will be the rule and also that L3
cache collisions will not be uncommon, so that the resulting cache misses will degrade
performance. In this case, the bottleneck is not in the CPU, but rather in the hardware
memory system.
Additional evidence for this memory-system bottleneck may be found by examining
the unsynchronized code. This code does not need locks, so each hash bucket occupies
only 16 bytes compared to the 56 bytes for RCU and QSBR. Similarly, each hash-table
element occupies only 56 bytes compared to the 72 bytes for RCU and QSBR. So it is
unsurprising that the single-CPU unsynchronized run performs up to about half again
faster than that of either QSBR or RCU.
Quick Quiz 10.7: How can we be so sure that the hash-table size is at fault here, especially
given that Figure 10.4 on page 295 shows that varying hash-table size has almost no effect?
Might the problem instead be something like false sharing?
What if the memory footprint is reduced still further? Figure E.5 on page 788 shows
that RCU attains very nearly ideal performance on the much smaller data structure
represented by the pre-BSD routing table.
Quick Quiz 10.8: The memory system is a serious bottleneck on this big system. Why bother
putting 448 CPUs on a system without giving them enough memory bandwidth to do something
useful???
As noted earlier, Schrödinger is surprised by the popularity of his cat [Sch35], but
recognizes the need to reflect this popularity in his design. Figure 10.10 shows the
results of 64-CPU runs, varying the number of CPUs that are doing nothing but looking
up the cat. Both RCU and hazard pointers respond well to this challenge, but bucket
locking scales negatively, eventually performing as badly as global locking. This should
not be a surprise because if all CPUs are doing nothing but looking up the cat, the lock
corresponding to the cat’s bucket is for all intents and purposes a global lock.
This cat-only benchmark illustrates one potential problem with fully partitioned
sharding approaches. Only the CPUs associated with the cat’s partition is able to access
v2023.06.11a
302 CHAPTER 10. DATA STRUCTURES
7
1x10
RCU
6
1x10
100000 bucket
10000
1000 global
100
1 10 100
Number of CPUs Doing Updates
the cat, limiting the cat-only throughput. Of course, a great many applications have
good load-spreading properties, and for these applications sharding works quite well.
However, sharding does not handle “hot spots” very well, with the hot spot exemplified
by Schrödinger’s cat being but one case in point.
If we were only ever going to read the data, we would not need any concurrency
control to begin with. Figure 10.11 therefore shows the effect of updates on readers.
At the extreme left-hand side of this graph, all but one of the CPUs are doing lookups,
while to the right all 448 CPUs are doing updates. For all four implementations, the
number of lookups per millisecond decreases as the number of updating CPUs increases,
of course reaching zero lookups per millisecond when all 448 CPUs are updating. Both
hazard pointers and RCU do well compared to per-bucket locking because their readers
do not increase update-side lock contention. RCU does well relative to hazard pointers
as the number of updaters increases due to the latter’s read-side memory barriers, which
incur greater overhead, especially in the presence of updates, and particularly when
execution involves more than one socket. It therefore seems likely that modern hardware
heavily optimizes memory-barrier execution, greatly reducing memory-barrier overhead
in the read-only case.
Where Figure 10.11 showed the effect of increasing update rates on lookups, Fig-
ure 10.12 shows the effect of increasing update rates on the updates themselves. Again,
at the left-hand side of the figure all but one of the CPUs are doing lookups and at the
right-hand side of the figure all 448 CPUs are doing updates. Hazard pointers and RCU
start off with a significant advantage because, unlike bucket locking, readers do not
exclude updaters. However, as the number of updating CPUs increases, update-side
overhead starts to make its presence known, first for RCU and then for hazard pointers.
Of course, all three of these implementations beat global locking.
It is quite possible that the differences in lookup performance observed in Figure 10.11
are affected by the differences in update rates. One way to check this is to artificially
throttle the update rates of per-bucket locking and hazard pointers to match that of
RCU. Doing so does not significantly improve the lookup performance of per-bucket
locking, nor does it close the gap between hazard pointers and RCU. However,
removing the read-side memory barriers from hazard pointers (thus resulting in an
unsafe implementation) does nearly close the gap between hazard pointers and RCU.
v2023.06.11a
10.3. READ-MOSTLY DATA STRUCTURES 303
6
1x10
bucket
100000
10000 RCU
1000
global
100
10
1 10 100
Number of CPUs Doing Updates
Although this unsafe hazard-pointer implementation will usually be reliable enough for
benchmarking purposes, it is absolutely not recommended for production use.
Quick Quiz 10.9: The dangers of extrapolating from 28 CPUs to 448 CPUs was made quite
clear in Section 10.2.3. Would extrapolating up from 448 CPUs be any safer?
And this situation exposes yet another of the complications listed on page 289.
v2023.06.11a
304 CHAPTER 10. DATA STRUCTURES
Heisenberg taught us to live with this sort of uncertainty [Hei27], which is a good
thing because computing hardware and software acts similarly. For example, how do
you know that a piece of computing hardware has failed? Often because it does not
respond in a timely fashion. Just like the cat’s heartbeat, this results in a window of
uncertainty as to whether or not the hardware has really failed, as opposed to just being
slow.
Furthermore, most computing systems are intended to interact with the outside world.
Consistency with the outside world is therefore of paramount importance. However,
as we saw in Figure 9.28 on page 266, increased internal consistency can come at the
expense of degraded external consistency. Techniques such as RCU and hazard pointers
give up some degree of internal consistency to attain improved external consistency.
In short, internal consistency is not necessarily a natural part of all problem domains,
and often incurs great expense in terms of performance, scalability, consistency with
the outside world [HKLP12, HHK+ 13, Rin13], or all of the above.
Fixed-size hash tables are perfectly partitionable, but resizable hash tables pose parti-
tioning challenges when growing or shrinking, as fancifully depicted in Figure 10.14.
However, it turns out that it is possible to construct high-performance scalable RCU-
protected hash tables, as described in the following sections.
v2023.06.11a
10.4. NON-PARTITIONABLE DATA STRUCTURES 305
Bucket 0 Bucket 1
A B C D
The key insight behind the first hash-table implementation is that each data element
can have two sets of list pointers, with one set currently being used by RCU readers (as
well as by non-RCU updaters) and the other being used to construct a new resized hash
table. This approach allows lookups, insertions, and deletions to all run concurrently
with a resize operation (as well as with each other).
The resize operation proceeds as shown in Figures 10.15–10.18, with the initial
two-bucket state shown in Figure 10.15 and with time advancing from figure to figure.
The initial state uses the zero-index links to chain the elements into hash buckets. A
Bucket 0 Bucket 1
A B C D
v2023.06.11a
306 CHAPTER 10. DATA STRUCTURES
Bucket 0 Bucket 1
A B C D
A B C D
four-bucket array is allocated, and the one-index links are used to chain the elements
into these four new hash buckets. This results in state (b) shown in Figure 10.16, with
readers still using the original two-bucket array.
The new four-bucket array is exposed to readers and then a grace-period operation
waits for all readers, resulting in state (c), shown in Figure 10.17. In this state, all
readers are using the new four-bucket array, which means that the old two-bucket array
may now be freed, resulting in state (d), shown in Figure 10.18.
This design leads to a relatively straightforward implementation, which is the subject
of the next section.
v2023.06.11a
10.4. NON-PARTITIONABLE DATA STRUCTURES 307
->ht_new field on line 14. If there is no resize operation in progress, ->ht_new is NULL.
Thus, a resize operation proceeds by allocating a new ht structure and referencing it
via the ->ht_new pointer, then advancing ->ht_resize_cur through the old table’s
buckets. When all the elements have been added to the new table, the new table is linked
into the hashtab structure’s ->ht_cur field. Once all old readers have completed, the
old hash table’s ht structure may be freed.
The ->ht_idx field on line 15 indicates which of the two sets of list pointers are
being used by this instantiation of the hash table, and is used to index the ->hte_next[]
array in the ht_elem structure on line 3.
The ->ht_cmp(), ->ht_gethash(), and ->ht_getkey() fields on lines 16–18
collectively define the per-element key and the hash function. The ->ht_cmp() function
compares a specified key with that of the specified element, the ->ht_gethash()
calculates the specified key’s hash, and ->ht_getkey() extracts the key from the
enclosing data element.
The ht_lock_state shown on lines 22–25 is used to communicate lock state from
a new hashtab_lock_mod() to hashtab_add(), hashtab_del(), and hashtab_
unlock_mod(). This state prevents the algorithm from being redirected to the wrong
bucket during concurrent resize operations.
The ht_bucket structure is the same as before, and the ht_elem structure differs
from that of previous implementations only in providing a two-element array of list
pointer sets in place of the prior single set of list pointers.
In a fixed-sized hash table, bucket selection is quite straightforward: Simply transform
the hash value to the corresponding bucket index. In contrast, when resizing, it is also
necessary to determine which of the old and new sets of buckets to select from. If the
bucket that would be selected from the old table has already been distributed into the
new table, then the bucket should be selected from the new table as well as from the old
v2023.06.11a
308 CHAPTER 10. DATA STRUCTURES
table. Conversely, if the bucket that would be selected from the old table has not yet
been distributed, then the bucket should be selected from the old table.
Bucket selection is shown in Listing 10.10, which shows ht_get_bucket() on
lines 1–11 and ht_search_bucket() on lines 13–28. The ht_get_bucket() func-
tion returns a reference to the bucket corresponding to the specified key in the specified
hash table, without making any allowances for resizing. It also stores the bucket index
corresponding to the key into the location referenced by parameter b on line 7, and the
corresponding hash value corresponding to the key into the location referenced by pa-
rameter h (if non-NULL) on line 9. Line 10 then returns a reference to the corresponding
bucket.
The ht_search_bucket() function searches for the specified key within the speci-
fied hash-table version. Line 20 obtains a reference to the bucket corresponding to the
specified key. The loop spanning lines 21–26 searches that bucket, so that if line 24
detects a match, line 25 returns a pointer to the enclosing data element. Otherwise, if
there is no match, line 27 returns NULL to indicate failure.
Quick Quiz 10.10: How does the code in Listing 10.10 protect against the resizing process
progressing past the selected bucket?
v2023.06.11a
10.4. NON-PARTITIONABLE DATA STRUCTURES 309
bucket’s lock, which will prevent any concurrent resizing operation from distributing
that bucket, though of course it will have no effect if that bucket has already been
distributed. Lines 14–15 store the bucket pointer and pointer-set index into their
respective fields in the ht_lock_state structure, which communicates the information
to hashtab_add(), hashtab_del(), and hashtab_unlock_mod(). Line 16 then
checks to see if a concurrent resize operation has already distributed this bucket across
the new hash table, and if not, line 17 indicates that there is no already-resized hash
bucket and line 18 returns with the selected hash bucket’s lock held (thus preventing
a concurrent resize operation from distributing this bucket) and also within an RCU
read-side critical section. Deadlock is avoided because the old table’s locks are always
acquired before those of the new table, and because the use of RCU prevents more than
two versions from existing at a given time, thus preventing a deadlock cycle.
Otherwise, a concurrent resize operation has already distributed this bucket, so
line 20 proceeds to the new hash table, line 21 selects the bucket corresponding to
the key, and line 22 acquires the bucket’s lock. Lines 23–24 store the bucket pointer
and pointer-set index into their respective fields in the ht_lock_state structure,
which again communicates this information to hashtab_add(), hashtab_del(), and
hashtab_unlock_mod(). Because this bucket has already been resized and because
hashtab_add() and hashtab_del() affect both the old and the new ht_bucket
structures, two locks are held, one on each of the two buckets. Additionally, both
elements of each array in ht_lock_state structure are used, with the [0] element
pertaining to the old ht_bucket structure and the [1] element pertaining to the new
structure. Once again, hashtab_lock_mod() exits within an RCU read-side critical
section.
v2023.06.11a
310 CHAPTER 10. DATA STRUCTURES
Now that we have bucket selection and concurrency control in place, we are ready to
search and update our resizable hash table. The hashtab_lookup(), hashtab_add(),
and hashtab_del() functions are shown in Listing 10.12.
The hashtab_lookup() function on lines 1–10 of the listing does hash lookups.
Line 7 fetches the current hash table and line 8 searches the bucket corresponding to the
specified key. Line 9 returns a pointer to the searched-for element or NULL when the
search fails. The caller must be within an RCU read-side critical section.
Quick Quiz 10.12: The hashtab_lookup() function in Listing 10.12 ignores concurrent
resize operations. Doesn’t this mean that readers might miss an element that was previously
added during a resize operation?
The hashtab_add() function on lines 12–22 of the listing adds new data elements
to the hash table. Line 15 picks up the current ht_bucket structure into which the
new element is to be added, and line 16 picks up the index of the pointer pair. Line 18
adds the new element to the current hash bucket. If line 19 determines that this bucket
has been distributed to a new version of the hash table, then line 20 also adds the new
element to the corresponding new bucket. The caller is required to handle concurrency,
v2023.06.11a
10.4. NON-PARTITIONABLE DATA STRUCTURES 311
v2023.06.11a
312 CHAPTER 10. DATA STRUCTURES
The actual resizing itself is carried out by hashtab_resize, shown in Listing 10.13
on page 311. Line 16 conditionally acquires the top-level ->ht_lock, and if this
acquisition fails, line 17 returns -EBUSY to indicate that a resize is already in progress.
Otherwise, line 18 picks up a reference to the current hash table, and lines 19–22
allocate a new hash table of the desired size. If a new set of hash/key functions have
been specified, these are used for the new table, otherwise those of the old table are
preserved. If line 23 detects memory-allocation failure, line 24 releases ->ht_lock
and line 25 returns a failure indication.
Line 27 picks up the current table’s index and line 28 stores its inverse to the new
hash table, thus ensuring that the two hash tables avoid overwriting each other’s linked
lists. Line 29 then starts the bucket-distribution process by installing a reference to the
new table into the ->ht_new field of the old table. Line 30 ensures that all readers who
are not aware of the new table complete before the resize operation continues.
Each pass through the loop spanning lines 31–42 distributes the contents of one of
the old hash table’s buckets into the new hash table. Line 32 picks up a reference to the
old table’s current bucket and line 33 acquires that bucket’s spinlock.
Quick Quiz 10.14: In the hashtab_resize() function in Listing 10.13, what guarantees
that the update to ->ht_new on line 29 will be seen as happening before the update to ->ht_
resize_cur on line 40 from the perspective of hashtab_add() and hashtab_del()? In
other words, what prevents hashtab_add() and hashtab_del() from dereferencing a NULL
pointer loaded from ->ht_new?
Each pass through the loop spanning lines 34–39 adds one data element from the
current old-table bucket to the corresponding new-table bucket, holding the new-table
bucket’s lock during the add operation. Line 40 updates ->ht_resize_cur to indicate
that this bucket has been distributed. Finally, line 41 releases the old-table bucket lock.
Execution reaches line 43 once all old-table buckets have been distributed across
the new table. Line 43 installs the newly created table as the current one, and line 44
waits for all old readers (who might still be referencing the old table) to complete. Then
line 45 releases the resize-serialization lock, line 46 frees the old hash table, and finally
line 47 returns success.
Quick Quiz 10.15: Why is there a WRITE_ONCE() on line 40 in Listing 10.13?
1 You see only two traces? The dashed one is composed of two traces that differ only
v2023.06.11a
10.4. NON-PARTITIONABLE DATA STRUCTURES 313
1x107
100000
10000 2,097,152
1000
1 10 100
Number of CPUs (Threads)
Figure 10.19: Overhead of Resizing Hash Tables Between 262,144 and 524,288 Buckets
vs. Total Number of Elements
The lower three traces are for the 2,097,152-element hash table. The upper dashed
trace corresponds to the 262,144-bucket fixed-size hash table, the solid trace in the
middle for low CPU counts and at the bottom for high CPU counts to the resizable
hash table, and the other trace to the 524,288-bucket fixed-size hash table. The fact that
there are now an average of eight elements per bucket can only be expected to produce
a sharp decrease in performance, as in fact is shown in the graph. But worse yet, the
hash-table elements occupy 128 MB, which overflows each socket’s 39 MB L3 cache,
with performance consequences analogous to those described in Section 3.2.2. The
resulting cache overflow means that the memory system is involved even for a read-only
benchmark, and as you can see from the sublinear portions of the lower three traces, the
memory system can be a serious bottleneck.
Quick Quiz 10.16: How much of the difference in performance between the large and small
hash tables shown in Figure 10.19 was due to long hash chains and how much was due to
memory-system bottlenecks?
Referring to the last column of Table 3.1, we recall that the first 28 CPUs are in
the first socket, on a one-CPU-per-core basis, which explains the sharp decrease in
performance of the resizable hash table beyond 28 CPUs. Sharp though this decrease
is, please recall that it is due to constant resizing back and forth. It would clearly be
better to resize once to 524,288 buckets, or, even better, do a single eight-fold resize to
2,097,152 elements, thus dropping the average number of elements per bucket down to
the level enjoyed by the runs producing the upper three traces.
The key point from this data is that the RCU-protected resizable hash table performs
and scales almost as well as does its fixed-size counterpart. The performance during
an actual resize operation of course suffers somewhat due to the cache misses causes
by the updates to each element’s pointers, and this effect is most pronounced when the
memory system becomes a bottleneck. This indicates that hash tables should be resized
by substantial amounts, and that hysteresis should be applied to prevent performance
degradation due to too-frequent resize operations. In memory-rich environments,
hash-table sizes should furthermore be increased much more aggressively than they are
decreased.
v2023.06.11a
314 CHAPTER 10. DATA STRUCTURES
v2023.06.11a
10.4. NON-PARTITIONABLE DATA STRUCTURES 315
even 0 2
(a)
odd 1 3
even 0 2
(b)
odd 1 3
all
even 0 2
(c)
odd 1 3
all
even 0 2
(d)
odd 1 3
all
even 0 2
(e)
odd 1 3
all
(f) all 1 3 0 2
The next step is to wait for all pre-existing readers to complete, resulting in state (e).
In this state, all readers are using the new small hash table, so that the old large hash
table’s buckets may be freed, resulting in the final state (f).
Growing a relativistic hash table reverses the shrinking process, but requires more
grace-period steps, as shown in Figure 10.21. The initial state (a) is at the top of this
figure, with time advancing from top to bottom.
We start by allocating the new large two-bucket hash table, resulting in state (b).
Note that each of these new buckets references the first element destined for that bucket.
These new buckets are published to readers, resulting in state (c). After a grace-period
operation, all readers are using the new large hash table, resulting in state (d). In this
state, only those readers traversing the even-values hash bucket traverse element 0,
which is therefore now colored white.
At this point, the old small hash buckets may be freed, although many implementations
use these old buckets to track progress “unzipping” the list of items into their respective
new buckets. The last even-numbered element in the first consecutive run of such
elements now has its pointer-to-next updated to reference the following even-numbered
element. After a subsequent grace-period operation, the result is state (e). The vertical
arrow indicates the next element to be unzipped, and element 1 is now colored black to
indicate that only those readers traversing the odd-values hash bucket may reach it.
v2023.06.11a
316 CHAPTER 10. DATA STRUCTURES
(a) all 0 1 2 3
even
(b) odd
all 0 1 2 3
even
(c) odd
all 0 1 2 3
even
(d) odd
all 0 1 2 3
even 0 1 2 3
(e)
odd
even 0 1 2 3
(f)
odd
even 0 2
(g)
odd 1 3
v2023.06.11a
10.5. OTHER DATA STRUCTURES 317
Next, the last odd-numbered element in the first consecutive run of such elements
now has its pointer-to-next updated to reference the following odd-numbered element.
After a subsequent grace-period operation, the result is state (f). A final unzipping
operation (including a grace-period operation) results in the final state (g).
In short, the relativistic hash table reduces the number of per-element list pointers
at the expense of additional grace periods incurred during resizing. These additional
grace periods are usually not a problem because insertions, deletions, and lookups may
proceed concurrently with a resize operation.
It turns out that it is possible to reduce the per-element memory overhead from a pair
of pointers to a single pointer, while still retaining O (1) deletions. This is accomplished
by augmenting split-order list [SS06] with RCU protection [Des09b, MDJ13c]. The
data elements in the hash table are arranged into a single sorted linked list, with each
hash bucket referencing the first element in that bucket. Elements are deleted by setting
low-order bits in their pointer-to-next fields, and these elements are removed from the
list by later traversals that encounter them.
This RCU-protected split-order list is complex, but offers lock-free progress guarantees
for all insertion, deletion, and lookup operations. Such guarantees can be important
in real-time applications. An implementation is available from recent versions of the
userspace RCU library [Des09b].
The preceding sections have focused on data structures that enhance concurrency due
to partitionability (Section 10.2), efficient handling of read-mostly access patterns
(Section 10.3), or application of read-mostly techniques to avoid non-partitionability
(Section 10.4). This section gives a brief review of other data structures.
One of the hash table’s greatest advantages for parallel use is that it is fully partitionable,
at least while not being resized. One way of preserving the partitionability and the
size independence is to use a radix tree, which is also called a trie. Tries partition
the search key, using each successive key partition to traverse the next level of the
trie. As such, a trie can be thought of as a set of nested hash tables, thus providing
the required partitionability. One disadvantage of tries is that a sparse key space can
result in inefficient use of memory. There are a number of compression techniques that
may be used to work around this disadvantage, including hashing the key value to a
smaller keyspace before the traversal [ON07]. Radix trees are heavily used in practice,
including in the Linux kernel [Pig06].
One important special case of both a hash table and a trie is what is perhaps the oldest
of data structures, the array and its multi-dimensional counterpart, the matrix. The fully
partitionable nature of matrices is exploited heavily in concurrent numerical algorithms.
Self-balancing trees are heavily used in sequential code, with AVL trees and red-
black trees being perhaps the most well-known examples [CLRS01]. Early attempts
to parallelize AVL trees were complex and not necessarily all that efficient [Ell80],
however, more recent work on red-black trees provides better performance and scalability
v2023.06.11a
318 CHAPTER 10. DATA STRUCTURES
by using RCU for readers and hashed arrays of locks2 to protect reads and updates,
respectively [HW11, HW14]. It turns out that red-black trees rebalance aggressively,
which works well for sequential programs, but not necessarily so well for parallel use.
Recent work has therefore made use of RCU-protected “bonsai trees” that rebalance less
aggressively [CKZ12], trading off optimal tree depth to gain more efficient concurrent
updates.
Concurrent skip lists lend themselves well to RCU readers, and in fact represents an
early academic use of a technique resembling RCU [Pug90].
Concurrent double-ended queues were discussed in Section 6.1.2, and concurrent
stacks and queues have a long history [Tre86], though not normally the most impressive
performance or scalability. They are nevertheless a common feature of concurrent
libraries [MDJ13d]. Researchers have recently proposed relaxing the ordering constraints
of stacks and queues [Sha11], with some work indicating that relaxed-ordered queues
actually have better ordering properties than do strict FIFO queues [HKLP12, KLP12,
HHK+ 13].
It seems likely that continued work with concurrent data structures will produce novel
algorithms with surprising properties.
10.6 Micro-Optimization
The devil is in the details.
Unknown
The data structures shown in this chapter were coded straightforwardly, with no adaptation
to the underlying system’s cache hierarchy. In addition, many of the implementations
used pointers to functions for key-to-hash conversions and other frequent operations.
Although this approach provides simplicity and portability, in many cases it does give
up some performance.
The following sections touch on specialization, memory conservation, and hardware
considerations. Please do not mistake these short sections for a definitive treatise on
this subject. Whole books have been written on optimizing to a specific CPU, let alone
to the set of CPU families in common use today.
10.6.1 Specialization
The resizable hash table presented in Section 10.4 used an opaque type for the key.
This allows great flexibility, permitting any sort of key to be used, but it also incurs
significant overhead due to the calls via of pointers to functions. Now, modern hardware
uses sophisticated branch-prediction techniques to minimize this overhead, but on the
other hand, real-world software is often larger than can be accommodated even by
today’s large hardware branch-prediction tables. This is especially the case for calls via
pointers, in which case the branch prediction hardware must record a pointer in addition
to branch-taken/branch-not-taken information.
This overhead can be eliminated by specializing a hash-table implementation to a given
key type and hash function, for example, by using C++ templates. Doing so eliminates
the ->ht_cmp(), ->ht_gethash(), and ->ht_getkey() function pointers in the ht
2 In the guise of swissTM [DFGG11], which is a variant of software transactional
v2023.06.11a
10.6. MICRO-OPTIMIZATION 319
structure shown in Listing 10.9 on page 307. It also eliminates the corresponding calls
through these pointers, which could allow the compiler to inline the resulting fixed
functions, eliminating not only the overhead of the call instruction, but the argument
marshalling as well.
Quick Quiz 10.17: How much do these specializations really save? Are they really worth it?
All that aside, one of the great benefits of modern hardware compared to that available
when I first started learning to program back in the early 1970s is that much less
specialization is required. This allows much greater productivity than was possible back
in the days of four-kilobyte address spaces.
2. They cannot participate in the lockdep deadlock detection tooling in the Linux
kernel [Cor06a].
4. They do not participate in priority boosting in -rt kernels, which means that
preemption must be disabled when holding bit spinlocks, which can degrade
real-time latency.
v2023.06.11a
320 CHAPTER 10. DATA STRUCTURES
In short, there is a tradeoff between minimal memory overhead on the one hand, and
performance and simplicity on the other. Fortunately, the relatively large memories
available on modern systems have allowed us to prioritize performance and simplicity
over memory overhead. However, even though the year 2022’s pocket-sized smartphones
sport many gigabytes of memory and its mid-range servers sport terabytes, it is sometimes
necessary to take extreme measures to reduce memory overhead.
1. Place read-mostly data far from frequently updated data. For example, place
read-mostly data at the beginning of the structure and frequently updated data at
the end. Place data that is rarely accessed in between.
2. If the structure has groups of fields such that each group is updated by an independent
code path, separate these groups from each other. Again, it can be helpful to place
3 A number of these rules are paraphrased and expanded on here with permission from
Orran Krieger.
v2023.06.11a
10.7. SUMMARY 321
rarely accessed data between the groups. In some cases, it might also make sense to
place each such group into a separate structure referenced by the original structure.
3. Where possible, associate update-mostly data with a CPU, thread, or task. We saw
several very effective examples of this rule of thumb in the counter implementations
in Chapter 5.
4. Going one step further, partition your data on a per-CPU, per-thread, or per-task
basis, as was discussed in Chapter 8.
There has been some work towards automated trace-based rearrangement of structure
fields [GDZE10]. This work might well ease one of the more painstaking tasks required
to get excellent performance and scalability from multithreaded software.
An additional set of rules of thumb deal with locks:
1. Given a heavily contended lock protecting data that is frequently modified, take
one of the following approaches:
(a) Place the lock in a different cacheline than the data that it protects.
(b) Use a lock that is adapted for high contention, such as a queued lock.
(c) Redesign to reduce lock contention. (This approach is best, but is not always
trivial.)
2. Place uncontended locks into the same cache line as the data that they protect. This
approach means that the cache miss that brings the lock to the current CPU also
brings its data.
3. Protect read-mostly data with hazard pointers, RCU, or, for long-duration critical
sections, reader-writer locks.
Of course, these are rules of thumb rather than absolute rules. Some experimentation
is required to work out which are most applicable to a given situation.
10.7 Summary
This chapter has focused primarily on hash tables, including resizable hash tables, which
are not fully partitionable. Section 10.5 gave a quick overview of a few non-hash-table
data structures. Nevertheless, this exposition of hash tables is an excellent introduction
to the many issues surrounding high-performance scalable data access, including:
1. Fully partitioned data structures work well on small systems, for example, single-
socket systems.
v2023.06.11a
322 CHAPTER 10. DATA STRUCTURES
3. Read-mostly techniques, such as hazard pointers and RCU, provide good locality
of reference for read-mostly workloads, and thus provide excellent performance
and scalability even on larger systems.
4. Read-mostly techniques also work well on some types of non-partitionable data
structures, such as resizable hash tables.
5. Large data structures can overflow CPU caches, reducing performance and scala-
bility.
6. Additional performance and scalability can be obtained by specializing the data
structure to a specific workload, for example, by replacing a general key with a
32-bit integer.
7. Although requirements for portability and for extreme performance often conflict,
there are some data-structure-layout techniques that can strike a good balance
between these two sets of requirements.
That said, performance and scalability are of little use without reliability, so the next
chapter covers validation.
v2023.06.11a
If it is not tested, it doesn’t work.
Unknown
Chapter 11
Validation
I have had a few parallel programs work the first time, but that is only because I have
written an extremely large number parallel programs over the past few decades. And I
have had far more parallel programs that fooled me into thinking that they were working
correctly the first time than actually were working the first time.
I thus need to validate my parallel programs. The basic trick behind validation, is to
realize that the computer knows what is wrong. It is therefore your job to force it to tell
you. This chapter can therefore be thought of as a short course in machine interrogation.
But you can leave the good-cop/bad-cop routine at home. This chapter covers much
more sophisticated and effective methods, especially given that most computers couldn’t
tell a good cop from a bad cop, at least as far as we know.
A longer course may be found in many recent books on validation, as well as at least
one older but valuable one [Mye79]. Validation is an extremely important topic that
cuts across all forms of software, and is worth intensive study in its own right. However,
this book is primarily about concurrency, so this chapter will do little more than scratch
the surface of this critically important topic.
Section 11.1 introduces the philosophy of debugging. Section 11.2 discusses tracing,
Section 11.3 discusses assertions, and Section 11.4 discusses static analysis. Section 11.5
describes some unconventional approaches to code review that can be helpful when the
fabled 10,000 eyes happen not to be looking at your code. Section 11.6 overviews the
use of probability for validating parallel software. Because performance and scalability
are first-class requirements for parallel programming, Section 11.7 covers these topics.
Finally, Section 11.8 gives a fanciful summary and a short list of statistical traps to
avoid.
But never forget that the three best debugging tools are a thorough understanding of
the requirements, a solid design, and a good night’s sleep!
323
v2023.06.11a
324 CHAPTER 11. VALIDATION
11.1 Introduction
Debugging is like being the detective in a crime
movie where you are also the murderer.
Filipe Fortes
Section 11.1.1 discusses the sources of bugs, and Section 11.1.2 overviews the mindset
required when validating software. Section 11.1.3 discusses when you should start
validation, and Section 11.1.4 describes the surprisingly effective open-source regimen
of code review and community testing.
1. Computers lack common sense, despite huge sacrifices at the altar of artificial
intelligence.
The first two points should be uncontroversial, as they are illustrated by any number
of failed products, perhaps most famously Clippy and Microsoft Bob. By attempting to
relate to users as people, these two products raised common-sense and theory-of-mind
expectations that they proved incapable of meeting. Perhaps the set of software assistants
are now available on smartphones will fare better, but as of 2021 reviews are mixed.
That said, the developers working on them by all accounts still develop the old way:
The assistants might well benefit end users, but not so much their own developers.
This human love of fragmentary plans deserves more explanation, especially given
that it is a classic two-edged sword. This love of fragmentary plans is apparently due
to the assumption that the person carrying out the plan will have (1) common sense
and (2) a good understanding of the intent and requirements driving the plan. This
latter assumption is especially likely to hold in the common case where the person
doing the planning and the person carrying out the plan are one and the same: In this
case, the plan will be revised almost subconsciously as obstacles arise, especially when
that person has the a good understanding of the problem at hand. In fact, the love of
fragmentary plans has served human beings well, in part because it is better to take
random actions that have a some chance of locating food than to starve to death while
attempting to plan the unplannable. However, the usefulness of fragmentary plans in
the everyday life of which we are all experts is no guarantee of their future usefulness in
stored-program computers.
Furthermore, the need to follow fragmentary plans has had important effects on the
human psyche, due to the fact that throughout much of human history, life was often
difficult and dangerous. It should come as no surprise that executing a fragmentary
v2023.06.11a
11.1. INTRODUCTION 325
plan that has a high probability of a violent encounter with sharp teeth and claws
requires almost insane levels of optimism—a level of optimism that actually is present
in most human beings. These insane levels of optimism extend to self-assessments
of programming ability, as evidenced by the effectiveness of (and the controversy
over) code-interviewing techniques [Bra07]. In fact, the clinical term for a human
being with less-than-insane levels of optimism is “clinically depressed”. Such people
usually have extreme difficulty functioning in their daily lives, underscoring the perhaps
counter-intuitive importance of insane levels of optimism to a normal, healthy life.
Furtheremore, if you are not insanely optimistic, you are less likely to start a difficult
but worthwhile project.1
Quick Quiz 11.1: When in computing is it necessary to follow a fragmentary plan?
An important special case is the project that, while valuable, is not valuable enough
to justify the time required to implement it. This special case is quite common, and
one early symptom is the unwillingness of the decision-makers to invest enough to
actually implement the project. A natural reaction is for the developers to produce an
unrealistically optimistic estimate in order to be permitted to start the project. If the
organization is strong enough and its decision-makers ineffective enough, the project
might succeed despite the resulting schedule slips and budget overruns. However, if
the organization is not strong enough and if the decision-makers fail to cancel the
project as soon as it becomes clear that the estimates are garbage, then the project might
well kill the organization. This might result in another organization picking up the
project and either completing it, canceling it, or being killed by it. A given project
might well succeed only after killing several organizations. One can only hope that
the organization that eventually makes a success of a serial-organization-killer project
maintains a suitable level of humility, lest it be killed by its next such project.
Quick Quiz 11.2: Who cares about the organization? After all, it is the project that is
important!
Important though insane levels of optimism might be, they are a key source of bugs
(and perhaps failure of organizations). The question is therefore “How to maintain the
optimism required to start a large project while at the same time injecting enough reality
to keep the bugs down to a dull roar?” The next section examines this conundrum.
From these definitions, it logically follows that any reliable non-trivial program
contains at least one bug that you do not know about. Therefore, any validation effort
undertaken on a non-trivial program that fails to find any bugs is itself a failure. A good
validation is therefore an exercise in destruction. This means that if you are the type of
person who enjoys breaking things, validation is just job for you.
1 There are some famous exceptions to this rule of thumb. Some people take on difficult
or risky projects in order to at least a temporarily escape from their depression. Others have
nothing to lose: The project is literally a matter of life or death.
v2023.06.11a
326 CHAPTER 11. VALIDATION
Quick Quiz 11.3: Suppose that you are writing a script that processes the output of the time
command, which looks as follows:
real 0m0.132s
user 0m0.040s
sys 0m0.008s
The script is required to check its input for errors, and to give appropriate diagnostics if fed
erroneous time output. What test inputs should you provide to this program to test it for use
with time output generated by single-threaded programs?
But perhaps you are a super-programmer whose code is always perfect the first time
every time. If so, congratulations! Feel free to skip this chapter, but I do hope that you
will forgive my skepticism. You see, I have too many people who claimed to be able to
write perfect code the first time, which is not too surprising given the previous discussion
of optimism and over-confidence. And even if you really are a super-programmer, you
just might find yourself debugging lesser mortals’ work.
One approach for the rest of us is to alternate between our normal state of insane
optimism (Sure, I can program that!) and severe pessimism (It seems to work, but I just
know that there have to be more bugs hiding in there somewhere!). It helps if you enjoy
breaking things. If you don’t, or if your joy in breaking things is limited to breaking
other people’s things, find someone who does love breaking your code and have them
help you break it.
Another helpful frame of mind is to hate it when other people find bugs in your code.
This hatred can help motivate you to torture your code beyond all reason in order to
increase the probability that you will be the one to find the bugs. Just make sure to
suspend this hatred long enough to sincerely thank anyone who does find a bug in your
code! After all, by so doing, they saved you the trouble of tracking it down, and possibly
at great personal expense dredging through your code.
Yet another helpful frame of mind is studied skepticism. You see, believing that you
understand the code means you can learn absolutely nothing about it. Ah, but you know
that you completely understand the code because you wrote or reviewed it? Sorry, but
the presence of bugs suggests that your understanding is at least partially fallacious.
One cure is to write down what you know to be true and double-check this knowledge,
as discussed in Sections 11.2–11.5. Objective reality always overrides whatever you
might think you know.
One final frame of mind is to consider the possibility that someone’s life depends on
your code being correct. One way of looking at this is that consistently making good
things happen requires a lot of focus on a lot of bad things that might happen, with an
eye towards preventing or otherwise handling those bad things.2 The prospect of these
bad things might also motivate you to torture your code into revealing the whereabouts
of its bugs.
This wide variety of frames of mind opens the door to the possibility of multiple
people with different frames of mind contributing to the project, with varying levels of
optimism. This can work well, if properly organized.
Some people might see vigorous validation as a form of torture, as depicted in
Figure 11.1.3 Such people might do well to remind themselves that, Tux cartoons aside,
2 For more on this philosophy, see the chapter entitled “The Power of Negative Thinking”
from Chris Hadfield’s excellent book entitled “An Astronaut’s Guide to Life on Earth.”
3 The cynics among us might question whether these people are afraid that validation
v2023.06.11a
11.1. INTRODUCTION 327
v2023.06.11a
328 CHAPTER 11. VALIDATION
they are really torturing an inanimate object, as shown in Figure 11.2. Rest assured that
those who fail to torture their code are doomed to be tortured by it!
However, this leaves open the question of exactly when during the project lifetime
validation should start, a topic taken up by the next section.
It is worth reiterating that this advice applies to first-of-a-kind projects. If you are
instead doing a project in a well-explored area, you would be quite foolish to refuse
to learn from previous experience. But you should still start validating right at the
4 The old saying “First we must code, then we have incentive to think” notwithstanding.
v2023.06.11a
11.1. INTRODUCTION 329
beginning of the project, but hopefully guided by others’ hard-won knowledge of both
requirements and pitfalls.
An equally important question is “When should validation stop?” The best answer
is “Some time after the last change.” Every change has the potential to create a bug,
and thus every change must be validated. Furthermore, validation development should
continue through the full lifetime of the project. After all, the Darwinian perspective
above implies that bugs are adapting to your validation suite. Therefore, unless you
continually improve your validation suite, your project will naturally accumulate hordes
of validation-suite-immune bugs.
But life is a tradeoff, and every bit of time invested in validation suites as a bit of time
that cannot be invested in directly improving the project itself. These sorts of choices
are never easy, and it can be just as damaging to overinvest in validation as it can be to
underinvest. But this is just one more indication that life is not easy.
Now that we have established that you should start validation when you start the
project (if not earlier!), and that both validation and validation development should
continue throughout the lifetime of that project, the following sections cover a number
of validation techniques and methods that have proven their worth.
1. How many of those eyeballs are actually going to look at your code?
2. How many will be experienced and clever enough to actually find your bugs?
I was lucky: There was someone out there who wanted the functionality provided by
my patch, who had long experience with distributed filesystems, and who looked at my
patch almost immediately. If no one had looked at my patch, there would have been
no review, and therefore none of those bugs would have been located. If the people
looking at my patch had lacked experience with distributed filesystems, it is unlikely
that they would have found all the bugs. Had they waited months or even years to look, I
v2023.06.11a
330 CHAPTER 11. VALIDATION
likely would have forgotten how the patch was supposed to work, making it much more
difficult to fix them.
However, we must not forget the second tenet of the open-source development, namely
intensive testing. For example, a great many people test the Linux kernel. Some test
patches as they are submitted, perhaps even yours. Others test the -next tree, which is
helpful, but there is likely to be several weeks or even months delay between the time
that you write the patch and the time that it appears in the -next tree, by which time the
patch will not be quite as fresh in your mind. Still others test maintainer trees, which
often have a similar time delay.
Quite a few people don’t test code until it is committed to mainline, or the master
source tree (Linus’s tree in the case of the Linux kernel). If your maintainer won’t
accept your patch until it has been tested, this presents you with a deadlock situation:
Your patch won’t be accepted until it is tested, but it won’t be tested until it is accepted.
Nevertheless, people who test mainline code are still relatively aggressive, given that
many people and organizations do not test code until it has been pulled into a Linux
distro.
And even if someone does test your patch, there is no guarantee that they will be
running the hardware and software configuration and workload required to locate your
bugs.
Therefore, even when writing code for an open-source project, you need to be prepared
to develop and run your own test suite. Test development is an underappreciated and
very valuable skill, so be sure to take full advantage of any existing test suites available
to you. Important as test development is, we must leave further discussion of it to books
dedicated to that topic. The following sections therefore discuss locating bugs in your
code given that you already have a good test suite.
11.2 Tracing
When all else fails, add a printk()! Or a printf(), if you are working with user-mode
C-language applications.
The rationale is simple: If you cannot figure out how execution reached a given point
in the code, sprinkle print statements earlier in the code to work out what happened. You
can get a similar effect, and with more convenience and flexibility, by using a debugger
such as gdb (for user applications) or kgdb (for debugging Linux kernels). Much more
sophisticated tools exist, with some of the more recent offering the ability to rewind
backwards in time from the point of failure.
These brute-force testing tools are all valuable, especially now that typical systems
have more than 64K of memory and CPUs running faster than 4 MHz. Much has been
written about these tools, so this chapter will add only a little more.
However, these tools all have a serious shortcoming when you need a fastpath to tell
you what is going wrong, namely, these tools often have excessive overheads. There are
special tracing technologies for this purpose, which typically leverage data ownership
techniques (see Chapter 8) to minimize the overhead of runtime data collection. One
example within the Linux kernel is “trace events” [Ros10b, Ros10c, Ros10d, Ros10a],
which uses per-CPU buffers to allow data to be collected with extremely low overhead.
v2023.06.11a
11.3. ASSERTIONS 331
Even so, enabling tracing can sometimes change timing enough to hide bugs, resulting
in heisenbugs, which are discussed in Section 11.6 and especially Section 11.6.4. In the
kernel, BPF can do data reduction in the kernel, reducing the overhead of transmitting
the needed information from the kernel to userspace [Gre19]. In userspace code, there
is a huge number of tools that can help you. One good starting point is Brendan Gregg’s
blog.5
Even if you avoid heisenbugs, other pitfalls await you. For example, although the
machine really does know all, what it knows is almost always way more than your head
can hold. For this reason, high-quality test suites normally come with sophisticated
scripts to analyze the voluminous output. But beware—scripts will only notice what you
tell them to. My rcutorture scripts are a case in point: Early versions of those scripts
were quite satisfied with a test run in which RCU grace periods stalled indefinitely. This
of course resulted in the scripts being modified to detect RCU grace-period stalls, but
this does not change the fact that the scripts will only detect problems that I make them
detect. But note well that unless you have a solid design, you won’t know what your
script should check for!
Another problem with tracing and especially with printk() calls is that their
overhead can rule out production use. In such cases, assertions can be helpful.
11.3 Assertions
No man really becomes a fool until he stops asking
questions.
Charles P. Steinmetz
1 if (something_bad_is_happening())
2 complain();
In parallel code, one bad something that might happen is that a function expecting
to be called under a particular lock might be called without that lock being held.
Such functions sometimes have header comments stating something like “The caller
must hold foo_lock when calling this function”, but such a comment does no good
unless someone actually reads it. An executable statement carries far more weight.
The Linux kernel’s lockdep facility [Cor06a, Ros11] therefore provides a lockdep_
assert_held() function that checks whether the specified lock is held. Of course,
lockdep incurs significant overhead, and thus might not be helpful in production.
An especially bad parallel-code something is unexpected concurrent access to data.
The kernel concurrency sanitizer (KCSAN) [Cor16a] uses existing markings such as
5 http://www.brendangregg.com/blog/
v2023.06.11a
332 CHAPTER 11. VALIDATION
Static analysis is a validation technique where one program takes a second program as
input, reporting errors and vulnerabilities located in this second program. Interestingly
enough, almost all programs are statically analyzed by their compilers or interpreters.
These tools are far from perfect, but their ability to locate errors has improved immensely
over the past few decades, in part because they now have much more than 64K bytes of
memory in which to carry out their analyses.
The original UNIX lint tool [Joh77] was quite useful, though much of its function-
ality has since been incorporated into C compilers. There are nevertheless lint-like tools
in use to this day. The sparse static analyzer [Cor04b] finds higher-level issues in the
Linux kernel, including:
v2023.06.11a
11.5. CODE REVIEW 333
Code review is a special case of static analysis with human beings doing the analysis.
This section covers inspection, walkthroughs, and self-inspection.
11.5.1 Inspection
Traditionally, formal code inspections take place in face-to-face meetings with formally
defined roles: Moderator, developer, and one or two other participants. The developer
reads through the code, explaining what it is doing and why it works. The one or two
other participants ask questions and raise issues, hopefully exposing the author’s invalid
assumptions, while the moderator’s job is to resolve any resulting conflicts and take
notes. This process can be extremely effective at locating bugs, particularly if all of the
participants are familiar with the code at hand.
However, this face-to-face formal procedure does not necessarily work well in the
global Linux kernel community. Instead, individuals review code separately and provide
comments via email or IRC. The note-taking is provided by email archives or IRC logs,
and moderators volunteer their services as required by the occasional flamewar. This
process also works reasonably well, particularly if all of the participants are familiar with
the code at hand. In fact, one advantage of the Linux kernel community approach over
traditional formal inspections is the greater probability of contributions from people not
familiar with the code, who might not be blinded by the author’s invalid assumptions,
and who might also test the code.
Quick Quiz 11.7: Just what invalid assumptions are you accusing Linux kernel hackers of
harboring???
It is quite likely that the Linux kernel community’s review process is ripe for
improvement:
1. There is sometimes a shortage of people with the time and expertise required to
carry out an effective review.
2. Even though all review discussions are archived, they are often “lost” in the sense
that insights are forgotten and people fail to look up the discussions. This can
result in re-insertion of the same old bugs.
v2023.06.11a
334 CHAPTER 11. VALIDATION
11.5.2 Walkthroughs
A traditional code walkthrough is similar to a formal inspection, except that the group
“plays computer” with the code, driven by specific test cases. A typical walkthrough
team has a moderator, a secretary (who records bugs found), a testing expert (who
generates the test cases) and perhaps one to two others. These can be extremely effective,
albeit also extremely time-consuming.
It has been some decades since I have participated in a formal walkthrough, and I
suspect that a present-day walkthrough would use single-stepping debuggers. One could
imagine a particularly sadistic procedure as follows:
11.5.3 Self-Inspection
Although developers are usually not all that effective at inspecting their own code,
there are a number of situations where there is no reasonable alternative. For example,
the developer might be the only person authorized to look at the code, other qualified
developers might all be too busy, or the code in question might be sufficiently bizarre
that the developer is unable to convince anyone else to take it seriously until after
demonstrating a prototype. In these cases, the following procedure can be quite helpful,
especially for complex parallel code:
1. Write design document with requirements, diagrams for data structures, and
rationale for design choices.
2. Consult with experts, updating the design document as needed.
3. Write the code in pen on paper, correcting errors as you go. Resist the temptation
to refer to pre-existing nearly identical code sequences, instead, copy them.
4. At each step, articulate and question your assumptions, inserting assertions or
constructing tests to check them.
5. If there were errors, copy the code in pen on fresh paper, correcting errors as you
go. Repeat until the last two copies are identical.
6. Produce proofs of correctness for any non-obvious code.
v2023.06.11a
11.5. CODE REVIEW 335
When I follow this procedure for new RCU code, there are normally only a few
bugs left at the end. With a few prominent (and embarrassing) exceptions [McK11a], I
usually manage to locate these bugs before others do. That said, this is getting more
difficult over time as the number and variety of Linux-kernel users increases.
Quick Quiz 11.8: Why would anyone bother copying existing code in pen on paper??? Doesn’t
that just increase the probability of transcription errors?
Quick Quiz 11.9: This procedure is ridiculously over-engineered! How can you expect to get
a reasonable amount of software written doing it this way???
Quick Quiz 11.10: What do you do if, after all the pen-on-paper copying, you find a bug while
typing in the resulting code?
The above procedure works well for new code, but what if you need to inspect code
that you have already written? You can of course apply the above procedure for old
code in the special case where you wrote one to throw away [FPB79], but the following
approach can also be helpful in less desperate circumstances:
This works because describing the code in detail is an excellent way to spot
bugs [Mye79]. This second procedure is also a good way to get your head around
someone else’s code, although the first step often suffices.
Although review and inspection by others is probably more efficient and effective,
the above procedures can be quite helpful in cases where for whatever reason it is not
feasible to involve others.
At this point, you might be wondering how to write parallel code without having to
do all this boring paperwork. Here are some time-tested ways of accomplishing this:
1. Write a sequential program that scales through use of available parallel library
functions.
2. Write sequential plug-ins for a parallel framework, such as map-reduce, BOINC,
or a web-application server.
v2023.06.11a
336 CHAPTER 11. VALIDATION
3. Fully partition your problems, then implement sequential program(s) that run in
parallel without communication.
4. Stick to one of the application areas (such as linear algebra) where tools can
automatically decompose and parallelize the problem.
But the sad fact is that even if you do the paperwork or use one of the above ways to
more-or-less safely avoid paperwork, there will be bugs. If nothing else, more users and
a greater variety of users will expose more bugs more quickly, especially if those users
are doing things that the original developers did not consider. The next section describes
how to handle the probabilistic bugs that occur all too commonly when validating
parallel software.
Quick Quiz 11.11: Wait! Why on earth would an abstract piece of software fail only
sometimes???
So your parallel program fails sometimes. But you used techniques from the earlier
sections to locate the problem and now have a fix in place! Congratulations!!!
Now the question is just how much testing is required in order to be certain that you
actually fixed the bug, as opposed to just reducing the probability of it occurring on
the one hand, having fixed only one of several related bugs on the other hand, or made
some ineffectual unrelated change on yet a third hand. In short, what is the answer to
the eternal question posed by Figure 11.3?
Unfortunately, the honest answer is that an infinite amount of testing is required to
attain absolute certainty.
Quick Quiz 11.12: Suppose that you had a very large number of systems at your disposal. For
example, at current cloud prices, you can purchase a huge amount of CPU time at low cost.
Why not use this approach to get close enough to certainty for all practical purposes?
But suppose that we are willing to give up absolute certainty in favor of high
probability. Then we can bring powerful statistical tools to bear on this problem.
However, this section will focus on simple statistical tools. These tools are extremely
helpful, but please note that reading this section is not a substitute for statistics classes.6
For our start with simple statistical tools, we need to decide whether we are doing
discrete or continuous testing. Discrete testing features well-defined individual test runs.
6 Which I most highly recommend. The few statistics courses I have taken have provided
v2023.06.11a
11.6. PROBABILITY AND HEISENBUGS 337
Hooray! I passed
the stress test!
For example, a boot-up test of a Linux kernel patch is an example of a discrete test: The
kernel either comes up or it does not. Although you might spend an hour boot-testing
your kernel, the number of times you attempted to boot the kernel and the number of
times the boot-up succeeded would often be of more interest than the length of time you
spent testing. Functional tests tend to be discrete.
On the other hand, if my patch involved RCU, I would probably run rcutorture,
which is a kernel module that, strangely enough, tests RCU. Unlike booting the kernel,
where the appearance of a login prompt signals the successful end of a discrete test,
rcutorture will happily continue torturing RCU until either the kernel crashes or until
you tell it to stop. The duration of the rcutorture test is usually of more interest than
the number of times you started and stopped it. Therefore, rcutorture is an example
of a continuous test, a category that includes many stress tests.
Statistics for discrete tests are simpler and more familiar than those for continuous
tests, and furthermore the statistics for discrete tests can often be pressed into service
for continuous tests, though with some loss of accuracy. We therefore start with discrete
tests.
For those preferring formulas, call the probability of a single failure 𝑓 . The probability
of a single success is then 1 − 𝑓 and the probability that all of 𝑛 tests will succeed is 𝑆 𝑛 :
v2023.06.11a
338 CHAPTER 11. VALIDATION
𝑆 𝑛 = (1 − 𝑓 ) 𝑛 (11.1)
The probability of failure is 1 − 𝑆 𝑛 , or:
𝐹𝑛 = 1 − (1 − 𝑓 ) 𝑛 (11.2)
Quick Quiz 11.13: Say what??? When I plug the earlier five-test 10 %-failure-rate example
into the formula, I get 59,050 % and that just doesn’t make sense!!!
So suppose that a given test has been failing 10 % of the time. How many times do
you have to run the test to be 99 % sure that your supposed fix actually helped?
Another way to ask this question is “How many times would we need to run the test to
cause the probability of failure to rise above 99 %?” After all, if we were to run the test
enough times that the probability of seeing at least one failure becomes 99 %, if there
are no failures, there is only 1 % probability of this “success” being due to dumb luck.
And if we plug 𝑓 = 0.1 into Eq. 11.2 and vary 𝑛, we find that 43 runs gives us a 98.92 %
chance of at least one test failing given the original 10 % per-test failure rate, while 44
runs gives us a 99.03 % chance of at least one test failing. So if we run the test on our
fix 44 times and see no failures, there is a 99 % probability that our fix really did help.
But repeatedly plugging numbers into Eq. 11.2 can get tedious, so let’s solve for 𝑛:
𝐹𝑛 = 1 − (1 − 𝑓 ) 𝑛 (11.3)
1 − 𝐹𝑛 = (1 − 𝑓 ) 𝑛 (11.4)
log (1 − 𝐹𝑛 ) = 𝑛 log (1 − 𝑓 ) (11.5)
log (1 − 𝐹𝑛 )
𝑛= (11.6)
log (1 − 𝑓 )
Plugging 𝑓 = 0.1 and 𝐹𝑛 = 0.99 into Eq. 11.6 gives 43.7, meaning that we need 44
consecutive successful test runs to be 99 % certain that our fix was a real improvement.
This matches the number obtained by the previous method, which is reassuring.
Quick Quiz 11.14: In Eq. 11.6, are the logarithms base-10, base-2, or base-e?
Figure 11.4 shows a plot of this function. Not surprisingly, the less frequently each
test run fails, the more test runs are required to be 99 % confident that the bug has been
fixed. If the bug caused the test to fail only 1 % of the time, then a mind-boggling 458
test runs are required. As the failure probability decreases, the number of test runs
required increases, going to infinity as the failure probability goes to zero.
The moral of this story is that when you have found a rarely occurring bug, your
testing job will be much easier if you can come up with a carefully targeted test with a
much higher failure rate. For example, if your targeted test raised the failure rate from
1 % to 30 %, then the number of runs required for 99 % confidence would drop from
458 to a more tractable 13.
But these thirteen test runs would only give you 99 % confidence that your fix had
produced “some improvement”. Suppose you instead want to have 99 % confidence that
your fix reduced the failure rate by an order of magnitude. How many failure-free test
runs are required?
v2023.06.11a
11.6. PROBABILITY AND HEISENBUGS 339
1000
10
1
0 0.2 0.4 0.6 0.8 1
Per-Run Failure Probability
Figure 11.4: Number of Tests Required for 99 Percent Confidence Given Failure Rate
log (1 − 0.99)
𝑛= = 151.2 (11.7)
log (1 − 0.03)
So our order of magnitude improvement requires roughly an order of magnitude more
testing. Certainty is impossible, and high probabilities are quite expensive. This is
why making tests run more quickly and making failures more probable are essential
skills in the development of highly reliable software. These skills will be covered in
Section 11.6.4.
𝜆 𝑚 −𝜆
𝐹𝑚 = e (11.8)
𝑚!
v2023.06.11a
340 CHAPTER 11. VALIDATION
Here 𝐹𝑚 is the probability of 𝑚 failures in the test and 𝜆 is the expected failure
rate per unit time. A rigorous derivation may be found in any advanced probability
textbook, for example, Feller’s classic “An Introduction to Probability Theory and Its
Applications” [Fel50], while a more intuitive derivation may be found in the first edition
of this book [McK14c, Equations 11.8–11.26].
Let’s try reworking the example from Section 11.6.2 using the Poisson distribution.
Recall that this example involved a test with a 30 % failure rate per hour, and that the
question was how long the test would need to run error-free on a alleged fix to be 99 %
certain that the fix actually reduced the failure rate. In this case, 𝑚 is zero, so that
Eq. 11.8 reduces to:
𝐹0 = e−𝜆 (11.9)
Solving this requires setting 𝐹0 to 0.01 and solving for 𝜆, resulting in:
1 100 − 𝑃
𝑇 = − ln (11.11)
𝑛 100
Quick Quiz 11.15: Suppose that a bug causes a test failure three times per hour on average.
How long must the test run error-free to provide 99.9 % confidence that the fix significantly
reduced the probability of failure?
As before, the less frequently the bug occurs and the greater the required level of
confidence, the longer the required error-free test run.
Suppose that a given test fails about once every hour, but after a bug fix, a 24-hour
test run fails only twice. Assuming that the failure leading to the bug is a random
occurrence, what is the probability that the small number of failures in the second run
was due to random chance? In other words, how confident should we be that the fix
actually had some effect on the bug? This probability may be calculated by summing
Eq. 11.8 as follows:
𝑚
∑︁ 𝜆𝑖
𝐹0 + 𝐹1 + · · · + 𝐹𝑚−1 + 𝐹𝑚 = e−𝜆 (11.12)
𝑖=0
𝑖!
This is the Poisson cumulative distribution function, which can be written more
compactly as:
𝑚
∑︁ 𝜆𝑖
𝐹𝑖 ≤𝑚 = e−𝜆 (11.13)
𝑖=0
𝑖!
Here 𝑚 is the actual number of errors in the long test run (in this case, two) and 𝜆
is expected number of errors in the long test run (in this case, 24). Plugging 𝑚 = 2
and 𝜆 = 24 into this expression gives the probability of two or fewer failures as about
v2023.06.11a
11.6. PROBABILITY AND HEISENBUGS 341
1.2 × 10−8 , in other words, we have a high level of confidence that the fix actually had
some relationship to the bug.7
Quick Quiz 11.16: Doing the summation of all the factorials and exponentials is a real pain.
Isn’t there an easier way?
Quick Quiz 11.17: But wait!!! Given that there has to be some number of failures (including
the possibility of zero failures), shouldn’t Eq. 11.13 approach the value 1 as 𝑚 goes to infinity?
The Poisson distribution is a powerful tool for analyzing test results, but the fact is
that in this last example there were still two remaining test failures in a 24-hour test
run. Such a low failure rate results in very long test runs. The next section discusses
counter-intuitive ways of improving this situation.
7 Of course, this result in no way excuses you from finding and fixing the bug(s) resulting
observer effect from classical physics. Nevertheless, the name has stuck.
v2023.06.11a
342 CHAPTER 11. VALIDATION
Consider the count-lossy code in Section 5.1. Adding printf() statements will likely
greatly reduce or even eliminate the lost counts. However, converting the load-add-store
sequence to a load-add-delay-store sequence will greatly increase the incidence of lost
counts (try it!). Once you spot a bug involving a race condition, it is frequently possible
to create an anti-heisenbug by adding delay in this manner.
Of course, this begs the question of how to find the race condition in the first place.
Although very lucky developers might accidentally create delay-based anti-heisenbugs
when adding debug code, this is in general a dark art. Nevertheless, there are a number
of things you can do to find your race conditions.
One approach is to recognize that race conditions often end up corrupting some
of the data involved in the race. It is therefore good practice to double-check the
synchronization of any corrupted data. Even if you cannot immediately recognize
the race condition, adding delay before and after accesses to the corrupted data might
change the failure rate. By adding and removing the delays in an organized fashion (e.g.,
binary search), you might learn more about the workings of the race condition.
Quick Quiz 11.18: How is this approach supposed to help if the corruption affected some
unrelated pointer, which then caused the corruption???
Another important approach is to vary the software and hardware configuration and
look for statistically significant differences in failure rate. For example, back in the
1990s, it was common practice to test on systems having CPUs running at different
clock rates, which tended to make some types of race conditions more probable. One
way of getting a similar effect today is to test on multi-socket systems, thus incurring
the large delays described in Section 3.2.
However you choose to add delays, you can then look more intensively at the code
implicated by those delays that make the greatest difference in failure rate. It might be
helpful to test that code in isolation, for example.
One important aspect of software configuration is the history of changes, which is
why git bisect is so useful. Bisection of the change history can provide very valuable
clues as to the nature of the heisenbug, in this case presumably by locating a commit
that shows a change in the software’s response to the addition or removal of a given
delay.
Quick Quiz 11.19: But I did the bisection, and ended up with a huge commit. What do I do
now?
Once you locate the suspicious section of code, you can then introduce delays to
attempt to increase the probability of failure. As we have seen, increasing the probability
of failure makes it much easier to gain high confidence in the corresponding fix.
However, it is sometimes quite difficult to track down the problem using normal
debugging techniques. The following sections present some other alternatives.
It is often the case that a given test suite places relatively low stress on a given subsystem,
so that a small change in timing can cause a heisenbug to disappear. One way to create
an anti-heisenbug for this case is to increase the workload intensity, which has a good
chance of increasing the bug’s probability. If the probability is increased sufficiently, it
v2023.06.11a
11.6. PROBABILITY AND HEISENBUGS 343
may be possible to add lightweight diagnostics such as tracing without causing the bug
to vanish.
How can you increase the workload intensity? This depends on the program, but here
are some things to try:
4. Change the size of the problem, for example, if doing a parallel matrix multiply,
change the size of the matrix. Larger problems may introduce more complexity,
but smaller problems often increase the level of contention. If you aren’t sure
whether you should go large or go small, just try both.
However, it is often the case that the bug is in a specific subsystem, and the structure
of the program limits the amount of stress that can be applied to that subsystem. The
next section addresses this situation.
v2023.06.11a
344 CHAPTER 11. VALIDATION
call_rcu()
Grace-Period Start
Near Miss
Reader
Reader Error
Time
Grace-Period End
Callback Invocation
v2023.06.11a
11.6. PROBABILITY AND HEISENBUGS 345
near misses as the error condition could therefore result in false positives, which need
to be avoided in the automated rcutorture testing.
By sheer dumb luck, rcutorture happens to include some statistics that are sensitive
to the near-miss version of the grace period. As noted above, these statistics are subject
to false positives due to their unsynchronized access to RCU’s state variables, but these
false positives turn out to be extremely rare on strongly ordered systems such as the
IBM mainframe and x86, occurring less than once per thousand hours of testing.
These near misses occurred roughly once per hour, about two orders of magnitude
more frequently than the actual errors. Use of these near misses allowed the bug’s root
cause to be identified in less than a week and a high degree of confidence in the fix to
be built in less than a day. In contrast, excluding the near misses in favor of the real
errors would have required months of debug and validation time.
To sum up near-miss counting, the general approach is to replace counting of
infrequent failures with more-frequent near misses that are believed to be correlated
with those failures. These near-misses can be considered an anti-heisenbug to the real
failure’s heisenbug because the near-misses, being more frequent, are likely to be more
robust in the face of changes to your code, for example, the changes you make to add
debugging code.
enough (both less likely that you might think), then the formal-verification tools described in
Chapter 12 can be helpful.
v2023.06.11a
346 CHAPTER 11. VALIDATION
a vCPU was instead an interrupt storm preventing that vCPU from making forward
progress on the interrupted code. If the code you are debugging is new to you, this log is
also an excellent place to document the relationships between code and data structures.
Keeping a log when you are furiously chasing a difficult bug might seem like needless
paperwork, but it has on many occasions saved me from debugging around and around
in circles, which can waste far more time than keeping a log ever could.
Using Taleb’s nomenclature [Tal07], a white swan is a bug that we can reproduce. We
can run a large number of tests, use ordinary statistics to estimate the bug’s probability,
and use ordinary statistics again to estimate our confidence in a proposed fix. An
unsuspected bug is a black swan. We know nothing about it, we have no tests that
have yet caused it to happen, and statistics is of no help. Studying our own behavior,
especially the number and types of mistakes we make, can turn black swans into grey
swans. We might not know exactly what the bugs are, but we have some idea of their
number and maybe also of their type. Ordinary statistics is still of no help (at least not
until we are able to reproduce one of the bugs), but robust13 testing methods can be of
great help. The goal, therefore, is to use experience and good validation practices to
turn the black swans grey, focused testing and analysis to turn the grey swans white, and
ordinary methods to fix the white swans.
That said, thus far, we have focused solely on bugs in the parallel program’s
functionality. However, performance is a first-class requirement for a parallel program.
Otherwise, why not write a sequential program? To repurpose Kipling, our goal when
writing parallel code is to fill the unforgiving second with sixty minutes worth of
distance run. The next section therefore discusses a number of performance bugs that
would be happy to thwart this Kiplingesque goal.
Parallel programs usually have performance and scalability requirements, after all, if
performance is not an issue, why not use a sequential program? Ultimate performance
and linear scalability might not be necessary, but there is little use for a parallel program
that runs slower than its optimal sequential counterpart. And there really are cases
where every microsecond matters and every nanosecond is needed. Therefore, for
parallel programs, insufficient performance is just as much a bug as is incorrectness.
Quick Quiz 11.21: That is ridiculous!!! After all, isn’t getting the correct answer later than
one would like better than getting an incorrect answer???
Quick Quiz 11.22: But if you are going to put in all the hard work of parallelizing an
application, why not do it right? Why settle for anything less than optimal performance and
linear scalability?
Validating a parallel program must therfore include validating its performance. But
validating performance means having a workload to run and performance criteria with
which to evaluate the program at hand. These needs are often met by performance
benchmarks, which are discussed in the next section.
13 That is to say brutal.
v2023.06.11a
11.7. PERFORMANCE ESTIMATION 347
11.7.1 Benchmarking
Frequent abuse aside, benchmarks are both useful and heavily used, so it is not helpful
to be too dismissive of them. Benchmarks span the range from ad hoc test jigs to
international standards, but regardless of their level of formality, benchmarks serve four
major purposes:
Of course, the only completely fair framework is the intended application itself. So
why would anyone who cared about fairness in benchmarking bother creating imperfect
benchmarks rather than simply using the application itself as the benchmark?
Running the actual application is in fact the best approach where it is practical.
Unfortunately, it is often impractical for the following reasons:
1. The application might be proprietary, and you might not have the right to run the
intended application.
2. The application might require more hardware than you have access to.
3. The application might use data that you cannot access, for example, due to privacy
regulations.
4. The application might take longer than is convenient to reproduce a performance
or scalability problem.14
Creating a benchmark that approximates the application can help overcome these
obstacles. A carefully constructed benchmark can help promote performance, scalability,
energy efficiency, and much else besides. However, be careful to avoid investing too
much into the benchmarking effort. It is after all important to invest at least a little into
the application itself [Gra91].
11.7.2 Profiling
In many cases, a fairly small portion of your software is responsible for the majority of
the performance and scalability shortfall. However, developers are notoriously unable
to identify the actual bottlenecks by inspection. For example, in the case of a kernel
buffer allocator, all attention focused on a search of a dense array which turned out to
represent only a few percent of the allocator’s execution time. An execution profile
collected via a logic analyzer focused attention on the cache misses that were actually
responsible for the majority of the problem [MS93].
An old-school but quite effective method of tracking down performance and scalability
bugs is to run your program under a debugger, then periodically interrupt it, recording
v2023.06.11a
348 CHAPTER 11. VALIDATION
the stacks of all threads at each interruption. The theory here is that if something is
slowing down your program, it has to be visible in your threads’ executions.
That said, there are a number of tools that will usually do a much better job of helping
you to focus your attention where it will do the most good. Two popular choices are
gprof and perf. To use perf on a single-process program, prefix your command with
perf record, then after the command completes, type perf report. There is a lot
of work on tools for performance debugging of multi-threaded programs, which should
make this important job easier. Again, one good starting point is Brendan Gregg’s
blog.15
11.7.4 Microbenchmarking
Microbenchmarking can be useful when deciding which algorithms or data structures
are worth incorporating into a larger body of software for deeper evaluation.
One common approach to microbenchmarking is to measure the time, run some
number of iterations of the code under test, then measure the time again. The difference
between the two times divided by the number of iterations gives the measured time
required to execute the code under test.
Unfortunately, this approach to measurement allows any number of errors to creep in,
including:
1. The measurement will include some of the overhead of the time measurement.
This source of error can be reduced to an arbitrarily small value by increasing the
number of iterations.
2. The first few iterations of the test might incur cache misses or (worse yet) page
faults that might inflate the measured value. This source of error can also be
reduced by increasing the number of iterations, or it can often be eliminated entirely
by running a few warm-up iterations before starting the measurement period. Most
systems have ways of detecting whether a given process incurred a page fault,
and you should make use of this to reject runs whose performance has been thus
impeded.
15 http://www.brendangregg.com/blog/
v2023.06.11a
11.7. PERFORMANCE ESTIMATION 349
3. Some types of interference, for example, random memory errors, are so rare that
they can be dealt with by running a number of sets of iterations of the test. If the
level of interference was statistically significant, any performance outliers could be
rejected statistically.
4. Any iteration of the test might be interfered with by other activity on the system.
Sources of interference include other applications, system utilities and daemons,
device interrupts, firmware interrupts (including system management interrupts, or
SMIs), virtualization, memory errors, and much else besides. Assuming that these
sources of interference occur randomly, their effect can be minimized by reducing
the number of iterations.
The first and fourth sources of interference provide conflicting advice, which is one
sign that we are living in the real world.
Quick Quiz 11.23: But what about other sources of error, for example, due to interactions
between caches and memory layout?
The following sections discuss ways of dealing with these measurement errors, with
Section 11.7.5 covering isolation techniques that may be used to prevent some forms of
interference, and with Section 11.7.6 covering methods for detecting interference so as
to reject measurement data that might have been corrupted by that interference.
11.7.5 Isolation
The Linux kernel provides a number of ways to isolate a group of CPUs from outside
interference.
First, let’s look at interference by other processes, threads, and tasks. The POSIX
sched_setaffinity() system call may be used to move most tasks off of a given
set of CPUs and to confine your tests to that same group. The Linux-specific user-
level taskset command may be used for the same purpose, though both sched_
setaffinity() and taskset require elevated permissions. Linux-specific control
groups (cgroups) may be used for this same purpose. This approach can be quite
effective at reducing interference, and is sufficient in many cases. However, it does have
limitations, for example, it cannot do anything about the per-CPU kernel threads that
are often used for housekeeping tasks.
One way to avoid interference from per-CPU kernel threads is to run your test at
a high real-time priority, for example, by using the POSIX sched_setscheduler()
system call. However, note that if you do this, you are implicitly taking on responsibility
for avoiding infinite loops, because otherwise your test can prevent part of the kernel
from functioning. This is an example of the Spiderman Principle: “With great power
comes great responsibility.” And although the default real-time throttling settings often
16 Systems with adequate cooling tend to look like gaming systems.
v2023.06.11a
350 CHAPTER 11. VALIDATION
address such problems, they might do so by causing your real-time threads to miss their
deadlines.
These approaches can greatly reduce, and perhaps even eliminate, interference
from processes, threads, and tasks. However, it does nothing to prevent interfer-
ence from device interrupts, at least in the absence of threaded interrupts. Linux
allows some control of threaded interrupts via the /proc/irq directory, which
contains numerical directories, one per interrupt vector. Each numerical direc-
tory contains smp_affinity and smp_affinity_list. Given sufficient permis-
sions, you can write a value to these files to restrict interrupts to the specified
set of CPUs. For example, either “echo 3 > /proc/irq/23/smp_affinity” or
“echo 0-1 > /proc/irq/23/smp_affinity_list” would confine interrupts on
vector 23 to CPUs 0 and 1, at least given sufficient privileges. You can use “cat
/proc/interrupts” to obtain a list of the interrupt vectors on your system, how many
are handled by each CPU, and what devices use each interrupt vector.
Running a similar command for all interrupt vectors on your system would confine
interrupts to CPUs 0 and 1, leaving the remaining CPUs free of interference. Or mostly
free of interference, anyway. It turns out that the scheduling-clock interrupt fires on
each CPU that is running in user mode.17 In addition you must take care to ensure that
the set of CPUs that you confine the interrupts to is capable of handling the load.
But this only handles processes and interrupts running in the same operating-system
instance as the test. Suppose that you are running the test in a guest OS that is itself
running on a hypervisor, for example, Linux running KVM? Although you can in theory
apply the same techniques at the hypervisor level that you can at the guest-OS level, it is
quite common for hypervisor-level operations to be restricted to authorized personnel.
In addition, none of these techniques work against firmware-level interference.
Quick Quiz 11.24: Wouldn’t the techniques suggested to isolate the code under test also affect
that code’s performance, particularly if it is running within a larger application?
scheduling-clock interrupts to be disabled on CPUs that have only one runnable task.
As of 2021, this is largely complete.
v2023.06.11a
11.7. PERFORMANCE ESTIMATION 351
Opening and reading files is not the way to low overhead, and it is possible to get the
count of context switches for a given thread by using the getrusage() system call, as
shown in Listing 11.1. This same system call can be used to detect minor page faults
(ru_minflt) and major page faults (ru_majflt).
Unfortunately, detecting memory errors and firmware interference is quite system-
specific, as is the detection of interference due to virtualization. Although avoidance is
better than detection, and detection is better than statistics, there are times when one
must avail oneself of statistics, a topic addressed in the next section.
The fact that smaller measurements are more likely to be accurate than larger
measurements suggests that sorting the measurements in increasing order is likely to be
productive.18 The fact that the measurement uncertainty is known allows us to accept
measurements within this uncertainty of each other: If the effects of interference are
large compared to this uncertainty, this will ease rejection of bad data. Finally, the fact
that some fraction (for example, one third) can be assumed to be good allows us to
blindly accept the first portion of the sorted list, and this data can then be used to gain
an estimate of the natural variation of the measured data, over and above the assumed
measurement error.
The approach is to take the specified number of leading elements from the beginning
of the sorted list, and use these to estimate a typical inter-element delta, which in turn
may be multiplied by the number of elements in the list to obtain an upper bound on
18 To paraphrase the old saying, “Sort first and ask questions later.”
v2023.06.11a
352 CHAPTER 11. VALIDATION
permissible values. The algorithm then repeatedly considers the next element of the list.
If it falls below the upper bound, and if the distance between the next element and the
previous element is not too much greater than the average inter-element distance for the
portion of the list accepted thus far, then the next element is accepted and the process
repeats. Otherwise, the remainder of the list is rejected.
Listing 11.2 shows a simple sh/awk script implementing this notion. Input consists
of an x-value followed by an arbitrarily long list of y-values, and output consists of one
line for each input line, with fields as follows:
1. The x-value.
v2023.06.11a
11.7. PERFORMANCE ESTIMATION 353
--divisor: Number of segments to divide the list into, for example, a divisor of four
means that the first quarter of the data elements will be assumed to be good. This
defaults to three.
--relerr: Relative measurement error. The script assumes that values that differ by
less than this error are for all intents and purposes equal. This defaults to 0.01,
which is equivalent to 1 %.
Lines 1–3 of Listing 11.2 set the default values for the parameters, and lines 4–21 parse
any command-line overriding of these parameters. The awk invocation on line 23 sets
the values of the divisor, relerr, and trendbreak variables to their sh counterparts.
In the usual awk manner, lines 24–50 are executed on each input line. The loop spanning
lines 24 and 25 copies the input y-values to the d array, which line 26 sorts into increasing
order. Line 27 computes the number of trustworthy y-values by applying divisor and
rounding up.
Lines 28–32 compute the maxdelta lower bound on the upper bound of y-values. To
this end, line 29 multiplies the difference in values over the trusted region of data by the
divisor, which projects the difference in values across the trusted region across the
entire set of y-values. However, this value might well be much smaller than the relative
error, so line 30 computes the absolute error (d[i] * relerr) and adds that to the
difference delta across the trusted portion of the data. Lines 31 and 32 then compute
the maximum of these two values.
Each pass through the loop spanning lines 33–40 attempts to add another data value to
the set of good data. Lines 34–39 compute the trend-break delta, with line 34 disabling
this limit if we don’t yet have enough values to compute a trend, and with line 37
multiplying trendbreak by the average difference between pairs of data values in the
good set. If line 38 determines that the candidate data value would exceed the lower
bound on the upper bound (maxdelta) and that the difference between the candidate
data value and its predecessor exceeds the trend-break difference (maxdiff), then
line 39 exits the loop: We have the full good set of data.
Lines 41–49 then compute and print statistics.
Quick Quiz 11.25: This approach is just plain weird! Why not use means and standard
deviations, like we were taught in our statistics classes?
Quick Quiz 11.26: But what if all the y-values in the trusted group of data are exactly zero?
Won’t that cause the script to reject any non-zero value?
v2023.06.11a
354 CHAPTER 11. VALIDATION
Although statistical interference detection can be quite useful, it should be used only
as a last resort. It is far better to avoid interference in the first place (Section 11.7.5), or,
failing that, detecting interference via measurement (Section 11.7.6.1).
11.8 Summary
Although validation never will be an exact science, much can be gained by taking
an organized approach to it, as an organized approach will help you choose the right
validation tools for your job, avoiding situations like the one fancifully depicted in
Figure 11.6.
A key choice is that of statistics. Although the methods described in this chapter
work very well most of the time, they do have their limitations, courtesy of the Halting
Problem [Tur37, Pul00]. Fortunately for us, there is a huge number of special cases in
which we can not only work out whether a program will halt, but also estimate how long
it will run before halting, as discussed in Section 11.7. Furthermore, in cases where a
given program might or might not work correctly, we can often establish estimates for
what fraction of the time it will work correctly, as discussed in Section 11.6.
Nevertheless, unthinking reliance on these estimates is brave to the point of fool-
hardiness. After all, we are summarizing a huge mass of complexity in code and data
structures down to a single solitary number. Even though we can get away with such
bravery a surprisingly large fraction of the time, abstracting all that code and data away
will occasionally cause severe problems.
One possible problem is variability, where repeated runs give wildly different results.
This problem is often addressed using standard deviation, however, using two numbers
to summarize the behavior of a large and complex program is about as brave as using
only one number. In computer programming, the surprising thing is that use of the
mean or the mean and standard deviation are often sufficient. Nevertheless, there are no
guarantees.
One cause of variation is confounding factors. For example, the CPU time consumed
by a linked-list search will depend on the length of the list. Averaging together runs
with wildly different list lengths will probably not be useful, and adding a standard
deviation to the mean will not be much better. The right thing to do would be control for
v2023.06.11a
11.8. SUMMARY 355
list length, either by holding the length constant or to measure CPU time as a function
of list length.
Of course, this advice assumes that you are aware of the confounding factors, and
Murphy says that you will not be. I have been involved in projects that had confounding
factors as diverse as air conditioners (which drew considerable power at startup, thus
causing the voltage supplied to the computer to momentarily drop too low, sometimes
resulting in failure), cache state (resulting in odd variations in performance), I/O errors
(including disk errors, packet loss, and duplicate Ethernet MAC addresses), and even
porpoises (which could not resist playing with an array of transponders, which could be
otherwise used for high-precision acoustic positioning and navigation). And this is but
one reason why a good night’s sleep is such an effective debugging tool.
In short, validation always will require some measure of the behavior of the system.
To be at all useful, this measure must be a severe summarization of the system, which in
turn means that it can be misleading. So as the saying goes, “Be careful. It is a real
world out there.”
But what if you are working on the Linux kernel, which as of 2017 was estimated to
have more than 20 billion instances running throughout the world? In that case, a bug
that occurs once every million years on a single system will be encountered more than
50 times per day across the installed base. A test with a 50 % chance of encountering
this bug in a one-hour run would need to increase that bug’s probability of occurrence
by more than ten orders of magnitude, which poses a severe challenge to today’s testing
methodologies. One important tool that can sometimes be applied with good effect
to such situations is formal verification, the subject of the next chapter, and, more
speculatively, Section 17.4.
The topic of choosing a validation plan, be it testing, formal verification, or both, is
taken up by Section 12.7.
v2023.06.11a
356 CHAPTER 11. VALIDATION
v2023.06.11a
Beware of bugs in the above code; I have only proved
it correct, not tried it.
Donald Knuth
Chapter 12
Formal Verification
Parallel algorithms can be hard to write, and even harder to debug. Testing, though
essential, is insufficient, as fatal race conditions can have extremely low probabilities
of occurrence. Proofs of correctness can be valuable, but in the end are just as prone
to human error as is the original algorithm. In addition, a proof of correctness cannot
be expected to find errors in your assumptions, shortcomings in the requirements,
misunderstandings of the underlying software or hardware primitives, or errors that you
did not think to construct a proof for. This means that formal methods can never replace
testing. Nevertheless, formal methods can be a valuable addition to your validation
toolbox.
It would be very helpful to have a tool that could somehow locate all race conditions.
A number of such tools exist, for example, Section 12.1 provides an introduction to
the general-purpose state-space search tools Promela and Spin, Section 12.2 similarly
introduces the special-purpose ppcmem tool, Section 12.3 looks at an example axiomatic
approach, Section 12.4 briefly overviews SAT solvers, Section 12.5 briefly overviews
stateless model checkers, Section 12.6 sums up use of formal-verification tools for
verifying parallel algorithms, and finally Section 12.7 discusses how to decide how
much and what type of validation to apply to a given software project.
This section features the general-purpose Promela and Spin tools, which may be used
to carry out a full state-space search of many types of multi-threaded code. They are
used to verifying data communication protocols. Section 12.1.1 introduces Promela and
Spin, including a couple of warm-up exercises verifying both non-atomic and atomic
increment. Section 12.1.2 describes use of Promela, including example command
lines and a comparison of Promela syntax to that of C. Section 12.1.3 shows how
Promela may be used to verify locking, Section 12.1.4 uses Promela to verify an unusual
implementation of RCU named “QRCU”, and finally Section 12.1.5 applies Promela to
early versions of RCU’s dyntick-idle implementation.
357
v2023.06.11a
358 CHAPTER 12. FORMAL VERIFICATION
v2023.06.11a
12.1. STATE-SPACE SEARCH 359
This will produce output as shown in Listing 12.2. The first line tells us that our
assertion was violated (as expected given the non-atomic increment!). The second line
that a trail file was written describing how the assertion was violated. The “Warning”
line reiterates that all was not well with our model. The second paragraph describes the
type of state-search being carried out, in this case for assertion violations and invalid
end states. The third paragraph gives state-size statistics: This small model had only 45
states. The final line shows memory usage.
The trail file may be rendered human-readable as follows:
spin -t -p increment.spin
This gives the output shown in Listing 12.3. As can be seen, the first portion of the
init block created both incrementer processes, both of which first fetched the counter,
then both incremented and stored it, losing a count. The assertion then triggered, after
which the global state is displayed.
v2023.06.11a
360 CHAPTER 12. FORMAL VERIFICATION
1 11 128.7
2 52 128.7
3 372 128.7
4 3,496 128.9
5 40,221 131.7
6 545,720 174.0
7 8,521,446 881.9
It is easy to fix this example by placing the body of the incrementer processes in an
atomic block as shown in Listing 12.4. One could also have simply replaced the pair of
statements with counter = counter + 1, because Promela statements are atomic.
Either way, running this modified model gives us an error-free traversal of the state
space, as shown in Listing 12.5.
Table 12.1 shows the number of states and memory consumed as a function of number
of incrementers modeled (by redefining NUMPROCS):
Running unnecessarily large models is thus subtly discouraged, although 882 MB is
well within the limits of modern desktop and laptop machines.
With this example under our belt, let’s take a closer look at the commands used to
analyze Promela models and then look at more elaborate examples.
v2023.06.11a
12.1. STATE-SPACE SEARCH 361
v2023.06.11a
362 CHAPTER 12. FORMAL VERIFICATION
spin -a qrcu.spin
Create a file pan.c that fully searches the state machine.
v2023.06.11a
12.1. STATE-SPACE SEARCH 363
The -wN option specifies the hashtable size. The default for full state-space search
is -w24.1
If you aren’t sure whether your machine has enough memory, run top in one
window and ./pan in another. Keep the focus on the ./pan window so that you
can quickly kill execution if need be. As soon as CPU time drops much below
100 %, kill ./pan. If you have removed focus from the window running ./pan,
you may wait a long time for the windowing system to grab enough memory to do
anything for you.
Another option to avoid memory exhaustion is the -DMEMLIM=N compiler flag.
-DMEMLIM=2000 would set the maximum of 2 GB.
Don’t forget to capture the output, especially if you are working on a remote
machine.
If your model includes forward-progress checks, you will likely need to enable
“weak fairness” via the -f command-line argument to ./pan. If your forward-
progress checks involve accept labels, you will also need the -a argument.
spin -t -p qrcu.spin
Given trail file output by a run that encountered an error, output the sequence
of steps leading to that error. The -g flag will also include the values of changed
global variables, and the -l flag will also include the values of changed local
variables.
1 As of Spin Version 6.4.6 and 6.4.8. In the online manual of Spin dated 10 July 2011,
the default for exhaustive search mode is said to be -w19, which does not meet the actual
behavior.
v2023.06.11a
364 CHAPTER 12. FORMAL VERIFICATION
because they cause the state space to explode. On the other hand, there is no
penalty for infinite loops in Promela as long as none of the variables monotonically
increase or decrease—Promela will figure out how many passes through the loop
really matter, and automatically prune execution beyond that point.
6. In C torture-test code, it is often wise to keep per-task control variables. They are
cheap to read, and greatly aid in debugging the test code. In Promela, per-task
control variables should be used only when there is no other alternative. To see
this, consider a 5-task verification with one bit each to indicate completion. This
gives 32 states. In contrast, a simple counter would have only six states, more than
a five-fold reduction. That factor of five might not seem like a problem, at least
not until you are struggling with a verification program possessing more than 150
million states consuming more than 10 GB of memory!
7. One of the most challenging things both in C torture-test code and in Promela is
formulating good assertions. Promela also allows never claims that act like an
assertion replicated between every line of code.
8. Dividing and conquering is extremely helpful in Promela in keeping the state space
under control. Splitting a large model into two roughly equal halves will result
in the state space of each half being roughly the square root of the whole. For
example, a million-state combined model might reduce to a pair of thousand-state
models. Not only will Promela handle the two smaller models much more quickly
with much less memory, but the two smaller algorithms are easier for people to
understand.
1. Memory reordering. Suppose you have a pair of statements copying globals x and
y to locals r1 and r2, where ordering matters (e.g., unprotected by locks), but
where you have no memory barriers. This can be modeled in Promela as follows:
1 if
2 :: 1 -> r1 = x;
3 r2 = y
4 :: 1 -> r2 = y;
5 r1 = x
6 fi
v2023.06.11a
12.1. STATE-SPACE SEARCH 365
3. Promela does not provide functions. You must instead use C preprocessor macros.
However, you must use them carefully in order to avoid combinatorial explosion.
v2023.06.11a
366 CHAPTER 12. FORMAL VERIFICATION
v2023.06.11a
12.1. STATE-SPACE SEARCH 367
by the N_LOCKERS macro definition on line 3. The mutex itself is defined on line 5, an
array to track the lock owner on line 6, and line 7 is used by assertion code to verify that
only one process holds the lock.
The locker process is on lines 9–18, and simply loops forever acquiring the lock on
line 13, claiming it on line 14, unclaiming it on line 15, and releasing it on line 16.
The init block on lines 20–44 initializes the current locker’s havelock array entry on
line 26, starts the current locker on line 27, and advances to the next locker on line 28.
Once all locker processes are spawned, the do-od loop moves to line 29, which checks
the assertion. Lines 30 and 31 initialize the control variables, lines 32–40 atomically
sum the havelock array entries, line 41 is the assertion, and line 42 exits the loop.
We can run this model by placing the two code fragments of Listings 12.8 and 12.9
into files named lock.h and lock.spin, respectively, and then running the following
commands:
spin -a lock.spin
cc -DSAFETY -o pan pan.c
./pan
The output will look something like that shown in Listing 12.10. As expected, this
run has no assertion failures (“errors: 0”).
Quick Quiz 12.1: Why is there an unreached statement in locker? After all, isn’t this a full
state-space search?
Quick Quiz 12.2: What are some Promela code-style issues with this example?
v2023.06.11a
368 CHAPTER 12. FORMAL VERIFICATION
1. There is a qrcu_struct that defines a QRCU domain. Like SRCU (and unlike
other variants of RCU) QRCU’s action is not global, but instead focused on the
specified qrcu_struct.
2. There are qrcu_read_lock() and qrcu_read_unlock() primitives that delimit
QRCU read-side critical sections. The corresponding qrcu_struct must be
passed into these primitives, and the return value from qrcu_read_lock() must
be passed to qrcu_read_unlock().
For example:
idx = qrcu_read_lock(&my_qrcu_struct);
/* read-side critical section. */
qrcu_read_unlock(&my_qrcu_struct, idx);
A Linux-kernel patch for QRCU has been produced [McK07c], but is unlikely to ever
be included in the Linux kernel.
Returning to the Promela code for QRCU, the global variables are as shown in
Listing 12.11. This example uses locking and includes lock.h. Both the number of
readers and writers can be varied using the two #define statements, giving us not one
but two ways to create combinatorial explosion. The idx variable controls which of
the two elements of the ctr array will be used by readers, and the readerprogress
variable allows an assertion to determine when all the readers are finished (since a QRCU
v2023.06.11a
12.1. STATE-SPACE SEARCH 369
update cannot be permitted to complete until all pre-existing readers have completed
their QRCU read-side critical sections). The readerprogress array elements have
values as follows, indicating the state of the corresponding reader:
v2023.06.11a
370 CHAPTER 12. FORMAL VERIFICATION
contains two unconditional branches with guards on lines 4 and 8, which causes Promela
to non-deterministically choose one of the two (but again, the full state-space search
causes Promela to eventually make all possible choices in each applicable situation).
The first branch fetches the zero-th counter and sets i to 1 (so that line 14 will fetch the
first counter), while the second branch does the opposite, fetching the first counter and
setting i to 0 (so that line 14 will fetch the second counter).
Quick Quiz 12.3: Is there a more straightforward way to code the do-od statement?
With the sum_unordered macro in place, we can now proceed to the update-side
process shown in Listing 12.14. The update-side process repeats indefinitely, with
the corresponding do-od loop ranging over lines 7–57. Each pass through the loop
first snapshots the global readerprogress array into the local readerstart array on
lines 12–21. This snapshot will be used for the assertion on line 53. Line 23 invokes
sum_unordered, and then lines 24–27 re-invoke sum_unordered if the fastpath is
potentially usable.
Lines 28–40 execute the slowpath code if need be, with lines 30 and 38 acquiring and
releasing the update-side lock, lines 31–33 flipping the index, and lines 34–37 waiting
for all pre-existing readers to complete.
Lines 44–56 then compare the current values in the readerprogress array to those
collected in the readerstart array, forcing an assertion failure should any readers that
started before this update still be in progress.
Quick Quiz 12.4: Why are there atomic blocks at lines 12–21 and lines 44–56, when the
operations within those atomic blocks have no atomic implementation on any current production
microprocessor?
Quick Quiz 12.5: Is the re-summing of the counters on lines 24–27 really necessary?
All that remains is the initialization block shown in Listing 12.15. This block simply
initializes the counter pair on lines 5–6, spawns the reader processes on lines 7–14, and
spawns the updater processes on lines 15–21. This is all done within an atomic block to
reduce state space.
To run the QRCU example, combine the code fragments in the previous section into a
single file named qrcu.spin, and place the definitions for spin_lock() and spin_
unlock() into a file named lock.h. Then use the following commands to build and
run the QRCU model:
spin -a qrcu.spin
cc -DSAFETY [-DCOLLAPSE] -o pan pan.c
./pan [-mN]
The output shows that this model passes all of the cases shown in Table 12.2. It
would be nice to run three readers and three updaters, however, simple extrapolation
indicates that this will require about half a terabyte of memory. What to do?
It turns out that ./pan gives advice when it runs out of memory, for example, when
attempting to run three readers and three updaters:
v2023.06.11a
12.1. STATE-SPACE SEARCH 371
v2023.06.11a
372 CHAPTER 12. FORMAL VERIFICATION
1 1 376 95 128.7
1 2 6,177 218 128.9
1 3 99,728 385 132.6
2 1 29,399 859 129.8
2 2 1,071,181 2,352 169.6
2 3 33,866,736 12,857 1,540.8
3 1 2,749,453 53,809 236.6
3 2 186,202,860 328,014 10,483.7
a Obtained with the compiler flag -DCOLLAPSE specified.
Let’s try the suggested compiler flag -DMA=N, which generates code for aggressive
compression of the state space at the cost of greatly increased search overhead. The
required commands are as follows:
spin -a qrcu.spin
cc -DSAFETY -DMA=96 -O2 -o pan pan.c
./pan -m20000000
Here, the depth limit of 20,000,000 is an order of magnitude larger than the expected
depth deduced from simple extrapolation. Although this increases up-front memory
usage, it avoids wasting a long run due to incomplete search resulting from a too-tight
depth limit. This run took a little more than 3 days on a POWER9 server. The result is
shown in Listing 12.16. This Spin run completed successfully with a total memory usage
v2023.06.11a
12.1. STATE-SPACE SEARCH 373
of only 6.5 GB, which is almost two orders of magnitude lower than the -DCOLLAPSE
usage of about half a terabyte.
Quick Quiz 12.6: A compression rate of 0.48 % corresponds to a 200-to-1 decrease in memory
occupied by the states! Is the state-space search really exhaustive???
For reference, Table 12.3 summarizes the Spin results with -DCOLLAPSE and -DMA=N
compiler flags. The memory usage is obtained with minimal sufficient search depths
and -DMA=N parameters shown in the table. Hashtable sizes for -DCOLLAPSE runs are
tweaked by the -wN option of ./pan to avoid using too much memory hashing small
state spaces. Hence the memory usage is smaller than what is shown in Table 12.2,
where the hashtable size starts from the default of -w24. The runtime is from a POWER9
server, which shows that -DMA=N suffers up to about an order of magnitude higher CPU
overhead than does -DCOLLAPSE, but on the other hand reduces memory overhead by
well over an order of magnitude.
So far so good. But adding a few more updaters or readers would exhaust memory,
even with -DMA=N.2 So what to do? Here are some possible approaches:
1. See whether a smaller number of readers and updaters suffice to prove the general
case.
2. Manually construct a proof of correctness.
2 Alternatively, the CPU consumption would become excessive.
v2023.06.11a
374 CHAPTER 12. FORMAL VERIFICATION
-DCOLLAPSE -DMA=N
updaters readers # states depth reached -wN memory (MB) runtime (s) N memory (MB) runtime (s)
1. For synchronize_qrcu() to exit too early, then by definition there must have
been at least one reader present during synchronize_qrcu()’s full execution.
2. The counter corresponding to this reader will have been at least 1 during this time
interval.
4. The above two items imply that if the counter corresponding to this reader is exactly
one, then the other counter must be greater than or equal to one. Similarly, if the
other counter is equal to zero, then the counter corresponding to the reader must
be greater than or equal to two.
v2023.06.11a
12.1. STATE-SPACE SEARCH 375
5. Therefore, at any given point in time, either one of the counters will be at least 2,
or both of the counters will be at least one.
6. However, the synchronize_qrcu() fastpath code can read only one of the
counters at a given time. It is therefore possible for the fastpath code to fetch the
first counter while zero, but to race with a counter flip so that the second counter is
seen as one.
7. There can be at most one reader persisting through such a race condition, as
otherwise the sum would be two or greater, which would cause the updater to take
the slowpath.
8. But if the race occurs on the fastpath’s first read of the counters, and then again on
its second read, there have to have been two counter flips.
9. Because a given updater flips the counter only once, and because the update-side
lock prevents a pair of updaters from concurrently flipping the counters, the only
way that the fastpath code can race with a flip twice is if the first updater completes.
10. But the first updater will not complete until after all pre-existing readers have
completed.
11. Therefore, if the fastpath races with a counter flip twice in succession, all pre-
existing readers must have completed, so that it is safe to take the fastpath.
Of course, not all parallel algorithms have such simple proofs. In such cases, it may
be necessary to enlist more capable tools.
v2023.06.11a
376 CHAPTER 12. FORMAL VERIFICATION
Therefore, if you do intend to use QRCU, please take care. Its proofs of correctness
might or might not themselves be correct. Which is one reason why formal verification
is unlikely to completely replace testing, as Donald Knuth pointed out so long ago.
Quick Quiz 12.8: Given that we have two independent proofs of correctness for the QRCU
algorithm described herein, and given that the proof of incorrectness covers what is known to
be a different algorithm, why is there any room for doubt?
v2023.06.11a
12.1. STATE-SPACE SEARCH 377
Paul reviewed the code repeatedly from October 2007 to February 2008, and almost
always found at least one bug. In one case, Paul even coded and tested a fix before
realizing that the bug was illusory, and in fact in all cases, the “bug” turned out to be
illusory.
Near the end of February, Paul grew tired of this game. He therefore decided to enlist
the aid of Promela and Spin. The following presents a series of seven increasingly
realistic Promela models, the last of which passes, consuming about 40 GB of main
memory for the state space.
More important, Promela and Spin did find a very subtle bug for me!
Quick Quiz 12.9: Yeah, that’s just great! Now, just what am I supposed to do if I don’t happen
to have a machine with 40 GB of main memory???
Still better would be to come up with a simpler and faster algorithm that has a smaller
state space. Even better would be an algorithm so simple that its correctness was
obvious to the casual observer!
Sections 12.1.5.1–12.1.5.4 give an overview of preemptible RCU’s dynticks interface,
followed by Section 12.1.6’s discussion of the validation of the interface.
2. When entering the outermost of a possibly nested set of interrupt handlers, and
v2023.06.11a
378 CHAPTER 12. FORMAL VERIFICATION
sees the new value of dynticks_progress_counter will also see the completion of
any prior RCU read-side critical sections.
Similarly, when a CPU that is in dynticks-idle mode prepares to start executing a
newly runnable task, it invokes rcu_exit_nohz():
1 static inline void rcu_exit_nohz(void)
2 {
3 __get_cpu_var(dynticks_progress_counter)++;
4 mb();
5 WARN_ON(!(__get_cpu_var(dynticks_progress_counter) &
6 0x1));
7 }
Line 3 fetches the current CPU’s number, while lines 5 and 6 increment the rcu_
update_flag nesting counter if it is already non-zero. Lines 7–9 check to see whether
we are the outermost level of interrupt, and, if so, whether dynticks_progress_
counter needs to be incremented. If so, line 10 increments dynticks_progress_
counter, line 11 executes a memory barrier, and line 12 increments rcu_update_flag.
As with rcu_exit_nohz(), the memory barrier ensures that any other CPU that sees
the effects of an RCU read-side critical section in the interrupt handler (following the
rcu_irq_enter() invocation) will also see the increment of dynticks_progress_
counter.
v2023.06.11a
12.1. STATE-SPACE SEARCH 379
Quick Quiz 12.10: Why not simply increment rcu_update_flag, and then only increment
dynticks_progress_counter if the old value of rcu_update_flag was zero???
Quick Quiz 12.11: But if line 7 finds that we are the outermost interrupt, wouldn’t we always
need to increment dynticks_progress_counter?
Line 3 fetches the current CPU’s number, as before. Line 5 checks to see if the
rcu_update_flag is non-zero, returning immediately (via falling off the end of the
function) if not. Otherwise, lines 6 through 12 come into play. Line 6 decrements
rcu_update_flag, returning if the result is not zero. Line 8 verifies that we are indeed
leaving the outermost level of nested interrupts, line 9 executes a memory barrier,
line 10 increments dynticks_progress_counter, and lines 11 and 12 verify that this
variable is now even. As with rcu_enter_nohz(), the memory barrier ensures that
any other CPU that sees the increment of dynticks_progress_counter will also see
the effects of an RCU read-side critical section in the interrupt handler (preceding the
rcu_irq_exit() invocation).
These two sections have described how the dynticks_progress_counter variable
is maintained during entry to and exit from dynticks-idle mode, both by tasks and by
interrupts and NMIs. The following section describes how this variable is used by
preemptible RCU’s grace-period machinery.
v2023.06.11a
380 CHAPTER 12. FORMAL VERIFICATION
rcu_try_flip_idle_state
Still no activity
(No RCU activity)
rcu_try_flip_waitack_state
(Wait for acknowledgements)
Memory barrier
rcu_try_flip_waitzero_state
(Wait for RCU read−side
critical sections to complete)
rcu_try_flip_waitmb_state
(Wait for memory barriers)
v2023.06.11a
12.1. STATE-SPACE SEARCH 381
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb();
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if (curr != snap)
13 return 0;
14 return 1;
15 }
1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5
6 do
7 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
8 :: i < MAX_DYNTICK_LOOP_NOHZ ->
9 tmp = dynticks_progress_counter;
10 atomic {
11 dynticks_progress_counter = tmp + 1;
12 assert((dynticks_progress_counter & 1) == 1);
13 }
14 tmp = dynticks_progress_counter;
15 atomic {
16 dynticks_progress_counter = tmp + 1;
17 assert((dynticks_progress_counter & 1) == 0);
18 }
19 i++;
20 od;
21 }
v2023.06.11a
382 CHAPTER 12. FORMAL VERIFICATION
Lines 6 and 20 define a loop. Line 7 exits the loop once the loop counter i has
exceeded the limit MAX_DYNTICK_LOOP_NOHZ. Line 8 tells the loop construct to execute
lines 9–19 for each pass through the loop. Because the conditionals on lines 7 and 8
are exclusive of each other, the normal Promela random selection of true conditions
is disabled. Lines 9 and 11 model rcu_exit_nohz()’s non-atomic increment of
dynticks_progress_counter, while line 12 models the WARN_ON(). The atomic
construct simply reduces the Promela state space, given that the WARN_ON() is not
strictly speaking part of the algorithm. Lines 14–18 similarly model the increment and
WARN_ON() for rcu_enter_nohz(). Finally, line 19 increments the loop counter.
Each pass through the loop therefore models a CPU exiting dynticks-idle mode (for
example, starting to execute a task), then re-entering dynticks-idle mode (for example,
that same task blocking).
Quick Quiz 12.13: Why isn’t the memory barrier in rcu_exit_nohz() and rcu_enter_
nohz() modeled in Promela?
Quick Quiz 12.14: Isn’t it a bit strange to model rcu_exit_nohz() followed by rcu_enter_
nohz()? Wouldn’t it be more natural to instead model entry before exit?
The next step is to model the interface to RCU’s grace-period processing. For this, we
need to model dyntick_save_progress_counter(), rcu_try_flip_waitack_
needed(), rcu_try_flip_waitmb_needed(), as well as portions of rcu_try_
flip_waitack() and rcu_try_flip_waitmb(), all from the 2.6.25-rc4 kernel. The
following grace_period() Promela process models these functions as they would be
invoked during a single pass through preemptible RCU’s grace-period processing.
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 atomic {
7 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
8 snap = dynticks_progress_counter;
9 }
10 do
11 :: 1 ->
12 atomic {
13 curr = dynticks_progress_counter;
14 if
15 :: (curr == snap) && ((curr & 1) == 0) ->
16 break;
17 :: (curr - snap) > 2 || (snap & 1) == 0 ->
18 break;
19 :: 1 -> skip;
20 fi;
21 }
22 od;
23 snap = dynticks_progress_counter;
24 do
25 :: 1 ->
26 atomic {
27 curr = dynticks_progress_counter;
28 if
29 :: (curr == snap) && ((curr & 1) == 0) ->
30 break;
31 :: (curr != snap) ->
32 break;
33 :: 1 -> skip;
34 fi;
35 }
36 od;
37 }
v2023.06.11a
12.1. STATE-SPACE SEARCH 383
Lines 6–9 print out the loop limit (but only into the “.trail” file in case of error) and
models a line of code from rcu_try_flip_idle() and its call to dyntick_save_
progress_counter(), which takes a snapshot of the current CPU’s dynticks_
progress_counter variable. These two lines are executed atomically to reduce state
space.
Lines 10–22 model the relevant code in rcu_try_flip_waitack() and its call to
rcu_try_flip_waitack_needed(). This loop is modeling the grace-period state
machine waiting for a counter-flip acknowledgement from each CPU, but only that part
that interacts with dynticks-idle CPUs.
Line 23 models a line from rcu_try_flip_waitzero() and its call to dyntick_
save_progress_counter(), again taking a snapshot of the CPU’s dynticks_
progress_counter variable.
Finally, lines 24–36 model the relevant code in rcu_try_flip_waitack() and its
call to rcu_try_flip_waitack_needed(). This loop is modeling the grace-period
state-machine waiting for each CPU to execute a memory barrier, but again only that
part that interacts with dynticks-idle CPUs.
Quick Quiz 12.15: Wait a minute! In the Linux kernel, both dynticks_progress_counter
and rcu_dyntick_snapshot are per-CPU variables. So why are they instead being modeled
as single global variables?
1 #define GP_IDLE 0
2 #define GP_WAITING 1
3 #define GP_DONE 2
4 byte grace_period_state = GP_DONE;
The grace_period() process sets this variable as it progresses through the grace-
period phases, as shown below:
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 grace_period_state = GP_IDLE;
7 atomic {
8 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
9 snap = dynticks_progress_counter;
10 grace_period_state = GP_WAITING;
11 }
12 do
13 :: 1 ->
14 atomic {
15 curr = dynticks_progress_counter;
16 if
17 :: (curr == snap) && ((curr & 1) == 0) ->
18 break;
v2023.06.11a
384 CHAPTER 12. FORMAL VERIFICATION
Lines 6, 10, 25, 26, 29, and 44 update this variable (combining atomically with
algorithmic operations where feasible) to allow the dyntick_nohz() process to verify
the basic RCU safety property. The form of this verification is to assert that the value of
the grace_period_state variable cannot jump from GP_IDLE to GP_DONE during a
time period over which RCU readers could plausibly persist.
Quick Quiz 12.16: Given there are a pair of back-to-back changes to grace_period_state
on lines 25 and 26, how can we be sure that line 25’s changes won’t be lost?
1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 tmp = dynticks_progress_counter;
11 atomic {
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (grace_period_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic {
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle ||
19 grace_period_state != GP_DONE);
20 }
21 atomic {
22 dynticks_progress_counter = tmp + 1;
23 assert((dynticks_progress_counter & 1) == 0);
24 }
25 i++;
26 od;
27 }
v2023.06.11a
12.1. STATE-SPACE SEARCH 385
1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 tmp = dynticks_progress_counter;
11 atomic {
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (grace_period_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic {
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle ||
19 grace_period_state != GP_DONE);
20 }
21 atomic {
22 dynticks_progress_counter = tmp + 1;
23 assert((dynticks_progress_counter & 1) == 0);
24 }
25 i++;
26 od;
27 dyntick_nohz_done = 1;
28 }
With this variable in place, we can add assertions to grace_period() to check for
unnecessary blockage as follows:
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 grace_period_state = GP_IDLE;
8 atomic {
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 shouldexit = 0;
11 snap = dynticks_progress_counter;
12 grace_period_state = GP_WAITING;
13 }
14 do
15 :: 1 ->
16 atomic {
v2023.06.11a
386 CHAPTER 12. FORMAL VERIFICATION
17 assert(!shouldexit);
18 shouldexit = dyntick_nohz_done;
19 curr = dynticks_progress_counter;
20 if
21 :: (curr == snap) && ((curr & 1) == 0) ->
22 break;
23 :: (curr - snap) > 2 || (snap & 1) == 0 ->
24 break;
25 :: else -> skip;
26 fi;
27 }
28 od;
29 grace_period_state = GP_DONE;
30 grace_period_state = GP_IDLE;
31 atomic {
32 shouldexit = 0;
33 snap = dynticks_progress_counter;
34 grace_period_state = GP_WAITING;
35 }
36 do
37 :: 1 ->
38 atomic {
39 assert(!shouldexit);
40 shouldexit = dyntick_nohz_done;
41 curr = dynticks_progress_counter;
42 if
43 :: (curr == snap) && ((curr & 1) == 0) ->
44 break;
45 :: (curr != snap) ->
46 break;
47 :: else -> skip;
48 fi;
49 }
50 od;
51 grace_period_state = GP_DONE;
52 }
v2023.06.11a
12.1. STATE-SPACE SEARCH 387
condition on line 23 does not hold either because snap is odd and because curr is only
one greater than snap.
So one of these two conditions has to be incorrect. Referring to the comment block
in rcu_try_flip_waitack_needed() for the first condition:
If the CPU remained in dynticks mode for the entire time and didn’t take any
interrupts, NMIs, SMIs, or whatever, then it cannot be in the middle of an
rcu_read_lock(), so the next rcu_read_lock() it executes must use the
new value of the counter. So we can safely pretend that this CPU already
acknowledged the counter.
The first condition does match this, because if “curr == snap” and if curr is even,
then the corresponding CPU has been in dynticks-idle mode the entire time, as required.
So let’s look at the comment block for the second condition:
If the CPU passed through or entered a dynticks idle phase with no active
irq handlers, then, as above, we can safely pretend that this CPU already
acknowledged the counter.
The first part of the condition is correct, because if curr and snap differ by two,
there will be at least one even number in between, corresponding to having passed
completely through a dynticks-idle phase. However, the second part of the condition
corresponds to having started in dynticks-idle mode, not having finished in this mode.
We therefore need to be testing curr rather than snap for being an even number.
The corrected C code is as follows:
Lines 10–13 can now be combined and simplified, resulting in the following. A
similar simplification can be applied to rcu_try_flip_waitmb_needed().
v2023.06.11a
388 CHAPTER 12. FORMAL VERIFICATION
12.1.6.4 Interrupts
There are a couple of ways to model interrupts in Promela:
1. Using C-preprocessor tricks to insert the interrupt handler between each and every
statement of the dynticks_nohz() process, or
2. Modeling the interrupt handler with a separate process.
A bit of thought indicated that the second approach would have a smaller state space,
though it requires that the interrupt handler somehow run atomically with respect to the
dynticks_nohz() process, but not with respect to the grace_period() process.
Fortunately, it turns out that Promela permits you to branch out of atomic statements.
This trick allows us to have the interrupt handler set a flag, and recode dynticks_
nohz() to atomically check this flag and execute only when the flag is not set. This
can be accomplished with a C-preprocessor macro that takes a label and a Promela
statement as follows:
1 #define EXECUTE_MAINLINE(label, stmt) \
2 label: skip; \
3 atomic { \
4 if \
5 :: in_dyntick_irq -> goto label; \
6 :: else -> stmt; \
7 fi; \
8 }
EXECUTE_MAINLINE(stmt1,
tmp = dynticks_progress_counter)
Line 2 of the macro creates the specified statement label. Lines 3–8 are an atomic
block that tests the in_dyntick_irq variable, and if this variable is set (indicating
that the interrupt handler is active), branches out of the atomic block back to the label.
Otherwise, line 6 executes the specified statement. The overall effect is that mainline
execution stalls any time an interrupt is active, as required.
1 proctype dyntick_nohz()
2 {
3 byte tmp;
v2023.06.11a
12.1. STATE-SPACE SEARCH 389
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 EXECUTE_MAINLINE(stmt1,
11 tmp = dynticks_progress_counter)
12 EXECUTE_MAINLINE(stmt2,
13 dynticks_progress_counter = tmp + 1;
14 old_gp_idle = (grace_period_state == GP_IDLE);
15 assert((dynticks_progress_counter & 1) == 1))
16 EXECUTE_MAINLINE(stmt3,
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle ||
19 grace_period_state != GP_DONE))
20 EXECUTE_MAINLINE(stmt4,
21 dynticks_progress_counter = tmp + 1;
22 assert((dynticks_progress_counter & 1) == 0))
23 i++;
24 od;
25 dyntick_nohz_done = 1;
26 }
Quick Quiz 12.18: But what if the dynticks_nohz() process had “if” or “do” statements
with conditions, where the statement bodies of these constructs needed to execute non-atomically?
1 proctype dyntick_irq()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_IRQ -> break;
9 :: i < MAX_DYNTICK_LOOP_IRQ ->
10 in_dyntick_irq = 1;
11 if
12 :: rcu_update_flag > 0 ->
13 tmp = rcu_update_flag;
14 rcu_update_flag = tmp + 1;
15 :: else -> skip;
16 fi;
17 if
18 :: !in_interrupt &&
19 (dynticks_progress_counter & 1) == 0 ->
20 tmp = dynticks_progress_counter;
21 dynticks_progress_counter = tmp + 1;
22 tmp = rcu_update_flag;
23 rcu_update_flag = tmp + 1;
24 :: else -> skip;
25 fi;
26 tmp = in_interrupt;
27 in_interrupt = tmp + 1;
28 old_gp_idle = (grace_period_state == GP_IDLE);
29 assert(!old_gp_idle ||
30 grace_period_state != GP_DONE);
31 tmp = in_interrupt;
32 in_interrupt = tmp - 1;
v2023.06.11a
390 CHAPTER 12. FORMAL VERIFICATION
33 if
34 :: rcu_update_flag != 0 ->
35 tmp = rcu_update_flag;
36 rcu_update_flag = tmp - 1;
37 if
38 :: rcu_update_flag == 0 ->
39 tmp = dynticks_progress_counter;
40 dynticks_progress_counter = tmp + 1;
41 :: else -> skip;
42 fi;
43 :: else -> skip;
44 fi;
45 atomic {
46 in_dyntick_irq = 0;
47 i++;
48 }
49 od;
50 dyntick_irq_done = 1;
51 }
Lines 11–25 model rcu_irq_enter(), and lines 26 and 27 model the relevant
snippet of __irq_enter(). Lines 28–30 verify safety in much the same manner as
do the corresponding lines of dynticks_nohz(). Lines 31 and 32 model the relevant
snippet of __irq_exit(), and finally lines 33–44 model rcu_irq_exit().
Quick Quiz 12.20: What property of interrupts is this dynticks_irq() process unable to
model?
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 grace_period_state = GP_IDLE;
8 atomic {
9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 printf("MDLI = %d\n", MAX_DYNTICK_LOOP_IRQ);
11 shouldexit = 0;
12 snap = dynticks_progress_counter;
13 grace_period_state = GP_WAITING;
14 }
15 do
16 :: 1 ->
17 atomic {
18 assert(!shouldexit);
19 shouldexit = dyntick_nohz_done && dyntick_irq_done;
20 curr = dynticks_progress_counter;
21 if
22 :: (curr - snap) >= 2 || (curr & 1) == 0 ->
23 break;
24 :: else -> skip;
25 fi;
26 }
27 od;
v2023.06.11a
12.1. STATE-SPACE SEARCH 391
28 grace_period_state = GP_DONE;
29 grace_period_state = GP_IDLE;
30 atomic {
31 shouldexit = 0;
32 snap = dynticks_progress_counter;
33 grace_period_state = GP_WAITING;
34 }
35 do
36 :: 1 ->
37 atomic {
38 assert(!shouldexit);
39 shouldexit = dyntick_nohz_done && dyntick_irq_done;
40 curr = dynticks_progress_counter;
41 if
42 :: (curr != snap) || ((curr & 1) == 0) ->
43 break;
44 :: else -> skip;
45 fi;
46 }
47 od;
48 grace_period_state = GP_DONE;
49 }
The implementation of grace_period() is very similar to the earlier one. The only
changes are the addition of line 10 to add the new interrupt-count parameter, changes
to lines 19 and 39 to add the new dyntick_irq_done variable to the liveness checks,
and of course the optimizations on lines 22 and 42.
This model (dyntickRCU-irqnn-ssl.spin) results in a correct verification with
roughly half a million states, passing without errors. However, this version of the model
does not handle nested interrupts. This topic is taken up in the next section.
1 proctype dyntick_irq()
2 {
3 byte tmp;
4 byte i = 0;
5 byte j = 0;
6 bit old_gp_idle;
7 bit outermost;
8
9 do
10 :: i >= MAX_DYNTICK_LOOP_IRQ &&
11 j >= MAX_DYNTICK_LOOP_IRQ -> break;
12 :: i < MAX_DYNTICK_LOOP_IRQ ->
13 atomic {
14 outermost = (in_dyntick_irq == 0);
15 in_dyntick_irq = 1;
16 }
17 if
18 :: rcu_update_flag > 0 ->
19 tmp = rcu_update_flag;
20 rcu_update_flag = tmp + 1;
21 :: else -> skip;
22 fi;
23 if
24 :: !in_interrupt &&
25 (dynticks_progress_counter & 1) == 0 ->
26 tmp = dynticks_progress_counter;
27 dynticks_progress_counter = tmp + 1;
28 tmp = rcu_update_flag;
29 rcu_update_flag = tmp + 1;
30 :: else -> skip;
31 fi;
v2023.06.11a
392 CHAPTER 12. FORMAL VERIFICATION
32 tmp = in_interrupt;
33 in_interrupt = tmp + 1;
34 atomic {
35 if
36 :: outermost ->
37 old_gp_idle = (grace_period_state == GP_IDLE);
38 :: else -> skip;
39 fi;
40 }
41 i++;
42 :: j < i ->
43 atomic {
44 if
45 :: j + 1 == i ->
46 assert(!old_gp_idle ||
47 grace_period_state != GP_DONE);
48 :: else -> skip;
49 fi;
50 }
51 tmp = in_interrupt;
52 in_interrupt = tmp - 1;
53 if
54 :: rcu_update_flag != 0 ->
55 tmp = rcu_update_flag;
56 rcu_update_flag = tmp - 1;
57 if
58 :: rcu_update_flag == 0 ->
59 tmp = dynticks_progress_counter;
60 dynticks_progress_counter = tmp + 1;
61 :: else -> skip;
62 fi;
63 :: else -> skip;
64 fi;
65 atomic {
66 j++;
67 in_dyntick_irq = (i != j);
68 }
69 od;
70 dyntick_irq_done = 1;
71 }
v2023.06.11a
12.1. STATE-SPACE SEARCH 393
1 proctype dyntick_nmi()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NMI -> break;
9 :: i < MAX_DYNTICK_LOOP_NMI ->
10 in_dyntick_nmi = 1;
11 if
12 :: rcu_update_flag > 0 ->
13 tmp = rcu_update_flag;
14 rcu_update_flag = tmp + 1;
15 :: else -> skip;
16 fi;
17 if
18 :: !in_interrupt &&
19 (dynticks_progress_counter & 1) == 0 ->
20 tmp = dynticks_progress_counter;
21 dynticks_progress_counter = tmp + 1;
22 tmp = rcu_update_flag;
23 rcu_update_flag = tmp + 1;
24 :: else -> skip;
25 fi;
26 tmp = in_interrupt;
27 in_interrupt = tmp + 1;
28 old_gp_idle = (grace_period_state == GP_IDLE);
29 assert(!old_gp_idle ||
30 grace_period_state != GP_DONE);
31 tmp = in_interrupt;
32 in_interrupt = tmp - 1;
33 if
34 :: rcu_update_flag != 0 ->
35 tmp = rcu_update_flag;
36 rcu_update_flag = tmp - 1;
37 if
38 :: rcu_update_flag == 0 ->
39 tmp = dynticks_progress_counter;
40 dynticks_progress_counter = tmp + 1;
41 :: else -> skip;
42 fi;
43 :: else -> skip;
44 fi;
45 atomic {
46 i++;
47 in_dyntick_nmi = 0;
48 }
49 od;
50 dyntick_nmi_done = 1;
51 }
Of course, the fact that we have NMIs requires adjustments in the other components.
For example, the EXECUTE_MAINLINE() macro now needs to pay attention to the NMI
handler (in_dyntick_nmi) as well as the interrupt handler (in_dyntick_irq) by
checking the dyntick_nmi_done variable as follows:
v2023.06.11a
394 CHAPTER 12. FORMAL VERIFICATION
8 fi; \
9 }
1 proctype dyntick_irq()
2 {
3 byte tmp;
4 byte i = 0;
5 byte j = 0;
6 bit old_gp_idle;
7 bit outermost;
8
9 do
10 :: i >= MAX_DYNTICK_LOOP_IRQ &&
11 j >= MAX_DYNTICK_LOOP_IRQ -> break;
12 :: i < MAX_DYNTICK_LOOP_IRQ ->
13 atomic {
14 outermost = (in_dyntick_irq == 0);
15 in_dyntick_irq = 1;
16 }
17 stmt1: skip;
18 atomic {
19 if
20 :: in_dyntick_nmi -> goto stmt1;
21 :: !in_dyntick_nmi && rcu_update_flag ->
22 goto stmt1_then;
23 :: else -> goto stmt1_else;
24 fi;
25 }
26 stmt1_then: skip;
27 EXECUTE_IRQ(stmt1_1, tmp = rcu_update_flag)
28 EXECUTE_IRQ(stmt1_2, rcu_update_flag = tmp + 1)
29 stmt1_else: skip;
30 stmt2: skip; atomic {
31 if
32 :: in_dyntick_nmi -> goto stmt2;
33 :: !in_dyntick_nmi &&
34 !in_interrupt &&
35 (dynticks_progress_counter & 1) == 0 ->
36 goto stmt2_then;
37 :: else -> goto stmt2_else;
38 fi;
39 }
40 stmt2_then: skip;
41 EXECUTE_IRQ(stmt2_1,
42 tmp = dynticks_progress_counter)
43 EXECUTE_IRQ(stmt2_2,
44 dynticks_progress_counter = tmp + 1)
45 EXECUTE_IRQ(stmt2_3, tmp = rcu_update_flag)
46 EXECUTE_IRQ(stmt2_4, rcu_update_flag = tmp + 1)
47 stmt2_else: skip;
48 EXECUTE_IRQ(stmt3, tmp = in_interrupt)
49 EXECUTE_IRQ(stmt4, in_interrupt = tmp + 1)
50 stmt5: skip;
51 atomic {
52 if
53 :: in_dyntick_nmi -> goto stmt4;
v2023.06.11a
12.1. STATE-SPACE SEARCH 395
Note that we have open-coded the “if” statements (for example, lines 17–29). In
addition, statements that process strictly local state (such as line 59) need not exclude
dyntick_nmi().
Finally, grace_period() requires only a few changes:
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 grace_period_state = GP_IDLE;
8 atomic {
9 printf("MDL_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 printf("MDL_IRQ = %d\n", MAX_DYNTICK_LOOP_IRQ);
11 printf("MDL_NMI = %d\n", MAX_DYNTICK_LOOP_NMI);
v2023.06.11a
396 CHAPTER 12. FORMAL VERIFICATION
12 shouldexit = 0;
13 snap = dynticks_progress_counter;
14 grace_period_state = GP_WAITING;
15 }
16 do
17 :: 1 ->
18 atomic {
19 assert(!shouldexit);
20 shouldexit = dyntick_nohz_done &&
21 dyntick_irq_done &&
22 dyntick_nmi_done;
23 curr = dynticks_progress_counter;
24 if
25 :: (curr - snap) >= 2 || (curr & 1) == 0 ->
26 break;
27 :: else -> skip;
28 fi;
29 }
30 od;
31 grace_period_state = GP_DONE;
32 grace_period_state = GP_IDLE;
33 atomic {
34 shouldexit = 0;
35 snap = dynticks_progress_counter;
36 grace_period_state = GP_WAITING;
37 }
38 do
39 :: 1 ->
40 atomic {
41 assert(!shouldexit);
42 shouldexit = dyntick_nohz_done &&
43 dyntick_irq_done &&
44 dyntick_nmi_done;
45 curr = dynticks_progress_counter;
46 if
47 :: (curr != snap) || ((curr & 1) == 0) ->
48 break;
49 :: else -> skip;
50 fi;
51 }
52 od;
53 grace_period_state = GP_DONE;
54 }
Quick Quiz 12.21: Does Paul always write his code in this painfully incremental manner?
2. Documenting code can help locate bugs. In this case, the documentation effort
located a misplaced memory barrier in rcu_enter_nohz() and rcu_exit_
nohz(), as shown by the following patch [McK08d].
v2023.06.11a
12.1. STATE-SPACE SEARCH 397
3. Validate your code early, often, and up to the point of destruction. This effort
located one subtle bug in rcu_try_flip_waitack_needed() that would have
been quite difficult to test or debug, as shown by the following patch [McK08c].
4. Always verify your verification code. The usual way to do this is to insert a
deliberate bug and verify that the verification code catches it. Of course, if the
verification code fails to catch this bug, you may also need to verify the bug itself,
and so on, recursing infinitely. However, if you find yourself in this position, getting
a good night’s sleep can be an extremely effective debugging technique. You will
then see that the obvious verify-the-verification technique is to deliberately insert
bugs in the code being verified. If the verification fails to find them, the verification
clearly is buggy.
5. Use of atomic instructions can simplify verification. Unfortunately, use of the
cmpxchg atomic instruction would also slow down the critical IRQ fastpath, so
they are not appropriate in this case.
6. The need for complex formal verification often indicates a need to re-think
your design.
To this last point, it turns out that there is a much simpler solution to the dynticks
problem, which is presented in the next section.
v2023.06.11a
398 CHAPTER 12. FORMAL VERIFICATION
dynticks_nesting
This counts the number of reasons that the corresponding CPU should be monitored
for RCU read-side critical sections. If the CPU is in dynticks-idle mode, then this
counts the IRQ nesting level, otherwise it is one greater than the IRQ nesting level.
dynticks
This counter’s value is even if the corresponding CPU is in dynticks-idle mode and
there are no IRQ handlers currently running on that CPU, otherwise the counter’s
value is odd. In other words, if this counter’s value is odd, then the corresponding
CPU might be in an RCU read-side critical section.
dynticks_nmi
This counter’s value is odd if the corresponding CPU is in an NMI handler, but
only if the NMI arrived while this CPU was in dyntick-idle mode with no IRQ
handlers running. Otherwise, the counter’s value will be even.
dynticks_snap
This will be a snapshot of the dynticks counter, but only if the current RCU grace
period has extended for too long a duration.
dynticks_nmi_snap
This will be a snapshot of the dynticks_nmi counter, but again only if the current
RCU grace period has extended for too long a duration.
If both dynticks and dynticks_nmi have taken on an even value during a given
time interval, then the corresponding CPU has passed through a quiescent state during
that interval.
Quick Quiz 12.22: But what happens if an NMI handler starts running before an IRQ handler
completes, and if that NMI handler continues running until a second IRQ handler starts?
v2023.06.11a
12.1. STATE-SPACE SEARCH 399
Line 6 ensures that any prior memory accesses (which might include accesses from
RCU read-side critical sections) are seen by other CPUs before those marking entry to
dynticks-idle mode. Lines 7 and 12 disable and reenable IRQs. Line 8 acquires a pointer
to the current CPU’s rcu_dynticks structure, and line 9 increments the current CPU’s
dynticks counter, which should now be even, given that we are entering dynticks-idle
mode in process context. Finally, line 10 decrements dynticks_nesting, which
should now be zero.
The rcu_exit_nohz() function is quite similar, but increments dynticks_
nesting rather than decrementing it and checks for the opposite dynticks polarity.
v2023.06.11a
400 CHAPTER 12. FORMAL VERIFICATION
sections. Line 10 therefore executes a memory barrier to ensure that the increment of
dynticks is seen before any RCU read-side critical sections that the subsequent IRQ
handler might execute.
Line 18 of rcu_irq_exit() decrements dynticks_nesting, and if the result is
non-zero, line 19 silently returns. Otherwise, line 20 executes a memory barrier to
ensure that the increment of dynticks on line 21 is seen after any RCU read-side critical
sections that the prior IRQ handler might have executed. Line 22 verifies that dynticks
is now even, consistent with the fact that no RCU read-side critical sections may appear
in dynticks-idle mode. Lines 23–25 check to see if the prior IRQ handlers enqueued any
RCU callbacks, forcing this CPU out of dynticks-idle mode via a reschedule API if so.
v2023.06.11a
12.1. STATE-SPACE SEARCH 401
Linux-kernel RCU’s dyntick-idle code has since been rewritten yet again based on a
suggestion from Andy Lutomirski [McK15c], but it is time to sum up and move on to
other topics.
12.1.6.15 Discussion
A slight shift in viewpoint resulted in a substantial simplification of the dynticks
interface for RCU. The key change leading to this simplification was minimizing of
v2023.06.11a
402 CHAPTER 12. FORMAL VERIFICATION
sharing between IRQ and NMI contexts. The only sharing in this simplified interface is
references from NMI context to IRQ variables (the dynticks variable). This type of
sharing is benign, because the NMI functions never update this variable, so that its value
remains constant through the lifetime of the NMI handler. This limitation of sharing
allows the individual functions to be understood one at a time, in happy contrast to the
situation described in Section 12.1.5, where an NMI might change shared state at any
point during execution of the IRQ functions.
Verification can be a good thing, but simplicity is even better.
Although Promela and Spin allow you to verify pretty much any (smallish) algorithm,
their very generality can sometimes be a curse. For example, Promela does not
understand memory models or any sort of reordering semantics. This section therefore
describes some state-space search tools that understand memory models used by
production systems, greatly simplifying the verification of weakly ordered code.
For example, Section 12.1.4 showed how to convince Promela to account for weak
memory ordering. Although this approach can work well, it requires that the developer
fully understand the system’s memory model. Unfortunately, few (if any) developers
fully understand the complex memory models of modern CPUs.
Therefore, another approach is to use a tool that already understands this memory
ordering, such as the PPCMEM tool produced by Peter Sewell and Susmit Sarkar at the
University of Cambridge, Luc Maranget, Francesco Zappa Nardelli, and Pankaj Pawan
at INRIA, and Jade Alglave at Oxford University, in cooperation with Derek Williams
of IBM [AMP+ 11]. This group formalized the memory models of Power, Arm, x86, as
well as that of the C/C++11 standard [Smi19], and produced the PPCMEM tool based
on the Power and Arm formalizations.
v2023.06.11a
12.2. SPECIAL-PURPOSE STATE-SPACE SEARCH 403
Quick Quiz 12.24: But x86 has strong memory ordering, so why formalize its memory model?
The PPCMEM tool takes litmus tests as input. A sample litmus test is presented
in Section 12.2.1. Section 12.2.2 relates this litmus test to the equivalent C-language
program, Section 12.2.3 describes how to apply PPCMEM to this litmus test, and
Section 12.2.4 discusses the implications.
Lines 8–17 are the lines of code for each process. A given process can have empty
lines, as is the case for P0’s line 11 and P1’s lines 12–17. Labels and branches are
v2023.06.11a
404 CHAPTER 12. FORMAL VERIFICATION
permitted, as demonstrated by the branch on line 14 to the label on line 17. That said,
too-free use of branches will expand the state space. Use of loops is a particularly good
way to explode your state space.
Lines 19–20 show the assertion, which in this case indicates that we are interested in
whether P0’s and P1’s r3 registers can both contain zero after both threads complete
execution. This assertion is important because there are a number of use cases that
would fail miserably if both P0 and P1 saw zero in their respective r3 registers.
This should give you enough information to construct simple litmus tests. Some
additional documentation is available, though much of this additional documentation
is intended for a different research tool that runs tests on actual hardware. Perhaps
more importantly, a large number of pre-existing litmus tests are available with the
online tool (available via the “Select ARM Test” and “Select POWER Test” buttons at
https://www.cl.cam.ac.uk/~pes20/ppcmem/). It is quite likely that one of these
pre-existing litmus tests will answer your Power or Arm memory-ordering question.
Putting all this together, the C-language equivalent to the entire litmus test is as shown
in Listing 12.24. The key point is that if atomic_add_return() acts as a full memory
barrier (as the Linux kernel requires it to), then it should be impossible for P0()’s and
P1()’s r3 variables to both be zero after execution completes.
v2023.06.11a
12.2. SPECIAL-PURPOSE STATE-SPACE SEARCH 405
Quick Quiz 12.28: Does the lwsync on line 10 in Listing 12.23 provide sufficient ordering?
v2023.06.11a
406 CHAPTER 12. FORMAL VERIFICATION
4. The tools are restricted to small loop-free code fragments running on small numbers
of threads. Larger examples result in state-space explosion, just as with similar
tools such as Promela and Spin.
5. The full state-space search does not give any indication of how each offending
state was reached. That said, once you realize that the state is in fact reachable, it
is usually not too hard to find that state using the interactive tool.
6. These tools are not much good for complex data structures, although it is possible
to create and traverse extremely simple linked lists using initialization statements
of the form “x=y; y=z; z=42;”.
7. These tools do not handle memory mapped I/O or device registers. Of course,
handling such things would require that they be formalized, which does not appear
to be in the offing.
8. The tools will detect only those problems for which you code an assertion. This
weakness is common to all formal methods, and is yet another reason why testing
remains important. In the immortal words of Donald Knuth quoted at the beginning
of this chapter, “Beware of bugs in the above code; I have only proved it correct,
not tried it.”
That said, one strength of these tools is that they are designed to model the full range
of behaviors allowed by the architectures, including behaviors that are legal, but which
current hardware implementations do not yet inflict on unwary software developers.
Therefore, an algorithm that is vetted by these tools likely has some additional safety
margin when running on real hardware. Furthermore, testing on real hardware can only
find bugs; such testing is inherently incapable of proving a given usage correct. To
appreciate this, consider that the researchers routinely ran in excess of 100 billion test
v2023.06.11a
12.3. AXIOMATIC APPROACHES 407
runs on real hardware to validate their model. In one case, behavior that is allowed
by the architecture did not occur, despite 176 billion runs [AMP+ 11]. In contrast, the
full-state-space search allows the tool to prove code fragments correct.
It is worth repeating that formal methods and tools are no substitute for testing. The
fact is that producing large reliable concurrent software artifacts, the Linux kernel for
example, is quite difficult. Developers must therefore be prepared to apply every tool at
their disposal towards this goal. The tools presented in this chapter are able to locate
bugs that are quite difficult to produce (let alone track down) via testing. On the other
hand, testing can be applied to far larger bodies of software than the tools presented in
this chapter are ever likely to handle. As always, use the right tools for the job!
Of course, it is always best to avoid the need to work at this level by designing your
parallel code to be easily partitioned and then using higher-level primitives (such as
locks, sequence counters, atomic operations, and RCU) to get your job done more
straightforwardly. And even if you absolutely must use low-level memory barriers and
read-modify-write instructions to get your job done, the more conservative your use of
these sharp instruments, the easier your life is likely to be.
Although the PPCMEM tool can solve the famous “independent reads of independent
writes” (IRIW) litmus test shown in Listing 12.27, doing so requires no less than
fourteen CPU hours and generates no less than ten gigabytes of state space. That said,
this situation is a great improvement over that before the advent of PPCMEM, where
solving this problem required perusing volumes of reference manuals, attempting proofs,
discussing with experts, and being unsure of the final answer. Although fourteen hours
can seem like a long time, it is much shorter than weeks or even months.
However, the time required is a bit surprising given the simplicity of the litmus test,
which has two threads storing to two separate variables and two other threads loading
from these two variables in opposite orders. The assertion triggers if the two loading
threads disagree on the order of the two stores. Even by the standards of memory-order
litmus tests, this is quite simple.
v2023.06.11a
408 CHAPTER 12. FORMAL VERIFICATION
One reason for the amount of time and space consumed is that PPCMEM does a
trace-based full-state-space search, which means that it must generate and evaluate
all possible orders and combinations of events at the architectural level. At this level,
both loads and stores correspond to ornate sequences of events and actions, resulting
in a very large state space that must be completely searched, in turn resulting in large
memory and CPU consumption.
Of course, many of the traces are quite similar to one another, which suggests that
an approach that treated similar traces as one might improve performace. One such
approach is the axiomatic approach of Alglave et al. [AMT14], which creates a set of
axioms to represent the memory model and then converts litmus tests to theorems that
might be proven or disproven over this set of axioms. The resulting tool, called “herd”,
conveniently takes as input the same litmus tests as PPCMEM, including the IRIW
litmus test shown in Listing 12.27.
However, where PPCMEM requires 14 CPU hours to solve IRIW, herd does so in
17 milliseconds, which represents a speedup of more than six orders of magnitude.
That said, the problem is exponential in nature, so we should expect herd to exhibit
exponential slowdowns for larger problems. And this is exactly what happens, for
example, if we add four more writes per writing CPU as shown in Listing 12.28, herd
slows down by a factor of more than 50,000, requiring more than 15 minutes of CPU
time. Adding threads also results in exponential slowdowns [MS14].
Despite their exponential nature, both PPCMEM and herd have proven quite useful
for checking key parallel algorithms, including the queued-lock handoff on x86 systems.
The weaknesses of the herd tool are similar to those of PPCMEM, which were described
in Section 12.2.4. There are some obscure (but very real) cases for which the PPCMEM
and herd tools disagree, and as of 2021 many but not all of these disagreements was
resolved.
It would be helpful if the litmus tests could be written in C (as in Listing 12.24) rather
than assembly (as in Listing 12.23). This is now possible, as will be described in the
following sections.
v2023.06.11a
12.3. AXIOMATIC APPROACHES 409
Of course, if P0() and P1() use different locks, as shown in Listing 12.30 (C-
Lock2.litmus), then all bets are off. And in this case, the herd tool’s output features
4 The output of the herd tool is compatible with that of PPCMEM, so feel free to look at
Listings 12.25 and 12.26 for examples showing the output format.
v2023.06.11a
410 CHAPTER 12. FORMAL VERIFICATION
the string Sometimes, correctly indicating that use of different locks allows P1() to
see x having a value of one.
Quick Quiz 12.30: Why bother modeling locking directly? Why not simply emulate locking
with atomic operations?
But locking is not the only synchronization primitive that can be modeled directly:
The next section looks at RCU.
v2023.06.11a
12.3. AXIOMATIC APPROACHES 411
v2023.06.11a
412 CHAPTER 12. FORMAL VERIFICATION
readers (modeled by P0() on lines 12–35) access a linked list in a round-robin fashion by
“leaking” a pointer to the last list element accessed into variable c. Updaters (modeled
by P1() on lines 37–49) remove an element, taking care to avoid disrupting current or
future readers.
Quick Quiz 12.31: Wait!!! Isn’t leaking pointers out of an RCU read-side critical section a
critical bug???
Lines 4–8 define the initial linked list, tail first. In the Linux kernel, this would be a
doubly linked circular list, but herd is currently incapable of modeling such a beast.
The strategy is instead to use a singly linked linear list that is long enough that the end is
never reached. Line 9 defines variable c, which is used to cache the list pointer between
successive RCU read-side critical sections.
Again, P0() on lines 12–35 models readers. This process models a pair of successive
readers traversing round-robin through the list, with the first reader on lines 19–26 and
the second reader on lines 27–34. Line 20 fetches the pointer cached in c, and if line 21
sees that the pointer was NULL, line 22 restarts at the beginning of the list. In either case,
line 24 advances to the next list element, and line 25 stores a pointer to this element back
into variable c. Lines 27–34 repeat this process, but using registers r3 and r4 instead
of r1 and r2. As with Listing 12.31, this litmus test stores zero to emulate free(), so
line 52 checks for any of these four registers being NULL, also known as zero.
Because P0() leaks an RCU-protected pointer from its first RCU read-side critical
section to its second, P1() must carry out its update (removing x) very carefully. Line 41
removes x by linking w to y. Line 42 waits for readers, after which no subsequent reader
has a path to x via the linked list. Line 43 fetches c, and if line 44 determines that
c references the newly removed x, line 45 sets c to NULL and line 46 again waits for
readers, after which no subsequent reader can fetch x from c. In either case, line 48
emulates free() by storing zero to x.
Quick Quiz 12.32: In Listing 12.32, why couldn’t a reader fetch c just before P1() zeroed it
on line 45, and then later store this same value back into c just after it was zeroed, thus defeating
the zeroing operation?
The output of the herd tool when running this litmus test features Never, indicating
that P0() never accesses a freed element, as expected. Also as expected, removing
either synchronize_rcu() results in P1() accessing a freed element, as indicated by
Sometimes in the herd output.
Quick Quiz 12.33: In Listing 12.32, why not have just one call to synchronize_rcu()
immediately before line 48?
Quick Quiz 12.34: Also in Listing 12.32, can’t line 48 be WRITE_ONCE() instead of
smp_store_release()?
These sections have shown how axiomatic approaches can successfully model
synchronization primitives such as locking and RCU in C-language litmus tests. Longer
term, the hope is that the axiomatic approaches will model even higher-level software
artifacts, producing exponential verification speedups. This could potentially allow
axiomatic verification of much larger software systems, perhaps incorporating spatial-
synchronization techniques from separation logic [GRY13, ORY01]. Another alternative
is to press the axioms of boolean logic into service, as described in the next section.
v2023.06.11a
12.4. SAT SOLVERS 413
C Code
CBMC
Logic Expression
SAT Solver
Trace Generation
(If Counterexample
Located)
Verification Result
Any finite program with bounded loops and recursion can be converted into a logic
expression, which might express that program’s assertions in terms of its inputs. Given
such a logic expression, it would be quite interesting to know whether any possible
combinations of inputs could result in one of the assertions triggering. If the inputs are
expressed as combinations of boolean variables, this is simply SAT, also known as the
satisfiability problem. SAT solvers are heavily used in verification of hardware, which
has motivated great advances. A world-class early 1990s SAT solver might be able to
handle a logic expression with 100 distinct boolean variables, but by the early 2010s
million-variable SAT solvers were readily available [KS08].
In addition, front-end programs for SAT solvers can automatically translate C code
into logic expressions, taking assertions into account and generating assertions for error
conditions such as array-bounds errors. One example is the C bounded model checker,
or cbmc, which is available as part of many Linux distributions. This tool is quite easy
to use, with cbmc test.c sufficing to validate test.c, resulting in the processing flow
shown in Figure 12.2. This ease of use is exceedingly important because it opens the
door to formal verification being incorporated into regression-testing frameworks. In
contrast, the traditional tools that require non-trivial translation to a special-purpose
language are confined to design-time verification.
More recently, SAT solvers have appeared that handle parallel code. These solvers
operate by converting the input code into single static assignment (SSA) form, then
generating all permitted access orders. This approach seems promising, but it remains
v2023.06.11a
414 CHAPTER 12. FORMAL VERIFICATION
to be seen how well it works in practice. One encouraging sign is work in 2016
applying cbmc to Linux-kernel RCU [LMKM16, LMKM18, Roy17]. This work used
minimal configurations of RCU, and verified scenarios using small numbers of threads,
but nevertheless successfully ingested Linux-kernel C code and produced a useful
result. The logic expressions generated from the C code had up to 90 million variables,
450 million clauses, occupied tens of gigabytes of memory, and required up to 80 hours
of CPU time for the SAT solver to produce the correct result.
Nevertheless, a Linux-kernel hacker might be justified in feeling skeptical of a claim
that his or her code had been automatically verified, and such hackers would find
many fellow skeptics going back decades [DMLP79]. One way to productively express
such skepticism is to provide bug-injected versions of the allegedly verified code. If
the formal-verification tool finds all the injected bugs, our hacker might gain more
confidence in the tool’s capabilities. Of course, tools that find valid bugs of which
the hacker was not yet aware will likely engender even more confidence. And this is
exactly why there is a git archive with a 20-branch set of mutations, with each branch
potentially containing a bug injected into Linux-kernel RCU [McK17]. Anyone with a
formal-verification tool is cordially invited to try that tool out on this set of verification
challenges.
Currently, cbmc is able to find a number of injected bugs, however, it has not yet been
able to locate a bug that RCU’s maintainer was not already aware of. Nevertheless, there
is some reason to hope that SAT solvers will someday be useful for finding concurrency
bugs in parallel code.
The SAT-solver approaches described in the previous section are quite convenient and
powerful, but the full tracking of all possible executions, including state, can incur
substantial overhead. In fact, the memory and CPU-time overheads can sharply limit
the size of programs that can be feasibly verified, which raises the question of whether
less-exact approaches might find bugs in larger programs.
Although the jury is still out on this question, stateless model checkers such as
Nidhugg [LSLK14] have in some cases handled larger programs [KS17b], and with
similar ease of use, as illustrated by Figure 12.3. In addition, Nidhugg was more than
an order of magnitude faster than was cbmc for some Linux-kernel RCU verification
scenarios. Of course, Nidhugg’s speed and scalability advantages are tied to the fact
that it does not handle data non-determinism, but this was not a factor in these particular
verification scenarios.
Nevertheless, as with cbmc, Nidhugg has not yet been able to locate a bug that
Linux-kernel RCU’s maintainer was not already aware of. However, it was able to
demonstrate that one historical bug in Linux-kernel RCU was fixed by a different commit
than the maintainer thought, which gives some additional hope that stateless model
checkers like Nidhugg might someday be useful for finding concurrency bugs in parallel
code.
v2023.06.11a
12.6. SUMMARY 415
C Code
Nidhugg
LLVM Internal
Representation
Dynamic Partial
Order Reduction
(DPOR) Algorithm
Trace Generation
(If Counterexample
Located)
Verification Result
12.6 Summary
The formal-verification techniques described in this chapter are very powerful tools
for validating small parallel algorithms, but they should not be the only tools in your
toolbox. Despite decades of focus on formal verification, testing remains the validation
workhorse for large parallel software systems [Cor06a, Jon11, McK15d].
It is nevertheless quite possible that this will not always be the case. To see this,
consider that there is estimated to be more than twenty billion instances of the Linux
kernel as of 2017. Suppose that the Linux kernel has a bug that manifests on average
every million years of runtime. As noted at the end of the preceding chapter, this
bug will be appearing more than 50 times per day across the installed base. But the
fact remains that most formal validation techniques can be used only on very small
codebases. So what is a concurrency coder to do?
Think in terms of finding the first bug, the first relevant bug, the last relevant bug,
and the last bug.
The first bug is normally found via inspection or compiler diagnostics. Although the
increasingly sophisticated compiler diagnostics comprise a lightweight sort of formal
verification, it is not common to think of them in those terms. This is in part due to an
odd practitioner prejudice which says “If I am using it, it cannot be formal verification”
on the one hand, and a large gap between compiler diagnostics and verification research
on the other.
v2023.06.11a
416 CHAPTER 12. FORMAL VERIFICATION
Although the first relevant bug might be located via inspection or compiler diagnostics,
it is not unusual for these two steps to find only typos and false positives. Either way,
the bulk of the relevant bugs, that is, those bugs that might actually be encountered in
production, will often be found via testing.
When testing is driven by anticipated or real use cases, it is not uncommon for the last
relevant bug to be located by testing. This situation might motivate a complete rejection
of formal verification, however, irrelevant bugs have an annoying habit of suddenly
becoming relevant at the least convenient moment possible, courtesy of black-hat attacks.
For security-critical software, which appears to be a continually increasing fraction of
the total, there can thus be strong motivation to find and fix the last bug. Testing is
demonstrably unable to find the last bug, so there is a possible role for formal verification,
assuming, that is, that formal verification proves capable of growing into that role. As
this chapter has shown, current formal verification systems are extremely limited.
Quick Quiz 12.35: But shouldn’t sufficiently low-level software be for all intents and purposes
immune to being exploited by black hats?
Please note that formal verification is often much harder to use than is testing. This is
in part a cultural statement, and there is reason to hope that formal verification will be
perceived to be easier with increased familiarity. That said, very simple test harnesses
can find significant bugs in arbitrarily large software systems. In contrast, the effort
required to apply formal verification seems to increase dramatically as the system size
increases.
I have nevertheless made occasional use of formal verification for almost 30 years
by playing to formal verification’s strengths, namely design-time verification of small
complex portions of the overarching software construct. The larger overarching software
construct is of course validated by testing.
Quick Quiz 12.36: In light of the full verification of the L4 microkernel, isn’t this limited
view of formal verification just a little bit obsolete?
One final approach is to consider the following two definitions from Section 11.1.2
and the consequence that they imply:
Definition: Bug-free programs are trivial programs.
Definition: Reliable programs have no known bugs.
Consequence: Any non-trivial reliable program contains at least one as-yet-unknown
bug.
From this viewpoint, any advances in validation and verification can have but two
effects: (1) An increase in the number of trivial programs or (2) A decrease in the
number of reliable programs. Of course, the human race’s increasing reliance on
multicore systems and software provides extreme motivation for a very sharp increase
in the number of trivial programs.
However, if your code is so complex that you find yourself relying too heavily
on formal-verification tools, you should carefully rethink your design, especially if
your formal-verification tools require your code to be hand-translated to a special-
purpose language. For example, a complex implementation of the dynticks interface
for preemptible RCU that was presented in Section 12.1.5 turned out to have a much
simpler alternative implementation, as discussed in Section 12.1.6.9. All else being
equal, a simpler implementation is much better than a proof of correctness for a complex
implementation.
v2023.06.11a
12.7. CHOOSING A VALIDATION PLAN 417
And the open challenge to those working on formal verification techniques and
systems is to prove this summary wrong! To assist in this task, Verification Challenge 6
is now available [McK17]. Have at it!!!
v2023.06.11a
418 CHAPTER 12. FORMAL VERIFICATION
35000 50
RCU
30000 RCU Test
% Test 40
25000
20000 30
% Test
LoC 15000 20
10000
10
5000
0 v2.6.12 0
v2.6.16
v2.6.20
v2.6.24
v2.6.28
v2.6.32
v2.6.36
v3.0
v3.4
v3.8
v3.12
v3.16
v4.0
v4.4
v4.8
v4.12
v4.16
v5.0
v5.4
v5.8
v5.12
v5.16
v6.0
v6.3
Linux Release
v2023.06.11a
12.7. CHOOSING A VALIDATION PLAN 419
tests. Linux kernels v5.12 and v5.13 started adding the ability to change a given CPU’s
callback-offloading status at runtime and also added the torture.sh acceptance-test
script. Linux kernel v5.14 added distributed rcutorture. Linux kernel v5.15 added
demonic vCPU placement in rcutorture testing, which was successful in locating a
number of race conditions.5 Linux kernel v5.17 removed the RCU_FAST_NO_HZ Kconfig
option. Numerous other changes may be found in the Linux kernel’s git archives.
We have established that the validation budget varies from one project to the next, and
also over the lifetime of any given project. But how should the validation investment be
split between testing and formal verification?
This question is being answered naturally as compilers adopt increasingly aggressive
formal-verification techniques into their diagnostics and as formal-verification tools
continue to mature. In addition, the Linux-kernel lockdep and KCSAN tools illustrate
the advantages of combining formal verification techniques with run-time analysis, as
discussed in Section 11.3. Other combined techniques analyze traces gathered from
executions [dOCdO19]. For the time being, the best practice is to focus first on testing
and to reserve explicit work on formal verification for those portions of the project
that are not well-served by testing, and that have exceptional needs for robustness. For
example, Linux-kernel RCU relies primarily on testing, but has made occasional use of
formal verification as discussed in this chapter.
In short, choosing a validation plan for concurrent software remains more an art than
a science, let alone a field of engineering. However, there is every reason to expect that
increasingly rigorous approaches will continue to become more prevalent.
5 The trick is to place one pair of vCPUs within the same core on one socket, while
placing another pair within the same core on some other socket. As you might expect
from Chapter 3, this produces different memory latencies between different pairs of vCPUs
(https://paulmck.livejournal.com/62071.html).
v2023.06.11a
420 CHAPTER 12. FORMAL VERIFICATION
v2023.06.11a
You don’t learn how to shoot and then learn how to
launch and then learn to do a controlled spin—you
learn to launch-shoot-spin.
Ender’s Shadow, Orson Scott Card
Chapter 13
421
v2023.06.11a
422 CHAPTER 13. PUTTING IT ALL TOGETHER
Another approach is to “just say no” to counting, following the example of the
noatime mount option. If this approach is feasible, it is clearly the best: After all,
nothing is faster than doing nothing. If the lookup count cannot be dispensed with, read
on!
Any of the counters from Chapter 5 could be pressed into service, with the statistical
counters described in Section 5.2 being perhaps the most common choice. However,
this results in a large memory footprint: The number of counters required is the number
of data elements multiplied by the number of threads.
If this memory overhead is excessive, then one approach is to keep per-core or
even per-socket counters rather than per-CPU counters, with an eye to the hash-table
performance results depicted in Figure 10.3. This will require that the counter increments
be atomic operations, especially for user-mode execution where a given thread could
migrate to another CPU at any time.
If some elements are looked up very frequently, there are a number of approaches that
batch updates by maintaining a per-thread log, where multiple log entries for a given
element can be merged. After a given log entry has a sufficiently large increment or
after sufficient time has passed, the log entries may be applied to the corresponding data
elements. Silas Boyd-Wickizer has done some work formalizing this notion [BW14].
1. A lock residing outside of the object must be held while manipulating the reference
count.
2. The object is created with a non-zero reference count, and new references may be
acquired only when the current value of the reference counter is non-zero. If a
thread does not have a reference to a given object, it might seek help from another
thread that already has a reference.
3. In some cases, hazard pointers may be used as a drop-in replacement for reference
counters.
4. An existence guarantee is provided for the object, thus preventing it from being
freed while some other entity might be attempting to acquire a reference. Existence
guarantees are often provided by automatic garbage collectors, and, as is seen in
Sections 9.3 and 9.5, by hazard pointers and RCU, respectively.
v2023.06.11a
13.2. REFURBISH REFERENCE COUNTING 423
Release
Reference Hazard
Acquisition Locks RCU
Counts Pointers
Locks − CAM M CA
Reference
A AM M A
Counts
Hazard
M M M M
Pointers
RCU CA MCA M CA
v2023.06.11a
424 CHAPTER 13. PUTTING IT ALL TOGETHER
and “MCA” are equivalent to “CAM”, so that there are sections below for only the
first four cases and the sixth case: “−”, “A”, “AM”, “CAM”, and “M”. Later sections
describe optimizations that can improve performance if reference acquisition and release
is very frequent, and the reference count need be checked for zero only very rarely.
Simple counting, with neither atomic operations nor memory barriers, can be used
when the reference-counter acquisition and release are both protected by the same
lock. In this case, it should be clear that the reference count itself may be manipulated
non-atomically, because the lock provides any necessary exclusion, memory barriers,
atomic instructions, and disabling of compiler optimizations. This is the method of
choice when the lock is required to protect other operations in addition to the reference
count, but where a reference to the object must be held after the lock is released.
Listing 13.1 shows a simple API that might be used to implement simple non-atomic
reference counting—although simple reference counting is almost always open-coded
instead.
v2023.06.11a
13.2. REFURBISH REFERENCE COUNTING 425
The kref structure itself, consisting of a single atomic data item, is shown in lines 1–3
of Listing 13.2. The kref_init() function on lines 5–8 initializes the counter to
the value “1”. Note that the atomic_set() primitive is a simple assignment, the
name stems from the data type of atomic_t rather than from the operation. The
kref_init() function must be invoked during object creation, before the object has
been made available to any other CPU.
The kref_get() function on lines 10–14 unconditionally atomically increments the
counter. The atomic_inc() primitive does not necessarily explicitly disable compiler
optimizations on all platforms, but the fact that the kref primitives are in a separate
module and that the Linux kernel build process does no cross-module optimizations has
the same effect.
The kref_sub() function on lines 16–28 atomically decrements the counter, and if
the result is zero, line 24 invokes the specified release() function and line 25 returns,
informing the caller that release() was invoked. Otherwise, kref_sub() returns
zero, informing the caller that release() was not called.
2 As of Linux v4.10. Linux v4.11 introduced a refcount_t API that improves efficiency
weakly ordered platforms, but which is functionally equivalent to the atomic_t that it
replaced.
v2023.06.11a
426 CHAPTER 13. PUTTING IT ALL TOGETHER
Quick Quiz 13.3: Suppose that just after the atomic_sub_and_test() on line 22 of
Listing 13.2 is invoked, that some other CPU invokes kref_get(). Doesn’t this result in that
other CPU now having an illegal reference to a released object?
Quick Quiz 13.4: Suppose that kref_sub() returns zero, indicating that the release()
function was not invoked. Under what conditions can the caller rely on the continued existence
of the enclosing object?
Quick Quiz 13.5: Why not just pass kfree() as the release function?
hensive debugging checks, but the overall effect in the absence of bugs is identical.
v2023.06.11a
13.2. REFURBISH REFERENCE COUNTING 427
The Linux kernel’s fget() and fput() primitives use this style of reference counting.
Simplified versions of these functions are shown in Listing 13.4.4
Line 4 of fget() fetches the pointer to the current process’s file-descriptor table, which
might well be shared with other processes. Line 6 invokes rcu_read_lock(), which
enters an RCU read-side critical section. The callback function from any subsequent
call_rcu() primitive will be deferred until a matching rcu_read_unlock() is
reached (line 10 or 14 in this example). Line 7 looks up the file structure corresponding
to the file descriptor specified by the fd argument, as will be described later. If there
is an open file corresponding to the specified file descriptor, then line 9 attempts to
atomically acquire a reference count. If it fails to do so, lines 10–11 exit the RCU
read-side critical section and report failure. Otherwise, if the attempt is successful,
lines 14–15 exit the read-side critical section and return a pointer to the file structure.
The fcheck_files() primitive is a helper function for fget(). Line 22 uses
rcu_dereference() to safely fetch an RCU-protected pointer to this task’s current
v2023.06.11a
428 CHAPTER 13. PUTTING IT ALL TOGETHER
file-descriptor table, and line 24 checks to see if the specified file descriptor is in range. If
so, line 25 fetches the pointer to the file structure, again using the rcu_dereference()
primitive. Line 26 then returns a pointer to the file structure or NULL in case of failure.
The fput() primitive releases a reference to a file structure. Line 31 atomically
decrements the reference count, and, if the result was zero, line 32 invokes the call_
rcu() primitives in order to free up the file structure (via the file_free_rcu()
function specified in call_rcu()’s second argument), but only after all currently-
executing RCU read-side critical sections complete, that is, after an RCU grace period
has elapsed.
Once the grace period completes, the file_free_rcu() function obtains a pointer
to the file structure on line 39, and frees it on line 40.
This code fragment thus demonstrates how RCU can be used to guarantee existence
while an in-object reference count is being incremented.
v2023.06.11a
13.3. HAZARD-POINTER HELPERS 429
which applies this technique to RCU [McK06]. This approach eliminates the need for
atomic instructions or memory barriers on the increment and decrement primitives,
but still requires that code-motion compiler optimizations be disabled. In addition, the
primitives such as synchronize_srcu() that check for the aggregate reference count
reaching zero can be quite slow. This underscores the fact that these techniques are
designed for situations where the references are frequently acquired and released, but
where it is rarely necessary to check for a zero reference count.
However, it is usually the case that use of reference counts requires writing (often
atomically) to a data structure that is otherwise read only. In this case, reference counts
are imposing expensive cache misses on readers.
It is therefore worthwhile to look into synchronization mechanisms that do not require
readers to write to the data structure being traversed. One possibility is the hazard
pointers covered in Section 9.3 and another is RCU, which is covered in Section 9.5.
This section looks at some issues that can be addressed with the help of hazard pointers.
In addition, hazard pointers can sometimes be used to address the issues called out in
Section 13.5, and vice versa.
5 https://github.com/facebook/folly
v2023.06.11a
430 CHAPTER 13. PUTTING IT ALL TOGETHER
The key point is that where reader-writer locking readers block all updates for that
lock, hazard pointers instead simply hang onto the data that is actually needed, while
still allowing updates to proceed.
If the reader cannot be reasonably be converted to use reference counting, the tricks
in Section 13.5.8 might be helpful.
The girl who can’t dance says the band can’t play.
Yiddish proverb
v2023.06.11a
13.4. SEQUENCE-LOCKING SPECIALS 431
One approach is to use sequence locks (see Section 9.4), so that wedlock-related
updates are carried out under the protection of write_seqlock(), while reads requiring
wedlock consistency are carried out within a read_seqbegin() / read_seqretry()
loop. Note that sequence locks are not a replacement for RCU protection: Sequence
locks protect against concurrent modifications, but RCU is still needed to protect against
concurrent deletions.
This approach works quite well when the number of correlated elements is small,
the time to read these elements is short, and the update rate is low. Otherwise, updates
might happen so quickly that readers might never complete. Although Schrödinger
does not expect that even his least-sane relatives will marry and divorce quickly enough
for this to be a problem, he does realize that this problem could well arise in other
situations. One way to avoid this reader-starvation problem is to have the readers use
the update-side primitives if there have been too many retries, but this can degrade
both performance and scalability. Another way to avoid starvation is to have multiple
sequence locks, in Schrödinger’s case, perhaps one per species.
In addition, if the update-side primitives are used too frequently, poor performance
and scalability will result due to lock contention. One way to avoid this is to maintain a
per-element sequence lock, and to hold both spouses’ locks when updating their marital
status. Readers can do their retry looping on either of the spouses’ locks to gain a stable
view of any change in marital status involving both members of the pair. This avoids
contention due to high marriage and divorce rates, but complicates gaining a stable view
of all marital statuses during a single scan of the database.
If the element groupings are well-defined and persistent, which marital status is hoped
to be, then one approach is to add pointers to the data elements to link together the
members of a given group. Readers can then traverse these pointers to access all the
data elements in the same group as the first one located.
This technique is used heavily in the Linux kernel, perhaps most notably in the dcache
subsystem [Bro15b]. Note that it is likely that similar schemes also work with hazard
pointers.
This approach provides sequential consistency to successful readers, each of which
will either see the effects of a given update or not, with any partial updates resulting in
a read-side retry. Sequential consistency is an extremely strong guarantee, incurring
equally strong restrictions and equally high overheads. In this case, we saw that readers
might be starved on the one hand, or might need to acquire the update-side lock on
the other. Although this works very well in cases where updates are infrequent, it
unnecessarily forces read-side retries even when the update does not affect any of the
data that a retried reader accesses. Section 13.5.4 therefore covers a much weaker
form of consistency that not only avoids reader starvation, but also avoids any form of
read-side retry. The next section instead presents a weaker form of consistency that can
be provided with much lower probabilities of reader starvation.
v2023.06.11a
432 CHAPTER 13. PUTTING IT ALL TOGETHER
2. Allocate and initialize a copy of the element with the new name.
3. Write-acquire the sequence lock on the element with the old name, which has the
side effect of ordering this acquisition with the following insertion. Concurrent
lookups of the old name will now repeatedly retry.
4. Insert the copy of the element with the new name. Lookups of the new name will
now succeed.
5. Execute smp_wmb() to order the prior insertion with the subsequent removal.
6. Remove the element with the old name. Concurrent lookups of the old name will
now fail.
Thus, readers looking up the old name will retry until the new name is available, at
which point their final retry will fail. Any subsequent lookups of the new name will
succeed. Any reader succeeding in looking up the new name is guaranteed that any
subsequent lookup of the old name will fail, perhaps after a series of retries.
Quick Quiz 13.8: Is it possible to write-acquire the sequence lock on the new element before
it is inserted instead of acquiring that of the old element before it is removed?
v2023.06.11a
13.5. RCU RESCUES 433
This section shows how to apply RCU to some examples discussed earlier in this book.
In some cases, RCU provides simpler code, in other cases better performance and
scalability, and in still other cases, both.
13.5.1.1 Design
The hope is to use RCU rather than final_mutex to protect the thread traversal in
read_count() in order to obtain excellent performance and scalability from read_
count(), rather than just from inc_count(). However, we do not want to give
up any accuracy in the computed sum. In particular, when a given thread exits, we
absolutely cannot lose the exiting thread’s count, nor can we double-count it. Such
an error could result in inaccuracies equal to the full precision of the result, in other
words, such an error would make the result completely useless. And in fact, one of the
v2023.06.11a
434 CHAPTER 13. PUTTING IT ALL TOGETHER
purposes of final_mutex is to ensure that threads do not come and go in the middle
of read_count() execution.
Therefore, if we are to dispense with final_mutex, we will need to come up with
some other method for ensuring consistency. One approach is to place the total count
for all previously exited threads and the array of pointers to the per-thread counters into
a single structure. Such a structure, once made available to read_count(), is held
constant, ensuring that read_count() sees consistent data.
13.5.1.2 Implementation
Lines 1–4 of Listing 13.5 show the countarray structure, which contains a ->total
field for the count from previously exited threads, and a counterp[] array of pointers
to the per-thread counter for each currently running thread. This structure allows a
given execution of read_count() to see a total that is consistent with the indicated set
of running threads.
Lines 6–8 contain the definition of the per-thread counter variable, the global pointer
countarrayp referencing the current countarray structure, and the final_mutex
spinlock.
Lines 10–13 show inc_count(), which is unchanged from Listing 5.4.
Lines 15–31 show read_count(), which has changed significantly. Lines 22
and 29 substitute rcu_read_lock() and rcu_read_unlock() for acquisition and
release of final_mutex. Line 23 uses rcu_dereference() to snapshot the current
countarray structure into local variable cap. Proper use of RCU will guarantee that
this countarray structure will remain with us through at least the end of the current
RCU read-side critical section at line 29. Line 24 initializes sum to cap->total, which
is the sum of the counts of threads that have previously exited. Lines 25–27 add up the
per-thread counters corresponding to currently running threads, and, finally, line 30
returns the sum.
The initial value for countarrayp is provided by count_init() on lines 33–41.
This function runs before the first thread is created, and its job is to allocate and zero
the initial structure, and then assign it to countarrayp.
Lines 43–50 show the count_register_thread() function, which is invoked by
each newly created thread. Line 45 picks up the current thread’s index, line 47 acquires
final_mutex, line 48 installs a pointer to this thread’s counter, and line 49 releases
final_mutex.
Quick Quiz 13.11: Hey!!! Line 48 of Listing 13.5 modifies a value in a pre-existing
countarray structure! Didn’t you say that this structure, once made available to read_
count(), remained constant???
v2023.06.11a
13.5. RCU RESCUES 435
v2023.06.11a
436 CHAPTER 13. PUTTING IT ALL TOGETHER
sections, thus dropping any such references. Line 71 can then safely free the old
countarray structure.
Quick Quiz 13.12: Given the fixed-size counterp array, exactly how does this code avoid a
fixed upper bound on the number of threads???
13.5.1.3 Discussion
Quick Quiz 13.13: Wow! Listing 13.5 contains 70 lines of code, compared to only 42 in
Listing 5.4. Is this extra complexity really worth it?
Use of RCU enables exiting threads to wait until other threads are guaranteed
to be done using the exiting threads’ __thread variables. This allows the read_
count() function to dispense with locking, thereby providing excellent performance
and scalability for both the inc_count() and read_count() functions. However, this
performance and scalability come at the cost of some increase in code complexity. It is
hoped that compiler and library writers employ user-level RCU [Des09b] to provide
safe cross-thread access to __thread variables, greatly reducing the complexity seen
by users of __thread variables.
1 rcu_read_lock();
2 if (removing) {
3 rcu_read_unlock();
4 cancel_io();
5 } else {
6 add_count(1);
7 rcu_read_unlock();
8 do_io();
9 sub_count(1);
10 }
The RCU read-side primitives have minimal overhead, thus speeding up the fastpath,
as desired.
The updated code fragment removing a device is as follows:
1 spin_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 spin_unlock(&mylock);
5 synchronize_rcu();
6 while (read_count() != 0) {
7 poll(NULL, 0, 1);
8 }
9 remove_device();
v2023.06.11a
13.5. RCU RESCUES 437
Here we replace the reader-writer lock with an exclusive spinlock and add a
synchronize_rcu() to wait for all of the RCU read-side critical sections to complete.
Because of the synchronize_rcu(), once we reach line 6, we know that all remaining
I/Os have been accounted for.
Of course, the overhead of synchronize_rcu() can be large, but given that device
removal is quite rare, this is usually a good tradeoff.
1. The array is initially 16 characters long, and thus ->length is equal to 16.
2. CPU 0 loads the value of ->length, obtaining the value 16.
3. CPU 1 shrinks the array to be of length 8, and assigns a pointer to a new 8-character
block of memory into ->a[].
4. CPU 0 picks up the new pointer from ->a[], and stores a new value into element
12. Because the array has only 8 characters, this results in a SEGV or (worse yet)
memory corruption.
1. The array is initially 16 characters long, and thus ->length is equal to 16.
2. CPU 0 loads the value of ->fa, obtaining a pointer to the structure containing the
value 16 and the 16-byte array.
3. CPU 0 loads the value of ->fa->length, obtaining the value 16.
v2023.06.11a
438 CHAPTER 13. PUTTING IT ALL TOGETHER
4. CPU 1 shrinks the array to be of length 8, and assigns a pointer to a new foo_a
structure containing an 8-character block of memory into ->fa.
5. CPU 0 picks up the new pointer from ->a[], and stores a new value into element
12. But because CPU 0 is still referencing the old foo_a structure that contains
the 16-byte array, all is well.
Of course, in both cases, CPU 1 must wait for a grace period before freeing the old
array.
A more general version of this approach is presented in the next section.
6 This situation is similar to that described in Section 13.4.2, except that here readers
need only see a consistent view of a given single data element, not the consistent view of a
group of data elements that was required in that earlier section.
v2023.06.11a
13.5. RCU RESCUES 439
allocated, filled in with the measurements, and the animal structure’s ->mp field is
updated to point to this new measurement structure using rcu_assign_pointer().
After a grace period elapses, the old measurement structure can be freed.
Quick Quiz 13.14: But cant’t the approach shown in Listing 13.9 result in extra cache misses,
in turn resulting in additional read-side overhead?
This approach enables readers to see correlated values for selected fields, but while
incurring minimal read-side overhead. This per-data-element consistency suffices in the
common case where a reader looks only at a single data element.
7 Why would such a quantity be useful? Beats me! But group statistics are often useful.
v2023.06.11a
440 CHAPTER 13. PUTTING IT ALL TOGETHER
The mode flags are usually zero, but can include the PERCPU_REF_INIT_ATOMIC
bit if the counter is to start in slow non-per-CPU (that is, atomic) mode. There is
also a PERCPU_REF_ALLOW_REINIT bit that allows a given percpu_ref counter to be
reused via a call to percpu_ref_reinit() without needing to be freed and reallocated.
Regardless of how the percpu_ref structure is initialized, percpu_ref_get() may be
used to acquire a reference and percpu_ref_put() may be used to release a reference.
When in per-CPU mode, the percpu_ref structure cannot determine whether or
not its value has reached zero. When such a determination is necessary, percpu_ref_
kill() may be invoked. This function switches the structure into atomic mode and
removes the initial reference installed by the call to percpu_ref_init(). Of course,
when in atomic mode, calls to percpu_ref_get() and percpu_ref_put() are quite
expensive, but percpu_ref_put() can tell when the value reaches zero.
Readers desiring more percpu_ref information are referred to the Linux-kernel
documentation and source code.
v2023.06.11a
13.5. RCU RESCUES 441
open()
CLOSED OPEN
CB close() CB
open()
CLOSING REOPENING
CB close() open()
RECLOSING
v2023.06.11a
442 CHAPTER 13. PUTTING IT ALL TOGETHER
v2023.06.11a
13.5. RCU RESCUES 443
If that reader can be reasonably converted to use RCU, that might solve the problem.
The reason is that RCU readers do not completely block updates, but rather block only
the cleanup portions of those updates (including memory reclamation). Therefore, if
the system has ample memory, converting the reader-writer lock to RCU may suffice.
However, converting to RCU does not always suffice. For example, the code might
traverse an extremely large linked data structure within a single RCU read-side critical
section, which might so greatly extend the RCU grace period that the system runs out of
memory. These situations can be handled in a couple of different ways: (1) Use SRCU
instead of RCU and (2) Acquire a reference to exit the RCU reader.
v2023.06.11a
444 CHAPTER 13. PUTTING IT ALL TOGETHER
Quick Quiz 13.16: But how would this work with a resizable hash table, such as the one
described in Section 10.4?
v2023.06.11a
If a little knowledge is a dangerous thing, just think
what you could do with a lot of knowledge!
Unknown
Chapter 14
Advanced Synchronization
This chapter covers synchronization techniques used for lockless algorithms and parallel
real-time systems.
Although lockless algorithms can be quite helpful when faced with extreme require-
ments, they are no panacea. For example, as noted at the end of Chapter 5, you should
thoroughly apply partitioning, batching, and well-tested packaged weak APIs (see
Chapters 8 and 9) before even thinking about lockless algorithms.
But after doing all that, you still might find yourself needing the advanced techniques
described in this chapter. To that end, Section 14.1 summarizes techniques used
thus far for avoiding locks and Section 14.2 gives a brief overview of non-blocking
synchronization. Memory ordering is also quite important, but it warrants its own
chapter, namely Chapter 15.
The second form of advanced synchronization provides the stronger forward-progress
guarantees needed for parallel real-time computing, which is the topic of Section 14.3.
445
v2023.06.11a
446 CHAPTER 14. ADVANCED SYNCHRONIZATION
In short, lockless techniques are quite useful and are heavily used. However, it is
best if lockless techniques are hidden behind a well-defined API, such as the inc_
count(), memblock_alloc(), rcu_read_lock(), and so on. The reason for this is
that undisciplined use of lockless techniques is a good way to create difficult bugs. If
you believe that finding and fixing such bugs is easier than avoiding them, please re-read
Chapters 11 and 12.
NBS class 1 was formulated some time before 2015, classes 2, 3, and 4 were first
formulated in the early 1990s, class 5 was first formulated in the early 2000s, and class 6
was first formulated in 2013. The final two classes have seen informal use for a great
many decades, but were reformulated in 2013.
v2023.06.11a
14.2. NON-BLOCKING SYNCHRONIZATION 447
Quick Quiz 14.1: Given that there will always be a sharply limited number of CPUs available,
is population obliviousness really useful?
In theory, any parallel algorithm can be cast into wait-free form, but there are a
relatively small subset of NBS algorithms that are in common use. A few of these are
listed in the following section.
v2023.06.11a
448 CHAPTER 14. ADVANCED SYNCHRONIZATION
Although mutual exclusion is required to dequeue a single element (so that dequeue
is blocking), it is possible to carry out a non-blocking removal of the entire contents
of the queue. What is not possible is to dequeue any given element in a non-blocking
manner: The enqueuer might have failed between lines 9 and 10 of the listing, so that
the element in question is only partially enqueued. This results in a half-NBS algorithm
where enqueues are NBS but dequeues are blocking. This algorithm is nevertheless
heavily used in practice, in part because most production software is not required to
tolerate arbitrary fail-stop errors.
Quick Quiz 14.2: Wait! In order to dequeue all elements, both the ->head and ->tail
pointers must be changed, which cannot be done atomically on typical computer systems. So
how is this supposed to work???
v2023.06.11a
14.2. NON-BLOCKING SYNCHRONIZATION 449
list_pop_all(): One of them will get the list, and the other a NULL pointer, at least
assuming that there were no concurrent calls to list_push().
An instance of list_pop_all() that obtains a non-empty list in p processes this
list in the loop spanning lines 27–33. Line 28 prefetches the ->next pointer, line 30
invokes the function referenced by foo() on the current node, line 31 frees the current
node, and line 32 sets up p for the next pass through the loop.
But suppose that a pair of list_push() instances run concurrently with a list_
pop_all() with a list initially containing a single Node 𝐴. Here is one way that this
scenario might play out:
1. The first list_push() instance pushes a new Node 𝐵, executing through line 17,
having just stored a pointer to Node 𝐴 into Node 𝐵’s ->next pointer.
2. The list_pop_all() instance runs to completion, setting top to NULL and
freeing Node 𝐴.
3. The second list_push() instance runs to completion, pushing a new Node 𝐶,
but happens to allocate the memory that used to belong to Node 𝐴.
4. The first list_push() instance executes the cmpxchg() on line 18. Because new
Node 𝐶 has the same address as the newly freed Node 𝐴, this cmpxchg() succeeds
and this list_push() instance runs to completion.
Note that both pushes and the popall all ran successfully despite the reuse of Node 𝐴’s
memory. This is an unusual property: Most data structures require protection against
what is often called the ABA problem.
v2023.06.11a
450 CHAPTER 14. ADVANCED SYNCHRONIZATION
But this property holds only for algorithm written in assembly language. The sad fact
is that most languages (including C and C++) do not support pointers to lifetime-ended
objects, such as the pointer to the old Node 𝐴 contained in Node 𝐵’s ->next pointer.
In fact, compilers are within their rights to assume that if two pointers (call them p and
q) were returned from two different calls to malloc(), then those pointers must not be
equal. Real compilers really will generate the constant false in response to a p==q
comparison. A pointer to an object that has been freed, but whose memory has been
reallocated for a compatibly typed object is termed a zombie pointer.
Many concurrent applications avoid this problem by carefully hiding the memory
allocator from the compiler, thus preventing the compiler from making inappropriate
assumptions. This obfuscatory approach currently works in practice, but might well one
day fall victim to increasingly aggressive optimizers. There is work underway in both
the C and C++ standards committees to address this problem [MMS19, MMM+ 20]. In
the meantime, please exercise great care when coding ABA-tolerant algorithms.
Quick Quiz 14.3: So why not ditch antique languages like C and C++ for something more
modern?
v2023.06.11a
14.2. NON-BLOCKING SYNCHRONIZATION 451
housing the NBS algorithm. In contrast, real-time systems assume that the
scheduler is doing its level best to satisfy any scheduling constraints it knows about,
and, in the absence of such constraints, its level best to honor process priorities
and to provide fair scheduling to processes of the same priority. Non-demonic
schedulers allow real-time programs to use simpler algorithms than those required
for NBS [ACHS13, Bra11].
Quick Quiz 14.4: Why does anyone care about demonic schedulers?
v2023.06.11a
452 CHAPTER 14. ADVANCED SYNCHRONIZATION
as illustrated in Figure 5.1. It is only reasonable to assume that these latencies also
depend on the details of the system’s interconnect. In addition, given that hardware
vendors generally do not publish upper bounds for cache-miss latencies, it seems brave
to assume that memory-reference instructions are in fact wait-free in modern computer
systems. And the antique systems for which such bounds are available suffer from
profound overall slowness.
Furthermore, hardware is not the only source of slowness for memory-reference
instructions. For example, when running on typical computer systems, both loads
and stores can result in page faults. Which cause in-kernel page-fault handlers to be
invoked. Which might acquire locks, or even do I/O, potentially even using something
like network file system (NFS). All of which are most emphatically blocking operations.
Nor are page faults the only kernel-induced hazard. A given CPU might be interrupted
at any time, and the interrupt handler might run for some time. During this time, the
user-mode ostensibly non-blocking algorithm will not be running at all. This situation
raises interesting questions about the forward-progress guarantees provided by system
calls relying on interrupts, for example, the membarrier() system call.
Things do look bleak, but the non-blocking nature of such algorithms can be at least
partially redeemed using a number of approaches:
1. Run on bare metal, with paging disabled. If you are both brave and confident that
you can write code that is free of wild-pointer bugs, this approach might be for you.
2. Run on a non-blocking operating-system kernel [GC96]. Such kernels are quite
rare, in part because they have traditionally completely failed to provide the
hoped-for performance and scalability advantages over lock-based kernels. But
perhaps you should write one.
3. Use facilities such as mlockall() to avoid page faults, while also ensuring that
your program preallocates all the memory it will ever need at boot time. This can
work well, but at the expense of severe common-case underutilization of memory.
In environments that are cost-constrained or power-limited, this approach is not
likely to be feasible.
4. Use facilities such as the Linux kernel’s NO_HZ_FULL tickless mode [Cor13]. In
recent versions of the Linux kernel, this mode directs interrupts away from a
designated set of CPUs. However, this can sharply limit throughput for applications
that are I/O bound during even part of their operation.
v2023.06.11a
14.2. NON-BLOCKING SYNCHRONIZATION 453
v2023.06.11a
454 CHAPTER 14. ADVANCED SYNCHRONIZATION
that developers often learn their specification after the fact, one bug at a time. A few
such proof techniques were discussed in Chapter 12.3
It is often asserted that linearizability maps well to sequential specifications, which
are said to be more natural than are concurrent specifications [RR20]. But this assertion
fails to account for our highly concurrent objective universe. This universe can only be
expected to select for ability to cope with concurrency, especially for those participating
in team sports or overseeing small children. In addition, given that the teaching of
sequential computing is still believed to be somewhat of a black art [PBCE20], it is
reasonable to expect that teaching of concurrent computing is in a similar state of
disarray. Therefore, focusing on only one proof technique is unlikely to be a good way
forward.
Again, please understand that linearizability is quite useful in many situations. Then
again, so is that venerable tool, the hammer. But there comes a point in the field of
computing where one should put down the hammer and pick up a keyboard. Similarly,
it appears that there are times when linearizability is not the best tool for the job.
To their credit, there are some linearizability advocates who are aware of some of its
shortcomings [RR20]. There are also proposals to extend linearizability, for example,
interval-linearizability, which is intended to handle the common case of operations
that require non-zero time to complete [CnRR18]. It remains to be seen whether these
proposals will result in theories able to handle modern concurrent software artifacts,
especially given that several of the proof techniques discussed in Chapter 12 already
handle many modern concurrent software artifacts.
“So the reason linearizability is important is to rescue 1980s proof techniques?” The advocate
immediately replied in the affirmative, then spent some time disparaging a particular modern
proof technique. Oddly enough, that technique was one of those successfully applied to
Linux-kernel RCU.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 455
and wait-free synchronization, which is all to the good. Hopefully theoretical frameworks
will continue to improve their ability to describe software actually used in practice.
Those who feel that theory should lead the way are referred to the inimitable Peter
Denning, who said of operating systems: “Theory follows practice” [Den15], or to
the eminent Tony Hoare, who said of the whole of engineering: “In all branches of
engineering science, the engineering starts before the science; indeed, without the early
products of engineering, there would be nothing for the scientist to study!” [Mor07]. Of
course, once an appropriate body of theory becomes available, it is wise to make use of
it. However, note well that the first appropriate body of theory is often one thing and
the first proposed body of theory quite another.
Quick Quiz 14.5: It seems like the various members of the NBS hierarchy are rather useless.
So why bother with them at all???
v2023.06.11a
456 CHAPTER 14. ADVANCED SYNCHRONIZATION
We might therefore say that a given soft real-time application must meet its response-
time requirements at least some fraction of the time, for example, we might say that it
must execute in less than 20 microseconds 99.9 % of the time.
This of course raises the question of what is to be done when the application fails
to meet its response-time requirements. The answer varies with the application, but
one possibility is that the system being controlled has sufficient stability and inertia
to render harmless the occasional late control action. Another possibility is that the
application has two ways of computing the result, a fast and deterministic but inaccurate
method on the one hand and a very accurate method with unpredictable compute time
on the other. One reasonable approach would be to start both methods in parallel, and
if the accurate method fails to finish in time, kill it and use the answer from the fast
but inaccurate method. One candidate for the fast but inaccurate method is to take no
control action during the current time period, and another candidate is to take the same
control action as was taken during the preceding time period.
In short, it does not make sense to talk about soft real time without some measure of
exactly how soft it is.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 457
an option, as fancifully depicted in Figure 14.2. We simply cannot expect the poor
gentleman depicted in that figure to be reassured our saying “Rest assured that if a
missed deadline results in your tragic death, it most certainly will not have been due to
a software problem!” Hard real-time response is a property of the entire system, not
just of the software.
But if we cannot demand perfection, perhaps we can make do with notification, similar
to the soft real-time approach noted earlier. Then if the Life-a-Tron in Figure 14.2 is
about to miss its deadline, it can alert the hospital staff.
Unfortunately, this approach has the trivial solution fancifully depicted in Figure 14.3.
A system that always immediately issues a notification that it won’t be able to meet its
deadline complies with the letter of the law, but is completely useless. There clearly
must also be a requirement that the system meets its deadline some fraction of the
time, or perhaps that it be prohibited from missing its deadlines on more than a certain
number of consecutive operations.
v2023.06.11a
458 CHAPTER 14. ADVANCED SYNCHRONIZATION
We clearly cannot take a sound-bite approach to either hard or soft real time. The
next section therefore takes a more real-world approach.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 459
5 Decades later, the acceptance tests for some types of computer systems involve large
detonations, and some types of communications networks must deal with what is delicately
termed “ballistic jamming.”
v2023.06.11a
460 CHAPTER 14. ADVANCED SYNCHRONIZATION
specifications for interrupts and for wake-up operations, but quite rare for (say) filesystem
unmount operations. One reason for this is that it is quite difficult to bound the amount
of work that a filesystem-unmount operation might need to do, given that the unmount
is required to flush all of that filesystem’s in-memory data to mass storage.
This means that real-time applications must be confined to operations for which
bounded latencies can reasonably be provided. Other operations must either be pushed
out into the non-real-time portions of the application or forgone entirely.
There might also be constraints on the non-real-time portions of the application. For
example, is the non-real-time application permitted to use the CPUs intended for the
real-time portion? Are there time periods during which the real-time portion of the
application is expected to be unusually busy, and if so, is the non-real-time portion of
the application permitted to run at all during those times? Finally, by what amount
is the real-time portion of the application permitted to degrade the throughput of the
non-real-time portion?
6 Important safety tip: Worst-case response times from USB devices can be extremely
long. Real-time systems should therefore take care to place any USB devices well away from
critical paths.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 461
hardware, drivers, and software configuration would be required to support phase 4’s
more severe requirements.
A key advantage of this phase-by-phase approach is that the latency budgets can
be broken down, so that the application’s various components can be developed
independently, each with its own latency budget. Of course, as with any other kind of
budget, there will likely be the occasional conflict as to which component gets which
fraction of the overall budget, and as with any other kind of budget, strong leadership
and a sense of shared goals can help to resolve these conflicts in a timely fashion. And,
again as with other kinds of technical budget, a strong validation effort is required in
order to ensure proper focus on latencies and to give early warning of latency problems.
A successful validation effort will almost always include a good test suite, which might
be unsatisfying to the theorists, but has the virtue of helping to get the job done. As a
point of fact, as of early 2021, most real-world real-time system use an acceptance test
rather than formal proofs.
However, the widespread use of test suites to validate real-time systems does have
a very real disadvantage, namely that real-time software is validated only on specific
configurations of hardware and software. Adding additional configurations requires
additional costly and time-consuming testing. Perhaps the field of formal verification
will advance sufficiently to change this situation, but as of early 2021, rather large
advances are required.
Quick Quiz 14.8: Formal verification is already quite capable, benefiting from decades of
intensive study. Are additional advances really required, or is this just a practitioner’s excuse to
continue to lazily ignore the awesome power of formal verification?
In addition to latency requirements for the real-time portions of the application, there
will likely be performance and scalability requirements for the non-real-time portions
of the application. These additional requirements reflect the fact that ultimate real-time
latencies are often attained by degrading scalability and average performance.
Software-engineering requirements can also be important, especially for large appli-
cations that must be developed and maintained by large teams. These requirements
often favor increased modularity and fault isolation.
This is a mere outline of the work that would be required to specify deadlines and
environmental constraints for a production real-time system. It is hoped that this outline
clearly demonstrates the inadequacy of the sound-bite-based approach to real-time
computing.
v2023.06.11a
462 CHAPTER 14. ADVANCED SYNCHRONIZATION
Stimulus
Hard Non-Real-Time
Real-Time Strategy
Response "Reflexes" and Planning
Quick Quiz 14.9: Differentiating real-time from non-real-time based on what can “be achieved
straightforwardly by non-real-time systems and applications” is a travesty! There is absolutely
no theoretical basis for such a distinction!!! Can’t we do better than that???
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 463
Scripting languages 1s
100 ms
Linux 2.4 kernel
10 ms
Real-time Java (with GC)
1 ms
Linux 2.6.x/3.x kernel
Real-time Java (no GC)
Linux 4.x/5.x kernel 100 μs
Linux -rt patchset
10 μs
Specialty RTOSes (no MMU)
1 μs
Hand-coded assembly
100 ns
Custom digital hardware
10 ns
1 ns
v2023.06.11a
464 CHAPTER 14. ADVANCED SYNCHRONIZATION
RTOS Process
RTOS Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
RTOS
of a few milliseconds, even when the garbage collector is used. The Linux 2.6.x and 3.x
kernels can provide real-time latencies of a few hundred microseconds if painstakingly
configured, tuned, and run on real-time-friendly hardware. Special real-time Java
implementations can provide real-time latencies below 100 microseconds if use of the
garbage collector is carefully avoided. (But note that avoiding the garbage collector
means also avoiding Java’s large standard libraries, thus also avoiding Java’s productivity
advantages.) The Linux 4.x and 5.x kernels can provide sub-hundred-microsecond
latencies, but with all the same caveats as for the 2.6.x and 3.x kernels. A Linux kernel
incorporating the -rt patchset can provide latencies well below 20 microseconds, and
specialty real-time operating systems (RTOSes) running without MMUs can provide
sub-ten-microsecond latencies. Achieving sub-microsecond latencies typically requires
hand-coded assembly or even special-purpose hardware.
Of course, careful configuration and tuning are required all the way down the stack. In
particular, if the hardware or firmware fails to provide real-time latencies, there is nothing
that the software can do to make up for the lost time. Worse yet, high-performance
hardware sometimes sacrifices worst-case behavior to obtain greater throughput. In
fact, timings from tight loops run with interrupts disabled can provide the basis for a
high-quality random-number generator [MOZ09]. Furthermore, some firmware does
cycle-stealing to carry out various housekeeping tasks, in some cases attempting to
cover its tracks by reprogramming the victim CPU’s hardware clocks. Of course, cycle
stealing is expected behavior in virtualized environment, but people are nevertheless
working towards real-time response in virtualized environments [Gle12, Kis14]. It
is therefore critically important to evaluate your hardware’s and firmware’s real-time
capabilities.
But given competent real-time hardware and firmware, the next layer up the stack is
the operating system, which is covered in the next section.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 465
v2023.06.11a
466 CHAPTER 14. ADVANCED SYNCHRONIZATION
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
CONFIG_PREEMPT=n
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
Linux Process
RCU read-side RCU read-side RCU read-side
Linux critical sections Linux critical sections Linux critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Kernel Spinlock
critical sections
Interrupt handlers Interrupt handlers Interrupt handlers
Scheduling Scheduling Scheduling
Clock Interrupt disable Clock Interrupt disable Clock Interrupt disable
Interrupt Preempt disable Interrupt Preempt disable Interrupt Preempt disable
RCU read-side
Linux critical sections
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
-rt patchset
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 467
RT Linux Process
RT Linux Process
RT Linux Process
Linux Process
Linux Process
Linux Process
Kernel Spinlock
critical sections
Interrupt handlers
Scheduling
Clock Interrupt disable
Interrupt Preempt disable
The bottom row of Figure 14.7 shows the -rt patchset, which features threaded
(and thus preemptible) interrupt handlers for many devices, which also allows the
corresponding “interrupt-disabled” regions of these drivers to be preempted. These
drivers instead use locking to coordinate the process-level portions of each driver with its
threaded interrupt handlers. Finally, in some cases, disabling of preemption is replaced
by disabling of migration. These measures result in excellent response times in many
systems running the -rt patchset [RMF19, dOCdO19].
A final approach is simply to get everything out of the way of the real-time process,
clearing all other processing off of any CPUs that this process needs, as shown in
Figure 14.8. This was implemented in the 3.10 Linux kernel via the CONFIG_NO_HZ_
FULL Kconfig parameter [Cor13, Wei12]. It is important to note that this approach
requires at least one housekeeping CPU to do background processing, for example
running kernel daemons. However, when there is only one runnable task on a given
non-housekeeping CPU, scheduling-clock interrupts are shut off on that CPU, removing
an important source of interference and OS jitter. With a few exceptions, the kernel
does not force other processing off of the non-housekeeping CPUs, but instead simply
provides better performance when only one runnable task is present on a given CPU. Any
number of userspace tools may be used to force a given CPU to have no more that one
runnable task. If configured properly, a non-trivial undertaking, CONFIG_NO_HZ_FULL
offers real-time threads levels of performance that come close to those of bare-metal
systems [ACA+ 18]. Frédéric Weisbecker produced a practical guide to CONFIG_NO_
HZ_FULL configuration [Wei22d, Wei22b, Wei22e, Wei22c, Wei22a, Wei22f].
There has of course been much debate over which of these approaches is best for real-
time systems, and this debate has been going on for quite some time [Cor04a, Cor04c].
As usual, the answer seems to be “It depends,” as discussed in the following sections.
Section 14.3.5.1 considers event-driven real-time systems, and Section 14.3.5.2 considers
real-time systems that use a CPU-bound polling loop.
v2023.06.11a
468 CHAPTER 14. ADVANCED SYNCHRONIZATION
Timers are clearly critically important for real-time operations. After all, if you
cannot specify that something be done at a specific time, how are you going to respond
by that time? Even in non-real-time systems, large numbers of timers are generated,
so they must be handled extremely efficiently. Example uses include retransmit timers
for TCP connections (which are almost always canceled before they have a chance to
fire),7 timed delays (as in sleep(1), which are rarely canceled), and timeouts for the
poll() system call (which are often canceled before they have a chance to fire). A
good data structure for such timers would therefore be a priority queue whose addition
and deletion primitives were fast and O (1) in the number of timers posted.
The classic data structure for this purpose is the calendar queue, which in the Linux
kernel is called the timer wheel. This age-old data structure is also heavily used in
discrete-event simulation. The idea is that time is quantized, for example, in the Linux
kernel, the duration of the time quantum is the period of the scheduling-clock interrupt.
A given time can be represented by an integer, and any attempt to post a timer at some
non-integral time will be rounded to a convenient nearby integral time quantum.
One straightforward implementation would be to allocate a single array, indexed by
the low-order bits of the time. This works in theory, but in practice systems create large
numbers of long-duration timeouts (for example, the two-hour keepalive timeouts for
TCP sessions) that are almost always canceled. These long-duration timeouts cause
problems for small arrays because much time is wasted skipping timeouts that have not
yet expired. On the other hand, an array that is large enough to gracefully accommodate
a large number of long-duration timeouts would consume too much memory, especially
given that performance and scalability concerns require one such array for each and
every CPU.
A common approach for resolving this conflict is to provide multiple arrays in a
hierarchy. At the lowest level of this hierarchy, each array element represents one unit
of time. At the second level, each array element represents 𝑁 units of time, where 𝑁 is
the number of elements in each array. At the third level, each array element represents
𝑁 2 units of time, and so on up the hierarchy. This approach allows the individual arrays
to be indexed by different bits, as illustrated by Figure 14.9 for an unrealistically small
eight-bit clock. Here, each array has 16 elements, so the low-order four bits of the time
(currently 0xf) index the low-order (rightmost) array, and the next four bits (currently
0x1) index the next level up. Thus, we have two arrays each with 16 elements, for a
total of 32 elements, which, taken together, is much smaller than the 256-element array
that would be required for a single array.
This approach works extremely well for throughput-based systems. Each timer
operation is O (1) with small constant, and each timer element is touched at most 𝑚 + 1
times, where 𝑚 is the number of levels.
Unfortunately, timer wheels do not work well for real-time systems, and for two
reasons. The first reason is that there is a harsh tradeoff between timer accuracy and timer
overhead, which is fancifully illustrated by Figures 14.10 and 14.11. In Figure 14.10,
timer processing happens only once per millisecond, which keeps overhead acceptably
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 469
0x x0
1x x1
2x x2
3x x3
4x x4
5x x5
6x x6
7x x7
8x x8
9x x9
ax xa
bx xb
cx xc
dx xd
ex xe
fx xf
1f
v2023.06.11a
470 CHAPTER 14. ADVANCED SYNCHRONIZATION
low for many (but not all!) workloads, but which also means that timeouts cannot
be set for finer than one-millisecond granularities. On the other hand, Figure 14.11
shows timer processing taking place every ten microseconds, which provides acceptably
fine timer granularity for most (but not all!) workloads, but which processes timers so
frequently that the system might well not have time to do anything else.
The second reason is the need to cascade timers from higher levels to lower levels.
Referring back to Figure 14.9, we can see that any timers enqueued on element 1x in
the upper (leftmost) array must be cascaded down to the lower (rightmost) array so
that may be invoked when their time arrives. Unfortunately, there could be a large
number of timeouts waiting to be cascaded, especially for timer wheels with larger
numbers of levels. The power of statistics causes this cascading to be a non-problem for
throughput-oriented systems, but cascading can result in problematic degradations of
latency in real-time systems.
Of course, real-time systems could simply choose a different data structure, for
example, some form of heap or tree, giving up O (1) bounds on insertion and deletion
operations to gain O (log 𝑛) limits on data-structure-maintenance operations. This can
be a good choice for special-purpose RTOSes, but is inefficient for general-purpose
systems such as Linux, which routinely support extremely large numbers of timers.
The solution chosen for the Linux kernel’s -rt patchset is to differentiate between
timers that schedule later activity and timeouts that schedule error handling for low-
probability errors such as TCP packet losses. One key observation is that error handling
is normally not particularly time-critical, so that a timer wheel’s millisecond-level
granularity is good and sufficient. Another key observation is that error-handling
timeouts are normally canceled very early, often before they can be cascaded. In
addition, systems commonly have many more error-handling timeouts than they do
timer events, so that an O (log 𝑛) data structure should provide acceptable performance
for timer events.
However, it is possible to do better, namely by simply refusing to cascade timers.
Instead of cascading, the timers that would otherwise have been cascaded all the way
down the calendar queue are handled in place. This does result in up to a few percent
error for the time duration, but the few situations where this is a problem can instead
use tree-based high-resolution timers (hrtimers).
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 471
Return From
Interrupt Interrupt Mainline
Mainline
Interrupt Handler
Code Code
Long Latency:
Degrades Response Time
Return From
Interrupt
Interrupt Interrupt Mainline
Mainline
Code Code Preemptible
IRQ Thread
Interrupt Handler
Short Latency:
Improved Response Time
In short, the Linux kernel’s -rt patchset uses timer wheels for error-handling timeouts
and a tree for timer events, providing each category the required quality of service.
v2023.06.11a
472 CHAPTER 14. ADVANCED SYNCHRONIZATION
Another downside is that poorly written high-priority real-time code might starve
the interrupt handler, for example, preventing networking code from running, in turn
making it very difficult to debug the problem. Developers must therefore take great
care when writing high-priority real-time code. This has been dubbed the Spiderman
principle: With great power comes great responsibility.
Priority inheritance is used to handle priority inversion, which can be caused by,
among other things, locks acquired by preemptible interrupt handlers [SRL90]. Suppose
that a low-priority thread holds a lock, but is preempted by a group of medium-priority
threads, at least one such thread per CPU. If an interrupt occurs, a high-priority IRQ
thread will preempt one of the medium-priority threads, but only until it decides to
acquire the lock held by the low-priority thread. Unfortunately, the low-priority thread
cannot release the lock until it starts running, which the medium-priority threads prevent
it from doing. So the high-priority IRQ thread cannot acquire the lock until after one of
the medium-priority threads releases its CPU. In short, the medium-priority threads are
indirectly blocking the high-priority IRQ threads, a classic case of priority inversion.
Note that this priority inversion could not happen with non-threaded interrupts
because the low-priority thread would have to disable interrupts while holding the lock,
which would prevent the medium-priority threads from preempting it.
In the priority-inheritance solution, the high-priority thread attempting to acquire the
lock donates its priority to the low-priority thread holding the lock until such time as
the lock is released, thus preventing long-term priority inversion.
Of course, priority inheritance does have its limitations. For example, if you can
design your application to avoid priority inversion entirely, you will likely obtain
somewhat better latencies [Yod04b]. This should be no surprise, given that priority
inheritance adds a pair of context switches to the worst-case latency. That said, priority
inheritance can convert indefinite postponement into a limited increase in latency, and
the software-engineering benefits of priority inheritance may outweigh its latency costs
in many applications.
Another limitation is that it addresses only lock-based priority inversions within the
context of a given operating system. One priority-inversion scenario that it cannot
address is a high-priority thread waiting on a network socket for a message that is
to be written by a low-priority process that is preempted by a set of CPU-bound
medium-priority processes. In addition, a potential disadvantage of applying priority
inheritance to user input is fancifully depicted in Figure 14.14.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 473
A final limitation involves reader-writer locking. Suppose that we have a very large
number of low-priority threads, perhaps even thousands of them, each of which read-
holds a particular reader-writer lock. Suppose that all of these threads are preempted by
a set of medium-priority threads, with at least one medium-priority thread per CPU.
Finally, suppose that a high-priority thread awakens and attempts to write-acquire this
same reader-writer lock. No matter how vigorously we boost the priority of the threads
read-holding this lock, it could well be a good long time before the high-priority thread
can complete its write-acquisition.
There are a number of possible solutions to this reader-writer lock priority-inversion
conundrum:
Quick Quiz 14.10: But if you only allow one reader at a time to read-acquire a reader-writer
lock, isn’t that the same as an exclusive lock???
v2023.06.11a
474 CHAPTER 14. ADVANCED SYNCHRONIZATION
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 475
Quick Quiz 14.11: Suppose that preemption occurs just after the load from t->rcu_read_
unlock_special.s on line 12 of Listing 14.3. Mightn’t that result in the task failing to invoke
rcu_read_unlock_special(), thus failing to remove itself from the list of tasks blocking
the current grace period, in turn causing that grace period to extend indefinitely?
Another important real-time feature of RCU, whether preemptible or not, is the ability
to offload RCU callback execution to a kernel thread. To use this, your kernel must be
built with CONFIG_RCU_NOCB_CPU=y and booted with the rcu_nocbs= kernel boot
parameter specifying which CPUs are to be offloaded. Alternatively, any CPU specified
by the nohz_full= kernel boot parameter described in Section 14.3.5.2 will also have
its RCU callbacks offloaded.
In short, this preemptible RCU implementation enables real-time response for read-
mostly data structures without the delays inherent to priority boosting of large numbers
of readers, and also without delays due to callback invocation.
Preemptible spinlocks are an important part of the -rt patchset due to the long-
duration spinlock-based critical sections in the Linux kernel. This functionality has
not yet reached mainline: Although they are a conceptually simple substitution of
sleeplocks for spinlocks, they have proven relatively controversial. In addition the
real-time functionality that is already in the mainline Linux kernel suffices for a
great many use cases, which slowed the -rt patchset’s development rate in the early
2010s [Edg13, Edg14]. However, preemptible spinlocks are absolutely necessary to the
task of achieving real-time latencies down in the tens of microseconds. Fortunately,
Linux Foundation organized an effort to fund moving the remaining code from the -rt
patchset to mainline.
Per-CPU variables are used heavily in the Linux kernel for performance reasons.
Unfortunately for real-time applications, many use cases for per-CPU variables require
coordinated update of multiple such variables, which is normally provided by disabling
preemption, which in turn degrades real-time latencies. Real-time applications clearly
need some other way of coordinating per-CPU variable updates.
One alternative is to supply per-CPU spinlocks, which as noted above are actually
sleeplocks, so that their critical sections can be preempted and so that priority inheritance
is provided. In this approach, code updating groups of per-CPU variables must acquire
the current CPU’s spinlock, carry out the update, then release whichever lock is acquired,
keeping in mind that a preemption might have resulted in a migration to some other
CPU. However, this approach introduces both overhead and deadlocks.
Another alternative, which is used in the -rt patchset as of early 2021, is to convert
preemption disabling to migration disabling. This ensures that a given kernel thread
remains on its CPU through the duration of the per-CPU-variable update, but could also
allow some other kernel thread to intersperse its own update of those same variables,
courtesy of preemption. There are cases such as statistics gathering where this is not a
problem. In the surprisingly rare case where such mid-update preemption is a problem,
the use case at hand must properly synchronize the updates, perhaps through a set of
per-CPU locks specific to that use case. Although introducing locks again introduces the
possibility of deadlock, the per-use-case nature of these locks makes any such deadlocks
easier to manage and avoid.
Closing event-driven remarks. There are of course any number of other Linux-kernel
components that are critically important to achieving world-class real-time latencies,
v2023.06.11a
476 CHAPTER 14. ADVANCED SYNCHRONIZATION
for example, deadline scheduling [dO18b, dO18a], however, those listed in this section
give a good feeling for the workings of the Linux kernel augmented by the -rt patchset.
This command would confine interrupt #44 to CPUs 0–3. Note that scheduling-clock
interrupts require special handling, and are discussed later in this section.
A second source of OS jitter is due to kernel threads and daemons. Individual
kernel threads, such as RCU’s grace-period kthreads (rcu_bh, rcu_preempt, and
rcu_sched), may be forced onto any desired CPUs using the taskset command, the
sched_setaffinity() system call, or cgroups.
Per-CPU kthreads are often more challenging, sometimes constraining hardware
configuration and workload layout. Preventing OS jitter from these kthreads requires
either that certain types of hardware not be attached to real-time systems, that all interrupts
and I/O initiation take place on housekeeping CPUs, that special kernel Kconfig or
boot parameters be selected in order to direct work away from the worker CPUs, or that
worker CPUs never enter the kernel. Specific per-kthread advice may be found in the
Linux kernel source Documentation directory at kernel-per-CPU-kthreads.txt.
A third source of OS jitter in the Linux kernel for CPU-bound threads running
at real-time priority is the scheduler itself. This is an intentional debugging feature,
designed to ensure that important non-realtime work is allotted at least 50 milliseconds
out of each second, even if there is an infinite-loop bug in your real-time application.
However, when you are running a polling-loop-style real-time application, you will need
to disable this debugging feature. This can be done as follows:
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 477
You will of course need to be running as root to execute this command, and you will
also need to carefully consider the aforementioned Spiderman principle. One way to
minimize the risks is to offload interrupts and kernel threads/daemons from all CPUs
running CPU-bound real-time threads, as described in the paragraphs above. In addition,
you should carefully read the material in the Documentation/scheduler directory.
The material in the sched-rt-group.rst file is particularly important, especially if
you are using the cgroups real-time features enabled by the CONFIG_RT_GROUP_SCHED
Kconfig parameter.
A fourth source of OS jitter comes from timers. In most cases, keeping a given CPU
out of the kernel will prevent timers from being scheduled on that CPU. One important
exception are recurring timers, where a given timer handler posts a later occurrence
of that same timer. If such a timer gets started on a given CPU for any reason, that
timer will continue to run periodically on that CPU, inflicting OS jitter indefinitely. One
crude but effective way to offload recurring timers is to use CPU hotplug to offline all
worker CPUs that are to run CPU-bound real-time application threads, online these
same CPUs, then start your real-time application.
A fifth source of OS jitter is provided by device drivers that were not intended for
real-time use. For an old canonical example, in 2005, the VGA driver would blank the
screen by zeroing the frame buffer with interrupts disabled, which resulted in tens of
milliseconds of OS jitter. One way of avoiding device-driver-induced OS jitter is to
carefully select devices that have been used heavily in real-time systems, and which
have therefore had their real-time bugs fixed. Another way is to confine the device’s
interrupts and all code using that device to designated housekeeping CPUs. A third way
is to test the device’s ability to support real-time workloads and fix any real-time bugs.8
A sixth source of OS jitter is provided by some in-kernel full-system synchronization
algorithms, perhaps most notably the global TLB-flush algorithm. This can be avoided by
avoiding memory-unmapping operations, and especially avoiding unmapping operations
within the kernel. As of early 2021, the way to avoid in-kernel unmapping operations is
to avoid unloading kernel modules.
A seventh source of OS jitter is provided by scheduling-clock interrrupts and RCU
callback invocation. These may be avoided by building your kernel with the NO_HZ_
FULL Kconfig parameter enabled, and then booting with the nohz_full= parameter
specifying the list of worker CPUs that are to run real-time threads. For example,
nohz_full=2-7 would designate CPUs 2, 3, 4, 5, 6, and 7 as worker CPUs, thus
leaving CPUs 0 and 1 as housekeeping CPUs. The worker CPUs would not incur
scheduling-clock interrupts as long as there is no more than one runnable task on each
worker CPU, and each worker CPU’s RCU callbacks would be invoked on one of the
housekeeping CPUs. A CPU that has suppressed scheduling-clock interrupts due to
there only being one runnable task on that CPU is said to be in adaptive ticks mode
or in nohz_full mode. It is important to ensure that you have designated enough
housekeeping CPUs to handle the housekeeping load imposed by the rest of the system,
which requires careful benchmarking and tuning.
An eighth source of OS jitter is page faults. Because most Linux implementations
use an MMU for memory protection, real-time applications running on these systems
8 If you take this approach, please submit your fixes upstream so that others can benefit.
After all, when you need to port your application to a later version of the Linux kernel, you
will be one of those “others”.
v2023.06.11a
478 CHAPTER 14. ADVANCED SYNCHRONIZATION
can be subject to page faults. Use the mlock() and mlockall() system calls to pin
your application’s pages into memory, thus avoiding major page faults. Of course, the
Spiderman principle applies, because locking down too much memory may prevent the
system from getting other work done.
A ninth source of OS jitter is unfortunately the hardware and firmware. It is therefore
important to use systems that have been designed for real-time use.
Unfortunately, this list of OS-jitter sources can never be complete, as it will change
with each new version of the kernel. This makes it necessary to be able to track down
additional sources of OS jitter. Given a CPU 𝑁 running a CPU-bound usermode thread,
the commands shown in Listing 14.4 will produce a list of all the times that this CPU
entered the kernel. Of course, the N on line 5 must be replaced with the number of the
CPU in question, and the 1 on line 2 may be increased to show additional levels of
function call within the kernel. The resulting trace can help track down the source of
the OS jitter.
As always, there is no free lunch, and NO_HZ_FULL is no exception. As noted earlier,
NO_HZ_FULL makes kernel/user transitions more expensive due to the need for delta
process accounting and the need to inform kernel subsystems (such as RCU) of the
transitions. As a rough rule of thumb, NO_HZ_FULL helps with many types of real-time
and heavy-compute workloads, but hurts other workloads that feature high rates of
system calls and I/O [ACA+ 18]. Additional limitations, tradeoffs, and configuration
advice may be found in Documentation/timers/no_hz.rst.
As you can see, obtaining bare-metal performance when running CPU-bound real-
time threads on a general-purpose OS such as Linux requires painstaking attention
to detail. Automation would of course help, and some automation has been applied,
but given the relatively small number of users, automation can be expected to appear
relatively slowly. Nevertheless, the ability to gain near-bare-metal performance while
running a general-purpose operating system promises to ease construction of some
types of real-time systems.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 479
catalog would fill multiple books—but rather a brief overview of the types of components
available.
A natural place to look for real-time software components would be algorithms
offering wait-free synchronization [Her91], and in fact lockless algorithms are very
important to real-time computing. However, wait-free synchronization only guarantees
forward progress in finite time. Although a century is finite, this is unhelpful when your
deadlines are measured in microseconds, let alone milliseconds.
Nevertheless, there are some important wait-free algorithms that do provide bounded
response time, including atomic test and set, atomic exchange, atomic fetch-and-add,
single-producer/single-consumer FIFO queues based on circular arrays, and numerous
per-thread partitioned algorithms. In addition, recent research has confirmed the
observation that algorithms with lock-free guarantees9 also provide the same latencies
in practice (in the wait-free sense), assuming a stochastically fair scheduler and absence
of fail-stop bugs [ACHS13]. This means that many non-wait-free stacks and queues are
nevertheless appropriate for real-time use.
Quick Quiz 14.12: But isn’t correct operation despite fail-stop bugs a valuable fault-tolerance
property?
lock-free algorithms only guarantee that at least one thread will make progress in finite time.
See Section 14.2 for more details.
v2023.06.11a
480 CHAPTER 14. ADVANCED SYNCHRONIZATION
Quick Quiz 14.13: I couldn’t help but spot the word “include” before this list. Are there other
constraints?
This result opens a vast cornucopia of algorithms and data structures for use in
real-time software—and validates long-standing real-time practice.
Of course, a careful and simple application design is also extremely important. The
best real-time components in the world cannot make up for a poorly thought-out design.
For parallel real-time applications, synchronization overheads clearly must be a key
component of the design.
Many real-time applications consist of a single CPU-bound loop that reads sensor data,
computes a control law, and writes control output. If the hardware registers providing
sensor data and taking control output are mapped into the application’s address space,
this loop might be completely free of system calls. But beware of the Spiderman
principle: With great power comes great responsibility, in this case the responsibility
to avoid bricking the hardware by making inappropriate references to the hardware
registers.
This arrangement is often run on bare metal, without the benefits of (or the interference
from) an operating system. However, increasing hardware capability and increasing
levels of automation motivates increasing software functionality, for example, user
interfaces, logging, and reporting, all of which can benefit from an operating system.
One way of gaining much of the benefit of running on bare metal while still having
access to the full features and functions of a general-purpose operating system is to use
the Linux kernel’s NO_HZ_FULL capability, described in Section 14.3.5.2.
One type of big-data real-time application takes input from numerous sources, processes
it internally, and outputs alerts and summaries. These streaming applications are often
highly parallel, processing different information sources concurrently.
One approach for implementing streaming applications is to use dense-array circular
FIFOs to connect different processing steps [Sut13]. Each such FIFO has only a single
thread producing into it and a (presumably different) single thread consuming from
it. Fan-in and fan-out points use threads rather than data structures, so if the output of
several FIFOs needed to be merged, a separate thread would input from them and output
to another FIFO for which this separate thread was the sole producer. Similarly, if the
output of a given FIFO needed to be split, a separate thread would input from this FIFO
and output to several FIFOs as needed.
This discipline might seem restrictive, but it allows communication among threads
with minimal synchronization overhead, and minimal synchronization overhead is
important when attempting to meet tight latency constraints. This is especially true
when the amount of processing for each step is small, so that the synchronization
overhead is significant compared to the processing overhead.
The individual threads might be CPU-bound, in which case the advice in Sec-
tion 14.3.6.2 applies. On the other hand, if the individual threads block waiting for data
from their input FIFOs, the advice of the next section applies.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 481
We will use fuel injection into a mid-sized industrial engine as a fanciful example for
event-driven applications. Under normal operating conditions, this engine requires that
the fuel be injected within a one-degree interval surrounding top dead center. If we
assume a 1,500-RPM rotation rate, we have 25 rotations per second, or about 9,000
degrees of rotation per second, which translates to 111 microseconds per degree. We
therefore need to schedule the fuel injection to within a time interval of about 100
microseconds.
Suppose that a timed wait was to be used to initiate fuel injection, although if you are
building an engine, I hope you supply a rotation sensor. We need to test the timed-wait
functionality, perhaps using the test program shown in Listing 14.5. Unfortunately, if
we run this program, we can get unacceptable timer jitter, even in a -rt kernel.
One problem is that POSIX CLOCK_REALTIME is, oddly enough, not intended for real-
time use. Instead, it means “realtime” as opposed to the amount of CPU time consumed
by a process or thread. For real-time use, you should instead use CLOCK_MONOTONIC.
However, even with this change, results are still unacceptable.
Another problem is that the thread must be raised to a real-time priority by using the
sched_setscheduler() system call. But even this change is insufficient, because we
can still see page faults. We also need to use the mlockall() system call to pin the
application’s memory, preventing page faults. With all of these changes, results might
finally be acceptable.
In other situations, further adjustments might be needed. It might be necessary to
affinity time-critical threads onto their own CPUs, and it might also be necessary to
affinity interrupts away from those CPUs. It might be necessary to carefully select
hardware and drivers, and it will very likely be necessary to carefully select kernel
configuration.
As can be seen from this example, real-time computing can be quite unforgiving.
Suppose that you are writing a parallel real-time application that needs to access data
that is subject to gradual change, perhaps due to changes in temperature, humidity, and
barometric pressure. The real-time response constraints on this program are so severe
that it is not permissible to spin or block, thus ruling out locking, nor is it permissible to
use a retry loop, thus ruling out sequence locks and hazard pointers. Fortunately, the
temperature and pressure are normally controlled, so that a default hard-coded set of
data is usually sufficient.
v2023.06.11a
482 CHAPTER 14. ADVANCED SYNCHRONIZATION
However, the temperature, humidity, and pressure occasionally deviate too far from
the defaults, and in such situations it is necessary to provide data that replaces the
defaults. Because the temperature, humidity, and pressure change gradually, providing
the updated values is not a matter of urgency, though it must happen within a few minutes.
The program is to use a global pointer imaginatively named cur_cal that normally
references default_cal, which is a statically allocated and initialized structure that
contains the default calibration values in fields imaginatively named a, b, and c.
Otherwise, cur_cal points to a dynamically allocated structure providing the current
calibration values.
Listing 14.6 shows how RCU can be used to solve this problem. Lookups are
deterministic, as shown in calc_control() on lines 9–15, consistent with real-time
requirements. Updates are more complex, as shown by update_cal() on lines 17–35.
Quick Quiz 14.14: Given that real-time systems are often used for safety-critical applications,
and given that runtime memory allocation is forbidden in many safety-critical situations, what
is with the call to malloc()???
Quick Quiz 14.15: Don’t you need some kind of synchronization to protect update_cal()?
This example shows how RCU can provide deterministic read-side data-structure
access to real-time programs.
v2023.06.11a
14.3. PARALLEL REAL-TIME COMPUTING 483
v2023.06.11a
484 CHAPTER 14. ADVANCED SYNCHRONIZATION
If the answer to any of these questions is “yes”, you should choose real-fast over
real-time, otherwise, real-time might be for you.
Choose wisely, and if you do choose real-time, make sure that your hardware, firmware,
and operating system are up to the job!
v2023.06.11a
The art of progress is to preserve order amid change
and to preserve change amid order.
Alfred North Whitehead
Chapter 15
Advanced Synchronization:
Memory Ordering
Causality and sequencing are deeply intuitive, and hackers often have a strong grasp
of these concepts. These intuitions can be quite helpful when writing, analyzing, and
debugging not only sequential code, but also parallel code that makes use of standard
mutual-exclusion mechanisms such as locking. Unfortunately, these intuitions break
down completely in code that instead uses weakly ordered atomic operations and
memory barriers. One example of such code implements the standard mutual-exclusion
mechanisms themselves, while another example implements fast paths that use weaker
synchronization. Insults to intuition notwithstanding, some argue that weakness is a
virtue [Alg13]. Virtue or vice, this chapter will help you gain an understanding of
memory ordering, that, with practice, will be sufficient to implement synchronization
primitives and performance-critical fast paths.
Section 15.1 will demonstrate that real computer systems can reorder memory
references, give some reasons why they do so, and provide some information on how
to prevent undesired reordering. Sections 15.2 and 15.3 will cover the types of pain
that hardware and compilers, respectively, can inflict on unwary parallel programmers.
Section 15.4 gives an overview of the benefits of modeling memory ordering at higher
levels of abstraction. Section 15.5 follows up with more detail on a few representative
hardware platforms. Finally, Section 15.6 provides some reliable intuitions and useful
rules of thumb.
Quick Quiz 15.1: This chapter has been rewritten since the first edition, and heavily edited
since the second edition. Did memory ordering change all that since 2014, let alone 2021?
One motivation for memory ordering can be seen in the trivial-seeming litmus test in
Listing 15.1 (C-SB+o-o+o-o.litmus), which at first glance might appear to guarantee
485
v2023.06.11a
486 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
that the exists clause never triggers.1 After all, if 0:r2=0 as shown in the exists
clause,2 we might hope that Thread P0()’s load from x1 into r2 must have happened
before Thread P1()’s store to x1, which might raise further hopes that Thread P1()’s
load from x0 into r2 must happen after Thread P0()’s store to x0, so that 1:r2=2, thus
never triggering the exists clause. The example is symmetric, so similar reasoning
might lead us to hope that 1:r2=0 guarantees that 0:r2=2. Unfortunately, the lack
of memory barriers dashes these hopes. The CPU is within its rights to reorder the
statements within both Thread P0() and Thread P1(), even on relatively strongly
ordered systems such as x86.
Quick Quiz 15.2: The compiler can also reorder Thread P0()’s and Thread P1()’s memory
accesses in Listing 15.1, right?
This willingness to reorder can be confirmed using tools such as litmus7 [AMT14],
which found that the counter-intuitive ordering happened 314 times out of 100,000,000
trials on an x86 laptop. Oddly enough, the perfectly legal outcome where both loads
return the value 2 occurred less frequently, in this case, only 167 times.3 The lesson here
is clear: Increased counter-intuitivity does not necessarily imply decreased probability!
The following sections show exactly how this intuition breaks down, and then put
forward some mental models of memory ordering that can help you avoid these pitfalls.
Section 15.1.1 gives a brief overview of why hardware misorders memory accesses,
and then Section 15.1.2 gives an equally brief overview of how you can thwart such
misordering. Finally, Section 15.1.3 lists some basic rules of thumb, which will be
further refined in later sections. These sections focus on hardware reordering, but rest
assured that compilers reorder much more aggressively than hardware ever dreamed of
doing. Compiler-induced reordering will be taken up in Section 15.3.
Purists would instead insist that the exists clause is never satisfied, but we use “trigger”
1
the system is loaded, and much else besides. So why not try it out on your own system?
v2023.06.11a
15.1. ORDERING: WHY AND HOW? 487
Cache
C
rStore
4 µops 1 µop 1 µop 1 µop
Internal Results Bus
Sto128 Bit
Load
128 Bit
7+ Entry µop Buffer Shared Bus
32 KB Dual Ported Data Cache 16 Entry
4 µops Interface
(8 way) DTLB
Unit
Register Alias Table
and Allocator
4 µops
But why does memory misordering happen in the first place? Can’t CPUs keep track of
ordering on their own? Isn’t that why we have computers in the first place, to keep track
of things?
Many people do indeed expect their computers to keep track of things, but many also
insist that they keep track of things quickly. In fact, so intense is the focus on performance
that modern CPUs are extremely complex, as can be seen in the simplified block diagram
in Figure 15.1. Those needing to squeeze the last few percent of performance from their
systems will in turn need to pay close attention to the fine details of this figure when
tuning their software. Except that this close attention to detail means that when a given
CPU degrades with age, the software will no longer run quickly on it. For example, if
the leftmost ALU fails, software tuned to take full advantage of all of the ALUs might
well run more slowly than untuned software. One solution to this problem is to take
systems out of service as soon as any of their CPUs start degrading.
Another option is to recall the lessons of Chapter 3, especially the lesson that for
many important workloads, main memory cannot keep up with modern CPUs, which
can execute hundreds of instructions in the time required to fetch a single variable from
memory. For such workloads, the detailed internal structure of the CPU is irrelevant,
v2023.06.11a
488 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Memory
CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 r2 = x1; (0) x0==2 x1==0 r2 = x0; (0) x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
and the CPU can instead be approximated by the blue shapes in Figure 15.2 labeled
CPU, store buffer, and cache.
Because of these data-intensive workloads, CPUs sport increasingly large caches,
as was seen back in Figure 3.11, which means that although the first load by a given
CPU from a given variable will result in an expensive cache miss as was discussed in
Section 3.1.6, subsequent repeated loads from that variable by that CPU might execute
very quickly because the initial cache miss will have loaded that variable into that CPU’s
cache.
However, it is also necessary to accommodate frequent concurrent stores from
multiple CPUs to a set of shared variables. In cache-coherent systems, if the caches hold
multiple copies of a given variable, all the copies of that variable must have the same
value. This works extremely well for concurrent loads, but not so well for concurrent
stores: Each store must do something about all copies of the old value (another cache
miss!), which, given the finite speed of light and the atomic nature of matter, will be
slower than impatient software hackers would like. And these strings of stores are the
reason for the blue block labelled store buffer in Figure 15.2.
Removing the internal CPU complexity from Figure 15.2, adding a second CPU, and
showing main memory results in Figure 15.3. When a given CPU stores to a variable
not present in that CPU’s cache, then the new value is instead placed in that CPU’s store
buffer. The CPU can then proceed immediately, without having to wait for the store to
do something about all the old values of that variable residing in other CPUs’ caches.
Although store buffers can greatly increase performance, they can cause instructions
and memory references to execute out of order, which can in turn cause serious confusion,
as fancifully illustrated in Figure 15.4.
v2023.06.11a
15.1. ORDERING: WHY AND HOW? 489
do I out things of
Look! can order.
In particular, store buffers cause the memory misordering illustrated by Listing 15.1.
Table 15.1 shows the steps leading to this misordering. Row 1 shows the initial state,
where CPU 0 has x1 in its cache and CPU 1 has x0 in its cache, both variables having a
value of zero. Row 2 shows the state change due to each CPU’s store (lines 9 and 17 of
Listing 15.1). Because neither CPU has the stored-to variable in its cache, both CPUs
record their stores in their respective store buffers.
Quick Quiz 15.3: But wait!!! On row 2 of Table 15.1 both x0 and x1 each have two values at
the same time, namely zero and two. How can that possibly work???
Row 3 shows the two loads (lines 10 and 18 of Listing 15.1). Because the variable
being loaded by each CPU is in that CPU’s cache, each load immediately returns the
cached value, which in both cases is zero.
But the CPUs are not done yet: Sooner or later, they must empty their store buffers.
Because caches move data around in relatively large blocks called cachelines, and
because each cacheline can hold several variables, each CPU must get the cacheline into
its own cache so that it can update the portion of that cacheline corresponding to the
variable in its store buffer, but without disturbing any other part of the cacheline. Each
CPU must also ensure that the cacheline is not present in any other CPU’s cache, for
which a read-invalidate operation is used. As shown on row 4, after both read-invalidate
operations complete, the two CPUs have traded cachelines, so that CPU 0’s cache now
contains x0 and CPU 1’s cache now contains x1. Once these two variables are in
their new homes, each CPU can flush its store buffer into the corresponding cache line,
leaving each variable with its final value as shown on row 5.
Quick Quiz 15.4: But don’t the values also need to be flushed from the cache to main memory?
In summary, store buffers are needed to allow CPUs to handle store instructions
efficiently, but they can result in counter-intuitive memory misordering.
But what do you do if your algorithm really needs its memory references to be
ordered? For example, suppose that you are communicating with a driver using a pair of
flags, one that says whether or not the driver is running and the other that says whether
v2023.06.11a
490 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
there is a request pending for that driver. The requester needs to set the request-pending
flag, then check the driver-running flag, and if false, wake the driver. Once the driver has
serviced all the pending requests that it knows about, it needs to clear its driver-running
flag, then check the request-pending flag to see if it needs to restart. This very reasonable
approach cannot work unless there is some way to make sure that the hardware processes
the stores and loads in order. This is the subject of the next section.
v2023.06.11a
15.1. ORDERING: WHY AND HOW? 491
CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 smp_mb(); x0==2 x1==0 smp_mb(); x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
6 r2 = x1; (2) x1==2 r2 = x0; (2) x0==2
one value on row 2, however, as promised earlier, the smp_mb() invocations straighten
things out in the end.
Although full barriers such as smp_mb() have extremely strong ordering guarantees,
their strength comes at a high price in terms of foregone hardware and compiler
optimizations. A great many situations can be handled with much weaker ordering
guarantees that use much cheaper memory-ordering instructions, or, in some case, no
memory-ordering instructions at all.
Table 15.3 provides a cheatsheet of the Linux kernel’s ordering primitives and their
guarantees. Each row corresponds to a primitive or category of primitives that might
or might not provide ordering, with the columns labeled “Prior Ordered Operation”
and “Subsequent Ordered Operation” being the operations that might (or might not) be
ordered against. Cells containing “Y” indicate that ordering is supplied unconditionally,
while other characters indicate that ordering is supplied only partially or conditionally.
Blank cells indicate that no ordering is supplied.
The “Store” row also covers the store portion of an atomic RMW operation. In
addition, the “Load” row covers the load component of a successful value-returning
_relaxed() RMW atomic operation, although the combined “_relaxed() RMW
operation” line provides a convenient combined reference in the value-returning case. A
CPU executing unsuccessful value-returning atomic RMW operations must invalidate
the corresponding variable from all other CPUs’ caches. Therefore, unsuccessful value-
returning atomic RMW operations have many of the properties of a store, which means
that the “_relaxed() RMW operation” line also applies to unsuccessful value-returning
atomic RMW operations.
The *_acquire row covers smp_load_acquire(), cmpxchg_acquire(), xchg_
acquire(), and so on; the *_release row covers smp_store_release(), rcu_
assign_pointer(), cmpxchg_release(), xchg_release(), and so on; and
the “Successful full-strength non-void RMW” row covers atomic_add_return(),
atomic_add_unless(), atomic_dec_and_test(), cmpxchg(), xchg(), and so
on. The “Successful” qualifiers apply to primitives such as atomic_add_unless(),
cmpxchg_acquire(), and cmpxchg_release(), which have no effect on either mem-
ory or on ordering when they indicate failure, as indicated by the earlier “_relaxed()
RMW operation” row.
Column “C” indicates cumulativity and propagation, as explained in Sections 15.2.7.1
and 15.2.7.2. In the meantime, this column can usually be ignored when there are at
most two threads involved.
v2023.06.11a
492 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Quick Quiz 15.5: The rows in Table 15.3 seem quite random and confused. Whatever is the
conceptual basis of this table???
Quick Quiz 15.6: Why is Table 15.3 missing smp_mb__after_unlock_lock() and smp_
mb__after_spinlock()?
It is important to note that this table is just a cheat sheet, and is therefore in no way a
replacement for a good understanding of memory ordering. To begin building such an
understanding, the next section will present some basic rules of thumb.
v2023.06.11a
15.1. ORDERING: WHY AND HOW? 493
CPU 0
Memory
.... memory barriers guarantee X0 before X1.
Reference X0
Memory
Barrier
Memory
Reference Y0
CPU 1
Memory
Barrier
Memory
Reference X1
A given thread sees its own accesses in order. This rule assumes that loads and
stores from/to shared variables use READ_ONCE() and WRITE_ONCE(), respectively.
Otherwise, the compiler can profoundly scramble4 your code, and sometimes the CPU
can do a bit of scrambling as well, as discussed in Section 15.5.4.
Interrupts and signal handlers are part of a thread. Both interrupt and signal
handlers happen between a pair of adjacent instructions in a thread. This means that a
given handler appears to execute atomically from the viewpoint of the interrupted thread,
at least at the assembly-language level. However, the C and C++ languages do not define
the results of handlers and interrupted threads sharing plain variables. Instead, such
shared variables must be sig_atomic_t, lock-free atomics, or volatile.
On the other hand, because the handler executes within the interrupted thread’s
context, the memory ordering used to synchronize communication between the handler
and the thread can be extremely lightweight. For example, the counterpart of an acquire
load is a READ_ONCE() followed by a barrier() compiler directive and the counterpart
of a release store is a barrier() followed by a WRITE_ONCE(). The counterpart of a
full memory barrier is barrier(). Finally, disabling interrupts or signals (as the case
may be) within the thread excludes handlers.
Ordering has conditional if-then semantics. Figure 15.5 illustrates this for memory
barriers. Assuming that both memory barriers are strong enough, if CPU 1’s access Y1
happens after CPU 0’s access Y0, then CPU 1’s access X1 is guaranteed to happen after
CPU 0’s access X0. When in doubt as to which memory barriers are strong enough,
smp_mb() will always do the job, albeit at a price.
Quick Quiz 15.8: How can you tell which memory barriers are strong enough for a given use
case?
4 Many compiler writers prefer the word “optimize”.
v2023.06.11a
494 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Listing 15.2 is a case in point. The smp_mb() on lines 10 and 19 serve as the barriers,
the store to x0 on line 9 as X0, the load from x1 on line 11 as Y0, the store to x1 on
line 18 as Y1, and the load from x0 on line 20 as X1. Applying the if-then rule step by
step, we know that the store to x1 on line 18 happens after the load from x1 on line 11
if P0()’s local variable r2 is set to the value zero. The if-then rule would then state
that the load from x0 on line 20 happens after the store to x0 on line 9. In other words,
P1()’s local variable r2 is guaranteed to end up with the value two only if P0()’s
local variable r2 ends up with the value zero. This underscores the point that memory
ordering guarantees are conditional, not absolute.
Although Figure 15.5 specifically mentions memory barriers, this same if-then rule
applies to the rest of the Linux kernel’s ordering operations.
Ordering operations must be paired. If you carefully order the operations in one
thread, but then fail to do so in another thread, then there is no ordering. Both threads
must provide ordering for the if-then rule to apply.5
Ordering operations almost never speed things up. If you find yourself tempted to
add a memory barrier in an attempt to force a prior store to be flushed to memory faster,
resist! Adding ordering usually slows things down. Of course, there are situations
where adding instructions speeds things up, as was shown by Figure 9.22 on page 254,
but careful benchmarking is required in such cases. And even then, it is quite possible
that although you sped things up a little bit on your system, you might well have slowed
things down significantly on your users’ systems. Or on your future system.
Ordering operations are not magic. When your program is failing due to some
race condition, it is often tempting to toss in a few memory-ordering operations in an
attempt to barrier your bugs out of existence. A far better reaction is to use higher-level
primitives in a carefully designed manner. With concurrent programming, it is almost
always better to design your bugs out of existence than to hack them down to lower
probabilities.
These are only rough rules of thumb. Although these rules of thumb cover the vast
majority of situations seen in actual practice, as with any set of rules of thumb, they do
have their limits. The next section will demonstrate some of these limits by introducing
trick-and-trap litmus tests that are intended to insult your intuition while increasing your
understanding. These litmus tests will also illuminate many of the concepts represented
by the Linux-kernel memory-ordering cheat sheet shown in Table 15.3, and can be
automatically analyzed given proper tooling [AMM+ 18]. Section 15.6 will circle back
to this cheat sheet, presenting a more sophisticated set of rules of thumb in light of
learnings from all the intervening tricks and traps.
Quick Quiz 15.9: Wait!!! Where do I find this tooling that automatically analyzes litmus
tests???
v2023.06.11a
15.2. TRICKS AND TRAPS 495
Now that you know that hardware can reorder memory accesses and that you can prevent
it from doing so, the next step is to get you to admit that your intuition has a problem.
This painful task is taken up by Section 15.2.1, which presents some code demonstrating
that scalar variables can take on multiple values simultaneously, and by Sections 15.2.2
through 15.2.7, which show a series of intuitively correct code fragments that fail
miserably on real hardware. Once your intuition has made it through the grieving
process, later sections will summarize the basic rules that memory ordering follows.
But first, let’s take a quick look at just how many values a single variable might have
at a single point in time.
Upon exit from the loop, firsttb will hold a timestamp taken shortly after the
assignment and lasttb will hold a timestamp taken before the last sampling of the
shared variable that still retained the assigned value, or a value equal to firsttb if
the shared variable had changed before entry into the loop. This allows us to plot each
CPU’s view of the value of state.variable over a 532-nanosecond time period, as
shown in Figure 15.6. This data was collected in 2006 on 1.5 GHz POWER5 system
v2023.06.11a
496 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU 1 1 2
CPU 2 2
CPU 3 3 2
CPU 4 4 2
with 8 cores, each containing a pair of hardware threads. CPUs 1, 2, 3, and 4 recorded
the values, while CPU 0 controlled the test. The timebase counter period was about
5.32 ns, sufficiently fine-grained to allow observations of intermediate cache states.
Each horizontal bar represents the observations of a given CPU over time, with
the gray regions to the left indicating the time before the corresponding CPU’s first
measurement. During the first 5 ns, only CPU 3 has an opinion about the value of the
variable. During the next 10 ns, CPUs 2 and 3 disagree on the value of the variable,
but thereafter agree that the value is “2”, which is in fact the final agreed-upon value.
However, CPU 1 believes that the value is “1” for almost 300 ns, and CPU 4 believes
that the value is “4” for almost 500 ns.
Quick Quiz 15.11: How could CPUs possibly have different views of the value of a single
variable at the same time?
Quick Quiz 15.12: Why do CPUs 2 and 3 come to agreement so quickly, when it takes so
long for CPUs 1 and 4 to come to the party?
And if you think that the situation with four CPUs was intriguing, consider Figure 15.7,
which shows the same situation, but with 15 CPUs each assigning their number to a
single shared variable at time 𝑡 = 0. Both diagrams in the figure are drawn in the same
way as Figure 15.6. The only difference is that the unit of horizontal axis is timebase
ticks, with each tick lasting about 5.3 nanoseconds. The entire sequence therefore lasts
a bit longer than the events recorded in Figure 15.6, consistent with the increase in
number of CPUs. The upper diagram shows the overall picture, while the lower one
zooms in on the first 50 timebase ticks. Again, CPU 0 coordinates the test, so does not
record any values.
All CPUs eventually agree on the final value of 9, but not before the values 15 and 12
take early leads. Note that there are fourteen different opinions on the variable’s value at
time 21 indicated by the vertical line in the lower diagram. Note also that all CPUs see
sequences whose orderings are consistent with the directed graph shown in Figure 15.8.
Nevertheless, these figures underscore the importance of proper use of memory-ordering
operations.
How many values can a single variable take on at a single point in time? As many as
one per store buffer in the system! We have therefore entered a regime where we must
bid a fond farewell to comfortable intuitions about values of variables and the passage
of time. This is the regime where memory-ordering operations are needed.
But remember well the lessons from Chapters 3 and 6. Having all CPUs store
concurrently to the same variable is no way to design a parallel program, at least not if
performance and scalability are at all important to you.
Unfortunately, memory ordering has many other ways of insulting your intuition,
and not all of these ways conflict with performance and scalability. The next section
overviews reordering of unrelated memory reference.
v2023.06.11a
15.2. TRICKS AND TRAPS 497
CPU 1 1 6 4 10 15 3 9
CPU 2 2 3 9
CPU 3 3 9
CPU 4 4 10 15 12 9
CPU 5 5 10 15 12 9
CPU 6 6 2 15 9
CPU 7 7 2 15 9
CPU 8 8 9
CPU 9 9
CPU 10 10 15 12 9
CPU 11 11 10 15 12 9
CPU 12 12 9
CPU 13 13 12 9
CPU 14 14 15 12 9
CPU 15 15 12 9
0 50 100 150 200 250 300 350 400 450 500 (tick)
CPU 1 1
CPU 2 2
CPU 3 3
CPU 4 4
CPU 5 5
CPU 6 6
CPU 7 7
CPU 8 8 9
CPU 9 9
CPU 10 10
CPU 11 11
CPU 12 12
CPU 13 13
CPU 14 14 15
CPU 15 15
0 5 10 15 20 25 30 35 40 45 (tick)
v2023.06.11a
498 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
7 6
2 4 5 11
10 14
15 13
3 12 8
v2023.06.11a
15.2. TRICKS AND TRAPS 499
One rationale for reordering loads from different locations is that doing so allows
execution to proceed when an earlier load misses the cache, but the values for later loads
are already present.
Quick Quiz 15.13: But why make load-load reordering visible to the user? Why not just use
speculative execution to allow execution to proceed in the common case where there are no
intervening stores, in which case the reordering cannot be visible anyway?
Thus, portable code relying on ordered loads must add explicit ordering, for example,
the smp_rmb() shown on line 16 of Listing 15.5 (C-MP+o-wmb-o+o-rmb-o.litmus),
which prevents the exists clause from triggering.
v2023.06.11a
500 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
allow such reordering [AMP+ 11]. Therefore, the exists clause on line 21 really can
trigger.
Although it is rare for actual hardware to exhibit this reordering [Mar17], one situation
where it might be desirable to do so is when a load misses the cache, the store buffer
is nearly full, and the cacheline for a subsequent store is ready at hand. Therefore,
portable code must enforce any required ordering, for example, as shown in Listing 15.7
(C-LB+o-r+a-o.litmus). The smp_store_release() and smp_load_acquire()
guarantee that the exists clause on line 21 never triggers.
v2023.06.11a
15.2. TRICKS AND TRAPS 501
Listing 15.9: Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)
1 C C-MP+o-wmb-o+o-ad-o
2
3 {
4 y=1;
5 x1=y;
6 }
7
8 P0(int* x0, int** x1) {
9 WRITE_ONCE(*x0, 2);
10 smp_wmb();
11 WRITE_ONCE(*x1, x0);
12 }
13
14 P1(int** x1) {
15 int *r2;
16 int r3;
17
18 r2 = READ_ONCE(*x1);
19 r3 = READ_ONCE(*r2);
20 }
21
22 exists (1:r2=x0 /\ 1:r3=1)
allowing stores to complete out of order would allow execution to proceed. Therefore,
portable code must explicitly order the stores, for example, as shown in Listing 15.5,
thus preventing the exists clause from triggering.
Quick Quiz 15.14: Why should strongly ordered systems pay the performance price of
unnecessary smp_rmb() and smp_wmb() invocations? Shouldn’t weakly ordered systems
shoulder the full cost of their misordering choices???
v2023.06.11a
502 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
7 Note that lockless_dereference() is not needed on v4.15 and later, and therefore
is not available in these later Linux kernels. Nor is it needed in versions of this book containing
this footnote.
v2023.06.11a
15.2. TRICKS AND TRAPS 503
Quick Quiz 15.15: But how do we know that all platforms really avoid triggering the exists
clauses in Listings 15.10 and 15.11?
Quick Quiz 15.16: Why the use of smp_wmb() in Listings 15.10 and 15.11? Wouldn’t
smp_store_release() be a better choice?
Quick Quiz 15.17: SP, MP, LB, and now S. Where do all these litmus-test abbreviations come
from and how can anyone keep track of them?
However, it is important to note that address dependencies can be fragile and easily
broken by compiler optimizations, as discussed in Section 15.3.2.
v2023.06.11a
504 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
know that the result was zero, and could therefore substitute the constant zero for the
value loaded, thus breaking the dependency.
Quick Quiz 15.18: But wait!!! Line 17 of Listing 15.12 uses READ_ONCE(), which marks
the load as volatile, which means that the compiler absolutely must emit the load instruction
even if the value is later multiplied by zero. So how can the compiler possibly break this data
dependency?
In short, you can rely on data dependencies only if you prevent the compiler from
breaking them.
v2023.06.11a
15.2. TRICKS AND TRAPS 505
Quick Quiz 15.19: Wouldn’t control dependencies be more robust if they were mandated by
language standards???
8
Recall that SC stands for sequentially consistent.
There is reason to believe that using atomic RMW operations (for example, xchg())
9
for all the stores will provide sequentially consistent ordering, but this has not yet been proven
either way.
v2023.06.11a
506 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Memory Memory
Adding more variables and threads increases the scope for reordering and other
counter-intuitive behavior, as discussed in the next section.
v2023.06.11a
15.2. TRICKS AND TRAPS 507
requirement that all CPUs agree on the order of all stores.10 This means that if only a
subset of CPUs are doing stores, the other CPUs will agree on the order of stores, hence
the “other” in “other-multicopy atomicity”. Unlike multicopy-atomic platforms, within
other-multicopy-atomic platforms, the CPU doing the store is permitted to observe its
store early, which allows its later loads to obtain the newly stored value directly from
the store buffer, which improves performance.
Quick Quiz 15.21: Can you give a specific example showing different behavior for multicopy
atomic on the one hand and other-multicopy atomic on the other?
Perhaps there will come a day when all platforms provide some flavor of multi-copy
atomicity, but in the meantime, non-multicopy-atomic platforms do exist, and so software
must deal with them.
Listing 15.16 (C-WRC+o+o-data-o+o-rmb-o.litmus) demonstrates multicopy
atomicity, that is, on a multicopy-atomic platform, the exists clause on line 28 cannot
trigger. In contrast, on a non-multicopy-atomic platform this exists clause can trigger,
despite P1()’s accesses being ordered by a data dependency and P2()’s accesses being
ordered by an smp_rmb(). Recall that the definition of multicopy atomicity requires
that all threads agree on the order of stores, which can be thought of as all stores reaching
all threads at the same time. Therefore, a non-multicopy-atomic platform can have a
store reach different threads at different times. In particular, P0()’s store might reach
P1() long before it reaches P2(), which raises the possibility that P1()’s store might
reach P2() before P0()’s store does.
This leads to the question of why a real system constrained by the usual laws of
physics would ever trigger the exists clause of Listing 15.16. The cartoonish diagram
of a such a real system is shown in Figure 15.10. CPU 0 and CPU 1 share a store buffer,
10 As of early 2021, Armv8 and x86 provide other-multicopy atomicity, IBM mainframe
provides full multicopy atomicity, and PPC provides no multicopy atomicity at all. More
detail is shown in Table 15.5 on page 550.
v2023.06.11a
508 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Store CPU 1
CPU 0 Buffer Store CPU 3
CPU 2 Buffer
Cache Cache
Memory Memory
as do CPUs 2 and 3. This means that CPU 1 can load a value out of the store buffer,
thus potentially immediately seeing a value stored by CPU 0. In contrast, CPUs 2 and 3
will have to wait for the corresponding cache line to carry this new value to them.
Quick Quiz 15.22: Then who would even think of designing a system with shared store
buffers???
Table 15.4 shows one sequence of events that can result in the exists clause in
Listing 15.16 triggering. This sequence of events will depend critically on P0() and
P1() sharing both cache and a store buffer in the manner shown in Figure 15.10.
Quick Quiz 15.23: But just how is it fair that P0() and P1() must share a store buffer and a
cache, but P2() gets one each of its very own???
Row 1 shows the initial state, with the initial value of y in P0()’s and P1()’s shared
cache, and the initial value of x in P2()’s cache.
Row 2 shows the immediate effect of P0() executing its store on line 7. Because the
cacheline containing x is not in P0()’s and P1()’s shared cache, the new value (1) is
stored in the shared store buffer.
Row 3 shows two transitions. First, P0() issues a read-invalidate operation to fetch
the cacheline containing x so that it can flush the new value for x out of the shared store
buffer. Second, P1() loads from x (line 14), an operation that completes immediately
because the new value of x is immediately available from the shared store buffer.
v2023.06.11a
15.2. TRICKS AND TRAPS 509
Row 4 also shows two transitions. First, it shows the immediate effect of P1()
executing its store to y (line 15), placing the new value into the shared store buffer.
Second, it shows the start of P2()’s load from y (line 23).
Row 5 continues the tradition of showing two transitions. First, it shows P1()
complete its store to y, flushing from the shared store buffer to the cache. Second, it
shows P2() request the cacheline containing y.
Row 6 shows P2() receive the cacheline containing y, allowing it to finish its load
into r2, which takes on the value 1.
Row 7 shows P2() execute its smp_rmb() (line 24), thus keeping its two loads
ordered.
Row 8 shows P2() execute its load from x, which immediately returns with the value
zero from P2()’s cache.
Row 9 shows P2() finally responding to P0()’s request for the cacheline containing
x, which was made way back up on row 3.
Finally, row 10 shows P0() finish its store, flushing its value of x from the shared
store buffer to the shared cache.
Note well that the exists clause on line 28 has triggered. The values of r1 and r2
are both the value one, and the final value of r3 the value zero. This strange result
occurred because P0()’s new value of x was communicated to P1() long before it was
communicated to P2().
Quick Quiz 15.24: Referring to Table 15.4, why on earth would P0()’s store take so long to
complete when P1()’s store complete so quickly? In other words, does the exists clause on
line 28 of Listing 15.16 really trigger on real systems?
15.2.7.1 Cumulativity
The three-thread example shown in Listing 15.16 requires cumulative ordering, or
cumulativity. A cumulative memory-ordering operation orders not just any given access
preceding it, but also earlier accesses by any thread to that same variable.
Dependencies do not provide cumulativity, which is why the “C” column is blank for
the READ_ONCE() row of Table 15.3 on page 492. However, as indicated by the “C” in
their “C” column, release operations do provide cumulativity. Therefore, Listing 15.17
(C-WRC+o+o-r+a-o.litmus) substitutes a release operation for Listing 15.16’s data
dependency. Because the release operation is cumulative, its ordering applies not only
to Listing 15.17’s load from x by P1() on line 14, but also to the store to x by P0()
on line 7—but only if that load returns the value stored, which matches the 1:r1=1 in
the exists clause on line 27. This means that P2()’s load-acquire suffices to force
the load from x on line 24 to happen after the store on line 7, so the value returned is
one, which does not match 2:r3=0, which in turn prevents the exists clause from
triggering.
These ordering constraints are depicted graphically in Figure 15.11. Note also that
cumulativity is not limited to a single step back in time. If there was another load from
x or store to x from any thread that came before the store on line 7, that prior load or
store would also be ordered before the load on line 24, though only if both r1 and r2
both end up containing the value 1.
v2023.06.11a
510 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU 0
Store x=1 ... cumulativity guarantees CPU 0's store before CPU 1's store
CPU 1
... and given this link ... Load r1=x .... memory barriers guarantee this order ...
Release store
y=r1
CPU 2
Acquire load
Given this link ...
r2=y
Memory
Barrier
Load r3=x
v2023.06.11a
15.2. TRICKS AND TRAPS 511
Time
15.2.7.2 Propagation
Listing 15.18 (C-W+RWC+o-r+a-o+o-mb-o.litmus) shows the limitations of cumula-
tivity and store-release, even with a full memory barrier. The problem is that although
the smp_store_release() on line 8 has cumulativity, and although that cumulativity
does order P2()’s load on line 26, the smp_store_release()’s ordering cannot
propagate through the combination of P1()’s load (line 17) and P2()’s store (line 24).
This means that the exists clause on line 29 really can trigger.
Quick Quiz 15.25: But it is not necessary to worry about propagation unless there are at least
three threads in the litmus test, right?
This situation might seem completely counter-intuitive, but keep in mind that the
speed of light is finite and computers are of non-zero size. It therefore takes time for
the effect of the P2()’s store to z to propagate to P1(), which in turn means that it is
possible that P1()’s read from z happens much later in time, but nevertheless still sees
v2023.06.11a
512 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
the old value of zero. This situation is depicted in Figure 15.12: Just because a load
sees the old value does not mean that this load executed at an earlier time than did the
store of the new value.
Note that Listing 15.18 also shows the limitations of memory-barrier pairing, given
that there are not two but three processes. These more complex litmus tests can instead
be said to have cycles, where memory-barrier pairing is the special case of a two-thread
cycle. The cycle in Listing 15.18 goes through P0() (lines 7 and 8), P1() (lines 16
and 17), P2() (lines 24, 25, and 26), and back to P0() (line 7). The exists clause
delineates this cycle: The 1:r1=1 indicates that the smp_load_acquire() on line 16
returned the value stored by the smp_store_release() on line 8, the 1:r2=0 indicates
that the WRITE_ONCE() on line 24 came too late to affect the value returned by the
READ_ONCE() on line 17, and finally the 2:r3=0 indicates that the WRITE_ONCE() on
line 7 came too late to affect the value returned by the READ_ONCE() on line 26. In
this case, the fact that the exists clause can trigger means that the cycle is said to be
allowed. In contrast, in cases where the exists clause cannot trigger, the cycle is said
to be prohibited.
But what if we need to prohibit the cycle corresponding to the exists clause on line 29
of Listing 15.18? One solution is to replace P0()’s smp_store_release() with an
smp_mb(), which Table 15.3 shows to have not only cumulativity, but also propagation.
The result is shown in Listing 15.19 (C-W+RWC+o-mb-o+a-o+o-mb-o.litmus).
Quick Quiz 15.26: But given that smp_mb() has the propagation property, why doesn’t the
smp_mb() on line 25 of Listing 15.18 prevent the exists clause from triggering?
For completeness, Figure 15.13 shows that the “winning” store among a group of
stores to the same variable is not necessarily the store that started last. This should
not come as a surprise to anyone who carefully examined Figure 15.7 on page 497.
v2023.06.11a
15.2. TRICKS AND TRAPS 513
Time
One way to rationalize the counter-temporal properties of both load-to-store and store-
to-store ordering is to clearly distinguish between the temporal order in which the
store instructions executed on the one hand, and the order in which the corresponding
cacheline visited the CPUs that executed those instructions on the other. It is the
cacheline-visitation order that defines the externally visible ordering of the actual
stores. This cacheline-visitation order is not directly visible to the code executing the
store instructions, which results in the counter-intuitive counter-temporal nature of
load-to-store and store-to-store ordering.11
Quick Quiz 15.27: But for litmus tests having only ordered stores, as shown in Listing 15.20
(C-2+2W+o-wmb-o+o-wmb-o.litmus), research shows that the cycle is prohibited, even in
weakly ordered systems such as Arm and Power [SSA+ 11]. Given that, are store-to-store really
always counter-temporal???
15.2.7.3 Happens-Before
As shown in Figure 15.14, on platforms without user-visible speculation, if a load
returns the value from a particular store, then, courtesy of the finite speed of light
and the non-zero size of modern computing systems, the store absolutely has to have
executed at an earlier time than did the load. This means that carefully constructed
programs can rely on the passage of time itself as a memory-ordering operation.
CPUs in that same core as soon as the store reached the shared store buffer. As a result, such
systems are non-multicopy atomic.
v2023.06.11a
514 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Time
Of course, just the passage of time by itself is not enough, as was seen in Listing 15.6 on
page 499, which has nothing but store-to-load links and, because it provides absolutely
no ordering, still can trigger its exists clause. However, as long as each thread
provides even the weakest possible ordering, exists clause would not be able to trigger.
For example, Listing 15.21 (C-LB+a-o+o-data-o+o-data-o.litmus) shows P0()
ordered with an smp_load_acquire() and both P1() and P2() ordered with data
dependencies. These orderings, which are close to the top of Table 15.3, suffice to
prevent the exists clause from triggering.
Quick Quiz 15.28: Can you construct a litmus test like that in Listing 15.21 that uses only
dependencies?
An important use of time for ordering memory accesses is covered in the next section.
A minimal release-acquire chain was shown in Listing 15.7 on page 500, but these chains
can be much longer, as shown in Listing 15.22 (C-LB+a-r+a-r+a-r+a-r.litmus).
The longer the release-acquire chain, the more ordering is gained from the passage
v2023.06.11a
15.2. TRICKS AND TRAPS 515
of time, so that no matter how many threads are involved, the corresponding exists
clause cannot trigger.
Although release-acquire chains are inherently store-to-load creatures, it turns out that
they can tolerate one load-to-store step, despite such steps being counter-temporal, as
shown in Figure 15.12 on page 511. For example, Listing 15.23 (C-ISA2+o-r+a-r+a-
r+a-o.litmus) shows a three-step release-acquire chain, but where P3()’s final access
is a READ_ONCE() from x0, which is accessed via WRITE_ONCE() by P0(), forming a
non-temporal load-to-store link between these two processes. However, because P0()’s
smp_store_release() (line 8) is cumulative, if P3()’s READ_ONCE() returns zero,
this cumulativity will force the READ_ONCE() to be ordered before P0()’s smp_store_
release(). In addition, the release-acquire chain (lines 8, 15, 16, 23, 24, and 32) forces
P3()’s READ_ONCE() to be ordered after P0()’s smp_store_release(). Because
P3()’s READ_ONCE() cannot be both before and after P0()’s smp_store_release(),
either or both of two things must be true:
2. The release-acquire chain did not form, that is, one or more of the exists clause’s
1:r2=2, 2:r2=2, or 3:r1=2 is false.
v2023.06.11a
516 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Either way, the exists clause cannot trigger, despite this litmus test containing a
notorious load-to-store link between P3() and P0(). But never forget that release-
acquire chains can tolerate only one load-to-store link, as was seen in Listing 15.18.
Release-acquire chains can also tolerate a single store-to-store step, as shown in
Listing 15.24 (C-Z6.2+o-r+a-r+a-r+a-o.litmus). As with the previous example,
smp_store_release()’s cumulativity combined with the temporal nature of the
release-acquire chain prevents the exists clause on line 35 from triggering.
Quick Quiz 15.29: Suppose we have a short release-acquire chain along with one load-to-store
link and one store-to-store link, like that shown in Listing 15.25. Given that there is only one of
each type of non-store-to-load link, the exists cannot trigger, right?
But beware: Adding a second store-to-store link allows the correspondingly updated
exists clause to trigger. To see this, review Listings 15.26 and 15.27, which have
identical P0() and P1() processes. The only code difference is that Listing 15.27 has
an additional P2() that does an smp_store_release() to the x2 variable that P0()
releases and P1() acquires. The exists clause is also adjusted to exclude executions
in which P2()’s smp_store_release() precedes that of P0().
Running the litmus test in Listing 15.27 shows that the addition of P2() can totally
destroy the ordering from the release-acquire chain. Therefore, when constructing
release-acquire chains, please take care to construct them properly.
Quick Quiz 15.30: There are store-to-load links, load-to-store links, and store-to-store links.
But what about load-to-load links?
v2023.06.11a
15.2. TRICKS AND TRAPS 517
v2023.06.11a
518 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2023.06.11a
15.2. TRICKS AND TRAPS 519
1 PPC R+lwsync+sync
2 {
3 0:r1=1; 0:r2=x; 0:r4=y;
4 1:r1=2; 1:r2=y; 1:r4=x;
5 }
6 P0 | P1 ;
7 stw r1,0(r2) | stw r1,0(r2) ;
8 lwsync | sync ;
9 stw r1,0(r4) | lwz r3,0(r4) ;
10 exists (y=2 /\ 1:r3=0)
The first line identifies the type of test (PPC) and gives the test’s name. Lines 3
and 4 initialize P0()’s and P1()’s registers, respectively. Lines 6–9 show the PowerPC
assembly statements corresponding to the C code from Listing E.12, with the first
column being the code for P0() and the second column being the code for P1().
Line 7 shows the initial WRITE_ONCE() calls in both columns; the columns of line 8
show the smp_wmb() and smp_mb() for P0() and P1(), respectively; the columns
of line 9 shows P0()’s WRITE_ONCE() and P1()’s READ_ONCE(), respectively; and
finally line 10 shows the exists clause.
In order for this exists clause to be satisfied, P0()’s stw to y must precede that of
P1(), but P1()’s later lwz from x must precede P0()’s stw to x. Seeing how this can
happen requires a rough understanding of the following PowerPC terminology.
Instruction commit:
This can be thought of as the execution of that instruction as opposed to the
memory-system consequences of having executed that instruction.
v2023.06.11a
520 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
We are now ready to step through the PowerPC sequence of events that satisfies the
above exists clause.
To best understand this, please follow along at https://www.cl.cam.ac.uk/
~pes20/ppcmem/index.html, carefully copying the above assembly-language litmus
test into the pane. The result should look as shown in Figure 15.15, give or take space
characters. Click on the “Interactive” button in the lower left, which, after a short delay,
should produce a display as shown in Figure 15.16. If the “Interactive” button refuses
to do anything, this usually means that there is a syntax error, for example, a spurious
newline character might have been introduced during the copy-paste operation.
This display has one clickable link in each section displaying thread state, and
as the “Commit” in each link suggests, these links commit each thread’s first stw
v2023.06.11a
15.2. TRICKS AND TRAPS 521
instruction. If you prefer, you can instead click on the corresponding links listed under
“Enabled transitions” near the bottom of the screen. Note well that some of the later
memory-system transitions will appear in the upper “Storage subsystem state” section
of this display.
The following sequence of clicks demonstrates how the exists clause can be
satisfied:
6. At this point, there should be no clickable links in either of the two sections
displaying thread state, but there should be quite a few of them up in the “Storage
subsystem state”. The following steps tell you which of them to click on.
7. Partial coherence commit: c:W y=1 ->d:W y=2. This commits the sys-
tem to processing P0()’s store to y before P1()’s store even though neither store
has reached either the coherence point or any other thread. One might imagine
partial coherence commits happening within a store buffer that is shared by multiple
hardware threads that are writing to the same variable.
v2023.06.11a
522 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
At this point, you should see something like Figure 15.17. Note that the satisified
exists clause is shown in blue near the bottom, confirming that this counter-intuitive
really can happen. If you wish, you can click on “Undo” to explore other options or
click on “Reset” to start over. It can be very helpful to carry out these steps in different
orders to better understand how a non-multicopy-atomic architecture operates.
Quick Quiz 15.31: What happens if that lwsync instruction is instead a sync instruction?
v2023.06.11a
15.3. COMPILE-TIME CONSTERNATION 523
provide some helpful intuitions. Or perhaps more accurately, destroy some counter-
productive intuitions.
1. Plain accesses can tear, for example, the compiler could choose to access an
eight-byte pointer one byte at a time. Tearing of aligned machine-sized accesses
can be prevented by using READ_ONCE() and WRITE_ONCE().
2. Plain loads can fuse, for example, if the results of an earlier load from that same
object are still in a machine register, the compiler might opt to reuse the value
in that register instead of reloading from memory. Load fusing can be prevented
by using READ_ONCE() or by enforcing ordering between the two loads using
barrier(), smp_rmb(), and other means shown in Table 15.3.
3. Plain stores can fuse, so that a store can be omitted entirely if there is a later store
to that same variable. Store fusing can be prevented by using WRITE_ONCE() or
by enforcing ordering between the two stores using barrier(), smp_wmb(), and
other means shown in Table 15.3.
v2023.06.11a
524 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
5. Plain loads can be invented, for example, register pressure might cause the compiler
to discard a previously loaded value from its register, and then reload it later on.
Invented loads can be prevented by using READ_ONCE() or by enforcing ordering
as called out above between the load and a later use of its value using barrier().
6. Stores can be invented before a plain store, for example, by using the stored-to
location as temporary storage. This can be prevented by use of WRITE_ONCE().
Quick Quiz 15.32: Why not place a barrier() call immediately before a plain store to
prevent the compiler from inventing stores?
Please note that all of these shared-memory shenanigans can instead be avoided by
avoiding data races on plain accesses, as described in Section 4.3.4.4. After all, if there
are no data races, then each and every one of the compiler optimizations mentioned
above is perfectly safe. But for code containing data races, this list is subject to change
without notice as compiler optimizations continue becoming increasingly aggressive.
In short, use of READ_ONCE(), WRITE_ONCE(), barrier(), volatile, and other
primitives called out in Table 15.3 on page 492 are valuable tools in preventing the
compiler from optimizing your parallel algorithm out of existence. Compilers are starting
to provide other mechanisms for avoiding load and store tearing, for example, memory_
order_relaxed atomic loads and stores, however, work is still needed [Cor16b]. In
addition, compiler issues aside, volatile is still needed to avoid fusing and invention
of accesses, including C11 atomic accesses.
Please note that, it is possible to overdo use of READ_ONCE() and WRITE_ONCE().
For example, if you have prevented a given variable from changing (perhaps by holding
the lock guarding all updates to that variable), there is no point in using READ_ONCE().
Similarly, if you have prevented any other CPUs or threads from reading a given variable
(perhaps because you are initializing that variable before any other CPU or thread has
access to it), there is no point in using WRITE_ONCE(). However, in my experience,
developers need to use things like READ_ONCE() and WRITE_ONCE() more often than
they think that they do, and the overhead of unnecessary uses is quite low. In contrast,
the penalty for failing to use them when needed can be quite high.
v2023.06.11a
15.3. COMPILE-TIME CONSTERNATION 525
1. On DEC Alpha, a dependent load might not be ordered with the load heading the
dependency chain, as described in Section 15.5.1.
2. If the load heading the dependency chain is a C11 non-volatile memory_order_
relaxed load, the compiler could omit the load, for example, by using a value that
it loaded in the past.
3. If the load heading the dependency chain is a plain load, the compiler can omit
the load, again by using a value that it loaded in the past. Worse yet, it could load
twice instead of once, so that different parts of your code use different values—and
compilers really do this, especially when under register pressure.
4. The value loaded by the head of the dependency chain must be a pointer. In theory,
yes, you could load an integer, perhaps to use it as an array index. In practice, the
compiler knows too much about integers, and thus has way too many opportunities
to break your dependency chain [MWB+ 17].
1. Although it is permissible to compute offsets from a pointer, these offsets must not
result in total cancellation. For example, given a char pointer cp, cp-(uintptr_
t)cp will cancel and can allow the compiler to break your dependency chain.
v2023.06.11a
526 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
On the other hand, canceling offset values with each other is perfectly safe and
legal. For example, if a and b are equal, cp+a-b is an identity function, including
preserving the dependency.
2. Comparisons can break dependencies. Listing 15.28 shows how this can happen.
Here global pointer gp points to a dynamically allocated integer, but if memory
is low, it might instead point to the reserve_int variable. This reserve_int
case might need special handling, as shown on lines 6 and 7 of the listing. But the
compiler could reasonably transform this code into the form shown in Listing 15.29,
especially on systems where instructions with absolute addresses run faster than
instructions using addresses supplied in registers. However, there is clearly no
ordering between the pointer load on line 5 and the dereference on line 8. Please
note that this is simply an example: There are a great many other ways to break
dependency chains with comparisons.
Quick Quiz 15.33: Why can’t you simply dereference the pointer before comparing it to
&reserve_int on line 6 of Listing 15.28?
Quick Quiz 15.34: But it should be safe to compare two pointer variables, right? After all,
the compiler doesn’t know the value of either, so how can it possibly learn anything from the
comparison?
Note that a series of inequality comparisons might, when taken together, give the
compiler enough information to determine the exact value of the pointer, at which
point the dependency is broken. Furthermore, the compiler might be able to combine
information from even a single inequality comparison with other information to learn
the exact value, again breaking the dependency. Pointers to elements in arrays are
especially susceptible to this latter form of dependency breakage.
1. Comparisons against the NULL pointer. In this case, all the compiler can learn is
that the pointer is NULL, in which case you are not allowed to dereference it anyway.
2. The dependent pointer is never dereferenced, whether before or after the comparison.
3. The dependent pointer is compared to a pointer that references objects that were
last modified a very long time ago, where the only unconditionally safe value of “a
very long time ago” is “at compile time”. The key point is that something other
than the address or data dependency guarantees ordering.
5. The comparison is not-equal, and the compiler does not have enough other
information to deduce the value of the pointer carrying the dependency.
v2023.06.11a
15.3. COMPILE-TIME CONSTERNATION 527
Pointer comparisons can be quite tricky, and so it is well worth working through the
example shown in Listing 15.30. This example uses a simple struct foo shown on
lines 1–5 and two global pointers, gp1 and gp2, shown on lines 6 and 7, respectively.
This example uses two threads, namely updater() on lines 9–22 and reader() on
lines 24–39.
The updater() thread allocates memory on line 13, and complains bitterly on line 14
if none is available. Lines 15–17 initialize the newly allocated structure, and then
line 18 assigns the pointer to gp1. Lines 19 and 20 then update two of the structure’s
fields, and does so after line 18 has made those fields visible to readers. Please note
that unsynchronized update of reader-visible fields often constitutes a bug. Although
there are legitimate use cases doing just this, such use cases require more care than is
exercised in this example.
Finally, line 21 assigns the pointer to gp2.
The reader() thread first fetches gp2 on line 30, with lines 31 and 32 checking for
NULL and returning if so. Line 33 fetches field ->b and line 34 fetches gp1. If line 35
sees that the pointers fetched on lines 30 and 34 are equal, line 36 fetches p->c. Note
that line 36 uses pointer p fetched on line 30, not pointer q fetched on line 34.
But this difference might not matter. An equals comparison on line 35 might lead
the compiler to (incorrectly) conclude that both pointers are equivalent, when in fact
they carry different dependencies. This means that the compiler might well transform
v2023.06.11a
528 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
In short, great care is required to ensure that dependency chains in your source code
are still dependency chains in the compiler-generated assembly code.
1 q = READ_ONCE(x);
2 if (q) {
3 <data dependency barrier>
4 q = READ_ONCE(y);
5 }
This will not have the desired effect because there is no actual data dependency, but
rather a control dependency that the CPU may short-circuit by attempting to predict the
outcome in advance, so that other CPUs see the load from y as having happened before
the load from x. In such a case what’s actually required is:
1 q = READ_ONCE(x);
2 if (q) {
3 <read barrier>
4 q = READ_ONCE(y);
5 }
However, stores are not speculated. This means that ordering is provided for load-store
control dependencies, as in the following example:
1 q = READ_ONCE(x);
2 if (q)
3 WRITE_ONCE(y, 1);
Control dependencies pair normally with other types of ordering operations. That
said, please note that neither READ_ONCE() nor WRITE_ONCE() are optional! Without
the READ_ONCE(), the compiler might fuse the load from x with other loads from x.
Without the WRITE_ONCE(), the compiler might fuse the store to y with other stores
to y, or, worse yet, read the value, compare it, and only conditionally do the store. Any
of these can result in highly counter-intuitive effects on ordering.
Worse yet, if the compiler is able to prove (say) that the value of variable x is
always non-zero, it would be well within its rights to optimize the original example by
eliminating the “if” statement as follows:
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
v2023.06.11a
15.3. COMPILE-TIME CONSTERNATION 529
Quick Quiz 15.36: But there is a READ_ONCE(), so how can the compiler prove anything
about the value of q?
Now there is no conditional between the load from x and the store to y, which
means that the CPU is within its rights to reorder them: The conditional is absolutely
required, and must be present in the assembly code even after all compiler optimizations
have been applied. Therefore, if you need ordering in this example, you need explicit
memory-ordering operations, for example, a release store:
1 q = READ_ONCE(x);
2 if (q) {
3 smp_store_release(&y, 1);
4 do_something();
5 } else {
6 smp_store_release(&y, 1);
7 do_something_else();
8 }
The initial READ_ONCE() is still required to prevent the compiler from guessing the
value of x. In addition, you need to be careful what you do with the local variable q,
otherwise the compiler might be able to guess its value and again remove the needed
conditional. For example:
1 q = READ_ONCE(x);
2 if (q % MAX) {
3 WRITE_ONCE(y, 1);
4 do_something();
5 } else {
6 WRITE_ONCE(y, 2);
7 do_something_else();
8 }
If MAX is defined to be 1, then the compiler knows that (q%MAX) is equal to zero,
in which case the compiler is within its rights to transform the above code into the
following:
v2023.06.11a
530 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 2);
3 do_something_else();
Given this transformation, the CPU is not required to respect the ordering between
the load from variable x and the store to variable y. It is tempting to add a barrier()
to constrain the compiler, but this does not help. The conditional is gone, and the
barrier() won’t bring it back. Therefore, if you are relying on this ordering, you
should make sure that MAX is greater than one, perhaps as follows:
1 q = READ_ONCE(x);
2 BUILD_BUG_ON(MAX <= 1);
3 if (q % MAX) {
4 WRITE_ONCE(y, 1);
5 do_something();
6 } else {
7 WRITE_ONCE(y, 2);
8 do_something_else();
9 }
Please note once again that the stores to y differ. If they were identical, as noted
earlier, the compiler could pull this store outside of the “if” statement.
You must also avoid excessive reliance on boolean short-circuit evaluation. Consider
this example:
1 q = READ_ONCE(x);
2 if (q || 1 > 0)
3 WRITE_ONCE(y, 1);
Because the first condition cannot fault and the second condition is always true, the
compiler can transform this example as following, defeating the control dependency:
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1);
This example underscores the need to ensure that the compiler cannot out-guess your
code. Never forget that, although READ_ONCE() does force the compiler to actually
emit code for a given load, it does not force the compiler to use the value loaded.
In addition, control dependencies apply only to the then-clause and else-clause of the
if-statement in question. In particular, it does not necessarily apply to code following
the if-statement:
1 q = READ_ONCE(x);
2 if (q) {
3 WRITE_ONCE(y, 1);
4 } else {
5 WRITE_ONCE(y, 2);
6 }
7 WRITE_ONCE(z, 1); /* BUG: No ordering. */
It is tempting to argue that there in fact is ordering because the compiler cannot
reorder volatile accesses and also cannot reorder the writes to y with the condition.
Unfortunately for this line of reasoning, the compiler might compile the two writes to y
as conditional-move instructions, as in this fanciful pseudo-assembly language:
v2023.06.11a
15.3. COMPILE-TIME CONSTERNATION 531
1 ld r1,x
2 cmp r1,$0
3 cmov,ne r4,$1
4 cmov,eq r4,$2
5 st r4,y
6 st $1,z
A weakly ordered CPU would have no dependency of any sort between the load
from x and the store to z. The control dependencies would extend only to the pair of
cmov instructions and the store depending on them. In short, control dependencies apply
only to the stores in the “then” and “else” of the “if” in question (including functions
invoked by those two clauses), and not necessarily to code following that “if”.
Finally, control dependencies do not provide cumulativity.13 This is demonstrated by
two related litmus tests, namely Listings 15.31 and 15.32 with the initial values of x
and y both being zero.
The exists clause in the two-thread example of Listing 15.31 (C-LB+o-cgt-o+o-
cgt-o.litmus) will never trigger. If control dependencies guaranteed cumulativity
(which they do not), then adding a thread to the example as in Listing 15.32 (C-WWC+o-
cgt-o+o-cgt-o+o.litmus) would guarantee the related exists clause never to
trigger.
But because control dependencies do not provide cumulativity, the exists clause in
the three-thread litmus test can trigger. If you need the three-thread example to provide
ordering, you will need smp_mb() between the load and store in P0(), that is, just
before or just after the “if” statements. Furthermore, the original two-thread example
is very fragile and should be avoided.
Quick Quiz 15.37: Can’t you instead add an smp_mb() to P1() in Listing 15.32?
v2023.06.11a
532 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
2. Control dependencies can order prior loads against later stores. However, they do
not guarantee any other sort of ordering: Not prior loads against later loads, nor
prior stores against later anything. If you need these other forms of ordering, use
smp_rmb(), smp_wmb(), or, in the case of prior stores and later loads, smp_mb().
3. If both legs of the “if” statement begin with identical stores to the same variable,
then the control dependency will not order those stores, If ordering is needed,
precede both of them with smp_mb() or use smp_store_release(). Please
note that it is not sufficient to use barrier() at beginning of each leg of the “if”
statement because, as shown by the example above, optimizing compilers can
destroy the control dependency while respecting the letter of the barrier() law.
4. Control dependencies require at least one run-time conditional between the prior
load and the subsequent store, and this conditional must involve the prior load. If
the compiler is able to optimize the conditional away, it will have also optimized
away the ordering. Careful use of READ_ONCE() and WRITE_ONCE() can help to
preserve the needed conditional.
5. Control dependencies require that the compiler avoid reordering the dependency
into nonexistence. Careful use of READ_ONCE(), atomic_read(), or atomic64_
read() can help to preserve your control dependency.
6. Control dependencies apply only to the “then” and “else” of the “if” containing
the control dependency, including any functions that these two clauses call.
Control dependencies do not apply to code following the end of the “if” statement
containing the control dependency.
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 533
Again, many popular languages were designed with single-threaded use in mind.
Successful multithreaded use of these languages requires you to pay special attention to
your memory references and dependencies.
The answer to one of the quick quizzes in Section 12.3.1 demonstrated exponential
speedups due to verifying programs modeled at higher levels of abstraction. This section
will look into how higher levels of abstraction can also provide a deeper understanding
of the synchronization primitives themselves. Section 15.4.1 takes a look at memory
allocation, Section 15.4.2 examines the surprisingly varied semantics of locking, and
Section 15.4.3 digs more deeply into RCU.
v2023.06.11a
534 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Quick Quiz 15.38: But doesn’t PowerPC have weak unlock-lock ordering properties within the
Linux kernel, allowing a write before the unlock to be reordered with a read after the lock?
15.4.2 Locking
Locking is a well-known synchronization primitive with which the parallel-programming
community has had decades of experience. As such, locking’s semantics are quite
simple.
That is, they are quite simple until you start trying to mathematically model them.
The simple part is that any CPU or thread holding a given lock is guaranteed to see
any accesses executed by CPUs or threads while they were previously holding that
same lock. Similarly, any CPU or thread holding a given lock is guaranteed not to see
accesses that will be executed by other CPUs or threads while subsequently holding that
same lock. And what else is there?
As it turns out, quite a bit:
1. Are CPUs, threads, or compilers allowed to pull memory accesses into a given
lock-based critical section?
2. Will a CPU or thread holding a given lock also be guaranteed to see accesses
executed by CPUs and threads before they last acquired that same lock, and vice
versa?
3. Suppose that a given CPU or thread executes one access (call it “A”), releases a
lock, reacquires that same lock, then executes another access (call it “B”). Is some
other CPU or thread not holding that lock guaranteed to see A and B in order?
4. As above, but with the lock reacquisition carried out by some other CPU or thread?
5. As above, but with the lock reacquisition being some other lock?
The reaction to some or even all of these questions might well be “Why would anyone
do that?” However, any complete mathematical definition of locking must have answers
to all of these questions. Therefore, the following sections address these questions in
the context of the Linux kernel.
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 535
v2023.06.11a
536 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
However, other environments might make other choices. For example, locking
implementations that run only on the x86 CPU family will have lock-acquisition
primitives that fully order the lock acquisition with any prior and any subsequent
accesses. Therefore, on such systems the ordering shown in Listing 15.33 comes for
free. There are x86 lock-release implementations that are weakly ordered, thus failing to
provide the ordering shown in Listing 15.34, but an implementation could nevertheless
choose to guarantee this ordering.
For their part, weakly ordered systems might well choose to execute the memory-
barrier instructions required to guarantee both orderings, possibly simplifying code
making advanced use of combinations of locked and lockless accesses. However, as
noted earlier, LKMM chooses not to provide these additional orderings, in part to avoid
imposing performance penalties on the simpler and more prevalent locking use cases.
Instead, the smp_mb__after_spinlock() and smp_mb__after_unlock_lock()
primitives are provided for those more complex use cases, as discussed in Section 15.5.
Thus far, this section has discussed only hardware reordering. Can the compiler also
reorder memory references into lock-based critical sections?
The answer to this question in the context of the Linux kernel is a resounding “No!”
One reason for this otherwise inexplicable favoring of hardware reordering over compiler
optimizations is that the hardware will avoid reordering a page-faulting access into a
lock-based critical section. In contrast, compilers have no clue about page faults, and
would therefore happily reorder a page fault into a critical section, which could crash
the kernel. The compiler is also unable to reliably determine which accesses will result
in cache misses, so that compiler reordering into critical sections could also result in
excessive lock contention. Therefore, the Linux kernel prohibits the compiler (but not
the CPU) from moving accesses into lock-based critical sections.
If a given CPU or thread holds a given lock, it is guaranteed to see accesses executed
during all prior critical sections for that same lock. Similarly, such a CPU or thread
is guaranteed not to see accesses that will be executed during all subsequent critical
sections for that same lock.
But what about accesses preceding prior critical sections and following subsequent
critical sections?
This question can be answered for the Linux kernel by referring to Listing 15.35
(C-Lock-outside-across.litmus). Running this litmus test yields the Never result,
which means that accesses in code leading up to a prior critical section is also visible to
the current CPU or thread holding that same lock. Similarly, code that is placed after a
subsequent critical section is never visible to the current CPU or thread holding that
same lock.
As a result, the Linux kernel cannot allow accesses to be moved across the entirety of
a given critical section. Other environments might well wish to allow such code motion,
but please be advised that doing so is likely to yield profoundly counter-intuitive results.
In short, the ordering provided by spin_lock() extends not only throughout the
critical section, but also indefinitely beyond the end of that critical section. Similarly, the
ordering provided by spin_unlock() extends not only throughout the critical section,
but also indefinitely beyond the beginning of that critical section.
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 537
Does a CPU or thread that is not holding a given lock see that lock’s critical sections as
being ordered?
This question can be answered for the Linux kernel by referring to Listing 15.36
(C-Lock-across-unlock-lock-1.litmus), which shows an example where P(0)
places its write and read in two different critical sections for the same lock. Running
this litmus test shows that the exists can be satisfied, which means that the answer
is “no”, and that CPUs can reorder accesses across consecutive critical sections. In
other words, not only are spin_lock() and spin_unlock() weaker than a full barrier
when considered separately, they are also weaker than a full barrier when taken together.
If the ordering of a given lock’s critical sections are to be observed, then either the
observer must hold that lock on the one hand or either smp_mb__after_spinlock()
or smp_mb__after_unlock_lock() must be executed just after the second lock
acquisition on the other.
But what if the two critical sections run on different CPUs or threads?
This question is answered for the Linux kernel by referring to Listing 15.37 (C-Lock-
across-unlock-lock-2.litmus), in which the first lock acquisition is executed by
P0() and the second lock acquisition is executed by P1(). Note that P1() must read x
to reject executions in which P1() executes before P0() does. Running this litmus test
shows that the exists can be satisfied, which means that the answer is “no”, and that
CPUs can reorder accesses across consecutive critical sections, even if each of those
critical sections runs on a different CPU or thread.
Quick Quiz 15.39: But if there are three critical sections, isn’t it true that CPUs not holding
the lock will observe the accesses from the first and the third critical section as being ordered?
As before, if the ordering of a given lock’s critical sections are to be observed, then
either the observer must hold that lock or either smp_mb__after_spinlock() or
smp_mb__after_unlock_lock() must be executed just after P1()’s lock acquisition.
v2023.06.11a
538 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 539
Given that ordering is not guaranteed when both critical sections are protected by the
same lock, there is no hope of any ordering guarantee when different locks are used.
However, readers are encouraged to construct the corresponding litmus test and see this
for themselves.
This situation can seem counter-intuitive, but it is rare for code to care. This approach
also allows certain weakly ordered systems to implement locks more efficiently.
15.4.3 RCU
As described in Section 9.5.2, the fundamental property of RCU grace periods is this
straightforward two-part guarantee: (1) If any part of a given RCU read-side critical
section precedes the beginning of a given grace period, then the entirety of that critical
section precedes the end of that grace period. (2) If any part of a given RCU read-side
v2023.06.11a
540 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
rcu_read_lock() call_rcu()
rcu_read_unlock()
rcu_read_lock()
critical section follows the end of a given grace period, then the entirety of that critical
section follows the beginning of that grace period. These guarantees are summarized
in Figure 15.18, where the grace period is denoted by the dashed arrow between
the call_rcu() invocation in the upper right and the corresponding RCU callback
invocation in the lower left.14
In short, an RCU read-side critical section is guaranteed never to completely overlap
an RCU grace period, as demonstrated by Listing 15.38 (C-SB+o-rcusync-o+rl-o-
o-rul.litmus). Either or neither of the r2 registers can have the final value of zero,
but at least one of them must be non-zero (that is, the cycle identified by the exists
clause is prohibited), courtesy of RCU’s fundamental grace-period guarantee, as can
14 For more detail, please see Figures 9.11–9.13 starting on page 231.
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 541
be seen by running herd on this litmus test. Note that this guarantee is insensitive to
the ordering of the accesses within P1()’s critical section, so the litmus test shown in
Listing 15.3915 also forbids this same cycle.
However, this definition is incomplete, as can be seen from the following list of
questions:16
critical sections.
16 Several of which were introduced to Paul by Jade Alglave during early work on LKMM,
and a few more of which came from other LKMM participants [AMM+ 18].
v2023.06.11a
542 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
Nor do these primitives have barrier-like ordering properties, at least not unless there
is a grace period in the mix, as can be seen in Listing 15.41 (C-LB+o-rl-rul-o+o-
rl-rul-o.litmus). This litmus test’s cycle is also allowed. (Try it!)
Of course, lack of ordering in both these litmus tests should be absolutely no surprise,
given that both rcu_read_lock() and rcu_read_unlock() are no-ops in the QSBR
implementation of RCU.
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 543
v2023.06.11a
544 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
period due to the fact that interrupts take place at a precise location in the execution of
the interrupted code.
This in turn means that if the WRITE_ONCE() follows the end of a given RCU grace
period, then the accesses within and following that RCU read-side critical section must
follow the beginning of that same grace period. Similarly, if the READ_ONCE() precedes
the beginning of the grace period, everything within and preceding that critical section
must precede the end of that same grace period.
Listing 15.44 (C-SB+o-rcusync-o+rl-o-rul-o.litmus) is similar, but instead
looks at accesses after the RCU read-side critical section. This test’s cycle is also
forbidden, as can be checked with the herd tool. The reasoning is similar to that for
Listing 15.43, and is left as an exercise for the reader.
Listing 15.45 (C-SB+o-rcusync-o+o-rl-rul-o.litmus) takes things one step
farther, moving P1()’s WRITE_ONCE() to precede the RCU read-side critical section
and moving P1()’s READ_ONCE() to follow it, resulting in an empty RCU read-side
critical section.
Perhaps surprisingly, despite the empty critical section, RCU nevertheless still
manages to forbid the cycle. This can again be checked using the herd tool. Furthermore,
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 545
the reasoning is once again similar to that for Listing 15.43, Recapping, if P1()’s
WRITE_ONCE() follows the end of a given grace period, then P1()’s RCU read-side
critical section—and everything following it—must follow the beginning of that same
grace period. Similarly, if P1()’s READ_ONCE() precedes the beginning of a given
grace period, then P1()’s RCU read-side critical section—and everything preceding
it—must precede the end of that same grace period. In both cases, the critical section’s
emptiness is irrelevant.
Quick Quiz 15.41: Wait a minute! In QSBR implementations of RCU, no code is emitted
for rcu_read_lock() and rcu_read_unlock(). This means that the RCU read-side critical
section in Listing 15.45 isn’t just empty, it is completely nonexistent!!! So how can something
that doesn’t exist at all possibly have any effect whatsoever on ordering???
This situation leads to the question of what happens if rcu_read_lock() and rcu_
read_unlock() are omitted entirely, as shown in Listing 15.46 (C-SB+o-rcusync-
o+o-o.litmus). As can be checked with herd, this litmus test’s cycle is allowed, that
is, both instances of r2 can have final values of zero.
This might seem strange in light of the fact that empty RCU read-side critical sections
can provide ordering. And it is true that QSBR implementations of RCU would in fact
forbid this outcome, due to the fact that there is no quiescent state anywhere in P1()’s
function body, so that P1() would run within an implicit RCU read-side critical section.
However, RCU also has non-QSBR implementations, which have no implied RCU
read-side critical section, and in turn no way for RCU to enforce ordering. Therefore,
this litmus test’s cycle is allowed.
Quick Quiz 15.42: Can P1()’s accesses be reordered in the litmus tests shown in Listings 15.43,
15.44, and 15.45 in the same way that they were reordered going from Listing 15.38 to
Listing 15.39?
v2023.06.11a
546 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
1. P0()’s read from x1 precedes P1()’s write, as depicted by the dashed arrow near
the bottom of the diagram.
2. Because P1()’s write follows the end of P0()’s grace period, P1()’s read from
x2 cannot precede the beginning of P0()’s grace period.
3. P1()’s read from x2 precedes P2()’s write.
4. Because P2()’s write to x2 precedes the end of P0()’s grace period, it is completely
legal for P2()’s read from x0 to precede the beginning of P0()’s grace period.
5. Therefore, P2()’s read from x0 can precede P0()’s write, thus allowing the cycle
to form.
But what happens when another grace period is added? This situation is shown in
Listing 15.48, an SB litmus test in which P0() and P1() have RCU grace periods and
18 Especially given that Paul changed his mind several times about this particular litmus
test when working with Jade Alglave to generalize RCU ordering semantics.
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 547
rcu_read_lock();
r2 = READ_ONCE(x0);
WRITE_ONCE(x0, 2);
synchronize_rcu();
rcu_read_lock();
r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);
rcu_read_unlock();
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);
rcu_read_unlock();
Figure 15.19: Cycle for One RCU Grace Period and Two RCU Readers
v2023.06.11a
548 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
WRITE_ONCE(x0, 2);
synchronize_rcu(); rcu_read_lock();
r2 = READ_ONCE(x0);
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);
synchronize_rcu(); rcu_read_lock();
r2 = READ_ONCE(x3);
WRITE_ONCE(x3, 2);
rcu_read_unlock();
r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);
rcu_read_unlock();
Figure 15.20: No Cycle for Two RCU Grace Periods and Two RCU Readers
P2() and P3() have RCU readers. Again, the CPUs can reorder the accesses within
RCU read-side critical sections, as shown in Figure 15.20. For this cycle to form,
P2()’s critical section must end after P1()’s grace period and P3()’s must end after the
beginning of that same grace period, which happens to also be after the end of P0()’s
grace period. Therefore, P3()’s critical section must start after the beginning of P0()’s
grace period, which in turn means that P3()’s read from x0 cannot possibly precede
P0()’s write. Therefore, the cycle is forbidden because RCU read-side critical sections
cannot span full RCU grace periods.
However, a closer look at Figure 15.20 makes it clear that adding a third reader would
allow the cycle. This is because this third reader could end before the end of P0()’s
grace period, and thus start before the beginning of that same grace period. This in turn
suggests the general rule, which is: In these sorts of RCU-only litmus tests, if there are
at least as many RCU grace periods as there are RCU read-side critical sections, the
cycle is forbidden.19
19 Interestingly enough, Alan Stern proved that within the context of LKMM, the two-part
fundamental property of RCU expressed in Section 9.5.2 actually implies this seemingly more
general result, which is called the RCU axiom [AMM+ 18].
v2023.06.11a
15.4. HIGHER-LEVEL PRIMITIVES 549
WRITE_ONCE(x0, 2);
synchronize_rcu(); rcu_read_lock();
r2 = READ_ONCE(x0);
rcu_read_lock();
r2 = READ_ONCE(x1);
WRITE_ONCE(x1, 2);
smp_mb();
r2 = READ_ONCE(x2);
WRITE_ONCE(x2, 2);
rcu_read_unlock();
rcu_read_unlock();
Figure 15.21: Cycle for One RCU Grace Period, Two RCU Readers, and Memory
Barrier
In short, where RCU’s semantics were once purely pragmatic, they are now fully
formalized [MW05, DMS+ 12, GRY13, AMM+ 18].
v2023.06.11a
550 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
CPU Family
SPARC TSO
Armv7-A/R
z Systems
POWER
Itanium
Armv8
Alpha
MIPS
x86
Property
Instructions Load-Acquire/Store-Release? F F i I F b
Atomic RMW Instruction Type? L L L C L L C C C
Incoherent Instruction Cache/Pipeline? Y Y Y Y Y Y Y Y Y
Key: Load-Acquire/Store-Release?
b: Lightweight memory barrier
F: Full memory barrier
i: Instruction with lightweight ordering
I: Instruction with heavyweight ordering
Atomic RMW Instruction Type?
C: Compare-and-exchange instruction
L: Load-linked/store-conditional instruction
It is hoped that verifying against detailed semantics for higher-level primitives will
greatly increase the effectiveness of static analysis and model checking.
Each CPU family has its own peculiar approach to memory ordering, which can make
portability a challenge, as you can see in Table 15.5.
In fact, some software environments simply prohibit direct use of memory-ordering
operations, restricting the programmer to mutual-exclusion primitives that incorporate
them to the extent that they are required. Please note that this section is not intended to
be a reference manual covering all (or even most) aspects of each CPU family, but rather
v2023.06.11a
15.5. HARDWARE SPECIFICS 551
a high-level overview providing a rough comparison. For full details, see the reference
manual for the CPU of interest.
Getting back to Table 15.5, the first group of rows look at memory-ordering properties
and the second group looks at instruction properties. Please note that these properties
hold at the machine-instruction level. Compilers can and do reorder far more aggressively
than does hardware. Use marked accesses such as READ_ONCE() and WRITE_ONCE()
to constrain the compiler’s optimizations and prevent undesireable reordering.
The first three rows indicate whether a given CPU allows the four possible combinations
of loads and stores to be reordered, as discussed in Section 15.1 and Sections 15.2.2.1–
15.2.2.3. The next row (“Atomic Instructions Reordered With Loads or Stores?”)
indicates whether a given CPU allows loads and stores to be reordered with atomic
instructions.
The fifth and sixth rows cover reordering and dependencies, which was covered in
Sections 15.2.3–15.2.5 and which is explained in more detail in Section 15.5.1. The
short version is that Alpha requires memory barriers for readers as well as updaters
of linked data structures, however, these memory barriers are provided by the Alpha
architecture-specific code in v4.15 and later Linux kernels.
The next row, “Non-Sequentially Consistent”, indicates whether the CPU’s normal
load and store instructions are constrained by sequential consistency. Performance
considerations have dictated that no modern mainstream system is sequentially consistent.
The next three rows cover multicopy atomicity, which was defined in Section 15.2.7.
The first is full-up (and rare) multicopy atomicity, the second is the weaker other-
multicopy atomicity, and the third is the weakest non-multicopy atomicity.
The next row, “Non-Cache Coherent”, covers accesses from multiple threads to a
single variable, which was discussed in Section 15.2.6.
The final three rows cover instruction-level choices and issues. The first row indicates
how each CPU implements load-acquire and store-release, the second row classifies
CPUs by atomic-instruction type, and the third and final row indicates whether a given
CPU has an incoherent instruction cache and pipeline. Such CPUs require special
instructions be executed for self-modifying code.
The common “just say no” approach to memory-ordering operations can be eminently
reasonable where it applies, but there are environments, such as the Linux kernel, where
direct use of memory-ordering operations is required. Therefore, Linux provides a
carefully chosen least-common-denominator set of memory-ordering primitives, which
are as follows:
smp_mb() (full memory barrier) that orders both loads and stores. This means that
loads and stores preceding the memory barrier will be committed to memory
before any loads and stores following the memory barrier.
v2023.06.11a
552 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
These primitives generate code only in SMP kernels, however, several have UP
versions (mb(), rmb(), and wmb(), respectively) that generate a memory barrier even
in UP kernels. The smp_ versions should be used in most cases. However, these
latter primitives are useful when writing drivers, because MMIO accesses must remain
ordered even in UP kernels. In absence of memory-ordering operations, both CPUs and
compilers would happily rearrange these accesses, which at best would make the device
act strangely, and could crash your kernel or even damage your hardware.
So most kernel programmers need not worry about the memory-ordering peculiarities
of each and every CPU, as long as they stick to these interfaces and to the fully ordered
atomic operations.20 If you are working deep in a given CPU’s architecture-specific
code, of course, all bets are off.
Furthermore, all of Linux’s locking primitives (spinlocks, reader-writer locks, sem-
aphores, RCU, . . .) include any needed ordering primitives. So if you are working
with code that uses these primitives properly, you need not worry about Linux’s
memory-ordering primitives.
That said, deep knowledge of each CPU’s memory-consistency model can be very
helpful when debugging, to say nothing of when writing architecture-specific code or
synchronization primitives.
Besides, they say that a little knowledge is a very dangerous thing. Just imagine the
damage you could do with a lot of knowledge! For those who wish to understand more
about individual CPUs’ memory consistency models, the next sections describe those of
a few popular and prominent CPUs. Although there is no substitute for actually reading
a given CPU’s documentation, these sections do give a good overview.
15.5.1 Alpha
It may seem strange to say much of anything about a CPU whose end of life has long
since passed, but Alpha is interesting because it is the only mainstream CPU that reorders
dependent loads, and has thus had outsized influence on concurrency APIs, including
within the Linux kernel. The need for core Linux-kernel code to accommodate Alpha
ended with version v4.15 of the Linux kernel, and all traces of this accommodation
were removed in v5.9 with the removal of the smp_read_barrier_depends() and
read_barrier_depends() APIs. This section is nevertheless retained in the Third
Edition because here in early 2023 there are still a few Linux kernel hackers still
working on pre-v4.15 versions of the Linux kernel. In addition, the modifications to
v2023.06.11a
15.5. HARDWARE SPECIFICS 553
READ_ONCE() that permitted these APIs to be removed have not necessarily propagated
to all userspace projects that might still support Alpha.
The dependent-load difference between Alpha and the other CPUs is illustrated by
the code shown in Listing 15.49. This smp_store_release() guarantees that the
element initialization in lines 6–8 is executed before the element is added to the list on
line 9, so that the lock-free search will work correctly. That is, it makes this guarantee
on all CPUs except Alpha.
Given the pre-v4.15 implementation of READ_ONCE(), indicated by READ_ONCE_
OLD() in the listing, Alpha actually allows the code on line 19 of Listing 15.49 to see
the old garbage values that were present before the initialization on lines 6–8.
Figure 15.22 shows how this can happen on an aggressively parallel machine with
partitioned caches, so that alternating cache lines are processed by the different partitions
of the caches. For example, the load of head.next on line 16 of Listing 15.49 might
access cache bank 0, and the load of p->key on line 19 and of p->next on line 22
might access cache bank 1. On Alpha, the smp_store_release() will guarantee that
the cache invalidations performed by lines 6–8 of Listing 15.49 (for p->next, p->key,
and p->data) will reach the interconnect before that of line 9 (for head.next), but
makes absolutely no guarantee about the order of propagation through the reading
CPU’s cache banks. For example, it is possible that the reading CPU’s cache bank 1 is
very busy, but cache bank 0 is idle. This could result in the cache invalidations for the
new element (p->next, p->key, and p->data) being delayed, so that the reading CPU
loads the new value for head.next, but loads the old cached values for p->key and
p->next. Yes, this does mean that Alpha can in effect fetch the data pointed to before
it fetches the pointer itself, strange but true. See the documentation [Com01, Pug00]
called out earlier for more information, or if you think that I am just making all this
up.21 The benefit of this unusual approach to ordering is that Alpha can use simpler
cache hardware, which in turn permitted higher clock frequencies in Alpha’s heyday.
21 Of course, the astute reader will have already recognized that Alpha is nowhere near as
mean and nasty as it could be, the (thankfully) mythical architecture in Appendix C.6.1 being
a case in point.
v2023.06.11a
554 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
p->data = key;
smp_wmb(); p = READ_ONCE_OLD(head.next);
head.next = p; BUG_ON(p && p->key != key);
head.next p->key
p->data
p->next
One could place an smp_rmb() primitive between the pointer fetch and dereference
in order to force Alpha to order the pointer fetch with the later dependent load. However,
this imposes unneeded overhead on systems (such as Arm, Itanium, PPC, and SPARC)
that respect data dependencies on the read side. A smp_read_barrier_depends()
primitive was therefore added to the Linux kernel to eliminate overhead on these systems,
but was removed in v5.9 of the Linux kernel in favor of augmenting Alpha’s definition
of READ_ONCE(). Thus, as of v5.9, core kernel code no longer needs to concern itself
with this aspect of DEC Alpha. However, it is better to use rcu_dereference() as
shown on lines 16 and 21 of Listing 15.50, which works safely and efficiently for all
recent kernel versions.
It is also possible to implement a software mechanism that could be used in place
of smp_store_release() to force all reading CPUs to see the writing CPU’s writes
in order. This software barrier could be implemented by sending inter-processor
interrupts (IPIs) to all other CPUs. Upon receipt of such an IPI, a CPU would execute a
memory-barrier instruction, implementing a system-wide memory barrier similar to
that provided by the Linux kernel’s sys_membarrier() system call. Additional logic
is required to avoid deadlocks. Of course, CPUs that respect data dependencies would
define such a barrier to simply be smp_store_release(). However, this approach
was deemed by the Linux community to impose excessive overhead [McK01], and to
their point would be completely inappropriate for systems having aggressive real-time
response requirements.
The Linux memory-barrier primitives took their names from the Alpha instructions,
so smp_mb() is mb, smp_rmb() is rmb, and smp_wmb() is wmb. Alpha is the only CPU
whose READ_ONCE() includes an smp_mb().
Quick Quiz 15.45: Why does Alpha’s READ_ONCE() include an mb() rather than rmb()?
Quick Quiz 15.46: Isn’t DEC Alpha significant as having the weakest possible memory
ordering?
v2023.06.11a
15.5. HARDWARE SPECIFICS 555
15.5.2 Armv7-A/R
The Arm family of CPUs is popular in deep embedded applications, particularly for
power-constrained microcontrollers. Its memory model is similar to that of POWER (see
Section 15.5.6), but Arm uses a different set of memory-barrier instructions [ARM10]:
DMB (data memory barrier) causes the specified type of operations to appear to have
completed before any subsequent operations of the same type. The “type” of
operations can be all operations or can be restricted to only writes (similar to the
Alpha wmb and the POWER eieio instructions). In addition, Arm allows cache
coherence to have one of three scopes: Single processor, a subset of the processors
(“inner”) and global (“outer”).
DSB (data synchronization barrier) causes the specified type of operations to actually
complete before any subsequent operations (of any type) are executed. The “type”
of operations is the same as that of DMB. The DSB instruction was called DWB
(drain write buffer or data write barrier, your choice) in early versions of the Arm
architecture.
ISB (instruction synchronization barrier) flushes the CPU pipeline, so that all instruc-
tions following the ISB are fetched only after the ISB completes. For example, if
you are writing a self-modifying program (such as a JIT), you should execute an
ISB between generating the code and executing it.
None of these instructions exactly match the semantics of Linux’s rmb() primitive,
which must therefore be implemented as a full DMB. The DMB and DSB instructions
have a recursive definition of accesses ordered before and after the barrier, which has
an effect similar to that of POWER’s cumulativity, both of which are stronger than
LKMM’s cumulativity described in Section 15.2.7.1.
Arm also implements control dependencies, so that if a conditional branch depends
on a load, then any store executed after that conditional branch will be ordered after
the load. However, loads following the conditional branch will not be guaranteed to be
v2023.06.11a
556 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
LDLAR
ordered unless there is an ISB instruction between the branch and the load. Consider
the following example:
1 r1 = x;
2 if (r1 == 0)
3 nop();
4 y = 1;
5 r2 = z;
6 ISB();
7 r3 = z;
In this example, load-store control dependency ordering causes the load from x on
line 1 to be ordered before the store to y on line 4. However, Arm does not respect
load-load control dependencies, so that the load on line 1 might well happen after the
load on line 5. On the other hand, the combination of the conditional branch on line 2
and the ISB instruction on line 6 ensures that the load on line 7 happens after the load
on line 1. Note that inserting an additional ISB instruction somewhere between lines 2
and 5 would enforce ordering between lines 1 and 5.
15.5.3 Armv8
Arm’s Armv8 CPU family [ARM17] includes 64-bit capabilities, in contrast to their
32-bit-only CPU described in Section 15.5.2. Armv8’s memory model closely resembles
its Armv7 counterpart, but adds load-acquire (LDLARB, LDLARH, and LDLAR) and store-
release (STLLRB, STLLRH, and STLLR) instructions. These instructions act as “half
memory barriers”, so that Armv8 CPUs can reorder previous accesses with a later LDLAR
instruction, but are prohibited from reordering an earlier LDLAR instruction with later
accesses, as fancifully depicted in Figure 15.23. Similarly, Armv8 CPUs can reorder an
earlier STLLR instruction with a subsequent access, but are prohibited from reordering
previous accesses with a later STLLR instruction. As one might expect, this means that
these instructions directly support the C11 notion of load-acquire and store-release.
However, Armv8 goes well beyond the C11 memory model by mandating that the
combination of a store-release and load-acquire act as a full barrier under certain
v2023.06.11a
15.5. HARDWARE SPECIFICS 557
15.5.4 Itanium
Itanium offers a weak consistency model, so that in absence of explicit memory-
barrier instructions or dependencies, Itanium is within its rights to arbitrarily reorder
memory references [Int02a]. Itanium has a memory-fence instruction named mf,
but also has “half-memory fence” modifiers to loads, stores, and to some of its
atomic instructions [Int02b]. The acq modifier prevents subsequent memory-reference
instructions from being reordered before the acq, but permits prior memory-reference
instructions to be reordered after the acq, similar to the Armv8 load-acquire instructions.
Similarly, the rel modifier prevents prior memory-reference instructions from being
reordered after the rel, but allows subsequent memory-reference instructions to be
reordered before the rel.
These half-memory fences are useful for critical sections, since it is safe to push
operations into a critical section, but can be fatal to allow them to bleed out. However, as
one of the few CPUs with this property, Itanium at one time defined Linux’s semantics
of memory ordering associated with lock acquisition and release.22 Oddly enough,
actual Itanium hardware is rumored to implement both load-acquire and store-release
instructions as full barriers. Nevertheless, Itanium was the first mainstream CPU
to introduce the concept (if not the reality) of load-acquire and store-release into its
instruction set.
Quick Quiz 15.47: Given that hardware can have a half memory barrier, why don’t locking
primitives allow the compiler to move memory-reference instructions into lock-based critical
sections?
The Itanium mf instruction is used for the smp_rmb(), smp_mb(), and smp_wmb()
primitives in the Linux kernel. Despite persistent rumors to the contrary, the “mf”
mnemonic stands for “memory fence”.
Itanium also offers a global total order for release operations, including the mf
instruction. This provides the notion of transitivity, where if a given code fragment sees
a given access as having happened, any later code fragment will also see that earlier
access as having happened. Assuming, that is, that all the code fragments involved
correctly use memory barriers.
Finally, Itanium is the only architecture supporting the Linux kernel that can reorder
normal loads to the same variable. The Linux kernel avoids this issue because READ_
ONCE() emits a volatile load, which is compiled as a ld,acq instruction, which
forces ordering of all READ_ONCE() invocations by a given CPU, including those to the
same variable.
v2023.06.11a
558 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
15.5.5 MIPS
The MIPS memory model [Wav16, page 479] appears to resemble that of Arm, Itanium,
and POWER, being weakly ordered by default, but respecting dependencies. MIPS
has a wide variety of memory-barrier instructions, but ties them not to hardware
considerations, but rather to the use cases provided by the Linux kernel and the C++11
standard [Smi19] in a manner similar to the Armv8 additions:
SYNC
Full barrier for a number of hardware operations in addition to memory references,
which is used to implement the v4.13 Linux kernel’s smp_mb() for OCTEON
systems.
SYNC_WMB
Write memory barrier, which can be used on OCTEON systems to implement the
smp_wmb() primitive in the v4.13 Linux kernel via the syncw mnemonic. Other
systems use plain sync.
SYNC_MB
Full memory barrier, but only for memory operations. This may be used to
implement the C++ atomic_thread_fence(memory_order_seq_cst).
SYNC_ACQUIRE
Acquire memory barrier, which could be used to implement C++’s atomic_
thread_fence(memory_order_acquire). In theory, it could also be used
to implement the v4.13 Linux-kernel smp_load_acquire() primitive, but in
practice sync is used instead.
SYNC_RELEASE
Release memory barrier, which may be used to implement C++’s atomic_thread_
fence(memory_order_release). In theory, it could also be used to implement
the v4.13 Linux-kernel smp_store_release() primitive, but in practice sync is
used instead.
SYNC_RMB
Read memory barrier, which could in theory be used to implement the smp_rmb()
primitive in the Linux kernel, except that current MIPS implementations supported
by the v4.13 Linux kernel do not need an explicit instruction to force ordering.
Therefore, smp_rmb() instead simply constrains the compiler.
SYNCI
Instruction-cache synchronization, which is used in conjunction with other instruc-
tions to allow self-modifying code, such as that produced by just-in-time (JIT)
compilers.
Informal discussions with MIPS architects indicates that MIPS has a definition of
transitivity or cumulativity similar to that of Arm and POWER. However, it appears
that different MIPS implementations can have different memory-ordering properties, so
it is important to consult the documentation for the specific MIPS implementation you
are using.
v2023.06.11a
15.5. HARDWARE SPECIFICS 559
v2023.06.11a
560 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
v2023.06.11a
15.5. HARDWARE SPECIFICS 561
be read from that register. However, the heavier-weight “membar #MemIssue” must
be used when a write to a given MMIO register affects the value that will next be read
from some other MMIO register.
SPARC requires a flush instruction be used between the time that the instruction
stream is modified and the time that any of these instructions are executed [SPA94].
This is needed to flush any prior value for that location from the SPARC’s instruction
cache. Note that flush takes an address, and will flush only that address from the
instruction cache. On SMP systems, all CPUs’ caches are flushed, but there is no
convenient way to determine when the off-CPU flushes complete, though there is a
reference to an implementation note.
But again, the Linux kernel runs SPARC in TSO mode, so all of the above membar
variants are strictly of historical interest. In particular, the smp_mb() primitive only
needs to use #StoreLoad because the other three reorderings are prohibited by TSO.
15.5.8 x86
Historically, the x86 CPUs provided “process ordering” so that all CPUs agreed on the
order of a given CPU’s writes to memory. This allowed the smp_wmb() primitive to
be a no-op for the CPU [Int04b]. Of course, a compiler directive was also required to
prevent optimizations that would reorder across the smp_wmb() primitive. In ancient
times, certain x86 CPUs gave no ordering guarantees for loads, so the smp_mb() and
smp_rmb() primitives expanded to lock;addl. This atomic instruction acts as a
barrier to both loads and stores.
But those were ancient times. More recently, Intel has published a memory model
for x86 [Int07]. It turns out that Intel’s modern CPUs enforce tighter ordering than
was claimed in the previous specifications, so this model simply mandates this modern
behavior. Even more recently, Intel published an updated memory model for x86 [Int11,
Section 8.2], which mandates a total global order for stores, although individual CPUs
are still permitted to see their own stores as having happened earlier than this total
global order would indicate. This exception to the total ordering is needed to allow
important hardware optimizations involving store buffers. In addition, x86 provides
other-multicopy atomicity, for example, so that if CPU 0 sees a store by CPU 1, then
CPU 0 is guaranteed to see all stores that CPU 1 saw prior to its store. Software may
use atomic operations to override these hardware optimizations, which is one reason
that atomic operations tend to be more expensive than their non-atomic counterparts.
It is also important to note that atomic instructions operating on a given memory
location should all be of the same size [Int16, Section 8.1.2.2]. For example, if you write
a program where one CPU atomically increments a byte while another CPU executes a
4-byte atomic increment on that same location, you are on your own.
Some SSE instructions are weakly ordered (clflush and non-temporal move
instructions [Int04a]). Code that uses these non-temporal move instructions can
also use mfence for smp_mb(), lfence for smp_rmb(), and sfence for smp_wmb().
A few older variants of the x86 CPU have a mode bit that enables out-of-order stores,
and for these CPUs, smp_wmb() must also be defined to be lock;addl.
Although newer x86 implementations accommodate self-modifying code without
any special instructions, to be fully compatible with past and potential future x86
implementations, a given CPU must execute a jump instruction or a serializing instruction
(e.g., cpuid) between modifying the code and executing it [Int11, Section 8.1.3].
v2023.06.11a
562 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
15.5.9 z Systems
The z Systems machines make up the IBM mainframe family, previously known as
the 360, 370, 390 and zSeries [Int04c]. Parallelism came late to z Systems, but
given that these mainframes first shipped in the mid 1960s, this is not saying much.
The “bcr 15,0” instruction is used for the Linux smp_mb() primitives, but compiler
constraints suffices for both the smp_rmb() and smp_wmb() primitives. It also has
strong memory-ordering semantics, as shown in Table 15.5. In particular, all CPUs will
agree on the order of unrelated stores from different CPUs, that is, the z Systems CPU
family is fully multicopy atomic, and is the only commercially available system with
this property.
As with most CPUs, the z Systems architecture does not guarantee a cache-coherent
instruction stream, hence, self-modifying code must execute a serializing instruction
between updating the instructions and executing them. That said, many actual z Systems
machines do in fact accommodate self-modifying code without serializing instruc-
tions. The z Systems instruction set provides a large set of serializing instructions,
including compare-and-swap, some types of branches (for example, the aforementioned
“bcr 15,0” instruction), and test-and-set.
This section revisits Table 15.3 and Section 15.1.3, summarizing the intervening
discussion with some appeals to transitive intuitions and with more sophisticated rules
of thumb.
But first, it is necessary to review the temporal and non-temporal nature of communi-
cation from one thread to another when using memory as the communications medium,
as was discussed in detail in Section 15.2.7. The key point is that although loads and
stores are conceptually simple, on real multicore hardware significant periods of time
are required for their effects to become visible to all other threads.
The simple and intuitive case occurs when one thread loads a value that some other
thread stored. This straightforward cause-and-effect case exhibits temporal behavior, so
that the software can safely assume that the store instruction completed before the load
instruction started. In real life, the load instruction might well have started quite some
time before the store instruction did, but all modern hardware must carefully hide such
cases from the software. Software will thus see the expected temporal cause-and-effect
behavior when one thread loads a value that some other thread stores, as discussed in
Section 15.2.7.3.
v2023.06.11a
15.6. MEMORY-MODEL INTUITIONS 563
Time
CPU 0
Section
Critical
Before
Lock
Section
Critical
CPU 1
Unlock
Section
Critical
Before
Section
Critical
After
Lock
Section
Critical
CPU 2
Unlock
Section
Critical
Before
Section
Critical
After
Lock
Section
Critical
Unlock
Section
Critical
Figure 15.24: Locking Intuitions After
This temporal behavior provides the basis for the next section’s transitive intuitions.
v2023.06.11a
564 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
unlock-lock ordering. The dotted lines emanating from them to the wide green arrows
show the effects on ordering. In particular:
1. The fact that CPU 0’s unlock precedes CPU 1’s lock ensures that any access
executed by CPU 0 within or before its critical section will be seen by accesses
executed by CPU 1 within and after its critical section.
2. The fact that CPU 0’s unlock precedes CPU 2’s lock ensures that any access
executed by CPU 0 within or before its critical section will be seen by accesses
executed by CPU 2 within and after its critical section.
3. The fact that CPU 1’s unlock precedes CPU 2’s lock ensures that any access
executed by CPU 1 within or before its critical section will be seen by accesses
executed by CPU 2 within and after its critical section.
In short, lock-based ordering is transitive through CPUs 0, 1, and 2. A key point
is that this ordering extends beyond the critical sections, so that everything before an
earlier lock release is seen by everything after a later lock acquisition.
For those who prefer words to diagrams, code holding a given lock will see the
accesses in all prior critical sections for that same lock, transitively. And if such code
sees the accesses in a given critical section, it will also see the accesses in all of that
CPU’s code preceding that critical section. In other words, when a CPU releases a given
lock, all of that lock’s subsequent critical sections will see the accesses in all of that
CPU’s code preceding that lock release.
Inversely, code holding a given lock will be protected from seeing the accesses in any
subsequent critical sections for that same lock, again, transitively. And if such code is
protected against seeing the accesses in a given critical section, it will also be protected
against seeing the accesses in all of that CPU’s code following that critical section.
In other words, when a CPU acquires a given lock, all of that lock’s previous critical
sections will be protected from seeing the accesses in all of that CPU’s code following
that lock acquisition.
But what does it mean to “see accesses” and exactly what accesses are seen?
To start, an access is either a load or a store, possibly occurring as part of a
read-modify-write operation.
If a CPU’s code prior to its release of a given lock contains an access A to a given
variable, then for an access B to that same variable contained in any CPU’s code
following a later acquisition of that same lock:
1. If A and B are both loads, then B will return either the same value that A did or
some later value.
2. If A is a load and B is a store, then B will overwrite either the value loaded by A or
some later value.
3. If A is a store and B is a load, then B will return either the value stored by A or
some later value.
4. If A and B are both stores, then B will overwrite either the value stored by A or
some later value.
Here, “some later value” is shorthand for “the value stored by some intervening
access”.
Locking is strongly intuitive, which is one reason why it has survived so many
attempts to eliminate it. This is also one reason why you should use it where it applies.
v2023.06.11a
15.6. MEMORY-MODEL INTUITIONS 565
Time
CPU 0
Release
Before
Release A
Release
After
CPU 1
Acquire
Before
Acquire A
CPU 2
Release B
Acquire
Before
Release
After
Acquire B
Acquire
After
Figure 15.25: Release-Acquire Intuitions
1. The fact that CPU 0’s release of A is read by CPU 1’s acquire of A ensures that
any accesses executed by CPU 0 prior to its release will be seen by any accesses
executed by CPU 1 after its acquire.
2. The fact that CPU 1’s release of B is read by CPU 2’s acquire of B ensures that
any accesses executed by CPU 1 prior to its release will be seen by any accesses
executed by CPU 2 after its acquire.
3. Note also that CPU 0’s release of A is read by CPU 1’s acquire of A, which
precedes CPU 1’s release of B, which is read by CPU 2’s acquire of B. Taken
together, all this ensures that any accesses executed by CPU 0 prior to its release
will be seen by any accesses executed by CPU 2 after its acquire.
v2023.06.11a
566 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
For those who prefer words to diagrams, when an acquire loads the value stored by a
release, discussed in Section 15.2.7.4, then the code following that release will see all
accesses preceding the acquire. More precisely, if CPU 0 does an acquire that loads the
value stored by CPU 1’s release, than all the subsequent accesses executed by CPU 0
will see the all of CPU 1’s accesses prior to its release.
Similarly, the accesses preceding that release access will be protected from seeing
the accesses following the acquire access. (More precision is left as an exercise to the
reader.)
Releases and acquires can be chained, for example CPU 0’s release stores the value
loaded by CPU 1’s acquire, a later release by CPU 1 stores the value loaded by CPU 2’s
acquire, and so on. The accesses following a given acquire will see the accesses
preceding each prior release in the chain, and, inversely, the accesses preceding a given
release will be protected from seeing the accesses following each later acquire in the
chain. Some long-chain examples are illustrated by Listings 15.22, 15.23, and 15.24.
The seeing and not seeing of accesses works the same way as described in Sec-
tion 15.6.1.2.
However, as illustrated by Listing 15.27, the acquire access must load exactly what
was stored by the release access. Any intervening store that is not itself part of that
same release-acquire chain will break the chain.
Nevertheless, properly constructed release-acquire chains are transitive, intuitive, and
useful.
v2023.06.11a
15.6. MEMORY-MODEL INTUITIONS 567
The resulting program will be fully ordered, if somewhat slow. Such programs will
be sequentially consistent and much loved by formal-verification experts who specialize
in tried-and-true 1980s proof techniques. But slow or not, smp_mb() is always there
when you need it!
Nevertheless, there are situations that cannot be addressed by these intuitive ap-
proaches. The next section therefore presents a more complete, if less transitive, set of
rules of thumb.
24 Hobbyists and researchers should of course feel free to ignore this and many other
cautions.
v2023.06.11a
568 CHAPTER 15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING
your compiler will not order anything. The two threads sharing the sole non-store-to-
load link can sometimes substitute WRITE_ONCE() plus smp_wmb() for smp_store_
release() on the one hand, and READ_ONCE() plus smp_rmb() for smp_load_
acquire() on the other. However, the wise developer will check such substitutions
carefully, for example, using the herd tool as described in Section 12.3.
Quick Quiz 15.48: Why is it necessary to use heavier-weight ordering for load-to-store and
store-to-store links, but not for store-to-load links? What on earth makes store-to-load links so
special???
The fourth and final rule of thumb identifies where full memory barriers (or stronger)
are required: If a given cycle contains two or more non-store-to-load links (that is,
a total of two or more links that are either load-to-store or store-to-store links), you
will need at least one full barrier between each pair of non-store-to-load links in that
cycle, as illustrated by Listing 15.19 as well as in the answer to Quick Quiz 15.25.
Full barriers include smp_mb(), successful full-strength non-void atomic RMW
operations, and other atomic RMW operations in conjunction with either smp_mb__
before_atomic() or smp_mb__after_atomic(). Any of RCU’s grace-period-wait
primitives (synchronize_rcu() and friends) also act as full barriers, but at far greater
expense than smp_mb(). With strength comes expense, though full barriers usually
hurt performance more than they hurt scalability. The extreme logical endpoint of this
rule of thumb underlies the fully ordered intuitions presented in Section 15.6.1.5.
Recapping the rules:
1. Memory-ordering operations are required only if at least two variables are shared
by at least two threads.
2. If all links in a cycle are store-to-load links, then minimal ordering suffices.
3. If all but one of the links in a cycle are store-to-load links, then each store-to-load
link may use a release-acquire pair.
4. Otherwise, at least one full barrier is required between each pair of non-store-to-load
links.
v2023.06.11a
Creating a perfect API is like committing the perfect
crime. There are at least fifty things that can go
wrong, and if you are a genius, you might be able to
anticipate twenty-five of them.
With apologies to any Kathleen Turner fans who
might still be alive.
Chapter 16
Ease of Use
If you are tempted to look down on ease-of-use requirements, please consider that an
ease-of-use bug in Linux-kernel RCU resulted in an exploitable Linux-kernel security
bug in a use of RCU [McK19a]. It is therefore clearly important that even in-kernel
APIs be easy to use.
Unfortunately, “easy” is a relative term. For example, many people would consider
a 15-hour airplane flight to be a bit of an ordeal—unless they stopped to consider
alternative modes of transportation, especially swimming. This means that creating an
easy-to-use API requires that you understand your intended users well enough to know
what is easy for them. Which might or might not have anything to do with what is easy
for you.
The following question illustrates this point: “Given a randomly chosen person among
everyone alive today, what one change would improve that person’s life?”
There is no single change that would be guaranteed to help everyone’s life. After all,
there is an extremely wide range of people, with a correspondingly wide range of needs,
wants, desires, and aspirations. A starving person might need food, but additional food
might well hasten the death of a morbidly obese person. The high level of excitement so
fervently desired by many young people might well be fatal to someone recovering from
a heart attack. Information critical to the success of one person might contribute to the
failure of someone suffering from information overload. In short, if you are working on
a software project that is intended to help people you know nothing about, you should
not be surprised when those people find fault with your project.
If you really want to help a given group of people, there is simply no substitute for
working closely with them over an extended period of time, as in years. Nevertheless,
there are some simple things that you can do to increase the odds of your users being
happy with your software, and some of these things are covered in the next section.
569
v2023.06.11a
570 CHAPTER 16. EASE OF USE
This section is adapted from portions of Rusty Russell’s 2003 Ottawa Linux Symposium
keynote address [Rus03, Slides 39–57]. Rusty’s key point is that the goal should not
be merely to make an API easy to use, but rather to make the API hard to misuse.
To that end, Rusty proposed his “Rusty Scale” in decreasing order of this important
hard-to-misuse property.
The following list attempts to generalize the Rusty Scale beyond the Linux kernel:
1. It is impossible to get wrong. Although this is the standard to which all API
designers should strive, only the mythical dwim()1 command manages to come
close.
2. The compiler or linker won’t let you get it wrong.
3. The compiler or linker will warn you if you get it wrong. BUILD_BUG_ON() is
your users’ friend.
4. The simplest use is the correct one.
5. The name tells you how to use it. But names can be two-edged swords. Although
rcu_read_lock() is plain enough for someone converting code from reader-
writer locking, it might cause some consternation for someone converting code
from reference counting.
6. Do it right or it will always break at runtime. WARN_ON_ONCE() is your users’
friend.
7. Follow common convention and you will get it right. The malloc() library
function is a good example. Although it is easy to get memory allocation wrong,
a great many projects do manage to get it right, at least most of the time. Using
malloc() in conjunction with Valgrind [The11] moves malloc() almost up to
the “do it right or it will always break at runtime” point on the scale.
8. Read the documentation and you will get it right.
12. Read the implementation and you will get it wrong. The original non-CONFIG_
PREEMPT implementation of rcu_read_lock() [McK07a] is an infamous exam-
ple of this point on the scale.
v2023.06.11a
16.3. SHAVING THE MANDELBROT SET 571
13. Read the documentation and you will get it wrong. For example, the DEC Alpha
wmb instruction’s documentation [Cor02] fooled a number of developers into
thinking that this instruction had much stronger memory-order semantics than it
actually does. Later documentation clarified this point [Com01, Pug00], moving
the wmb instruction up to the “read the documentation and you will get it right”
point on the scale.
14. Follow common convention and you will get it wrong. The printf() statement
is an example of this point on the scale because developers almost always fail to
check printf()’s error return.
20. It is impossible to get right. The gets() function is a famous example of this point
on the scale. In fact, gets() can perhaps best be described as an unconditional
buffer-overflow security hole.
The set of useful programs resembles the Mandelbrot set (shown in Figure 16.1) in
that it does not have a clear-cut smooth boundary—if it did, the halting problem would
be solvable. But we need APIs that real people can use, not ones that require a Ph.D.
dissertation be completed for each and every potential use. So, we “shave the Mandelbrot
set”,2 restricting the use of the API to an easily described subset of the full set of
potential uses.
Such shaving may seem counterproductive. After all, if an algorithm works, why
shouldn’t it be used?
To see why at least some shaving is absolutely necessary, consider a locking design
that avoids deadlock, but in perhaps the worst possible way. This design uses a circular
doubly linked list, which contains one element for each thread in the system along with
a header element. When a new thread is spawned, the parent thread must insert a new
element into this list, which requires some sort of synchronization.
v2023.06.11a
572 CHAPTER 16. EASE OF USE
One way to protect the list is to use a global lock. However, this might be a bottleneck
if threads were being created and deleted frequently.3 Another approach would be to
use a hash table and to lock the individual hash buckets, but this can perform poorly
when scanning the list in order.
A third approach is to lock the individual list elements, and to require the locks for
both the predecessor and successor to be held during the insertion. Since both locks
must be acquired, we need to decide which order to acquire them in. Two conventional
approaches would be to acquire the locks in address order, or to acquire them in the
order that they appear in the list, so that the header is always acquired first when it is
one of the two elements being locked. However, both of these methods require special
checks and branches.
The to-be-shaven solution is to unconditionally acquire the locks in list order. But
what about deadlock?
Deadlock cannot occur.
To see this, number the elements in the list starting with zero for the header up to
𝑁 for the last element in the list (the one preceding the header, given that the list is
circular). Similarly, number the threads from zero to 𝑁 − 1. If each thread attempts to
lock some consecutive pair of elements, at least one of the threads is guaranteed to be
able to acquire both locks.
Why?
Because there are not enough threads to reach all the way around the list. Suppose
thread 0 acquires element 0’s lock. To be blocked, some other thread must have already
acquired element 1’s lock, so let us assume that thread 1 has done so. Similarly, for
thread 1 to be blocked, some other thread must have acquired element 2’s lock, and so
on, up through thread 𝑁 − 1, who acquires element 𝑁 − 1’s lock. For thread 𝑁 − 1 to be
blocked, some other thread must have acquired element 𝑁’s lock. But there are no more
threads, and so thread 𝑁 − 1 cannot be blocked. Therefore, deadlock cannot occur.
So why should we prohibit use of this delightful little algorithm?
The fact is that if you really want to use it, we cannot stop you. We can, however,
recommend against such code being included in any project that we care about.
But, before you use this algorithm, please think through the following Quick Quiz.
Quick Quiz 16.1: Can a similar algorithm be used when deleting elements?
v2023.06.11a
16.3. SHAVING THE MANDELBROT SET 573
The fact is that this algorithm is extremely specialized (it only works on certain sized
lists), and also quite fragile. Any bug that accidentally failed to add a node to the list
could result in deadlock. In fact, simply adding the node a bit too late could result in
deadlock, as could increasing the number of threads.
In addition, the other algorithms described above are “good and sufficient”. For
example, simply acquiring the locks in address order is fairly simple and quick, while
allowing the use of lists of any size. Just be careful of the special cases presented by
empty lists and lists containing only one element!
Quick Quiz 16.2: Yetch! What ever possessed someone to come up with an algorithm that
deserves to be shaved as much as this one does???
Exceptions aside, we must continue to shave the software “Mandelbrot set” so that
our programs remain maintainable, as shown in Figure 16.2.
v2023.06.11a
574 CHAPTER 16. EASE OF USE
v2023.06.11a
Prediction is very difficult, especially about the
future.
Niels Bohr
Chapter 17
This chapter presents some conflicting visions of the future of parallel programming. It
is not clear which of these will come to pass, in fact, it is not clear that any of them will.
They are nevertheless important because each vision has its devoted adherents, and if
enough people believe in something fervently enough, you will need to deal with that
thing’s existence in the form of its influence on the thoughts, words, and deeds of its
adherents. Besides which, one or more of these visions will actually come to pass. But
most are bogus. Tell which is which and you’ll be rich [Spi77]!
Therefore, the following sections give an overview of transactional memory, hardware
transactional memory, formal verification in regression testing, and parallel functional
programming. But first, a cautionary tale on prognostication taken from the early 2000s.
Years past always seem so simple and innocent when viewed through the lens of many
years of experience. And the early 2000s were for the most part innocent of the
impending failure of Moore’s Law to continue delivering the then-traditional increases
in CPU clock frequency. Oh, there were the occasional warnings about the limits
of technology, but such warnings had been sounded for decades. With that in mind,
consider the following scenarios:
575
v2023.06.11a
576 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
v2023.06.11a
17.1. THE FUTURE OF CPU TECHNOLOGY AIN’T WHAT IT USED TO BE 577
v2023.06.11a
578 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Unlikely indeed! But the larger software community was reluctant to accept the
fact that they would need to embrace parallelism, and so it was some time before this
community concluded that the “free lunch” of Moore’s-Law-induced CPU core-clock
frequency increases was well and truly finished. Never forget: Belief is an emotion, not
necessarily the result of a rational technical thought process!
v2023.06.11a
17.1. THE FUTURE OF CPU TECHNOLOGY AIN’T WHAT IT USED TO BE 579
And we all know how this story has played out, with multiple multi-threaded cores
on a single die plugged into a single socket, with varying degrees of optimization for
lower numbers of active threads per core. The question then becomes whether or not
future shared-memory systems will always fit into a single socket.
And the change has been the ever-increasing levels of integration that Moore’s Law is
still providing. But longer term, which will it be? More CPUs per die? Or more I/O,
cache, and memory?
Servers seem to be choosing the former, while embedded systems on a chip (SoCs)
continue choosing the latter.
v2023.06.11a
580 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
10000
100
10
0.1
82 84 86 88 90 92 94 96 98 00 02
Year
Figure 17.6: Instructions per Local Memory Reference for Sequent Computers
spinlock
Breakeven Update Fraction
RCU
0.1
1 10 100 1000
Memory-Latency Ratio
v2023.06.11a
17.1. THE FUTURE OF CPU TECHNOLOGY AIN’T WHAT IT USED TO BE 581
spinlock
drw
0.01
0.001 RCU
0.0001
1 10 100 1000
Memory-Latency Ratio
On the one hand, this passage failed to anticipate the cache-warmth issues that
RCU can suffer from in workloads with significant update intensity, in part because it
seemed unlikely that RCU would really be used for such workloads. In the event, the
SLAB_TYPESAFE_BY_RCU has been pressed into service in a number of instances where
these cache-warmth issues would otherwise be problematic, as has sequence locking.
On the other hand, this passage also failed to anticipate that RCU would be used to
reduce scheduling latency or for security.
Much of the data generated for this book was collected on an eight-socket system
with 28 cores per socket and two hardware threads per core, for a total of 448 hardware
threads. The idle-system memory latencies are less than one microsecond, which are
no worse than those of similar-sized systems of the year 2004. Some claim that these
latencies approach a microsecond only because of the x86 CPU family’s relatively
strong memory ordering, but it may be some time before that particular argument is
settled.
v2023.06.11a
582 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
The idea of using transactions outside of databases goes back many decades [Lom77,
Kni86, HM93], with the key difference between database and non-database transactions
being that non-database transactions drop the “D” in the “ACID”1 properties defining
database transactions. The idea of supporting memory-based transactions, or “transac-
tional memory” (TM), in hardware is more recent [HM93], but unfortunately, support
for such transactions in commodity hardware was not immediately forthcoming, despite
other somewhat similar proposals being put forward [SSHT93]. Not long after, Shavit
and Touitou proposed a software-only implementation of transactional memory (STM)
that was capable of running on commodity hardware, give or take memory-ordering
issues [ST95]. This proposal languished for many years, perhaps due to the fact that the
research community’s attention was absorbed by non-blocking synchronization (see
Section 14.2).
But by the turn of the century, TM started receiving more attention [MT01, RG01],
and by the middle of the decade, the level of interest can only be termed “incandes-
cent” [Her05, Gro07], with only a few voices of caution [BLM05, MMW07].
The basic idea behind TM is to execute a section of code atomically, so that other
threads see no intermediate state. As such, the semantics of TM could be implemented
by simply replacing each transaction with a recursively acquirable global lock acquisition
and release, albeit with abysmal performance and scalability. Much of the complexity
inherent in TM implementations, whether hardware or software, is efficiently detecting
when concurrent transactions can safely run in parallel. Because this detection is done
dynamically, conflicting transactions can be aborted or “rolled back”, and in some
implementations, this failure mode is visible to the programmer.
Because transaction roll-back is increasingly unlikely as transaction size decreases,
TM might become quite attractive for small memory-based operations, such as linked-list
manipulations used for stacks, queues, hash tables, and search trees. However, it is
currently much more difficult to make the case for large transactions, particularly those
containing non-memory operations such as I/O and process creation. The following
sections look at current challenges to the grand vision of “Transactional Memory
Everywhere” [McK09b]. Section 17.2.1 examines the challenges faced interacting
with the outside world, Section 17.2.2 looks at interactions with process modification
1 Atomicity, consistency, isolation, and durability.
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 583
Many computer users feel that input and output are not actually part of “real
programming,” they are merely things that (unfortunately) must be done in
order to get information in and out of the machine.
Whether or not we believe that input and output are “real programming”, the fact
is that software absolutely must deal with the outside world. This section therefore
critiques transactional memory’s outside-world capabilities, focusing on I/O operations,
time delays, and persistent storage.
1. Restrict I/O within transactions to buffered I/O with in-memory buffers. These
buffers may then be included in the transaction in the same way that any other
memory location might be included. This seems to be the mechanism of choice,
and it does work well in many common cases of situations such as stream I/O
and mass-storage I/O. However, special handling is required in cases where
multiple record-oriented output streams are merged onto a single file from multiple
processes, as might be done using the “a+” option to fopen() or the O_APPEND
flag to open(). In addition, as will be seen in the next section, common networking
operations cannot be handled via buffering.
2. Prohibit I/O within transactions, so that any attempt to execute an I/O operation
aborts the enclosing transaction (and perhaps multiple nested transactions). This
approach seems to be the conventional TM approach for unbuffered I/O, but
requires that TM interoperate with other synchronization primitives tolerating I/O.
3. Prohibit I/O within transactions, but enlist the compiler’s aid in enforcing this
prohibition.
4. Permit only one special irrevocable transaction [SMS08] to proceed at any given
time, thus allowing irrevocable transactions to contain I/O operations.2 This works
2 In earlier literature, irrevocable transactions are termed inevitable transactions.
v2023.06.11a
584 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
in general, but severely limits the scalability and performance of I/O operations.
Given that scalability and performance is a first-class goal of parallelism, this
approach’s generality seems a bit self-limiting. Worse yet, use of irrevocability to
tolerate I/O operations seems to greatly restrict use of manual transaction-abort
operations.3 Finally, if there is an irrevocable transaction manipulating a given
data item, any other transaction manipulating that same data item cannot have
non-blocking semantics.
5. Create new hardware and protocols such that I/O operations can be pulled into the
transactional substrate. In the case of input operations, the hardware would need
to correctly predict the result of the operation, and to abort the transaction if the
prediction failed.
I/O operations are a well-known weakness of TM, and it is not clear that the problem
of supporting I/O in transactions has a reasonable general solution, at least if “reasonable”
is to include usable performance and scalability. Nevertheless, continued time and
attention to this problem will likely produce additional progress.
1 begin_trans();
2 rpc_request();
3 i = rpc_response();
4 a[i]++;
5 end_trans();
The transaction’s memory footprint cannot be determined until after the RPC
response is received, and until the transaction’s memory footprint can be determined, it
is impossible to determine whether the transaction can be allowed to commit. The only
action consistent with transactional semantics is therefore to unconditionally abort the
transaction, which is, to say the least, unhelpful.
Here are some options available to TM:
1. Prohibit RPC within transactions, so that any attempt to execute an RPC operation
aborts the enclosing transaction (and perhaps multiple nested transactions). Alter-
natively, enlist the compiler to enforce RPC-free transactions. This approach does
work, but will require TM to interact with other synchronization primitives.
3 This difficulty was pointed out by Michael Factor. To see the problem, think through
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 585
2. Permit only one special irrevocable transaction [SMS08] to proceed at any given
time, thus allowing irrevocable transactions to contain RPC operations. This works
in general, but severely limits the scalability and performance of RPC operations.
Given that scalability and performance is a first-class goal of parallelism, this
approach’s generality seems a bit self-limiting. Furthermore, use of irrevocable
transactions to permit RPC operations restricts manual transaction-abort operations
once the RPC operation has started. Finally, if there is an irrevocable transaction
manipulating a given data item, any other transaction manipulating that same data
item must have blocking semantics.
3. Identify special cases where the success of the transaction may be determined
before the RPC response is received, and automatically convert these to irrevocable
transactions immediately before sending the RPC request. Of course, if several
concurrent transactions attempt RPC calls in this manner, it might be necessary
to roll all but one of them back, with consequent degradation of performance
and scalability. This approach nevertheless might be valuable given long-running
transactions ending with an RPC. This approach must still restrict manual
transaction-abort operations.
4. Identify special cases where the RPC response may be moved out of the transaction,
and then proceed using techniques similar to those used for buffered I/O.
5. Extend the transactional substrate to include the RPC server as well as its client. This
is in theory possible, as has been demonstrated by distributed databases. However,
it is unclear whether the requisite performance and scalability requirements can be
met by distributed-database techniques, given that memory-based TM has no slow
disk drives behind which to hide such latencies. Of course, given the advent of
solid-state disks, it is also quite possible that databases will need to redesign their
approach to latency hiding.
As noted in the prior section, I/O is a known weakness of TM, and RPC is simply an
especially problematic case of I/O.
1. Ignore time delays within transactions. This has an appearance of elegance, but
like too many other “elegant” solutions, fails to survive first contact with legacy
code. Such code, which might well have important time delays in critical sections,
would fail upon being transactionalized.
v2023.06.11a
586 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
17.2.1.4 Persistence
There are many different types of locking primitives. One interesting distinction is
persistence, in other words, whether the lock can exist independently of the address
space of the process using the lock.
Non-persistent locks include pthread_mutex_lock(), pthread_rwlock_
rdlock(), and most kernel-level locking primitives. If the memory locations in-
stantiating a non-persistent lock’s data structures disappear, so does the lock. For typical
use of pthread_mutex_lock(), this means that when the process exits, all of its locks
vanish. This property can be exploited in order to trivialize lock cleanup at program
shutdown time, but makes it more difficult for unrelated applications to share locks, as
such sharing requires the applications to share memory.
Quick Quiz 17.1: But suppose that an application exits while holding a pthread_mutex_
lock() that happens to be located in a file-mapped region of memory?
Persistent locks help avoid the need to share memory among unrelated applications.
Persistent locking APIs include the flock family, lockf(), System V semaphores, or
the O_CREAT flag to open(). These persistent APIs can be used to protect large-scale
operations spanning runs of multiple applications, and, in the case of O_CREAT even
surviving operating-system reboot. If need be, locks can even span multiple computer
systems via distributed lock managers and distributed filesystems—and persist across
reboots of any or all of those computer systems.
Persistent locks can be used by any application, including applications written using
multiple languages and software environments. In fact, a persistent lock might well be
acquired by an application written in C and released by an application written in Python.
How could a similar persistent functionality be provided for TM?
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 587
Of course, the fact that it is called transactional memory should give us pause, as
the name itself conflicts with the concept of a persistent transaction. It is nevertheless
worthwhile to consider this possibility as an important test case probing the inherent
limitations of transactional memory.
1 pthread_mutex_lock(...);
2 for (i = 0; i < ncpus; i++)
3 pthread_create(&tid[i], ...);
4 for (i = 0; i < ncpus; i++)
5 pthread_join(tid[i], ...);
6 pthread_mutex_unlock(...);
This pseudo-code fragment uses pthread_create() to spawn one thread per CPU,
then uses pthread_join() to wait for each to complete, all under the protection of
pthread_mutex_lock(). The effect is to execute a lock-based critical section in
parallel, and one could obtain a similar effect using fork() and wait(). Of course,
the critical section would need to be quite large to justify the thread-spawning overhead,
but there are many examples of large critical sections in production software.
What might TM do about thread spawning within a transaction?
v2023.06.11a
588 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
4. Extend the transaction to cover the parent and all child threads. This approach raises
interesting questions about the nature of conflicting accesses, given that the parent
and children are presumably permitted to conflict with each other, but not with other
threads. It also raises interesting questions as to what should happen if the parent
thread does not wait for its children before committing the transaction. Even more
interesting, what happens if the parent conditionally executes pthread_join()
based on the values of variables participating in the transaction? The answers to
these questions are reasonably straightforward in the case of locking. The answers
for TM are left as an exercise for the reader.
What happens when you attempt to execute an exec() system call from within a
transaction?
2. Disallow exec() within transactions, with the compiler enforcing this prohibition.
There is a draft specification for TM in C++ that takes this approach, allowing func-
tions to be decorated with the transaction_safe and transaction_unsafe
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 589
attributes.4 This approach has some advantages over aborting the transaction at
runtime, but again requires non-TM synchronization primitives for use in conjunc-
tion with exec(). One disadvantage is the need to decorate a great many library
functions with transaction_safe and transaction_unsafe attributes.
3. Treat the transaction in a manner similar to non-persistent locking primitives, so
that the transaction survives if exec() fails, and silently commits if the exec()
succeeds. The case where only some of the variables affected by the transaction
reside in mmap()ed memory (and thus could survive a successful exec() system
call) is left as an exercise for the reader.
4. Abort the transaction (and the exec() system call) if the exec() system call
would have succeeded, but allow the transaction to continue if the exec() system
call would fail. This is in some sense the “correct” approach, but it would require
considerable work for a rather unsatisfying result.
The exec() system call is perhaps the strangest example of an obstacle to universal
TM applicability, as it is not completely clear what approach makes sense, and some
might argue that this is merely a reflection of the perils of real-life interaction with
exec(). That said, the two options prohibiting exec() within transactions are perhaps
the most logical of the group.
Similar issues surround the exit() and kill() system calls, as well as a longjmp()
or an exception that would exit the transaction. (Where did the longjmp() or exception
come from?)
1. Treat the dynamic linking and loading in a manner similar to a page fault, so that
the function is loaded and linked, possibly aborting the transaction in the process.
If the transaction is aborted, the retry will find the function already present, and
the transaction can thus be expected to proceed normally.
4 Thanks to Mark Moir for pointing me at this spec, and to Michael Wong for having
v2023.06.11a
590 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Options for part (b), the inability to detect TM-unfriendly operations in a not-yet-
loaded function, possibilities include the following:
1. Just execute the code: If there are any TM-unfriendly operations in the function,
simply abort the transaction. Unfortunately, this approach makes it impossible for
the compiler to determine whether a given group of transactions may be safely
composed. One way to permit composability regardless is irrevocable transactions,
however, current implementations permit only a single irrevocable transaction to
proceed at any given time, which can severely limit performance and scalability.
Irrevocable transactions also to restrict use of manual transaction-abort operations.
Finally, if there is an irrevocable transaction manipulating a given data item, any
other transaction manipulating that same data item cannot have non-blocking
semantics.
2. Decorate the function declarations indicating which functions are TM-friendly.
These decorations can then be enforced by the compiler’s type system. Of
course, for many languages, this requires language extensions to be proposed,
standardized, and implemented, with the corresponding time delays, and also
with the corresponding decoration of a great many otherwise uninvolved library
functions. That said, the standardization effort is already in progress [ATS09].
3. As above, disallow dynamic linking and loading of functions from within transac-
tions.
I/O operations are of course a known weakness of TM, and dynamic linking and
loading can be thought of as yet another special case of I/O. Nevertheless, the proponents
of TM must either solve this problem, or resign themselves to a world where TM is but
one tool of several in the parallel programmer’s toolbox. (To be fair, a number of TM
proponents have long since resigned themselves to a world containing more than just
TM.)
1. Memory remapping is illegal within a transaction, and will result in all enclosing
transactions being aborted. This does simplify things somewhat, but also requires
that TM interoperate with synchronization primitives that do tolerate remapping
from within their critical sections.
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 591
4. Memory mapping is legal within a transaction, but the mapping operation will fail
if the region being mapped overlaps with the current transaction’s footprint.
5. All memory-mapping operations, whether within or outside a transaction, check
the region being mapped against the memory footprint of all transactions in the
system. If there is overlap, then the memory-mapping operation fails.
It is interesting to note that munmap() leaves the relevant region of memory unmapped,
which could have additional interesting implications.5
17.2.2.5 Debugging
The usual debugging operations such as breakpoints work normally within lock-based
critical sections and from usespace-RCU read-side critical sections. However, in initial
transactional-memory hardware implementations [DLMN09] an exception within a
transaction will abort that transaction, which in turn means that breakpoints abort all
enclosing transactions.
So how can transactions be debugged?
4. Program more carefully, so as to avoid having bugs in the transactions in the first
place. As soon as you figure out how to do this, please do let everyone know the
secret!
5 This difference between mapping and unmapping was noted by Josh Triplett.
v2023.06.11a
592 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
There is some reason to believe that transactional memory will deliver productivity
improvements compared to other synchronization mechanisms, but it does seem quite
possible that these improvements could easily be lost if traditional debugging techniques
cannot be applied to transactions. This seems especially true if transactional memory is
to be used by novices on large transactions. In contrast, macho “top-gun” programmers
might be able to dispense with such debugging aids, especially for small transactions.
Therefore, if transactional memory is to deliver on its productivity promises to novice
programmers, the debugging problem does need to be solved.
17.2.3 Synchronization
If transactional memory someday proves that it can be everything to everyone, it will
not need to interact with any other synchronization mechanism. Until then, it will need
to work with synchronization mechanisms that can do what it cannot, or that work more
naturally in a given situation. The following sections outline the current challenges in
this area.
17.2.3.1 Locking
It is commonplace to acquire locks while holding other locks, which works quite well,
at least as long as the usual well-known software-engineering techniques are employed
to avoid deadlock. It is not unusual to acquire locks from within RCU read-side critical
sections, which eases deadlock concerns because RCU read-side primitives cannot
participate in lock-based deadlock cycles. It is also possible to acquire locks while
holding hazard pointers and within sequence-lock read-side critical sections. But what
happens when you attempt to acquire a lock from within a transaction?
In theory, the answer is trivial: Simply manipulate the data structure representing the
lock as part of the transaction, and everything works out perfectly. In practice, a number
of non-obvious complications [VGS08] can arise, depending on implementation details
of the TM system. These complications can be resolved, but at the cost of a 45 %
increase in overhead for locks acquired outside of transactions and a 300 % increase in
overhead for locks acquired within transactions. Although these overheads might be
acceptable for transactional programs containing small amounts of locking, they are
often completely unacceptable for production-quality lock-based programs wishing to
use the occasional transaction.
1. Use only locking-friendly TM implementations. Unfortunately, the locking-
unfriendly implementations have some attractive properties, including low over-
head for successful transactions and the ability to accommodate extremely large
transactions.
2. Use TM only “in the small” when introducing TM to lock-based programs, thereby
accommodating the limitations of locking-friendly TM implementations.
3. Set aside locking-based legacy systems entirely, re-implementing everything in
terms of transactions. This approach has no shortage of advocates, but this requires
that all the issues described in this series be resolved. During the time it takes to
resolve these issues, competing synchronization mechanisms will of course also
have the opportunity to improve.
4. Use TM strictly as an optimization in lock-based systems, as was done by
the TxLinux [RHP+ 07] group and by a great many transactional lock elision
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 593
projects [PD11, Kle14, FIMR16, PMDY20]. This approach seems sound, but
leaves the locking design constraints (such as the need to avoid deadlock) firmly in
place.
5. Strive to reduce the overhead imposed on locking primitives.
The fact that there could possibly be a problem interfacing TM and locking came as a
surprise to many, which underscores the need to try out new mechanisms and primitives
in real-world production software. Fortunately, the advent of open source means that a
huge quantity of such software is now freely available to everyone, including researchers.
v2023.06.11a
594 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
POWER8 CPUs [LGW+ 15], but leaves the locking design constraints (such as the
need to avoid deadlock) firmly in place.
1. RCU readers abort concurrent conflicting TM updates. This is in fact the approach
taken by the TxLinux project. This approach does preserve RCU semantics, and
also preserves RCU’s read-side performance, scalability, and real-time-response
properties, but it does have the unfortunate side-effect of unnecessarily aborting
conflicting updates. In the worst case, a long sequence of RCU readers could
potentially starve all updaters, which could in theory result in system hangs.
In addition, not all TM implementations offer the strong atomicity required to
implement this approach, and for good reasons.
2. RCU readers that run concurrently with conflicting TM updates get old (pre-
transaction) values from any conflicting RCU loads. This preserves RCU semantics
and performance, and also prevents RCU-update starvation. However, not all TM
implementations can provide timely access to old values of variables that have
been tentatively updated by an in-flight transaction. In particular, log-based TM
implementations that maintain old values in the log (thus providing excellent TM
commit performance) are not likely to be happy with this approach. Perhaps the
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 595
3. If an RCU reader executes an access that conflicts with an in-flight transaction, then
that RCU access is delayed until the conflicting transaction either commits or aborts.
This approach preserves RCU semantics, but not RCU’s performance or real-time
response, particularly in presence of long-running transactions. In addition, not all
TM implementations are capable of delaying conflicting accesses. Nevertheless,
this approach seems eminently reasonable for hardware TM implementations that
support only small transactions.
4. RCU readers are converted to transactions. This approach pretty much guarantees
that RCU is compatible with any TM implementation, but it also imposes TM’s
rollbacks on RCU read-side critical sections, destroying RCU’s real-time response
guarantees, and also degrading RCU’s read-side performance. Furthermore, this
approach is infeasible in cases where any of the RCU read-side critical sections
contains operations that the TM implementation in question is incapable of handling.
This approach is more difficult to apply to hazard pointers and reference counters,
which do not have a sharply defined notion of a reader as a section of code.
5. Many update-side uses of RCU modify a single pointer to publish a new data
structure. In some of these cases, RCU can safely be permitted to see a transactional
pointer update that is subsequently rolled back, as long as the transaction respects
memory ordering and as long as the roll-back process uses call_rcu() to free up
the corresponding structure. Unfortunately, not all TM implementations respect
memory barriers within a transaction. Apparently, the thought is that because
transactions are supposed to be atomic, the ordering of the accesses within the
transaction is not supposed to matter.
6. Prohibit use of TM in RCU updates. This is guaranteed to work, but restricts use
of TM.
It seems likely that additional approaches will be uncovered, especially given the
advent of user-level RCU and hazard-pointer implementations.6 It is interesting to note
that many of the better performing and scaling STM implementations make use of
RCU-like techniques internally [Fra04, FH07, GYW+ 19, KMK+ 19].
Quick Quiz 17.3: MV-RLU looks pretty good! Doesn’t it beat RCU hands down?
6 Kudos to the TxLinux group, Maged Michael, and Josh Triplett for coming up with a
v2023.06.11a
596 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Given mechanisms such as the so-called “dirty reads” that are prevalent in production
database systems, it is not surprising that extra-transactional accesses have received seri-
ous attention from the proponents of TM, with the concept of weak atomicity [BLM06]
being but one case in point.
Here are some extra-transactional options:
4. Produce hardware extensions that permit some operations (for example, addition)
to be carried out concurrently on a single variable by multiple transactions.
17.2.4 Discussion
The obstacles to universal TM adoption lead to the following conclusions:
1. One interesting property of TM is the fact that transactions are subject to rollback
and retry. This property underlies TM’s difficulties with irreversible operations,
including unbuffered I/O, RPCs, memory-mapping operations, time delays, and
the exec() system call. This property also has the unfortunate consequence of
introducing all the complexities inherent in the possibility of failure, often in a
developer-visible manner.
v2023.06.11a
17.2. TRANSACTIONAL MEMORY 597
3. One of the stated goals of many workers in the TM area is to ease parallelization
of large sequential programs. As such, individual transactions are commonly
expected to execute serially, which might do much to explain TM’s issues with
multithreaded transactions.
Quick Quiz 17.4: Given things like spin_trylock(), how does it make any sense at all to
claim that TM introduces the concept of failure???
But for the moment, the current state of STM can best be summarized with a series
of cartoons. First, Figure 17.9 shows the STM vision. As always, the reality is a bit
v2023.06.11a
598 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 599
more nuanced, as fancifully depicted by Figures 17.10, 17.11, and 17.12.7 Less fanciful
STM retrospectives are also available [Duf10a, Duf10b].
Some commercially available hardware supports restricted variants of HTM, which
are addressed in the following section.
As of 2021, hardware transactional memory (HTM) has been available for many years
on several types of commercially available commodity computer systems [YHLR13,
Mer11, JSG12, Hay20]. This section makes an attempt to identify HTM’s place in the
parallel programmer’s toolbox.
From a conceptual viewpoint, HTM uses processor caches and speculative execution
to make a designated group of statements (a “transaction”) take effect atomically from
the viewpoint of any other transactions running on other processors. This transaction
is initiated by a begin-transaction machine instruction and completed by a commit-
transaction machine instruction. There is typically also an abort-transaction machine
instruction, which squashes the speculation (as if the begin-transaction instruction
and all following instructions had not executed) and commences execution at a failure
handler. The location of the failure handler is typically specified by the begin-transaction
7 Recent academic work-in-progress has investigated lock-based STM systems for real-
time use [And19, NA18], albeit without any performance results, and with some indications
that real-time hybrid STM/HTM systems must choose between fast common-case performance
and worst-case forward-progress guarantees [AKK+ 14, SBV10].
v2023.06.11a
600 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
8 I gratefully acknowledge many stimulating discussions with the other authors, Maged
Michael, Josh Triplett, and Jonathan Walpole, as well as with Andi Kleen.
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 601
Quick Quiz 17.6: Why would it matter that oft-written variables shared the cache line with
the lock variable?
1. Lock elision for in-memory data access and update [MT01, RG02].
2. Concurrent access and small random updates to large non-partitionable data
structures.
However, HTM also has some very real shortcomings, which will be discussed in the
next section.
1. Transaction-size limitations.
2. Conflict handling.
9 And it is also easy to extend this scheme to operations accessing multiple hash chains
by having such operations acquire the locks for all relevant chains in hash order.
v2023.06.11a
602 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
5. Irrevocable operations.
6. Semantic differences.
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 603
x = 1; y = 2;
y = 3; x = 4;
10 Liu’s and Spear’s paper entitled “Toxic Transactions” [LS11] is particularly instructive.
11 The need to update the count would result in additions to and deletions from the tree
conflicting with each other, resulting in strong non-commutativity [AGH+ 11a, AGH+ 11b,
McK11b].
v2023.06.11a
604 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Of course, aborts and rollbacks raise the question of whether HTM can be useful
for hard real-time systems. Do the performance benefits of HTM outweigh the costs
of the aborts and rollbacks, and if so under what conditions? Can transactions use
priority boosting? Or should transactions for high-priority threads instead preferentially
abort those of low-priority threads? If so, how is the hardware efficiently informed of
priorities? The literature on real-time use of HTM is quite sparse, perhaps because
there are more than enough problems in making HTM work well in non-real-time
environments.
Because current HTM implementations might deterministically abort a given trans-
action, software must provide fallback code. This fallback code must use some other
form of synchronization, for example, locking. If a lock-based fallback is ever used,
then all the limitations of locking, including the possibility of deadlock, reappear. One
can of course hope that the fallback isn’t used often, which might allow simpler and
less deadlock-prone locking designs to be used. But this raises the question of how
the system transitions from using the lock-based fallbacks back to transactions.12 One
approach is to use a test-and-test-and-set discipline [MT02], so that everyone holds off
until the lock is released, allowing the system to start from a clean slate in transactional
mode at that point. However, this could result in quite a bit of spinning, which might
not be wise if the lock holder has blocked or been preempted. Another approach is to
allow transactions to proceed in parallel with a thread holding a lock [MT02], but this
raises difficulties in maintaining atomicity, especially if the reason that the thread is
holding the lock is because the corresponding transaction would not fit into cache.
Finally, dealing with the possibility of aborts and rollbacks seems to put an additional
burden on the developer, who must correctly handle all combinations of possible error
conditions.
It is clear that users of HTM must put considerable validation effort into testing both
the fallback code paths and transition from fallback code back to transactional code.
Nor is there any reason to believe that the validation requirements of HTM hardware
are any less daunting.
12 The possibility of an application getting stuck in fallback mode has been termed the
“lemming effect”, a term that Dave Dice has been credited with coining.
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 605
Even though transaction size, conflicts, and aborts/rollbacks can all cause transactions
to abort, one might hope that sufficiently small and short-duration transactions could be
guaranteed to eventually succeed. This would permit a transaction to be unconditionally
retried, in the same way that compare-and-swap (CAS) and load-linked/store-conditional
(LL/SC) operations are unconditionally retried in code that uses these instructions to
implement atomic operations.
Unfortunately, other than low-clock-rate academic research prototypes [SBV10],
currently available HTM implementations refuse to make any sort of forward-progress
guarantee. As noted earlier, HTM therefore cannot be used to avoid deadlock on
those systems. Hopefully future implementations of HTM will provide some sort of
forward-progress guarantees. Until that time, HTM must be used with extreme caution
in real-time applications.
The one exception to this gloomy picture as of 2021 is the IBM mainframe, which
provides constrained transactions [JSG12]. The constraints are quite severe, and are
presented in Section 17.3.5.1. It will be interesting to see if HTM forward-progress
guarantees migrate from the mainframe to commodity CPU families.
Another consequence of aborts and rollbacks is that HTM transactions cannot accom-
modate irrevocable operations. Current HTM implementations typically enforce this
limitation by requiring that all of the accesses in the transaction be to cacheable memory
(thus prohibiting MMIO accesses) and aborting transactions on interrupts, traps, and
exceptions (thus prohibiting system calls).
Note that buffered I/O can be accommodated by HTM transactions as long as the
buffer fill/flush operations occur extra-transactionally. The reason that this works is that
adding data to and removing data from the buffer is revocable: Only the actual buffer
fill/flush operations are irrevocable. Of course, this buffered-I/O approach has the effect
of including the I/O in the transaction’s footprint, increasing the size of the transaction
and thus increasing the probability of failure.
Although HTM can in many cases be used as a drop-in replacement for locking (hence
the name transactional lock elision (TLE) [DHL+ 08]), there are subtle differences
in semantics. A particularly nasty example involving coordinated lock-based critical
sections that results in deadlock or livelock when executed transactionally was given by
Blundell [BLM06], but a much simpler example is the empty critical section.
In a lock-based program, an empty critical section will guarantee that all processes
that had previously been holding that lock have now released it. This idiom was used
by the 2.4 Linux kernel’s networking stack to coordinate changes in configuration.
But if this empty critical section is translated to a transaction, the result is a no-op.
The guarantee that all prior critical sections have terminated is lost. In other words,
transactional lock elision preserves the data-protection semantics of locking, but loses
locking’s time-based messaging semantics.
Quick Quiz 17.10: But why would anyone need an empty lock-based critical section???
v2023.06.11a
606 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Quick Quiz 17.11: Can’t transactional lock elision trivially handle locking’s time-based
messaging semantics by simply choosing not to elide empty lock-based critical sections?
Quick Quiz 17.12: Given modern hardware [MOZ09], how can anyone possibly expect
parallel software relying on timing to work?
One important semantic difference between locking and transactions is the priority
boosting that is used to avoid priority inversion in lock-based real-time programs. One
way in which priority inversion can occur is when a low-priority thread holding a lock
is preempted by a medium-priority CPU-bound thread. If there is at least one such
medium-priority thread per CPU, the low-priority thread will never get a chance to run.
If a high-priority thread now attempts to acquire the lock, it will block. It cannot acquire
the lock until the low-priority thread releases it, the low-priority thread cannot release
the lock until it gets a chance to run, and it cannot get a chance to run until one of the
medium-priority threads gives up its CPU. Therefore, the medium-priority threads are
in effect blocking the high-priority process, which is the rationale for the name “priority
inversion.”
One way to avoid priority inversion is priority inheritance, in which a high-priority
thread blocked on a lock temporarily donates its priority to the lock’s holder, which is
also called priority boosting. However, priority boosting can be used for things other
than avoiding priority inversion, as shown in Listing 17.1. Lines 1–12 of this listing
show a low-priority process that must nevertheless run every millisecond or so, while
lines 14–24 of this same listing show a high-priority process that uses priority boosting
to ensure that boostee() runs periodically as needed.
The boostee() function arranges this by always holding one of the two boost_
lock[] locks, so that lines 20–21 of booster() can boost priority as needed.
Quick Quiz 17.13: But the boostee() function in Listing 17.1 alternatively acquires its locks
in reverse order! Won’t this result in deadlock?
This arrangement requires that boostee() acquire its first lock on line 5 before the
system becomes busy, but this is easily arranged, even on modern hardware.
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 607
17.3.2.7 Summary
Although it seems likely that HTM will have compelling use cases, current imple-
mentations have serious transaction-size limitations, conflict-handling complications,
abort-and-rollback issues, and semantic differences that will require careful handling.
HTM’s current situation relative to locking is summarized in Table 17.1. As can be seen,
although the current state of HTM alleviates some serious shortcomings of locking,13
it does so by introducing a significant number of shortcomings of its own. These
shortcomings are acknowledged by leaders in the TM community [MS12].14
In addition, this is not the whole story. Locking is not normally used by itself, but is
instead typically augmented by other synchronization mechanisms, including reference
counting, atomic operations, non-blocking data structures, hazard pointers [Mic04a,
HLM02], and RCU [MS98a, MAK+ 01, HMBW07, McK12b]. The next section looks
at how such augmentation changes the equation.
and heavily used engineering solutions, including deadlock detectors [Cor06a], a wealth of
data structures that have been adapted to locking, and a long history of augmentation, as
discussed in Section 17.3.3. In addition, if locking really were as horrible as a quick skim of
many academic papers might reasonably lead one to believe, where did all the large lock-based
parallel programs (both FOSS and proprietary) come from, anyway?
14 In addition, in early 2011, I was invited to deliver a critique of some of the assumptions
v2023.06.11a
608 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 609
v2023.06.11a
610 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
For example, deadlock can be avoided in many cases by using reference counts,
hazard pointers, or RCU to protect data structures, particularly for read-only critical
sections [Mic04a, HLM02, DMS+ 12, GMTW08, HMBW07]. These approaches also
reduce the need to partition data structures, as was seen in Chapter 10. RCU further
provides contention-free bounded wait-free read-side primitives [MS98a, DMS+ 12],
while hazard pointers provides lock-free read-side primitives [Mic02, HLM02, Mic04a].
Adding these considerations to Table 17.1 results in the updated comparison between
augmented locking and HTM shown in Table 17.2. A summary of the differences
between the two tables is as follows:
2. Read-side mechanisms such as hazard pointers and RCU can operate efficiently on
non-partitionable data.
3. Hazard pointers and RCU do not contend with each other or with updaters, allowing
excellent performance and scalability for read-mostly workloads.
4. Hazard pointers and RCU provide forward-progress guarantees (lock freedom and
bounded wait-freedom, respectively).
For those with good eyesight, Table 17.3 combines Tables 17.1 and 17.2.
Quick Quiz 17.15: Tables 17.1 and 17.2 state that hardware is only starting to become
available. But hasn’t HTM hardware support been widely available for almost a full decade?
v2023.06.11a
Table 17.3: Comparison of Locking (Plain and Augmented) and HTM ( Advantage , Disadvantage , Strong Disadvantage )
Locking Locking with Userspace RCU or Hazard Pointers Hardware Transactional Memory
Basic Idea Allow only one thread at a time to access a given set Allow only one thread at a time to access a given set Cause a given operation over a set of objects to execute
of objects. of objects. atomically.
Scope Handles all operations. Handles all operations. Handles revocable operations.
Irrevocable operations force fallback (typically to lock-
ing).
Composability Limited by deadlock. Readers limited only by grace-period-wait operations. Limited by irrevocable operations, transaction size,
and deadlock. (Assuming lock-based fallback code.)
Updaters limited by deadlock. Readers reduce dead-
lock.
Scalability & Per- Data must be partitionable to avoid lock contention. Data must be partitionable to avoid lock contention Data must be partitionable to avoid conflicts.
formance among updaters.
Partitioning not needed for readers.
Partitioning must typically be fixed at design time. Partitioning for updaters must typically be fixed at Dynamic adjustment of partitioning carried out auto-
design time. matically down to cacheline boundaries.
Partitioning not needed for readers. Partitioning required for fallbacks (less important for
rare fallbacks).
Locking primitives typically result in expensive cache Updater locking primitives typically result in expensive Transactions begin/end instructions typically do not
misses and memory-barrier instructions. cache misses and memory-barrier instructions. result in cache misses, but do have memory-ordering
and overhead consequences.
Contention effects are focused on acquisition and re- Update-side contention effects are focused on acquisi- Contention aborts conflicting transactions, even if they
lease, so that the critical section runs at full speed. tion and release, so that the critical section runs at full have been running for a long time.
speed.
Readers do not contend with updaters or with each
17.3. HARDWARE TRANSACTIONAL MEMORY
other.
Read-side primitives are typically bounded wait-free Read-only transactions subject to conflicts and roll-
with low overhead. (Lock-free with low overhead for backs. No forward-progress guarantees other than
hazard pointers.) those supplied by fallback code.
Privatization operations are simple, intuitive, perfor- Privatization operations are simple, intuitive, perfor- Privatized data contributes to transaction size.
mant, and scalable. mant, and scalable when data is visible only to updaters.
Privatization operations are expensive (though still
intuitive and scalable) for reader-visible data.
Hardware Support Commodity hardware suffices. Commodity hardware suffices. New hardware required (and is starting to become
available).
Performance is insensitive to cache-geometry details. Performance is insensitive to cache-geometry details. Performance depends critically on cache geometry.
Software Support APIs exist, large body of code and experience, debug- APIs exist, large body of code and experience, debug- APIs emerging, little experience outside of DBMS,
611
gers operate naturally. gers operate naturally. breakpoints mid-transaction can be problematic.
Interaction With Long experience of successful interaction. Long experience of successful interaction. Just beginning investigation of interaction.
Other Mechanisms
Practical Apps Yes. Yes. Yes.
Wide Applicability Yes. Yes. Jury still out.
v2023.06.11a
612 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
1. Forward-progress guarantees.
2. Transaction-size increases.
3. Improved debugging support.
4. Weak atomicity.
systems at just about the time that NoSQL databases are relaxing the traditional database-
application reliance on strict transactions. Nevertheless, HTM has in fact realized the
ease-of-use promise of TM, albeit for black-hat attacks on the Linux kernel’s address-space
randomization defense mechanism [JLK16a, JLK16b].
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 613
1. The maximum data footprint is four blocks of memory, where each block can be
no larger than 32 bytes.
v2023.06.11a
614 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
3. If a given 4K page contains a constrained transaction’s code, then that page may
not contain that transaction’s data.
v2023.06.11a
17.3. HARDWARE TRANSACTIONAL MEMORY 615
Int20b]. Worse yet, some cutting-edge debugging facilities are incompatible with
HTM [OHOC20].
Given that HTM is likely to face some sort of size limitations for the foreseeable future,
it will be necessary for HTM to interoperate smoothly with other mechanisms. HTM’s
interoperability with read-mostly mechanisms such as hazard pointers and RCU would
be improved if extra-transactional reads did not unconditionally abort transactions with
conflicting writes—instead, the read could simply be provided with the pre-transaction
value. In this way, hazard pointers and RCU could be used to allow HTM to handle
larger data structures and to reduce conflict probabilities.
This is not necessarily simple, however. The most straightforward way of imple-
menting this requires an additional state in each cache line and on the bus, which is a
non-trivial added expense. The benefit that goes along with this expense is permitting
large-footprint readers without the risk of starving updaters due to continual conflicts.
An alternative approach, applied to great effect to binary search trees by Siakavaras et
al. [SNGK17], is to use RCU for read-only traversals and HTM only for the actual updates
themselves. This combination outperformed other transactional-memory techniques
by up to 220 %, a speedup similar to that observed by Howard and Walpole [HW11]
when they combined RCU with STM. In both cases, the weak atomicity is implemented
in software rather than in hardware. It would nevertheless be interesting to see what
additional speedups could be obtained by implementing weak atomicity in both hardware
and software.
17.3.6 Conclusions
v2023.06.11a
616 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
5. Information beyond the source code and inputs must be modest in scope.
This list builds on, but is somewhat more modest than, Richard Bornat’s dictum:
“Formal-verification researchers should verify the code that developers write, in the
language they write it in, running in the environment that it runs in, as they write it.”
The following sections discuss each of the above requirements, followed by a section
presenting a scorecard of how well a few tools stack up against these requirements.
Quick Quiz 17.16: This list is ridiculously utopian! Why not stick to the current state of the
formal-verification art?
v2023.06.11a
17.4. FORMAL REGRESSION TESTING? 617
PPCMEM and herd are extremely useful, but they are not well-suited for regression
suites.
In contrast, cbmc and Nidhugg can input C programs of reasonable (though still
quite limited) size, and if their capabilities continue to grow, could well become
excellent additions to regression suites. The Coverity static-analysis tool also inputs
C programs, and of very large size, including the Linux kernel. Of course, Coverity’s
static analysis is quite simple compared to that of cbmc and Nidhugg. On the other
hand, Coverity had an all-encompassing definition of “C program” that posed special
challenges [BBC+ 10]. Amazon Web Services uses a variety of formal-verification tools,
including cbmc, and applies some of these tools to regression testing [Coo18]. Google
uses a number of relatively simple static analysis tools directly on large Java code bases,
which are arguably less diverse than C code bases [SAE+ 18]. Facebook uses more
aggressive forms of formal verification against its code bases, including analysis of
concurrency [DFLO19, O’H19], though not yet on the Linux kernel. Finally, Microsoft
has long used static analysis on its code bases [LBD+ 04].
Given this list, it is clearly possible to create sophisticated formal-verification tools
that directly consume production-quality source code.
However, one shortcoming of taking C code as input is that it assumes that the
compiler is correct. An alternative approach is to take the binary produced by the C
compiler as input, thereby accounting for any relevant compiler bugs. This approach
has been used in a number of verification efforts, perhaps most notably by the SEL4
project [SM13].
Quick Quiz 17.17: Given the groundbreaking nature of the various verifiers used in the SEL4
project, why doesn’t this chapter cover them in more depth?
However, verifying directly from either the source or binary both have the advantage of
eliminating human translation errors, which is critically important for reliable regression
testing.
This is not to say that tools with special-purpose languages are useless. On the
contrary, they can be quite helpful for design-time verification, as was discussed in
Chapter 12. However, such tools are not particularly helpful for automated regression
testing, which is in fact the topic of this section.
17.4.2 Environment
It is critically important that formal-verification tools correctly model their environment.
One all-too-common omission is the memory model, where a great many formal-
verification tools, including Promela/spin, are restricted to sequential consistency. The
QRCU experience related in Section 12.1.4.6 is an important cautionary tale.
Promela and spin assume sequential consistency, which is not a good match for
modern computer systems, as was seen in Chapter 15. In contrast, one of the great
strengths of PPCMEM and herd is their detailed modeling of various CPU families
memory models, including x86, Arm, Power, and, in the case of herd, a Linux-kernel
memory model [AMM+ 18], which was accepted into Linux-kernel version v4.17.
The cbmc and Nidhugg tools provide some ability to select memory models, but
do not provide the variety that PPCMEM and herd do. However, it is likely that the
larger-scale tools will adopt a greater variety of memory models as time goes on.
In the longer term, it would be helpful for formal-verification tools to include
I/O [MDR16], but it may be some time before this comes to pass.
v2023.06.11a
618 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Nevertheless, tools that fail to match the environment can still be useful. For example,
a great many concurrency bugs would still be bugs on a mythical sequentially consistent
system, and these bugs could be located by a tool that over-approximates the system’s
memory model with sequential consistency. Nevertheless, these tools will fail to find
bugs involving missing memory-ordering directives, as noted in the aforementioned
cautionary tale of Section 12.1.4.6.
17.4.3 Overhead
Almost all hard-core formal-verification tools are exponential in nature, which might
seem discouraging until you consider that many of the most interesting software
questions are in fact undecidable. However, there are differences in degree, even among
exponentials.
PPCMEM by design is unoptimized, in order to provide greater assurance that the
memory models of interest are accurately represented. The herd tool optimizes more
aggressively, as described in Section 12.3, and is thus orders of magnitude faster than
PPCMEM. Nevertheless, both PPCMEM and herd target very small litmus tests rather
than larger bodies of code.
In contrast, Promela/spin, cbmc, and Nidhugg are designed for (somewhat) larger bod-
ies of code. Promela/spin was used to verify the Curiosity rover’s filesystem [GHH+ 14]
and, as noted earlier, both cbmc and Nidhugg were appled to Linux-kernel RCU.
If advances in heuristics continue at the rate of the past three decades, we can look
forward to large reductions in overhead for formal verification. That said, combinatorial
explosion is still combinatorial explosion, which would be expected to sharply limit
the size of programs that could be verified, with or without continued improvements in
heuristics.
However, the flip side of combinatorial explosion is Philip II of Macedon’s timeless
advice: “Divide and rule.” If a large program can be divided and the pieces verified,
the result can be combinatorial implosion [McK11e]. One natural place to divide is on
API boundaries, for example, those of locking primitives. One verification pass can
then verify that the locking implementation is correct, and additional verification passes
can verify correct use of the locking APIs.
The performance benefits of this approach can be demonstrated using the Linux-kernel
memory model [AMM+ 18]. This model provides spin_lock() and spin_unlock()
primitives, but these primitives can also be emulated using cmpxchg_acquire()
and smp_store_release(), as shown in Listing 17.2 (C-SB+l-o-o-u+l-o-o-
*u.litmus and C-SB+l-o-o-u+l-o-o-u*-C.litmus). Table 17.4 compares the
performance and scalability of using the model’s spin_lock() and spin_unlock()
against emulating these primitives as shown in the listing. The difference is not
v2023.06.11a
17.4. FORMAL REGRESSION TESTING? 619
insignificant: At four processes, the model is more than two orders of magnitude faster
than emulation!
Quick Quiz 17.18: Why bother with a separate filter command on line 27 of Listing 17.2
instead of just adding the condition to the exists clause? And wouldn’t it be simpler to use
xchg_acquire() instead of cmpxchg_acquire()?
It would of course be quite useful for tools to automatically divide up large programs,
verify the pieces, and then verify the combinations of pieces. In the meantime,
verification of large programs will require significant manual intervention. This
intervention will preferably mediated by scripting, the better to reliably carry out
repeated verifications on each release, and preferably eventually in a manner well-suited
for continuous integration. And Facebook’s Infer tool has taken important steps towards
doing just that, via compositionality and abstraction [BGOS18, DFLO19].
In any case, we can expect formal-verification capabilities to continue to increase over
time, and any such increases will in turn increase the applicability of formal verification
to regression testing.
v2023.06.11a
620 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
v2023.06.11a
17.4. FORMAL REGRESSION TESTING? 621
Quick Quiz 17.20: But the formal-verification tools should immediately find all the bugs
introduced by the fixes, so why is this a problem?
Worse yet, imagine another software artifact with one bug that fails once every day on
average and 99 more that fail every million years each. Suppose that a formal-verification
tool located the 99 million-year bugs, but failed to find the one-day bug. Fixing the 99
bugs located will take time and effort, decrease reliability, and do nothing at all about
the pressing each-day failure that is likely causing embarrassment and perhaps much
worse besides.
Therefore, it would be best to have a validation tool that preferentially located the
most troublesome bugs. However, as noted in Section 17.4.4, it is permissible to
leverage additional tools. One powerful tool is none other than plain old testing. Given
knowledge of the bug, it should be possible to construct specific tests for it, possibly
also using some of the techniques described in Section 11.6.4 to increase the probability
of the bug manifesting. These techniques should allow calculation of a rough estimate
of the bug’s raw failure rate, which could in turn be used to prioritize bug-fix efforts.
Quick Quiz 17.21: But many formal-verification tools can only find one bug at a time, so
that each bug must be fixed before the tool can locate the next. How can bug-fix efforts be
prioritized given such a tool?
There has been some recent formal-verification work that prioritizes executions
having fewer preemptions, under that reasonable assumption that smaller numbers of
preemptions are more likely.
Identifying relevant bugs might sound like too much to ask, but it is what is really
required if we are to actually increase software reliability.
v2023.06.11a
622 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
Promela requires hand translation and supports only sequential consistency, so its
first two cells are red. It has reasonable overhead (for formal verification, anyway) and
provides a traceback, so its next two cells are yellow. Despite requiring hand translation,
Promela handles assertions in a natural way, so its fifth cell is green.
PPCMEM usually requires hand translation due to the small size of litmus tests that
it supports, so its first cell is orange. It handles several memory models, so its second
cell is green. Its overhead is quite high, so its third cell is red. It provides a graphical
display of relations among operations, which is not as helpful as a traceback, but is still
quite useful, so its fourth cell is yellow. It requires constructing an exists clause and
cannot take intra-process assertions, so its fifth cell is also yellow.
The herd tool has size restrictions similar to those of PPCMEM, so herd’s first cell
is also orange. It supports a wide variety of memory models, so its second cell is blue.
It has reasonable overhead, so its third cell is yellow. Its bug-location and assertion
capabilities are quite similar to those of PPCMEM, so herd also gets yellow for the
next two cells.
The cbmc tool inputs C code directly, so its first cell is blue. It supports a few memory
models, so its second cell is yellow. It has reasonable overhead, so its third cell is also
yellow, however, perhaps SAT-solver performance will continue improving. It provides
a traceback, so its fourth cell is green. It takes assertions directly from the C code, so its
fifth cell is blue.
Nidhugg also inputs C code directly, so its first cell is also blue. It supports only a
couple of memory models, so its second cell is orange. Its overhead is quite low (for
formal-verification), so its third cell is green. It provides a traceback, so its fourth cell is
green. It takes assertions directly from the C code, so its fifth cell is blue.
So what about the sixth and final row? It is too early to tell how any of the tools do at
finding the right bugs, so they are all yellow with question marks.
Quick Quiz 17.22: How would testing stack up in the scorecard shown in Table 17.5?
Quick Quiz 17.23: But aren’t there a great many more formal-verification systems than are
shown in Table 17.5?
Once again, please note that this table rates these tools for use in regression testing.
Just because many of them are a poor fit for regression testing does not at all mean that
they are useless, in fact, many of them have proven their worth many times over.18 Just
not for regression testing.
18 For but one example, Promela was used to verify the file system of none other than the
Curiosity Rover. Was your formal verification tool used on software that currently runs on
Mars???
v2023.06.11a
17.5. FUNCTIONAL PROGRAMMING FOR PARALLELISM 623
However, this might well change. After all, formal verification tools made impressive
strides in the 2010s. If that progress continues, formal verification might well become
an indispensable tool in the parallel programmer’s validation toolbox.
When I took my first-ever functional-programming class in the early 1980s, the professor
asserted that the side-effect-free functional-programming style was well-suited to trivial
parallelization and analysis. Thirty years later, this assertion remains, but mainstream
production use of parallel functional languages is minimal, a state of affairs that might
not be entirely unrelated to professor’s additional assertion that programs should neither
maintain state nor do I/O. There is niche use of functional languages such as Erlang,
and multithreaded support has been added to several other functional languages, but
mainstream production usage remains the province of procedural languages such as C,
C++, Java, and Fortran (usually augmented with OpenMP, MPI, or coarrays).
This situation naturally leads to the question “If analysis is the goal, why not transform
the procedural language into a functional language before doing the analysis?” There
are of course a number of objections to this approach, of which I list but three:
1. Procedural languages often make heavy use of global variables, which can be
updated independently by different functions, or, worse yet, by multiple threads.
Note that Haskell’s monads were invented to deal with single-threaded global state,
and that multi-threaded access to global state inflicts additional violence on the
functional model.
v2023.06.11a
624 CHAPTER 17. CONFLICTING VISIONS OF THE FUTURE
17.6 Summary
This chapter has taken a quick tour of a number of possible futures, including multicore,
transactional memory, formal verification as a regression test, and concurrent functional
programming. Any of these futures might come true, but it is more likely that, as in the
past, the future will be far stranger than we can possibly imagine.
v2023.06.11a
History is the sum total of things that could have
been avoided.
Konrad Adenauer
Chapter 18
You have arrived at the end of this book, well done! I hope that your journey was a
pleasant but challenging and worthwhile one.
For your editor and contributors, this is the end of the journey to the Second Edition,
but for those willing to join in, it is also the start of the journey to the Third Edition.
Either way, it is good to recap this past journey.
Chapter 1 covered what this book is about, along with some alternatives for those
interested in something other than low-level parallel programming.
Chapter 2 covered parallel-programming challenges and high-level approaches for
addressing them. It also touched on ways of avoiding these challenges while nevertheless
still gaining most of the benefits of parallelism.
Chapter 3 gave a high-level overview of multicore hardware, especially those aspects
that pose challenges for concurrent software. This chapter puts the blame for these
challenges where it belongs, very much on the laws of physics and rather less on
intransigent hardware architects and designers. However, there might be some things
that hardware architects and engineers can do, and this chapter discusses a few of them.
In the meantime, software architects and engineers must do their part to meet these
challenges, as discussed in the rest of the book.
Chapter 4 gave a quick overview of the tools of the low-level concurrency trade.
Chapter 5 then demonstrated use of those tools—and, more importantly, use of parallel-
programming design techniques—on the simple but surprisingly challenging task of
concurrent counting. So challenging, in fact, that a number of concurrent counting
algorithms are in common use, each specialized for a different use case.
Chapter 6 dug more deeply into the most important parallel-programming design
technique, namely partitioning the problem at the highest possible level. This chapter
also overviewed a number of points in this design space.
Chapter 7 expounded on that parallel-programming workhorse (and villain), locking.
This chapter covered a number of types of locking and presented some engineering
solutions to many well-known and aggressively advertised shortcomings of locking.
Chapter 8 discussed the uses of data ownership, where synchronization is supplied
by the association of a given data item with a specific thread. Where it applies, this
approach combines excellent performance and scalability with profound simplicity.
Chapter 9 showed how a little procrastination can greatly improve performance and
scalability, while in a surprisingly large number of cases also simplifying the code. A
number of the mechanisms presented in this chapter take advantage of the ability of
CPU caches to replicate read-only data, thus sidestepping the laws of physics that cruelly
625
v2023.06.11a
626 CHAPTER 18. LOOKING FORWARD AND BACK
limit the speed of light and the smallness of atoms. Chapter 10 looked at concurrent
data structures, with emphasis on hash tables, which have a long and honorable history
in parallel programs.
Chapter 11 dug into code-review and testing methods, and Chapter 12 overviewed
formal verification. Whichever side of the formal-verification/testing divide you might
be on, if code has not been thoroughly validated, it does not work. And that goes at
least double for concurrent code.
Chapter 13 presented a number of situations where combining concurrency mecha-
nisms with each other or with other design tricks can greatly ease parallel programmers’
lives. Chapter 14 looked at advanced synchronization methods, including lockless pro-
gramming, non-blocking synchronization, and parallel real-time computing. Chapter 15
dug into the critically important topic of memory ordering, presenting techniques and
tools to help you not only solve memory-ordering problems, but also to avoid them
completely. Chapter 16 presented a brief overview of the surprisingly important topic
of ease of use.
Last, but definitely not least, Chapter 17 expounded on a number of conflicting
visions of the future, including CPU-technology trends, transactional memory, hardware
transactional memory, use of formal verification in regression testing, and the long-
standing prediction that the future of parallel programming belongs to functional-
programming languages.
But now that we have recapped the contents of this Second Edition, how did this book
get started?
Paul’s parallel-programming journey started in earnest in 1990, when he joined
Sequent Computer Systems, Inc. Sequent used an apprenticeship-like program in which
newly hired engineers were placed in cubicles surrounded by experienced engineers,
who mentored them, reviewed their code, and gave copious quantities of advice on a
variety of topics. A few of the newly hired engineers were greatly helped by the fact
that there were no on-chip caches in those days, which meant that logic analyzers could
easily display a given CPU’s instruction stream and memory accesses, complete with
accurate timing information. Of course, the downside of this transparency was that
CPU core clock frequencies were 100 times slower than those of the twenty-first century.
Between apprenticeship and hardware performance transparency, these newly hired
engineers became productive parallel programmers within two or three months, and
some were doing ground-breaking work within a couple of years.
Sequent understood that its ability to quickly train new engineers in the mysteries of
parallelism was unusual, so it produced a slim volume that crystalized the company’s
parallel-programming wisdom [Seq88], which joined a pair of groundbreaking papers
that had been written a few years earlier [BK85, Inm85]. People already steeped in
these mysteries saluted this book and these papers, but novices were usually unable to
benefit much from them, invariably making highly creative and quite destructive errors
that were not explicitly prohibited by either the book or the papers.1 This situation of
course caused Paul to start thinking in terms of writing an improved book, but his efforts
during this time were limited to internal training materials and to published papers.
By the time Sequent was acquired by IBM in 1999, many of the world’s largest
database instances ran on Sequent hardware. But times change, and by 2001 many
of Sequent’s parallel programmers had shifted their focus to the Linux kernel. After
some initial reluctance, the Linux kernel community embraced concurrency both
enthusiastically and effectively [BWCM+ 10, McK12a], with many excellent innovations
1 “But why on earth would you do that???” “Well, why not?”
v2023.06.11a
627
and improvements from throughout the community. The thought of writing a book
occurred to Paul from time to time, but life was flowing fast, so he made no progress on
this project.
In 2006, Paul was invited to a conference on Linux scalability, and was granted the
privilege of asking the last question of panel of esteemed parallel-programming experts.
Paul began his question by noting that in the 15 years from 1991 to 2006, the price of a
parallel system had dropped from that of a house to that of a mid-range bicycle, and it was
clear that there was much more room for additional dramatic price decreases over the next
15 years extending to the year 2021. He also noted that decreasing price should result in
greater familiarity and faster progress in solving parallel-programming problems. This
led to his question: “In the year 2021, why wouldn’t parallel programming have become
routine?”
The first panelist seemed quite disdainful of anyone who would ask such an absurd
question, and quickly responded with a soundbite answer. To which Paul gave a
soundbite response. They went back and forth for some time, for example, the panelist’s
sound-bite answer “Deadlock” provoked Paul’s sound-bite response “Lock dependency
checker”.
The panelist eventually ran out of soundbites, improvising a final “People like you
should be hit over the head with a hammer!”
Paul’s response was of course “You will have to get in line for that!”
Paul turned his attention to the next panelist, who seemed torn between agreeing with
the first panelist and not wishing to have to deal with Paul’s series of responses. He
therefore have a short non-committal speech. And so it went through the rest of the
panel.
Until it was the turn of the last panelist, who was someone you might have heard of
who goes by the name of Linus Torvalds. Linus noted that three years earlier (that is,
2003), the initial version of any concurrency-related patch was usually quite poor, having
design flaws and many bugs. And even when it was cleaned up enough to be accepted,
bugs still remained. Linus contrasted this with the then-current situation in 2006, in
which he said that it was not unusual for the first version of a concurrency-related patch
to be well-designed with few or even no bugs. He then suggested that if tools continued
to improve, then maybe parallel programming would become routine by the year 2021.2
The conference then concluded. Paul was not surprised to be given wide berth by
many audience members, especially those who saw the world in the same way as did
the first panelist. Paul was also not surprised that a few audience members thanked him
for the question. However, he was quite surprised when one man came up to say “thank
you” with tears streaming down his face, sobbing so hard that he could barely speak.
You see, this man had worked several years at Sequent, and thus very well understood
parallel programming. Furthermore, he was currently assigned to a group whose job it
was to write parallel code. Which was not going well. You see, it wasn’t that they had
trouble understanding his explanations of parallel programming.
It was that they refused to listen to him at all.
In short, his group was treating this man in the same way that the first panelist
attempted to treat Paul. And so in that moment, Paul went from “I should write a book
some day” to “I will do whatever it takes to write this book”. Paul is embarrassed to
admit that he does not remember the man’s name, if in fact he ever knew it.
2 Tools have in fact continued to improve, including fuzzers, lock dependency checkers,
static analyzers, formal verification, memory models, and code-modification tools such as
coccinelle. Therefore, those who wish to assert that year-2021 parallel programming is not
routine should refer to Chapter 2’s epigraph.
v2023.06.11a
628 CHAPTER 18. LOOKING FORWARD AND BACK
v2023.06.11a
Ask me no questions, and I’ll tell you no fibs.
She Stoops to Conquer, Oliver Goldsmith
Appendix A
Important Questions
The following sections discuss some important questions relating to SMP programming.
Each section also shows how to avoid worrying about the corresponding question, which
can be extremely important if your goal is to simply get your SMP code working as
quickly and painlessly as possible—which is an excellent goal, by the way!
Although the answers to these questions are often less intuitive than they would be in a
single-threaded setting, with a bit of work, they are not that difficult to understand. If you
managed to master recursion, there is nothing here that should pose an overwhelming
challenge.
With that, here are the questions:
1. Why aren’t parallel programs always faster? (Appendix A.1)
2. Why not remove locking? (Appendix A.2)
3. What time is it? (Appendix A.3)
4. What does “after” mean? (Appendix A.4)
5. How much ordering is needed? (Appendix A.5)
6. What is the difference between “concurrent” and “parallel”? (Appendix A.6)
7. Why is software buggy? (Appendix A.7)
Read on to learn some answers. Improve upon these answers if you can!
629
v2023.06.11a
630 APPENDIX A. IMPORTANT QUESTIONS
v2023.06.11a
A.3. WHAT TIME IS IT? 631
300
250
200
Frequency
150
100
50
0
-100 -80 -60 -40 -20 0 20 40 60
Nanoseconds Deviation
between the two readings is then a measure of uncertainty of the time at which the
intervening operation occurred.
Of course, in many cases, the exact time is not necessary. For example, when
printing the time for the benefit of a human user, we can rely on slow human reflexes to
render internal hardware and software delays irrelevant. Similarly, if a server needs to
timestamp the response to a client, any time between the reception of the request and
the transmission of the response will do equally well.
There is an old saying that those who have but one clock always know the time, but
those who have several clocks can never be sure. And there was a time when the typical
low-end computer’s sole software-visible clock was its program counter, but those days
are long gone. This is not a bad thing, considering that on modern computer systems,
the program counter is a truly horrible clock [MOZ09].
In addition, different clocks provide different tradeoffs of performance, accuracy,
precision, and ordering. For example, in the Linux kernel, the jiffies counter1 provides
high-speed access to a course-grained counter (at best one-millisecond accuracy and
precision) that imposes very little ordering on either the compiler or the hardware. In
contrast, the x86 HPET hardware provides an accurate and precise clock, but at the
price of slow access. The x86 time-stamp counter (TSC) has a checkered past, but is
more recently held out as providing a good combination of precision, accuracy, and
performance. Unfortunately, for all of these counters, ordering against all effects of
prior and subsequent code requires expensive memory-barrier instructions. And this
expense appears to be an unavoidable consequence of the complex superscalar nature of
modern computer systems.
In addition, each clock source provides its own timebase. Figure A.2 shows a
histogram of the value returned by a call to clock_gettime(CLOCK_MONOTONIC)
subtracted from that returned by an immediately following clock_gettime(CLOCK_
1 The jiffies variable is a location in normal memory that is incremented by software
v2023.06.11a
632 APPENDIX A. IMPORTANT QUESTIONS
REALTIME) (timeskew.c). Because some time passes between these two function
calls, it is no surprise that there are positive deviations, but the negative deviations
should give us some pause. Nevertheless, such deviations are possible, if for no other
reason than the machinations of network time protocol (NTP) [Wei22f].
Worse yet, identical clocksources on different systems are not necessarily compatible
with that of another. For example, the jiffies counters on a pair of systems very likely
started counting at different times, and worse yet might well be counting at different
rates. This brings up the topic of synchronizing a given system’s counters with some
real-world notion of time such as the aforementioned NTP, but that topic is beyond the
scope of this book.
In short, time is a slippery topic that causes untold confusion to parallel programmers
and to their code.
One might intuitively expect that the difference between the producer and consumer
timestamps would be quite small, as it should not take much time for the producer to
record the timestamps or the values. An excerpt of some sample output on a dual-core
1 GHz x86 is shown in Table A.1. Here, the “seq” column is the number of times
through the loop, the “time” column is the time of the anomaly in seconds, the “delta”
column is the number of seconds the consumer’s timestamp follows that of the producer
v2023.06.11a
A.4. WHAT DOES “AFTER” MEAN? 633
v2023.06.11a
634 APPENDIX A. IMPORTANT QUESTIONS
(where a negative value indicates that the consumer has collected its timestamp before
the producer did), and the columns labelled “a”, “b”, and “c” show the amount that
these variables increased since the prior snapshot collected by the consumer.
Why is time going backwards? The number in parentheses is the difference in
microseconds, with a large number exceeding 10 microseconds, and one exceeding even
100 microseconds! Please note that this CPU can potentially execute more than 100,000
instructions in that time.
One possible reason is given by the following sequence of events:
2. Consumer is preempted.
5. Consumer starts running again, and picks up the producer’s timestamp (Listing A.2,
line 14).
In this scenario, the producer’s timestamp might be an arbitrary amount of time after
the consumer’s timestamp.
How do you avoid agonizing over the meaning of “after” in your SMP code?
Simply use SMP primitives as designed.
In this example, the easiest fix is to use locking, for example, acquire a lock in the
producer before line 10 in Listing A.1 and in the consumer before line 13 in Listing A.2.
This lock must also be released after line 13 in Listing A.1 and after line 17 in Listing A.2.
These locks cause the code segments in lines 10–13 of Listing A.1 and in lines 13–17 of
Listing A.2 to exclude each other, in other words, to run atomically with respect to each
other. This is represented in Figure A.3: The locking prevents any of the boxes of code
from overlapping in time, so that the consumer’s timestamp must be collected after the
prior producer’s timestamp. The segments of code in each box in this figure are termed
“critical sections”; only one such critical section may be executing at a given time.
This addition of locking results in output as shown in Table A.2. Here there are no
instances of time going backwards, instead, there are only cases with more than 1,000
counts difference between consecutive reads by the consumer.
Quick Quiz A.2: How could there be such a large gap between successive consumer reads?
See timelocked.c for full code.
In summary, if you acquire an exclusive lock, you know that anything you do while
holding that lock will appear to happen after anything done by any prior holder of that
v2023.06.11a
A.5. HOW MUCH ORDERING IS NEEDED? 635
Time
Producer
ss.t = dgettimeofday();
ss.a = ss.c + 1;
ss.b = ss.a + 1;
ss.c = ss.b + 1;
Consumer
curssc.tc = gettimeofday();
curssc.t = ss.t;
curssc.a = ss.a;
curssc.b = ss.b;
curssc.c = ss.c;
Producer
ss.t = dgettimeofday();
ss.a = ss.c + 1;
ss.b = ss.a + 1;
ss.c = ss.b + 1;
lock, at least give or take transactional lock elision (see Section 17.3.2.6). No need to
worry about which CPU did or did not execute a memory barrier, no need to worry about
the CPU or compiler reordering operations—life is simple. Of course, the fact that this
locking prevents these two pieces of code from running concurrently might limit the
program’s ability to gain increased performance on multiprocessors, possibly resulting
in a “safe but slow” situation. Chapter 6 describes ways of gaining performance and
scalability in many situations.
In short, in many parallel programs, the really important definition of “after” is
ordering of operations, which is covered in dazzling detail in Chapter 15.
However, in most cases, if you find yourself worrying about what happens before
or after a given piece of code, you should take this as a hint to make better use of the
standard primitives. Let these primitives do the worrying for you.
v2023.06.11a
636 APPENDIX A. IMPORTANT QUESTIONS
One approach is to construct a strongly ordered system, then examine its performance
and scalability. If these suffice, the system is good and sufficient, and no more need
be done. Otherwise, undertake careful analysis (see Section 11.7) and attack each
bottleneck until the system’s performance is good and sufficient.
This approach can work very well, especially in contrast to the all-too-common
approach of optimizing random components of the system in the hope of achieving
significant system-wide benefits. However, starting with strong ordering can also be
quite wasteful, given that weakening ordering of the system’s bottleneck can require
that large portions of the rest of the system be redesigned and rewritten to accommodate
the weakening. Worse yet, eliminating one bottleneck often exposes another, which
in turn needs to be weakened and which in turn can result in wholesale redesigns and
rewrites of other parts of the system. Perhaps even worse is the approach, also common,
of starting with a fast but unreliable system and then playing whack-a-mole with an
endless succession of concurrency bugs, though in the latter case, Chapters 11 and 12
are always there for you.
It would be better to have design-time tools to determine which portions of the system
could use weak ordering, and at the same time, which portions actually benefit from
weak ordering. These tasks are taken up by the following sections.
1. Some value between the conceptual value at the time of the call to the function and
the conceptual value at the time of the return from that function. For example, see
the statistical counters discussed in Section 5.2, keeping in mind that such counters
are normally monotonic, at least between consecutive overflows.
2. The actual value at some time between the call to and the return from that function.
For example, see the single-variable atomic counter shown in Listing 5.2.
3. If the values used by that function remain unchanged during the time between
that function’s call and return, the expected value, otherwise some approximation
to the expected value. Precise specification of the bounds on the approximation
can be quite challenging. For example, consider a function combining values
from different elements of an RCU-protected linked data structure, as described in
Section 10.3.
Weaker ordering usually implies weaker semantics, and you should be able to give
some sort of promise to your users as to how this weakening affects them. At the same
v2023.06.11a
A.5. HOW MUCH ORDERING IS NEEDED? 637
time, unless the caller holds a lock across both the function call and the use of any
values computed by that function, even fully ordered implementations normally cannot
do any better than the semantics given by the options above.
Quick Quiz A.3: But if fully ordered implementations cannot offer stronger guarantees than
the better performing and more scalable weakly ordered implementations, why bother with full
ordering?
Some might argue that useful computing deals only with the outside world, and
therefore that all computing can use weak ordering. Such arguments are incorrect. For
example, the value of your bank account is defined within your bank’s computers, and
people often prefer exact computations involving their account balances, especially
those who might suspect that any such approximations would be in the bank’s favor.
In short, although data tracking external state can be an attractive candidate for weakly
ordered access, please think carefully about exactly what is being tracked and what is
doing the tracking.
v2023.06.11a
638 APPENDIX A. IMPORTANT QUESTIONS
This of course begs the question of why such a distinction matters, which brings
us to the second perspective, that of the underlying scheduler. Schedulers come in a
wide range of complexities and capabilities, and as a rough rule of thumb, the more
tightly and irregularly a set of parallel processes communicate, the higher the level of
sophistication required from the scheduler. As such, parallel computing’s avoidance
of interdependencies means that parallel-computing programs run well on the least-
capable schedulers. In fact, a pure parallel-computing program can run successfully
after being arbitrarily subdivided and interleaved onto a uniprocessor.3 In contrast,
concurrent-computing programs might well require extreme subtlety on the part of the
scheduler.
One could argue that we should simply demand a reasonable level of competence from
the scheduler, so that we could simply ignore any distinctions between parallelism and
concurrency. Although this is often a good strategy, there are important situations where
efficiency, performance, and scalability concerns sharply limit the level of competence
that the scheduler can reasonably offer. One important example is when the scheduler is
implemented in hardware, as it often is in SIMD units or GPGPUs. Another example
is a workload where the units of work are quite short, so that even a software-based
scheduler must make hard choices between subtlety on the one hand and efficiency on
the other.
Now, this second perspective can be thought of as making the workload match the
available scheduler, with parallel workloads able to use simple schedulers and concurrent
workloads requiring sophisticated schedulers.
Unfortunately, this perspective does not always align with the dependency-based
distinction put forth by the first perspective. For example, a highly interdependent
lock-based workload with one thread per CPU can make do with a trivial scheduler
because no scheduler decisions are required. In fact, some workloads of this type can
even be run one after another on a sequential machine. Therefore, such a workload
would be labeled “concurrent” by the first perspective and “parallel” by many taking the
second perspective.
3 Yes, this does mean that data-parallel-computing programs are best-suited for sequential
v2023.06.11a
A.7. WHY IS SOFTWARE BUGGY? 639
Quick Quiz A.5: In what part of the second (scheduler-based) perspective would the lock-based
single-thread-per-CPU workload be considered “concurrent”?
Which is just fine. No rule that humankind writes carries any weight against the
objective universe, not even rules dividing multiprocessor programs into categories
such as “concurrent” and “parallel”.
This categorization failure does not mean such rules are useless, but rather that you
should take on a suitably skeptical frame of mind when attempting to apply them to new
situations. As always, use such rules where they apply and ignore them otherwise.
In fact, it is likely that new categories will arise in addition to parallel, concurrent,
map-reduce, task-based, and so on. Some will stand the test of time, but good luck
guessing which!
v2023.06.11a
640 APPENDIX A. IMPORTANT QUESTIONS
v2023.06.11a
The only difference between men and boys is the
price of their toys.
M. Hébert
Appendix B
The toy RCU implementations in this appendix are designed not for high performance,
practicality, or any kind of production use,1 but rather for clarity. Nevertheless, you will
need a thorough understanding of Chapters 2, 3, 4, 6, and 9 for even these toy RCU
implementations to be easily understandable.
This appendix provides a series of RCU implementations in order of increasing
sophistication, from the viewpoint of solving the existence-guarantee problem. Appen-
dix B.1 presents a rudimentary RCU implementation based on simple locking, while
Appendices B.2 through B.9 present a series of simple RCU implementations based on
locking, reference counters, and free-running counters. Finally, Appendix B.10 provides
a summary and a list of desirable RCU properties.
DMS+ 12].
641
v2023.06.11a
642 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Because synchronize_rcu() does not return until it has acquired (and released)
the lock, it cannot return until all prior RCU read-side critical sections have completed,
thus faithfully implementing RCU semantics. Of course, only one RCU reader may be in
its read-side critical section at a time, which almost entirely defeats the purpose of RCU.
In addition, the lock operations in rcu_read_lock() and rcu_read_unlock() are
extremely heavyweight, with read-side overhead ranging from about 100 nanoseconds
on a single POWER5 CPU up to more than 17 microseconds on a 64-CPU system.
Worse yet, these same lock operations permit rcu_read_lock() to participate in
deadlock cycles. Furthermore, in absence of recursive locks, RCU read-side critical
sections cannot be nested, and, finally, although concurrent RCU updates could in
principle be satisfied by a common grace period, this implementation serializes grace
periods, preventing grace-period sharing.
Quick Quiz B.1: Why wouldn’t any deadlock in the RCU implementation in Listing B.1 also
be a deadlock in any other RCU implementation?
Quick Quiz B.2: Why not simply use reader-writer locks in the RCU implementation in
Listing B.1 in order to allow RCU readers to proceed in parallel?
v2023.06.11a
B.3. SIMPLE COUNTER-BASED RCU 643
This implementation does have the virtue of permitting concurrent RCU readers, and
does avoid the deadlock condition that can arise with a single global lock. Furthermore,
the read-side overhead, though high at roughly 140 nanoseconds, remains at about 140
nanoseconds regardless of the number of CPUs. However, the update-side overhead
ranges from about 600 nanoseconds on a single POWER5 CPU up to more than 100
microseconds on 64 CPUs.
Quick Quiz B.3: Wouldn’t it be cleaner to acquire all the locks, and then release them all in
the loop from lines 15–18 of Listing B.2? After all, with this change, there would be a point in
time when there were no readers, simplifying things greatly.
Quick Quiz B.4: Is the implementation shown in Listing B.2 free from deadlocks? Why or
why not?
Quick Quiz B.5: Isn’t one advantage of the RCU algorithm shown in Listing B.2 that it uses
only primitives that are widely available, for example, in POSIX pthreads?
This approach could be useful in some situations, given that a similar approach was
used in the Linux 2.4 kernel [MM00].
The counter-based RCU implementation described next overcomes some of the
shortcomings of the lock-based implementation.
However, this implementation still has some serious shortcomings. First, the atomic
operations in rcu_read_lock() and rcu_read_unlock() are still quite heavyweight,
with read-side overhead ranging from about 100 nanoseconds on a single POWER5
CPU up to almost 40 microseconds on a 64-CPU system. This means that the RCU
read-side critical sections have to be extremely long in order to get any real read-
side parallelism. On the other hand, in the absence of readers, grace periods elapse
in about 40 nanoseconds, many orders of magnitude faster than production-quality
implementations in the Linux kernel.
v2023.06.11a
644 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Quick Quiz B.7: How can the grace period possibly elapse in 40 nanoseconds when
synchronize_rcu() contains a 10-millisecond delay?
Design It is the two-element rcu_refcnt[] array that provides the freedom from
starvation. The key point is that synchronize_rcu() is only required to wait for
v2023.06.11a
B.4. STARVATION-FREE COUNTER-BASED RCU 645
2 There is a race condition that this “monotonically decreasing” statement ignores. This
race condition will be dealt with by the code for synchronize_rcu(). In the meantime, I
suggest suspending disbelief.
v2023.06.11a
646 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
v2023.06.11a
B.5. SCALABLE COUNTER-BASED RCU 647
Quick Quiz B.9: Why the memory barrier on line 5 of synchronize_rcu() in Listing B.6
given that there is a spin-lock acquisition immediately after?
Quick Quiz B.10: Why is the counter flipped twice in Listing B.6? Shouldn’t a single
flip-and-wait cycle be sufficient?
This implementation avoids the update-starvation issues that could occur in the
single-counter implementation shown in Listing B.3.
Discussion There are still some serious shortcomings. First, the atomic operations in
rcu_read_lock() and rcu_read_unlock() are still quite heavyweight. In fact, they
are more complex than those of the single-counter variant shown in Listing B.3, with the
read-side primitives consuming about 150 nanoseconds on a single POWER5 CPU and
almost 40 microseconds on a 64-CPU system. The update-side synchronize_rcu()
primitive is more costly as well, ranging from about 200 nanoseconds on a single
POWER5 CPU to more than 40 microseconds on a 64-CPU system. This means that
the RCU read-side critical sections have to be extremely long in order to get any real
read-side parallelism.
Second, if there are many concurrent rcu_read_lock() and rcu_read_unlock()
operations, there will be extreme memory contention on the rcu_refcnt elements, re-
sulting in expensive cache misses. This further extends the RCU read-side critical-section
duration required to provide parallel read-side access. These first two shortcomings
defeat the purpose of RCU in most situations.
Third, the need to flip rcu_idx twice imposes substantial overhead on updates,
especially if there are large numbers of threads.
Finally, despite the fact that concurrent RCU updates could in principle be satisfied
by a common grace period, this implementation serializes grace periods, preventing
grace-period sharing.
Quick Quiz B.11: Given that atomic increment and decrement are so expensive, why not just
use non-atomic increment on line 10 and a non-atomic decrement on line 25 of Listing B.5?
Despite these shortcomings, one could imagine this variant of RCU being used on
small tightly coupled multiprocessors, perhaps as a memory-conserving implementation
that maintains API compatibility with more complex implementations. However, it
would not likely scale well beyond a few CPUs.
The next section describes yet another variation on the reference-counting scheme
that provides greatly improved read-side performance and scalability.
v2023.06.11a
648 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Quick Quiz B.12: Come off it! We can see the atomic_read() primitive in rcu_read_
lock()!!! So why are you trying to pretend that rcu_read_lock() contains no atomic
operations???
v2023.06.11a
B.5. SCALABLE COUNTER-BASED RCU 649
Quick Quiz B.13: Great, if we have 𝑁 threads, we can have 2𝑁 ten-millisecond waits (one set
per flip_counter_and_wait() invocation, and even that assumes that we wait only once for
each thread). Don’t we need the grace period to complete much more quickly?
This implementation still has several shortcomings. First, the need to flip rcu_idx
twice imposes substantial overhead on updates, especially if there are large numbers of
threads.
Finally, as noted in the text, the need for per-thread variables and for enumerating
threads may be problematic in some software environments.
That said, the read-side primitives scale very nicely, requiring about 115 nanoseconds
regardless of whether running on a single-CPU or a 64-CPU POWER5 system. As
noted above, the synchronize_rcu() primitive does not scale, ranging in overhead
from almost a microsecond on a single POWER5 CPU up to almost 200 microseconds
on a 64-CPU system. This implementation could conceivably form the basis for a
production-quality user-level RCU implementation.
The next section describes an algorithm permitting more efficient concurrent RCU
updates.
v2023.06.11a
650 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Listing B.10: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data
1 DEFINE_SPINLOCK(rcu_gp_lock);
2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
3 long rcu_idx;
4 DEFINE_PER_THREAD(int, rcu_nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);
Listing B.11: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update
1 static void rcu_read_lock(void)
2 {
3 int i;
4 int n;
5
6 n = __get_thread_var(rcu_nesting);
7 if (n == 0) {
8 i = READ_ONCE(rcu_idx) & 0x1;
9 __get_thread_var(rcu_read_idx) = i;
10 __get_thread_var(rcu_refcnt)[i]++;
11 }
12 __get_thread_var(rcu_nesting) = n + 1;
13 smp_mb();
14 }
15
16 static void rcu_read_unlock(void)
17 {
18 int i;
19 int n;
20
21 smp_mb();
22 n = __get_thread_var(rcu_nesting);
23 if (n == 1) {
24 i = __get_thread_var(rcu_read_idx);
25 __get_thread_var(rcu_refcnt)[i]--;
26 }
27 __get_thread_var(rcu_nesting) = n - 1;
28 }
1. There is a new oldctr local variable that captures the pre-lock-acquisition value
of rcu_idx on line 20.
v2023.06.11a
B.6. SCALABLE COUNTER-BASED RCU WITH SHARED GRACE PERIODS651
3. Lines 27–30 check to see if at least three counter flips were performed by other
threads while the lock was being acquired, and, if so, releases the lock, does a
memory barrier, and returns. In this case, there were two full waits for the counters
to go to zero, so those other threads already did all the required work.
v2023.06.11a
652 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
Third, this implementation requires per-thread variables and the ability to enumerate
threads, which again can be problematic in some software environments.
Finally, on 32-bit machines, a given update thread might be preempted long enough
for the rcu_idx counter to overflow. This could cause such a thread to force an
unnecessary pair of counter flips. However, even if each grace period took only one
microsecond, the offending thread would need to be preempted for more than an hour,
in which case an extra pair of counter flips is likely the least of your worries.
As with the implementation described in Appendix B.3, the read-side primitives
scale extremely well, incurring roughly 115 nanoseconds of overhead regardless of the
number of CPUs. The synchronize_rcu() primitive is still expensive, ranging from
about one microsecond up to about 16 microseconds. This is nevertheless much cheaper
than the roughly 200 microseconds incurred by the implementation in Appendix B.5.
So, despite its shortcomings, one could imagine this RCU implementation being used
in production in real-life applications.
Quick Quiz B.14: All of these toy RCU implementations have either atomic operations in rcu_
read_lock() and rcu_read_unlock(), or synchronize_rcu() overhead that increases
linearly with the number of threads. Under what circumstances could an RCU implementation
enjoy lightweight implementations for all three of these primitives, all having deterministic
(O (1)) overheads and latencies?
Referring back to Listing B.11, we see that there is one global-variable access and
no fewer than four accesses to thread-local variables. Given the relatively high cost of
thread-local accesses on systems implementing POSIX threads, it is tempting to collapse
the three thread-local variables into a single structure, permitting rcu_read_lock()
and rcu_read_unlock() to access their thread-local data with a single thread-local-
storage access. However, an even better approach would be to reduce the number of
thread-local accesses to one, as is done in the next section.
v2023.06.11a
B.7. RCU BASED ON FREE-RUNNING COUNTER 653
Quick Quiz B.15: If any even value is sufficient to tell synchronize_rcu() to ignore a given
task, why don’t lines 11 and 12 of Listing B.14 simply assign zero to rcu_reader_gp?
v2023.06.11a
654 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
incur more overhead, ranging from about 500 nanoseconds on a single POWER5 CPU
to more than 100 microseconds on 64 such CPUs.
Quick Quiz B.17: Couldn’t the update-side batching optimization described in Appendix B.6
be applied to the implementation shown in Listing B.14?
v2023.06.11a
B.8. NESTABLE RCU BASED ON FREE-RUNNING COUNTER 655
is the outermost rcu_read_lock(). If so, line 9 places the global rcu_gp_ctr into
tmp because the current value previously fetched by line 7 is likely to be obsolete. In
either case, line 10 increments the nesting depth, which you will recall is stored in the
seven low-order bits of the counter. Line 11 stores the updated counter back into this
thread’s instance of rcu_reader_gp, and, finally, line 12 executes a memory barrier to
prevent the RCU read-side critical section from bleeding out into the code preceding
the call to rcu_read_lock().
In other words, this implementation of rcu_read_lock() picks up a copy of the
global rcu_gp_ctr unless the current invocation of rcu_read_lock() is nested
within an RCU read-side critical section, in which case it instead fetches the contents of
the current thread’s instance of rcu_reader_gp. Either way, it increments whatever
value it fetched in order to record an additional nesting level, and stores the result in the
current thread’s instance of rcu_reader_gp.
Interestingly enough, despite their rcu_read_lock() differences, the implementa-
tion of rcu_read_unlock() is broadly similar to that shown in Appendix B.7. Line 17
executes a memory barrier in order to prevent the RCU read-side critical section from
bleeding out into code following the call to rcu_read_unlock(), and line 18 decre-
ments this thread’s instance of rcu_reader_gp, which has the effect of decrementing
the nesting count contained in rcu_reader_gp’s low-order bits. Debugging versions
v2023.06.11a
656 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
of this primitive would check (before decrementing!) that these low-order bits were
non-zero.
The implementation of synchronize_rcu() is quite similar to that shown in
Appendix B.7. There are two differences. The first is that lines 27 and 28 adds
RCU_GP_CTR_BOTTOM_BIT to the global rcu_gp_ctr instead of adding the constant
“2”, and the second is that the comparison on line 31 has been abstracted out to a separate
function, where it checks the bits indicated by RCU_GP_CTR_NEST_MASK instead of
unconditionally checking the low-order bit.
This approach achieves read-side performance almost equal to that shown in Appen-
dix B.7, incurring roughly 65 nanoseconds of overhead regardless of the number of
POWER5 CPUs. Updates again incur more overhead, ranging from about 600 nanosec-
onds on a single POWER5 CPU to more than 100 microseconds on 64 such CPUs.
Quick Quiz B.19: Why not simply maintain a separate per-thread nesting-level variable, as
was done in previous section, rather than having all this complicated bit manipulation?
This implementation suffers from the same shortcomings as does that of Appendix B.7,
except that nesting of RCU read-side critical sections is now permitted. In addition,
on 32-bit systems, this approach shortens the time required to overflow the global
rcu_gp_ctr variable. The following section shows one way to greatly increase the
time required for overflow to occur, while greatly reducing read-side overhead.
Quick Quiz B.20: Given the algorithm shown in Listing B.16, how could you double the time
required to overflow the global rcu_gp_ctr?
Quick Quiz B.21: Again, given the algorithm shown in Listing B.16, is counter overflow
fatal? Why or why not? If it is fatal, what can be done to fix it?
v2023.06.11a
B.9. RCU BASED ON QUIESCENT STATES 657
Quick Quiz B.22: Doesn’t the additional memory barrier shown on line 14 of Listing B.18
greatly increase the overhead of rcu_quiescent_state?
Some applications might use RCU only occasionally, but use it very heavily when
they do use it. Such applications might choose to use rcu_thread_online() when
starting to use RCU and rcu_thread_offline() when no longer using RCU. The
time between a call to rcu_thread_offline() and a subsequent call to rcu_thread_
online() is an extended quiescent state, so that RCU will not expect explicit quiescent
states to be registered during this time.
v2023.06.11a
658 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
However, this implementation requires that each thread either invoke rcu_
quiescent_state() periodically or to invoke rcu_thread_offline() for extended
quiescent states. The need to invoke these functions periodically can make this im-
plementation difficult to use in some situations, such as for certain types of library
functions.
Quick Quiz B.25: Why would the fact that the code is in a library make any difference for
how easy it is to use the RCU implementation shown in Listings B.18 and B.19?
Quick Quiz B.26: But what if you hold a lock across a call to synchronize_rcu(), and then
acquire that same lock within an RCU read-side critical section? This should be a deadlock, but
how can a primitive that generates absolutely no code possibly participate in a deadlock cycle?
v2023.06.11a
B.10. SUMMARY OF TOY RCU IMPLEMENTATIONS 659
9. RCU grace periods should not be blocked by threads that halt outside of RCU read-
side critical sections. (But note that most quiescent-state-based implementations
violate this desideratum.)
Quick Quiz B.27: Given that grace periods are prohibited within RCU read-side critical
sections, how can an RCU data structure possibly be updated while in an RCU read-side critical
section?
v2023.06.11a
660 APPENDIX B. “TOY” RCU IMPLEMENTATIONS
v2023.06.11a
Order! Order in the court!
Unknown
Appendix C
So what possessed CPU designers to cause them to inflict memory barriers on poor
unsuspecting SMP software designers?
In short, because reordering memory references allows much better performance,
courtesy of the finite speed of light and the non-zero size of atoms noted in Section 3.2,
and particularly in the hardware-performance question posed by Quick Quiz 3.7.
Therefore, memory barriers are needed to force ordering in things like synchronization
primitives whose correct operation depends on ordered memory references.
Getting a more detailed answer to this question requires a good understanding of how
CPU caches work, and especially what is required to make caches really work well. The
following sections:
1. Present the structure of a cache,
2. Describe how cache-coherency protocols ensure that CPUs agree on the value of
each location in memory, and, finally,
3. Outline how store buffers and invalidate queues help caches and cache-coherency
protocols achieve high performance.
We will see that memory barriers are a necessary evil that is required to enable good
performance and scalability, an evil that stems from the fact that CPUs are orders of
magnitude faster than are both the interconnects between them and the memory they are
attempting to access.
to the CPU with single-cycle access time, and a larger level-two cache with a longer access
time, perhaps roughly ten clock cycles. Higher-performance CPUs often have three or even
four levels of cache.
661
v2023.06.11a
662 APPENDIX C. WHY MEMORY BARRIERS?
CPU 0 CPU 1
Cache Cache
Interconnect
Memory
Data flows among the CPUs’ caches and memory in fixed-length blocks called “cache
lines”, which are normally a power of two in size, ranging from 16 to 256 bytes. When a
given data item is first accessed by a given CPU, it will be absent from that CPU’s cache,
meaning that a “cache miss” (or, more specifically, a “startup” or “warmup” cache miss)
has occurred. The cache miss means that the CPU will have to wait (or be “stalled”) for
hundreds of cycles while the item is fetched from memory. However, the item will be
loaded into that CPU’s cache, so that subsequent accesses will find it in the cache and
therefore run at full speed.
After some time, the CPU’s cache will fill, and subsequent misses will likely need to
eject an item from the cache in order to make room for the newly fetched item. Such
a cache miss is termed a “capacity miss”, because it is caused by the cache’s limited
capacity. However, most caches can be forced to eject an old item to make room for a
new item even when they are not yet full. This is due to the fact that large caches are
implemented as hardware hash tables with fixed-size hash buckets (or “sets”, as CPU
designers call them) and no chaining, as shown in Figure C.2.
This cache has sixteen “sets” and two “ways” for a total of 32 “lines”, each entry
containing a single 256-byte “cache line”, which is a 256-byte-aligned block of memory.
This cache line size is a little on the large size, but makes the hexadecimal arithmetic
much simpler. In hardware parlance, this is a two-way set-associative cache, and is
analogous to a software hash table with sixteen buckets, where each bucket’s hash
chain is limited to at most two elements. The size (32 cache lines in this case) and the
associativity (two in this case) are collectively called the cache’s “geometry”. Since this
cache is implemented in hardware, the hash function is extremely simple: Extract four
bits from the memory address.
In Figure C.2, each box corresponds to a cache entry, which can contain a 256-byte
cache line. However, a cache entry can be empty, as indicated by the empty boxes in the
figure. The rest of the boxes are flagged with the memory address of the cache line that
they contain. Since the cache lines must be 256-byte aligned, the low eight bits of each
address are zero, and the choice of hardware hash function means that the next-higher
four bits match the hash line number.
The situation depicted in the figure might arise if the program’s code were located at
address 0x43210E00 through 0x43210EFF, and this program accessed data sequentially
from 0x12345000 through 0x12345EFF. Suppose that the program were now to access
location 0x12345F00. This location hashes to line 0xF, and both ways of this line are
v2023.06.11a
C.1. CACHE STRUCTURE 663
Way 0 Way 1
0x0 0x12345000
0x1 0x12345100
0x2 0x12345200
0x3 0x12345300
0x4 0x12345400
0x5 0x12345500
0x6 0x12345600
0x7 0x12345700
0x8 0x12345800
0x9 0x12345900
0xA 0x12345A00
0xB 0x12345B00
0xC 0x12345C00
0xD 0x12345D00
0xE 0x12345E00 0x43210E00
0xF
empty, so the corresponding 256-byte line can be accommodated. If the program were
to access location 0x1233000, which hashes to line 0x0, the corresponding 256-byte
cache line can be accommodated in way 1. However, if the program were to access
location 0x1233E00, which hashes to line 0xE, one of the existing lines must be ejected
from the cache to make room for the new cache line. If this ejected line were accessed
later, a cache miss would result. Such a cache miss is termed an “associativity miss”.
Thus far, we have been considering only cases where a CPU reads a data item. What
happens when it does a write? Because it is important that all CPUs agree on the value
of a given data item, before a given CPU writes to that data item, it must first cause
it to be removed, or “invalidated”, from other CPUs’ caches. Once this invalidation
has completed, the CPU may safely modify the data item. If the data item was present
in this CPU’s cache, but was read-only, this process is termed a “write miss”. Once a
given CPU has completed invalidating a given data item from other CPUs’ caches, that
CPU may repeatedly write (and read) that data item.
Later, if one of the other CPUs attempts to access the data item, it will incur a cache
miss, this time because the first CPU invalidated the item in order to write to it. This
type of cache miss is termed a “communication miss”, since it is usually due to several
CPUs using the data items to communicate (for example, a lock is a data item that is
used to communicate among CPUs using a mutual-exclusion algorithm).
Clearly, much care must be taken to ensure that all CPUs maintain a coherent view
of the data. With all this fetching, invalidating, and writing, it is easy to imagine data
being lost or (perhaps worse) different CPUs having conflicting values for the same data
item in their respective caches. These problems are prevented by “cache-coherency
protocols”, described in the next section.
v2023.06.11a
664 APPENDIX C. WHY MEMORY BARRIERS?
for SGI Origin2000 and Sequent (now IBM) NUMA-Q, respectively. Both diagrams are
significantly simpler than real life.
v2023.06.11a
C.2. CACHE-COHERENCE PROTOCOLS 665
Read Response:
The “read response” message contains the data requested by an earlier “read”
message. This “read response” message might be supplied either by memory or by
one of the other caches. For example, if one of the caches has the desired data in
“modified” state, that cache must supply the “read response” message.
Invalidate:
The “invalidate” message contains the physical address of the cache line to be
invalidated. All other caches must remove the corresponding data from their caches
and respond.
Invalidate Acknowledge:
A CPU receiving an “invalidate” message must respond with an “invalidate
acknowledge” message after removing the specified data from its cache.
Read Invalidate:
The “read invalidate” message contains the physical address of the cache line to
be read, while at the same time directing other caches to remove the data. Hence,
it is a combination of a “read” and an “invalidate”, as indicated by its name. A
“read invalidate” message requires both a “read response” and a set of “invalidate
acknowledge” messages in reply.
Writeback:
The “writeback” message contains both the address and the data to be written back
to memory (and perhaps “snooped” into other CPUs’ caches along the way). This
message permits caches to eject lines in the “modified” state as needed to make
room for other data.
Quick Quiz C.1: Where does a writeback message originate from and where does it go to?
Quick Quiz C.3: When an “invalidate” message appears in a large multiprocessor, every CPU
must give an “invalidate acknowledge” response. Wouldn’t the resulting “storm” of “invalidate
acknowledge” responses totally saturate the system bus?
Quick Quiz C.4: If SMP machines are really using message passing anyway, why bother with
SMP at all?
v2023.06.11a
666 APPENDIX C. WHY MEMORY BARRIERS?
a f
b c d e
g
E S
h
j k
i l
I
Figure C.3: MESI Cache-Coherency State Diagram
Transition (a):
A cache line is written back to memory, but the CPU retains it in its cache and
further retains the right to modify it. This transition requires a “writeback” message.
Transition (b):
The CPU writes to the cache line that it already had exclusive access to. This
transition does not require any messages to be sent or received.
Transition (c):
The CPU receives a “read invalidate” message for a cache line that it has modified.
The CPU must invalidate its local copy, then respond with both a “read response”
and an “invalidate acknowledge” message, both sending the data to the requesting
CPU and indicating that it no longer has a local copy.
Transition (d):
The CPU does an atomic read-modify-write operation on a data item that was not
present in its cache. It transmits a “read invalidate”, receiving the data via a “read
response”. The CPU can complete the transition once it has also received a full set
of “invalidate acknowledge” responses.
Transition (e):
The CPU does an atomic read-modify-write operation on a data item that was
previously read-only in its cache. It must transmit “invalidate” messages, and must
wait for a full set of “invalidate acknowledge” responses before completing the
transition.
Transition (f):
Some other CPU reads the cache line, and it is supplied from this CPU’s cache,
which retains a read-only copy, possibly also writing it back to memory. This
transition is initiated by the reception of a “read” message, and this CPU responds
with a “read response” message containing the requested data.
Transition (g):
Some other CPU reads a data item in this cache line, and it is supplied either from
this CPU’s cache or from memory. In either case, this CPU retains a read-only
v2023.06.11a
C.2. CACHE-COHERENCE PROTOCOLS 667
copy. This transition is initiated by the reception of a “read” message, and this
CPU responds with a “read response” message containing the requested data.
Transition (h):
This CPU realizes that it will soon need to write to some data item in this cache
line, and thus transmits an “invalidate” message. The CPU cannot complete
the transition until it receives a full set of “invalidate acknowledge” responses,
indicating that no other CPU has this cacheline in its cache. In other words, this
CPU is the only CPU caching it.
Transition (i):
Some other CPU does an atomic read-modify-write operation on a data item in a
cache line held only in this CPU’s cache, so this CPU invalidates it from its cache.
This transition is initiated by the reception of a “read invalidate” message, and
this CPU responds with both a “read response” and an “invalidate acknowledge”
message.
Transition (j):
This CPU does a store to a data item in a cache line that was not in its cache, and
thus transmits a “read invalidate” message. The CPU cannot complete the transition
until it receives the “read response” and a full set of “invalidate acknowledge”
messages. The cache line will presumably transition to “modified” state via
transition (b) as soon as the actual store completes.
Transition (k):
This CPU loads a data item in a cache line that was not in its cache. The CPU
transmits a “read” message, and completes the transition upon receiving the
corresponding “read response”.
Transition (l):
Some other CPU does a store to a data item in this cache line, but holds this cache
line in read-only state due to its being held in other CPUs’ caches (such as the
current CPU’s cache). This transition is initiated by the reception of an “invalidate”
message, and this CPU responds with an “invalidate acknowledge” message.
Quick Quiz C.5: How does the hardware handle the delayed transitions described above?
v2023.06.11a
668 APPENDIX C. WHY MEMORY BARRIERS?
at address 0 out of its cache via an invalidation, replacing it with the data at address 8.
CPU 2 now does a load from address 0, but this CPU realizes that it will soon need to
store to it, and so it uses a “read invalidate” message in order to gain an exclusive copy,
invalidating it from CPU 3’s cache (though the copy in memory remains up to date).
Next CPU 2 does its anticipated store, changing the state to “modified”. The copy of
the data in memory is now out of date. CPU 1 does an atomic increment, using a “read
invalidate” to snoop the data from CPU 2’s cache and invalidate it, so that the copy in
CPU 1’s cache is in the “modified” state (and the copy in memory remains out of date).
Finally, CPU 1 reads the cache line at address 8, which uses a “writeback” message to
push address 0’s data back out to memory.
Note that we end with data in some of the CPU’s caches.
Quick Quiz C.6: What sequence of operations would put the CPUs’ caches all back into the
“invalid” state?
a few orders of magnitude more than that required to execute a simple register-to-register
instruction.
v2023.06.11a
C.3. STORES RESULT IN UNNECESSARY STALLS 669
CPU 0 CPU 1
Write
Invalidate
Stall
Acknowledgement
CPU 0 can simply record its write in its store buffer and continue executing. When the
cache line does finally make its way from CPU 1 to CPU 0, the data will be moved from
the store buffer to the cache line.
Quick Quiz C.7: But then why do uniprocessors also have store buffers?
Please note that the store buffer does not necessarily operate on full cache lines. The
reason for this is that a given store-buffer entry need only contain the value stored,
not the other data contained in the corresponding cache line. Which is a good thing,
because the CPU doing the store has no idea what that other data might be! But once
the corresponding cache line arrives, any values from the store buffer that update that
cache line can be merged into it, and the corresponding entries can then be removed
from the store buffer. Any other data in that cache line is of course left intact.
Quick Quiz C.8: So store-buffer entries are variable length? Isn’t that difficult to implement
in hardware?
These store buffers are local to a given CPU or, on systems with hardware multithread-
ing, local to a given core. Either way, a given CPU is permitted to access only the store
buffer assigned to it. For example, in Figure C.5, CPU 0 cannot access CPU 1’s store
buffer and vice versa. This restriction simplifies the hardware by separating concerns:
The store buffer improves performance for consecutive writes, while the responsibility
for communicating among CPUs (or cores, as the case may be) is fully shouldered by the
cache-coherence protocol. However, even given this restriction, there are complications
that must be addressed, which are covered in the next two sections.
v2023.06.11a
670 APPENDIX C. WHY MEMORY BARRIERS?
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Interconnect
Memory
1 a = 1;
2 b = a + 1;
3 assert(b == 2);
One would not expect the assertion to fail. However, if one were foolish enough to
use the very simple architecture shown in Figure C.5, one would be surprised. Such a
system could potentially see the following sequence of events:
5 CPU 1 receives the “read invalidate” message, and responds by transmitting the
cache line and removing that cacheline from its cache.
7 CPU 0 receives the cache line from CPU 1, which still has a value of zero for “a”.
8 CPU 0 loads “a” from its cache, finding the value zero.
9 CPU 0 applies the entry from its store buffer to the newly arrived cache line, setting
the value of “a” in its cache to one.
10 CPU 0 adds one to the value zero loaded for “a” above, and stores it into the cache
line containing “b” (which we will assume is already owned by CPU 0).
v2023.06.11a
C.3. STORES RESULT IN UNNECESSARY STALLS 671
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Interconnect
Memory
The problem is that we have two copies of “a”, one in the cache and the other in the
store buffer.
This example breaks a very important guarantee, namely that each CPU will always
see its own operations as if they happened in program order. Breaking this guarantee is
violently counter-intuitive to software types, so much so that the hardware guys took
pity and implemented “store forwarding”, where each CPU refers to (or “snoops”) its
store buffer as well as its cache when performing loads, as shown in Figure C.6. In other
words, a given CPU’s stores are directly forwarded to its subsequent loads, without
having to pass through the cache.
With store forwarding in place, item 8 in the above sequence would have found the
correct value of 1 for “a” in the store buffer, so that the final value of “b” would have
been 2, as one would hope.
Suppose CPU 0 executes foo() and CPU 1 executes bar(). Suppose further that
the cache line containing “a” resides only in CPU 1’s cache, and that the cache line
containing “b” is owned by CPU 0. Then the sequence of operations might be as follows:
v2023.06.11a
672 APPENDIX C. WHY MEMORY BARRIERS?
1 CPU 0 executes a = 1. The cache line is not in CPU 0’s cache, so CPU 0 places
the new value of “a” in its store buffer and transmits a “read invalidate” message.
2 CPU 1 executes while (b == 0)continue, but the cache line containing “b” is
not in its cache. It therefore transmits a “read” message.
3 CPU 0 executes b = 1. It already owns this cache line (in other words, the cache
line is already in either the “modified” or the “exclusive” state), so it stores the
new value of “b” in its cache line.
4 CPU 0 receives the “read” message, and transmits the cache line containing the
now-updated value of “b” to CPU 1, also marking the line as “shared” in its own
cache (but only after writing back the line containing “b” to main memory).
5 CPU 1 receives the cache line containing “b” and installs it in its cache.
6 CPU 1 can now finish executing while (b == 0)continue, and since it finds
that the value of “b” is 1, it proceeds to the next statement.
7 CPU 1 executes the assert(a == 1), and, since CPU 1 is working with the old
value of “a”, this assertion fails.
8 CPU 1 receives the “read invalidate” message, and transmits the cache line
containing “a” to CPU 0 and invalidates this cache line from its own cache. But it
is too late.
9 CPU 0 receives the cache line containing “a” and applies the buffered store just in
time to fall victim to CPU 1’s failed assertion.
Quick Quiz C.9: In step 1 above, why does CPU 0 need to issue a “read invalidate” rather
than a simple “invalidate”? After all, foo() will overwrite the variable a in any case, so why
should it care about the old value of a?
Quick Quiz C.10: In step 4 above, don’t systems avoid that store to memory?
Quick Quiz C.11: In step 9 above, did bar() read a stale value from a, or did its reads of b
and a get reordered?
The hardware designers cannot help directly here, since the CPUs have no idea which
variables are related, let alone how they might be related. Therefore, the hardware
designers provide memory-barrier instructions to allow the software to tell the CPU
about such relations. The program fragment must be updated to contain the memory
barrier:
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 assert(a == 1);
12 }
v2023.06.11a
C.3. STORES RESULT IN UNNECESSARY STALLS 673
The memory barrier smp_mb() will cause the CPU to flush its store buffer before
applying each subsequent store to its variable’s cache line. The CPU could either simply
stall until the store buffer was empty before proceeding, or it could use the store buffer to
hold subsequent stores until all of the prior entries in the store buffer had been applied.
With this latter approach the sequence of operations might be as follows:
1 CPU 0 executes a = 1. The cache line is not in CPU 0’s cache, so CPU 0 places
the new value of “a” in its store buffer and transmits a “read invalidate” message.
2 CPU 1 executes while (b == 0)continue, but the cache line containing “b” is
not in its cache. It therefore transmits a “read” message.
3 CPU 0 executes smp_mb(), and marks all current store-buffer entries (namely, the
a = 1).
4 CPU 0 executes b = 1. It already owns this cache line (in other words, the cache
line is already in either the “modified” or the “exclusive” state), but there is a
marked entry in the store buffer. Therefore, rather than store the new value of “b”
in the cache line, it instead places it in the store buffer (but in an unmarked entry).
5 CPU 0 receives the “read” message, and transmits the cache line containing the
original value of “b” to CPU 1. It also marks its own copy of this cache line as
“shared”.
6 CPU 1 receives the cache line containing “b” and installs it in its cache.
7 CPU 1 can now load the value of “b”, but since it finds that the value of “b” is
still 0, it repeats the while statement. The new value of “b” is safely hidden in
CPU 0’s store buffer.
8 CPU 1 receives the “read invalidate” message, and transmits the cache line
containing “a” to CPU 0 and invalidates this cache line from its own cache.
9 CPU 0 receives the cache line containing “a” and applies the buffered store, placing
this line into the “modified” state.
10 Since the store to “a” was the only entry in the store buffer that was marked by the
smp_mb(), CPU 0 can also store the new value of “b”—except for the fact that the
cache line containing “b” is now in “shared” state.
12 CPU 1 receives the “invalidate” message, invalidates the cache line containing “b”
from its cache, and sends an “acknowledgement” message to CPU 0.
13 CPU 1 executes while (b == 0)continue, but the cache line containing “b” is
not in its cache. It therefore transmits a “read” message to CPU 0.
14 CPU 0 receives the “acknowledgement” message, and puts the cache line containing
“b” into the “exclusive” state. CPU 0 now stores the new value of “b” into the cache
line.
15 CPU 0 receives the “read” message, and transmits the cache line containing the new
value of “b” to CPU 1. It also marks its own copy of this cache line as “shared”.
v2023.06.11a
674 APPENDIX C. WHY MEMORY BARRIERS?
16 CPU 1 receives the cache line containing “b” and installs it in its cache.
17 CPU 1 can now load the value of “b”, and since it finds that the value of “b” is 1, it
exits the while loop and proceeds to the next statement.
18 CPU 1 executes the assert(a == 1), but the cache line containing “a” is no
longer in its cache. Once it gets this cache from CPU 0, it will be working with the
up-to-date value of “a”, and the assertion therefore passes.
Quick Quiz C.12: After step 15 in Appendix C.3.3 on page 673, both CPUs might drop the
cache line containing the new value of “b”. Wouldn’t that cause this new value to be lost?
As you can see, this process involves no small amount of bookkeeping. Even
something intuitively simple, like “load the value of a” can involve lots of complex steps
in silicon.
v2023.06.11a
C.4. STORE SEQUENCES RESULT IN UNNECESSARY STALLS 675
CPU 0 CPU 1
Store Store
Buffer Buffer
Cache Cache
Invalidate Invalidate
Queue Queue
Interconnect
Memory
transmit the invalidate message; it must instead wait until the invalidate-queue entry has
been processed.
Placing an entry into the invalidate queue is essentially a promise by the CPU to
process that entry before transmitting any MESI protocol messages regarding that cache
line. As long as the corresponding data structures are not highly contended, the CPU
will rarely be inconvenienced by such a promise.
However, the fact that invalidate messages can be buffered in the invalidate queue
provides additional opportunity for memory-misordering, as discussed in the next
section.
Let us suppose that CPUs queue invalidation requests, but respond to them immediately.
This approach minimizes the cache-invalidation latency seen by CPUs doing stores, but
can defeat memory barriers, as seen in the following example.
Suppose the values of “a” and “b” are initially zero, that “a” is replicated read-only
(MESI “shared” state), and that “b” is owned by CPU 0 (MESI “exclusive” or “modified”
state). Then suppose that CPU 0 executes foo() while CPU 1 executes function bar()
in the following code fragment:
v2023.06.11a
676 APPENDIX C. WHY MEMORY BARRIERS?
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 assert(a == 1);
12 }
Quick Quiz C.13: In step 1 of the first scenario in Appendix C.4.3, why is an “invalidate” sent
instead of a ”read invalidate” message? Doesn’t CPU 0 need the values of the other variables
that share this cache line with “a”?
v2023.06.11a
C.4. STORE SEQUENCES RESULT IN UNNECESSARY STALLS 677
instructions can interact with the invalidate queue, so that when a given CPU executes a
memory barrier, it marks all the entries currently in its invalidate queue, and forces any
subsequent load to wait until all marked entries have been applied to the CPU’s cache.
Therefore, we can add a memory barrier to function bar as follows:
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 smp_mb();
12 assert(a == 1);
13 }
Quick Quiz C.14: Say what??? Why do we need a memory barrier here, given that the CPU
cannot possibly execute the assert() until after the while loop completes?
2 CPU 1 executes while (b == 0)continue, but the cache line containing “b” is
not in its cache. It therefore transmits a “read” message.
3 CPU 1 receives CPU 0’s “invalidate” message, queues it, and immediately responds
to it.
4 CPU 0 receives the response from CPU 1, and is therefore free to proceed past
the smp_mb() on line 4 above, moving the value of “a” from its store buffer to its
cache line.
5 CPU 0 executes b = 1. It already owns this cache line (in other words, the cache
line is already in either the “modified” or the “exclusive” state), so it stores the
new value of “b” in its cache line.
6 CPU 0 receives the “read” message, and transmits the cache line containing the
now-updated value of “b” to CPU 1, also marking the line as “shared” in its own
cache.
7 CPU 1 receives the cache line containing “b” and installs it in its cache.
8 CPU 1 can now finish executing while (b == 0)continue, and since it finds
that the value of “b” is 1, it proceeds to the next statement, which is now a memory
barrier.
9 CPU 1 must now stall until it processes all pre-existing messages in its invalidation
queue.
v2023.06.11a
678 APPENDIX C. WHY MEMORY BARRIERS?
10 CPU 1 now processes the queued “invalidate” message, and invalidates the cache
line containing “a” from its own cache.
11 CPU 1 executes the assert(a == 1), and, since the cache line containing “a” is
no longer in CPU 1’s cache, it transmits a “read” message.
12 CPU 0 responds to this “read” message with the cache line containing the new
value of “a”.
13 CPU 1 receives this cache line, which contains a value of 1 for “a”, so that the
assertion does not trigger.
With much passing of MESI messages, the CPUs arrive at the correct answer. This
section illustrates why CPU designers must be extremely careful with their cache-
coherence optimizations. The key requirement is that the memory barriers provide the
appearance of ordering to the software. As long as these appearances are maintained, the
hardware can carry out whatever queueing, buffering, marking, stallings, and flushing
optimizations it likes.
Quick Quiz C.15: Instead of all of this marking of invalidation-queue entries and stalling of
loads, why not simply force an immediate flush of the invalidation queue?
Quick Quiz C.16: But can’t full memory barriers impose global ordering? After all, isn’t that
needed to provide the ordering shown in Listing 12.27?
If we update foo and bar to use read and write memory barriers, they appear as
follows:
v2023.06.11a
C.6. EXAMPLE MEMORY-BARRIER SEQUENCES 679
1 void foo(void)
2 {
3 a = 1;
4 smp_wmb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 smp_rmb();
12 assert(a == 1);
13 }
Some computers have even more flavors of memory barriers, but understanding these
three variants will provide a good introduction to memory barriers in general.
consult CPU vendors’ manuals [SW95, Adv02, Int02a, IBM94, LHF05, SPA94, Int04b,
Int04a, Int04c], Gharachorloo’s dissertation [Gha95], Peter Sewell’s work [Sew], or the
excellent hardware-oriented primer by Sorin, Hill, and Wood [SHW11].
v2023.06.11a
680 APPENDIX C. WHY MEMORY BARRIERS?
Node 0 Node 1
CPU 0 CPU 1 CPU 2 CPU 3
Cache Cache
Interconnect
Memory
5. All of a given CPU’s accesses (loads and stores) preceding a full memory barrier
(smp_mb()) will be perceived by all CPUs to precede any accesses following that
memory barrier.
Quick Quiz C.17: Does the guarantee that each CPU sees its own memory accesses in order
also guarantee that each user-level thread will see its own memory accesses in order? Why or
why not?
C.6.2 Example 1
Listing C.1 shows three code fragments, executed concurrently by CPUs 0, 1, and 2.
Each of “a”, “b”, and “c” are initially zero.
Suppose CPU 0 recently experienced many cache misses, so that its message queue is
full, but that CPU 1 has been running exclusively within the cache, so that its message
5 Any real hardware architect or designer will no doubt be objecting strenuously, as they
just might be a bit upset about the prospect of working out which queue should handle a
message involving a cache line that both CPUs accessed, to say nothing of the many races
that this example poses. All I can say is “Give me a better example”.
v2023.06.11a
C.6. EXAMPLE MEMORY-BARRIER SEQUENCES 681
queue is empty. Then CPU 0’s assignment to “a” and “b” will appear in Node 0’s cache
immediately (and thus be visible to CPU 1), but will be blocked behind CPU 0’s prior
traffic. In contrast, CPU 1’s assignment to “c” will sail through CPU 1’s previously
empty queue. Therefore, CPU 2 might well see CPU 1’s assignment to “c” before it sees
CPU 0’s assignment to “a”, causing the assertion to fire, despite the memory barriers.
Therefore, portable code cannot rely on this assertion not firing, as both the compiler
and the CPU can reorder the code so as to trip the assertion.
Quick Quiz C.18: Could this code be fixed by inserting a memory barrier between CPU 1’s
“while” and assignment to “c”? Why or why not?
C.6.3 Example 2
Listing C.2 shows three code fragments, executed concurrently by CPUs 0, 1, and 2.
Both “a” and “b” are initially zero.
Again, suppose CPU 0 recently experienced many cache misses, so that its message
queue is full, but that CPU 1 has been running exclusively within the cache, so that its
message queue is empty. Then CPU 0’s assignment to “a” will appear in Node 0’s cache
immediately (and thus be visible to CPU 1), but will be blocked behind CPU 0’s prior
traffic. In contrast, CPU 1’s assignment to “b” will sail through CPU 1’s previously
empty queue. Therefore, CPU 2 might well see CPU 1’s assignment to “b” before it sees
CPU 0’s assignment to “a”, causing the assertion to fire, despite the memory barriers.
In theory, portable code should not rely on this example code fragment, however, as
before, in practice it actually does work on most mainstream computer systems.
C.6.4 Example 3
Listing C.3 shows three code fragments, executed concurrently by CPUs 0, 1, and 2.
All variables are initially zero.
Note that neither CPU 1 nor CPU 2 can proceed to line 5 until they see CPU 0’s
assignment to “b” on line 3. Once CPU 1 and 2 have executed their memory barriers on
line 4, they are both guaranteed to see all assignments by CPU 0 preceding its memory
barrier on line 2. Similarly, CPU 0’s memory barrier on line 8 pairs with those of
v2023.06.11a
682 APPENDIX C. WHY MEMORY BARRIERS?
CPUs 1 and 2 on line 4, so that CPU 0 will not execute the assignment to “e” on line 9
until after its assignment to “b” is visible to both of the other CPUs. Therefore, CPU 2’s
assertion on line 9 is guaranteed not to fire.
Quick Quiz C.19: Suppose that lines 3–5 for CPUs 1 and 2 in Listing C.3 are in an interrupt
handler, and that the CPU 2’s line 9 runs at process level. In other words, the code in all three
columns of the table runs on the same CPU, but the first two columns run in an interrupt handler,
and the third column runs at process level, so that the code in third column can be interrupted
by the code in the first two columns. What changes, if any, are required to enable the code to
work correctly, in other words, to prevent the assertion from firing?
Quick Quiz C.20: If CPU 2 executed an assert(e==0||c==1) in the example in Listing C.3,
would this assert ever trigger?
v2023.06.11a
C.8. ADVICE TO HARDWARE DESIGNERS 683
This charming misfeature can result in DMAs from memory missing recent changes
to the output buffer, or, just as bad, cause input buffers to be overwritten by the
contents of CPU caches just after the DMA completes. To make your system
work in face of such misbehavior, you must carefully flush the CPU caches of
any location in any DMA buffer before presenting that buffer to the I/O device.
Otherwise, a store from one of the CPUs might not be accounted for in the data
DMAed out through the device. This is a form of data corruption, which is an
extremely serious bug.
Similarly, you need to invalidate6 the CPU caches corresponding to any location in
any DMA buffer after DMA to that buffer completes. Otherwise, a given CPU
might see the old data still residing in its cache instead of the newly DMAed data
that it was supposed to see. This is another form of data corruption.
And even then, you need to be very careful to avoid pointer bugs, as even a
misplaced read to an input buffer can result in corrupting the data input! One way
to avoid this is to invalidate all of the caches of all of the CPUs once the DMA
completes, but it is much easier and more efficient if the device DMA participates
in the cache-coherence protocol, making all of this flushing and invalidating
unnecessary.
2. External busses that fail to transmit cache-coherence data.
This is an even more painful variant of the above problem, but causes groups
of devices—and even memory itself—to fail to respect cache coherence. It is
my painful duty to inform you that as embedded systems move to multicore
architectures, we will no doubt see a fair number of such problems arise. By
the year 2021, there were some efforts to address these problems with new
interconnect standards, with some debate as to how effective these standards will
really be [Won19].
6 Why not flush? If there is a difference, then a CPU must have incorrectly stored to the
v2023.06.11a
684 APPENDIX C. WHY MEMORY BARRIERS?
the task could easily see the corresponding variables revert to prior values, which
can fatally confuse most algorithms.
6. Overly kind simulators and emulators.
It is difficult to write simulators or emulators that force memory re-ordering, so
software that runs just fine in these environments can get a nasty surprise when it
first runs on the real hardware. Unfortunately, it is still the rule that the hardware is
more devious than are the simulators and emulators, but we hope that this situation
changes.
Again, we encourage hardware designers to avoid these practices!
v2023.06.11a
De gustibus non est disputandum.
Latin maxim
Appendix D
Style Guide
• (On punctuations and quotations) Despite being American myself, for this sort of
book, the UK approach is better because it removes ambiguities like the following:
Type “ls -a,” look for the file “.,” and file a bug if you don’t see it.
Type “ls -a”, look for the file “.”, and file a bug if you don’t see it.
• Oxford comma: “a, b, and c” rather than “a, b and c”. This is arbitrary. Cases
where the Oxford comma results in ambiguity should be reworded, for example,
by introducing numbering: “a, b, and c and d” should be “(1) a, (2) b, and (3) c
and d”.
• North American rules on periods and abbreviations. For example neither of the
following can reasonably be interpreted as two sentences:
685
v2023.06.11a
686 APPENDIX D. STYLE GUIDE
v2023.06.11a
D.2. NIST STYLE GUIDE 687
v2023.06.11a
688 APPENDIX D. STYLE GUIDE
v2023.06.11a
D.3. LATEX CONVENTIONS 689
\DefineVerbatimEnvironment{VerbatimL}{Verbatim}%
{fontsize=\scriptsize,numbers=left,numbersep=5pt,%
xleftmargin=9pt,obeytabs=true,tabsize=2}
\AfterEndEnvironment{VerbatimL}{\vspace*{-9pt}}
\DefineVerbatimEnvironment{VerbatimN}{Verbatim}%
{fontsize=\scriptsize,numbers=left,numbersep=3pt,%
xleftmargin=5pt,xrightmargin=5pt,obeytabs=true,%
tabsize=2,frame=single}
\DefineVerbatimEnvironment{VerbatimU}{Verbatim}%
{fontsize=\scriptsize,numbers=none,xleftmargin=5pt,%
xrightmargin=5pt,obeytabs=true,tabsize=2,%
samepage=true,frame=single}
The LATEX source of a sample code snippet is shown in Listing D.1 and is typeset as
shown in Listing D.2.
Labels to lines are specified in “$lnlbl[]” command. The characters specified by
“commandchars” option to VarbatimL environment are used by the fancyvrb package
to substitute “\lnlbl{}” for “$lnlbl[]”. Those characters should be selected so that
they don’t appear elsewhere in the code snippet.
Labels “printf” and “return” in Listing D.2 can be referred to as shown below:
v2023.06.11a
690 APPENDIX D. STYLE GUIDE
\begin{fcvref}[ln:base1]
\Clnref{printf, return} can be referred
to from text.
\end{fcvref}
\newcommand{\lnlblbase}{}
\newcommand{\lnlbl}[1]{%
\phantomsection\label{\lnlblbase:#1}}
\newcommand{\lnrefbase}{}
\newcommand{\lnref}[1]{\ref{\lnrefbase:#1}}
The main part of LATEX source shown on lines 2–14 in Listing D.1 can be extracted
from a code sample of Listing D.3 by a perl script utilities/fcvextract.pl. All
the relevant rules of extraction are described as recipes in the top level Makefile and a
script to generate dependencies (utilities/gen_snippet_d.pl).
As you can see, Listing D.3 has meta commands in comments of C (C++ style). Those
meta commands are interpreted by utilities/fcvextract.pl, which distinguishes
the type of comment style by the suffix of code sample’s file name.
Meta commands which can be used in code samples are listed below:
v2023.06.11a
D.3. LATEX CONVENTIONS 691
• \begin{snippet}[<options>]
• \end{snippet}
• \lnlbl{<label string>}
• \fcvexclude
• \fcvblank
“<options>” to the \begin{snippet} meta command is a comma-spareted list of
options shown below:
• labelbase=<label base string>
• keepcomment=yes
• gobbleblank=yes
• commandchars=\X\Y\Z
The “labelbase” option is mandatory and the string given to it will be passed to
the “\begin{fcvlabel}[<label base string>]” command as shown on line 2
of Listing D.1. The “keepcomment=yes” option tells fcvextract.pl to keep com-
ment blocks. Otherwise, comment blocks in C source code will be omitted. The
“gobbleblank=yes” option will remove empty or blank lines in the resulting snippet.
The “commandchars” option is given to the VerbatimL environment as is. At the
moment, it is also mandatory and must come at the end of options listed above. Other
types of options, if any, are also passed to the VerbatimL environment.
The “\lnlbl” commands are converted along the way to reflect the escape-character
choice.1 Source lines with “\fcvexclude” are removed. “\fcvblank” can be used to
keep blank lines when the “gobbleblank=yes” option is specified.
There can be multiple pairs of \begin{snippet} and \end{snippet} as long as
they have unique “labelbase” strings.
Our naming scheme of “labelbase” for unique name space is as follows:
ln:<Chapter/Subdirectory>:<File Name>:<Function Name>
v2023.06.11a
692 APPENDIX D. STYLE GUIDE
1 // Comment at first
2 C C-sample
3 // Comment with { and } characters
4 {
5 x=2; // C style comment in initialization
6 }
7
8 P0(int *x}
9 {
10 int r1;
11
12 r1 = READ_ONCE(*x); // Comment with "exists"
13 }
14
15 [...]
16
17 exists (0:r1=0) // C++ style comment after test body
To avoid parse errors, meta commands in litmus tests (C flavor) are embedded in the
following way.
1 C C-SB+o-o+o-o
2 //\begin[snippet][labelbase=ln:base,commandchars=\%\@\$]
3
4 {
5 1:r2=0 (*\lnlbl[initr2]*)
6 }
7
8 P0(int *x0, int *x1) //\lnlbl[P0:b]
9 {
10 int r2;
11
12 WRITE_ONCE(*x0, 2);
13 r2 = READ_ONCE(*x1);
14 } //\lnlbl[P0:e]
15
16 P1(int *x0, int *x1)
17 {
18 int r2;
19
20 WRITE_ONCE(*x1, 2);
21 r2 = READ_ONCE(*x0);
22 }
23
24 //\end[snippet]
25 exists (1:r2=0 /\ 0:r2=0) (* \lnlbl[exists_] *)
1 // Do not edit!
2 // Generated by utillities/reorder_ltms.pl
3 //\begin{snippet}[labelbase=ln:base,commandchars=\%\@\$]
4 C C-SB+o-o+o-o
5
6 {
7 1:r2=0 //\lnlbl{initr2}
8 }
9
10 P0(int *x0, int *x1) //\lnlbl{P0:b}
11 {
12 int r2;
13
14 WRITE_ONCE(*x0, 2);
v2023.06.11a
D.3. LATEX CONVENTIONS 693
15 r2 = READ_ONCE(*x1);
16 } //\lnlbl{P0:e}
17
18 P1(int *x0, int *x1)
19 {
20 int r2;
21
22 WRITE_ONCE(*x1, 2);
23 r2 = READ_ONCE(*x0);
24 }
25
26 exists (1:r2=0 /\ 0:r2=0) \lnlbl{exists_}
27 //\end{snippet}
Note that each litmus test’s source file can contain at most one pair of
\begin[snippet] and \end[snippet] because of the restriction of comments.
The “verbatim” environment is used for listings with too many lines to fit in a
column. It is also used to avoid overwhelming LATEX with a lot of floating objects. They
are being converted to the scheme using the VerbatimN environment.
v2023.06.11a
694 APPENDIX D. STYLE GUIDE
\co, \nbco \, %, {, }
\tco # %, {, }, \
D.3.1.3 Identifier
We use “\co{}” macro for inline identifiers. (“co” stands for “code”.)
By putting them into \co{}, underscore characters in their names are free of escaping
in LATEX source. It is convenient to search them in source files. Also, \co{} macro has
a capability to permit line breaks at particular sequences of letters. Current definition
permits a line break at an underscore (_), two consecutive underscores (__), a white
space, or an operator ->.
D.3.1.6 Limitations
There are a few cases where macros introduced in this section do not work as expected.
Table D.2 lists such limitations.
While \co{} requires some characters to be escaped, it can contain any character.
On the other hand, \tco{} can not handle “%”, “{”, “}”, nor “\” properly. If they are
escaped by a “\”, they appear in the end result with the escape character. The “\verb”
command can be used in running text if you need to use monospace font for a string
which contains many characters to escape.4
3 Overfill can be a problem if the URL or the path name contains long runs of unbreakable
characters.
4 The \verb command is not almighty though. For example, you can’t use it within a
footnote. If you do so, you will see a fatal LATEX error. A workaround would be a macro
v2023.06.11a
D.3. LATEX CONVENTIONS 695
D.3.2 Cross-reference
Cross-references to Chapters, Sections, Listings, etc. have been expressed by combina-
tions of names and bare \ref{} commands in the following way:
1 Chapter~\ref{chp:Introduction},
2 Table~\ref{tab:app:styleguide:Digit-Grouping Style}
v2023.06.11a
696 APPENDIX D. STYLE GUIDE
5 In exchange for enabling the shortcut, we can’t use plain LAT X’s shortcut “\-” to
E
specify hyphenation points. Use pfhyphex.tex to add such exceptions.
v2023.06.11a
D.3. LATEX CONVENTIONS 697
Note that “\=/” enables hyphenation in elements of compound words as the same as
“\-/” does.
D.3.4.3 Em Dash
Em dashes are used to indicate parenthetic expression. In perfbook, em dashes are
placed without spaces around it. In LATEX source, an em dash is represented by “---”.
Example (quote from Appendix C.1):
This disparity in speed—more than two orders of magnitude—has resulted in
the multi-megabyte caches found on modern CPUs.
D.3.4.4 En Dash
In LATEX convention, en dashes (–) are used for ranges of (mostly) numbers. Past
revisions of perfbook didn’t follow this rule and used plain dashes (-) for such cases.
Now that \clnrefrange, \crefrange, and their variants, which generate en dashes,
are used for ranges of cross-references, the remaining couple of tens of simple dashes
of other types of ranges have been converted to en dashes for consistency.
Example with a simple dash:
Lines 4-12 in Listing D.4 are the contents of the verbbox environment. The
box is output by the \theverbbox macro on line 16.
Example with an en dash:
Lines 4–12 in Listing D.4 are the contents of the verbbox environment. The
box is output by the \theverbbox macro on line 16.
D.3.5 Punctuation
D.3.5.1 Ellipsis
In monospace fonts, ellipses can be expressed by series of periods. For example:
Great ... So how do I fix it?
However, in proportional fonts, the series of periods is printed with tight spaces as
follows:
Great ... So how do I fix it?
Standard LATEX defines the \dots macro for this purpose. However, it has a kludge
in the evenness of spaces. The “ellipsis” package redefines the \dots macro to fix the
issue.7 By using \dots, the above example is typeset as the following:
6 This rule assumes that math mode uses the same upright glyph as text mode. Our default
macro in math mode is not affected. The “amsmath” package has another definition of \dots.
It is not used in perfbook at the moment.
v2023.06.11a
698 APPENDIX D. STYLE GUIDE
v2023.06.11a
D.3. LATEX CONVENTIONS 699
8 https://www.inf.ethz.ch/personal/markusp/teaching/guides/guide-
tables.pdf
9 There is another package named “arydshln” which provides dashed lines to be used in
v2023.06.11a
700 APPENDIX D. STYLE GUIDE
Figure D.1: Timer Wheel at 1 kHz Figure D.2: Timer Wheel at 100 kHz
v2023.06.11a
D.3. LATEX CONVENTIONS 701
v2023.06.11a
702 APPENDIX D. STYLE GUIDE
Release
Reference Hazard
Acquisition Locks RCU
Counts Pointers
Locks − CAMR M CA
Reference
A AMR M A
Counts
Hazard
M M M M
Pointers
RCU CA MA CA M CA
box appended here explain the abbreviations used in the matrix. Two types of memory
barrier are denoted by subscripts here. The legends and subscripts are not present in
Table 13.1 since they are redundant there.
Table D.6 (corresponding to Table C.1) is a sequence diagram drawn as a table.
Table D.7 is a tweaked version of Table 9.3. Here, the “Category” column in the
original is removed and the categories are indicated in rows of bold-face font just below
the mid-rules. This change makes it easier for \rowcolors{} command of “xcolor”
package to work properly.
Table D.8 is another version which keeps original columns and colors rows only
where a category has multiple rows. This is done by combining \rowcolors{} of
“xcolor” and \cellcolor{} commands of the “colortbl” package (\cellcolor{}
overrides \rowcolors{}).
In Table 9.3, the latter layout without partial row coloring has been chosen for
simplicity.
v2023.06.11a
D.3. LATEX CONVENTIONS 703
v2023.06.11a
704 APPENDIX D. STYLE GUIDE
CPU 0 CPU 1
Instruction Store Buffer Cache Instruction Store Buffer Cache
1 (Initial state) x1==0 (Initial state) x0==0
2 x0 = 2; x0==2 x1==0 x1 = 2; x1==2 x0==0
3 r2 = x1; (0) x0==2 x1==0 r2 = x0; (0) x1==2 x0==0
4 (Read-invalidate) x0==2 x0==0 (Read-invalidate) x1==2 x1==0
5 (Finish store) x0==2 (Finish store) x1==2
Table D.9 (corresponding to Table 15.1) is also a sequence diagram drawn as a tabular
object.
Table D.10 shows another version of Table D.3 with dashed horizontal and vertical
rules of the arydshln package.
In this case, the vertical dashed rules seems unnecessary. The one without the vertical
rules is shown in Table D.11.
v2023.06.11a
The Answer to the Ultimate Question of Life, The
Universe, and Everything.
The Hitchhikers Guide to the Galaxy, Douglas Adams
Appendix E
Answer:
In Appendix E starting on page 705. Hey, I thought I owed you an easy one! ❑
Answer:
Indeed it is! Many are questions that Paul E. McKenney would probably have asked if
he was a novice student in a class covering this material. It is worth noting that Paul was
taught most of this material by parallel hardware and software, not by professors. In
Paul’s experience, professors are much more likely to provide answers to verbal questions
than are parallel systems, recent advances in voice-activated assistants notwithstanding.
Of course, we could have a lengthy debate over which of professors or parallel systems
provide the most useful answers to these sorts of questions, but for the time being
let’s just agree that usefulness of answers varies widely across the population both of
professors and of parallel systems.
Other quizzes are quite similar to actual questions that have been asked during
conference presentations and lectures covering the material in this book. A few others
are from the viewpoint of the author. ❑
Answer:
Here are a few possible strategies:
705
v2023.06.11a
706 APPENDIX E. ANSWERS TO QUICK QUIZZES
1. Just ignore the Quick Quizzes and read the rest of the book. You might miss out
on the interesting material in some of the Quick Quizzes, but the rest of the book
has lots of good material as well. This is an eminently reasonable approach if your
main goal is to gain a general understanding of the material or if you are skimming
through the book to find a solution to a specific problem.
2. Look at the answer immediately rather than investing a large amount of time in
coming up with your own answer. This approach is reasonable when a given
Quick Quiz’s answer holds the key to a specific problem you are trying to solve.
This approach is also reasonable if you want a somewhat deeper understanding
of the material, but when you do not expect to be called upon to generate parallel
solutions given only a blank sheet of paper.
3. If you find the Quick Quizzes distracting but impossible to ignore, you can always
clone the LATEX source for this book from the git archive. You can then run the
command make nq, which will produce a perfbook-nq.pdf. This PDF contains
unobtrusive boxed tags where the Quick Quizzes would otherwise be, and gathers
each chapter’s Quick Quizzes at the end of that chapter in the classic textbook style.
4. Learn to like (or at least tolerate) the Quick Quizzes. Experience indicates that
quizzing yourself periodically while reading greatly increases comprehension and
depth of understanding.
Note that the quick quizzes are hyperlinked to the answers and vice versa. Click
either the “Quick Quiz” heading or the small black square to move to the beginning of
the answer. From the answer, click on the heading or the small black square to move to
the beginning of the quiz, or, alternatively, click on the small white square at the end of
the answer to move to the end of the corresponding quiz. ❑
Answer:
For those preferring analogies, coding concurrent software is similar to playing music
in that there are good uses for many different levels of talent and skill. Not everyone
needs to devote their entire live to becoming a concert pianist. In fact, for every such
virtuoso, there are a great many lesser pianists whose of music is welcomed by their
friends and families. But these lesser pianists are probably doing something else to
support themselves, and so it is with concurrent coding.
One potential benefit of passively reading this book is the ability to read and understand
modern concurrent code. This ability might in turn permit you to:
1. See what the kernel does so that you can check to see if a proposed use case is
valid.
3. Use information in the kernel to more easily chase down a userspace bug.
v2023.06.11a
E.2. INTRODUCTION 707
5. Create a straightforward kernel feature, whether from scratch or using the modern
copy-pasta development methodology.
If you are proficient with straightforward uses of locks and atomic operations,
passively reading this book should enable you to successfully apply modern concurrency
techniques.
And finally, if your job is to coordinate the activities of developers making use of
modern concurrency techniques, passively reading this book might help you understand
what on earth they are talking about. ❑
E.2 Introduction
Answer:
If you really believe that parallel programming is exceedingly hard, then you should have
a ready answer to the question “Why is parallel programming hard?” One could list any
number of reasons, ranging from deadlocks to race conditions to testing coverage, but
the real answer is that it is not really all that hard. After all, if parallel programming was
really so horribly difficult, how could a large number of open-source projects, ranging
from Apache to MySQL to the Linux kernel, have managed to master it?
A better question might be: “Why is parallel programming perceived to be so
difficult?” To see the answer, let’s go back to the year 1991. Paul McKenney was
walking across the parking lot to Sequent’s benchmarking center carrying six dual-80486
Sequent Symmetry CPU boards, when he suddenly realized that he was carrying several
times the price of the house he had just purchased.1 This high cost of parallel systems
meant that parallel programming was restricted to a privileged few who worked for
an employer who either manufactured or could afford to purchase machines costing
upwards of $100,000—in 1991 dollars US.
In contrast, in 2020, Paul finds himself typing these words on a six-core x86 laptop.
Unlike the dual-80486 CPU boards, this laptop also contains 64 GB of main memory, a
1 TB solid-state disk, a display, Ethernet, USB ports, wireless, and Bluetooth. And the
laptop is more than an order of magnitude cheaper than even one of those dual-80486
CPU boards, even before taking inflation into account.
Parallel systems have truly arrived. They are no longer the sole domain of a privileged
few, but something available to almost everyone.
The earlier restricted availability of parallel hardware is the real reason that parallel
programming is considered so difficult. After all, it is quite difficult to learn to program
even the simplest machine if you have no access to it. Since the age of rare and
expensive parallel machines is for the most part behind us, the age during which parallel
programming is perceived to be mind-crushingly difficult is coming to a close.2 ❑
1 Yes, this sudden realization did cause him to walk quite a bit more carefully. Why do
you ask?
2 Parallel programming is in some ways more difficult than sequential programming, for
v2023.06.11a
708 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
It depends on the programming environment. SQL [Int92] is an underappreciated
success story, as it permits programmers who know nothing about parallelism to keep a
large parallel system productively busy. We can expect more variations on this theme as
parallel computers continue to become cheaper and more readily available. For example,
one possible contender in the scientific and technical computing arena is MATLAB*P,
which is an attempt to automatically parallelize common matrix operations.
Finally, on Linux and UNIX systems, consider the following shell command:
This shell pipeline runs the get_input, grep, and sort processes in parallel. There,
that wasn’t so hard, now was it?
In short, parallel programming is just as easy as sequential programming—at least in
those environments that hide the parallelism from the user! ❑
Answer:
These are important goals, but they are just as important for sequential programs as they
are for parallel programs. Therefore, important though they are, they do not belong on a
list specific to parallel programming. ❑
Answer:
Given that parallel programming is perceived to be much harder than sequential pro-
gramming, productivity is tantamount and therefore must not be omitted. Furthermore,
high-productivity parallel-programming environments such as SQL serve a specific
purpose, hence generality must also be added to the list. ❑
Answer:
From an engineering standpoint, the difficulty in proving correctness, either formally or
informally, would be important insofar as it impacts the primary goal of productivity.
So, in cases where correctness proofs are important, they are subsumed under the
“productivity” rubric. ❑
v2023.06.11a
E.2. INTRODUCTION 709
Answer:
Having fun is important as well, but, unless you are a hobbyist, would not normally be a
primary goal. On the other hand, if you are a hobbyist, go wild! ❑
Answer:
There certainly are cases where the problem to be solved is inherently parallel, for
example, Monte Carlo methods and some numerical computations. Even in these cases,
however, there will be some amount of extra work managing the parallelism.
Parallelism is also sometimes used for reliability. For but one example, triple-modulo
redundancy has three systems run in parallel and vote on the result. In extreme cases,
the three systems will be independently implemented using different algorithms and
technologies. ❑
Answer:
If the developers, budget, and time is available for such a rewrite, and if the result will
attain the required levels of performance on a single CPU, this can be a reasonable
approach. ❑
Answer:
If you are a pure hobbyist, perhaps you don’t need to care. But even pure hobbyists
will often care about how much they can get done, and how quickly. After all, the
most popular hobbyist tools are usually those that are the best suited for the job, and an
important part of the definition of “best suited” involves productivity. And if someone
is paying you to write parallel code, they will very likely care deeply about your
productivity. And if the person paying you cares about something, you would be most
wise to pay at least some attention to it!
Besides, if you really didn’t care about productivity, you would be doing it by hand
rather than using a computer! ❑
v2023.06.11a
710 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
There are a number of answers to this question:
1. Given a large computational cluster of parallel machines, the aggregate cost of the
cluster can easily justify substantial developer effort, because the development cost
can be spread over the large number of machines.
2. Popular software that is run by tens of millions of users can easily justify substantial
developer effort, as the cost of this development can be spread over the tens of
millions of users. Note that this includes things like kernels and system libraries.
4. If the software for the low-cost parallel machine produces an extremely valuable
result (e.g., energy savings), then this valuable result might again justify substantial
developer cost.
5. Safety-critical systems protect lives, which can clearly justify very large developer
effort.
6. Hobbyists and researchers might instead seek knowledge, experience, fun, or glory.
So it is not the case that the decreasing cost of hardware renders software worthless, but
rather that it is no longer possible to “hide” the cost of software development within the
cost of the hardware, at least not unless there are extremely large quantities of hardware.
❑
Answer:
This is eminently achievable. The cellphone is a computer that can be used to make
phone calls and to send and receive text messages with little or no programming or
configuration on the part of the end user.
This might seem to be a trivial example at first glance, but if you consider it carefully
you will see that it is both simple and profound. When we are willing to sacrifice
generality, we can achieve truly astounding increases in productivity. Those who indulge
in excessive generality will therefore fail to set the productivity bar high enough to
succeed near the top of the software stack. This fact of life even has its own acronym:
YAGNI, or “You Ain’t Gonna Need It.” ❑
v2023.06.11a
E.2. INTRODUCTION 711
Answer:
Exactly! And that is the whole point of using existing software. One team’s work can
be used by many other teams, resulting in a large decrease in overall effort compared to
all teams needlessly reinventing the wheel. ❑
Answer:
There are any number of potential bottlenecks:
1. Main memory. If a single thread consumes all available memory, additional threads
will simply page themselves silly.
2. Cache. If a single thread’s cache footprint completely fills any shared CPU cache(s),
then adding more threads will simply thrash those affected caches, as will be seen
in Chapter 10.
4. I/O bandwidth. If a single thread is I/O bound, adding more threads will simply
result in them all waiting in line for the affected I/O resource.
Specific hardware systems might have any number of additional bottlenecks. The fact
is that every resource which is shared between multiple CPUs or threads is a potential
bottleneck. ❑
Answer:
There are any number of potential limits on the number of threads:
1. Main memory. Each thread consumes some memory (for its stack if nothing else),
so that excessive numbers of threads can exhaust memory, resulting in excessive
paging or memory-allocation failures.
v2023.06.11a
712 APPENDIX E. ANSWERS TO QUICK QUIZZES
Specific applications and platforms may have any number of additional limiting
factors. ❑
Answer:
Where each thread is given access to some set of resources during an agreed-to slot
of time. For example, a parallel program with eight threads might be organized into
eight-millisecond time intervals, so that the first thread is given access during the first
millisecond of each interval, the second thread during the second millisecond, and so
on. This approach clearly requires carefully synchronized clocks and careful control of
execution times, and therefore should be used with considerable caution.
In fact, outside of hard realtime environments, you almost certainly want to use
something else instead. Explicit timing is nevertheless worth a mention, as it is always
there when you need it. ❑
Answer:
There are a great many other potential obstacles to parallel programming. Here are a
few of them:
1. The only known algorithms for a given project might be inherently sequential in
nature. In this case, either avoid parallel programming (there being no law saying
that your project has to run in parallel) or invent a new parallel algorithm.
2. The project allows binary-only plugins that share the same address space, such
that no one developer has access to all of the source code for the project. Because
many parallel bugs, including deadlocks, are global in nature, such binary-only
plugins pose a severe challenge to current software development methodologies.
This might well change, but for the time being, all developers of parallel code
sharing a given address space need to be able to see all of the code running in that
address space.
3. The project contains heavily used APIs that were designed without regard to
parallelism [AGH+ 11a, CKZ+ 13]. Some of the more ornate features of the System
V message-queue API form a case in point. Of course, if your project has been
around for a few decades, and its developers did not have access to parallel hardware,
it undoubtedly has at least its share of such APIs.
4. The project was implemented without regard to parallelism. Given that there are a
great many techniques that work extremely well in a sequential environment, but
that fail miserably in parallel environments, if your project ran only on sequential
hardware for most of its lifetime, then your project undoubtably has at least its
share of parallel-unfriendly code.
v2023.06.11a
E.3. HARDWARE AND ITS HABITS 713
6. The people who originally did the development on your project have since moved
on, and the people remaining, while well able to maintain it or add small features,
are unable to make “big animal” changes. In this case, unless you can work out a
very simple way to parallelize your project, you will probably be best off leaving
it sequential. That said, there are a number of simple approaches that you might
use to parallelize your project, including running multiple instances of it, using a
parallel implementation of some heavily used library function, or making use of
some other parallel project, such as a database.
One can argue that many of these obstacles are non-technical in nature, but that does
not make them any less real. In short, parallelization of a large body of code can be a
large and complex effort. As with any large and complex effort, it makes sense to do
your homework beforehand. ❑
Answer:
It might well be easier to ignore the detailed properties of the hardware, but in most cases
it would be quite foolish to do so. If you accept that the only purpose of parallelism is
to increase performance, and if you further accept that performance depends on detailed
properties of the hardware, then it logically follows that parallel programmers are going
to need to know at least a few hardware properties.
This is the case in most engineering disciplines. Would you want to use a bridge
designed by an engineer who did not understand the properties of the concrete and
steel making up that bridge? If not, why would you expect a parallel programmer to be
able to develop competent parallel software without at least some understanding of the
underlying hardware? ❑
Answer:
One answer to this question is that it is often possible to pack multiple elements of data
into a single machine word, which can then be manipulated atomically.
A more trendy answer would be machines supporting transactional memory [Lom77,
Kni86, HM93]. By early 2014, several mainstream systems provided limited hardware
transactional memory implementations, which is covered in more detail in Section 17.3.
v2023.06.11a
714 APPENDIX E. ANSWERS TO QUICK QUIZZES
The jury is still out on the applicability of software transactional memory [MMW07,
PW07, RHP+ 07, CBM+ 08, DFGG11, MS12], which is covered in Section 17.2. ❑
Answer:
Unfortunately, not so much. There has been some reduction given constant numbers of
CPUs, but the finite speed of light and the atomic nature of matter limits their ability to
reduce cache-miss overhead for larger systems. Section 3.3 discusses some possible
avenues for possible future progress. ❑
Answer:
This sequence ignored a number of possible complications, including:
2. The cacheline might have been replicated read-only in several CPUs’ caches, in
which case, it would need to be flushed from their caches.
3. CPU 7 might have been operating on the cache line when the request for it arrived,
in which case CPU 7 might need to hold off the request until its own operation
completed.
4. CPU 7 might have ejected the cacheline from its cache (for example, in order to
make room for other data), so that by the time that the request arrived, the cacheline
was on its way to memory.
5. A correctable error might have occurred in the cacheline, which would then need
to be corrected at some point before the data was used.
Answer:
If the cacheline was not flushed from CPU 7’s cache, then CPUs 0 and 7 might have
different values for the same set of variables in the cacheline. This sort of incoherence
greatly complicates parallel software, which is why wise hardware architects avoid it. ❑
v2023.06.11a
E.3. HARDWARE AND ITS HABITS 715
Answer:
The hardware designers have been working on this problem, and have consulted with
no less a luminary than the late physicist Stephen Hawking. Hawking’s observation was
that the hardware designers have two basic problems [Gar07]:
1. The finite speed of light, and
2. The atomic nature of matter.
The first problem limits raw speed, and the second limits miniaturization, which
in turn limits frequency. And even this sidesteps the power-consumption issue that is
currently limiting production frequencies to well below 10 GHz.
In addition, Table 3.1 on page 36 represents a reasonably large system with no fewer
than 448 hardware threads. Smaller systems often achieve better latency, as may be seen
in Table E.1, which represents a much smaller system with only 16 hardware threads.
A similar view is provided by the rows of Table 3.1 down to and including the two
“Off-Core” rows.
Furthermore, newer small-scale single-socket systems such as the laptop on which I
am typing this also have more reasonable latencies, as can be seen in Table E.2.
Alternatively, a 64-CPU system in the mid 1990s had cross-interconnect latencies
in excess of five microseconds, so even the eight-socket 448-hardware-thread monster
shown in Table 3.1 represents more than a five-fold improvement over its 25-years-prior
counterparts.
Integration of hardware threads in a single core and multiple cores on a die have
improved latencies greatly, at least within the confines of a single core or single die.
There has been some improvement in overall system latency, but only by about a factor
of two. Unfortunately, neither the speed of light nor the atomic nature of matter has
changed much in the past few years [Har16]. Therefore, spatial and temporal locality
are first-class concerns for concurrent software, even when running on relatively small
systems.
Section 3.3 looks at what else hardware designers might be able to do to ease the
plight of parallel programmers. ❑
v2023.06.11a
716 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2023.06.11a
E.3. HARDWARE AND ITS HABITS 717
Answer:
I was surprised by the data I obtained and did a rigorous check of their validity. I got the
same result persistently. One theory that might explain the observation would be: The
two threads in the core are able to overlap their accesses, while the single CPU must
do everything sequentially. Unfortunately, there seems to be no public documentation
explaining why the Intel X5550 (Nehalem) system behaved like that. ❑
Answer:
Get a roll of toilet paper. In the USA, each roll will normally have somewhere around
350–500 sheets. Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single CAS cache miss.
For the more-expensive inter-system communications latencies, use several rolls (or
multiple cases) of toilet paper to represent the communications latency.
Important safety tip: Make sure to account for the needs of those you live with when
appropriating toilet paper, especially in 2020 or during a similar time when store shelves
are free of toilet paper and much else besides.
Furthermore, for those working on kernel code, a CPU disabling interrupts across a
cache miss is analogous to you holding your breath while unrolling a roll of toilet paper.
How many rolls of toilet paper can you unroll while holding your breath? You might
wish to avoid disabling interrupts across that many cache misses.3 ❑
Answer:
Electron drift velocity tracks the long-term movement of individual electrons. It turns
out that individual electrons bounce around quite randomly, so that their instantaneous
speed is very high, but over the long term, they don’t move very far. In this, electrons
resemble long-distance commuters, who might spend most of their time traveling at full
highway speed, but over the long term go nowhere. These commuters’ speed might be
70 miles per hour (113 kilometers per hour), but their long-term drift velocity relative
to the planet’s surface is zero.
Therefore, we should pay attention not to the electrons’ drift velocity, but to their
instantaneous velocities. However, even their instantaneous velocities are nowhere
near a significant fraction of the speed of light. Nevertheless, the measured velocity of
electric waves in conductors is a substantial fraction of the speed of light, so we still
have a mystery on our hands.
v2023.06.11a
718 APPENDIX E. ANSWERS TO QUICK QUIZZES
The other trick is that electrons interact with each other at significant distances (from
an atomic perspective, anyway), courtesy of their negative charge. This interaction is
carried out by photons, which do move at the speed of light. So even with electricity’s
electrons, it is photons doing most of the fast footwork.
Extending the commuter analogy, a driver might use a smartphone to inform other
drivers of an accident or congestion, thus allowing a change in traffic flow to propagate
much faster than the instantaneous velocity of the individual cars. Summarizing the
analogy between electricity and traffic flow:
1. The (very low) drift velocity of an electron is similar to the long-term velocity of a
commuter, both being very nearly zero.
Answer:
There are a number of reasons:
1. Shared-memory multiprocessor systems have strict size limits. If you need more
than a few thousand CPUs, you have no choice but to use a distributed system.
3. Large shared-memory systems tend to have much longer cache-miss latencies than
do smaller system. To see this, compare Table 3.1 on page 36 with Table E.2.
v2023.06.11a
E.4. TOOLS OF THE TRADE 719
Thus, large shared-memory systems tend to be used for applications that benefit from
faster latencies than can be provided by distributed computing, and particularly for those
applications that benefit from a large shared memory.
It is likely that continued work on parallel applications will increase the number
of embarrassingly parallel applications that can run well on machines and/or clusters
having long communications latencies, reductions in cost being the driving force that
it is. That said, greatly reduced hardware latencies would be an extremely welcome
development, both for single-system and for distributed computing. ❑
Answer:
Because it is often the case that only a small fraction of the program is performance-
critical. Shared-memory parallelism allows us to focus distributed-programming
techniques on that small fraction, allowing simpler shared-memory techniques to be
used on the non-performance-critical bulk of the program. ❑
Answer:
They look that way because they are in fact low-level synchronization primitives. And
they are in fact the fundamental tools for building low-level concurrent software. ❑
Answer:
Because you should never forget the simple stuff!
Please keep in mind that the title of this book is “Is Parallel Programming Hard,
And, If So, What Can You Do About It?”. One of the most effective things you can do
about it is to avoid forgetting the simple stuff! After all, if you choose to do parallel
programming the hard way, you have no one but yourself to blame. ❑
v2023.06.11a
720 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
One straightforward approach is the shell pipeline:
For a sufficiently large input file, grep will pattern-match in parallel with sed editing
and with the input processing of sort. See the file parallel.sh for a demonstration
of shell-script parallelism and pipelining. ❑
Answer:
In fact, it is quite likely that a very large fraction of parallel programs in use today are
script-based. However, script-based parallelism does have its limitations:
4. Scripting languages are often too slow, but are often quite useful when coordinating
execution of long-running programs written in lower-level programming languages.
Answer:
Some parallel applications need to take special action when specific children exit, and
therefore need to wait for each child individually. In addition, some parallel applications
need to detect the reason that the child died. As we saw in Listing 4.2, it is not hard to
build a waitall() function out of the wait() function, but it would be impossible to
do the reverse. Once the information about a specific child is lost, it is lost. ❑
v2023.06.11a
E.4. TOOLS OF THE TRADE 721
Answer:
Indeed there is, and it is quite possible that this section will be expanded in future
versions to include messaging features (such as UNIX pipes, TCP/IP, and shared file
I/O) and memory mapping (such as mmap() and shmget()). In the meantime, there
are any number of textbooks that cover these primitives in great detail, and the truly
motivated can read manpages, existing parallel applications using these primitives, as
well as the source code of the Linux-kernel implementations themselves.
It is important to note that the parent process in Listing 4.3 waits until after the child
terminates to do its printf(). Using printf()’s buffered I/O concurrently to the
same file from multiple processes is non-trivial, and is best avoided. If you really need
to do concurrent buffered I/O, consult the documentation for your OS. For UNIX/Linux
systems, Stewart Weiss’s lecture notes provide a good introduction with informative
examples [Wei13]. ❑
Answer:
In this simple example, there is no reason whatsoever. However, imagine a more
complex example, where mythread() invokes other functions, possibly separately
compiled. In such a case, pthread_exit() allows these other functions to end the
thread’s execution without having to pass some sort of error return all the way back up
to mythread(). ❑
Answer:
Ah, but the Linux kernel is written in a carefully selected superset of the C language
that includes special GNU extensions, such as asms, that permit safe execution even
in presence of data races. In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic. For an example, consider
embedded systems with 32-bit pointers and 16-bit busses. On such a system, a data
race involving a store to and a load from a given pointer might well result in the load
returning the low-order 16 bits of the old value of the pointer concatenated with the
high-order 16 bits of the new value of the pointer.
Nevertheless, even in the Linux kernel, data races can be quite dangerous and should
be avoided where feasible [Cor12]. ❑
v2023.06.11a
722 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
The first thing you should do is to ask yourself why you would want to do such a thing.
If the answer is “because I have a lot of data that is read by many threads, and only
occasionally updated”, then POSIX reader-writer locks might be what you are looking
for. These are introduced in Section 4.2.4.
Another way to get the effect of multiple threads holding the same lock is for one
thread to acquire the lock, and then use pthread_create() to create the other threads.
The question of why this would ever be a good idea is left to the reader. ❑
Answer:
Because we will need to pass lock_reader() to pthread_create(). Although we
could cast the function when passing it to pthread_create(), function casts are quite
a bit uglier and harder to get right than are simple pointer casts. ❑
Answer:
These macros constrain the compiler so as to prevent it from carrying out optimizations
that would be problematic for concurrently accessed shared variables. They don’t
constrain the CPU at all, other than by preventing reordering of accesses to a given
single variable. Note that this single-variable constraint does apply to the code shown
in Listing 4.5 because only the variable x is accessed.
For more information on READ_ONCE() and WRITE_ONCE(), please see Section 4.2.5.
For more information on ordering accesses to multiple variables by multiple threads,
please see Chapter 15. In the meantime, READ_ONCE(x) has much in common with the
GCC intrinsic __atomic_load_n(&x, __ATOMIC_RELAXED) and WRITE_ONCE(x,
v) has much in common with the GCC intrinsic __atomic_store_n(&x, v, __
ATOMIC_RELAXED). ❑
Answer:
Indeed! And for that reason, the pthread_mutex_lock() and pthread_mutex_
unlock() primitives are normally wrapped in functions that do this error checking.
Later on, we will wrap them with the Linux kernel spin_lock() and spin_unlock()
APIs. ❑
v2023.06.11a
E.4. TOOLS OF THE TRADE 723
Answer:
No. The reason that “x = 0” was output was that lock_reader() acquired the lock
first. Had lock_writer() instead acquired the lock first, then the output would have
been “x = 3”. However, because the code fragment started lock_reader() first and
because this run was performed on a multiprocessor, one would normally expect lock_
reader() to acquire the lock first. Nevertheless, there are no guarantees, especially on
a busy system. ❑
Answer:
Although it is sometimes possible to write a program using a single global lock that both
performs and scales well, such programs are exceptions to the rule. You will normally
need to use multiple locks to attain good performance and scalability.
One possible exception to this rule is “transactional memory”, which is currently a
research topic. Transactional-memory semantics can be loosely thought of as those of a
single global lock with optimizations permitted and with the addition of rollback [Boe09].
❑
Answer:
No. On a busy system, lock_reader() might be preempted for the entire duration of
lock_writer()’s execution, in which case it would not see any of lock_writer()’s
intermediate states for x. ❑
Answer:
See line 4 of Listing 4.5. Because the code in Listing 4.6 ran first, it could rely on
the compile-time initialization of x. The code in Listing 4.7 ran next, so it had to
re-initialize x. ❑
v2023.06.11a
724 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
A volatile declaration is in fact a reasonable alternative in this particular case.
However, use of READ_ONCE() has the benefit of clearly flagging to the reader that
goflag is subject to concurrent reads and updates. Note that READ_ONCE() is especially
useful in cases where most of the accesses are protected by a lock (and thus not subject
to change), but where a few of the accesses are made outside of the lock. Using a
volatile declaration in this case would make it harder for the reader to note the special
accesses outside of the lock, and would also make it harder for the compiler to generate
good code under the lock. ❑
Answer:
No, memory barriers are not needed and won’t help here. Memory barriers only enforce
ordering among multiple memory references: They absolutely do not guarantee to
expedite the propagation of data from one part of the system to another.4 This leads to a
quick rule of thumb: You do not need memory barriers unless you are using more than
one variable to communicate between multiple threads.
But what about nreadersrunning? Isn’t that a second variable used for communi-
cation? Indeed it is, and there really are the needed memory-barrier instructions buried
in __sync_fetch_and_add(), which make sure that the thread proclaims its presence
before checking to see if it should start. ❑
Answer:
It depends. If the per-thread variable was accessed only from its thread, and never from
a signal handler, then no. Otherwise, it is quite possible that READ_ONCE() is needed.
We will see examples of both situations in Section 5.4.4.
This leads to the question of how one thread can gain access to another thread’s
__thread variable, and the answer is that the second thread must store a pointer to
its __thread variable somewhere that the first thread has access to. One common
approach is to maintain a linked list with one element per thread, and to store the address
of each thread’s __thread variable in the corresponding element. ❑
4 There have been persistent rumors of hardware in which memory barriers actually do
v2023.06.11a
E.4. TOOLS OF THE TRADE 725
Answer:
Not at all. In fact, this comparison was, if anything, overly lenient. A more balanced
comparison would be against single-CPU throughput with the locking primitives
commented out. ❑
Answer:
If the data being read never changes, then you do not need to hold any locks while
accessing it. If the data changes sufficiently infrequently, you might be able to checkpoint
execution, terminate all threads, change the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per thread, so that a thread
read-acquires the larger aggregate reader-writer lock by acquiring its own lock, and
write-acquires by acquiring all the per-thread locks [HW92]. This can work quite well
for readers, but causes writers to incur increasingly large overheads as the number of
threads increases.
Some other ways of efficiently handling very small critical sections are described in
Chapter 9. ❑
Answer:
In general, newer hardware is improving. However, it will need to improve several
orders of magnitude to permit reader-writer lock to achieve ideal performance on 448
CPUs. Worse yet, the greater the number of CPUs, the larger the required performance
improvement. The performance problems of reader-writer locking are therefore very
likely to be with us for quite some time to come. ❑
Answer:
Strictly speaking, no. One could implement any member of the second set using the
corresponding member of the first set. For example, one could implement __sync_
nand_and_fetch() in terms of __sync_fetch_and_nand() as follows:
tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp;
v2023.06.11a
726 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Unfortunately, no. See Chapter 5 for some stark counterexamples. ❑
Answer:
In the 2018 v4.15 release, the Linux kernel’s ACCESS_ONCE() was replaced by READ_
ONCE() and WRITE_ONCE() for reads and writes, respectively [Cor12, Cor14a, Rut17].
ACCESS_ONCE() was introduced as a helper in RCU code, but was promoted to core API
soon afterward [McK07b, Tor08]. Linux kernel’s READ_ONCE() and WRITE_ONCE()
have evolved into complex forms that look quite different than the original ACCESS_
ONCE() implementation due to the need to support access-once semantics for large
structures, but with the possibility of load/store tearing if the structure cannot be loaded
and stored with a single machine instruction. ❑
Answer:
They don’t really exist. All tasks executing within the Linux kernel share memory, at
least unless you want to do a huge amount of memory-mapping work by hand. ❑
Answer:
On CPUs with load-store architectures, incrementing counter might compile into
something like the following:
LOAD counter,r0
INC r0
STORE r0,counter
On such machines, two threads might simultaneously load the value of counter,
each increment it, and each store the result. The new value of counter will then only
be one greater than before, despite two threads each incrementing it. ❑
v2023.06.11a
E.4. TOOLS OF THE TRADE 727
Answer:
Suppose that global_ptr is initially non-NULL, but that some other thread sets global_
ptr to NULL. Suppose further that line 1 of the transformed code (Listing 4.15) executes
just before global_ptr is set to NULL and line 2 just after. Then line 1 will conclude
that global_ptr is non-NULL, line 2 will conclude that it is less than high_address,
so that line 3 passes do_low() a NULL pointer, which do_low() just might not be
prepared to deal with.
Your editor made exactly this mistake in the DYNIX/ptx kernel’s memory allocator
in the early 1990s. Tracking down the bug consumed a holiday weekend not just for
your editor, but also for several of his colleagues. In short, this is not a new problem,
nor is it likely to go away on its own. ❑
Answer:
Because gp is not a static variable, if either do_something() or do_something_
else() were separately compiled, the compiler would have to assume that either or
both of these two functions might change the value of gp. This possibility would force
the compiler to reload gp on line 15, thus avoiding the NULL-pointer dereference. ❑
Answer:
Thankfully, the answer is no. This is because the compiler is forbidden from introducing
data races. The case of inventing a store just before a normal store is quite special:
It is not possible for some other entity, be it CPU, thread, signal handler, or interrupt
handler, to be able to see the invented store unless the code already has a data race, even
without the invented store. And if the code already has a data race, it already invokes
the dreaded spectre of undefined behavior, which allows the compiler to generate pretty
much whatever code it wants, regardless of the wishes of the developer.
But if the original store is volatile, as in WRITE_ONCE(), for all the compiler knows,
there might be a side effect associated with the store that could signal some other thread,
allowing data-race-free access to the variable. By inventing the store, the compiler
might be introducing a data race, which it is not permitted to do.
Furthermore, in Listing 4.21, the address of that variable is passed to do_a_bunch_
of_stuff(). If the compiler can see this function’s definition, and notes that a is
unconditionally stored to without any synchronization operations, then the compiler can
be quite sure that it is not introducing a data race in this case.
In the case of volatile and atomic variables, the compiler is specifically forbidden
from inventing writes. ❑
v2023.06.11a
728 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
As is often the case, the answer is “it depends”. However, if only two threads are accessing
the status and other_task_ready variables, then the smp_store_release() and
smp_load_acquire() functions discussed in Section 4.3.5 will suffice. ❑
Answer:
Then that interrupt handler must follow the same rules that are followed by other
interrupted code. Only those handlers that cannot be themselves interrupted or that
access no variables shared with an interrupting handler may safely use plain accesses,
and even then only if those variables cannot be concurrently accessed by some other
CPU or thread. ❑
Answer:
One approach would be to create an array indexed by smp_thread_id(), and another
would be to use a hash table to map from smp_thread_id() to an array index—which
is in fact what this set of APIs does in pthread environments.
Another approach would be for the parent to allocate a structure containing fields
for each desired per-thread variable, then pass this to the child during thread creation.
However, this approach can impose large software-engineering costs in large systems.
To see this, imagine if all global variables in a large system had to be declared in a
single file, regardless of whether or not they were C static variables! ❑
Answer:
First, needing a per-thread variable is less likely than you might think. Per-CPU
variables can often do a per-thread variable’s job. For example, if you only need to
do addition, bitwise AND, bitwise OR, exchange, or compare-and-exchange, then the
this_cpu_add(), this_cpu_add_return(), this_cpu_and(), this_cpu_or(),
this_cpu_xchg(), this_cpu_cmpxchg(), and this_cpu_cmpxchg_double() op-
erations, respectively, will do the job cheaply and atomically with respect to context
switches, interrupt handlers, and softirq handlers, but not non-maskable interrupts.
Second, within a preemption-disabled region of code, for example, one surrounded
by the preempt_disable() and preempt_enable() macros, the current task is
guaranteed to remain executing on the current CPU. Therefore, while within one
such region, any series of accesses to per-CPU variables is atomic with respect to
v2023.06.11a
E.5. COUNTING 729
context switches, though not with respect to interrupt handlers, softirq handlers, and
non-maskable interrupts. But please be aware that a preemption-disabled region of
code that runs for more than a few microseconds will not be looked upon with favor by
people attempting to construct real-time systems.
Third, a field added to the task_struct structure acts as set of per-task variables.
However, there are those who keep a close eye on the size of this structure, and these
people are likely to ask hard questions about the need for any added fields. Therefore,
if your field is being added for some facility that is only built into some kernels, you
should definitely place your new task_struct fields under an appropriate #ifdef.
Fourth and finally, your per-task variable might instead be located in some other
structure and protected by some synchronization mechanism that is already in use. For
example, if your code must hold a given lock, can accesses to this storage instead be
protected by that lock? The fact that this is at the end of the list notwithstanding, you
should look into this possibility first, not last! ❑
Answer:
It might well do that, however, checking is left as an exercise for the reader. But in the
meantime, I hope that we can agree that vfork() is a variant of fork(), so that we
can use fork() as a generic term covering both. ❑
E.5 Counting
Answer:
Because the straightforward counting algorithms, for example, atomic operations on
a shared counter, either are slow and scale badly, or are inaccurate, as will be seen in
Section 5.1. ❑
Answer:
Hint: The act of updating the counter must be blazingly fast, but because the counter is
read out only about once in five million updates, the act of reading out the counter can be
quite slow. In addition, the value read out normally need not be all that accurate—after
all, since the counter is updated a thousand times per millisecond, we should be able to
v2023.06.11a
730 APPENDIX E. ANSWERS TO QUICK QUIZZES
work with a value that is within a few thousand counts of the “true value”, whatever
“true value” might mean in this context. However, the value read out should maintain
roughly the same absolute error over time. For example, a 1 % error might be just fine
when the count is on the order of a million or so, but might be absolutely unacceptable
once the count reaches a trillion. See Section 5.2. ❑
Answer:
Hint: The act of updating the counter must again be blazingly fast, but the counter is
read out each time that the counter is increased. However, the value read out need not
be accurate except that it must distinguish approximately between values below the limit
and values greater than or equal to the limit. See Section 5.3. ❑
Answer:
Hint: The act of updating the counter must once again be blazingly fast, but the counter
is read out each time that the counter is increased. However, the value read out need not
be accurate except that it absolutely must distinguish perfectly between values between
the limit and zero on the one hand, and values that either are less than or equal to zero
or are greater than or equal to the limit on the other hand. See Section 5.4. ❑
Answer:
Hint: Yet again, the act of updating the counter must be blazingly fast and scalable
in order to avoid slowing down I/O operations, but because the counter is read out
only when the user wishes to remove the device, the counter read-out operation can be
extremely slow. Furthermore, there is no need to be able to read out the counter at all
unless the user has already indicated a desire to remove the device. In addition, the
value read out need not be accurate except that it absolutely must distinguish perfectly
between non-zero and zero values, and even then only when the device is in the process
v2023.06.11a
E.5. COUNTING 731
of being removed. However, once it has read out a zero value, it must act to keep the
value at zero until it has taken some action to prevent subsequent threads from gaining
access to the device being removed. See Section 5.4.6. ❑
Answer:
See Section 4.3.4.1 on page 63 for more information on how the compiler can cause
trouble, as well as how READ_ONCE() and WRITE_ONCE() can avoid this trouble. ❑
Answer:
Although the ++ operator could be atomic, there is no requirement that it be so unless it
is applied to a C11 _Atomic variable. And indeed, in the absence of _Atomic, GCC
often chooses to load the value to a register, increment the register, then store the value
to memory, which is decidedly non-atomic.
Furthermore, note the volatile casts in READ_ONCE() and WRITE_ONCE(), which
tell the compiler that the location might well be an MMIO device register. Because
MMIO registers are not cached, it would be unwise for the compiler to assume that the
increment operation is atomic. ❑
Answer:
Not only are there very few trivial parallel programs, and most days I am not so sure
that there are many trivial sequential programs, either.
No matter how small or simple the program, if you haven’t tested it, it does not work.
And even if you have tested it, Murphy’s Law says that there will be at least a few bugs
still lurking.
Furthermore, while proofs of correctness certainly do have their place, they never will
replace testing, including the counttorture.h test setup used here. After all, proofs
are only as good as the assumptions that they are based on. Finally, proofs can be every
bit as buggy as are programs! ❑
v2023.06.11a
732 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Because of the overhead of the atomic operation. The dashed line on the x axis
represents the overhead of a single non-atomic increment. After all, an ideal algorithm
would not only scale linearly, it would also incur no performance penalty compared to
single-threaded code.
This level of idealism may seem severe, but if it is good enough for Linus Torvalds, it
is good enough for you. ❑
Answer:
In many cases, atomic increment will in fact be fast enough for you. In those cases,
you should by all means use atomic increment. That said, there are many real-world
situations where more elaborate counting algorithms are required. The canonical
example of such a situation is counting packets and bytes in highly optimized networking
stacks, where it is all too easy to find much of the execution time going into these sorts
of accounting tasks, especially on large multiprocessors.
In addition, as noted at the beginning of this chapter, counting provides an excellent
view of the issues encountered in shared-memory parallel programs. ❑
Answer:
It might well be possible to do this in some cases. However, there are a few complications:
1. If the value of the variable is required, then the thread will be forced to wait for the
operation to be shipped to the data, and then for the result to be shipped back.
2. If the atomic increment must be ordered with respect to prior and/or subsequent
operations, then the thread will be forced to wait for the operation to be shipped to
the data, and for an indication that the operation completed to be shipped back.
3. Shipping operations among CPUs will likely require more lines in the system
interconnect, which will consume more die area and more electrical power.
But what if neither of the first two conditions holds? Then you should think carefully
about the algorithms discussed in Section 5.2, which achieve near-ideal performance on
commodity hardware.
If either or both of the first two conditions hold, there is some hope for improved
hardware. One could imagine the hardware implementing a combining tree, so that the
v2023.06.11a
E.5. COUNTING 733
Interconnect Interconnect
Cache Cache Cache Cache
CPU 4 CPU 5 CPU 6 CPU 7
increment requests from multiple CPUs are combined by the hardware into a single
addition when the combined request reaches the hardware. The hardware could also
apply an order to the requests, thus returning to each CPU the return value corresponding
to its particular atomic increment. This results in instruction latency that varies as
O (log 𝑁), where 𝑁 is the number of CPUs, as shown in Figure E.1. And CPUs with
this sort of hardware optimization started to appear in 2011.
This is a great improvement over the O (𝑁) performance of current hardware shown in
Figure 5.2, and it is possible that hardware latencies might decrease further if innovations
such as three-dimensional fabrication prove practical. Nevertheless, we will see that in
some important special cases, software can do much better. ❑
Answer:
No, because modulo addition is still commutative and associative. At least as long as
you use unsigned integers. Recall that in the C standard, overflow of signed integers
results in undefined behavior, never mind the fact that machines that do anything other
than wrap on overflow are quite rare these days. Unfortunately, compilers frequently
carry out optimizations that assume that signed integers will not overflow, so if your
code allows signed integers to overflow, you can run into trouble even on modern
twos-complement hardware.
That said, one potential source of additional complexity arises when attempting to
gather (say) a 64-bit sum from 32-bit per-thread counters. Dealing with this added
complexity is left as an exercise for the reader, for whom some of the techniques
introduced later in this chapter could be quite helpful. ❑
Answer:
It can, and in this toy implementation, it does. But it is not that hard to come up with an
alternative implementation that permits an arbitrary number of threads, for example,
using C11’s _Thread_local facility, as shown in Section 5.2.3. ❑
v2023.06.11a
734 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
See Sections 4.3.4.1 and 15.3 for more information. One nasty optimization would be
to apply common subexpression elimination to successive calls to the read_count()
function, which might come as a surprise to code expecting changes in the values
returned from successive calls to that function. ❑
Answer:
The C standard specifies that the initial value of global variables is zero, unless they are
explicitly initialized, thus implicitly initializing all the instances of counter to zero.
Besides, in the common case where the user is interested only in differences between
consecutive reads from statistical counters, the initial value is irrelevant. ❑
Answer:
Indeed, this toy example does not support more than one counter. Modifying it so that it
can provide multiple counters is left as an exercise to the reader. ❑
Answer:
Let’s do worst-case analysis first, followed by a less conservative analysis.
In the worst case, the read operation completes immediately, but is then delayed for 𝛥
time units before returning, in which case the worst-case error is simply 𝑟 𝛥.
This worst-case behavior is rather unlikely, so let us instead consider the case where
the reads from each of the 𝑁 counters is spaced equally over the time period 𝛥. There
will be 𝑁 + 1 intervals of duration 𝑁𝛥+1 between the 𝑁 reads. The rate 𝑟 of increments
is expected to be spread evenly over the 𝑁 counters, for 𝑁𝑟 increments per unit time for
each individual counter. The error due to the delay after the read from the last thread’s
counter will be given by 𝑁 (𝑟𝑁𝛥+1) , the second-to-last thread’s counter by 𝑁 (2𝑟𝑁𝛥+1) , the
third-to-last by 𝑁 (3𝑟𝑁𝛥+1) , and so on. The total error is given by the sum of the errors due
to the reads from each thread’s counter, which is:
𝑁
𝑟𝛥 ∑︁
𝑖 (E.1)
𝑁 (𝑁 + 1) 𝑖=1
Expressing the summation in closed form yields:
v2023.06.11a
E.5. COUNTING 735
𝑟𝛥 𝑁 (𝑁 + 1)
(E.2)
𝑁 (𝑁 + 1) 2
𝑟𝛥
(E.3)
2
𝛥
𝑟 +𝑡 (E.4)
2
(1 − 2 𝑓 ) 𝑟 𝛥
(E.5)
2
All that aside, in most uses of statistical counters, the error in the value returned by
read_count() is irrelevant. This irrelevance is due to the fact that the time required for
read_count() to execute is normally extremely small compared to the time interval
between successive calls to read_count(). ❑
v2023.06.11a
736 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Why indeed?
To be fair, user-mode thread-local storage faces some challenges that the Linux kernel
gets to ignore. When a user-level thread exits, its per-thread variables all disappear,
which complicates the problem of per-thread-variable access, particularly before the
advent of user-level RCU (see Section 9.5). In contrast, in the Linux kernel, when a
CPU goes offline, that CPU’s per-CPU variables remain mapped and accessible.
Similarly, when a new user-level thread is created, its per-thread variables suddenly
come into existence. In contrast, in the Linux kernel, all per-CPU variables are mapped
and initialized at boot time, regardless of whether the corresponding CPU exists yet, or
indeed, whether the corresponding CPU will ever exist.
A key limitation that the Linux kernel imposes is a compile-time maximum bound
on the number of CPUs, namely, CONFIG_NR_CPUS, along with a typically tighter
boot-time bound of nr_cpu_ids. In contrast, in user space, there is not necessarily a
hard-coded upper limit on the number of threads.
Of course, both environments must handle dynamically loaded code (dynamic libraries
in user space, kernel modules in the Linux kernel), which increases the complexity of
per-thread variables.
These complications make it significantly harder for user-space environments to
provide access to other threads’ per-thread variables. Nevertheless, such access is highly
useful, and it is hoped that it will someday appear.
In the meantime, textbook examples such as this one can use arrays whose limits can
be easily adjusted by the user. Alternatively, such arrays can be dynamically allocated
and expanded as needed at runtime. Finally, variable-length data structures such as
linked lists can be used, as is done in the userspace RCU library [Des09b, DMS+ 12].
This last approach can also reduce false sharing in some cases. ❑
Answer:
This is a reasonable strategy. Checking for the performance difference is left as
an exercise for the reader. However, please keep in mind that the fastpath is not
read_count(), but rather inc_count(). ❑
Answer:
Remember, when a thread exits, its per-thread variables disappear. Therefore, if we
v2023.06.11a
E.5. COUNTING 737
attempt to access a given thread’s per-thread variables after that thread exits, we will get
a segmentation fault. The lock coordinates summation and thread exit, preventing this
scenario.
Of course, we could instead read-acquire a reader-writer lock, but Chapter 9 will
introduce even lighter-weight mechanisms for implementing the required coordination.
Another approach would be to use an array instead of a per-thread variable, which, as
Alexey Roytman notes, would eliminate the tests against NULL. However, array accesses
are often slower than accesses to per-thread variables, and use of an array would imply
a fixed upper bound on the number of threads. Also, note that neither tests nor locks are
needed on the inc_count() fastpath. ❑
Answer:
This lock could in fact be omitted, but better safe than sorry, especially given that this
function is executed only at thread startup, and is therefore not on any critical path. Now,
if we were testing on machines with thousands of CPUs, we might need to omit the lock,
but on machines with “only” a hundred or so CPUs, there is no need to get fancy. ❑
Answer:
Remember, the Linux kernel’s per-CPU variables are always accessible, even if the
corresponding CPU is offline—even if the corresponding CPU never existed and never
will exist.
One workaround is to ensure that each thread continues to exist until all threads
are finished, as shown in Listing E.1 (count_tstat.c). Analysis of this code is
left as an exercise to the reader, however, please note that it requires tweaks in
the counttorture.h counter-evaluation scheme. (Hint: See #ifndef KEEP_GCC_
THREAD_LOCAL.) Chapter 9 will introduce synchronization mechanisms that handle
this situation in a much more graceful manner. ❑
Answer:
Because one of the two threads only reads, and because the variable is aligned and
machine-sized, non-atomic instructions suffice. That said, the READ_ONCE() macro is
used to prevent compiler optimizations that might otherwise prevent the counter updates
from becoming visible to eventual().5
v2023.06.11a
738 APPENDIX E. ANSWERS TO QUICK QUIZZES
An older version of this algorithm did in fact use atomic instructions, kudos to
Ersoy Bayramoglu for pointing out that they are in fact unnecessary. However, note
that on a 32-bit system, the per-thread counter variables might need to be limited to
32 bits in order to sum them accurately, but with a 64-bit global_count variable to
avoid overflow. In this case, it is necessary to zero the per-thread counter variables
periodically in order to avoid overflow, which does require atomic instructions. It is
extremely important to note that this zeroing cannot be delayed too long or overflow of
the smaller per-thread variables will result. This approach therefore imposes real-time
requirements on the underlying system, and in turn must be used with extreme care.
In contrast, if all variables are the same size, overflow of any variable is harmless
because the eventual sum will be modulo the word size. ❑
Answer:
In this case, no. What will happen instead is that as the number of threads increases, the
estimate of the counter value returned by read_count() will become more inaccurate.
❑
v2023.06.11a
E.5. COUNTING 739
Answer:
Yes. If this proves problematic, one fix is to provide multiple eventual() threads,
each covering its own subset of the other threads. In more extreme cases, a tree-like
hierarchy of eventual() threads might be required. ❑
Answer:
The thread executing eventual() consumes CPU time. As more of these eventually-
consistent counters are added, the resulting eventual() threads will eventually consume
all available CPUs. This implementation therefore suffers a different sort of scalability
limitation, with the scalability limit being in terms of the number of eventually consistent
counters rather than in terms of the number of threads or CPUs.
Of course, it is possible to make other tradeoffs. For example, a single thread could
be created to handle all eventually-consistent counters, which would limit the overhead
to a single CPU, but would result in increasing update-to-read latencies as the number
of counters increased. Alternatively, that single thread could track the update rates
of the counters, visiting the frequently-updated counters more frequently. In addition,
the number of threads handling the counters could be set to some fraction of the total
number of CPUs, and perhaps also adjusted at runtime. Finally, each counter could
specify its latency, and deadline-scheduling techniques could be used to provide the
required latencies to each counter.
There are no doubt many other tradeoffs that could be made. ❑
Answer:
A straightforward way to evaluate this estimate is to use the analysis derived in Quick
Quiz 5.17, but set 𝛥 to the interval between the beginnings of successive runs of the
eventual() thread. Handling the case where a given counter has multiple eventual()
threads is left as an exercise for the reader. ❑
Answer:
When counting packets, the counter is only incremented by the value one. On the other
hand, when counting bytes, the counter might be incremented by largish numbers.
v2023.06.11a
740 APPENDIX E. ANSWERS TO QUICK QUIZZES
Why does this matter? Because in the increment-by-one case, the value returned
will be exact in the sense that the counter must necessarily have taken on that value at
some point in time, even if it is impossible to say precisely when that point occurred.
In contrast, when counting bytes, two different threads might return values that are
inconsistent with any global ordering of operations.
To see this, suppose that thread 0 adds the value three to its counter, thread 1 adds the
value five to its counter, and threads 2 and 3 sum the counters. If the system is “weakly
ordered” or if the compiler uses aggressive optimizations, thread 2 might find the sum
to be three and thread 3 might find the sum to be five. The only possible global orders of
the sequence of values of the counter are 0,3,8 and 0,5,8, and neither order is consistent
with the results obtained.
If you missed this one, you are not alone. Michael Scott used this question to stump
Paul E. McKenney during Paul’s Ph.D. defense. ❑
Answer:
One approach would be to maintain a global approximation to the value, similar to the
approach described in Section 5.2.4. Updaters would increment their per-thread variable,
but when it reached some predefined limit, atomically add it to a global variable, then
zero their per-thread variable. This would permit a tradeoff between average increment
overhead and accuracy of the value read out. In particular, it would allow sharp bounds
on the read-side inaccuracy.
Another approach makes use of the fact that readers often care only about certain
transitions in value, not in the exact value. This approach is examined in Section 5.3.
The reader is encouraged to think up and try out other approaches, for example, using
a combining tree. ❑
Answer:
Because structures come in different sizes. Of course, a limit counter corresponding to a
specific size of structure might still be able to use inc_count() and dec_count(). ❑
Answer:
Two words. “Integer overflow.”
Try the formulation in Listing 5.8 with counter equal to 10 and delta equal to
ULONG_MAX. Then try it again with the code shown in Listing 5.7.
v2023.06.11a
E.5. COUNTING 741
A good understanding of integer overflow will be required for the rest of this example,
so if you have never dealt with integer overflow before, please try several examples to
get the hang of it. Integer overflow can sometimes be more difficult to get right than
parallel algorithms! ❑
Answer:
The globalreserve variable tracks the sum of all threads’ countermax variables.
The sum of these threads’ counter variables might be anywhere from zero to
globalreserve. We must therefore take a conservative approach, assuming that
all threads’ counter variables are full in add_count() and that they are all empty in
sub_count().
But remember this question, as we will come back to it later. ❑
Answer:
Given that add_count() takes an unsigned long as its argument, it is going to be a
bit tough to pass it a negative number. And unless you have some anti-matter memory,
there is little point in allowing negative numbers when counting the number of structures
in use!
All kidding aside, it would of course be possible to combine add_count() and
sub_count(), however, the if conditions on the combined function would be more
complex than in the current pair of functions, which would in turn mean slower execution
of these fast paths. ❑
v2023.06.11a
742 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
First, it really is reserving countermax counts (see line 14), however, it adjusts so that
only half of these are actually in use by the thread at the moment. This allows the thread
to carry out at least countermax / 2 increments or decrements before having to refer
back to globalcount again.
Note that the accounting in globalcount remains accurate, thanks to the adjustment
in line 18. ❑
Answer:
This might well be possible, but great care is required. Note that removing counter
without first zeroing countermax could result in the corresponding thread increasing
counter immediately after it was zeroed, completely negating the effect of zeroing the
counter.
The opposite ordering, namely zeroing countermax and then removing counter,
can also result in a non-zero counter. To see this, consider the following sequence of
events:
4. Thread A, having found that its countermax is non-zero, proceeds to add to its
counter, resulting in a non-zero value for counter.
v2023.06.11a
E.5. COUNTING 743
Answer:
It assumes eight bits per byte. This assumption does hold for all current commodity
microprocessors that can be easily assembled into shared-memory multiprocessors, but
certainly does not hold for all computer systems that have ever run C code. (What could
you do instead in order to comply with the C standard? What drawbacks would it have?)
❑
Answer:
There is only one counterandmax variable per thread. Later, we will see code that
needs to pass other threads’ counterandmax variables to split_counterandmax().
❑
Answer:
Later, we will see that we need the int return to pass to the atomic_cmpxchg()
primitive. ❑
Answer:
Replacing the goto with a break would require keeping a flag to determine whether
or not line 15 should return, which is not the sort of thing you want on a fastpath. If
you really hate the goto that much, your best bet would be to pull the fastpath into a
separate function that returned success or failure, with “failure” indicating a need for
the slowpath. This is left as an exercise for goto-hating readers. ❑
v2023.06.11a
744 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Later, we will see how the flush_local_count() function in Listing 5.15 might
update this thread’s counterandmax variable concurrently with the execution of the
fastpath on lines 8–14 of Listing 5.13. ❑
Answer:
This other thread cannot refill its counterandmax until the caller of flush_local_
count() releases the gblcnt_mutex. By that time, the caller of flush_local_
count() will have finished making use of the counts, so there will be no problem with
this other thread refilling—assuming that the value of globalcount is large enough to
permit a refill. ❑
Answer:
Nothing. Consider the following three cases:
v2023.06.11a
E.5. COUNTING 745
Answer:
The caller of both balance_count() and flush_local_count() hold gblcnt_
mutex, so only one may be executing at a given time. ❑
Answer:
No. If the signal handler is migrated to another CPU, then the interrupted thread is also
migrated along with it. ❑
Answer:
To indicate that only the fastpath is permitted to change the theft state, and that if the
thread remains in this state for too long, the thread running the slowpath will resend the
POSIX signal. ❑
Answer:
Reasons why collapsing the REQ and ACK states would be a very bad idea include:
1. The slowpath uses the REQ and ACK states to determine whether the signal should
be retransmitted. If the states were collapsed, the slowpath would have no choice
but to send redundant signals, which would have the unhelpful effect of needlessly
slowing down the fastpath.
v2023.06.11a
746 APPENDIX E. ANSWERS TO QUICK QUIZZES
(e) The fastpath sets the state to READY, disabling further fastpath execution for
this thread.
The basic problem here is that the combined REQACK state can be referenced by
both the signal handler and the fastpath. The clear separation maintained by the
four-state setup ensures orderly state transitions.
That said, you might well be able to make a three-state setup work correctly. If you
do succeed, compare carefully to the four-state setup. Is the three-state solution really
preferable, and why or why not? ❑
Answer:
No, that smp_store_release() suffices because this code communicates only with
flush_local_count(), and there is no need for store-to-load ordering. ❑
Answer:
Because the other thread is not permitted to change the value of its countermax variable
unless it holds the gblcnt_mutex lock. But the caller has acquired this lock, so it is not
possible for the other thread to hold it, and therefore the other thread is not permitted to
change its countermax variable. We can therefore safely access it—but not change it.
❑
Answer:
There is no need for an additional check. The caller of flush_local_count() has
already invoked globalize_count(), so the check on line 25 will have succeeded,
skipping the later pthread_kill(). ❑
Answer:
The theft variable must be of type sig_atomic_t to guarantee that it can be safely
shared between the signal handler and the code interrupted by the signal. ❑
v2023.06.11a
E.5. COUNTING 747
Answer:
Because many operating systems over several decades have had the property of losing
the occasional signal. Whether this is a feature or a bug is debatable, but irrelevant. The
obvious symptom from the user’s viewpoint will not be a kernel bug, but rather a user
application hanging.
Your user application hanging! ❑
Answer:
One approach is to use the techniques shown in Section 5.2.4, summarizing an
approximation to the overall counter value in a single variable. Another approach would
be to use multiple threads to carry out the reads, with each such thread interacting with
a specific subset of the updating threads. ❑
Answer:
One simple solution is to overstate the upper limit by the desired amount. The limiting
case of such overstatement results in the upper limit being set to the largest value that
the counter is capable of representing. ❑
Answer:
You had better have set the upper limit to be large enough accommodate the bias, the
expected maximum number of accesses, and enough “slop” to allow the counter to work
efficiently even when the number of accesses is at its maximum. ❑
Answer:
Strange, perhaps, but true! Almost enough to make you think that the name “reader-writer
lock” was poorly chosen, isn’t it? ❑
v2023.06.11a
748 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
A huge number!
Here are a few to start with:
1. There could be any number of devices, so that the global variables are inappropriate,
as are the lack of arguments to functions like do_io().
2. Polling loops can be problematic in real systems, wasting CPU time and energy.
In many cases, an event-driven design is far better, for example, where the last
completing I/O wakes up the device-removal thread.
3. The I/O might fail, and so do_io() will likely need a return value.
4. If the device fails, the last I/O might never complete. In such cases, there might
need to be some sort of timeout to allow error recovery.
5. Both add_count() and sub_count() can fail, but their return values are not
checked.
6. Reader-writer locks do not scale well. One way of avoiding the high read-acquisition
costs of reader-writer locks is presented in Chapters 7 and 9. ❑
Answer:
The read-side code must scan the entire fixed-size array, regardless of the number of
threads, so there is no difference in performance. In contrast, in the last two algorithms,
readers must do more work when there are more threads. In addition, the last two
algorithms interpose an additional level of indirection because they map from integer
thread ID to the corresponding _Thread_local variable. ❑
Answer:
“Use the right tool for the job.”
As can be seen from Figure 5.1, single-variable atomic increment need not apply for
any job involving heavy use of parallel updates. In contrast, the algorithms shown in
the top half of Table 5.1 do an excellent job of handling update-heavy situations. Of
course, if you have a read-mostly situation, you should use something else, for example,
an eventually consistent design featuring a single atomically incremented variable that
can be read out using a single load, similar to the approach used in Section 5.2.4. ❑
v2023.06.11a
E.5. COUNTING 749
Answer:
That depends on the workload. Note that on a 64-core system, you need more than one
hundred non-atomic operations (with roughly a 40-nanosecond performance gain) to
make up for even one signal (with almost a 5-microsecond performance loss). Although
there are no shortage of workloads with far greater read intensity, you will need to
consider your particular workload.
In addition, although memory barriers have historically been expensive compared
to ordinary instructions, you should check this on the specific hardware you will be
running. The properties of computer hardware do change over time, and algorithms
must change accordingly. ❑
Answer:
One approach is to give up some update-side performance, as is done with scalable
non-zero indicators (SNZI) [ELLM07]. There are a number of other ways one might
go about this, and these are left as exercises for the reader. Any number of approaches
that apply hierarchy, which replace frequent global-lock acquisitions with local lock
acquisitions corresponding to lower levels of the hierarchy, should work quite well. ❑
Answer:
In the C++ language, you might well be able to use ++ on a 1,000-digit number, assuming
that you had access to a class implementing such numbers. But as of 2021, the C
language does not permit operator overloading. ❑
Answer:
Indeed, multiple processes with separate address spaces can be an excellent way to
exploit parallelism, as the proponents of the fork-join methodology and the Erlang
language would be very quick to tell you. However, there are also some advantages to
shared-memory parallelism:
v2023.06.11a
750 APPENDIX E. ANSWERS TO QUICK QUIZZES
P1
P5 P2
P4 P3
2. Although cache misses are quite slow compared to individual register-to-register in-
structions, they are typically considerably faster than inter-process-communication
primitives, which in turn are considerably faster than things like TCP/IP networking.
3. Shared-memory multiprocessors are readily available and quite inexpensive, so, in
stark contrast to the 1990s, there is little cost penalty for use of shared-memory
parallelism.
As always, use the right tool for the job! ❑
Answer:
One such improved solution is shown in Figure E.2, where the philosophers are simply
provided with an additional five forks. All five philosophers may now eat simultaneously,
and there is never any need for philosophers to wait on one another. In addition, this
approach offers greatly improved disease control.
This solution might seem like cheating to some, but such “cheating” is key to finding
good solutions to many concurrency problems, as any hungry philosopher would agree.
And this is one solution to the Dining Philosophers concurrent-consumption problem
called out on page 113. ❑
v2023.06.11a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 751
First, for algorithms in which picking up left-hand and right-hand forks are separate
operations, start with all forks on the table. Then have all philosophers attempt to pick
up their first fork. Once all philosophers either have their first fork or are waiting for
someone to put down their first fork, have each non-waiting philosopher pick up their
second fork. At this point in any starvation-free solution, at least one philosopher will
be eating. If there were any waiting philosophers, repeat this test, preferably imposing
random variations in timing.
Second, create a stress test in which philosphers start and stop eating at random times.
Generate starvation and fairness conditions and verify that these conditions are met.
Here are a couple of example starvation and fairness conditions:
1. If all other philosophers have stopped eating 𝑁 times since a given philosopher
attempted to pick up a given fork, that philosopher should have succeeded in picking
up that fork. For high-quality solutions using high-quality locking primitives (or
high-quality atomic operations), 𝑁 = 1 is doable.
2. Given an upper bound 𝑇 on the time any philosopher holds onto both forks
before putting them down, the maximum waiting time for any philosopher should
be bounded by 𝑁𝑇 for some 𝑁 that is not hugely larger than the number of
philosophers.
3. Generate some statistic representing the time from when philosophers attempt to
pick up their first fork to the time when they start eating. The smaller this statistic,
the better the solution. Mean, median, and maximum are all useful statistics, but
examining the full distribution can also be enlightening.
Readers are encouraged to actually try testing any of the solutions presented in this
book, and especially testing solutions of their own devising. ❑
Answer:
Inman was working with protocol stacks, which are normally depicted vertically, with
the application on top and the hardware interconnect on the bottom. Data flows up and
down this stack. “Horizontal parallelism” processes packets from different network
connections in parallel, while “vertical parallelism” handles different protocol-processing
steps for a given packet in parallel.
“Vertical parallelism” is also called “pipelining”. ❑
Answer:
In this case, simply dequeue an item from the non-empty queue, release both locks, and
return. ❑
v2023.06.11a
752 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
The best way to answer this is to run lockhdeq.c on a number of different multiprocessor
systems, and you are encouraged to do so in the strongest possible terms. One reason
for concern is that each operation on this implementation must acquire not one but two
locks.
The first well-designed performance study will be cited.6 Do not forget to compare
to a sequential implementation! ❑
Answer:
It is optimal in the case where data flow switches direction only rarely. It would of
course be an extremely poor choice if the double-ended queue was being emptied from
both ends concurrently. This of course raises another question, namely, in what possible
universe emptying from both ends concurrently would be a reasonable thing to do.
Work-stealing queues are one possible answer to this question. ❑
Answer:
The need to avoid deadlock by imposing a lock hierarchy forces the asymmetry, just
as it does in the fork-numbering solution to the Dining Philosophers Problem (see
Section 6.1.1). ❑
Answer:
This retry is necessary because some other thread might have enqueued an element
between the time that this thread dropped d->rlock on line 25 and the time that it
reacquired this same lock on line 27. ❑
Answer:
It would be possible to use spin_trylock() to attempt to acquire the left-hand lock
when it was available. However, the failure case would still need to drop the right-hand
6 The studies by Dalessandro et al. [DCW+ 11] and Dice et al. [DLM+ 10] are excellent
starting points.
v2023.06.11a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 753
lock and then re-acquire the two locks in order. Making this transformation (and
determining whether or not it is worthwhile) is left as an exercise for the reader. ❑
Answer:
Indeed it does!
But the same is true of other algorithms claiming this property. For example, in
solutions using software transactional memory mechanisms based on hashed arrays of
locks, the leftmost and rightmost elements’ addresses will sometimes happen to hash to
the same lock. These hash collisions will also prevent concurrent access. For another
example, solutions using hardware transactional memory mechanisms with software
fallbacks [YHLR13, Mer11, JSG12] often use locking within those software fallbacks,
and thus suffer (albeit hopefully rarely) from whatever concurrency limitations that
these locking solutions suffer from.
Therefore, as of 2021, all practical solutions to the concurrent double-ended queue
problem fail to provide full concurrency in at least some circumstances, including the
compound double-ended queue. ❑
Answer:
There are actually at least three. The third, by Dominik Dingel, makes interesting use of
reader-writer locking, and may be found in lockrwdeq.c.
And so there is not one, but rather three solutions to the lock-based double-ended
queue problem on page 113! ❑
Answer:
The hashed double-ended queue’s locking design only permits one thread at a time at
each end, and further requires two lock acquisitions for each operation. The tandem
double-ended queue also permits one thread at a time at each end, and in the common case
requires only one lock acquisition per operation. Therefore, the tandem double-ended
queue should be expected to outperform the hashed double-ended queue.
Can you create a double-ended queue that allows multiple concurrent operations at
each end? If so, how? If not, why not? ❑
v2023.06.11a
754 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
One approach is to transform the problem to be solved so that multiple double-ended
queues can be used in parallel, allowing the simpler single-lock double-ended queue to
be used, and perhaps also replace each double-ended queue with a pair of conventional
single-ended queues. Without such “horizontal scaling”, the speedup is limited to 2.0. In
contrast, horizontal-scaling designs can achieve very large speedups, and are especially
attractive if there are multiple threads working either end of the queue, because in this
multiple-thread case the dequeue simply cannot provide strong ordering guarantees.
After all, the fact that a given thread removed an item first in no way implies that it will
process that item first [HKLP12]. And if there are no guarantees, we may as well obtain
the performance benefits that come with refusing to provide these guarantees.
Regardless of whether or not the problem can be transformed to use multiple queues, it
is worth asking whether work can be batched so that each enqueue and dequeue operation
corresponds to larger units of work. This batching approach decreases contention on
the queue data structures, which increases both performance and scalability, as will be
seen in Section 6.3. After all, if you must incur high synchronization overheads, be sure
you are getting your money’s worth.
Other researchers are working on other ways to take advantage of limited ordering
guarantees in queues [KLP12]. ❑
Answer:
Although non-blocking synchronization can be very useful in some situations, it is
no panacea, as discussed in Section 14.2. Also, non-blocking synchronization really
does have critical sections, as noted by Josh Triplett. For example, in a non-blocking
algorithm based on compare-and-swap operations, the code starting at the initial load
and continuing to the compare-and-swap is analogous to a lock-based critical section. ❑
Answer:
Quite a bit, actually.
See Section 10.3.2 for a good starting point. ❑
Answer:
Perhaps so.
v2023.06.11a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 755
But in the next section we will be partitioning space (that is, address space) as well
as time. This nomenclature will permit us to partition spacetime, as opposed to (say)
partitioning space but segmenting time. ❑
Answer:
Here are a few possible solutions to this existence guarantee problem:
1. Provide a statically allocated lock that is held while the per-structure lock is being
acquired, which is an example of hierarchical locking (see Section 6.4.2). Of
course, using a single global lock for this purpose can result in unacceptably high
levels of lock contention, dramatically reducing performance and scalability.
6. Use transactional memory (TM) [HM93, Lom77, ST95], so that each reference and
modification to the data structure in question is performed atomically. Although TM
has engendered much excitement in recent years, and seems likely to be of some use
in production software, developers should exercise some caution [BLM05, BLM06,
MMW07], particularly in performance-critical code. In particular, existence
guarantees require that the transaction covers the full path from a global reference
to the data elements being updated. For more on TM, including ways to overcome
some of its weaknesses by combining it with other synchronization mechanisms,
see Sections 17.2 and 17.3.
v2023.06.11a
756 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
You can indeed think in these terms.
And if you are working on a persistent data store where state survives shutdown,
thinking in these terms might even be useful. ❑
Answer:
The matmul.c program creates the specified number of worker threads, so even the
single-worker-thread case incurs thread-creation overhead. Making the changes required
to optimize away thread-creation overhead in the single-worker-thread case is left as an
exercise to the reader. ❑
Answer:
I am glad that you are paying attention! This example serves to show that although
data parallelism can be a very good thing, it is not some magic wand that automatically
wards off any and all sources of inefficiency. Linear scaling at full performance, even to
“only” 64 threads, requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the size of the partitions. For
example, if you split a 64-by-64 matrix multiply across 64 threads, each thread gets
only 64 floating-point multiplies. The cost of a floating-point multiply is minuscule
compared to the overhead of thread creation, and cache-miss overhead also plays a role
in spoiling the theoretically perfect scalability (and also in making the traces so jagged).
The full 448 hardware threads would require a matrix with hundreds of thousands of
rows and columns to attain good scalability, but by that point GPGPUs become quite
attractive, especially from a price/performance viewpoint.
Moral: If you have a parallel program with variable input, always include a check for
the input size being too small to be worth parallelizing. And when it is not helpful to
parallelize, it is not helpful to incur the overhead required to spawn a thread, now is it?
❑
v2023.06.11a
E.6. PARTITIONING AND SYNCHRONIZATION DESIGN 757
Answer:
For this simple approach, very little.
However, the validation of production-quality matrix multiply requires great care
and attention. Some cases require careful handling of floating-point rounding errors,
others involve complex sparse-matrix data structures, and still others make use of
special-purpose arithmetic hardware such as vector units or GPGPUs. Adequate tests
for handling of floating-point rounding errors can be especially challenging. ❑
Answer:
If the comparison on line 31 of Listing 6.8 were replaced by a much heavier-weight
operation, then releasing bp->bucket_lock might reduce lock contention enough to
outweigh the overhead of the extra acquisition and release of cur->node_lock. ❑
Answer:
Indeed it does! We are used to thinking of allocating and freeing memory, but the
algorithms in Section 5.3 are taking very similar actions to allocate and free “count”. ❑
Answer:
This is due to the per-CPU target value being three. A run length of 12 must acquire the
global-pool lock twice, while a run length of 13 must acquire the global-pool lock three
times. ❑
v2023.06.11a
758 APPENDIX E. ANSWERS TO QUICK QUIZZES
Per-Thread Allocation 0 0 m m
Answer:
This solution is adapted from one put forward by Alexey Roytman. It is based on the
following definitions:
The values 𝑔, 𝑚, and 𝑛 are given. The value for 𝑝 is 𝑚 rounded up to the next multiple
of 𝑠, as follows:
l𝑚m
𝑝=𝑠 (E.6)
𝑠
The value for 𝑖 is as follows:
𝑔 (mod 2𝑠) = 0 : 2𝑠
𝑖= (E.7)
𝑔 (mod 2𝑠) ≠ 0 : 𝑔 (mod 2𝑠)
The relationships between these quantities are shown in Figure E.3. The global pool
is shown on the top of this figure, and the “extra” initializer thread’s per-thread pool
and per-thread allocations are the left-most pair of boxes. The initializer thread has no
blocks allocated, but has 𝑖 blocks stranded in its per-thread pool. The rightmost two
pairs of boxes are the per-thread pools and per-thread allocations of threads holding
the maximum possible number of blocks, while the second-from-left pair of boxes
represents the thread currently trying to allocate.
The total number of blocks is 𝑔, and adding up the per-thread allocations and per-
thread pools, we see that the global pool contains 𝑔 −𝑖 − 𝑝(𝑛 − 1) blocks. If the allocating
thread is to be successful, it needs at least 𝑚 blocks in the global pool, in other words:
𝑔 − 𝑖 − 𝑝(𝑛 − 1) ≥ 𝑚 (E.8)
The question has 𝑔 = 40, 𝑠 = 3, and 𝑛 = 2. Equation E.7 gives 𝑖 = 4, and Eq. E.6
gives 𝑝 = 18 for 𝑚 = 18 and 𝑝 = 21 for 𝑚 = 19. Plugging these into Eq. E.8 shows that
𝑚 = 18 will not overflow, but that 𝑚 = 19 might well do so.
v2023.06.11a
E.7. LOCKING 759
The presence of 𝑖 could be considered to be a bug. After all, why allocate memory
only to have it stranded in the initialization thread’s cache? One way of fixing this
would be to provide a memblock_flush() function that flushed the current thread’s
pool into the global pool. The initialization thread could then invoke this function after
freeing all of the blocks. ❑
Answer:
This is an excellent question that is left to a suitably interested and industrious reader. ❑
Answer:
There are indeed a great many ways to distribute the extra threads. Evaluation of
distribution strategies is left to a suitably interested and industrious reader. ❑
E.7 Locking
Answer:
The reason locking serves as a research-paper whipping boy is because it is heavily used
in practice. In contrast, if no one used or cared about locking, most research papers
would not bother even mentioning it. ❑
Answer:
Suppose that there is no cycle in the graph. We would then have a directed acyclic graph
(DAG), which would have at least one leaf node.
If this leaf node was a lock, then we would have a thread that was waiting on a lock
that wasn’t held by any thread, counter to the definition. In this case the thread would
immediately acquire the lock.
On the other hand, if this leaf node was a thread, then we would have a thread that
was not waiting on any lock, again counter to the definition. And in this case, the thread
would either be running or be blocked on something that is not a lock. In the first case,
in the absence of infinite-loop bugs, the thread will eventually release the lock. In the
v2023.06.11a
760 APPENDIX E. ANSWERS TO QUICK QUIZZES
second case, in the absence of a failure-to-wake bug, the thread will eventually wake up
and release the lock.7
Therefore, given this definition of lock-based deadlock, there must be a cycle in the
corresponding graph. ❑
Answer:
Indeed there are! Here are a few of them:
1. If one of the library function’s arguments is a pointer to a lock that this library
function acquires, and if the library function holds one of its locks while acquiring
the caller’s lock, then we could have a deadlock cycle involving both caller and
library locks.
2. If one of the library functions returns a pointer to a lock that is acquired by the
caller, and if the caller acquires one of its locks while holding the library’s lock,
we could again have a deadlock cycle involving both caller and library locks.
3. If one of the library functions acquires a lock and then returns while still holding
it, and if the caller acquires one of its locks, we have yet another way to create a
deadlock cycle involving both caller and library locks.
4. If the caller has a signal handler that acquires locks, then the deadlock cycle can
involve both caller and library locks. In this case, however, the library’s locks are
innocent bystanders in the deadlock cycle. That said, please note that acquiring a
lock from within a signal handler is a no-no in many environments—it is not just a
bad idea, it is unsupported. But if you absolutely must acquire a lock in a signal
handler, be sure to block that signal while holding that same lock in thread context,
and also while holding any other locks acquired while that same lock is held. ❑
Answer:
By privatizing the data elements being compared (as discussed in Chapter 8) or through
use of deferral mechanisms such as reference counting (as discussed in Chapter 9). Or
through use of layered locking hierarchies, as described in Section 7.1.1.3.
On the other hand, changing a key in a list that is currently being sorted is at best
rather brave. ❑
7 Of course, one type of failure-to-wake bug is a deadlock that involves not only locks,
but also non-lock resources. But the question really did say “lock-based deadlock”!
v2023.06.11a
E.7. LOCKING 761
Answer:
There are at least two hazards in this situation.
One is indeed that the number of children may or may not be observed to have changed.
While that would be consistent with tree_add() being called either before or after
the iterator started, it is better not left to the vagaries of the compiler. A more serious
problem is that realloc() may not be able to extend the array in place, causing the
heap to free the one used by the iterator and replace it with another block of memory. If
the children pointer is not re-read then the iterating thread will access invalid memory
(either free or reclaimed). ❑
Answer:
It really does depend.
The scheduler locks are always held with interrupts disabled. Therefore, if call_
rcu() is invoked with interrupts enabled, no scheduler locks are held, and call_rcu()
can safely call into the scheduler. Otherwise, if interrupts are disabled, one of the
scheduler locks might be held, so call_rcu() must play it safe and refrain from calling
into the scheduler. ❑
Answer:
Locking primitives, of course! ❑
Answer:
Absolutely not!
Consider a program that acquires mutex_a, and then mutex_b, in that order, and
then passes mutex_a to pthread_cond_wait(). Now, pthread_cond_wait() will
release mutex_a, but will re-acquire it before returning. If some other thread acquires
mutex_a in the meantime and then blocks on mutex_b, the program will deadlock. ❑
v2023.06.11a
762 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Absolutely not!
This transformation assumes that the layer_2_processing() function is idem-
potent, given that it might be executed multiple times on the same packet when the
layer_1() routing decision changes. Therefore, in real life, this transformation can
become arbitrarily complex. ❑
Answer:
Maybe.
If the routing decision in layer_1() changes often enough, the code will always
retry, never making forward progress. This is termed “livelock” if no thread makes any
forward progress or “starvation” if some threads make forward progress but others do
not (see Section 7.1.2). ❑
Answer:
Provide an additional global lock. If a given thread has repeatedly tried and failed to
acquire the needed locks, then have that thread unconditionally acquire the new global
lock, and then unconditionally acquire any needed locks. (Suggested by Doug Lea.) ❑
Answer:
Because this would lead to deadlock. Given that Lock A is sometimes held outside of a
signal handler without blocking signals, a signal might be handled while holding this
lock. The corresponding signal handler might then acquire Lock B, so that Lock B is
acquired while holding Lock A. Therefore, if we also acquire Lock A while holding
Lock B, we will have a deadlock cycle. Note that this problem exists even if signals are
blocked while holding Lock B.
This is another reason to be very careful with locks that are acquired within interrupt
or signal handlers. But the Linux kernel’s lock dependency checker knows about this
situation and many others as well, so please do make full use of it! ❑
v2023.06.11a
E.7. LOCKING 763
Answer:
One of the simplest and fastest ways to do so is to use the sa_mask field of the
struct sigaction that you pass to sigaction() when setting up the signal. ❑
Answer:
Because these same rules apply to the interrupt handlers used in operating-system
kernels and in some embedded applications.
In many application environments, acquiring locks in signal handlers is frowned
upon [Ope97]. However, that does not stop clever developers from (perhaps unwisely)
fashioning home-brew locks out of atomic operations. And atomic operations are in
many cases perfectly legal in signal handlers. ❑
Answer:
There are a number of approaches:
1. In the case of parametric search via simulation, where a large number of simulations
will be run in order to converge on (for example) a good design for a mechanical or
electrical device, leave the simulation single-threaded, but run many instances of the
simulation in parallel. This retains the object-oriented design, and gains parallelism
at a higher level, and likely also avoids both deadlocks and synchronization overhead.
2. Partition the objects into groups such that there is no need to operate on objects in
more than one group at a given time. Then associate a lock with each group. This
is an example of a single-lock-at-a-time design, which discussed in Section 7.1.1.8.
3. Partition the objects into groups such that threads can all operate on objects in the
groups in some groupwise ordering. Then associate a lock with each group, and
impose a locking hierarchy over the groups.
4. Impose an arbitrarily selected hierarchy on the locks, and then use conditional
locking if it is necessary to acquire a lock out of order, as was discussed in
Section 7.1.1.6.
5. Before carrying out a given group of operations, predict which locks will be
acquired, and attempt to acquire them before actually carrying out any updates. If
the prediction turns out to be incorrect, drop all the locks and retry with an updated
prediction that includes the benefit of experience. This approach was discussed in
Section 7.1.1.7.
v2023.06.11a
764 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Listing 7.5 provides some good hints. In many cases, livelocks are a hint that you
should revisit your locking design. Or visit it in the first place if your locking design
“just grew”.
That said, one good-and-sufficient approach due to Doug Lea is to use conditional
locking as described in Section 7.1.1.6, but combine this with acquiring all needed
locks first, before modifying shared data, as described in Section 7.1.1.7. If a given
critical section retries too many times, unconditionally acquire a global lock, then
unconditionally acquire all the needed locks. This avoids both deadlock and livelock,
and scales reasonably assuming that the global lock need not be acquired too often. ❑
Answer:
Here are a couple:
1. A one-second wait is way too long for most uses. Wait intervals should begin with
roughly the time required to execute the critical section, which will normally be in
the microsecond or millisecond range.
2. The code does not check for overflow. On the other hand, this bug is nullified by
the previous bug: 32 bits worth of seconds is more than 50 years. ❑
Answer:
It would be better in some sense, but there are situations where it can be appropriate to
use designs that sometimes result in high lock contentions.
For example, imagine a system that is subject to a rare error condition. It might
well be best to have a simple error-handling design that has poor performance and
scalability for the duration of the rare error condition, as opposed to a complex and
difficult-to-debug design that is helpful only when one of those rare error conditions is
in effect.
That said, it is usually worth putting some effort into attempting to produce a design
that both simple as well as efficient during error conditions, for example by partitioning
the problem. ❑
v2023.06.11a
E.7. LOCKING 765
Answer:
If the data protected by the lock is in the same cache line as the lock itself, then attempts
by other CPUs to acquire the lock will result in expensive cache misses on the part of
the CPU holding the lock. This is a special case of false sharing, which can also occur if
a pair of variables protected by different locks happen to share a cache line. In contrast,
if the lock is in a different cache line than the data that it protects, the CPU holding the
lock will usually suffer a cache miss only on first access to a given variable.
Of course, the downside of placing the lock and data into separate cache lines is that
the code will incur two cache misses rather than only one in the uncontended case. As
always, choose wisely! ❑
Answer:
Empty lock-based critical sections are rarely used, but they do have their uses. The
point is that the semantics of exclusive locks have two components: (1) The familiar
data-protection semantic and (2) A messaging semantic, where releasing a given lock
notifies a waiting acquisition of that same lock. An empty critical section uses the
messaging component without the data-protection component.
The rest of this answer provides some example uses of empty critical sections,
however, these examples should be considered “gray magic.”8 As such, empty critical
sections are almost never used in practice. Nevertheless, pressing on into this gray area
...
One historical use of empty critical sections appeared in the networking stack of the
2.4 Linux kernel through use of a read-side-scalable reader-writer lock called brlock
for “big reader lock”. This use case is a way of approximating the semantics of read-copy
update (RCU), which is discussed in Section 9.5. And in fact this Linux-kernel use case
has been replaced with RCU.
The empty-lock-critical-section idiom can also be used to reduce lock contention in
some situations. For example, consider a multithreaded user-space application where
each thread processes units of work maintained in a per-thread list, where threads are
prohibited from touching each others’ lists [McK12e]. There could also be updates that
require that all previously scheduled units of work have completed before the update
can progress. One way to handle this is to schedule a unit of work on each thread, so
that when all of these units of work complete, the update may proceed.
In some applications, threads can come and go. For example, each thread might
correspond to one user of the application, and thus be removed when that user logs
out or otherwise disconnects. In many applications, threads cannot depart atomically:
They must instead explicitly unravel themselves from various portions of the application
using a specific sequence of actions. One specific action will be refusing to accept
further requests from other threads, and another specific action will be disposing of any
remaining units of work on its list, for example, by placing these units of work in a
v2023.06.11a
766 APPENDIX E. ANSWERS TO QUICK QUIZZES
global work-item-disposal list to be taken by one of the remaining threads. (Why not
just drain the thread’s work-item list by executing each item? Because a given work
item might generate more work items, so that the list could not be drained in a timely
fashion.)
If the application is to perform and scale well, a good locking design is required.
One common solution is to have a global lock (call it G) protecting the entire process
of departing (and perhaps other things as well), with finer-grained locks protecting the
individual unraveling operations.
Now, a departing thread must clearly refuse to accept further requests before disposing
of the work on its list, because otherwise additional work might arrive after the disposal
action, which would render that disposal action ineffective. So simplified pseudocode
for a departing thread might be as follows:
1. Acquire lock G.
2. Acquire the lock guarding communications.
3. Refuse further communications from other threads.
4. Release the lock guarding communications.
5. Acquire the lock guarding the global work-item-disposal list.
6. Move all pending work items to the global work-item-disposal list.
7. Release the lock guarding the global work-item-disposal list.
8. Release lock G.
Of course, a thread that needs to wait for all pre-existing work items will need to take
departing threads into account. To see this, suppose that this thread starts waiting for all
pre-existing work items just after a departing thread has refused further communications
from other threads. How can this thread wait for the departing thread’s work items to
complete, keeping in mind that threads are not allowed to access each others’ lists of
work items?
One straightforward approach is for this thread to acquire G and then the lock guarding
the global work-item-disposal list, then move the work items to its own list. The thread
then release both locks, places a work item on the end of its own list, and then wait for
all of the work items that it placed on each thread’s list (including its own) to complete.
This approach does work well in many cases, but if special processing is required
for each work item as it is pulled in from the global work-item-disposal list, the result
could be excessive contention on G. One way to avoid that contention is to acquire G
and then immediately release it. Then the process of waiting for all prior work items
look something like the following:
v2023.06.11a
E.7. LOCKING 767
4. Release G.
5. Acquire the lock guarding the global work-item-disposal list.
6. Move all work items from the global work-item-disposal list to this thread’s list,
processing them as needed along the way.
7. Release the lock guarding the global work-item-disposal list.
8. Enqueue an additional work item onto this thread’s list. (As before, this work item
will atomically decrement the global counter, and if the result is zero, it will set a
condition variable to one.)
9. Wait for the condition variable to take on the value one.
Once this procedure completes, all pre-existing work items are guaranteed to have
completed. The empty critical sections are using locking for messaging as well as for
protection of data. ❑
Answer:
There are in fact several. One way would be to use the null, protected-read, and exclusive
modes. Another way would be to use the null, protected-read, and concurrent-write
modes. A third way would be to use the null, concurrent-read, and exclusive modes. ❑
Answer:
Conditionally acquiring a single global lock does work very well, but only for relatively
small numbers of CPUs. To see why it is problematic in systems with many hundreds
of CPUs, look at Figure 5.1. ❑
Answer:
How indeed? This just shows that in concurrency, just as in life, one should take care to
learn exactly what winning entails before playing the game. ❑
Answer:
Because this default initialization does not apply to locks allocated as auto variables
within the scope of a function. ❑
v2023.06.11a
768 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Suppose that the lock is held and that several threads are attempting to acquire the lock.
In this situation, if these threads all loop on the atomic exchange operation, they will
ping-pong the cache line containing the lock among themselves, imposing load on the
interconnect. In contrast, if these threads are spinning in the inner loop on lines 7–8,
they will each spin within their own caches, placing negligible load on the interconnect.
❑
Answer:
This can be a legitimate implementation, but only if this store is preceded by a memory
barrier and makes use of WRITE_ONCE(). The memory barrier is not required when the
xchg() operation is used because this operation implies a full memory barrier due to
the fact that it returns a value. ❑
Answer:
In the C language, the following macro correctly handles this:
#define ULONG_CMP_LT(a, b) \
(ULONG_MAX / 2 < (a) - (b))
Although it is tempting to simply subtract two signed integers, this should be avoided
because signed overflow is undefined in the C language. For example, if the compiler
knows that one of the values is positive and the other negative, it is within its rights
to simply assume that the positive number is greater than the negative number, even
though subtracting the negative number from the positive number might well result in
overflow and thus a negative number.
How could the compiler know the signs of the two numbers? It might be able to
deduce it based on prior assignments and comparisons. In this case, if the per-CPU
counters were signed, the compiler could deduce that they were always increasing in
value, and then might assume that they would never go negative. This assumption could
well lead the compiler to generate unfortunate code [McK12d, Reg10]. ❑
Answer:
The flag approach will normally suffer fewer cache misses, but a better answer is to try
both and see which works best for your particular workload. ❑
v2023.06.11a
E.8. DATA OWNERSHIP 769
Answer:
Here are some bugs resulting from improper use of implicit existence guarantees:
1. A program writes the address of a global variable to a file, then a later instance
of that same program reads that address and attempts to dereference it. This can
fail due to address-space randomization, to say nothing of recompilation of the
program.
2. A module can record the address of one of its variables in a pointer located in some
other module, then attempt to dereference that pointer after the module has been
unloaded.
3. A function can record the address of one of its on-stack variables into a global
pointer, which some other function might attempt to dereference after that function
has returned.
I am sure that you can come up with additional possibilities. ❑
Answer:
This is a very simple hash table with no chaining, so the only element in a given bucket
is the first element. The reader is invited to adapt this example to a hash table with full
chaining. ❑
Answer:
Use of auto variables in functions. By default, these are private to the thread executing
the current function. ❑
Answer:
The creation of the threads via the sh & operator and the joining of thread via the sh
wait command.
Of course, if the processes explicitly share memory, for example, using the shmget()
or mmap() system calls, explicit synchronization might well be needed when acccessing
or updating the shared memory. The processes might also synchronize using any of the
following interprocess communications mechanisms:
v2023.06.11a
770 APPENDIX E. ANSWERS TO QUICK QUIZZES
1. System V semaphores.
3. UNIX-domain sockets.
5. File locking.
6. Use of the open() system call with the O_CREAT and O_EXCL flags.
Answer:
That is a philosophical question.
Those wishing the answer “no” might argue that processes by definition do not share
memory.
Those wishing to answer “yes” might list a large number of synchronization mecha-
nisms that do not require shared memory, note that the kernel will have some shared
state, and perhaps even argue that the assignment of process IDs (PIDs) constitute
shared data.
Such arguments are excellent intellectual exercise, and are also a wonderful way of
feeling intelligent and scoring points against hapless classmates or colleagues, but are
mostly a way of avoiding getting anything useful done. ❑
Answer:
Amazingly enough, yes. One example is a simple message-passing system where threads
post messages to other threads’ mailboxes, and where each thread is responsible for
removing any message it sent once that message has been acted on. Implementation of
such an algorithm is left as an exercise for the reader, as is identifying other algorithms
with similar ownership patterns. ❑
v2023.06.11a
E.8. DATA OWNERSHIP 771
Answer:
There is a very large number of such mechanisms, including:
1. System V message queues.
Answer:
The key phrase is “owns the rights to the data”. In this case, the rights in question are
the rights to access the per-thread counter variable defined on line 1 of the listing.
This situation is similar to that described in Section 8.2.
However, there really is data that is owned by the eventual() thread, namely the t
and sum variables defined on lines 19 and 20 of the listing.
For other examples of designated threads, look at the kernel threads in the Linux
kernel, for example, those created by kthread_create() and kthread_run(). ❑
Answer:
Yes. One approach is for read_count() to add the value of its own per-thread variable.
This maintains full ownership and performance, but only a slight improvement in
accuracy, particularly on systems with very large numbers of threads.
Another approach is for read_count() to use function shipping, for example, in
the form of per-thread signals. This greatly improves accuracy, but at a significant
performance cost for read_count().
However, both of these methods have the advantage of eliminating cache thrashing
for the common case of updating counters. ❑
v2023.06.11a
772 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
To greatly increase the probability of finding bugs. A small torture-test program
(routetorture.h) that allocates and frees only one type of structure can tolerate a
surprisingly large amount of use-after-free misbehavior. See Figure 11.4 on page 339
and the related discussion in Section 11.6.4 starting on page 341 for more on the
importance of increasing the probability of finding bugs. ❑
Answer:
Because the traversal is already protected by the lock, so no additional protection is
required. ❑
Answer:
The break is due to hyperthreading. On this particular system, the first hardware thread
in each core within a socket have consecutive CPU numbers, followed by the first
hardware threads in each core for the other sockets, and finally followed by the second
hardware thread in each core on all the sockets. On this particular system, CPU numbers
0–27 are the first hardware threads in each of the 28 cores in the first socket, numbers
28–55 are the first hardware threads in each of the 28 cores in the second socket, and so
on, so that numbers 196–223 are the first hardware threads in each of the 28 cores in the
eighth socket. Then CPU numbers 224–251 are the second hardware threads in each
of the 28 cores of the first socket, numbers 252–279 are the second hardware threads
in each of the 28 cores of the second socket, and so on until numbers 420–447 are the
second hardware threads in each of the 28 cores of the eighth socket.
Why does this matter?
Because the two hardware threads of a given core share resources, and this workload
seems to allow a single hardware thread to consume more than half of the relevant
resources within its core. Therefore, adding the second hardware thread of that core
adds less than one might hope. Other workloads might gain greater benefit from each
core’s second hardware thread, but much depends on the details of both the hardware
and the workload. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 773
8
1x10
7 ideal
1x10
100000
10000 refcnt
1000
100
1 10 100
Number of CPUs (Threads)
Figure E.4: Pre-BSD Routing Table Protected by Reference Counting, Log Scale
Answer:
Define “a little bit.”
Figure E.4 shows the same data, but on a log-log plot. As you can see, the refcnt line
drops below 5,000 at two CPUs. This means that the refcnt performance at two CPUs is
more than one thousand times smaller than the first y-axis tick of 5 × 106 in Figure 9.2.
Therefore, the depiction of the performance of reference counting shown in Figure 9.2
is all too accurate. ❑
Answer:
That sentence did say “reduced the usefulness”, not “eliminated the usefulness”, now
didn’t it?
Please see Section 13.2, which discusses some of the techniques that the Linux kernel
uses to take advantage of reference counting in a highly concurrent environment. ❑
Answer:
The published implementations of hazard pointers used non-blocking synchronization
techniques for insertion and deletion. These techniques require that readers traversing
the data structure “help” updaters complete their updates, which in turn means that
readers need to look at the successor of a deleted element.
In contrast, we will be using locking to synchronize updates, which does away with
the need for readers to help updaters complete their updates, which in turn allows us to
v2023.06.11a
774 APPENDIX E. ANSWERS TO QUICK QUIZZES
leave pointers’ bottom bits alone. This approach allows read-side code to be simpler
and faster. ❑
Answer:
Because hp_try_record() must check for concurrent modifications. To do that job,
it needs a pointer to a pointer to the element, so that it can check for a modification to
the pointer to the element. ❑
Answer:
It might be easier in some sense, but as will be seen in the Pre-BSD routing example,
there are situations for which hp_record() simply does not work. ❑
Answer:
If the pointer emanates from a global variable or is otherwise not subject to being freed,
then hp_record() may be used to repeatedly attempt to record the hazard pointer, even
in the face of concurrent deletions.
In certain cases, restart can be avoided by using link counting as exemplified by
the UnboundedQueue and ConcurrentHashMap data structures implemented in Folly
open-source library.9 ❑
Answer:
Yes and no. These restrictions apply only to reference-counting mechanisms whose
reference acquisition can fail. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 775
would make time available to the other hardware thread sharing the core, resulting in
greater scalability at the expense of per-hardware-thread performance. ❑
Answer:
First, Figure 9.3 has a linear y-axis, while most of the graphs in the “Structured Deferral”
paper have logscale y-axes. Next, that paper uses lightly-loaded hash tables, while
Figure 9.3’s uses a 10-element simple linked list, which means that hazard pointers
face a larger memory-barrier penalty in this workload than in that of the “Structured
Deferral” paper. Finally, that paper used an older modest-sized x86 system, while a
much newer and larger system was used to generate the data shown in Figure 9.3.
In addition, use of pairwise asymmetric barriers [Mic08, Cor10b, Cor18] has been
proposed to eliminate the read-side hazard-pointer memory barriers on systems sup-
porting this notion [Gol18b], which might improve the performance of hazard pointers
beyond what is shown in the figure.
As always, your mileage may vary. Given the difference in performance, it is clear
that hazard pointers give you the best performance either for very large data structures
(where the memory-barrier overhead will at least partially overlap cache-miss penalties)
and for data structures such as hash tables where a lookup operation needs a minimal
number of hazard pointers. ❑
Answer:
The sequence-lock mechanism is really a combination of two separate synchronization
mechanisms, sequence counts and locking. In fact, the sequence-count mechanism
is available separately in the Linux kernel via the write_seqcount_begin() and
write_seqcount_end() primitives.
However, the combined write_seqlock() and write_sequnlock() primitives
are used much more heavily in the Linux kernel. More importantly, many more people
will understand what you mean if you say “sequence lock” than if you say “sequence
count”.
So this section is entitled “Sequence Locks” so that people will understand what it
is about just from the title, and it appears in the “Deferred Processing” because (1) of
the emphasis on the “sequence count” aspect of “sequence locks” and (2) because a
“sequence lock” is much more than merely a lock. ❑
Answer:
That would be a legitimate implementation. However, if the workload is read-mostly, it
would likely increase the overhead of the common-case successful read, which could
v2023.06.11a
776 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
If it was omitted, both the compiler and the CPU would be within their rights to move the
critical section preceding the call to read_seqretry() down below this function. This
would prevent the sequence lock from protecting the critical section. The smp_mb()
primitive prevents such reordering. ❑
Answer:
In older versions of the Linux kernel, no.
In very new versions of the Linux kernel, line 16 could use smp_load_acquire()
instead of READ_ONCE(), which in turn would allow the smp_mb() on line 17 to be
dropped. Similarly, line 41 could use an smp_store_release(), for example, as
follows:
Answer:
Nothing. This is one of the weaknesses of sequence locking, and as a result, you should
use sequence locking only in read-mostly situations. Unless of course read-side starvation
is acceptable in your situation, in which case, go wild with the sequence-locking updates!
❑
Answer:
In this case, the ->lock field could be omitted, as it is in seqcount_t in the Linux
kernel. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 777
Answer:
Not at all. The Linux kernel has a number of special attributes that allow it to ignore
the following sequence of events:
2. Thread 0 starts executing its read-side critical section, but is then preempted for a
long time.
4. Thread 0 resumes execution, completing its read-side critical section with inconsis-
tent data.
The Linux kernel uses sequence locking for things that are updated rarely, with
time-of-day information being a case in point. This information is updated at most
once per millisecond, so that seven weeks would be required to overflow the counter. If
a kernel thread was preempted for seven weeks, the Linux kernel’s soft-lockup code
would be emitting warnings every two minutes for that entire time.
In contrast, with a 64-bit counter, more than five centuries would be required to
overflow, even given an update every nanosecond. Therefore, this implementation uses
a type for ->seq that is 64 bits on 64-bit systems. ❑
Answer:
One trivial way of accomplishing this is to surround all accesses, including the read-
only accesses, with write_seqlock() and write_sequnlock(). Of course, this
solution also prohibits all read-side parallelism, resulting in massive lock contention,
and furthermore could just as easily be implemented using simple locking.
If you do come up with a solution that uses read_seqbegin() and read_
seqretry() to protect read-side accesses, make sure that you correctly handle the
following sequence of events:
1. CPU 0 is traversing the linked list, and picks up a pointer to list element A.
v2023.06.11a
778 APPENDIX E. ANSWERS TO QUICK QUIZZES
3. CPU 2 allocates an unrelated data structure, and gets the memory formerly occupied
by element A. In this unrelated data structure, the memory previously used for
element A’s ->next pointer is now occupied by a floating-point number.
4. CPU 0 picks up what used to be element A’s ->next pointer, gets random bits,
and therefore gets a segmentation fault.
One way to protect against this sort of problem requires use of “type-safe memory”,
which will be discussed in Section 9.5.4.5. Roughly similar solutions are possible using
the hazard pointers discussed in Section 9.3. But in either case, you would be using
some other synchronization mechanism in addition to sequence locks! ❑
Answer:
Yes, it would.
Because a NULL pointer is being assigned, there is nothing to order against, so there is
no need for smp_store_release(). In contrast, when assigning a non-NULL pointer,
it is necessary to use smp_store_release() in order to ensure that initialization of
the pointed-to structure is carried out before assignment of the pointer.
In short, WRITE_ONCE() would work, and would save a little bit of CPU time on some
architectures. However, as we will see, software-engineering concerns will motivate use
of a special rcu_assign_pointer() that is quite similar to smp_store_release().
❑
Answer:
Not necessarily.
As hinted at in Sections 3.2.3 and 3.3, speed-of-light delays mean that a computer’s
data is always stale compared to whatever external reality that data is intended to model.
Real-world algorithms therefore absolutely must tolerate inconsistancies between
external reality and the in-computer data reflecting that reality. Many of those algorithms
are also able to tolerate some degree of inconsistency within the in-computer data.
Section 10.3.4 discusses this point in more detail.
Please note that this need to tolerate inconsistent and stale data is not limited to RCU.
It also applies to reference counting, hazard pointers, sequence locks, and even to some
locking use cases. For example, if you compute some quantity while holding a lock, but
use that quantity after releasing that lock, you might well be using stale data. After all,
the data that quantity is based on might change arbitrarily as soon as the lock is released.
So yes, RCU readers can see stale and inconsistent data, but no, this is not necessarily
problematic. And, when needed, there are RCU usage patterns that avoid both staleness
and inconsistency [ACMS03]. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 779
Answer:
A pointer to RCU-protected data. RCU-protected data is in turn a block of dynamically
allocated memory whose freeing will be deferred such that an RCU grace period will
elapse between the time that there were no longer any RCU-reader-accessible pointers
to that block and the time that that block is freed. This ensures that no RCU readers will
have access to that block at the time that it is freed.
RCU-protected pointers must be handled carefully. For example, any reader that
intends to dereference an RCU-protected pointer must use rcu_dereference() (or
stronger) to load that pointer. In addition, any updater must use rcu_assign_
pointer() (or stronger) to store to that pointer. ❑
Answer:
If a synchronize_rcu() cannot prove that it started before a given rcu_read_lock(),
then it must wait for the corresponding rcu_read_unlock(). ❑
Answer:
Because that waiting is exactly what enables readers to use the same sequence of
instructions that is appropriate for single-theaded situations. In other words, this
additional “redundant” waiting enables excellent read-side performance, scalability, and
real-time response. ❑
Answer:
Recall that readers are not permitted to pass through a quiescent state. For example,
within the Linux kernel, RCU readers are not permitted to execute a context switch. Use
of rcu_read_lock() and rcu_read_unlock() enables debug checks for improperly
placed quiescent states, making it easy to find bugs that would otherwise be difficult to
find, intermittent, and quite destructive. ❑
v2023.06.11a
780 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
The RCU-specific APIs do have similar semantics to the suggested replacements, but
also enable static-analysis debugging checks that complain if an RCU-specific API is
invoked on a non-RCU pointer and vice versa. ❑
Answer:
A call_rcu() function, which is described in Section 9.5.2.2, permits asynchronous
grace-period waits. ❑
Answer:
Yes and no. Although seqlock readers can run concurrently with seqlock writers,
whenever this happens, the read_seqretry() primitive will force the reader to retry.
This means that any work done by a seqlock reader running concurrently with a seqlock
updater will be discarded and then redone upon retry. So seqlock readers can run
concurrently with updaters, but they cannot actually get any work done in this case.
In contrast, RCU readers can perform useful work even in presence of concurrent
RCU updaters.
However, both reference counters (Section 9.2) and hazard pointers (Section 9.3)
really do permit useful concurrent forward progress for both updaters and readers, just
at somewhat greater cost. Please see Section 9.6 for a comparison of these different
solutions to the deferred-reclamation problem. ❑
Answer:
Sometimes, for example, on TSO systems such as x86 or the IBM mainframe where
a store-release operation emits a single store instruction. However, weakly ordered
systems must also emit a memory barrier or perhaps a store-release instruction. In
addition, removing data requires quite a bit of additional work because it is necessary to
wait for pre-existing readers before freeing the removed data. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 781
Answer:
In the universe where an iterating reader is only required to traverse elements that were
present throughout the full duration of the iteration. In the example, that would be
elements B and C. Because elements A and D were each present for only part of the
iteration, the reader is permitted to iterate over them, but not obliged to. Note that this
supports the common case where the reader is simply looking up a single item, and
does not know or care about the presence or absence of other items.
If stronger consistency is required, then higher-cost synchronization mechanisms
are required, for example, sequence locking or reader-writer locking. But if stronger
consistency is not required (and it very often is not), then why pay the higher cost? ❑
Answer:
The r1 == 0 && r2 == 0 possibility was called out in the text. Given that r1 == 0
implies r2 == 0, we know that r1 == 0 && r2 == 1 is forbidden. The following
discussion will show that both r1 == 1 && r2 == 1 and r1 == 1 && r2 == 0 are
possible. ❑
Answer:
Absolutely nothing would change. The fact that P0()’s loads from x and y are in the
same RCU read-side critical section suffices; their order is irrelevant. ❑
Answer:
The exact same ordering rules would apply, that is, (1) If any part of P0()’s RCU
read-side critical section preceded the beginning of P1()’s grace period, all of P0()’s
RCU read-side critical section would precede the end of P1()’s grace period, and (2) If
any part of P0()’s RCU read-side critical section followed the end of P1()’s grace
period, all of P0()’s RCU read-side critical section would follow the beginning of
P1()’s grace period.
It might seem strange to have RCU read-side critical sections containing writes, but
this capability is not only permitted, but also highly useful. For example, the Linux
kernel frequently carries out an RCU-protected traversal of a linked data structure and
v2023.06.11a
782 APPENDIX E. ANSWERS TO QUICK QUIZZES
then acquires a reference to the destination data element. Because this data element
must not be freed in the meantime, that element’s reference counter must necessarily
be incremented within the traversal’s RCU read-side critical section. However, that
increment entails a write to memory. Therefore, it is a very good thing that memory
writes are permitted within RCU read-side critical sections.
If having writes in RCU read-side critical sections still seems strange, please review
Section 5.4.6, which presented a use case for writes in reader-writer locking read-side
critical sections. ❑
Answer:
That depends on the synchronization design. If a semaphore protecting the update is
held across the grace period, then there can be at most two versions, the old and the new.
However, suppose that only the search, the update, and the list_replace_rcu()
were protected by a lock, so that the synchronize_rcu() was outside of that lock,
similar to the code shown in Listing E.2. Suppose further that a large number of
threads undertook an RCU replacement at about the same time, and that readers are
also constantly traversing the data structure.
Then the following sequence of events could occur, starting from the end state of
Figure 9.15:
2. Thread B replaces Element C with a new Element F, then waits for its
synchronize_rcu() call to return.
4. Thread D replaces Element F with a new Element G, then waits for its
synchronize_rcu() call to return.
v2023.06.11a
E.9. DEFERRED PROCESSING 783
6. Thread F replaces Element G with a new Element H, then waits for its
synchronize_rcu() call to return.
8. And the previous two steps repeat quickly with additional new elements, so that all
of them happen before any of the synchronize_rcu() calls return.
Thus, there can be an arbitrary number of versions active, limited only by memory
and by how many updates could be completed within a grace period. But please note
that data structures that are updated so frequently are not likely to be good candidates
for RCU. Nevertheless, RCU can handle high update rates when necessary. ❑
Answer:
The most effective way to reduce the per-update overhead of RCU is to increase the
number of updates served by a given grace period. This works because the per-grace
period overhead is nearly independent of the number of updates served by that grace
period.
One way to do this is to delay the start of a given grace period in the hope that more
updates requiring that grace period appear in the meantime. Another way is to slow
down execution of the grace period in the hope that more updates requiring an additional
grace period will accumulate in the meantime.
There are many other possible optimizations, and fanatically devoted readers are
referred to the Linux-kernel RCU implementation. ❑
Answer:
The modifications undertaken by a given RCU updater will cause the corresponding CPU
to invalidate cache lines containing the data, forcing the CPUs running concurrent RCU
readers to incur expensive cache misses. (Can you design an algorithm that changes
a data structure without inflicting expensive cache misses on concurrent readers? On
subsequent readers?) ❑
Answer:
The API members with exclamation marks (rcu_read_lock(), rcu_read_unlock(),
and call_rcu()) were the only members of the Linux RCU API that Paul E. McKenney
was aware of back in the mid-90s. During this timeframe, he was under the mistaken
impression that he knew all that there is to know about RCU. ❑
v2023.06.11a
784 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
There is no need to do anything to prevent RCU read-side critical sections from
indefinitely blocking a synchronize_rcu() invocation, because the synchronize_
rcu() invocation need wait only for pre-existing RCU read-side critical sections. So as
long as each RCU read-side critical section is of finite duration, RCU grace periods will
also remain finite. ❑
Answer:
In v4.20 and later Linux kernels, yes [McK19c, McK19a].
But not in earlier kernels, and especially not when using preemptible RCU! You
instead want synchronize_irq(). Alternatively, you can place calls to rcu_read_
lock() and rcu_read_unlock() in the specific interrupt handlers that you want
synchronize_rcu() to wait for. But even then, be careful, as preemptible RCU
will not be guaranteed to wait for that portion of the interrupt handler preceding the
rcu_read_lock() or following the rcu_read_unlock(). ❑
Answer:
They wait on different things. While synchronize_rcu() waits for pre-existing RCU
read-side critical sections to complete, rcu_barrier() instead waits for callbacks
from prior calls to call_rcu() to be invoked.
This distinction is illustrated by Listing E.3, which shows code being executed by a
given CPU. For simplicity, assume that no other CPU is executing rcu_read_lock(),
rcu_read_unlock(), or call_rcu().
Table E.3 shows how long each primitive must wait if invoked concurrently with each
of the do_something_*() functions, with empty cells indicating that no waiting is
necessary. As you can see, synchronize_rcu() need not wait unless it is in an RCU
read-side critical section, in which case it must wait for the rcu_read_unlock() that
ends that critical section. In contrast, RCU read-side critical sections have no effect
v2023.06.11a
E.9. DEFERRED PROCESSING 785
do_something_1()
do_something_2() rcu_read_unlock() (line 6)
do_something_3() rcu_read_unlock() (line 6) f(&p->rh) (line 8)
do_something_4() f(&p->rh) (line 8)
do_something_5()
Answer:
In principle, you can use either synchronize_srcu() or synchronize_srcu_
expedited() with a given srcu_struct within an SRCU read-side critical section
that uses some other srcu_struct. In practice, however, doing this is almost certainly
a bad idea. In particular, the code shown in Listing E.4 could still result in deadlock. ❑
Answer:
You are quite right!
In fact, in nonpreemptible kernels, synchronize_rcu_tasks() is a wrapper around
synchronize_rcu(). ❑
v2023.06.11a
786 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
One such exception is when a multi-element linked data structure is initialized as a unit
while inaccessible to other CPUs, and then a single rcu_assign_pointer() is used to
plant a global pointer to this data structure. The initialization-time pointer assignments
need not use rcu_assign_pointer(), though any such assignments that happen after
the structure is globally visible must use rcu_assign_pointer().
However, unless this initialization code is on an impressively hot code-path, it is
probably wise to use rcu_assign_pointer() anyway, even though it is in theory
unnecessary. It is all too easy for a “minor” change to invalidate your cherished
assumptions about the initialization happening privately. ❑
Answer:
It can sometimes be difficult for automated code checkers such as “sparse” (or indeed for
human beings) to work out which type of RCU read-side critical section a given RCU
traversal primitive corresponds to. For example, consider the code shown in Listing E.5.
Is the rcu_dereference() primitive in a vanilla RCU critical section or an RCU
Sched critical section? What would you have to do to figure this out?
But perhaps after the consolidation of the RCU flavors in the v4.20 Linux kernel we
no longer need to care! ❑
Answer:
One way to handle this is to always move nodes to the beginning of the destination
bucket, ensuring that when the reader reaches the end of the list having a matching NULL
pointer, it will have searched the entire list.
Of course, if there are too many move operations in a hash table with many elements
per bucket, the reader might never reach the end of a list. One way of avoiding this
v2023.06.11a
E.9. DEFERRED PROCESSING 787
in the common case is to keep hash tables well-tuned, thus with short lists. One way
of detecting the problem and handling it is for the reader to terminate the search after
traversing some large number of nodes, acquire the update-side lock, and redo the search,
but this might introduce deadlocks. Another way of avoiding the problem entirely is for
readers to search within RCU read-side critical sections, and to wait for an RCU grace
period between successive updates. An intermediate position might wait for an RCU
grace period every 𝑁 updates, for some suitable value of 𝑁. ❑
Answer:
Because Tasks RCU does not have read-side markers. Instead, Tasks RCU read-side
critical sections are bounded by voluntary context switches. ❑
Answer:
This is an excellent question, and the answer is that modern CPUs and compilers are
extremely complex. But before getting into that, it is well worth noting that RCU QSBR’s
performance advantage appears only in the one-hardware-thread-per-core regime. Once
the system is fully loaded, RCU QSBR’s performance drops back to ideal.
The RCU variant of the route_lookup() search loop actually has one more x86
instruction than does the sequential version, namely the lea in the sequence cmp, je,
mov, cmp, lea, and jne. This extra instruction is due to the rcu_head structure at the
beginning of the RCU variant’s route_entry structure, so that, unlike the sequential
variant, the RCU variant’s ->re_next.next pointer has a non-zero offset. Back in the
1980s, this additional lea instruction might have reliably resulted in the RCU variant
being slower, but we are now in the 21st century, and the 1980s are long gone.
But those of you who read Section 3.1.1 carefully already knew all of this!
These counter-intuitive results of course means that any performance result on modern
microprocessors must be subject to some skepticism. In theory, it really does not make
sense to obtain performance results that are better than ideal, but it really can happen on
modern microprocessors. Such results can be thought of as similar to the celebrated
super-linear speedups (see Section 6.5 for one such example), that is, of interest but
also of limited practical importance. Nevertheless, one of the strengths of RCU is that
its read-side overhead is so low that tiny effects such as this one are visible in real
performance measurements.
This raises the question as to what would happen if the rcu_head structure were to
be moved so that RCU’s ->re_next.next pointer also had zero offset, just the same
as the sequential variant. And the answer, as can be seen in Figure E.5, is that this
causes RCU QSBR’s performance to decrease to where it is still very nearly ideal, but
no longer super-ideal. ❑
v2023.06.11a
788 APPENDIX E. ANSWERS TO QUICK QUIZZES
2.5x107
2x107 ideal
1x107
seqlock
5x106
hazptr
0
0 50 100 150 200 250 300 350 400 450
Number of CPUs (Threads)
Figure E.5: Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial
rcu_head
Answer:
Because RCU QSBR places constraints on the overall application that might not be
tolerable, for example, requiring that each and every thread in the application regularly
pass through a quiescent state. Among other things, this means that RCU QSBR is
not helpful to library writers, who might be better served by other flavors of userspace
RCU [MDJ13f]. ❑
Answer:
One approach would be to use rcu_read_lock() and rcu_read_unlock() in nmi_
profile(), and to replace the synchronize_sched() with synchronize_rcu(),
perhaps as shown in Listing E.6.
But why on earth would an NMI handler be preemptible??? ❑
Answer:
The problem is that there is no ordering between the cco() function’s load from be_
careful and any memory loads executed by the cco_quickly() function. Because
there is no ordering, without that second call to syncrhonize_rcu(), memory ordering
could cause loads in cco_quickly() to overlap with stores by do_maint().
v2023.06.11a
E.9. DEFERRED PROCESSING 789
Listing E.6: Using RCU to Wait for Mythical Preemptible NMIs to Finish
1 struct profile_buffer {
2 long size;
3 atomic_t entry[0];
4 };
5 static struct profile_buffer *buf = NULL;
6
7 void nmi_profile(unsigned long pcvalue)
8 {
9 struct profile_buffer *p;
10
11 rcu_read_lock();
12 p = rcu_dereference(buf);
13 if (p == NULL) {
14 rcu_read_unlock();
15 return;
16 }
17 if (pcvalue >= p->size) {
18 rcu_read_unlock();
19 return;
20 }
21 atomic_inc(&p->entry[pcvalue]);
22 rcu_read_unlock();
23 }
24
25 void nmi_stop(void)
26 {
27 struct profile_buffer *p = buf;
28
29 if (p == NULL)
30 return;
31 rcu_assign_pointer(buf, NULL);
32 synchronize_rcu();
33 kfree(p);
34 }
Another alternative would be to compensate for the removal of that second call to
synchronize_rcu() by changing the READ_ONCE() to smp_load_acquire() and
the WRITE_ONCE() to smp_store_release(), thus restoring the needed ordering. ❑
Answer:
By one popular school of thought, you cannot.
But in this case, those willing to jump ahead to Chapter 12 and Chapter 15 might find
a couple of LKMM litmus tests to be interesting (C-RCU-phased-state-change-
1.litmus and C-RCU-phased-state-change-2.litmus). These tests could be
argued to demonstrate that this code and a variant of it really do work. ❑
Answer:
There could certainly be an arbitrarily long period of time during which at least one
v2023.06.11a
790 APPENDIX E. ANSWERS TO QUICK QUIZZES
thread is always in an RCU read-side critical section. However, the key words in the
description in Section 9.5.4.5 are “in-use” and “pre-existing”. Keep in mind that a
given RCU read-side critical section is conceptually only permitted to gain references
to data elements that were visible to readers during that critical section. Furthermore,
remember that a slab cannot be returned to the system until all of its data elements have
been freed, in fact, the RCU grace period cannot start until after they have all been freed.
Therefore, the slab cache need only wait for those RCU read-side critical sections
that started before the freeing of the last element of the slab. This in turn means that
any RCU grace period that begins after the freeing of the last element will do—the slab
may be returned to the system after that grace period ends. ❑
Answer:
As with the (bug-ridden) Listing 7.10, this is a very simple hash table with no chaining,
so the only element in a given bucket is the first element. The reader is again invited to
adapt this example to a hash table with full chaining. Less energetic readers might wish
to refer to Chapter 10. ❑
Answer:
First, please note that the second check on line 14 is necessary because some other CPU
might have removed this element while we were waiting to acquire the lock. However,
the fact that we were in an RCU read-side critical section while acquiring the lock
guarantees that this element could not possibly have been re-allocated and re-inserted
into this hash table. Furthermore, once we acquire the lock, the lock itself guarantees
the element’s existence, so we no longer need to be in an RCU read-side critical section.
The question as to whether it is necessary to re-check the element’s key is left as an
exercise to the reader. ❑
Answer:
Suppose we reverse the order of these two lines. Then this code is vulnerable to the
following sequence of events:
1. CPU 0 invokes delete(), and finds the element to be deleted, executing through
line 15. It has not yet actually deleted the element, but is about to do so.
v2023.06.11a
E.9. DEFERRED PROCESSING 791
3. CPU 0 executes lines 16 and 17, and blocks at line 18 waiting for CPU 1 to exit its
RCU read-side critical section.
4. CPU 1 now acquires the lock, but the test on line 14 fails because CPU 0 has
already removed the element. CPU 1 now executes line 22 (which we switched
with line 23 for the purposes of this Quick Quiz) and exits its RCU read-side
critical section.
5. CPU 0 can now return from synchronize_rcu(), and thus executes line 19,
sending the element to the freelist.
6. CPU 1 now attempts to release a lock for an element that has been freed, and,
worse yet, possibly reallocated as some other type of data structure. This is a fatal
memory-corruption error. ❑
Answer:
Listing 9.18 replaces the per-element spin_lock() and spin_unlock() shown in
Listing 7.11 with a much cheaper rcu_read_lock() and rcu_read_unlock(), thus
greatly improving both performance and scalability. For more detail, please see
Section 10.3.3. ❑
Answer:
First, consider that the inner loop used to take this measurement is as follows:
v2023.06.11a
792 APPENDIX E. ANSWERS TO QUICK QUIZZES
So the “measurement” of 267 picoseconds is simply the fixed overhead of the timing
measurements divided by the number of passes through the inner loop containing the
calls to rcu_read_lock() and rcu_read_unlock(), plus the code to manipulate i
divided by the loop-unrolling factor. And therefore, this measurement really is in error,
in fact, it exaggerates the overhead by an arbitrary number of orders of magnitude. After
all, in terms of machine instructions emitted, the actual overheads of rcu_read_lock()
and of rcu_read_unlock() are each precisely zero.
It is not just every day that a timing measurement of 267 picoseconds turns out to be
an overestimate! ❑
Answer:
Excellent memory!!! The overhead in some early releases was in fact roughly 100 fem-
toseconds.
What happened was that RCU usage spread more broadly through the Linux kernel,
including into code that takes page faults. Back at that time, rcu_read_lock() and
rcu_read_unlock() were complete no-ops in CONFIG_PREEMPT=n kernels. Unfortu-
nately, that situation allowed the compiler to reorder page-faulting memory accesses
into RCU read-side critical sections. Of course, page faults can block, which destroys
those critical sections.
Nor was this a theoretical problem: A failure actually manifested in 2019. Herbert
Xu tracked down this failure down and Linus Torvalds therefore queued a commit to
upgrade rcu_read_lock() and rcu_read_unlock() to unconditionally include a
call to barrier() [Tor19]. And although barrier() emits no code, it does constrain
compiler optimizations. And so the price of widespread RCU usage is slightly higher
rcu_read_lock() and rcu_read_unlock() overhead. As such, Linux-kernel RCU
has proven to be a victim of its own success.
Of course, it is also the case that the older results were obtained on a different system
than were those shown in Figure 9.25. So which change had the most effect, Linus’s
commit or the change in the system? This question is left as an exercise to the reader. ❑
Answer:
Keep in mind that this is a log-log plot, so those large-seeming RCU variances in reality
span only a few hundred picoseconds. And that is such a short time that anything could
cause it. However, given that the variance decreases with both small and large numbers
of CPUs, one hypothesis is that the variation is due to migrations from one CPU to
another.
Yes, these measurements were taken with interrupts disabled, but they were also
taken within a guest OS, so that preemption was still possible at the hypervisor level. In
addition, the system featured hyperthreading and a single hardware thread running this
RCU workload is able to consume more than half of the core’s resources. Therefore, the
overall throughput varies depending on how many of a given guest OS’s CPUs share
v2023.06.11a
E.9. DEFERRED PROCESSING 793
cores. Attempting to reduce these variations by running the guest OSes at real-time
priority (as suggested by Joel Fernandes) is left as an exercise for the reader. ❑
Answer:
Because the script (rcuscale.sh) that generates this data spawns a guest operating
system for each set of points gathered, and on this particular system, both qemu and
KVM limit the number of CPUs that may be configured into a given guest OS. Yes, it
would have been possible to run a few more CPUs, but 192 is a nice round number from
a binary perspective, given that 256 is infeasible. ❑
Answer:
Because smaller disturbances result in greater relative errors for smaller measurements.
Also, the Linux kernel’s ndelay() nanosecond-scale primitive is (as of 2020) less
accurate than is the udelay() primitive used for the data for durations of a microsecond
or more. It is instructive to compare to the zero-length case shown in Figure 9.25. ❑
rcu_read_lock();
synchronize_rcu();
rcu_read_unlock();
The synchronize_rcu() cannot return until all pre-existing RCU read-side critical
sections complete, but is enclosed in an RCU read-side critical section that cannot
complete until the synchronize_rcu() returns. The result is a classic self-deadlock—
you get the same effect when attempting to write-acquire a reader-writer lock while
read-holding it.
Note that this self-deadlock scenario does not apply to RCU QSBR, because the
context switch performed by the synchronize_rcu() would act as a quiescent state
for this CPU, allowing a grace period to complete. However, this is if anything even
worse, because data used by the RCU read-side critical section might be freed as a result
of the grace period completing. Plus Linux kernel’s lockdep facility will yell at you.
In short, do not invoke synchronous RCU update-side primitives, which are listed in
Table 9.2, from within an RCU read-side critical section.
In addition, within the Linux kernel, RCU uses the scheduler and the scheduler uses
RCU. In some cases, both RCU and the scheduler must take care to avoid deadlock. ❑
v2023.06.11a
794 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
It really does work. After all, if it didn’t work, the Linux kernel would not run. ❑
Answer:
Quite a few!
Please keep in mind that the finite speed of light means that data reaching a given
computer system is at least slightly stale at the time that it arrives, and extremely stale
in the case of astronomical data. The finite speed of light also places a sharp limit on
the consistency of data arriving from different sources of via different paths.
You might as well face the fact that the laws of physics are incompatible with naive
notions of perfect freshness and consistency. ❑
Answer:
Maybe, but these are less likely.
In the case of Tasks RCU, recall that the quiescent state is a voluntary context switch.
Thus, all tasks not blocked after a voluntary context switch might need to be boosted,
and the mechanics of deboosting would not likely be at all pretty.
In the case of Tasks RCU Rude, as was the case with the old RCU Sched, any
preemptible region of code is a quiescent state. Thus, the only tasks that might need
boosting are those currently running with preemption disabled. But boosting the priority
of a preemption-disabled task has no effect. It therefore seems doubly unlikely that
priority boosting will ever be introduced to Tasks RCU Rude, at least in its current form.
❑
Answer:
Which grace period, exactly?
The updater is required to wait for at least one grace period that starts at or some
time after the removal, in this case, the xchg(). So in Figure 9.29, the indicated
grace period starts as early as theoretically possible and extends to the return from
synchronize_rcu(). This is a perfectly legal grace period corresponding to the
change carried out by that xchg() statement. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 795
Answer:
Not at all.
Hazard pointers can be considered to combine temporal and spatial synchronization in
a similar manner. Referring to Listing 9.4, the hp_record() function’s acquisition of a
reference provides both spatial and temporal synchronization, subscribing to a version and
marking the start of a reference, respectively. This function therefore combines the effects
of RCU’s rcu_read_lock() and rcu_dereference(). Referring now to Listing 9.5,
the hp_clear() function’s release of a reference provides temporal synchronization
marking the end of a reference, and is thus similar to RCU’s rcu_read_unlock().
The hazptr_free_later() function’s retiring of a hazard-pointer-protected object
provides temporal synchronization, similar to RCU’s call_rcu(). The primitives used
to mutate a hazard-pointer-protected structure provide spatial synchronization, similar
to RCU’s rcu_assign_pointer().
Alternatively, one could instead come at hazard pointers by analogy with reference
counting. ❑
Answer:
This is an effect of the Law of Toy Examples: Beyond a certain point, the code fragments
look the same. The only difference is in how we think about the code. For example,
what does an atomic_inc() operation do? It might be acquiring another explicit
reference to an object to which we already have a reference, it might be incrementing an
often-read/seldom-updated statistical counter, it might be checking into an HPC-style
barrier, or any of a number of other things.
However, these differences can be extremely important. For but one example of
the importance, consider that if we think of RCU as a restricted reference counting
scheme, we would never be fooled into thinking that the updates would exclude the
RCU read-side critical sections.
It nevertheless is often useful to think of RCU as a replacement for reader-writer
locking, for example, when you are replacing reader-writer locking with RCU. ❑
Answer:
Pre-BSD routing could be argued to fit into either quasi reader-writer lock, quasi
reference count, or quasi multi-version concurrency control. The code is the same either
way. This is similar to things like atomic_inc(), another tool that can be put to a great
many uses. ❑
v2023.06.11a
796 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
There were multiple independent inventions of mechanisms vaguely resembling RCU.
Each group of inventors was unaware of the others, so each made up its own terminology
as a matter of course. And the different terminology made it quite difficult for any one
group to find any of the others.
Sorry, but life is like that sometimes! ❑
Answer:
One reason is that Kung and Lehman were simply ahead of their time. Another reason
was that their approach, ground-breaking though it was, did not take a number of
software-engineering and performance issues into account.
To see that they were ahead of their time, consider that three years after their
paper was published, Paul was working on a PDP-11 system running BSD 2.8. This
system lacked any sort of automatic configuration, which meant that any hardware
modification, including adding a new disk drive, required hand-editing and rebuilding
the kernel. Furthermore, this was a single-CPU system, which meant that full-system
synchronization was a simple matter of disabling interrupts.
Fast-forward a number of years, and multicore systems permitting runtime changes in
hardware configuration were commonplace. This meant that the hardware configuration
data that was implicitly represented in 1980s kernel source code was now a mutable
data structure that was accessed on every I/O. Such data structures rarely change, but
could change at any time. And this read-mostly property applies to many other new-age
data structures, including those concerning networking (rare in the 1980s), security
policies (physical locks in the 1980s), software configuration (immutable at runtime in
the 1980s), and much else besides. There was thus much more opportunity for RCU to
demonstrate its benefits in the 1990s and 2000s than there was in the 1980s.
Kung’s and Lehman’s software-engineering sins included failing to mark readers (thus
presenting debugging difficulties), failing to provide a clean RCU API (thus tying their
mechanism to a specific data structure), and failing to allow for any post-grace-period
operation other than freeing memory (thus disallowing a number of RCU use cases).
Kung and Lehman presented two garbage-collection strategies. The first waited for
all processes running at a given time to terminate, which represented another software-
engineering sin that ruled out their mechanism’s use in software that runs indefinitely.
The second used per-object reference counting, which greatly complicates their read-side
code (thus representing yet another software-engineering sin), and, on modern hardware,
results in severe cache-miss overhead (thus representing a performance sin, see for
example Figures 9.30 and 9.31).
Despite this long list of software-engineering and performance sins, Kung’s and
Lehman’s paper remains a truly impressive piece of work, especially considering that
much of the later work (both independent and not) committed these same sins, plus
others as well. ❑
v2023.06.11a
E.9. DEFERRED PROCESSING 797
Answer:
The authors wished to support linearizable tree operations, so that concurrent additions
to, deletions from, and searches of the tree would appear to execute in some globally
agreed-upon order. In their search trees, this requires holding locks across grace
periods. (It is probably better to drop linearizability as a requirement in most cases, but
linearizability is a surprisingly popular (and costly!) requirement.) ❑
Answer:
They can, but at the expense of additional reader-traversal overhead and, in some
environments, the need to handle memory-allocation failure. ❑
Answer:
Yes they do, but the guarantee only applies unconditionally in cases where a reference
is already held. With this in mind, please review the paragraph at the beginning of
Section 9.6, especially the part saying “large enough that readers do not hold references
from one traversal to another”. ❑
Answer:
Yes, it did. However, doing this could be argued to change hazard-pointers “Reclamation
Forward Progress” row (discussed later) from lock-free to blocking because a CPU
spinning with interrupts disabled in the kernel would prevent the update-side portion
of the asymmetric barrier from completing. In the Linux kernel, such blocking could
in theory be prevented by building the kernel with CONFIG_NO_HZ_FULL, designating
the relevant CPUs as nohz_full at boot time, ensuring that only one thread was ever
runnable on a given CPU at a given time, and avoiding ever calling into the kernel.
Alternatively, you could ensure that the kernel was free of any bugs that might cause
CPUs to spin with interrupts disabled.
Given that CPUs spinning in the Linux kernel with interrupts disabled seems to be
rather rare, one might counter-argue that asymmetric-barrier hazard-pointer updates are
non-blocking in practice, if not in theory. ❑
v2023.06.11a
798 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Indeed it is! However, hash tables quite frequently store information with keys such
as character strings that do not necessarily fit into an unsigned long. Simplifying the
hash-table implementation for the case where keys always fit into unsigned longs is left
as an exercise for the reader. ❑
Answer:
The answer depends on a great many things. If the hash table has a large number of
elements per bucket, it would clearly be better to increase the number of hash buckets.
On the other hand, if the hash table is lightly loaded, the answer depends on the
hardware, the effectiveness of the hash function, and the workload. Interested readers
are encouraged to experiment. ❑
Answer:
You can do just that! In fact, you can extend this idea to large clustered systems,
running one copy of the application on each node of the cluster. This practice is called
“sharding”, and is heavily used in practice by large web-based retailers [DHJ+ 07].
However, if you are going to shard on a per-socket basis within a multisocket system,
why not buy separate smaller and cheaper single-socket systems, and then run one shard
of the database on each of those systems? ❑
v2023.06.11a
E.10. DATA STRUCTURES 799
that mean that a lookup could return a reference to a data element that was removed
immediately after it was looked up?
Answer:
Yes it can! This is why hashtab_lookup() must be invoked within an RCU read-side
critical section, and it is why hashtab_add() and hashtab_del() must also use
RCU-aware list-manipulation primitives. Finally, this is why the caller of hashtab_
del() must wait for a grace period (e.g., by calling synchronize_rcu()) before
freeing the removed element. This will ensure that all RCU readers that might reference
the newly removed element have completed before that element is freed. ❑
1. Have some subset of elements that always reside in the table, and verify that
lookups always find these elements regardless of the number and type of concurrent
updates in flight.
2. Pair an updater with one or more readers, verifying that after an element is added,
once a reader successfully looks up that element, all later lookups succeed. The
definition of “later” will depend on the table’s consistency requirements.
3. Pair an updater with one or more readers, verifying that after an element is deleted,
once a reader’s lookup of that element fails, all later lookups also fail.
There are many more tests where those came from, the exact nature of which depend
on the details of the requirements on your particular hash table. ❑
Answer:
Excellent question!
False sharing requires writes, which are not featured in the unsynchronized and RCU
runs of this lookup-only benchmark. The problem is therefore not false sharing.
Still unconvinced? Then look at the log-log plot in Figure E.6, which shows
performance for 448 CPUs as a function of the hash-table size, that is, number of buckets
and maximum number of elements. A hash-table of size 1,024 has 1,024 buckets
and contains at most 1,024 elements, with the average occupancy being 512 elements.
Because this is a read-only benchmark, the actual occupancy is always equal to the
average occupancy.
This figure shows near-ideal performance below about 8,000 elements, that is, when
the hash table comprises less than 1 MB of data. This near-ideal performance is
v2023.06.11a
800 APPENDIX E. ANSWERS TO QUICK QUIZZES
1x108
6
1x10 QSBR,RCU,hazptr
100000
100
1000
10000
100000
1x106
Hash Table Size (Buckets and Maximum Elements)
consistent with that for the pre-BSD routing table shown in Figure 9.21 on page 252,
even at 448 CPUs. However, the performance drops significantly (this is a log-log
plot) at about 8,000 elements, which is where the 1,048,576-byte L2 cache overflows.
Performance falls off a cliff (even on this log-log plot) at about 300,000 elements, where
the 40,370,176-byte L3 cache overflows. This demonstrates that the memory-system
bottleneck is profound, degrading performance by well in excess of an order of magnitude
for the large hash tables. This should not be a surprise, as the size-8,388,608 hash table
occupies about 1 GB of memory, overflowing the L3 caches by a factor of 25.
The reason that Figure 10.4 on page 295 shows little effect is that its data was gathered
from bucket-locked hash tables, where locking overhead and contention drowned out
cache-capacity effects. In contrast, both RCU and hazard-pointers readers avoid stores
to shared data, which means that the cache-capacity effects come to the fore.
Still not satisfied? Find a multi-socket system and run this code, making use of
whatever performance-counter hardware is available. This hardware should allow
you to track down the precise cause of any slowdowns exhibited on your particular
system. The experience gained by doing this exercise will be extremely valuable, giving
you a significant advantage over those whose understanding of this issue is strictly
theoretical.10 ❑
v2023.06.11a
E.10. DATA STRUCTURES 801
than performed by the benchmarks in this chapter. For example, such systems might
be processing images or video streams stored in each element, providing further
performance benefits due to the fact that the resulting sequential memory accesses will
make better use of the available memory bandwidth than will a pure pointer-following
workload.
But let this be a lesson to you. Modern computer systems come in a great many
shapes and sizes, and great care is frequently required to select one that suits your
application. And perhaps even more frequently, significant care and work is required to
adjust your application to the specific computer systems at hand. ❑
Answer:
In theory, no, it isn’t any safer, and a useful exercise would be to run these programs on
larger systems. In practice, there are only a very few systems with more than 448 CPUs,
in contrast to the huge number having more than 28 CPUs. This means that although it
is dangerous to extrapolate beyond 448 CPUs, there is very little need to do so.
In addition, other testing has shown that RCU read-side primitives offer consistent
performance and scalability up to at least 1024 CPUs. However, it is useful to review
Figure E.6 and its associated commentary. You see, unlike the 448-CPU system that
provided this data, the system enjoying linear scalability up to 1024 CPUs boasted
excellent memory bandwidth. ❑
Answer:
It does not provide any such protection. That is instead the job of the update-side
concurrency-control functions described next. ❑
Answer:
The second resize operation will not be able to move beyond the bucket into which the
insertion is taking place due to the insertion holding the lock(s) on one or both of the
hash buckets in the hash tables. Furthermore, the insertion operation takes place within
an RCU read-side critical section. As we will see when we examine the hashtab_
resize() function, this means that each resize operation uses synchronize_rcu()
invocations to wait for the insertion’s read-side critical section to complete. ❑
v2023.06.11a
802 APPENDIX E. ANSWERS TO QUICK QUIZZES
tions. Doesn’t this mean that readers might miss an element that was previously added
during a resize operation?
Answer:
No. As we will see soon, the hashtab_add() and hashtab_del() functions keep the
old hash table up-to-date while a resize operation is in progress. ❑
Answer:
Yes, at least assuming that a slight increase in the cost of hashtab_lookup() is
acceptable. One approach is shown in Listings E.7 and E.8 (hash_resize_s.c).
This version of hashtab_add() adds an element to either the old bucket if it is not
resized yet, or to the new bucket if it has been resized, and hashtab_del() removes
the specified element from any buckets into which it has been inserted. The hashtab_
lookup() function searches the new bucket if the search of the old bucket fails,
v2023.06.11a
E.10. DATA STRUCTURES 803
which has the disadvantage of adding overhead to the lookup fastpath. The alternative
hashtab_lock_mod() returns the locking state of the new bucket in ->hbp[0] and
->hls_idx[0] if resize operation is in progress, instead of the perhaps more natural
choice of ->hbp[1] and ->hls_idx[1]. However, this less-natural choice has the
advantage of simplifying hashtab_add().
Further analysis of the code is left as an exercise for the reader. ❑
Answer:
The synchronize_rcu() on line 30 of Listing 10.13 ensures that all pre-existing RCU
readers have completed between the time that we install the new hash-table reference on
line 29 and the time that we update ->ht_resize_cur on line 40. This means that any
reader that sees a non-negative value of ->ht_resize_cur cannot have started before
the assignment to ->ht_new, and thus must be able to see the reference to the new hash
table.
And this is why the update-side hashtab_add() and hashtab_del() functions
must be enclosed in RCU read-side critical sections, courtesy of hashtab_lock_mod()
and hashtab_unlock_mod() in Listing 10.11. ❑
v2023.06.11a
804 APPENDIX E. ANSWERS TO QUICK QUIZZES
1x107
100000
10000 2,097,152
1000
1 10 100
Number of CPUs (Threads)
Answer:
Together with the READ_ONCE() on line 16 in hashtab_lock_mod() of Listing 10.11,
it tells the compiler that the non-initialization accesses to ->ht_resize_cur must
remain because reads from ->ht_resize_cur really can race with writes, just not in a
way to change the “if” conditions. ❑
Answer:
The easy way to answer this question is to do another run with 2,097,152 elements, but
this time also with 2,097,152 buckets, thus bringing the average number of elements per
bucket back down to unity.
The results are shown by the triple-dashed new trace in the middle of Figure E.7. The
other six traces are identical to their counterparts in Figure 10.19 on page 313. The gap
between this new trace and the lower set of three traces is a rough measure of how much
of the difference in performance was due to hash-chain length, and the gap between
the new trace and the upper set of three traces is a rough measure of how much of that
difference was due to memory-system bottlenecks. The new trace starts out slightly
below its 262,144-element counterpart at a single CPU, showing that cache capacity
is degrading performance slightly even on that single CPU.11 This is to be expected,
given that unlike its smaller counterpart, the 2,097,152-bucket hash table does not fit
into the L3 cache. This new trace rises just past 28 CPUs, which is also to be expected.
This rise is due to the fact that the 29th CPU is on another socket, which brings with it
an additional 39 MB of cache as well as additional memory bandwidth.
But the large hash table’s advantage over that of the hash table with 524,288 buckets
(but still 2,097,152 elements) decreases with additional CPUs, which is consistent
with the bottleneck residing in the memory system. Above about 400 CPUs, the
11 Yes, as far as hardware architects are concerned, caches are part of the memory system.
v2023.06.11a
E.11. VALIDATION 805
Answer:
The answer to the first question is left as an exercise to the reader. Try specializing the
resizable hash table and see how much performance improvement results. The second
question cannot be answered in general, but must instead be answered with respect to a
specific use case. Some use cases are extremely sensitive to performance and scalability,
while others are less so. ❑
E.11 Validation
Answer:
There are any number of situations, but perhaps the most important situation is when no
one has ever created anything resembling the program to be developed. In this case, the
only way to create a credible plan is to implement the program, create the plan, and
implement it a second time. But whoever implements the program for the first time
has no choice but to follow a fragmentary plan because any detailed plan created in
ignorance cannot survive first contact with the real world.
And perhaps this is one reason why evolution has favored insanely optimistic human
beings who are happy to follow fragmentary plans! ❑
Answer:
Yes, projects are important, but if you like being paid for your work, you need
organizations as well as projects. ❑
v2023.06.11a
806 APPENDIX E. ANSWERS TO QUICK QUIZZES
real 0m0.132s
user 0m0.040s
sys 0m0.008s
The script is required to check its input for errors, and to give appropriate diagnostics
if fed erroneous time output. What test inputs should you provide to this program to
test it for use with time output generated by single-threaded programs?
Answer:
Can you say “Yes” to all the following questions?
1. Do you have a test case in which all the time is consumed in user mode by a
CPU-bound program?
2. Do you have a test case in which all the time is consumed in system mode by a
CPU-bound program?
3. Do you have a test case in which all three times are zero?
4. Do you have a test case in which the “user” and “sys” times sum to more than the
“real” time? (This would of course be completely legitimate in a multithreaded
program.)
5. Do you have a set of tests cases in which one of the times uses more than one
second?
6. Do you have a set of tests cases in which one of the times uses more than ten
seconds?
7. Do you have a set of test cases in which one of the times has non-zero minutes?
(For example, “15m36.342s”.)
8. Do you have a set of test cases in which one of the times has a seconds value of
greater than 60?
9. Do you have a set of test cases in which one of the times overflows 32 bits of
milliseconds? 64 bits of milliseconds?
10. Do you have a set of test cases in which one of the times is negative?
11. Do you have a set of test cases in which one of the times has a positive minutes
value but a negative seconds value?
12. Do you have a set of test cases in which one of the times omits the “m” or the “s”?
13. Do you have a set of test cases in which one of the times is non-numeric? (For
example, “Go Fish”.)
14. Do you have a set of test cases in which one of the lines is omitted? (For example,
where there is a “real” value and a “sys” value, but no “user” value.)
15. Do you have a set of test cases where one of the lines is duplicated? Or duplicated,
but with a different time value for the duplicate?
16. Do you have a set of test cases where a given line has more than one time value?
(For example, “real 0m0.132s 0m0.008s”.)
v2023.06.11a
E.11. VALIDATION 807
18. In all test cases involving invalid input, did you generate all permutations?
19. For each test case, do you have an expected outcome for that test?
If you did not generate test data for a substantial number of the above cases, you will
need to cultivate a more destructive attitude in order to have a chance of generating
high-quality tests.
Of course, one way to economize on destructiveness is to generate the tests with
the to-be-tested source code at hand, which is called white-box testing (as opposed to
black-box testing). However, this is no panacea: You will find that it is all too easy to
find your thinking limited by what the program can handle, thus failing to generate truly
destructive inputs. ❑
Answer:
If it is your project, for example, a hobby, do what you like. Any time you waste will
be your own, and you have no one else to answer to for it. And there is a good chance
that the time will not be completely wasted. For example, if you are embarking on a
first-of-a-kind project, the requirements are in some sense unknowable anyway. In this
case, the best approach might be to quickly prototype a number of rough solutions, try
them out, and see what works best.
On the other hand, if you are being paid to produce a system that is broadly similar
to existing systems, you owe it to your users, your employer, and your future self to
validate early and often. ❑
Answer:
Please note that the text used the word “validation” rather than the word “testing”. The
word “validation” includes formal methods as well as testing, for more on which please
see Chapter 12.
But as long as we are bringing up things that everyone should know, let’s remind
ourselves that Darwinian evolution is not about correctness, but rather about survival.
As is software. My goal as a developer is not that my software be attractive from a
theoretical viewpoint, but rather that it survive whatever its users throw at it.
Although the notion of correctness does have its uses, its fundamental limitation is
that the specification against which correctness is judged will also have bugs. This
means nothing more nor less than that traditional correctness proofs prove that the code
in question contains the intended set of bugs!
Alternative definitions of correctness instead focus on the lack of problematic
properties, for example, proving that the software has no use-after-free bugs, no NULL
pointer dereferences, no array-out-of-bounds references, and so on. Make no mistake,
finding and eliminating such classes of bugs can be highly useful. But the fact remains
v2023.06.11a
808 APPENDIX E. ANSWERS TO QUICK QUIZZES
that the lack of certain classes of bugs does nothing to demonstrate fitness for any
specific purpose.
Therefore, usage-driven validation remains critically important.
Besides, it is also impossible to verify correctness into your software, especially given
the problematic need to verify both the verifier and the specification. ❑
Answer:
If you don’t mind WARN_ON_ONCE() sometimes warning more than once, simply
maintain a static variable that is initialized to zero. If the condition triggers, check the
variable, and if it is non-zero, return. Otherwise, set it to one, print the message, and
return.
If you really need the message to never appear more than once, you can use an atomic
exchange operation in place of “set it to one” above. Print the message only if the atomic
exchange operation returns zero. ❑
Answer:
Those wishing a complete answer to this question are encouraged to search the Linux
kernel git repository for commits containing the string “Fixes:”. There were many
thousands of them just in the year 2020, including fixes for the following invalid
assumptions:
2. Userspace can be trusted to zero out versioned data structures used to communicate
with the kernel. (Hint: Sometimes userspace has no idea how large the data
structure is.)
5. Once a data structure is no longer needed, all of its memory may be immediately
freed.
Those who look at these commits in greater detail will conclude that invalid assump-
tions are the rule, not the exception. ❑
v2023.06.11a
E.11. VALIDATION 809
Answer:
If you are worried about transcription errors, please allow me to be the first to introduce
you to a really cool tool named diff. In addition, carrying out the copying can be quite
valuable:
1. If you are copying a lot of code, you are probably failing to take advantage of an
opportunity for abstraction. The act of copying code can provide great motivation
for abstraction.
2. Copying the code gives you an opportunity to think about whether the code really
works in its new setting. Is there some non-obvious constraint, such as the need to
disable interrupts or to hold some lock?
3. Copying the code also gives you time to consider whether there is some better way
to get the job done.
So, yes, copy the code! ❑
Answer:
Indeed, repeatedly copying code by hand is laborious and slow. However, when
combined with heavy-duty stress testing and proofs of correctness, this approach is also
extremely effective for complex parallel code where ultimate performance and reliability
are required and where debugging is difficult. The Linux-kernel RCU implementation
is a case in point.
On the other hand, if you are writing a simple single-threaded shell script, then you
would be best-served by a different methodology. For example, enter each command
one at a time into an interactive shell with a test data set to make sure that it does what
you want, then copy-and-paste the successful commands into your script. Finally, test
the script as a whole.
If you have a friend or colleague who is willing to help out, pair programming can
work very well, as can any number of formal design- and code-review processes.
And if you are writing code as a hobby, then do whatever you like.
In short, different types of software need different development methodologies. ❑
Answer:
The answer, as is often the case, is “it depends”. If the bug is a simple typo, fix that
typo and continue typing. However, if the bug indicates a design flaw, go back to pen
and paper. ❑
v2023.06.11a
810 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Because complexity and concurrency can produce results that are indistinguishable
from randomness [MOZ09]. For example, a bug in Linux-kernel RCU required the
following to hold before that bug would manifest:
1. The kernel was built for HPC or real-time use, so that a given CPU’s RCU work
could be offloaded to some other CPU.
2. An offloaded CPU went offline just after generating a large quantity of RCU work.
3. A special rcu_barrier() API was invoked just at this time.
4. The RCU work from the newly offlined CPU was still being processed after
rcu_barrier() returned.
5. One of these remaining RCU work items was related to the code invoking the
rcu_barrier().
Making this bug manifest therefore required considerable luck or great testing skill. But
the testing skill could be effective only if the bug was known, which of course it was
not. Therefore, the manifesting of this bug was very well modeled as a probabilistic
process. ❑
Answer:
This approach might well be a valuable addition to your validation arsenal. But it does
have limitations that rule out “for all practical purposes”:
1. Some bugs have extremely low probabilities of occurrence, but nevertheless need
to be fixed. For example, suppose that the Linux kernel’s RCU implementation
had a bug that is triggered only once per million years of machine time on average.
A million years of CPU time is hugely expensive even on the cheapest cloud
platforms, but we could expect this bug to result in more than 50 failures per day
on the more than 20 billion Linux instances in the world as of 2017.
2. The bug might well have zero probability of occurrence on your particular cloud-
computing test setup, which means that you won’t see it no matter how much
machine time you burn testing it. For but one example, there are RCU bugs that
appear only in preemptible kernels, and also other RCU bugs that appear only in
non-preemptible kernels.
Of course, if your code is small enough, formal validation may be helpful, as discussed
in Chapter 12. But beware: Formal validation of your code will not find errors in your
assumptions, misunderstanding of the requirements, misunderstanding of the software
or hardware primitives you use, or errors that you did not think to construct a proof for.
❑
v2023.06.11a
E.11. VALIDATION 811
Answer:
You are right, that makes no sense at all.
Remember that a probability is a number between zero and one, so that you need to
divide a percentage by 100 to get a probability. So 10 % is a probability of 0.1, which
gets a probability of 0.4095, which rounds to 41 %, which quite sensibly matches the
earlier result. ❑
Answer:
It does not matter. You will get the same answer no matter what base of logarithms you
use because the result is a pure ratio of logarithms. The only constraint is that you use
the same base for both the numerator and the denominator. ❑
Answer:
We set 𝑛 to 3 and 𝑃 to 99.9 in Eq. 11.11, resulting in:
1 100 − 99.9
𝑇 = − ln = 2.3 (E.9)
3 100
If the test runs without failure for 2.3 hours, we can be 99.9 % certain that the fix
reduced the probability of failure. ❑
Answer:
One approach is to use the open-source symbolic manipulation program named “max-
ima”. Once you have installed this program, which is a part of many Linux distributions,
you can run it and give the load(distrib); command followed by any number of
bfloat(cdf_poisson(m,l)); commands, where the m is replaced by the desired
value of 𝑚 (the actual number of failures in actual test) and the l is replaced by the
desired value of 𝜆 (the expected number of failures in the actual test).
In particular, the bfloat(cdf_poisson(2,24)); command results in
1.181617112359357b-8, which matches the value given by Eq. 11.13.
Another approach is to recognize that in this real world, it is not all that useful to
compute (say) the duration of a test having two or fewer errors that would give a 76.8 %
confidence of a 349.2x improvement in reliability. Instead, human beings tend to focus
on specific values, for example, a 95 % confidence of a 10x improvement. People also
v2023.06.11a
812 APPENDIX E. ANSWERS TO QUICK QUIZZES
Improvement
greatly prefer error-free test runs, and so should you because doing so reduces your
required test durations. Therefore, it is quite possible that the values in Table E.4 will
suffice. Simply look up the desired confidence and degree of improvement, and the
resulting number will give you the required error-free test duration in terms of the
expected time for a single error to appear. So if your pre-fix testing suffered one failure
per hour, and the powers that be require a 95 % confidence of a 10x improvement, you
need a 30-hour error-free run.
Alternatively, you can use the rough-and-ready method described in Section 11.6.2.
❑
Answer:
Indeed it should. And it does.
To see this, note that e−𝜆 does not depend on 𝑖, which means that it can be pulled out
of the summation as follows:
∞
∑︁ 𝜆𝑖
e−𝜆 (E.10)
𝑖=0
𝑖!
e−𝜆 e𝜆 (E.11)
The two exponentials are reciprocals, and therefore cancel, resulting in exactly 1, as
required. ❑
Answer:
Indeed, that can happen. Many CPUs have hardware-debugging facilities that can help
you locate that unrelated pointer. Furthermore, if you have a core dump, you can search
the core dump for pointers referencing the corrupted region of memory. You can also
look at the data layout of the corruption, and check pointers whose type matches that
layout.
v2023.06.11a
E.11. VALIDATION 813
You can also step back and test the modules making up your program more intensively,
which will likely confine the corruption to the module responsible for it. If this makes
the corruption vanish, consider adding additional argument checking to the functions
exported from each module.
Nevertheless, this is a hard problem, which is why I used the words “a bit of a dark
art”. ❑
Answer:
A huge commit? Shame on you! This is but one reason why you are supposed to keep
the commits small.
And that is your answer: Break up the commit into bite-sized pieces and bisect the
pieces. In my experience, the act of breaking up the commit is often sufficient to make
the bug painfully obvious. ❑
Answer:
There are locking algorithms that depend on conditional-locking primitives telling them
the truth. For example, if conditional-lock failure signals that some other thread is
already working on a given job, spurious failure might cause that job to never get done,
possibly resulting in a hang. ❑
Answer:
This question fails to consider the option of choosing not to compute the answer at all,
and in doing so, also fails to consider the costs of computing the answer. For example,
consider short-term weather forecasting, for which accurate models exist, but which
require large (and expensive) clustered supercomputers, at least if you want to actually
run the model faster than the weather.
And in this case, any performance bug that prevents the model from running faster
than the actual weather prevents any forecasting. Given that the whole purpose of
purchasing the large clustered supercomputers was to forecast weather, if you cannot run
the model faster than the weather, you would be better off not running the model at all.
More severe examples may be found in the area of safety-critical real-time computing.
❑
v2023.06.11a
814 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Although I do heartily salute your spirit and aspirations, you are forgetting that there
may be high costs due to delays in the program’s completion. For an extreme example,
suppose that a 40 % performance shortfall from a single-threaded application is causing
one person to die each day. Suppose further that in a day you could hack together a
quick and dirty parallel program that ran 50 % faster on an eight-CPU system than the
sequential version, but that an optimal parallel program would require four months of
painstaking design, coding, debugging, and tuning.
It is safe to say that more than 100 people would prefer the quick and dirty version. ❑
Answer:
Changes in memory layout can indeed result in unrealistic decreases in execution time.
For example, suppose that a given microbenchmark almost always overflows the L0
cache’s associativity, but with just the right memory layout, it all fits. If this is a real
concern, consider running your microbenchmark using huge pages (or within the kernel
or on bare metal) in order to completely control the memory layout.
But note that there are many different possible memory-layout bottlenecks. Bench-
marks sensitive to memory bandwidth (such as those involving matrix arithmetic) should
spread the running threads across the available cores and sockets to maximize memory
parallelism. They should also spread the data across NUMA nodes, memory controllers,
and DRAM chips to the extent possible. In contrast, benchmarks sensitive to memory
latency (including most poorly scaling applications) should instead maximize locality,
filling each core and socket in turn before adding another one. ❑
Answer:
Indeed it might, although in most microbenchmarking efforts you would extract the
code under test from the enclosing application. Nevertheless, if for some reason you
must keep the code under test within the application, you will very likely need to use
the techniques discussed in Section 11.7.6. ❑
Answer:
Because mean and standard deviation were not designed to do this job. To see this, try
v2023.06.11a
E.12. FORMAL VERIFICATION 815
applying mean and standard deviation to the following data set, given a 1 % relative
error in measurement:
The problem is that mean and standard deviation do not rest on any sort of measurement-
error assumption, and they will therefore see the difference between the values near
49,500 and those near 49,900 as being statistically significant, when in fact they are
well within the bounds of estimated measurement error.
Of course, it is possible to create a script similar to that in Listing 11.2 that uses
standard deviation rather than absolute difference to get a similar effect, and this is left
as an exercise for the interested reader. Be careful to avoid divide-by-zero errors arising
from strings of identical data values! ❑
Answer:
Indeed it will! But if your performance measurements often produce a value of exactly
zero, perhaps you need to take a closer look at your performance-measurement code.
Note that many approaches based on mean and standard deviation will have similar
problems with this sort of dataset. ❑
Answer:
There are several:
1. The declaration of sum should be moved to within the init block, since it is not
used anywhere else.
2. The assertion code should be moved outside of the initialization loop. The
initialization loop can then be placed in an atomic block, greatly reducing the state
space (by how much?).
v2023.06.11a
816 APPENDIX E. ANSWERS TO QUICK QUIZZES
3. The atomic block covering the assertion code should be extended to include the
initialization of sum and j, and also to cover the assertion. This also reduces the
state space (again, by how much?). ❑
Answer:
Yes. Replace it with if-fi and remove the two break statements. ❑
Answer:
Because those operations are for the benefit of the assertion only. They are not part of
the algorithm itself. There is therefore no harm in marking them atomic, and so marking
them greatly reduces the state space that must be searched by the Promela model. ❑
Answer:
Yes. To see this, delete these lines and run the model.
Alternatively, consider the following sequence of steps:
1. One process is within its RCU read-side critical section, so that the value of ctr[0]
is zero and the value of ctr[1] is two.
2. An updater starts executing, and sees that the sum of the counters is two so that the
fastpath cannot be executed. It therefore acquires the lock.
3. A second updater starts executing, and fetches the value of ctr[0], which is zero.
4. The first updater adds one to ctr[0], flips the index (which now becomes zero),
then subtracts one from ctr[1] (which now becomes one).
5. The second updater fetches the value of ctr[1], which is now one.
6. The second updater now incorrectly concludes that it is safe to proceed on the
fastpath, despite the fact that the original reader has not yet completed. ❑
v2023.06.11a
E.12. FORMAL VERIFICATION 817
v2023.06.11a
818 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
According to Spin’s documentation, yes, it is.
As an indirect evidence, let’s compare the results of runs with -DCOLLAPSE and with
-DMA=88 (two readers and three updaters). The diff of outputs from those runs is shown
in Listing E.9. As you can see, they agree on the numbers of states (stored and matched).
❑
Answer:
It is certainly true that many formal-verification tools are specialized in some way.
For example, Promela does not handle realistic memory models (though they can be
programmed into Promela [DMD13]), CBMC [CKL04] does not detect probabilistic
hangs and deadlocks, and Nidhugg [LSLK14] does not detect bugs involving data
nondeterminism. But this means that these tools cannot be trusted to find bugs that they
are not designed to locate.
And therefore people creating formal-verification tools should “tell the truth on
the label”, clearly calling out what classes of bugs their tools can and cannot detect.
Otherwise, the first time a practitioner finds a tool failing to detect a bug, that practitioner
is likely to make extremely harsh and extremely public denunciations of that tool. Yes,
yes, there is something to be said for putting your best foot forward, but putting it too
far forward without appropriate disclaimers can easily trigger a land mine of negative
reaction that your tool might or might not be able to recover from.
You have been warned! ❑
Answer:
There is always room for doubt. In this case, it is important to keep in mind that the
two proofs of correctness preceded the formalization of real-world memory models,
raising the possibility that these two proofs are based on incorrect memory-ordering
assumptions. Furthermore, since both proofs were constructed by the same person, it is
quite possible that they contain a common error. Again, there is always room for doubt.
❑
v2023.06.11a
E.12. FORMAL VERIFICATION 819
Answer:
Relax, there are a number of lawful answers to this question:
1. Try compiler flags -DCOLLAPSE and -DMA=N to reduce memory consumption. See
Section 12.1.4.1.
2. Further optimize the model, reducing its memory consumption.
3. Work out a pencil-and-paper proof, perhaps starting with the comments in the code
in the Linux kernel.
4. Devise careful torture tests, which, though they cannot prove the code correct, can
find hidden bugs.
5. There is some movement towards tools that do model checking on clusters of
smaller machines. However, please note that we have not actually used such tools
myself, courtesy of some large machines that Paul has occasional access to.
6. Wait for memory sizes of affordable systems to expand to fit your problem.
7. Use one of a number of cloud-computing services to rent a large system for a short
time period. ❑
Answer:
This fails in presence of NMIs. To see this, suppose an NMI was received just
after rcu_irq_enter() incremented rcu_update_flag, but before it incremented
dynticks_progress_counter. The instance of rcu_irq_enter() invoked by the
NMI would see that the original value of rcu_update_flag was non-zero, and would
therefore refrain from incrementing dynticks_progress_counter. This would leave
the RCU grace-period machinery no clue that the NMI handler was executing on this
CPU, so that any RCU read-side critical sections in the NMI handler would lose their
RCU protection.
The possibility of NMI handlers, which, by definition cannot be masked, does
complicate this code. ❑
Answer:
Not if we interrupted a running task! In that case, dynticks_progress_counter
would have already been incremented by rcu_exit_nohz(), and there would be no
need to increment it again. ❑
v2023.06.11a
820 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Read the next section to see if you were correct. ❑
Answer:
Promela assumes sequential consistency, so it is not necessary to model memory barriers.
In fact, one must instead explicitly model lack of memory barriers, for example, as
shown in Listing 12.13 on page 369. ❑
Answer:
It probably would be more natural, but we will need this particular order for the liveness
checks that we will add later. ❑
Answer:
Because the grace-period code processes each CPU’s dynticks_progress_counter
and rcu_dyntick_snapshot variables separately, we can collapse the state onto a
single CPU. If the grace-period code were instead to do something special given
specific values on specific CPUs, then we would indeed need to model multiple CPUs.
But fortunately, we can safely confine ourselves to two CPUs, the one running the
grace-period processing and the one entering and leaving dynticks-idle mode. ❑
Answer:
Recall that Promela and Spin trace out every possible sequence of state changes.
Therefore, timing is irrelevant: Promela/Spin will be quite happy to jam the entire
rest of the model between those two statements unless some state variable specifically
prohibits doing so. ❑
v2023.06.11a
E.12. FORMAL VERIFICATION 821
Answer:
The easiest thing to do would be to put each such statement in its own EXECUTE_
MAINLINE() statement. ❑
Answer:
One approach, as we will see in a later section, is to use explicit labels and “goto”
statements. For example, the construct:
if
:: i == 0 -> a = -1;
:: else -> a = -2;
fi;
EXECUTE_MAINLINE(stmt1,
if
:: i == 0 -> goto stmt1_then;
:: else -> goto stmt1_else;
fi)
stmt1_then: skip;
EXECUTE_MAINLINE(stmt1_then1, a = -1; goto stmt1_end)
stmt1_else: skip;
EXECUTE_MAINLINE(stmt1_then1, a = -2)
stmt1_end: skip;
However, it is not clear that the macro is helping much in the case of the “if”
statement, so these sorts of situations will be open-coded in the following sections. ❑
Answer:
These lines of code pertain to controlling the model, not to the code being modeled, so
there is no reason to model them non-atomically. The motivation for modeling them
atomically is to reduce the size of the state space. ❑
Answer:
One such property is nested interrupts, which are handled in the following section. ❑
v2023.06.11a
822 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Not always, but more and more frequently. In this case, Paul started with the smallest
slice of code that included an interrupt handler, because he was not sure how best to
model interrupts in Promela. Once he got that working, he added other features. (But if
he was doing it again, he would start with a “toy” handler. For example, he might have
the handler increment a variable twice and have the mainline code verify that the value
was always even.)
Why the incremental approach? Consider the following, attributed to Brian W.
Kernighan:
Debugging is twice as hard as writing the code in the first place. Therefore, if
you write the code as cleverly as possible, you are, by definition, not smart
enough to debug it.
This means that any attempt to optimize the production of code should place at least
66 % of its emphasis on optimizing the debugging process, even at the expense of
increasing the time and effort spent coding. Incremental coding and testing is one way
to optimize the debugging process, at the expense of some increase in coding effort.
Paul uses this approach because he rarely has the luxury of devoting full days (let alone
weeks) to coding and debugging. ❑
Answer:
This cannot happen within the confines of a single CPU. The first IRQ handler cannot
complete until the NMI handler returns. Therefore, if each of the dynticks and
dynticks_nmi variables have taken on an even value during a given time interval, the
corresponding CPU really was in a quiescent state at some time during that interval. ❑
Answer:
Although this approach would be functionally correct, it would result in excessive IRQ
entry/exit overhead on large machines. In contrast, the approach laid out in this section
allows each CPU to touch only per-CPU data on IRQ and NMI entry/exit, resulting in
much lower IRQ entry/exit overhead, especially on large machines. ❑
v2023.06.11a
E.12. FORMAL VERIFICATION 823
Answer:
Actually, academics consider the x86 memory model to be weak because it can allow
prior stores to be reordered with subsequent loads. From an academic viewpoint, a
strong memory model is one that allows absolutely no reordering, so that all threads
agree on the order of all operations visible to them.
Plus it really is the case that developers are sometimes confused about x86 memory
ordering. ❑
Answer:
Either way works. However, in general, it is better to use initialization than explicit
instructions. The explicit instructions are used in this example to demonstrate their use.
In addition, many of the litmus tests available on the tool’s web site (https://www.cl.
cam.ac.uk/~pes20/ppcmem/) were automatically generated, which generates explicit
initialization instructions. ❑
Answer:
The implementation of PowerPC version of atomic_add_return() loops when the
stwcx instruction fails, which it communicates by setting non-zero status in the
condition-code register, which in turn is tested by the bne instruction. Because actually
modeling the loop would result in state-space explosion, we instead branch to the
Fail1: label, terminating the model with the initial value of 2 in P0’s r3 register, which
will not trigger the exists assertion.
There is some debate about whether this trick is universally applicable, but I have not
seen an example where it fails. ❑
Answer:
Arm does not have this particular bug because it places smp_mb() before and after the
atomic_add_return() function’s assembly-language implementation. PowerPC no
longer has this bug; it has long since been fixed [Her11]. ❑
v2023.06.11a
824 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
It depends on the semantics required. The rest of this answer assumes that the assembly
language for P0 in Listing 12.23 is supposed to implement a value-returning atomic
operation.
As is discussed in Chapter 15, Linux kernel’s memory consistency model requires
value-returning atomic RMW operations to be fully ordered on both sides. The ordering
provided by lwsync is insufficient for this purpose, and so sync should be used instead.
This change has since been made [Fen15] in response to an email thread discussing a
couple of other litmus tests [McK15g]. Finding any other bugs that the Linux kernel
might have is left as an exercise for the reader.
In other enviroments providing weaker semantics, lwsync might be sufficient. But
not for the Linux kernel’s value-returning atomic operations! ❑
Answer:
Get version v4.17 (or later) of the Linux-kernel source code, then follow the instructions
in tools/memory-model/README to install the needed tools. Then follow the further
instructions to run these tools on the litmus test of your choice. ❑
Answer:
In a word, performance, as can be seen in Table E.5. The first column shows the number
of herd processes modeled. The second column shows the herd runtime when modeling
spin_lock() and spin_unlock() directly in herd’s cat language. The third column
shows the herd runtime when emulating spin_lock() with cmpxchg_acquire()
and spin_unlock() with smp_store_release(), using the herd filter clause
to reject executions that fail to acquire the lock. The fourth column is like the third,
but using xchg_acquire() instead of cmpxchg_acquire(). The fifth and sixth
columns are like the third and fourth, but instead using the herd exists clause to reject
executions that fail to acquire the lock.
Note also that use of the filter clause is about twice as fast as is use of the exists
clause. This is no surprise because the filter clause allows early abandoning of
excluded executions, where the executions that are excluded are the ones in which the
lock is concurrently held by more than one process.
More important, modeling spin_lock() and spin_unlock() directly ranges from
five times faster to more than two orders of magnitude faster than modeling emulated
locking. This should also be no surprise, as direct modeling raises the level of abstraction,
thus reducing the number of events that herd must model. Because almost everything
that herd does is of exponential computational complexity, modest reductions in the
number of events produces exponentially large reductions in runtime.
v2023.06.11a
E.12. FORMAL VERIFICATION 825
Model Emulate
# Proc.
filter exists
cmpxchg xchg cmpxchg xchg
2 0.004 0.022 0.027 0.039 0.058
3 0.041 0.743 0.968 1.653 3.203
4 0.374 59.565 74.818 151.962 500.960
5 4.905
Thus, in formal verification even more than in parallel programming itself, divide
and conquer!!! ❑
Answer:
Yes, it usually is a critical bug. However, in this case, the updater has been cleverly
constructed to properly handle such pointer leaks. But please don’t make a habit of
doing this sort of thing, and especially don’t do this without having put a lot of thought
into making some more conventional approach work. ❑
Answer:
Because the reader advances to the next element on line 24, thus avoiding storing a
pointer to the same element as was fetched. ❑
Answer:
That is an excellent question. As of late 2021, the answer is “no one knows”. Much
depends on the semantics of Armv8’s conditional-move instruction. While awaiting
clarity on these semantics, smp_store_release() is the safe choice. ❑
v2023.06.11a
826 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Unfortunately, no.
At one time, Paul E. McKenny felt that Linux-kernel RCU was immune to such
exploits, but the advent of Row Hammer showed him otherwise. After all, if the black
hats can hit the system’s DRAM, they can hit any and all low-level software, even
including RCU.
And in 2018, this possibility passed from the realm of theoretical speculation into the
hard and fast realm of objective reality [McK19a]. ❑
Answer:
Unfortunately, no.
The first full verification of the L4 microkernel was a tour de force, with a large
number of Ph.D. students hand-verifying code at a very slow per-student rate. This level
of effort could not be applied to most software projects because the rate of change is
just too great. Furthermore, although the L4 microkernel is a large software artifact
from the viewpoint of formal verification, it is tiny compared to a great number of
projects, including LLVM, GCC, the Linux kernel, Hadoop, MongoDB, and a great
many others. In addition, this verification did have limits, as the researchers freely ad-
mit, to their credit: https://docs.sel4.systems/projects/sel4/frequently-
asked-questions.html#does-sel4-have-zero-bugs.
Although formal verification is finally starting to show some promise, including
more-recent L4 verifications involving greater levels of automation, it currently has no
chance of completely displacing testing in the foreseeable future. And although I would
dearly love to be proven wrong on this point, please note that such proof will be in the
form of a real tool that verifies real software, not in the form of a large body of rousing
rhetoric.
Perhaps someday formal verification will be used heavily for validation, including for
what is now known as regression testing. Section 17.4 looks at what would be required
to make this possibility a reality. ❑
Answer:
Although this can resolve the race between the release of the last reference and acquisition
of a new reference, it does absolutely nothing to prevent the data structure from being
v2023.06.11a
E.13. PUTTING IT ALL TOGETHER 827
freed and reallocated, possibly as some completely different type of structure. It is quite
likely that the “simple compare-and-swap operation” would give undefined results if
applied to the differently typed structure.
In short, use of atomic operations such as compare-and-swap absolutely requires
either type-safety or existence guarantees.
But what if it is absolutely necessary to let the type change?
One approach is for each such type to have the reference counter at the same location,
so that as long as the reallocation results in an object from this group of types, all is
well. If you do this in C, make sure you comment the reference counter in each structure
in which it appears. In C++, use inheritance and templates. ❑
Answer:
Because a CPU must already hold a reference in order to legally acquire another
reference. Therefore, if one CPU releases the last reference, there had better not be any
CPU acquiring a new reference! ❑
Answer:
This cannot happen if these functions are used correctly. It is illegal to invoke kref_
get() unless you already hold a reference, in which case the kref_sub() could not
possibly have decremented the counter to zero. ❑
Answer:
The caller cannot rely on the continued existence of the object unless it knows that at
least one reference will continue to exist. Normally, the caller will have no way of
knowing this, and must therefore carefully avoid referencing the object after the call to
kref_sub().
Interested readers are encouraged to work around this limitation using RCU, in
particular, call_rcu(). ❑
v2023.06.11a
828 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Because the kref structure normally is embedded in a larger structure, and it is necessary
to free the entire structure, not just the kref field. This is normally accomplished by
defining a wrapper function that does a container_of() and then a kfree(). ❑
Answer:
Suppose that the “if” condition completed, finding the reference counter value equal
to one. Suppose that a release operation executes, decrementing the reference counter
to zero and therefore starting cleanup operations. But now the “then” clause can
increment the counter back to a value of one, allowing the object to be used after it has
been cleaned up.
This use-after-cleanup bug is every bit as bad as a full-fledged use-after-free bug. ❑
Answer:
Such replication is impractical if the data is too large, as it might be in the Schrödinger’s-
zoo example described in Section 13.4.2.
Such replication is unnecessary if delays are prevented, for example, when updaters
disable interrupts when running on bare-metal hardware (that is, without the use of a
vCPU-preemption-prone hypervisor).
Alternatively, if readers can tolerate the occasional delay, then replication is again
unnecessary. Consider the example of reader-writer locking, where writers always delay
readers and vice versa.
However, if the data to be replicated is reasonably small, if delays are possible, and if
readers cannot tolerate these delays, replicating the data is an excellent approach. ❑
Answer:
Yes, and the details are left as an exercise to the reader.
The term tombstone is sometimes used to refer to the element with the old name after
its sequence lock is acquired. Similarly, the term birthstone is sometimes used to refer
to the element with the new name while its sequence lock is still held. ❑
v2023.06.11a
E.13. PUTTING IT ALL TOGETHER 829
Answer:
Yes, and one way to do this would be to use per-hash-chain locks. The updater could
acquire lock(s) corresponding to both the old and the new element, acquiring them in
address order. In this case, the insertion and removal operations would of course need to
refrain from acquiring and releasing these same per-hash-chain locks. This complexity
can be worthwhile if rename operations are frequent, and of course can allow rename
operations to execute concurrently. ❑
Answer:
A given thread’s __thread variables vanish when that thread exits. It is therefore
necessary to synchronize any operation that accesses other threads’ __thread variables
with thread exit. Without such synchronization, accesses to __thread variable of a
just-exited thread will result in segmentation faults. ❑
Answer:
Indeed I did say that. And it would be possible to make count_register_thread()
allocate a new structure, much as count_unregister_thread() currently does.
But this is unnecessary. Recall the derivation of the error bounds of read_count()
that was based on the snapshots of memory. Because new threads start with initial
counter values of zero, the derivation holds even if we add a new thread partway
through read_count()’s execution. So, interestingly enough, when adding a new
thread, this implementation gets the effect of allocating a new structure, but without
actually having to do the allocation. ❑
Answer:
You are quite right, that array does in fact reimpose the fixed upper limit. This limit may
be avoided by tracking threads with a linked list, as is done in userspace RCU [DMS+ 12].
Doing something similar for this code is left as an exercise for the reader. ❑
v2023.06.11a
830 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
This of course needs to be decided on a case-by-case basis. If you need an implementation
of read_count() that scales linearly, then the lock-based implementation shown in
Listing 5.4 simply will not work for you. On the other hand, if calls to read_count()
are sufficiently rare, then the lock-based version is simpler and might thus be better,
although much of the size difference is due to the structure definition, memory allocation,
and NULL return checking.
Of course, a better question is “Why doesn’t the language implement cross-thread
access to __thread variables?” After all, such an implementation would make both the
locking and the use of RCU unnecessary. This would in turn enable an implementation
that was even simpler than the one shown in Listing 5.4, but with all the scalability and
performance benefits of the implementation shown in Listing 13.5! ❑
Answer:
Indeed it can.
One way to avoid this cache-miss overhead is shown in Listing E.10: Simply embed
an instance of a measurement structure named meas into the animal structure, and
point the ->mp field at this ->meas field.
Measurement updates can then be carried out as follows:
1. Allocate a new measurement structure and place the new measurements into it.
3. Wait for a grace period to elapse, for example using either synchronize_rcu()
or call_rcu().
4. Copy the measurements from the new measurement structure into the embedded
->meas field.
v2023.06.11a
E.13. PUTTING IT ALL TOGETHER 831
This approach uses a heavier weight update procedure to eliminate the extra cache
miss in the common case. The extra cache miss will be incurred only while an update is
actually in progress. ❑
v2023.06.11a
832 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Given the surprisingly limited scalability of any number of NBS algorithms, population
obliviousness can be surprisingly useful. Nevertheless, the overall point of the question
is valid. It is not normally helpful for an algorithm to scale beyond the size of the largest
system it is ever going to run on. ❑
Answer:
One pointer at a time!
First, atomically exchange the ->head pointer with NULL. If the return value from
the atomic exchange operation is NULL, the queue was empty and you are done. And if
someone else attempts a dequeue-all at this point, they will get back a NULL pointer.
Otherwise, atomically exchange the ->tail pointer with a pointer to the now-NULL
->head pointer. The return value from the atomic exchange operation is a pointer to
the ->next field of the eventual last element on the list.
Producing and testing actual code is left as an exercise for the interested and
enthusiastic reader, as are strategies for handling half-enqueued elements. ❑
Answer:
That won’t help unless the more-modern languages proponents are energetic enough to
write their own compiler backends. The usual practice of re-using existing backends
also reuses charming properties such as refusal to support pointers to lifetime-ended
objects. ❑
Answer:
A demonic scheduler is one way to model an insanely overloaded system. After all, if
you have an algorithm that you can prove runs reasonably given a demonic scheduler,
mere overload should be no problem, right?
On the other hand, it is only reasonable to ask if a demonic scheduler is really the
best way to model overload conditions. And perhaps it is time for more accurate models.
For one thing, a system might be overloaded in any of a number of ways. After all, an
v2023.06.11a
E.14. ADVANCED SYNCHRONIZATION 833
NBS algorithm that works fine on a demonic scheduler might or might not do well in
out-of-memory conditions, when mass storage fills, or when the network is congested.
Except that systems’ core counts have been increasing, which means that an overloaded
system is quite likely to be running more than one concurrent program.12 In that case,
even if a demonic scheduler is not so demonic as to inject idle cycles while there are
runnable tasks, it is easy to imagine such a scheduler consistently favoring the other
program over yours. If both programs could consume all available CPU, then this
scheduler might not run your program at all.
One way to avoid these issues is to simply avoid overload conditions. This is often
the preferred approach in production, where load balancers direct traffic away from
overloaded systems. And if all systems are overloaded, it is not unheard of to simply
shed load, that is, to drop the low-priority incoming requests. Nor is this approach
limited to computing, as those who have suffered through a rolling blackout can attest.
But load-shedding is often considered a bad thing by those whose load is being shed.
As always, choose wisely! ❑
Answer:
Sooner or later, the battery must be recharged, which requires energy to flow into the
system. ❑
v2023.06.11a
834 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Yes, but . . .
Those queueing-theory results assume infinite “calling populations”, which in the
Linux kernel might correspond to an infinite number of tasks. As of early 2021, no
real system supports an infinite number of tasks, so results assuming infinite calling
populations should be expected to have less-than-infinite applicability.
Other queueing-theory results have finite calling populations, which feature sharply
bounded response times [HL86]. These results better model real systems, and these
models do predict reductions in both average and worst-case response times as utiliza-
tions decrease. These results can be extended to model concurrent systems that use
synchronization mechanisms such as locking [Bra11, SM04a].
In short, queueing-theory results that accurately describe real-world real-time systems
show that worst-case response time decreases with decreasing utilization. ❑
Answer:
Perhaps this situation is just a theoretician’s excuse to avoid diving into the messy world
of real software? Perhaps more constructively, the following advances are required:
1. Formal verification needs to handle larger software artifacts. The largest verification
efforts have been for systems of only about 10,000 lines of code, and those have
been verifying much simpler properties than real-time latencies.
2. Hardware vendors will need to publish formal timing guarantees. This used to be
common practice back when hardware was much simpler, but today’s complex
hardware results in excessively complex expressions for worst-case performance.
Unfortunately, energy-efficiency concerns are pushing vendors in the direction of
even more complexity.
All that said, there is hope, given recent work formalizing the memory models of real
computer systems [AMP+ 11, AKNT13]. On the other hand, formal verification has just
as much trouble as does testing with the astronomical number of variants of the Linux
kernel that can be constructed from different combinations of its tens of thousands of
Kconfig options. Sometimes life is hard! ❑
v2023.06.11a
E.14. ADVANCED SYNCHRONIZATION 835
Answer:
This distinction is admittedly unsatisfying from a strictly theoretical perspective. But
on the other hand, it is exactly what the developer needs in order to decide whether the
application can be cheaply and easily developed using standard non-real-time approaches,
or whether the more difficult and expensive real-time approaches are required. In other
words, although theory is quite important, for those of us called upon to complete
practical projects, theory supports practice, never the other way around. ❑
Answer:
Indeed it is, other than the API. And the API is important because it allows the Linux
kernel to offer real-time capabilities without having the -rt patchset grow to ridiculous
sizes.
However, this approach clearly and severely limits read-side scalability. The Linux
kernel’s -rt patchset was long able to live with this limitation for several reasons:
(1) Real-time systems have traditionally been relatively small, (2) Real-time systems
have generally focused on process control, thus being unaffected by scalability limitations
in the I/O subsystems, and (3) Many of the Linux kernel’s reader-writer locks have been
converted to RCU.
However, the day came when it was absolutely necessary to permit concurrent readers,
as described in the text following this quiz. ❑
Answer:
That is a real problem, and it is solved in RCU’s scheduler hook. If that scheduler
hook sees that the value of t->rcu_read_lock_nesting is negative, it invokes rcu_
read_unlock_special() if needed before allowing the context switch to complete.
❑
Answer:
Yes and no.
v2023.06.11a
836 APPENDIX E. ANSWERS TO QUICK QUIZZES
Yes in that non-blocking algorithms can provide fault tolerance in the face of fail-stop
bugs, but no in that this is grossly insufficient for practical fault tolerance. For example,
suppose you had a wait-free queue, and further suppose that a thread has just dequeued an
element. If that thread now succumbs to a fail-stop bug, the element it has just dequeued
is effectively lost. True fault tolerance requires way more than mere non-blocking
properties, and is beyond the scope of this book. ❑
Answer:
Indeed there are, and lots of them. However, they tend to be specific to a given situation,
and many of them can be thought of as refinements of some of the constraints listed
above. For example, the many constraints on choices of data structure will help meeting
the “Bounded time spent in any given critical section” constraint. ❑
Answer:
In early 2016, projects forbidding runtime memory allocation were also not at all
interested in multithreaded computing. So the runtime memory allocation is not an
additional obstacle to safety criticality.
However, by 2020 runtime memory allocation in multi-core real-time systems was
gaining some traction. ❑
Answer:
Indeed you do, and you could use any of a number of techniques discussed earlier in
this book. One of those techniques is use of a single updater thread, which would result
in exactly the code shown in update_cal() in Listing 14.6. ❑
Answer:
The earlier memory-ordering section had its roots in a pair of Linux Journal arti-
cles [McK05a, McK05b] dating back to 2005. Since then, the C and C++ memory
models [Bec11] have been formalized (and critiqued [BS14, BD14, VBC+ 15, BMN+ 15,
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 837
LVK+ 17, BGV17]), executable formal memory models for computer systems have
become the norm [MSS12, McK11d, SSA+ 11, AMP+ 11, AKNT13, AKT13, AMT14,
MS14, FSP+ 17, ARM17], and there is even a memory model for the Linux ker-
nel [AMM+ 17a, AMM+ 17b, AMM+ 18], along with a paper describing differences
between the C11 and Linux memory models [MWPF18].
The kernel concurrency sanitizer (KCSAN) [EMV+ 20a, EMV+ 20b], based in part on
RacerD [BGOS18] and implementing LKMM, has also been added to the Linux kernel
and is now heavily used.
Finally, there are now better ways of describing LKMM.
Given all this progress, substantial change was required. ❑
Answer:
In general, compiler optimizations carry out more extensive and profound reorderings
than CPUs can. However, in this case, the volatile accesses in READ_ONCE() and
WRITE_ONCE() prevent the compiler from reordering. And also from doing much else
as well, so the examples in this section will be making heavy use of READ_ONCE() and
WRITE_ONCE(). See Section 15.3 for more detail on the need for READ_ONCE() and
WRITE_ONCE(). ❑
Answer:
There is an underlying cache-coherence protocol that straightens things out, which are
discussed in Appendix C.2. But if you think that a given variable having two values at
the same time is surprising, just wait until you get to Section 15.2.1! ❑
Answer:
Perhaps surprisingly, not necessarily! On some systems, if the two variables are being
used heavily, they might be bounced back and forth between the CPUs’ caches and
never land in main memory. ❑
v2023.06.11a
838 APPENDIX E. ANSWERS TO QUICK QUIZZES
The READ_ONCE() row captures the fact that (as of 2021) compilers and CPUs do
not indulge in user-visible speculative stores, so that any store whose address, data, or
execution depends on a prior load is guaranteed to happen after that load completes.
However, this guarantee assumes that these dependencies have been constructed carefully,
as described in Sections 15.3.2 and 15.3.3.
The “_relaxed() RMW operation” row captures the fact that a value-returning
_relaxed() RMW has done a load and a store, which are every bit as good as a
READ_ONCE() and a WRITE_ONCE(), respectively.
The *_dereference() row captures the address and data dependency ordering
provided by rcu_dereference() and friends. Again, these dependencies must been
constructed carefully, as described in Section 15.3.2.
The “Successful *_acquire()” row captures the fact that many CPUs have special
“acquire” forms of loads and of atomic RMW instructions, and that many other CPUs
have lightweight memory-barrier instructions that order prior loads against subsequent
loads and stores.
The “Successful *_release()” row captures the fact that many CPUs have special
“release” forms of stores and of atomic RMW instructions, and that many other CPUs
have lightweight memory-barrier instructions that order prior loads and stores against
subsequent stores.
The smp_rmb() row captures the fact that many CPUs have lightweight memory-
barrier instructions that order prior loads against subsequent loads. Similarly, the
smp_wmb() row captures the fact that many CPUs have lightweight memory-barrier
instructions that order prior stores against subsequent stores.
None of the ordering operations thus far require prior stores to be ordered against
subsequent loads, which means that these operations need not interfere with store buffers,
whose main purpose in life is in fact to reorder prior stores against subsequent loads.
The lightweight nature of these operations is precisely due to their policy of store-buffer
non-interference. However, as noted earlier, it is sometimes necessary to interfere with
the store buffer in order to prevent prior stores from being reordered against later stores,
which brings us to the remaining rows in this table.
The smp_mb() row corresponds to the full memory barrier available on most platforms,
with Itanium being the exception that proves the rule. However, even on Itanium, smp_
mb() provides full ordering with respect to READ_ONCE() and WRITE_ONCE(), as
discussed in Section 15.5.4.
The “Successful full-strength non-void RMW” row captures the fact that on some
platforms (such as x86) atomic RMW instructions provide full ordering both before and
after. The Linux kernel therefore requires that full-strength non-void atomic RMW
operations provide full ordering in cases where these operations succeed. (Full-strength
atomic RMW operation’s names do not end in _relaxed, _acquire, or _release.)
As noted earlier, the case where these operations do not succeed is covered by the
“_relaxed() RMW operation” row.
However, the Linux kernel does not require that either void or _relaxed() atomic
RMW operations provide any ordering whatsoever, with the canonical example being
atomic_inc(). Therefore, these operations, along with failing non-void atomic
RMW operations may be preceded by smp_mb__before_atomic() and followed by
smp_mb__after_atomic() to provide full ordering for any accesses preceding or
following both. No ordering need be provided for accesses between the smp_mb__
before_atomic() (or, similarly, the smp_mb__after_atomic()) and the atomic
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 839
Answer:
These two primitives are rather specialized, and at present seem difficult to fit into
Table 15.3. The smp_mb__after_unlock_lock() primitive is intended to be placed
immediately after a lock acquisition, and ensures that all CPUs see all accesses in
prior critical sections as happening before all accesses following the smp_mb__after_
unlock_lock() and also before all accesses in later critical sections. Here “all CPUs”
includes those CPUs not holding that lock, and “prior critical sections” includes all
prior critical sections for the lock in question as well as all prior critical sections for all
other locks that were released by the same CPU that executed the smp_mb__after_
unlock_lock().
The smp_mb__after_spinlock() provides the same guarantees as does smp_
mb__after_unlock_lock(), but also provides additional visibility guarantees for
other accesses performed by the CPU that executed the smp_mb__after_spinlock().
Given any store S performed prior to any earlier lock acquisition and any load L
performed after the smp_mb__after_spinlock(), all CPUs will see S as happening
before L. In other words, if a CPU performs a store S, acquires a lock, executes
an smp_mb__after_spinlock(), then performs a load L, all CPUs will see S as
happening before L. ❑
Answer:
Much of the purpose of the remainder of this chapter is to answer exactly that question!
❑
Answer:
Ah, that is a deep question whose answer requires most of the rest of this chapter. But
the short answer is that smp_mb() is almost always strong enough, albeit at some cost.
❑
v2023.06.11a
840 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Get version v4.17 (or later) of the Linux-kernel source code, then follow the instructions
in tools/memory-model/README to install the needed tools. Then follow the further
instructions to run these tools on the litmus test of your choice. ❑
Answer:
CPUs 2 and 3 are a pair of hardware threads on the same core, sharing the same cache
hierarchy, and therefore have very low communications latencies. This is a NUMA, or,
more accurately, a NUCA effect.
This leads to the question of why CPUs 2 and 3 ever disagree at all. One possible
reason is that they each might have a small amount of private cache in addition to a
larger shared cache. Another possible reason is instruction reordering, given the short
10-nanosecond duration of the disagreement and the total lack of memory-ordering
operations in the code fragment. ❑
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 841
execution to allow execution to proceed in the common case where there are no
intervening stores, in which case the reordering cannot be visible anyway?
Answer:
They can and many do, otherwise systems containing strongly ordered CPUs would
be slow indeed. However, speculative execution does have its downsides, especially if
speculation must be rolled back frequently, particularly on battery-powered systems.
Speculative execution can also introduce side channels, which might in turn be exploited
to exfiltrate information. But perhaps future systems will be able to overcome these
disadvantages. Until then, we can expect vendors to continue producing weakly ordered
CPUs. ❑
Answer:
That is in fact exactly what happens. On strongly ordered systems, smp_rmb() and
smp_wmb() emit no instructions, but instead just constrain the compiler. Thus, in this
case, weakly ordered systems do in fact shoulder the full cost of their memory-ordering
choices. ❑
Answer:
Answering this requires identifying three major groups of platforms: (1) Total-store-order
(TSO) platforms, (2) Weakly ordered platforms, and (3) DEC Alpha.
The TSO platforms order all pairs of memory references except for prior stores against
later loads. Because the address dependency on lines 18 and 19 of Listing 15.10 is
instead a load followed by another load, TSO platforms preserve this address dependency.
They also preserve the address dependency on lines 17 and 18 of Listing 15.11 because
this is a load followed by a store. Because address dependencies must start with a
load, TSO platforms implicitly but completely respect them, give or take compiler
optimizations, hence the need for READ_ONCE().
Weakly ordered platforms don’t necessarily maintain ordering of unrelated accesses.
However, the address dependencies in Listings 15.10 and 15.11 are not unrelated: There
is an address dependency. The hardware tracks dependencies and maintains the needed
ordering.
There is one (famous) exception to this rule for weakly ordered platforms, and that
exception is DEC Alpha for load-to-load address dependencies. And this is why, in
Linux kernels predating v4.15, DEC Alpha requires the explicit memory barrier supplied
for it by the now-obsolete lockless_dereference() on line 18 of Listing 15.10.
However, DEC Alpha does track load-to-store address dependencies, which is why
line 17 of Listing 15.11 does not need a lockless_dereference(), even in Linux
kernels predating v4.15.
v2023.06.11a
842 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
In most cases, smp_store_release() is indeed a better choice. However, smp_wmb()
was there first in the Linux kernel, so it is still good to understand how to use it. ❑
Answer:
The best scorecard is the infamous test6.pdf [SSA+ 11]. Unfortunately, not all of the
abbreviations have catchy expansions like SB (store buffering), MP (message passing),
and LB (load buffering), but at least the list of abbreviations is readily available. ❑
Answer:
Yes, the compiler absolutely must emit a load instruction for a volatile load. But if you
multiply the value loaded by zero, the compiler is well within its rights to substitute a
constant zero for the result of that multiplication, which will break the data dependency
on many platforms.
Worse yet, if the dependent store does not use WRITE_ONCE(), the compiler could
hoist it above the load, which would cause even TSO platforms to fail to provide ordering.
❑
Answer:
But of course! And perhaps in the fullness of time they will be so mandated. ❑
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 843
Listing E.11: Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic
1 C C-MP-OMCA+o-o-o+o-rmb-o
2
3 {}
4
5 P0(int *x, int *y)
6 {
7 int r0;
8
9 WRITE_ONCE(*x, 1);
10 r0 = READ_ONCE(*x);
11 WRITE_ONCE(*y, r0);
12 }
13
14 P1(int *x, int *y)
15 {
16 int r1;
17 int r2;
18
19 r1 = READ_ONCE(*y);
20 smp_rmb();
21 r2 = READ_ONCE(*x);
22 }
23
24 exists (1:r1=1 /\ 1:r2=0)
Answer:
Yes, it would. Feel free to modify the exists clause to check for that outcome and see
what happens. ❑
Answer:
Listing E.11 (C-MP-OMCA+o-o-o+o-rmb-o.litmus) shows such a test.
On a multicopy-atomic platform, P0()’s store to x on line 9 must become visible
to both P0() and P1() simultaneously. Because this store becomes visible to P0()
on line 10, before P0()’s store to y on line 11, P0()’s store to x must become visible
before its store to y everywhere, including P1(). Therefore, if P1()’s load from y on
line 19 returns the value 1, so must its load from x on line 21, given that the smp_rmb()
on line 20 forces these two loads to execute in order. Therefore, the exists clause on
line 24 cannot trigger on a multicopy-atomic platform.
In contrast, on an other-multicopy-atomic platform, P0() could see its own store
early, so that there would be no constraint on the order of visibility of the two stores
from P1(), which in turn allows the exists clause to trigger. ❑
v2023.06.11a
844 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
This is in fact a very natural design for any system having multiple hardware threads per
core. Natural from a hardware point of view, that is! ❑
Answer:
Presumably there is a P3(), as is in fact shown in Figure 15.10, that shares P2()’s
store buffer and cache. But not necessarily. Some platforms allow different cores to
disable different numbers of threads, allowing the hardware to adjust to the needs of the
workload at hand. For example, a single-threaded critical-path portion of the workload
might be assigned to a core with only one thread enabled, thus allowing the single
thread running that portion of the workload to use the entire capabilities of that core.
Other more highly parallel but cache-miss-prone portions of the workload might be
assigned to cores with all hardware threads enabled to provide improved throughput.
This improved throughput could be due to the fact that while one hardware thread is
stalled on a cache miss, the other hardware threads can make forward progress.
In such cases, performance requirements override quaint human notions of fairness.
❑
Answer:
You need to face the fact that it really can trigger. Akira Yokosawa used the litmus7
tool to run this litmus test on a POWER8 system. Out of 1,000,000,000 runs, 4 triggered
the exists clause. Thus, triggering the exists clause is not merely a one-in-a-million
occurrence, but rather a one-in-a-hundred-million occurrence. But it nevertheless really
does trigger on real systems. ❑
Answer:
Wrong.
Listing E.12 (C-R+o-wmb-o+o-mb-o.litmus) shows a two-thread litmus test that
requires propagation due to the fact that it only has store-to-store and load-to-store links
between its pair of threads. Even though P0() is fully ordered by the smp_wmb() and
P1() is fully ordered by the smp_mb(), the counter-temporal nature of the links means
that the exists clause on line 21 really can trigger. To prevent this triggering, the
smp_wmb() on line 8 must become an smp_mb(), bringing propagation into play twice,
once for each non-temporal link. ❑
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 845
Listing E.12: R Litmus Test With Write Memory Barrier (No Ordering)
1 C C-R+o-wmb-o+o-mb-o
2
3 {}
4
5 P0(int *x0, int *x1)
6 {
7 WRITE_ONCE(*x0, 1);
8 smp_wmb();
9 WRITE_ONCE(*x1, 1);
10 }
11
12 P1(int *x0, int *x1)
13 {
14 int r2;
15
16 WRITE_ONCE(*x1, 2);
17 smp_mb();
18 r2 = READ_ONCE(*x0);
19 }
20
21 exists (1:r2=0 /\ x1=2)
Answer:
As a rough rule of thumb, the smp_mb() barrier’s propagation property is sufficient to
maintain ordering through only one load-to-store link between processes. Unfortunately,
Listing 15.18 has not one but two load-to-store links, with the first being from the
READ_ONCE() on line 17 to the WRITE_ONCE() on line 24 and the second being from
the READ_ONCE() on line 26 to the WRITE_ONCE() on line 7. Therefore, preventing the
exists clause from triggering should be expected to require not one but two instances
of smp_mb().
As a special exception to this rule of thumb, a release-acquire chain can have one
load-to-store link between processes and still prohibit the cycle. ❑
Answer:
This litmus test is indeed a very interesting curiosity. Its ordering apparently occurs
naturally given typical weakly ordered hardware design, which would normally be
considered a great gift from the relevant laws of physics and cache-coherency-protocol
mathematics.
Unfortunately, no one has been able to come up with a software use case for this
gift that does not have a much better alternative implementation. Therefore, neither
the C11 nor the Linux kernel memory models provide any guarantee corresponding to
Listing 15.20. This means that the exists clause on line 19 can trigger.
Of course, without the barrier, there are no ordering guarantees, even on real weakly
ordered hardware, as shown in Listing E.13 (C-2+2W+o-o+o-o.litmus). ❑
v2023.06.11a
846 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Listing E.14 shows a somewhat nonsensical but very real example. Creating a more
useful (but still real) litmus test is left as an exercise for the reader. ❑
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 847
one store-to-store link, like that shown in Listing 15.25. Given that there is only one
of each type of non-store-to-load link, the exists cannot trigger, right?
Answer:
Wrong. It is the number of non-store-to-load links that matters. If there is only one
non-store-to-load link, a release-acquire chain can prevent the exists clause from
triggering. However, if there is more than one non-store-to-load link, be they store-
to-store, load-to-store, or any combination thereof, it is necessary to have at least one
full barrier (smp_mb() or better) between each non-store-to-load link. In Listing 15.25,
preventing the exists clause from triggering therefore requires an additional full barrier
between either P0()’s or P1()’s accesses. ❑
Answer:
The counter-intuitive outcome cannot happen. (Try it!) ❑
Answer:
Because it would not work. Although the compiler would be prevented from inventing a
store prior to the barrier(), nothing would prevent it from inventing a store between
that barrier() and the plain store. ❑
Answer:
For first, it might be necessary to invoke handle_reserve() before do_something_
with().
But more relevant to memory ordering, the compiler is often within its rights to
hoist the comparison ahead of the dereferences, which would allow the compiler to use
&reserve_int instead of the variable p that the hardware has tagged with a dependency.
❑
v2023.06.11a
848 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Unfortunately, the compiler really can learn enough to break your dependency chain,
for example, as shown in Listing E.15. The compiler is within its rights to transform
this code into that shown in Listing E.16, and might well make this transformation due
to register pressure if handle_equality() was inlined and needed a lot of registers.
Line 9 of this transformed code uses q, which although equal to p, is not necessarily
tagged by the hardware as carrying a dependency. Therefore, this transformed code
does not necessarily guarantee that line 9 is ordered after line 5.13 ❑
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 849
Answer:
Given the simple if statement comparing against zero, it is hard to imagine the compiler
proving anything. But suppose that later code executed a division by q. Because division
by zero is undefined behavior, as of 2023, many compilers will assume that the value
of q must be non-zero, and will thus remove that if statement, thus unconditionally
executing the WRITE_ONCE(), in turn destroying the control dependency.
There are some who argue (correctly, in Paul’s view) that back-propagating undefined
behavior across volatile accesses constitutes a compiler bug, but many compiler writers
insist that this is not a bug, but rather a valuable optimization. ❑
Answer:
Not given the Linux kernel memory model. (Try it!) However, you can instead replace
P0()’s WRITE_ONCE() with smp_store_release(), which usually has less overhead
than does adding an smp_mb(). ❑
Answer:
Yes, but only from the perspective of a third thread not holding that lock. In contrast,
memory allocators need only concern themselves with the two threads migrating the
memory. It is after all the developer’s responsibility to properly synchronize with any
other threads that need access to the newly migrated block of memory. ❑
Answer:
No.
Listing E.17 shows an example three-critical-section chain (Lock-across-unlock-
lock-3.litmus). Running this litmus test shows that the exists clause can still be
satisfied, so this additional critical section is still not sufficient to force ordering.
However, as the reader can verify, placing an smp_mb__after_spinlock() after
either P1()’s or P2()’s lock acquisition does suffice to force ordering. ❑
Answer:
No. By the time that the code inspects the return value from spin_is_locked(), some
other CPU or thread might well have acquired the corresponding lock. ❑
v2023.06.11a
850 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Because in QSBR, RCU read-side critical sections don’t actually disappear. Instead, they
are extended in both directions until a quiescent state is encountered. For example, in the
Linux kernel, the critical section might be extended back to the most recent schedule()
call and ahead to the next schedule() call. Of course, in non-QSBR implementations,
rcu_read_lock() and rcu_read_unlock() really do emit code, which can clearly
provide ordering. And within the Linux kernel, even the QSBR implementation has
a compiler barrier() in rcu_read_lock() and rcu_read_unlock(), which is
necessary to prevent the compiler from moving memory accesses that might result in
page faults into the RCU read-side critical section.
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 851
Therefore, strange though it might seem, empty RCU read-side critical sections really
can and do provide some degree of ordering. ❑
Answer:
No, because none of these later litmus tests have more than one access within their
RCU read-side critical sections. But what about swapping the accesses, for example,
in Listing 15.43, placing P1()’s WRITE_ONCE() within its critical section and the
READ_ONCE() before its critical section?
Swapping the accesses allows both instances of r2 to have a final value of zero, in
other words, although RCU read-side critical sections’ ordering properties can extend
outside of those critical sections, the same is not true of their reordering properties.
Checking this with herd and explaining why is left as an exercise for the reader. ❑
Answer:
The cycle would again be forbidden. Further analysis is left as an exercise for the reader.
❑
Answer:
Alpha has only mb and wmb instructions, so smp_rmb() would be implemented by the
Alpha mb instruction in either case. In addition, at the time that the Linux kernel started
relying on dependency ordering, it was not clear that Alpha ordered dependent stores,
and thus smp_mb() was therefore the safe choice.
However, given the aforementioned v5.9 changes to READ_ONCE() and a few of
Alpha’s atomic read-modify-write operations, no Linux-kernel core code need concern
itself with DEC Alpha, thus greatly reducing Paul E. McKenney’s incentive to remove
Alpha support from the kernel. ❑
v2023.06.11a
852 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Although DEC Alpha does take considerable flak, it does avoid reordering reads from
the same CPU to the same variable. It also avoids the out-of-thin-air problem that
plagues the Java and C11 memory models [BD14, BMN+ 15, BS14, Boe20, Gol19,
Jef14, MB20, MJST16, Š11, VBC+ 15]. ❑
v2023.06.11a
E.15. ADVANCED SYNCHRONIZATION: MEMORY ORDERING 853
Suppose that the compiler reordered lines 27 and 28 into the critical section starting
at line 29. Now suppose that two updaters start executing synchronize_rcu() at
about the same time. Then consider the following sequence of events:
1. CPU 0 acquires the lock at line 29.
2. Line 27 determines that CPU 0 was online, so it clears its own counter at line 28.
(Recall that lines 27 and 28 have been reordered by the compiler to follow line 29).
3. CPU 0 invokes update_counter_and_wait() from line 30.
4. CPU 0 invokes rcu_gp_ongoing() on itself at line 16, and line 5 sees that CPU 0 is
in a quiescent state. Control therefore returns to update_counter_and_wait(),
and line 15 advances to CPU 1.
5. CPU 1 invokes synchronize_rcu(), but because CPU 0 already holds the lock,
CPU 1 blocks waiting for this lock to become available. Because the compiler
reordered lines 27 and 28 to follow line 29, CPU 1 does not clear its own counter,
despite having been online.
6. CPU 0 invokes rcu_gp_ongoing() on CPU 1 at line 16, and line 5 sees that
CPU 1 is not in a quiescent state. The while loop at line 16 therefore never exits.
So the compiler’s reordering results in a deadlock. In contrast, hardware reordering
is temporary, so that CPU 1 might undertake its first attempt to acquire the mutex on
line 29 before executing lines 27 and 28, but it will eventually execute lines 27 and 28.
Because hardware reordering only results in a short delay, it can be tolerated. On the
other hand, because compiler reordering results in a deadlock, it must be prohibited.
Some research efforts have used hardware transactional memory to allow compilers
to safely reorder more aggressively, but the overhead of hardware transactions has thus
far made such optimizations unattractive. ❑
Answer:
Recall that load-to-store and store-to-store links can be counter-temporal, as illustrated
by Figures 15.12 and 15.13 in Section 15.2.7.2. This counter-temporal nature of
load-to-store and store-to-store links necessitates strong ordering.
In constrast, store-to-load links are temporal, as illustrated by Listings 15.12 and 15.13.
This temporal nature of store-to-load links permits use of minimal ordering. ❑
v2023.06.11a
854 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Yes. However, since each thread must hold the locks of three consecutive elements to
delete the middle one, if there are 𝑁 threads, there must be 2𝑁 + 1 elements (rather than
just 𝑁 + 1) in order to avoid deadlock. ❑
Answer:
One exception would be a difficult and complex algorithm that was the only one known
to work in a given situation. Another exception would be a difficult and complex
algorithm that was nonetheless the simplest of the set known to work in a given situation.
However, even in these cases, it may be very worthwhile to spend a little time trying to
come up with a simpler algorithm! After all, if you managed to invent the first algorithm
to do some task, it shouldn’t be that hard to go on to invent a simpler one. ❑
Answer:
Indeed, in this case the lock would persist, much to the consternation of other processes
attempting to acquire this lock that is held by a process that no longer exists. Which is
why great care is required when using pthread_mutex objects located in file-mapped
memory regions. ❑
v2023.06.11a
E.17. CONFLICTING VISIONS OF THE FUTURE 855
Answer:
If the exec()ed program maps those same regions of memory, then this program could
in principle simply release the lock. The question as to whether this approach is sound
from a software-engineering viewpoint is left as an exercise for the reader. ❑
Answer:
One might get that impression from a quick read of the abstract, but more careful readers
will notice the “for a wide range of workloads” phrase in the last sentence. It turns out
that this phrase is quite important:
1. Their RCU evaluation uses synchronous grace periods, which needlessly throttle
updates, as noted in their Section 6.2.1. See Figure 10.11 page 302 of this book
to see that the venerable asynchronous call_rcu() primitive enables RCU to
perform and scale quite well with large numbers of updaters. Furthermore, in
Section 3.7 of their paper, the authors admit that asynchronous grace periods are
important to MV-RLU scalability. A fair comparison would also allow RCU the
benefits of asynchrony.
2. They use a poorly tuned 1,000-bucket hash table containing 10,000 elements. In
addition, their 448 hardware threads need considerably more than 1,000 buckets to
avoid the lock contention that they correctly state limits RCU performance in their
benchmarks. A useful comparison would feature a properly tuned hash table.
3. Their RCU hash table used per-bucket locks, which they call out as a bottleneck,
which is not a surprise given the long hash chains and small ratio of buckets to
threads. A number of their competing mechanisms instead use lockfree techniques,
thus avoiding the per-bucket-lock bottleneck, which cynics might claim sheds some
light on the authors’ otherwise inexplicable choice of poorly tuned hash tables.
The first graph in the middle row of the authors’ Figure 4 show what RCU can
achieve if not hobbled by artificial bottlenecks, as does the first portion of the
second graph in that same row.
4. Their linked-list operation permits RLU to do concurrent modifications of different
elements in the list, while RCU is forced to serialize updates. Again, RCU has
always worked just fine in conjunction with lockless updaters, a fact that has been
set forth in academic literature that the authors cited [DMS+ 12]. A fair comparison
would use the same style of update for RCU as it does for MV-RLU.
5. The authors fail to consider combining RCU and sequence locking, which is used
in the Linux kernel to give readers coherent views of multi-pointer updates.
6. The authors fail to consider RCU-based solutions to the Issaquah Chal-
lenge [McK16a], which also gives readers a coherent view of multi-pointer
updates, albeit with a weaker view of “coherent”.
v2023.06.11a
856 APPENDIX E. ANSWERS TO QUICK QUIZZES
It is surprising that the anonymous reviewers of this paper did not demand an
apples-to-apples comparison of MV-RLU and RCU. Nevertheless, the authors should
be congratulated on producing an academic paper that presents an all-too-rare example
of good scalability combined with strong read-side coherence. They are also to be
congratulated on overcoming the traditional academic prejudice against asynchronous
grace periods, which greatly aided their scalability.
Interestingly enough, RLU and RCU take different approaches to avoid the inherent
limitations of STM noted by Hagit Attiya et al. [AHM09]. RCU avoids providing strict
serializability and RLU avoids providing invisible read-only transactions, both thus
avoiding the limitations. ❑
Answer:
When using locking, spin_trylock() is a choice, with a corresponding failure-free
choice being spin_lock(), which is used in the common case, as in there are more
than 100 times as many calls to spin_lock() than to spin_trylock() in the v5.11
Linux kernel. When using TM, the only failure-free choice is the irrevocable transaction,
which is not used in the common case. In fact, the irrevocable transaction is not even
available in all TM implementations. ❑
Answer:
The year 2005 just called, and it says that it wants its incandescent TM marketing hype
back.
In the year 2021, TM still has significant proving to do, even with the advent of HTM,
which is covered in the upcoming Section 17.3. ❑
Answer:
The larger the updates, the greater the probability of conflict, and thus the greater
probability of retries, which degrade performance. ❑
v2023.06.11a
E.17. CONFLICTING VISIONS OF THE FUTURE 857
Answer:
In many cases, the enumeration need not be exact. In these cases, hazard pointers or
RCU may be used to protect readers, which provides low probability of conflict with
any given insertion or deletion. ❑
Answer:
This scheme might work with reasonably high probability, but it can fail in ways that
would be quite surprising to most users. To see this, consider the following transaction:
1 begin_trans();
2 if (a) {
3 do_one_thing();
4 do_another_thing();
5 } else {
6 do_a_third_thing();
7 do_a_fourth_thing();
8 }
9 end_trans();
Suppose that the user sets a breakpoint at line 4, which triggers, aborting the transaction
and entering the debugger. Suppose that between the time that the breakpoint triggers
and the debugger gets around to stopping all the threads, some other thread sets the
value of a to zero. When the poor user attempts to single-step the program, surprise!
The program is now in the else-clause instead of the then-clause.
This is not what I call an easy-to-use debugger. ❑
Answer:
See the answer to Quick Quiz 7.20 in Section 7.2.1.
However, it is claimed that given a strongly atomic HTM implementation without
forward-progress guarantees, any memory-based locking design based on empty critical
sections will operate correctly in the presence of transactional lock elision. Although I
have not seen a proof of this statement, there is a straightforward rationale for this claim.
The main idea is that in a strongly atomic HTM implementation, the results of a given
transaction are not visible until after the transaction completes successfully. Therefore,
if you can see that a transaction has started, it is guaranteed to have already completed,
which means that a subsequent empty lock-based critical section will successfully “wait”
on it—after all, there is no waiting required.
This line of reasoning does not apply to weakly atomic systems (including many STM
implementation), and it also does not apply to lock-based programs that use means other
v2023.06.11a
858 APPENDIX E. ANSWERS TO QUICK QUIZZES
than memory to communicate. One such means is the passage of time (for example, in
hard real-time systems) or flow of priority (for example, in soft real-time systems).
Locking designs that rely on priority boosting are of particular interest. ❑
Answer:
It could do so, but this would be both unnecessary and insufficient.
It would be unnecessary in cases where the empty critical section was due to
conditional compilation. Here, it might well be that the only purpose of the lock was to
protect data, so eliding it completely would be the right thing to do. In fact, leaving the
empty lock-based critical section would degrade performance and scalability.
On the other hand, it is possible for a non-empty lock-based critical section to be
relying on both the data-protection and time-based and messaging semantics of locking.
Using transactional lock elision in such a case would be incorrect, and would result in
bugs. ❑
Answer:
The short answer is that on commonplace commodity hardware, synchronization designs
based on any sort of fine-grained timing are foolhardy and cannot be expected to operate
correctly under all conditions.
That said, there are systems designed for hard real-time use that are much more
deterministic. In the (very unlikely) event that you are using such a system, here is
a toy example showing how time-based synchronization can work. Again, do not try
this on commodity microprocessors, as they have highly nondeterministic performance
characteristics.
This example uses multiple worker threads along with a control thread. Each worker
thread corresponds to an outbound data feed, and records the current time (for example,
from the clock_gettime() system call) in a per-thread my_timestamp variable after
executing each unit of work. The real-time nature of this example results in the following
set of constraints:
1. It is a fatal error for a given worker thread to fail to update its timestamp for a time
period of more than MAX_LOOP_TIME.
3. Locks are granted in strict FIFO order within a given thread priority.
When worker threads complete their feed, they must disentangle themselves from
the rest of the application and place a status value in a per-thread my_status variable
that is initialized to −1. Threads do not exit; they instead are placed on a thread pool to
accommodate later processing requirements. The control thread assigns (and re-assigns)
v2023.06.11a
E.17. CONFLICTING VISIONS OF THE FUTURE 859
worker threads as needed, and also maintains a histogram of thread statuses. The control
thread runs at a real-time priority no higher than that of the worker threads.
Worker threads’ code is as follows:
1 for (;;) {
2 for_each_thread(t) {
3 ct = clock_gettime(...);
4 d = ct - per_thread(my_timestamp, t);
5 if (d >= MAX_LOOP_TIME) {
6 /* thread departing. */
7 acquire_lock(&departing_thread_lock);
8 release_lock(&departing_thread_lock);
9 i = per_thread(my_status, t);
10 status_hist[i]++; /* Bug if TLE! */
11 }
12 }
13 /* Repurpose threads as needed. */
14 }
Line 5 uses the passage of time to deduce that the thread has exited, executing lines 6
and 10 if so. The empty lock-based critical section on lines 7 and 8 guarantees that any
thread in the process of exiting completes (remember that locks are granted in FIFO
order!).
Once again, do not try this sort of thing on commodity microprocessors. After all, it
is difficult enough to get this right on systems specifically designed for hard real-time
use! ❑
Answer:
No deadlock will result. To arrive at deadlock, two different threads must each acquire
the two locks in opposite orders, which does not happen in this example. However,
deadlock detectors such as lockdep [Cor06a] will flag this as a false positive. ❑
v2023.06.11a
860 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
At least they accomplished something useful! And perhaps there will continue to be
additional HTM progress over time [SNGK17, SBN+ 20, GGK18, PMDY20]. ❑
Answer:
Yes and no. It appears that implementing even the HTM subset of TM in real hardware
is a bit trickier than it appears [JSG12, Was14, Int20a, Int21, Lar21]. Therefore, the sad
fact is that “starting to become available” is all too accurate as of 2021. In fact, vendors
are beginning to deprecate their HTM implementations [Int20c, Book III Appendix A].
❑
Answer:
You are welcome to your opinion on what is and is not utopian, but I will be paying more
attention to people actually making progress on the items in that list than to anyone who
might be objecting to them. This might have something to do with my long experience
with people attempting to talk me out of specific things that their favorite tools cannot
handle.
In the meantime, please feel free to read the papers written by the people who are
actually making progress, for example, this one [DFLO19]. ❑
Answer:
There can be no doubt that the verifiers used by the SEL4 project are quite capable.
However, SEL4 started as a single-CPU project. And although SEL4 has gained
multi-processor capabilities, it is currently using very coarse-grained locking that is
similar to the Linux kernel’s old Big Kernel Lock (BKL). There will hopefully come a
day when it makes sense to add SEL4’s verifiers to a book on parallel programming,
but this is not yet that day. ❑
v2023.06.11a
E.17. CONFLICTING VISIONS OF THE FUTURE 861
cmpxchg_acquire() xchg_acquire()
# Lock filter exists filter exists
2 0.004 0.022 0.039 0.027 0.058
3 0.041 0.743 1.653 0.968 3.203
4 0.374 59.565 151.962 74.818 500.96
5 4.905
Answer:
The filter clause causes the herd tool to discard executions at an earlier stage of
processing than does the exists clause, which provides significant speedups.
As for xchg_acquire(), this atomic operation will do a write whether or not
lock acquisition succeeds, which means that a model using xchg_acquire() will
have more operations than one using cmpxchg_acquire(), which won’t do a write
in the failed-acquisition case. More writes means more combinatorial to explode,
as shown in Table E.6 (C-SB+l-o-o-u+l-o-o-*u.litmus, C-SB+l-o-o-u+l-o-
o-u*-C.litmus, C-SB+l-o-o-u+l-o-o-u*-CE.litmus, C-SB+l-o-o-u+l-o-
o-u*-X.litmus, and C-SB+l-o-o-u+l-o-o-u*-XE.litmus). This table clearly
shows that cmpxchg_acquire() outperforms xchg_acquire() and that use of the
filter clause outperforms use of the exists clause. ❑
Answer:
We don’t, but it does not matter.
To see this, note that the 7 % figure only applies to injected bugs that were subsequently
located: It necessarily ignores any injected bugs that were never found. Therefore,
the MTBF statistics of known bugs is likely to be a good approximation of that of the
injected bugs that are subsequently located.
A key point in this whole section is that we should be more concerned about bugs that
inconvenience users than about other bugs that never actually manifest. This of course
is not to say that we should completely ignore bugs that have not yet inconvenienced
users, just that we should properly prioritize our efforts so as to fix the most important
and urgent bugs first. ❑
v2023.06.11a
862 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
It is a problem because real-world formal-verification tools (as opposed to those that
exist only in the imaginations of the more vociferous proponents of formal verification)
are not omniscient, and thus are only able to locate certain types of bugs. For but
one example, formal-verification tools are unlikely to spot a bug corresponding to an
omitted assertion or, equivalently, a bug corresponding to an undiscovered portion of
the specification. ❑
Answer:
One approach is to provide a simple fix that might not be suitable for a production
environment, but which allows the tool to locate the next bug. Another approach is to
restrict configuration or inputs so that the bugs located thus far cannot occur. There
are a number of similar approaches, but the common theme is that fixing the bug
from the tool’s viewpoint is usually much easier than constructing and validating a
production-quality fix, and the key point is to prioritize the larger efforts required to
construct and validate the production-quality fixes. ❑
Answer:
It would be blue all the way down, with the possible exception of the third row (overhead)
which might well be marked down for testing’s difficulty finding improbable bugs.
On the other hand, improbable bugs are often also irrelevant bugs, so your mileage
may vary.
Much depends on the size of your installed base. If your code is only ever going to
run on (say) 10,000 systems, Murphy can actually be a really nice guy. Everything that
can go wrong, will. Eventually. Perhaps in geologic time.
But if your code is running on 20 billion systems, like the Linux kernel was said to
be by late 2017, Murphy can be a real jerk! Everything that can go wrong, will, and it
can go wrong really quickly!!! ❑
Answer:
Indeed there are! This table focuses on those that Paul has used, but others are proving
to be useful. Formal verification has been heavily used in the seL4 project [SM13], and
its tools can now handle modest levels of concurrency. More recently, Catalin Marinas
used Lamport’s TLA tool [Lam02] to locate some forward-progress bugs in the Linux
kernel’s queued spinlock implementation. Will Deacon fixed these bugs [Dea18], and
Catalin verified Will’s fixes [Mar18].
v2023.06.11a
E.18. IMPORTANT QUESTIONS 863
Answer:
Here are errors you might have found:
1. Missing barrier() or volatile on tight loops.
2. Missing memory barriers on update side.
3. Lack of synchronization between producer and consumer. ❑
Answer:
Because strongly ordered implementations are sometimes able to provide greater
consistency among sets of calls to functions accessing a given data structure. For
example, compare the atomic counter of Listing 5.2 to the statistical counter of
Section 5.2. Suppose that one thread is adding the value 3 and another is adding the
value 5, while two other threads are concurrently reading the counter’s value. With
atomic counters, it is not possible for one of the readers to obtain the value 3 while the
other obtains the value 5. With statistical counters, this outcome really can happen.
In fact, in some computing environments, this outcome can happen even on relatively
strongly ordered hardware such as x86.
Therefore, if your user happen to need this admittedly unusual level of consistency,
you should avoid weakly ordered statistical counters. ❑
v2023.06.11a
864 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Yes. ❑
Answer:
The people who would like to arbitrarily subdivide and interleave the workload. Of
course, an arbitrary subdivision might end up separating a lock acquisition from the
corresponding lock release, which would prevent any other thread from acquiring that
lock. If the locks were pure spinlocks, this could even result in deadlock. ❑
Answer:
Suppose the functions foo() and bar() in Listing E.19 are invoked concurrently from
different CPUs. Then foo() will acquire my_lock() on line 3, while bar() will
acquire rcu_gp_lock on line 13.
When foo() advances to line 4, it will attempt to acquire rcu_gp_lock, which
is held by bar(). Then when bar() advances to line 14, it will attempt to acquire
my_lock, which is held by foo().
Each function is then waiting for a lock that the other holds, a classic deadlock.
v2023.06.11a
E.19. “TOY” RCU IMPLEMENTATIONS 865
Answer:
One could in fact use reader-writer locks in this manner. However, textbook reader-writer
locks suffer from memory contention, so that the RCU read-side critical sections would
need to be quite long to actually permit parallel execution [McK03].
On the other hand, use of a reader-writer lock that is read-acquired in rcu_read_
lock() would avoid the deadlock condition noted above. ❑
Answer:
Making this change would re-introduce the deadlock, so no, it would not be cleaner. ❑
Answer:
One deadlock is where a lock is held across synchronize_rcu(), and that same lock
is acquired within an RCU read-side critical section. However, this situation could
deadlock any correctly designed RCU implementation. After all, the synchronize_
rcu() primitive must wait for all pre-existing RCU read-side critical sections to
complete, but if one of those critical sections is spinning on a lock held by the thread
executing the synchronize_rcu(), we have a deadlock inherent in the definition of
RCU.
Another deadlock happens when attempting to nest RCU read-side critical sections.
This deadlock is peculiar to this implementation, and might be avoided by using recursive
locks, or by using reader-writer locks that are read-acquired by rcu_read_lock() and
write-acquired by synchronize_rcu().
However, if we exclude the above two cases, this implementation of RCU does
not introduce any deadlock situations. This is because only time some other thread’s
lock is acquired is when executing synchronize_rcu(), and in that case, the lock is
immediately released, prohibiting a deadlock cycle that does not involve a lock held
across the synchronize_rcu() which is the first case above. ❑
v2023.06.11a
866 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
This is indeed an advantage, but do not forget that rcu_dereference() and rcu_
assign_pointer() are still required, which means volatile manipulation for rcu_
dereference() and memory barriers for rcu_assign_pointer(). Of course, many
Alpha CPUs require memory barriers for both primitives. ❑
Answer:
Indeed, this would deadlock any legal RCU implementation. But is rcu_read_lock()
really participating in the deadlock cycle? If you believe that it is, then please ask
yourself this same question when looking at the RCU implementation in Appendix B.9.
❑
Answer:
The update-side test was run in absence of readers, so the poll() system call was never
invoked. In addition, the actual code has this poll() system call commented out, the
better to evaluate the true overhead of the update-side code. Any production uses of this
code would be better served by using the poll() system call, but then again, production
uses would be even better served by other implementations shown later in this section.
❑
Answer:
Although this would in fact eliminate the starvation, it would also mean that rcu_
read_lock() would spin or block waiting for the writer, which is in turn waiting on
readers. If one of these readers is attempting to acquire a lock that the spinning/blocking
rcu_read_lock() holds, we again have deadlock.
In short, the cure is worse than the disease. See Appendix B.4 for a proper cure. ❑
v2023.06.11a
E.19. “TOY” RCU IMPLEMENTATIONS 867
Answer:
The spin-lock acquisition only guarantees that the spin-lock’s critical section will not
“bleed out” to precede the acquisition. It in no way guarantees that code preceding the
spin-lock acquisition won’t be reordered into the critical section. Such reordering could
cause a removal from an RCU-protected list to be reordered to follow the complementing
of rcu_idx, which could allow a newly starting RCU read-side critical section to see
the recently removed data element.
Exercise for the reader: Use a tool such as Promela/spin to determine which (if any)
of the memory barriers in Listing B.6 are really needed. See Chapter 12 for information
on using these tools. The first correct and complete response will be credited. ❑
Answer:
Both flips are absolutely required. To see this, consider the following sequence of
events:
4 Lines 9 and 10 of rcu_read_lock() store the value zero to this thread’s instance
of rcu_read_idx and increments rcu_refcnt[0], respectively. Execution then
proceeds into the RCU read-side critical section.
6 The grace period that started in step 5 has been allowed to end, despite the fact
that the RCU read-side critical section that started beforehand in step 4 has not
completed. This violates RCU semantics, and could allow the update to free a data
element that the RCU read-side critical section was still referencing.
Exercise for the reader: What happens if rcu_read_lock() is preempted for a very
long time (hours!) just after line 8? Does this implementation operate correctly in that
case? Why or why not? The first correct and complete response will be credited. ❑
v2023.06.11a
868 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Using non-atomic operations would cause increments and decrements to be lost, in turn
causing the implementation to fail. See Appendix B.5 for a safe way to use non-atomic
operations in rcu_read_lock() and rcu_read_unlock(). ❑
Answer:
The atomic_read() primitives does not actually execute atomic machine instructions,
but rather does a normal load from an atomic_t. Its sole purpose is to keep the
compiler’s type-checking happy. If the Linux kernel ran on 8-bit CPUs, it would also
need to prevent “store tearing”, which could happen due to the need to store a 16-bit
pointer with two eight-bit accesses on some 8-bit systems. But thankfully, it seems that
no one runs Linux on 8-bit systems. ❑
Answer:
Keep in mind that we only wait for a given thread if that thread is still in a pre-existing
RCU read-side critical section, and that waiting for one hold-out thread gives all the
other threads a chance to complete any pre-existing RCU read-side critical sections
that they might still be executing. So the only way that we would wait for 2𝑁 intervals
would be if the last thread still remained in a pre-existing RCU read-side critical section
despite all the waiting for all the prior threads. In short, this implementation will not
wait unnecessarily.
However, if you are stress-testing code that uses RCU, you might want to comment
out the poll() statement in order to better catch bugs that incorrectly retain a reference
to an RCU-protected data element outside of an RCU read-side critical section. ❑
v2023.06.11a
E.19. “TOY” RCU IMPLEMENTATIONS 869
Answer:
Special-purpose uniprocessor implementations of RCU can attain this ideal [McK09a].
❑
Answer:
Assigning zero (or any other even-numbered constant) would in fact work, but assigning
the value of rcu_gp_ctr can provide a valuable debugging aid, as it gives the developer
an idea of when the corresponding thread last exited an RCU read-side critical section.
❑
Answer:
These memory barriers are required because the locking primitives are only guaranteed
to confine the critical section. The locking primitives are under absolutely no obligation
to keep other code from bleeding in to the critical section. The pair of memory barriers
are therefore requires to prevent this sort of code motion, whether performed by the
compiler or by the CPU. ❑
Answer:
Indeed it could, with a few modifications. This work is left as an exercise for the reader.
❑
Answer:
It is a real problem, there is a sequence of events leading to failure, and there are a
number of possible ways of addressing it. For more details, see the Quick Quizzes near
the end of Appendix B.8. The reason for locating the discussion there is to (1) give you
more time to think about it, and (2) because the nesting support added in that section
greatly reduces the time required to overflow the counter. ❑
v2023.06.11a
870 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
The apparent simplicity of the separate per-thread variable is a red herring. This
approach incurs much greater complexity in the guise of careful ordering of operations,
especially if signal handlers are to be permitted to contain RCU read-side critical
sections. But don’t take my word for it, code it up and see what you end up with! ❑
Answer:
One way would be to replace the magnitude comparison on lines 32 and 33 with an
inequality check of the per-thread rcu_reader_gp variable against rcu_gp_ctr+RCU_
GP_CTR_BOTTOM_BIT. ❑
Answer:
It can indeed be fatal. To see this, consider the following sequence of events:
3. Thread 0 now starts running again, and stores into its per-thread rcu_reader_gp
variable. The value it stores is RCU_GP_CTR_BOTTOM_BIT+1 greater than that of
the global rcu_gp_ctr.
5. Thread 1 now removes the data element A that thread 0 just acquired a reference to.
v2023.06.11a
E.19. “TOY” RCU IMPLEMENTATIONS 871
Note that scenario can also occur in the implementation presented in Appendix B.7.
One strategy for fixing this problem is to use 64-bit counters so that the time required
to overflow them would exceed the useful lifetime of the computer system. Note that
non-antique members of the 32-bit x86 CPU family allow atomic manipulation of 64-bit
counters via the cmpxchg64b instruction.
Another strategy is to limit the rate at which grace periods are permitted to occur in
order to achieve a similar effect. For example, synchronize_rcu() could record the
last time that it was invoked, and any subsequent invocation would then check this time
and block as needed to force the desired spacing. For example, if the low-order four bits
of the counter were reserved for nesting, and if grace periods were permitted to occur at
most ten times per second, then it would take more than 300 days for the counter to
overflow. However, this approach is not helpful if there is any possibility that the system
will be fully loaded with CPU-bound high-priority real-time threads for the full 300
days. (A remote possibility, perhaps, but best to consider it ahead of time.)
A third approach is to administratively abolish real-time threads from the system in
question. In this case, the preempted process will age up in priority, thus getting to run
long before the counter had a chance to overflow. Of course, this approach is less than
helpful for real-time applications.
A fourth approach would be for rcu_read_lock() to recheck the value of the global
rcu_gp_ctr after storing to its per-thread rcu_reader_gp counter, retrying if the
new value of the global rcu_gp_ctr is inappropriate. This works, but introduces
non-deterministic execution time into rcu_read_lock(). On the other hand, if your
application is being preempted long enough for the counter to overflow, you have no
hope of deterministic execution time in any case!
A fifth approach is for the grace period process to wait for all readers to become
aware of the new grace period. This works nicely in theory, but hangs if a reader blocks
indefinitely outside of an RCU read-side critical section.
A final approach is, oddly enough, to use a single-bit grace-period counter and for
each call to synchronize_rcu() to take two passes through its algorithm. This is the
approached use by userspace RCU [Des09b], and is described in detail in the journal
article and supplementary materials [DMS+ 12, Appendix D]. ❑
Answer:
Indeed it does! An application using this implementation of RCU should therefore
invoke rcu_quiescent_state sparingly, instead using rcu_read_lock() and rcu_
read_unlock() most of the time.
However, this memory barrier is absolutely required so that other threads will see the
store on lines 12–13 before any subsequent RCU read-side critical sections executed by
the caller. ❑
Answer:
The memory barrier on line 11 prevents any RCU read-side critical sections that might
v2023.06.11a
872 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
Since the measurement loop contains a pair of empty functions, the compiler optimizes it
away. The measurement loop takes 1,000 passes between each call to rcu_quiescent_
state(), so this measurement is roughly one thousandth of the overhead of a single
call to rcu_quiescent_state(). ❑
Answer:
A library function has absolutely no control over the caller, and thus cannot force the
caller to invoke rcu_quiescent_state() periodically. On the other hand, a library
function that made many references to a given RCU-protected data structure might
be able to invoke rcu_thread_online() upon entry, rcu_quiescent_state()
periodically, and rcu_thread_offline() upon exit. ❑
Answer:
Please note that the RCU read-side critical section is in effect extended beyond the
enclosing rcu_read_lock() and rcu_read_unlock(), out to the previous and next
call to rcu_quiescent_state(). This rcu_quiescent_state can be thought of as
an rcu_read_unlock() immediately followed by an rcu_read_lock().
Even so, the actual deadlock itself will involve the lock acquisition in the RCU
read-side critical section and the synchronize_rcu(), never the rcu_quiescent_
state(). ❑
v2023.06.11a
E.20. WHY MEMORY BARRIERS? 873
such as call_rcu(). This primitive may be invoked within an RCU read-side critical
section, and the specified RCU callback will in turn be invoked at a later time, after a
grace period has elapsed.
The ability to perform an RCU update while within an RCU read-side critical
section can be extremely convenient, and is analogous to a (mythical) unconditional
read-to-write upgrade for reader-writer locking. ❑
Answer:
The writeback message originates from a given CPU, or in some designs from a given
level of a given CPU’s cache—or even from a cache that might be shared among several
CPUs. The key point is that a given cache does not have room for a given data item, so
some other piece of data must be ejected from the cache to make room. If there is some
other piece of data that is duplicated in some other cache or in memory, then that piece
of data may be simply discarded, with no writeback message required.
On the other hand, if every piece of data that might be ejected has been modified so
that the only up-to-date copy is in this cache, then one of those data items must be copied
somewhere else. This copy operation is undertaken using a “writeback message”.
The destination of the writeback message has to be something that is able to store
the new value. This might be main memory, but it also might be some other cache.
If it is a cache, it is normally a higher-level cache for the same CPU, for example, a
level-1 cache might write back to a level-2 cache. However, some hardware designs
permit cross-CPU writebacks, so that CPU 0’s cache might send a writeback message
to CPU 1. This would normally be done if CPU 1 had somehow indicated an interest in
the data, for example, by having recently issued a read request.
In short, a writeback message is sent from some part of the system that is short of
space, and is received by some other part of the system that can accommodate the data.
❑
Answer:
One of the CPUs gains access to the shared bus first, and that CPU “wins”. The other
CPU must invalidate its copy of the cache line and transmit an “invalidate acknowledge”
message to the other CPU.
Of course, the losing CPU can be expected to immediately issue a “read invalidate”
transaction, so the winning CPU’s victory will be quite ephemeral. ❑
v2023.06.11a
874 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
It might, if large-scale multiprocessors were in fact implemented that way. Larger
multiprocessors, particularly NUMA machines, tend to use so-called “directory-based”
cache-coherence protocols to avoid this and other problems. ❑
Answer:
Usually by adding additional states, though these additional states need not be actually
stored with the cache line, due to the fact that only a few lines at a time will be
transitioning. The need to delay transitions is but one issue that results in real-world
cache coherence protocols being much more complex than the over-simplified MESI
protocol described in this appendix. Hennessy and Patterson’s classic introduction to
computer architecture [HP95] covers many of these issues. ❑
Answer:
Because the purpose of store buffers is not just to hide acknowledgement latencies in
multiprocessor cache-coherence protocols, but to hide memory latencies in general.
v2023.06.11a
E.20. WHY MEMORY BARRIERS? 875
Answer:
Here are two ways for hardware to easily handle variable-length stores.
First, each store-buffer entry could be a single byte wide. Then an 64-bit store
would consume eight store-buffer entries. This approach is simple and flexible, but one
disadvantage is that each entry would need to replicate much of the address that was
stored to.
Second, each store-buffer entry could be double the size of a cache line, with half
of the bits containing the values stored, and the other half indicating which bits had
been stored to. So, assuming a 32-bit cache line, a single-byte store of 0x5a to the
low-order byte of a given cache line would result in 0xXXXXXX5a for the first half and
0x000000ff for the second half, where the values labeled X are arbitrary because they
would be ignored. This approach allows multiple consecutive stores corresponding to a
given cache line to be merged into a single store-buffer entry, but is space-inefficient for
random stores of single bytes.
Much more complex and efficient schemes are of course used by actual hardware
designers. ❑
Answer:
Yes, they do. But to do so, they add states beyond the MESI quadruple that this example
is working within. ❑
v2023.06.11a
876 APPENDIX E. ANSWERS TO QUICK QUIZZES
Answer:
It might, and that is why real hardware takes steps to avoid this problem. A traditional
approach, pointed out by Vasilevsky Alexander, is to write this cache line back to main
memory before marking the cache line as “shared”. A more efficient (though more
complex) approach is to use additional state to indicate whether or not the cache line is
“dirty”, allowing the writeback to happen. Year-2000 systems went further, using much
more state in order to avoid redundant writebacks [CSG99, Figure 8.42]. It would be
reasonable to assume that complexity has not decreased in the meantime. ❑
Answer:
CPU 0 already has the values of these variables, given that it has a read-only copy of
the cache line containing “a”. Therefore, all CPU 0 need do is to cause the other CPUs
to discard their copies of this cache line. An “invalidate” message therefore suffices. ❑
Answer:
Suppose that memory barrier was omitted.
Keep in mind that CPUs are free to speculatively execute later loads, which can have
the effect of executing the assertion before the while loop completes. Furthermore,
compilers assume that only the currently executing thread is updating the variables, and
this assumption allows the compiler to hoist the load of a to precede the loop.
In fact, some compilers would transform the loop to a branch around an infinite loop
as follows:
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 if (b == 0)
11 for (;;)
12 continue;
13 assert(a == 1);
14 }
v2023.06.11a
E.20. WHY MEMORY BARRIERS? 877
Given this optimization, the code would behave in a completely different way than
the original code. If bar() observed “b == 0”, the assertion could of course not be
reached at all due to the infinite loop. However, if bar() loaded the value “1” just as
“foo()” stored it, the CPU might still have the old zero value of “a” in its cache, which
would cause the assertion to fire. You should of course use volatile casts (for example,
those volatile casts implied by the C11 relaxed atomic load operation) to prevent the
compiler from optimizing your parallel code into oblivion. But volatile casts would not
prevent a weakly ordered CPU from loading the old value for “a” from its cache, which
means that this code also requires the explicit memory barrier in “bar()”.
In short, both compilers and CPUs aggressively apply code-reordering optimizations,
so you must clearly communicate your constraints using the compiler directives and
memory barriers provided for this purpose. ❑
Answer:
An immediate flush of the invalidation queue would do the trick. Except that the common-
case super-scalar CPU is executing many instructions at once, and not necessarily even
in the expected order. So what would “immediate” even mean? The answer is clearly
“not much”.
Nevertheless, for simpler CPUs that execute instructions serially, flushing the invali-
dation queue might be a reasonable implementation strategy. ❑
Answer:
Sort of.
Note well that this litmus test has not one but two full memory-barrier instructions,
namely the two sync instructions executed by P2 and P3.
It is the interaction of those two instructions that provides the global ordering, not just
their individual execution. For example, each of those two sync instructions might stall
waiting for all CPUs to process their invalidation queues before allowing subsequent
instructions to execute.14 ❑
Answer:
No. Consider the case where a thread migrates from one CPU to another, and where
the destination CPU perceives the source CPU’s recent memory operations out of
v2023.06.11a
878 APPENDIX E. ANSWERS TO QUICK QUIZZES
order. To preserve user-mode sanity, kernel hackers must use memory barriers in the
context-switch path. However, the locking already required to safely do a context switch
should automatically provide the memory barriers needed to cause the user-level task
to see its own accesses in order. That said, if you are designing a super-optimized
scheduler, either in the kernel or at user level, please keep this scenario in mind! ❑
Answer:
No. Such a memory barrier would only force ordering local to CPU 1. It would have no
effect on the relative ordering of CPU 0’s and CPU 1’s accesses, so the assertion could
still fail. However, all mainstream computer systems provide one mechanism or another
to provide “transitivity”, which provides intuitive causal ordering: If B saw the effects
of A’s accesses, and C saw the effects of B’s accesses, then C must also see the effects
of A’s accesses. In short, hardware designers have taken at least a little pity on software
developers. ❑
Answer:
The assertion must ensure that the load of “e” precedes that of “a”. In the Linux kernel,
the barrier() primitive may be used to accomplish this in much the same way that
the memory barrier was used in the assertions in the previous examples. For example,
the assertion can be modified as follows:
r1 = e;
barrier();
assert(r1 == 0 || a == 1);
No changes are needed to the code in the first two columns, because interrupt handlers
run atomically from the perspective of the interrupted code. ❑
Answer:
The result depends on whether the CPU supports “transitivity”. In other words, CPU 0
stored to “e” after seeing CPU 1’s store to “c”, with a memory barrier between CPU 0’s
v2023.06.11a
E.20. WHY MEMORY BARRIERS? 879
load from “c” and store to “e”. If some other CPU sees CPU 0’s store to “e”, is it also
guaranteed to see CPU 1’s store?
All CPUs I am aware of claim to provide transitivity. ❑
v2023.06.11a
880 APPENDIX E. ANSWERS TO QUICK QUIZZES
v2023.06.11a
Dictionaries are inherently circular in nature.
Self Reference in word definitions, David Levary et al.
Glossary
Acquire Load: A read from memory that has acquire semantics. Normal use cases
pair an acquire load with a release store, in which case if the load returns the
value stored, then all code executed by the loading CPU after that acquire load will
see the effects of all memory-reference instructions executed by the storing CPU
prior to that release store. Acquiring a lock provides similar memory-ordering
semantics, hence the “acquire” in “acquire load”. (See also “memory barrier” and
“release store”.)
Amdahl’s Law: If sufficient numbers of CPUs are used to run a job that has both a
sequential portion and a concurrent portion, performance and scalability will be
limited by the overhead of the sequential portion.
Associativity: The number of cache lines that can be held simultaneously in a given
cache, when all of these cache lines hash identically in that cache. A cache
that could hold four cache lines for each possible hash value would be termed a
“four-way set-associative” cache, while a cache that could hold only one cache
line for each possible hash value would be termed a “direct-mapped” cache. A
cache whose associativity was equal to its capacity would be termed a “fully
associative” cache. Fully associative caches have the advantage of eliminating
associativity misses, but, due to hardware limitations, fully associative caches
are normally quite limited in size. The associativity of the large caches found on
modern microprocessors typically range from two-way to eight-way.
Associativity Miss: A cache miss incurred because the corresponding CPU has recently
accessed more data hashing to a given set of the cache than will fit in that set. Fully
associative caches are not subject to associativity misses (or, equivalently, in fully
associative caches, associativity and capacity misses are identical).
Atomic: An operation is considered “atomic” if it is not possible to observe any
intermediate state. For example, on most CPUs, a store to a properly aligned
pointer is atomic, because other CPUs will see either the old value or the new
value, but are guaranteed not to see some mixed value containing some pieces of
the new and old values.
Atomic Read-Modify-Write Operation: An atomic operation that both reads and
writes memory is considered an atomic read-modify-write operation, or atomic
RMW operation for short. Although the value written usually depends on the value
read, atomic_xchg() is the exception that proves this rule.
Bounded Wait Free: A forward-progress guarantee in which every thread makes
progress within a specific finite period of time, the specific time being the bound.
881
v2023.06.11a
882 GLOSSARY
Cache Coherence: A property of most modern SMP machines where all CPUs will
observe a sequence of values for a given variable that is consistent with at least
one global order of values for that variable. Cache coherence also guarantees that
at the end of a group of stores to a given variable, all CPUs will agree on the final
value for that variable. Note that cache coherence applies only to the series of
values taken on by a single variable. In contrast, the memory consistency model
for a given machine describes the order in which loads and stores to groups of
variables will appear to occur. See Section 15.2.6 for more information.
Cache-Coherence Protocol: A communications protocol, normally implemented in
hardware, that enforces memory consistency and ordering, preventing different
CPUs from seeing inconsistent views of data held in their caches.
Cache Geometry: The size and associativity of a cache is termed its geometry. Each
cache may be thought of as a two-dimensional array, with rows of cache lines
(“sets”) that have the same hash value, and columns of cache lines (“ways”) in
which every cache line has a different hash value. The associativity of a given
cache is its number of columns (hence the name “way”—a two-way set-associative
cache has two “ways”), and the size of the cache is its number of rows multiplied
by its number of columns.
Cache Line: (1) The unit of data that circulates among the CPUs and memory, usually
a moderate power of two in size. Typical cache-line sizes range from 16 to 256
bytes.
(2) A physical location in a CPU cache capable of holding one cache-line unit of
data.
(3) A physical location in memory capable of holding one cache-line unit of data,
but that it also aligned on a cache-line boundary. For example, the address of the
first word of a cache line in memory will end in 0x00 on systems with 256-byte
cache lines.
Cache Miss: A cache miss occurs when data needed by the CPU is not in that CPU’s
cache. The data might be missing because of a number of reasons, including:
(1) This CPU has never accessed the data before (“startup” or “warmup” miss),
(2) This CPU has recently accessed more data than would fit in its cache, so that
some of the older data had to be removed (“capacity” miss), (3) This CPU has
recently accessed more data in a given set1 than that set could hold (“associativity”
1 In hardware-cache terminology, the word “set” is used in the same way that the word
v2023.06.11a
883
miss), (4) Some other CPU has written to the data (or some other data in the same
cache line) since this CPU has accessed it (“communication miss”), or (5) This
CPU attempted to write to a cache line that is currently read-only, possibly due to
that line being replicated in other CPUs’ caches.
Capacity Miss: A cache miss incurred because the corresponding CPU has recently
accessed more data than will fit into the cache.
Code Locking: A simple locking design in which a “global lock” is used to protect
a set of critical sections, so that access by a given thread to that set is granted
or denied based only on the set of threads currently occupying the set of critical
sections, not based on what data the thread intends to access. The scalability of
a code-locked program is limited by the code; increasing the size of the data set
will normally not increase scalability (in fact, will typically decrease scalability by
increasing “lock contention”). Contrast with “data locking”.
Communication Miss: A cache miss incurred because some other CPU has written to
the cache line since the last time this CPU accessed it.
Concurrent: In this book, a synonym of parallel. Please see Appendix A.6 on page 638
for a discussion of the recent distinction between these two terms.
Data Locking: A scalable locking design in which each instance of a given data
structure has its own lock. If each thread is using a different instance of the
data structure, then all of the threads may be executing in the set of critical
sections simultaneously. Data locking has the advantage of automatically scaling
to increasing numbers of CPUs as the number of instances of data grows. Contrast
with “code locking”.
v2023.06.11a
884 GLOSSARY
Data Race: A race condition in which several CPUs or threads access a variable
concurrently, and in which at least one of those accesses is a store and at least one
of those accesses is a plain access. It is important to note that while the presence
of data races often indicates the presence of bugs, the absence of data races in no
way implies the absence of bugs. (See “Plain access” and “Race condition”.)
Deadlock: A failure mode in which each of several threads is unable to make progress
until some other thread makes progress. For example, if two threads acquire a pair
of locks in opposite orders, deadlock can result. More information is provided in
Section 7.1.1.
Deadlock Free: A forward-progress guarantee in which, in the absence of failures, at
least one thread makes progress within a finite period of time.
Direct-Mapped Cache: A cache with only one way, so that it may hold only one cache
line with a given hash value.
Efficiency: A measure of effectiveness normally expressed as a ratio of some metric
actually achieved to some maximum value. The maximum value might be a theo-
retical maximum, but in parallel programming is often based on the corresponding
measured single-threaded metric.
Embarrassingly Parallel: A problem or algorithm where adding threads does not
significantly increase the overall cost of the computation, resulting in linear
speedups as threads are added (assuming sufficient CPUs are available).
Energy Efficiency: Shorthand for “energy-efficient use” in which the goal is to carry
out a given computation with reduced energy consumption. Sublinear scalability
can be an obstacle to energy-efficient use of a multicore system.
Epoch-Based Reclamation (EBR): An RCU implementation style put forward by
Keir Fraser [Fra03, Fra04, FH07].
Existence Guarantee: An existence guarantee is provided by a synchronization
mechanism that prevents a given dynamically allocated object from being freed for
the duration of that guarantee. For example, RCU provides existence guarantees
for the duration of RCU read-side critical sections. A similar but strictly weaker
guarantee is provided by type-safe memory.
Exclusive Lock: An exclusive lock is a mutual-exclusion mechanism that permits only
one thread at a time into the set of critical sections guarded by that lock.
False Sharing: If two CPUs each frequently write to one of a pair of data items, but
the pair of data items are located in the same cache line, this cache line will be
repeatedly invalidated, “ping-ponging” back and forth between the two CPUs’
caches. This is a common cause of “cache thrashing”, also called “cacheline
bouncing” (the latter most commonly in the Linux community). False sharing can
dramatically reduce both performance and scalability.
Forward-Progress Guarantee: Algorithms or programs that guarantee that execution
will progress at some rate under specified conditions. Academic forward-progress
guarantees are grouped into a formal hierarchy shown in Section 14.2. A wide
variety of practical forward-progress guarantees are provided by real-time systems,
as discussed in Section 14.3.
v2023.06.11a
885
Fragmentation: A memory pool that has a large amount of unused memory, but not
laid out to permit satisfying a relatively small request is said to be fragmented.
External fragmentation occurs when the space is divided up into small fragments
lying between allocated blocks of memory, while internal fragmentation occurs
when specific requests or types of requests have been allotted more memory than
they actually requested.
Fully Associative Cache: A fully associative cache contains only one set, so that it can
hold any subset of memory that fits within its capacity.
Grace Period: A grace period is any contiguous time interval such that any RCU
read-side critical section that began before the start of that interval has completed
before the end of that same interval. Many RCU implementations define a grace
period to be a time interval during which each thread has passed through at least
one quiescent state. Since RCU read-side critical sections by definition cannot
contain quiescent states, these two definitions are almost always interchangeable.
Heisenbug: A timing-sensitive bug that disappears from sight when you add print
statements or tracing in an attempt to track it down.
Hot Spot: Data structure that is very heavily used, resulting in high levels of contention
on the corresponding lock. One example of this situation would be a hash table
with a poorly chosen hash function.
Invalidation: When a CPU wishes to write to a data item, it must first ensure that this
data item is not present in any other CPUs’ cache. If necessary, the item is removed
from the other CPUs’ caches via “invalidation” messages from the writing CPUs
to any CPUs having a copy in their caches.
IPI: Inter-processor interrupt, which is an interrupt sent from one CPU to another.
IPIs are used heavily in the Linux kernel, for example, within the scheduler to alert
CPUs that a high-priority process is now runnable.
IRQ: Interrupt request, often used as an abbreviation for “interrupt” within the Linux
kernel community, as in “irq handler”.
v2023.06.11a
886 GLOSSARY
Livelock: A failure mode in which each of several threads is able to execute, but in
which a repeating series of failed operations prevents any of the threads from
making any useful forward progress. For example, incorrect use of conditional
locking (for example, spin_trylock() in the Linux kernel) can result in livelock.
More information is provided in Section 7.1.2.
Lock: A software abstraction that can be used to guard critical sections, as such, an
example of a “mutual exclusion mechanism”. An “exclusive lock” permits only
one thread at a time into the set of critical sections guarded by that lock, while
a “reader-writer lock” permits any number of reading threads, or but one writing
thread, into the set of critical sections guarded by that lock. (Just to be clear, the
presence of a writer thread in any of a given reader-writer lock’s critical sections
will prevent any reader from entering any of that lock’s critical sections and vice
versa.)
Lock Free: A forward-progress guarantee in which at least one thread makes progress
within a finite period of time.
Marked Access: A source-code memory access that uses a special function or macro,
such as READ_ONCE(), WRITE_ONCE(), atomic_inc(), and so on, in order to
protect that access from compiler and/or hardware optimizations. In contrast, a
plain access simply mentions the name of the object being accessed, so that in the
following, line 2 is the plain-access equivalent of line 1:
Memory: From the viewpoint of memory models, the main memory, caches, and store
buffers in which values might be stored. However, this term is often used to denote
the main memory itself, excluding caches and store buffers.
Memory Barrier: A compiler directive that might also include a special memory-
barrier instruction. The purpose of a memory barrier is to order memory-reference
instructions that executed before the memory barrier to precede those that will
execute following that memory barrier. (See also “read memory barrier” and
“write memory barrier”.)
Memory Consistency: A set of properties that impose constraints on the order in which
accesses to groups of variables appear to occur. Memory consistency models
range from sequential consistency, a very constraining model popular in academic
circles, through process consistency, release consistency, and weak consistency.
v2023.06.11a
887
Moore’s Law: A 1965 empirical projection by Gordon Moore that transistor density
increases exponentially over time [Moo65].
NUCA: Non-uniform cache architecture, where groups of CPUs share caches and/or
store buffers. CPUs in a group can therefore exchange cache lines with each other
much more quickly than they can with CPUs in other groups. Systems comprised
of CPUs with hardware threads will generally have a NUCA architecture.
NUMA: Non-uniform memory architecture, where memory is split into banks and
each such bank is “close” to a group of CPUs, the group being termed a “NUMA
node”. An example NUMA machine is Sequent’s NUMA-Q system, where each
group of four CPUs had a bank of memory nearby. The CPUs in a given group can
access their memory much more quickly than another group’s memory.
NUMA Node: A group of closely placed CPUs and associated memory within a larger
NUMA machines.
v2023.06.11a
888 GLOSSARY
Overhead: Operations that must be executed, but which do not contribute directly to
the work that must be accomplished. For example, lock acquisition and release
is normally considered to be overhead, and specifically to be synchronization
overhead.
Parallel: In this book, a synonym of concurrent. Please see Appendix A.6 on page 638
for a discussion of the recent distinction between these two terms.
Performance: Rate at which work is done, expressed as work per unit time. If this
work is fully serialized, then the performance will be the reciprocal of the mean
latency of the work items.
Plain Access: A source-code memory access that simply mentions the name of the
object being accessed. (See “Marked access”.)
Program Order: The order in which a given thread’s instructions would be executed by
a now-mythical “in-order” CPU that completely executed each instruction before
proceeding to the next instruction. (The reason such CPUs are now the stuff of
ancient myths and legends is that they were extremely slow. These dinosaurs were
one of the many victims of Moore’s-Law-driven increases in CPU clock frequency.
Some claim that these beasts will roam the earth once again, others vehemently
disagree.)
Quiescent State: In RCU, a point in the code where there can be no references held to
RCU-protected data structures, which is normally any point outside of an RCU
read-side critical section. Any interval of time during which all threads pass
through at least one quiescent state each is termed a “grace period”.
Race Condition: Any situation where multiple CPUs or threads can interact, though
this term is often used in cases where such interaction is undesirable. (See “Data
race”.)
v2023.06.11a
889
RCU Read-Side Critical Section: A section of code protected by RCU, for example,
beginning with rcu_read_lock() and ending with rcu_read_unlock(). (See
“Read-side critical section”.)
Read Memory Barrier: A memory barrier that is only guaranteed to affect the ordering
of load instructions, that is, reads from memory. (See also “memory barrier” and
“write memory barrier”.)
Read Mostly: Read-mostly data is (again, as the name implies) rarely updated. However,
it might be updated at any time.
Read Only: Read-only data is, as the name implies, never updated except by beginning-
of-time initialization. In this book, a synonym for immutable.
Real Time: A situation in which getting the correct result is not sufficient, but where
this result must also be obtained within a given amount of time.
Reference Count: A counter that tracks the number of users of a given object or
entity. Reference counters provide existence guarantees and are sometimes used to
implement garbage collectors.
v2023.06.11a
890 GLOSSARY
Release Store: A write to memory that has release semantics. Normal use cases pair
an acquire load with a release store, in which case if the load returns the value
stored, then all code executed by the loading CPU after that acquire load will see
the effects of all memory-reference instructions executed by the storing CPU prior
to that release store. Releasing a lock provides similar memory-ordering semantics,
hence the “release” in “release store”. (See also “acquire load” and “memory
barrier”.)
Starvation: A condition where at least one CPU or thread is unable to make progress
due to an unfortunate series of resource-allocation decisions, as discussed in
Section 7.1.2. For example, in a multisocket system, CPUs on one socket having
privileged access to the data structure implementing a given lock could prevent
CPUs on other sockets from ever acquiring that lock.
Store Buffer: A small set of internal registers used by a given CPU to record pending
stores while the corresponding cache lines are making their way to that CPU. Also
called “store queue”.
Store Forwarding: An arrangement where a given CPU refers to its store buffer as
well as its cache so as to ensure that the software sees the memory operations
performed by this CPU as if they were carried out in program order.
v2023.06.11a
891
Teachable: A topic, concept, method, or mechanism that teachers believe that they
understand completely and are therefore comfortable teaching.
Throughput: A performance metric featuring work items completed per unit time.
Unfairness: A condition where the progress of at least one CPU or thread is impeded by
an unfortunate series of resource-allocation decisions, as discussed in Section 7.1.2.
Extreme levels of unfairness are termed “starvation”.
Unteachable: A topic, concept, method, or mechanism that the teacher does not
understand well is therefore uncomfortable teaching.
Vector CPU: A CPU that can apply a single instruction to multiple items of data
concurrently. In the 1960s through the 1980s, only supercomputers had vector
capabilities, but the advent of MMX in x86 CPUs and VMX in PowerPC CPUs
brought vector processing to the masses.
Wait Free: A forward-progress guarantee in which every thread makes progress within
a finite period of time.
Write Memory Barrier: A memory barrier that is only guaranteed to affect the ordering
of store instructions, that is, writes to memory. (See also “memory barrier” and
“read memory barrier”.)
Write Miss: A cache miss incurred because the corresponding CPU attempted to write
to a cache line that is read-only, most likely due to its being replicated in other
CPUs’ caches.
Write Mostly: Write-mostly data is (yet again, as the name implies) frequently updated.
v2023.06.11a
892 GLOSSARY
v2023.06.11a
Bibliography
[AA14] Maya Arbel and Hagit Attiya. Concurrent updates with RCU: Search
tree as an example. In Proceedings of the 2014 ACM Symposium on
Principles of Distributed Computing, PODC ’14, page 196–205, Paris,
France, 2014. ACM.
[AAKL06] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, and Charles E.
Leiserson. Unbounded transactional memory. IEEE Micro, pages 59–69,
January-February 2006.
[AB13] Samy Al Bahra. Nonblocking algorithms and scalable multicore pro-
gramming. Commun. ACM, 56(7):50–61, July 2013.
[ABD+ 97] Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat,
Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T.
Vandevoorde, Carl A. Waldspurger, and William E. Weihl. Continuous
profiling: Where have all the cycles gone? In Proceedings of the 16th
ACM Symposium on Operating Systems Principles, pages 1–14, New
York, NY, October 1997.
[ACA+ 18] A. Aljuhni, C. E. Chow, A. Aljaedi, S. Yusuf, and F. Torres-Reyes.
Towards understanding application performance and system behavior
with the full dynticks feature. In 2018 IEEE 8th Annual Computing
and Communication Workshop and Conference (CCWC), pages 394–401,
2018.
[ACHS13] Dan Alistarh, Keren Censor-Hillel, and Nir Shavit. Are lock-
free concurrent algorithms practically wait-free?, December 2013.
ArXiv:1311.3200v2.
[ACMS03] Andrea Arcangeli, Mingming Cao, Paul E. McKenney, and Dipankar
Sarma. Using read-copy update techniques for System V IPC in the
Linux 2.5 kernel. In Proceedings of the 2003 USENIX Annual Technical
Conference (FREENIX Track), pages 297–310, San Antonio, Texas, USA,
June 2003. USENIX Association.
[Ada11] Andrew Adamatzky. Slime mould solves maze in one pass . . . assisted
by gradient of chemo-attractants, August 2011. arXiv:1108.4956.
[ADF+ 19] Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel Lustig,
Luc Maranget, Paul E. McKenney, Andrea Parri, Nicholas Piggin, Alan
Stern, Akira Yokosawa, and Peter Zijlstra. Who’s afraid of a big bad
optimizing compiler?, July 2019. Linux Weekly News.
893
v2023.06.11a
894 BIBLIOGRAPHY
v2023.06.11a
BIBLIOGRAPHY 895
v2023.06.11a
896 BIBLIOGRAPHY
[ATC+ 11] Ege Akpinar, Sasa Tomic, Adrian Cristal, Osman Unsal, and Mateo
Valero. A comprehensive study of conflict resolution policies in hardware
transactional memory. In TRANSACT 2011, New Orleans, LA, USA,
June 2011. ACM SIGPLAN.
[ATS09] Ali-Reza Adl-Tabatabai and Tatiana Shpeisman. Draft specification of
transactional language constructs for C++, August 2009. URL: https://
software.intel.com/sites/default/files/ee/47/21569 (may
need to append .pdf to view after download).
[Att10] Hagit Attiya. The inherent complexity of transactional memory and
what to do about it. In Proceedings of the 29th ACM SIGACT-SIGOPS
Symposium on Principles of Distributed Computing, PODC ’10, pages
1–5, Zurich, Switzerland, 2010. ACM.
[BA01] Jeff Bonwick and Jonathan Adams. Magazines and vmem: Extending the
slab allocator to many CPUs and arbitrary resources. In USENIX Annual
Technical Conference, General Track 2001, pages 15–33, 2001.
[Bah11a] Samy Al Bahra. ck_epoch: Support per-object destructors, Oc-
tober 2011. https://github.com/concurrencykit/ck/commit/
10ffb2e6f1737a30e2dcf3862d105ad45fcd60a4.
[Bah11b] Samy Al Bahra. ck_hp.c, February 2011. Hazard pointers: https:
//github.com/concurrencykit/ck/blob/master/src/ck_hp.c.
[Bah11c] Samy Al Bahra. ck_sequence.h, February 2011. Sequence lock-
ing: https://github.com/concurrencykit/ck/blob/master/
include/ck_sequence.h.
[Bas18] JF Bastien. P1152R0: Deprecating volatile, October 2018.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/
2018/p1152r0.html.
[BBC+ 10] Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem,
Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler.
A few billion lines of code later: Using static analysis to find bugs in the
real world. Commun. ACM, 53(2):66–75, February 2010.
[BCR03] David F. Bacon, Perry Cheng, and V. T. Rajan. A real-time garbage
collector with low overhead and consistent utilization. SIGPLAN Not.,
38(1):285–298, 2003.
[BD13] Paolo Bonzini and Mike Day. RCU implementation for Qemu, Au-
gust 2013. https://lists.gnu.org/archive/html/qemu-devel/
2013-08/msg02055.html.
[BD14] Hans-J. Boehm and Brian Demsky. Outlawing ghosts: Avoiding out-
of-thin-air results. In Proceedings of the Workshop on Memory Systems
Performance and Correctness, MSPC ’14, pages 7:1–7:6, Edinburgh,
United Kingdom, 2014. ACM.
[Bec11] Pete Becker. Working draft, standard for programming language C++, Feb-
ruary 2011. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2011/n3242.pdf.
v2023.06.11a
BIBLIOGRAPHY 897
[BGHZ16] Oana Balmau, Rachid Guerraoui, Maurice Herlihy, and Igor Zablotchi.
Fast and robust memory reclamation for concurrent data structures. In
Proceedings of the 28th ACM Symposium on Parallelism in Algorithms
and Architectures, SPAA ’16, pages 349–359, Pacific Grove, California,
USA, 2016. ACM.
[BGOS18] Sam Blackshear, Nikos Gorogiannis, Peter W. O’Hearn, and Ilya Sergey.
Racerd: Compositional static race detection. Proc. ACM Program. Lang.,
2(OOPSLA), October 2018.
[BGV17] Hans-J. Boehm, Olivier Giroux, and Viktor Vafeiades. P0668r1: Revising
the C++ memory model, July 2017. http://www.open-std.org/
jtc1/sc22/wg21/docs/papers/2017/p0668r1.html.
[BJ12] Rex Black and Capers Jones. Economics of software quality: An interview
with Capers Jones, part 1 of 2 (podcast transcript), January 2012. https:
//www.informit.com/articles/article.aspx?p=1824791.
[BK85] Bob Beck and Bob Kasten. VLSI assist in building a multiprocessor
UNIX system. In USENIX Conference Proceedings, pages 255–275,
Portland, OR, June 1985. USENIX Association.
v2023.06.11a
898 BIBLIOGRAPHY
[BMMM05] Luke Browning, Thomas Mathews, Paul E. McKenney, and James Moody.
Apparatus, method, and computer program product for converting simple
locks in a multiprocessor system. US Patent 6,842,809, Assigned to
International Business Machines Corporation, Washington, DC, January
2005.
[BMN+ 15] Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-
Pharabod, and Peter Sewell. The problem of programming language
concurrency semantics. In Jan Vitek, editor, Programming Languages
and Systems, volume 9032 of Lecture Notes in Computer Science, pages
283–307. Springer Berlin Heidelberg, 2015.
[BMP08] R. F. Berry, P. E. McKenney, and F. N. Parr. Responsive systems: An
introduction. IBM Systems Journal, 47(2):197–206, April 2008.
[Boe05] Hans-J. Boehm. Threads cannot be implemented as a library. SIGPLAN
Not., 40(6):261–268, June 2005.
[Boe09] Hans-J. Boehm. Transactional memory should be an implementation
technique, not a programming interface. In HOTPAR 2009, page 6, Berke-
ley, CA, USA, March 2009. Available: https://www.usenix.org/
event/hotpar09/tech/full_papers/boehm/boehm.pdf [Viewed
May 24, 2009].
[Boe20] Hans Boehm. “Undefined behavior” and the concurrency memory model,
August 2020. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2020/p2215r0.pdf.
[Boh01] Kristoffer Bohmann. Response time still matters, July
2001. URL: http://www.bohmann.dk/articles/response_time_
still_matters.html [broken, November 2016].
[Bon13] Paolo Bonzini. seqlock: introduce read-write seqlock, Septem-
ber 2013. https://git.qemu.org/?p=qemu.git;a=commit;h=
ea753d81e8b085d679f13e4a6023e003e9854d51.
[Bon15] Paolo Bonzini. rcu: add rcu library, February 2015.
https://git.qemu.org/?p=qemu.git;a=commit;h=
7911747bd46123ef8d8eef2ee49422bb8a4b274f.
[Bon21a] Paolo Bonzini. An introduction to lockless algorithms, February 2021.
Available: https://lwn.net/Articles/844224/ [Viewed February
19, 2021].
[Bon21b] Paolo Bonzini. Lockless patterns: an introduction to compare-and-
swap, March 2021. Available: https://lwn.net/Articles/847973/
[Viewed March 13, 2021].
[Bon21c] Paolo Bonzini. Lockless patterns: full memory barriers, March 2021.
Available: https://lwn.net/Articles/847481/ [Viewed March 8,
2021].
[Bon21d] Paolo Bonzini. Lockless patterns: more read-modify-write opera-
tions, March 2021. Available: https://lwn.net/Articles/849237/
[Viewed March 19, 2021].
v2023.06.11a
BIBLIOGRAPHY 899
[Bon21e] Paolo Bonzini. Lockless patterns: relaxed access and partial memory
barriers, February 2021. Available: https://lwn.net/Articles/
846700/ [Viewed February 27, 2021].
[Bon21f] Paolo Bonzini. Lockless patterns: some final topics, March 2021.
Available: https://lwn.net/Articles/850202/ [Viewed March 19,
2021].
[Bor06] Richard Bornat. Dividing the sheep from the goats, Jan-
uary 2006. Seminar at School of Computing, Univ. of
Kent. Abstract is available at https://www.cs.kent.ac.uk/
seminar_archive/2005_06/abs_2006_01_24.html. Retracted in
July 2014: http://www.eis.mdx.ac.uk/staffpages/r_bornat/
papers/camel_hump_retraction.pdf.
[Bos10] Keith Bostic. Switch lockless programming style from
epoch to hazard references, January 2010. https:
//github.com/wiredtiger/wiredtiger/commit/
dddc21014fc494a956778360a14d96c762495e09.
[Bos23] Mara Bos. Rust Atomics and Locks. O’Reilly Media, Inc., Sebastopol,
CA, USA, 2023.
[BPP+ 16] Adam Belay, George Prekas, Mia Primorac, Ana Klimovic, Samuel
Grossman, Christos Kozyrakis, and Edouard Bugnion. The IX operating
system: Combining low latency, high throughput, and efficiency in
a protected dataplane. ACM Trans. Comput. Syst., 34(4):11:1–11:39,
December 2016.
[Bra07] Reg Braithwaite. Don’t overthink fizzbuzz, January 2007.
http://weblog.raganwald.com/2007/01/dont-overthink-
fizzbuzz.html.
[Bra11] Björn Brandenburg. Scheduling and Locking in Multiprocessor Real-
Time Operating Systems. PhD thesis, The University of North Carolina
at Chapel Hill, 2011. URL: https://www.cs.unc.edu/~anderson/
diss/bbbdiss.pdf.
[Bro15a] Neil Brown. Pathname lookup in Linux, June 2015. https://lwn.
net/Articles/649115/.
[Bro15b] Neil Brown. RCU-walk: faster pathname lookup in Linux, July 2015.
https://lwn.net/Articles/649729/.
[Bro15c] Neil Brown. A walk among the symlinks, July 2015. https://lwn.
net/Articles/650786/.
[BS75] Paul J. Brown and Ronald M. Smith. Shared data controlled by a plurality
of users, May 1975. US Patent 3,886,525, filed June 29, 1973.
[BS14] Mark Batty and Peter Sewell. The thin-air problem, February 2014.
https://www.cl.cam.ac.uk/~pes20/cpp/notes42.html.
[But97] David Butenhof. Programming with POSIX Threads. Addison-Wesley,
Boston, MA, USA, 1997.
v2023.06.11a
900 BIBLIOGRAPHY
[BWCM+ 10] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev,
M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An analysis
of Linux scalability to many cores. In 9th USENIX Symposium on
Operating System Design and Implementation, pages 1–16, Vancouver,
BC, Canada, October 2010. USENIX.
[CAK+ 96] Crispin Cowan, Tito Autrey, Charles Krasic, Calton Pu, and Jonathan
Walpole. Fast concurrent dynamic linking for an adaptive operating
system. In International Conference on Configurable Distributed Systems
(ICCDS’96), pages 108–115, Annapolis, MD, May 1996.
[CBF13] UPC Consortium, Dan Bonachea, and Gary Funck. UPC language and
library specifications, version 1.3. Technical report, UPC Consortium,
November 2013.
[CBM+ 08] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng
Wu, Stefanie Chiras, and Siddhartha Chatterjee. Software transactional
memory: Why is it only a research toy? ACM Queue, September 2008.
[CKL04] Edmund Clarke, Daniel Kroening, and Flavio Lerda. A tool for checking
ANSI-C programs. In Kurt Jensen and Andreas Podelski, editors, Tools
and Algorithms for the Construction and Analysis of Systems (TACAS
2004), volume 2988 of Lecture Notes in Computer Science, pages 168–176.
Springer, 2004.
[Cli09] Cliff Click. And now some hardware transactional memory comments...,
February 2009. URL: http://www.cliffc.org/blog/2009/02/25/
and-now-some-hardware-transactional-memory-comments/.
v2023.06.11a
BIBLIOGRAPHY 901
[CLRS01] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to
Algorithms, Second Edition. MIT electrical engineering and computer
science series. MIT Press, 2001.
[CnRR18] Armando Castañeda, Sergio Rajsbaum, and Michel Raynal. Unifying
concurrent objects and distributed tasks: Interval-linearizability. J. ACM,
65(6), November 2018.
[Com01] Compaq Computer Corporation. Shared memory, threads, interpro-
cess communication, August 2001. Zipped archive: wiz_2637.
txt in https://www.digiater.nl/openvms/freeware/v70/ask_
the_wizard/wizard.zip.
[Coo18] Byron Cook. Formal reasoning about the security of amazon web services.
In Hana Chockler and Georg Weissenbacher, editors, Computer Aided
Verification, pages 38–47, Cham, 2018. Springer International Publishing.
[Cor02] Compaq Computer Corporation. Alpha Architecture Reference Manual.
Digital Press, fourth edition, 2002.
[Cor03] Jonathan Corbet. Driver porting: mutual exclusion with seqlocks, Febru-
ary 2003. https://lwn.net/Articles/22818/.
[Cor04a] Jonathan Corbet. Approaches to realtime Linux, October 2004. URL:
https://lwn.net/Articles/106010/.
[Cor04b] Jonathan Corbet. Finding kernel problems automatically, June 2004.
https://lwn.net/Articles/87538/.
[Cor04c] Jonathan Corbet. Realtime preemption, part 2, October 2004. URL:
https://lwn.net/Articles/107269/.
[Cor06a] Jonathan Corbet. The kernel lock validator, May 2006. Available:
https://lwn.net/Articles/185666/ [Viewed: March 26, 2010].
[Cor06b] Jonathan Corbet. Priority inheritance in the kernel, April 2006. Available:
https://lwn.net/Articles/178253/ [Viewed June 29, 2009].
[Cor10a] Jonathan Corbet. Dcache scalability and RCU-walk, December 2010.
Available: https://lwn.net/Articles/419811/ [Viewed May 29,
2017].
[Cor10b] Jonathan Corbet. sys_membarrier(), January 2010. https://lwn.net/
Articles/369567/.
[Cor11] Jonathan Corbet. How to ruin linus’s vacation, July 2011. Available:
https://lwn.net/Articles/452117/ [Viewed May 29, 2017].
[Cor12] Jonathan Corbet. ACCESS_ONCE(), August 2012. https://lwn.net/
Articles/508991/.
[Cor13] Jonathan Corbet. (Nearly) full tickless operation in 3.10, May 2013.
https://lwn.net/Articles/549580/.
[Cor14a] Jonathan Corbet. ACCESS_ONCE() and compiler bugs, December 2014.
https://lwn.net/Articles/624126/.
v2023.06.11a
902 BIBLIOGRAPHY
[Cor14b] Jonathan Corbet. MCS locks and qspinlocks, March 2014. https:
//lwn.net/Articles/590243/.
[Cor14c] Jonathan Corbet. Relativistic hash tables, part 1: Algorithms, September
2014. https://lwn.net/Articles/612021/.
[Cor14d] Jonathan Corbet. Relativistic hash tables, part 2: Implementation,
September 2014. https://lwn.net/Articles/612100/.
[Cor16a] Jonathan Corbet. Finding race conditions with KCSAN, June 2016.
https://lwn.net/Articles/691128/.
[Cor16b] Jonathan Corbet. Time to move to C11 atomics?, June 2016. https:
//lwn.net/Articles/691128/.
[Cor18] Jonathan Corbet. membarrier(2), October 2018. https://man7.org/
linux/man-pages/man2/membarrier.2.html.
[Cra93] Travis Craig. Building FIFO and priority-queuing spin locks from atomic
swap. Technical Report 93-02-02, University of Washington, Seattle,
Washington, February 1993.
[CRKH05] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Linux
Device Drivers. O’Reilly Media, Inc., third edition, 2005. URL: https:
//lwn.net/Kernel/LDD3/.
[CSG99] David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel Com-
puter Architecture: a Hardware/Software Approach. Morgan Kaufman,
1999.
[cut17] crates.io user ticki. conc v0.5.0: Hazard-pointer-based concurrent memory
reclamation, August 2017. https://crates.io/crates/conc.
[Dat82] C. J. Date. An Introduction to Database Systems, volume 1. Addison-
Wesley Publishing Company, 1982.
[DBA09] Saeed Dehnadi, Richard Bornat, and Ray Adams. Meta-analysis of the
effect of consistency on success in early learning of programming. In
PPIG 2009, pages 1–13, University of Limerick, Ireland, June 2009.
Psychology of Programming Interest Group.
[DCW+ 11] Luke Dalessandro, Francois Carouge, Sean White, Yossi Lev, Mark
Moir, Michael L. Scott, and Michael F. Spear. Hybrid NOrec: A case
study in the effectiveness of best effort hardware transactional memory.
In Proceedings of the 16th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS),
ASPLOS ’11, page 39–52, Newport Beach, CA, USA, 2011. ACM.
[Dea18] Will Deacon. [PATCH 00/10] kernel/locking: qspinlock improvements,
April 2018. https://lkml.kernel.org/r/1522947547-24081-1-
git-send-email-will.deacon@arm.com.
[Dea19] Will Deacon. Re: [PATCH 1/1] Fix: trace sched switch start/stop
racy updates, August 2019. https://lore.kernel.org/lkml/
20190821103200.kpufwtviqhpbuv2n@willie-the-truck/.
v2023.06.11a
BIBLIOGRAPHY 903
[Des09b] Mathieu Desnoyers. [RFC git tree] userspace RCU (urcu) for Linux,
February 2009. https://liburcu.org.
[DHK12] Vijay D’Silva, Leopold Haller, and Daniel Kroening. Satisfiability solvers
are static analyzers. In Static Analysis Symposium (SAS), volume 7460 of
LNCS, pages 317–333. Springer, 2012.
[DHL+ 08] Dave Dice, Maurice Herlihy, Doug Lea, Yossi Lev, Victor Luchangco,
Wayne Mesard, Mark Moir, Kevin Moore, and Dan Nussbaum. Appli-
cations of the adaptive transactional memory test platform. In 3rd ACM
SIGPLAN Workshop on Transactional Computing, pages 1–10, Salt Lake
City, UT, USA, February 2008.
[DKS89] Alan Demers, Srinivasan Keshav, and Scott Shenker. Analysis and
simulation of a fair queuing algorithm. SIGCOMM ’89, pages 1–12,
1989.
v2023.06.11a
904 BIBLIOGRAPHY
[DLM+ 10] Dave Dice, Yossi Lev, Virendra J. Marathe, Mark Moir, Dan Nussbaum,
and Marek Oleszewski. Simplifying concurrent algorithms by exploiting
hardware transactional memory. In Proceedings of the 22nd ACM
symposium on Parallelism in algorithms and architectures, SPAA ’10,
pages 325–334, Thira, Santorini, Greece, 2010. ACM.
[DLMN09] Dave Dice, Yossi Lev, Mark Moir, and Dan Nussbaum. Early experi-
ence with a commercial hardware transactional memory implementation.
In Fourteenth International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS ’09), pages
157–168, Washington, DC, USA, March 2009.
[DMLP79] Richard A. De Millo, Richard J. Lipton, and Alan J. Perlis. Social processes
and proofs of theorems and programs. Commun. ACM, 22(5):271–280,
May 1979.
[DMS+ 12] Mathieu Desnoyers, Paul E. McKenney, Alan Stern, Michel R. Dagenais,
and Jonathan Walpole. User-level implementations of read-copy update.
IEEE Transactions on Parallel and Distributed Systems, 23:375–382,
2012.
[dO18a] Daniel Bristot de Oliveira. Deadline scheduler part 2 – details and usage,
January 2018. URL: https://lwn.net/Articles/743946/.
[Dre11] Ulrich Drepper. Futexes are tricky. Technical Report FAT2011, Red Hat,
Inc., Raleigh, NC, USA, November 2011.
[DSS06] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In Proc.
International Symposium on Distributed Computing. Springer Verlag,
2006.
v2023.06.11a
BIBLIOGRAPHY 905
v2023.06.11a
906 BIBLIOGRAPHY
[ENS05] Ryan Eccles, Blair Nonneck, and Deborah A. Stacey. Exploring parallel
programming knowledge in the novice. In HPCS ’05: Proceedings of the
19th International Symposium on High Performance Computing Systems
and Applications, pages 97–102, Guelph, Ontario, Canada, 2005. IEEE
Computer Society.
[Eri08] Christer Ericson. Aiding pathfinding with cellular automata, June 2008.
http://realtimecollisiondetection.net/blog/?p=57.
[ES90] Margaret A. Ellis and Bjarne Stroustrup. The Annotated C++ Reference
Manual. Addison Wesley, 1990.
[ES05] Ryan Eccles and Deborah A. Stacey. Understanding the parallel program-
mer. In HPCS ’05: Proceedings of the 19th International Symposium on
High Performance Computing Systems and Applications, pages 156–160,
Guelph, Ontario, Canada, 2005. IEEE Computer Society.
[FH07] Keir Fraser and Tim Harris. Concurrent programming without locks.
ACM Trans. Comput. Syst., 25(2):1–61, 2007.
[FIMR16] Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. Hard-
ware read-write lock elision. In Proceedings of the Eleventh European
Conference on Computer Systems, EuroSys ’16, London, United Kingdom,
2016. Association for Computing Machinery.
[Fos10] Ron Fosner. Scalable multithreaded programming with tasks. MSDN Mag-
azine, 2010(11):60–69, November 2010. http://msdn.microsoft.
com/en-us/magazine/gg309176.aspx.
v2023.06.11a
BIBLIOGRAPHY 907
[FRK02] Hubertus Francke, Rusty Russell, and Matthew Kirkwood. Fuss, futexes
and furwocks: Fast userlevel locking in linux. In Ottawa Linux Sympo-
sium, pages 479–495, June 2002. Available: https://www.kernel.
org/doc/ols/2002/ols2002-pages-479-495.pdf [Viewed May
22, 2011].
[FSP+ 17] Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc
Maranget, Kathryn E. Gray, Ali Sezgin, Mark Batty, and Peter Sewell.
Mixed-size concurrency: ARM, POWER, C/C++11, and SC. SIGPLAN
Not., 52(1):429–442, January 2017.
[GAJM15] Alex Groce, Iftekhar Ahmed, Carlos Jensen, and Paul E. McKenney. How
verified is my code? falsification-driven verification (t). In Proceedings
of the 2015 30th IEEE/ACM International Conference on Automated
Software Engineering (ASE), ASE ’15, pages 737–748, Washington, DC,
USA, 2015. IEEE Computer Society.
[Gar07] Bryan Gardiner. IDF: Gordon Moore predicts end of Moore’s law
(again), September 2007. Available: https://www.wired.com/2007/
09/idf-gordon-mo-1/ [Viewed: February 27, 2021].
[GC96] Michael Greenwald and David R. Cheriton. The synergy between non-
blocking synchronization and operating system structure. In Proceedings
of the Second Symposium on Operating Systems Design and Implementa-
tion, pages 123–136, Seattle, WA, October 1996. USENIX Association.
[GDZE10] Olga Golovanevsky, Alon Dayan, Ayal Zaks, and David Edelsohn. Trace-
based data layout optimizations for multi-core processors. In Proceedings
of the 5th International Conference on High Performance Embedded
Architectures and Compilers, HiPEAC’10, pages 81–95, Pisa, Italy, 2010.
Springer-Verlag.
v2023.06.11a
908 BIBLIOGRAPHY
[GGL+ 19] Rachid Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, and
Vasileios Trigonakis. Lock–unlock: Is that all? a pragmatic analysis of
locking in software systems. ACM Trans. Comput. Syst., 36(1):1:1–1:149,
March 2019.
[Gha95] Kourosh Gharachorloo. Memory consistency models for shared-memory
multiprocessors. Technical Report CSL-TR-95-685, Computer Sys-
tems Laboratory, Departments of Electrical Engineering and Computer
Science, Stanford University, Stanford, CA, December 1995. Avail-
able: https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-
95-9.pdf [Viewed: October 11, 2004].
[GHH+ 14] Alex Groce, Klaus Havelund, Gerard J. Holzmann, Rajeev Joshi, and
Ru-Gang Xu. Establishing flight software reliability: testing, model
checking, constraint-solving, monitoring and learning. Ann. Math. Artif.
Intell., 70(4):315–349, 2014.
[GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design
Patterns: Elements of Reusable Object-Oriented Software. Addison-
Wesley, 1995.
[GKAS99] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Michael Stumm.
Tornado: Maximizing locality and concurrency in a shared memory
multiprocessor operating system. In Proceedings of the 3rd Symposium
on Operating System Design and Implementation, pages 87–100, New
Orleans, LA, February 1999.
[GKP13] Justin Gottschlich, Rob Knauerhase, and Gilles Pokam. But how do we
really debug transactional memory? In 5th USENIX Workshop on Hot
Topics in Parallelism (HotPar 2013), San Jose, CA, USA, June 2013.
[GKPS95] Ben Gamsa, Orran Krieger, E. Parsons, and Michael Stumm. Performance
issues for multiprocessor operating systems, November 1995. Techni-
cal Report CSRI-339, Available: ftp://ftp.cs.toronto.edu/pub/
reports/csri/339/339.ps.
[Gla18] Stjepan Glavina. Merge remaining subcrates, November
2018. https://github.com/crossbeam-rs/crossbeam/commit/
d9b1e3429450a64b490f68c08bd191417e68f00c.
[Gle10] Thomas Gleixner. Realtime linux: academia v. reality, July 2010. URL:
https://lwn.net/Articles/397422/.
[Gle12] Thomas Gleixner. Linux -rt kvm guest demo, December 2012. Personal
communication.
[GMTW08] D. Guniguntala, P. E. McKenney, J. Triplett, and J. Walpole. The
read-copy-update mechanism for supporting real-time applications on
shared-memory multiprocessor systems with Linux. IBM Systems Journal,
47(2):221–236, May 2008.
[Gol18a] David Goldblatt. Add the Seq module, a simple seqlock implementa-
tion, April 2018. https://github.com/jemalloc/jemalloc/tree/
06a8c40b36403e902748d3f2a14e6dd43488ae89.
v2023.06.11a
BIBLIOGRAPHY 909
[Gol19] David Goldblatt. There might not be an elegant OOTA fix, Oc-
tober 2019. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2019/p1916r0.pdf.
[GPB+ 07] Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes,
and Doug Lea. Java: Concurrency in Practice. Addison Wesley, Upper
Saddle River, NJ, USA, 2007.
[Gra91] Jim Gray. The Benchmark Handbook for Database and Transaction
Processing Systems. Morgan Kaufmann, 1991.
[Gre19] Brendan Gregg. BPF Performance Tools: Linux System and Application
Observability. Addison-Wesley Professional, 1st edition, 2019.
[Gri00] Scott Griffen. Internet pioneers: Doug englebart, May 2000. Available:
https://www.ibiblio.org/pioneers/englebart.html [Viewed
November 28, 2008].
[Gro01] The Open Group. Single UNIX specification, July 2001. http://www.
opengroup.org/onlinepubs/007908799/index.html.
v2023.06.11a
910 BIBLIOGRAPHY
v2023.06.11a
BIBLIOGRAPHY 911
v2023.06.11a
912 BIBLIOGRAPHY
[HOS89] James P. Hennessy, Damian L. Osisek, and Joseph W. Seigh II. Pas-
sive serialization in a multitasking environment. Technical Report US
Patent 4,809,168, Assigned to International Business Machines Corp,
Washington, DC, February 1989.
[HS08] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming.
Morgan Kaufmann, Burlington, MA, USA, 2008.
[HSLS20] Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. The
Art of Multiprocessor Programming, 2nd Edition. Morgan Kaufmann,
Burlington, MA, USA, 2020.
v2023.06.11a
BIBLIOGRAPHY 913
[Inm07] Bill Inmon. Time value of information, January 2007. URL: http:
//www.b-eye-network.com/view/3365 [broken, February 2021].
[Int20a] Intel. Desktop 4th Generation Intel® Core™ Processor Family, Desktop
Intel® Pentium® Processor Family, and Desktop Intel® Celeron® Pro-
cessor Family, April 2020. http://www.intel.com/content/dam/
www/public/us/en/documents/specification-updates/4th-
gen-core-family-desktop-specification-update.pdf.
v2023.06.11a
914 BIBLIOGRAPHY
[Jac93] Van Jacobson. Avoid read-side locking via delayed free, September 1993.
private communication.
[JED] JEDEC. mega (M) (as a prefix to units of semiconductor storage capacity)
[online].
[JJKD21] Ralf Jung, Jacques-Henri Jourdan, Robbert Krebbers, and Derek Dreyer.
Safe systems programming in Rust. Commun. ACM, 64(4):144–152,
March 2021.
[JLK16a] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel
address space layout randomization (KASLR) with Intel TSX, July
2016. Black Hat USA 2018 https://www.blackhat.com/us-
16/briefings.html#breaking-kernel-address-space-
layout-randomization-kaslr-with-intel-tsx.
[JLK16b] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel address
space layout randomization with Intel TSX. In Proceedings of the 2016
ACM SIGSAC Conference on Computer and Communications Security,
CCS ’16, pages 380–392, Vienna, Austria, 2016. ACM.
[Jon11] Dave Jones. Trinity: A system call fuzzer. In 13th Ottawa Linux
Symposium, Ottawa, Canada, June 2011. Project repository: https:
//github.com/kernelslacker/trinity.
v2023.06.11a
BIBLIOGRAPHY 915
[JSG12] Christian Jacobi, Timothy Slegel, and Dan Greiner. Transactional memory
architecture and implementation for IBM System z. In Proceedings of the
45th Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 45, pages 25–36, Vancouver B.C. Canada, December 2012. Pre-
sentation slides: https://www.microarch.org/micro45/talks-
posters/3-jacobi-presentation.pdf.
[Kaa15] Frans Kaashoek. Parallel computing and the os. In SOSP History Day,
October 2015.
[KCH+ 06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kumar,
and Anthony Nguyen. Hybrid transactional memory. In Proceedings
of the ACM SIGPLAN 2006 Symposium on Principles and Practice of
Parallel Programming, New York, New York, United States, 2006. ACM
SIGPLAN.
[KDI20] Alex Kogan, Dave Dice, and Shady Issa. Scalable range locks for scalable
address spaces and beyond. In Proceedings of the Fifteenth European
Conference on Computer Systems, EuroSys ’20, Heraklion, Greece, 2020.
Association for Computing Machinery.
v2023.06.11a
916 BIBLIOGRAPHY
v2023.06.11a
BIBLIOGRAPHY 917
v2023.06.11a
918 BIBLIOGRAPHY
[LLO09] Yossi Lev, Victor Luchangco, and Marek Olszewski. Scalable reader-
writer locks. In SPAA ’09: Proceedings of the twenty-first annual
symposium on Parallelism in algorithms and architectures, pages 101–
110, Calgary, AB, Canada, 2009. ACM.
[LMKM16] Lihao Liang, Paul E. McKenney, Daniel Kroening, and Tom Melham.
Verification of the tree-based hierarchical read-copy update in the Linux
kernel. Technical report, Cornell University Library, October 2016.
https://arxiv.org/abs/1610.03052.
[LMKM18] Lihao Liang, Paul E. McKenney, Daniel Kroening, and Tom Melham.
Verification of tree-based hierarchical Read-Copy Update in the Linux
Kernel. In 2018 Design, Automation & Test in Europe Conference &
Exhibition (DATE), Dresden, Germany, March 2018.
[Loc02] Doug Locke. Priority inheritance: The real story, July 2002.
URL: http://www.linuxdevices.com/articles/AT5698775833.
html [broken, November 2016], page capture available at https://www.
math.unipd.it/%7Etullio/SCD/2007/Materiale/Locke.pdf.
[LR80] Butler W. Lampson and David D. Redell. Experience with processes and
monitors in Mesa. Communications of the ACM, 23(2):105–117, 1980.
[LS11] Yujie Liu and Michael Spear. Toxic transactions. In TRANSACT 2011,
San Jose, CA, USA, June 2011. ACM SIGPLAN.
[LSLK14] Carl Leonardsson, Kostis Sagonas, Truc Nguyen Lam, and Michalis
Kokologiannakis. Nidhugg, July 2014. https://github.com/
nidhugg/nidhugg.
[LVK+ 17] Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek
Dreyer. Repairing sequential consistency in C/C++11. SIGPLAN Not.,
52(6):618–632, June 2017.
[LZC14] Ran Liu, Heng Zhang, and Haibo Chen. Scalable read-mostly synchro-
nization using passive reader-writer locks. In 2014 USENIX Annual
Technical Conference (USENIX ATC 14), pages 219–230, Philadelphia,
PA, June 2014. USENIX Association.
v2023.06.11a
BIBLIOGRAPHY 919
[MAK+ 01] Paul E. McKenney, Jonathan Appavoo, Andi Kleen, Orran Krieger, Rusty
Russell, Dipankar Sarma, and Maneesh Soni. Read-copy update. In Ot-
tawa Linux Symposium, July 2001. URL: https://www.kernel.org/
doc/ols/2001/read-copy.pdf, http://www.rdrop.com/users/
paulmck/RCU/rclock_OLS.2001.05.01c.pdf.
[Mar17] Luc Maraget. Aarch64 model vs. hardware, May 2017.
http://pauillac.inria.fr/~maranget/cats7/model-
aarch64/specific.html.
[Mar18] Catalin Marinas. Queued spinlocks model, March 2018.
https://git.kernel.org/pub/scm/linux/kernel/git/
cmarinas/kernel-tla.git.
[Mas92] H. Massalin. Synthesis: An Efficient Implementation of Fundamental
Operating System Services. PhD thesis, Columbia University, New York,
NY, 1992.
[Mat17] Norm Matloff. Programming on Parallel Machines. University of
California, Davis, Davis, CA, USA, 2017.
[MB20] Paul E. McKenney and Hans Boehm. P2055R0: A relaxed guide to
memory_order_relaxed, January 2020. http://www.open-std.org/
jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf.
[MBM+ 06] Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and
David A. Wood. LogTM: Log-based transactional memory. In Proceed-
ings of the 12th Annual International Symposium on High Performance
Computer Architecture (HPCA-12), Austin, Texas, United States, 2006.
IEEE. Available: http://www.cs.wisc.edu/multifacet/papers/
hpca06_logtm.pdf [Viewed December 21, 2006].
[MBWW12] Paul E. McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. RCU
usage in the linux kernel: One decade later, September 2012. Techni-
cal report paulmck.2012.09.17, http://rdrop.com/users/paulmck/
techreports/survey.2012.09.17a.pdf.
[McK90] Paul E. McKenney. Stochastic fairness queuing. In IEEE IN-
FOCOM’90 Proceedings, pages 733–740, San Francisco, June
1990. The Institute of Electrical and Electronics Engineers,
Inc. Revision available: http://www.rdrop.com/users/paulmck/
scalability/paper/sfq.2002.06.04.pdf [Viewed May 26, 2008].
[McK95] Paul E. McKenney. Differential profiling. In MASCOTS 1995, pages
237–241, Toronto, Canada, January 1995.
[McK96a] Paul E. McKenney. Pattern Languages of Program Design, vol-
ume 2, chapter 31: Selecting Locking Designs for Parallel Pro-
grams, pages 501–531. Addison-Wesley, June 1996. Avail-
able: http://www.rdrop.com/users/paulmck/scalability/
paper/mutexdesignpat.pdf [Viewed February 17, 2005].
[McK96b] Paul E. McKenney. Selecting locking primitives for parallel programs.
Communications of the ACM, 39(10):75–82, October 1996.
v2023.06.11a
920 BIBLIOGRAPHY
v2023.06.11a
BIBLIOGRAPHY 921
[McK07f] Paul E. McKenney. Using Promela and Spin to verify parallel algorithms,
August 2007. Available: https://lwn.net/Articles/243851/
[Viewed September 8, 2007].
[McK08a] Paul E. McKenney. Efficient support of consistent cyclic search with read-
copy update (lapsed). Technical Report US Patent 7,426,511, Assigned
to International Business Machines Corp, Washington, DC, September
2008.
[McK08e] Paul E. McKenney. RCU part 3: the RCU API, January 2008. Available:
https://lwn.net/Articles/264090/ [Viewed January 10, 2008].
[McK08g] Paul E. McKenney. What is RCU? part 2: Usage, January 2008. Available:
https://lwn.net/Articles/263130/ [Viewed January 4, 2008].
[McK09a] Paul E. McKenney. Re: [PATCH fyi] RCU: the bloatwatch edition,
January 2009. Available: https://lkml.org/lkml/2009/1/14/449
[Viewed January 15, 2009].
[McK10] Paul E. McKenney. Efficient support of consistent cyclic search with read-
copy update (lapsed). Technical Report US Patent 7,814,082, Assigned to
International Business Machines Corp, Washington, DC, October 2010.
[McK11a] Paul E. McKenney. 3.0 and RCU: what went wrong, July 2011. https:
//lwn.net/Articles/453002/.
v2023.06.11a
922 BIBLIOGRAPHY
[McK12b] Paul E. McKenney. Making RCU safe for battery-powered devices, Feb-
ruary 2012. Available: http://www.rdrop.com/users/paulmck/
RCU/RCUdynticks.2012.02.15b.pdf [Viewed March 1, 2012].
[McK14a] Paul E. McKenney. C++ memory model meets high-update-rate data struc-
tures, September 2014. http://www2.rdrop.com/users/paulmck/
RCU/C++Updates.2014.09.11a.pdf.
[McK14b] Paul E. McKenney. Efficient support of consistent cyclic search with read-
copy update (lapsed). Technical Report US Patent 8,874,535, Assigned to
International Business Machines Corp, Washington, DC, October 2014.
[McK14e] Paul E. McKenney. Proper care and feeding of return values from
rcu_dereference(), February 2014. https://www.kernel.org/
doc/Documentation/RCU/rcu_dereference.txt.
[McK14f] Paul E. McKenney. The RCU API, 2014 edition, September 2014.
https://lwn.net/Articles/609904/.
v2023.06.11a
BIBLIOGRAPHY 923
v2023.06.11a
924 BIBLIOGRAPHY
v2023.06.11a
BIBLIOGRAPHY 925
[MDSS20] Hans Meuer, Jack Dongarra, Erich Strohmaier, and Horst Simon. Top 500:
The list, November 2020. Available: https://top500.org/lists/
[Viewed March 6, 2021].
[Mer11] Rick Merritt. IBM plants transactional memory in CPU, August 2011. EE
Times https://www.eetimes.com/ibm-plants-transactional-
memory-in-cpu/.
[MGM+ 09] Paul E. McKenney, Manish Gupta, Maged M. Michael, Phil Howard,
Joshua Triplett, and Jonathan Walpole. Is parallel programming hard,
and if so, why? Technical Report TR-09-02, Portland State University,
Portland, OR, USA, February 2009. URL: https://archives.pdx.
edu/ds/psu/10386 [Viewed February 13, 2021].
[MHS12] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip
coherence is here to stay. Communications of the ACM, 55(7):78–89,
July 2012.
v2023.06.11a
926 BIBLIOGRAPHY
api/processthreadsapi/nf-processthreadsapi-
flushprocesswritebuffers.
[Mil06] David S. Miller. Re: [PATCH, RFC] RCU : OOM avoidance and lower
latency, January 2006. Available: https://lkml.org/lkml/2006/1/
7/22 [Viewed February 29, 2012].
[MJST16] Paul E. McKenney, Alan Jeffrey, Ali Sezgin, and Tony Tye. Out-of-thin-air
execution is vacuous, July 2016. http://www.open-std.org/jtc1/
sc22/wg21/docs/papers/2016/p0422r0.html.
[MKM12] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness
for fast multicore key-value storage. In Proceedings of the 7th ACM
European Conference on Computer Systems, EuroSys ’12, pages 183–196,
Bern, Switzerland, 2012. ACM.
[MLH94] Peter Magnusson, Anders Landin, and Erik Hagersten. Efficient software
synchronization on large cache coherent multiprocessors. Technical
Report T94:07, Swedish Institute of Computer Science, Kista, Sweden,
February 1994.
[MM00] Ingo Molnar and David S. Miller. brlock, March 2000. URL: http:
//kernel.nic.funet.fi/pub/linux/kernel/v2.3/patch-
html/patch-2.3.49/linux_include_linux_brlock.h.html.
[MMM+ 20] Paul E. McKenney, Maged Michael, Jens Maurer, Peter Sewell, Mar-
tin Uecker, Hans Boehm, Hubert Tong, Niall Douglas, Thomas
Rodgers, Will Deacon, Michael Wong, David Goldblatt, Kostya Sere-
bryany, and Anthony Williams. P1726R4: Pointer lifetime-end zap,
July 2020. http://www.open-std.org/jtc1/sc22/wg21/docs/
papers/2020/p1726r4.pdf.
[MMS19] Paul E. McKenney, Maged Michael, and Peter Sewell. N2369: Pointer
lifetime-end zap, April 2019. http://www.open-std.org/jtc1/
sc22/wg14/www/docs/n2369.pdf.
v2023.06.11a
BIBLIOGRAPHY 927
[MMW07] Paul E. McKenney, Maged Michael, and Jonathan Walpole. Why the
grass may not be greener on the other side: A comparison of locking
vs. transactional memory. In Programming Languages and Operating
Systems, pages 1–5, Stevenson, Washington, USA, October 2007. ACM
SIGOPS.
[Mor07] Richard Morris. Sir Tony Hoare: Geek of the week, Au-
gust 2007. https://www.red-gate.com/simple-talk/opinion/
geek-of-the-week/sir-tony-hoare-geek-of-the-week/.
[MOZ09] Nicholas Mc Guire, Peter Odhiambo Okech, and Qingguo Zhou. Analysis
of inherent randomness of the linux kernel. In Eleventh Real Time Linux
Workshop, Dresden, Germany, September 2009.
[MP15b] Paul E. McKenney and Aravinda Prasad. Some more details on read-log-
update, December 2015. https://lwn.net/Articles/667720/.
[MPA+ 06] Paul E. McKenney, Chris Purcell, Algae, Ben Schumin, Gaius Cornelius,
Qwertyus, Neil Conway, Sbw, Blainster, Canis Rufus, Zoicon5, Anome,
and Hal Eisen. Read-copy update, July 2006. https://en.wikipedia.
org/wiki/Read-copy-update.
[MPI08] MPI Forum. Message passing interface forum, September 2008. Available:
http://www.mpi-forum.org/ [Viewed September 9, 2008].
[MRP+ 17] Paul E. McKenney, Torvald Riegel, Jeff Preshing, Hans Boehm, Clark
Nelson, Olivier Giroux, Lawrence Crowl, JF Bastian, and Michael
v2023.06.11a
928 BIBLIOGRAPHY
[MS93] Paul E. McKenney and Jack Slingwine. Efficient kernel memory al-
location on shared-memory multiprocessors. In USENIX Conference
Proceedings, pages 295–306, Berkeley CA, February 1993. USENIX
Association. Available: http://www.rdrop.com/users/paulmck/
scalability/paper/mpalloc.pdf [Viewed January 30, 2005].
[MS96] M.M Michael and M. L. Scott. Simple, fast, and practical non-blocking
and blocking concurrent queue algorithms. In Proc of the Fifteenth ACM
Symposium on Principles of Distributed Computing, pages 267–275, May
1996.
[MS98a] Paul E. McKenney and John D. Slingwine. Read-copy update: Using exe-
cution history to solve concurrency problems. In Parallel and Distributed
Computing and Systems, pages 509–518, Las Vegas, NV, October 1998.
[MS01] Paul E. McKenney and Dipankar Sarma. Read-copy update mutual exclu-
sion in Linux, February 2001. Available: http://lse.sourceforge.
net/locking/rcu/rcupdate_doc.html [Viewed October 18, 2004].
[MS12] Alexander Matveev and Nir Shavit. Towards a fully pessimistic STM
model. In TRANSACT 2012, San Jose, CA, USA, February 2012. ACM
SIGPLAN.
[MS18] Luc Maranget and Alan Stern. lock.cat, May 2018. https:
//github.com/torvalds/linux/blob/master/tools/memory-
model/lock.cat.
v2023.06.11a
BIBLIOGRAPHY 929
[MSA+ 02] Paul E. McKenney, Dipankar Sarma, Andrea Arcangeli, Andi Kleen,
Orran Krieger, and Rusty Russell. Read-copy update. In Ottawa Linux Sym-
posium, pages 338–367, June 2002. Available: https://www.kernel.
org/doc/ols/2002/ols2002-pages-338-367.pdf [Viewed Febru-
ary 14, 2021].
[MSFM15] Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier. Read-
log-update: A lightweight synchronization mechanism for concurrent
programming. In Proceedings of the 25th Symposium on Operating
Systems Principles, SOSP ’15, pages 168–183, Monterey, California,
2015. ACM.
[MSK01] Paul E. McKenney, Jack Slingwine, and Phil Krueger. Experience with
an efficient parallel kernel memory allocator. Software – Practice and
Experience, 31(3):235–257, March 2001.
[MSS04] Paul E. McKenney, Dipankar Sarma, and Maneesh Soni. Scaling dcache
with RCU. Linux Journal, 1(118):38–46, January 2004.
[MSS12] Luc Maranget, Susmit Sarkar, and Peter Sewell. A tutorial introduction to
the ARM and POWER relaxed memory models, October 2012. https:
//www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf.
[MT01] Jose F. Martinez and Josep Torrellas. Speculative locks for concurrent exe-
cution of critical sections in shared-memory multiprocessors. In Workshop
on Memory Performance Issues, International Symposium on Computer
Architecture, Gothenburg, Sweden, June 2001. Available: https://
iacoma.cs.uiuc.edu/iacoma-papers/wmpi_locks.pdf [Viewed
June 23, 2004].
[MW05] Paul E. McKenney and Jonathan Walpole. RCU semantics: A first attempt,
January 2005. Available: http://www.rdrop.com/users/paulmck/
RCU/rcu-semantics.2005.01.30a.pdf [Viewed December 6, 2009].
v2023.06.11a
930 BIBLIOGRAPHY
[MWB+ 17] Paul E. McKenney, Michael Wong, Hans Boehm, Jens Maurer,
Jeffrey Yasskin, and JF Bastien. P0190R4: Proposal for new
memory_order_consume definition, July 2017. http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2017/p0190r4.pdf.
[MWPF18] Paul E. McKenney, Ulrich Weigand, Andrea Parri, and Boqun Feng.
Linux-kernel memory model, September 2018. http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html.
[Nata] National Institure of Standards and Technology. SI Unit rules and style
conventions [online].
[Nes06b] Oleg Nesterov. Re: [rfc, patch 1/2] qrcu: "quick" srcu implementation,
November 2006. Available: https://lkml.org/lkml/2006/11/29/
330 [Viewed November 26, 2008].
[NSHW20] Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. A
Primer on Memory Consistency and Cache Coherence, Second Edition.
Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2020.
v2023.06.11a
BIBLIOGRAPHY 931
[NZ13] Oleg Nesterov and Peter Zijlstra. rcu: Create rcu_sync infrastructure, Oc-
tober 2013. https://lore.kernel.org/lkml/20131002150518.
675931976@infradead.org/.
[O’H19] Peter W. O’Hearn. Incorrectness logic. Proc. ACM Program. Lang.,
4(POPL), December 2019.
[OHOC20] Robert O’Callahan, Kyle Huey, Devon O’Dell, and Terry Coatta. To catch
a failure: The record-and-replay approach to debugging: A discussion
with robert o’callahan, kyle huey, devon o’dell, and terry coatta. Queue,
18(1):61–79, February 2020.
[ON07] Robert Olsson and Stefan Nilsson. TRASH: A dynamic LC-trie and hash
data structure. In Workshop on High Performance Switching and Routing
(HPSR’07), May 2007.
[ONH+ 96] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and
Kunyung Chang. The case for a single-chip multiprocessor. In ASPLOS
VII, Cambridge, MA, USA, October 1996.
[Ope97] Open Group. The single UNIX specification, version 2: Threads, 1997.
Available: http://www.opengroup.org/onlinepubs/007908799/
xsh/threads.html [Viewed September 19, 2008].
[ORY01] Peter W. O’Hearn, John C. Reynolds, and Hongseok Yang. Local
reasoning about programs that alter data structures. In Proceedings of
the 15th International Workshop on Computer Science Logic, CSL ’01,
page 1–19, Berlin, Heidelberg, 2001. Springer-Verlag.
[PAB+ 95] Calton Pu, Tito Autrey, Andrew Black, Charles Consel, Crispin Cowan,
Jon Inouye, Lakshmi Kethana, Jonathan Walpole, and Ke Zhang. Opti-
mistic incremental specialization: Streamlining a commercial operating
system. In 15th ACM Symposium on Operating Systems Principles
(SOSP’95), pages 314–321, Copper Mountain, CO, December 1995.
[Pat10] David Patterson. The trouble with multicore. IEEE Spectrum, 2010:28–32,
52–53, July 2010.
[PAT11] V Pankratius and A R Adl-Tabatabai. A study of transactional memory
vs. locks in practice. In Proceedings of the 23rd ACM symposium on
Parallelism in algorithms and architectures (2011), SPAA ’11, pages
43–52, San Jose, CA, USA, 2011. ACM.
[PBCE20] Elizabeth Patitsas, Jesse Berlin, Michelle Craig, and Steve Easterbrook.
Evidence that computer science grades are not bimodal. Commun. ACM,
63(1):91–98, January 2020.
[PD11] Martin Pohlack and Stephan Diestelhorst. From lightweight hardware
transactional memory to lightweight lock elision. In TRANSACT 2011,
San Jose, CA, USA, June 2011. ACM SIGPLAN.
[Pen18] Roman Penyaev. [PATCH v2 01/26] introduce list_next_or_null_rr_rcu(),
May 2018. https://lkml.kernel.org/r/20180518130413.
16997-2-roman.penyaev@profitbricks.com.
v2023.06.11a
932 BIBLIOGRAPHY
[Pet06] Jeremy Peters. From reuters, automatic trading linked to news events,
December 2006. URL: http://www.nytimes.com/2006/12/11/
technology/11reuters.html?ei=5088&en=e5e9416415a9eeb2&
ex=1323493200...
[Pig06] Nick Piggin. [patch 3/3] radix-tree: RCU lockless readside, June
2006. Available: https://lkml.org/lkml/2006/6/20/238 [Viewed
March 25, 2008].
[Pik17] Fedor G. Pikus. Read, copy, update... Then what?, September 2017.
https://www.youtube.com/watch?v=rxQ5K9lo034.
[PMDY20] SeongJae Park, Paul E. McKenney, Laurent Dufour, and Heon Y. Yeom.
An htm-based update-side synchronization for rcu on numa systems. In
Proceedings of the Fifteenth European Conference on Computer Systems,
EuroSys ’20, Heraklion, Greece, 2020. Association for Computing
Machinery.
[Pug90] William Pugh. Concurrent maintenance of skip lists. Technical Report CS-
TR-2222.1, Institute of Advanced Computer Science Studies, Department
of Computer Science, University of Maryland, College Park, Maryland,
June 1990.
[Pul00] Geoffrey K. Pullum. How Dr. Seuss would prove the halting problem
undecidable. Mathematics Magazine, 73(4):319–320, 2000. http:
//www.lel.ed.ac.uk/~gpullum/loopsnoop.html.
[Ray99] Eric S. Raymond. The Cathedral and the Bazaar: Musings on Linux and
Open Source by an Accidental Revolutionary. O’Reilly, 1999.
v2023.06.11a
BIBLIOGRAPHY 933
[RC15] Pedro Ramalhete and Andreia Correia. Poor man’s URCU, Au-
gust 2015. https://github.com/pramalhe/ConcurrencyFreaks/
blob/master/papers/poormanurcu-2015.pdf.
[RD12] Ravi Rajwar and Martin Dixon. Intel transactional synchronization exten-
sions, September 2012. Intel Developer Forum (IDF) 2012 ARCS004.
[Reg10] John Regehr. A guide to undefined behavior in C and C++, part 1, July
2010. https://blog.regehr.org/archives/213.
[RG01] Ravi Rajwar and James R. Goodman. Speculative lock elision: Enabling
highly concurrent multithreaded execution. In Proceedings of the 34th
Annual ACM/IEEE International Symposium on Microarchitecture, MI-
CRO 34, pages 294–305, Austin, TX, December 2001. The Institute of
Electrical and Electronics Engineers, Inc.
[RH02] Zoran Radović and Erik Hagersten. Efficient synchronization for nonuni-
form communication architectures. In Proceedings of the 2002 ACM/IEEE
Conference on Supercomputing, pages 1–13, Baltimore, Maryland, USA,
November 2002. The Institute of Electrical and Electronics Engineers,
Inc.
[RH03] Zoran Radović and Erik Hagersten. Hierarchical backoff locks for
nonuniform communication architectures. In Proceedings of the Ninth
International Symposium on High Performance Computer Architecture
(HPCA-9), pages 241–252, Anaheim, California, USA, February 2003.
[RH18] Geoff Romer and Andrew Hunter. An RAII interface for deferred
reclamation, March 2018. http://www.open-std.org/jtc1/sc22/
wg21/docs/papers/2018/p0561r4.html.
v2023.06.11a
934 BIBLIOGRAPHY
[RKM+ 10] Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and
David I. August. Speculative parallelization using software multi-threaded
transactions. SIGARCH Comput. Archit. News, 38(1):65–76, 2010.
[RLPB18] Yuxin Ren, Guyue Liu, Gabriel Parmer, and Björn Brandenburg. Scalable
memory reclamation for multi-core, real-time systems. In Proceedings of
the 2018 IEEE Real-Time and Embedded Technology and Applications
Symposium (RTAS), page 12, Porto, Portugal, April 2018. IEEE.
[RMF19] Federico Reghenzani, Giuseppe Massari, and William Fornaciari. The
real-time Linux kernel: A survey on PREEMPT_RT. ACM Comput.
Surv., 52(1):18:1–18:36, February 2019.
[Ros06] Steven Rostedt. Lightweight PI-futexes, June 2006. Avail-
able: https://www.kernel.org/doc/html/latest/locking/pi-
futex.html [Viewed February 14, 2021].
[Ros10a] Steven Rostedt. tracing: Harry Potter and the Deathly Macros, December
2010. Available: https://lwn.net/Articles/418710/ [Viewed:
August 28, 2011].
[Ros10b] Steven Rostedt. Using the TRACE_EVENT() macro (part 1), March 2010.
Available: https://lwn.net/Articles/379903/ [Viewed: August
28, 2011].
[Ros10c] Steven Rostedt. Using the TRACE_EVENT() macro (part 2), March 2010.
Available: https://lwn.net/Articles/381064/ [Viewed: August
28, 2011].
[Ros10d] Steven Rostedt. Using the TRACE_EVENT() macro (part 3), April 2010.
Available: https://lwn.net/Articles/383362/ [Viewed: August
28, 2011].
[Ros11] Steven Rostedt. lockdep: How to read its cryptic output, September 2011.
http://www.linuxplumbersconf.org/2011/ocw/sessions/153.
[Roy17] Lance Roy. rcutorture: Add CBMC-based formal verification for SRCU,
January 2017. URL: https://www.spinics.net/lists/kernel/
msg2421833.html.
[RR20] Sergio Rajsbaum and Michel Raynal. Mastering concurrent computing
through sequential thinking. Commun. ACM, 63(1):78–87, January 2020.
[RSB+ 97] Rajeev Rastogi, S. Seshadri, Philip Bohannon, Dennis W. Leinbaugh,
Abraham Silberschatz, and S. Sudarshan. Logical and physical versioning
in main memory databases. In Proceedings of the 23rd International
Conference on Very Large Data Bases, VLDB ’97, pages 86–95, San
Francisco, CA, USA, August 1997. Morgan Kaufmann Publishers Inc.
[RTY+ 87] Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert
Baron, David Black, William Bolosky, and Jonathan Chew. Machine-
independent virtual memory management for paged uniprocessor and
multiprocessor architectures. In 2nd Symposium on Architectural Support
for Programming Languages and Operating Systems, pages 31–39, Palo
Alto, CA, October 1987. Association for Computing Machinery.
v2023.06.11a
BIBLIOGRAPHY 935
[Rus03] Rusty Russell. Hanging out with smart people: or... things I learned
being a kernel monkey, July 2003. 2003 Ottawa Linux Symposium
Keynote https://ozlabs.org/~rusty/ols-2003-keynote/ols-
keynote-2003.html.
[SAE+ 18] Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon,
and Ciera Jaspan. Lessons from building static analysis tools at google.
Commun. ACM, 61(4):58–66, March 2018.
[SAH+ 03] Craig A. N. Soules, Jonathan Appavoo, Kevin Hui, Dilma Da Silva, Gre-
gory R. Ganger, Orran Krieger, Michael Stumm, Robert W. Wisniewski,
Marc Auslander, Michal Ostrowski, Bryan Rosenburg, and Jimi Xenidis.
System support for online reconfiguration. In Proceedings of the 2003
USENIX Annual Technical Conference, pages 141–154, San Antonio,
Texas, USA, June 2003. USENIX Association.
[SATG+ 09] Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Robert Geva, Yang Ni, and
Adam Welc. Towards transactional memory semantics for C++. In SPAA
’09: Proceedings of the twenty-first annual symposium on Parallelism in
algorithms and architectures, pages 49–58, Calgary, AB, Canada, 2009.
ACM.
[SBV10] Martin Schoeberl, Florian Brandner, and Jan Vitek. RTTM: Real-time
transactional memory. In Proceedings of the 2010 ACM Symposium on
Applied Computing, pages 326–333, 01 2010.
v2023.06.11a
936 BIBLIOGRAPHY
v2023.06.11a
BIBLIOGRAPHY 937
[SM04b] Dipankar Sarma and Paul E. McKenney. Making RCU safe for deep
sub-millisecond response realtime applications. In Proceedings of the
2004 USENIX Annual Technical Conference (FREENIX Track), pages
182–191, Boston, MA, USA, June 2004. USENIX Association.
[SM13] Thomas Sewell and Toby Murray. Above and beyond: seL4 noninterfer-
ence and binary verification, May 2013. https://cps-vo.org/node/
7706.
[Smi19] Richard Smith. Working draft, standard for programming language
C++, January 2019. http://www.open-std.org/jtc1/sc22/wg21/
docs/papers/2019/n4800.pdf.
[SMS08] Michael Spear, Maged Michael, and Michael Scott. Inevitability mech-
anisms for software transactional memory. In 3rd ACM SIGPLAN
Workshop on Transactional Computing, Salt Lake City, Utah, February
2008. ACM. Available: http://www.cs.rochester.edu/u/scott/
papers/2008_TRANSACT_inevitability.pdf [Viewed January 10,
2009].
[SNGK17] Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, and Nec-
tarios Koziris. Combining HTM and RCU to implement highly efficient
balanced binary search trees. In 12th ACM SIGPLAN Workshop on
Transactional Computing, Austin, TX, USA, February 2017.
[SPA94] SPARC International. The SPARC Architecture Manual, 1994.
Available: https://sparc.org/wp-content/uploads/2014/01/
SPARCV9.pdf.gz.
[Spi77] Keith R. Spitz. Tell which is which and you’ll be rich, 1977. Inscription
on wall of dungeon.
[Spr01] Manfred Spraul. Re: RFC: patch to allow lock-free traversal of lists with
insertion, October 2001. URL: http://lkml.iu.edu/hypermail/
linux/kernel/0110.1/0410.html.
[Spr08] Manfred Spraul. [RFC, PATCH] state machine based rcu, August
2008. Available: https://lkml.org/lkml/2008/8/21/336 [Viewed
December 8, 2008].
[SR84] Z. Segall and L. Rudolf. Dynamic decentralized cache schemes for
MIMD parallel processors. In 11th Annual International Symposium on
Computer Architecture, pages 340–347, June 1984.
[SRK+ 11] Justin Seyster, Prabakar Radhakrishnan, Samriti Katoch, Abhinav Duggal,
Scott D. Stoller, and Erez Zadok. Redflag: a framework for analysis
of kernel-level concurrency. In Proceedings of the 11th international
conference on Algorithms and architectures for parallel processing -
Volume Part I, ICA3PP’11, pages 66–79, Melbourne, Australia, 2011.
Springer-Verlag.
[SRL90] Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority in-
heritance protocols: An approach to real-time synchronization. IEEE
Transactions on Computers, 39(9):1175–1185, 1990.
v2023.06.11a
938 BIBLIOGRAPHY
[SS06] Ori Shalev and Nir Shavit. Split-ordered lists: Lock-free extensible hash
tables. J. ACM, 53(3):379–405, May 2006.
[SSA+ 11] Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek
Williams. POWER and ARM litmus tests, 2011. https://www.cl.
cam.ac.uk/~pes20/ppc-supplemental/test6.pdf.
[SSHT93] Janice S. Stone, Harold S. Stone, Philip Heidelberger, and John Turek.
Multiple reservations and the Oklahoma update. IEEE Parallel and
Distributed Technology Systems and Applications, 1(4):58–71, November
1993.
[SSRB00] Douglas C. Schmidt, Michael Stal, Hans Rohnert, and Frank Buschmann.
Pattern-Oriented Software Architecture Volume 2: Patterns for Concur-
rent and Networked Objects. Wiley, Chichester, West Sussex, England,
2000.
[SSVM02] S. Swaminathan, John Stultz, Jack Vogel, and Paul E. McKenney. Fairlocks
– a high performance fair locking scheme. In Proceedings of the 14th
IASTED International Conference on Parallel and Distributed Computing
and Systems, pages 246–251, Cambridge, MA, USA, November 2002.
[ST87] William E. Snaman and David W. Thiel. The VAX/VMS distributed lock
manager. Digital Technical Journal, 5:29–44, September 1987.
[ST95] Nir Shavit and Dan Touitou. Software transactional memory. In Proceed-
ings of the 14th Annual ACM Symposium on Principles of Distributed
Computing, pages 204–213, Ottawa, Ontario, Canada, August 1995.
[Sut08] Herb Sutter. Effective concurrency, 2008. Series in Dr. Dobbs Journal.
[Sut13] Adrian Sutton. Concurrent programming with the Disruptor, January 2013.
Presentation at Linux.conf.au 2013, URL: https://www.youtube.
com/watch?v=ItpT_vmRHyI.
[SW95] Richard L. Sites and Richard T. Witek. Alpha AXP Architecture. Digital
Press, second edition, 1995.
[SWS16] Harshal Sheth, Aashish Welling, and Nihar Sheth. Read-copy up-
date in a garbage collected environment, 2016. MIT PRIMES pro-
gram: https://math.mit.edu/research/highschool/primes/
materials/2016/conf/10-1%20Sheth-Welling-Sheth.pdf.
v2023.06.11a
BIBLIOGRAPHY 939
[Tal07] Nassim Nicholas Taleb. The Black Swan. Random House, 2007.
[TDV15] Joseph Tassarotti, Derek Dreyer, and Victor Vafeiadis. Verifying read-
copy-update in a logic for weak memory. In Proceedings of the 2015
Proceedings of the 36th annual ACM SIGPLAN conference on Program-
ming Language Design and Implementation, PLDI ’15, pages 110–120,
New York, NY, USA, June 2015. ACM.
[The08] The Open MPI Project. Open MPI, November 2008. Available: http:
//www.open-mpi.org/software/ [Viewed November 26, 2008].
[Tor01] Linus Torvalds. Re: [Lse-tech] Re: RFC: patch to allow lock-free traversal
of lists with insertion, October 2001. URL: https://lkml.org/lkml/
2001/10/13/105, https://lkml.org/lkml/2001/10/13/82.
[Tor19] Linus Torvalds. rcu: locking and unlocking need to always be at least
barriers, June 2019. Git commit: https://git.kernel.org/linus/
66be4e66a7f4.
v2023.06.11a
940 BIBLIOGRAPHY
[Tri12] Josh Triplett. Relativistic Causal Ordering: A Memory Model for Scalable
Concurrent Data Structures. PhD thesis, Portland State University, 2012.
[Tri22] Josh Triplett. Spawning processes faster and easier with io_uring, Sep-
tember 2022. https://www.youtube.com/watch?v=_h-kV8AYYqM&
t=4074s.
[TS93] Hiroaki Takada and Ken Sakamura. A bounded spin lock algorithm with
preemption. Technical Report 93-02, University of Tokyo, Tokyo, Japan,
1993.
[TS95] H. Takada and K. Sakamura. Real-time scalability of nested spin locks. In
Proceedings of the 2nd International Workshop on Real-Time Computing
Systems and Applications, RTCSA ’95, pages 160–167, Tokyo, Japan,
1995. IEEE Computer Society.
[Tur37] Alan M. Turing. On computable numbers, with an application to the
entscheidungsproblem. In Proceedings of the London Mathematical
Society, volume 42 of 2, pages 230–265, 1937.
[TZK+ 13] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel
Madden. Speedy transactions in multicore in-memory databases. In
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems
Principles, SOSP ’13, pages 18–32, Farminton, Pennsylvania, 2013.
ACM.
[Ung11] David Ungar. Everything you know (about parallel programming) is
wrong!: A wild screed about the future. In Dynamic Languages Sympo-
sium 2011, Portland, OR, USA, October 2011. Invited talk presentation.
[Uni08a] University of California, Berkeley. BOINC: compute for science, October
2008. Available: http://boinc.berkeley.edu/ [Viewed January 31,
2008].
[Uni08b] University of California, Berkeley. SETI@HOME, December 2008.
Available: http://setiathome.berkeley.edu/ [Viewed January 31,
2008].
[Uni10] University of Maryland. Parallel maze solving, November 2010. URL:
http://www.cs.umd.edu/class/fall2010/cmsc433/p3/ [broken,
February 2021].
[Val95] John D. Valois. Lock-free linked lists using compare-and-swap. In
Proceedings of the Fourteenth Annual ACM Symposium on Principles
of Distributed Computing, PODC ’95, pages 214–222, Ottowa, Ontario,
Canada, 1995. ACM.
[Van18] Michal Vaner. ArcSwap, April 2018. https://crates.io/crates/
arc-swap.
[VBC+ 15] Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Moris-
set, and Francesco Zappa Nardelli. Common compiler optimisations are
invalid in the c11 memory model and what we can do about it. SIGPLAN
Not., 50(1):209–220, January 2015.
v2023.06.11a
BIBLIOGRAPHY 941
[VGS08] Haris Volos, Neelam Goyal, and Michael M. Swift. Pathological interac-
tion of locks with transactional memory. In 3rd ACM SIGPLAN Workshop
on Transactional Computing, Salt Lake City, Utah, USA, February 2008.
ACM. Available: http://www.cs.wisc.edu/multifacet/papers/
transact08_txlock.pdf [Viewed September 7, 2009].
v2023.06.11a
942 BIBLIOGRAPHY
[WTS96] Cai-Dong Wang, Hiroaki Takada, and Ken Sakamura. Priority inheritance
spin locks for multiprocessor real-time systems. In Proceedings of the
2nd International Symposium on Parallel Architectures, Algorithms, and
Networks, ISPAN ’96, pages 70–76, Beijing, China, 1996. IEEE Computer
Society.
[Xu10] Herbert Xu. bridge: Add core IGMP snooping support, February
2010. Available: https://marc.info/?t=126719855400006&r=1&
w=2 [Viewed March 20, 2011].
[YHLR13] Richard M. Yoo, Christopher J. Hughes, Konrad Lai, and Ravi Rajwar. Per-
formance evaluation of Intel® Transactional Synchronization Extensions
for high-performance computing. In Proceedings of SC13: International
Conference for High Performance Computing, Networking, Storage and
Analysis, SC ’13, pages 19:1–19:11, Denver, Colorado, 2013. ACM.
v2023.06.11a
BIBLIOGRAPHY 943
[Zel11] Cyril Zeller. CUDA C/C++ basics: Supercomputing 2011 tutorial, No-
vember 2011. https://www.nvidia.com/docs/IO/116711/sc11-
cuda-c-basics.pdf.
[Zha89] Lixia Zhang. A New Architecture for Packet Switching Network Protocols.
PhD thesis, Massachusetts Institute of Technology, July 1989.
[Zij14] Peter Zijlstra. Another go at speculative page faults, October 2014.
https://lkml.org/lkml/2014/10/20/620.
v2023.06.11a
944 BIBLIOGRAPHY
v2023.06.11a
If I have seen further it is by standing on the
shoulders of giants.
Isaac Newton, modernized
Credits
LATEX Advisor
Akira Yokosawa is this book’s LATEX advisor, which perhaps most notably includes the
care and feeding of the style guide laid out in Appendix D. This work includes table
layout, listings, fonts, rendering of math, acronyms, bibliography formatting, epigraphs,
hyperlinks, paper size. Akira also perfected the cross-referencing of quick quizzes,
allowing easy and exact navigation between quick quizzes and their answers. He also
added build options that permit quick quizzes to be hidden and to be gathered at the end
of each chapter, textbook style.
This role also includes the build system, which Akira has optimized and made
much more user-friendly. His enhancements have included automating response to
bibliography changes, automatically determining which source files are present, and
automatically generating listings (with automatically generated hyperlinked line-number
references) from the source files.
Reviewers
• Alan Stern (Chapter 15).
• Andy Whitcroft (Section 9.5.2, Section 9.5.3).
• Artem Bityutskiy (Chapter 15, Appendix C).
• Dave Keck (Appendix C).
• David S. Horner (Section 12.1.5).
• Gautham Shenoy (Section 9.5.2, Section 9.5.3).
• “jarkao2”, AKA LWN guest #41960 (Section 9.5.3).
• Jonathan Walpole (Section 9.5.3).
• Josh Triplett (Chapter 12).
• Michael Factor (Section 17.2).
• Mike Fulton (Section 9.5.2).
• Peter Zijlstra (Section 9.5.4).
• Richard Woodruff (Appendix C).
945
v2023.06.11a
946 CREDITS
Reviewers whose feedback took the extremely welcome form of a patch are credited
in the git logs.
Machine Owners
Readers might have noticed some graphs showing scalability data out to several hundred
CPUs, courtesy of my current employer, with special thanks to Paul Saab, Yashar
Bayani, Joe Boyd, and Kyle McMartin.
From back in my time at IBM, a great debt of thanks goes to Martin Bligh, who
originated the Advanced Build and Test (ABAT) system at IBM’s Linux Technology
Center, as well as to Andy Whitcroft, Dustin Kirkland, and many others who extended
this system. Many thanks go also to a great number of machine owners: Andrew
Theurer, Andy Whitcroft, Anton Blanchard, Chris McDermott, Cody Schaefer, Darrick
Wong, David “Shaggy” Kleikamp, Jon M. Tollefson, Jose R. Santos, Marvin Heffler,
Nathan Lynch, Nishanth Aravamudan, Tim Pepper, and Tony Breeds.
Original Publications
1. Section 2.4 (“What Makes Parallel Programming Hard?”) on page 19 originally
appeared in a Portland State University Technical Report [MGM+ 09].
2. Section 4.3.4.1 (“Shared-Variable Shenanigans”) on page 63 originally appeared
in Linux Weekly News [ADF+ 19].
3. Section 6.5 (“Retrofitted Parallelism Considered Grossly Sub-Optimal”) on
page 147 originally appeared in 4th USENIX Workshop on Hot Topics on Parallel-
ism [McK12c].
4. Section 9.5.2 (“RCU Fundamentals”) on page 228 originally appeared in Linux
Weekly News [MW07].
5. Section 9.5.3 (“RCU Linux-Kernel API”) on page 238 originally appeared in Linux
Weekly News [McK08e].
6. Section 9.5.4 (“RCU Usage”) on page 251 originally appeared in Linux Weekly
News [McK08g].
7. Section 9.5.5 (“RCU Related Work”) on page 277 originally appeared in Linux
Weekly News [McK14g].
8. Section 9.5.5 (“RCU Related Work”) on page 277 originally appeared in Linux
Weekly News [MP15a].
9. Chapter 12 (“Formal Verification”) on page 357 originally appeared in Linux
Weekly News [McK07f, MR08, McK11d].
10. Section 12.3 (“Axiomatic Approaches”) on page 407 originally appeared in Linux
Weekly News [MS14].
v2023.06.11a
FIGURE CREDITS 947
11. Section 13.5.4 (“Correlated Fields”) on page 438 originally appeared in Oregon
Graduate Institute [McK04].
12. Chapter 15 (“Advanced Synchronization: Memory Ordering”) on page 485
originally appeared in the Linux kernel [HMDZ06].
13. Chapter 15 (“Advanced Synchronization: Memory Ordering”) on page 485
originally appeared in Linux Weekly News [AMM+ 17a, AMM+ 17b].
14. Chapter 15 (“Advanced Synchronization: Memory Ordering”) on page 485
originally appeared in ASPLOS ’18 [AMM+ 18].
15. Section 15.3.2 (“Address- and Data-Dependency Difficulties”) on page 524 origi-
nally appeared in the Linux kernel [McK14e].
16. Section 15.5 (“Memory-Barrier Instructions For Specific CPUs”) on page 550
originally appeared in Linux Journal [McK05a, McK05b].
Figure Credits
1. Figure 3.1 (p 26) by Melissa Broussard.
2. Figure 3.2 (p 26) by Melissa Broussard.
3. Figure 3.3 (p 27) by Melissa Broussard.
4. Figure 3.5 (p 29) by Melissa Broussard.
5. Figure 3.6 (p 30) by Melissa Broussard.
6. Figure 3.7 (p 31) by Melissa Broussard.
7. Figure 3.8 (p 31) by Melissa Broussard, remixed.
8. Figure 3.9 (p 32) by Melissa Broussard.
9. Figure 3.10 (p 33) by Melissa Broussard.
10. Figure 3.12 (p 39) by Melissa Broussard.
11. Figure 5.3 (p 80) by Melissa Broussard.
12. Figure 6.1 (p 114) by Kornilios Kourtis.
13. Figure 6.2 (p 115) by Melissa Broussard.
14. Figure 6.3 (p 115) by Kornilios Kourtis.
15. Figure 6.4 (p 117) by Kornilios Kourtis.
16. Figure 6.13 (p 133) by Melissa Broussard.
17. Figure 6.14 (p 134) by Melissa Broussard.
18. Figure 6.15 (p 135) by Melissa Broussard.
19. Figure 7.1 (p 160) by Melissa Broussard.
v2023.06.11a
948 CREDITS
35. Figure 15.1 (p 487) by Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted.
36. Figure 15.2 (p 487) by Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted.
v2023.06.11a
OTHER SUPPORT 949
Figure 9.33 was adapted from Fedor Pikus’s “When to use RCU” slide [Pik17]. The
discussion of mechanical reference counters in Section 9.2 stemmed from a private
conversation with Dave Regan.
Other Support
We owe thanks to many CPU architects for patiently explaining the instruction- and
memory-reordering features of their CPUs, particularly Wayne Cardoza, Ed Silha,
Anton Blanchard, Tim Slegel, Juergen Probst, Ingo Adlung, Ravi Arimilli, Cathy May,
Derek Williams, H. Peter Anvin, Andy Glew, Leonid Yegoshin, Richard Grisenthwaite,
and Will Deacon. Wayne deserves special thanks for his patience in explaining Alpha’s
reordering of dependent loads, a lesson that Paul resisted quite strenuously!
The bibtex-generation service of the Association for Computing Machinery has saved
us a huge amount of time and effort compiling the bibliography, for which we are
grateful. Thanks are also due to Stamatis Karnouskos, who convinced me to drag
my antique bibliography database kicking and screaming into the 21st century. Any
technical work of this sort owes thanks to the many individuals and organizations that
keep Internet and the World Wide Web up and running, and this one is no exception.
Portions of this material are based upon work supported by the National Science
Foundation under Grant No. CNS-0719851.
v2023.06.11a
950 CREDITS
v2023.06.11a
Acronyms
CAS compare and swap, 35, 43, 57, 72, 404, 423, 605, TLE transactional lock elision, 605, 635, 891
717, 826, 883
TM transactional memory, 891
CBMC C bounded model checker, 280, 413, 617, 818
UTM unbounded transactional memory, 602, 891
EBR epoch-based reclamation, 4, 279, 286, 884
951
v2023.06.11a
952 Acronyms
v2023.06.11a
Index
Bold: Major reference.
Underline: Definition.
Acquire load, 72, 230, 503, 881 communication, 663, 883 Deadlock free, 446, 884
Ahmed, Iftekhar, 280 write, 663, 891 Desnoyers, Mathieu, 277, 279
Alglave, Jade, 376, 402, 408, 541, 546 Cache-coherence protocol, 664, 882 Dijkstra, Edsger W., 2, 114
Amdahl’s Law, 9, 128, 154, 881 Cache-invalidation latency, see Latency, Dining philosophers problem, 114
Anti-Heisenbug, see Heisenbug, anti- cache-invalidation Direct-mapped cache, see Cache,
Arbel, Maya, 278, 279 Cache-miss latency, see Latency, direct-mapped
Ash, Mike, 279 cache-miss Dreyer, Derek, 280
Associativity, see Cache associativity Capacity miss, see Cache miss, capacity Dufour, Laurent, 278
Associativity miss, see Cache miss, Chen, Haibo, 279
associativity Chien, Andrew, 5 Efficiency, 12, 126, 136, 183, 638, 884
Atomic, 29, 43, 56–58, 72, 78, 87, 95, 881 Clash free, 446, 883 energy, 39, 347, 884
Atomic read-modify-write operation, 491, Clements, Austin, 278 Embarrassingly parallel, 17, 136, 147, 884
492, 666, 881 Code locking, see Locking, code Epoch-based reclamation (EBR), 279,
Attiya, Hagit, 278, 856 Combinatorial explosion, 883 286, 884
Combinatorial implosion, 883 Exclusive lock, see Lock, exclusive
Belay, Adam, 280 Communication miss, see Cache miss, Existence guarantee, 184, 259, 260, 282,
Bhat, Srivatsa, 279 communication 422, 755, 884
Bonzini, Paolo, 5 Compare and swap (CAS), 35, 43, 57,
Bornat, Richard, 4 404, 423, 605, 717, 826, 883 False sharing, 38, 121, 154, 301, 320, 736,
Bos, Mara, 5 Concurrent, 638, 883 765, 799, 884
Bounded population-oblivious wait free, Consistency Felber, Pascal, 280
see Wait free, bounded memory, 552, 886 Forward-progress guarantee, 192, 279,
population-oblivious process, 888 284, 446, 884
Bounded wait free, see Wait free, bounded sequential, 431, 617, 890 Fragmentation, 146, 885
Butenhof, David R., 5 weak, 557 Fraser, Keir, 279, 884
Corbet, Jonathan, 5 Full memory barrier, see Memory barrier,
C bounded model checker (CBMC), 280, Correia, Andreia, 279 full
413, 617, 818 Critical section, 30, 56, 127, 131, 132, Fully associative cache, see Cache, fully
Cache, 882 140, 174, 184, 883 associative
direct-mapped, 667, 884 RCU read-side, 222, 231, 889
fully associative, 602, 885 read-side, 176, 214, 889 Generality, 11, 14, 42, 126
Cache associativity, 602, 662, 881 write-side, 892 Giannoula, Christina, 280, 594
Cache coherence, 505, 555, 602, 882 Gotsman, Alexey, 280
Cache geometry, 662, 882 Data locking, see Locking, data Grace period, 223, 239, 284, 297, 331,
Cache line, 34, 80, 182, 320, 489, 508, Data race, 49, 63, 159, 332, 524, 884 376, 410, 428, 474, 539, 581, 642,
553, 600, 662, 882 Deacon, Will, 64 885
Cache miss, 882 Deadlock, 9, 21, 119, 161, 224, 309, 475, Grace-period latency, see Latency,
associativity, 663, 881 526, 571, 589, 604, 884 grace-period
capacity, 662, 883 Deadlock cycle, 642, 643 Groce, Alex, 280
953
v2023.06.11a
954 INDEX
Hardware transactional memory (HTM), Liu, Ran, 279 Non-maskable interrupt (NMI), 274, 377,
599, 601, 856, 857, 860, 885 Liu, Yujie, 279 578, 887
Harris, Timothy, 278, 279 Livelock, 9, 21, 161, 171, 362, 605, 762, Non-uniform cache architecture (NUCA),
Hawking, Stephen, 10 886 680, 840, 887
Hazard pointer, 207, 227, 236, 281, 295, Lock, 886 Non-uniform memory architecture
321, 429, 481, 583, 607, 755, 885 exclusive, 53, 175, 634, 884 (NUMA), 174, 280, 295, 594, 840,
Heisenberg, Weiner, 304, 341 reader-writer, 53, 176, 279, 889 887
Heisenbug, 341, 885 sequence, 890 NUMA node, 21, 814, 887
anti-, 341 Lock contention, 88, 109, 121, 127, 132,
Hennessy, John L., 5, 25 142, 172, 886 Obstruction free, 446, 887
Herlihy, Maurice P., 4 Lock free, 279, 446, 886 Overhead, 10, 33, 888
Hot spot, 135, 302, 885 Locking, 159 memory-barrier, 30
Howard, Phil, 277 code, 128, 131, 142, 883
Howlett, Liam, 278 data, 21, 128, 146, 883 Parallel, 638, 888
Hraska, Adam, 279 Luchangco, Victor, 4, 279 Park, SeongJae, 280, 594
Humiliatingly parallel, 152, 885 Patterson, David A., 5, 25
Hunter, Andrew, 280 Madden, Samuel, 278 Pawan, Pankaj, 402
Mao, Yandong, 278 Penyaev, Roman, 410
Immutable, 885 Maranget, Luc, 402 Performance, 11, 126, 638, 888
Inter-processor interrupt (IPI), 219, 554, Marked access, 886 Pikus, Fedor, 949
683, 885 Marlier, Patrick, 280 Pipelined CPU, 888
Interrupt request (IRQ), 397, 471, 885 Matloff, Norm, 5 Plain access, 63, 75, 229, 523, 888
Invalidation, 663, 674, 856, 885 Mattson, Timothy G., 5 Podzimek, Andrej, 279
Matveev, Alexander, 280 Process consistency, see Consistency,
Jensen, Carlos, 280 McKenney, Paul E., 280 process
Melham, Tom, 280 Productivity, 11, 13, 126, 478, 592
Kaashoek, Frans, 278 Memory, 886 Program order, 888
Kernel concurrency sanitizer (KCSAN), Memory barrier, 30, 57, 127, 174, 210, Promela, 358, 818
331, 837 283, 299, 364, 423, 490, 578, 635,
Kim, Jaeho, 280 643, 652, 661, 886 Quiescent state, 223, 398, 589, 656, 888
Knuth, Donald, 4, 277, 583 full, 219, 490, 534, 549, 551, 838 Quiescent-state-based reclamation
Kogan, Alex, 280 read, 528, 551, 678, 889 (QSBR), 224, 254, 280, 286, 299,
Kohler, Eddie, 278 write, 551, 678, 891 542, 888
Kokologiannakis, Michalis, 280 Memory consistency, see Consistency,
Kroah-Hartman, Greg, 5 memory Race condition, 9, 185, 342, 357, 358,
Kroening, Daniel, 280 Memory latency, see Latency, memory 437, 494, 645, 888
Kung, H. T., 4, 277 Memory-barrier latency, see Latency, Ramalhete, Pedro, 279
memory-barrier RCU read-side critical section, see
Latency, 28, 38, 461, 885 Memory-barrier overhead, see Overhead, Critical section, RCU read-side
cache-invalidation, 675 memory-barrier RCU-protected data, 779, 888
cache-miss, 39 MESI protocol, 664, 887 RCU-protected pointer, 222, 889
grace-period, 239, 653 Message latency, see Latency, message Read memory barrier, see Memory
memory, 579 Moore’s Law, 10, 12, 18, 25, 28, 39, 42, barrier, read
memory-barrier, 299 129, 575, 578, 579, 887 Read mostly, 889
message, 127 Morris, Robert, 278 Read only, 889
scheduling, 450 Morrison, Adam, 279 Read-copy update (RCU), 219, 857, 889
Lea, Doug, 5 Mutual-exclusion mechanism, 887 Read-side critical section, see Critical
Lehman, Philip L., 4, 277 section, read-side
Lespinasse, Michel, 278 Nardelli, Francesco Zappa, 402 Reader-writer lock, see Lock,
Liang, Lihao, 280 Nidhugg, 414, 617, 818 reader-writer
Linearizable, 278, 446, 797, 886 Non-blocking, 887 Real time, 889
Linux kernel memory consistency model Non-blocking synchronization (NBS), Reference count, 72, 77, 204, 272, 281,
(LKMM), 409, 536, 555, 837 125, 188, 277, 446, 582, 626, 630, 422, 439, 594, 641, 730, 889
Liskov, Barbara, 278 887 Regan, Dave, 949
v2023.06.11a
INDEX 955
v2023.06.11a
956 INDEX
v2023.06.11a
API Index
(c): Cxx standard, (g): GCC extension, (k): Linux kernel,
(kh): Linux kernel historic, (pf): perfbook CodeSamples,
(px): POSIX, (ur): userspace RCU.
957
v2023.06.11a
958 API INDEX
v2023.06.11a